21 – 25 de ago. de 2023
IFSC/USP
Fuso horário America/Sao_Paulo

Deep variational anomaly generation: an approach to testing molecular representation robustness

21 de ago. de 2023 14:00
1h 30m
Salão de Eventos USP

Salão de Eventos USP

Básica 14h00 - 15h30

Descrição

Real-world datasets in various domains, ranging from telecommunications to healthcare, often contain anomalous or outlier data that deviate significantly from the norm. Before applying modeling techniques, it is essential to filter out these anomalies to ensure data quality. This requirement has led to the development of robust anomaly detection models that can be deployed for data cleaning or to raise alarms in dynamic information processing systems, such as browsing, spam detection, or credit card fraud detection. However, there are cases where anomalies are the focus of investigation, shifting the attention from anomaly detection to anomaly generation. In certain domains, anomaly detection models face limitations due to the scarcity of training data, which hinders their predictive potential. In such situations, generating anomalies to populate synthetic training datasets has emerged as a promising approach to address data scarcity. (1) To tackle this challenge, it is crucial to investigate advanced methods for testing the robustness of molecular representations. In this context, we highlight the need for exploring advanced representational robustness testing methods in conjunction with the progress made in maximizing molecular representation robustness. We propose leveraging deep learning techniques, specifically variational autoencoders (VAE), to generate anomalies in a recently developed molecular string representation called SELF-referencIng Embedded Strings (SELFIES). (2-3) The objective was to test the robustness of the SELFIES representation, which assumes 100% validity when converting to SMILES notation. Through the exploration of a hyper-spherical latent space, we demonstrated that a VAE trained on SELFIES representations can generate molecules that violate the assumption of validity, surpassing a set of null models in the same task. We propose the VAE and the associated anomaly generation methodology as an effective means of testing the robustness of molecular representations. Furthermore, we discuss potential sources of invalidity in the SELFIES representation (latest version 2.1.1) and suggest validating modifications to address these issues. This discussion aims to invite further discourse on SELFIES and molecular string representations, fostering continuous improvement and development in the field.

Referências

1 LAPTEV, N. Anogen: deep anomaly generator. In: OUTLIER DETECTION DE-CONSTRUCTED WORKSHOP, 2018, London. Proceedings [...]. New York: ACM, 2018. 3 p.

2 GÓMEZ-BOMBARELLI, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Central Science, v. 4, n. 2, p. 268-276, Feb. 2018.

3 KRENN, M.et al. A. self-referencIng embedded strings (SELFIES): a 100% robust molecular string representation. Machine Learning: science and technology, v. 1, n. 4, p. 045024-1-045024-8, 2020.

Certifico que os nomes citados como autor e coautor estão cientes de suas nomeações. Sim
Palavras-chave Variational auto-encoder. SELFIES. Anomaly generation.
Orientador e coorientador Orientador Rafael Victório Carvalho Guido. Coorientador Alexandre Victor Fassio
Subárea 1 Cristalografia
Subárea 2 (opcional) Planejamento de Fármacos
Agência de Fomento CAPES
Número de Processo 88887.357974/2019-00
Modalidade DOUTORADO
Concessão de Direitos Autorais Sim

Autor primário

Victor Nogueira (Instituto de Física de São Carlos - USP)

Co-autores

Rishabh Sharma (University of California - UC) Michael Keiser (University of California - UC) Rafael Victório Carvalho Guido (Instituto de Física de São Carlos - USP)

Materiais de apresentação

Ainda não há materiais