Reymond Research Group

University of Bern

Nested TMAPs to Visualize Billions of Molecules

Check out our new paper Nested TMAPs to Visualize Billions of Molecules in Journal of Chemical Information and Modeling!

Abstract

Here, we present a visualization and clustering framework enabling the exploration of billion-sized chemical data sets, exemplified with the REAL database of 9.6 billion make-on-demand molecules. We represent molecules as 42-dimensional MQN (molecular quantum numbers) fingerprints describing molecular structures with counts for different atom and bond types, polar groups and topological features, and cluster the data set by applying Product Quantization and PQk-Means. We retrieve the molecule closest to the cluster centroid as a representative for each cluster and compute a tree-map (TMAP) displaying these representatives organized by MQN-similarity. Each cluster representative in this primary TMAP is linked to a nested secondary TMAP displaying the corresponding cluster content organized by the ECFP4 substructure fingerprint similarity. This nested TMAP approach can be computed on a single workstation and gives direct access to the entire data set down to single molecular structures in two clicks. A nested TMAP for the REAL database is accessible at https://chelombus.gdb.tools/databases/real-database.

Author(s) Alejandro Flores Sepúlveda, Jean-Louis Reymond