GDB Databases
About
GDB-11 enumerates small organic molecules up to 11 atoms of C, N, O and F following simple chemical stability and synthetic feasibility rules.
GDB-13 enumerates small organic molecules up to 13 atoms of C, N, O, S and Cl following simple chemical stability and synthetic feasibility rules. With 977 468 314 structures, GDB-13 is the largest publicly available small organic molecule database to date.
GDB-17 enumerates 166 billion molecules up to 17 atoms of C, N, O, S and halogens.
GDB-20 enumerates small organic molecules up to 20 atoms of C, N, O, S, F, Cl, Br, and I, following simple chemical stability and synthetic feasibility rules. By combining systematic graph enumeration with machine learning–based generation, 12,092,137,338 unique molecules are assembled as part of the newest GDB-20 database, representing a subset of an estimated 32 trillion possible molecules within this chemical space.

How to cite
To cite GDB-11, please reference:
Virtual exploration of the chemical universe up to 11 atoms of C, N, O, F: assembly of 26.4 million structures (110.9 million stereoisomers) and analysis for new ring systems, stereochemistry, physico-chemical properties, compound classes and drug discovery. Fink, T.; Reymond, J.-L. J. Chem. Inf. Model. 2007, 47, 342-353.
Virtual Exploration of the Small Molecule Chemical Universe below 160 Daltons. Fink, T.; Bruggesser, H.; Reymond, J.-L. Angew. Chem. Int. Ed. 2005, 44, 1504-1508.
To cite GDB-13, please reference:
970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13. Blum L. C.; Reymond J.-L. J. Am. Chem. Soc., 2009, 131, 8732-8733.
To cite GDB-17, please reference:
Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. Ruddigkeit Lars, van Deursen Ruud, Blum L. C.; Reymond J.-L. J. Chem. Inf. Model., 2012, 52, 2864-2875.
To cite GDB-20, please reference:
Sampling a GDB-20 Database of 32 Trillion Drug-Like Molecules by Generative Artificial Intelligence. Buehler Ye, Javor Sacha, and Reymond Jean-Louis. ChemRxiv, 2026.
Download
The GDB databases are hosted on the open-access repository Zenodo. You can download the databases and subsets of it using the links below. All the molecules are stored in dearomatized, canonized SMILES format and compressed as tar/gz archive (for Windows users: Download 7-zip to open archives). To see the structures in drawing format, we suggest MarvinView, a free chemical structure viewer from ChemAxon.
| Set | Link | Size |
|---|---|---|
| GDB-20 (Generative models: https://github.com/Ye-Buehler/GDB-ML ; Zenodo: https://zenodo.org/communities/reymond-gdb20 | ||
| GDB-20s (HAC1-17) | GDB20.HAC1-17.tar.gz | 2.8 GB |
| GDB-20s (HAC18) | GDB20.HAC18.tar.gz | 13.7 GB |
| GDB-20s (HAC19) | GDB20.HAC19.tar.gz | 17.9 GB |
| GDB-20s (HAC20) | GDB20.HAC20.tar.gz | 35.3 GB |
| GDB-20s (50 million) | GDB20.50000000.smi.gz | 484 MB |
| GDB-20s (50 thousand) | GDB20.50000.smi.gz | 470 KB |
| GDB-17 | ||
| GDB-17-Set (50 million) | GDB17.50000000.smi.gz | 484 MB |
| Lead-like Set (100-350 MW & 1-3 clogP)(11 million) | GDB17.50000000LL.smi.gz | 75 MB |
| Lead-like Set (100-350 MW & 1-3 clogP) without small rings (3-4 ring atoms)(0.8 million) | GDB17.50000000LLnoSR.smi.gz | 55 MB |
| GDB-13 | ||
| Entire GDB-13 (including all C/N/O/Cl/S molecules) | gdb13.tgz | 2.6 GB |
| GDB-13 Subsets (The sum of all the subsets below correspond to the entire GDB-13 above) | ||
| Graph subset (saturated hydrocarbons) | gdb13.g.tgz | 1.1 MB |
| Skeleton subset (unsaturated hydrocarbons) | gdb13.sk.tgz | 14 MB |
| Only carbon & nitrogen containing molecules | gdb13.cn.tgz | 443 MB |
| Only carbon & oxygen containing molecules | gdb13.co.tgz | 299 MB |
| Only carbon & nitrogen & oxygen containing molecules | gdb13.cno.tgz | 1.8 GB |
| Chlorine & sulphur containing molecules | gdb13.cls.tgz | 189 MB |
| GDB-13 Subsets (For details please refer to the Table 2 in J Comput Aided Mol Des (2011) 25:637 to 647) | ||
| GDB-13 Subset AB (~635 Millions) | AB.smi.gz | 2.4 GB |
| GDB-13 Subset ABC (~441 Millions) | ABC.smi.gz | 1.7 GB |
| GDB-13 Subset ABCD (~277 Millions) | ABCD.smi.gz | 1.1 GB |
| GDB-13 Subset ABCDE (~140 Millions) | ABCDE.smi.gz | 565 MB |
| GDB-13 Subset ABCDEF (~43 Millions) | ABCDEF.smi.gz | 171 MB |
| GDB-13 Subset ABCDEFG (~13 Millions) | ABCDEFG.smi.gz | 50 MB |
| GDB-13 Subset ABCDEFGH (~1.4 Millions) | ABCDEFGH.smi.gz | 6.2 MB |
| GDB-13 Random Sample. Annotated with frequency and log-likelihood (Please refer to Exploring the GDB-13 chemical space using deep generative models) | ||
| GDB-13 Random Sample (1 Million) | gdb13.1M.freq.ll.smi.gz | 14.8 MB |
| GDB-13s | ||
| GDB-13s | GDB-13s.smi.gz | 423.0 MB |
| FDB-17 | ||
| FDB-17 | FDB-17-fragmentset.smi.gz | 62.2 MB |
| GDB4c | ||
| GDB4c (SMILES) | GDB4c.smi.gz | 6.2 MB |
| GDB4c3D (SMILES) | GDB4c3D.smi.gz | 161 MB |
| GDB4c3D (SDF) | GDB4c3D.sdf.tar.gz | 2 GB |
| Other | ||
| GDBMedChem (SMILES) | GDBMedChem.smi | 353.6 MB |
| GDBChEMBL (SMILES) | GDBChEMBL.smi | 276 MB |
| GDB-13 random selection (1 million) | gdb13.rand1M.smi.gz | 7.2 MB |
| Fragment-like subset (Rule of three) | gdb13.frl.tgz | 1.2 GB |
| Dark matter universe up to 9 heavy atoms | dmu9.tgz | 87 MB |
| GDB-11 | ||
| Entire GDB-11 (including all C/N/O/F molecules) | gdb11.tgz | 122 MB |
| Fragrance Like Subsets: For details please refer to Ruddigkeit et al. Journal of Cheminformatics 2014, 6:27 | ||
| FragranceDB (SuperScent + Flavornet) | FragranceDB.smi | 56 KB |
| TasteDB (SuperSweet + BitterDB) | TasteDB.smi | 44 KB |
| FragranceDB.FL (Fragrance-like subset of FragranceDB) | FragranceDB.FL.smi | 32 KB |
| ChEMBL.FL (Fragrance-like subset of ChEMBL) | ChEMBL.FL.smi | 452 KB |
| PubChem.FL Fragrance-like subset of PubChem | PubChem.FL.smi | 20 MB |
| ZINC.FL (Fragrance-like subset of ZINC) | ZINC.FL.smi | 1.3 MB |
| GDB-13.FL (Fragrance-like subset of GDB-13) | GDB-13.FL.smi.gz | 165 MB |
| Scaffold Hopping: There is a diversity driven fragment library (extracted from GDB-17) available that has been indexed for ReCore from BioSolveIT. | ||
| 3D Scaffold Hopping tool | ReCore | 165 MB |
Tagsfree Encoding System for Combinatorial Peptide Libraries
About
TAGSFREE is a program for designing split-and-mix peptide libraries that can be decoded by amino acid analysis. The analysis is independent of peptide topology (linear, branched, cyclic) and amino acid type (natural or non-natural, including beta-amino acids).

How to cite
Any work based on the TAGSFREE method must cite the following publications:
A General Method for Designing Combinatorial Peptide Libraries Decodable by Amino Acid Analysis. Kofoed J.; Reymond J.-L. J. Comb. Chem. 2007, 9, 1046-1052.
Identification of protease substrates by combinatorial profiling on tentagel beads. Kofoed J.; Reymond J.-L. Chem. Commun. 2007, 48, 4453-4455.