The Chemical Space Project
Chemical space describes the ensemble of all molecules that are possible by assembling atoms through covalent bonds. This concept is particularly relevant in drug discovery, where new molecular entities are constantly needed to develop new drugs addressing unmet medical needs. In our research we design cheminformatics methods for enumerating, mapping and virtual screening the chemical space of small organic molecules and peptides (BigChem). We then implement these methods to choose, synthesize and test molecules in the laboratory, with the goal of identifying promising bioactive compounds that might later be developed as drugs. We are currently working on small molecules targeting ion channels and transporters (NCCR TransCure, HiQScreen Sàrl), on new peptide based antibiotics against multi-drug resistant bacteria (B5 Platform project), and on peptide dendrimers for DNA/RNA-transfection (MMBio). One further project concerns the design and synthesis of peptide and carbohydrate model substrates to study protein glycosylation enzymes (TransGlyco).
GDB stands for “Generated DataBase”, which is a list of all small organic molecules that would be possible following simple rules of chemical stability and synthetic feasibility. We started this project by enumerating GDB-11, which lists 26.4 million molecules consisting of up to 11 atoms of carbons, nitrogen, oxygen, and fluorine. As computational power and memory kept increasing, we have extended the enumeration to GDB-13 containing almost one billion molecules of up to 13 atoms, and GDB-17 containing 164 billion molecules up to 17 atoms. These databases are by no means exhaustive because the rules chosen to select molecules are quite simple. Nevertheless the GDB provides a fascinating overview of chemical space, in particular of the very vast areas that remain so far completely unexplored experimentally. For example GDB is very abundant in 3D-shaped molecules containing asymmetric and quaternary carbon centers, which are structural features frequently encountered in complex natural products but not in typical drug molecules.
Why are so many small molecules still unknown? One simple answer is in the numbers: there has not been that many molecules synthesized to date. For instance while GDB-17 contains 164 billion molecules up to 17 toms, only approximately 100 million entries are recorded in total in the CAS registry, and the vast majority of these molecules are larger than 17 atoms. A second answer lies in our lack of imagination: synthetic chemistry is taught using a rather limited set of molecules, leaving many innovative possibilities out of sight. For example the chiral polycyclic hydrocarbons from GDB-11 shown below both contain three interlocked norbornane units, however only the one shown at right has been reported: it was obtained as the trioxa analog by acid-catalyzed cyclization of barrelene triepoxide.
Peptide Dendrimers and Bicyclic Peptides
Peptides and proteins are key elements in the complex machinery of living cells and organisms. At the molecular level they consist of chains of tens (for peptides) to hundreds (for proteins) of amino acids connected by peptide bonds, which usually fold into a complex 3D-structure.
If one considers peptides and proteins as graphs in which graph nodes are occupied by amino acids and graph edges by peptide bonds, one realizes that they consist of only one type of graph, a linear chain of varying length. If we now introduce branching points in this chain by forming peptide bonds both at the α-position and at the side chain of lysine or glutamic acid residues, a diversity of branched topologies become possible creating a vast and completely new chemical space waiting to be explored. These branched peptides can be assembled by solid-phase peptide synthesis (SPPS), a method which is used to produce peptides in the laboratory as well as on industrial scale. We are particularly aiming at branched peptides up to 10 to 40 residues, which occupy the size range of large natural products and might exhibit biological activities not accessible to small organic molecules.
Our exploration of branched peptides started with the synthesis of peptide dendrimers. The tree-like topology of dendrimers had been well studied with organic polymers but not at all with peptides of defined sequences. We demonstrated that SPPS is suitable to synthesize very pure peptide dendrimers in good preparative yields in a divergent approach with up to three successive branching points and 37 amino acids. We discovered peptide dendrimers acting as artificial enzymes, as well as lectin-binding glycopeptide dendrimers acting as inhibitors of bacterial biofilms. We have also designed cell penetrating dendrimers for drug delivery, as well as peptide dendrimers acting as efficient reagents for DNA transfection, which we are currently optimizing for nucleic acid transfection in diverse applications.
Our current focus is on a class of antimicrobial peptide dendrimers to address multidrug resistant pathogens, which is a pressing public health threat. One of our peptide dendrimers exhibits a particularly strong antimicrobial activity against a variety of multi-drug resistant pathogens including Pseudomonas aeruginosa and Acinetobacter baumanii without eliciting resistance. Furthermore this antimicrobial peptide dendrimer is non-toxic and resistant to degradation in serum.
To extend our investigation of branched peptides we recently established a robust SPPS approach to form bicyclic peptides that contain a lysine and a glutamate residue as bridgeheads. These define a vast new class of constrained peptides, comprising 3.1×1019 members distributed in 97 bicyclic graphs up to 15 residues when using only proteinogenic amino acids. Compared to peptide dendrimers these bicyclic peptides possess a more compact and well-defined structure, and are also largely resistant to serum degradation. Using a proteomics-based photoaffinity labeling approach we identified a bicylic peptide with micromolar binding affinity to calmodulin as a proof of principle of protein targeting for this new class of peptides.
Chemical Space Maps
The chemical space of small molecules in GDB as well as that of branched peptides is striking by its size. Indeed most discussions about chemical space in the literature focus entirely on the number of compounds. However once a large number of molecules has been explicitly written out as SMILES representing their chemical structure, as is the case for our GDB, the real and more complex problem becomes the description of this database in a form that makes its contents understandable and useful.
Our solution to that problem consisted in selecting 42 molecular descriptors counting atoms, bonds, polar groups and topological features, which we called Molecular Quantum Numbers (MQN). The corresponding 42-dimensional MQN property space organizes molecules following molecular size, structural rigidity, and polarity, in a manner comparable to the periodic table of elements. Its logical organization is well visible in 2D-maps obtained by projection in the principal component plane. Each pixel is color-coded according to the value and standard deviation of selected molecular property for the molecules in that pixel, which is a key innovative step in our visualization method.
We have created interactive versions of these colorful chemical space maps in 2D and 3D to visualize molecules in each pixel on mouse over, for GDB as well as for further public databases such as DrugBank, ChEMBL, PubChem and ZINC. We have further extended this concept in an interactive web-page called similarity mapplet which creates interactive similarity maps comparing user defined sets of molecules with the entire ChEMBL database. Most recently we have extended the method to produce a chemical space map of the Protein Databank, containing all experimentally determined 3D structure of proteins and peptides, using a new molecular fingerprint called 3DP which we designed to represent the molecular shape and pharmacophores of proteins. We are currently completing a similar approach to map the chemical space of our branched peptides.
Molecules that are close to one another in MQN space tend to be structurally related and exhibit similar biological activities. Therefore MQN proximity searches can be used for ligand-based virtual screening (LBVS) to identify new analogs of a reference drug. MQN-similarity searching is extremely fast and is the only LBVS method applicable to GDB-17. Due to our interest in LBVS in very large databases we have developed additional fingerprints, in particular SMIfp, APfp, Xfp and 3DXfp, which allow similarity searches by molecular shape and pharmacophore similarity and are available for use from our tools webpage. Recently we also developed a direct 3D-shape and pharmacophore scoring function for LBVS called xLOS (atom category extended Ligand Overlap Score).
In the early stages of drug discovery one tries to identify strong and selective modulators of a particular bioassay, targeting for example inhibition of an enzyme, channel or transporter, or a directly observable biological phenomenon such as cell survival. One can use high-throughput screening (HTS) of several hundred thousand molecules to perform this step, however HTS is very expensive and often not applicable. In our laboratory we use a simpler and faster approach based on virtual screening. We use our LBVS methods to select analogs of compounds already known to interact with the chosen biological target, or a related one, for experimental testing. Typically we search through the 12 million commercially available drug-like molecules in the ZINC database and purchase between 50 and 500 compounds for experimental testing. We follow-up on initial hits with additional rounds of LBVS and testing, and eventually synthesize further analogs of the best inhibitors to understand the structure-activity relationship and optimize activity and selectivity. This approach allowed us to discover the first selective inhibitor of TRPV6, a calcium channel overexpressed in various cancers, as well as to discover a new and highly selective inhibitor of Aurora A, a kinase cancer target. Beyond screening commercial catalogs, we have also used our LBVS methods to search for bioactive compounds in our GDB databases, in particular modulators of NMDA and nicotinic acetylcholine receptors.
Protein Glycosylation Enzymes
Following on previous activities in developing reagents for enzyme assays and glycosidase inhibitors, we are currently collaborating with Markus Aebi and Kaspar Locher at ETHZ to study protein glycosylation enzymes, in particular oligosaccharyltransferases and oligosaccharide flippases. In this project we design and synthesize fluorescent peptide and lipid-linked oligosaccharide model substrates to enable biochemical and structural studies.