Molecule Cloud methodology
The principle of the Molecule Cloud method is very simple. The most common substructure features present in the analyzed data set are identified and then displayed in such a way that their size corresponds to their frequency. The most common substructures are rendered the largest and therefore immediately catch the eye of an observer. Molecules are represented by their scaffolds, i.e. cores remaining when all non-ring substituents are removed. The concept of scaffold as the central part of a molecule is one of the basic concepts of medicinal chemistry and scaffolds play an important role in several drug discovery techniques like combinatorial chemistry and scaffold hopping [7]. Molecules without any rings are represented by their major chain, i.e. the longest chain, containing the largest number of heteroatoms. Reduction of molecules to scaffolds simplifies the analysis considerably. For example 35 million molecules from the PubChem database are represented by about 3.9 million scaffolds (50% of which are singletons, present only once in the database) and about 150 thousand chains.
Once the most common scaffolds and chains are identified, they need to be visualized with their size scaled according to the frequency in the parent database. It is well known that the frequency of various substructure features like scaffolds, substituents or linkers in molecular databases follows the power law (so called “long tail” distribution) [8]. This means that only few scaffolds in a database are very common, while there are many rare scaffolds including large number of singletons. Before using scaffold frequencies as a scaling factor, they therefore need to be transformed into the logarithmic scale. Benzene is a special case, practically in all large data sets the benzene is clearly the most frequent scaffold. In many cases it is therefore advisable not to display it. Even after logarithmic transformation of frequencies benzene would be disproportionally large, and it would not contribute any useful information. Removal of benzene is similar to the removal of the common stop words in classical text clouds.
According to our experience the optimal number of substructures to be displayed in the Molecule Cloud is between 100 and 250. This number usually contains 30 - 50 large structures, easily recognizable, the rest are smaller structures that optically fill the image. Of course, when displaying the graph in a larger area (for example as a poster) the number of structures that can be displayed is proportionally larger.
The greatest challenge in generating the Molecule Cloud is the esthetically pleasing layout of molecules in the graph. This is done by a two-pass layout algorithm developed specially for this purpose. In the first step molecules are placed in the display area in such a way that their overlap is minimal. The process starts by sorting molecule images according to their size and placing the largest one in the center. Then other molecules, one by one are placed in a loop on a dense grid of predefined layout points and an “overlap score” is calculated for each placement. At the end the molecule is placed at the position with the best score (i.e. position where the overlap with other structures is minimal) and the next molecule is processed. This procedure is illustrated in the Figure 2, where one can see the layout after placement of 10, 25, 50 and 200 molecules. The “overlap score” used to identify the best position to place a molecule is calculated as the sum of overlap areas between molecular frames and the sum of the distances between molecule centers.
In most cases already this initial layout provides quite good placement of molecules. To further improve it, a second layout step is performed, namely an iterative optimization loop. In this step molecules, one by one are slightly moved in the direction that improves the total “overlap score”. The convergence is fast and after few seconds the final layout is achieved. During the optimization slight repulsive forces are also placed in the corners of the drawing area to provide aesthetically more pleasing “oval” display instead of completely filling the available image rectangle.
The Molecule Cloud layout algorithm has been implemented in Java. The layout itself does not require any “chemical intelligence”, it operates simply on the rectangles representing molecules. The required molecular processing capabilities, particularly parsing of SMILES and molecule depiction are defined by a Java interface class and may be implemented by using any cheminformatics toolkit. The program requires as input only a list of SMILES codes of structures to display with their frequencies and desired size of the final image as input. We tested the algorithm using two cheminformatics engines, the depiction engine from Molinspiration [9] and the recently released Novartis open source Avalon Cheminformatics Toolkit [10]. To interested parties the Java source code of the Molecule Cloud layout algorithm is available from the corresponding author under the terms of the BSD license. The distribution provides also instructions how to interface the program with the Avalon Cheminformatics Toolkit.
Application examples
In this section Molecule Cloud diagrams are presented for several popular publicly available data sets. PubChem [11] is the largest publicly available molecular structure database. In June 2012 it contained nearly 33 million unique structures. The Molecule Cloud of PubChem is shown in Figure 3. In this image, scaffolds of bioactive molecules are indicated by magenta background, where the color intensity is proportional to the ratio between bioactive and all molecules containing this scaffold. Bioactive molecules were identified by the PubChem advanced search as molecules having activity better than 10 μm in any PubChem assay.
A very useful resource for drug discovery, particularly for researchers in academia is the ZINC database [12]. ZINC, created and maintained by John Irwin from the University of California, San Francisco is a collection of commercially available compounds that may be used in virtual screening. The Molecule Cloud graph with the most common scaffolds present in about 12 million ZINC structures is shown in Figure 4.
The last example shows results of analyses of molecules in the ChEMBL database [13]. ChEMBL is a database of molecules extracted from medicinal chemistry journals and other sources. It also contains biological activities and information about the respective targets. This database is an extremely useful source of information particularly for scientists in academia, providing the type of information that was before available only to researchers in pharmaceutical industry. Molecule Cloud visualizing the most common scaffolds of more than 350,000 bioactive molecules (having activity below 10 μm) from ChEMBL is shown in Figure 5. In this figure, information about targets is also displayed by using color. The following six target classes were considered: GPCRs, ion channels, nuclear receptors, kinases, proteases and other enzymes. The scaffold box is colored if at least 70% of molecules containing this scaffold show activity on particular target class. Scaffolds exhibiting multiple types of activities are not colored. Among these, one can see several known “privileged scaffolds” [14] including biphenyl, indole, quinoline and purine.
The Molecule Cloud allows, of course, visualization of also other structural elements than scaffolds. This is illustrated in Figure 6, where the most common substituents (up to 15 atoms) from the ChEMBL database are shown. Although majority of the substituents are the same as in the other databases [15], the Molecule Cloud visualization provides clear advantage in comparison with classical molecule grid display.