Mapping and classifying molecules from a high-throughput structural database

High-throughput computational materials design promises to greatly accelerate the process of discovering new materials and compounds, and of optimizing their properties. The large databases of structures and properties that result from computational searches, as well as the agglomeration of data of heterogeneous provenance leads to considerable challenges when it comes to navigating the database, representing its structure at a glance, understanding structure–property relations, eliminating duplicates and identifying inconsistencies. Here we present a case study, based on a data set of conformers of amino acids and dipeptides, of how machine-learning techniques can help addressing these issues. We will exploit a recently-developed strategy to define a metric between structures, and use it as the basis of both clustering and dimensionality reduction techniques—showing how these can help reveal structure–property relations, identify outliers and inconsistent structures, and rationalise how perturbations (e.g. binding of ions to the molecule) affect the stability of different conformers. Electronic supplementary material The online version of this article (doi:10.1186/s13321-017-0192-4) contains supplementary material, which is available to authorized users.

In this section we will describe a specific combination of atomic structure representations and of analysis techniques that can be used to deal effectively with large databases of materials and molecules.We will however also briefly summarize some of the alternative approaches that could be used to substitute different components of our tool chain.

A. Fingerprints and Structural Dissimilarity
The most crucial and basic element in any structural analysis algorithm is to introduce a metric to measure (dis)similarity between two atomic configurations.Many options are available, with different levels of complexity and generality, starting from the commonly used Root Mean Square (RMS) distance.In order to deal with symmetry operations or condensed phase structures, several "fingerprint" frameworks have been developed 8,[28][29][30][31][32][33][34][35][36][37][38][39][40] , that assign a unique vector of order parameters to each molecular or crystalline configuration: a metric can then be easily built by taking some norm of the difference between fingerprint vectors.Any of these distances could be taken as the basis of the classification and mapping algorithms that we will describe in what follows.
In this paper we will use instead a very flexible framework (REMatch-SOAP) that is based on the definition of an environ-ment similarity matrix C i j (A, B), which contains the complete information on the pair-wise similarity of the environment of each of the atoms within the molecules A and B. In our framework, the similarity between two local environments X A i and X B j is computed using the SOAP kernel 29 C i j (A, B) = k(X A i , X B j ). (1) The REMatch kernel is then defined as the following weighted combination of the elements of C(A, B) Kγ (A, B) = Tr P γ C(A, B), where the optimal combination is obtained by searching over the doubly stochastic matrices U (N, N) the one that minimizes the discrepancy between matching pairs of environments subject to a regularization based on the matrix information-entropy E(P) = − ∑ i j P i j ln P i j 41 .Once a kernel between two configurations has been defined, it is then possible to introduce a kernel distance that we will use as the metric for representing and clustering structures from a database.As discussed in Ref. 26 , the choice of a SOAP kernel as the definition of an environment similarity provides at the same time great generality -it can be seamlessly applied to both molecules and solids -and elbowroom for fine-tuning -ranging from setting an appropriate cutoff distance to circumscribe an environment to the introduction of an alchemical similarity kernel that translates the notion that different chemical species can behave similarly with respect to the properties of interest.

B. Mapping the Structural Landscape of a Database
The dissimilarity between the N atomic configurations in a database contains a large amount of information on the structural relations between the database items.However, this information is not readily interpretable, as it is encoded as a N 2 matrix of numbers.Several methods are available to process dissimilarity information into a form that can be understood more intuitively.A first approach involves building a low-dimensional "map", where each point corresponds to one of the structures in the database and where the (Euclidean) distances between points represents the information on the pairwise dissimilarity matrix.
Several methods have been proposed over the years to solve this dimensionality reduction problem, starting from principal component analysis 42 and the equivalent linear multidimensional scaling 43 , and proceeding to non-linear generalizations of the idea, such as ISOMAP 44 , diffusion maps 45 , kernel PCA 46 .In this manuscript, we will use sketchmap [22][23][24] , a method in which one iteratively optimizes the objective function that measures the mismatch of the dissimilarity between atomic configurations D(X i , X j ) with the dissimilarity (typically just the Euclidean distance) between the corresponding low-dimensional projections {x i }.The procedure is very similar to multidimensional scaling, except for the appearance of the transformations F and f , which are non-linear sigmoid functions of the form: The non-linear transformation focuses the optimization of ?? on the most significant distances (typically those of the order of σ ), and disregards local distortions (e.g.induced by thermal fluctuations or by incomplete convergence of a geometry optimization) and the relation between completely unrelated portions of configuration landscape.The maps that we report in this work will be labeled synthetically using the notation σ -A B-a b, where A and B denote the exponents used for the high-dimensional function F, a and b denote the exponents for the low-dimensional function f , and σ the threshold for the switching function.The choice of these parameters of the sigmoid functions are discussed in detail elsewhere 24 .In practice A, B, a and b have relatively small effect on the projection, and can be optimized and kept fixed for systems belonging to the same family.Since the structures we consider here are minimum-energy configurations, and there are no thermal fluctuations that should be filtered out, we set A = a = 1 (so that at short range the algorithm will still try to represent distances faithfully) and set the long-range exponents to B = b = 4.The parameter σ is the one to which sketchmap is most sensitive, and needs to be tuned for each system separately.To automate the process of building sketchmaps of large amount of subsets of the database, we have used a simple heuristic procedure for determining the value of σ automatically.Following the prescriptions in Ref. 24 , we first compute the histogram of distances in the dissimilarity matrix of each molecular set, and detect the dissimilarity value (D max ) corresponding to the peak value of the histogram.We then set the value of σ to 0.8D max .

C. Hierarchical Clustering Representation
As we will demonstrate below, sketchmap provides a remarkably informative two-dimensional representation of structures in a data set, making it possible to identify groups of similar configurations, outliers, as well as to investigate structure-property relations.An alternative approach to navigate a set of structures based on the dissimilarity matrix is to use clustering algorithms, that identify groups of objects having similar properties to hint at the presence of recurring motifs underlying the behavior of the system.
Clustering models based on connectivity information such as hierarchical (or agglomerative) clustering 49 are particularly suited for this purpose and we will focus only on this type of clustering in this paper.Starting from each configurations as its own cluster, the hierarchical clustering algorithms iteratively aggregate clusters together based on some assessment of their similarity.Similarity between two individual structures can be obviously measured by their distance D(A, B).The distance between two clusters, however, can be defined in many different ways.In our study, we will use in particular the RMS dissimilarity between the pair of members of the two clusters.The linkage distance ∆ between two clusters X = {X i } and Y = {Y i } is then defined as: where N X and N Y are the total number of configurations within each cluster.D(X,Y ) is the dissimilarity between the two configurations, as computed e.g. from the REMatch-SOAP kernel.The complexity of this type of clustering (O(N 2 log(N))) is relatively cheaper than dimensionality reduction algorithms like sketchmap (O(N 3 )).Both procedures can be greatly accelerated by selecting a subset of the configurations (e.g. by farthest point sampling, with the possibility of weighting based on density information 24 ) which is then subject to either dimensionality reduction or hierarchical clustering.
The results of a hierarchical clustering procedure can be represented in a "dendrogram" plot, that conveys visually the sequence of agglomerative clustering operations and the linkage distance at each step.The x-axis represents the index of the structures, sorted in a way so that at each stage the clusters being joint are adjacent.Each merge operation is represented by a line joining the two underlying clusters, with the y position of the line representing the linkage distance for that pair, as defined by ??.In this kind of representation, at the bottom of the dendrogram, each structure can be thought of as an individual cluster containing only one item.Clusters are then merged iteratively, selecting at each step the pair of clusters that are closest to each other.This operation is repeated until all the clusters collapse into one single group that encompasses all the structures in the database, thus completing the dendrogram.To avoid overcrowding the bottom of the plot, one can hide the part that corresponds to very small linkage distances, while still graphically visualising the size of the clusters by drawing bars that encompass the associated structures.Since the "leaves" of this dendrogram correspond to individual configurations, it is possible to complement the dendrogram with color-coded bar plots that represent the value of different properties of each structure, thereby giving a clear picture of the relation between structural clustering and the different properties.
In order to understand the basic motifs of a particular cluster X, it is very useful to select one of its structures that is as representative as possible of the entire subset.In case where stability estimates are available, such structure may be the lowest-energy structure in the cluster.For a definition that is based purely on conformational or configurational information, the most representative structure RS (X) could be defined, as the item having the minimum mean square dissimilarity with respect to all other members of X, i.e.

RS (X) = argmin
Representative structures can be defined at each level of the hierarchy, and can therefore be very useful in navigating the database, and understanding what are its most crucial structural features.The spread of the cluster around RS (X), can be used to quantify the range of structural landscape that is covered by the cluster.
Another important aspect of database analysis is 'outlier detection' [55][56][57][58][59][60] .An "outlier" configuration is defined as a configuration which is different from most of the configurations in the database.Outlier configurations are very important as they are likely to have unique structural motif in the whole database and are thus interesting for structure prediction applications.They also could represent chemical changes or indicate inconsistent configurations which are likely to be "errors" in the database.
In the following sections, we will present examples of how these different analyses can be applied to different subsets of structures taken from a database of amino acid and dipeptide conformers.

II. ANALYSIS OF A DATABASE
This work is based on a first-principles derived structure/energy data set with conformers of twenty proteinogenic amino acids and dipeptides, as well as their interactions with a series of divalent cations 27 (Ca 2+ , Ba 2+ , Sr 2+ , Cd 2+ , Pb 2+ , Hg 2+ ).The potential-energy surfaces (PES) of 280 systems were explored in a wide relative energy range of up to 4 eV (390 kJ/mol), summing up to an overall of 45,892 stationary points on the respective potential-energy surfaces 15 .The underlying energetics were calculated by applying densityfunctional theory (DFT) in the generalized gradient approximation corrected for long-range van der Waals interactions [61][62][63].A number of theory-theory and theory-experiment comparisons have shown the applicability of the method to amino acid and peptide systems 15,[64][65][66][67][68][69] .Such dominantly manually curated data is a great starting point for studying materials and molecules across chemical space 70 , yet automated techniques are needed to extract unbiased and hypothesis-free trends and to check consistency of the underlying data.
In this study we focus on the amino acid lysine (in short Lys) and investigate basic structural motifs of three forms, see Fig. 1.Furthermore, the machine learning techniques introduced in this work are used to detect the impact of perturbations (here Ca 2+ cations) on the structural properties of the unperturbed systems.Finally, we demonstrate how the approach can also be applied to discover inconsistencies and outliers in the database.Hierarchical classifications and sketchmap projections for all the proteinogenic amino acids in the database are given in the Supporting Information.

Lysine Dipeptide
We take as our first example a subset of the database containing 2080 conformers of lysine dipeptide.As discussed in the previous section, we start by constructing the (dis)similarity matrix using the SOAP-REMatch kernel.In ?? the dendrogram plot as well as sketchmaps have been shown along with five properties, energy and four dihedral angles, using the same color scales in both the sketchmap and dendrogram representations.In the sketchmap each circular 'disk' represents a conformer.Whereas in the case of the dendrogram plot, structures are represented by vertical lines at the bottom of the plot.The strong correlation between energy and conformational parameters on one side, and clustering and position on the map on the other, testifies how the the REMatch-SOAP kernel induces a meaningful classification of the structures in this dataset.
While both clustering and sketchmap show clearly that the dataset is composed of groups of structurally-related conformers, the agnostic nature of the underlying metric does not disclose immediately the structural features that most transparently differentiate between different clusters.Comparing the representative structures from the main clusters allowed us to quickly identify candidate structural motifs that could be used to rationalize the layout of the conformational landscape.By colorcoding the dendrogram and the sketchmaps according to these indicators one can readily highlight the key correlations.When considering existing literature on the stability of oligopeptides, the two structural parameters that are most often considered as the key coordinates to navigate the conformational landscape are the Ramachandran dihedral angles φ and ψ, that determine the structure of the backbone around the side chain bearing Cα atom 71 under the assumption of peptide bonds being solely in trans conformation.While fine-grained clusters are clearly homogeneous with respect to the φ and ψ angles, it is clear that for the present systems they are not those that determine the clear-cut branching at the top of the dendrogram.An analysis of the representative structures for the two main clusters (a) and (b) shows that the two molecules differ by the isomerization of the N-terminal peptide bond.Further splitting of these two clusters, i.e.(a) into clusters (d) and (e), and (b) into (f) and (g), depends on the isomerization of the C-terminal peptide bond.We can confirm this attribution of the main features of the dataset by color-coding the map and the dendrogram follow-ing the dihedral angles ω 1 and ω 2 .The four main clusters are largely homogeneous with respect to peptide bond isomerization, and are then further subdivided based on φ and ψ.This observation deserves some further comment.Peptide bonds in naturally-occurring proteins are believed to almost exclusively exist in trans conformation with the exception of prolyl peptide bonds where a smaller energy difference to trans increases the chance for cis onformers 72,73 .This view is supported by the analysis of protein structures deposited in the protein databases where cis conformations are found for about 5% of the prolyl peptide bonds, but less than 0.1% for the others 74 .X-ray crystallographic structure represent however merely frozen snapshots of structural dynamics.The ab initio structure search protocol, instead, does consider the peptide bond torsions as variable and intentionally allowed simulations to overcome the isomerization barrier.Consequently, the dataset contains representatives of all four combinations of cis and trans conformers.Since these transitions are strongly bimodal, and reflect in significant changes of the favorable side chain conformations, they constitute the most significant feature to classify the conformers.As expected, the most stable conformers are largely in a trans-trans conformation.However, the large parts of conformational space of that is occupied by conformers with 1 or 2 cis peptide bonds suggests that cis isomers might play a role in the dynamics of peptides and proteins.If the analysis had been performed solely under the assumption of the importance of the Ramachandran φ and ψ dihedrals, it would have missed one of the main features of the structural landscape that is critical to characterize the relation between structure and energetics.One could then proceed further with the analysis, focusing for instance on small clusters containing low-energy structures such as that represented by the conformer (c).All the structure in this group are trans-trans isomers, that in addition have φ ≈ −90 degrees and ψ ≈ 90 degrees, allowing for the formation of a H-bond between the side chain N 3 and H 1 , and a favorable arrangement of the N 2 donating a H-bond to the carbonyl O 1 as shown in ??.Having access to the combined information on energetics, and on the grouping of structures with similar geometry makes it easier to rationalize the energy ordering of the structures, without having to separately juxtapose all of the low-lying conformers but focusing on a few representative structures.

Protonated Lysine Dipeptide
As the second example we considered a dataset containing 897 conformers of gas phase protonated lysine dipeptide.We follow the same steps as described in the previous example in order to find the most basic structural motifs for this system.?? shows the dendrogram, the sketchmap and a few color coded properties for this system to show their correlation with the classification.The most prominent feature for this molecule, which is evident in both the dendrogram and the sketch maps, is the presence of a group of outliers, that are clearly separated from the bulk of the conformers.Inspection of the cluster centroid (g) clarifies the structural basis of this separation.Whereas in most of the structures the excess charge lies on the lysine side chain as a NH + 3 group, conformers in this cluster experienced a proton transfer event, with the excess proton attached to one of the carbonyl oxygen O 1 , stabilized by H-bonding to N 2 .This is a result of the database generation where ab initio replica-exchange molecular dynamics including high T trajectories where used for structure sampling during which protons can eventually transfer.
Moving on to the main cluster of structures, we can see that similar to our previous example of the neutral dipeptide and again due to the unbiased sampling protocol and the high en- bonds and to less clear-cut grouping based on peptide-bond isomerism than for the neutral dipeptide.This is apparent for instance from the presence of subclusters such as that represented by conformer (f), in which the bent side chain leads to the formation of two H-bonds between NH + 3 group and the carbonyl oxygens.H-bonds also dominate the partitioning of cluster (b), that is split into two groups -one of which is still best represented by the same conformer, and one that is epitomised by (c).Once again, inspection of these structural representatives reveals the organising principle behind the classification: (c)-like struc-tures have an extended side chain, and are dominated by interactions among the peptide bond moieties, whereas (b)-like structures have a well-formed H-bond between the side chain and one of the two backbone O atoms.This structural pattern can be emphasized by color-coding conformers based on the parameter cates bending of the side chain and the formation of a H-bond between O and N.As it is clear from the sketchmap representation, there is a very strong correlation between the bending of the charged side chain and the energy of a conformer, with all of the structures within 0.5 eV of the ground state feature this sidechain to backbone H-bonds.
A scenario with such a clear-cut partitioning is less likely to happen with oligopeptides in a polar solvent like water, where intramolecular H-bonds that introduce strain compete with Hbonds with the surrounding water molecules, without the need of bending the structure.The analysis techniques we introduce in this work would be ideally suited to rationalize the changes in the (free) energetics of biological molecules when moving from the gas phase to (micro)solvated environments or to organic/inorganic interfaces.

Uncapped Lysine
Our third example is a dataset containing 733 conformers of the un-capped lysine molecule in the gas phase.We follow the same steps as described in the previous examples to construct the dendrogram shown in ??.The map has a simple structure, with few well-separated groups.Being a smaller molecule with fewer degrees of freedom, the Ramachandran angles are not defined.Still, the dihedral angles in the vicinity of the C α atom display local structural correlation but once again they are not the main organizing factor that can rationalize the clustering.By juxtaposing representative conformers from the main clusters we could identify a better order parameter, that correlates strongly with H-bond patterns within the molecule.Namely, the distance (D H ) between the H atom in the OH group of the carboxy function and the N atom in the backbone (N 1 ) discriminates well between structures based on H-bonding patterns 70 of type I between N 1 H→O 2 (e.g.conformer (b)) and of type II with a H-bond O 1 H→N 1 (e.g.conformer (a)).It can be seen from both the dendrogram and the sketchmaps that one could identify several further subgroups based on particular values of D H , representing specific orientations.Conformers (c) and (d) represent small groups of conformers having specific relative orientation between the OH and NH 2 groups.Conformer (e) is representative of a small outlier group with a well-defined bend of the side chain, leading to the formation of a further H-bond between the N 1 atom in the amino acid moiety and N 2 , in the side chain.The lysine side chain is very flexible, and the distance between N and C α only plays a role in defining the fine-grained structure of the dataset, but is minimally correlated with the most prominent features.
While it appears that even in this case we could identify the basic structural motifs that characterize the conformational landscape of this system, the correlation with energy is very poor.There are several instances in both the dendrogram and the sketchmap where two conformers that are detected as struc-turally very similar while they display very different stability.Understanding whether this inconsistency signals a problem with our analysis brings us to the topic of outlier detection and consistency checks, that we will discuss in details in Section II C.

B. Understanding the Impact of Perturbations on conformational Space
Having elucidated the essential structural motifs that underlie the organization of a set of molecular conformers, one could also wonder how changes in the thermodynamic conditions, or another external perturbation such as solvation, the addition or subtraction of an electron 75 or that of an atom [76][77][78] , modify the conformations of the molecule and their stability.The database 15,27 that we are using here as an example contains, in addition to bare oligopeptides, sets of locally-stable conformers in the presence of cations of six different species, namely Ca 2+ , Ba 2+ , Sr 2+ , Cd 2+ , Pb 2+ and Hg 2+ .We take the example of Ca 2+ to describe how one can characterize its impact on the conformational space of the three molecular systems that we have discussed in our previous examples.We start by calculating the dissimilarity of all the conformers containing cations with their pure counterpart.In order to make the comparison on the same footings, we did not include the location of the cation in defining the SOAP kernels, so that the presence of Ca 2+ only enters by distorting molecular geometries and/or altering their relative stability.Using this information, we then projected the cation-containing dataset on the top of the sketchmap of structures for the bare molecule.This is done using sketchmap out-of-sample embedding, and we refer our reader to see the relevant literature [22][23][24] for more details about the method.In ?? we show the resulting projection, colored according to the stability of the conformers, on top of the sketchmap of the pure molecule shown in grey color as a reference.A close proximity of projected conformers with a pure conformer signifies their structural similarity.Segregation of the projected conformers with the cation in some area of the reference sketchmap, represents the structural bias introduced by the strong electrostatic interaction with Ca 2+ .
In the case of neutral lysine dipeptide (??-a), the presence of the Ca 2+ ion induces overall relatively small distortions of the stable conformers, that get pushed towards the outer region of the map but are still clearly related to the locally stable structures for the bare molecule.Energies are dramatically changed, with the most stable cluster in the original map begin completely absent in the presence of the cation.These observations highlight the importance of sampling high-energy conformers during high-throughput structure searches, since the relative stability can be modulated strongly by external perturbations.In particular, cis conformers become energetically more competitive and are topologically closer to the global minima.In the case of protonated lysine dipeptide (??-b), the same analysis shows an even clearer pattern.All the conformers with Ca 2+ ions are projected in the lower part of the sketchmap, that correspond to conformers with an extended side chain (see ??).The Ca 2+ ion preferably binds to the peptide O atoms, and the electrostatic repulsion with the protonated lysine residue strongly favors extended conformers, contrary to what observed in the case of the bare molecule.b) are agglomerated according to the orientation of H 1 and the oxygen atom it is bonded to with respect to N 1 which is well described by the distance D H .The sub-cluster (e) is composed of 'outlier' structures showing an H-bond between N 2 and an hydrogen of N 1 resulting in a folded side chain structural motif.Finally, the outlier cluster (f) contains a H-bond between the carboxy H and the side-chain NH 2 , that can be seen as a precursor to the zwitterionic form.
Finally, one sees that for molecular lysine (??-c) the addition of Ca 2+ leads to conformers with very different structural motifs from those seen with the bare molecule, which is apparent in the sketchmap projection being concentrated far away from the unperturbed conformers.In fact, inspection of the structures shows that in most cases Ca 2+ triggers the transition to the zwitterionic form, with the cation coupled to the carboxylate group, and the side chain protonated NH 3 + extending as far as possible away from it.In analogy with what observed for Lennard-Jones clusters 24 and solvated polypeptide segments 79 , sketchmaps proved to be a powerful tool to analyze the response of the system to external perturbations and changes in the boundary conditions, and -in this specific example -do draw connections between different subsets of a high-throughput molecular database.

C. Identifying Outliers and Checking for Consistency
The tools we introduced in this work are useful to address other important issues in data-driven science, such as outlier detection and consistency checks.We have already discussed the importance of detecting groups of "outlier" structures that are very different from the bulk of the dataset.These unusual items often signal the occurrence of unexpected effects that go beyond the original goal of the database construction effort.In the case of protonated lysine dipeptide, looking for outliers allowed us to reveal the presence of conformers with different chemical connectivity, or of strong H-bonds between the backbone and the charged side chain.Similar observations can also be made in the case of the bare lysine molecule (??).Also in this case, one can observe a branch at the topmost level of the dendrogram, containing only two conformers.These are the only two cases where a H-bond is formed between the N of the side chain and the H atom of the OH group in the backbone.In the sketchmap, these two conformers are projected on the top, clearly isolated from rest of the groups, and bear the most resemblance to the zwitterionic conformers that are stabilized in the presence of a divalent cation.Obviously, the definition of a group of "outliers" can be more nuanced, and refer to small groups of structures appearing at deeper levels in the hierarchy.Overall, the possibility of clustering together the structures in a large dataset and inspect a few representative conformers, rather than hundreds or thousands, greatly facilitates the task of identifying trends and spotting interesting or unexpected structures.
Outliers can signal interesting or important trends, but can also be a red flag for the presence of errors.The importance of database integrity has long been recognized by computer sci-FIG.6.This figure compares the homogeneity of clusters from the protonated lysine dipeptide (see a) and the bare lysine uncapped (see b) with respect to properties of their elements.The homogeneity of a cluster is probed using the standard deviation with respect to the distance between each cluster elements, σ D , and the conformational energy, σ E .The outliers of uncapped lysine (b) were manually highlighted in orange.
entists [80][81][82][83] , and several tools are available to monitor and correct inconsistencies from the technical point of view, in terms of reliability of storing and retrieving data [55][56][57][58][59][60] .The issue is also crucial when it comes to the maintenance of automaticallygenerated databases, and to repositories that store data of heterogeneous provenance [1][2][3][4][5][6][7] .In these cases, problems have generally little to do with the integrity of the storage, but rather with the consistency of the simulation details of different sets of calculations.Inconsistencies should here manifest themselves in the presence of structures that are geometrically very similar, but are associated to very different values of particular properties.
For example the lysine molecule dataset shows signs of this kind of issues, with energies that vary wildly within clusters that are very homogeneous in structure.This problem can be seen from the maps, i.e. when comparing the energy-colored sketchmap in ?? to the respective maps for the other systems.However, a more robust and easy-to-automate approach to identify structure/property inconsistencies starts from the hierarchical clusters, and compares the structural variability within each cluster σ D (??) with the variance of a given property, in this case energy, σ E .Looking, for example, at a glassy energy landscape 84 , one can observe configurations that are very different from a structural point of view, but have similar energy, giving rise to clusters with large σ D and small σ E .The data points in ?? each represent individual clusters of lysine dipeptide and uncapped lysine, respectively, and illustrate their variation in energy and structure.In the case of lysine dipeptide (??a) one sees a clear correlation between the structural and energetical variation of the clusters.The two quantities σ D and σ E are not necessarily strongly correlated, but in general clusters that contain very similar structures also have a low spread in energy.For uncapped lysine (??b), however, one can identify a group of points (which we manually highlighted in orange for clarity) that has a distinctively different behavior, with σ E converging to a constant value other than zero as σ D decreases.This kind of feature indicates that the metric that is used to classify structures cannot detect one specific effect that has a dramatic impact on energetics, signaling either a failure of the metric or, as in this case, an inconsistency in the generated data.Further investigation of the lysine molecule dataset revealed that a subset of structures that had been generated at a lower level of theory in the initial stages of the structure-search procedure made their way by mistake into the final dataset.Using this measure of cluster homogeneity on all systems of the amino acid database (see Supporting Information) revealed similar problems also for other molecules, for example Cys, Glu, and Arg.Thanks to this analysis we will be able to identify and rectify mistakes in all the affected datasets and subsequently update the on-line repository 27 .

III. CONCLUSION
The increasing use of high-throughput computational screening of materials and molecules, and the compilations of more-orless curated databases of the resulting structures and properties, is making more and more urgent to adapt "big data" techniques to the problems that are specific to this field.In this work we have demonstrated how a toolbox of algorithms ranging from hierarchical clustering to non-linear dimensionality reduction can be used to navigate molecular databases, taking as a paradigmatic example some subsets of a database of oligopeptide structures in the gas phase.The software that was used to compute similarity data between molecules, as well as to generate dendrograms and sketch-maps, are open-source and available for download 85,86 .
We find that the use of REMatch-SOAP, a general and unbiased metric to compare different structures based on the combination of pair-wise similarity between molecular environments, makes these techniques particularly insightful.Rather than simply reflecting preconceived notions of what would be the key structural parameters to differentiate molecular conformers, this metric reveals for instance the importance of peptide bond isomerization in describing the high-energy portion of conformational space of oligopeptides, the possibility of changes in chem-ical connectivity in the course of the ab initio structural search, and the interplay between hydrogen-bonding, backbone dihedrals an electrostatic interactions.Sketchmaps and hierarchical clustering proved to be complementary tools, with representative structures from the main clusters providing an easy way to compare visually different groups of conformers, and the lowdimensional map providing a quick, intuitive tool to verify hypotheses and visualize structure-property correlations.
Assumption-free first-principles molecular-structure search for data generation in combination with dimensionality reduction and clustering for data analysis provide a powerful tool box for studying structure formation trends.We could highlight the presence of large portions of configurational space that consist of cis isomers of the peptide bond.Albeit energetically unfavorable, these conformers may play an important role in the dynamics of polypeptides.By comparing isolated molecules with their complexes with Ca 2+ , we can also reveal how a strong electrostatic perturbation modifies the energetic landscape of a small molecule -be it by shifting the stability of different conformers, or triggering the formation of new structures that are absent in the absence of a cation.Furthermore, we also demonstrate the importance of automated analysis techniques in assessing the integrity and internal consistence of a database, by successfully identifying a subset of structures associated with ill-converged energetics.
All of the techniques we discussed should be readily extendable to heterogeneous databases of molecules and solids, where we expect that the possibility of defining an alchemical kernel within the REMatch-SOAP metric will make it possible to tune the relative weight of composition and structure in determining the notion of similarity.By simplifying the analysis and the interpretation of computational datasets containing thousands or millions of hypothetical compounds, these methods will be crucial to unleash the full potential of computational materials design.

A. Availability of data and materials
The oligopeptide database used as an example in this paper is already available online 27 .All the data required to generate the figures in this paper and same analysis data for other oligopeptides in the database are provided in the supplementary information.Similarity matrix data is not included in SI due to size limitations, but is freely available from from our group repository 85 .

B. Competing interests
We confirm that none of the authors have any competing interests in the manuscript.

FIG. 2 .
FIG. 2. Representation of the similarity matrix corresponding to the lysine dipeptide dataset using the agglomerative clustering algorithm (top) and the sketchmap algorithm (bottom, projection parameters shown following the scheme σ -A B-a b).A few representative structures (see ??) of interesting clusters are shown (right) and their corresponding position on the sketchmaps and dendrogram representation is highlighted.The five sketchmaps are colored according to the conformational energy and the backbone dihedral angles φ, ψ, ω 1 and ω 2 .The dendrogram shows the clustering hierarchy of the structures of the dataset.Each structure is vertically aligned with its properties shown using color bars below the dendrogram.The dendrogram is cut at a linkage distance of 0.1 since structural properties are very similar below this threshold, and the clusters that are merged at this level are shown as thick gray bars separated by light-gray lines.Clusters composed of only one structure are drawn as a black line reaching the bottom of the dendrogram.The main structural motifs of this set of structures are governed by the peptide bond dihedral angles ω 1 and ω 2 .The two main clusters (a) and (b) are showing a global correlation with the angle ω 2 while the angle ω 1 splits them into two well correlated sub-clusters (d), (e) and (f), (g) respectively.The cluster (c) is highlighted as an example containing 'outlier' structures of low conformational energy.
FIG. 3. Representation of the similarity matrix corresponding to the protonated lysine dipeptide dataset using the agglomerative clustering algorithm (top) and the sketchmap algorithm (bottom, projection parameters shown following the scheme σ -A B-a b).A few representative structures (see ??) of interesting clusters are shown (right) and their corresponding position on the sketchmaps and dendrogram representation is highlighted.The six sketchmaps are colored according to the conformational energy, the minimal distance between O 1 or O 2 with N 3 called D ON , and the backbone dihedral angles φ, ψ, ω 1 and ω 2 .The dendrogram shows the clustering hierarchy of the structures of the dataset.Each structure is vertically aligned with its properties shown using color bars below the dendrogram.The dendrogram is cut at a linkage distance of 0.1 since structural properties are very similar below this threshold, and the clusters that are merged at this level are shown as thick gray bars separated by light-gray lines.Clusters composed of only one structure are drawn as a black line reaching the bottom of the dendrogram.The main structural motifs of this set of structures are governed by the dihedral angles ω 1 and ω 2 and the distance D ON .The two main clusters (a) and (b) are showing a global correlation with the angle ω 2 while the angle ω 1 splits them into well correlated sub-clusters (e.g.sub-clusters (d) and (e)).The other important sub-clustering parameter is the distance D ON , e.g.sub-clusters (c) and (b), which also correlates well with the separation between low and high conformational energy shown on the sketchmaps.Two sub-clusters are particular: (g) is a clear 'outlier' due to a chemical change and (f) features a H-bonding pattern with the side chain NH + 3 pointing to both carboxy groups that sets this cluster apart from all others.

FIG. 4 .
FIG. 4. Representation of the similarity matrix corresponding to the lysine uncapped dataset using the agglomerative clustering algorithm (top) and the sketchmap algorithm (bottom, projection parameters shown following the scheme σ -A B-a b).A few representative structures (see ??) of interesting clusters are shown (right) and their corresponding position on the sketchmaps and dendrogram representation is highlighted.The five sketchmaps are colored according to the conformational energy, the distance between N 1 and the hydrogen in the carboxilic group H 1 (labelled D H ), the distance between N 2 and C α (labelled D CN ), and the dihedral angles α 1 and α 2 which are respectively computed with the following atoms (N 1 ,C α ,C 2 ,C 3 ) and (C 1 ,C α ,C 2 ,C 3 ).The dendrogram shows the clustering hierarchy of the structures of the dataset.Each structure is vertically aligned with its properties shown using color bars below the dendrogram.The dendrogram is cut at a linkage distance of 0.1 since structural properties are very similar below this threshold, and the clusters that are merged at this level are shown as thick gray bars separated by light-gray lines.Clusters composed of only one structure are drawn as a black line reaching the bottom of the dendrogram.The main structural motifs of the database are governed by the distance D H .The two main clusters (a) and (b) are agglomerated according to the orientation of H 1 and the oxygen atom it is bonded to with respect to N 1 which is well described by the distance D H .The sub-cluster (e) is composed of 'outlier' structures showing an H-bond between N 2 and an hydrogen of N 1 resulting in a folded side chain structural motif.Finally, the outlier cluster (f) contains a H-bond between the carboxy H and the side-chain NH 2 , that can be seen as a precursor to the zwitterionic form.

FIG. 5 .
FIG.5.The out-of-sample embedding of conformers with Ca 2+ ion on the sketchmap of their pure counterpart, for the three systems we discussed in above: lysine dipeptide (a), protonated lysine dipeptide (b) and molecular lysine (c) systems.The projected conformers are colored with their energy where as the sketchmap on which they are projected are kept all in grey color.The location of the projected conformers allows us to understand how the conformational space of the pure conformers are affected due to presence of the Ca 2+ ion.
. and M.C. would like to acknowledge support from the NCCR MARVEL.M.C., T.I. and C.B. would like to acknowledge funding from the MPG-EPFL center for molecular nanoscience.D. Authors' contributions M.C. and S.D. designed the calculations and developed the methods.F.M. and S.D. performed calculations and analysis, and prepared the materials for the manuscript.C.B. and T.I. provided insights on the implications of the analysis of the oligopeptide database.All the authors contributed to the writing of the manuscript.