QSAR DataBank - an approach for the digital organization and archiving of QSAR model information

Background Research efforts in the field of descriptive and predictive Quantitative Structure-Activity Relationships or Quantitative Structure–Property Relationships produce around one thousand scientific publications annually. All the materials and results are mainly communicated using printed media. The printed media in its present form have obvious limitations when they come to effectively representing mathematical models, including complex and non-linear, and large bodies of associated numerical chemical data. It is not supportive of secondary information extraction or reuse efforts while in silico studies poses additional requirements for accessibility, transparency and reproducibility of the research. This gap can and should be bridged by introducing domain-specific digital data exchange standards and tools. The current publication presents a formal specification of the quantitative structure-activity relationship data organization and archival format called the QSAR DataBank (QsarDB for shorter, or QDB for shortest). Results The article describes QsarDB data schema, which formalizes QSAR concepts (objects and relationships between them) and QsarDB data format, which formalizes their presentation for computer systems. The utility and benefits of QsarDB have been thoroughly tested by solving everyday QSAR and predictive modeling problems, with examples in the field of predictive toxicology, and can be applied for a wide variety of other endpoints. The work is accompanied with open source reference implementation and tools. Conclusions The proposed open data, open source, and open standards design is open to public and proprietary extensions on many levels. Selected use cases exemplify the benefits of the proposed QsarDB data format. General ideas for future development are discussed.


Example QDB archive
The following section gives an overview about the authoring of example QDB archive. The software used in this tutorial is available as executables [1] and source code [2].

2.1.
Dataset conversion The CSV database file was manually extracted from Randic et al. [3]. The data table (Table 5) contains six columns in total. The respective CSV file can be downloaded from GitHub [4].
The data table conversion application maps the first column to the Compound name attribute and the second column to the Property values cargo, completely ignoring the remaining four columns. The data table conversion application prompts the user for the Property identifier and name attributes. The conversion yields a QDB archive that contains a container registry with 58 Compounds, and a property registry with a single Property.
The QDB archive is opened in the curator application in order to verify that all Compound names are correctly understood by the MarvinBeans library. The visual inspection of the displayed molecular graphs proceeds well after the shorthands "M" and "E" are expanded to "methyl" and "ethyl", respectively. stereoisomerism (optical isomerism). However, the original publication [3] does not contain any stereochemistry information. Therefore, contrary to what is recommended, the Compound InChI attribute is generated as a non-standard InChI by activating the InChI generation option "SUU". The idea is to inform QDB archive users that the QDB archive developers were well aware that they were working with incomplete data (and, for example, did not cause the loss of stereochemistry information by themselves).
$ java -cp curation-toolkit-1.0.0.jar org.qsardb.toolkit.curation.IUPACNameGenerator --dir P:/example/qdb Compounds are given classification labels in order to make the formulation of meaningful subsets of data easier. Firstly, all chemical structures are classified as "primary", "secondary" or "tertiary" depending on the type of the alcohol. The data set appears to be fairly balanced, because there are 24 primary, 23 secondary and 11 tertiary alcohols.
Secondly, chemical structures that feature stereoisomerism (optical isomerism) are classified as "tetrahedral" or "tetrahedral-multi" depending on whether they feature a single tetrahedral stereocenter (i.e. an enantiomer) or two or more stereocenters (i.e. a diastereomer), respectively. There are 25 enantiomers and 2 diastereomers, all of unknown configuration.
Enantiomers are known to have identical physical and chemical properties in symmetric environment. They can be used safely for QSAR/QSPR modeling. However, diastereomers are known to have differing properties. The original publication [3] does not provide any cues about resolving the configuration of stereocenters. Both Compounds "24" and "56" have four diastereomers. It is not known if the reported Property value corresponds to some particular enantiomer, or is the average of four enantiomers. These two compounds were labelled as "tetrahedral-multi-SUU" and excluded from the later model development process.
Compound InChI attributes are used to generate Compound SMILES structure cargos (identifier "smiles"). The description of the SMILES generation workflow is specified in Ant build files "smiles.xml" and "common.xml" [6]. The user is responsible to update the local copy of the "smiles.properties" configuration file to match the current environment. This operation completes the work with the compound registry.

Edit Property definition
The definition of Property "bp" is updated by editing the contents of the property registry. The Property endpoint attribute is specified according to the QMRF classification system as "1.
Physicochemical effects 1.2. Boiling point". Obviously, this specification alone is too vague for practical applications. The Property description attribute clarifies that this Property represents a normal boiling point, which is the temperature at which the vapour pressure of the liquid equals 1 atm (101 325 Pa).
$ java -cp prediction-toolkit-1.0.0.jar org.qsardb.toolkit.prediction.PropertyRegistryManager --dir P:/example/qdb set-attribute --id bp --endpoint "1. Physicochemical effects 1.2. Boiling point" Property units is stored as Property UCUM cargo. The content of the UCUM cargo is a plain text string "Celsius". Earlier versions of the UCUM have suggested the use of the shorthand "Cel", but the current version of the Java JSR-275 library does not recognize it.
$ java -cp prediction-toolkit-1.0.0.jar org.qsardb.toolkit.prediction.PropertyRegistryManager --dir P:/example/qdb attach-ucum --id bp --unit Celsius The bibliography reference of the original publication [3] is stored as Property BibTeX cargo. The base version of the BibTeX database was retrieved from Google Scholar (via the "Import into BibTeX" function). It is composed of a single BibTeX data entry, which is given a more concise identifier and extended with the "doi" field. In the end, the BibTeX database is re-formatted for better readability. The original publication [3] declares that the reported Property values have been collected from previous publications, but does not specify their bibliography references. This limits the usefulness of the Property references cargo, which has to list one and the same BibTeX entry identifier "randic2004vcit" on every row.

Descriptor definition and calculation
The descriptor registry is populated with 274 whole-molecule Descriptor definitions included in the CDK library. The Descriptor identifier and name attributes and the BODO cargo are derived programmatically from the underlying CDK descriptor class. The handling of CDK descriptor classes that return array-valued results is notably different from those that return single-valued results. For example, the CDK descriptor class "org.openscience.cdk.qsar.ChiPathDescriptor" returns an array that contains 16 elements, whose symbolic names run first from "SP-0" to "SP-7", and then from "VP-0" to "VP-7". A new Descriptor definition is created for every array element. The identifier attribute is derived from the symbolic name by replacing potentially troublesome dot ('.') and hyphen ('-') characters with the underscore character ('_') and fixing irregularities in the capitalization of letters. The name attribute is derived from the simple name of the CDK descriptor class by replacing the suffix "Descriptor" with the symbolic name in square brackets.
This descriptor was removed manually.

2.5.
Model development The model development aims to find the best one-parameter regression equation. It is known that the melting and boiling points of normal-chain aliphatic alcohols correlate very well with the length (i.e. the number of carbon atoms) of the alkyl chain. It is reasonable to expect that during the regression analysis the best performing Descriptor(s) would demonstrate considerably higher degrees of variability than that. All 121 low to medium variability Descriptors are purged from the descriptor registry.
$ java -cp prediction-toolkit-1.0.0.jar org.qsardb.toolkit.prediction.DescriptorRegistryManager --dir P:/example/qdb purge --categories 5 The model development takes place in the R environment using the "rQsarDB" package [7]. The model and prediction registries are created and populated with Container instances externally, because the "rQsarDB" package does not implement such functionality yet. Currently, Models and Predictions are assigned simple numeric identifiers. When performing more complex work then it is advisable to switch to more expressive textual identifiers in order to improve the readability and robustness of R scripts. For example, an R formula object that is declared as "bp ~ SP_1" is much easier to grasp than the one that is declared as "1 ~ 42". The R script for performing model training and validation is provided in Section 3.
The process starts with loading 58 Compounds together with their Property and Descriptor values from the QDB archive to R native "data.frame" data structure. As explained above, Compounds "24" and "56" have to be filtered out from the data set by label "tetrahedralmulti-SUU". The partitioning of the remaining 56 Compounds between the training subset and the external validation subset (in an approximate ratio of 85/15) is performed using the special purpose "caret" package [8].
The training subset contains 48 Compounds. The process starts with variable selection. The highest correlation coefficient is obtained with Descriptor "SP_1" (R = 0.9708), closely followed by another Descriptor "XLogP" (R = 0.9703). The selected Descriptor "SP_1" is the first order atom connectivity index (aka simple chi index). It accounts both for the length and branching of the alkyl chain, and the position of the hydroxy group. Most importantly, it should be readily calculable for all sorts of new aliphatic alcohols.
The mathematical relationship has the form "bp = 37.2784912416731 * SP_1 + 17.5525407795743" and it is stored in Model PMML cargo. The mathematical relationship is given in the original numeric precision (i.e. 12 decimal places). However, all Prediction values have been rounded to a single decimal place (before storing in Prediction values cargos) on practical considerations.
The training subset is used for leave-one-out cross-validation (LOO-CV). The R environment has several built-in and extension packages that provide high-level LOO-CV estimates (e.g. raw or adjusted prediction errors). However, there are no packages that would provide low-level LOO-CV primitives (e.g. ensemble of n regression equations). The solution is to locally develop a LOO-CV function with the desired properties. Owing to the great expressiveness of the R programming language it can be achieved in a couple of lines of code.
The external validation subset contains 8 Compounds. The prediction goes without problems and delivers more than satisfactory results. The type attribute of Predictions "2" and "3" is "validation." The exact subtype can be determined by analyzing their intersection with the training subset. Prediction "2" is fully contained in Prediction "1", which means that its subtype is internal validation. Prediction "3" is fully disjoint from Prediction "1", which means that its subtype is external validation.
The Model can be used for making predictions on new aliphatic alcohols. For example, the normal boiling point of "heptan-2-ol" is predicted to be 158.1 degrees of Celsius The example QDB archive is published alongside QsarDB software [9].