Template-based combinatorial enumeration of virtual compound libraries for lipids

A variety of software packages are available for the combinatorial enumeration of virtual libraries for small molecules, starting from specifications of core scaffolds with attachments points and lists of R-groups as SMILES or SD files. Although SD files include atomic coordinates for core scaffolds and R-groups, it is not possible to control 2-dimensional (2D) layout of the enumerated structures generated for virtual compound libraries because different packages generate different 2D representations for the same structure. We have developed a software package called LipidMapsTools for the template-based combinatorial enumeration of virtual compound libraries for lipids. Virtual libraries are enumerated for the specified lipid abbreviations using matching lists of pre-defined templates and chain abbreviations, instead of core scaffolds and lists of R-groups provided by the user. 2D structures of the enumerated lipids are drawn in a specific and consistent fashion adhering to the framework for representing lipid structures proposed by the LIPID MAPS consortium. LipidMapsTools is lightweight, relatively fast and contains no external dependencies. It is an open source package and freely available under the terms of the modified BSD license.


Background
The combinatorial virtual library enumeration methodology is routinely used during the early stages of the small molecule drug discovery cycle. Virtual compound libraries containing a large of number molecules are generated and ranked based on various calculated/ predicted characteristics such as physicochemical properties, activity, specificity, solubility, etc. A set of top ranked compounds are selected and synthesized/ acquired for further investigation using experimental techniques [1][2][3][4][5][6][7]. A variety of software packages are available for the combinatorial enumeration of virtual compound libraries. These tools fall into three broad categories: open source or freely available packages [8][9][10][11][12]; commercially available packages [13][14][15][16][17][18][19][20][21]; proprietary software packages implemented for internal use on top of custom or commercial software libraries [22][23][24][25]. Although implementation details might differ, all virtual library enumeration packages deploy similar general strategy to generate virtual compound libraries. A core scaffold along with attachment points for R-groups is specified and lists of R-groups are provided by the user. Options to incorporate linkers between the core scaffold and R-groups are also available in some packages. The core scaffold, R-groups and linkers are specified either as SMILES [26,27] or SD [28] files. All possible structures are enumerated by the combinatorial attachment of Rgroups to the core scaffold along with the placement of any linkers between them and a virtual compound library is generated as a SMILES or SD file. The 2D structure representations generated for the compounds in virtual libraries are rather arbitrary. Although input SD files contain 2D atomic coordinate information for core scaffolds and R-groups, it is not possible to specify the exact orientation of R-groups around scaffolds for the structures enumerated for virtual libraries in any available software package, to the best of our knowledge. Different software packages end up generating completely different orientations of R-groups around scaffolds due to different internal strategies deployed for their optimal placement in the enumerated structures. Consequently, 2D structure layouts for the enumerated structures are not always consistent across software packages.
We have developed a software package called Lipid-MapsTools for the combinatorial enumeration of virtual compound libraries for lipids. Virtual libraries are enumerated for the specified lipid abbreviations using matching lists of pre-defined templates and chain abbreviations, instead of core scaffolds, linkers and lists of Rgroups provided by the user as SMILES or SD files. 2D structures of the enumerated lipids are drawn in a specific fashion; their representation is consistent and adheres to the framework for representing lipid structures proposed by LIPID MAPS consortium [29,30]. The structure data for the enumerated virtual library is written to a SD file along with additional ontological information such as abbreviation, systematic name, category, main class, sub class, etc. LipidMapsTools is capable of generating large virtual compound libraries for lipids with minimal input from the user.

Methodology
We previously developed a LIPID MAPS Structure Database (LMSD) [31] containing structures and annotations of biologically relevant lipids. It is a relational database and currently contains over 37,000 structures. All lipids in the LMSD have been classified, named and drawn according to the comprehensive classification, nomenclature and drawing system [32,33] proposed by the LIPID MAPS consortium. Based on this classification system, lipids are divided into eight categories: fatty acyls (FA), glycerolipids (GL), glycerophospholipids (GP), sphingolipids (SP), sterol lipids (ST), prenol lipids (PR), saccharolipids (SL) and polyketides (PK). Each category is further divided into classes and subclasses. Figure 1 shows representative lipid structure for each lipid category along with its LIPID MAP ID (LM ID) and name.
In general, the acid/acyl or head group is drawn on the right side and the hydrophobic radyl chain is shown on the left. For two out of the eight lipid categories, GL and GP, the radyl hydrocarbon chains -sn1, sn3 and sn3 chains at positions 1, 2 and 3 on the glycerol backbone corresponding to stereo chemical numbering (sn) scheme -are drawn with the chain termini on the left using attachment points on the template backbone and the appropriate head groups are shown on the right ( Figure 2). The term radyl chain is used to represent acyl, alkyl or 1Z-alkenyl chains. For sphingolipids (SP), the two chains corresponding to a long chain base and an N-acyl chain are also drawn with the chain termini on the left of the ceramide template backbone and the head groups are on the right.
The three lipid categories of GL, GP and SP along with the cardiolipins (CL), a lipid class under GP, have a fixed backbone with chains and head groups attached to the specific attachment points on the backbone. These characteristics make these types of lipids amenable to the template-based combinatorial enumeration of virtual compound libraries, using the pre-defined lists of most likely chains and templates containing the appropriate head groups.

Lipid abbreviations
The lipid abbreviation format ( Figure 2 The sphingolipid abbreviation format includes the specifications of a long chain base and an N-acyl chain on the ceramide backbone along with the specification of a head group: Headgroup(LongChainBase/NAcylChain). One of the three letters -d, t, or m -must precede the chain length specifier of the long chain base; the format of rest of the long chain base and N-acyl chain abbreviation is similar to the format of chain abbreviation for other lipid categories. The letters t and m are used to represent 4R-hydroxy and 3-keto groups at positions 4 and 3 respectively in the long chain base. Representative examples of the lipid abbreviations format for SP are: Cer(d18:1(4E)/14:0), Cer(t18:0/18:2(9Z,12Z)) and Cer (m14:0/16:1(9Z)). These abbreviations correspond to N-(tetradecanoyl)-sphing-4-enine, N-(9Z,12Z-octadecadienoyl)-4R-hydroxy-sphinganine and N-(9Z-hexadecenoyl)-3-keto-tetradecasphinganine.
The chain abbreviation format consists of the specifications for chain length, number of double bonds along  with their geometry, and substituents. The chain length specification is mandatory; all other specifications are optional. The substituent specification includes its name, position in the chain and an optional value for stereochemistry (R or S). The stereochemistry for the substituents is determined using CIP (Cahn-Ingold-Prelog) [34][35][36] priority rules. For example, the acyl chain specification 16:0 corresponds to hexadecanoyl, and 20:4(7E,10E,13E,16E) implies 7E,10E,13E,16E-eicosatetraenoyl. A hydroxyl group at position 6 with R stereochemistry in 18:2(2E,4E) acyl chain corresponding to 2E,4E-octadecadienoyl is shown as 18:2(2E,4E)(6OH(R)). Table 1 shows representative examples of the sn chain specifications available during the combinatorial enumeration of virtual compound libraries for GL, GP and CL. Complete lists of the sn chain specifications for GL, GP and CL are shown in Additional file 1: Table S1; the long chain bases and N-acyl chain lists are shown in Additional file 1: Table S2 and Table S3; and Additional file 1: Table S4 provides the list of all available substituents.
Wild card specifications of chain lengths and double bonds along with their geometry are supported in the lipid abbreviation format, in order to specify a set of lipids to enumerate from the pre-defined lists of most likely sn chain abbreviations and head groups. Allowed wild card characters are: * (asterisk), + (plus), -(minus), > (greater than) and < (less than). The wild card character * is used for the specification of chain length and number of double bonds along with their geometry. It refers to all available chain abbreviations for lipids. The wild card characters + and -refer to even and odd chain lengths. The wild card characters > and < are used along with a number after them to indicate chain lengths greater than or less than a specified number; these are only valid for chain length specifications. For example, the lipid abbreviation TG(*/*/*) or TG(*:*/*:*/*:*) represents all possible triradylglycerolipid structures. The abbreviation DG(*:2/ *:1(9Z)/0:0) corresponds to all possible diradylglycerolipid structures containing 2 double bonds in sn1 chains and a specific double bond in sn2 chains. The abbreviation PC(*-> 10 < 20:*/*+ > 16 < 24:*) implies all possible GP structures containing the phosphocholine head group, sn1 chains with odd chain lengths greater than 10 and less than 20, and sn2 chains with even chain lengths greater than 16 and less than 24.   the identification of a lipid template for the specified lipid abbreviation. The abbreviation is analyzed to identify the presence or absence of chains and head groups. The chain abbreviations are parsed to retrieve the specified chain length and number of double bonds along with their geometry. Any wild cards specified for chains and head groups are identified and marked for further combinatorial enumeration. An appropriate template or set of templates are selected from a pre-defined list of templates that match the specified lipid abbreviations.   Workflow for the template-based combinatorial enumeration of virtual compound libraries. An appropriate lipid template structure is selected from a pre-defined list of templates for the specified lipid abbreviation. 2D structures of lipid templates are stored internally as MDL MOL strings and annotated with information regarding atom numbers and atomic coordinates for attachment points, number of existing carbon atoms in chains, head group name, etc. After an appropriate template has been identified and chains selected for the specified lipid abbreviation, a virtual compound library is generated by the combinatorial enumeration of all selected chains at appropriate attachment points on the template. A SD file is written out containing structure data along with abbreviation, systematic name, lipid category, main class, sub class, etc.
The templates are internally stored as MDLMOL strings containing structure data, along with mapping of each template ID to additional information such as atom numbers and atomic coordinates for attachment points, number of chain carbons in the template, head group name, lipid category, etc. The examples of available templates for GL, GP, CL and SP are shown in Figure 4, Figure 5 , Figure 6 and Figure 7 respectively (The Complete lists of templates GP and SP are available in Additional file 1: Figure S1 and Figure S2). After an appropriate template or set of templates have been identified for the specified lipid abbreviation, chain abbreviations that match the specified abbreviation are retrieved from the pre-defined list of most likely chain abbreviations spanning chain lengths from 2 to 39 for GL, GP and CL, and the most likely long chain bases and N-acyl groups for SP. Appropriate head groups are selected from the lists of supported head groups for GP and SP (Additional file 1: Table S9 and Additional file 1: Table S10). A virtual compound library for lipids is generated by the combinatorial enumeration of all selected chains at appropriate attachment points on the lipid templates. A SD file is written out containing structure data along with additional ontological data such as lipid abbreviation, systematic name, lipid category, main class, sub class, etc. Table 2 shows representative examples of the lipid abbreviations and commands for generating virtual compound libraries for various lipid categories. In addition to the complete enumeration of all possible structures for various lipid categories using wild card characters for chain lengths, number of double bond along with their geometries and head groups, the Lipid-MapsTools software package allows the generation of subsets of these virtual libraries corresponding to specific chain lengths, number of double bonds with specific double bond geometry and head groups.

Implementation
LipidMapsTools is implemented using the Perl [37,38] programming language and is available on a variety of platforms. The software architecture (Figure 8) consists of a bin sub-directory under the lipidmapstools root directory containing command line scripts for generating virtual compound libraries from the lipid abbreviations corresponding to the specific lipid categories, which in turn make use of the functionality available in Perl modules residing in the lib sub-directory to generate the structures along with the additional ontological information. The modules in the lib directory are divided into two categories: specific and generic modules. The specific modules contain functionality for virtual library generation corresponding to the specific lipid categories; the generic modules implement the core functionality such as chain abbreviation parsing, structure generation by attaching chains to the template, etc. The command line scripts make use of the specific modules to generate virtual compound libraries for lipids, which in turn rely on the core functionality to perform specific tasks. The separation of the core functionality from the lipid specific functionality facilitates the maintenance and enhancement of the Lipid-MapsTools package. Two external modules, TextUtil.pm and FileUtil.pm, available from an open source package MayaChemTools [39] are also distributed with LipidMap-sTools. These modules provide functionality for processing data from text files and various file manipulation utilities.
The templates for lipid categories are stored in each category specific module as MDLMOL strings corresponding to template structure data, along with mapping of each template ID to additional information such as atom numbers and atomic coordinates for attachment points, number of chain carbons in the template, head group name, lipid category, main class, sub class, etc. No external MDLMOL data files are needed.

Results and discussion
LipidMapTools provides command line scripts for the combinatorial enumeration of virtual compound libraries for lipids from the specified lipid abbreviations. Virtual compound libraries are generated by the combinatorial enumeration of most likely chains around the specific templates, with chain lengths varying from 2 to 39 containing specific number of double bonds and their geometry. Some radyl chains corresponding to alkyl and 1Zalkenyl chains instead of acyl chains are skipped from sn2 and sn3 positions during the combinatorial enumeration, wherever they are not permitted by the LIPID MAPS classification scheme for lipids. For example, the LIPID MAPS classification scheme doesn't contain any sub classes for alkyl and 1Z-alkenyl chains at sn2 and sn3 positions for the glycerolipids, and the structures corresponding to these chains at sn2 and sn3 positions are not generated.
Virtual compound libraries containing all possible structures for GL, GP, CL and SP are generated using the commands GLStrGen.pl "*(*/*/*)", GPStrGen.pl "* (*/*)", CLStrGen.pl "CL(1'-[*/*],3'-[*/*]) and SPStrGen.pl "*(*/*)" respectively. These command line scripts generate SD files containing the 2D structure data for all the enumerated structures along with additional information such as abbreviation, systematic name, chain length and double bond count, lipid category, main class, sub class, etc. The subsets of complete virtual libraries containing specific chain lengths and head groups are generated by their explicit specification in the specified lipid abbreviations. For example, the command GLStrGen.pl "PC(*:*/*:*)" generates a subset of GP virtual library containing all possible structures with the phosphocholine (PC) head group. A SP virtual library containing structures with the sphingnomyelin (SM) head group and the long chain bases between length 15 and 21 is generated by the following command: SPStrGen.pl "SM(* > 15 < 21:*/* > 15 < 21:*)". The complete lists of available head groups for GP and SP are shown in Additional file 1: Table S7 and Table S8.
The cardiolipins, a class of the glycerophospholipids, may contain up to 4 sn chains. Due to the combinatorial nature of the enumeration of all possible structures containing all 4 sn chains, the size of the CL virtual library gets to be quite large and it takes a substantial amount of time (Table 3) to generate the complete virtual library. A subset of the CL virtual library containing sn chain  The current focus of LipidMapsTools for the combinatorial enumeration of virtual compound libraries is on the mammalian lipids. The pre-defined lists of radyl chains and long chain bases contain the specifications for chain lengths with degrees of unsaturation that are most likely to occur in the mammalian lipids. Although these pre-defined lists are quite comprehensive, it is impossible to cover all the scenarios not only in terms of novel mammalian lipids but also different types of radyl chains or long chain bases that may be present in nonmammalian species such as plants, insects, bacteria, fungi and marine organisms. LipidMapsTools is designed to allow addition of new radyl chains and long chain bases in a relatively straight forward manner. The core Perl modules ChainAbbrev.pm and SPChainAbbev.pm contain the pre-defined lists of radyl chains and long chain bases respectively. After the pre-defined lists in the appropriate modules have been updated, the newly added radyl chains or long bases are available for the enumeration of virtual compound libraries through the command line scripts.
LipidMapsTools also provides the capability to generate individual lipid structures containing arbitrary specifications for radyl chains or long chain bases which are not present in the pre-defined lists available in the package, without requiring any customization by the user. This functionality is available in the scripts provided with the LipdMapsTools package through a command line option. Figure 7 Representative examples of template structures for sphingolipids (SP). Additional file 1: Figure S2 shows complete list of supported template structures for SP. A representative example of an abbreviation matching a template is shown below the template structure. The head group abbreviations are: ceramide, 4R-hydroxy-ceramide, 3-keto-ceramide, CerP (ceramide-1-phosphate), PE-Cer (ceramide-1-phosphoethanolamine), SM (sphingomyelin), GlcCer (1-β-glucosyl-ceramide). For the lipid abbreviations containing arbitrary specifications, the structure generation methodology used in the LipidMapsTools package skips the step to confirm the presence of the specified radyl chains or long chain bases in the pre-defined lists of most likely chain lengths, and proceeds to generate the structure as long as the format of the specified abbreviation is valid. For example, the command, SPStrGen.pl -ChainAbbrevMode Arbitrary "SM (d30:4(4E,8E,12E,16E)/34:4(5Z,8Z,11Z,14Z)(16OH[R]))", parses the arbitrary specifications for the long chain base and N-acyl chain not present in the pre-defined lists available in the package, generates the appropriate sphingomyelin structure and writes it out to a SD file. This functionality facilitates the generation of individual structures for both mammalian and non-mammalian species containing radyl chains or long chain bases currently not present in the LipidMapsTools package. The capability to generate individual structures from the specific lipid abbreviation is quite useful for on-the-fly structure generation for populating databases and on line structure display.

Conclusions
The LipidMapsTools software package has been developed for the template based combinatorial enumeration of virtual compound libraries for lipids. A set of command line scripts is provided to enumerate all possible structures corresponding to the specified lipid abbreviations without any additional input requirements from the user. It is relatively straight forward to generate subsets of complete virtual libraries by explicit specifications of chains and head groups in the lipid abbreviations. 2D structures of the enumerated lipids are drawn in a specific fashion; their representation is consistent and adheres to the framework for representing lipid structures proposed by LIPID MAPS consortium. The customization and enhancement of existing functionality along with development of new functionality is facilitated by modular nature of the software architecture. LipidMapsTools is under continuous development and we anticipate the addition of the new templates along with the radyl chains and long chain bases for both mammalian and nonmammalian lipid species in the future versions of the package. Glycerolipids (GL) GLStrGen.pl "*(*/*/*)" 971,268 181. 6