Molecule database framework: a framework for creating database applications with chemical structure search capability

Background Research in organic chemistry generates samples of novel chemicals together with their properties and other related data. The involved scientists must be able to store this data and search it by chemical structure. There are commercial solutions for common needs like chemical registration systems or electronic lab notebooks. However for specific requirements of in-house databases and processes no such solutions exist. Another issue is that commercial solutions have the risk of vendor lock-in and may require an expensive license of a proprietary relational database management system. To speed up and simplify the development for applications that require chemical structure search capabilities, I have developed Molecule Database Framework. The framework abstracts the storing and searching of chemical structures into method calls. Therefore software developers do not require extensive knowledge about chemistry and the underlying database cartridge. This decreases application development time. Results Molecule Database Framework is written in Java and I created it by integrating existing free and open-source tools and frameworks. The core functionality includes: • Support for multi-component compounds (mixtures) • Import and export of SD-files • Optional security (authorization) For chemical structure searching Molecule Database Framework leverages the capabilities of the Bingo Cartridge for PostgreSQL and provides type-safe searching, caching, transactions and optional method level security. Molecule Database Framework supports multi-component chemical compounds (mixtures). Furthermore the design of entity classes and the reasoning behind it are explained. By means of a simple web application I describe how the framework could be used. I then benchmarked this example application to create some basic performance expectations for chemical structure searches and import and export of SD-files. Conclusions By using a simple web application it was shown that Molecule Database Framework successfully abstracts chemical structure searches and SD-File import and export to simple method calls. The framework offers good search performance on a standard laptop without any database tuning. This is also due to the fact that chemical structure searches are paged and cached. Molecule Database Framework is available for download on the projects web page on bitbucket: https://bitbucket.org/kienerj/moleculedatabaseframework.


Background
In organic chemistry research the chemists synthesize novel compounds that exhibit the desired properties and perform better than existing compounds. The commercial objective is to sell these new and superior compounds directly or in new products containing them. During the process from the initial creation of a new compound until it is commercialized, a lot of data related to it is generated such as physiochemical properties, analytical-and toxicological data. All this data needs to be stored and scientists must be able to search it by chemical structure.
The chemist performs a chemical reaction and enters the relevant chemical structures, the exact procedure and outcome into his/her electronic lab notebook (ELN) so that other chemists can find it and repeat the reaction if necessary. After this new compound is synthesized, a new entry for it is generated in the chemical registration system. Unfortunately there is not much scientific literature concerning how and what such registration systems do [1]. The main purpose is to assign each compound a unique in-house reference number that then can be used to cross-reference it. There are many commercial software solutions for ELNs and chemical registrations such as those from PerkinElmer (CambridgeSoft), Accelrys, ChemAxon or Dotmatics to name a few. However these solutions may be costly especially if they also require a commercial relational database management system (RDBMS). An exception is ChemAxon Compound Registration [2] which runs on MySQL. Another issue is that they are closed-source and if such systems are used in the context of scientific experiments, the experiment is not fully reproducible by other scientists unless they also have access to a license for the same system. Because these systems need to be highly configurable, they are also very complex. So configuring and administrating such a system requires a lot of very specific expertise. Consequentially it can be advantageous to build your own solution especially for organisations that already have an in-house software development department. A custom solution does not have to be highly configurable as it can be tailored to your needs. This includes administrative interfaces for user management and specific tasks like informing all users about new features. Such tools can greatly reduce administrative overhead.
Another common requirement is in-house compound databases and processes for which no commercial solutions are available. These databases might not store information on novel compounds but are still important in the discovery process. For example, an organisation wishes to gather and store information on commercially available products from competitors or other suppliers such as purity, pricing and other properties. With PerkinElmer's BioSAR application you can search for a chemical structure in any database table in oracle that was indexed by their cartridge. However it offers no easy solution for data entry. Either a trained expert needs to enter the data or a custom web interface for easy data entry must be developed. Dotmatics Browser application is very similar to BioSAR in this regard. In general there seems to be a need to create custom chemical structure searchable database applications [3,4].
As a solution for these issues Molecule Database Framework (MDF) was created. MDF can be used to create chemical structure search enabled database applications. One of the advantages compared to directly using a chemical database cartridge is, that all code to connect, access and query the database is provided by MDF. It abstracts the storing and querying of chemical structures into method calls. The software developers using MDF do not require extensive knowledge of either chemistry or the underlying database cartridge.
MDF has been designed with a flexible domain model. An application can have many types of chemical compounds like a registration compound and an inventory compound which can have different properties. Samples of these different types of chemical compounds can be associated with a container identified by a unique barcode. Such containers can then be searched by chemical structure.
The design and functionality of MDF makes it possible to build many different types of systems like registration systems, inventory systems and simple compound databases or all of these together in one single integrated solution.

Implementation
Open-Source software components Molecule Database Framework was developed by integrating free, open-source software components (Table 1). These components are backed by a commercial organisation offering services or are wide-spread and have a large base of contributors like PostgreSQL. This means all these components are well maintained and should be future proof.
PostgreSQL is the RDBMS used for MDF. It is Open-Source and available for different platforms (Linux, BSD, Solaris, Windows, OS X). Bingo PostgreSQL Cartridge is an extension to Post-greSQL that enables chemical structure search within PostgreSQL. Bingo offers functionality that not all commercial cartridges have like advanced tautomer search and resonance substructure search [5]. It is also available for Oracle and SQL Server. Bingo is responsible for the chemical structure searches in MDF.
The Indigo Toolkit is a chemistry toolkit written in C++ with application programming interfaces (API) for various other languages like Python, Java or .Net. Indigo is used for format conversion, import and export of SD-Files and for generating chemical structure depictions including highlighted substructures.
The Spring Framework is a framework for creating a diverse range of Java applications. It can be seen as an alternative to Java EE. Its core module is an inversion of control container for injecting dependencies at runtime. For example this means that the behaviour of an application can be changed by using a different implementation of an interface. Which implementation to use can be configured in an Extensible Markup Language (XML) file and does not require any code changes. To activate such a configuration change, the application must be restarted.
MDF also uses the Spring Data JPA and Spring Security modules. Spring Data JPA is for fast creation of Java Persistence API (JPA) based data access layers. Spring Security is used for method level security and securityaware database queries (results are limited to content that a user is allowed to see on the database level).

Software layers
MDF consists of the 4 layers: the database using Post-greSQL, Object Relational Mapping (ORM) provided by Hibernate using JPA, the data access layer implemented with Spring Data JPA and a service layer for transaction support and security ( Figure 1).

Entity classes
MDF uses JPA with Hibernate as its implementation. All entity classes are therefore annotated with JPA annotations. The entity classes are shown as an UML Class diagram in Figure 2.

Base entity
BaseEntity is an abstract class that manages metadata like creation and last modified date and who created and modified the entity. It contains the abstract getId() method to retrieve the unique database id. BaseEntity is further responsible for the optimistic locking feature of JPA. It contains a column annotated with @Version. If an entity is updated, the numeric version column is incremented by 1. Update statements always contain the version in the where-clause and if a concurrent update occurs, they do not match and the concurrent modification is detected.

Chemical structure
ChemicalStructure entity holds the chemical structure data (SMILES [6] or molfile) and the structure key (InChikey). A ChemicalStructure is unique and immutable and managed by MDF. Users operate on Chemical Compounds and not on ChemicalStructures directly. Unique means that when a new ChemicalCompound is saved, the framework checks if the referenced Chemical Structures already exist and if yes, reuses them. The existence check is done by using the standard InChikey which has a unique constraint in the database. Immutable means that if a ChemicalCompound is updated and one of the ChemicalStructures was changed, MDF will automatically check if the updated ChemicalStructure already exists and will use it or create a new Chemical Structure. The old one will remain unchanged.
While the Bingo cartridge offers a molecular weight search without explicitly storing it in the database, this cannot yet be combined with a chemical structure search. Therefore MDF automatically calculates the molecular weight of each ChemicalStructure and saves it to the database. This allows structure searches within a specified molecular weight range.

Chemical compound composition
ChemicalCompoundComposition links together Chemical Structure and ChemicalCompound and defines the relative occurrence of the ChemicalStructure within the ChemicalCompound.

Chemical compound
The ChemicalCompound entity is an abstract model of a chemical compound. A ChemicalCompound is a "descriptive entity". It is not a concrete object that physically exists. It consists of ChemicalCompoundCompositions. A ChemicalCompound is therefore a mixture if it consists of multiple ChemicalCompoundCompositions (see Figure 2).
The class contains some basic fields like compound-Name and cas. ChemicalCompounds can also be associated with Containables. Developers using MDF must create concrete implementations of this abstract class. An application can have multiple implementations of ChemicalCompound and each implementation is stored and searched separately (Table per Concrete class Inheritance). Note that due to better usability it was decided to make CAS-Number column nullable and it is not unique. This behavior can be changed by manually creating the according constraints in the database.

Containable
A Containable is a set of a ChemicalCompounds that have a common source. For example they were produced in the same way. In a Chemical Registration System this would be a Batch and in an Inventory System a Lot. The important part is that ChemicalCompound and Containable are generic classes and must form a pair:

Chemical compound container
A ChemicalCompoundContainer holds exactly one Containable of any type. An application should only have one implementation of ChemicalCompoundContainer. A ChemicalCompoundContainer instance represents a concrete physically available object containing a ChemicalCompound linked by a Containable. Chemical-CompoundContainer has a barcode field which is unique and not nullable. The barcode hence uniquely identifies a physically available sample of a Chemical Compound.

Role and user
Role and User are only relevant if MDF is used with Spring Security. ChemicalCompound and Containable hold a reference to their read-role. This is used to filter ChemicalCompoundContainers in the database based on the current users' privileges. Example: An application has two ChemicalCompound-Implementations, DefaultCompound and SecretCompound. Current user has the privilege to read Default Compounds but not for reading (viewing) SecretCompounds. So if this user searches for ChemicalCompound Containers, only ChemicalCompoundContainers that contain a DefaultCompound must be returned by the search. To achieve that, the query's WHERE-clause is extended and the filter based on the Role is added automatically. The main advantage of doing this filtering in the database compared to filtering the results within the applications is that you get pageable results which would not be easily possible (if at all) with application-side filtering.

Data access layer Spring data JPA
The data access layer was created with Spring Data JPA. Spring Data JPA is a module for fast creation of data access layers. It provides methods for create, read, update and delete (CRUD) operations. The developer does not need to write any code for this.
When using Spring Data you create a new interface that extends from generic interfaces provided by Spring Data JPA and represents the repository for an entity. There are different kinds of repository interfaces but the repositories in MDF all extend JpaRepository. JpaRepository provides CRUD-methods and basic retrieval methods for your entity. All these methods are implemented automatically by Spring Data. The interface itself can remain completely empty.
Repositories in MDF also extend QueryDslPredicate Executor. This adds findOne(predicate) and findAll(predicate) methods. Predicates are basically type-safe WHERE-Clauses provided by QueryDSL.
Repositories can also contain custom find-methods. The convention is that the methods starts with "findBy" and is then followed by a path. An example would be the method T findByCas(String cas) in ChemicalCompound Repository. You only need to write the method signature into your interface. The implementation is created automatically by Spring Data.
Another option is to annotate a repository interface method with @Query. The value of the annotation is either a JPQL Query or native SQL.

Chemical structure search
The main challenge was how to make Bingo Cartridge methods usable as Spring Data Repository methods. This involved two steps: First the Bingo Cartridge methods were exposed to Hibernate by extending PostgreSQL82-Dialect and then were used by creating custom Repository implementations.
Spring Data JPA allows you to create method implementations and to use them in your repository interface. In MDF a generic interface for chemical structure search was created.
Spring Data JPA will merge the interface ChemicalCom-poundRepository and its implementation ChemicalCom-poundRepositoryImpl together under the interfaces type.
See Additional file 1 for the full source code of Chemical CompoundRepositoryImpl and AbstractStructureSearch RepositoryImpl which it extends.
As can be deduced from above source code, the chemical structure search methods support paging. Paging of search results has one main benefit: only the requested data needs to be loaded from the database. This can have a huge impact on performance especially if the structures are stored as molfiles. Paging requires that the total number of records is known. The need to count the total hits of a query for each page will greatly reduce performance. MDF solves this issue by caching the count of a query. Hence the count is only determined when the query is executed the first time. Subsequent requests for the next page use the cached value. In the event that relevant data is edited or added, the cache is cleared.
As a last step the methods from ChemicalStructureSearchRepository must be implemented. This can be done in a custom repository implementation. The convention is to name the implementation the same as the interface and add "Impl" at the end of the name.
A repository offering chemical structure search extends above interface.
MDF contains a Spring data repository for each entity (ChemicalCompound, Containable, and ChemicalCom-poundContainer) with methods for chemical structure searching. A developer must extend the provided repositories for each of his entity implementations and may add his own custom search methods to them.

Service layer
The service layer is responsible for transaction management and security. It also offers methods for importing and exporting SD-Files. Transactions and security are implemented declaratively by using according annotations from Spring framework. Security is entirely optional. If you do not need it, MDF remains fully operational and every user can see everything.

Service for ChemicalStructure entity
This service manages entities of type ChemicalStructure. This service is provided by the framework and should be used as-is. Any access of ChemicalStructures should be through this service. ChemicalStructureService contains the logic that makes ChemicalStructures immutable. For convenience ChemicalStructureService has a method that allows changing an existing ChemicalStructure. This will then affect all ChemicalCompoundCompositions that contain the changed ChemicalStructure.

Services for ChemicalCompound and Containable
The services for ChemicalCompound and Containable are very similar. They consist of an interface which contains the security annotations and an implementation containing the @Transactional annotations for declarative transactions. They also offer methods with optional loading of lazy collections.
When an entity is saved, the service automatically sets the correct ChemicalStructures by either selecting an existing one from the database or creating a new one.
Before the entity is passed to Hibernate, the method preSave(entity) is executed. This method is empty in the provided abstract services like ChemicalCompound ServiceImpl but can be overridden in subclasses. For example one could set all non-nullable properties to a default or a sequence value if it is null.
For every ChemicalCompound and Containable implementation the application uses, a service interface and an implementation must be created. The services must implement the getRepository(), checkUniqueness() and getExistingCompound() methods.

Service for ChemicalCompoundContainer
The main difference to services for ChemicalCompound and Containable is that an application should only have 1 implementation of ChemicalCompoundContainer and hence only one such service. The service can be used as provided or in some cases must be extended. When the ChemicalCompoundContainer implementation adds additional unique constraints, checkUniqueness() and getExistingContainer() must be overridden.

Functionality
Below is a listing of the core functionality of MDF: Chemical structure search: Full, Sub, SMARTS, Similarity, Formula Chemical structure searches can be combined with property searches Chemical structure searches are paged and cached Support for multi-component compounds (mixtures) 3 chemical structure searchable entities: ChemicalCompound, Containable and ChemicalCompoundContainer Import and Export of SD-Files for above 3 entities Transactional database access Optional security (authorization) With the design and functionality of MDF it is possible to build many different types of system such as registration systems, inventory systems or just a simple compound database. While you could also create your own ELN, there also exists the free Indigo ELN. This ELN was create by GGA Software Services and is used at Pfizer [7].
In contrast to MolDB5R [3] and to MyMolDB [4], MDF is not a fully functional standalone web application with chemical structure search. As the name implies it's a framework to simplify creating such an application. MDF may also be used to create local or client-server desktop applications. MDF is targeted at software developers and not intended for use by scientists themselves. However MDF features are very robust. Chemical structure searching is done in the database and not the application code. Hence you can search by chemical structure and other properties at the same time, the results can be sorted by multiple properties and can be paged (SQL OFFSET and LIMIT clauses). Note that if you do the chemical structure search in application code, any query will require at least two trips to the database, namely the structure search and subsequently the filtering by other properties, sorting and/ or limiting. Both need to happen in the same transaction. It was not determined if MolDB5R [3] and MyMolDB [4] actually do this in the same transaction.
In MDF the chemical compounds can be associated with a containable, which in registration systems would be a batch or in an inventory system a lot. A specific physically available sample in a bar-coded bottle can then be associated with a container. These containers can also be searched by chemical structure. This is the foundation for creating an inventory system. Developers may add as many additional properties as they want to each of the entities and all of them are searchable together with the chemical structure.
All data access in MDF is transactional to prevent data inconsistencies. MDF can be configured to use a database connection pool. When querying a RDBMS creating a connection often takes more time than the query itself and hence if you already have open connections response times can be reduced.
For similarity searching MDF exposed the algorithms provided by the Bingo Cartridge which are Tanimoto, Tversky and Euclidean metric for substructures.
MDF is ready to be used with Spring security. Security is optional. MDF offers method level security (authorization). It does not offer any authentication features.

Mixture handling
MDF supports multi-component chemical compounds.
Searching by substructure will return all compounds that have at least one component (chemical structure) matching the query structure. This is important because reaction products that may be entered into a chemical registration system are almost always mixtures unless extensive purification is done.
If an entry in an imported SD-File consists of multiple disconnected structures it is assumed that this entry is a mixture and each structure is stored as a separate chemical structure.

Structure normalization
By default MDF stores the chemical structures as they are submitted. MDF does not do any standardization/ normalization of chemical structures. It is up to the developer using MDF to ensure that chemical structures are correctly normalized prior to saving them to the database. It is currently suggested that developers implement such a feature by overriding the preSave() method of ChemicalCompoundServiceImpl. This method is called before any chemical compound is created or updated. Within this method the chemical compound and all the chemical structures it consists of can be freely manipulated as desired. Note that every compound being saved will be processed by this method.

Salts, solvates and solutions
MDF current version 1.0.1 has no special handling for salts, solvates or solutions. MDF will store separate components in a chemical structure file as a separate chemical If the chemical structure is a single ion it will be stored and searchable like any other chemical structure. If this behaviour is unsuitable in a specific case, developers may implement salt and solvate handling functionality in the preSave() method.
Some commercial systems also appear to have no way of handling solutions. It is recommended to create the compound as if it were pure and add the solution information as separate fields on the compound level.

Example web application
A simple web application using MDF was created. The web application makes use of Spring MVC. The application does not make use of the security integration and it does not use the entities Containable and Chemical CompoundContainer. It only uses ChemicalCompound entity. The application is a compound database for multicomponent compounds. It has a page for importing the chemical structures in an SD-file into the compound database. The database can be searched by substructure and properties. It uses JSME [8] for drawing the chemical structures ( Figure 3). The search results page displays the search hits in a tabular and paged fashion. When a substructure search is done, the substructure will be highlighted in the search results (Figure 4). The hits of a search can be exported as an SD-file. The search results contain a link to a single compound view. The properties of the compound may be edited and compositions can be added, edited and deleted (Figures 5, 6). When editing a compound or a composition the application deals with concurrent modifications transparently and conflict resolution dialog is shown on which the user can select which values to use for each property and then save that new version.

Performance
MDF has one main performance issue when handling mixtures. If an application uses mixtures, i.e. compounds with multiple components, a chemical structure query will Figure 4 Result page of a substructure search. The results are displayed in a paged table generated by the JQuery plugin datatables [9]. The chemical structure images have the matching substructure highlighted in red. Clicking in the chemical structure image will display the SMILES for it.  return one row for every component in a compound matching the query. This is undesirable because the end users want to see each compound that matches the query only once. The solution to the problem is to use a distinct query and this is where the performance issue happens. If you perform a distinct query, the whole database needs to be searched regardless of limit clause which greatly increases execution time. Note that sorting has the same effect. So sorting can have a huge performance penalty too and when paging you should always sort to get a predictable outcome. To make this even worse Bingo Cartridge for PostgreSQL does not yet have a proper implementation for cost estimation and the cost for using the chemical structure index is hard coded and underestimated. This misleads the PostgreSQL query planer to always use a full index scan on structure search index even when the query has an additional where-clause that greatly limits the amount of results. In these cases it would, for example, be faster to use the index for the CAS number and use the Bingo matchsub function for filtering. The matchsub function does substructure matching without index. This is of course slower than with an index but if it only has to be done for a small number of structures it is a lot faster than the full index scan. To fix the cost estimation issue, MDF does some internal calculations to explicitly decide whether the structure index or the matchsub function is used. This can improve performance by an order of magnitude. Note that the supplier of the Bingo Cartridge is aware of this issue and the timeline for the fix was end of 2013. The main issue of distinct queries and sorting however is inherent to how relational databases work and cannot be solved except with a better substructure search index or better hardware. MDF also offers a setting to disable distinct queries application-wide for single-component compound databases.
To benchmark MDF the earlier outlined web application was used. The database contains 525573 unique compounds. The compounds are from Zinc [10] subset 13 at reference pH 7 the SD-files 13_p0.0.sdf, 13_p0.1. sdf, 13_p0.10.sdf and 13_p0.11.sdf. The structures are stored in the database as SMILES. Importing each of the SD-files, which contain approximately 131'000 chemical structures, took 12 min with the chemical structure search index disabled. Rebuilding the index after all SDfiles were imported took 22 min on a laptop with 4 GB RAM, Core i5-3220M CPU and a 512GB Samsung 830 SSD. That totals to 1 h 10 min to setup a fully indexed database with half a million compounds. As an additional reference the same import was done on a desktop PC with 12 GB of RAM, i7-875K @ 3.4 GHz and the database running on a Western Digital Green Drive (5400 RPM). Here the import took 8 min and the conclusion is that the import is CPU limited rather than being limited by storage-speed. The index generation took approximately 22 min on the laptop and 20 min on the desktop. The conclusion here is that it is limited more by CPU but drive speed also matters. The import and index generation performance when storing the structures as molfiles was not benchmarked.
The substructure search performance was benchmarked with different configuration settings. Substructure search is done by Bingo PostgreSQL cartridge [5] and this benchmark therefore reflects its performance plus any overhead caused by MDF. With the exception of c1ccccc1 the author drew chemical structures with no specific significance and tested the search speed. The search speed was determined by logbacks [11] implementation of org.slf4j.profiler.Profiler.
The first benchmark is a reference. This benchmark used the option to disable all distinct queries and no sorting was   Table 2 ordered ascending by number of hits. The benchmark was repeated but this time with distinct queries enabled. First page load time is doubled because the count query is run and then the actual query is run which takes about the same amount of time as the count query due to the distinct clause. The second page then always takes half the time to load compared to the first page for the same reason ( Table 3). The number of hits is identical to the ones in Table 2 because all compounds in the database consist of only one component.
The results show that Bingo does no optimizations for a common substructure query like a benzene ring and hence searching for c1ccccc1 in a database in which almost all molecules have this feature is very slow. To improve search speed in such a scenario filtering by additional properties is recommended. Hence the benchmark was repeated with an additional filter of compound name starting with "ZINC34". Table 4 shows the benefit of the MDF optimization as a workaround of the cost estimation issue in Bingo Post-greSQL cartridge. Without this optimization the benchmark would have the same performance as shown in Table 3.
MDF also uses Bingo Cartridges similarity searching functionality. Its performance was tested by searching for compounds with a similarity score of 0.9 using the Tanimoto similarity measure also known as Jaccard Index [12]. The results are shown in Table 5.

Outlook
For generating the chemical structure depictions the Indigo toolkit is used. The toolkit can be configured to generate the structures in many ways including the coloring of heteroatoms, bond length and width and many more. Currently this is hard coded and cannot be adjusted by the user. A next step would be to expose these configurations options so they can be set through a Java properties file. Also the handling of salts and solvates must be implemented to make MDF usable in areas were such compounds are important.
To make use of MDF you must be able to program in Java and you will need basic knowledge about Spring framework and how to configure it. This limits the target audience. When using MDF you need to write some boiler-plate code and hence the next step would be to create additional tools to facilitate the usage of MDF like the automatic generation of entity classes and their repositories and services. These tools would need to be configurable so that a user can define the desired properties for each of the entities and the desired search methods. An option would be a maven plugin. Maven plugins can generate code like the metamodel creation done by the QueryDSL-maven plugin. Another option would be annotations that generate code on compile like Project Lombok does [13].
The final step would be to create a web application platform that lets administrators create new web applications with chemical structure search capability by simply entering the desired properties for the entities on a web form and clicking on a button. It is obvious that this would require a significant development effort.