Chemotion ELN: an Open Source electronic lab notebook for chemists in academia

The development of an electronic lab notebook (ELN) for researchers working in the field of chemical sciences is presented. The web based application is available as an Open Source software that offers modern solutions for chemical researchers. The Chemotion ELN is equipped with the basic functionalities necessary for the acquisition and processing of chemical data, in particular the work with molecular structures and calculations based on molecular properties. The ELN supports planning, description, storage, and management for the routine work of organic chemists. It also provides tools for communicating and sharing the recorded research data among colleagues. Meeting the requirements of a state of the art research infrastructure, the ELN allows the search for molecules and reactions not only within the user’s data but also in conventional external sources as provided by SciFinder and PubChem. The presented development makes allowance for the growing dependency of scientific activity on the availability of digital information by providing Open Source instruments to record and reuse research data. The current version of the ELN has been using for over half of a year in our chemistry research group, serves as a common infrastructure for chemistry research and enables chemistry researchers to build their own databases of digital information as a prerequisite for the detailed, systematic investigation and evaluation of chemical reactions and mechanisms. Electronic supplementary material The online version of this article (doi:10.1186/s13321-017-0240-0) contains supplementary material, which is available to authorized users.


Technical aspects and programming details
The Chemotion ELN was programmed in Ruby, Javascript, HTML, and CSS. The backend server is built on the Ruby on Rails framework with PostgreSQL relational database, while the front-end user interface is mainly constructed with the ReactJS framework to serve a single page application. Ruby on Rails adopts Ruby, a script language, which enables fast development with a clear MVC (model-view-controller) structure. On the other hand, ReactJS separates DOM (document object model) manipulations from data flow, decomposes entangled structures for sophisticated user interactions. People who want to expand features on the Chemotion ELN or start a new related project can comprehend the logics with a less steep learning curve. Ruby package management allow to easily implement external package from public code repository. The ELN was programmed in a way to be customizable through this practical package management. Plugins specific to the ELN can also be written as RAILS engine so to extend the ELN DB, server-side functions, but also the user interface. Adding additional web pages, or even modifying the main application page produced with ReactJS modules is possible. The extension with SciFinder or NMRdb functions are two already mentioned examples.

Docker File
Setting up the development environment can be a tedious task for the software engineer. The ELN project is programmed by Ruby-on-Rails as the web application, connecting to the PostgreSQL, and npm for client side package manager. On the back-end we are using OpenBabel as chemistry helper, Rmagick for the Ketcher editor and many more packages. On the front-end, browserify is used to convert Nodejs packages for use in the browser. Getting everything ready is a timeconsuming task for people wantingtodeploy or develop our ELN project. For the ease of our opensource development, we also provide a Docker configuration for our ELN project using Docker.
Docker [1] is an open-source project that helps simplify process building, shipping, and running web applications. Docker wraps the project code with all the needed information and dependencies for the application into one unit called a "container". There are two ways for users or developer to setup the development environment:  Get the Dockerfile [2] from our repository [3] . Dockerfile is a script for Docker to install every software package that is required by the ELN: the PostgreSQL database, the back-end Ruby-on-Rails with additional Ruby gems (using Ruby bunder [4] ) to front-end npm [5] package.  Directly pull the docker image from Docker hub [6] . This image actually is the compressed docker image from the Dockerfile.
After creating or downloading the Docker image, the ELN project is ready to use by starting a Docker container from the Docker image.

Legend:
The visible organization column (collection navigation) contains a gear icon button to access the collection management area where generation, editing and deletion of collections are done (see also info 7, right panel). The standard collection All (standard collections are predefined collections for all users) contains all items. It functions as a backup service. The chemotion.net collection (also a predefined standard collection for all users) is installed for a latter use in combination with a repository for chemical reactions and analyses.
Collections that contain sub-collections can be unfolded (see centered tab). Symbols indicate the collections which have been shared with others. A separate organizer has been created for collections that have been shared, that are shared with the user (by others). The tab in the middle of the figure shows the extended view of the organizer if collections contain nested collections (sub-collection). A deeper nesting of collections is also possible (not shown). The management UI for the collections is divided into a management for collections of the user and collections that the user shared with others. Both management areas can be modified via drag & drop allowing the fast rearrangement of the single collections and the modification of the assignment. Collections, subcollections and so on can be created, edited, extended and deleted at any time.
Additional information on the collection synchronization status with others are summarized below the collection entry. The assignment as asynchronized collection can be easily deleted or modified via the icons.

Legend:
Samples and reactions are summarized in lists which belong to particular collections. One can easily switch from the sample to the reaction view. One can select either single samples, whole pages, or the whole list of samples for actions (e.g. move/assign/share/copy etc.). The list can be sorted according to the date of modification or the molecular structure. In sample lists, the items are sorted according to their molecular structure. Each sample is assigned to a molecule, one molecule can regroup several samples. Samples are always listed below the assigned molecules and are represented by the molecular structure, the sum formula and the name of the compound (if available via Pubchem).
Molecules are automatically checked for their presence in the Pubchem database and the information on the presence or absence is summarized in the list. A grey-colored Pubchem icon stands for "unknown in Pubchem", a blue-colored icon means the structure is registered in Pubchem. In the latter case, the PubChem symbol serves as a link to the landing page of the molecule at the Pubchem website. Additional information on the listed samples are summarized in information indicators. These small tags include information on the presence of the sample in other collections of the ELN, the sharing of the sample with others and the availability of analytical data that have been added to the sample and confirmed.

Legend:
The detailed view encompasses additional data that are available either through manual input or automatic request to external sources. The main data include the sum formula, the chemical name (IUPAC name), the molecular mass and the exact mass of the compound. The structure of a molecule is given in a prominent manner and is visible independent of the selection of other tabs. Via simple click on the molecule, the structure editor can be opened allowing the modification of the molecule.
The details view combines the information of a molecule and the sample. The combined information is summarized in the three tabs properties, analyses and results. Additional data are available via the tabs SciFinder and NMR which are implemented as plugin The window of the details tab is surrounded by a functional frame. The frame color switches from deep blue to light blue if unsaved changes have been made on the client side. If all information are saved on the server side, the frame changes to deep blue again. In addition, the frame contains information on the name of the sample, its function in reactions that have been planned with the ELN (link to the reaction) and the share status of the sample. The user can define a name for the sample and the sample can be defined as "top secret". The latter status prevents the information of the sample from being shared and distributed.
While the molecule name of a structure is retrieved from Pubchem if possible, the name of the sample can be changed by the user. A secondary name and a location can be inputted as well.
Additional data such as the amount of the sample (in mg, ml and mmol), its density, the boiling point and melting point can be entered The details window summarizes in addition a description (textual), the purity (number) of the compound, impurities (textual) and a solvent (for a dissolved sample, selection via dropdown). The field purity affects the amount of the sample in mmol for a given mass.
A collapsible panel also contains information about the calculated elemental analysis. The user may use the form to fill the experimentally found elemental analysis data. Selected Identifiers like the InCHI and Smiles Code are given to allow a fast search of the molecules in other databases: the identifiers can be copied to the clipboard with a single click. Additionally, the CAS registry number is given for those compounds that are available via PubChem (see Figure 3B). Every sample is assigned automatically to a molecule, and so to an InChI and SMILES identifier generated with OpenBabel. The InChI key is used to request the CAS registry number via the PubChem API service. The answer set is listed in a dropdown menu allowing the user to select the correct CAS RN. The answer set of the PubChem service includes CAS RNs for the requested InChI key and known isotopic derivatives of the requested InChI key. As the PubChem response also contains in some cases CAS RNs that belong to different molecules, forthcoming developments will provide an additional query alternative for double checking of the returned CAS RNs.

Legend:
If CAS registry numbers for a molecular structure are retrievable via PubChem, the available CAS numbers are stored with the molecule. The user can assign one CAS number to a particular sample by choosing from a dropdown menu. All identifiers can be copied to the clipboard.

Legend:
The Analysis Tab allow the addition of one or more analyses. The user may select, whether the data are relevant to be included in reports. The user can define a name for the analyses (textual). The type of the analysis can be selected from predefined values from a dropdown menu to keep a standard classification.
The status of the analyses can be selected (confirmed, unconfirmed), the total count is indicated for each sample in the list view. Only the confirmed analyses are counted.
The free text field named content can be used for the textual summary of the analytical result.
To allow a fast and standardized text input for the content of the analysis, several icons have been created allowing the introduction of basic text parts that are necessary for the description of the standard analytical methods. In addition, text formatting functions are available. For each analysis, several datasets can be attached. Datasets are uploaded and stored on the local server and can be retrieved via download.

Legend:
Datasets belonging to the corresponding analysis can be named individually and details for the instrument as well as a general description of the processing method that been used can be given. The attachments can be uploaded from any available storage or can be dropped into the datasets area.

Legend:
The results tab stores information that has been gained from external sources e.g. a collaborator that provides additional data. The textual field can be filled automatically from Excel sheets with a suitable formatting.

SciFinder Credential and requests
Figure SI-4. Left: Changing the user settings with the SciFinder credentials to obtain a user token with time-limitation. Right: SciFinder Tab with results of a database request with 4 hits identified for the exact structure search.

Legend:
To use the SciFinder Plugin, the user has to enter the SciFinder credentials. This procedure has to be repeated once the token has expired. The search functionality allows the same search structure as the SciFinder online application, the user can chose between exact, substructure or similarity search. All results of the search request are listed and the information is provided via a direct link to the search result on the SciFinder web page. In parallel to the search, recent search results for the actual session are listed in a search history

Legend:
The search results that have been gained via the SciFinder search are visualized in the list of the samples. Upon hovering over the SciFinder symbol, the date of the database request and the given results are listed (not shown).
The list format includes also information about the availability of Pubchem data related to the sample. In opposite to the search for the structure in Pubchem, no user request is necessary and the update of the information in the ELN is done automatically. For the PubChem request, only the exact search function is used.

Legend:
The reaction scheme image can be shifted or zoomed in through mouse action.
Reagents are always given above the reaction arrow. All items of the reaction scheme like starting material, reagents, and product(s) are linked to the samples entries (shown exemplarily in the manuscript). All samples that are given in the reaction table are given with the systematic ID of the ELN and additional external names and identifiers if possible. New samples can be submitted to the reaction (to any role) either by drag-and-dropping from the samples/molecules list or by drawing the structure with the editor. All values are calculated according to the input of the user. Units are switchable and the user may select between the planned amount and the real amount that has been used after weighing of the materials. The reference compound is always set to be 1 equiv. per definition and can be switched in an easy manner. The definition of the role of a sample within the table can be changed as well (e.g. from reagent to starting material). Products and calculations with material that occurs as a product differ from reagents and staring materials. The numbering of the product depends on the reaction name and the different products or fractions are labeled with the reaction ID and the letters (e.g. from A to D). The automatic calculation of the yield out of the outcome of the reaction is implemented while the direct entry of the yield without knowing the obtained mass is disabled to avoid misuse of the ELN. Solvents can be added either by a drag-and-drop from the samples list, by the generation of a structure or by choosing one from a standard solvent dropdown menu. The amount of solvent is used to define the concentration of reagents in the reaction mixture. The description of the reaction can be given and created individually or one can use pre-defined text blocks for a fast standardized entry (see Figure SI-6B). The temperature of a reaction can be entered either as a number or can be given in a detailed way via creating an online temperature chart (not shown)

Legend:
The name of the reaction can be entered as free text by the user; the status can be chosen according to planned, successful and not successful and the temperature can be entered either as one value or a period (represented as a diagram) There are at the moment 15 predefined text blocks that can be inserted for a fast standardized reporting on common procedures. The blocks can be combined to describe the whole synthetic process (if suitable for the reaction). The description input field combines the free text input and the predefined text blocks. The text can be edited and modified according to the user's preferences. The ELN in its current version allows the input of information on references that are important or should be cited along with a reaction. At this stage, the ELN has a limited function to register a title, along with a URL. This will be superseded soon by a more advanced reference system linking Zotero citation management accounts..

Legend:
A new reference is added to a reaction directly via a tab that can be opened in the reaction area of the user interface. The user can add a title to structure the entries.
The URL of the reference can be entered and stored for a fast retrieval of the original text. The input field supports the addition of several references to one reaction.

Configuration of export functions and results (examples)
Figure SI-7. Export scheme allowing the selection of single items to be exported to the xslx or sd file format.

Legend:
The user defined export of data is managed via a list of possible data that can be included into the export file. The following information can be selected: image (image of the sample), name (user defined name of the sample), description (description of the molecule), cano_smiles (canonical smiles code), sum_formula (sum formula of the molecule), inchistring (InChI string), target amount (amount that was planned), created_at (date-time of creation), updated_at (date-time of last update), molfile (molfile of the compound), purity (purity), solvent (for diluted samples: solvent), impurities (textual, qualitative information on impurities), location (location of the sample), is_top_secret (Boolean describing sensitive information), ancestry (hierarchical sequence of sample ids), external_label (additional label provided by others or to provide for others), short_label (automatically generated label for samples), real_amount (amount that has been effectively used), imported readout (data obtained from external sources), identifier (InChIKey, molecule identifier, density (density), melting_point (melting_point), and boiling point (boiling point), and molecular weight (molecular weight).
The export can generate either a XLSX file or an Sd file

Sharing of information (single user and group level)
The sharing of information can be managed via the selection of a single user or a user group. The user groups have to be defined by each ELN user individually.

Legend:
The creation of a new user group (with a short label for the group) can be managed via the user settings in the ELN. The user can define as many groups as necessary and can chose the members of the groups The user groups can be deleted, edited and the colleagues that are part of the group can be listed. New users are added via selection form a list of all ELN users The sharing roles allow a fast assignment of the rights (read, write, share, delete, import elements, take_ownership) to a colleague or a group. The read level is the lowest level of rights, the take_ownership level is the highest one. The fast assignment includes also a predefined detail level that is available for samples or reactions. i The permission level can be assigned either through the sharing role or can be set individually. Sample and reaction detail level can be chosen individually or given automatically via the sharing role. There are two levels for reactions (level 1: Observation, description, calculation; level 2: everything) and 4 levels for samples (level 1: molecular mass of the compound and external level; level 2: molecule, structure; level 3: analysis result and description; level 4: analysis datasets, level 5: everything). In the last step, colleagues are selected from the list of users of the ELN.

Search functions
Figure SI-9A. Search functions using the refined search function of the ELN.

Legend:
The contents of the ELN can be searched by a text search. The user can switch from a simple search mode (searching for one text fragment in all items) to a refined search mode. The refined search mode supports EXACT and SUBSTRING search and currently the search in four different ELN areas (example in Figure SI-9A is sample short label) The user may input either single search items or a set of items to retrieve a list of compounds as a result of the search function.
The results are listed according to the order of the original entry SI-9B. Search functions with structure and substructure search (adaptable through similarity search).

Legend:
The contents of the ELN can be searched by a text based search and a structure search. The structure search can be adapted to the user's needs by choosing either the similar search (and defining the degree of similarity) or the substructure search.

Coding and Tracking
Details on the used QR Codes and BarCodes: Samples, Reactions, and Analyses objects are assigned a UUID v4. The code is translated into a 40 digit sequence and can be displayed as QR code (version 1 data correction level L or version 2 data correction level Q), or truncated (10 digits) Barcode of type 128C.

Legend:
Upon creation of a new item, a QR-and a Barcode are created for every sample and every new reaction. In addition, the QR-and Barcodes are generated for each analysis that is planned in the ELN. Barcodes and QR-codes can be printed easily. As of now, two sizes are offered to meet the requirements of bottle labeling and sample labeling for GC-vials or PCR tubes. The labels are generated as pdf (see description of 3 and 4) and can be printed with e.g. on label printer (used in the demo version in the KIT labs). The printed label contains one QR-and one Barcode assigned to the sample/reaction/analysis. The additionally printed Sample ID to facilitates the assignment of the label to the item that should be labeled