- Open Access
The Open Spectral Database: an open platform for sharing and searching spectral data
Journal of Cheminformaticsvolume 8, Article number: 55 (2016)
A number of websites make available spectral data for download (typically as JCAMP-DX text files) and one (ChemSpider) that also allows users to contribute spectral files. As a result, searching and retrieving such spectral data can be time consuming, and difficult to reuse if the data is compressed in the JCAMP-DX file. What is needed is a single resource that allows submission of JCAMP-DX files, export of the raw data in multiple formats, searching based on multiple chemical identifiers, and is open in terms of license and access. To address these issues a new online resource called the Open Spectral Database (OSDB) http://osdb.info/ has been developed and is now available. Built using open source tools, using open code (hosted on GitHub), providing open data, and open to community input about design and functionality, the OSDB is available for anyone to submit spectral data, making it searchable and available to the scientific community. This paper details the concept and coding, internal architecture, export formats, Representational State Transfer (REST) Application Programming Interface and options for submission of data.
The OSDB website went live in November 2015. Concurrently, the GitHub repository was made available at https://github.com/stuchalk/OSDB/, and is open for collaborators to join the project, submit issues, and contribute code.
The combination of a scripting environment (PHPStorm), a PHP Framework (CakePHP), a relational database (MySQL) and a code repository (GitHub) provides all the capabilities to easily develop REST based websites for ingestion, curation and exposure of open chemical data to the community at all levels. It is hoped this software stack (or equivalent ones in other scripting languages) will be leveraged to make more chemical data available for both humans and computers.
Tools to make research data freely available are vitally important to the open science movement. Such tools must play well with both humans and computers because of the importance of data import/export into other systems for analysis, verification, and data mining. One important data type in this area is instrumental spectra, used for identification and analysis in a variety of different application areas. Many websites (e.g. NIST Webbook , ChemSpider , University of the West Indies—Chemistry ) contain spectral files available in the current de-facto data standard, Joint Committee on Atomic and Molecular Physical Data—Data Exchange format (JCAMP-DX) [4–7] and this format can be exported from the majority of instrument software available today. However, the usefulness of spectral data in JCAMP-DX format is somewhat limited due to the specification being over 30 years old, and if saved using compression, difficult to transfer to other software. Providing a mechanism to allow conversion of legacy data in JCAMP-DX format is an important activity in-of-itself, as the community needs spectral data for comparison/standardization in many different applications.
The website has been developed using open-source software (as far as possible), using open standards, and is openly being made available using the GitHub code repository. The website is built in the Representational State Transfer (REST) style  and has a documented Application Programming Interface (API)  for computer based discovery and export.
The foundation of the OSDB website is the common Apache , MySQL , and PHP  software stack that can be installed on any computer system as: LAMP (for Linux), WAMP (for Windows) and MAMP [for OSX (Mac)]. Coding was done using the PHPStorm  Integrated Development Environment (IDE) (free for faculty and students) and scripts are written in PHP implementing the CakePHP object oriented framework . Because of the use of this standard open-source software developers can either; deploy on their own physical server, publish using one of a number of online hosting sites, or use a virtual machine, for creation of data websites.
As the goal of the project was to develop a site that offered a standardized REST style API, the PHP framework CakePHP was used to develop the code. CakePHP standardizes the development of PHP scripts by use of the model–view–controller (MVC)  model, implemented using Object Oriented Programming (OOP) . The MVC paradigm separates code into logical sections; the model—to access a database table, the view—presentation of data as a web page, and the controller—the logic that coordinates the processing of a webpage request into a Hypertext Markup Language (HTML) document. As an example, if the user goes to the systems index page , then the following code is executed (Fig. 1).
The function ‘index’ in the SystemsController.php file executes a set of PHP commands to present a list of current chemical systems in the database, and is one of a number of methods of the SystemsController class. This is executed by default where is no action after “/systems” in the URL. In the first line the call to $this->System->find accesses the System ‘model’ (that accesses the ‘systems’ database table) and executes the find ‘method’ to retrieve all the systems, and place the data in the $data variable. The $this->set command assigns the data returned to a variable called ‘data’ that will be available to the PHP code in the view file. For “/systems”, once the code in the controller has finished, CakePHP knows to then return the ‘index.ctp’ file (CakePHP template file) to the browser—shown in Fig. 2.
Note that the function in Fig. 1 has a variable ($format) in the function arguments and when no value is passed the default of an empty string is set. When $format is tested as being equal to an empty string the $this->set command completes and the HTML file ‘index.ctp’ is rendered. However, if the URL “/systems/index/XML” is accessed, the same call is made except the data is not sent to the view and is instead is converted to an array and reformatted as XML ($this->Export->xml()) and passed to the browser. Note that the action ‘index’ must be included in the URL so that CakePHP does not try and run the action ‘XML’ (i.e. “/systems/XML” will cause an error).
The view file takes the list of all the systems in $data (view variable), iterates through each one ($data is a PHP array—see Fig. 3a) and prints out a HTML link on the webpage as an unordered list “<ul>” and shown in Fig. 3b.
Spectral file format
The Joint Committee on Atomic and Molecular Physical Data (JCAMP) published the specification for the Data eXchange format (DX) for spectral data for UV/Vis and IR , MS , IMS , NMR , ESR , and CD  data. JCAMP-DX files are ASCII text files populated with LABELLED-DATA-RECORDs or LDRs. These are defined to allow reporting of spectral metadata and raw/processed instrument data. The instrument data is reported as XY pairs, nominally in tabular format where one line contains a starting X value and a number of equally spaced Y values, the X value of which can be calculated using the LDR DELTAX. In addition, because the format was developed when disk space was at a premium, the data can be reported in a number of compressed formats, referred to as ASCII Squeezed Difference Form (ASDF), that use letters and symbols to encode data in more compact formats (Table 1).
Table 2 shows data in normal fixed format and equivalent storage in four compression formats.
When a JCAMP file is uploaded to the OSDB website, a record is added to the database that contains the file metadata. A unique id is generated and the file is saved with the id as its filename (available at id.jdx). The file is then read into a variable in PHP and processed line by line using a JCAMP plugin written in our laboratory. The plugin processes the file using the following seven steps:
Clean—remove non-ACSII characters extra spaces at the start/end of lines
Uncomment—remove (and save) comments (indicated by $$)
Get LDRs—detect LDRs in the file
Validate—check the LDRs to identify if the file is valid JCAMP-DX
Standardize—standardize data in certain LDR fields
Decompress—expand data in any of the ASDF formats and calculates respective X values
The metadata, data, comments and any processing errors are stored in an array in PHP and then converted to XML and saved. Figures 4 and 5 show a comparison of the original JCAMP file (e.g. “../spectra/view/000000115/JCAMP”) and JCAMP saved in XML (e.g. “../spectra/view/000000115/XML”).
Spectral data in the JCAMP file is stored both in its <raw> state along with the <pro> (cessed) expanded format. Any discrepancies between the data in the original JCAMP file and the process data are annotated in the errors element, with details of the issues.
Graphical user interface
The website allows users to access the data in the system via endpoints (MVC controller) for:
The default pages accessed via “/<controllername>” URLs all provide an index of available resources of that type. For instance, Fig. 7 shows the spectra index, with spectra organized by compound, with a JSmol  view of the structure, a link to the compound page, and external links to more data about the compound.
In addition to the index view for spectra, users can search of a specific compound using the search box at the top of the page. The search is performed over the ‘identifier’ database table which is populated from the PubChem PUGREST interface  and contains names, SMILES, PubChem CIDs, InChI strings and InChIKeys.
Clicking on the name of a compound on the compound index page  brings up a summary page of a compound with the JSmol molecular view, metadata, and links to the spectra for that compound. Figure 8 shows the page for aspirin. Note the external links to view the chemical on PubChem  and Wikidata  a free linked database that underpins a number of websites including Wikipedia.
Clicking on a spectral link (under Systems and Spectra) brings up the view of the spectrum (see Fig. 9 for the MS spectrum of Aspirin). Metadata about the spectrum is available on the left side by clicking each of the four buttons. The spectrum can also be downloaded in JCAMP, XML, and SciData (JSON-LD) formats by clicking the respective icon.
To contribute a spectrum to the OSDB, users first signup for an account under “My OSDB” (top right) and then click on “Add Spectra” on the top menu bar. The form for upload of JCAMP files (Fig. 10a) can be used to upload a file from a local drive or a web address. When entering the compound name the form searches and displays (Fig. 10b) the existing compounds in the system and clicking on one of the names found selects that compound. If the user is uploading local files they can add as many as they like by clicking the “Add another file” button (Fig. 10c).
For computer access to all of the above functionality (except file upload) a REST API is available and described here  (Fig. 11) built using the widely popular Swagger API framework . As an example for spectral data the API allows access to the files via a number of formats—OSDB ID, Splash , and compound name and technique code (comp|tech). It is also possible to access just the plot of the spectrum which is useful for embedding the data into another website.
The basic REST website provides a mechanism to add data to the repository, access it in a standardize way and download the data in multiple formats. However, the key to making the data truly accessible is by integration with other platforms and expanded search capabilities. These features have been added to the OSDB website through the following additions.
PubChem lookup for chemical metadata
When a new compound is entered on the spectra upload page and new spectrum uploaded the system does a check for the compound in the current system and if not found searches PubChem using the Power User Gateway REST API . PubChem allows extensive searching of the data and metadata the system holds via the API, which has a myriad of options and has the generalized URL.
<inputspecification>/<operation specification>/[<output specification>][?<operation_options>]”
As an example of using this API to gather data about compounds, users submit the compound name along with the spectral data to the OSDB. Figure 12 shows a PHP function written to allow the system to retrieve the PubChem CID for the compound entered, which is subsequently used to retrieve the identifier data mentioned earlier.
The function cid has two arguments $name, and $debug (used to check that the code is working correctly). First, access to the CakePHP HttpSocket is established , and URL constructed from the base PubChem API address , the compound name ($name), and ‘/synonym/JSON’ . The URL is requested (equivalent to a web browser)  and the resulting JSON data converted to a PHP array $syns . The code then checks for errors in the response  and retrieves the CID for the compound . The value of the CID is then returned to the calling function. Other functions in the Chemical class return all the synonyms for a compound, and property data for a compound.
Retrieve Wikidata ID
Similar to the PubChem example, a function was written to search the Wikidata website , this time using a SPARQL query  encoded in a URL (Fig. 13). Three separate searches are coded to retrieve the Wikidata ID via InChIKey, SMILES, or PubChem CID. If the script calling this function tries all three approaches and does not get an ID, it assumes that the compound is not in the Wikidata database.
Generate the Splash for a spectrum
A recent addition to the identifier scene is the ‘Spectral Hash’ or Splash . This identifier evolved out of work started at the 2015 Metabolomics Hackathon  where participants became enthusiastic about unique spectral identifiers similar to the InChIKey. In order to generate a Splash for a spectrum the spectral data is encoded in a JSON object and then sent to the Splash website. The code in Fig. 14 does just that.
The OSDB website, as outlined above, provides access to spectral data and its metadata in a standardize way. However, it is important to point out that what can be done with the data is up to the user. This applies to the OSDB website as well as after spectra have been downloaded. For instance, the website does not currently allow for searching the raw data/metadata across all spectra (all of it is in the database but can only be found searching for a complete spectrum).
In order to make this site truly useful the code and data of the project should be made openly available. In this way the user is not limited to the functionality that the original developers envisioned but can develop their own functions/features, enhance the integration of the site, and output the data in new formats for new web or mobile applications. In addition, the openness of the project means it can be used in education as a tool to develop the next generation of cheminformaticians—potentially building their own website from the source code as a course project.
For all these reasons (and many more) the project is available as a free download on GitHub . GitHub is a hosting service for the well-respected Git source code repository system . Git allows multiple developers to write code for one project and centrally coordinate version control, patching, extension and attribution. GitHub does this though a website and adds features like issue tracking, collaborative (discussion based) code review, and team management. Anyone can download the code, work on an enhancement or issue, submit updates, fix issues, and discuss project goals, timelines, and features. The basic site has been built and users can let the developers (that’s all of us) know what needs to be added, changed or removed, and implement it themselves. Readers are encouraged to check out the ‘Projects’ page  for ideas on additional features/enhancements that you could work on.
This paper describes a new project to support open spectral research data on the web. Anyone can contribute to the content, to the code, to the concept, or to the management/vision. This paper also outlines the components needed to put together such a project and it can be used as a template to build other websites with different functionality and/or different types of chemical data.
The current version of the OSDB is just a starting point. There are many additional features one can envision for the site and it is a hope that the reader has ideas of their own and adds them. Open source code has become a mainstay in the computing world. With the tools, concepts and frameworks outlined in this paper, open source research data will hopefully become a mainstay of the scientific community.
Application Programming Interface
ASCII Squeezed Difference Form
Cascading Style Sheet
graphical user interface
Hypertext Markup Language
Linux, Apache, MySQL, and PHP
Joint Committee on Atomic and Molecular Physical Data—Data Exchange
Linked Data Record
Mac, Apache, MySQL, and PHP
nuclear magnetic resonance
object oriented programming
Open Spectral Database
web ontology language
representational state transfer
scientific data model
simplified molecular-input line-entry system
SPARQL protocol and RDF query language
structured query language
uniform resource identifier
Windows, Apache, MySQL, and PHP
extensible markup language
NIST Materials Measurement Laboratory (2016) NIST chemistry WebBook. National Institute for Standards and Technology, Gaithersburg. http://webbook.nist.gov/. Accessed 19 July 2016
Williams T, Tkachenko V (2016) ChemSpider. Royal Society of Chemistry, Cambridge. http://www.chemspider.com/. Accessed 19 July 2016
Lancashire R (2016) JCAMP-DX sample files. The University of the West Indies, Mona. http://wwwchem.uwimona.edu.jm/spectra/index.html. Accessed 19 July 2016
IUPAC (2016) IUPAC subcommittee on electronic data standards. http://jcamp-dx.org/. Accessed 1 Mar 2016
Davies AN, Lampen P (1993) JCAMP-DX for NMR. Appl Spectrosc. doi:10.1366/0003702934067874
Grasselli JG (1991) JCAMP-DX, a standard format for exchange of infrared-spectra in computer readable form. Pure Appl Chem. doi:10.1351/pac199163121781
Lampen P, Hillig H, Davies AN, Linscheid M (1994) JCAMP-DX for mass-spectrometry. Appl Spectrosc. doi:10.1366/0003702944027840
Chalk S (2016) The Open Spectral Database. University of North Florida, Jacksonville. http://osdb.info/. Accessed 1 Mar 2016
Sporny M, Longley D, Kellogg G, Lanthaler M, Lindström N (2016) JSON-LD 1.0—a JSON-based Serialization for Linked Data. The World Wide Web Consortium. http://www.w3.org/TR/json-ld/. Accessed 1 Mar 2016
Chalk S (2016) SciData—a scientific data model. University of North Florida, Jacksonville. http://stuchalk.github.io/scidata/. Accessed 1 Mar 2016
Fielding RT, Taylor RN (2002) Principled design of the modern Web architecture. ACM Trans Internet Technol 2(2):115–150. doi:10.1145/514183.514185
Mann A (2014) What’s an API? A beginner’s guide to the application programming interface. http://www.slideshare.net/CAinc/whats-an-api-a-beginners-guide-to-the-application-programming-interface. Accessed 23 June 2016
ASF (2016) The Apache HTTP server project. The Apache Software Foundation (ASF), Forest Hill. http://httpd.apache.org/. Accessed 1 Mar 2016
Oracle (2016) MySQL open-source database oracle corporation. http://www.mysql.com/. Accessed 1 Mar 2016
The PHP Group (2016) PHP: hypertext preprocessor. The PHP Group. http://php.net/. Accessed 1 Mar 2016
JetBrains (2016) PHPStorm. PHP IDE. https://www.jetbrains.com/phpstorm/. Accessed 1 Mar 2016
CSF (2016) CakePHP: the rapid PHP development framework. Cake Software Foundation (CSF). http://cakephp.org/. Accessed 1 Mar 2016
Fowler M (2006) GUI architectures: model–view–controller. ModelViewController. http://martinfowler.com/eaaDev/uiArchs.html. Accessed 23 June 2016
Oracle (2015) Lesson: object-oriented programming concepts. https://docs.oracle.com/javase/tutorial/java/concepts/. Accessed 23 June 2016
Chalk S (2016) The Open Spectral Database—system index. University of North Florida, Jacksonville. http://osdb.info/systems. Accessed 1 Mar 2016
Baumbach JI, Davies AN, Lampen P, Schmidt H (2001) JCAMP-DX. A standard format for the exchange of ion mobility spectrometry data (IUPAC recommendations 2001). Pure Appl Chem. doi:10.1351/pac200173111765
Cammack R, Fann Y, Lancashire RJ, Maher JP, McIntyre PS, Morse R (2006) JCAMP-DX for electron magnetic resonance (EMR). Pure Appl Chem. doi:10.1351/pac200678030613
Woollett B, Klose D, Cammack R, Janes RW, Wallace BA (2012) JCAMP-DX for circular dichroism spectra and metadata (IUPAC Recommendations 2012). Pure Appl Chem. doi:10.1351/PAC-REC-12-02-03
Chalk S (2016) SciData: a data model and ontology for semantic representation of scientific data. J Cheminform
BCT (2016) Bootstrap Bootstrap Core Team. http://getbootstrap.com/. Accessed 1 Mar 2016
Çelik T, Lilley C, Baron LD, Pemberton S, Pettit B (2016) Cascading Style Sheets working group. The World Wide Web Consortium. https://www.w3.org/TR/css3-color/. Accessed 1 Mar 2016
Hickson I, Berjon R, Faulkner S, Leithead T, Doyle Navara E, O’Connor E, Pfeiffer S (2016) HTML5: a vocabulary and associated API’s for HTML and XHTML The World Wide Web Consortium. http://www.w3.org/TR/html5/. Accessed 1 Mar 2016
Chalk S (2016) The Open Spectral Database—spectra index. University of North Florida, Jacksonville. http://osdb.info/spectra. Accessed 1 Mar 2016
Chalk S (2016) The Open Spectral Database—compound index. University of North Florida, Jacksonville. http://osdb.info/compounds. Accessed 1 Mar 2016
Chalk S (2016) The Open Spectral Database—analytical technique index. University of North Florida, Jacksonville. http://osdb.info/techniques. Accessed 1 Mar 2016
Chalk S (2016) The Open Spectral Database—collection index. University of North Florida, Jacksonville. http://osdb.info/collections. Accessed 1 Mar 2016
NLM (2016) PubChem Power User Gateway (PUG) REST interface documentation. https://pubchem.ncbi.nlm.nih.gov/pug_rest/PUG_REST.html. Accessed 1 Mar 2016
NLM (2016) PubChem. National Institutes of Health, Bethesda. https://pubchem.ncbi.nlm.nih.gov/. Accessed 19 July 2016
WikiMedia (2016) Wikidata. https://www.wikidata.org/. Accessed 1 Mar 2016
Chalk S (2016) The Open Spectral Database—API. University of North Florida, Jacksonville. http://osdb.info/api. Accessed 1 Mar 2016
SmartBear (2016) Swagger API framework open API initiative (OAI). http://swagger.io/. Accessed 19 July 2016
UC Davis (2016) Splash—the spectral hash identifier. http://splash.fiehnlab.ucdavis.edu/. Accessed 1 Mar 2016
Chalk S (2016) Open Spectral Database Github repository. GitHub Inc, San Francisco. https://github.com/stuchalk/OSDB/. Accessed 1 Mar 2016
Harris S, Seaborne A (2016) SPARQL query language for RDF. The World Wide Web Consortium. https://www.w3.org/TR/sparql11-query/. Accessed 19 July 2016
Metabolomics Society (2016) Metabolomics Hackathon Metabolomics Society. http://metabolomics2015.org/index.php/program/hackathon. Accessed 1 Mar 2016
SFC (2016) Git distributed version control system. Software Freedom Conservancy. https://git-scm.com/. Accessed 1 Mar 2016
Chalk S (2016) The Open Spectral Database—projects. University of North Florida, Jacksonville. http://osdb.info/pages/projects. Accessed 1 Mar 2016
Thanks to Tony Williams for encouraging me to start this project. Thanks to J. C. Bradley for pioneering open science and paving the way for projects like this to be conceptualized.
The author declares that he has no competing interests.
Availability of data and materials
All data associated with this project is available at https://github.com/stuchalk/OSDB.