Skip to main content

DNAmod: the DNA modification database

Abstract

Covalent DNA modifications, such as 5-methylcytosine (5mC), are increasingly the focus of numerous research programs. In eukaryotes, both 5mC and 5-hydroxymethylcytosine (5hmC) are now recognized as stable epigenetic marks, with diverse functions. Bacteria, archaea, and viruses contain various other modified DNA nucleobases. Numerous databases describe RNA and histone modifications, but no database specifically catalogues DNA modifications, despite their broad importance in epigenetic regulation. To address this need, we have developed DNAmod: the DNA modification database. DNAmod is an open-source database (https://dnamod.hoffmanlab.org) that catalogues DNA modifications and provides a single source to learn about their properties. DNAmod provides a web interface to easily browse and search through these modifications. The database annotates the chemical properties and structures of all curated modified DNA bases, and a much larger list of candidate chemical entities. DNAmod includes manual annotations of available sequencing methods, descriptions of their occurrence in nature, and provides existing and suggested nomenclature. DNAmod enables researchers to rapidly review previous work, select mapping techniques, and track recent developments concerning modified bases of interest.

Introduction

A rapidly growing body of research is continuing to reveal numerous gene-regulatory effects of covalent DNA modifications, such as 5‑methylcytosine (5mC). We now recognize 5mC as a stable epigenetic mark and as having diverse functions beyond transcriptional repression [12]. An increasing number of studies demonstrate the importance of other cytosine modifications, such as 5‑hydroxymethylcytosine (5hmC), 5‑formylcytosine (5fC), and 5‑carboxylcytosine (5caC) [3, 9, 26, 43, 46]. More recently, three analogous modifications of thymine were found to occur in mammals [38, 53] and can now largely be sequenced [19]. N6‑methyladenine, previously thought to mainly occur as an RNA modification in eukaryotes, has now been found in the DNA of multiple eukaryotes [24]. Bacteria, archaea, and especially bacteriophages have long been known to harbor a diverse array of modified bases [18, 51]. Their genomes can also have hypermodified bases—modified DNA bases that substitute for the unmodified base in many positions genome-wide [17, 51].

Multiple databases profile RNA modifications [4, 8, 54] and human histone modifications [56], but no database catalogues DNA modifications systematically. Some databases include particular classes of DNA modifications [44]. These include restriction endonucleases and DNA methyltransferases in REBASE [41]; methylation databases, like MethDB [1]; databases including DNA metabolic pathways, such as KEGG [27]; and those focused on DNA damage and repair, like REPAIRtoire [31].

Since DNA modifications are a key aspect of epigenetic regulation, there is a pressing need to organize them in a single location. We have accordingly created DNAmod: the DNA modification database (https://dnamod.hoffmanlab.org). DNAmod is the first database to comprehensively catalogue DNA modifications and provides a single resource to launch an investigation of their properties.

Database construction and visualization

DNAmod consists of two components: a relational database back-end and a web interface front-end. We used the Chemical Entities of Biological Interest (ChEBI) database [13, 22] to seed the DNAmod database. We imported a nucleobase-related subset of ChEBI, consisting of chemical entities and related annotations. We performed queries against the entities to construct a set of candidate DNA modifications for DNAmod, retaining most of these as a separate unverified set. Then, we filtered candidate entities into a manually curated set of verified DNA modifications, augmenting them with modification-specific annotations.

The web interface front-end allows users to either search or browse through the catalogue of DNA modifications, integrating ChEBI’s information with our own.

Identifying candidate DNA modifications from ChEBI

DNAmod leverages ChEBI [22] to define a set of modified DNA candidates for inclusion and to add preliminary information for each candidate. ChEBI is a database of small biologically relevant molecules, which affect living organisms. We queried ChEBI via ChEBI Web Services [22]. We used Biopython [10] and the Python Simple Object Access Protocol (SOAP) client, suds [35], to query ChEBI and construct the DNAmod database.

ChEBI provides an ontology which encodes the relationships between its compounds. We used this ontology to precisely define the notion of parents and children, which we used to hierarchically retrieve and display modifications. We used two kinds of relationships for this purpose, both of which have associated symbols, defined by ChEBI [13]: \(\mathcal {F}\) has functional parent and \(\triangle \) is a. We used these relationships to find candidate DNA modifications, by identifying entities related to the core nucleobases, which we represent by their symbols: {A,C,G,T,U}. We included uracil, since many of its descendants in the ontology are modifications of thymine (CHEBI:17821, which is equivalent to 5-methyluracil), and are not annotated as descendants of thymine itself. For each of these bases, we imported all entities that are annotated in the ontology as a child of one of these bases, via the \(\mathcal {F}\) has functional parent relationship. ChEBI ranks entities based on their degree of curation. We only imported entities with the highest rating—three stars—indicating manual curation by ChEBI. Whenever possible, we only included entities as nitrogenous bases (nucleobases). If ChEBI did not have the nucleobase, we then selected the nucleoside form and finally, if necessary, the nucleotide. These imported bases formed the candidate set of modifications (the unverified set), from which we created a curated set of DNA modifications (the verified set).

The ChEBI ontology does not generally encode \(\mathcal {F}\) has functional parent relationships for nucleobases beyond the children of the unmodified nucleobases. It instead encodes modified nucleobases with an \(\triangle \) is a relationship to their parent base. This is because descendant entities of specific modifications are generally subtypes of the class of modifications from which they originate. For example, 3-methyladenine \(\triangle \) is a methyladenine. Methyladenine, however, \(\mathcal {F}\) has functional parent adenine, since it is conceived of as possessing adenine as a characteristic group and as being derived via functional modification [13]. We therefore need to use both of these relationships, within the ChEBI ontology, to accurately capture the full nucleobase hierarchy.

ChEBI also provides selected citations, associated with some of its entities. We retrieved the citations from ChEBI as PubMed IDs [32]. We used the Biopython [10] package Bio.Entrez to query the PubMed citation database, using NCBI’s Entrez Programming Utilities [32]. We retrieved the details of each citation, and use them to construct a formatted citation. We currently support only publications indexed in PubMed.

Manual curation and annotation

We manually created and defined a whitelist, which contains our curated (or verified) set of candidates that we deem DNA modifications. For each of the bases enumerated in our whitelist, we also imported all descendants with an eventual \(\mathcal {F}\) has functional parent or \(\triangle \) is a relationship with any of the members of the verified set. We expanded the verified set to include any bases recursively imported in this manner, since they were children of verified DNA nucleobases. We also manually created and defined a distinct blacklist, which contains compounds that we deem to not be DNA modifications, also excluding any of their descendant compounds. Therefore, our above verification rule has the exception that it excludes any bases with an ancestor in our blacklist.

We can formalize the above description of bases imported from the ChEBI ontology [13] and subsequent filtering as follows. Let \(a\mathbin{\mathcal {F}}\,b\) specify that a has the \(\mathcal {F}\) has functional parent relationship with b. The definition of \(\mathcal {F}\) is transitive: for all n entities, \(l_{i}\), for \(i = 0\) to \(n - 1\), between a and b,

$$\begin{aligned} a\mathbin{\mathcal {F}}\,b \iff \bigl ( a\mathbin{\mathcal {F}}\,l_{n - 1} \bigr ) \wedge \bigl ( l_{i}\mathbin{ \mathcal {F}}\,l_{i - 1} \mathord {\forall } i \in \left( 0, n\right) \bigr ) \wedge \bigl ( l_{0}\mathbin{\mathcal {F}}\,b \bigr ). \end{aligned}$$

The analogous definitions hold for \(\triangle \).

We call each \(l_{i}\) a child of \(l_{i - 1}\) and call each \(l_{i - 1}\) a parent of \(l_{i}\). We refer to a as a descendant of b and refer to b as an ancestor of a. Let \(\mathcal {C}\) represent the first level of children of the unmodified nucleobases, such that \(\mathcal {C} = \left\{ x \mid x\mathbin{{\mathcal {F}}}\,y, y \in \{\tt{A, C, G, T, U}\} \right\} \). Let \(\mathcal {V} \subset \mathcal {C}\) represent the manually-annotated, verified proper subset of \(\mathcal {C}\).

We manually curated a blacklist of excluded entities, \(\mathcal {B}\), satisfying: \(\mathcal {B} \subseteq \left\{ b\mid\left( b\mathbin{{\mathcal {F}}}\,p \vee b \mathbin{\triangle} p \right) , p \in \mathcal {V} \right\} \). We imported the set of verified DNA modifications, \(\mathcal {M}\), defined in set-builder notation with predicates, as:

$$\begin{aligned} \mathcal {M}=\, {} \mathcal {V}\, \cup\, &\left\{ z \mid \left( \exists v\, {\in } \mathcal {V} \right) \left( \forall\, b\, {\in }\, \mathcal {B} \right) \right. \\&\left. \left[ \left( z\mathbin{{\mathcal {F}}}\,v \vee z \mathbin{\triangle} v \right) \wedge \lnot \left( z\mathbin{{\mathcal {F}}}\,b \vee z \mathbin{\triangle} b \right) \right] \right\} . \end{aligned}$$

Finally, we added a small number of bases manually, that do not have any of the DNA bases or uracil as a parent in their ontology, but are nonetheless notable modified bases, such as 2′-deoxyinosine.

We additionally provided two kinds of manual annotations: sequencing techniques and occurrence in nature, for each modified DNA base. We surveyed the literature of sequencing methods for covalent DNA modifications [6, 29, 37, 39, 45], and annotated the available methods for each base, providing curated citations. These annotations include the method’s name, our categorizations of the basis for the method (such as chemical conversion), its resolution, and any further qualifier (Table 1A). Qualifiers include limitations (such as applicability to only some genomic regions), enrichment methods, and advantages (such as optimization for single-cell sequencing). We considered any method which involves affinity-based recognition of targets to be of “low” resolution [5]. These methods can also suffer from low specificity or antibody cross-reactivity [6]. Conversely, we annotated any methods based principally upon the detection of a chemically converted modification as “high” resolution. This generally reflects the resulting resolution of the method’s output data and often corresponds to the necessity to bin genomic regions during downstream analyses of the detected analyte.

For each modified base, we investigated if it had been previously reported to occur in vivo. This included any endogenous occurrences, as well as those stimulated exogenously, such as from exposure to an environmental toxin. We annotated any modification observed in vivo as “natural”. We additionally provided non-exhaustive examples of some organisms in which the modifications have been reported. We based these annotations on our ability to find evidence of in vivo occurrence, as opposed to publications describing only the synthesis or physicochemical properties of a nucleobase. For each of these annotations, we also briefly annotated a primary biological function, if known (Table 1B). For any modification not observed in vivo, we annotated it as “synthetic” and listed a reference pertaining to its synthesis or in which the synthetic base was used.

We entered these annotations in two annotation source files (Table 1), which we later imported into our database. This decoupled them from the rest of our pipeline and allows outside experts to submit additions without requiring knowledge of our pipeline or programming workflow.

Table 1 Possible annotations within DNAmod’s curated (A) sequencing method data and (B) natural occurrence information

DNAmod integrates manually-curated nomenclature, including the name and abbreviation deemed most consistent and in common use [9, 11, 28]. We additionally provide recommendations for one-letter symbols of selected modified bases, and in some instances for their base-pairing complements, as previously described [49]. The DNAmod web interface displays recommended notation in an organized table (Fig. 1).

Fig. 1
figure 1

Manually-curated recommended notation, mapping techniques, and natural occurrence data for 5-formylcytosine (5fC). See Table 1 for an explanation of the mapping and natural occurrence table headers

We store all data, either imported from ChEBI or from our manual annotations, within a SQLite [25] database, used via the Python sqlite3 package [16].

Website generation

We created a static website to display and provide navigation for the information contained within the database. We generated it by formatting the database content using the templating engine Jinja2 [42]. Two templates were sufficient to generate all HTML files. We used a single template for all modification pages and another for the homepage. We also record the date of the most recent update to the database. The main footer contains this date, along with the current ChEBI and DNAmod versions. All web pages use the Bootstrap [36] framework, which provides a standardized, portable, and mobile-compatible viewing format. We visualized the chemical structure of each compound from its Simplified Molecular-Input Line-Entry System (SMILES) [52] data, if available from ChEBI, as a vector graphic. We did this using the cheminformatics toolkit Open Babel [34], via its Python wrapper Pybel [33].

Searching and navigation

DNAmod makes modifications accessible via three main navigation options, each provided on a tab of the DNAmod homepage. First, users may search for modifications by several fields. Second, users may find curated DNA modifications via a pie menu [7]. Third, users may find candidate entities as a list, categorized by their parent unmodified nucleobases.

Client-side search functionality provides a means of rapidly finding bases with differing nomenclature (Fig. 2a), while maintaining a static web page. This functionality relies on the elasticlunr.js JavaScript module [47]. Searches match to multiple fields: common or International Union of Pure and Applied Chemistry (IUPAC) names, all synonyms, any assigned abbreviation, and recommended notation symbol, when available. DNAmod displays curated DNA modifications in green, and others in magenta. The search results provide the field matched by the query, such as “abbreviation”, along with the common name of the associated hit.

Alternatively, users may browse the modifications in DNAmod through a pie menu [7] interface (Fig. 2b). This interface hierarchically arranges the bases according to their structure within the ChEBI ontology. The innermost ring consists of the four unmodified DNA bases, with an additional “other” category. This category encapsulates modified bases found in DNA, but which are not modifications of one of the four DNA bases. Consecutive outer rings represent children of the previous base or category. We demarcated natural versus synthetic bases by colouring natural bases in teal and synthetic bases in grey.

Fig. 2
figure 2

Finding 6-methyladenine by a searching for its abbreviation “6mA” or b via the pie menu

DNAmod structure and content

Individual modification pages visually represent the data contained within the backing database. We standardize and display all modifications in an identical format. DNAmod may omit some information, however, depending upon the extent of ChEBI’s annotations and whether the page describes a verified DNA modification or merely a candidate entry.

Modification pages begin with a header displaying the DNA modification’s ChEBI name. The top-right corner of the page lists the unmodified ancestor of the modification. For example, 5-hydroxymethyluracil is a modification of thymine (Fig. 3), whereas 6-dimethyladenine is a modification of adenine.

Each modification begins with a short textual description of its chemistry, followed by a table containing its chemical properties. We import these from ChEBI, which provides their chemical formula, net charge, and average mass.

We annotate entities with all names available from ChEBI, including: their IUPAC name, SMILES [52] string, International Chemical Identifier (InChI) and hashed InChIKey [23] strings, and common synonyms. We also provide a recommended abbreviation and in some instances a suggested single-letter symbol for bioinformatic purposes, from our proposed expanded alphabet [49] (Fig. 3).

We provide literature annotations for many DNA modifications, focusing upon those observed in vivo. We provide a list of methods that have been used to map the genomic locations of a modification (“Manual curation and annotation”). We additionally provide information on a modification’s occurrence, either naturally or only synthetically, where applicable, including some organisms in which it has been observed in vivo (“Manual curation and annotation”). Finally, each page ends with the ChEBI database reference and a ChEBI-derived list of related literature citations (Fig. 3). Our website has semantic web support, making use of the Resource Description Framework in Attributes (RDFa) [40] technique, augmented by Chemical Information Ontology (CHEMINF) [20] and PubChemRDF [15] Semanticscience Integrated Ontology (SIO) [14] annotations—providing machine-readable descriptions of key website features.

Fig. 3
figure 3

The full modification page for 5-hydroxymethyluracil (5hmU)

Discussion

DNAmod enables researchers to rapidly obtain information on covalently modified DNA nucleobases and assist those interested in profiling a modification. It additionally provides a reference toward standardization of modified base nomenclature and offers the potential to track recent developments within the field. We have kept DNAmod up to date for 3 yr and expect to continue to maintain it, particularly as new discoveries about DNA modifications are made. We also hope that DNAmod will serve to highlight underappreciated modifications that may have substantial biological importance.

The nomenclature used to describe a particular DNA modification is often inconsistent, with some early efforts toward standardization of particular classes [11, 28]. The ChEBI name, for instance, often corresponds to the common chemical name of the compound, which is occasionally distinct from its common name within the biological literature, in the context of a DNA modification. We address this and attempt to encourage standardization by endeavouring to ensure that other names are annotated, while providing specific nomenclature recommendations. In particular, the suggested name of verified DNA modifications, as displayed on the homepage and within the recommended notation section, is always manually-curated and sometimes differs from the name assigned by ChEBI.

Our database, like many others, relies upon the ChEBI ontology. Like any large and complex endeavour, curating ChEBI is a substantial undertaking, requiring protracted deployment of expertise and effort. While ChEBI has a dedicated team of expert curators, who assiduously and continually improve ChEBI, their resources are naturally limited. Accordingly, while ChEBI has an issue tracker where we and others can suggest changes, revisions to ChEBI are highly dependent on user reports and the team’s available bandwidth. ChEBI contains a non-negligible fraction of errors and omissions, across most entity categories [30, 55]. These works highlight the substantial effort and difficulty involved in maintaining high-quality annotations. Such errors naturally propagate to its downstream databases, including our own. While we have made efforts to further curate data and report relevant issues back upstream, we do inherit some errors and limitations. As in any project of this nature, we surely have our own errors and omissions. We lack a dedicated curator; accordingly, we curate this data on a best-effort basis. DNAmod has its own issue tracker, and we would appreciate if users could report any of our own errors or omissions, so that we can address them or facilitate reporting them upstream.

The inclusion of assays available to sequence different DNA modifications provides a means of assessing and selecting a sequencing method. It additionally attempts to track sequencing methods over time, as resolution improves, and especially to highlight recent developments, like direct-detection of various modifications via nanopore sequencing [50]. The sequencing annotations we provide annotate nucleobases which are directly elucidated by the method and only for the base or set of bases which the method independently maps. This includes those that are obtained in addition to another nucleobase. For instance, confounded mixtures are often obtained. For example, 5mC and 5hmC cannot be distinguished with only conventional bisulfite sequencing. Alternatively, some methods have the capacity to independently resolve between modifications, such as various nanopore-based methods. Therefore, while many use oxidative bisulfite sequencing (oxBS-seq) in combination with conventional bisulfite sequencing to elucidate 5hmC via subtraction, we only annotated it as a sequencing method for 5mC, which it directly elucidates [6]. Conversely, we only annotate TET-assisted bisulfite sequencing (TAB-seq) under 5hmC, which it directly elucidates [6], although many use it to also detect 5mC.

We demarcated bases found to occur in vivo, providing examples of organisms in which a modification has been found, along with associated citations. This merely substantiates its in vivo presence, however. We did not attempt to comprehensively list the organisms which contain any particular modification. Finally, we expect our brief annotations of the biological roles of various DNA modifications to change as further research is conducted.

Future work

We plan to keep DNAmod updated continuously, manually reviewing newly added ChEBI compounds, requesting appropriate additions to ChEBI, and curating any improvements. We also endeavour to annotate recently developed sequencing methods as we come across them.

Integrating additional external databases will further increase DNAmod’s utility. In particular, we envision potential integration with domain-specific DNA modification databases, such as those cataloguing compounds formed from the operation of particular biological pathways. For instance, modifications involved in DNA damage and repair could be linked to REPAIRtoire [31] data. We could also improve functional characterization using Gene Ontology (GO) [2] or Kyoto Encyclopedia of Genes and Genomes (KEGG) [27], but this would require extensive manual curation.

We used ChEBI Web Services [22] to obtain information from their database. ChEBI has, however, recently released a Python application programming interface (API), permitting us to directly access their data [48]. Switching from our current web-based queries to use of their API would likely result in a more robust system and expedite the database-building process.

Abbreviations

5caC:

5-carboxylcytosine

5fC:

5-formylcytosine

5hmC:

5-hydroxymethylcytosine

5hmU:

5-hydroxymethyluracil

5mC:

5-methylcytosine

6mA:

6-methyladenine

API:

application programming interface

ChEBI:

Chemical Entities of Biological Interest

CHEMINF:

Chemical Information Ontology

DNMT:

DNA methyltransferase

GO:

Gene Ontology

InChI:

International Chemical Identifier

IUPAC:

International Union of Pure and Applied Chemistry

KEGG:

Kyoto Encyclopedia of Genes and Genomes

oxBS-seq:

oxidative bisulfite sequencing

RDFa:

Resource Description Framework in Attributes

SIO:

Semanticscience Integrated Ontology

SMILES:

Simplified Molecular-Input Line-Entry System

TAB-seq:

TET-assisted bisulfite sequencing

TET:

ten-eleven translocation enzyme

References

  1. Amoreira C, Hindermann W, Grunau C (2003) An improved version of the DNA methylation database (MethDB). Nucleic Acids Res 31:75–77. https://doi.org/10.1093/nar/gkg093

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G (2000) Gene ontology: tool for the unification of biology. Nat Genet 25:25–29. https://doi.org/10.1038/75556

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Bachman M, Uribe-Lewis S, Yang X, Burgess HE, Iurlaro M, Reik W, Murrell A, Balasubramanian S (2015) 5-formylcytosine can be a stable DNA modification in mammals. Nat Chem Biol 11:555–557. https://doi.org/10.1038/nchembio.1848

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Boccaletto P, Machnicka MA, Purta E, Piatkowski P, Baginski B, Wirecki TK, de Crécy-Lagard V, Ross R, Limbach PA, Kotter A, Helm M, Bujnicki JM (2018) MODOMICS: a database of RNA modification pathways. 2017 update. Nucleic Acids Res 46:D303–D307. https://doi.org/10.1093/nar/gkx1030

    Article  CAS  PubMed  Google Scholar 

  5. Booth MJ, Marsico G, Bachman M, Beraldi D, Balasubramanian S (2014) Quantitative sequencing of 5-formylcytosine in DNA at single-base resolution. Nat Chem 6:435–440. https://doi.org/10.1038/nchem.1893

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Booth MJ, Raiber EA, Balasubramanian S (2015) Chemical methods for decoding cytosine modifications in DNA. Chem Rev 115:2240–2254. https://doi.org/10.1021/cr5002904

    Article  CAS  PubMed  Google Scholar 

  7. Callahan J, Hopkins D, Weiser M, Shneiderman B (1988) An empirical comparison of pie vs. linear menus. In: O’Hare JJ (ed) Proceedings of the SIGCHI Conference on human factors in computing systems, pp 95–100. https://doi.org/10.1145/57167.57182

  8. Cantara WA, Crain PF, Rozenski J, McCloskey JA, Harris KA, Zhang X, Vendeix FAP, Fabris D, Agris PF (2011) The RNA modification database, RNAMDB: 2011 update. Nucleic Acids Res 39:D195–D201. https://doi.org/10.1093/nar/gkq1028

    Article  CAS  PubMed  Google Scholar 

  9. Chen K, Zhao BS, He C (2016) Nucleic acid modifications in regulation of gene expression. Cell Chem Biol 23:74–85. https://doi.org/10.1016/j.chembiol.2015.11.007

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, de Hoon MJL (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25:1422–1423. https://doi.org/10.1093/bioinformatics/btp163

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Cooke MS, Loft S, Olinski R, Evans MD, Bialkowski K, Wagner JR, Dedon PC, Møller P, Greenberg MM, Cadet J (2010) Recommendations for standardized description of and nomenclature concerning oxidatively damaged nucleobases in DNA. Chem Res Toxicol 23:705–707. https://doi.org/10.1021/tx1000706

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Dantas Machado AC, Zhou T, Rao S, Goel P, Rastogi C, Lazarovici A, Bussemaker HJ, Rohs R (2014) Evolving insights on how cytosine methylation affects protein-DNA binding. Brief Funct Genom 14:61–73. https://doi.org/10.1093/bfgp/elu040

    Article  CAS  Google Scholar 

  13. Degtyarenko K, de Matos P, Ennis M, Hastings J, Zbinden M, McNaught A, Alcántara R, Darsow M, Guedj M, Ashburner M (2008) ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Res 36:D344–D350. https://doi.org/10.1093/nar/gkm791

    Article  CAS  PubMed  Google Scholar 

  14. Dumontier M, Baker CJ, Baran J, Callahan A, Chepelev L, Cruz-Toledo J, Del Rio NR, Duck G, Furlong LI, Keath N, Klassen D, McCusker JP, Queralt-Rosinach N, Samwald M, Villanueva-Rosales N, Wilkinson MD, Hoehndorf R (2014) The Semanticscience Integrated Ontology (SIO) for biomedical research and knowledge discovery. J Biomed Semant 5:14. https://doi.org/10.1186/2041-1480-5-14

    Article  Google Scholar 

  15. Fu G, Batchelor C, Dumontier M, Hastings J, Willighagen E, Bolton E (2015) PubChemRDF: towards the semantic annotation of PubChem compound and substance databases. J Cheminf 7:34. https://doi.org/10.1186/s13321-015-0084-4

    Article  Google Scholar 

  16. Gerhard H (2016) sqlite3. https://docs.python.org/2/library/sqlite3.html

  17. Gommers-Ampt JH, Borst P (1995) Hypermodified bases in DNA. FASEB J 9:1034–1042. https://doi.org/10.1096/fasebj.9.11.7649402

    Article  CAS  PubMed  Google Scholar 

  18. Grosjean H (2009) Nucleic acids are not boring long polymers of only four types of nucleotides: a guided tour. In: Grosjean H (ed) DNA and RNA modification enzymes: structure, mechanism, function and evolution. Landes Bioscience, Austin, TX, pp 1–18

    Chapter  Google Scholar 

  19. Hardisty RE, Kawasaki F, Sahakyan AB, Balasubramanian S (2015) Selective chemical labeling of natural T modifications in DNA. J Am Chem Soc 137:9270–9272. https://doi.org/10.1021/jacs.5b03730

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Hastings J, Chepelev L, Willighagen E, Adams N, Steinbeck C, Dumontier M (2011) The Chemical Information Ontology: provenance and disambiguation for chemical data on the biological semantic web. PLOS One 6(10):e25,513. https://doi.org/10.1371/journal.pone.0025513

  21. Hastings J, de Matos P, Dekker A, Ennis M, Harsha B, Kale N, Muthukrishnan V, Owen G, Turner S, Williams M, Steinbeck C (2013) The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013. Nucleic Acids Res 41:456–463. https://doi.org/10.1093/nar/gks1146

    Article  CAS  Google Scholar 

  22. Hastings J, Owen G, Dekker A, Ennis M, Kale N, Muthukrishnan V, Turner S, Swainston N, Mendes P, Steinbeck C (2016) ChEBI in 2016: improved services and an expanding collection of metabolites. Nucleic Acids Res 44:D1214–D1219. https://doi.org/10.1093/nar/gkv1031

    Article  CAS  PubMed  Google Scholar 

  23. Heller SR, McNaught A, Pletnev I, Stein S, Tchekhovskoi D (2015) InChI, the IUPAC international chemical identifier. J Cheminf 7:23. https://doi.org/10.1186/s13321-015-0068-4

    Article  CAS  Google Scholar 

  24. Heyn H, Esteller M (2015) An adenine code for DNA: a second life for N6-methyladenine. Cell 161:710–713. https://doi.org/10.1016/j.cell.2015.04.021

    Article  CAS  PubMed  Google Scholar 

  25. Hipp DR, Kennedy D, Mistachkin J (2000–2018) SQLite. https://www.sqlite.org

  26. Iurlaro M, McInroy GR, Burgess HE, Dean W, Raiber EA, Bachman M, Beraldi D, Balasubramanian S, Reik W (2016) In vivo genome-wide profiling reveals a tissue-specific role for 5-formylcytosine. Genome Biol 17:141. https://doi.org/10.1186/s13059-016-1001-5

    Article  PubMed  PubMed Central  Google Scholar 

  27. Kanehisa M, Goto S (2000) KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res 28(1):27–30. https://doi.org/10.1093/nar/28.1.27

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Khromov-Borisov NN (1997) Naming the mutagenic nucleic acid base analogs: the Galatea syndrome. Mutat Res 379:95–103. https://doi.org/10.1016/S0027-5107(97)00112-7

    Article  CAS  PubMed  Google Scholar 

  29. Korlach J, Turner SW (2012) Going beyond five bases in DNA sequencing. Curr Opin Struct Biol 22:251–261. https://doi.org/10.1016/j.sbi.2012.04.002

    Article  CAS  PubMed  Google Scholar 

  30. Liu H, Chen L, Zheng L, Perl Y, Geller J (2018) A quality assurance methodology for ChEBI ontology focusing on uncommonly modeled concepts. In: Jaiswal P, Cooper L, Haendel MA, Mungall CJ (eds) Proceedings of the 9th international conference on biological ontology (ICBO), Corvallis, OR, USA, vol 2285. http://ceur-ws.org/Vol-2285/ICBO_2018_paper_7.pdf

  31. Milanowska K, Krwawicz J, Papaj G, Kosiński J, Poleszak K, Lesiak J, Osińska E, Rother K, Bujnicki JM (2011) REPAIRtoire–a database of DNA repair pathways. Nucleic Acids Res 39:D788–D792. https://doi.org/10.1093/nar/gkq1087

    Article  CAS  PubMed  Google Scholar 

  32. NCBI Resource Coordinators (2018) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 46:D8–D13. https://doi.org/10.1093/nar/gkx1095

    Article  CAS  Google Scholar 

  33. O’Boyle NM, Morley C, Hutchison GR (2008) Pybel: a Python wrapper for the OpenBabel cheminformatics toolkit. Chem Cent J 2:5. https://doi.org/10.1186/1752-153X-2-5

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. O’Boyle NM, Banck M, James CA, Morley C, Vandermeersch T, Hutchison GR (2011) Open Babel: an open chemical toolbox. J Cheminf 3:33. https://doi.org/10.1186/1758-2946-3-33

    Article  CAS  Google Scholar 

  35. Ortel J, Noehr J, van Gheem N (2011) suds. https://pypi.org/project/suds

  36. Otto M, Thornton J, Rebert C, Thilo J, XhmikosR, Fenkart H, Lauke PH, et al (2011–2018) Bootstrap. http://getbootstrap.com

  37. Pachter L (2013) *Seq. https://liorpachter.wordpress.com/seq/

  38. Pfaffeneder T, Spada F, Wagner M, Brandmayr C, Laube SK, Eisen D, Truss M, Steinbacher J, Hackner B, Kotljarova O, Schuermann D, Michalakis S, Kosmatchev O, Schiesser S, Steigenberger B, Raddaoui N, Kashiwazaki G, Müller U, Spruijt CG, Vermeulen M, Leonhardt H, Schär P, Müller M, Carell T (2014) Tet oxidizes thymine to 5-hydroxymethyluracil in mouse embryonic stem cell DNA. Nat Chem Biol 10:574–581. https://doi.org/10.1038/nchembio.1532

    Article  CAS  PubMed  Google Scholar 

  39. Plongthongkum N, Diep DH, Zhang K (2014) Advances in the profiling of DNA modifications: cytosine methylation and beyond. Nat Rev Genet 15:647–661. https://doi.org/10.1038/nrg3772

    Article  CAS  PubMed  Google Scholar 

  40. RDFa Working Group (2015) RDFa 1.1 primer—third edition. W3C Working Group Note, https://www.w3.org/TR/2015/NOTE-rdfa-primer-20150317/

  41. Roberts RJ, Vincze T, Posfai J, Macelis D (2015) REBASE–a database for DNA restriction and modification: enzymes, genes and genomes. Nucleic Acids Res 43:D298–D299. https://doi.org/10.1093/nar/gku1046

    Article  CAS  PubMed  Google Scholar 

  42. Ronacher A (2008) Jinja2 (the Python template engine). http://jinja.pocoo.org

  43. Rothbart SB, Strahl BD (2014) Interpreting the language of histone and DNA modifications. Biochim Biophys Acta, Gene Regul Mech 1839:627–643. https://doi.org/10.1016/j.bbagrm.2014.03.001

    Article  CAS  Google Scholar 

  44. Rother K, Papaj G, Bujnicki JM (2009) Databases of DNA modifications. In: Grosjean H (ed) DNA and RNA Modification enzymes: structure, mechanism, function and evolution. Landes Bioscience, Austin, TX, pp 622–623

    Google Scholar 

  45. Song CX, Yi C, He C (2012) Mapping recently identified nucleotide variants in the genome and transcriptome. Nat Biotechnol 30:1107–1116. https://doi.org/10.1038/nbt.2398

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Song CX, Szulwach KE, Dai Q, Fu Y, Mao SQ, Lin L, Street C, Li Y, Poidevin M, Wu H, Gao J, Liu P, Li L, Xu GL, Jin P, He C (2013) Genome-wide profiling of 5-formylcytosine reveals its roles in epigenetic priming. Cell 153:678–691. https://doi.org/10.1016/j.cell.2013.04.001

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Song W (2012–2018) Elasticlunr.js. http://elasticlunr.com

  48. Swainston N, Hastings J, Dekker A, Muthukrishnan V, May J, Steinbeck C, Mendes P (2016) libChEBI: an API for accessing the ChEBI database. J Cheminf 8:11. https://doi.org/10.1186/s13321-016-0123-9

    Article  CAS  Google Scholar 

  49. Viner C, Johnson J, Walker N, Shi H, Sjöberg M, Adams DJ, Ferguson-Smith AC, Bailey TL, Hoffman MM (2016) Modeling methyl-sensitive transcription factor motifs with an expanded epigenetic alphabet. bioRxiv 043794, https://doi.org/10.1101/043794

  50. Wallace EVB, Stoddart D, Heron AJ, Mikhailova E, Maglia G, Donohoe TJ, Bayley H (2010) Identification of epigenetic DNA modifications with a protein nanopore. Chem Commun 46:8195–8197. https://doi.org/10.1039/c0cc02864a

    Article  CAS  Google Scholar 

  51. Weigele P, Raleigh EA (2016) Biosynthesis and function of modified bases in bacteria and their viruses. Chem Rev 116:12,655–12,687. https://doi.org/10.1021/acs.chemrev.6b00114

    Article  CAS  Google Scholar 

  52. Weininger D (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Model 28(1):31–36. https://doi.org/10.1021/ci00057a005

    Article  CAS  Google Scholar 

  53. Wu H, Zhang Y (2014) Reversing DNA methylation: mechanisms, genomics, and biological functions. Cell 156:45–68. https://doi.org/10.1016/j.cell.2013.12.019

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Xuan JJ, Sun WJ, Lin PH, Zhou KR, Liu S, Zheng LL, Qu LH, Yang JH (2018) RMBase v2.0: deciphering the map of RNA modifications from epitranscriptome sequencing data. Nucleic Acids Res 46:D327–D334. https://doi.org/10.1093/nar/gkx934

    Article  CAS  PubMed  Google Scholar 

  55. Yumak H, Chen L, Halper M, Zheng L, Perl Y, Elhanan G (2016) A quality-assurance study of ChEBI. In: Jaiswal P, Hoehndorf R, Arighi CN, Meier A (eds) Proceedings of the joint international conference on biological ontology and biocreative, Corvallis, Oregon, USA, vol 1747. http://ceur-ws.org/Vol-1747/IT701_ICBO2016.pdf

  56. Zhang Y, Lv J, Liu H, Zhu J, Su J, Wu Q, Qi Y, Wang F, Li X (2010) HHMD: the human histone modification database. Nucleic Acids Res 38:D149–D154. https://doi.org/10.1093/nar/gkp968

    Article  CAS  PubMed  Google Scholar 

Download references

Authors’ contributions

Conceptualization, MMH; Methodology, AJS, CV, and MMH; Software, AJS and CV; Resources, MMH; Data Curation, AJS and CV; Writing—Original Draft, AJS and CV; Writing—Review & Editing, AJS, CV, and MMH; Visualization, AJS, CV, and MMH; Funding Acquisition, MMH; Supervision, CV and MMH. All authors read and approved the final manuscript.

Acknowlegements

We thank Daniel D. De Carvalho and Christopher E. Mason for helpful feedback on early versions of DNAmod. We thank the creators of ChEBI [13], and all those who have worked to improve it [21, 22, 48]. In particular, we thank Gareth Owen, Steve Turner, and Marcus Ennis for actively responding to curation requests and Venkatesh Muthukrishnan for managing ChEBI issues. We thank Egon L. Willighagen for useful suggestions in a PubPeer review of an early version of this work. We thank Carl Virtanen, Qun Jin, and Zhibin Lu for technical assistance.

Competing interests

The authors declare that they have no competing interests.

Availability of data and materials

The DNAmod website, including a description and contact information, as well as the backing SQLite database, are freely available at: https://dnamod.hoffmanlab.org. Python source code, web assets, and an issue tracker for this project are available at: https://bitbucket.org/hoffmanlab/dnamod. Persistent availability is ensured by Zenodo, in which we have deposited current and previous versions of our code (https://doi.org/10.5281/zenodo.640631) and SQLite database (https://doi.org/10.5281/zenodo.640561). All source code and web assets are licensed under a General Public License, version 2 (GPLv2). DNAmod’s data is licensed under a Creative Commons Attribution 4.0 International license (CC BY 4.0).

Funding

This work was supported by the University of Toronto Undergraduate Research Opportunities Program (to AJS), the Natural Sciences and Engineering Research Council of Canada (RGPIN-2015-03948 to MMH and Alexander Graham Bell Canada Graduate Scholarships to CV), the Canadian Institutes of Health Research (201512MSH-360970 to MMH), the Canadian Cancer Society (703827 to MMH), the Ontario Ministry of Training, Colleges and Universities (Ontario Graduate Scholarships to CV), the Ontario Institute for Cancer Research through funding provided by the Government of Ontario (CSC-FR-UHN to John E. Dick), the Ontario Ministry of Research, Innovation and Science (ER-15-11-223 to MMH), the University of Toronto McLaughlin Centre (MC-2015-16 to MMH), and the Princess Margaret Cancer Foundation.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael M. Hoffman.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sood, A.J., Viner, C. & Hoffman, M.M. DNAmod: the DNA modification database. J Cheminform 11, 30 (2019). https://doi.org/10.1186/s13321-019-0349-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13321-019-0349-4

Keywords