DNAmod: the DNA modification database

Sood, Ankur Jai; Viner, Coby; Hoffman, Michael M.

doi:10.1186/s13321-019-0349-4

Database
Open access
Published: 23 April 2019

DNAmod: the DNA modification database

Ankur Jai Sood^1,2,
Coby Viner^2,3 &
Michael M. Hoffman^1,2,3,4

Journal of Cheminformatics volume 11, Article number: 30 (2019) Cite this article

9408 Accesses
47 Citations
23 Altmetric
Metrics details

Abstract

Covalent DNA modifications, such as 5-methylcytosine (5mC), are increasingly the focus of numerous research programs. In eukaryotes, both 5mC and 5-hydroxymethylcytosine (5hmC) are now recognized as stable epigenetic marks, with diverse functions. Bacteria, archaea, and viruses contain various other modified DNA nucleobases. Numerous databases describe RNA and histone modifications, but no database specifically catalogues DNA modifications, despite their broad importance in epigenetic regulation. To address this need, we have developed DNAmod: the DNA modification database. DNAmod is an open-source database (https://dnamod.hoffmanlab.org) that catalogues DNA modifications and provides a single source to learn about their properties. DNAmod provides a web interface to easily browse and search through these modifications. The database annotates the chemical properties and structures of all curated modified DNA bases, and a much larger list of candidate chemical entities. DNAmod includes manual annotations of available sequencing methods, descriptions of their occurrence in nature, and provides existing and suggested nomenclature. DNAmod enables researchers to rapidly review previous work, select mapping techniques, and track recent developments concerning modified bases of interest.

Introduction

A rapidly growing body of research is continuing to reveal numerous gene-regulatory effects of covalent DNA modifications, such as 5‑methylcytosine (5mC). We now recognize 5mC as a stable epigenetic mark and as having diverse functions beyond transcriptional repression [12]. An increasing number of studies demonstrate the importance of other cytosine modifications, such as 5‑hydroxymethylcytosine (5hmC), 5‑formylcytosine (5fC), and 5‑carboxylcytosine (5caC) [3, 9, 26, 43, 46]. More recently, three analogous modifications of thymine were found to occur in mammals [38, 53] and can now largely be sequenced [19]. N⁶‑methyladenine, previously thought to mainly occur as an RNA modification in eukaryotes, has now been found in the DNA of multiple eukaryotes [24]. Bacteria, archaea, and especially bacteriophages have long been known to harbor a diverse array of modified bases [18, 51]. Their genomes can also have hypermodified bases—modified DNA bases that substitute for the unmodified base in many positions genome-wide [17, 51].

Multiple databases profile RNA modifications [4, 8, 54] and human histone modifications [56], but no database catalogues DNA modifications systematically. Some databases include particular classes of DNA modifications [44]. These include restriction endonucleases and DNA methyltransferases in REBASE [41]; methylation databases, like MethDB [1]; databases including DNA metabolic pathways, such as KEGG [27]; and those focused on DNA damage and repair, like REPAIRtoire [31].

Since DNA modifications are a key aspect of epigenetic regulation, there is a pressing need to organize them in a single location. We have accordingly created DNAmod: the DNA modification database (https://dnamod.hoffmanlab.org). DNAmod is the first database to comprehensively catalogue DNA modifications and provides a single resource to launch an investigation of their properties.

Database construction and visualization

DNAmod consists of two components: a relational database back-end and a web interface front-end. We used the Chemical Entities of Biological Interest (ChEBI) database [13, 22] to seed the DNAmod database. We imported a nucleobase-related subset of ChEBI, consisting of chemical entities and related annotations. We performed queries against the entities to construct a set of candidate DNA modifications for DNAmod, retaining most of these as a separate unverified set. Then, we filtered candidate entities into a manually curated set of verified DNA modifications, augmenting them with modification-specific annotations.

The web interface front-end allows users to either search or browse through the catalogue of DNA modifications, integrating ChEBI’s information with our own.

Identifying candidate DNA modifications from ChEBI

DNAmod leverages ChEBI [22] to define a set of modified DNA candidates for inclusion and to add preliminary information for each candidate. ChEBI is a database of small biologically relevant molecules, which affect living organisms. We queried ChEBI via ChEBI Web Services [22]. We used Biopython [10] and the Python Simple Object Access Protocol (SOAP) client, suds [35], to query ChEBI and construct the DNAmod database.

ChEBI provides an ontology which encodes the relationships between its compounds. We used this ontology to precisely define the notion of parents and children, which we used to hierarchically retrieve and display modifications. We used two kinds of relationships for this purpose, both of which have associated symbols, defined by ChEBI [13]: $\mathcal {F}$ has functional parent and $\triangle $ is a. We used these relationships to find candidate DNA modifications, by identifying entities related to the core nucleobases, which we represent by their symbols: {A,C,G,T,U}. We included uracil, since many of its descendants in the ontology are modifications of thymine (CHEBI:17821, which is equivalent to 5-methyluracil), and are not annotated as descendants of thymine itself. For each of these bases, we imported all entities that are annotated in the ontology as a child of one of these bases, via the $\mathcal {F}$ has functional parent relationship. ChEBI ranks entities based on their degree of curation. We only imported entities with the highest rating—three stars—indicating manual curation by ChEBI. Whenever possible, we only included entities as nitrogenous bases (nucleobases). If ChEBI did not have the nucleobase, we then selected the nucleoside form and finally, if necessary, the nucleotide. These imported bases formed the candidate set of modifications (the unverified set), from which we created a curated set of DNA modifications (the verified set).

The ChEBI ontology does not generally encode $\mathcal {F}$ has functional parent relationships for nucleobases beyond the children of the unmodified nucleobases. It instead encodes modified nucleobases with an $\triangle $ is a relationship to their parent base. This is because descendant entities of specific modifications are generally subtypes of the class of modifications from which they originate. For example, 3-methyladenine $\triangle $ is a methyladenine. Methyladenine, however, $\mathcal {F}$ has functional parent adenine, since it is conceived of as possessing adenine as a characteristic group and as being derived via functional modification [13]. We therefore need to use both of these relationships, within the ChEBI ontology, to accurately capture the full nucleobase hierarchy.

ChEBI also provides selected citations, associated with some of its entities. We retrieved the citations from ChEBI as PubMed IDs [32]. We used the Biopython [10] package Bio.Entrez to query the PubMed citation database, using NCBI’s Entrez Programming Utilities [32]. We retrieved the details of each citation, and use them to construct a formatted citation. We currently support only publications indexed in PubMed.

Manual curation and annotation

We manually created and defined a whitelist, which contains our curated (or verified) set of candidates that we deem DNA modifications. For each of the bases enumerated in our whitelist, we also imported all descendants with an eventual $\mathcal {F}$ has functional parent or $\triangle $ is a relationship with any of the members of the verified set. We expanded the verified set to include any bases recursively imported in this manner, since they were children of verified DNA nucleobases. We also manually created and defined a distinct blacklist, which contains compounds that we deem to not be DNA modifications, also excluding any of their descendant compounds. Therefore, our above verification rule has the exception that it excludes any bases with an ancestor in our blacklist.

We can formalize the above description of bases imported from the ChEBI ontology [13] and subsequent filtering as follows. Let $a\mathbin{\mathcal {F}}\,b$ specify that a has the $\mathcal {F}$ has functional parent relationship with b. The definition of $\mathcal {F}$ is transitive: for all n entities, $l_{i}$, for $i = 0$ to $n - 1$, between a and b,

$$\begin{aligned} a\mathbin{\mathcal {F}}\,b \iff \bigl ( a\mathbin{\mathcal {F}}\,l_{n - 1} \bigr ) \wedge \bigl ( l_{i}\mathbin{ \mathcal {F}}\,l_{i - 1} \mathord {\forall } i \in \left( 0, n\right) \bigr ) \wedge \bigl ( l_{0}\mathbin{\mathcal {F}}\,b \bigr ). \end{aligned}$$

The analogous definitions hold for $\triangle $.

We call each $l_{i}$ a child of $l_{i - 1}$ and call each $l_{i - 1}$ a parent of $l_{i}$. We refer to a as a descendant of b and refer to b as an ancestor of a. Let $\mathcal {C}$ represent the first level of children of the unmodified nucleobases, such that $\mathcal {C} = \left\{ x \mid x\mathbin{{\mathcal {F}}}\,y, y \in \{\tt{A, C, G, T, U}\} \right\} $. Let $\mathcal {V} \subset \mathcal {C}$ represent the manually-annotated, verified proper subset of $\mathcal {C}$.

We manually curated a blacklist of excluded entities, $\mathcal {B}$, satisfying: $\mathcal {B} \subseteq \left\{ b\mid\left( b\mathbin{{\mathcal {F}}}\,p \vee b \mathbin{\triangle} p \right) , p \in \mathcal {V} \right\} $. We imported the set of verified DNA modifications, $\mathcal {M}$, defined in set-builder notation with predicates, as:

$$\begin{aligned} \mathcal {M}=\, {} \mathcal {V}\, \cup\, &\left\{ z \mid \left( \exists v\, {\in } \mathcal {V} \right) \left( \forall\, b\, {\in }\, \mathcal {B} \right) \right. \\&\left. \left[ \left( z\mathbin{{\mathcal {F}}}\,v \vee z \mathbin{\triangle} v \right) \wedge \lnot \left( z\mathbin{{\mathcal {F}}}\,b \vee z \mathbin{\triangle} b \right) \right] \right\} . \end{aligned}$$

Finally, we added a small number of bases manually, that do not have any of the DNA bases or uracil as a parent in their ontology, but are nonetheless notable modified bases, such as 2′-deoxyinosine.

We additionally provided two kinds of manual annotations: sequencing techniques and occurrence in nature, for each modified DNA base. We surveyed the literature of sequencing methods for covalent DNA modifications [6, 29, 37, 39, 45], and annotated the available methods for each base, providing curated citations. These annotations include the method’s name, our categorizations of the basis for the method (such as chemical conversion), its resolution, and any further qualifier (Table 1A). Qualifiers include limitations (such as applicability to only some genomic regions), enrichment methods, and advantages (such as optimization for single-cell sequencing). We considered any method which involves affinity-based recognition of targets to be of “low” resolution [5]. These methods can also suffer from low specificity or antibody cross-reactivity [6]. Conversely, we annotated any methods based principally upon the detection of a chemically converted modification as “high” resolution. This generally reflects the resulting resolution of the method’s output data and often corresponds to the necessity to bin genomic regions during downstream analyses of the detected analyte.

For each modified base, we investigated if it had been previously reported to occur in vivo. This included any endogenous occurrences, as well as those stimulated exogenously, such as from exposure to an environmental toxin. We annotated any modification observed in vivo as “natural”. We additionally provided non-exhaustive examples of some organisms in which the modifications have been reported. We based these annotations on our ability to find evidence of in vivo occurrence, as opposed to publications describing only the synthesis or physicochemical properties of a nucleobase. For each of these annotations, we also briefly annotated a primary biological function, if known (Table 1B). For any modification not observed in vivo, we annotated it as “synthetic” and listed a reference pertaining to its synthesis or in which the synthetic base was used.

We entered these annotations in two annotation source files (Table 1), which we later imported into our database. This decoupled them from the rest of our pipeline and allows outside experts to submit additions without requiring knowledge of our pipeline or programming workflow.

Table 1 Possible annotations within DNAmod’s curated (A) sequencing method data and (B) natural occurrence information

Full size table

DNAmod integrates manually-curated nomenclature, including the name and abbreviation deemed most consistent and in common use [9, 11, 28]. We additionally provide recommendations for one-letter symbols of selected modified bases, and in some instances for their base-pairing complements, as previously described [49]. The DNAmod web interface displays recommended notation in an organized table (Fig. 1).

We store all data, either imported from ChEBI or from our manual annotations, within a SQLite [25] database, used via the Python sqlite3 package [16].

Website generation

We created a static website to display and provide navigation for the information contained within the database. We generated it by formatting the database content using the templating engine Jinja2 [42]. Two templates were sufficient to generate all HTML files. We used a single template for all modification pages and another for the homepage. We also record the date of the most recent update to the database. The main footer contains this date, along with the current ChEBI and DNAmod versions. All web pages use the Bootstrap [36] framework, which provides a standardized, portable, and mobile-compatible viewing format. We visualized the chemical structure of each compound from its Simplified Molecular-Input Line-Entry System (SMILES) [52] data, if available from ChEBI, as a vector graphic. We did this using the cheminformatics toolkit Open Babel [34], via its Python wrapper Pybel [33].

Searching and navigation

DNAmod makes modifications accessible via three main navigation options, each provided on a tab of the DNAmod homepage. First, users may search for modifications by several fields. Second, users may find curated DNA modifications via a pie menu [7]. Third, users may find candidate entities as a list, categorized by their parent unmodified nucleobases.

Client-side search functionality provides a means of rapidly finding bases with differing nomenclature (Fig. 2a), while maintaining a static web page. This functionality relies on the elasticlunr.js JavaScript module [47]. Searches match to multiple fields: common or International Union of Pure and Applied Chemistry (IUPAC) names, all synonyms, any assigned abbreviation, and recommended notation symbol, when available. DNAmod displays curated DNA modifications in green, and others in magenta. The search results provide the field matched by the query, such as “abbreviation”, along with the common name of the associated hit.

Alternatively, users may browse the modifications in DNAmod through a pie menu [7] interface (Fig. 2b). This interface hierarchically arranges the bases according to their structure within the ChEBI ontology. The innermost ring consists of the four unmodified DNA bases, with an additional “other” category. This category encapsulates modified bases found in DNA, but which are not modifications of one of the four DNA bases. Consecutive outer rings represent children of the previous base or category. We demarcated natural versus synthetic bases by colouring natural bases in teal and synthetic bases in grey.

DNAmod structure and content

Individual modification pages visually represent the data contained within the backing database. We standardize and display all modifications in an identical format. DNAmod may omit some information, however, depending upon the extent of ChEBI’s annotations and whether the page describes a verified DNA modification or merely a candidate entry.

Modification pages begin with a header displaying the DNA modification’s ChEBI name. The top-right corner of the page lists the unmodified ancestor of the modification. For example, 5-hydroxymethyluracil is a modification of thymine (Fig. 3), whereas 6-dimethyladenine is a modification of adenine.

Each modification begins with a short textual description of its chemistry, followed by a table containing its chemical properties. We import these from ChEBI, which provides their chemical formula, net charge, and average mass.

We annotate entities with all names available from ChEBI, including: their IUPAC name, SMILES [52] string, International Chemical Identifier (InChI) and hashed InChIKey [23] strings, and common synonyms. We also provide a recommended abbreviation and in some instances a suggested single-letter symbol for bioinformatic purposes, from our proposed expanded alphabet [49] (Fig. 3).

We provide literature annotations for many DNA modifications, focusing upon those observed in vivo. We provide a list of methods that have been used to map the genomic locations of a modification (“Manual curation and annotation”). We additionally provide information on a modification’s occurrence, either naturally or only synthetically, where applicable, including some organisms in which it has been observed in vivo (“Manual curation and annotation”). Finally, each page ends with the ChEBI database reference and a ChEBI-derived list of related literature citations (Fig. 3). Our website has semantic web support, making use of the Resource Description Framework in Attributes (RDFa) [40] technique, augmented by Chemical Information Ontology (CHEMINF) [20] and PubChemRDF [15] Semanticscience Integrated Ontology (SIO) [14] annotations—providing machine-readable descriptions of key website features.

Discussion

DNAmod enables researchers to rapidly obtain information on covalently modified DNA nucleobases and assist those interested in profiling a modification. It additionally provides a reference toward standardization of modified base nomenclature and offers the potential to track recent developments within the field. We have kept DNAmod up to date for 3 yr and expect to continue to maintain it, particularly as new discoveries about DNA modifications are made. We also hope that DNAmod will serve to highlight underappreciated modifications that may have substantial biological importance.

The nomenclature used to describe a particular DNA modification is often inconsistent, with some early efforts toward standardization of particular classes [11, 28]. The ChEBI name, for instance, often corresponds to the common chemical name of the compound, which is occasionally distinct from its common name within the biological literature, in the context of a DNA modification. We address this and attempt to encourage standardization by endeavouring to ensure that other names are annotated, while providing specific nomenclature recommendations. In particular, the suggested name of verified DNA modifications, as displayed on the homepage and within the recommended notation section, is always manually-curated and sometimes differs from the name assigned by ChEBI.

Our database, like many others, relies upon the ChEBI ontology. Like any large and complex endeavour, curating ChEBI is a substantial undertaking, requiring protracted deployment of expertise and effort. While ChEBI has a dedicated team of expert curators, who assiduously and continually improve ChEBI, their resources are naturally limited. Accordingly, while ChEBI has an issue tracker where we and others can suggest changes, revisions to ChEBI are highly dependent on user reports and the team’s available bandwidth. ChEBI contains a non-negligible fraction of errors and omissions, across most entity categories [30, 55]. These works highlight the substantial effort and difficulty involved in maintaining high-quality annotations. Such errors naturally propagate to its downstream databases, including our own. While we have made efforts to further curate data and report relevant issues back upstream, we do inherit some errors and limitations. As in any project of this nature, we surely have our own errors and omissions. We lack a dedicated curator; accordingly, we curate this data on a best-effort basis. DNAmod has its own issue tracker, and we would appreciate if users could report any of our own errors or omissions, so that we can address them or facilitate reporting them upstream.

The inclusion of assays available to sequence different DNA modifications provides a means of assessing and selecting a sequencing method. It additionally attempts to track sequencing methods over time, as resolution improves, and especially to highlight recent developments, like direct-detection of various modifications via nanopore sequencing [50]. The sequencing annotations we provide annotate nucleobases which are directly elucidated by the method and only for the base or set of bases which the method independently maps. This includes those that are obtained in addition to another nucleobase. For instance, confounded mixtures are often obtained. For example, 5mC and 5hmC cannot be distinguished with only conventional bisulfite sequencing. Alternatively, some methods have the capacity to independently resolve between modifications, such as various nanopore-based methods. Therefore, while many use oxidative bisulfite sequencing (oxBS-seq) in combination with conventional bisulfite sequencing to elucidate 5hmC via subtraction, we only annotated it as a sequencing method for 5mC, which it directly elucidates [6]. Conversely, we only annotate TET-assisted bisulfite sequencing (TAB-seq) under 5hmC, which it directly elucidates [6], although many use it to also detect 5mC.

We demarcated bases found to occur in vivo, providing examples of organisms in which a modification has been found, along with associated citations. This merely substantiates its in vivo presence, however. We did not attempt to comprehensively list the organisms which contain any particular modification. Finally, we expect our brief annotations of the biological roles of various DNA modifications to change as further research is conducted.

Future work

We plan to keep DNAmod updated continuously, manually reviewing newly added ChEBI compounds, requesting appropriate additions to ChEBI, and curating any improvements. We also endeavour to annotate recently developed sequencing methods as we come across them.

Integrating additional external databases will further increase DNAmod’s utility. In particular, we envision potential integration with domain-specific DNA modification databases, such as those cataloguing compounds formed from the operation of particular biological pathways. For instance, modifications involved in DNA damage and repair could be linked to REPAIRtoire [31] data. We could also improve functional characterization using Gene Ontology (GO) [2] or Kyoto Encyclopedia of Genes and Genomes (KEGG) [27], but this would require extensive manual curation.

We used ChEBI Web Services [22] to obtain information from their database. ChEBI has, however, recently released a Python application programming interface (API), permitting us to directly access their data [48]. Switching from our current web-based queries to use of their API would likely result in a more robust system and expedite the database-building process.

Abbreviations

5caC:: 5-carboxylcytosine
5fC:: 5-formylcytosine
5hmC:: 5-hydroxymethylcytosine
5hmU:: 5-hydroxymethyluracil
5mC:: 5-methylcytosine
6mA:: 6-methyladenine
API:: application programming interface
ChEBI:: Chemical Entities of Biological Interest
CHEMINF:: Chemical Information Ontology
DNMT:: DNA methyltransferase
GO:: Gene Ontology
InChI:: International Chemical Identifier
IUPAC:: International Union of Pure and Applied Chemistry
KEGG:: Kyoto Encyclopedia of Genes and Genomes
oxBS-seq:: oxidative bisulfite sequencing
RDFa:: Resource Description Framework in Attributes
SIO:: Semanticscience Integrated Ontology
SMILES:: Simplified Molecular-Input Line-Entry System
TAB-seq:: TET-assisted bisulfite sequencing
TET:: ten-eleven translocation enzyme

References

Amoreira C, Hindermann W, Grunau C (2003) An improved version of the DNA methylation database (MethDB). Nucleic Acids Res 31:75–77. https://doi.org/10.1093/nar/gkg093
Article CAS PubMed PubMed Central Google Scholar
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G (2000) Gene ontology: tool for the unification of biology. Nat Genet 25:25–29. https://doi.org/10.1038/75556
Article CAS PubMed PubMed Central Google Scholar
Bachman M, Uribe-Lewis S, Yang X, Burgess HE, Iurlaro M, Reik W, Murrell A, Balasubramanian S (2015) 5-formylcytosine can be a stable DNA modification in mammals. Nat Chem Biol 11:555–557. https://doi.org/10.1038/nchembio.1848
Article CAS PubMed PubMed Central Google Scholar
Boccaletto P, Machnicka MA, Purta E, Piatkowski P, Baginski B, Wirecki TK, de Crécy-Lagard V, Ross R, Limbach PA, Kotter A, Helm M, Bujnicki JM (2018) MODOMICS: a database of RNA modification pathways. 2017 update. Nucleic Acids Res 46:D303–D307. https://doi.org/10.1093/nar/gkx1030
Article CAS PubMed Google Scholar
Booth MJ, Marsico G, Bachman M, Beraldi D, Balasubramanian S (2014) Quantitative sequencing of 5-formylcytosine in DNA at single-base resolution. Nat Chem 6:435–440. https://doi.org/10.1038/nchem.1893
Article CAS PubMed PubMed Central Google Scholar
Booth MJ, Raiber EA, Balasubramanian S (2015) Chemical methods for decoding cytosine modifications in DNA. Chem Rev 115:2240–2254. https://doi.org/10.1021/cr5002904
Article CAS PubMed Google Scholar
Callahan J, Hopkins D, Weiser M, Shneiderman B (1988) An empirical comparison of pie vs. linear menus. In: O’Hare JJ (ed) Proceedings of the SIGCHI Conference on human factors in computing systems, pp 95–100. https://doi.org/10.1145/57167.57182
Cantara WA, Crain PF, Rozenski J, McCloskey JA, Harris KA, Zhang X, Vendeix FAP, Fabris D, Agris PF (2011) The RNA modification database, RNAMDB: 2011 update. Nucleic Acids Res 39:D195–D201. https://doi.org/10.1093/nar/gkq1028
Article CAS PubMed Google Scholar
Chen K, Zhao BS, He C (2016) Nucleic acid modifications in regulation of gene expression. Cell Chem Biol 23:74–85. https://doi.org/10.1016/j.chembiol.2015.11.007
Article CAS PubMed PubMed Central Google Scholar
Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, de Hoon MJL (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25:1422–1423. https://doi.org/10.1093/bioinformatics/btp163
Article CAS PubMed PubMed Central Google Scholar
Cooke MS, Loft S, Olinski R, Evans MD, Bialkowski K, Wagner JR, Dedon PC, Møller P, Greenberg MM, Cadet J (2010) Recommendations for standardized description of and nomenclature concerning oxidatively damaged nucleobases in DNA. Chem Res Toxicol 23:705–707. https://doi.org/10.1021/tx1000706
Article CAS PubMed PubMed Central Google Scholar
Dantas Machado AC, Zhou T, Rao S, Goel P, Rastogi C, Lazarovici A, Bussemaker HJ, Rohs R (2014) Evolving insights on how cytosine methylation affects protein-DNA binding. Brief Funct Genom 14:61–73. https://doi.org/10.1093/bfgp/elu040
Article CAS Google Scholar
Degtyarenko K, de Matos P, Ennis M, Hastings J, Zbinden M, McNaught A, Alcántara R, Darsow M, Guedj M, Ashburner M (2008) ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Res 36:D344–D350. https://doi.org/10.1093/nar/gkm791
Article CAS PubMed Google Scholar
Dumontier M, Baker CJ, Baran J, Callahan A, Chepelev L, Cruz-Toledo J, Del Rio NR, Duck G, Furlong LI, Keath N, Klassen D, McCusker JP, Queralt-Rosinach N, Samwald M, Villanueva-Rosales N, Wilkinson MD, Hoehndorf R (2014) The Semanticscience Integrated Ontology (SIO) for biomedical research and knowledge discovery. J Biomed Semant 5:14. https://doi.org/10.1186/2041-1480-5-14
Article Google Scholar
Fu G, Batchelor C, Dumontier M, Hastings J, Willighagen E, Bolton E (2015) PubChemRDF: towards the semantic annotation of PubChem compound and substance databases. J Cheminf 7:34. https://doi.org/10.1186/s13321-015-0084-4
Article Google Scholar
Gerhard H (2016) sqlite3. https://docs.python.org/2/library/sqlite3.html
Gommers-Ampt JH, Borst P (1995) Hypermodified bases in DNA. FASEB J 9:1034–1042. https://doi.org/10.1096/fasebj.9.11.7649402
Article CAS PubMed Google Scholar
Grosjean H (2009) Nucleic acids are not boring long polymers of only four types of nucleotides: a guided tour. In: Grosjean H (ed) DNA and RNA modification enzymes: structure, mechanism, function and evolution. Landes Bioscience, Austin, TX, pp 1–18
Chapter Google Scholar
Hardisty RE, Kawasaki F, Sahakyan AB, Balasubramanian S (2015) Selective chemical labeling of natural T modifications in DNA. J Am Chem Soc 137:9270–9272. https://doi.org/10.1021/jacs.5b03730
Article CAS PubMed PubMed Central Google Scholar
Hastings J, Chepelev L, Willighagen E, Adams N, Steinbeck C, Dumontier M (2011) The Chemical Information Ontology: provenance and disambiguation for chemical data on the biological semantic web. PLOS One 6(10):e25,513. https://doi.org/10.1371/journal.pone.0025513
Hastings J, de Matos P, Dekker A, Ennis M, Harsha B, Kale N, Muthukrishnan V, Owen G, Turner S, Williams M, Steinbeck C (2013) The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013. Nucleic Acids Res 41:456–463. https://doi.org/10.1093/nar/gks1146
Article CAS Google Scholar
Hastings J, Owen G, Dekker A, Ennis M, Kale N, Muthukrishnan V, Turner S, Swainston N, Mendes P, Steinbeck C (2016) ChEBI in 2016: improved services and an expanding collection of metabolites. Nucleic Acids Res 44:D1214–D1219. https://doi.org/10.1093/nar/gkv1031
Article CAS PubMed Google Scholar
Heller SR, McNaught A, Pletnev I, Stein S, Tchekhovskoi D (2015) InChI, the IUPAC international chemical identifier. J Cheminf 7:23. https://doi.org/10.1186/s13321-015-0068-4
Article CAS Google Scholar
Heyn H, Esteller M (2015) An adenine code for DNA: a second life for N6-methyladenine. Cell 161:710–713. https://doi.org/10.1016/j.cell.2015.04.021
Article CAS PubMed Google Scholar
Hipp DR, Kennedy D, Mistachkin J (2000–2018) SQLite. https://www.sqlite.org
Iurlaro M, McInroy GR, Burgess HE, Dean W, Raiber EA, Bachman M, Beraldi D, Balasubramanian S, Reik W (2016) In vivo genome-wide profiling reveals a tissue-specific role for 5-formylcytosine. Genome Biol 17:141. https://doi.org/10.1186/s13059-016-1001-5
Article PubMed PubMed Central Google Scholar
Kanehisa M, Goto S (2000) KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res 28(1):27–30. https://doi.org/10.1093/nar/28.1.27
Article CAS PubMed PubMed Central Google Scholar
Khromov-Borisov NN (1997) Naming the mutagenic nucleic acid base analogs: the Galatea syndrome. Mutat Res 379:95–103. https://doi.org/10.1016/S0027-5107(97)00112-7
Article CAS PubMed Google Scholar
Korlach J, Turner SW (2012) Going beyond five bases in DNA sequencing. Curr Opin Struct Biol 22:251–261. https://doi.org/10.1016/j.sbi.2012.04.002
Article CAS PubMed Google Scholar
Liu H, Chen L, Zheng L, Perl Y, Geller J (2018) A quality assurance methodology for ChEBI ontology focusing on uncommonly modeled concepts. In: Jaiswal P, Cooper L, Haendel MA, Mungall CJ (eds) Proceedings of the 9th international conference on biological ontology (ICBO), Corvallis, OR, USA, vol 2285. http://ceur-ws.org/Vol-2285/ICBO_2018_paper_7.pdf
Milanowska K, Krwawicz J, Papaj G, Kosiński J, Poleszak K, Lesiak J, Osińska E, Rother K, Bujnicki JM (2011) REPAIRtoire–a database of DNA repair pathways. Nucleic Acids Res 39:D788–D792. https://doi.org/10.1093/nar/gkq1087
Article CAS PubMed Google Scholar
NCBI Resource Coordinators (2018) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 46:D8–D13. https://doi.org/10.1093/nar/gkx1095
Article CAS Google Scholar
O’Boyle NM, Morley C, Hutchison GR (2008) Pybel: a Python wrapper for the OpenBabel cheminformatics toolkit. Chem Cent J 2:5. https://doi.org/10.1186/1752-153X-2-5
Article CAS PubMed PubMed Central Google Scholar
O’Boyle NM, Banck M, James CA, Morley C, Vandermeersch T, Hutchison GR (2011) Open Babel: an open chemical toolbox. J Cheminf 3:33. https://doi.org/10.1186/1758-2946-3-33
Article CAS Google Scholar
Ortel J, Noehr J, van Gheem N (2011) suds. https://pypi.org/project/suds
Otto M, Thornton J, Rebert C, Thilo J, XhmikosR, Fenkart H, Lauke PH, et al (2011–2018) Bootstrap. http://getbootstrap.com
Pachter L (2013) *Seq. https://liorpachter.wordpress.com/seq/
Pfaffeneder T, Spada F, Wagner M, Brandmayr C, Laube SK, Eisen D, Truss M, Steinbacher J, Hackner B, Kotljarova O, Schuermann D, Michalakis S, Kosmatchev O, Schiesser S, Steigenberger B, Raddaoui N, Kashiwazaki G, Müller U, Spruijt CG, Vermeulen M, Leonhardt H, Schär P, Müller M, Carell T (2014) Tet oxidizes thymine to 5-hydroxymethyluracil in mouse embryonic stem cell DNA. Nat Chem Biol 10:574–581. https://doi.org/10.1038/nchembio.1532
Article CAS PubMed Google Scholar
Plongthongkum N, Diep DH, Zhang K (2014) Advances in the profiling of DNA modifications: cytosine methylation and beyond. Nat Rev Genet 15:647–661. https://doi.org/10.1038/nrg3772
Article CAS PubMed Google Scholar
RDFa Working Group (2015) RDFa 1.1 primer—third edition. W3C Working Group Note, https://www.w3.org/TR/2015/NOTE-rdfa-primer-20150317/
Roberts RJ, Vincze T, Posfai J, Macelis D (2015) REBASE–a database for DNA restriction and modification: enzymes, genes and genomes. Nucleic Acids Res 43:D298–D299. https://doi.org/10.1093/nar/gku1046
Article CAS PubMed Google Scholar
Ronacher A (2008) Jinja2 (the Python template engine). http://jinja.pocoo.org
Rothbart SB, Strahl BD (2014) Interpreting the language of histone and DNA modifications. Biochim Biophys Acta, Gene Regul Mech 1839:627–643. https://doi.org/10.1016/j.bbagrm.2014.03.001
Article CAS Google Scholar
Rother K, Papaj G, Bujnicki JM (2009) Databases of DNA modifications. In: Grosjean H (ed) DNA and RNA Modification enzymes: structure, mechanism, function and evolution. Landes Bioscience, Austin, TX, pp 622–623
Google Scholar
Song CX, Yi C, He C (2012) Mapping recently identified nucleotide variants in the genome and transcriptome. Nat Biotechnol 30:1107–1116. https://doi.org/10.1038/nbt.2398
Article CAS PubMed PubMed Central Google Scholar
Song CX, Szulwach KE, Dai Q, Fu Y, Mao SQ, Lin L, Street C, Li Y, Poidevin M, Wu H, Gao J, Liu P, Li L, Xu GL, Jin P, He C (2013) Genome-wide profiling of 5-formylcytosine reveals its roles in epigenetic priming. Cell 153:678–691. https://doi.org/10.1016/j.cell.2013.04.001
Article CAS PubMed PubMed Central Google Scholar
Song W (2012–2018) Elasticlunr.js. http://elasticlunr.com
Swainston N, Hastings J, Dekker A, Muthukrishnan V, May J, Steinbeck C, Mendes P (2016) libChEBI: an API for accessing the ChEBI database. J Cheminf 8:11. https://doi.org/10.1186/s13321-016-0123-9
Article CAS Google Scholar
Viner C, Johnson J, Walker N, Shi H, Sjöberg M, Adams DJ, Ferguson-Smith AC, Bailey TL, Hoffman MM (2016) Modeling methyl-sensitive transcription factor motifs with an expanded epigenetic alphabet. bioRxiv 043794, https://doi.org/10.1101/043794
Wallace EVB, Stoddart D, Heron AJ, Mikhailova E, Maglia G, Donohoe TJ, Bayley H (2010) Identification of epigenetic DNA modifications with a protein nanopore. Chem Commun 46:8195–8197. https://doi.org/10.1039/c0cc02864a
Article CAS Google Scholar
Weigele P, Raleigh EA (2016) Biosynthesis and function of modified bases in bacteria and their viruses. Chem Rev 116:12,655–12,687. https://doi.org/10.1021/acs.chemrev.6b00114
Article CAS Google Scholar
Weininger D (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Model 28(1):31–36. https://doi.org/10.1021/ci00057a005
Article CAS Google Scholar
Wu H, Zhang Y (2014) Reversing DNA methylation: mechanisms, genomics, and biological functions. Cell 156:45–68. https://doi.org/10.1016/j.cell.2013.12.019
Article CAS PubMed PubMed Central Google Scholar
Xuan JJ, Sun WJ, Lin PH, Zhou KR, Liu S, Zheng LL, Qu LH, Yang JH (2018) RMBase v2.0: deciphering the map of RNA modifications from epitranscriptome sequencing data. Nucleic Acids Res 46:D327–D334. https://doi.org/10.1093/nar/gkx934
Article CAS PubMed Google Scholar
Yumak H, Chen L, Halper M, Zheng L, Perl Y, Elhanan G (2016) A quality-assurance study of ChEBI. In: Jaiswal P, Hoehndorf R, Arighi CN, Meier A (eds) Proceedings of the joint international conference on biological ontology and biocreative, Corvallis, Oregon, USA, vol 1747. http://ceur-ws.org/Vol-1747/IT701_ICBO2016.pdf
Zhang Y, Lv J, Liu H, Zhu J, Su J, Wu Q, Qi Y, Wang F, Li X (2010) HHMD: the human histone modification database. Nucleic Acids Res 38:D149–D154. https://doi.org/10.1093/nar/gkp968
Article CAS PubMed Google Scholar

Download references

Authors’ contributions

Conceptualization, MMH; Methodology, AJS, CV, and MMH; Software, AJS and CV; Resources, MMH; Data Curation, AJS and CV; Writing—Original Draft, AJS and CV; Writing—Review & Editing, AJS, CV, and MMH; Visualization, AJS, CV, and MMH; Funding Acquisition, MMH; Supervision, CV and MMH. All authors read and approved the final manuscript.

Acknowlegements

We thank Daniel D. De Carvalho and Christopher E. Mason for helpful feedback on early versions of DNAmod. We thank the creators of ChEBI [13], and all those who have worked to improve it [21, 22, 48]. In particular, we thank Gareth Owen, Steve Turner, and Marcus Ennis for actively responding to curation requests and Venkatesh Muthukrishnan for managing ChEBI issues. We thank Egon L. Willighagen for useful suggestions in a PubPeer review of an early version of this work. We thank Carl Virtanen, Qun Jin, and Zhibin Lu for technical assistance.

Competing interests

The authors declare that they have no competing interests.

Availability of data and materials

The DNAmod website, including a description and contact information, as well as the backing SQLite database, are freely available at: https://dnamod.hoffmanlab.org. Python source code, web assets, and an issue tracker for this project are available at: https://bitbucket.org/hoffmanlab/dnamod. Persistent availability is ensured by Zenodo, in which we have deposited current and previous versions of our code (https://doi.org/10.5281/zenodo.640631) and SQLite database (https://doi.org/10.5281/zenodo.640561). All source code and web assets are licensed under a General Public License, version 2 (GPLv2). DNAmod’s data is licensed under a Creative Commons Attribution 4.0 International license (CC BY 4.0).

Funding

This work was supported by the University of Toronto Undergraduate Research Opportunities Program (to AJS), the Natural Sciences and Engineering Research Council of Canada (RGPIN-2015-03948 to MMH and Alexander Graham Bell Canada Graduate Scholarships to CV), the Canadian Institutes of Health Research (201512MSH-360970 to MMH), the Canadian Cancer Society (703827 to MMH), the Ontario Ministry of Training, Colleges and Universities (Ontario Graduate Scholarships to CV), the Ontario Institute for Cancer Research through funding provided by the Government of Ontario (CSC-FR-UHN to John E. Dick), the Ontario Ministry of Research, Innovation and Science (ER-15-11-223 to MMH), the University of Toronto McLaughlin Centre (MC-2015-16 to MMH), and the Princess Margaret Cancer Foundation.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Author information

Ankur Jai Sood and Coby Viner contributed equally to this work

Authors and Affiliations

Department of Medical Biophysics, University of Toronto, Princess Margaret Cancer Research Tower 15-701, 101 College Street, Toronto, ON, M5G 1L7, Canada
Ankur Jai Sood & Michael M. Hoffman
Princess Margaret Cancer Centre, Princess Margaret Cancer Research Tower 11-311, 101 College Street, Toronto, ON, M5G 1L7, Canada
Ankur Jai Sood, Coby Viner & Michael M. Hoffman
Department of Computer Science, University of Toronto, Sandford Fleming Building 3302, 10 King’s College Road, Toronto, ON, M5S 3G4, Canada
Coby Viner & Michael M. Hoffman
Vector Institute, MaRS Centre, West Tower, Suite 710, 661 University Avenue, Toronto, ON, M5G 1M1, Canada
Michael M. Hoffman

Authors

Ankur Jai Sood
View author publications
You can also search for this author in PubMed Google Scholar
Coby Viner
View author publications
You can also search for this author in PubMed Google Scholar
Michael M. Hoffman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael M. Hoffman.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Sood, A.J., Viner, C. & Hoffman, M.M. DNAmod: the DNA modification database. J Cheminform 11, 30 (2019). https://doi.org/10.1186/s13321-019-0349-4

Download citation

Received: 18 September 2018
Accepted: 25 March 2019
Published: 23 April 2019
DOI: https://doi.org/10.1186/s13321-019-0349-4

DNAmod: the DNA modification database

Abstract

Introduction

Database construction and visualization