TCMSID is composed of five fields, including TCM categories, TCM herb, ingredient, target and drug (Fig. 1). Detailed information of each field was integrated from other relevant databases, text mining of published articles and prediction tools such as ADMETlab [14, 15]. In virtue of these interrelated fields, users can conduct a query relying on keywords of any field as an entry point and retrieve relevant information as needed based on the corresponding links. To conduct TCM simplification and mechanism analysis, representative key ingredients of an herb, which exert the pharmacological action of the herb, are available to be screened out. The identification method is based on the detailed information about the ingredients, mainly including significance degree, ADME/T, physicochemical properties, structural reliability, and structural characteristics. Meanwhile, a multilevel functional network can be built through the resulting key ingredients, the reliable targets of the key ingredients and the similar-drug-related information of the key ingredients. This network bridges the gap between TCM and modern medicine. Next, we will elaborate on the detailed information and acquisition process for constructing TCMSID concluded in Fig. 1.
Data processing and implementation
Herbal ingredients
To ensure the high storage of the database, 499 frequently used and approved TCM herbs were collected from the Pharmacopoeia of the People’s Republic of China (2015 version). It is well known that a TCM herb is more likely contains hundreds of compounds and can even be regarded as a small compound library, however, not all the ingredients contained are pharmacologically active. Herein, to extract the major active ingredients of TCM herbs, more than 1500 Chinese articles researching these TCM herbs were retrieved from China National Knowledge Infrastructure (CNKI) (http://www.cnki.net/), since TCM was widely used and researched in China and the related research results were mainly published in Chinese as well. The ingredients with high content and activity were extracted through manual mining of these literatures. The herbal ingredients from those publications, as well as from other related web-based databases including TCMSP and SymMap [16] form the data foundation of TCMSID. The significance degree ranges from 0 to 2, the smaller the number, the higher the significance degree. The three numbers are assigned by the bioactivity data and the minimum volume of a compound per unit to exert pharmacological effect according to the referred literature and the Pharmacopoeia of the People’s Republic of China (2015 version), respectively. For the compound that satisfies the criteria of both bioactivity and minimum volume per unit, we assign the degree of significance the value of 0; for the compound that satisfies the criteria of either bioactivity or minimum volume per unit, we assign the degree of significance the value of 1; and for the compound that fails to satisfy the criteria of both bioactivity and minimum volume per unit, we assign the degree of significance the value of 2. The data of bioactivity and minimum volume per unit for the 499 TCM herbs was manually collected from the Pharmacopoeia and literature, following which the significance degree values of all the ingredients were assigned according to the aforementioned criteria. Details about those ingredients, such as name, structure etc., were comprehensively retrieved from PubChem (PUG-REST interface) automatically [17, 18], where the structure files in multiple formats (sdf, mol, SMILES etc.) were eventually converted into canonical SMILES using OpenBabel (version 2.4.1). The duplicates were removed according to InChIKey.
ADME/T-related properties
To improve the quality of the database, we conducted an in-depth analysis for each ingredient. First of all, a battery of pivotal drug-likeness properties were computed through our prior work ADMETlab (http://admet.scbdd.com) and ADMETlab 2.0 (https://admetmesh.scbdd.com/), including ADME/T parameters: Caco-2 permeability (Caco-2), Bioavailability (F-30), Plasma Protein Binding (PPB), Blood-Brain Barrier (BBB) Penetration, Half Life (T1/2), Clearance (CL), hERG Inhibition (hERG), Human Hepatotoxicity (HHT), drug-likeness (DL), etc. and basic physicochemical parameters: molecular weight (MW), LogP, LogS, etc. Different from most of the property computational tools, ADMETlab and its updated version is an ADME/T evaluation platform, which integrates comprehensive ADME/T properties and basic physicochemical endpoints as many as possible to provide an overall understanding of query compounds and facilitate the drug discovery process.
Before compounds are further investigated in vitro, ADME/T-related properties and basic physicochemical properties are commonly used to provide a fast preliminary filtering. ADME/T-related properties determine whether a molecule will reach the acting site in the body, and how long it will stay in the bloodstream, while basic physicochemical properties closely related to drug-likeness. Property evaluation is nowadays routinely carried out at the early stage of drug discovery to reduce the attrition rate [19, 20], among which the evaluation of pharmacokinetic and physicochemical properties are important prerequisites for filtering key ingredients. As a result, only the major active ingredients that exhibit favorable pharmacokinetic and physicochemical properties can exert potential biological effects.
Ingredient structural classification
To improve the quality of the database, structures of all ingredients were further dissected since the structural characteristic of immense structural diversity is the source of a wide variety of biological activities and the fundamental basis of herbal ingredients for drug design. Herein, ClassyFire web server [21], an automated chemical classification web tool, was used to the refine structural classification of all ingredients layer-by-layer. For instance, matrine, an alkaloid found in plants and a key active ingredient in the herb Sophora flavescent, was grouped under the headings of alkaloids and derivatives, lupin alkaloids, and matrine alkaloids.
Ingredient structural reliability evaluation
From the perspective of structural quality, the structural reliability of ingredients can be trustworthy insufficiently due to the diverse data sources, which will fundamentally hinder the TCM research process in a great measure. To evaluate the structural reliability of each ingredient for accurate analysis, the reliability annotations, which indicate the structural quality, were gained by performing structural reliability evaluation using a semi-automated quality checking workflow while keeping the ingredients failed to meet criteria with structural reliability marking [22]. The operation principle of the workflow is to input the chemical name and CAS number of any Chinese medicine ingredient, and then retrieve data from several different ingredient databases such as PubChem and evaluate the quality of the ingredient data by comparing the consistency of the search results obtained by the two searching methods. Here, the structural reliability ranges from 1 to 5, in which 1 to 3 means relatively higher structural reliability with 1 the highest reliability, while 4 stands for unknown reliability, and only 5 means poor reliability. For the chemicals with unknown reliability, we performed additional manual inspection and information correction, and then rescored the corrected chemicals following the workflow. (Fig. 2a).
Ingredient target information
To acquire reliable targets for mechanism exploration, target prediction was performed by implementing and assembling different target prediction tools including SEA [23], SwissTargetPrediction [24], HitpickV2 [25], PPB [26], PPB2 [27] and ChEMBL [28]. We introduced occurrence frequency parameter, which refers to the frequency of targets predicted by different tools. For a given target, the higher the occurrence frequency represents the higher-ranking level. Herein, for each prediction tool, only the top 15 predicted targets were retained according to the occurrence frequency parameter.
Comprehensive information for 3270 target proteins was collected from ChEMBL [29]. Detailed annotation information of all targets is obtained by ID conversion through UniProt [30], which included identification names, functionality description, cross-ref IDs, etc. In addition, target proteins involved in this database were classified into different homologous families through ChEMBL target annotation, such as enzyme and ion channel (Fig. 2b). In the meantime, from a clinical, chemical and biological standpoint, the development level of these targets was divided into Tclin (clinic), Tchem (chemistry), Tbio (biology) and Tdark (dark genome) using TDL classification scheme developed by Oprea et al. (Fig. 2c) [31].
Drug-related information
To further clarify the knowledge of TCM functions from the modern medicinal point of view, we built the relationship between TCM ingredients and drugs through chemical similarity. The drugs in TCMSID were collected from the DrugBank database [32], which included a total of 10,450 known drugs (containing 3883 FDA-approved drugs), as well as drug-related information including drug names, structures, and drug targets, etc.
Herein, both FCFP6 and ECFP4 fingerprints were adopted to represent all ingredients and drugs since it was previously reported that the circular fingerprint, especially the FCFP6 and ECFP4, show better performance in TCM ingredient similarity search [33, 34]. As a common measure method for 2D similarity, Tanimoto coefficient (Tc) was applied to define chemical structural similarity between comparative individuals. Moreover, Tc = 0.85 and Tc = 0.5 were taken respectively as the thresholds to indicate high and medium similarities between query molecules and drugs. Finally, the structural similarities between comparatives were determined by the intersection of similarity results by comparing the two results and adopting the lower level of classification as the final similarity outcome for the two conflicting results. Calculation of fingerprints and chemical similarity was performed using CDK Fingerprints and Similarity Search node of Knime (version 3.7.2), respectively [35].
Functionalities - mechanism exploration of TCM herbs
To achieve Mechanism exploration of TCM herbs, TCMSID provided TCM simplification for clarifying mechanisms, including two key steps of key ingredients filtering and target identification (Fig. 3). The key ingredients, as the fundamental material basis of TCM, refer to several ingredients that are available to replace a TCM to exert effective pharmacological activity to a certain extent. Key ingredients should have the characteristics of high activity and content. In addition, favorable pharmacokinetic and physicochemical properties should be exhibited to exert potential biological effects. Moreover, given the significant role of molecular structure in pharmacological activity, the structural characteristics and reliability of herbal ingredients should be considered as well. TCMSID provided integrative information for each herbal ingredient, including significance degree, ADME/T and physicochemical properties, structural reliability, structural characteristics, etc. The key ingredients can be filtered in a custom way by setting the threshold range of the above information, according to details of parameters and filtering criteria provided by TCMSID.
Reliable target proteins are the core of mechanism research to promote the modernization of TCM herbs. In recent years, in-silico target prediction methods have been regarded as an effective alternative to experimental target identification methods due to its convenience and less time-consuming properties. However, a single target prediction method is more likely leading to inaccurate offset results. It is more beneficial to combine these target prediction methods to take different theoretical foundations into account.
To explore the mechanisms of TCM herbs, the reliable targets of key ingredients can be obtained and aggregated by carrying out multi-tool target prediction. According to the occurrence frequency parameter and detailed target information provided by TCMSID, the potential targets of TCM herbs to exert pharmacological effects can be screened out as well. In addition, TCMSID provides ingredient-related drug information, such as the therapeutic effects and known targets of the drugs being connected, to bridge the gap between TCM herbs and modern drugs through chemical similarity calculation. Finally, the mechanism of action of herbal ingredients can be inferred according to the multilevel herb-ingredient-target-drug network constructed on the network visualization interface.