As the drug-repurposing strategy applied here is mainly conceived for educational purposes, we introduce a step-by-step tutorial for a guided development of a KNIME workflow. In addition, the workflow developed herein is fully versatile and it can thus be reproduced for other diseases of interest. Basic knowledge of configuration and execution of standard nodes (e.g., the ‘Row Filter’ node, the ‘GroupBy’ node, the ‘Joiner’ node, the ‘Pivoting’ node), import of external data sets into a KNIME workflow (e.g., the ‘SDF reader’ node, the ‘File Reader’ node), handling different structural formats, as well as working with specific data types in KNIME, is expected here as a prerequisite.
When integrating data from diverse sources, it becomes beneficial to query databases programmatically, i.e., without the need of laborious manual data download and data integration. UniProtKB and other databases used in this example enable targeted access of the stored data through an Application Programming Interface (API).
In the KNIME workflow discussed herein, a triad of KNIME nodes is consecutively executed (1) to specify the API request (via the ‘String Manipulation’ node), (2) to retrieve data from web services (via the ‘GET request’ node), and (3) to perform XPath/JSON queries to extract useful properties for a given protein (via the ‘XPath’ or ‘JSONPath’ node, respectively). The corresponding part of the KNIME workflow is depicted in Fig. 2.
1. Step: Mapping target identifiers of the Open Targets Platform to UniProt
The workflow discussed herein, allows two different sorts of input: (1) Automated retrieval of targets associated with a certain disease via the Open Targets Platform and (2) importing an external data set with a list of protein targets.
In option (1), the disease identifiers from the Open Targets Platform for GLUT-1 deficiency syndrome (Orphanet_71277) and COVID-19 (MONDO_0100096) have been specified as input in the ‘Table Creator’ node. Next, an API request to fetch disease records was created using the ‘String Manipulation’ node. The join() function in the ‘String Manipulation’ node is used and a corresponding Open Targets Platform disease ID is forwarded to the string as a variable ($disease_id$ column). Additional parameters used in this API request are the maximum number of associated drug targets (‘size’, here set to 10,000), and the association score, which enables to prioritize the drug targets on basis of their available evidence for a disease (‘scorevalue_min’, here set to 0.99): join("https://platform-api.opentargets.io/v3/platform/public/association/filter?disease=",$disease_id$,"&size=10000&scorevalue_min=0.99").
As an output of the ‘String Manipulation’ node, a column with the respective API requests is appended to the output table, such as: https://platform-api.opentargets.io/v3/platform/public/association/filter?disease=EFO_0001360&size=10000&scorevalue_min=0.99.
By executing the API request (via the ‘GET Request’ node), a JSON file is downloaded from the Open Targets Platform and appended to the output table as a separate column. Additionally, columns reporting the content type (here ‘application/json’), and the HTTP status code are appended (Fig. 3). There exist five classes of HTTP status codes: (1) Informational responses (100–199), (2) Successful responses (200–299), (3) Redirects (300–399), (4) Client errors (400–499), and (5) Server errors (500–599). The information provided about the status of the request can be used to filter out any useless data entries. It is recommended to increase the timeout in the ‘GET Request’ configuration as the default specification (2 s) is usually insufficient to receive all requested data.
Subsequently, the ‘JSON Path’ node is used to extract the information of interest on the basis of querying different JSON objects. The ‘JSON Path’ node enables to create JSON Path queries in both dot-notation and bracket-notation (depending on how the properties of an object are specified in the syntax). Here, the bracket notation is applied to extract target identifiers, target names, and gene symbol by using the following JSON paths:
$['data'][*]['target']['id']
$['data'][*]['target']['gene_info']['name']
$['data'][*]['target']['gene_info']['symbol']
Output values are appended to separate cells as a collection data type. The ‘Ungroup’ node is subsequently used to transform collections of values into individual rows.
Next, cross-references for all human target entries in the Open Targets Platform can be fetched via the UniProt web services. Here, a corresponding API request was executed to retrieve the mappings for targets (UniProt target IDs are mapped to Open Targets Platform target IDs): https://www.uniprot.org/uniprot/?query=organism:9606+AND+database:OpenTargets&format=xls&columns=id,database(OpenTargets),reviewed.
Due to the potential workflow overload, we recommend to download a mapping file (XLS format) and forward it to the workflow via the ‘File Reader’ node and later join the two data sets via the ‘Joiner’ node.
Option (2) is to use a user-specified list of UniProt IDs in a data table format. In this contribution, this step is exemplified by the use case for proteins that are listed to be of potential interest for treating COVID-19 (53 entries available at https://covid-19.uniprot.org/uniprotkb?query=*). The CSV/TSV file is read in by a ‘File Reader’ node.
2. Step: Retrieving protein–ligand structural data from the Protein Data Bank
UniProt IDs for targets of interest were used to retrieve available protein–ligand complexes stored in the Protein Data Bank (PDB) [29].
Based on the same strategy as in step 1, a column with the respective API requests is appended to the output table. An example for such an API request looks like this: https://www.uniprot.org/uniprot/F8W8F0.xml.
When executing the workflow with COVID-19 pre-release data provided by UniProtKB, the API request has to be adopted in the following manner: https://www.ebi.ac.uk/uniprot/api/covid-19/uniprotkb/accession/O15393.xml.
By executing the API requests (via the ‘GET Request’ node), the XML file is downloaded from UniProt and appended to the output table as XML cell. Similar to the ‘JSON Path’ described in the previous step, the ‘Xpath’ node (XPath 1.0 version) is used to extract the information of interest on the basis of querying different XML elements and the associated XML attributes. One can define an Xpath query within the ‘Xpath’ node from scratch. Another way is to perform a double-click on a specific section in the XML-Cell Preview table and the Xpath query is generated automatically. The XPath query below is used to retrieve all available PDB IDs for a given UniProt ID:
/dns:uniprot/dns:entry/dns:dbReference[@type='PDB']/@id
The ‘dns’ prefix corresponds to the namespace used in the XPath query. Here, http://uniprot.org/uniprot’ is used as a namespace. Namespaces are defined automatically and are listed in the node configuration.
The example XPath query shows that PDB IDs are integrated within the < dbReference > XML element. However, UniProt entries consist of multiple < dbReference > elements which are pointing to different data sources, such as PubMed, GO, InterPro, Pfam, or PDB:
<dbReference type="PubMed" id="12730500">
<dbReference type="GO" id="GO:0039579">
<dbReference type="InterPro" id="IPR036333">
<dbReference type="Pfam" id="PF06478">
<dbReference type="PDB" id="6NUR">
A key task is to query data from XML elements which do possess the ‘PDB’ attribute exclusively. The ‘@’ character is used to specify certain XML attributes in the XPath query. Therefore, dbReference[@type = 'PDB'] is forwarded to the XPath query to get all PDB IDs by querying the @id attribute.
Due to the possible synchronization delay of UniProt releases with other cross-referenced databases, an additional alternative approach has been used to fetch PDB data. Specifically, PDBe graph APIs were used for this purpose. The PDB entities are returned in JSON format by default. Below an example is provided for a request to fetch protein structures for the ACE2 receptor (UniProt ID: Q9BYF1) via PDBe graph APIs: https://www.ebi.ac.uk/pdbe/graph-api/mappings/best_structures/Q9BYF1.
Similar to the ‘XPath’ node for processing XML documents, KNIME also provides the ‘JSON Path’ node which is used to process JSON data. The ‘JSON Path’ node enables to create JSON Path queries in both dot notation and bracket notation (depending on how the properties of an object are specified in the syntax). In the discussed KNIME workflow herein, the bracket notation is applied to extract the PDB IDs:
$..[*].['pdb_id']
Since the data are listed as a collection column type, the ‘JSON Path’ node is followed by the ‘UnGroup’ node to list multiple PDB IDs per protein target into separate rows. After concatenating data (‘Concatenate’ node) retrieved from PDBe graph APIs, duplicates for a respective target were removed by grouping the data by target UniProt ID and PDB IDs (‘GroupBy’ node). The ‘PDB ID’ column is used to create the Uniform Resource Locator (URL) path to extract different properties by using the same strategy as shown in Fig. 2. An example of such URL is given below: https://files.rcsb.org/view/2VYI.pdb.
The ‘PDB Loader’ and the ‘PDB Property Extractor’ nodes are available from the KNIME repository (created by Vernalis, Cambridge, UK) to facilitate analysis of PDB data in KNIME (Fig. 4). These nodes were employed in order to explore properties of the PDB files, such as the experimental method used (X-ray diffraction, solution NMR, cryo-EM, theoretical models), the number of stored models, the resolution of structures, Space groups, R-factor, and so on.
Next, the available PDB structures were examined for their availability of co-resolved ligands. Ligand information (in JSON format) can be received through the RCSB PDB RESTful Web services by creating the following request: https://data.rcsb.org/rest/v1/core/entry/2VYI.
Following JSON Path node is used to retrieve a collection of bound ligands, if available:
$['rcsb_entry_info']['nonpolymer_bound_components']
Available ligands are listed using their shortcuts (e.g., BME, NAG, XU3). An API request is subsequently created and executed to fetch ligand information (in JSON format): https://data.rcsb.org/rest/v1/core/chemcomp/NAG.
The following JSON path node is used to retrieve the SMILES code for a specific ligand:
$['rcsb_chem_comp_descriptor']['smiles']
Subsequently, PDB entries without a co-resolved ligand are filtered out (by applying the ‘RowFilter’ node). The ‘GroupBy’ node is used to keep unique ligand structures per protein target (grouping by UniProt ID and smiles string). This procedure might also retrieve salts, solvents, and/or co-crystallizing compounds, as they are identified as ‘ligands’ in PDB. Although the salts and unconnected fragments are stripped during the structure standardization procedure (as described in Sect. 3), it is generally advisable to cross-check the output table to eliminate retained co-crystallizing agents (e.g., isonicotinamide).
3. Step: Fetching ligand bioactivity data from open bioactivity data sources via programmatic data access
Orthogonal to fetching ligand data for drug targets of interest from their protein structures, ligands and their experimental bioactivity measurements can also be collected from open pharmacological databases. In this example, data is retrieved from ChEMBL (version 26) [4, 30], PubChem [5], and IUPHAR (also known as Guide-to-Pharmacology, version 2020.2) [27] by using the respective web services via the ‘Get Request’ and ‘XPath’ nodes in KNIME. Automated data access can be achieved by using predefined identifiers for targets, ligands (such as ligand structure, available bioactivities, or molecule names), biochemical assays, and so on.
The KNIME workflow for fetching ChEMBL data allows to map UniProt IDs of protein targets to target ChEMBL IDs and subsequent retrieval of ligand bioactivities and their respective structural information (here: canonical smiles), document ChEMBL IDs, and Pubmed IDs for the primary publication. A major challenge is the limited number of bioactivities (up to 1000 bioactivities) that are being fetched per single call. The KNIME workflow therefore has to be adopted to fetch all available data without manual intervention. The metanode that does the trick (termed ‘Get bioactivities per target’) works as follows:
-
1. A single XML file per target is downloaded and the number of bioactivities integrated within the < total_count > XML element is extracted.
-
2. The number of iterations needed to fetch all available bioactivities per target is calculated by dividing the number of bioactivities by 1000 and then rounding the result up (ceil() function in the ‘Math Formula’ node).
-
3. A recursive loop is used in order to process protein targets one-by-one.
-
4. A nested loop is used within a recursive loop where the API call is modified in a way that it dynamically changes the ‘offset’ parameter per each iteration. The ‘offset’ parameter determines the number of bioactivities that should be skipped before downloading the next portion of bioactivities for a given target. After the loop ends, all information needed is extracted from the collected XML files by the ‘Xpath’ node.
This procedure shall be illustrated on basis of an example: There are 2410 bioactivities for protein X available. Thus, three iterations are needed to fetch all data available for protein X if offset is set to 1000. Within each iteration, a column is appended to the table containing the API call with the corresponding offset parameter, i.e.
https://www.ebi.ac.uk/chembl/api/data/activity?target_chembl_id=CHEMBL5118&limit=1000&offset=0 (iteration#1).
https://www.ebi.ac.uk/chembl/api/data/activity?target_chembl_id=CHEMBL5118&limit=1000&offset=1000 (iteration#2).
https://www.ebi.ac.uk/chembl/api/data/activity?target_chembl_id=CHEMBL5118&limit=1000&offset=2000 (iteration#3).
At the end of the loop, 2410 bioactivities have been collected for protein X and these are processed as indicated in the description above.
Step 3 and 4 from the metanode ‘Get bioactivities per target’, as described above, are visually depicted in Fig. 5.
In case of PubChem, UniProt IDs are mapped to ‘PubChem Assay IDs’ (AID) in the first step. Further, AIDs are mapped to available compounds by ‘PubChem Compound ID’ (CID), including bioactivity measurements and associated PubMed IDs. Compound structures and names are retrieved in the next step. In some cases, compound names in PubChem are included in the form of molecule ChEMBL IDs. If this condition is true, the ChEMBL is additionally queried to download a compound name, if available.
In order to query IUPHAR data, the UniProt ID is mapped to the IUPHAR target ID. API calls have a specific syntax for accessing substrates, e.g.: http://www.guidetopharmacology.org/services/targets/2421/substrates and for accessing inhibitors, e.g.: http://www.guidetopharmacology.org/services/targets/2421/interactions,
where “2421” is an identifier for a specific target ID. Compound ID, PubMed ID, affinity, affinity type (corresponding to a certain end-point), and action (corresponding to a certain activity annotation) were retrieved by using the ‘JSON Path’ node. Retrieval of the ligand structural format is done by an additional API call on basis of the respective ligand ID.
Bioactivity values are converted to their negative logarithmic representation and binary labels (‘1’ for active and ‘0’ for inactive) are assigned on the basis of an activity cut-off. In this example, all compounds possessing a negative logarithmic value greater than 9 (i.e., < 1 nM) were labeled as ‘1’, while the rest was labeled as ‘0’.
After merging the output tables from ChEMBL, PubChem, and IUPHAR, the data is grouped to keep unique ligands per target and median values for binary activity labels (by using the ‘GroupBy’ node). In addition, only active ligands per target (label ‘1’) are kept and the final table is concatenated with ligand structures from PDB entries.
A prerequisite for merging ligand data from diverse sources is standardization of the molecular structures. A similar curation strategy like the one published by Gadaleta et al. [31] was applied:
-
1. Characters encoding stereoisomerism in SMILES format (@; \; /) are removed by using the ‘String Replacer’ node since for subsequent operations this information is not needed.
-
2. Salts are stripped by using the ‘RDkit Salt Stripper’ node. (This node works with pre-defined sets of different salts/salt mixtures by default. If requested, additional salt definitions can be forwarded to the node.)
-
3. Salt components are listed in the output table using the ‘Connectivity’ node (CDK plugin) followed by the ‘Split Collection Column’ node
-
4. The ‘RDKit Structure Normalizer’ node neutralizes charges and checks for atomic clashes, etc. Additional criteria for compound quality check can be adjusted in the ‘Advanced’ section of the node configuration.
-
5. The ‘Element Filter’ node keeps compounds containing the following elements only: H,C,N,O,F,Br,I,Cl,P,S).
-
6. InChI, InChiKey, and Canonical smiles formats are finally created from the standardized compounds.
Steps 2–4 are visually depicted in Fig. 6.
4. Step: Substructure searches to identify potentially interesting compounds for drug repurposing
Finally, the merged data sets are used to generate structural queries in SMARTS format in order to perform substructure searches in DrugBank (version 5.1.6, approx. 10,000 compounds, structures in SDF format are publicly available at https://www.drugbank.ca/releases/latest#structures) and in the COVID-19 antiviral candidate compound data set provided by the Chemical Abstracts Service (approx. 50,000 compounds, available upon request at https://www.cas.org/covid-19-antiviral-compounds-dataset).
Bemis-Murcko scaffolds are extracted (‘RDKit Find Murcko Scaffolds’ node) in order to get a quick overview of the structural diversity of the curated data set. Scaffolds possessing too generic structures (i.e., a single aromatic ring) can be filtered out (by using the ‘RDKit Descriptors Calculator’ node in conjunction with the ‘Row Filter’ node) and remaining ones can be explored with respect to their structural similarity in the context of a certain target. This step is done by (1) calculating molecular distances (the ‘MoSS MCSS Molecule Similarity’ node), (2) hierarchical clustering (the ‘Hierarchical Clustering [DistMatrix]’ node), and (3) assigning a threshold (here: distance threshold = 0.5) for cluster assignment (the ‘Hierarchical Cluster Assigner’ node). The ‘MoSS MCSS Molecule Similarity’ node is used to calculate similarities between Murcko scaffolds by taking the size of their Maximum Common Substructure (MCS) as a similarity metric. Molecular similarities are then evaluated on the basis of a distance matrix. The respective part of the workflow is depicted in Fig. 7.
Next, looping over distinct clusters of associated Bemis-Murcko scaffolds for a respective target is done in order to create a maximum common substructure (the ‘RDKit MCS’ node) from all associated scaffolds belonging to a respective cluster. Recursive loops are extensions to regular loops which can be used in conjunction with a ‘Row Splitter’ node to separate the current row from the rest of the table. After termination of the current iteration, the rest of the table is forwarded to the loop start and the next row is used for the subsequent iteration (see Fig. 8). Generated substructures for a certain target are appended to the output table in SMARTS format.
For the substructure searches in DrugBank and the CAS data set loops are being used as well (Fig. 9). The ‘Table Row To Variable Loop Start’ forwards each substructure as a query to the ‘RDKit Substructure Filter’ node as a flow variable which then examines whether a particular substructure is contained in the data sets from DrugBank or CAS. Extracted compounds are being forwarded to the ‘RDKit molecule highlighting’ node which visualizes the highlighted substructure within the respective compounds.
Software
KNIME workflows were built in KNIME version 4.1.2. The KNIME workflows are freely available from GitHub (https://github.com/AlzbetaTuerkova/Drug-Repurposing-in-KNIME). The published workflow can be either used as a single pipeline, or as multiple stand-alone workflows (1) to gather data from PDB, (2) to retrieve ligand bioactivities from ChEMBL, PubChem, and IUPHAR, and (3) to perform substructure searches, by providing the needed data input, respectively.