Skip to main content

Table 5 Highest-importance features in the two Random Forest classifiers with the highest ROC AUC scores, i.e. those generated using the random test-training set split using (a) molecular descriptors only and (b) both molecular and protein target descriptors

From: Leveraging heterogeneous data from GHS toxicity annotations, molecular and protein target descriptors and Tox21 assay readouts to predict and rationalise acute toxicity

Classifier Highest importance molecular descriptors Highest importance protein target descriptors
(a) Random test set, molecular descriptors a_nN: Number of nitrogen atoms
Q_RPC-: Relative negative partial charge
a_ICM: Atom information content (mean)
h_pavgQ: Average total charge sum across protonation states at pH 7
GCUT_PEOE_0: First GCUT descriptor calculated from the eigenvalues of a modified graph distance adjacency matrix where the diagonal takes the values of the partial charges
n/a
(b) Random test set, molecular and protein target descriptors Q_RPC-: Relative negative partial charge
a_nN: Number of nitrogen atoms
GCUT_SLOGP_0: First GCUT descriptor calculated using atomic contributions to logP instead of partial charge
bpol: Sum of the absolute value of the difference between atomic polarizabilities of all bonded atoms in the molecule
chi1v_C: Carbon valence connectivity index (order 1)
P18031: Tyrosine-protein phosphatase non-receptor type 1
P51449: Nuclear receptor ROR-gamma
P00352: Retinal dehydrogenase 1
P23219: Prostaglandin G/H synthase 1
P11473: Vitamin D3 receptor
  1. The table illustrates the difference in interpretability between the two classes of descriptors, since molecular descriptors may be either be too broad to interpret or nontrivial to understand, while protein target descriptors provide a specific biological hypothesis which can be subsequently tested to validate a mechanism of action