Skip to main content

Table 5 Highest-importance features in the two Random Forest classifiers with the highest ROC AUC scores, i.e. those generated using the random test-training set split using (a) molecular descriptors only and (b) both molecular and protein target descriptors

From: Leveraging heterogeneous data from GHS toxicity annotations, molecular and protein target descriptors and Tox21 assay readouts to predict and rationalise acute toxicity

Classifier

Highest importance molecular descriptors

Highest importance protein target descriptors

(a) Random test set, molecular descriptors

a_nN: Number of nitrogen atoms

Q_RPC-: Relative negative partial charge

a_ICM: Atom information content (mean)

h_pavgQ: Average total charge sum across protonation states at pH 7

GCUT_PEOE_0: First GCUT descriptor calculated from the eigenvalues of a modified graph distance adjacency matrix where the diagonal takes the values of the partial charges

n/a

(b) Random test set, molecular and protein target descriptors

Q_RPC-: Relative negative partial charge

a_nN: Number of nitrogen atoms

GCUT_SLOGP_0: First GCUT descriptor calculated using atomic contributions to logP instead of partial charge

bpol: Sum of the absolute value of the difference between atomic polarizabilities of all bonded atoms in the molecule

chi1v_C: Carbon valence connectivity index (order 1)

P18031: Tyrosine-protein phosphatase non-receptor type 1

P51449: Nuclear receptor ROR-gamma

P00352: Retinal dehydrogenase 1

P23219: Prostaglandin G/H synthase 1

P11473: Vitamin D3 receptor

  1. The table illustrates the difference in interpretability between the two classes of descriptors, since molecular descriptors may be either be too broad to interpret or nontrivial to understand, while protein target descriptors provide a specific biological hypothesis which can be subsequently tested to validate a mechanism of action