A note on utilising binary features as ligand descriptors

It is common in cheminformatics to represent the properties of a ligand as a string of 1’s and 0’s, with the intention of elucidating, inter alia, the relationship between the chemical structure of a ligand and its bioactivity. In this commentary we note that, where relevant but non-redundant features are binary, they inevitably lead to a classifier capable of capturing only a linear relationship between structural features and activity. If, instead, we were to use relevant but non-redundant real-valued features, the resulting predictive model would be capable of describing a non-linear structure-activity relationship. Hence, we suggest that real-valued features, where available, are to be preferred in this scenario.


Background
One of the major goals of cheminformatics is to predict the relationship between a ligand's chemical structure and its bioactivity [1]. If this relationship is captured correctly, then (among other goals) designing the right drug for each disease would become an easier task [1,2]. Unfortunately, the structure-activity relationship can often be intricate and arcane, and in particular non-linear.
To devise an adequate model describing this relationship, the cheminformaticist typically follows a standard approach; starting with a large number of ligand attributes or features considered important for representing the underlying characteristics of the ligand, and relevant to its bioactivity. Then, through feature selection techniques, one selects the ligand attributes deemed to have statistically minimum interdependence among themselves (given the ligand bioactivity), while also showing strong association with the ligand bioactivity [3][4][5]. With this step, one strives for a set of relevant but nonredundant ligand features [4,5]: "relevant" in the sense that there is a strong association between the selected features and the bioactivity, and "non-redundant" in the sense that these features are conditionally independent given the bioactivity. (Irrelevant features are basically noise and relevant but redundant features are nuisance [6]; we are not concerned with these features here [6]).
Typically the ligand's chemical structure is represented by an L-dimensional vector x = (x 1 , x 2 , ..., x L ). The elements x l ideally contain appropriate information about the ligand's features, relevant for predicting its bioactivity. This bioactivity against a particular target or protein may be represented either numerically or as a class label; such classes (or class labels) are denoted henceforth by k, where k = 1, 2, ..., K with K being the total number of classes of interest.
Identifying the relevant features x without errors is generally impossible. Usually both x and k are treated as random variables such that for a given x we have a distribution p(k|x)-the so-called class posterior probability-on the different possible classes [1,7]. In practice, p(k|x) that can assign a new ligand represented by x to the class minimising the probability of misclassification is induced from given prototype samples (a training dataset) [8,9].
In Bayesian probabilistic settings, it is usually computationally easier to estimate p(k|x) in terms of class Open Access *Correspondence: mussax021@gmail.com 1 Centre for Molecular Informatics, Department of Chemistry, Cambridge University, Lensfield Road, Cambridge CB2 1EW, UK Full list of author information is available at the end of the article probability (p(k)), evidence (p(x)) and class-conditional probability density function (p(x|k)): In cheminformatics, the main task of estimating p(k|x) often reduces to inducing p(x|k) from the training dataset.

Commentary
It is common practice nowadays to assume that the L relevant chemical structure features of the ligand can be encoded as a binary "vector" of 1's and 0's denoting presence (1) and absence (0) of these features-i.e., x l ∈ {0, 1} [10]. In practice, state-of-the-art feature selection techniques [3,5] that are based on information theory are used to quantify the level of association between the features and the bioactivity. These techniques are also capable of quantifying the class-conditional interdependency among the features. However, in the light of the insightful work of Li on the peculiar but useful characteristics of the conditional dependence between two binary random variables [11], one might be able to go one step further; identify the L ′ features in the L relevant features whose relationship with the bioactivity is statistically significant, but whose class-conditional interdependency is statistically insignificant-i.e., retain features that are statistically non-redundant (and for that matter ignore or discard statistically redundant features).
In our probabilistic setting, L ′ relevant descriptors x ′ = (x ′ 1 , x ′ 2 , ..., x ′ L ) being non-redundant entails that p(x ′ |k) can be expressed as a product of L ′ class-conditional univariate probability density functions p(x ′ l |k) , i.e., p(x ′ |k) = � L ′ l=1 p(x ′ l |k). This means that p(k|x ′ ) , which is what we are interested in estimating, can be given as [8,12,13] In terms of these Bernoulli distributions, Eq. 2 modifies to which can be further rewritten in an equivalent but more convenient form (see Chapter 4 of ref [8]): p(x ′ ) . Clearly, the discriminant function g k (x ′ ) is linear in x ′ [8,12,13]-irrespective of the nature of the association between the chemical structure of the ligand and its bioactivity. This is the consequence of the ligand's relevant but non-redundant features being represented by a binary "vector".
However, the situation can be different if non-redundant real-valued features are utilised to represent the chemical structure of the ligand. In this scenario the L ′ class-conditional univariate distributions p(x ′ l |k) are not necessarily Bernoulli. Here p(x ′ l |k) can be expressed in Hermite polynomial basis functions φ n (x ′ l ) in variable x ′ l where α k nl are the appropriate coefficient values. Note that the k in α k nl and φ k n is just an index (not a power). Inserting Eq. 5 into Eq. 2 and then taking the logarithm of the resultant equation yields the following discriminant function where b k = log p(k) p(x ′ ) . Clearly h k (x ′ ) is not necessarily linear in x ′ even though the L ′ features utilised are classconditionally independent [13]. Thus, for real-valued features, the resulting classifier is capable of representing a non-linear structure-activity relationship.

Conclusions
In this commentary it has been noted that, when ligand features are represented by a string of binary numbers, one must end up with a linear model for describing the dependency (if any) between the chemical structure of a ligand and its bioactivity of interest-albeit in a classification setting. Such a linear model may be severely biased and limited in its predictivity. It was also pointed out that, where relevant real-valued features are used, the resulting model can be unbiased as it can adequately capture both linear and non-linear structure-activity relationships.