2D-Qsar for 450 types of amino acid induction peptides with a novel substructure pair descriptor having wider scope

Background Quantitative structure-activity relationships (QSAR) analysis of peptides is helpful for designing various types of drugs such as kinase inhibitor or antigen. Capturing various properties of peptides is essential for analyzing two-dimensional QSAR. A descriptor of peptides is an important element for capturing properties. The atom pair holographic (APH) code is designed for the description of peptides and it represents peptides as the combination of thirty-six types of key atoms and their intermediate binding between two key atoms. Results The substructure pair descriptor (SPAD) represents peptides as the combination of forty-nine types of key substructures and the sequence of amino acid residues between two substructures. The size of the key substructures is larger and the length of the sequence is longer than traditional descriptors. Similarity searches on C5a inhibitor data set and kinase inhibitor data set showed that order of inhibitors become three times higher by representing peptides with SPAD, respectively. Comparing scope of each descriptor shows that SPAD captures different properties from APH. Conclusion QSAR/QSPR for peptides is helpful for designing various types of drugs such as kinase inhibitor and antigen. SPAD is a novel and powerful descriptor for various types of peptides. Accuracy of QSAR/QSPR becomes higher by describing peptides with SPAD.


Background
Research on the classification of small molecules using computers was popular in the 1990s [1][2][3][4][5], with similarity analysis of compounds being a major objective. At the time, there were mainly two methods for similarity analysis: the fingerprint description approach [4,6] and the inductive logic programming approach [7][8][9]. In the fingerprint description approach, a molecule is described as a sequence of bits, each of which corresponds to the existence of a chemical substructure. Atom-pair descriptor [4] or substructure type fingerprints are popular descriptors.
Research on the classification of peptides became popular in the year 2000 [10][11][12]. The hidden Markov model (HMM) approach [12] and physical data description of peptide approach [11] were the major approaches. The main subject of these papers is the natural twenty amino acids, such as isoleucine, valine, and so on. For example, the subject of immunity concerns peptides whose components are one of 20 natural amino acids. In traditional research for the classification of peptides, an amino acid residue was described as an alphabet or a set of physical or chemical values [11].
However, in practical virtual screening, describing other amino acid inductions such as cyclohexyl alanine or F5 phenylalanine is necessary. The traditional description of peptides is not sufficiently powerful because the common characteristics among amino acid residues cannot be described sufficiently. For example, tyrosine and phenylalanine have an aromatic ring substructure in common. In the alphabetic description, tyrosine and phenylalanine are described as 'Y' and 'F' respectively. However, understanding that symbols 'Y' and 'F' have a common substructure on a machine learning algorithm is impossible. Research of two-dimensional QSAR has been undertaken for various types of peptides. In the atom-pair holographic code (APH) [13], each peptide is described with the method similar to atom-pair descriptor [3]. Our novel descriptor, substructure-pair descriptor (SPAD), captures different characteristics of peptides from APH and has greater descriptive power than APH. The combination of APH and SPAD may lead to better QSAR for peptides with many types of amino acid inductions [14].
Tanimoto coefficient [15] is a popular indicator for measuring similarity between two compounds [16]. In binary case, Tanimoto coefficient T(X, Y ) between vectors X and Y is defined as following expression.
Tanimoto coefficient becomes large when two vectors have more similar bit-pattern. When the structure of two compounds is similar, Tanimoto coefficient is also high.
In machine learning, excessive features degrade the performance of machine learning algorithms due to over-fitting problems [17]. Under excessive feature space, predictive models lose robustness. Feature selection is necessary for building more accurate predictive models. Kohavia proposed the relevance of features instead of maximizing accuracy of an algorithm [18]. Discussions about relevance of features are popular in various types of algorithm [19]. Relevance is defined as the difference between probability density function P(Y = y) and conditional probability density function P(Y = y|X i = x i ). When In information theory [20], entropy is an indicator for measuring the amount of information. We denote probability of x i as P(x i ). Entropy E is defined as next function.

Definition of several terms
In this paper, we define several terms as follows.
• Substructure: a part of structure of peptides • Descriptor: The function for mapping a structure of amino acid residues or peptides to a bit according to substructure.
• Feature: A bit as the result of a descriptor.
A target protein binds some amino acid residues of peptides by some kinds of chemical or physical interactions. For example, hydrogen bonds and hydrophobic effect are representative interactions. In our QSAR approach, we describe the two-dimensional structure of peptides with a sequence of bits and analyze the relationship between peptides structure and its activity statistically. When we analyze this relationship with a data mining algorithm, QSAR rules are extracted automatically from dataset annotated with peptides' activity. From a chemical viewpoint, describing various types of amino acid inductions properly is important for improving QSAR analysis.
From a statistical viewpoint, features which maximize the accuracy of an algorithm for analyzing QSAR are the best. Kohavi proposed the relevance of features instead of maximizing accuracy of an algorithm. Discussions about relevance of features are popular in various types of algorithm [19]. Relevance is defined as the difference between probability density function P(Y = y) and conditional probability density function P( We define each symbol as Figure 1. The SPAD is defined with these symbols.

Definition of the base substructure set for amino acid inductions
The aim of defining the base substructure ( Figure 2) set is the description of important interactions between a target protein and a peptide such as hydrogen bonds, the hydrophobic effect, and so on. However statistically redundant or specific descriptor may degrade the accuracy of an algorithm for QSAR analysis. We defined the base substructure set under next three conditions.
• Describe potential factors for interactions such as hydrogen bond acceptor.
• Features of amino acid residues should be weak relevant to each other mathematically. This is the condition for avoiding strong relevant features. Abandon features with strong relevance.
• A feature should have high entropy (in information theory) after mapping structures of 450 types amino acids to a sequence of bits. This is the condition for avoiding too specific descriptor. Abandon descriptors with low entropy.
The first item is essential for QSAR analysis because key substructures such as hydrogen bond acceptor may cause the activity of peptide for target protein. Under the condition lack of description of them, most of algorithms analyzing QSAR become powerless. The second and third items are necessary for efficient analysis from a statistical viewpoint. The second item prohibits the redundancy of features. Even if the structures of two amino acid inductions are chemically different, two features may be relevant to each other. Then, these two features are redundant statistically. The third item is necessary for generating robust QSAR rules. Features with low entropy (in information theory) lose generality.
The set of substructures Z includes the forty-nine substructures shown in Figure 2. These substructures are roughly categorized into three parts. Three categories are "the number of atoms", "Substructures" and "Properties". The number of atoms indicates how many atoms there are in an amino acid residue. "Substructures" indicates whether an amino acid residue has a specific substructure or not. "Properties" indicates whether an amino acid residue has some character from a viewpoint. For example, the first item of "Properties" describes the structure that is the methylene group and a hydrogen bond acceptor are connected via any atom.
An element z Z denotes each substructure shown in Figure 2. Then, we can define any substructures except z as z*. In other word, each element z* is defined corresponding to each z. The substructure z* is complement of the substructure of z because z ∩ z* = j, z ∪ z* = All. Then, we define the set Z* as all elements z*. Finally, we define the base substructure set X as X = Z ∪ Z*.

Definition of a set of intermediate bindings between any two base substructures
The activity of a peptide is determined not only by the structure of each amino acid residue but also by the relationship among amino acid residues. Here, we define an intermediate binding between two amino acid inductions as the distance between any two base substructures.  Structures of peptides are more flexible than small compounds because peptides have many rotatable bonds. Descriptors for peptides should have a potential for describing the flexibility to obtain high accuracy.
We defined the intermediate bindings shown in Figure 3. To increase flexibility of descriptors, we added a set of bindings within some length to the definition. In Figure 3, '*' denotes an amino acid residue and '~' denotes a peptide binding. '{}' denotes 'or ' condition. For example, '{~,~*~, *~*~}' represents the peptide consisting of amino acid residues from 0 to 2. We represent a set of intermediate bindings as set Y.

Definition of substructure-pair descriptor
Then, SPAD is defined as next function. We suppose that the number of X is N and that the number of Y is M. However, the number of amino acid residues between two substructures x i and x j must be equal to the length of y k . F(x i , y k , x j , p a ) =0 Otherwise When x i , x j and y k are given, a peptide p a is converted to a bit with function F (x i , y k , x j , p a ). Here, we denotes the suffix set (i, j, k) as b. Then, we obtained the matrix (M ab ) = (F (x i , y k , x j , p a )) for the input of QSAR analysis algorithm. The vector (M a1 , M a2 , ...) is corresponding to the features of the peptide p a .

Definition of Datasets
We use two types of datasets for evaluation of the proposed descriptors. One is C5a inhibitors [21] and the other is kinase inhibitors [22]. Positive data are defined as peptides with high inhibitory potential, and negative data are defined as other peptides and peptides with random arrays. Content of dataset is as follows. two amino acid residues have carbon chains which are different to each other in its length, APH classifies two amino acid residues. However, SPAD does not. The structures of amino acid residues are very similar so it is natural to consider that their properties are approximately similar. In this case, the descriptor that ignores the difference is better. The second different point between SPAD and APH is whether the information Figure 4 Descriptors with high correlation to peptides' activity in SPAD. The range of them is from 3 to 6 amino acids.
Osoda and Miyano Journal of Cheminformatics 2011, 3:50 http://www.jcheminf.com/content/3/1/50 about properties is included in descriptors. It may be unnecessary to distinguish amino acid residues from a viewpoint of some property.

Comparison of descriptors correlated highly with peptides' activity
By comparing each descriptor, we know that the range of the substructures of SPAD (Figure 4) is wider than that of APH ( Figure 5). The range of APH is from 3 to 7 atoms. On the other hand, the range of SPAD is from 3 to 6 amino acid residues, which usually comprises 6-12 atoms. SPAD captures a wider range of characteristics than APH. Therefore, the range of SPAD is more appropriate for capturing properties of peptides than that of APH.

Capturing Area of APH and SPAD in active peptides
In the case of SPAD (curve in Figure 6), x Z or x Z* where x denotes a substructure. We show substructures x Z with high correlation to peptides' activity. In case of APH (dotted curve in Figure 6), we show substructures with high correlation to peptides' activity. There are few overwrapped regions between SPAD and APH. SPAD and APH capture different regions complementarily. APH inclines to capturing a component of a peptide. On the other hand, SPAD descriptor inclines to capturing a relationship of side chains between two amino acid residues.

Definition of dataset for similarity search with Tanimoto coefficient
Peptides are classified in three categories: • non-active: negative peptides.
• active reference: positive peptides which are the basis of similarity search with Tanimoto coefficient.
• active: positive peptides except for active reference.
All peptides were ordered by descendent ordering with Tanimoto coefficient.

Comparison of the performance of SPAD with APH
When the structure of two peptides is similar and a descriptor captures a whole structure or property of Figure 5 Descriptors with high correlation to peptides' activity in APH. The range of them is from 3 to 6 atoms. Its length is shorter than that of SPAD.
Osoda and Miyano Journal of Cheminformatics 2011, 3:50 http://www.jcheminf.com/content/3/1/50 peptides, these two features have similar sequences of bits. As a result, Tanimoto coefficient between these peptides becomes large. Structures of active peptides for a target protein are usually similar to each other because the pocket of target protein is same. When we describe peptides with a descriptor capturing whole peptides' structures or properties, Tanimoto coefficient between any two active peptides is larger.
Oppositely, Tanimoto coefficient between an active peptide and a non-active peptide is smaller because these two features are different to each other. However, if we describe peptides with a poor descriptor, we cannot always measure the similarity of peptides with Tanimoto coefficient. Poor descriptors break the similarity of structures at mapping to features. Therefore, Tanimoto coefficient is an indicator of the descriptor's performance.  All peptides are ordered by descendent ordering with Tanimoto coefficient. Then, we count the number of active peptides with this ordering. Figure 7 shows the enrichment factor with Tanimoto coefficient. The horizontal-axis and the vertical-axis is defined as follows.
• The horizontal-axis The number of peptides with Tanimoto Coefficient ≥ α The number of all peptides • The vertical-axis The number of active peptides with Tanimoto Coefficient ≥ α The number of active peptides The graph increases more rapidly as active peptides have larger Tanimoto coefficient than non-active peptides.
In both cases, C5a (left figure at Figure 7) and kinase inhibitors (right figure in Figure 7), the graph in case of SPAD is higher than the graph in case of APH. The enrichment factor with the SPAD is higher than with APH at any percentage of active peptides. Therefore, the SPAD translates similar structures to similar features more precisely than the APH. This fact means that the performance of the SPAD is higher than the performance of APH in the case of analyzing peptides' activity.

Conclusions
It is necessary for two-dimensional QSAR of peptides that are sequences of 450 types of amino acid inductions to capture various properties with descriptors. The atom pair holographic code and substructure pair descriptor that we proposed are such descriptors. APH captures internal characters of an amino acid induction. On the other hand, SPAD captures the relationship between two amino acid inductions. SPAD captures much more information for QSAR of peptides than APH and distinguishes active peptides from non-active peptides more accurately.