Utilizing maximal frequent itemsets and social network analysis for HIV data analysis

Koçak, Yunuscan; Özyer, Tansel; Alhajj, Reda

doi:10.1186/s13321-016-0184-9

Research Article
Open access
Published: 09 December 2016

Utilizing maximal frequent itemsets and social network analysis for HIV data analysis

Yunuscan Koçak¹,
Tansel Özyer¹ &
Reda Alhajj²

Journal of Cheminformatics volume 8, Article number: 71 (2016) Cite this article

3481 Accesses
1 Citations
3 Altmetric
Metrics details

Abstract

Acquired immune deficiency syndrome is a deadly disease which is caused by human immunodeficiency virus (HIV). This virus attacks patients immune system and effects its ability to fight against diseases. Developing effective medicine requires understanding the life cycle and replication ability of the virus. HIV-1 protease enzyme is used to cleave an octamer peptide into peptides which are used to create proteins by the virus. In this paper, a novel feature extraction method is proposed for understanding important patterns in octamer’s cleavability. This feature extraction method is based on data mining techniques which are used to find important relations inside a dataset by comprehensively analyzing the given data. As demonstrated in this paper, using the extracted information in the classification process yields important results which may be taken into consideration when developing a new medicine. We have used 746 and 1625, Impens and schilling data instances from the 746-dataset. Besides, we have performed social network analysis as a complementary alternative method.

Background

Acquired immune deficiency syndrome (AIDS) is a deadly disease which is caused by human immunodeficiency virus (HIV). This virus destroys immune cells and yields patient’s body to become slowly defenseless against other diseases. According to Global Health Observatory (GHO) report, 78 million people are infected with HIV and it caused the death of 39 million pe ople. As of 2013, nearly 35 million people are living with HIV/AIDS and the mortality is 1.5 million [1]. There are various attempts to keep the virus under control. Unfortunately, effective cure has not been found yet despite the efforts to fully understand how the disease advances and its causes [2]. Inhibitors have been developed to keep it under control.

HIV-1 protease enzyme is used by the virus to cleave an amino acid octomer into peptides which are used to create essential proteins. These proteins are used by the virus to reproduce itself. Spread of the virus in the body is currently blocked with protease inhibitors. Herein, the main issue is to understand the link between HIV-1 protease and amino acid octomer for cleavage. Drugs become more of an issue during the therapy. Inhibitors mimic a peptide such that chemically modified peptide and scissile bond cannot be cleaved [15].

Available medicines work as HIV-1 protease inhibitors [3], i.e., the aim is to slow down reproduction of the virus. To design better inhibitors, it will be beneficial to find out amino acid sequences can be cleaved by HIV-1 protease [4]. This remains a difficult situation due to the uncertainty in patterns for cleavage sites of enzymes.

Amino acid residues are denoted by ${P_4}, {P_3}, {P_2}, {P_1},{P'_1},{P'_2},{P'_3},{P'_4}$ and their counterparts in protease are denoted by ${S_4},{S_3},{S_2},{S_1},{S'_1},{S'_2},{S'_3},{S'_4}$.

There are 20 possible amino acids which align to make an octamer. This leads to $20^8$ potential combinations of sequences. Data can be encoded in different ways. Although there are two alternative encoding schemes, namely OETMAP [5], and GP [6] encoding, it was noted in recent study by Rögnvaldsson et al. [13] that advanced feature encoding and selection schemes do not lead to better achievement in comparison to standard orthogonal encoding in samples without feature selection. A different Fresno-style approach was demonstrated by Liao et al. [29] who used Fresno semi-empirical scoring function to predict MHC molecule-peptide binding. Standard orthogonal encoding in representation has 160 binary positions (i.e., $20 \times 8$). While representing an octamer, out of 160 binary values, at each 20 bit length segment, one of them has value one to indicate an amino acid for the octamer. Hence, in total eight bits are set to one and 152 bits have value zero.

The problem of cleavage prediction resorts to binary classification from computational point of view. Recently, a consistency based feature selection mechanism associated with linear SVM has been proposed for the 746 dataset. Although there are several datasets as completely described in [13], some patterns for cleavage have been elicited particularly in the 746-dataset. In addition to SVM methods, neural networks [7] and markov models have been proposed in the literature. Another direction is introducing extra features by applying machine learning techniques. These techniques have been detailed in [13].

In this paper, we incorporate maximal frequent itemset mining to extract new features for cleavage prediction. These features have been added with different options to fully understand performance compared to results that use stand-alone standard encoding scheme. We alternatively utilize mining results for selected features which were previously named for the 746-dataset. Thus, we facilitate the use of social network analysis in feature selection. A social network graph is constructed based on results of the mining process. This forms a graph based on relationships among items (maximal frequent items). Actually, the power of social network analysis has been increasingly realized and the technique has gained huge interest in the research community. It became very popular in multi-disciplinary domains. Social network analysis focuses on relationships among social entities. The proposed methodology has been tested and the results reported in this paper demonstrate its applicability and effectiveness.

The rest of this paper is organized as follows. “The necessary background” section covers the background necessary to understand the approach described in this paper. In particular, we provide a brief overview of network analysis, fundamental definitions of frequent pattern mining and maximal frequent itemset mining. The proposed methodology is presented in “The methodology” section. Experiments and the analysis are discussed in “Experiment results and discussion” section; further, patterns specific to the 746-dataset have been used by using social network analysis. “Comparison of algorithms without feature selection” section is conclusions.

The necessary background

The methodology described in this paper integrates techniques from social network analysis and data mining which are briefly covered in this section. We use frequent pattern mining to construct a network between various molecules.

Social network analysis

A social network reflects connections between a set of items inspired from the investigated domain and called actors. Connections are determined based on the type of relationship to be studied and this may lead either to directed or to undirected network. A network may be analyzed based on existing actors and connections to reveal certain discoveries which may be valuable for effective and informative decision making.

Network analysis metrics includes a variety of measures which investigate various aspects of a given network. These include: (1) Degree centrality which is computed differently for directed and undirected networks. For the former, each node has in-degree and out-degree which are, respectively, number of links directed to and out of the node. For the latter, each node has a uniform degree which is the number of links connected to the node. (2) Betweenness centrality which is the number of shortest paths passing through a given node. (3) Density is the ratio of the number of links existing in a graph to the number of links in a complete graph, i.e., maximum density is one. (4) Eigen-vector centrality which determines how popular a given node is.

Frequent patterns

Given a set of items. say I, it is possible to have various not necessarily disjoint subsets of I such that items in each subset are associated based on their coexistence in a given number of transactions where each transaction is a non-empty subset of I. Studying all associations across all subsets could reveal valuable information that describe some implicit relationships between various items. Items associated in a reasonable number of the given subsets form a frequent itemset. For instance, given genes in a body may be differently expressed in a number of samples forming different sets of expressed genes, one set per sample. These sets of expressed genes do overlap and analyzing them would lead to subsets of genes co-expressed together in a large number of samples. It is possible to determine a number of association rules from each frequent itemset by splitting the set into two non-empty disjoint subsets of the given itemset such that one subset forms the antecedent of the rule and the other subset forms the consequent of the rule. For instance, given a set of samples where only genes expressed in each sample are specified. $S_1: g_1, g_2,g_3,g_5,g_6$, $S_2: g_1.g_3,g_4, g_5, g_7$, $S_3: g_2,g_3, g_6, g_8$, and $S_4: g_1,g_3,g_4,g_8,g_9$. From these four samples, it is possible to find some frequent itemsets of co-expressed genes by assuming a minimum threshold value of 2, i.e., a set of genes is frequent if its genes coexist in at least 2 samples. An example frequent itemset could be {$g_1,g_3,g_4$}, {$g_2,g_3$}, etc.

Association rule mining has been well-studied in the literature [10]. Frequent itemsets are prominent for capturing intrinsic structure of a dataset. Formally speaking, given $T={t_1,t_2,\ldots ,t_n}$ as a dataset of n transactions, where each transaction $t_i$ contains items, e.g., $t_i = \{I_{i1},I_{i2},\ldots ,I_{ik}\}$ and each item $I_{ij} \in I$ the set of all possible items. An itemset, IS which contains items from I, is said to be frequent if and only if it is subset from a number of transactions in T greater than or equal to a pre-determined minimum support threshold value (minsup). Finally, given a set of items F an association rule is formally defined as $X\rightarrow Y$ such that $X\bigcup Y=F$, $X\ne \phi $, $Y\ne \phi $ and $X\bigcap Y=\phi $. An itemset F is characterized by support which is defined as the percentage of transactions from which F is subset. Further an association rule $X\rightarrow Y$ has a confidence value which is determined the fraction or ratio of support of $X\bigcup Y$ by support of X. Minimum support (minsup) and minimum confidence (minconf) threshold value are used in the minimg process for generating association rules that can be derived from F. Formally, support formula of itemset F is:

$$ support(F)=\frac{{{\# }\,\,{\text{of}}\,{\text{transactions}}\,{\text{having}}\,F}}{{|T|}} $$

where |T| is the total number of transactions. Itemset F is said to be frequent if and only if:

${\text{Frequent(F)}} = F \subseteq I\,\wedge \,{\text{support}}\,(F) \ge \,minsup$. Further, an association rule $X \rightarrow Y$ is said to be of specific importance when its confidence score is greater than or equal to minimum confidence value. Confidence formula is:

$$ {\text{confidence}}(X \rightarrow Y)= \frac{{\text{support(F)}}}{{\text{support(X)}}} $$

Frequent itemsets can be alleviated to different forms such as closed frequent itemsets [12] and maximal frequent itemsets [11]. A frequent itemset is closed if none of its supersets has its support. Formally,

$$ ClosedItemset(F) = {\text{Frequent(F)}} \,\wedge \, \forall Z ((Z \supset IS) \wedge (support(F) \ne support(Z))) $$

An itemset F is maximal, if it is frequent and none of its supersets is frequent. This can be formalized as:

$$ MaximalFrequentItemset(F) = {\text{Frequent(F)}} \,\wedge \, \forall Z ((Z \supset F) \wedge (frequent(Z) = False)) $$

Closed frequent and maximal frequent itemsets are two concise classes of itemsets which could be used to produced some valuable knowledge in a more controlled and efficient way as described in this paper.

The methodology

Our methodology is organized in four phases. The first phase transforms the original input data by using orthogonal encoding. The second phase utilizes the new representation to find frequent itemsets from the new data representation. The third phase includes selecting the required itemsets from the obtained frequent itemsets. Finally, the selected itemsets are considered as features for classifying instances. Also as a complementary analysis, important itemsets are found by applying social network analysis metrics on a network among existing itemsets.

Data modification

The methodology starts by transforming the original data into orthogonal encoding. In order to find frequent itemsets based on sequences of amino acid octomers, each amino acid is also changed to represent its position on the octomer. For example, the first instance of Schilling Dataset [17], namely

$$ AAAAAPAK $$

has been transformed into

$$ P_4A, P_3A, P_2A, P_1A, P'_1A, P'_2P, P'_3A, P'_4K. $$

In orthogonal encoding, there are 8 features for each instance and each feature can have 20 different values, one for each possible amino acid.

Finding frequent itemsets

After transforming the dataset into the new representation, frequent itemsets based on the sequential amino acid octomer can be found. The FP-Growth algorithm has been used to extract frequent itemsets which are above a certain support threshold [18, 19]. In this study, maximal frequent itemsets have been extracted. Maximal frequent items give us a summarization of the given dataset. It is a lossy compression in the sense that all subsets of maximal itemsets are also frequent, but the support value of each subset itemset is not known.

In our experiments, the methodology works as follows: frequent itemsets are extracted in three ways by considering: (1) Data having cleavage class value, (2) Data having non-cleavage class value (3) All the dataset regardless of class value. These three frequent itemsets have been used in our experiments with different combinations.

The reason for separating the datasets is to find different patterns for different underlying class value. There may be some patterns that are frequent and specific to cleaving data. On the other hand, some other frequent itemsets may specific to data, and hence they are not cleaving. The separation leads to identifying all patterns, which may be in low support for the entire dataset whereas may have high support for a specific class value (cleavage or non-cleavage) without loss of generality.

Alternatives for feature selection

We have accumulated number of features in terms of attribute patterns, which cover maximal frequent itemsets that are sufficient after selecting very low minimum support threshold value. In our experiments, we have determined this value as 0.05 which can be considered enough to capture enough number of itemsets.

Dataset features can be potentially expanded further, i.e., resorting to rich set of features. Then, the most informative features should be selected. During the process, frequent itemsets are used as features and the intersection between instances and features represents number of same amino acid occurrence at same residue. This function is named as similarity.

For example, assume A is a frequent itemset which contains items $(P'_3D, P'_1Y, P'_4S, P_1Y, P_4S)$. Assume B is an instance which consists of $(P_4A, P_3A, P_2A, P_1A, P'_1Y, P'_2P, P'_3D, P'_4K)$ amino acid octomer. The similarity between A and B is 2 because only items $P'_3D$ and $P'_1Y$ are present in both. The similarity formula can be expressed as:

$$ similarity(A,B) = number\, of\, same\, amino\, acid\, occurrence\, at\,same\, residue\, $$

The new dataset is constructed by applying the similarity function for every instance-feature combination. For a dataset with M instances and N frequent itemsets, the expansion of the dataset can have the size $M *N$.

In the first approach, we used the well known principal component analysis (PCA) technique. Briefly, it maps correlated features into linearly uncorrelated features. In other words, it can be used for dimensionality reduction. The second approach applies filtering by using a position based method. Here, frequent itemsets which have items at positions $P_1$ and $P'_1$ are selected. It has been reported that $P_1$ and $P'_1$ positions in octamer are important as they are informative to locate where cleavage happens. In this approach, only frequent itemsets containing items relevant to the mentioned positions have been considered [14]. The third approach utilizes social network analysis (SNA) methods for filtering. It is a novel feature selection method, which creates a social network between possible features. Then, for each feature in the network, its centrality score is calculated using different centrality measures. Consequently, features selected after applying the particular approach are introduced as the new dataset.

Fitting into machine learning algorithm

We have rephrased the data in orthogonal encoding as suggested in [13]. Alternatively, a group of feature selection methods are proposed. After the feature selection process, the new dataset can be used for fit into classification to decide on the occurrence of cleavage. We have employed support vector machine (SVM) with linear kernel [13] and feature selection algorithms such as principal component analysis (PCA), RFE (Recursive Feature Elimination), Univariate ANOVA f value. Feature selection algorithms used 100 features in reduction. CMAR (JCBA) [25]^{Footnote 1} and CPAR [26].^{Footnote 2} ROC-AUC results are not reported for CPAR.

Methodology of social network analysis

Social Network Analysis (SNA) is used to understand characteristics of a given network represented as a graph. Vertices represent actors in the network and edges represent interactions between actors.

By looking at network structure, it is possible to identify vertices which are more important compared to others. In general, vertices in the center of the network are more representative. As mentioned in “The necessary background” section, a variety of centrality measures are defined to reflect different perspectives by calculating different centrality scores of a vertex. One of these centrality metrics is normalized betweenness. Given a graph G, normalized betweenness centrality of a vertex v in G is calculated as the number of shortest paths passing through vertex v divided by total number of shortest paths in graph G. Another relevant centrality measure is PageRank [28], which is calculated as follows. After the initialization phrase, each vertex votes for other vertices regarding their importance and important vertices based on votes have higher impact for PageRank.

SNA measures have been used to find out which feature sets are more important for our problem. First, a social network of frequent itemsets is constructed. A matrix M was defined where each row represents an instance and each column represents a feature. The intersection between a row and a column is filled based on the similarity function defined in “Alternatives for feature selection” section.

$$ M_{ij} = similarity(M_i, M_j) $$

Given a two dimensional matrix which reflects a relationship between two sets of items (which are actors), folding is the process of multiplying a two dimensional matrix by its transpose to obtain a new matrix where rows and columns represent the same set of actors.Folding is applied on M to find similarity between frequent itemsets. Frequent itemsets form rows and columns of matrix F produced from the folding process.

$$ F=M^T \cdot M $$

After folding, a graph is constructed using adjacency matrix F, where each column is a vertex and if the entry at the intersection between a row and a column is greater than zero then an edge is constructed between the corresponding vertices. For this graph, PageRank and betweenness centrality measures are computed and the top 50 frequent itemsets are chosen.

Experiment results and discussion

Four datasets have been utilized in the testing, namely 746Data [15], 1625Data [20], impensData [22,23,24,24] and schillingData [21]. Three of these datasets have been rectified (746Data, 1625Data, and schillingData) [13]. The four datasets are available at the UCI Machine learning repository,^{Footnote 3} Details about these four datasets may be found in [13].

We have performed tenfold stratified cross validation technique for the classification in order to obviate with the overfitting problem. During the tenfold cross validation, for each test case, frequent itemsets have been found using all training folds, some frequent itemsets are selected and new dataset is created by applying similarity function over training and test instances (rows) and frequent itemsets (columns). The classifier model has been built using training folds and testing has been conducted using the remaining fold.

Our system has been implemented in python and using scikit-learn packages.^{Footnote 4} SVC classifier has been used with linear kernel with penalty value as 1.0 and tolerance value for stopping criteria as 1e−4. Additionally, Pyfim^{Footnote 5} has been used for extracting frequent itemsets. The cross validation results of the original dataset which were transformed into orthogonal encoding have been taken as baseline for comparison purposes. For the rest of the article, suggested methods are listed in Table 1.

Table 1 Abbreviation and explanation

Utilizing maximal frequent itemsets and social network analysis for HIV data analysis

Abstract

Background

The necessary background

Social network analysis

Frequent patterns

The methodology

Data modification

Finding frequent itemsets

Alternatives for feature selection

Fitting into machine learning algorithm

Methodology of social network analysis

Experiment results and discussion

746 dataset

Impens dataset

1625 dataset

Schilling dataset

Characteristics of patterns sfter RFE ranking

Complementary analysis by social network analysis

Comparison of algorithms without feature selection

Conclusions and future work

Notes

References

Authors' contributions

Acknowledgements

Competing interests

Author information

Authors and Affiliations

Corresponding author

Additional file

Additional file 1. Non-parametric statistic methods for comparative analysis between methods.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Journal of Cheminformatics

Contact us