GPURFSCREEN: a GPU based virtual screening tool using random forest classifier

Background In-silico methods are an integral part of modern drug discovery paradigm. Virtual screening, an in-silico method, is used to refine data models and reduce the chemical space on which wet lab experiments need to be performed. Virtual screening of a ligand data model requires large scale computations, making it a highly time consuming task. This process can be speeded up by implementing parallelized algorithms on a Graphical Processing Unit (GPU). Results Random Forest is a robust classification algorithm that can be employed in the virtual screening. A ligand based virtual screening tool (GPURFSCREEN) that uses random forests on GPU systems has been proposed and evaluated in this paper. This tool produces optimized results at a lower execution time for large bioassay data sets. The quality of results produced by our tool on GPU is same as that on a regular serial environment. Conclusion Considering the magnitude of data to be screened, the parallelized virtual screening has a significantly lower running time at high throughput. The proposed parallel tool outperforms its serial counterpart by successfully screening billions of molecules in training and prediction phases.


Background
Conventional drug design relied on in-vitro methods. In these methods, wet lab experiments are employed to discover the activity of ligands with a target protein molecule. Depending on this approach to determine the activities of a large set of molecules would consume a lot of time and money. In-silico drug design breaks this bottleneck by using modern computational techniques. Insilico approach increases the speed and reduces the cost of drug design. Depending upon the structural knowledge of the ligand and proteins, there are four different approaches used in in-silico drug discovery. They are structure based drug design, ligand based drug design, de-novo design and library design. The most widely used approaches are structure based drug design and ligand based drug design [1].
Ligand based drug design works by building a conceptual model of the target protein. Ligand based virtual screening uses this model to evaluate and separate active molecules for a target protein. This process involves the application of classification algorithms on the conceptual model. The model acts as the training data [2,3] for the classification algorithms used in virtual screening. These algorithms are expected to learn the model parameters of the input training data. After the training phase, these algorithms are applied on molecules whose activities with the target protein are unknown. Based on the properties of the training data, the algorithms will be able to classify these new molecules as active or inactive. This eliminates wet lab experiments involving inactive molecules from being carried out [4]. Hence, this approach drastically reduces the number of molecules with which the activity of a target is to be studied.
Implementing virtual screening using sequential computing techniques often fail to produce results within a reasonable time frame. The complexity of operations exhibited by virtual screening encourages the usage of parallel computing techniques to tackle them. But, the installation and maintenance of an infrastructure for parallel computing with a large number of processors, such as a Multiple Instruction Multiple Data (MIMD) system, is neither cost effective nor energy efficient. Computing on Graphical Processing Unit (GPU) offers a large number of computational cores that are capable of performing floating point computations in parallel. The computation performance obtained by running this algorithm on a GPU is comparable to that of a CPU cluster [5]. A GPU infrastructure can be installed and maintained at a comparatively low cost. Hence, the advantage of using GPU computing is twofold; it is both cost effective and energy efficient. Therefore, implementing a parallelized version of virtual screening using GPU computing is highly desirable.
The aim of virtual screening is to filter inactive molecules from the active ones and not to misclassify any active molecules. The success of ligand based drug discovery depends on the effectiveness of the classifier used in virtual screening. Ideal virtual screening requires a classifier that could produce no false negatives and a lower number of false positives. This results a reduction in the spectrum of wet lab experiments being conducted. This reduces the cost. The popular classification algorithms that are widely used for virtual screening are Support Vector Machines (SVM) [6], Random Forest Classifiers (RFC) [7,8] and Navïe Bayes Classifiers [9]. This is because they have been found useful in many domains of bioinformatics and medicinal chemistry [10][11][12].

Random forest classifier
Random forest classifiers work by growing a predetermined number of decision trees simultaneously [13]. The internal nodes of a decision tree may contain one or more values. Each such internal node represents possible outcomes of the problem given to it. Splitting of internal nodes is based on maximum information gain using GINI index. The growth of the tree is restricted either by pruning or by setting a threshold value on the splitting criteria. The classification process is as follows. A test instance is run on all the decision trees grown in the forest. Each tree's classification of the test instance is recorded as a vote. The votes from all trees are aggregated and the test instance is assigned to the class that receives the maximum vote.

Parallel decision trees
There are many reported works in this area. Some of them are outlined in this section.
Nuno Amado et al. [14] presents an overview on the various ideas regarding implementing parallel decision trees using data parallelism, task parallelism and hybrid parallelism in their work. They also describe a new parallel implementation of the C4.5 decision tree construction algorithm using breadth first strategy. A method [15] for implementing the evaluation and training of decision trees and forests implemented completely on a GPU (non-CUDA version) was reported in 2009. A ubiquitous parallel computing approach for constructing a decision tree on GPU is also available [16]. This work exploits the divide and conquer parallelism in ID3 at two levels. One at the outer level of building the tree node by node in a top-down fashion. And the other at the inner level of sorting data records within a single node level. Grahn et al. presents a parallel version of the Random Forest machine learning algorithm namely CudaRF [17] which was implemented using the Compute Unified Device Architecture (CUDA). In this implementation, one CUDA thread is used to build one tree in the forest. The forest is constructed on the GPU using one thread for each decision tree in the random forest.
The study in [18] compares the effectiveness of Field Programmable Gate Arrays (FGPAs), multi-core CPUs and GPUs for accelerating classification using models generated by Compact Random Forest (CRF) machine learning classifiers. It was noted that FPGAs provided the best performance and performance per watt. Multicore CPUs with OpenMP based implementation ensured scalability. GPUs offered the best performance per dollar and more than twice the performance of CPUs. The issue with GPUs is that the performance deteriorated with larger classifiers.
All previous works [15][16][17] to use GPUs for Random Forest classification have relied on coarse-grained task parallelism and have yielded unsatisfactory results. Their attempts seem to underutilize the available parallelism of graphics hardware and have undergone only cursory evaluations. Liao et al. introduced CudaTree [19], a GPU Random Forest implementation which adaptively switches between data and task parallelism.

Implementation
The motivation for the parallelization of the process of building a random forest stems from the fact that decision trees in a random forest are built independent of one another. This results in the faster construction of decision trees. Thus, deployment of parallelized random forest speeds up the ligand classification during the virtual screening phase. Figure 1 illustrates the execution flow and communication between the host and the device for the execution of the proposed tool.
In GPU based computing, the process of growing each tree is assigned to a set of cores. The samples and features of a given tree are determined from the host system. The decision tree is built level by level on a grid, by assigning a block to build a level in a tree. The device memory of the GPU is used to manage the tree data in a grid. Each grid performs its evaluation independently.
Building a random forest requires growing a certain number of decision trees. Many decision tree algorithms are based on recursion. But, the kernels executed on the GPU do not support recursion [20]. So, the use of recursion is not feasible in the GPU based Random Forest algorithm. This necessitates the design of an iterative tree building algorithm. The data and output of each such decision tree are not interdependent until execution is completed. Hence, each such decision tree can be handled as an independent thread in the GPU. There are other parallel decision tree learning algorithms [19,21,22] that work by clubbing data and task parallelism. This is similar to the approach taken in the proposed algorithm.
There are two approaches to parallelize the building phase of a decision tree, namely data parallel depth first tree construction and fine grained task parallel breadth first construction [19]. The depth first tree construction uses the GPU to compute the optimal split point for a single node of the decision tree. Each CUDA thread block is responsible for a subset of samples from a single feature, followed by a parallel evaluation of the optimal feature split thresholds. In breadth first tree construction, instead of constructing a single tree node a whole level of tree nodes is created simultaneously.
In the serial implementation of Random Forest, the depth first method is adopted to split the nodes based on selected features. From the GPU perspective, the depth first tree construction method will become more expensive as the tree grows deeper. This is because of a large number of kernels created to perform feature split on a relatively small number of samples. Similarly, the breadth first construction is less efficient at the top of a tree, as the kernels would need to perform task parallel threads on less number of nodes.
In order to completely utilize the parallelism provided by the GPU, a hybrid approach is adopted in the proposed algorithm. The decision tree on GPU is constructed, starting from the root node in a depth first manner. After a certain point, the tree construction is switched to breadth first approach. This crossover point determines the efficiency of the GPU implementation and the effectiveness of the results produced by the Random Forest. This crossover point can be determined by setting a threshold on the number of nodes that is present in the decision tree. Once, the number of nodes in a decision tree crosses this threshold, the tree construction strategy is switched from depth first to breadth first. A very low value of this threshold will result in poor performance of the algorithm whereas a very large value will make the tree building slower. The only possible way to create a decision tree in a depth first manner in the GPU environment is by using a stack. The breadth first method of tree construction uses a queue. It is computationally faster to use the breadth first method for large data on a GPU. So it is more productive to use this hybrid method for constructing decision trees on a GPU.

Determining the breadth first crossover threshold
The instance at which point the size of the sub-tree is large enough to merit a switch from depth first to breadth first tree construction is critical. The scikit-learn [23] serial algorithm uses the depth first method for the tree construction. In the proposed parallel implementation, calculating and fixing the crossover threshold determines the speed of the algorithm. Since the number of features used in the present implementation is fixed for the parallelization of a given tree, the techniques derived by Liao et al. [19] are used here also. The optimal value of the crossover threshold is derived by varying the number of samples, features, and classes used in the classification. The optimal crossover value obtained through regression technique is: where n is the number of samples and f is the number of features considered at each node split. Therefore, the optimal crossover point is dependent on the number of features and the number of training samples used in the training process.

Proposed parallel algorithm
Random forest building on a GPU begins with a parallelized bootstrapping of the input data items on CUDA as shown in the Algorithm 1. It is then followed by the parallel creation of decision trees in depth first manner. GINI index is used for splitting the data. The childArray is used as a stack to create nodes as described in the Algorithm 1. After the number of nodes in the decision tree crosses the threshold, the tree creation switches to breadth first mode. Here, the childArray works as a queue and creates a set of nodes level by level as shown in Algorithm 1. The algorithm terminates by eliminating those nodes that fail to attain a threshold GINI index value.
Random forest prediction run on a GPU is shown in Algorithm 2. To classify a new instance, it is classified by all the trees in the forest in parallel. Each tree evaluates the features for the new instance. The label after processing the input is recorded as a vote. The votes from all trees are combined and the class for which maximum votes are counted (majority voting) is declared as the class of the new instance.

Results and discussion
The computational performance and quality of results obtained from the GPURFSCREEN for ligand based drug discovery are presented in this section.

Datasets used
Datasets with different numbers of molecules were used to gain insight into the quality of the result produced by     [4]. These descriptors fall into three categories. The first eight descriptors are used mainly to characterize the druglikeness of a compound. Another set of twenty-four continuous descriptors considered are based on a variation of BCUT descriptors to define a low dimensional chemistry space. The last 147 bit-string structural descriptors, known as Pharmacophore Fingerprints, are based on bioisosteric principles. The dataset used for training are shown in Table 1. The dataset for testing is taken from GDB17 [29], a chemical universal database for unknown compounds that has been enumerated by Lars Ruddigkeit et al.
The technical specification of GPU hardware used in the experimentation can be found in Table 2.

Interpretation
A Random Forest consists of a certain number of decision trees. Their nodes split data according to randomly selected features. This offers a better method to split features apart from the split in data, thus enabling the classifier to pick a combination of subtle changes in certain features of the data. Tables 3 and 4 show the comparison of serial and GPU implementation of random forest on the basis of precision, recall, accuracy, ROC area and F-score on different    [30,31]. Table 5 shows the depth-breadth crossover analysis for the AID2314 training set, for different values of crossover point. With the number of descriptors being constant, the optimal depth-breadth crossover largely depends on the size of the training data set. Loading small training data sets would build depth first driven trees while loading large data sets would build breadth first driven trees. Therefore, depth-breadth crossover was studied over the range from 1000 to 50,000. AID2314 is a balanced training set with 37,055 active out of the 296,456 compounds present in it. It may be noted that performance matrices do not change through the depth-breadth slide in the tree construction. The optimal crossover point of AID2314 is close to 25,000. The running time of AID2314 is the best in the table. The end user can change the main parameters, such as bfs_threshold and no_of_trees_in_the_forest in random-forest.py file in the code base for optimal performance. Table 6 shows the performance comparison between the training phase of the serial and that of GPU versions of RF based virtual screening on different ligand data models generated from the corresponding bioassay data in NCBI PubChem. Though there is a visible performance gain while using GPU, a major boost in performance is seen in the classification phase. The comparison of running times of the classification phase is shown in Table 7 (also see Fig. 2). For small input size, the performance gain was offset by the cost of copying the data to the GPU memory. The GPU version of random forest guarantees an increase in execution speed by 2-20 times. The speed up of the implementation increases with the number of molecules. The growth rate of execution time corresponding to increasing input size is lower for the proposed parallel implementation than the serial version. The proposed tool can easily take up billions of molecules for classification. Due to the difficulty in feature extraction of large input data, the table size is limited to ten million.
The number of molecules that can be simultaneously classified in the serial environment is constrained by the amount of memory available in the machine. It is evident that this new parallel tool GPURFSCREEN outperforms the serial versions in terms of the number of molecules considered for training and classification. This parallel implementation has successfully trained more than three hundred thousand molecules on a single batch. This can be extended to up to one million in a single batch. The larger the video RAM available on the GPU, greater the number of molecules that can be trained. It should be noted that this implementation can take up billions of molecules for screening, by using the technique of partitioning the data into batches.
As evident from the results, a huge performance gain is achieved in both training and prediction phases of the learning algorithm. This contributes to a significant improvement in virtual screening of ligand based data models. The performance of the random forest classifier can be further improved by increasing the number of decision trees in the ensemble. The performance of the classifier algorithm improves with an increase in the number of decision trees steadily up to a point, after which the performance starts to decline. This optimal point of the number of decision trees can be obtained by observing the error rate. The global minima of error rate with the change in the number of decision trees may be taken as the optimal number of decision trees. However, this number is specific to each ligand data set and cannot be specified beforehand.

Conclusion
A tool named GPURFSCREEN was developed for virtual screening process using Random Forest technique that works on a CUDA based GPU environment. Considering the large volume of data involved in ligand based drug design, this parallelized version of virtual screening is favorable for two significant reasons: reduced running time and high throughput. The computational performance offered by the GPU outperforms a multi-core system. Also, the cost of installation, power consumption and maintenance of a GPU based system are lower compared to other multi-core systems. Thus, the GPU based virtual screening for ligand based data sets is a viable alternative for quickly screening large quantities of ligand data at a comparatively lower cost.
A computational boost of 2-20 folds for Random Forest training and prediction is achieved on mediocre GPUs with a moderate number of GPU cores and video RAM. GPUs with a large number of computational cores and larger video RAM can run large bioassay data sets with significantly lower execution time. As a future extension, the virtual screening of ligand data sets can be further implemented and tested with other variants of random forest classifiers that implement balanced decision trees. The GPU implementation can also be extended to work with balanced decision trees for classification.