Chemical toxicity prediction based on semi-supervised learning and graph convolutional neural network

As safety is one of the most important properties of drugs, chemical toxicology prediction has received increasing attentions in the drug discovery research. Traditionally, researchers rely on in vitro and in vivo experiments to test the toxicity of chemical compounds. However, not only are these experiments time consuming and costly, but experiments that involve animal testing are increasingly subject to ethical concerns. While traditional machine learning (ML) methods have been used in the field with some success, the limited availability of annotated toxicity data is the major hurdle for further improving model performance. Inspired by the success of semi-supervised learning (SSL) algorithms, we propose a Graph Convolution Neural Network (GCN) to predict chemical toxicity and trained the network by the Mean Teacher (MT) SSL algorithm. Using the Tox21 data, our optimal SSL-GCN models for predicting the twelve toxicological endpoints achieve an average ROC-AUC score of 0.757 in the test set, which is a 6% improvement over GCN models trained by supervised learning and conventional ML methods. Our SSL-GCN models also exhibit superior performance when compared to models constructed using the built-in DeepChem ML methods. This study demonstrates that SSL can increase the prediction power of models by learning from unannotated data. The optimal unannotated to annotated data ratio ranges between 1:1 and 4:1. This study demonstrates the success of SSL in chemical toxicity prediction; the same technique is expected to be beneficial to other chemical property prediction tasks by utilizing existing large chemical databases. Our optimal model SSL-GCN is hosted on an online server accessible through: https://app.cbbio.online/ssl-gcn/home. Supplementary information Supplementary information accompanies this paper at 10.1186/s13321-021-00570-8.

Toxicity is one of the five pharmacokinetic properties (ADMET) that must be strictly ascertained before a new drug candidate is approved for clinical trials [8]. On the premise that "the structure of a chemical substance implicitly determines its physical and chemical properties and reactivity, and these properties interact with biological systems to determine its biological/toxicological properties" [9,10], efforts have been made to develop computational methods, often machine learning (ML) based, that attempt to relate the toxicological properties of compounds to their chemical structures. For a comprehensive review of ML-based toxicity prediction methods, the readers are referred to refs [11][12][13].
Graph Convolutional Neural Networks (GCN) are commonly used for tasks such as social network analysis and knowledge graph mining. Since biomolecular structures can also be represented as graphs, a variety of GCN-based biomolecular property prediction models have been developed in recent years. For example, the Weave model was proposed by Kearnes et al. in 2016 [14], which was a deep learning system based on molecular graph convolutions. This model uses only the simple descriptions of atoms, bonds, and atom pairs as input data. In addition, a learnable module called Weave module, extracts and combines the features of atom and distance relationship with learnable parameters. These modules can be stacked to an arbitrary depth to allow fine-tuning of the architecture for the needs of different learning tasks. In 2017, Li et al. proposed the GraphConv-SuperNode model [15]. By adding a dummy fully connected node (the super node) in each graph, this model captures and extracts graph-level representations from chemical structures, allowing it to focus on graph-level classification and regression tasks. In 2020, Wang et al. proposed a graph attention convolutional neural network (GACNN) that classified poisonous chemicals to honey bees [16], which is a Graph Convolution Neural Network with undirected graph and attention mechanism. They demonstrated that the performance of their GACNN model was better than all previous models, and they also summarised important structural features that might lead to poisoning.
All of these previous studies have highlighted the advantages of using GCN-based models to predict biomolecular properties. First, the suitability of different traditional molecular descriptors for different tasks significantly affects the performance of the models [16,17]. Graph-based molecular representations can circumvent this problem by preserving the structural and physicochemical information of the molecules. Second, the majority of models using graph-based techniques perform better on biomolecular property prediction tasks than conventional ML models using traditional molecular descriptors [14][15][16]18]. Third, since GCN-based models can directly manipulate graph-based molecular representations, they can retain molecular structural information during prediction. This characteristics of GCN makes the interpretability of GCN-based models superior to other traditional ML models.
Based on the different training strategies, ML algorithms can be broadly classified into 4 types, namely supervised learning (SL), semi-supervised learning (SSL), unsupervised learning and reinforcement learning [19]. All the prediction models we mentioned above are based on the SL algorithms which learn only from annotated datasets. However, despite enormous efforts in data curation and data sharing, the amount of labeled data falls far short of the amount of known compounds. Strategies to make use of the unannotated data such as those of SSL are expected to enhance the generalizability of prediction models.
Therefore, inspired by the success of GCN and the needs for improving chemical toxicity prediction confronted with limited data, we designed a learning system that hybridizes graph convolutional neural network (GCN) and SSL to predict the toxicity of chemical compounds. Here, we used chemical data from the Tox21 dataset as annotated data and collected compounds from other datasets as unannotated data. First, the molecular features encoded in GCN were defined, then experiments were performed to investigate the influence of SSL on the predictivity of the models. Moreover, the performances of the SSL models with varying unannotated data ratios were compared, which showed that SSL has a positive influence on the prediction performance of GCN models. This paper is organized as follows. The theoretical foundation of GCN and the mean teacher SSL algorithm are presented in the Material and Method section. The dataset, model, and validation technique are then described. The Results section contains comparative study of the traditional ML, SL-GCN, and SSL-GCN models performances. The impact of various unannotated data ratios was also investigated. Finally, SSL-GCN was compared to existing DeepChem methods for toxicity prediction.

Graph convolutional neural network (GCN)
Traditional convolutional neural networks (CNN) can extract features from Euclidean or grid structure data, such as images and text. But for non-Euclidean data like social networks, knowledge graphs, or chemical structures, due to its irregular data topology, CNN cannot directly operate on them [20,21]. A solution for machine learning on non-Euclidean data is Graph Convolutional Neural Network (GCN) [22]. GCN has been widely used in solving computer science problems such as social network analysis [23], natural language processing [24,25], and recommendation system [26,27], and also chemistry problems such as molecular properties prediction [14,15,18,20]. For the latter, each molecule is described as an undirected graph where atoms are represented as nodes and covalent chemical bonds are represented as edges. The basic idea of graph convolution is to apply a learnable function on each node and its neighbors, gradually merging information from distant atoms through the connecting edges, and ultimately extracting the atomtype and connectivity patterns in the molecule. In this work, we used off-the-shelf GCN method that was proposed by Kipf et al. in 2017 [28]. The layer-wise propagation function of this approach is defined in the following equations in terms of matrix calculation: These equations can be denoted as f (H (l) , A) . Ã represents the adjacency matrix A of an undirected graph G with added self-connections I. D is the degree matrix of Ã . H (l) ∈ R N ×D represents the nodes signal matrix (features) generated by the lth layer, where N and D denote the number of nodes in this graph and the dimension of each node's signal matrix respectively. W (l) is the layerspecific learnable weight matrix of the l th layer. σ denotes a non-linear activation function [28].
To facilitate implementation, the previous equations can be represented as the following: where N (i) is the set of neighbors of the node i. W (l) represents the layer-specific learnable weight matrix of the lth layer, h (l) j is the signal matrix (features) of each neighbor node j around i, and b (l) is the bias value of the l th layer. Therefore, the signal of each node in the next layer is determined by the weighted sum of signals in each node of the current layer and the signals of its adjacent nodes of the same layer. All signals are nonlinearly transformed using the Rectified Linear Unit (ReLU) function, ReLU(x) = max(0, x).

Semi-supervised learning (SSL)
The basic idea of machine learning (ML) is to reproduce the human learning process by computer algorithms. Most ML algorithms can be classified into four types [19,29]: supervised learning, unsupervised learning, semisupervised learning and reinforcement learning. The most commonly used method is supervised learning. It derives knowledge from training data with fully annotated labels [30]. However, acquiring accurate annotated data is sometimes difficult for certain tasks such as chemical compound properties prediction. On one hand, there are tens of thousands known chemical compounds that exist in nature, and even more artificial chemical compounds are being produced every year. On the other hand, each annotation requires labor-intensive and expensive procedure from compound synthesis to measurement. Consequently, a significant amount of molecules are not properly labelled while some labels may subject to experimental errors. To learn from incompletely annotated data, semi-supervised learning method is more suitable [31].
In SSL, it is assumed that the label function is smooth in high-density areas, so data points located in the same area should share the same label. Based on this smoothness assumption, even unlabelled data can be exploited in the learning process. Here, the main idea is to build classification models that are robust to local perturbations in the input data. When the input data is perturbed with a small amount of noise, the prediction results for the perturbed data and original data should be similar [32]. Since this consistency in predictions does not depend on the data labels, therefore unlabelled data can be exploited in the training process to enhance the prediction consistency of the model.
Earlier SSL models that used this consistency regularization, such as the Ŵ-model [33], assigned two roles (teacher and student) to the same model. With the role of student, the model learns based on labeled data. With the teacher role, the model generates targets for unlabeled data, which are then used by itself as a student for consistency learning. However, at the beginning of training, the generated targets for unlabeled data are most likely incorrect. The consistency cost for unlabeled data outweighs the classification cost for labeled data at the beginning of training, so the model cannot learn any new information from the training process [34]. One way to solve this problem is to carefully select or update the teacher model instead of sharing the same model with the student model. Following this idea, the -model and Temporal Ensembling model were proposed in 2017 [35].
In each training epoch of the -model, the same unlabeled data are predicted twice with different roles (student and teacher). Since data perturbations and dropout methods are implemented in each prediction process, two prediction processes will give slightly different predictions for the same data. The goal of the -model during the training process is to make two predictions for the same unlabeled data as consistent as possible. Their experiments show that this method can eventually make the teacher model make accurate targets for unlabeled data [35]. However, the computational cost of this model is too high. The Temporal Ensembling model improves on the -model by making predictions only once per training epoch for unlabeled data, reducing the number of predictions by half and nearly doubling the speed. To calculate the consistency cost in the Temporal Ensembling model, the target of unlabeled data is generated by the exponential moving average (EMA) of the predictions for unlabeled data in previous training epochs. However, since each target is updated only once per epoch, the updating speed is too slow, which still limits the training speed of Temporal Ensembling model [34].
In this study, we implemented the SSL algorithm proposed by Tarvainen and Valpola, called Mean Teacher (MT) [34]. To circumvent the limitations of the Temporal Ensembling model, the MT algorithm updates the internal weights of the model through the EMA strategy at each training step to produce a more accurate model, rather than updating the targets of the unlabeled data at each training epoch. During training process, this algorithm requires two models with the same architecture, namely the student model and the teacher model. In each training step, the student model updates its internal weights based on the classification loss on the labeled data and the consistency loss between the two models on the unlabeled data. After the student model is updated, the teacher model is also updated using EMA strategy defined in Equation 4 [31,34]. Previous studies have demonstrated that this kind of self-ensembling framework could bring improvements to classification models [34,35]. The pseudo code of this algorithm is shown below: g(·) denotes the data perturbation function, m s (·) and m t (·) represent the student and teacher models respectively, θ i s and θ i t represent the internal weights in the training step i, z and z are the generated classification probabilities. Loss cls and Loss con represent classification loss and consistency loss. w i denotes the consistency loss coefficient in the training step i. This consistency loss coefficient varies with the training steps. It is defined as the function e −5(1−t) 2 , where t ∈ {0, 1} , represents scaled number of training step [34]. Update(·) is the process of updating the internal weights of the model through backpropagation.
EMA(·) is the process of updating the weights in m t by applying the Exponential Moving Average (EMA) of weights in m s where α i is the smoothing coefficient. The following equation defines this process mathematically: In our implementation, we applied the Gaussian noise g(x) as the data perturbation method using the same distribution for both m s (·) and m t (·) . The cross entropy loss function and Mean Squared Error (MSE) are used to compute the classification loss and consistency loss, respectively. The GCN network is optimized using the Adam optimizer [36], which is the optimizer chosen in the original implementation of MT [34]. Although both the well-trained teacher model and the student model can be used for prediction, previous studies have demonstrated that the teacher model is more accurate than the student model [31,34]. Therefore, the teacher model is used as the final classification model.

Datasets
For semi-supervised learning, both labeled (compounds with toxicity information) and unlabeled (compounds without toxicity information) data are required. In this study, the Tox21 dataset from MoleculeNet [37] is used as the labeled data. The Tox21 challenge is a communitywide compound toxicity prediction competition in 2014.
Since then, the Tox21 dataset has been widely used as the benchmark dataset for evaluating toxicity prediction models. It consists of 12 endpoints, including 7 nuclear receptor signals (NR-AR, NR-AhR, NR-AR-LBD, NR-ER, NR-ER-LBD, NR-Aromatase, NR-PPAR-gamma) and 5 stress response indicators (SR-ARE, SR-ATAD5, SR-HSE, SR-MMP, SR-p53). In this dataset, each compound is expressed in Simplified Molecular Input Line Entry Specification (SMILES) format and the binary labels indicate whether the compound is toxic to a specific toxicological endpoint. In total, the Tox21 dataset include 7831 compounds and 12 different endpoints. It should be noted that not all compounds have all endpoint labels; the missing endpoint label means that the toxicology effect toward this endpoint is unknown. For unlabeled data, other chemical compound datasets were sought from the MoleculeNet website, including ClinTox, SIDER, Tox-Cast, and HIV datasets [37]. All the label information in these datasets have been removed. In addition, duplicate molecules between these datasets and the Tox21 dataset have also been removed. In total, 50527 compounds were used as unlabeled data. Table 1 shows the details of the datasets used in this study. For each labeled dataset, we follow the conventional dataset splitting rule with the splitting ratios of 0.8:0.1:0.1 to divide the dataset into training, validation and test sets. Training set is used for the training process, validation set for the hyperparameter tuning process and the test set is to measure the generalization performance. The most commonly used splitting method is random splitting. However, it is not always suitable for molecular data because random splitting cannot guarantee that the training and test sets contain diverse and representative data samples [37,38]. In order to overcome the problem of data bias, we adopted a scaffold splitting method. It splits the dataset according to the two-dimensional structural framework of the molecule [39,40] and then assign structurally different molecules into different subsets [37]. In this way, both the training set and the test set contain a good proportion of data samples scattered in the molecular space of the dataset, and we can expect that the performance of the model measured on this test set is closer to its actual performance on new data.
As mentioned above, an undirected graph can be described by two matrices, namely the signal (feature) matrix H and the adjacency matrix A. In this study, we used the molecule-graph conversion tool from Deep Graph Library (DGL) [41] to convert molecules from SMILES to graphs. For each molecule, the connectivity of atoms is stored in the adjacency matrix and the physicochemical properties of each atom (node features) are encoded into a feature matrix in binary or numerical form. Since the DGL conversion tool provides eight default atom features, as listed in Table 2, the dimension of each node feature matrix is 1 × 74 . Therefore, for a molecule with N atoms, the conversion will generate one adjacency matrix of dimension N × N and one feature matrix of dimension N × 74 . This graph conversion process is depicted in Fig. 1. After this step, the graph-based molecular data can be learned by the graph convolutional neural network.

Model architecture and hyperparameters selection
The architecture of our GCN model consists of two parts, an encoder and a classifier. The encoder extracts and updates node representations through several graph convolutional layers (Graph Conv). In addition, there is a dropout layer after each Graph Conv layer to provide additional noise to the molecular representations [31, Fig. 1 The SSL-GCN model for compound toxicity prediction. Molecular compounds are converted into graphs of nodes and connections. The GCN model architecture is composed of two stacked layers of graph convolutional layer, dropout, and batch normalization layer. All signals are summarized by the max pooling layer and fed into the multilayer perceptron network to generate the final output. The teacher and student GCN models are updated using the MT algorithm 34]. The last layer of the encoder merges all nodes features into a tensor by using max-pooling and weighted sum operations. This tensor is the learned representation of the input molecule. The classifier is to compute the final prediction. We used the classifier provided in DGL [41] which contains two layers perceptron (MLP) with a dropout layer and a batch normalization layer.
In order to select the best hyperparameters for these models, Bayesian optimization algorithm [42] is used to search the hyperparameter space, and the maximum number of trials is 32. In each trial, the algorithm selects a set of candidate hyperparameters and initializes the model. Then, model training and validation are carried out iteratively until the early stopping condition of 30 epochs is met. After all trials are completed, a set of candidate hyperparameters with the best validation metric (ROC-AUC) is selected as the default hyperparameters for the following experiments.
Since the toxicity dataset is highly imbalanced, with an average toxic/non-toxic data ratio of about 1:17, the area under the Receiver Characteristic Operator curve (ROC-AUC) is used as the main metric in the hyperparameter selection process (practically, to decide for early stopping) and the final model evaluation. The hyperparameters with the best validation performance are selected to construct the optimal toxicity prediction models. Finally, the generalization performance of these models are estimated using the test set.

Implementation detail
In this study, all implementations and experiments are carried out in an environment with following libraries/ software: Python 3.7.9, Anaconda 4.7.10, Scikit-learn 0.23.2, RDKit v2018.09.3.0. We used Pytorch 1.7.0 with CUDA 10.0 as the basic machine learning framework. The GCN model is implemented using DGL 0.5.6 and its supplementary package DGL-LifeSci 0.2.6 [41] (available on GitHub, DGL [43], DGL-LifeSci [44]). The Bayesian Optimization process for hyperparameter selection is implemented using Hyperopt 0.2.5 [42] (available on GitHub [45]). We also used DeepChem 2.5.0 [46] to generate the benchmark scores of other state-of-the-art models on the Tox21 dataset (available on GitHub [47]). The original source code for the Mean Teacher(MT) algorithm [34] can be accessed via its GitHub repository [48].

Results
All experiments were repeated five times to observe the variability of the results and obtain an accurate measure of model performance through the average ROC-AUC score. The complete record of all experiments can be found in the Additional file 1.

Performance of conventional machine learning (ML) methods
To establish the baseline performance, several commonly used ML algorithms, namely K-Nearest Neighbor (KNN), Neural Network (NN), Random Forest (RF), Support Vector Machine (SVM) and eXtreme Gradient Boosting (XGBoost) were tested. The compounds were encoded using the Extended Connectivity Fingerprints (ECFP4), which is a circular topological fingerprint designed for molecular characterization, similarity searching, and structure-activity modeling [49]. The encoding was generated using the RDKit library. In total, 60 different ML models (12 prediction tasks × 5 types of ML algorithms) were trained and optimized using the training and validation sets. Subsequently, the optimal models were tested on the test set. The test performance of these conventional models on the 12 toxicity prediction tasks are presented in Table 3. Each experiment was repeated 5 times; the average ROC-AUC score and the standard deviation (std) were reported. In all prediction tasks, the ROC-AUC scores range between 0.5127 and 0.8287. In certain cases (KNN, SVM, and XGBoost), we observed that the same optimal models were obtained in all replicate experiments such that the ROC-AUC scores are the same (std = 0). Overall, RF, XGBoost, and SVM generated the best models for 5, 4, 3 of the prediction tasks, respectively. The average ROC-AUC score of the best performing conventional ML models of all tasks is 0.71.

Performance of supervised learning GCN (SL-GCN)
Having established the baseline performance of the traditional ML models in toxicity prediction, we went on to test the GCN models for the 12 prediction tasks. Similar to other ML models above, the GCN models were trained using supervised learning and optimized by the Bayesian optimization algorithm, hence the name SL-GCN. In Fig. 2, the ROC curves of the SL-GCN models on the test set prediction are plotted against other ML models, and the 5-repeated average of the ROC-AUC scores are tabulated in Table 4. The results show that, while the SL-GCN models perform similarly to the best conventional ML models in the majority of the twelve toxicity prediction tasks, they improve in four of the tasks, including NR-ER, SR-ARE, SR-HSE, and SR-MMP, while they perform worse in three of the tasks, including NR-AR-LBD, NR-PPARgamma, SR-p53.

Performance of semi-supervised learning GCN (SSL-GCN)
The MT technique employed in this study necessitates the use of two models with the same architecture, one for m t and one for m s . Therefore, we used the hyperparameters obtained from the SL-GCN models as the initial parameters to train SSL-GCN. As shown in the previous study [34], the amount of unlabeled data in the training process can affect the final model performance. To investigate this impact on the performance of the SSL-GCN models, we ran numerous trials with varying amounts of unlabeled data. We define the unlabeled-to-labeled data ratio as R u ∈ {0.5, 1.0, 2.0, 3.0, 4.0} . So, when R u = 0.5 , we randomly select a portion of unlabeled data from the entire unlabeled data set to participate in the semisupervised learning process, and the amount of this portion of unlabeled data is only half of the labeled data. Due to significant increase in training time, a large R u , such as > 4.0 , were not considered. Table 4 shows the test results of the optimized SSL-GCN models for the 12 toxicity prediction tasks, as well as a comparison of the ROC curves in Fig. 3. As shown in Table 4, SSL improves the predictive power of the GCN models when sufficient amount of unlabeled data is included in the training. SSL-GCN with R u of 0.5 improves the ROC-AUC score in 10 of the 12 prediction tasks, while only the ROC-AUC scores of two tasks are somewhat reduced. When the SSL-GCN models are trained with additional unlabeled data ( R u = 1.0 to 4.0) , they always outperform their SL-GCN counterparts in terms of AUC score. Nonetheless, the best R u for each prediction task is different. SSL-GCN produces 4 optimal models when R u = 2.0 ; 3 optimal models when R u = 4.0 ; 2 optimal models when R u = 0.5 , and 1 optimal model when R u = 1.0 . As a result, the best R u varies depending on the prediction task at hand. The rates of performance improvement in terms of ROC-AUC for different task range from 1% to 13%. Finally, Fig. 4 compares the best CM, SL-GCN and SSL-GCN models. As can be clearly seen, SSL-GCN can produce models with greater predictive potential than CM and SL-GCN in all toxicity prediction tasks.
As a summary, the comparative study of the SSL-GCN models with varying R u values suggests that when training with unlabeled data, the ratio of unlabeled and labeled data should be treated as a hyperparameter in order to obtain the optimal model.

Case study: how the similarity between unlabeled and labeled data affects the semi-supervised learning process?
In the previous section, we showed that semi-supervised learning algorithms can improve the performance of our GCN models compared to models trained with purely supervised algorithm. However, we only studied the effect of unlabeled data ration R u on the SSL algorithm. Here, we will further investigate how the similarity between unlabeled and labeled data affects the performance of SSL-GCN model. To define the similarity between unlabeled data and labeled dataset, we used the k-nearest neighbors (KNN) method proposed by Tropsha et al. [50,51] This method are been widely used to measure the similarity between known and unknown chemical compounds using different similarity cutoff, C s , which is defined by following equation 5.
where < d > denotes the average of similarity scores of all instances in labeled data set, σ denotes the standard deviation of these similarity scores. Z is a self-defined parameter to control the similarity cutoff C s , which can help us determine the level of similarity. Next, we used the average similarity score SS i between each unlabeled instance i and its k nearest neighbors in the labeled dataset to evaluate how similar each unlabeled instance is to the labeled dataset. In this study, k = 5 and we used RDKit to calculate the most commonly used Tanimoto (Jaccard) distance as similarity score. To properly define the level of similarity, we first counted the distribution of SS i in 12 similarity domains defined by different cutoff values C s . The Z of these cutoff values range from − 2 to 3.5 with a step size of 0.5. The detail of the distribution can be found in the Additional file 1: Figure S4.
To shorten the experiment time and to ensure that there is enough unlabeled data at each similarity level to support the semi-supervised learning process, we reorganized the above 12 similarity domains into 3 similarity domains based on the distribution, namely close, normal, and far. For one unlabeled instance i with similarity score SS i , SS i ≤ C s (Z = 0) means i belongs to close domain; C s (Z = 0) < SS i ≤ C s (Z = 1) means it belongs to normal domain; C s (Z = 1) < SS i represents i belongs to far domain. Based on three similarity domains, we divided the entire unlabeled dataset into three subsets with corresponding similarity level. The following Table 5 presents the detail of these unlabeled subsets.
Here, we used these newly generated subsets to train several SSL-GCN models for comparison. We adopted the same experimental procedure (repeated 5 times) and optimal hyperparameter settings as in the previous section to facilitate performance comparison. The average ROC-AUC scores of these SSL-GCN models on the 12 test sets can be found in Table 6. The bold number denotes the best result among all models (all, close, normal, far) in the corresponding task, the underlined number represents only the best result among models using different similarity levels of unlabeled subsets (close, normal, far).
As shown in Table 6, the optimal model for 7 tasks still belongs to the model trained on the entire unlabeled dataset, SSL-GCN(all). For the remaining 5 tasks, the optimal model for 3 tasks (NR-ER, SR-ARE, SR-HSE) was trained with the close subset, and only for 2 tasks (NR-AR-LBD, NR-Aromatase) the optimal model was trained with the far subset. However, the performance improvement of the SSL-GCN model on these 5 tasks is slight, ranging from 0.0011 to 0.0080, suggesting that the use of close subset and far subset in the SSL process had a limited impact on these models. On the other hand, the use of these similarity-based subsets leads to performance degradation in 7 tasks, with the largest degradation occurring in the NR-AR task, where the average AUC value decreased by 0.0616. Table 4 The average test performance of SSL-GCN models with various unlabeled data ratio ( R u in brackets) on the 12 prediction tasks in 5 repeated experiments. For comparison, the results of the SL-GCN models are shown The bold number denotes the best result among all SSL-GCN models with various unlabeled data ratio in the corresponding task  From the perspective of similarity between labeled and unlabeled data, models trained with the close subset tend to perform better than models trained with normal and far subsets. After excluding the performance of SSL-GCN(all) models, 5 SSL-GCN(close) models, 3 SSL-GCN(normal) models, and 3 SSL-GCN(far) models achieved optimal performance on the corresponding task. In addition, the model SSL-GCN(close) outperformed the SSL-GCN(all) model on 3 tasks (NR-ER, SR-ARE, SR-HSE), while this number is 0 for SSL-GCN(normal) model and 2 for SSL-GCN(far) model. Therefore, the performance of SSL-GCN(normal) is the worst among these three types of models; the overall scores of SSL-GCN(near), SSL-GCN(normal), and SSL-GCN(far) on 12 tasks are 0.7417, 0.7388, and 0.7450 respectively, which also indicates this fact.
There are several reasons that lead to this result. First, using unlabeled data in the close subset that is similar to the labeled data allows the semi-supervised learning model to make more accurate predictions about unlabeled data in the early training phase, allowing the model to more accurately generate and update the loss in the early training phase. This enriches the information learned by the model and results in the SSL process generating a better model. Second, using unlabeled data that is dissimilar to the labeled data (far subset) provides  additional information for the SSL-GCN model during the semi-supervised learning process. This may improve the generalization ability of the model, which could increase the performance of the model on unseen data. In summary, we believe that using the entire unlabeled dataset and labeled data to train the SSL-GCN model is still the best way to generate the optimal model since the whole unlabeled dataset mixes unlabeled data with different similarities to labeled data.

Performance comparison of SSL-GCN to the built-in DeepChem methods
The DeepChem package [46] provides some built-in ML methods that can be readily used to generate predictive models for different computational chemistry challenges.
Making use of the DeepChem-integrated MoleculeNet datasets [37], we performed experiments to evaluate the performances of the DeepChem models on the Tox21 dataset. The dataset was splitted by scaffold splitting method and all models were initialized with the hyperparameters provided by the DeepChem package. Following the previous experimental procedure, we conducted the training, validation and test processes, and repeated them five times for each model. Here, we benchmark our method by comparing the performance of the SL-GCN and SSL-GCN models in the test set to these DeepChem models in terms of the average ROC-AUC score. As shown in Table 7, among the 8 DeepChem models, the best one is kernelsvm, with an overall score of 0.7, whereas both our models SL-GCN and SSL-GCN beat the best DeepChem model with overall scores of 0.7156 (2% improvement) and 0.7571 (8% improvement), respectively. It should be mentioned that while the graphconv model utilizes similar graph convolution technique to our method but its use of different model architecture and molecular feature rendering their model less effective.

Discussion and conclusions
In this work, we attempt to improve compound toxicity prediction using graph convolutional neural network (GCN) and semi-supervised learning (SSL). We choose Mean Teacher [34] as the SSL algorithm to improve the prediction performance of GCN on 12 toxicity prediction tasks from the Tox21 dataset. Meanwhile, we hope to answer two questions about predictive modeling in this research. First, is GCN superior to other more commonly used ML methods? Second, is unlabeled data advantageous for model training?
To this end, we have designed and implemented a GCN model for chemical compounds based on simple physicochemical properties of atoms. Unlike other commonly used chemical fingerprints that represent an entire compound in a one-dimensional feature vector for learning, GCN encodes it into a network of features, where the network resembles bond connectivity in the molecule. Given that structural diversity of a dataset is one of the elements that affect the prediction performance and generalizability of a model, we have used the scaffold splitting approach to divide the dataset into training, validation, and test sets for each prediction task. The Bayesian optimization technique has been used to speed up the process of tuning hyperparameters. Now, with the GCN model in place, we have trained and optimized the supervised learning SL-GCN models and the semi-supervised learning SSL-GCN models on 12 toxicity prediction tasks. To answer the first question, is GCN superior to other commonly used ML methods? We have trained and optimized toxicity Table 7 Comparison of our GCN models (SL-GCN and SSL-GCN) and the models constructed using the DeepChem built-in ML methods The overall score is the average ROC-AUC score in predicting the 12 prediction tasks in the test set. The experiments were repeated 5 times The bold number denotes the best overall score among all models prediction models using 5 conventional ML methods in the supervised learning setting. Our comparative study has revealed that out of the 12 prediction tasks, 5 tasks are better predicted by SL-GCN, 2 tasks are similarly predicted, and 5 tasks are worse by SL-GCN; and the "better" models are not improved by a large margin. Therefore, our experimental result suggests that in the same supervised learning setting, GCN is not superior to conventional ML methods. The answer to this question is a bit disappointing though, as a GCN model is much more complex and expensive to train than the conventional models. We believe that the bottleneck to improvement is the limitation of available data. Instead of adding more annotated data, which is not always possible or easy, we turn our attention to unlabeled data. Here, we have applied the SSL algorithm, called Mean Teacher (MT), to enhance the performance of the GCN model. Encouragingly, SSL-GCN models consistently outperform their SL-GCN counterparts, with the ROC-AUC scores improving between 1 and 13%. Nonetheless, the amount of unlabeled data required to boost performance has to be determined on a case-by-case basis. We have found that for the prediction of various toxicological endpoints, the appropriate ratios of unlabel-to-label data range from 1 to 4. Larger ratios may improve further, but were not investigated in this study due to limited computational resources. Finally, a comparative analysis of our models with the models from the DeepChem library was done. The findings are that the SL-GCN models are 2 to 12% better than the DeepChem models in terms of ROC-AUC, while the SSL-GCN models are 8 to 18% better. Based on the above results, our answer to the second question, "Is unlabeled data advantageous for model training?", is therefore yes, and the amount of unlabeled data required to optimize the model is subject to each study.
In many bioinformatics tasks, the size of an annotated dataset is often limited, which complicates the implementation and limits the performance of many ML algorithms. The result of this study suggests that SSL could be applied to other property prediction tasks such as adsorption/distribution/metabolism/excretion (ADME), solubility, binding activity, etc., to improve the predictive ability of model by using unannotated data.
This study does, however, have some limitations that we must point out.
First, the toxicity of a compound is determined by several factors such as chirality and the nature of functional groups. This information requires a more delicate coding approach to avoid information loss during graph conversion. Although there are various well-designed molecular fingerprints or descriptors for conventional ML algorithms that can be used, there is no specific one that is suitable for GCN. Therefore, we have to use the molecule-graph conversion tool from Deep Graph Library (DGL) to convert molecules from SMILES to graphs. However, the graphs converted by this tool only include few basic molecular physicochemical properties. Due to the limited computational power, the running time of the graph convolution layers using the current feature matrix was already very high and adding additional features will certainly cost more time during the model development process. In our future study, it becomes particularly important to increase the diversity of molecular information contained in the feature matrix while limiting the size of the matrix.
Second, the interpretability of our graph convolution model has not been explored. Most researchers consider ML methods with neural networks as a black box. The only factor that can be confirmed during the training or prediction process is the input data, and the prediction results produced by these ML models are unexplainable. Specifically for biomedical ML applications, this limitation has been amplified. Without knowing which part of the compound led to the prediction result, researchers cannot modify the original compounds or select the compounds with better structure to conduct further studies. Therefore, in the next step of our study, we will focus on the interpretability of the graph convolutional neural network.
Third, the activity cliffs problem has not yet been solved in this study. Activity cliffs refer to those chemical compounds that have highly similar structure but different or opposite chemical properties. Although the semi-supervised learning algorithm can use unlabeled data to improve the performance of our GCN model. But nothing comes for free, the basic assumption of the SSL algorithm we implemented is the smoothing assumption, i.e., it assumes that the label function is smooth in highdensity areas, so data points located in the same area of the feature space should share the same label. This fundamental assumption makes our model very unreliable in predicting molecules distributed at the edges of high density areas (decision boundary), where most of the molecules with "activity cliffs" are located. Moreover, there is currently no good way for QSAR models to solve the "activity cliff " problem, since the primary assumption of the QSAR model is that similar molecular structure should lead to similar properties [57,58]. We have already noted that there are some studies [58][59][60][61][62] that attempt to address this problem, and we will follow these studies in our future work.
Finally, our study has exploited the SSL algorithm that is based on the self-ensembling framework. There are other recently proposed SSL algorithms, such as Mixup [63], Interpolation Consistency Training [64],