MAIP: A Prediction Platform for Predicting Blood-Stage Malaria Inhibitors

Malaria is a disease affecting hundreds of millions of people across the world, mainly in developing countries and especially in Sub-Saharan Africa. It is the cause of hundreds of thousands of deaths each year and there is an ever-present need to identify and develop effective new therapies to tackle the disease and overcome increasing drug resistance. Here, we extend a previous study in which a number of partners collaborated to develop a consensus in silico model that can be used to identify novel molecules that may have antimalarial properties. The performance of machine learning methods generally improves with the number of data points available for training. One practical challenge in building large training sets is that the data is often proprietary and cannot be straightforwardly integrated. Here, this was addressed by sharing QSAR models, each built on a private data set. We describe the development of an open-source software platform for creating such models, a comprehensive evaluation of methods to create a single consensus model and a web platform called MAIP available at https://www.ebi.ac.uk/chembl/maip/. MAIP is freely available for the wider community to make large-scale predictions of potential malaria inhibiting compounds. This project also highlights some of the practical challenges in reproducing published computational methods and the opportunities that open source software can offer to the community.


Introduction
Despite substantial scienti c progress, new, affordable and safe malaria medicines are urgently required to overcome increasing resistance against artemisinin-based combination therapies (ACTs), treat vulnerable populations, interrupt the parasite life cycle by blocking transmission to the vectors, prevent infection and target malaria species that transiently remain dormant in the liver. Malaria remains one of the most serious infectious diseases; it threatens nearly half of the world's population and led to over 400,000 deaths in 2019, predominantly among children in resource-limited areas in Africa, Asia and Central and South America [1].
Clinicians, researchers and, most importantly, patients remain acutely aware of the limitations of the current antimalarial medicines. These include drug resistance, lack of a single dose cure and suboptimal therapies for children and pregnant women. Drug resistance remains a constant threat and patients with resistance to the front-line ACTs are now being routinely identi ed in Southeast Asia [2]. Secondly, all existing malaria treatments require patient adherence to the course of treatment. A single exposure medicine would allow health workers to directly control and observe drug administration ensuring completion of the treatment course and help to avoid parasite exposure to sub-therapeutic doses [3].
Finally, to protect the entire population, new medicines suitable for use in mothers and babies are urgently needed. Such medicines need age and weight appropriate dosing, child friendly formulation and comprehensive reproductive toxicology evaluation to ensure they are safe to use during pregnancy [4].
The past two decades have seen not only the development of highly e cient high-throughput screening platforms but also signi cant collaborative efforts between the commercial, not-for-pro t and academic communities to join forces in tackling the malaria challenge, often by screening in-house compound libraries and making some of the results available in the literature [5][6][7]. Whilst precise assay details vary, many of these screens involve cell-based assays that can test for compounds that inhibit the asexual blood cell stage of Plasmodium spp. infection, which is the cause of symptomatic malaria. Red blood cells are the site of exponential parasite replication to potentially > 10 12 parasites per patient, in the case of Plasmodium falciparum, and all disease symptoms are caused by the response to the repeated lysis and invasion of erythrocytes by the asexual blood stage parasites. Drug discovery projects starting from the hits arising from these screening experiments have led to the identi cation of candidates that have been progressed to clinical development. In addition, chemogenomic methods have enabled the discovery of new drug targets and mechanism of action from the same pool of compounds [8,9].
Machine learning is a computational tool widely-used in drug discovery [10,11], with its origins in Quantitative Structure-Activity Relationship (QSAR) modelling. Virtual screening is one of the key practical applications of such computational models, enabling large numbers of compounds (including as-yetunsynthesised molecules) to be assessed for their therapeutic potential prior to experiment [12]. Such machine learning models have the greatest general applicability when trained on large and chemically diverse chemical collections. A practical challenge is that the datasets available for training such models are limited by data availability. Whilst large-scale, open access databases such as ChEMBL [13] have made a signi cant impact, nevertheless much bioactivity data is proprietary, making integration a signi cant challenge due to con dentiality concerns.
Previously, a consortium of Pharma and not-for-pro t organisations described a consensus machine learning approach to predict blood stage malaria inhibition [14]. The approach taken involved, rst, each partner training a model following the same protocol based on their own in-house data. The models were then shared within the consortium without revealing the con dential information contained in the underlying chemical entities. Proprietary datasets were further obscured by only sharing descriptor weights for the subset of descriptors important in the individual models. A consensus approach, combining several models, showed improved predictivity performance compared to the individual models. The method was subsequently evaluated by selecting and then screening a set of compounds, resulting in a 3-fold enrichment. However, for a number of practical reasons, including the dependence on commercial software, the project was terminated at this stage, and critically, a prediction platform freely available for wider use was not delivered.
Here, we report a successor project, using open source tools and additional datasets, that has resulted in a freely available and accessible combined model for the community to use in developing new malaria treatments. One of our initial goals was to replicate as closely as possible the original study, whilst using different, open source software. In addition to the creation of the open access tool based on a larger dataset, our current study therefore also explores the important, and rarely considered, question of method reimplementation and reproducibility in machine learning for drug discovery [15,16]. Even with the greater availability of public datasets, there can be dozens if not hundreds of settings that may be required to reimplement the exact same model. Our study may therefore also guide future efforts with regard to what is feasible in this context.
Herein, we report a panel of models trained on blood stage malaria inhibition data from several independent partners using open source tools. We explore different ways to combine these models in order to achieve enhanced prediction performance. We compare our results with those obtained previously using a variety of measures. We describe our choice of consensus approach that has been implemented in an online prediction platform now freely available for community use.

Training datasets
Eleven datasets from ve different partners were used in this study to train models. The Evotec, Johns Hopkins, MRCT, MMV -St. Jude, AZ, GSK, and St. Jude Vendor Library datasets were essentially the same as described previously [14]. The Medicines for Malaria Venture (MMV) partner provided three additional datasets to be used for training models (MMV5, MMV6, MMV7; see Table 1) and the Novartis dataset was signi cantly different from the original work; these four new or modi ed datasets are therefore described in detail, as follows: Novartis. The Novartis data set contains 3,355,412 measurements covering 2,726,063 unique canonical smiles assayed in a high-throughput screen using erythrocytes infected with the 3D7 Plasmodium falciparum (PF) strain. The activity cut-off was INH ≥ 50% at either 1.25 or 12.5 µM compound concentration. Of these, 107,505 always measured as active, 2,593,470 always measured as inactive, and 25,088 were measured ambiguously as both active and inactive. Each measurement was treated as a separate observation in the modelling. The data are proprietary to Novartis and were used to build Novartis models.
MMV5. MMV purchased 446,465 compounds that were screened using 3D7 PF strain with Sybr Green dye at 3 µM concentration with an endpoint at 72 h. For the purposes of building a model, compounds were classed as active if they demonstrated an EC50 value < 10uM or if they showed a mean percentage inhibition at 3uM > 50%. On that basis, 4,980 compounds were classi ed as active and 441,485 as inactive. The data are proprietary to MMV and were used to construct MMV5 models.
MMV6. MMV purchased 249,444 compounds that were tested using Dd2 PF strain at a concentration of 12.5 mM with a SybrGreen readout. Approximately 8,000 screening actives were selected on the criteria that % inhibition > = 80%. Then the compounds were clustered and the list reduced to 2,123 compounds that were tested in dose-response on Dd2-SybrGreen with 10 mM top concentration. For the purposes of building a model, compounds tested in the dose-response assay were classed as active if they achieved an EC50 < 10 mM (533 compounds), inactive if not. Of the remainder, compounds were classed as active if their mean percentage inhibition value at 12.5 mM was > = 80% (inactive if not). This gave a total of 6,328 active and 243,116 inactive compounds for model building. The data are proprietary to MMV and were used to construct MMV6 models.
MMV7. MMV purchased 12,732 compounds that were tested at 12.5 µM against Dd2 PF strain using a Sybr Green endpoint at 72 h. 1,073 screening actives were selected based on the criteria that % inhibition > = 70% in each of 2 replicates. From these, 590 compounds gave a dose-response value < 10 µM. For the purposes of model construction, compounds were considered active if they had an IC50 < 10 µM or if they showed a mean percentage inhibition value of > 80%. This resulted in 848 active and 11,884 inactive compounds. The data are proprietary to MMV and were used to construct MMV7 models.

Validation datasets
Three datasets were used for model validation purposes, henceforth referred to as the MMV test set, the PubChem dataset and the St. Jude Screening Set (see Table 1). The latter was already used for model validation in the earlier work but with a single difference due to a compound class change [14].
MMV test set. The MMV test set was the outcome of the previous study [14], where the generated consensus model obtained by combining all the partners' models was used to predict the activity of a collection of 12 million compound extracted from the eMolecules database of commercially available compounds. Compounds with a model consensus score in the − 20 to 0 interval were excluded from further consideration. Substructure search ltering was performed to remove well known anti-malaria chemotypes. Only lead-like compounds were selected by using molecular properties, removing potential Pan-assay interference compounds (PAINS) and manual curation. After a clustering step based on the compound molecular ngerprints, 5,869 compounds were tested using PF Dd2 train at 12.5 µM in a blood stage malaria assay by monitoring DNA dye at an MMV Network Test Center. Those with an activity > 50% were considered as 'active' and leading to a total of 1,198 active and 4,671 inactive compounds. A resulting 3-fold enrichment for the compounds predicted as active relative to the ones predicted as active by the model was observed.
PubChem. This dataset was obtained combining the results from two assays in PubChem [17].
AID1159554 is a primary screen measuring PF Dd2 strain growth inhibition using SYBR green I in human erythrocytes. 94,441 unique substances were screened at 10 µM and their activity or inactivity status was assigned by PubChem. AID1159566 is a con rmatory screen where 468 substances identi ed from the primary screen were screened a second time at different concentrations for an EC50 to be determined by regression analysis. Activity and inactivity were also determined directly by PubChem. We combined the results of these assays and considered as active the compounds that were agged as being 'active' in both assays. From this selection, compounds identi ed as active at a single concentration dose but not validated in the dose-response assay were discarded. Finally, for this validation set 384 compounds were considered as active and the remaining 91,412 were considered inactive. The previous study [14] used proprietary methods available via Pipeline Pilot for molecule standardisation, calculation of molecular descriptors and building computational models [18]. For this work we focussed entirely on open source tools; this would also make it possible for other potential partners to contribute models in the future without limitation.
Molecule preparation. Each molecule was standardised using an open source python-rdkit standardiser [19]. Molecules that failed the standardisation were removed. Descriptors were calculated using RDKit [20]. Six different combinations of descriptors were explored for each dataset (see Table 2). The non-binary descriptors (logP, MW, RTB, HBA, HBD and HAC) were discretised into 10 bins. Unhashed ECFP and FCFP ngerprints were used, so the numbers of bit descriptors varied according to size and chemical diversity of each individual dataset. The resulting data were stored in sparse matrix objects for more e cient processing.
Model building. A Naïve Bayes approach was used for model building, to ensure consistency with the previous study [14]. This is in general a robust approach for creating models using binary molecular descriptors. We implemented the algorithm described in [22,23] using the BaseDiscreteNB base class in scikit-learn, to ensure that the model was compatible with all the utilities in the scikit-learn library [24].
In this model, the posterior probability for each feature is de ned by (1): Where is the number of active molecules in the set, the total number of molecules in the set, the total number of molecules that contain feature and number of active molecules that contain feature .
Variable selection was used to remove uninformative descriptors, de ned as those with ) < 0.05.
Finally, the score for a given compound as calculated by the model is de ned by (2): ROC AUC score. The receiver operator characteristic (ROC) curve is calculated by plotting the fraction of false positives on the x-axis and the fraction of true positives on the y-axis. The area under the curve (AUC) provides a measure of how well a model is able to classify binary data. A value 0.5 corresponds to selecting compounds at random while a perfect model will return a ROC AUC value of 1. This metric is often used to measure the performance of classi cation models as it is insensitive to class imbalance.
Enrichment factor. The enrichment factor (EF) is the hit rate (the proportion of active compounds) within a de ned sorted fraction divided by the total hit rate [25]. It is de ned by (3): Where is the user-de ned fraction, the number of actives in the fraction, the total number of compounds in the fraction, the total number of actives and the total number of compounds. The EF is frequently used as a pragmatic measure of performance, re ecting the common use of in silico models to identify a subset of compounds for experimental evaluation. In this work, we calculated this metric at 1% and 10%, respectively.
Fingerprint coverage. The ngerprint coverage compares the set of bit descriptors retained by the Naïve Bayes model with the set of bits derived from the compound ngerprint. Hence, it is calculated as the proportion of bits present in the molecule that are shared with the model [14].

Software development, model building and sharing
A Docker le repository was used to provide a secure and transparent way of distributing the software, containing a json con guration le with the different sets of descriptors for training and the MMV test set for calculating performance metrics. A Python script calculates the molecular descriptors, removes uninformative variables, trains the models and generates performance reports. Two different performance reports are generated; one from a 5-fold cross validation with random splits and the second by training the model on the whole training set and validating it against the MMV test set.
Each partner ran the docker image after having formatted their datasets to t the platform's input format and returned to EMBL-EBI all trained models and model performance reports through appropriate secure channels. Models were created in a human readable format, so each partner was also able to verify that no structural or other con dential information was included.
Data visualisation using t-distributed Stochastic Neighbor Embedding t-distributed Stochastic Neighbor Embedding (t-SNE) is an algorithm performing a nonlinear dimensionality reduction and designed for data visualisation [26]. The chemical space for the three validation sets was derived from the t-SNE calculation using the same ngerprint descriptors as for model 2 ( Table 2). The resulting sparse matrix corresponding to the chemical features present in the P N validation set compounds was used as input for scikit-learn's implementation of the t-SNE algorithm using a perplexity value of 500 [24].

Validation of model-building protocols
Our rst goal was to assess the ability of our new software methods and code to reproduce the previous study [14]. We determined that the major difference would be due to implementations of the descriptor calculations as the distributions of calculated physico-chemical properties are reasonably well (but not perfectly) correlated (Additional le 1). To explore the impact of differences in ngerprint implementations on model-building and performance we used the MMV -St. Jude dataset with only FCFP6 ngerprints and RDKit Morgan ngerprints with radius of 3 and features from Pipeline Pilot and RDKit, respectively. After removing the uninformative bits, we built two Naïve Bayes models and predicted the MMV test set. A pairwise comparison using the Pearson correlation coe cient (R) for the two sets of scores gave a value of 0.88, indicating a good but not perfect correlation (Additional le 2A). In a second comparison, we used ECFP6 ngerprints and RDKit Morgan ngerprints with radius of 3 but without features. This gave an R coe cient of 0.98, indicating almost perfect identity between the two model implementations (Additional le 2B).
These results are not unexpected and are due to differences in the way that which ECFP and FCFP ngerprints are calculated. Both are circular ngerprints, describing the environment in the vicinity of each atom in the molecule. ECFP captures the atom-based substructural information while FCFP represents function-based substructural information [27]. Differences in feature de nitions between Pipeline Pilot and RDKit explain the lower correspondence between models built using these two approaches, whilst the ECFP implementations are almost identical.
Nevertheless, the results did con rm to our satisfaction that it would be possible to deliver model performance that was very close to those of the previous study.

Model comparison
Internal validation. The performance for each dataset across the 6 different descriptor combinations is shown in Fig. 1. Somewhat surprisingly, the 5-fold cross validations show that the choice of descriptors appears to have rather limited impact on the performance. Indeed, we observed little variability between the six descriptor sets. Moreover, combining physico-chemical descriptors and molecular ngerprints results in barely any effect (comparing model 1, 3 and 4, and model 2, 5, 6, respectively). External validation. External model performance was evaluated using the MMV test set (Fig. 2). We observed relatively little difference between all the models. When comparing different combinations of descriptors, we observed little variation, though FCFP alone (model 2) is generally better than ECFP alone (model 1).
As the various models were so similar in performance, we pragmatically decided to use the same set of descriptors for each dataset and opted for model2 (FCFP ngerprints only). Using a single set of descriptors makes the prediction of new compounds more straightforward and faster to compute. Figure 2: Results of the external validation for the 6 models trained by each partner dataset. This validation was performed predicting the MMV test set.

Consensus approaches
Several consensus options were investigated to identify the optimal approach for inclusion in the public model. We describe here in detail two of the approaches explored, MaxScore and metamodel.
MaxScore. For any given combination of models, the MaxScore prediction for a test compound is the maximum score predicted by any one of the component models.  Fig. 3 the impact of increasing the number of models in the MaxScore consensus, using these three metrics for the three validation sets. Metamodel. In this approach, we merged all the different partner models into a single model based on a combined set of ngerprints bits, where the weight for any one bit is given by summing the weights for that bit across the different individual models. The resulting metamodel is a Naïve Bayes model that generates scores from 893,855 binary variables. It has the key advantage of providing a consensus score from a single model, thus providing signi cant run-time advantages when compared to the all-model consensus which would require each new compound to be rst scored by 11 different models. The metamodel approach is compared with the all-model MaxScore consensus in Table 3 and is also annotated on Fig. 3.
In Table 3 we also present summary results for two other consensus approaches that were investigated. The mean score (MeanScore) is calculated as the arithmetic average of the prediction scores returned by the individual models. The MinRank algorithm scores a compound as the minimum rank across all models. Overall, we could not identify a clear winner when comparing the results across these three validation sets. Chemical space analysis In machine learning a standard way to assess the chemical space over which the model can generate reliable predictions is by de ning its domain of applicability [28][29][30]. It is usually calculated from the variables used to describe the model training set. Methods such as conformal prediction are furthermore able to quantify the prediction certainty [31][32][33]. Our consensus approaches were developed with the underlying training sets remaining obscured, leaving us with limited options to perform further analysis and in particular it was not possible to directly assess the model's domain of applicability. Nevertheless, to con rm the predictivity of our methods, we comprehensively assessed their performance using three validation sets as described above. Additionally, we further investigated the variation in performance across these different sets by considering ngerprint coverage. For a given molecule the ngerprint coverage is calculated as the proportion of bits shared with the model and so is a measure of the similarity of the compound with the model. Taking the MaxScore consensus as an example, for each compound in the three validation sets, ngerprint coverage values were calculated as the mean across the 11 models and we compared the resulting metric with the corresponding consensus score, distinguishing active from inactive compounds. See Fig. 4, where active compounds are dark and inactive compounds are light. The MMV and the St. Jude datasets show similar distributions, with high compound score for these sets appearing to require high ngerprint coverage. This might be considered expected behaviour since the score is the sum of feature log(probabilities) (see Methods). However, it is not the case that high ngerprint coverage necessarily leads to a high score. Indeed, the metric does not consider the weight of these shared bits, and it is possible that they contribute very little to nal score, if not negatively. Having only a few bits shared with the model would be expected to lead to lower scores as there are just too few contributing bits. The results for the PubChem dataset differ from the two others, showing a much weaker relationship between score and the mean ngerprint coverage. Interestingly, the lowest ngerprint coverage values in this dataset have higher absolute values than for the other two sets, whilst also having a narrower range of scores. To further explore this, Fig. 5 shows the results of a t-distributed Stochastic Neighbor Embedding (t-SNE) visualisation of all validation set compounds. Described above, this nonlinear dimensionality reduction technique is particularly useful for preserving local data structures [34,35]. We observe in Fig. 5 that the MMV test set and St. Jude Screening set compounds are more similar between each other than they are compared to the PubChem dataset compounds, respectively. One possible explanation for these results is that the PubChem set is inherently less diverse, being based around a small number of central scaffolds, than the other two sets. However, such analysis is beyond the scope of this study. We should also be cautious in over-interpreting these results. In particular, the ngerprint coverage metric relies only on the bits shared with the model, ignoring bits in the model but not in the molecule. Hence, while it gives an indication on how much of the compound bits are in the model, it completely eludes the information present in the model but not in the compound. This is analogous to the observations of Martin et al. [36] and Hempel's "raven paradox".

Web application
The above results demonstrate that the metamodel shows comparable results to the other consensus approaches in our validation experiments. Furthermore, there are signi cant computational advantages in having to deal with just a single model instead of eleven from a system performance perspective. We have therefore implemented the metamodel in a web application called MAIP (MAlaria Inhibitor Prediction), which is available for the community to make malaria activity predictions.
MAIP is accessible through https://www.ebi.ac.uk/chembl/maip/ (Fig. 6A). Users upload a le containing their molecules represented by SMILES strings and associated identi ers. Once submitted, a job is queued for execution on the EMBL-EBI compute infrastructure (Fig. 6B). Each submitted molecule is MAIP is intended to be used as a prioritisation tool, offering users a way to identify compounds of interest for further analysis and selection. MAIP does not directly return a prediction ag, e.g. 'predicted antimalarial compound'; users should scrutinise their results before making any decisions on next steps. To assist this, documentation is provided with the results together with relevant statistics for our three validation sets (Fig. 7). For each dataset, the higher the model score, the greater the observed enrichment. Further, the thresholds needed to pick 1%, 10% and 50% of the predictions correlate with the dataset size.
These data can be used to guide the users in their analysis.
Users are also strongly advised to use additional in silico lters to assess the suitability of any virtual hits from MAIP prior to any experimental testing. High scoring compounds may have physicochemical properties and/or substructure features that are unsuitable as starting points for a malaria (or indeed any) drug discovery programme. In addition, some of the training sets used in MAIP contain examples of known anti-malarial compounds (e.g. aminoquinolines). Thus, molecules that score highly in MAIP may have already been worked on extensively in anti-malarial programmes. Public bioactivity resources such as ChEMBL can be used to determine whether the anti-malarial activity is already known for speci c structural classes [13]. As part of MMV's commitment to Open Innovation, screening slots in the MMV Plasmodium falciparum asexual blood stage assay are being made available to test 3rd party compounds identi ed using MAIP. Please contact MMV (maip@mmv.org) for more details.

Discussion and Conclusions
The key output from this work is a practical tool that we hope will be of value in the global search for new malaria treatments. We have built upon and extended a previous study, delivering a computational model derived from > 6.5 million malaria bioactivity values. We believe that this is the largest collection of malaria data assembled to date and the largest public machine learning model for a drug discovery target. We have created a comprehensive open source software capability for molecule standardisation and model building, based on the widely used RDKit library, that will enable other groups to build their own models and potentially to extend our current model by the inclusion of additional malaria bioassay data. This software was used to deliver 11 separate in silico models that individually show heterogeneous performance across the three different validation sets employed. We anticipate (though without direct access to the underlying structures we cannot be completely certain) that this is due to the different chemical space covered by each individual dataset. Combining the models in a consensus approach gives improved and more robust performance. Of particular note is that the metamodel approach to combining Naïve Bayes models was shown to deliver good performance across multiple validation sets and also has the key advantage of computational e ciency. This model has been implemented in an easy-to-use web application. Although the compounds used in the training sets remain accessible only to their owners, the wider community can nevertheless take advantage of their malaria bioactivity properties without breaching con dentialities.
MAIP was developed with the objective to be relatively easy to maintain promoting sustainability of the tool.
Combined with the open source software used to develop the model, this will enable both existing and new partners to further contribute to the consensus model. In addition, the modular nature of the MAIP system means that the same approach could be easily applied to other areas, for example with alternative machine learning algorithms and/or to other areas of drug discovery. The scienti c community is now invited to use our platform in the hope that this may lead to the initiation of new antimalarial drug discovery projects. Results of the internal validation for the 6 models trained by each partner dataset. The error bars represent the standard deviation returned by the 5-fold cross validation.

Figure 2
Results of the external validation for the 6 models trained by each partner dataset. This validation was performed predicting the MMV test set.

Figure 3
Results of the external validation for the 6 models trained by each partner dataset. This validation was performed predicting the MMV test set.
Comparison between ngerprint coverage and prediction score for (A) MMV, (B) PubChem and (C) St. Jude Screening sets, where active compounds are dark and inactive compounds are light.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.