RANdom SAmple Consensus (RANSAC) algorithm for material-informatics: application to photovoltaic solar cells

An important aspect of chemoinformatics and material-informatics is the usage of machine learning algorithms to build Quantitative Structure Activity Relationship (QSAR) models. The RANdom SAmple Consensus (RANSAC) algorithm is a predictive modeling tool widely used in the image processing field for cleaning datasets from noise. RANSAC could be used as a “one stop shop” algorithm for developing and validating QSAR models, performing outlier removal, descriptors selection, model development and predictions for test set samples using applicability domain. For “future” predictions (i.e., for samples not included in the original test set) RANSAC provides a statistical estimate for the probability of obtaining reliable predictions, i.e., predictions within a pre-defined number of standard deviations from the true values. In this work we describe the first application of RNASAC in material informatics, focusing on the analysis of solar cells. We demonstrate that for three datasets representing different metal oxide (MO) based solar cell libraries RANSAC-derived models select descriptors previously shown to correlate with key photovoltaic properties and lead to good predictive statistics for these properties. These models were subsequently used to predict the properties of virtual solar cells libraries highlighting interesting dependencies of PV properties on MO compositions.


Background
Material informatics is a rapidly developing field engaged with the application of informatics principles to materials science in order to assist in the discovery and development of new materials [1][2][3][4][5]. Developments in material informatics take advantage of the vast empirical and computational information on structures and properties of materials available in multiple databases such as Mat-Web (http://www.matweb.com/) which includes properties for over 115,000 materials and MatDat (https://www. matdat.com/) which includes over 1000 datasets of materials, to name but a few. [6][7][8][9][10] Turning this large volume of information into knowledge could be performed in multiple ways using multiple data mining procedures. As an example, AFLOW [6] (http://aflowlib.org/) is a database of density functional theory (DFT) calculations performed on more than 1.5 million materials with known crystal structures. Isayev et al. [5]. used this database to introduce the term "material cartography" for representing a library of materials as a network. The resulting network was subsequently mined using various machine learning methods in search for materials with interesting properties.
A pre-requisite to any data mining procedure is a data curation stage [11]. Data curation is important for two main reasons: (1) Publically available data sets may contain multiple errors; (2) even a small number of errors may compromise the quality of QSAR models [11]. For example, Olah et al. [12,13] have shown an error rate as high as 8% in the WOMBAT database and Young et al. [14] have recorded error rates between 0.1 and 3.4% in a variety of databases. More recently, Isayev et al. [5] have demonstrated several errors in the AFLOW database including duplicate compounds and incorrect extraction of literature data. In general, data curation involves steps like the removal of duplicates, compounds with wrong Lewis structures, compounds for which descriptors could not be calculated, and in case of experimentally measured data the removal of compounds which suffer from errors caused by the measurement process.
Due to the sheer size of material databases, data curation cannot be performed manually but rather requires a computational workflow. Indeed several such workflows have been reported in the literature [11,15,16]. However, even a stringent curation workflow cannot clean a database from noise that often accompanies experimental data. The presence of noise might mask the information that the data hold, thereby compromising data interpretation, model generation and decisions making.
In general, noise could be classified as either internal or external. Internal noise is inherent to the measurement process of the data, affects all data points, and is assumed to be distributed normally. In contrast, external noise results from sources exterior to the system due to an error in the measurement itself or from extreme behavior that does not match the overall behavior of the majority of samples. While all samples experience internal noise, some may also experience (greater) external noise and could therefore be regarded as outlier samples. Thus, an outlier is an observation on the dataset, which appears to be inconsistent with the rest of the data [17].
Important aspects of data mining in material informatics are database searching, similarity searches, and the usage of machine learning algorithms for pattern recognition and derivation of predictive models [18,19]. Multiple terms have been used to describe such models including Quantitative Structure Activity/Property Relationship (QSAR/QSPR) models [20,21], Quantitative Materials Structure-Property Relationships (QMSPR) models [5], and Quantitative Nanostructure Activity Relationship (QNAR) models [22]. All models attempt to correlate specific activities (or properties) for a set of materials with (calculated or measured) molecular descriptors by means of a mathematical model. Such models should both provide scientific insight into the problem in hand as well as allow for the prediction of the results of future experiments. An important characteristic of QSAR models is therefore their predictive power. However the presence of outliers (i.e., noise) may bias the dataset to the point of compromising the ability of machine learning algorithms to build predictive models. Consequently, a common practice of QSAR modeling is the prior removal of outlying samples prior to model generation [23]. Accordingly, several methods for the removal of outliers were reported in the literature [24][25][26][27][28][29].
Two more aspects of machine learning algorithms which critically affect performances are the selection of specific descriptors that best correlate with the activity under study from the initial pool of descriptors and the definition (and application) of the model's applicability domain, namely, the region in material space in which the model is expected to give accurate predictions. Multiple descriptors selection (i.e. feature selection) methods have been developed including filter methods, wrapper methods and embedded methods [30]. Similarly several algorithms for the definition of applicability domains have been reported [31].
Most QSAR studies treat the removal of outliers, the selection of descriptors and the definition of applicability domain as separate stages within a QSAR workflow, often using different tools for each task [11,20,32,33]. Thus, there is an interest in presenting a "one stop shop" algorithm for the performance of all tasks. The advantages of such an algorithm are the potential prevention of errors resulting from interfaces between different components as well as easier accessibility, in particular by non-experts. In contrast a "one stop shop" algorithm is by its nature non-modular, offering minimal flexibility in the modeling process.
With this in mind we present in this work the adaptation, implementation, and the first application of the RANdom SAmple Consensus (RANSAC) method [34] to the field of material-informatics by deriving predictive models for key photovoltaic properties of solar cells. RANSAC is a modeling tool widely used in the Image Processing field [34][35][36] primarily for image noise filtration. The algorithm produces and validates a linear QSAR model based on the Minimum Least Square (LMS) method by (1) filtering noisy samples (i.e., outliers), (2) selecting the best features (i.e., descriptors), (3) deriving a QSAR model from training set samples and (4) predicting the activity of test set samples while invoking the concept of applicability domain, all in a single process without the need of complementary processes. For prediction of samples not in the original test set (i.e., samples for which no activity data are available), RANSAC provides a statistical estimate for the probability of obtaining reliable predictions, i.e., predictions within a pre-defined number of standard deviations from the true values. These characteristics make RANSAC an appealing addition to the arsenal of tools available for the derivation of predictive QSAR models.
As a first application, we chose to test the performances of RANSAC in the important field of solar cells which emerge as one of the main resources for clean energy. Briefly, a typical solar cell (photovoltaic device) operates by: (1) Generation of charge carriers (electrons and holes) following the absorption of photons; (2) Separation of the photo-generated charge carriers via charge selective contact(s); (3) Collection of the photo-generated charge carriers at an external circuit resulting in electricity.
In particular we focus our attention on solar cells entirely composed of metal oxides (MOs). Such cells possess many favorable properties including natural abundance of the constituting materials, ease of fabrication and long time stability. However, such cells do not demonstrate sufficient efficiency in converting sunlight to electricity thereby requiring the development of new cells potentially composed of new MOs or MO combinations [37]. Such developments could be facilitated by the development of QSAR models to predict key solar cells properties such as current, voltage, and quantum efficiency. Yet despite their importance only few QSAR studies were reported on solar cells [38][39][40] and even fewer on MObased solar cells [41].
MO-based solar cells are often produced using combinatorial techniques resulting in solar cell libraries [37,42]. Following fabrication, the libraries are subjected to medium throughput measurements to characterize their composition/structure as well as their photovoltaic (PV) properties. Due to the technical challenges involved in both fabrication and characterization, the resulting libraries often contain noisy data [42] making them ideal candidates for the RANSAC algorithm. The main objective of the present study is therefore to establish the usefulness of the RANSAC algorithm in cleaning and analyzing datasets of solar cells libraries and predicting their PV properties. For this purpose, we used three recently published datasets experimentally-derived from two different solar cells libraries. The first library is a TiO 2 |Cu 2 O library reported by Pavan et al. [43]. The library consists of two datasets, one with Ag back contacts and the other with Ag|Cu back contacts. The second library is a TiO 2 |Co 3 O 4 |MoO 3 library reported by Majhi et al. [15]. The two libraries comprised of TiO 2 |Cu 2 O based solar cells were previously modeled using k nearest neighbors (kNN) and genetic algorithm allowing for a facile comparison between the performances of the different algorithms. The third library (TiO 2 |Co 3 O 4 |MoO 3 ) was previously analyzed using visualization methods [36]. We demonstrate that the RANSAC algorithm filters the sample space from noisy data (i.e., outliers), automatically selects descriptors previous shown to correlate with key PV properties and generates models with good predictive statistics for these properties.

RANSAC overview
RANdom SAmple Consensus (RANSAC) [34] is a method for deriving a model based on linear regression, performed on input data that may include noisy samples (both internal and external noise). The basic assumption of the algorithm is that the measured activity (Y measured (x)) depends on a set of noise-free variables (e.g., descriptors; x) and on noise added to them; Eq. (1).
where Y noise−free (x) is the expected activity in a noise-free environment and N is a random internal noise. RANSAC assumes that the internal noise obeys the homoscedastic assumption, namely, that it has a constant distribution across all activity values. Using this assumption, boundaries could be set to form a "strip" that classifies the samples as either affected by internal noise only (model-compatible samples residing within the "strip") or such that are affected both by internal and by external noise (model-incompatible samples residing outside the "strip"). Importantly, these boundaries should be a priori provided to the algorithm, based on the system's characteristics and are expressed as the distance, in number of standard deviations (n), from the model (see below and Fig. 1).
Mathematically, the following definition applies [Eq. (2)]: where Y calculated (x) is the calculated activity (see below), σ is the standard deviation of the sample and n is the width

Model construction
For RANSAC to build a model, it must first draw a subsample from all the samples used for model training (i.e., training set) and use it to construct a regression line. For a single observation the model takes the form of Eq. (3): Where y is the dependent variable, x is the vector of the independent variables (i.e., descriptors), i denotes sample i, p is the power of the best fit curve, d is the dimensionality of the model (i.e., number of descriptors) and W is a vector holding the weights calculated using the linear

Model scoring
The basic assumption underlying the RANSAC algorithm is that the set of samples (expressed as data points) could be approximated by a model of a certain dimensionality (d), where each dimension is represented by a descriptor raised up to a maximum power (p) allowed for the model. If this assumption holds true, then one would expect to have most dataset points residing within a "strip" of a given width around the best fit curve calculated for a subsample (i.e., model compatible samples). The "strip" could be used for several purposes: (1) Scoring models by counting the number of dataset points residing within their boundaries (the larger the number, the better the model). Models are scored based on the entire training set and not only on the drawn sub-sample used for their construction. (2) Identifying outliers by observing training set samples residing outside the "strip's" boundaries. (3) Defining the "strip" as the model's applicability domain for test set predictions. RANSAC scores a model based on the number of model-compatible samples from within the training set.

Iterative phase and select highest scoring model
RANSAC is an iterative algorithm that requires many repetitions of the model construction and scoring phases (i.e., iterations) in order to obtain the best model. Furthermore, the number of the required iterations depends on the size of the dataset with larger datasets requiring more iterations. At each iteration, the algorithm counts Fig. 2 Description of the RANSAC algorithm as used for model construction the number of model-compatible samples and outputs the weights vector (W ) that corresponds to the highest ranked model (i.e., model with the highest score). For this model the LMS error is calculated both before and after the removal of outliers (i.e., model-incompatible samples). It is important to note that the size of the "strip" (which ultimately determines the number of model compatible samples) may vary between libraries and should be specifically chosen for each library.

Predictions
The best model emerging from the iterative phase is used for predictions. For test set samples, their known activities allow to classify them as either within or outside the model's applicability domain (i.e., either within or outside the "strip"). The percentage of within-"strip" samples provides an estimate for the percentage of "correct" (i.e., within the predefined number of standard deviations (n) from the true value) predictions for "future" samples, that is samples with unknown activities. RANSAC does not feature an inherent applicability domain for individual samples although a descriptors based applicability domain approach could of course be used [31].
For all RANSAC's applications described in this work the following parameters were used: The number of iterations was set to 10 5 to derive a polynomial equation of the 5th power. The size of the "strip" (i.e. the models' boundaries) was set to be ±1 standard deviation around Ȳ measured derived from the training set. The algorithm was coded in MATLAB version R2014a.

Metal-oxide solar cells library
The basic assembly of MO solar cell library includes (see Fig. 3): (1) a transparent conducting oxide (TCO) coated on a glass, typically in the form of fluorine doped tin oxide (FTO); (2) a window layer, which is a wide bandgap n-type semiconductor (typically TiO 2 ); (3) a light absorbing layer (absorber); (4) Metal back contact; (5) Metal frame (front contact) soldered directly onto the FTO.
TiO 2 |Cu 2 O library (Fig. 3a) An experimental library of solar cells was obtained from Pavan et al. [43]. This library was generated on precut glass coated with fluorine doped tin oxide (FTO) substrates onto which a TiO 2 window layer with a linear gradient was deposited, followed by an absorber layer of Cu 2 O. Inserting two different grids of 13 × 13 = 169 back-contacts, namely, silver only (Ag) and silver and copper (Ag|Cu) deposited one after the other, lead to two sub-libraries (datasets) each consisting of 169 cells. In this work we omitted the non-photovoltaic cells leaving a total of 162 and 166 cells for the Ag and Ag|Cu back contact data base respectively. 3 library (Fig. 3b) This library was constructed in a manner roughly similar to the TiO 2 |Cu 2 O libraries, with the same window layer (TiO 2 ) but different target metal oxide for the absorber layer (Co 3 O 4 ) and also included a third recombination

Library characterization
Each solar cell was characterized by its material descriptors (independent variables) and experimentally measured photovoltaic activities (dependent variables). Material descriptors included the thickness of the window layer (d In this work we focused on three experimentally measured PV activities (dependent variables, end points): (1) the short circuit photocurrent density (J SC ) which is the current density through the solar cell when the voltage across the cell is zero. (2) The open circuit voltage (Voc) which is the maximum voltage available from a solar cell.

This voltage occurs at an open circuit. (3) The internal quantum efficiency (IQE) which reflects the charge separation and collection efficiencies of a device and is calculated by Eq. (6)
where J max is the maximum theoretical calculated photocurrent. The distributions of the three PV activities are represented by boxplots in Fig. 4 and their ranges are given in Table 3.

Model fitting and statistical parameters
The datasets were divided into training and validation (test) sets using a recently published representativeness algorithm [44]. Subsets selected by this algorithm were previously employed as external validation sets in QSAR modeling [24,25,41,44]. Each dataset was divided into a training set composed of 80% of the original dataset (130, 134 and 120 cells for the TiO 2 |Cu 2 O with Ag back contacts, TiO 2 |Cu 2 O with Ag|Cu back contacts and TiO 2 |Co 3 O 4 |MoO 3 datasets, respectively) and a test set containing the remaining cells (32 samples for the TiO 2 |Cu 2 O with Ag and Ag|Cu back contact and 30 samples for the TiO 2 |Co 3 O 4 |MoO 3 dataset). The TiO 2 |Cu 2 O libraries with Ag and Ag|Cu Back Contacts were previously modeled by Yosipof et al. [41]. For the purpose of comparison, the training and test sets described above, were made identical to those described by Yosipof et al. [41].
To evaluate the RANSAC model performances on the training set we used Q 2 train as expressed by Eq. (7). The RANSAC algorithm excludes samples from the training set-based error calculation if residing outside the model's boundaries (e.g., "strip"). Thus the model's error is derived without these samples. This is analogous to outlier removal. Below we therefore report two Q 2 train estimates, the first based on all samples and the second based on samples surviving RANSAC's inherent outlier removal.
The performances of the RANSAC algorithm on the test set (Q 2 ext ) were calculated in a similar manner [Eq. (8)]. Similarly to outlier removal, the "strip" calculated by the RANSAC algorithm was used to evaluate the applicability domain (AD) of the resulting model. Accordingly, two estimates of Q 2 ext were calculated one pertaining to the entire test set and one, for that portion of the test set which resided within the model's applicability domain.  where Y measured is the experimental result, Y predicted is the predicted value and Ȳ measured is the mean of the experimental results over training set samples. In addition, we used the R 2 (squared correlation coefficient) between the predicted (Y predicted ) and the experimental (Y measured ) data for both training and test set.
Finally, to assess model significance and to rule-out chance correlation, Y-randomization procedure was applied to all models.

Performances of RANSAC-derived models
The RANSC algorithm was applied to the three datasets described above. For each dataset, three models were derived to describe their photovoltaic (PV) properties (J SC , V OC and IQE). Table 4 presents the number of training set and test set samples found to reside within the model's "strip" (i.e., model-compatible samples). Model-incompatible samples in the training and test sets are referred to as outliers and outside of the model's AD, respectively. As can clearly be seen, the vast majority (≥85%) of the samples are included within the "strip" for both the training and test sets. This suggests that (1) predictive models could likely be derived for this dataset and (2) the model described by the "strip" forming curve approximates most of the training set and test set samples to within one standard deviation (the pre-defined "strip" width; see Methods section) from their experimental values. One could therefore propose that the majority of future samples will be similarly predicted. However, in two cases the number of model compatible cells was below the 85% threshold (the V OC models for the TiO 2 |Co 3 O 4 |MoO 3 library with 84 and 80% of model compatible cells for the training and test sets, respectively), indicating higher variance for this property in this dataset in comparison with the other properties/datasets. In accord with this observation, the performances of the V OC model from the TiO 2 |Co 3 O 4 |MoO 3 library were exceptionally poor (Table 5). This model was therefore excluded from the analysis reported below.
Overall, the RANSAC algorithm led to models with good statistical parameters (Table 5)     The performances of the RANSAC models on the test set samples followed a trend similar to that observed for the training set. Thus, for all test sets, Q 2 ext was found to be between 0.69-0.82, 0.62-0.80, and 0.69-0.79 for J SC , V OC (excluding the TiO 2 |Co 3 O 4 |MoO 3 library) and IQE, respectively. Similar results were obtained for R 2 values between the predicted and the actual activities ( Table 5). As expected for datasets devoid of significant activity cliffs, when considering only samples within the models' applicability domains, these numbers improved to 0.82-0.87 and 0.79-0.83 for J SC , and IQE, respectively. For V OC of the TiO 2 |Cu 2 O (Ag) library, no test set samples were filtered by the applicability domain leading to no change in model performances (Q 2 ext = 0.80). However for this property a significant increase in the TiO 2 |Cu 2 O (Ag|Cu) library upon the removal of only two samples was observed (Q 2 ext = 0.62 and 0.73 without and with the model's AD, respectively). Figures 5 and 6 present predicted versus experimentally measured values for all three PV properties considered in this work across the three datasets following outlier removal for training set samples and considering only samples within the models applicability domains for the test set.
Finally, Y-randomization procedure was applied to all models and no statistically significant models were derived. Two of the above described datasets [TiO 2 |Cu 2 O (Ag) and TiO 2 |Cu 2 O (Ag|Cu)] were previously modeled by Yosipof et al. [41] using kNN and a Genetic Programming (GP) approach, thereby allowing for a direct comparison between the performances of the resulting models (the results of kNN and GP models from Yosipof et al. [41] are presented in Table 7). GP produced models with Q 2 ext values between 0.74-0.76, 0.50-0.78 and 0.72 for J SC , V OC and IQE respectively.    Thus, kNN provides models with higher prediction statistics than RANSAC in particular when the AD is not considered. However, the performances of RANSAC approach those of kNN upon the introduction of the AD. Moreover, the test set coverage provided by RANSAC is generally higher than that provided by kNN (Table 6). Finally, in contrast with kNN, RANSAC provides a model in the form of a QSAR equation which enhances model interpretability. Table 8 presents the model equations produced by RANSAC for the different PV properties of the three datasets.

RANSAC as a feature selection tool
For both TiO 2 |Cu 2 O datasets it is evident that while four descriptors were evaluated by RANSAC, only two were picked by the algorithm as predictors of photovoltaic activities. Importantly, these two descriptors give rise to six terms in the resulting QSAR equations due to their power form. Thus, RANSAC "expands" the small number of final descriptors by using them in multiple forms.
A potential drawback of the resulting models is therefore reduced interpretability of terms including "high power" descriptors. The TiO 2 |Co 3 O 4 |MoO 3 dataset was characterized by five descriptors and only three were selected by RANSAC leading to models with six terms ( Table 8).
The features selected by the RANSAC algorithm could be compared with those selected by the kNN and GP models reported by Yosipof et al. [41]. As can be deduced from Table 9, all methods selected the same descriptors for the TiO 2 |Cu 2 O (Ag) library while kNN replaced d TiO 2 by the ratio descriptor for the TiO 2 |Cu 2 O (Ag|Cu) library. While GP sometimes selected a smaller number of "base descriptors", it compensated for this smaller number by incorporating these descriptors in more complex mathematical equations. In contrast, the RANSAC algorithm is limited to simple polynomial equation (to the 5th power in this study).

RANSAC derived virtual cells
RANSAC derived models could be used to predict PV properties of virtual solar cell libraries. These predictions could serve two purposes: (1) identify trends related to the dependence of PV properties on descriptors values, nm. The virtual cell should cover these ranges and expand upon them to allow RANSAC-based extrapolations. With this in mind, thickness values for the different layers were selected to be between 200 and 700 nm and between 40 and 400 nm for the Cu 2 O and TiO 2 layers, respectively, where each range was divided into 100 bins (a total of 10,000 cells per virtual library). These specific ranges were selected following several iterations designed to find the model's limits, beyond which the results would not be physically meaningful (i.e., have negative PV values). Next, the PV properties (J SC , V OC , IQE) of each cell were predicted using the RANSAC models presented in Table 8. The results of these predictions are presented in Fig. 7 and demonstrate a few trends: (1) all PV activities primarily depend on the thickness of the Cu 2 O layer rather than on the thickness of the TiO 2 layer. This trend was noted by Pavan et al. [43]. but only for J SC . (2) J SC presents a marked increase for Cu 2 O thicknesses above 500 nm (where J SC equals 200 µA cm 2 ) as seen in Fig. 7a, d. Similar trends (yet with less sharp transitions) are also seen for IQE and V OC (Fig. 7b, e and c, f, respectively). Interestingly, Cu 2 O thicknesses above 500 nm where  Fig. 7a, d) which is followed by V OC (compare Fig. 7b, e). In contrast, the dependence of IQE on the thickness of the Cu 2 O layer is the least affected by the back contact (compare Fig. 7c, f ) [15] showed that IQE is mainly affected by the thickness of both the Co 3 O 4 and MoO 3 layers. This conclusion was further supported by a computational analysis [36]. Figure 8 shows that RANSAC's prediction is in line with this proposition (i.e., to achieve relatively high IQE values, the thickness of the Co 3 O 4 layer must be low, smaller than 150 nm and this property is also influenced by the thickness of the MoO 3 layer). In addition, RANSAC's models point to an inherent problem in producing solar cells with both high J SC and IQE values for this MOs combination since the former seems to yield maximum value at Co 3 O 4 layer thickness at the 500 nm region while the latter, yields its global maxima at the 30 nm region. Finally, Fig. 8