A machine learning platform for the discovery of materials

For photovoltaic materials, properties such as band gap \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$E_{g}$$\end{document}Eg are critical indicators of the material’s suitability to perform a desired function. Calculating \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$E_{g}$$\end{document}Eg is often performed using Density Functional Theory (DFT) methods, although more accurate calculation are performed using methods such as the GW approximation. DFT software often used to compute electronic properties includes applications such as VASP, CRYSTAL, CASTEP or Quantum Espresso. Depending on the unit cell size and symmetry of the material, these calculations can be computationally expensive. In this study, we present a new machine learning platform for the accurate prediction of properties such as \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$E_{g}$$\end{document}Eg of a wide range of materials.


Introduction
Opportunities to harness the continued pace of computer processing capabilities as well as new and refined data processing techniques exist for those wishing to investigate and predict material properties computationally.
Using a Machine Learning (ML), Deep Learning (DL), and High Throughput (HT) computing techniques can provide an efficient robust data processing platform for the prediction and discovery of new materials.
ML techniques involve processing large datasets in order to generate highly accurate modelling algorithms that can be used to find relationships within the data and predict outcomes.
HT computing techniques involve aggregating the results of computations that have already been executed from many disparate data sources. Quantum chemical calculations and atomic scale calculations are often time consuming and CPU expensive, requiring hundreds of hours of super-computer processing time. Using pre-calculated results from these operations will greatly reduce processing time, allowing for a greater throughput on much more modest hardware.
The combination of ML with HT will allow for rapid and exhaustive exploration of materials properties within a computational environment, at a scale and speed that simply cannot be matched in a laboratory.
In this paper we present a bespoke software platform (codename: Hadoken) for the discovery of materials, as well as 5 models derived from ML techniques that can be used to accurately predict material properties (such as the band gap of a compound), and a newly developed website that provides the basis for a materials prediction platform.
subsequently used to compose new features, comprising of the element and the count of the instance of that element. The one-hot encoding process involves decomposing categorical values into a binary representation.
To encode H 2 O: To encode copper indium selenide: Isomers have the same stoichiometric S value, yet have differing physical structures ( C 3 H 4 for example). These isomers will produce identical encoding.
In this paper, the definition S (1) refers to this equation. The gap type GT feature represents values that indicate the category (one-hot encoded) of gap type present in the compound. Table 1 details the possible gap type values with corresponding definitions.
* Given a band gap, this keyword describes if the system is a metal, a semi-metal, an insulator with direct or indirect band gap [5].
The geometry G feature is decomposed using cell parameters (the unit cell's lengths and angles) into 6 features: Space group SG which defines one of the possible 230 symmetry groups of the crystal lattice is a categorical scalar that requires transformation into appropriate  (3) G → {aÅ, bÅ, cÅ, α • , β • , γ • } binary features (one-hot encoding). As an example, space group represents one of 230 possible categories with the use of a single integer: this scalar is transformed into 230 binary features: To encode the space group 37: In the final stages of data preparation, constant features (features that contain the same value for each record) were dropped from the dataset, as well as any rows that contained null feature values. The dataset is now ready for use.

Aggregated feature set
Data obtained from all databases (AFLOW [6], Materials Project [7]) is normalised and aggregated into a single, functional form. This process results in the aggregation of maximum number of homogeneous features from consumed data sources. Table 2 details the features obtained  from the AFLOW and Materials Project databases along  with example values.  Table 3 details the attributes collected, along with names, example values, and original data source. Table 2 details the feature set with names and example values.

Additional feature set derivation
Additional features useful for ML can be derived from existing features and also user input. Deriving these features frees the user from the necessity of performing these calculations and expedites work flow. In some instances, derivation of these additional features has been undertaken purely for experimental purposes, with the expectation that further refinement in the future will yield less theoretical results.  In this paper, the notation S (1) refers to S provided by the Definition 1.

Number of atoms
The total number of atoms N contained within the system can be derived from S (1) such: where S (1) describes the stoichiometric composition of the material. This feature returns the sum of species of each atom contained in the unit cell multiplied by the instance.

Atomic weight
The total atomic weight T Ar of the system with reference to S (1) is given by: where S (1) describes the stoichiometric composition of the material and Ar [8] describes the atomic weight of each element. This feature returns the sum of each atomic weight of each species considered individually in the unit cell multiplied by the instance.

Chemical potential
The total chemical potential T µ of the system with reference to S (1) is given by: where S(1) describes the stoichiometric composition of the material and µ describes the chemical potential of each element. This feature returns the sum of each chemical potential of each species considered individually in the unit cell multiplied by the instance. This feature contains values generated by the software given a stoichiometry value. The chemical potential values are provided from the corresponding VASP POTCAR files.

S, P, D, F electrons
The total count of the number of electrons T e in each type of sub shell within the compound is given by: where e i describes the number of electrons present in the corresponding sub shell. Electron configuration is determined using values from the literature [9].

S, P, D, F orbitals
The total count of each type of orbital T σ within the compound is given by: where σ i describes the corresponding number of orbitals present in the element. Orbital configuration is determined using values from the literature [9].

Symmetry
The symmetry elements HS [10] associated with the space group of the crystal lattice has been stored in our database. This information is one-hot encoded in a similar fashion to SG (4).

Electron affinity
The total electron affinity T EA with reference to S (1) is given by: where S (1) describes the stoichiometric composition of the material and EA i describes the electron affinity [11] of each element. This feature returns the sum of each electron affinity of each species considered individually in the unit cell multiplied by the instance.

Electronegativity
The total electronegativity χ is given by the Mulliken electronegativity definition [12,13]: where E i describes the first ionisation energy [14] and E ea describes the electron affinity [11].

Ionisation energy
The total ionisation energy T IE with reference to S (1) is given by: where S (1) describes the stoichiometric composition of the material and IE i describes the ionisation energy [14] of each element.

Mass density
The total mass density T ρ with reference to S (1) is given by: where S (1) describes the stoichiometric composition of the material and ρ i describes the density [15,9] of each element. This feature returns the sum of each mass density value multiplied by the instance count of the corresponding element.

Valence electrons
The total number of valence electrons T Ve with reference to S (1) is given by: where S (1) describes the stoichiometric composition of the material and Ve i describes the number of valences electrons present for each element [15]. Currently, the number of valence electrons is determined primarily from the specification of the chemical elements in the VASP POTCAR file associated with the structure.

Effective mass
For a free electron, effective mass [16,17] is given by For an electron in a crystal, the effective mass approximation is given by where m ′ e = xm e . Thus the dispersion may be rewritten as Using the second derivative of (18) to calculate x Fitting a curve to the conduction band minima of an E-k diagram using the form y = ax 2 + bx + c yields (14) x Then And Thus our final equation for calculating effective mass (adjusting for atomic units) is given by: The VASP software package can produce EIGENVAL files which contain the Kohn-Sham eigenvalues for all k-points. We have developed software to parse these files and produce the appropriate band structure diagrams, to which a parabola may be fitted. The EIGENVALS output usually appears in the following format: The values on this line are the number of electrons, number of k-points, and number of bands respectively. Lines that contain 4 double-values contain information regarding the 3-dimensional position in k-space , as well as a weighting factor (not used by our software): These values are parsed into a vector and stored in memory. Immediately following the coordinate lines are lines containing energies associated at that coordinate: Coordinate vectors represent direct coordinate values (x, y, z) and are require further processing to be useful for m * calculation.
The reciprocal lattice is a 3 × 3 matrix defined as where the reciprocal primitive vectors are defined as The reciprocal lattice (values sourced from the OUTCAR file, another VASP output file) is used to transform coor- by the reciprocal lattice such: Finally, the distance between two 3-dimensional k-coordinate vectors v i and v j is described by: This value is used as the k-value (converted from units of along the x-axis in the following E-k diagram, with E (converted from units of eV → Ha ) comprising the y-axis values. This process ostensibly provides (24) . . .
energy values that correspond to the associated position in the Brillioun zone. Figure 1 shows the band structure (E-k) diagram of Si 2 generated by the Hadoken software. Conduction bands are shaded green, with the lowest unoccupied molecular orbital (LUMO) is shown as bold green. Valence bands are shaded blue, with the highest occupied molecular orbital (HOMO) show as bold blue. An orange parabola has been fitted to the LUMO minimum in the Ŵ-X segment, and it is this curve that is used to calculate effective mass. Also shown are red parabolae fitted to the HOMO maxima.
Fitting a parabola in the quadratic form y = ax 2 + bx + c yields the coefficient a which can then be used by (23) to obtained the final m * value.
Should more than one fit per k-space segment be possible, then the resultant values are averaged to yield the final effective mass value. Currently, only m * values calculated in the Ŵ-X segment via this method are persisted.
The following Table 4 displays the entire feature set, including sourced and derived values, corresponding example values and units.

Deep learning model training process
All models were trained using the same process: 1 Features in the entire 0.477 GB dataset were normalised.
2 Data was split into two streams: training and validation at a ratio of 0.7/0.3. 3 An artificial neural network based on a sequential DL model from the Keras framework on a TensorFlow back end with dense layers (100, 50) was used and trained over 300 iterations. 4 Verification that over-fitting was not occurring.
Over-fitting is characterised by an increase in loss which will be reflected in the training history. Even after 300 iterations, the loss recorded continues to converge, indicating that the algorithm is still learning (i.e., not over-fitting). Had the training process resulted in an increase in loss, we could be sure overfitting was occurring. 5 The neural network was optimised by training it with all the data over 1000 iterations. The following Table 5 demonstrates that the optimisation process may yield extra accuracy when training a model for production deployment. 6 Information about the neural network was serialised for production use (layers, weights, biases, activation functions etc.). 7 Optimised models are persisted for future use via the https:// www. hadok enmat erials. io/ website and associated API The models described in this document have been made available for use at https:// www. hadok enmat erials. io/ with the API documentation available at https:// www. hadok enmat erials. io/ Home/ Api. These models are also made available via GitHub at https:// github. com/ carly man77/ Mater ialsD iscov eryML.

Determination of model accuracy
We include three different loss functions used to determine accuracy for the predictive models, and a single loss function for the classification model. All metrics should be considered when evaluating the accuracy of a model, as each method has advantages in certain applications. For example, if the average errors are evenly distributed then both Mean Absolute Error and Root Mean Squared Error outputs should converge. However, Root Mean Squared Error will penalise large outlier errors as the errors are squared before an average is taken.

Mean absolute error (MAE)
This value is derived from the mean_absolute_error [19] function which produces a risk metric corresponding to the expected value of the absolute error loss or l1-norm loss.  Given ŷ i to be the predicted value of the i-th sample, and y i to be the corresponding true value, then the MAE estimated over n samples, is defined such that:

Root mean squared error (RMSE)
This value is derived by taking the square root of the Mean Squared Error (MSE, quadratic or L2 loss) value generated by the mean_squared_error [20] function.
Given ŷ i to be the predicted value of the i-th sample, and y i to be the corresponding true value, then the MSE estimated over n samples, is defined such: Therefore:

R 2
This value is derived from the r2_score [21] function which is a representation of the proportion of explained variance. A perfect score is 1.0 which indicates that all independent variables are used to explain variation in the dependant variable.
Given ŷ i to be the predicted value of the i-th sample, and y i to be the corresponding true value for a total of n samples, then the estimated R 2 is defined such:

Overview
Models are produced by the ML training process, and contain the refined weights, biases and activation functions required to operate independently of the original dataset. Models are software assets that can be used to perform complex algorithmic tasks such as prediction or classification.

Band gap
Band gap E g is an energy range between the uppermost valence band (valence band maximum) and the lowest conduction band (conduction band minimum) of a crystal. Electrons in the valence bands can transition into the conduction bands upon excitation. This size of the band gap is a critical feature that many of the material's possible applications. Photovoltaic (PV) materials are semiconductors, and so it follows that E g is a key metric when considering a material's suitability for PV applications.

Deep learning to predict band gap (Single feature)
This model attempts to predict E g from stoichiometry only. This model uses a single feature, stoichiometry S (1), such: where E g (S) describes the predicted result computed by M from S (1). Figure 2 displays the predicted E g values generated by the model with the original E g values. A clear linear trend is evident. Figure 3 displays errors in 0.1 eV buckets. The majority of predicted results appear in the first negative bucket, indicating that for most predictions, the resultant value is no more than 0.1 eV different from the original value. Figure 4 displays the loss values generated by during the model training process. Table 6 details the overall predictive accuracy metrics for the model.

Deep learning to predict band gap (minimal features)
This model attempts to predict E g from the fewest features considered logical that are also easily sourced, i.e., (32)       indicating that for most predictions, the resultant value is no more than 0.1 eV different from the original value. Figure 7 displays the loss values generated by during the model training process. Table 7 details the overall predictive accuracy metrics for the model.

Deep learning to predict band gap (maximal features)
This model attempts to predict E g from the maximum number of features available from the collated dataset. This model is described as such: where E g (F ALL ) describes the predicted result computed by M from F ALL , and F ALL describes all features in the dataset.   Figure 9 displays errors in 0.1 eV buckets. As with the previous model, the majority of predicted results appear in the first negative bucket, indicating that for most predictions, the resultant value is no more than 0.1 eV different from the original value. Figure 10 displays the loss values generated by during the model training process. Table 8 details the overall predictive accuracy metrics for the model. Table 9 summarises the predictive accuracy metrics for each model. All 3 models are extremely accurate, and of note is the diminishing returns realised by the addition of many extra features: the model using a single feature is almost as accurate as the model that uses 20 features.

Fermi energy
Fermi energy is also an attribute useful for the design and discovery of materials, however some online data sources do not store this value. We provide a model for the prediction of this property.

Deep learning to predict fermi energy
This model attempts to predict E F from the fewest features. This model uses 2 main features, stoichiometry (one-hot encoded) S (1), and geometry G (3). This model is described as:    Figure 11 displays the predicted E F values generated by the model with the original E F values. A clear linear trend is evident, with most data points clustered on or around this trend. Figure 12 displays errors in 0.1 eV buckets. This model is accurate to within 0.5 eV for the majority of predicted values. Figure 13 displays the loss values generated by during the model training process.

Gap type
Gap type is an important attribute used to classify the type of band gap present in a material. Typically the gap type relates directly to the usefulness of a material for a specific application. For example, metals have no band gap and and such make excellent conductors, whilst semiconductors may have a direct or indirect band gap (an indirect band gap is characterised by the phonon-assisted transmission). Insulators typically have a very large band gap.

Deep learning to classify gap type
This model uses 2 main features, stoichiometry (one-hot encoded) S (1), and space group SG (4), that are encoded (or decomposed) into values of varying size. This model is described as: Figure 14 displays the accuracy of gap type predictions per gap type. This model is most useful at predicting whether a gap type is an direct insulator or a metal. Figure 15 displays the loss values generated by during the model training process. Table 11 details the overall predictive accuracy metrics for the model. Table 12 details the metrics for each class of the model.

Production deployment of machine learning models
In addition to development of the preceding models, we have developed a lightweight and efficient method for deploying models to a production environment. Multiple files are produced by Keras when persisting a model, namely the weights and structure of the network. The weights are stored in the HDF5 [22] format and the model structure in a JSON format, neither of which are suitable for a number of reasons: JSON offers no schema support, or mature query language, comments, or meta-data. JSON is also a terse format designed to be used when the contract is pre-agreed upon, and therefore does not make a good candidate to support rich, searchable data models. The HDF5 format is not human readable and is not easily parsed.  Unifying these two files in a more appropriate format is a welcome improvement. In addition to this, no information is saved with the model about how it is intended to be used. For example,  inputs are not labelled, and no normalising parameters are included, which renders the model not portable and useless for production consumption. To address this, we have developed a simple, portable XML format that is searchable and can be validated against a schema. Only a single file is required to instantiate a usable model in a production environment that is guaranteed to produce reliable results from minimal code.

Artificial neural network function
Provisioning of ML models from the XML definition is provided via the Hadoken.ML.NeuralNetwork type located in the Hadoken.ML assembly. This custom-built Artificial Neural Network (ANN) functions as a series of completely connected layers using the following method: 1 Inputs are multiplied by weights and forwarded to the nodes in each layer 2 Each node introduces a bias and another weight and sends the value to the next layer via the activation function Figure 16 displays the map of a typical neural network.
Inputs are fully connected with the first hidden layer, which is in turn fully connected to each following layer. This process is completed for each hidden layer, with results forwarded to the output layer. Figure 17 displays the map of a neural network node. Inputs are multiplied by a weight and then added to a bias value. The sum of these operations is forwarded to an activation function which determines the final output value.   Table 13 details the activation functions provided by the software.

Software platform
Architecture A bespoke software platform (codename: Hadoken) was created for the express purpose of aggregating materials data from disparate representational state transfer (REST) APIs such as Materials Project [7] and AFLOW [6]. Data from these sources is collected via an aggregator and stored in a relational database. Additional supporting files that may be of use (such as associated VASP [23] files) are also downloaded and stored for later use. Useful attributes such as Fermi energy E F that are not present in REST API data are sourced from the VASP files and added to the database. Curated data is then used for the purposes of training ML models for predictive tasks.

Technology stack
The technology stack mirrors current popular industry standard for rapid application development (RAD), and is based on Microsoft's .NET Core Framework [24] and Microsoft SQL Server 2017 [25]. ML technologies include Python 3.5 [26] and TensorFlow [27] as well as Azure ML Studio [28].

Data collection
Data are initially sourced from two streams, on-line and off-line. On-line data sources are actively maintained network resources which release edits in real (or near to real) time. These often take the form of REST web API offerings including Materials Project and AFLOW. Some of these web services include information gathered from other sources, such as the Inorganic Crystal Structure Database (ICSD) [29]. These RESTful web services provide an industry standard method for querying and retrieving data. Data is provided in JSON format, which is then parsed into a common object model and stored locally. On-line data sources are much easier to work with than off-line, as they provide instant access to data stores that are proactively curated.
Off-line data sources may not be actively maintained, or may only release edits periodically (such as with a new publication) and typically include information contained within texts, files, or databases which may have been produced by a lab during the research process. It is most likely that each off-line source differs in its storage format or layout, especially in the case of textural publications, and thus must have a bespoke parser written for it. This process is very time consuming and so these sources are currently avoided.

Data curation and post processing
Data is collected currently on an ad-hoc basis, however when a new model is to be trained a snapshot of the database is taken so that continual data collection may occur. These snapshots are completely disconnected from the original data source, thus any updates to the database are not reflected in any dataset used by the model training process.
Post processing is the first step in data curation, and involves processing values and schema structure to assist with preparing data for curation. As an example, the AFLOW schema is mostly flat, however the Materials Project schema is nested. The Hadoken software prepares nested data by moving it to a simple normalised schema ready for the curation and matching process.
Data curation is a process that involves the careful selection and combination of data sources. Data sources may have differing, non-identical schemas applied to them, which will affect the storage and representation of underlying data. During the process, attributes from disparate data sources are matched where possible. Decisions must also be made about the treatment of nullable attributes. For example, it may be possible to replace null values with a default initialisation value, such as 0 for a null integer or an empty string for a null string. These decisions are realised in code, and applied in the software and underlying database schema.
All data collected must be curated, and this process involved dividing the data into two streams: high-quality and low-quality. Data must attain a completeness factor of 100% in order to be useful, so efforts are made to achieve this.
The completeness factor F C is the ratio of features that contain non-null values F NN to the total number of features F Tot in the dataset: where F NN defines the number of features with nonnull values and F Tot defines the total number of features present.
Data is considered high-quality if its F C > 0.9 , with the additional constraint that any missing attributes can be retro-fitted by reading them from associated files or calculating them directly.
Data is considered low-quality if its F C ≤ 0.9 . Records that contain missing attributes cannot be used by model training as they may mislead the model. Low-quality data is stored, but shelved for use later, as it may be possible to reconstruct missing attributes via ML, or, the data may be updated when matched with a future high-quality data source.

API access
We present a lightweight REST API for accessing the machine learning models built from this curated data. The API is built on current industry standards supporting both JSON and XML data exchange formats. The full API definition is located on the Hadoken Materials website here: https:// hadok enmat erials. io/ Home/ Api. Registration is required to use the API (https:// hadok enmat erials. io/ Accou nt/ SignUp) and is fast (and free), however registration is not required to use the web UI interfaces provided for each model.
Whilst use of the website is free, any use of the website or API for research purposes, commercial or otherwise, are governed by terms defined in the citing document available on the website. More information is available here https:// hadok enmat erials. io/ Home/ Citing. Upon completion of registration, an API key in the form of an 128-bit GUID is allocated and API access is granted to the entire platform. This API key must be presented during each request.
By way of example, a typical API request for a band gap prediction for the compound Ca 2 Cu 2 Ge 4 O 12 follows:

API reference
Currently, the API supports a single version: 1. In the future, different versions will become available; to use those versions replace the current version number.

Band gap-space group, derived
Compute the E g from stoichiometry and values derived from stoichiometry. Note for this model, only the stoichiometry is required for operation. https:// www. hadok enmat erials. io/ Machi neLea rning/ BandG apSpa ceGro upHig hSymm etryD erived

Conclusions
In this paper we show that it is possible to develop a number of highly accurate ML models to inexpensively predict the properties of materials using information previously generated from computationally expensive simulations. The ML models demonstrate that a stoichiometry definition alone is a high value feature, containing (in most cases) all the information required to accurately compute the band gap associated with that material. Initial experimenting has demonstrated that the addition of other features (such as Density or Total Atomic Weight) has yielded little, if any additional accuracy. This suggests that DFT computation may not be required to perform this type of calculation.
The prospect of fast, efficient DFT-free computation of materials properties using only consumer hardware is tantalising and implies that further investigation into properties implied by stoichiometry related to E g is required. This development could in turn greatly reduce the amount of time spent on simulations, managing simulation software, and budgets spent on supercomputing.
This project also lays the foundation for expansion to the prediction of other materials properties in the future using a similar process, and the development of an industry standard platform for the production development of said models should facilitate the exhaustive profiling of compounds to develop novel materials by the wider research community in general.