Prediction of organic compound aqueous solubility using machine learning: a comparison study of descriptor-based and fingerprints-based models

Table 1 Comparison between different current models that predict water solubility

Developer	Data Preparation Method	Total Size	ML Method	R² Test Value⁷	MAE⁸	RMSE⁹	SEP¹⁰	Refs
Huuskonen	Descriptor-Based	1297	MLR¹	0.88	–	0.71	–	[10]
Huuskonen	Descriptor-Based	1297	ANN³	0.92	–	0.60	–	[10]
Yan	Descriptor-Based	1293	MLR	0.82	0.68	0.79	–	[11]
Yan	Descriptor-Based	1293	ANN	0.96	0.49	0.59	–	[11]
Delaney	Descriptor-Based	2874	MLR	0.71	0.68	0.87	–	[12]
Hou	Group Contribution	1294	MLR	0.9	0.52	0.63	–	[2]
Ali	Descriptor-Based	1290	MLR	0.73	0.72	0.94	–	[13]
Sorkun	Descriptor-Based	1290	Ensemble of ANN, RF², and XGB⁴	0.93	0.397	0.53	–	[14]
Le	Descriptor-Based	4376	MLR	0.89	–	–	0.75	[15]
			MLREM⁵	0.88	–	–	0.76
			BRANNLP⁶	0.90	–	–	0.66

Total size in this table stands for the number of datasets used to train each of the algorithms
¹MLR: Multilinear Regression; ²RF: Random Forest; ³ANN: Artificial Neural Network; ⁴XGB: Gradient Boosted Trees; ⁵MLREM: multiple linear regression with expectation maximization; ⁶BRANNLP: Bayesian regularized artificial neural network with a Laplacian prior; ⁷R²: squared coefficient of determination; ⁸MAE: mean absolute error; ⁹RMSE: root-mean-square deviation; ¹⁰SEP: standard error of prediction

ISSN: 1758-2946