From: ChemSAR: an online pipelining platform for molecular SAR modeling
Module name | Input | Output | Description |
---|---|---|---|
Feature calculation | *.smi; *.sdf | *.csv | The standard version of molfile in SDF must be V2000; 2D or 3D information are both valid; the first row of *.csv file is the descriptor names; the first column is the SMILES of molecules |
Model selection | *.csv | Data table in the page | The first column of X_train file can be molecular identifiers like molecular names or IDs; the first row of X_train file must be descriptor names; the first column of y_train file must be the same with X_train file; the second column must be experimental values of the sample (different presentation styles of classes must be converted into 0 or 1) |
Model building | *.csv | Data table in the page; *.png | The same with “Model selection” |
Prediction | *.csv | Data table in the page; *.csv | The requirements of X_test file are the same with X_train file |
Validation of molecules | *.sdf | *.csv | The standard version of molfile in SDF must be V2000 |
Standardization of molecules | *.sdf | *.sdf | The same with “Validation of molecules” |
Custom preprocessing | *.sdf | *.sdf | The same with “Validation of molecules” |
Imputation of missing values | *.csv | *.csv | The first row of input file must be header like descriptor names; each column including the first one must be feature values like descriptor values |
Removing low variance features | *.csv | *.csv | The same with “Imputation of missing values” |
Removing high correlation features | *.csv | *.csv | The same with “Imputation of missing values” |
Univariate feature selection | *.csv | *.csv | The first row of input file must be header like descriptor names; each column from the first one to the penultimate one must be feature values like descriptor values; The last column must be experimental values of the sample (different presentation styles of classes must be converted into 0 or 1) |
Tree-based feature selection | *.csv | Data table in the page; *.csv | The same with “Univariate feature selection” |
RFE feature selection | *.csv | Data table in the page; *.csv; *.png | The same with “Univariate feature selection” |
Statistical analysis | *.csv | Data table in the page; *.png | The four columns of input file must be in order: molecular identifier, predict label, predict probability, experimental value; the label name can be defined by users |
Random training set split | *.csv | *.csv | The same with “Model selection” |
Diverse training set split | *.sdf | *.sdf | The de facto standard version of molfile in SDF must be V2000; 2D or 3D information are both valid |
Feature importance | *.csv | *.csv | The same with “Univariate feature selection” |