Skip to main content

Table 1 The valid file formats and requirements for each module

From: ChemSAR: an online pipelining platform for molecular SAR modeling

Module name

Input

Output

Description

Feature calculation

*.smi; *.sdf

*.csv

The standard version of molfile in SDF must be V2000; 2D or 3D information are both valid; the first row of *.csv file is the descriptor names; the first column is the SMILES of molecules

Model selection

*.csv

Data table in the page

The first column of X_train file can be molecular identifiers like molecular names or IDs; the first row of X_train file must be descriptor names; the first column of y_train file must be the same with X_train file; the second column must be experimental values of the sample (different presentation styles of classes must be converted into 0 or 1)

Model building

*.csv

Data table in the page; *.png

The same with “Model selection”

Prediction

*.csv

Data table in the page; *.csv

The requirements of X_test file are the same with X_train file

Validation of molecules

*.sdf

*.csv

The standard version of molfile in SDF must be V2000

Standardization of molecules

*.sdf

*.sdf

The same with “Validation of molecules”

Custom preprocessing

*.sdf

*.sdf

The same with “Validation of molecules”

Imputation of missing values

*.csv

*.csv

The first row of input file must be header like descriptor names; each column including the first one must be feature values like descriptor values

Removing low variance features

*.csv

*.csv

The same with “Imputation of missing values”

Removing high correlation features

*.csv

*.csv

The same with “Imputation of missing values”

Univariate feature selection

*.csv

*.csv

The first row of input file must be header like descriptor names; each column from the first one to the penultimate one must be feature values like descriptor values; The last column must be experimental values of the sample (different presentation styles of classes must be converted into 0 or 1)

Tree-based feature selection

*.csv

Data table in the page; *.csv

The same with “Univariate feature selection”

RFE feature selection

*.csv

Data table in the page; *.csv; *.png

The same with “Univariate feature selection”

Statistical analysis

*.csv

Data table in the page; *.png

The four columns of input file must be in order: molecular identifier, predict label, predict probability, experimental value; the label name can be defined by users

Random training set split

*.csv

*.csv

The same with “Model selection”

Diverse training set split

*.sdf

*.sdf

The de facto standard version of molfile in SDF must be V2000; 2D or 3D information are both valid

Feature importance

*.csv

*.csv

The same with “Univariate feature selection”