Skip to main content

Table 1 The valid file formats and requirements for each module

From: ChemSAR: an online pipelining platform for molecular SAR modeling

Module name




Feature calculation

*.smi; *.sdf


The standard version of molfile in SDF must be V2000; 2D or 3D information are both valid; the first row of *.csv file is the descriptor names; the first column is the SMILES of molecules

Model selection


Data table in the page

The first column of X_train file can be molecular identifiers like molecular names or IDs; the first row of X_train file must be descriptor names; the first column of y_train file must be the same with X_train file; the second column must be experimental values of the sample (different presentation styles of classes must be converted into 0 or 1)

Model building


Data table in the page; *.png

The same with “Model selection”



Data table in the page; *.csv

The requirements of X_test file are the same with X_train file

Validation of molecules



The standard version of molfile in SDF must be V2000

Standardization of molecules



The same with “Validation of molecules”

Custom preprocessing



The same with “Validation of molecules”

Imputation of missing values



The first row of input file must be header like descriptor names; each column including the first one must be feature values like descriptor values

Removing low variance features



The same with “Imputation of missing values”

Removing high correlation features



The same with “Imputation of missing values”

Univariate feature selection



The first row of input file must be header like descriptor names; each column from the first one to the penultimate one must be feature values like descriptor values; The last column must be experimental values of the sample (different presentation styles of classes must be converted into 0 or 1)

Tree-based feature selection


Data table in the page; *.csv

The same with “Univariate feature selection”

RFE feature selection


Data table in the page; *.csv; *.png

The same with “Univariate feature selection”

Statistical analysis


Data table in the page; *.png

The four columns of input file must be in order: molecular identifier, predict label, predict probability, experimental value; the label name can be defined by users

Random training set split



The same with “Model selection”

Diverse training set split



The de facto standard version of molfile in SDF must be V2000; 2D or 3D information are both valid

Feature importance



The same with “Univariate feature selection”