Skip to main content

Table 2 Properties of the selected protein descriptor sets and representations used in our benchmarks

From: How to approach machine learning-based prediction of drug/compound–target interactions

Name

Approach

Description

Dimension

apaac

Model-driven (physico-chemistry)

Amino acid composition regarding the sequence order correlated factors computed from hydrophobicity and hydrophilicity indices of a.aa

80

ctdd

Model-driven (physico-chemistry)

Chain length-based distribution of a.a for selected physicochemical properties

195

ctriad

Model-driven (physico-chemistry)

Triad frequency of residues classified on dipoles and volumes of aa side chains

343

dde

Model-driven (sequence comp.b)

Dipeptide composition deviation

400

geary

Model-driven (physico-chemistry)

Autocorrelation regarding the distribution of physicochemical properties of a.a

240

k-sep_pssm

Model-driven (sequence homology)

Column transformation-based position specific scoring matrix (pssm) profiles

400

pfam

Model-driven (functional properties)

Protein domain profiles

38–294c

qso

Model-driven (physico-chemistry)

Sequence order effect based on physicochemical distances between coupled residues

100

spmap

Model-driven (sequence comp.)

Subsequence-based feature map

544

taap

Model-driven (physico-chemistry)

Summation of corresponding residue values for selected physicochemical properties

10

random 200

Randomly generated continuous numbers between 0 and 1 with uniform distribution

200

protvec

Data-driven (learned embedding)

Sequence embedding utilizing skip-gram modelling approach

100

seqvec

Data-driven (learned embedding)

Sequence embedding based on bi-directional language model architecture “ELMo”

1024

transformer

Data-driven (learned embedding)

Transformer-architecture based embedding method that utilizes attention mechanism

768

unirep

Data-driven (learned embedding)

Sequence embedding based on mLSTM architecture as a variation of recurrent neural networks

1900 and 5700

  1. aAmino acids
  2. bComposition
  3. cSize varies depending on the dataset, since pfam vectors only include the domains presented in the given protein dataset