Skip to main content

Table 1 POSEIDON features for the ML summary table

From: POSEIDON: Peptidic Objects SEquence-based Interaction with cellular DOmaiNs: a new database and predictor

Total

Sample object

Amount

Description

2.908

Peptides

31

Whole-peptide features

2.000

Peptide-position one-hot encoding. Considering the maximum size of 100 amino acids (longest peptide registered in the dataset), one-hot encoding was used for each of the positions of the 20 amino acids

70

After inspecting the peptide sequences with anomalous amino acid substitutions, we annotated the position of the substitution (maximum of 24 as this was the size of the longest peptide with anomalous substitutions). Allows 56 possible substitutions along with those registered in the dataset

Cell Lines

735

According to the GDSC, cell line gene mutation data includes 42 available cell lines

1

According to the GDSC, cell line gene mutation data includes 42 available cell lines

A categorical variable to indicate whether the cell line present in POSEIDON is exactly that of GDSC or a similar cell line present in the same tissue

Experimental

1

Concentration (μM) of the peptide sequence

5

Categorical temperature (°C). Although it is possible to use a numerical variable, there are only five available temperatures with biological relevance. For example, 37 °C is the regular human body temperature and 25 °C is a common room environment. For these reasons, and because in some cases, there is no temperature information available, the temperature was categorically encoded

2

Incubation time and duration (min)

63

Annotated cargo was manually curated in several steps of the dataset. Initially, only cargoes annotated in the original research papers were considered. Additionally, while processing the dataset, position-independent additions were considered as cargoes