Skip to main content

Table 9 Orthographic features extracted by NERsuite by default.

From: Optimising chemical named entity recognition with pre-processing analytics, knowledge-rich features and heuristics

Feature

Example

Initial letter is in uppercase

Boc-L-leucine

Contains only digits

206553

Contains digits

5-HTP

Contains only alphanumeric characters

HClO4

Contains only uppercase letters and digits

AFB1

Contains only uppercase letters

NO

Does not contain any lowercase letters

SKF81297

Contains non-initial uppercase letters

PbS

Contains two consecutive uppercase letters

PAHs

Has a Greek letter name as a substring

alpha-ketoacid

Contains a comma

3,14-dibromo

Contains a full stop

In(0.2)Ga(0.8)As

Contains a hyphen

HP-β-CD

Contains a forward slash

(E/Z)-Goniothalamin

Contains an opening square bracket

[(14)C]pazopanib

Contains a closing square bracket

pyrido[3,2-d]pyrimidines

Contains an opening parenthesis

I3 (-)

Contains a closing parenthesis

Fe(C10 H15)2

Contains a semi-colon

R = Me, Et; X = O, S;

Contains a percentage symbol

85%

Contains an apostrophe

5-methyl-2'-deoxycytidine