Skip to main content

Table 9 Orthographic features extracted by NERsuite by default.

From: Optimising chemical named entity recognition with pre-processing analytics, knowledge-rich features and heuristics

Feature Example
Initial letter is in uppercase Boc-L-leucine
Contains only digits 206553
Contains digits 5-HTP
Contains only alphanumeric characters HClO4
Contains only uppercase letters and digits AFB1
Contains only uppercase letters NO
Does not contain any lowercase letters SKF81297
Contains non-initial uppercase letters PbS
Contains two consecutive uppercase letters PAHs
Has a Greek letter name as a substring alpha-ketoacid
Contains a comma 3,14-dibromo
Contains a full stop In(0.2)Ga(0.8)As
Contains a hyphen HP-β-CD
Contains a forward slash (E/Z)-Goniothalamin
Contains an opening square bracket [(14)C]pazopanib
Contains a closing square bracket pyrido[3,2-d]pyrimidines
Contains an opening parenthesis I3 (-)
Contains a closing parenthesis Fe(C10 H15)2
Contains a semi-colon R = Me, Et; X = O, S;
Contains a percentage symbol 85%
Contains an apostrophe 5-methyl-2'-deoxycytidine