Skip to main content

Table 1 Orthographical features.

From: Enhancing of chemical compound and drug name recognition using representative tag scheme and fine-grained tokenization

Feature name Regular expression pattern
FG ^\d(,\d)*\-?\w+
INITCAPS ^[A-Z].+
CAPWORD ^[A-Z][a-z]+$
ALLCAPS ^[A-Z]+$
CAPSMIX ^[A-z]*([A-Z][a-z]|[a-z][A-Z])[A-z]*$
ALPHANUMMIX ^[A-z0-9]*([0-9][A-z]|[A-z][0-9])[A-z0-9]*$
ALPHANUM ^[A-z]+[0-9]+$
UPPERCHAR ^[A-Z]$
LOWERCHAR ^[a-z]$
SHORTNUM ^[0-9]?$
INTEGER ^-?[0-9]+$
REAL ^-?[0-9]\.[0-9]+$
ROMAN ^[IVX]+$
HASDASH -
INITDASH ^-
ENDDASH -$
PUNCTUATION ^[,.;:?!]$
QUOTE ^[\"`']$