Skip to main content

Table 2 Orthographic features used in our system.

From: A comparison of conditional random fields and structured support vector machines for chemical entity recognition in biomedical literature

Feature

Regular Expression

Feature

Regular Expression

ALLCAPS

^[A-Z]+$

MANY_NUM

^[0-9]{1,2}(,[0-9]{1,2})+$

INITCAP

^[A-Z].*

REAL_NUM

^-?[0-9]+[\.][0-9]+$

HASCAP

^.*[A-Z].*$

INDASH

^([\w+][\-]+)+\w+$

SINGLECAP

^[A-Z]$

HASDIGIT

.*[0-9].*

PUNCTATION

^[,;:\'\"]$

IS_DASH

^[-]+$

INITDIGIT

^[0-9].*

ROMAN

^[IVXDLCM]+$

SINGLEDIGIT

^[0-9]$

END_PUNC

^[.?!]$

ALPHANUM

.*[A-Za-z].*[0-9].*

|.*[0-9].*[A-Za-z].*

CAPSMIX

.*[A-Z].*[a-z].*

|.*[a-z].*[A-Z].*