Skip to main content

Advertisement

Table 2 Orthographic features used in our system.

From: A comparison of conditional random fields and structured support vector machines for chemical entity recognition in biomedical literature

Feature Regular Expression Feature Regular Expression
ALLCAPS ^[A-Z]+$ MANY_NUM ^[0-9]{1,2}(,[0-9]{1,2})+$
INITCAP ^[A-Z].* REAL_NUM ^-?[0-9]+[\.][0-9]+$
HASCAP ^.*[A-Z].*$ INDASH ^([\w+][\-]+)+\w+$
SINGLECAP ^[A-Z]$ HASDIGIT .*[0-9].*
PUNCTATION ^[,;:\'\"]$ IS_DASH ^[-]+$
INITDIGIT ^[0-9].* ROMAN ^[IVXDLCM]+$
SINGLEDIGIT ^[0-9]$ END_PUNC ^[.?!]$
ALPHANUM .*[A-Za-z].*[0-9].* |.*[0-9].*[A-Za-z].* CAPSMIX .*[A-Z].*[a-z].* |.*[a-z].*[A-Z].*