Skip to main content

Table 3 Examples of features and regular expressions used during the training of the chemical entities identification systems.

From: CheNER: a tool for the identification of chemical entities and their classes in biomedical literature

Name of feature

Description

Length

Classifies tokens by length. If the length is less than 5, the token is Short. If length is between 5 and 15, the token is Medium, otherwise, the token is Large.

Word class

Automatic generation of features in terms of frequency of upper and lower case characters, digits and other types of characters.

Autom. Prefixes/Suffixes

Automatic generation of suffix and prefix (length 2, 3 and 4)

List

Automatic generation for every token that match an element within the list. We used lists of basic name segments (~3300), and stop words (~550).

Dictionaries

A dictionary matching for trivial, family and abbreviations names classes (~6400, ~1300 and ~1400 elements, repectively).

Regular expressions

Regular expressions that identify specific features, such as "contains dashes?", "is all cap?", or "contains numbers?".

Regular expressions that identify specific types of characters that are more common in chemical entities than in other words, such as greek letters, roman numbers, etc.

Regular expressions that match with specific morphological chemical formulas features, identifiers, and systematic features in chemical names.

Regular expressions used in the pos-processing step that filter out common names that are incorrectly tagged by the systems in a systematic way.