Skip to main content

Table 7 Character and word n-gram features extracted by NERsuite by default.

From: Optimising chemical named entity recognition with pre-processing analytics, knowledge-rich features and heuristics

Feature

Brief description

Sample features (bigrams)

Character n-grams

the set of all possible combinations of a token's consecutive characters, taken n at a time (n = 2, 3, 4)

{GS}, {SK}, {K2}, {21}, {14}, {4a}

Token n-grams

unigrams and bigrams of surface forms; unigrams and bigrams of normalised surface forms where numbers numbers are replaced with '0's, the consecutive instances of which are compressed

{It, attenuated}, {attenuated, GSK214a}; {Aa, aaaaaaaaaa}, {aaaaaaaaaa, AAA000a}

Lemma n-grams

unigrams and bigrams of lemmatised surface forms

{It, attenuate}, {attenuate, GSK214a}

POS tag n-grams

unigrams and bigrams of part-of-speech (POS) tags

{PRP, VBD}, {VBD, NN},

Lemma & POS tag

n-grams

unigrams and bigrams of lemmatised forms combined with POS tags

{It:PRP, attenuate:VBD}, {attenuate:VBD, GSK214a:NN}

Chunk information

chunk tag of current token; surface form of the enclosing chunk's

{B-NP}; {gestation}