Skip to main content

Table 7 Character and word n-gram features extracted by NERsuite by default.

From: Optimising chemical named entity recognition with pre-processing analytics, knowledge-rich features and heuristics

Feature Brief description Sample features (bigrams)
Character n-grams the set of all possible combinations of a token's consecutive characters, taken n at a time (n = 2, 3, 4) {GS}, {SK}, {K2}, {21}, {14}, {4a}
Token n-grams unigrams and bigrams of surface forms; unigrams and bigrams of normalised surface forms where numbers numbers are replaced with '0's, the consecutive instances of which are compressed {It, attenuated}, {attenuated, GSK214a}; {Aa, aaaaaaaaaa}, {aaaaaaaaaa, AAA000a}
Lemma n-grams unigrams and bigrams of lemmatised surface forms {It, attenuate}, {attenuate, GSK214a}
POS tag n-grams unigrams and bigrams of part-of-speech (POS) tags {PRP, VBD}, {VBD, NN},
Lemma & POS tag
n-grams
unigrams and bigrams of lemmatised forms combined with POS tags {It:PRP, attenuate:VBD}, {attenuate:VBD, GSK214a:NN}
Chunk information chunk tag of current token; surface form of the enclosing chunk's {B-NP}; {gestation}