Optimising chemical named entity recognition with pre-processing analytics, knowledge-rich features and heuristics

Table 7 Character and word n-gram features extracted by NERsuite by default.

Feature	Brief description	Sample features (bigrams)
Character n-grams	the set of all possible combinations of a token's consecutive characters, taken n at a time (n = 2, 3, 4)	{GS}, {SK}, {K2}, {21}, {14}, {4a}
Token n-grams	unigrams and bigrams of surface forms; unigrams and bigrams of normalised surface forms where numbers numbers are replaced with '0's, the consecutive instances of which are compressed	{It, attenuated}, {attenuated, GSK214a}; {Aa, aaaaaaaaaa}, {aaaaaaaaaa, AAA000a}
Lemma n-grams	unigrams and bigrams of lemmatised surface forms	{It, attenuate}, {attenuate, GSK214a}
POS tag n-grams	unigrams and bigrams of part-of-speech (POS) tags	{PRP, VBD}, {VBD, NN},
Lemma & POS tag n-grams	unigrams and bigrams of lemmatised forms combined with POS tags	{It:PRP, attenuate:VBD}, {attenuate:VBD, GSK214a:NN}
Chunk information	chunk tag of current token; surface form of the enclosing chunk's	{B-NP}; {gestation}

ISSN: 1758-2946