Skip to main content

Table 5 Accept and reject rules succession for unigrams (1-grams)

From: Terminology spectrum analysis of natural-language chemical documents: term-like phrases retrieval routine



GeneralChemTermRule (accept rule)

True if a 1-gram is a general chemistry scientific term

StrictFilteringTagRule (reject rule)

True if a 1-gram consists of a token with the strict filtering tag «rubbish:true»

ShortTokensRule (reject rule)

True if a 1-gram consists of a short token of length less than three characters

This rule is to exclude noise existing in documents such as axes labels and so on

UnitsRule (reject rule)

True if a 1-gram contains a string being a measurement unit from the dictionary (Table 1)

ChemUnigramRule (accept rule)

True if a 1-gram is tagged by any OSCAR tag and by one of the following POS tags: FW, NNP, or tagged by tag COMP. Selected unigrams are assumed and marked to have a chemical sense.

Term-like: barium, phenanthrene, pentanol, xanes

GeneralEnglishDictRule (reject rule)

True, if a 1-gram is in the General English Dictionary (Table 1)

Filtered: topography, paint, plateau, pool, searching, file, addenda, improvement, theme …

Term-like: hydrocalcite, acetylacetone, cracking, ageing

UnigramPOSRule (reject rule)

True, if a 1-gram is not a noun or a gerund.

Term-like 1-gram must be tagged with the following POS tags: VBG, NN, NNPS, NNS

Filtered: schematized, suddenly, skeletal, behind

Term-like: ethylene, hydrocalcite, leaching, 12n-decylhexadecanamide, sulfamethoxazole, anchoring

UnigramAddRules (reject rules)

Set of regular expressions to filter unigrams denoting various ions, signs, captions and etc.

Filtered: M(O2), GA15.6, PW91, V2.1, G(D), TI(V), PD(I), PT0, P(X), BA2+, CE(3+), cm3, CH3, AA, Cu2+, Mo6+, Et-CP, GC–MS, Zn-Al