BLEU scoring for machine translations is a scoring metric introduced in 2002 used to compare a predicted sentence with the original sentence. Each predicted word is compared with the original, and each word is called an unigram or a 1-gram. In longer sentences we can also compare word pairs or bigrams. Here, we calculated BLEU-1 for unigram comparison, BLEU-2 for the bigram comparison, BLEU-3 for 3-gram comparison and BLEU-4 for 4-gram comparison.
In order to compare the predicted IUPAC name with the original IUPAC name a sentence to sentence comparison should be done, so we used the sentence BLEU scoring function inbuilt in Python Natural Language Toolkit [28]. We use the original IUPAC name as the reference string and the predicted IUPAC name as the candidate string to calculate the BLEU scores.
For all BLEU calculations we used the NLTK sentence BLEU scoring function [24].
Weight distributions for different BLEU scores,
-
BLEU-1: weights = (1.0, 0, 0, 0)
-
BLEU-2: weights = (0.5, 0.5, 0, 0)
-
BLEU-3: weights = (0.3, 0.3, 0.3, 0)
-
BLEU-4: weights = (0.25, 0.25, 0.25, 0.25).
BLEU score can reduce according to the following,
-
each wrong word match
-
each wrong n-gram matches
-
length of the candidate string is longer/shorter than reference string
-
order of the predicted words are wrong.
For these a penalty will be awarded so the overall score will decrease. A few examples are given below.
Reference: 1,3,7-trimethylpurine-2,6-dione
Candidate: 1,3,7-trimethylpurine-2,6-dione
BLEU score: 1.0
BLEU-1: 1.00
BLEU-2: 1.00
BLEU-3: 1.00
BLEU-4: 1.00
Wrong word
Reference: 1,3,7-tri methyl purine-2,6-di one
Candidate: 1,3,7-tri methyl purine-2,6-tri one
BLEU score: 0.87
BLEU-1: 0.94
BLEU-2: 0.90
BLEU-3: 0.90
BLEU-4: 0.88
Wrong word pair
Reference: 1,3,7-tri methyl purine-2,6-di one
Candidate: 1,3,7-tri methyl purine-2,6,tri one
BLEU score: 0.81
BLEU-1: 0.88
BLEU-2: 0.84
BLEU-3: 0.84
BLEU-4: 0.81
Shorter prediction
Reference: 1,3,7-tri methyl purine-2,6-di one
Candidate: 1,3,7-tri methyl purine-2
BLEU score: 0.63
BLEU-1: 0.63
BLEU-2: 0.63
BLEU-3: 0.63
BLEU-4: 0.63
Longer prediction
Reference: 1,3,7-tri methyl purine-2,6-di one
Candidate: 1,3,7-tri methyl purine-2,6-di one, 6-di one, 6-di one
BLEU score: 0.52
BLEU-1: 0.63
BLEU-2: 0.59
BLEU-3: 0.59
BLEU-4: 0.52
Wrong order of predictions
Reference: 1,3,7-tri methyl purine-2,6-di one
Candidate: 1,3,7-tri methyl purine-6,2-di one
BLEU score: 0.71
BLEU-1: 1.00
BLEU-2: 0.86
BLEU-3: 0.80
BLEU-4: 0.71
For the BLEU score calculation, we are using the default settings of sentence BLEU. This corresponds to a four-gram comparison. The weights are distributed evenly. In very few cases as reported in the Results section, we encountered the predictions with BLEU 1.0 where the strings were not identical. The problem can be rectified using more N-gram comparisons with different weight distributions. In our results these cases were very low in number so we used the default settings.
Reference: 4-[(4-amino-2,3,6-tri methyl phenyl) methyl]-2,3,5-tri methyl aniline
Candidate: 4-[(4-amino-2,3,5-tri methyl phenyl)methyl]-2,3,6-tri methyl aniline
With sentence BLEU, 4-gram (weights = (0.25,0.25,0.25,0.25))
BLEU score: 1.00
With sentence BLEU, 5-gram (weights = (0.2,0.2,0.2,0.2,0.2))
BLEU score: 0.98
With sentence BLEU, 8-gram (weights = (0.125,0.125,0.125,0.125,0.125,0.125,0.125,0.125))
BLEU score: 0.88.