Skip to main content

Table 3 Degenerate repetition observed in various tasks from state-of-the-art large-scale language models

From: Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization

Task [model]

Context

[Repetition pattern] output

Sentence completion [GPT2] ref. [10]

Prefix: Lyrically the song has excerpts of different languages including French, Spanish,

[Single-word repetition] German, Italian, Portuguese, Spanish, Portuguese, Portuguese, Portuguese, Portuguese, Portuguese, Portuguese, Portuguese, Portuguese, Portuguese, .

Summarization [Transformer] ref. [52]

Article

[Phrase-level repetition] ...cristiano ronaldo’s best footballer whereas barcelona talisman lionel messi’s best footballer whereas barcelona’s best footballer whereas barcelona talisman lionel lee whereas’s best footballer’s. (Truncated)

Generation from an initial tag line [GPT-2] ref. [9]

Tag: So what’s new in my life? 09/11/18 - Just got back from vacation

[Sentence-level repetition] Just got back from vacation. 09/10/18 - Just got back from vacation. Just got back from vacation. 09/09/18 - Just got back from vacation. Just got back from vacation. 09/08/18 - Just got back from vacation. Just got back from vacation.

Product review generation [GPT-2] ref. [53]

Initial context

[Structural repetition] Great movie, although took a while to see at first it held my interest and kept me interested, plus i thought it was extremely good. also it was very good.

Protein sequence generation [ProtGPT-2] ref. [54]

No context

[Subsequential repetition] MSNDTPTHDPTPPAPAPAPEPAPAPAPEPAPAPAPEPAPAPAPEPAPAPAPEPAPAPAPEPAPAPAPEPAPAPAPEPAPAPAPEPAPAPAPEPAPAPAPE.

Molecule captioning [Transformer] ref. [55]

SMILES: CC[N+](CC)=C1C=CC2=N C3=C(OC2=C1)C=C(N)C(C) =C3

[Single-word repetition] the molecule is a deuterated compound that is is is is is an isotopologue of chloroform in which the four hydrogen atoms have been replaced by deuterium. (Truncated)

  1. The examples contain single-word repetitions, phrase-level repetitions, sentence-level repetitions, structural repetitions where tokens may vary within a repeating phrase, and subsequential repetitions. The first repeated unit in each example is  emphasized in bold.