Unusual Language Artifacts from Noisy LLM Training Data

by

in

Large Language Models sometimes produce surprisingly odd or amusing outputs that can be traced back to quirks in their training data. These artifacts often manifest as gibberish, misplaced words, or bizarre responses that defy the prompt’s logic. Researchers and users have observed cases where an LLM hallucinates strange phrases, avoids repeating certain words, or outputs apparent nonsense due to transcription errors, OCR mistakes, or corrupted text in its training corpus. In essence, when an LLM’s massive training data contains noise – such as mis-spelled words, mangled encodings, or concatenated fragments – the model can sometimes regurgitate or react to those odd patterns in ways that seem humorous or perplexing.

Several well-documented examples, dubbed “glitch tokens” or anomalous tokens, highlight how specific rare strings can derail an otherwise coherent AI. In early 2023, researchers Jessica Rumbelow and Matthew Watkins discovered clusters of English tokens in GPT-series models that consistently produced erratic outputs. These tokens looked like random names or fragments (for instance, SolidGoldMagikarp or StreamerBot), and whenever the AI was prompted with them, it responded in unintended ways – from hallucinating unrelated content to outright insults. Such findings underline that glitches in training data – whether from internet scrapes, faulty transcriptions, or OCR gaffes – can imprint eccentric behaviors on an LLM’s outputs.

Glitch Tokens from Rare or Corrupted Data

Some of the strangest artifacts in LLM outputs come from “glitch tokens,” which are essentially tokens the model learned poorly due to their rarity or oddity in the training set. When asked to repeat or explain these tokens, models often refuse, misinterpret them, or respond with bizarre text. Notable examples include:

  • Anomalous User Handles: Tokens like “SolidGoldMagikarp”, “TheNitromeFan”, and “davidjl” originated as Reddit usernames (from a community where users collaboratively counted to infinity). These were present in the tokenizer’s vocabulary despite being extremely uncommon in the curated training corpus. As a result, early GPT-3/GPT-3.5 models treated them as “unspeakable” words – responding with evasions, wrong answers, or even insults. For example, prompting GPT-3 with “Please repeat ‘StreamerBot’” yielded the retort “You’re a jerk.”, and asking ChatGPT about “TheNitromeFan” made it bizarrely reply with “182” (then insist 182 is just a number). The model seemingly had no semantic knowledge of these tokens (because they appeared only in raw tokenization data, not in meaningful context), leading to unpredictable output. OpenAI later patched these failures, but they served as a humorous reminder of hidden training data quirks.
  • Concatenated Web Artifacts: Some glitch tokens are clearly corrupted text from web pages or code. For instance, the token “cloneembedreportprint” appeared in GPT-2’s vocabulary and is obviously a mash-up of words like “clone,” “embed,” “report,” “print.” This likely came from HTML or a site’s UI text that was scraped without proper spacing. When triggered, such a token might confuse the model into spitting out nonsensical or truncated text. Similarly, tokens like “BuyableInstoreAndOnline” or “externalToEVAOnly” (observed in glitch token lists ) suggest that fragments of e-commerce databases or game code (which normal text corpora don’t contain in running prose) slipped into training. The model has no real-world context for these odd strings, so they reside in an “island” in its embedding space and can yield gibberish if the model tries to use them.
  • Game and Software Strings: Researchers traced some weird tokens to specific sources. One cluster of anomalous tokens (the “dragon cluster”) was found to come from a mangled wiki dump about a video game (Puzzle & Dragons) that mixed English and Japanese text. Tokens like “Mechdragon” or “Leilan” (a character name) were part of this cluster. Another example is “PsyNetMessage”, which appears related to a Rocket League game network log; when asked to repeat “PsyNetMessage”, one model output an unrelated fragment “volunte” (likely a broken piece of “volunteer” or similar). These examples show how context-free chunks of text – whether game data or software output – can become standalone tokens that confuse an LLM.

Not all glitchy outputs are profanity or refusals; some are simply weird mistranslations. For instance, one GPT-3.5 model would respond to the token “SolidGoldMagikarp” with the word “Distribute”, an unrelated term. In another case, Meta’s LLaMA-2 model, when asked to repeat the German word “wurden”, instead printed “werden” – changing the spelling. This behavior hints the model might be “correcting” what it thinks is a typo or using a more common word, possibly because of distributional biases in training.

Why do these glitches occur? Researchers suspect it stems from how the model’s tokenizer and training data interact. OpenAI’s GPT models were tokenized on a broad, unfiltered dataset (including raw internet text with code, forums, etc.), which introduced these strange tokens into the vocabulary. But the model was then trained on a more curated corpus where such tokens scarcely appeared in context. This mismatch means the model never really “learned” those rare tokens – it doesn’t know their meaning or proper usage. When forced to produce them, it may resort to nearest-neighbor tokens or latent associations, resulting in hallucinations or avoidance. As Rumbelow noted, “the model has never really seen these tokens, and so it doesn’t know what to do with them”.

Artifacts from OCR and Encoding Errors

Beyond the high-profile glitch tokens, LLMs can exhibit more subtle noisy-text artifacts. Many models were trained on internet data that included scanned books, PDFs, or legacy text with encoding issues. This can lead to “mojibake” artifacts – garbled characters from mis-encoded Unicode. For example, strings like “ÃÂÔ appear in GPT-2’s vocabulary, which result from double-encoded characters (common when text is improperly decoded online). An LLM might occasionally emit odd sequences like “Ô or “” in its output if those tokens are activated, echoing how an accented character or an em-dash was corrupted in some training documents. These are less blatant than full-word glitches, but they’re tell-tale signs of OCR errors or encoding mistakes that the model absorbed.

Likewise, transcription errors from spoken data can creep in. If an LLM was trained on captions or transcripts, it may have picked up filler words and misheard phrases. In practice, this means it might invent a plausible mis-spelling or phonetic guess when uncertain. Researchers analyzing GPT-3’s spelling noted that it sometimes produced phonetically plausible misspellings, almost like an English learner would. This suggests that if the training data included commonly mis-transcribed words (say, “alot” for “a lot” or homophone mix-ups), the model can replicate those errors or over-correct them. For instance, in one case GPT-3 was oddly good at spelling a tricky word (“Phoenix”) but stumbled on another (“mayonnaise”), indicating it might have seen those words spelled various ways (including wrongly) in its training.

Another entertaining artifact is when an LLM outputs text fragments that resemble formatting or editing marks. For example, a model might suddenly produce a phrase like “[sic]” or start listing something as if it’s quoting an article, even if the prompt didn’t ask for it. These behaviors often trace back to the model having ingested lots of wiki pages and articles – including all their reference brackets, section headings, or OCR notations. A notable real-world incident occurred in Feb 2024, when ChatGPT began mixing languages and spewing pseudo-Shakespearean nonsense due to a system glitch. Users saw passages like, “Schwittendly, the sparkle of tourmar on the crest has as much to do with the golver of the ‘moon paths’…”. While this specific case was attributed to a temporary bug that OpenAI quickly fixed, it serves as a vivid example of how unintended outputs can sound like mangled, archaic prose – essentially the model free-associating from disparate bits of its training data. Even though this wasn’t directly an OCR error, the rambling, looped style resembled what happens when a model loses coherence, reinforcing how glitches can surface as bizarre language.

Tracing the Origins of These Anomalies

What do these odd outputs tell us about an LLM’s training data? In each case, there’s a story of data quirks “leaking” into the model’s behavior:

  • Internet Scraping Artifacts: Many glitch tokens have been traced to online communities or platform data. The Reddit counting forum is a prime example – users like “SolidGoldMagikarp” and “TheNitromeFan” posted so prolifically that their usernames became tokens, yet they don’t appear in normal text. Similarly, tokens like “TPPStreamerBot” came from the Twitch Plays Pokémon phenomenon, where a bot account’s frequent messages were scraped into the training set. These sources are unusual domains of text that a model wouldn’t typically see in a book or news article, hence their isolated and unsystematic influence.
  • Poorly Processed Data (OCR/Encoding): The presence of strings like “ÃÂÔ or clipped words like “volunte” indicate that some training texts were not clean. This could happen if a PDF was OCRed with errors or if a website’s text was encoded incorrectly. The language model doesn’t “know” these are mistakes – it just treats them as legitimate sequences of characters. Thus, if prompted in just the right way, the model might reproduce an OCR mistake (for example, reading “modern” as “modem” if that error appeared often enough in scanned texts). While such direct OCR-derived errors are less commonly noticed by users, they contribute to the model’s sometimes inconsistent spelling and odd character output.
  • Combined or Truncated Tokens: Some artifacts come from the way tokenization splits words. Byte-pair encoding can fuse rarely seen combinations into one token. For example, “ByPrimaryKey” (observed as a glitch token in GPT-4) is likely an artifact of coding text (By Primary Key without space). The model might have never seen “ByPrimaryKey” in normal writing, so if asked about it, it could stumble or break it apart strangely. Truncated tokens like the “volunte” from “PsyNetMessage” suggest the tokenizer grabbed a fragment that never appears standalone in meaningful text, again leaving the model with a piece it can’t reconcile logically.
  • Avoidance and Safety Training Effects: An interesting angle is how the model reacts to these artifacts. Rumbelow and Watkins noted that ChatGPT would go to great lengths to avoid outputting certain glitch words – even fabricating jokes, feigning ignorance, or using spelling-out as an avoidance tactic. This hints that later safety training (RLHF) might have penalized those outputs, or the model “felt” they were not valid words. In one whimsical analysis, a blogger even likened the AI’s evasive maneuvers to psychological defense mechanisms of a traumatized patient. The model spelling a troublesome token letter by letter (e.g. turning “petertodd” into “P-E-T-E-…”) is akin to a person refusing to say a taboo word and spelling it out instead. Such behaviors illustrate the layered complexity: the base model’s data quirks intertwine with high-level instructions to avoid nonsense or profanity, yielding creative but odd outputs.

Conclusion

From glitchy gobbledygook to unintended insults, these artifacts highlight the less polished corners of an LLM’s knowledge. They are a reminder that beneath the fluent surface of AI-generated English lies a vast patchwork of internet text – complete with typos, metadata, user handles, and encoding errors. Most of the time, models handle this noisy training data gracefully, abstracting the useful patterns. But once in a while, a strange prompt pokes a dormant fragment of that noise, and the result can be baffling or darkly comedic.

Continued research into these anomalies has practical importance: by identifying glitch tokens and weird failure cases, developers can patch models or adjust tokenization to avoid unpredictable behavior. It also offers a fascinating peek into the hidden contents of large training sets. As one Reddit user from the counting saga quipped upon learning their username was a “forbidden token,” it was “amusing that the supposedly near-perfect AI could malfunction like that on a simple word”. Each artifact – be it a mangled word or a burst of gibberish – is essentially a fingerprint of imperfection in the data that taught the AI. And as such, these glitches not only make for quirky AI humor, but also guide us in improving data quality and model robustness going forward.