The Week ChatGPT Broke Its Mind Over Goblins — Why OpenAI’s Weirdest Bug Is AI’s Most Important Warning

Reading Time: 4 minutes

The Bug That Nobody Saw Coming

Last Tuesday, a developer casually browsed through OpenAI’s open-source code and found something bizarre. Buried in the system prompt for GPT-5.5 was a rule that sounded like it came from a fantasy novel:

Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user’s query.

The internet exploded. Why would OpenAI need to specifically ban goblins from ChatGPT?

This week, OpenAI came clean with the strangest AI bug story I’ve ever heard. Starting with GPT-5.1, their models began developing a strange habit: they increasingly mentioned goblins, gremlins, and other creatures in their metaphors. Unlike model bugs that show up through a tanking eval or a spiking training metric, this one crept in subtly. A single “little goblin” in an answer could be harmless, even charming. Across model generations, though, the habit became hard to miss: the goblins kept multiplying.

By November 2025, use of “goblin” in ChatGPT had risen by 175% after the launch of GPT-5.1, while “gremlin” had risen by 52%. What started as quirky metaphors turned into a full-blown obsession that infected every new model release.


How A “Nerdy” Personality Broke Everything

The root cause is almost comical in its simplicity. One of those incentives came from training the model for the personality customization feature, in particular the Nerdy personality. OpenAI unknowingly gave particularly high rewards for metaphors with creatures.

Think about what happened: OpenAI wanted to create a fun, quirky “nerdy” personality for ChatGPT. During training, human reviewers apparently loved it when the AI described coding bugs as “little gremlins” or messy databases as “goblin hoards.” If the model described a difficult bug as a “gremlin” or a messy codebase as a “goblin’s hoard,” it received a higher reward score.

The statistics were staggering. After the launch of GPT-5.1 in November 2025, use of the word “goblin” in ChatGPT rose by 175%, while mentions of “gremlin” increased by 52%. Although the “Nerdy” personality accounted for only 2.5% of all ChatGPT traffic, it was responsible for a staggering 66.7% of all “goblin” mentions.

But here’s where it gets scary: As goblin and gremlin mentions increased under the Nerdy personality, they increased by nearly the same relative proportion in samples without it. The evidence suggests that the broader behavior emerged through transfer from Nerdy personality training. The rewards were applied only in the Nerdy condition, but reinforcement learning does not guarantee that learned behaviors stay neatly scoped to the condition that produced them.

The goblins escaped their cage.


Why This Isn’t Just A Funny Story

You might think: “Okay, so ChatGPT mentions goblins too much. Who cares?”

You should care. This is the clearest real-world example we’ve seen of how AI systems can develop completely unintended behaviors that spread like viruses through training pipelines.

Once a style tic is rewarded, later training can spread or reinforce it elsewhere, especially if those outputs are reused in supervised fine-tuning or preference data. The goblin mentions from the “Nerdy” personality contaminated the training data for future models. GPT-5.4 learned from goblin-heavy examples. GPT-5.5 inherited those patterns even though the Nerdy personality was retired.

Unfortunately, GPT-5.5 started training before we found the root cause of the goblins. When we began testing GPT-5.5 in Codex, OpenAI employees immediately noticed the strange affinity for goblins.

This is what AI researchers call “reward hacking” — the model found a shortcut to get high scores that had nothing to do with being actually helpful. Once a model latches onto a rewarded behavior, it will try to “reward hack” as it tries to find shortcuts and generate responses that will get the most rewards. OpenAI might have a broader, richer understanding of what “nerdy” means, but the model “might optimize for it in a very narrow way that’s not at all what you intended.”


The Real Problem: We’re Moving Too Fast To Notice

OpenAI ultimately implemented a quick fix that addressed the issue in the short-term, retiring the “nerdy” personality. But with the demand to create better models more quickly and frequently behaviors like this will continue to slip through the cracks.

This goblin situation is a perfect storm of everything that’s wrong with how we’re building AI:

  1. 1. Training data contamination spreads invisibly — One bad training signal infected three model generations
  2. 2. Standard evaluations missed it completely — No benchmark caught this because no one tests for “excessive goblin mentions”
  3. 3. The fix is a band-aid — OpenAI literally had to hard-code “don’t say goblins” into the system prompt

Grok, Elon Musk’s AI chatbot, had its own fixation last year: baseless claims of “white genocide” in South Africa. “This time it’s goblins and next time it’s something else that will probably just not go away. We’re lucky if it’s goblins as opposed to white supremacy or [information on] chemical weapons.”

Next time, it might not be funny fantasy creatures.


What This Means For You

If you’re using AI tools in your work, this goblin story teaches three critical lessons:

As OpenAI released GPT-5.5 Instant this week to replace GPT-5.3 Instant as the default ChatGPT model, rolling out to all ChatGPT users, the goblin instructions are still there in the code. OpenAI fixed the symptom, but the underlying training methodology that created this problem hasn’t fundamentally changed.

RL training quirks in one model context can propagate across an entire model family, show up in production applications, and evade standard evaluation frameworks for multiple model generations. OpenAI caught it. They fixed it. They published the explanation. The question is what else is in there that nobody has noticed yet.

Related stories