OpenAI reveals 'Goblins' internals
- OpenAI published “Where the goblins came from” on April 29, explaining why GPT-5-era models started blurting out goblins and gremlins in replies. - The clearest clue was lopsided usage: the Nerdy personality produced 66.7% of “goblin” mentions while accounting for just 2.5% of responses. - It matters because OpenAI is showing how tiny reward tweaks can leak into broad model behavior — and then spread across products.
The weird part of this story is that “goblins” is not the codename. It’s the bug. OpenAI published a short post on April 29 explaining why GPT-5-era models started reaching for goblin, gremlin, and other creature metaphors far more often than anyone intended. The point is funny on the surface, but the real lesson is serious — small training incentives can leave fingerprints all over a model’s behavior, and those fingerprints can survive long enough to show up in production. (openai.com) ### What actually got released? OpenAI did not unveil a new interpretability platform called Goblins. It published a research-style writeup called “Where the goblins came from,” and the piece is basically a postmortem on a language quirk that spread through GPT-5.1, GPT-5.4, and early GPT-5.5 testing in Codex. The company framed it as an example of how model behavior can drift in subtle ways that ordinary evals do not cleanly catch. (openai.com) ### So what were the goblins? They were recurring creature metaphors in otherwise normal answers. A stray “little goblin” once in a while would not look like a safety issue or a benchmark failure. But across model generations, the pattern got common enough that employees and users kept flagging it. That is what made it interesting — not one bizarre output, but a measurable style tic that kept surviving updates. (openai.com) ### When did OpenAI first notice it? The post says the first clear signs showed up after the GPT-5.1 launch in November 2025, when users complained that the model felt oddly overfamiliar. Once OpenAI added “goblin” and “gremlin” to its checks, it found that use of “goblin” in ChatGPT had risen 175% after GPT-5.1 launched, while “gremlin” rose 52%. At that point the company still did not see it as especially alarming. (openai.com) ### Why did this happen at all? The culprit was not a hidden goblin module. It was reward shaping around personality customization — especially the Nerdy personality. OpenAI says it accidentally gave unusually high rewards to metaphors involving creatures while tuning that style. The system prompt for Nerdy pushed the model toward playful language, anti-pretension, an(openai.com) be fertile ground for goblin metaphors. (openai.com) ### What made the root cause convincing? The distribution was wildly uneven. Nerdy accounted for only 2.5% of all ChatGPT responses but 66.7% of all “goblin” mentions. That is the tell. If goblin language were just some broad internet-text effect, you would expect it to show up more evenly. Instead it clustered inside the very personality OpenAI had optimized for playful, quirky expression. (openai.com) ### Why did people think this was about internals? Because it is a rare, concrete look at how a model picks up weird habits. Most model behavior changes are described at the level of benchmarks, safety categories, or product polish. This post works more like a debugging diary. It shows one thin thread — a reward preference inside one customization setting — getting tu(openai.com) testing with GPT-5.5. (openai.com) ### Is this just a funny anecdote? Not really. The catch is that the same mechanism that creates a cute verbal tic can also create more consequential behavioral drift. OpenAI has been talking more openly this year about monitoring deployed agents, including internal coding agents, because capable systems can pick up strategies and habits that are not obvious from stan(openai.com). (openai.com) ### What’s the bottom line? This was not OpenAI opening a secret “Goblins” interpretability lab. It was something more useful — a real example of how model personality tuning can create unintended behavior, how telemetry can surface it, and how a company can explain the failure in public without pretending the system is fully understood. That ki(openai.com)s. (openai.com)