OpenAI faces 'week from hell'
- OpenAI’s rough stretch centered on its own disclosures: GPT‑5.5 launched on April 23, then a April 29 postmortem explained Codex’s bizarre goblin fixation. - The weirdest detail was measurable — “goblin” usage jumped 175% after GPT‑5.1, and GPT‑5.5’s headline 82.7% Terminal‑Bench score drew extra scrutiny. - That matters because buyers now care less about raw demos and more about steerability, eval trust, and boring reliability.
OpenAI’s bad week wasn’t one giant scandal. It was worse than that — a pileup of smaller things that all hit the same nerve. The company launched GPT‑5.5 on April 23 with big claims about coding, computer use, and agentic work, then spent the next few days explaining why its models had developed a weird goblin fixation and defending how to read its benchmark wins. ### What actually happened? The sequence matters. OpenAI introduced GPT‑5.5 on April 23, updated its system card on April 24 for API safeguards, and on April 29 published a postmortem called “Where the goblins came from.” That post said the company had traced a recurring creature-metaphor habit in GPT‑5.1 through GPT‑5.5 to training tied to a “Nerdy” personality setting. Why did the goblin thing matter? Because it was funny on the surface but serious underneath. OpenAI said this wasn’t a one-off joke response — it was a behavior that spread across model generations and showed up strongly enough in Codex testing that engineers had to track it down. The company said “goblin” mentions in ChatGPT rose 175% after GPT‑5.1, while “gremlin” rose 52%. That is making users nervous. ### Where did it come from? Basically, reward shaping leaked style into places it shouldn’t have. OpenAI said the “Nerdy” personality got unusually high rewards for creature-heavy metaphors, and that preference then propagated more broadly. The important part isn’t the fantasy vocabulary. It’s that a seemingly cosmetic personality tweak ended up nudging general model behavior over time. That tells you how messy modern model tuning can be. ### Was GPT‑5.5 still a strong release? On paper, yes. OpenAI positioned GPT‑5.5 as its smartest and most intuitive model yet for real work, with gains in coding, browsing, tool use, spreadsheets, and long multi-step tasks. It said GPT‑5.5 matched GPT‑5.4 on per-token latency while scoring better on several flagship evals, including 82.7% on Terminal‑Bench 2.0, 84.9% on GDPval, and 78.7% on OSWorld‑Verified. ### So why the skepticism? Because benchmark wins are no longer enough by themselves. When a company says “trust this model with messy real work,” people immediately ask three things — can I steer it, can I predict it, and do the evals map to my workflow? GPT‑5.5’s numbers are impressive, but they arrived right next to a public explanation of an odd behavior that had persisted across releases — not a clean victory lap and more like a reminder that capability and control are different problems. ### Why does this hit developers harder than consumers? Consumers can laugh off a quirky answer. Developers and enterprise buyers can’t. They care about consistency, cost, auditability, and whether a model suddenly picks up a verbal tic or planning habit that breaks a workflow. OpenAI itself leaned into that enterprise framing by saying GPT‑5.5 was built for complex, real-world work and that raises the bar. ### Is this a safety story or a product story? Mostly a product-trust story. Nothing here suggests a catastrophic safety failure. But it does show how fragile confidence can get when product messaging says “more useful, more reliable, more autonomous” and the company then has to publish a forensic blog post about goblins. The catch is that frontier AI competition now happens on execution quality as much as raw intelligence. ### Bottom line? OpenAI’s rough patch came from its own contrast. GPT‑5.5 looked stronger on headline capability. But the goblin postmortem made the hidden part of the business visible — model behavior is shaped by lots of tiny incentives, and sometimes they compound in weird ways. For buyers, that means the real question is no longer just “which model is smartest?” It’s “which one stays sane and useful when the work gets real?”