GPT-5.4 Fails Creative Writing Tests

OpenAI's GPT-5.4 is being slammed as a "failure" in creative tasks, performing worse than versions 5.1 and 5.2 with zero SM-Bench leads and increased hallucinations. Writers are launching #keep4o campaigns while critics argue prompts aren't "art." The backlash highlights growing resistance to AI replacing human creativity in writing.

OpenAI's official announcement on March 5, 2026, framed GPT-5.4 not as a creative partner but as a tool for professional work. The company claimed it was its "most factual model yet," reducing false claims by 33% and overall errors by 18% compared to GPT-5.2, focusing on tasks like coding and data analysis. The SM-Bench mentioned in the backlash is a benchmark that reportedly evaluates how often a model prioritizes safety over common-sense answers. A low score, as reported for GPT-5.4, suggests an overly restrictive nature, which can stifle the nuanced and unpredictable outputs required for creative writing. While flunking creative tests, GPT-5.4 excelled at professional ones, scoring 87.5% on a benchmark mimicking tasks for junior investment banking analysts—a significant jump from GPT-5.2's 68.4%. On a benchmark for navigating computer desktop environments, it even surpassed average human performance, scoring 75% versus the human baseline of 72.4%. The #keep4o campaign has historical precedent. It first erupted in August 2025 when OpenAI tried to retire the GPT-4o model. Users at the time described the loss as losing a "confidant," and a Change.org petition to save the model garnered nearly 21,000 signatures, forcing OpenAI to reinstate it. The debate over whether AI prompts constitute "art" is a documented conflict in the field. Some experts, including scientists at Google's DeepMind, have called prompt engineering a "fad" and a "poor user interface" that holds back true natural language interaction. Conversely, proponents argue that prompt creation is a genuine artistic skill, comparing it to photography in its early days when critics dismissed it as a mechanical process. They argue that the difference between a basic query and a masterfully crafted prompt is as vast as the one between a simple snapshot and a work by a master photographer. This all unfolds as OpenAI has compressed its release cycles from years to mere months. GPT-5.4 was released just days after GPT-5.3, an accelerated pace that suggests a strategic prioritization of enterprise-friendly features like accuracy and safety over more subjective capabilities like creative expression.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.