GPT-5.4 Fails Creative Writing Tests
OpenAI's GPT-5.4 is being slammed as a "failure" in creative tasks, performing worse than versions 5.1 and 5.2 with zero SM-Bench leads and increased hallucinations. Writers are launching #keep4o campaigns while critics argue prompts aren't "art." The backlash highlights growing resistance to AI replacing human creativity in writing.
OpenAI's official announcement on March 5, 2026, framed GPT-5.4 not as a creative partner but as a tool for professional work. The company claimed it was its "most factual model yet," reducing false claims by 33% and overall errors by 18% compared to GPT-5.2, focusing on tasks like coding and data analysis. The SM-Bench mentioned in the backlash is a benchmark that reportedly evaluates how often a model prioritizes safety over common-sense answers. A low score, as reported for GPT-5.4, suggests an overly restrictive nature, which can stifle the nuanced and unpredictable outputs required for creative writing. While flunking creative tests, GPT-5.4 excelled at professional ones, scoring 87.5% on a benchmark mimicking tasks for junior investment banking analysts—a significant jump from GPT-5.2's 68.4%. On a benchmark for navigating computer desktop environments, it even surpassed average human performance, scoring 75% versus the human baseline of 72.4%. The #keep4o campaign has historical precedent. It first erupted in August 2025 when OpenAI tried to retire the GPT-4o model. Users at the time described the loss as losing a "confidant," and a Change.org petition to save the model garnered nearly 21,000 signatures, forcing OpenAI to reinstate it. The debate over whether AI prompts constitute "art" is a documented conflict in the field. Some experts, including scientists at Google's DeepMind, have called prompt engineering a "fad" and a "poor user interface" that holds back true natural language interaction. Conversely, proponents argue that prompt creation is a genuine artistic skill, comparing it to photography in its early days when critics dismissed it as a mechanical process. They argue that the difference between a basic query and a masterfully crafted prompt is as vast as the one between a simple snapshot and a work by a master photographer. This all unfolds as OpenAI has compressed its release cycles from years to mere months. GPT-5.4 was released just days after GPT-5.3, an accelerated pace that suggests a strategic prioritization of enterprise-friendly features like accuracy and safety over more subjective capabilities like creative expression.