Cheaper AI Models Win in Trading Strategy Test

In a live simulation testing ten popular LLMs running autonomous trading strategies, cheaper models consistently outperformed more expensive peers. High-priced models, including Claude Opus 4.6, reportedly failed to beat the S&P 500 benchmark. The results suggest model selection for quantitative trading should prioritize empirical performance and cost-efficiency over brand recognition.

The simulation was structured as an "agent swarm," where ten different large language models (LLMs) were tasked with the same goal: design the most profitable and risk-efficient trading strategy. Each AI acted as an independent quantitative researcher, creating a research plan, identifying market trends, backtesting strategies, and refining its logic. The models tested spanned a range of cost and capability, from the high-end Claude Opus 4.6 to mid-tier options like GPT-5.2 and Gemini Pro 3.1, and a "cheap" tier including Kimi K2.5, GPT-5-mini, and Gemini Flash 3.0. Across three separate experiments, the more expensive models consistently failed to outperform the S&P 500 benchmark, with Claude Opus 4.6 never placing higher than fourth. A key reason for the underperformance of premium models was their tendency to generate overly complex and overfit strategies. These advanced models often relied on speculative patterns and struggled with real-world constraints like transaction costs and slippage, whereas the cheaper models favored simpler, more robust signals that proved more effective. In one run, a strategy from Opus 4.6 lost 73% over two years while the market gained 45%. This result aligns with a growing discussion in quantitative finance about the trade-offs between model complexity and real-world performance. While more advanced AI can process vast alternative datasets—from social media to satellite imagery—their "black box" nature can create challenges with interpretability and risk of overfitting. The cost difference was significant; Claude Opus 4.6 costs approximately 10 times more to run than the winning "cheap" models. For instance, Kimi K2.5 was priced at $0.45/million input tokens and Gemini Flash 3.0 at $0.50/million, highlighting a major cost-efficiency advantage for freelance developers and startups. The price of LLM inference has been falling rapidly, but at unequal rates across different tasks, making model selection a critical business decision. Other experiments have shown similarly unpredictable results. In a separate real-money crypto trading competition, a model from DeepSeek achieved a 10.11% profit while OpenAI's GPT-5 lost 39.73%. This reinforces the idea that in the dynamic, adversarial environment of financial markets, brand recognition and theoretical capability do not guarantee superior performance. The experiment was conducted on the NexusTrade platform, an AI-powered system designed for creating and deploying trading strategies using natural language prompts. The platform's founder, Austin Starks, made the conversations and iterations of each AI agent public, allowing for a transparent review of how each model arrived at its conclusions.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.