AI Giants Launch Next-Gen Agents
Google, OpenAI, and Anthropic all launched new AI engineering agents within 48 hours, signaling a major shift toward autonomous systems. Google's DeepThink (Aletheia) reportedly achieved an 84.6% score on the ArcAGI2 reasoning benchmark. Anthropic's Claude Opus 4.6 now features a 1 million token context window and parallel agent teams, while OpenAI's GPT-5.3 Codex is optimized for speed in coding tasks.
- Google's Aletheia agent, built on Gemini Deep Think, works by using a cycle of three sub-agents: a Generator to propose solutions, a Verifier to check for errors, and a Reviser to make corrections. This system recently produced a mathematics paper on arithmetic geometry entirely without human intervention, aside from final formatting. - The 84.6% score for DeepThink on ArcAGI2 is significant because most top AI models from late 2025 scored below 20% on its predecessor, with human averages over 85%. The benchmark is designed to test fluid intelligence and abstract problem-solving, an area where AI has traditionally struggled. - Anthropic's 1 million token context window for Claude Opus 4.6 is a major advance for developers working on large-scale projects, as it can process the entirety of large codebases or extensive documentation without needing to chunk the data. This model also introduces "adaptive thinking," allowing it to decide how much reasoning to apply based on the complexity of a task. - OpenAI's GPT-5.3 Codex is optimized for speed and interactivity, running 25% faster than its predecessor and featuring a "Steering" capability that allows developers to guide or correct the agent in real-time without losing context. This model was used in its own development to help debug training and manage deployment. - The concept of "agent teams" is emerging as a new frontier, where developers use multiple specialized AI agents that work in parallel—for example, one for frontend code, one for backend, and another for testing. This approach mirrors human software development teams and can significantly increase development velocity. - Autonomous agents like Devin AI are being tested on real-world freelance jobs from platforms like Upwork, demonstrating the ability to handle entire software projects from planning to deployment. On the SWE-bench benchmark, which evaluates the ability to fix real-world GitHub issues, Devin was able to resolve 13.86% of problems without any human help. - The developer workflow is shifting from prompt-based interaction to "agentic loops," where the engineer's role becomes orchestrating and directing agents through intent and feedback, rather than writing code line-by-line. This is sometimes referred to as "vibe-coding," a continuous conversation with AI collaborators. - OpenAI has also released a smaller, ultra-fast version called GPT-5.3-Codex-Spark, designed for near-instant, real-time coding collaboration, delivering over 1000 tokens per second. This allows for a tight interactive loop where developers can rapidly iterate with the model.