Open-Weight Models Challenge Closed AI Systems on Agentic Tasks

Open-weight models are increasingly competitive with leading closed-source alternatives from OpenAI and Anthropic. GLM-5, a model developed by Zhipu AI and Tsinghua University, now tops open benchmarks for both code and text generation. The performance of models like GLM-5 on agentic tasks indicates a democratization of access to high-performing AI.

- The GLM-5 model is a 744 billion parameter Mixture of Experts (MoE) architecture, with 44 billion parameters active during any given inference operation. It was trained entirely on Huawei Ascend chips, without using NVIDIA hardware, and supports a context window of up to 200,000 tokens. - In agentic task benchmarks, GLM-5's performance is competitive with leading closed-source models. For instance, on the SWE-bench Verified for software engineering tasks, GLM-5 achieves a score of 77.8%, placing it ahead of some versions of GPT but behind Claude 4.5 Opus. On the Vending Bench 2, which measures long-term operational capabilities, GLM-5 achieved a first-place rank among open-source models with an account balance of $4,432. - Reinforcement Learning from Human Feedback (RLHF) is a critical workflow for training models on agentic tasks, reducing the need for extensive manual data labeling by using human preference rankings on model outputs to train a reward model. However, this process can be resource-intensive, often requiring tens of thousands of human preference labels to fine-tune a large language model. - Constitutional AI, an approach developed by Anthropic, offers a more scalable alternative to traditional RLHF by using a predefined set of principles—a "constitution"—to guide the model's behavior. This allows the AI to critique and revise its own outputs, reducing the reliance on constant human feedback and mitigating potential human bias. - Evaluating agentic AI requires moving beyond traditional metrics to assess the entire system's behavior, including task completion rates, tool-use accuracy, and reasoning coherence across multi-step workflows. Benchmarks like the Berkeley Function-Calling Leaderboard (BFCL) and ToolBench are emerging as standards for evaluating a model's ability to interact with external tools and APIs. - While synthetic data can be generated much faster and at a lower cost than human-labeled data, it often lacks the nuance and contextual understanding required for complex tasks. A hybrid approach, using synthetic data for scale and human annotation for critical edge cases and quality assurance, is often the most effective solution for training high-performing models. - The fundraising climate for AI infrastructure startups has become more challenging in 2026 compared to the "gold rush" of 2023-2024, with investors now focusing more on burn rates and profitability. Despite a tougher environment, global AI funding reached over $202 billion in 2025, with a significant portion directed towards AI infrastructure and foundation models. - The rise of AI is expected to displace millions of jobs, particularly in administrative and data entry roles, while also creating new opportunities. The World Economic Forum predicts that by 2027, AI could displace 83 million jobs while creating 69 million new ones. The impact on the labor market will also affect job quality, including wages and working conditions, due to the increased use of algorithmic management tools.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.