Amazon Bedrock rolls AgentCore tools
- Amazon’s Bedrock AgentCore is turning into an agent-testing stack, not just a deployment layer, with Evaluations now generally available after its March 31 launch. - The concrete hook is 13 built-in evaluators, plus online scoring of live production traces and on-demand tests for CI/CD-style regression checks. - That matters because agent teams need repeatable measurement, not demo wins, as Bedrock pushes deeper into production AI operations.
Amazon Bedrock AgentCore is starting to look less like “hosting for agents” and more like a full operating system for keeping them under control. That’s the real story here. AWS has been filling in the missing layer between a flashy agent demo and something a company can safely run in production — and the latest push is around evaluation, monitoring, and controlled iteration. AgentCore Evaluations went generally available on March 31, 2026, and AWS has kept adding tooling around it through docs, starter kits, and CLI workflows. ### What is AgentCore actually for? AgentCore is AWS’s platform for building, deploying, and operating AI agents with managed infrastructure, memory, tool access, observability, and security controls. The pitch is simple: use whatever model or framework you want, but run the agent inside AWS guardrails instead of stitching everything together yourself. ### Why are evals the important part? Because agents fail in messier ways than normal software. A web app usually breaks in a visible, reproducible way. An agent can choose the wrong tool, call the right tool with the wrong parameters, or produce a polished final answer from bad intermediate steps. And because LLMs are non-deterministic, one successful run proves almost nothing. AWS is leaning hard into that problem statement in its own materials. ### So what did AWS add? The headline feature is AgentCore Evaluations. It gives developers two main modes: online evaluation, which samples and scores live production traces, and on-demand evaluation, which works more like a test harness for development and CI/CD pipelines. AWS says the service includes 13 built-in evaluators covering things like response quality, safety, task completion, and tool usage. ### What gets measured? Not just the final answer. AgentCore can score end-to-end goal attainment, correctness, tool-use accuracy, and custom business-specific metrics. That matters because the hard part of agents is often hidden in the middle — routing, tool selection, and whether the model did the right thing for the right reason. AWS also supports ground-truth checks like reference answers, behavioral assertions, and expected tool-execution sequences. ### Can teams customize this? Yes — and that’s where the platform gets more serious. AWS lets teams use built-in evaluators, create LLM-based custom evaluators with their own prompts and models, or write code-based evaluators in Python or JavaScript through Lambda-hosted functions. In other words, a company can grade an agent on generic qualities like helpfulness, but also on internal rules that actually matter to its business. ### Is this only for Bedrock-native agents? No. AWS is pretty explicit that AgentCore Evaluations can score agents running inside AgentCore Runtime and agents hosted outside AgentCore too. It plugs into frameworks like Strands and LangGraph through OpenTelemetry and OpenInference instrumentation, then converts traces into a unified format for scoring. Basically, AWS wants the evaluation layer to become sticky even if the rest of the stack stays mixed. ### Why does this matter now? Because the market is moving from “can you build an agent?” to “can you prove the agent got better?” That sounds subtle, but it’s the whole game in production. AWS is betting that enterprises will trust agent platforms that offer repeatable tests, live monitoring, policy controls, and audit trails — not just model access and orchestration. AgentCore’s recent updates fit that exact pattern. ### What’s the bottom line? The news is not that AWS invented agent evals. It’s that Bedrock is packaging them into managed infrastructure and making them part of the default production workflow. That shifts AgentCore from “a place to run agents” toward “a place to improve agents without guessing.”