Databricks: eval frameworks = 6x success
Databricks says teams using systematic eval frameworks see 6x higher production success for AI agents, and a 5‑step Travel Code process (define metrics, edge tests, load sims, A/B, monitor) was highlighted as a repeatable pattern. The claim ties measurable evaluation discipline directly to production outcomes. (x.com)
Databricks announced an "Enhanced Agent Evaluation" extension to MLflow on March 12, 2025 that added a Guidelines AI Judge, a Review App for expert feedback, and an evaluation‑dataset SDK to collect labeled traces. (databricks.com) Databricks' MLflow evaluation docs were last updated on March 3, 2026 and describe MLflow Tracing, reuse of the same LLM judges and scorers in development and production, and built‑in conversation simulation and production monitoring (Beta). (docs.databricks.com) Databricks introduced Mosaic Agent Bricks at its Data + AI Summit; coverage notes the product automates agent optimization using TAO (Test‑time Adaptive Optimization) and domain‑specific synthetic data generation. (venturebeat.com) The Agent Bricks product page states the platform provides multi‑AI access across OpenAI, Anthropic and Google models and links agent reasoning to enterprise schemas and Unity Catalog for consistent data semantics. (databricks.com) Databricks publishes a load‑testing notebook for custom model serving endpoints that shows a Locust example and operational steps for simulating production traffic, with documentation updated September 3, 2025. (docs.databricks.com) Databricks maintains public solution repositories and an ai‑dev‑kit with MLflow evaluation examples and user‑journeys that demonstrate aligning strategy, building evaluation datasets, and automating judge/scorer pipelines; the databricks‑solutions repo contains recent example commits. (github.com)