Harness Engineering goes mainstream
A viral video frames 'Harness Engineering' as 2026’s hottest AI engineering concept—arguing that stronger foundation models don't stop agents from failing; the real challenge is engineering reliable harnesses, testbeds, and safety nets. That reframes hiring: labs now want people who can translate model capacity into robust, observable agent behavior. (youtube.com)
OpenAI published an engineering post on February 11, 2026 describing a five‑month internal experiment where Codex-generated code produced roughly one million lines across ~1,500 merged pull requests with an initial team of three engineers driving the agents. (openai.com)) OpenAI’s current job listing for “Software Engineer, Applied Evals” explicitly tasks hires with “designing agent harnesses” and building eval pipelines, and it lists a preference for candidates with 4+ years of software engineering experience and familiarity with deep learning or training systems. (openai.com)) Google DeepMind has posted multiple roles—listed as “Agent Quality and Evaluation” and “Research Engineer, Agentic Safety”—whose responsibilities include building evaluation frameworks, orchestration prototypes, and leaderboards to measure agent reliability across use cases. (jobs.anitab.org)) Academic work is following suit: an arXiv submission titled ReliabilityBench (submitted January 3, 2026) proposes multi‑dimensional benchmarks to measure agent reliability under repeated execution, perturbations, and simulated tool failures. (arxiv.org)) Industry and trade guides circulated in March 2026 label “harness engineering” the defining discipline for production agents and list specific skills now in demand—LLMOps/observability, tool orchestration, automated evals, and SLOs for agents—in hiring checklists and training roadmaps. (nxcode.io)) Research‑scientist openings at DeepMind and similar labs continue to explicitly prefer a PhD and a publication record at top venues (NeurIPS/ICML/ICLR), while OpenAI maintains a six‑month Research Residency to transition researchers from adjacent fields into full research roles, signaling that industry hiring splits between PhD‑style research tracks and applied harness/ops tracks. (boards.greenhouse.io))