Top AI papers roundup (May 18–24)
- DAIR.AI on May 24 highlighted a weekly paper list centered on agent systems, naming AIRA_2, MetaCogAgent, MeMo and Code as Agent Harness. - AIRA_2 reported 81.5% mean percentile rank at 24 hours on MLE-bench-30, while MetaCogAgent said it reached 82.4% accuracy on 700 tasks. - The papers are available on arXiv and GitHub, with AIRA-dojo code from Meta's Facebook Research repository.
DAIR.AI on May 24 pointed readers to a compact set of recent AI papers focused on agents, memory and production system design. The list named AIRA_2, MetaCogAgent, MeMo: Memory as a Model, and Code as Agent Harness, all posted in March or May on arXiv. The papers do not describe a single product release. They document a cluster of work around how agents search, delegate, remember and execute in software systems. ### Which papers were in the roundup, and when did they appear? AIRA_2 was first submitted to arXiv on March 27 and revised on April 13 under the title “AIRA_2: Overcoming Bottlenecks in AI Research Agents.” The paper was authored by Karen Hambardzumyan and 24 co-authors and framed three bottlenecks for research agents: synchronous single-GPU execution, evaluation-driven overfitting and the limits of fixed single-turn operators. (arxiv.org) MetaCogAgent was submitted on May 17, MeMo on May 14 and Code as Agent Harness on May 18, according to their arXiv entries. The three papers cover self-aware task delegation in multi-agent systems, model-based memory for updating knowledge without retraining the base model, and a survey view of code as the operating substrate for agent systems. ### Why did AIRA_2 stand out in this set? (arxiv.org) AIRA_2 reported a mean percentile rank of 81.5% at 24 hours and 83.1% at 72 hours on MLE-bench-30, above a strongest baseline at 72.7%, according to the paper’s abstract page. The authors said the system used an asynchronous multi-GPU worker pool, a “Hidden Consistent Evaluation” protocol and ReAct agents that debug interactively. (arxiv.org) Facebook Research’s GitHub repository for AIRA-dojo describes the codebase as a framework for developing and evaluating AI research agents. The repository says it provides abstractions for tasks and agents, implements MLE-bench, and includes the agents introduced in the related paper. It also says the framework enabled 1,000 agents to run in parallel for up to 120 hours. (arxiv.org) ### What problem is MetaCogAgent trying to solve? MetaCogAgent said current multi-agent frameworks often assign work by predefined role without checking whether an agent can judge its own competence boundary. The paper introduced a “Metacognitive Self-Assessment Unit” that estimates task-capability fit before execution and routes low-confidence tasks to other agents through cross-agent evaluation. (github.com) The paper reported 82.4% task accuracy on a 700-task benchmark spanning five cognitive dimensions. The authors said that was 8.7 percentage points above the best routing baseline, while using 5% fewer API calls than AutoGen and 34% fewer than ensemble voting. ### Why was MeMo included alongside agent papers? MeMo, short for “Memory as a Model,” addressed a different production problem: how to add new knowledge without changing the base language model’s weights. (arxiv.org) The paper said it keeps an executive model frozen and trains a separate memory model that stores new knowledge and can be queried at inference time. The authors said that design captures cross-document relationships, is robust to retrieval noise, avoids catastrophic forgetting and works with both open and closed-source models because it does not require access to weights or logits. (arxiv.org) The paper reported results across BrowseComp-Plus, NarrativeQA and MuSiQue. ### Where does “production agent architecture” fit in? (arxiv.org) Code as Agent Harness was submitted as a survey, not a new benchmark paper. Its authors argued that code is increasingly the infrastructure layer for agent reasoning, action, environment modeling and execution-based verification. The survey organized the field into three layers: interface, mechanisms and scaling. (arxiv.org) It listed planning, memory, tool use, control, optimization and multi-agent coordination over shared code artifacts as core parts of that architecture, and pointed to applications in coding assistants, GUI and OS automation, scientific discovery, DevOps and enterprise workflows. (arxiv.org) ArXiv pages for MetaCogAgent, MeMo and Code as Agent Harness remain live as of May 24, and Facebook Research’s AIRA-dojo repository is publicly available on GitHub. Engineers following the roundup can read the papers directly on arXiv and inspect the AIRA-dojo implementation in the repository. (arxiv.org 1) (arxiv.org 2)