EURECOM finds constraint decay in agents
- Francesco Dente, Dario Satriani, and Paolo Papotti posted a new arXiv paper on May 7 showing coding agents break under backend production constraints. (arxiv.org) - Across 80 greenfield tasks and 20 feature tasks in eight frameworks, capable setups lost about 30 assertion-pass-rate points as rules accumulated. (arxiv.org) - It matters because common coding benchmarks reward working outputs, but miss whether agents can obey architecture, ORM, and database rules. (arxiv.org)
Coding agents look a lot better in demos than they do in real backend work. That is basically the point of a new paper from Francesco Dente at EURECOM, Dario Satriani at the University of Basilicata, and Paolo Papotti at EURECOM, posted to arXiv on May 7, 2026. (arxiv.org) The paper gives that drop a name — “constraint decay” — meaning agent performance falls as you pile on the kinds of structural rules production systems actually use. ### What are they measuring? They are not asking whether an agent can spit out code that seems plausible. They fix one API contract, then make agents build multi-file backend systems around it under different levels of constraint — from loose baseline tasks to setups that require architectural patterns, databases, and ORM layers. (arxiv.org) The benchmark covers 80 greenfield generation tasks and 20 feature-implementation tasks across eight web frameworks, and scores the results with both behavioral tests and static verification. ### What is “constraint decay”? (arxiv.org) It is the pattern where the same agent that does fine under loose instructions starts failing once the job includes production-style structure. The paper says capable configurations lose about 30 percentage points in assertion pass rate from baseline to fully specified tasks, while weaker ones can get close to zero. So this is not “agents can’t code.” It is narrower and more useful — agents can often code the feature, but they struggle to code the feature inside the required shape. ### Why does backend work make this worse? (arxiv.org) Because backend systems have hidden obligations. An endpoint is not just an endpoint — it has to fit routing conventions, call the right service layer, map objects cleanly into a schema, and avoid ORM mistakes that only show up at runtime. A toy benchmark can ignore that stuff and still mark the answer correct. Production code cannot. That gap is what this paper is trying to isolate. ### Why do frameworks matter so much? The paper says agents do better in minimal, explicit frameworks like Flask and worse in convention-heavy environments like FastAPI and Django. (arxiv.org) That makes intuitive sense. In explicit frameworks, more of the structure is right there in the code. In convention-heavy frameworks, a lot of correctness lives in invisible expectations — naming, layout, lifecycle hooks, ORM behavior. Miss one and the whole app can wobble. ### Where do the failures cluster? Mostly in the data layer. The paper calls out incorrect query composition and ORM runtime violations as leading root causes. (arxiv.org) That is a useful detail because it points away from the usual “the model can’t reason about business logic” story. Turns out a lot of the brittleness is lower down — schema assumptions, joins, mappings, and framework-database glue. ### Why is that a bigger deal than it sounds? Because many benchmark wins may be overstating readiness for real software work. If an evaluation rewards functionally correct but structurally arbitrary solutions, an agent can look strong while still being unreliable in the exact places teams care about — maintainability, architecture compliance, and safe integration with a real data model. (arxiv.org) This paper is not saying agents are useless. It is saying the exam has been too easy. ### So what should people take from this? The takeaway is not “use no-code constraints” or “avoid Django.” It is that evaluation has to look more like the job. (arxiv.org) If you want to know whether an agent can help with backend engineering, you have to test whether it can satisfy functional requirements and structural ones at the same time. Right now, that combined test is still a weak spot. ### Bottom line? Constraint decay is a clean name for a thing a lot of engineers already suspected — coding agents are much better at getting something working than getting it working the right way. (arxiv.org) This paper matters because it turns that hunch into a measurable backend benchmark.