Rubber‑Duck AI reviews land in CLI
GitHub added a ‘Rubber Duck’ review agent to the Copilot CLI that runs a second model to critique the primary agent’s plan and catch issues like infinite loops before execution. (infoworld.com) (technobezz.com)
Programmers have used literal rubber ducks for years because saying code out loud often exposes the bug before the computer does. GitHub just turned that ritual into software by adding an experimental “Rubber Duck” reviewer to GitHub Copilot Command Line Interface, the terminal tool that can plan and run coding tasks from text prompts. (github.blog) A coding agent works in steps: it makes a plan, edits files, writes tests, and then executes commands. If the first plan is wrong, every later step can build on the same mistake like a contractor following the wrong blueprint. (github.blog) Most artificial intelligence agents already do self-checks, but that is like asking the same witness to review their own testimony. GitHub’s new system instead uses a second model from a different model family, so the reviewer is less likely to share the first model’s blind spots. (github.blog) GitHub says Rubber Duck can step in after the plan is drafted, after a complicated implementation, and after tests are written but before those tests run. The tool can also be called when the agent appears stuck in a loop, which is one of the easiest ways for an automated coding session to waste time and tokens. (github.blog) (neowin.net) GitHub’s example from OpenLibrary shows why the timing matters. The primary agent proposed a scheduler that would exit immediately on startup and included a task that would loop forever, and Rubber Duck flagged both problems before execution. (github.blog) (helpnetsecurity.com) A second example came from Apache Solr search code. Rubber Duck caught a loop that kept overwriting the same dictionary key, which meant three of four search facet categories were silently dropped with no error message. (github.blog) (helpnetsecurity.com) GitHub says the feature improved results on SWE-Bench Pro, a benchmark built from real software engineering tasks. In the company’s tests, pairing Claude Sonnet with Rubber Duck closed 74.7% of the performance gap between Sonnet alone and the stronger Claude Opus model on harder multi-file and long-running jobs. (github.blog) (4sysops.com) That result hints at the real pitch: buy a cheaper first model, then spend a smaller amount on a second opinion only at risky moments. Instead of running the most expensive model for every command, GitHub is trying to use disagreement between two models as a quality control layer. (github.blog) (infoworld.com) The feature is in experimental mode inside GitHub Copilot Command Line Interface right now, not a general guarantee that every generated command is safe. GitHub’s own write-up frames Rubber Duck as a reviewer that raises concerns and questions assumptions, which means the human developer still decides whether to trust, change, or reject the plan. (github.blog)