Human feedback remains the bottleneck
A viral post contrasted conversational user signals (noisy, hallucination‑prone feedback) with agentic user signals (verifiable tests), arguing that labs need higher‑quality, structured human feedback for dense rewards. (x.com)
Artificial intelligence labs can check whether a coding agent passed a unit test, but they still struggle to score whether a chatbot answer actually helped a person. (openai.com) That gap sits at the center of reinforcement learning from human feedback, or Reinforcement Learning from Human Feedback, the method OpenAI described in 2017 for training systems on human preferences instead of hand-written goals. OpenAI said the approach lets a model learn from people choosing which behavior is better, because writing exact reward functions is often harder than judging examples. (openai.com) Google DeepMind made the same point in its 2022 Sparrow work on dialogue systems. The company said conversation is hard to score because “it’s difficult to pinpoint what makes a dialogue successful,” so it asked raters to compare answers and trained a reward model from those judgments. (deepmind.google) Researchers have tried to make those human judgments less fuzzy by breaking them into smaller checks. In the Sparrow paper, DeepMind said it split “good dialogue” into natural-language rules and asked raters about each rule separately, which produced more targeted judgments and more efficient reward models. (storage.googleapis.com) A different path is to use tasks with answers that can be verified automatically, like code that either passes tests or fails them. A recent survey of Reinforcement Learning with Verifiable Rewards described those signals as “ground-truth rewards” from unit tests, formal proofs, or fact-checkers, which are easier to audit than open-ended preference scores. (github.com) That distinction has become more important as labs push models from chat into agents that take actions. When an agent edits files, runs tools, or writes code, developers can often attach a concrete check to each step; when a chatbot gives advice or explains a topic, the score is usually noisier and more subjective. (openai.com) (github.com) The quality of the reward signal matters because models learn shortcuts when the score is wrong. Anthropic wrote in a December 2025 study that “reward hacking” means a model finds a loophole that earns a high reward without completing the intended task, and the company linked that behavior to broader misalignment risks. (anthropic.com) Researchers are also trying to get denser rewards, meaning feedback on more than just the final answer. A 2024 paper, “Dense Reward for Free in Reinforcement Learning from Human Feedback,” argued that standard Reinforcement Learning from Human Feedback usually scores only the full completion, while token-level credit can provide more fine-grained training signals. (arxiv.org) The bottleneck is not that users produce no feedback; it is that most of the feedback arrives in forms that are hard to trust, compare, and reuse at scale. OpenAI’s original preference-learning post and DeepMind’s Sparrow work both describe systems that still depend on humans to supply the judgments that reward models learn from. (openai.com) (storage.googleapis.com) So the race is shifting from collecting more reactions to collecting better ones: structured ratings for chat, and verifiable tests for agents. The more a lab can turn “was this useful?” into a checkable signal, the less it has to rely on noisy human guesswork. (storage.googleapis.com) (github.com)