Computer‑vision pain points in production
Social commentary flagged longstanding computer‑vision issues—tight data constraints, paradigm limits for long‑horizon robotics, and heavy operational costs when CV models must drive GUI navigation or legacy automation. The posts emphasize that production CV often requires continuous human oversight and pragmatic MLOps to handle brittleness in real environments. ( / )
Computer vision looks simple in demos, but production systems still break on new lighting, camera angles, screen layouts, and edge cases that were not in training data. (aws.amazon.com) At its core, computer vision is software that turns pixels into labels, boxes, or actions. Google Cloud’s Vision documentation tells customers to test models in “real-world scenarios,” a nod to the gap between benchmark performance and live deployments. (docs.cloud.google.com) That gap gets wider when the model must act over many steps instead of classifying one image. A January 2025 robotics paper said long-horizon manipulation still faces “complex representation and policy learning requirements,” plus sparse rewards and hard visual scenes. (arxiv.org) NVIDIA made the same point in a November 4, 2025 post on task-and-motion planning, saying traditional robot planners often fail in new environments unless perception is folded back into the plan during execution. In plain terms, the robot has to keep re-checking the world because the world does not stay still. (developer.nvidia.com) The data problem starts earlier. AWS says data drift is any meaningful change between production data and the data used to train a model, and that drift can reduce quality, accuracy, and fairness. (aws.amazon.com) That is why production teams build operations around the model, not just the model itself. Google Cloud’s Vertex AI monitoring tools log prediction requests and check incoming features for skew and drift, while Azure’s MLOps guidance lays out separate production patterns for computer vision systems. (docs.cloud.google.com) (learn.microsoft.com) Human review is usually part of that stack. Google Cloud defines human-in-the-loop as people participating in training, evaluation, or operation, and AWS says human monitoring helps validate model outputs and catch degradation over time. (cloud.google.com) (docs.aws.amazon.com) The operational burden gets heavier when vision is used to drive old software through screenshots and clicks. Microsoft’s UI Automation framework exists because many Windows applications expose structured interface elements that software can query directly, which is usually more stable than guessing from pixels alone. (learn.microsoft.com 1) (learn.microsoft.com 2) Even then, legacy systems do not always expose clean hooks. Microsoft says Windows automation spans both newer UI Automation and older Microsoft Active Accessibility, which means teams often inherit mixed interfaces, partial support, and brittle workflows. (learn.microsoft.com 1) (learn.microsoft.com 2) Researchers are still trying to widen the range of tasks these systems can handle. Google DeepMind introduced Gemini Robotics in March 2025 as a vision-language-action model for robots, but the broader robotics literature still describes long-horizon manipulation as a hard problem with limited generalization from small task-specific datasets. (deepmind.google) (arxiv.org) So the production lesson is not that computer vision failed. It is that working systems usually depend on labeled data, drift checks, fallback paths, and people who keep correcting the model after the demo ends. (cloud.google.com) (docs.aws.amazon.com)