Perceptron Mk1 debuts video reasoning model

- Perceptron launched Mk1 on May 12, a closed-source vision-language model built for video understanding and “embodied reasoning” in robotics and production workflows. - The headline detail is price: $0.15 per million input tokens and $1.50 output, with 32K context plus structured outputs like boxes and timestamps. - It matters because video AI is shifting from chatty summaries toward machine-readable perception that downstream agents, robots, and analytics systems can actually use.

Video models are getting more useful — and more specific. The big change is not just that they can describe a clip in plain English. It’s that some of them are starting to return outputs software can act on directly. That is the lane Perceptron is pushing into with Mk1, released on May 12. It’s a closed-source vision-language model for image and video tasks, but the real pitch is “physical AI” — systems that need to understand scenes, track events over time, and hand structured results to other tools. ### What did Perceptron actually launch? Mk1 is Perceptron’s new flagship model for image and video reasoning. In the company’s docs, it sits above the smaller Isaac models and is described as the standard-speed, reasoning-enabled option for image and video work. The API model ID is `perceptron-mk1`, and the listed update date is May 12, 2026. ### Why call it a video reasoning model? Because this is not pitched as a text bot that happens to accept video. (docs.perceptron.inc) Perceptron frames Mk1 around understanding what happens across time — answering questions about clips, detecting events, reading text in scenes, and clipping moments out of longer footage. That temporal piece matters. A lot of real video work is not “what’s in this frame?” but “when did the thing happen, and where?” ### What makes the output different? (docs.perceptron.inc) The useful bit is structure. Perceptron’s docs show support for constrained outputs via Pydantic models, JSON Schema, and regex, so the model can return data in a format software expects instead of freeform prose. The platform also supports grounded outputs like points, boxes, and polygons, plus video clipping with start and end timestamps. Basically, it is trying to turn perception into something parseable. ### Why does that matter for robotics? Robots and industrial systems do not need a poetic recap of a scene. They need coordinates, object IDs, timestamps, and yes-or-no decisions they can route into control software or audits. Perceptron has been building around that idea already — its SDK and site are aimed at robotics, manufacturing, logistics, security, and other physical-world workflows. Mk1 looks like the higher-end model for that same stack. (docs.perceptron.inc) ### What’s the pricing angle? Perceptron lists Mk1 at $0.15 per million input tokens and $1.50 per million output tokens, with a 32K context window. That is the eye-catching part of the launch because the company is clearly positioning Mk1 as frontier-ish video reasoning without frontier-lab pricing. Third-party coverage around the release says Perceptron is benchmarking it against Google, OpenAI, and Anthropic while claiming much lower cost. ### Is this just another benchmark story? (github.com) Not really — though benchmarks are part of the marketing. The more interesting shift is product shape. Multimodal models started by answering questions about images. Now the competition is moving toward models that can localize, track, clip, and emit structured data reliably. That makes them more like perception engines than chat interfaces. Perceptron is leaning hard into that distinction. (docs.perceptron.inc) ### What’s the catch? The catch is that this is still a closed model from a smaller company, and most of the strongest performance claims around rivals are coming from launch materials and secondary writeups, not a broad independent bake-off. So the important thing is less “did it beat every frontier model?” and more “is there now a credible specialist model for video-heavy physical AI tasks?” That answer looks like yes. ### Bottom line? Mk1 matters because it points at where multimodal AI is going next. (docs.perceptron.inc) The winning models may not be the ones that talk best about video. They may be the ones that can watch, reason, and return clean machine-readable outputs fast enough and cheaply enough to plug into real systems. (docs.perceptron.inc)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.