Thinking Machines ships interaction models

- Thinking Machines Lab, Mira Murati’s startup, unveiled “interaction models” on May 11 — its first public model effort built for live, multimodal collaboration. - The first system, TML-Interaction-Small, handles audio, video, and text in 200-millisecond slices and reaches roughly 0.40-second turn-taking latency in demos. - It matters because the lab is attacking voice AI’s clunkiest weakness — timing — not just raw model intelligence.

Voice AI has had a weird problem for a while. The models got smarter, but the conversations still felt stiff — like sending voice notes back and forth instead of actually talking. Thinking Machines Lab is trying to fix that gap with what it calls “interaction models,” a new model class built to listen, speak, watch, and react at the same time. That’s the company’s first real public product signal since Mira Murati launched the startup, and it’s a pretty direct argument that today’s voice assistants are bottlenecked by their interface, not just their brains. ### What did Thinking Machines actually ship? On May 11, Thinking Machines published a research preview rather than a consumer app or open release. The headline model is TML-Interaction-Small, and the point is not just that it can do audio, video, and text. The point is that it processes those streams continuously, in parallel, instead of waiting for neat human turns to end before it starts thinking. (thinkingmachines.ai) ### Why is turn-taking such a big deal? Most “real-time” AI still cheats a little. You speak, another system decides when you’re done, the model gets a cleaned-up chunk, and then it answers. While it answers, parts of perception can effectively stall. That setup works, but it breaks a bunch of normal human behaviors — interruption, overlap, quick backchannels, reacting to what someone is showing on screen, or changing course mid-sentence. Thinking Machines is basically saying the awkwardness people feel in voice AI comes from that architecture. (thinkingmachines.ai) ### What’s the new trick? The lab says its models use a multi-stream, micro-turn design. Instead of one long serialized thread, the system handles tiny 200-millisecond chunks across audio, video, and text. That lets it update continuously and respond more like a phone call than a walkie-talkie. TechCrunch’s shorthand was “full duplex” — listening and generating at once — which is the clearest way to think about it. (thinkingmachines.ai) ### Why 200 milliseconds? Because conversation is mostly timing. Humans don’t just exchange finished paragraphs. We overlap, hesitate, signal attention, and jump in fast when something goes wrong. A 200-millisecond interaction clock is close enough to those micro-behaviors that the model can start feeling collaborative instead of merely responsive. The company also says TML-Interaction-Small hits about 0.40 seconds of turn-taking latency, which is the kind of number that makes this more than a branding exercise. (thinkingmachines.ai) ### Is this mainly about voice quality? Not really — and that’s the interesting part. Thinking Machines is framing voice as a coordination problem. Timing, overlap, memory, and shared attention matter as much as transcription accuracy or speech naturalness. In plain English, the lab is treating conversation as something the model should participate in, not something a bunch of helper systems should package up for it. (thinkingmachines.ai) ### Does it beat OpenAI and Google? In the company’s own reported benchmarks, yes — especially on the combined problem of responsiveness plus intelligence. The Decoder says Thinking Machines positions the model ahead of OpenAI’s GPT-Realtime-2 and Google’s Gemini Live on interaction quality and latency. But the catch is obvious: this is still a research preview, and outside users cannot really pressure-test it yet. (thinkingmachines.ai) ### When can people try it? Not today. The company says a limited research preview is coming in the next few months, with a broader release planned later in 2026. So right now this is still more thesis than product rollout. But it is a very clear thesis — that the next leap in AI assistants may come from fixing the rhythm of interaction, not just scaling the next benchmark monster. (the-decoder.com) The bottom line is simple. Thinking Machines didn’t just ship another multimodal model. It shipped an argument that conversational AI feels unnatural because the system architecture is unnatural — and that fixing the timing layer could matter as much as making the model smarter. (thinkingmachines.ai) (techcrunch.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.