Engineer outlines real‑time meeting AI stack
- On May 24, 2026, engineer Gokul JS detailed a production stack for real-time meeting agents, centering on streaming STT, LLM, TTS, VAD and interruption handling. - The thread’s most concrete point was that the orchestration layer, not any single model, manages turn-taking, barge-in, latency budgets and duplex media flow. - Gokul JS’s related blog and project pages describe the same pipeline with LiveKit, Whisper STT, GPT-4o and Rime TTS.
Gokul JS used an X thread on May 24 to lay out the moving parts behind a real-time conversational agent for meetings and voice interfaces. The stack he described centered on streaming speech-to-text, language-model inference, text-to-speech, voice activity detection, interruption handling and separate voice and data channels. His focus was not on a single model vendor. It was on the orchestration layer that decides when the system listens, speaks, stops and resumes. That framing tracks with Gokul JS’s public site, which says he builds real-time AI systems and voice agents. A related blog post on his site describes “the two-layer architecture, VAD, STT, LLM, TTS pipeline, and where latency comes from at each stage,” while a project page says he built a voice conversation pipeline using LiveKit, Whisper STT, GPT-4o and Rime TTS. ### Why does the orchestration layer matter more than the model list? The clearest point in Gokul JS’s thread was that natural conversation depends on coordination logic between components, not just model quality. In a live meeting system, speech recognition, language generation and speech synthesis can all stream, but something still has to manage turn detection, partial transcripts, response timing and interruption rules. That is the orchestration layer he highlighted. (gokuljs.com) LiveKit, whose open-source agents framework is one example of this design pattern, describes real-time agents as “programmable participants” that mix and match STT, LLM, TTS and realtime APIs. Its documentation also points to interruption handling and end-of-turn behavior as first-class engineering problems rather than optional features. ### What does a production voice stack actually have to coordinate? A real-time meeting agent has to process audio as a stream rather than as uploaded files. (gokuljs.com) Gokul JS’s blog describes a pipeline in which VAD detects speech boundaries, STT emits partial transcripts while the user is still speaking, the LLM begins reasoning on partial input, and TTS streams audio back with as little delay as possible. That architecture is now common across voice-agent tooling. (github.com) LiveKit’s framework documentation says developers can swap STT, LLM and TTS providers, while keeping scheduling, transport and session logic in place. A separate technical explainer on real-time voice systems describes the same core tradeoff: chained pipelines are modular, but streaming systems reduce delay by overlapping recognition, reasoning and synthesis. (gokuljs.com) ### Why are interruptions and barge-in so hard in meetings? Interruptions are difficult because the system has to listen while it is speaking. Gokul JS’s thread called out interruptions and full-duplex channels, which are required if a user is supposed to cut off the assistant mid-response without waiting for playback to finish. In practice that means handling inbound audio, outbound audio and control events at the same time. (github.com) Research papers and developer frameworks now treat that as a core systems problem. An arXiv paper on full-duplex speech dialogue describes a setup in which perception and motor modules run in tandem so the model can speak and listen simultaneously, while LiveKit’s recent code history includes fixes specifically tied to interruption behavior. ### Where does latency usually accumulate? Latency builds across every handoff. (gokuljs.com) Gokul JS’s blog says the key engineering task is to account for delay at each stage of the VAD-STT-LLM-TTS path rather than treating response time as one monolithic number. Another engineering explainer on voice AI argues that sub-500 millisecond behavior is the threshold where systems begin to feel conversational, and says developers need streaming transport such as WebRTC to avoid file-based waits. (arxiv.org) That is why the thread reads less like a product pitch than a checklist. The practical work is in partial transcripts, back-pressure, duplex transport, cancellation logic and timing budgets. Gokul JS’s own project page points to one concrete implementation path through LiveKit, Whisper STT, GPT-4o and Rime TTS, and his blog indicates he is documenting more of that stack in public. (gokuljs.com 1) (gokuljs.com 2)