Anam.ai + VisionAgents
- Anam.ai integrated GetStream's VisionAgents open-source multimodal framework to build reactive AI avatars for calls. - The avatars read body language and eye contact to coach tutors, sales reps, and new hires in real time. - This shows growing interest in multimodal, nonverbal signals for meeting UX, with implications for privacy and latency on devices (x.com).
A video agent is a chatbot that can see and speak during a live call, and Anam.ai is now wiring that kind of agent into animated faces. GetStream added native Anam avatar support to its open-source Vision Agents framework on April 8, 2026. (getstream.io) Vision Agents is Stream’s Python framework for real-time voice and video agents, first released on October 10, 2025. Stream says it is built for low-latency calls, with agents joining in about 500 milliseconds and audio-video latency under 30 milliseconds on Stream’s edge network. (getstream.io) Anam supplies the face layer. Its docs describe a four-step pipeline: speech-to-text hears the user, a large language model decides the reply, text-to-speech generates audio, and face generation turns that audio into live video. (anam.ai) The new integration lets developers stream an agent’s audio into an Anam avatar and receive synchronized video frames back during the call. Stream packaged that support in `vision-agents-plugins-anam` and said it handles audio resampling, video settings, and interruptions when a user starts speaking. (getstream.io) Anam published a companion recipe on April 17, 2026 showing the setup in code. The example connects to `getstream.Edge`, adds `AnamAvatarPublisher`, uses Gemini for the language model, Deepgram for speech-to-text and text-to-speech, and swaps the avatar’s background between a kitchen and a studio based on the conversation. (anam.ai) That matters because the same framework already ships with real-time video processors that can inspect what a camera sees. Stream’s examples include a golf coach that combines Gemini Live with Ultralytics YOLO pose detection, a computer-vision model that tracks body position from video frames. (getstream.io) Put together, that stack points to meeting agents that do more than transcribe words. Anam says its avatars are already aimed at customer support, sales and lead qualification, language tutoring, skill training, and medical front-desk assistance, which are the same kinds of settings where gaze, posture, and turn-taking can be turned into live coaching prompts. (anam.ai) The tradeoff is where all that media gets processed. Stream’s latest release added a LocalEdge option that runs microphone, speaker, and optional camera input directly on the machine, with only language-model, speech-to-text, and text-to-speech calls leaving the device. (getstream.io) Anam has also been pushing enterprise controls around retention. Its privacy docs say zero data retention can be enabled for a persona or an ephemeral session, which means transcripts, user audio, prompts, responses, and session recordings are processed in volatile memory rather than stored. (anam.ai) So the immediate story is not just a new avatar plugin. It is that the open-source plumbing for live agents now covers voice, video, pose analysis, and a human-like face in the same call loop, which makes coaching-style assistants easier to build and puts latency and privacy decisions much closer to the product surface. (github.com )