Tencent releases embodied VLM on Hugging Face
Tencent published Hunyuan Embodied AI — a 2‑billion‑parameter vision‑language model with Mixture‑of‑Tensors (MoT) — claiming state‑of‑the‑art results on multiple embodied benchmarks and CV tasks. The model is now available on Hugging Face, showing major labs are shipping multimodal models that target embodied use cases rather than only text or images. For robotics teams, these releases mean there are stronger off‑the‑shelf VLMs to experiment with for perception and language‑conditioned behavior. (x.com/HuggingPapers/status/2041962225812787387)
A robot does not fail because it cannot chat. It fails because it cannot tell whether the mug is behind the box, whether the hand is moving toward the drawer, or whether “put it on the left shelf” means a physical place in 3D space. (arxiv.org) That is why researchers talk about embodied artificial intelligence. “Embodied” means the model is built for an agent with a body, a camera, and a job in the real world, not just a chatbot answering from a text window. (arxiv.org) The core model here is a vision-language model. That means one system reads images and words together, like a person looking at a kitchen counter while hearing “pick up the red cup near the sink.” (arxiv.org) Tencent’s new release is aimed at that exact problem. On April 9, 2026, the company published HY-Embodied-0.5 on Hugging Face and GitHub, with open weights for the smaller model and official inference code. (huggingface.co, github.com) The small version is built for machines that cannot afford giant data-center hardware. Tencent says the open model uses 2.2 billion active parameters at inference, while the larger closed variant is a 32 billion active-parameter model for harder reasoning. (huggingface.co, github.com) The unusual part is the architecture. Tencent calls it Mixture of Transformers, which works like sending different parts of a job to different specialists instead of making one giant network do every step the same way. (arxiv.org, github.com) Tencent says that design lets the model stay fast like a dense 2 billion parameter system while improving fine visual detail. In the paper, the team says the model uses modality-specific computing and latent tokens to sharpen perception for embodied tasks. (huggingface.co, arxiv.org) The training recipe is also split in two stages. Tencent first trains a stronger 32 billion model, then uses on-policy distillation to pass its step-by-step planning behavior down into the smaller 2 billion model. (arxiv.org, github.com) Tencent says the compact model beats similarly sized state-of-the-art systems on 16 benchmarks, and the full paper reports evaluations across 22 benchmarks covering visual perception, spatial reasoning, and embodied understanding. Those are the tests researchers use to check whether a model can track objects, infer positions, and reason about actions instead of only captioning an image. (arxiv.org, huggingface.co) The company also says the model was trained on more than 100 million embodied and spatial data points and a corpus of more than 200 billion tokens. That scale matters because robots need examples of motion, viewpoint changes, and object interactions that ordinary internet image models do not see enough of. (github.com, huggingface.co) The last step is turning perception into action. Tencent says HY-Embodied is meant to plug into a vision-language-action system, where the model interprets the scene and instruction first, then a control policy turns that understanding into motor commands for a physical robot. (arxiv.org, github.com) The release does not mean home robots are solved. It does mean one more major lab is shipping open multimodal models aimed at shelves, drawers, hands, and camera views instead of only web pages and image captions, and that gives robotics teams a stronger off-the-shelf starting point than they had a year ago. (huggingface.co, github.com)