Meta’s TRIBE v2 decodes brain responses

Meta released TRIBE v2, a trimodal foundation model trained to predict fMRI responses to video, audio, and text—advancing zero-shot neural decoding across modalities. That’s a practical step toward aligning multimodal AI outputs with biological signals, with implications for accessibility and brain–computer research. (marktechpost.com)

TRIBE v2 was trained on fMRI recordings collected from more than 700 volunteers, using in excess of 500 hours of scanning data according to Meta’s release materials. (biopharmatrend.com) Meta reports TRIBE v2 scales brain predictions from coarse ~1,000-parcel maps to roughly 70,000 voxel-level targets and cites a roughly 70× improvement in spatial resolution versus earlier encoding approaches. (aihola.com) The model stitches together pretrained feature extractors—LLaMA 3.2 for text, V-JEPA2 for video, and Wav2Vec-BERT for audio—feeding their layerwise embeddings into a multimodal transformer that maps activations onto the cortical surface. (github.com) Architectural details disclosed in the release show modality features synchronized at 2 Hz, compressed and projected into a shared 1,024-dimensional embedding, and processed by an 8-layer transformer; the published configuration reports ~980 million trainable parameters. (openreview.net) Meta and independent coverage highlight TRIBE v2’s zero-shot generalization to held-out subjects, unseen languages, and new task settings, and the system builds on TRIBE v1’s top placement in the Algonauts 2025 brain-encoding challenge. (biopharmatrend.com) Source code and pretrained checkpoints for TRIBE v2 were published publicly (GitHub repository and a Hugging Face model card), and the repository includes a demo notebook showing quickstart scripts for predicting fMRI responses from a video file. (github.com)

Meta’s TRIBE v2 decodes brain responses

Get your own daily briefing