Meta’s TRIBE v2 Released
Meta unveiled TRIBE v2, a foundation model trained on fMRI data from roughly 700 volunteers that predicts neural responses to images, sound, and text — effectively creating a digital “neural twin” for media stimuli. The model opens new research avenues for neuroscience-aligned AI and raises clear questions about data use and ethical safeguards. (mathrubhumi.com) (m.economictimes.com)
TRIBE v2 stitches together LLaMA 3.2 for text, V‑JEPA2 for video and Wav2Vec‑BERT for audio inside a unified transformer that maps multimodal embeddings onto brain space. (github.com) Meta’s demo and technical materials say the system now predicts whole‑brain activity at roughly 70,000 voxels, a jump from earlier parcel‑level outputs used in prior work. (aidemos.atmeta.com) The public code and pretrained checkpoints were released on GitHub and Hugging Face (model repo facebook/tribev2), with a repository README showing inference outputs on the fsaverage5 cortical mesh (~20k vertices) and Colab demo notebooks. (github.com) Meta documents zero‑shot generalization across new subjects and reports 2–3× better accuracy than standard linear encoding baselines on held‑out auditory and visual datasets. (aidemos.atmeta.com) Repository licensing and model weights are distributed under a Creative Commons Attribution‑NonCommercial 4.0 license in the published files, with the README noting gated access to some pretrained components (LLaMA 3.2 requires a Hugging Face token). (github.com) Reporting on training scale varies across outlets, with Meta‑hosted materials and secondary coverage citing cohorts of 700+ participants and reporting hundreds to roughly 1,000 hours of fMRI recordings aggregated for training. (aidemos.atmeta.com) TRIBE v2 builds on the original TRIBE architecture that placed first in the Algonauts 2025 brain‑encoding competition, and Meta’s demo highlights that the model’s predictions often correlate better with group‑average signals than noisy single fMRI scans. (arxiv.org)