Meta FAIR’s Brain & AI team took first place in this year’s Algonauts brain-encoding challenge with TRIBE, a model that predicts moment-by-moment fMRI responses while people watch movies. It doesn’t rely on a single stream. It fuses video frames, the soundtrack, and the dialogue transcript, then forecasts how activity rises and falls across the cortex.
The result topped a field of more than 260 teams and held up on films the model never saw during training.
What TRIBE is, in plain terms.
Imagine you’re in a scanner watching a scene: a cut to a close-up, a line of dialogue under a swelling score. Every ~1.5 seconds the fMRI machine records a blood-oxygen signal across the brain. TRIBE ingests the same scene—visual frames, audio, and time-aligned text—and predicts those signals for 1,000 standardized cortical regions. It was trained on an unusually dense dataset: roughly 80 hours of recordings per person from the Courtois NeuroMod project, including seasons of “Friends” and several feature films. Dense per-subject data is the point; it lets a single model learn structure that generalizes across people.
TRIBE stands on three pretrained backbones: Llama 3.2 for text, Wav2Vec2-BERT for audio, and V-JEPA 2 for video. The team extracts intermediate representations from each, resamples them to a shared 2 Hz timeline, compresses across layers, projects to a common width, concatenates, and feeds the sequence to an 8-layer transformer with positional and subject embeddings. A subject-conditioned linear head maps to the 1,000 parcels. Training uses AdamW, cosine schedule, and “modality dropout” so the model stays useful if one stream is missing. An ensemble blends many runs parcel-by-parcel based on validation scores. It’s pragmatic: reuse strong feature extractors; learn the temporal fusion and subject conditioning.
The leaderboard metric is correlation: predicted vs. measured BOLD time courses, averaged over parcels. TRIBE ranked first and showed robust performance on out-of-distribution movies—animation, nature documentaries, even silent black-and-white clips—an honest test of cross-modal understanding rather than pattern memorization.
How close is “brain-like”? What is trimodality?
fMRI is noisy; even repeated viewings by the same person don’t match perfectly. The authors estimate a noise ceiling from repeats and report TRIBE captures about half of the explainable variance on average, approaching the ceiling in language and auditory cortices. That’s not “we modeled the brain,” but it’s a large chunk of predictable signal from naturalistic stimuli using one model.
Unimodal encoders do well where you expect—video for early visual cortex, audio for auditory cortex, text for language areas. But the gains from TRIBE show up most in associative regions where the brain integrates signals over time. That’s the territory where understanding, not just sensing, happens. The ablations back this up: any single stream lags; combining two helps; all three do better still.
Three ingredients: naturalistic data at depth (tens of hours per subject), strong pretrained foundations (especially a predictive video backbone in V-JEPA 2), and a time-aware fusion model that learns across subjects instead of training one network per person. The open-sourcing helps too. Paper and code are public; the dataset is broadly accessible. Reproducibility isn’t an afterthought.
Parcel-level prediction, not voxel-level. That trades spatial precision for reliability and tractable compute. fMRI’s temporal resolution is slow; millisecond dynamics are out of scope. The subject pool is small, though deeply scanned; scaling to more brains will test how far subject embeddings stretch. And the current target is perception and comprehension. Behavior, memory, and decision signals will need different experiments and possibly other measurements. These limits are documented and point to clear follow-ups rather than hidden caveats.
Beyond a leaderboard
For neuroscience, this is a move away from siloed, task-specific models toward integrated encoders that reflect real viewing conditions. For AI, alignment with brain responses provides an external yardstick for whether model representations capture human-relevant structure across sight, sound, and language over time. That’s useful for diagnosing failure modes, for safety work, and for training objectives that encourage temporal understanding instead of static pattern matching. The video backbone choice is not incidental here; V-JEPA 2 is trained to predict the near future, which naturally correlates with anticipatory signals seen in cortex during natural viewing.
Two obvious knobs improve TRIBE: more subjects and more hours. The scaling curves haven’t flattened. A second frontier is finer spatial targets, moving from parcels toward voxel-level maps where the signal-to-noise allows it. A third is broader scope: adding tasks that tap memory and decisions, or pairing fMRI with faster signals. On the modeling side, the recipe is accessible: better trimodal backbones, smarter temporal fusion, and disciplined training with modality dropout and subject conditioning. The infrastructure and data to reproduce and extend the result are already available.
Download the code: https://github.com/facebookresearch/algonauts-2025
Read the paper: https://arxiv.org/abs/2507.22229
Learn about the challenge: https://algonautsproject.com/index.html
Download the data: https://cneuromod.ca
