Meta FAIR’s Brain & AI team took first place in this year’s Algonauts brain-encoding challenge with TRIBE, a model that predicts moment-by-moment fMRI responses while people watch movies. It doesn’t rely on a single stream. It fuses video frames, the soundtrack, and the dialogue transcript, then forecasts how activity rises and falls across the cortex.
The result topped a field of more than 260 teams and held up on films the model never saw during training.
What TRIBE is, in plain terms.
Imagine you’re in a scanner watching a scene: a cut to a close-up, a line of dialogue under a swelling score. Every ~1.5 seconds the fMRI machine records a blood-oxygen signal across the brain. TRIBE ingests the same scene—visual frames, audio, and time-aligned text—and predicts those signals for 1,000 standardized cortical regions. It was trained on an unusually dense dataset: roughly 80 hours of recordings per person from the Courtois NeuroMod project, including seasons of “Friends” and several feature films. Dense per-subject data is the point; it lets a single model learn structure that generalizes across people.
TRIBE stands on three pretrained backbones: Llama 3.2 for text, Wav2Vec2-BERT for audio, and V-JEPA 2 for video. The team extracts intermediate representations from each, resamples them to a shared 2 Hz timeline, compresses across layers, projects to a common width, concatenates, and feeds the sequence to an 8-layer transformer with positional and subject embeddings. A subject-conditioned linear head maps to the 1,000 parcels. Training uses AdamW, cosine schedule, and “modality dropout” so the model stays useful if one stream is missing. An ensemble blends many runs parcel-by-parcel based on validation scores. It’s pragmatic: reuse strong feature extractors; learn the temporal fusion and subject conditioning.