As multi-modal models tackle increasingly complex long-form video understanding, their capacity to retain and recall information—their memory—becomes a significant bottleneck. Current benchmarks, while advancing perception and reasoning, have largely overlooked a systematic evaluation of this critical capability. The researchers address this gap with M$^3$Eval, the first comprehensive evaluation framework and benchmark designed specifically for probing different memory dimensions in multi-modal models. This novel approach, grounded in cognitive psychology, introduces carefully constructed tasks to isolate key memory aspects. You can find more details on this pioneering work at arXiv.
Related startups
Memory's Weakest Links: Disentanglement and Interference
Leveraging M$^3$Eval, extensive experiments reveal consistent weaknesses across representative multi-modal models. A key finding is the struggle to maintain disentangled representations when processing parallel video streams. Furthermore, the models exhibit interference patterns that diverge significantly from human memory, suggesting a fundamental difference in how information is overwritten or corrupted. This underscores the need for improved multi-modal model memory evaluation.
Spatial vs. Temporal Grounding and Symbolic Gaps
The evaluation also sheds light on how multi-modal models anchor their memory. The research indicates that models ground memory sources more reliably in the spatial domain than the temporal domain. This spatial bias may limit their ability to recall sequential events accurately. Additionally, a notable limitation observed is the constrained symbolic memory capacity, which is crucial for abstract reasoning and understanding narratives over extended periods.