Video has become a dominant signal on the internet. It powers everything from Netflix's $82 billion Warner Bros Discovery acquisition to the sensor streams feeding warehouse robots and city surveillance grids. Yet beneath this sprawl, AI systems are hitting a wall in that they can tag clips and rank highlights, but they struggle to remember what they've seen over weeks or months.
Interestingly, a new wave of infrastructure is emerging to fix that. It treats "visual memory" as a distinct layer in the stack, and it could redefine how machines perceive, recall, and act on the physical world.
The scale of the memory problem
The numbers alone make this case. Streaming platforms generate petabytes of viewer data daily, while enterprise video from security cameras, factory floors, and retail sensors now exceeds 80% of all internet traffic. Netflix's AI engine is a massive cog in its content recommendation and personalization machine. It already operates as a form of behavioural memory. It tracks what viewers watch, skip, and binge to optimize slates and thumbnails. Warner's own AI teams are now folded into that empire, bringing similar tooling for production analytics and audience testing.
But this is still episodic intelligence.
These systems process short bursts of footage for immediate insights. Examples include flagging violent clips or predicting churn from watch patterns. The harder problems demand persistent memory. These problems include tracking how a store layout influences dwell time over six months or spotting process drift in a manufacturing line using continuous visual memory. Current architectures collapse under the compute cost of feeding months of video into a single context window and this forces brittle workarounds like manual sampling or clip-level indexing.
Research catches up: Memory as architecture
Academic work is converging on explicit memory mechanisms as the fix. VideoMem, a recent arXiv paper, proposes adaptive retention for ultra-long videos. It dynamically compresses frames based on "surprise" while it preserves key episodes for downstream reasoning. Parallel efforts in long-video generation use hybrid memories. They utilize global state for storyline coherence and local cues for frame-to-frame continuity. This synthesizes minutes of footage without hallucinating drift.
Google Research's Titans + MIRAS framework takes this further. It blends long-term memory modules with retrieval for test-time adaptation. This explicitly frames sequence models as associative stores rather than stateless predictors. The pattern is becoming clear: while larger models and longer contexts are necessary, the true unlock for long-term video intelligence lies in durable, queryable memory architectures. These aren't fringe ideas. They are the scaffolding for agents that operate over real-world timescales. Applications range from AR assistants remembering room layouts to industrial systems learning from seasonal patterns.
Startups productizing the memory layer
Venture capital flows are following suit. A clutch of companies is betting on memory-first video infra. Twelve Labs, with over $50M in funding raised, powers semantic search and chapterization for media archives and audits. It treats video as queryable embeddings rather than lived history. VideoRAG pipelines, such as agentic tools from startups in YC's infra batch, segment long footage for iterative LLM recall. However, they often lack native persistence across sessions.
Memories.ai, another player in the market, is focusing its efforts on explicit visual memory. The company, which recently completed its seed funding, aims to build the fundamental substrate that allows agents to recall long-term events, such as "what happened here last month," without needing full recomputation. Their focus is on applications like marketing, security, smart hardware, and robotics, where consolidating episodes into a graph structure is more effective than simple one-shot search.
| Approach | Key Players | Core Mechanism | Horizon / Strength |
|---|---|---|---|
| Clip Understanding | Legacy CV tools | Frame classification | Seconds; alerts |
| Semantic Search | Twelve Labs | Embeddings + retrieval | Hours; archives/highlights |
| Agentic Retrieval | VideoRAG startups | Segmentation + LLM iteration | Days; reasoning over chunks |
| Persistent Memory | Memories.ai LVMM | Compression + graph consolidation | Months; robotics/agents |
| Big-Lab Hybrids | Titans/MIRAS (Google) | Adaptive modules + test-time mem | Indefinite; scalable adaptation |
Visual Memory vs. LLMs and Generators
It is important to distinguish this development from existing LLM agent memory systems like Mem0 or Letta. While those tools excel at text-based persistence, tracking user preferences and chat history effectively, their approach is fundamentally different when applied to video. These systems generally treat video as flat embeddings, which is sufficient for tasks like summarizing a clip, but they remain functionally blind to object trajectories, environmental changes, or months-long, cumulative patterns.
At the consumer end, models like OpenAI’s Sora 2 and Google’s Veo 3.1 make this distinction intuitive. They are astonishing at producing short sequences with believable physics, synchronized audio, and persistent characters across a handful of shots. They are essentially a world simulator for clips. But even Sora 2’s multi‑shot stories are closer to vivid dreams than durable memories. Once the sequence is rendered, the system does not “remember” what it created in any structured, queryable way. This inherent limitation is why a separate, persistent memory layer is increasingly seen not as an optional add-on, but as the quiet, essential backbone supporting these powerful generative front ends.
Video memory demands distinct primitives. It requires denoising latents to kill redundancy, spatiotemporal graphs for "where/when/what moved," and episodic consolidation for higher-level recall. That is the premium edge. It moves from pixels to a persistent world model rather than just tokens to session state.
Big labs and the modular stack
Behind this, big labs are redesigning the AI stack around modularity. Yann LeCun's JEPA sketches separate perception from a hippocampus-like episodic store. This feeds a world model for planning. Memory isn't baked into weights. It is a tool. Andrej Karpathy echoes the point. He notes that context windows mimic working memory. However, long-term histories demand external persistence. This is especially true for multimodal data.
IBM and others are operationalizing this with "AI that remembers everything." They blend structured recall with privacy controls for enterprise-scale deployment. Cloud providers sense the pattern. They are experimenting with memory-augmented video services. Think of AWS or Azure slots for long-horizon feeds that agents can query natively.
Netflix and the entertainment flywheel
Returning to entertainment and media, the memory imperative is already acute. Speculatively, the Netflix-Warner deal wasn't just for viewer growth, catalogues and IP, but fusing two AI engines optimized for visual prediction. It combined recommendations from watch history with production tweaks from A/B tests on trailers and edits. Netflix's internal video understanding tech is adept at short-cycle insights. It covers scene detection, engagement forecasting, and multimodal personalization. Yet it is primed for evolution into persistent memory. It will allow queries like "how did slow pans perform for horror fans last quarter?" across petabytes of assets.
As Sora‑class models and Veo‑style tools seep into production pipelines, studios will need something underneath them like a layer that actually remembers what was generated, how it performed, and how it connects to everything else in the catalog. This is where the emerging memory stack comes in. As studios push dynamic cuts and viewer-specific visuals, that tech could tap memory specialists or build in-house. Either way, the logic scales. We live in a world where AI watches more video than humans. Forgetting less becomes the edge.
Fault lines ahead
Risks loom large. Privacy regimes will hammer persistent visual logs. They will demand redaction and ephemerality that pure search vendors dodge. Standardization matters too. Memory APIs may converge like vector DBs, or may fragment by modality and domain. Incumbents like hyperscalers and security giants could subsume the layer, squeezing out pure-plays.
"Visual memory isn't optional for embodied AI," Shen adds. "It's the bridge from clips to continuous experience." The companies that own it will define the next stack.
Memories.ai recently partnered with Qualcomm phones to embed their latest video memory model, LVMM 2.0, providing for rapid retrieval of photos and documents by basic descriptive searches.
In an industry chasing world models and agents, that observation lands squarely on the infrastructure reshaping how machines see the world.



