LinkedIn's Feed, serving over a billion members, experienced intermittent availability drops due to a critical infrastructure component freezing for up to 15 seconds. The Feed Retrieval platform, powered by the Rust-based FishDB engine, saw entire shards breach their Service Level Objectives (SLOs) without clear logs or reproducible triggers.
The elusive issue, affecting different shards sporadically for brief periods, was eventually traced to a single HashMap resizing event. This resize, occurring at approximately 58.7 million keys, triggered a cascade of kernel-level lock contentions, ultimately freezing the system's entire asynchronous runtime. The fix, a single line of code, belied the complex investigation that uncovered the root cause. This incident highlights critical challenges in memory allocation at scale.
FishDB and the Feed's Foundation
FishDB, the storage and retrieval layer for LinkedIn's Feed, is built in Rust with jemalloc as its memory allocator and Tokio as its async runtime. It maintains several in-memory index structures for low-latency retrieval.
The document reference index, a HashMap mapping primary keys to document references, was central to the problem. At the time of the incident, this map held roughly 56–59 million entries per shard, consuming about 1.75 GB of memory.
The Mystery: Elusive Availability Drops
FishDB experienced recurring 1-minute breaches of its availability SLO, characterized by brief, self-resolving outages with minimal digital footprint.
The outages were ephemeral, lasting only 10-15 seconds, making them impossible to catch with conventional monitoring. During these freezes, the application produced zero logs, and health checks went unanswered, creating the illusion of a complete process pause.
The sporadic nature and lack of discernible triggers like deployment changes or traffic spikes complicated the investigation.
The First Clue: Memory Spikes
Correlation analysis revealed a critical pattern: every availability drop coincided with a significant spike in Resident Set Size (RSS) memory. RSS would momentarily jump about 4 GB above baseline, then settle to a persistent ~2 GB increase.
This simultaneous spiking across all hosts in an affected shard ruled out individual hot queries or traffic issues, pointing to a systemic, data-driven problem.
Eliminating Possibilities
Before resorting to advanced profiling, common culprits were systematically ruled out.