The Bug That Took Us Two Days to Find: Cache Stampede in Production

At my previous company, users were getting silently logged out.

Support tickets were piling up.

The frontend team believed it was a backend issue. The backend team suspected something was wrong on the frontend.

For two days, the teams kept investigating different parts of the system.

Eventually, we discovered the real issue.

All our Redis TTLs were expiring at the same time.

Because of time constraints, we needed an immediate solution. We decided to add jitter to the TTL values, so that cache entries would expire at different times instead of simultaneously.

Later, while studying system design in more depth, I realized this issue has a well-known name:

Cache Stampede (also known as the Thundering Herd Problem).

It turns out this is a very common issue in distributed systems, and many large-scale companies have faced it at some point.

Why Basic TTL Is Not Enough

A basic caching strategy might look something like this:

cache -> user_profile:123
TTL -> 60 seconds

After 60 seconds, the cache entry expires. The next request fetches fresh data from the database and rebuilds the cache.

For small applications with low traffic, this approach works perfectly fine.

However, in large-scale systems, this simple strategy can create serious problems.

The Biggest Problem: Cache Stampede (Thundering Herd)

Imagine your application receives 10,000+ requests for a popular endpoint.

If the cache expires at the same moment:

The cache entry disappears
All requests miss the cache
Every request directly queries the database

Cache expires
→ 10,000 cache misses
→ 10,000 database queries

The database suddenly receives a massive spike in traffic.

This can lead to:

Database overload
Increased latency
Request timeouts
Cascading system failures

This phenomenon is known as the Cache Stampede or Thundering Herd problem.

The root cause is synchronized cache expiration.

Why Synchronized Expiration Is Dangerous

Basic TTL means every key expires at exactly the same time.

If many requests depend on the same cache entry, they will all attempt to rebuild the cache simultaneously.

This creates traffic spikes and unnecessary pressure on the database.

To avoid this problem, engineers use different strategies. Each approach comes with its own trade-offs.

Strategies to Prevent Cache Stampede

1. TTL Jitter (The Approach We Used in Production)

Instead of assigning the same TTL to every cache entry, we introduce randomness.

TTL = 60 seconds + Math.floor(Math.random() * 60)

Now cache entries expire at slightly different times, which spreads the load over time.

Instead of thousands of requests hitting the database simultaneously, the traffic is distributed more evenly.

This is one of the simplest and most effective solutions.

2. Mutex Locking

In this approach, when the cache expires and multiple requests arrive:

One request acquires a lock
Other requests wait

Request A -> acquires lock
Request B -> blocked
Request C -> blocked

Request A then:

Fetches data from the database
Rebuilds the cache
Releases the lock

After that, the blocked requests read the data directly from the cache.

This ensures only one database query happens per cache miss.

3. Cache Coalescing (Request Coalescing)

Cache coalescing takes a slightly different approach.

Instead of blocking requests with locks, identical requests are grouped together.

One request fetches data from the database, and the response is shared with all waiting requests.

100 requests
→ 1 database query
→ response shared with all

This reduces duplicate backend calls and improves overall efficiency.

4. Probability-Based Early Re-computation

Another technique is refreshing the cache before it expires using a probability function.

Instead of waiting for the TTL to reach zero, some requests will refresh the cache earlier.

if (probability(TTL) < threshold) {
  refreshCache();
}

This spreads cache refresh operations across time and avoids sudden spikes.

Trade-off: The cache might be recomputed earlier than necessary, which increases compute usage.

Key Takeaway

Caching seems simple:

set(key, value, TTL);

But in high-traffic systems, expiration strategies become extremely important.

A fixed TTL that works fine at small scale can silently destroy your system at scale. Start with TTL jitter, understand the other strategies, and choose based on your traffic patterns and tolerance for complexity.

Cache Strategies in Distributed Systems

The Bug That Took Us Two Days to Find: Cache Stampede in Production

Why Basic TTL Is Not Enough

The Biggest Problem: Cache Stampede (Thundering Herd)

Why Synchronized Expiration Is Dangerous

Strategies to Prevent Cache Stampede

1. TTL Jitter (The Approach We Used in Production)

2. Mutex Locking

3. Cache Coalescing (Request Coalescing)

4. Probability-Based Early Re-computation

Key Takeaway

Related Posts

The Thundering Herd Problem

Why I Added Redis to My Auth Flow (And What I Learned)

How MongoDB Aggregation Pipelines Saved My Profile API

Viraj's Portfolio Assistant