What is the thundering herd problem?

The thundering herd problem is the failure mode where many concurrent callers arrive after a cached value has expired or been evicted, and each independently triggers a recomputation against the origin. The origin is hit with N times its expected traffic, while every caller waits on a result that should have been served from cache. It is also called cache stampede — the same failure mode, named in the web-infrastructure literature rather than the operating-systems literature. Standard mitigations include single-flight, jittered TTLs, probabilistic early expiration, request coalescing, gateway rate limiting, and client backoff.

What's the difference between the thundering herd problem and cache stampede?

They're two names for the same failure mode. "Thundering herd" comes from the operating-systems literature (many processes woken by one event, only one able to make progress); "cache stampede" comes from web infrastructure (many requests racing to recompute one expired cache key). In both, a popular value disappears from a cache and a crowd of concurrent callers independently rebuilds it, overwhelming the origin. The mechanics and the mitigations (single-flight, jittered TTLs, request coalescing) are identical; only the vocabulary differs by field.

How do you fix or prevent the thundering herd problem?

For a load-bound system — a content site, a dashboard, an API in front of an expensive computation — the standard mitigations work well, and most production systems combine several: - Single-flight (request coalescing): let one caller recompute the expired value while the rest wait on its result, so the origin sees one rebuild instead of N. - Jittered TTLs: offset each key's expiry by a small random amount so keys don't all expire at the same instant. - Probabilistic early recomputation (XFetch): have readers independently refresh a value slightly before it expires, with rising probability, so it's rebuilt before the crowd arrives. - Stale-while-revalidate: serve the last value immediately and recompute in the background. - Load shedding / rate limiting: cap how many rebuilds the origin will accept so it survives the surge. For a content system these fully solve the problem, because the only damage is load and a brief delay. For a real-time decision system they're necessary but not sufficient — they keep the origin alive but still serve a stale or delayed value, which is the rest of this article's subject.

Is the thundering herd problem a load problem or a correctness problem?

In a content system it is a load problem: the origin is overwhelmed for a moment, then the cache refills and everyone who waited gets the same correct answer. In a real-time decision system the stampede coincides with a correctness problem, though they're worth keeping distinct. The correctness failure — concurrent requests committing irreversible decisions against stale or in-flight state — is caused by reading from a copy that lags its source, and it can happen with a perfectly warm cache that never stampedes at all. The stampede doesn't cause the wrong decisions; it concentrates them into the worst moment (a hot key under burst). Both are symptoms of the same root: a separate cached copy on its own clock. The standard mitigations address the load symptom and leave the staleness untouched.

Does single-flight recomputation fix the thundering herd problem for decision systems?

It fixes the load aspect, not the correctness aspect. Single-flight lets one caller recompute while the rest wait on its result, so origin load drops from N to 1. But the waiters still wait, and the recompute still takes time — so requests either exceed the decision's validity window and time out into a wrong decision, or fall back to a stale value and commit against state that's already behind. Single-flight is necessary for surviving the load; it is not sufficient for correctness under burst.

Why don't jittered TTLs or backoff help during a fraud burst?

Jittered TTLs and backoff desynchronize a herd that is spread across many keys, by spreading expiry or retries across time. A fraud burst forms around one hot key — the victim card — so there is nothing to desynchronize: every request needs that single key, and jitter can't spread a single key across time. Backoff adds delay the validity window can't absorb. Both are load-shaping tools for the many-keys case; the burst is the one-key case.

Why doesn't write-through or continuous-push caching fix it?

It fixes the stampede but not the underlying problem. Pushing every update as it happens removes the expiry, and cache-on-write means the first request warms a cold key while the rest of a burst hit it warm — so the prediction surface narrows to the first request rather than the whole herd. What it can't fix is staleness: a warm, never-expiring copy is still a copy on its own clock, lagging the source by the push's propagation delay. During a burst the counter can be warm, hit on every read, and still read 4 while the source has passed 5 — so every request approves. You've stopped the stampede and kept the wrong decision.

What is the alternative to caching derived state for real-time decisions?

Maintain the value in the same store that serves it, rather than computing or fetching it elsewhere and caching a copy. An incremental materialized view is one mechanism: the store keeps the value current as the underlying data changes, and reads hit it directly with no separate cache in front. There's no second copy to expire, to lag, to invalidate, or to keep warm — so the whole family of cache problems, stampede included, loses its surface. Two honest caveats: a rolling-window aggregate still does bounded work at read time (an indexed range scan) or via a time-bucketed rollup, since incremental maintenance is write-triggered, not clock-triggered; and a hot key under a write burst becomes write contention on that row, handled with sharded or batched counters. Critically, this only stays current if the serving store is the one the writes commit to — or close enough that its lag stays inside the validity window. Maintain the value on a replica fed by change data capture and you reintroduce a propagation delay; the win holds only to the extent that delay stays under the window.

Back to Blog

Real-Time Architecture

The Thundering Herd Problem in Real-Time Decision Systems

The thundering herd problem looks like a load problem, and single-flight or jittered TTLs treat it as one. In a real-time decision system it's a correctness problem — and the real fix is removing the cache, not tuning it.

Xiaowei Jiang

CEO & Chief Architect

May 22, 2026

12 min read

TL;DR: The thundering herd problem — many callers hitting an expired cache key at once and each triggering a duplicate recompute — only exists because an external cache was forced into the read path. That cache is there because going to the source on every read is too costly — too slow, too repetitive, or too unpredictable on the tail — but the reason doesn't matter to what follows. The standard mitigations (single-flight, jittered TTLs, backoff, load shedding) all optimize the recompute step, and for content systems that's enough. But the external cache creates a family of problems, not just the stampede: it drifts out of sync under concurrency, it has to be invalidated correctly, and it can't hold every key. In a real-time decision system these surface as wrong, irreversible decisions inside the validity window, not as latency. The fix isn't a better cache — it's removing the reason for an external one: maintain the value in the same store that serves it, advanced as data changes (for example, an incremental materialized view), so reads hit current state directly. This holds only when that store is close enough to the source that its lag stays inside the validity window — a precondition this post is explicit about.

Almost everything written about the thundering herd problem treats it as a load problem. A hot cache key expires, dozens of concurrent callers race to recompute it, the origin collapses under the traffic — so you reach for single-flight, jittered TTLs, request coalescing, backoff. These are good tools, and for a content site or a dashboard they are the right tools: a few hundred milliseconds of pain resolves once the cache refills, everyone who waited gets the same correct answer, and nobody is harmed.

This post takes a step back and asks a different question: why is the cache there at all? The thundering herd problem only exists because something forced an external cache into the read path — a separate store, on its own clock, holding a copy of state the source of truth couldn't serve fast enough on demand. And once you see the cache as a forced compromise rather than a given, the thundering herd stops looking like the problem. It's one symptom of a larger one. The same external copy that stampedes on expiry also drifts out of sync, also has to be invalidated, also can't hold every key. The stampede is just the symptom with a name.

That reframing matters most in a real-time decision system — fraud checks, credit decisions, agent actions — because there the cache's problems don't show up as slow responses. They show up as wrong ones: an irreversible approval or decline made against a copy of state that no longer matches reality. So this post traces the thundering herd back to the cache it depends on, walks the family of problems that same cache creates, and shows the architecture that removes the reason to have one.

---

What the Thundering Herd Problem Is in Distributed Systems

The thundering herd problem is the failure mode where many concurrent callers for the same cached value arrive after that value has expired or been evicted, and each independently triggers a recomputation against the origin. The origin — a slower database, an aggregation pipeline, or an upstream API — gets hit with N times its expected traffic, while every caller waits on a result that should have been served from cache. The term comes from operating systems (multiple processes woken by one event, only one able to do useful work); in modern distributed systems it most often describes a stampede against Redis or any read-through layer fronting an expensive computation. It is also called cache stampede — the same failure mode, named in the web-infrastructure literature rather than the OS literature.

The canonical example is a content one. A high-traffic page caches a rendered fragment with a five-minute TTL. At the instant of expiry, a few hundred requests are mid-flight. All of them miss. All of them call the rendering service, which was sized for one request every five minutes for this key — and it melts. The same shape recurs anywhere many callers read one hot key through a cache: a device fleet waking together and hammering a control-plane API, CDN edges expiring the same key during a live event and fetching from origin at once.

In every one of these cases the damage is load, and it is self-correcting: the origin is overwhelmed, latency spikes, and once the cache refills the system is fine — everyone who waited gets the same correct answer. Notice what the stampede depends on, though. It requires a cache: a separate store, holding a copy of a value, that can expire and force a reconstruction. No external cache, no stampede. So before asking how to tame the herd, it's worth asking why the cache is there — because the answer turns out to explain a whole family of problems, not just this one.

Why the Cache Is There: a Forced Compromise

The external cache is not an accident, and it is not laziness. It is forced by a real constraint: going to the source of truth on every read is too costly. Why it's too costly varies — and, as we'll see, doesn't actually matter to what follows:

The value is expensive to compute. A velocity count, a rolling aggregate, a session feature — anything derived from many underlying events — is too slow to derive on every read, so it's precomputed ahead of time (often by an asynchronous pipeline — Kafka → Flink → Redis) and the result is stored.
The value is expensive to fetch. Even a raw, already-computed row can be too slow to read from the system of record under load, so a copy is kept close to the reader to skip the round-trip.
The same value is read many times. Each read might be affordable alone, but recomputing or refetching it for every one of a million reads is waste, so the result is memoized.
The source's tail is unpredictable. The source may be fast on average and still blow a hard deadline on its slow tail; a cache cuts the variance, not just the mean.

Those are different problems, and there are more (shielding the source from load, smoothing a spiky upstream). But notice they all resolve to the same move and the same structural outcome: a second copy of state, in a separate store, on its own clock, kept in rough sync with a source it sits downstream of. The specific reason the copy exists is irrelevant to everything that follows — what matters is that there is now a copy, and a copy can disagree with the source. The cache buys read performance, and it pays for it in synchronization debt.

The thundering herd is the first installment of that debt. The cached copy can expire or be evicted; when it does, the system has to reconstruct it from the source, and the source — sized for the legitimate write rate, not for inline reconstruction at read concurrency — buckles when many callers demand the same hot key at once. But expiry is only one way the copy and the source disagree, and the stampede is only the disagreement that happens to have a name. The next sections walk the rest of the family.

A Correctness Problem, Not a Load Problem: the Family of Cache Failures

A second copy of state, downstream of its source, can disagree with that source in more than one way. Each disagreement is a distinct failure; they share a root cause but are not the same problem, and conflating them is how teams end up fixing one and getting blindsided by another.

Staleness under concurrency. The copy lags the source by however long propagation takes. Under concurrent reads of fast-changing state, multiple callers read the same lagging value and each acts on it. This is not the thundering herd — it happens with a perfectly warm cache that never misses or expires. Picture the velocity counter sitting warm at 4 while a burst drives the true count past 5: every read is a clean cache hit, no stampede at all, and every request still approves against a value that's already wrong. This is the failure context under concurrency treats in depth.

Invalidation. The copy has to be told when it's wrong. Knowing precisely which keys a given write invalidates is a famously hard problem; get it wrong and the cache silently serves stale values nobody flagged.

The prediction problem. A cache can't hold everything, so it holds a subset — which means deciding in advance which keys are worth keeping. The keys that matter most in an attack are, by construction, the ones you didn't predict.

The thundering herd. The copy expires, and reconstructing it stampedes the source. The famous one — but, as the list shows, only one of four.

All four are the synchronization debt of keeping a second copy in a separate store. In a content system the debt is cheap: a stale or stampeding read costs latency, resolves, and everyone converges on the right answer. In a decision system it isn't, because a decision is bound by a validity window — the time it has to read context, evaluate, and commit before the decision is useless or wrong — and it is irreversible once committed. Reading a stale, in-flight, or absent value inside that window doesn't slow the decision down; it makes it wrong, and there is no second pass to fix it. The stampede is the most visible way this happens — it concentrates the staleness into the worst possible moment, a hot key under burst — but every member of the family commits the same kind of irreversible error. The rest of this post follows the famous one, the stampede, because chasing its fixes is what reveals that the cache itself is the thing to remove.

Fix 1: Make the Recompute Faster (Single-Flight, Jitter, Backoff)

Here is the scenario the rest of the post runs on. A card issuer runs a fraud check on every authorization, with one velocity rule: decline if the card has been used more than five times in the last 60 seconds. Decisions must commit within a 250ms validity window. A fraud ring fires roughly 8,000 authorization attempts at one previously-quiet victim card in 400ms, rotating across merchants to dodge per-merchant rules. The velocity counter lives in Redis behind a Kafka → Flink pipeline, and its key expires 80ms into the burst.

Faced with that, an engineer works through three fixes in roughly this order. Each is correct against the problem as the previous one left it. None closes the window — and walking the sequence shows why the cache itself, not any property of how it’s filled, is the thing that has to go.

The first move is to tame the recompute. The standard family of mitigations all do this, and all of them are mature and effective at what they target:

Single-flight / request coalescing lets one caller recompute while the other 7,999 wait on its result — origin load drops from N to 1. But the waiters still wait. If the Flink recompute takes 180ms, requests that arrived early in a 250ms validity window time out and decline for timeout, not for fraud — a wrong decision that looks like a right one. (Go’s singleflight package is the reference implementation.)
Jittered TTLs / probabilistic early expiration spread expiry across time so fewer keys expire at once. The refined version, XFetch (Vattani et al.'s probabilistic early recomputation), has each reader independently decide to refresh slightly before expiry with rising probability, so one reader rebuilds the value before the crowd arrives. Both help when the herd is spread across many keys. They do almost nothing here, because the herd forms around one hot key — the victim card — and no amount of jitter desynchronizes a single key from itself.
Stale-while-revalidate returns the last cached value immediately and recomputes in the background. It eliminates the wait — by committing the stale value. The counter reads 4, the threshold is 5, and all 8,000 requests approve. The recompute lands 180ms later showing the true count in the thousands, after the burst has cleared and the funds are gone.
Load shedding / gateway rate limiting drops a fraction of requests to keep the origin alive. The origin survives; the dropped decisions are simply never made, which a fraud ring reads as a green light.

Every one of these optimizes the recompute — makes it smaller, smoother, or hidden. For content, that’s the whole job: a brief inconsistency resolves into one correct answer and nobody is harmed. In the burst, each option produces a different wrong decision, because the recompute is still on the read path and still takes time the validity window doesn’t have.

Fix 2: Pre-Compute the Value So Reads Never Recompute

If the recompute on the read path is the problem, the natural next move is to take it off the read path: don’t compute on miss, compute ahead of time. This is exactly what the Kafka → Flink → Redis pipeline already does — Flink maintains the counter continuously and pushes it into Redis, so a normal read is a clean cache hit with no recompute behind it.

This genuinely removes the miss-triggered recompute. But it doesn’t remove the cache, and the cache still carries a TTL — a pushed value with no expiry drifts silently when an update is dropped or a Flink job restarts, so operators set a TTL as a correctness backstop. The stampede surface doesn’t disappear; it moves, from “miss because the value was never there” to “miss because the value expired.” When the TTL lapses for the hot key mid-burst, the system falls back to the pipeline — which is sized for the legitimate write rate, not for inline computation at read concurrency. You are back in Fix 1, with the same validity-window failure, triggered by expiry instead of absence.

Fix 3: Push Every Update So the Value Never Expires

So drop the TTL and keep the cache fresh by pushing every update as it happens — write-through, change-driven invalidation, a continuously-updated key. Now there’s no expiry to miss on. This is the closest of the three to right, and it genuinely retires the stampede for any key that stays warm. But it trades the expiry surface for two others.

It still has to decide what to keep warm. A cache holds a subset; keeping every key resident is the case we'll come to in a moment, and at that point you no longer have a cache in front of a source — you have the maintained store itself. Short of that, a cold victim card costs at least the first request a miss-and-reconstruct. Cache-on-write blunts this — the first authorization warms the key and the rest of the burst hits it warm — so the prediction surface is narrower than under Fix 2, not gone: it's the first request, plus any that arrive before the first write propagates.

It doesn't remove staleness — it just stops the misses. This is the one that matters for decisions. A warm, never-expiring copy is still a copy on its own clock: between the moment a write commits at the source and the moment the push lands in the cache, every read is a clean hit on a value that's already behind. That is the staleness family member from the previous section, and write-through doesn't touch it. During the burst, the counter can be warm, hit on every read, and still read 4 while the source has already passed 5. You've fixed the stampede and kept the wrong decision.

Why the Standard Thundering Herd Fixes Fall Short for Decisions

Fix 1 optimizes the recompute and leaves it on the read path. Fix 2 takes the recompute off the read path but keeps an expiry surface. Fix 3 removes the expiry surface but keeps a narrowed prediction surface and — the decisive point — keeps the staleness. The surface keeps moving and never closes because all three keep the external cache: a separate store, on its own clock, holding a copy that lags the source. The stampede was never the root problem. The copy was. Every fix that keeps the copy keeps at least one member of the family.

The table traces it. Every row removes origin load; not one closes the validity window, because none of them removes the copy. The last row does — and that is the rest of the post. If the cache is the problem, the fix isn't a better cache. It's not having an external one.

Approach	Keeps a separate copy?	Stops the stampede?	Closes the validity window?	What still bites
Single-flight / coalescing	Yes	No	No	Waiters still wait; recompute exceeds the window
Jittered TTL / early expiry	Yes	No	No	Can’t desynchronize a single hot key
Stale-while-revalidate	Yes	Yes	No	Commits the stale value as a decision
Load shedding / rate limit	Yes	No	No	Dropped requests make no decision at all
Pre-compute into cache (Fix 2)	Yes	No	No	Stampede moves to TTL expiry under load
Continuous push / write-through (Fix 3)	Yes	Mostly	No	Copy still lags the source; reads stay stale
Maintain state in the store that serves it	No	Yes	Yes, if its lag stays inside the window	Hot-row write contention under burst

Remove the Reason for an External Cache

Every fix so far kept the separate copy and moved the surface. The whole family — stampede, staleness, invalidation, prediction — comes from one thing: a value living in a second store, downstream of the source, that has to be kept in sync. So the fix isn't a better copy. It's not having a separate one. Maintain the value in the same store that serves it, kept current as the underlying data changes, and a read hits live state directly — no second store to expire, to lag, to invalidate, or to keep warm.

Concretely, for the velocity counter: instead of a pipeline that computes it and a cache that stores the result, define it as a view over the transactions and let the store maintain it incrementally — an incremental materialized view. Each authorization that commits updates the count as part of, or immediately off, that same write; a read is then a point lookup against an indexed value, with no separate cache in front of it. (An IMV is one mechanism, not the point; the requirement is simply that the derived value is kept current in the store that serves it, however that's done.)

What that buys, specifically against the family:

No stampede surface and no prediction surface. There is no separate key to expire or to keep warm, so there is nothing to miss-and-reconstruct — for any entity, hot or cold, predicted or not. Whether 1 or 80,000 reads arrive for the same card, each is the same bounded lookup.

Reads aren't stale, because there's no second clock. This is the member of the family the cache fixes couldn't touch. When the value is maintained in the store that serves it, a read reflects current state directly — not a copy that lags it by a propagation delay. And because every read goes to the same place, concurrent reads all see the same value rather than a scatter of copies caught at different moments.

Two honest qualifications, because this is easy to oversell. First, the sliding 60-second window isn't free at read time. Incremental maintenance is driven by writes, not the clock — an insert advances the count, but nothing fires when a transaction simply ages past 60 seconds, because no row changed. So a true rolling window is evaluated at read time as a bounded range scan against an index on the hot entity:

sql

-- The whole velocity check: one indexed range scan against live state.
-- No cache, no recompute step, nothing to expire or stampede.
SELECT count(*) >= 5 AS decline
FROM authorizations
WHERE card_id = $1
  AND created_at > now() - interval '60 seconds';

With an index on `(card_id, created_at)` this touches only the rows for one card in the last minute — bounded work, not a reconstruction. (For very high write rates a time-bucketed rollup keeps it cheaper still.) The point is what's absent: there is no separately stored value that can be missing, expired, or behind. The claim isn't "zero work at read time"; it's a bounded indexed lookup against live state.

The honest residual: write contention, not a stampede. The work doesn't vanish, it changes shape. An 8,000-writes-in-400ms burst against one counter is now hot-row contention — serialized updates to a single row, version churn that concurrent readers traverse under MVCC. That's a real cost, and a genuinely write-heavy hot key needs the usual treatments (sharded counters, batched application of the increments). But it is bounded by the write rate and it fails safe: a contended read returns committed state, late at worst, never a stale or absent value that silently approves. The herd's defining move — many readers reconstructing one missing value — has no surface here.

One precondition makes or breaks all of this, and it's the same propagation delay this post criticized in the Kafka → Flink → Redis pipeline: the value is only current if the store that maintains it is the store the writes commit to, or close enough that its lag stays inside the validity window. Maintain the view inside the system of record (or in a store the authorizations transact against directly) and there is no second clock. Maintain it on a replica fed by change data capture and you have reintroduced exactly the propagation gap — smaller, perhaps, but real — and the "closes the validity window" claim holds only to the extent that CDC lag stays under the window. Any honest version of this architecture has to say which case it's in.

None of this is specific to one product. It's buildable wherever derived state can be maintained incrementally and read transactionally — pg_ivm on Postgres, a streaming view engine, an HTAP database, or a system designed around it. What they share is the property that matters: the value is maintained in the store that serves it, so a read sees current state instead of a copy of past state.

The Same Burst, End to End, Under Both Architectures

Same rule, same attack as above — the velocity check, the 250ms window, the ~8,000-request burst on one quiet card — run end to end under each architecture, so the contrast is in one place.

External cache (Kafka → Flink → Redis, 30s TTL). The counter’s key expires 80ms in. Single-flight makes the non-lead requests wait on a ~180ms recompute, so the early ones exceed the 250ms window and decline for timeout rather than fraud; the late ones read a counter reflecting when the recompute started, not where the burst now is. Switch to stale-while-revalidate and every request reads the pre-expiry value of 4, sees it under the threshold of 5, and approves — all of them — while the true count in the thousands lands after the funds are gone. And note the deeper point: even without the expiry, a warm copy fed by the Flink pipeline lags the source by the pipeline’s propagation delay, so the same approvals happen on a clean cache hit. The origin survived; the decisions did not.

Value maintained in the serving store. The count is maintained as a view in the store the fraud check reads, current with committed authorizations rather than copied into a downstream cache. There is no key to expire and no separate copy to lag, so reads aren’t stale and there is nothing to stampede. As the burst’s authorizations commit, the count rises with them; once it crosses 5, subsequent checks read a value at or above the threshold under the same snapshot and the rule fires. The residual cost is real but bounded — 8,000 writes to one hot row is contention on that row, handled with the usual hot-counter techniques — and it fails safe: a contended read is late, never silently stale. The ring doesn’t clear the burst. This holds to the extent the serving store’s lag behind the authorizations stays inside the 250ms window; if the authorizations commit elsewhere and replicate in via CDC, that replication delay is the budget to watch.

The difference isn’t a faster recompute. It’s that there is no separate copy on a separate clock — so no stampede on expiry and no staleness between writes. For the same failure traced through a full production fraud stack — feature store, rules engine, cross-channel seams — see real-time fraud detection architecture.

---