Tacnode
Back to Blog
Real-Time Architecture

Context Under Concurrency: Why Your Cache Collapses Under Load

Context under concurrency is the production failure mode where cached derived state goes stale faster than the system can refresh it, and parallel decisions commit against divergent snapshots. This post covers why high-velocity state plus concurrent decisions break the caching pattern, how the preparation gap and the retrieval gap compound under load, and what a serving layer has to do differently to keep decisions coherent when every millisecond of staleness has a business consequence.

Xiaowei Jiang
Xiaowei Jiang
CEO & Chief Architect
12 min read
Diagram showing how concurrent decisions read divergent cached context and commit conflicting outcomes before the underlying state catches up

TL;DR: Context under concurrency is the failure mode where cached derived state goes stale faster than the system can refresh it, and concurrent decisions commit against divergent snapshots. Two gaps compound under load: the preparation gap (pre-computed state lags the events that should have updated it) and the retrieval gap (different services read different caches, each at a different propagation stage). Benchmarks don’t reproduce this because they don’t reproduce concurrency. The fix isn’t a bigger cache — it’s a single serving layer that keeps context internally coherent across parallel decisions inside the validity window of the decision itself.

Every team that hits context under concurrency starts in the same place. The cache is fast on the benchmark. The p99 read latency on Redis is under a millisecond. The feature store returns a vector in three. The materialized view refreshes on a schedule that looked reasonable in the design doc. And then real traffic shows up, concurrent decisions start stacking, and the numbers that mattered in design stop describing what production actually does.

This is not a tuning problem. You can make the cache bigger, add more replicas, partition the keyspace, pre-warm the hot path. Those moves help at the margin. What they do not do is change the fact that the cache has a propagation delay, the decision has a validity window, and under concurrent load the delay is wider than the window. Concurrent decisions read the same stale snapshot, commit against it in parallel, and only discover the conflict when it lands in the ledger hours later.

What Context Under Concurrency Actually Means

Context under concurrency describes the production condition where an automated decision must read derived state — aggregates, counters, balances, session data — that is being updated by parallel events at the same moment the decision is committing. Multiple concurrent decisions read the same pre-update snapshot, each commits against context that was correct a moment ago, and the composite result is wrong by the time the commits land.

The problem is not that one decision reads slightly old data. Single-decision staleness is an old problem with reasonable answers — TTLs, cache invalidation, pull-through reads. The problem is that many decisions read the same old data at the same time, each commits independently, and the side effects collide. When state velocity is high — many events per second modifying the same shared counter, balance, or aggregate — the window in which the cache is “current enough” shrinks below the window in which concurrent decisions have to commit. Below that crossover, the system stops being able to make consistent decisions, regardless of how the cache is configured.

This is structurally different from what a benchmark measures. A benchmark runs one request, measures latency, repeats. Every run starts from a clean cache state, reads a fresh value, returns. There is no other concurrent writer racing against the read. There is no second decision that depends on the side effect of the first. The numbers look fine because the test doesn’t create the condition the failure requires.

Latency targets dominate the conversation about real-time systems because they’re easy to measure and optimize. But latency tells you how long one request takes. It does not tell you what happens when two requests that depend on the same state arrive at the same time. Consider a card authorization path: the decision is “does this transaction stay within the cardholder’s daily exposure limit?” In a benchmark, the read is fast and the commit path is clean. In production, two transactions for the same card arrive within the same hundred-millisecond window from two different merchants. Both authorization requests read the cache. Both see the same exposure total. Both decide the transaction is within the limit. Both approve. The cardholder now has two transactions approved that together exceed the limit. No individual step failed — the cache returned fresh data within its TTL, the comparison was correct, latency was inside budget — yet the system approved something it should have blocked. Two concurrent decisions read the same pre-update snapshot and neither knew about the other.

This failure mode requires concurrent events against shared state, and it shows up exactly when traffic spikes — when a fraud burst arrives, when a flash sale fires, when a live event drives session volume. The benchmarks pass. The production system falls over.

How Caches Collapse Under Concurrent Load

The reason caches fail under concurrency isn’t that they’re slow. It’s that they’re downstream of a pipeline with a propagation delay, and under concurrent load the pipeline can’t keep up with the rate of state change.

A typical stack for derived context looks like this: an event hits the system of record (a transaction commits, a position updates, a user action lands); a change stream — CDC, Kafka, a queue — carries the event out of the SoR; a processor consumes the stream, recomputes derived state (a velocity counter, an exposure aggregate, a freshness score); the processor writes the updated state to a serving cache — Redis, an in-memory feature store, a materialized view; decision services read from the cache.

Each step adds propagation delay. Under normal load that delay is small — hundreds of milliseconds, maybe a few seconds for heavier computations. Under concurrent load, two things happen simultaneously. First, the rate of incoming events increases, which widens the stream’s lag. Second, the processor that maintains the derived state now has more work and falls further behind. By the time the recomputed state lands in the cache, more events have arrived, and the cache is already wrong again.

This is compounded by the fact that most production stacks aren’t one cache — they’re several. A fraud service runs its own feature cache at one propagation stage. An authorization service reads an exposure aggregate maintained by a different pipeline. An agent tool calls a vector store backed by a third embedding pipeline. Each is at a different point in its own catch-up cycle. Under quiet traffic, the drift between them is small enough that nobody notices. Under concurrent load, they diverge — and decisions start reading different versions of reality at the same moment. This is what we mean by the context gap: the derived state the decision needs is not yet available at the moment the decision commits, because the pipeline that was supposed to prepare it is still processing events from the last batch.

Context failures under concurrency come in two shapes, and production stacks often have both at once. Naming them separately matters because they require different fixes.

The preparation gap is the delay between an event occurring and the derived state that reflects it being available to read. A transaction commits at T=0. The velocity counter that should reflect it becomes queryable at T=800ms. In that 800ms window, any authorization decision that reads the counter sees a value that does not include the transaction that just happened. Pre-computation was supposed to solve this — instead of recomputing an aggregate on every read, compute it once when the underlying event lands, store it, serve it cheaply. This works well for aggregates that change slowly — daily rollups, weekly cohorts, monthly summaries. It breaks down when the aggregate is changing at the same rate as the decisions that read it. The preparation gap gets worse under concurrency because the pipeline that maintains the derived state has a finite processing rate. When events arrive faster than the pipeline can update the materialized state, the gap widens. On a quiet Tuesday afternoon it’s small enough to hide inside the decision’s latency budget. During a fraud burst, a checkout surge, or an agent workflow with parallel tool calls, it exceeds the budget and the decision reads state that is minutes behind, not milliseconds. See data freshness vs latency for the full breakdown of why these are different dimensions of the same problem.

The retrieval gap is what happens when context is split across multiple systems that each propagate independently. The fraud model reads feature vectors from one store. The velocity service reads counters from another. The session service holds state in a third. Each is consistent with itself, but none of them are consistent with each other at any given moment. Under concurrent load, a single decision often needs context from all of them — the fraud score, the velocity check, the session state, the exposure aggregate. Each component comes from a different pipeline at a different propagation stage. The composite picture the decision assembles from those reads is not a snapshot that ever existed in the real world. It’s a chimera: parts from different moments, glued together by the fact that they happened to be read in the same request. This is the retrieval gap under concurrency. The individual systems may each be performing correctly. The decision that depends on their composition is still wrong because it was built from inconsistent parts.

Concrete Failure Modes in Production

The abstract framing matters less than the specific ways this fails in production. A few canonical examples:

Why “Tune the Cache” Doesn’t Fix It

The intuitive response to cache collapse under load is to tune the cache. Shorter TTLs. More replicas. Pre-warming. Smarter invalidation. These moves help in specific cases, but none of them change the structural problem.

Shorter TTLs make the preparation gap visible more often, not smaller. A TTL forces the cache to fetch from the source or recompute on expiry. If the preparation pipeline takes 400ms to recompute the aggregate, a 100ms TTL doesn’t produce fresher data — it just produces more cache misses, each of which still waits on the same upstream pipeline.

More replicas scale reads, not writes. If the problem is that the derived state can’t be updated fast enough to keep up with events, adding read replicas doesn’t help. It makes stale state available in more places. The underlying pipeline still has the same update rate.

Pre-warming is a one-time win. You can load the cache with the current state at startup. Within seconds of traffic arriving, concurrent events update the underlying state and the pre-warmed data is stale. Pre-warming solves cold-start latency, not under-load freshness.

Smarter invalidation requires knowing what to invalidate. Pattern-based invalidation (invalidate this key when that table changes) requires a data model that maps every event to every cached derivation. That mapping is rarely maintained precisely in production, and when it isn’t, the cache quietly serves stale entries that nobody invalidated.

Finally, multi-cache architectures are immune to per-cache tuning. You can tune any single cache as tightly as you like. If the decision depends on context from three caches maintained by three pipelines, tuning one of them doesn’t make the cross-cache composite more coherent. The retrieval gap is an architectural property of having multiple backing stores, not a configuration property of any one of them.

The Validity Window Frame

The concept that makes this tractable is the validity window — the duration within which the decision must commit, and within which the context it reads has to remain correct for the decision to be correct.

Validity windows vary by use case. A card authorization has a window around 100–400ms. A fraud score for a live transaction is in the same range. An agent tool call inside a live session might have a window of 200ms–2s. A batch risk re-scoring run might have a window measured in seconds or minutes.

A decision fails when the preparation gap or the retrieval gap exceeds the validity window. You can describe every concrete failure mode above in exactly those terms: the exposure counter’s propagation delay exceeded the authorization’s validity window; the velocity counter’s update rate fell below the concurrent decision rate inside the window; the composite of five agent tool reads drifted outside the window the action had to respect.

Framing the problem as “context freshness inside the validity window” rather than “cache latency” makes two things clear. First, it’s a per-use-case spec — different decisions have different windows, so the same cache behavior might be fine for one and broken for another. Second, it’s the thing you can actually commit to as an SLA: not “we read the cache in X milliseconds” but “state committed to the system of record is reflected in the serving layer within Y milliseconds, under concurrency rate Z.”

This is what data freshness actually means. Freshness is not a property of the data. It’s a property of the relationship between the data and the decisions that depend on it. It’s also why a dedicated live-context layer — one that holds derived state continuously current under load — is a different architectural category from a cache.

What Actually Works: A Single Context Layer, Coherent Under Load

If tuning caches doesn’t fix context under concurrency, what does? The answer looks less like “a better cache” and more like “one serving layer that replaces several caches, maintained continuously, and read under a single coherent snapshot.”

The specific properties this layer needs:

One read path, not several. If a decision needs a velocity counter, an exposure aggregate, a fraud score, and a session state, it reads all four from the same store under one logical snapshot. There is no retrieval gap because there are not multiple backing stores to drift from each other. Every read of composite context is internally coherent because it comes from the same set of ingested events.

Incremental maintenance, not scheduled refresh. Derived state updates as events arrive, not on a cadence. When a transaction commits upstream, the aggregates and counters that depend on it update incrementally, in sub-second time, without waiting for a refresh window. This collapses the preparation gap to the propagation latency of the change, which is bounded and much smaller than a scheduled refresh interval.

On-demand computation against consistent data. For derivations that can’t be pre-maintained — an arbitrary window aggregation, a vector similarity, an LLM-derived signal — the query runs against state that is itself coherent at query time. The on-demand computation doesn’t add a new divergent snapshot; it runs against the same snapshot the rest of the decision is reading.

Read behavior that matches the decision’s concurrency model. Concurrent decisions that read the same state read it under the same logical snapshot. The system surfaces conflicts that matter (two authorizations for the same card) rather than hiding them (two decisions each reading a stale counter and approving independently).

This is what we mean by a Context Lake: a single serving layer that replaces the stack of caches, feature stores, materialized views, and vector indexes that production decision systems have accumulated, and that keeps derived context continuously current and internally coherent under load.

The difference in production is not marginal. A system that can tell you “the exposure counter reflects every transaction committed up to the last 50ms, and you are reading it under the same snapshot as the velocity counter and the fraud score” is making a qualitatively different statement from “each of our caches returns in under 10ms.” The first is a statement about decisions. The second is a statement about reads.

FAQ

ConcurrencyCachingReal-Time DecisionsContext LakeFraud DetectionData FreshnessDistributed Systems
Xiaowei Jiang

Written by Xiaowei Jiang

Former Meta and Microsoft. Built distributed query engines at petabyte scale. Author of the Composition Impossibility Theorem (arXiv:2601.17019).

Ready to see Tacnode Context Lake in action?

Book a demo and discover how Tacnode can power your AI-native applications.

Book a Demo