Card Authorization Limits Under Concurrent Merchant Requests

A cardholder has a $5,000 daily limit. At 2:14pm, a $2,000 transaction authorizes at Merchant A and a $1,500 transaction authorizes at Merchant B. Half a second later, a $3,000 transaction comes in at Merchant C. The running exposure counter, maintained by a Kafka pipeline and cached in Redis, has caught up on Merchant A’s transaction ($2,000) but not yet on Merchant B’s ($1,500). The authorization service reads $2,000. Adds the $3,000. Compares to $5,000. Approves. The cardholder now has $6,500 in approved transactions against a $5,000 limit. The authorization service did its job correctly — the cache said $2,000, the math said approve. The failure was in the context. The cache did not reflect all the state the decision should have considered. Under slower traffic, Merchant B’s transaction would have propagated by the time Merchant C’s arrived. Under concurrent traffic, it did not.

Velocity-Based Fraud Detection During a Burst

A fraud rule says: decline if this card has been used at more than five merchants in the last 60 seconds. The velocity counter is maintained by a stream processor and cached per-card. Normal traffic produces at most one or two increments per 60-second window per card. During a card-testing fraud burst, an attacker runs twenty small transactions across twenty merchants in under a minute. Each authorization hits the cache. The first three read counters at 1, 2, 3. The next five arrive concurrently — the stream processor hasn’t yet emitted the updates — and each reads a counter of 3 or 4, well under the threshold. Each of those five decisions independently approves. By the time the velocity counter has caught up, the attacker has taken eight transactions through the system that individual decisions would have blocked if they had seen concurrent siblings. See velocity counters under load for the full mechanic of how burst traffic defeats counter-based rules, and real-time fraud detection architecture for the full stack-level treatment of where these seams appear.

Agent Workflows With Parallel Tool Calls

An AI agent executes a plan. The plan has five tool calls that can run in parallel — check inventory, check customer credit, check shipping capacity, check promo eligibility, check fraud score. Each tool reads context from a different backing store. Each returns a result computed against a snapshot that is correct for that store, but none of the snapshots are synchronized with each other. If credit was checked against balance state from 400ms ago and fraud was checked against velocity state from 150ms ago and inventory was checked against stock state from 1.2s ago, the agent is now reasoning over a composite picture that never existed as a coherent state of the world. When the agent proceeds to act, the action may commit against a real-world state that has moved past all five reads — approving an order for a customer whose credit has since been revoked, or reserving inventory that has already been allocated to a different parallel agent invocation. Stateful agents exacerbate this. See stateful vs stateless AI agents for how state management interacts with concurrent tool execution.

What is context under concurrency in simple terms?

It’s the situation where multiple automated decisions need to read the same quickly-changing data at the same time, and the data source can’t update fast enough to give each decision an accurate picture. Each decision reads an old snapshot, commits independently, and the side effects collide when the real state finally catches up.

Why doesn’t adding more cache capacity fix this?

Because the problem isn’t cache size or speed — it’s propagation delay between when underlying events happen and when the cache reflects them. Adding capacity scales read throughput but doesn’t change how fast the cache updates. Under concurrent load, the update rate is the bottleneck, not the read rate.

How is this different from the CAP theorem?

CAP is about what a distributed database guarantees during a network partition. Context under concurrency is about what happens during normal operation when derived state (aggregates, counters) is maintained by a separate pipeline from the source data. CAP describes a consistency-vs-availability tradeoff. Context under concurrency describes a freshness-vs-concurrency tradeoff, which exists even on systems with no partitions and strong consistency at the storage layer.

Is this only a problem for financial systems?

No. The same failure mode shows up in any system where automated decisions depend on derived state that changes at high velocity: fraud detection, card authorization, real-time personalization inside a live session, agent workflows with parallel tool calls, online ad bidding, multiplayer game state, logistics dispatch. The common property is “shared state that changes fast, decisions that commit quickly, and concurrency that exceeds the state’s update rate.”

Can’t I just use transactions on the underlying database?

You can for operations that touch one record — a single balance update or a single row lock. You cannot for decisions that depend on derived state (aggregates, counts, velocity, similarity) that isn’t maintained in the same transaction as the source data. Most real-time decisions read derived state from a pipeline or a cache that’s downstream of the SoR, which is precisely where the gap opens.

How do I tell if I have this problem?

Two signs. First, your benchmarks look good but production incidents involve decisions that “should have been blocked” — approvals that exceeded limits, fraud that passed rules it should have failed, agent actions that collided. Second, the incidents cluster around high-traffic moments rather than being uniformly distributed. If the benchmark is fine and the system is fine at low traffic but fails under spikes, concurrency is the stress test that’s finding the context gap.

What’s the difference between the preparation gap and the retrieval gap?

The preparation gap is the delay between an event happening and the derived state that reflects it being readable. The retrieval gap is the inconsistency between multiple backing stores (feature store, cache, vector index) that each propagate independently. A production stack often has both at once, and they compound under concurrent load.

Does this mean I should stop using Redis or a feature store?

Not necessarily — they’re fine for the workloads they were designed for (cache hot reads of slowly-changing data, serve features to offline models). The problem is when they’re asked to back automated decisions that depend on fast-changing derived state under concurrency. That workload has different requirements, and the existing tools weren’t designed to meet them. For those decisions, the right move is to replace the serving layer, not to tune what you already have. If you’re building a decision path where context under concurrency is the binding constraint, the question isn’t “is my cache fast enough.” It’s “does my serving layer keep decisions coherent while state is changing under them?” Those are different questions, and they have different answers.

Back to Blog

Real-Time Architecture

Context Under Concurrency: Why Your Cache Collapses Under Load

Context under concurrency is the production failure mode where cached derived state goes stale faster than the system can refresh it, and parallel decisions commit against divergent snapshots. This post covers why high-velocity state plus concurrent decisions break the caching pattern, how the preparation gap and the retrieval gap compound under load, and what a serving layer has to do differently to keep decisions coherent when every millisecond of staleness has a business consequence.

Xiaowei Jiang

CEO & Chief Architect

Apr 21, 2026

12 min read

TL;DR: Context under concurrency is the failure mode where cached derived state goes stale faster than the system can refresh it, and concurrent decisions commit against divergent snapshots. Two gaps compound under load: the preparation gap (pre-computed state lags the events that should have updated it) and the retrieval gap (different services read different caches, each at a different propagation stage). Benchmarks don’t reproduce this because they don’t reproduce concurrency. The fix isn’t a bigger cache — it’s a single serving layer that keeps context internally coherent across parallel decisions inside the validity window of the decision itself.

Every team that hits context under concurrency starts in the same place. The cache is fast on the benchmark. The p99 read latency on Redis is under a millisecond. The feature store returns a vector in three. The materialized view refreshes on a schedule that looked reasonable in the design doc. And then real traffic shows up, concurrent decisions start stacking, and the numbers that mattered in design stop describing what production actually does.

This is not a tuning problem. You can make the cache bigger, add more replicas, partition the keyspace, pre-warm the hot path. Those moves help at the margin. What they do not do is change the fact that the cache has a propagation delay, the decision has a validity window, and under concurrent load the delay is wider than the window. Concurrent decisions read the same stale snapshot, commit against it in parallel, and only discover the conflict when it lands in the ledger hours later.

What Context Under Concurrency Actually Means

Context under concurrency describes the production condition where an automated decision must read derived state — aggregates, counters, balances, session data — that is being updated by parallel events at the same moment the decision is committing. Multiple concurrent decisions read the same pre-update snapshot, each commits against context that was correct a moment ago, and the composite result is wrong by the time the commits land.

The problem is not that one decision reads slightly old data. Single-decision staleness is an old problem with reasonable answers — TTLs, cache invalidation, pull-through reads. The problem is that many decisions read the same old data at the same time, each commits independently, and the side effects collide. When state velocity is high — many events per second modifying the same shared counter, balance, or aggregate — the window in which the cache is “current enough” shrinks below the window in which concurrent decisions have to commit. Below that crossover, the system stops being able to make consistent decisions, regardless of how the cache is configured.

This is structurally different from what a benchmark measures. A benchmark runs one request, measures latency, repeats. Every run starts from a clean cache state, reads a fresh value, returns. There is no other concurrent writer racing against the read. There is no second decision that depends on the side effect of the first. The numbers look fine because the test doesn’t create the condition the failure requires.

Latency targets dominate the conversation about real-time systems because they’re easy to measure and optimize. But latency tells you how long one request takes. It does not tell you what happens when two requests that depend on the same state arrive at the same time. Consider a card authorization path: the decision is “does this transaction stay within the cardholder’s daily exposure limit?” In a benchmark, the read is fast and the commit path is clean. In production, two transactions for the same card arrive within the same hundred-millisecond window from two different merchants. Both authorization requests read the cache. Both see the same exposure total. Both decide the transaction is within the limit. Both approve. The cardholder now has two transactions approved that together exceed the limit. No individual step failed — the cache returned fresh data within its TTL, the comparison was correct, latency was inside budget — yet the system approved something it should have blocked. Two concurrent decisions read the same pre-update snapshot and neither knew about the other.

This failure mode requires concurrent events against shared state, and it shows up exactly when traffic spikes — when a fraud burst arrives, when a flash sale fires, when a live event drives session volume. The benchmarks pass. The production system falls over.

How Caches Collapse Under Concurrent Load

The reason caches fail under concurrency isn’t that they’re slow. It’s that they’re downstream of a pipeline with a propagation delay, and under concurrent load the pipeline can’t keep up with the rate of state change.

A typical stack for derived context looks like this: an event hits the system of record (a transaction commits, a position updates, a user action lands); a change stream — CDC, Kafka, a queue — carries the event out of the SoR; a processor consumes the stream, recomputes derived state (a velocity counter, an exposure aggregate, a freshness score); the processor writes the updated state to a serving cache — Redis, an in-memory feature store, a materialized view; decision services read from the cache.

Each step adds propagation delay. Under normal load that delay is small — hundreds of milliseconds, maybe a few seconds for heavier computations. Under concurrent load, two things happen simultaneously. First, the rate of incoming events increases, which widens the stream’s lag. Second, the processor that maintains the derived state now has more work and falls further behind. By the time the recomputed state lands in the cache, more events have arrived, and the cache is already wrong again.

This is compounded by the fact that most production stacks aren’t one cache — they’re several. A fraud service runs its own feature cache at one propagation stage. An authorization service reads an exposure aggregate maintained by a different pipeline. An agent tool calls a vector store backed by a third embedding pipeline. Each is at a different point in its own catch-up cycle. Under quiet traffic, the drift between them is small enough that nobody notices. Under concurrent load, they diverge — and decisions start reading different versions of reality at the same moment. This is what we mean by the context gap: the derived state the decision needs is not yet available at the moment the decision commits, because the pipeline that was supposed to prepare it is still processing events from the last batch.

Context failures under concurrency come in two shapes, and production stacks often have both at once. Naming them separately matters because they require different fixes.

The preparation gap is the delay between an event occurring and the derived state that reflects it being available to read. A transaction commits at T=0. The velocity counter that should reflect it becomes queryable at T=800ms. In that 800ms window, any authorization decision that reads the counter sees a value that does not include the transaction that just happened. Pre-computation was supposed to solve this — instead of recomputing an aggregate on every read, compute it once when the underlying event lands, store it, serve it cheaply. This works well for aggregates that change slowly — daily rollups, weekly cohorts, monthly summaries. It breaks down when the aggregate is changing at the same rate as the decisions that read it. The preparation gap gets worse under concurrency because the pipeline that maintains the derived state has a finite processing rate. When events arrive faster than the pipeline can update the materialized state, the gap widens. On a quiet Tuesday afternoon it’s small enough to hide inside the decision’s latency budget. During a fraud burst, a checkout surge, or an agent workflow with parallel tool calls, it exceeds the budget and the decision reads state that is minutes behind, not milliseconds. See data freshness vs latency for the full breakdown of why these are different dimensions of the same problem.

The retrieval gap is what happens when context is split across multiple systems that each propagate independently. The fraud model reads feature vectors from one store. The velocity service reads counters from another. The session service holds state in a third. Each is consistent with itself, but none of them are consistent with each other at any given moment. Under concurrent load, a single decision often needs context from all of them — the fraud score, the velocity check, the session state, the exposure aggregate. Each component comes from a different pipeline at a different propagation stage. The composite picture the decision assembles from those reads is not a snapshot that ever existed in the real world. It’s a chimera: parts from different moments, glued together by the fact that they happened to be read in the same request. This is the retrieval gap under concurrency. The individual systems may each be performing correctly. The decision that depends on their composition is still wrong because it was built from inconsistent parts.

Concrete Failure Modes in Production

The abstract framing matters less than the specific ways this fails in production. A few canonical examples:

Why “Tune the Cache” Doesn’t Fix It

The intuitive response to cache collapse under load is to tune the cache. Shorter TTLs. More replicas. Pre-warming. Smarter invalidation. These moves help in specific cases, but none of them change the structural problem.

Shorter TTLs make the preparation gap visible more often, not smaller. A TTL forces the cache to fetch from the source or recompute on expiry. If the preparation pipeline takes 400ms to recompute the aggregate, a 100ms TTL doesn’t produce fresher data — it just produces more cache misses, each of which still waits on the same upstream pipeline.

More replicas scale reads, not writes. If the problem is that the derived state can’t be updated fast enough to keep up with events, adding read replicas doesn’t help. It makes stale state available in more places. The underlying pipeline still has the same update rate.

Pre-warming is a one-time win. You can load the cache with the current state at startup. Within seconds of traffic arriving, concurrent events update the underlying state and the pre-warmed data is stale. Pre-warming solves cold-start latency, not under-load freshness.

Smarter invalidation requires knowing what to invalidate. Pattern-based invalidation (invalidate this key when that table changes) requires a data model that maps every event to every cached derivation. That mapping is rarely maintained precisely in production, and when it isn’t, the cache quietly serves stale entries that nobody invalidated.

Finally, multi-cache architectures are immune to per-cache tuning. You can tune any single cache as tightly as you like. If the decision depends on context from three caches maintained by three pipelines, tuning one of them doesn’t make the cross-cache composite more coherent. The retrieval gap is an architectural property of having multiple backing stores, not a configuration property of any one of them.

The Validity Window Frame

The concept that makes this tractable is the validity window — the duration within which the decision must commit, and within which the context it reads has to remain correct for the decision to be correct.

Validity windows vary by use case. A card authorization has a window around 100–400ms. A fraud score for a live transaction is in the same range. An agent tool call inside a live session might have a window of 200ms–2s. A batch risk re-scoring run might have a window measured in seconds or minutes.

A decision fails when the preparation gap or the retrieval gap exceeds the validity window. You can describe every concrete failure mode above in exactly those terms: the exposure counter’s propagation delay exceeded the authorization’s validity window; the velocity counter’s update rate fell below the concurrent decision rate inside the window; the composite of five agent tool reads drifted outside the window the action had to respect.

Framing the problem as “context freshness inside the validity window” rather than “cache latency” makes two things clear. First, it’s a per-use-case spec — different decisions have different windows, so the same cache behavior might be fine for one and broken for another. Second, it’s the thing you can actually commit to as an SLA: not “we read the cache in X milliseconds” but “state committed to the system of record is reflected in the serving layer within Y milliseconds, under concurrency rate Z.”

This is what data freshness actually means. Freshness is not a property of the data. It’s a property of the relationship between the data and the decisions that depend on it. It’s also why a dedicated live-context layer — one that holds derived state continuously current under load — is a different architectural category from a cache.

What Actually Works: A Single Context Layer, Coherent Under Load

If tuning caches doesn’t fix context under concurrency, what does? The answer looks less like “a better cache” and more like “one serving layer that replaces several caches, maintained continuously, and read under a single coherent snapshot.”

The specific properties this layer needs:

One read path, not several. If a decision needs a velocity counter, an exposure aggregate, a fraud score, and a session state, it reads all four from the same store under one logical snapshot. There is no retrieval gap because there are not multiple backing stores to drift from each other. Every read of composite context is internally coherent because it comes from the same set of ingested events.

Incremental maintenance, not scheduled refresh. Derived state updates as events arrive, not on a cadence. When a transaction commits upstream, the aggregates and counters that depend on it update incrementally, in sub-second time, without waiting for a refresh window. This collapses the preparation gap to the propagation latency of the change, which is bounded and much smaller than a scheduled refresh interval.

On-demand computation against consistent data. For derivations that can’t be pre-maintained — an arbitrary window aggregation, a vector similarity, an LLM-derived signal — the query runs against state that is itself coherent at query time. The on-demand computation doesn’t add a new divergent snapshot; it runs against the same snapshot the rest of the decision is reading.

Read behavior that matches the decision’s concurrency model. Concurrent decisions that read the same state read it under the same logical snapshot. The system surfaces conflicts that matter (two authorizations for the same card) rather than hiding them (two decisions each reading a stale counter and approving independently).

This is what we mean by a Context Lake: a single serving layer that replaces the stack of caches, feature stores, materialized views, and vector indexes that production decision systems have accumulated, and that keeps derived context continuously current and internally coherent under load.

The difference in production is not marginal. A system that can tell you “the exposure counter reflects every transaction committed up to the last 50ms, and you are reading it under the same snapshot as the velocity counter and the fraud score” is making a qualitatively different statement from “each of our caches returns in under 10ms.” The first is a statement about decisions. The second is a statement about reads.