Tacnode
Back to Blog
Stream Processing

Stateful Stream Processing for Decisions: Where Flink Stops Being Enough

Flink gives you stateful stream processing. It does not give you a decision-coherent serving layer. The gap is what teams discover when they put Redis or Postgres in front of Flink to serve decisions — and hit the same split-state problem Flink was supposed to have solved.

Xiaowei Jiang
Xiaowei Jiang
CEO & Chief Architect
15 min read
Diagram contrasting Flink as a stream computation tier with a decision-coherent serving layer, showing the split-state propagation gap between Flink operator state and a downstream Redis cache that appears at the seam under concurrent load

TL;DR: Flink’s state — RocksDB-backed operator state, checkpointed for fault tolerance, with exactly-once semantics inside the job — is excellent for streaming analytics: high throughput, large windows, replay safety. Decision-time workloads need something different: point-in-time reads across entities under concurrency, native K/V serving at decision-rate, mutation support, and cross-entity consistency between structured, derived, and semantic context. Flink provides none of these natively. The popular answer — put Redis, Postgres, or a feature store in front of Flink for serving — reintroduces split state between Flink’s operator state and the external store, and the retrieval gap appears at the interface. The fix is not to tune the serving store; it is to replace it with an incrementally-maintained, snapshot-coherent layer — what Tacnode calls a Context Lake — that absorbs what Flink’s state was already doing and what the external serving store was supposed to provide.

Flink gives you stateful stream processing. It does not give you a decision-coherent serving layer. The gap is what teams discover when they put Redis or Postgres in front of Flink to serve decisions — and hit the same split-state problem Flink was supposed to have solved.

This post walks what Flink’s state is actually good at, what decision-time workloads require that Flink doesn’t provide, and why the popular Flink + Redis / Flink + Postgres pattern reintroduces the context gap that stateful stream processing was supposed to close. It describes what a decision-coherent architecture does differently — not to replace Flink, but to move the serving layer to somewhere Flink’s state model was not designed to live.

What Stateful Stream Processing Actually Is

Stateful stream processing is the pattern where the stream processor maintains durable, queryable state as events arrive — windowed aggregates, session accumulators, running counts, enrichment lookups, watermarked timers — so that downstream logic can react to the stream against more than just the current event. Flink, Kafka Streams, Spark Structured Streaming, and Materialize all implement some version of this. The state lives alongside the processor, is recoverable on failure, and advances as the stream advances.

Before stateful stream processing, real-time pipelines were largely stateless. Events flowed through transformations and aggregations lived in a downstream OLAP system (Druid, Pinot, ClickHouse) that ingested the stream and answered queries against its own materialized state. That split was acceptable for monitoring and dashboards. It broke for any workload that needed to react to the stream based on the state the stream had produced so far.

Flink changed the shape of what was buildable. Keyed state scoped per entity, operator state scoped per operator, value state and list state and map state, broadcast state for configuration joins, windowing semantics that respect event time and handle late arrivals, savepoints and checkpoints for operational control — all of it opened real-time use cases that previously required a separate online database doing duplicate work. The DataStream API, Table API, and Flink SQL gave teams three entry points into the same state model.

The question this post is about is not whether stateful stream processing is good. It is whether stateful stream processing inside Flink is the serving layer a decision-time workload can read from directly — and the answer, structurally, is no.

Stateful vs Stateless Stream Processing: Key Concepts

Stream processing engines split into two modes: stateful and stateless. Stateless stream processing transforms each individual event without reference to past events — filter, map, project, enrich from a static lookup. Each event passes through independently. Stateless processing is easy to parallelize, easy to scale, and contributes nothing to decisions that depend on what has happened across multiple events. For continuous monitoring of streaming data, stateless enrichment is enough. For fraud detection, anomaly detection, or real time decision making, it is not.

Stateful processing maintains state across events. A stateful operator holds a running count, a windowed aggregate, a session accumulator, or an input-stream join state, and advances that state as new data arrives. Stateful operators are the foundation of windowed aggregation, event correlation, and the kind of derived context that decisions need. The cost is operational: stateful streaming systems have to handle state storage, fault tolerance, checkpointing, and recovery — complexity that stateless processing avoids.

Within stateful streaming, there is also a distinction between short-window state (recent events in a sliding window or tumbling window) and long-lived session state that persists across many events, potentially until the session ends. Modern streaming systems handle both. The longer the state horizon, the more the state store itself — RocksDB in Flink, RocksDB in Kafka Streams — becomes the operational center of the system. Unbounded data streams produce unbounded state if the processing node keeps every data point; the defined window semantics — defined start, defined end — are how streaming systems bound the state at all.

Incremental Materialized Views: The Alternative

The alternative to Flink + serving layer is not “don’t use Flink”; it is “move the serving layer to somewhere that does not have a seam with Flink’s state.” An incrementally-maintained materialized view (IMV) inside a single serving system closes the seam by keeping the derived aggregate and the serving path inside the same transactional boundary.

An IMV is a view definition — typically declared in SQL — that the system maintains incrementally as the underlying data changes. When a base event arrives, the view is updated atomically within the same storage engine that serves reads. There is no “emit to a downstream store” step because the aggregate is already in the serving layer. There is no “two systems to keep consistent” problem because there is only one system.

The capability matches a subset of Flink’s stateful stream processing — windowed aggregations, running counts, session state, rolling sums — that is exactly the subset most decision-time workloads need. For those workloads, an IMV layer inside the serving system absorbs what Flink was doing and what the downstream serving store was doing, in one layer, with one coherent snapshot. The decision reads the aggregate directly from the same system that served every other read it needs.

This does not replace Flink’s full capability surface. Flink still wins on long event-time windows with complex CEP patterns, on jobs where the processing is the product (stream-to-lake ETL, real-time data transformation pipelines that feed downstream analytics), and on workloads where the processor’s throughput or exactly-once semantics to a specific sink are the dominant requirement. What it replaces is the specific case where Flink is being asked to do serving as well as computation, via a downstream K/V store, and is hitting the retrieval gap as a result.

On-demand computation extends the IMV model. For aggregations that can’t be usefully pre-maintained — arbitrary windows, cross-entity joins, vector similarity over a live index — the query runs at decision time against state that is itself coherent. This closes the case Flink couldn’t close cleanly: the arbitrary-window aggregation that Flink could compute but not serve consistently to external readers.

Vertical Patterns

The Flink-plus-serving pattern shows up across verticals with different surface symptoms but the same underlying seam.

Fraud detection: Flink maintains velocity counters and feature aggregates for fraud detection; emits to a key-value serving store (Redis) or a feature store; the authorization service reads from the downstream store; under fraud-burst load, the decision reads a pre-burst snapshot and fraud slips through. This is the canonical fraud detection application of stateful stream processing — and the canonical place the Flink-plus-Redis seam produces fraudulent activity that slips past the rule it was supposed to trigger. Anomaly detection workloads running on the same Flink pipeline inherit the same gap. Real-time fraud detection architecture walks this in detail.

Credit decisioning: Flink maintains exposure aggregates and velocity counters; emits to Redis and/or a feature store; the credit decision engine reads a composite that may reflect three different propagation stages. Real-time credit decisioning architecture covers the three-pipeline divergence pattern.

Real-time ML inference: Flink computes feature aggregates; the feature store promotes them to online serving; the model reads features; the decision commits. Every promotion step is a propagation stage, and the training-serving skew / stream-serving skew both surface here. Real-time ML: feature freshness at decision time walks the freshness budget decomposition that makes this the ML Platform team’s binding constraint.

Personalization enforcement (where the session itself is the validity window): Flink maintains session state and behavioral counters; emits to Redis for decision lookup; the enforcement decision reads a state that lags the current session. The gap is smaller than fraud or credit because session windows are shorter, but the structural seam is the same.

Across all of these, the common signature is: Flink does its computation correctly, the serving store serves what it has correctly, and the decision that composes reads across them is still wrong because what it reads is not what Flink’s state actually contains at decision time.

What to Measure

If the architecture is “Flink + serving store,” the telemetry that diagnoses the coherence gap is not what most teams are measuring.

Stream-to-serve propagation delay. The time between Flink processing an event and the effect of that event being readable by a decision in the serving store. Under quiet traffic, this is milliseconds. Under burst traffic, it can widen to seconds. This is the metric that should live inside the decision’s validity window.

Cross-store read skew. When a decision reads from Redis AND Postgres AND a feature store, the maximum time skew between when each read reflected its source of truth. In a serving layer with an IMV model inside one system, this is zero by construction. In a composed stack, it is the right telemetry to surface — and it is usually the first place architecture-level failures become visible.

Unexpected reconciliation delta. The rate at which post-hoc reconciliation finds a decision that, against the state actually effective at decision time, should have committed differently. This is the business-outcome version of the coherence gap. In Flink + serving-store architectures, this number is often elevated enough to hide in overall loss metrics but significant enough that individual incidents drive policy tightening that over-corrects.

Frequently Asked Questions

Where Stateful Stream Processing Goes From Here

Flink is not going away. Stream transformation, stream-to-lake ETL, complex event processing, and high-throughput event pipelines are workloads Flink remains well-suited for, and the foundation it provides — stateful stream processing as a first-class primitive — is one of the most important shifts in real-time infrastructure over the last decade.

What is changing is where the serving layer lives. In the canonical architecture, Flink plus an external serving store was the default. In the decision-coherent architecture, the derived aggregates that used to live in Flink’s operator state and get emitted to Redis now live as incrementally-maintained views inside a serving layer that also handles structured point lookups and semantic retrieval. Flink still computes; it computes what is on the path to the lake, or to the downstream stream, or to a sink that needs exactly-once delivery. Decision-time derived state lives where decisions read from.

The question is not “is Flink good” — it is. The question is “where should the state decisions read from live.” For decision-time workloads at production scale, the answer is: not in Flink, and not in a cache in front of Flink, but in a serving layer that absorbs both.

For the cross-domain framing, see the decision coherence pillar. For the specific instances of this pattern in production, see real-time fraud detection architecture and real-time credit decisioning architecture. For the streaming + OLAP variant of the same gap, see the decision-time system model.

FlinkStream ProcessingStateful ProcessingReal-Time DecisionsContext Lake
Xiaowei Jiang

Written by Xiaowei Jiang

Former Meta and Microsoft. Built distributed query engines at petabyte scale. Author of the Composition Impossibility Theorem (arXiv:2601.17019).

Ready to see Tacnode Context Lake in action?

Book a demo and discover how Tacnode can power your AI-native applications.

Book a Demo