Back to Blog
Data Engineering

Streaming Database: What It Is, How It Works, and When You Need One

A streaming database replaces the batch query model with continuous computation — materialized views maintained incrementally as data arrives. Here's how they work, when they help, and where they run out of runway.

Alex Kimball
Marketing
16 min read
Share:
Diagram showing stream processor engine continuously maintaining materialized views from an event stream

A streaming database sounds like two concepts that don't belong together. Databases store data. Streams move it. What does it mean for a database to stream?

The short answer: a streaming database is a system that applies the SQL query model to continuously arriving data — not by scanning tables when you query them, but by maintaining query results incrementally as new events arrive. You define views in SQL. The database keeps those views current. Queries return current results instantly, because the work was already done.

That's the marketing version. The reality is more specific, and the tradeoffs matter more than the pitch. This guide covers what streaming databases actually are, how they work under the hood, what they're genuinely good at, and where they run into limits that aren't fixable by configuration.

What Is a Streaming Database?

A streaming database is a database system where the primary compute model is continuous, incremental query evaluation — rather than batch query evaluation at read time.

In a traditional OLAP database, a query like "sum of transactions by user over the last 24 hours" triggers a full scan of the relevant data at query time. The database reads rows, applies the aggregation, and returns a result. Fast or slow depends on indexes, partitioning, and data volume.

In a streaming database, that query is defined once as a materialized view. The database maintains the result of that query incrementally as new transactions arrive. When you query it, you get the pre-computed, continuously updated result — not a fresh scan. Latency is near-zero, because the aggregation already ran.

The engine powering this is not a storage engine in the traditional sense (B-tree, LSM tree, columnar blocks). It's a stream processor — a dataflow computation engine that processes events incrementally and updates derived state as those events arrive. Materialize is built on Timely Dataflow. RisingWave uses its own streaming execution engine. Hazelcast and ksqlDB are built on top of existing streaming infrastructure. The SQL interface looks familiar; the computation model underneath is fundamentally different.

Materialize describes it precisely: a streaming database uses "the same declarative SQL and the same abstractions (tables, columns, rows, views, indexes) as a traditional database, but has a completely different engine (a stream processor) and computing model (dataflows) inside."

How Streaming Databases Work

Understanding the mechanics matters, because the tradeoffs fall out of them directly.

Ingestion from an event stream

Streaming databases don't replace your data sources — they consume from them. The primary ingestion path is an event stream, typically Apache Kafka. Events flow from your systems of record into Kafka topics. The streaming database reads those topics continuously, applying transformations and maintaining derived state as events arrive.

Many streaming databases also support Change Data Capture (CDC) — reading the write-ahead log (WAL) from source databases like PostgreSQL or MySQL and ingesting row-level changes as a stream. This is how you get near-real-time reflection of transactional data without modifying the application.

Incremental materialized views

The core primitive is the incremental materialized view. You write a SQL query — a join, aggregation, window function, or filter — and the streaming database maintains the result of that query as an always-current view. When a new event arrives, the engine doesn't re-run the full query. It applies the delta: what changed, and how does that change the result?

This incremental computation is what makes streaming databases different from materialized views in traditional databases (like PostgreSQL or ClickHouse), where `REFRESH MATERIALIZED VIEW` reruns the full query on a schedule. Streaming databases update views continuously, event by event, without full rescans.

Stream processing primitives

SQL in a streaming database has extensions that don't exist in batch SQL: window functions over time, watermarks for handling late-arriving data, and join semantics that account for infinite streams. A tumbling window aggregates events over fixed time intervals. A hopping window slides across time with overlap. A session window groups events by activity gaps. These primitives map directly to real-time analytics patterns.

Exactly-once semantics and fault tolerance are handled by the underlying streaming engine. If a node fails, the system replays from Kafka and restores derived state. Watermarks let the engine decide when a time window is complete enough to commit results, handling out-of-order and late-arriving data.

Query serving

Queries against a streaming database look like standard SQL. The streaming database exposes a PostgreSQL-compatible wire protocol in most cases, so your existing tooling — ORMs, BI tools, query runners — connects without modification. The difference from your perspective as a user is latency: queries return in milliseconds, because the results are pre-maintained.

Streaming Database vs. Traditional Database

The category distinction that matters is not streaming vs. relational. It's the computation model: batch query evaluation at read time vs. continuous incremental maintenance.

OLTP databases (PostgreSQL, MySQL) optimize for transactional writes — ACID guarantees, low-latency reads on indexed keys, row-level operations. They evaluate queries at read time, which works fine for operational record access but struggles for complex analytical aggregations over large datasets.

OLAP databases (ClickHouse, Redshift, BigQuery) optimize for analytical reads — columnar storage, vectorized execution, efficient full-table scans. They evaluate queries at read time too, just much faster for aggregation workloads. But "faster" still means seconds to minutes for complex queries at scale, and results reflect data as of the last batch load.

Streaming databases occupy a different position: they're not optimized for writes or for ad-hoc analytical scans. They're optimized for a specific class of pre-defined queries that must always return current results at low latency. The tradeoff is that streaming databases need to know your queries in advance — you define views, and the system maintains them. Ad-hoc exploration over raw data is not their strength.

The practical distinction: if you know what derived state you need to serve — fraud velocity counts, per-user session aggregates, live leaderboards, real-time portfolio positions — a streaming database maintains those results continuously. If you need to explore raw data with arbitrary queries, a streaming database is the wrong tool.

Core Capabilities

SQL and PostgreSQL compatibility

Most streaming databases expose standard SQL and a PostgreSQL-compatible wire protocol. This matters practically: you can use psql, existing connection pooling, ORMs, and most BI tools without modification. The SQL dialect includes streaming-specific extensions (window functions, watermarks) but standard SELECT, JOIN, GROUP BY, and aggregate functions work as expected.

Incremental materialized views

The primary value primitive. Define a SQL view; the streaming database maintains it incrementally. Views can be composed — a view built on another view — enabling layered transformation pipelines entirely in SQL, without separate Apache Flink jobs or Kafka Streams applications.

CDC ingestion

Change Data Capture from relational sources via WAL reading (PostgreSQL logical replication, MySQL binlog) or connectors. This bridges transactional source systems to the streaming database without application-layer instrumentation. Debezium is a common CDC connector used with Kafka-backed streaming databases.

Window functions

Tumbling windows (fixed non-overlapping intervals), hopping windows (overlapping intervals), and session windows (bounded by inactivity gaps) are first-class SQL constructs. These are essential for time-series aggregation, rate limiting, and sliding-window metrics.

Fault tolerance and exactly-once semantics

Streaming databases checkpoint internal state periodically and replay from Kafka or source systems on failure. Most provide exactly-once semantics — each event is processed exactly once even across failures, with no duplicate or missing updates to materialized views.

Low query latency

Because views are pre-maintained, query latency is effectively the read path of an indexed key-value store — single-digit milliseconds in most cases. The computation cost has already been paid as events arrived.

Use Cases

Fraud detection and risk velocity

Fraud detection systems rely on velocity features — transaction counts, amounts, and patterns computed over sliding time windows per user, account, or device. These features need to be current at decision time. A streaming database maintains velocity aggregations continuously, so the feature value at query time reflects the most recent events rather than the last batch refresh.

Financial real-time reporting

Portfolio positions, account balances, and exposure calculations that must reflect the most recent transactions benefit from streaming databases' ability to maintain continuously updated aggregations. Where a batch pipeline updates these views on a schedule, a streaming database updates them event by event.

IoT and sensor analytics

Sensor data arrives continuously at high volume. Aggregations over recent windows — average temperature over the last 5 minutes, anomaly counts per device per hour — are a natural fit for streaming SQL. The streaming database consumes from Kafka topics fed by IoT infrastructure and maintains the derived analytics without full table scans.

Real-time analytics dashboards

Operational dashboards showing live counts, rates, and aggregates benefit from streaming databases' low query latency. Instead of a BI tool running an expensive OLAP query on each refresh, it queries pre-maintained views that reflect current data.

Live personalization and recommendation serving

User activity signals — clicks, views, purchases — can drive continuously updated preference models or feature vectors stored in materialized views. Recommendation engines and eligibility gates query these views at serving time to get current signals without triggering on-demand aggregation.

Top Streaming Databases

RisingWave is an open-source streaming database with a PostgreSQL-compatible interface built on a custom streaming execution engine. It supports SQL materialized views, CDC ingestion, and cloud-native deployment. Architecturally designed to separate storage and compute.

Materialize is a source-available streaming database built on Timely Dataflow, the distributed dataflow computation framework from differential dataflow research. Strong SQL compatibility, composable views, and tight Kafka integration. The commercial product targets production workloads requiring strong consistency guarantees.

ksqlDB is the streaming database layer built on Apache Kafka. It runs as a Kafka application, consumes from and produces to Kafka topics, and exposes a SQL-like interface for defining streaming queries and materialized tables. Tightly coupled to the Kafka ecosystem — strong choice if Kafka is already central infrastructure.

Hazelcast combines an in-memory data grid with stream processing capabilities. Supports SQL over continuously updated distributed maps and streaming pipelines. Well-suited for low-latency use cases where data fits in memory and query latency is critical.

Memgraph is a graph database with streaming capabilities — ingests from Kafka and other event sources, supports continuous graph analytics and pattern detection. Positioned for use cases where the data model is inherently graph-structured (social networks, fraud graph traversal, network topology).

The right choice among these depends on your query patterns, data volume, latency requirements, and existing infrastructure. ksqlDB is natural if you're already Kafka-native. RisingWave and Materialize are stronger for teams that want a more database-like abstraction with stronger SQL semantics.

Where Streaming Databases Fall Short

Streaming databases solve a real problem well. But there's a set of requirements they can't satisfy, and understanding those limits prevents a category mismatch.

They solve the preparation gap, not the retrieval gap

A streaming database closes what you might call the preparation gap: derived context — aggregations, velocity counts, joined views — is no longer a step behind because it's maintained incrementally rather than computed on a batch schedule. That's genuinely valuable.

What it doesn't close is the retrieval gap: if a decision needs a point lookup (get the current account record by key), an aggregation (sum of transactions over the last hour), and a similarity query (find accounts with similar behavior patterns), those retrieval patterns may still be split across systems. Redis for low-latency point lookups. The streaming database for aggregations. A vector store for similarity. Each system is at a different propagation stage, pulling from different snapshots at different moments. The decision sees context that never existed as a consistent whole.

Ingestion lag compounds

Streaming databases sit downstream of Kafka. Events flow: source system → WAL → Kafka → streaming database → materialized view. Each hop adds latency. Under high throughput, Kafka consumer lag compounds with view maintenance lag. The streaming database is incrementally faster than a batch pipeline, but it's still a pipeline — and pipeline lag is structural, not a configuration knob.

High state velocity and concurrency expose limits

The incremental view maintenance model works well at moderate throughput. Under high state velocity — where shared state (account balances, exposure limits, velocity counters) changes with every transaction — and high concurrency — where many decisions evaluate that shared state simultaneously — the gap between source events and maintained view results can open up in ways that matter.

Consider a fraud velocity counter per account. Under burst conditions, dozens of transactions arrive simultaneously against the same account. The streaming database processes them incrementally, but view maintenance is not instantaneous. A fraud check running in that window may evaluate a counter that doesn't yet reflect the full burst. The streaming database closed the preparation gap from seconds to milliseconds — but under sufficient concurrency, milliseconds is still a gap.

They're read-side optimization only

Streaming databases maintain derived read state. They're not ACID-transactional systems for the source data. They don't own writes. If two services need to read the same derived state and then take conflicting actions based on it — both passing a check that only one should pass — the streaming database doesn't help. That's a write-coordination problem, and it requires transactional write semantics at the source, not faster read propagation. This limitation applies equally to context lakes like Tacnode: Tacnode ingests source events via CDC and does not replace your source-of-record database. Write coordination is a different architectural concern.

Ad-hoc exploration is not their strength

Streaming databases are view-first systems. Query performance is excellent for pre-defined materialized views. Ad-hoc queries against raw data — the kind of exploratory analysis typical in data engineering workflows — don't benefit from the streaming model and may be slower than a purpose-built OLAP system.

When the Problem Is Bigger Than a Streaming Database

If you're evaluating streaming databases because you need execution-time reliability for automated decisions — fraud checks, authorization limits, real-time risk, agent actions — it's worth asking whether the problem is a preparation problem, a retrieval problem, or both.

Streaming databases solve the preparation problem cleanly. If your issue is that aggregated features are stale because they're computed on a batch schedule, a streaming database with incremental materialized views is the right category.

If the problem also includes retrieval fragmentation — the decision needs multiple query patterns served from one consistent snapshot, and the composed stack (streaming DB + Redis + vector store) puts those patterns on different propagation timelines — a streaming database doesn't close that gap. You've replaced one part of the pipeline with a better system, but you've still got a pipeline.

The category that addresses both gaps is a context lake: a single system where all context a decision depends on — raw records, derived aggregations, vector embeddings, joined views — is stored, maintained, and retrieved under one consistent snapshot. No cross-system retrieval. No mismatched propagation stages. The decision evaluates context as it exists at one moment.

Tacnode is built as a PostgreSQL-compatible context lake. It ingests via CDC, maintains incremental materialized views inside its transactional boundary, handles vector similarity natively, and serves all retrieval patterns from one consistent snapshot — so a decision that needs a point lookup, an aggregation, and a similarity result gets all three from the same moment in state, not three separate systems at three different propagation stages.

That distinction matters when the cost of acting on inconsistent context is real: an authorization that exceeds a limit because the balance didn't reflect the last transaction, a fraud check that passes because the velocity counter didn't reflect the burst, an agent action that conflicts because two agents read the same stale state.

If you're past the point where a streaming database is enough, that's the architecture worth evaluating.

Streaming DatabaseStream ProcessingMaterialized ViewsReal-Time AnalyticsChange Data Capture
T

Written by Alex Kimball

Building the infrastructure layer for AI-native applications. We write about Decision Coherence, Tacnode Context Lake, and the future of data systems.

View all posts

Ready to see Tacnode Context Lake in action?

Book a demo and discover how Tacnode can power your AI-native applications.

Book a Demo