Real-Time ML: Architecture, Feature Freshness, and Where ML Models Make Bad Decisions
Real-time ML — the architecture that runs ML models against live requests for instant decisions — is bottlenecked by feature freshness, not model latency. The model serves in 8 milliseconds; the features it scored are 40 seconds old. For real-time machine learning systems committing against fresh state, the freshness budget is the binding constraint, and most stacks never measure it.
TL;DR: Real-time machine learning architectures optimize what’s measurable: model latency, serving p99, GPU utilization. Feature pipelines — online stores, stream aggregations, embedding refreshes — often operate at propagation timescales 100–1000× the inference latency. For real-time machine learning systems that produce online predictions and instantaneous predictions inside tight decision windows, the feature freshness budget, not the model latency budget, is the binding constraint. Most ML systems never measure it. The fix is a serving layer where real-time features are maintained inside the same transactional boundary as the rest of the decision’s context — what Tacnode calls a Context Lake — so that the feature vector the ML model scores is coherent with the velocity counter, the exposure aggregate, and the rules-engine flags reading alongside it.
The ML model serves in 8 milliseconds. The dashboard says p99 is green. A customer gets approved for a line they shouldn’t — because the features the model scored were from 40 seconds ago, and in those 40 seconds the same customer’s exposure doubled at another merchant. Model latency was never the problem. Feature freshness inside the decision’s validity window was.
This post walks what real-time machine learning actually looks like in production — the ML system topology, the role of ML models against live data, where feature freshness lives in the architecture, and why the feature-pipeline piece of the stack is where the freshness budget gets spent, not the model server. It describes what a decision-coherent architecture has to do differently: not replace the model server, not replace ML platforms entirely, but move the decision-time features for real-time machine learning to a serving layer where their freshness is bounded by the ingest path, not by a scheduled refresh cadence.
What Real-Time Machine Learning Actually Is
real-time machine learning (real-time machine learning) is the architectural category where a trained ML model scores a live request — a transaction, an action, a session event — and returns a real time prediction that feeds an automated decision before the request completes. Real-time ML systems serve ML models (Triton, TorchServe, Ray Serve, BentoML, SageMaker endpoints, Vertex AI) against a feature vector assembled at request time, inside a low latency budget typically between 10 and 200 milliseconds.
The canonical production shape is: request arrives at an ML system; the system retrieves a feature vector for the entity being scored (customer, session, device) from an online store; the machine learning model runs inference on the vector; the score returns; the decision service composes the score with rules, thresholds, and other signals; the action commits. Every step has a measurable latency budget that benchmarks well in isolation.
What the architecture does not measure, by default, is the age of the features the vector contained. The feature vector came from an online store. That online store got it from a feature pipeline. The pipeline got it from events that happened some time before the inference request arrived. Each step in that upstream feature pipeline has its own propagation delay. The ML system is fast; the features are less fresh than the dashboard suggests.
For ML models that consume slowly-changing signals (customer tenure, aggregate behavior over weeks, geographic profile), feature freshness is not a binding constraint. For ML models that gate decisions against signals that change at transaction speed — velocity counters, session state, cross-channel risk flags, recent exposure — it is the constraint, and it is the one most ML platforms do not instrument.
Batch Machine Learning vs Real-Time Machine Learning
Machine learning systems split into two operational modes: batch machine learning and real-time machine learning. The distinction is about when the ML model scores incoming data and how fresh that data must be.
Batch machine learning runs on a schedule. A scoring job reads a window of historical data, runs the ML model over it, writes predictions out. Examples: daily customer-segment scoring, weekly churn predictions, hourly demand forecasts. The data the model scores is by definition not fresh data — it reflects the world as of the last batch run. Batch features are computed once per cycle and reused across the cycle. The infrastructure is straightforward: data pipelines, batch processing, a model server invoked on a cron, output stored for downstream consumption. Most enterprise ML still runs this way and most of it should — the workloads do not need real time predictions.
Real-time machine learning runs on every request. The ML model scores new data points as they arrive, against real time features that should reflect what just happened. Real-time features come from streaming infrastructure: incoming data flows through Kafka, gets transformed by feature pipelines, lands in an online store, gets retrieved by the ML system at inference time. Online predictions come back in milliseconds. The shift from batch features to real time features is the shift that makes the architecture interesting — and the shift that exposes the freshness gap most batch-trained ML engineering teams have not had to think about.
Online learning extends real-time machine learning further: not just scoring new data points but updating ML models against them as they arrive, so the model itself drifts forward with the data. Continuous learning is the discipline of keeping ML models current as new patterns emerge — important for fraud detection, recommendation systems, and dynamic environments where the relationships the model learned can decay. Continuous learning does not solve the feature freshness problem; the same model retrained more often still scores against features from a feature pipeline with the same propagation delay. The two are independent dimensions of “current.”
Real-time machine learning is more resource intensive than batch machine learning. It requires robust infrastructure for streaming, serving, monitoring, and a tighter operational discipline around model performance and data quality. The reason to take that cost on is that the workload — real time decision making, instant decisions, fraud detection, autonomous vehicles, recommendation systems — has business consequences inside the decision window that batch scoring cannot deliver against.
Where Feature Freshness Actually Lives
Feature freshness is a property of the pipeline that produced the feature, not of the serving store the model reads from. Understanding where it lives means tracing the path backward from the inference call to the source event.
A typical modern ML feature pipeline has four stages. First, raw data is ingested from a source — a transactional database via CDC, an event bus via Kafka, a SaaS API via webhook, or live data streams from real-time sensors. Second, a stream or batch processing layer — Flink, Spark, dbt, a scheduled job — transforms and aggregates the events into feature values, processing data through feature engineering logic that data scientists wrote. Third, an offline store (Tecton, Feast, SageMaker, Vertex AI, Databricks) holds the feature values as materialized tables for ML model training and batch predictions. Fourth, an online store — often a Redis or DynamoDB-backed subset of the offline store — serves the features at low latency to ML systems generating online predictions.
Each stage has a cadence. CDC streams are typically sub-second. Stream processors like Flink update operator state in near-real-time but emit to downstream stores on a configured interval. Offline stores are updated by scheduled jobs — daily, hourly, or micro-batched every few minutes. Online stores are promoted from offline on a separate cadence, often seconds to tens of seconds. Every promotion step is a propagation stage, and the online store the ML model reads from reflects the state of the source event system at some point in the past, not the present.
For a feature defined as “count of transactions in the last 60 seconds,” this matters intensely. The feature’s value at T=0 depends on events that occurred up to T-60s. The pipeline that maintains this feature needs to have absorbed every one of those events before T, or the feature value at T is wrong — not stale in the “slightly outdated” sense but structurally wrong for the window it claims to cover. In practice, most production online stores serve a value that reflects events up to some point earlier than T, and the freshness SLA is the maximum of those propagation delays across all features.
The Freshness Budget Decomposition
To understand why feature freshness breaks decision-time workloads, decompose the budget for a specific feature.
Take a cross-channel velocity feature: “count of distinct merchants this card has transacted at in the last 5 minutes.” The ML feature pipeline for this feature typically looks like: card-transaction events land in Kafka from the payment network (~100ms propagation); a Flink job reads the Kafka stream, maintains a per-card distinct-merchant counter with sliding windows, and emits updated counter values every configured checkpoint interval (~500ms–2s); the emitted values flow into the offline feature store via a feature pipeline sink (~1s batch write); the online store is synced from the offline store on a scheduled promotion (~5–30s, depending on feature-store product); the inference service reads the online store at request time (~2–10ms).
The inference service reads the feature in milliseconds. The feature it reads reflects events that happened between 5 and 45 seconds ago, depending on where in the promotion cycle the read lands. The inference service’s SLA says “p99 < 50ms” and is correct. The feature’s freshness SLA is “state of the world 5–45s ago” and is almost never measured or exposed. A decision that needs to reflect transactions that committed in the last 60 seconds gets a feature that reflects transactions from the last 2–5 minutes.
Different feature types have different freshness profiles. Aggregated behavioral features (rolling counts, running sums) follow the pipeline above. Embedding features — vector representations of customer behavior or session state — are often recomputed on an even longer cadence (hourly, daily) because embedding models are expensive. Static features (credit tier, account tenure) are essentially always-fresh because they change on human timescales. Risk-score features from other upstream models can have their own training-serving skew stacked on top of the feature-store pipeline, which compounds.
The total freshness budget a feature has is the maximum of its worst component. One fast feature does not help if the decision also reads a slow one.
Why the Online Store Misses the Decision Window
Feature stores were not designed for real-time decision-time workloads. They were designed for a different problem: making it cheap and consistent for ML models in production to read the same features they were trained on, while avoiding training-serving skew. That problem is real and solving it is valuable. But solving it does not solve the freshness problem for decisions inside a tight validity window.
The architectural bet is that features are maintained offline (where the compute is cheap and the consistency with training data is easy to guarantee) and promoted to an online store on a cadence that balances freshness against cost. The online store is a cache with scheduled refresh. For training-serving parity, this is correct: the online feature is a controlled copy of the offline feature, with known delay, so the ML model in production sees features that match what it was trained on.
For a decision that commits inside a validity window of hundreds of milliseconds, this architecture is not a good fit. The online store’s refresh cadence is measured in seconds or longer, and any feature it serves has an age equal to the time since the last refresh that included the source event. Under concurrent load — when many transactions for the same customer are being scored within a window shorter than the refresh cadence — the feature vector the ML model sees does not reflect the prior transactions in the same window. The ML model is doing its job correctly against stale inputs.
The deeper problem is that the online store is the decision’s only view of the customer’s recent behavior. The rules engine reads its own flags from its own cache; the velocity counter lives in yet another store; the exposure aggregate lives in a fourth. A single authorization decision reads from all four, each at its own propagation stage. This is the retrieval gap — and why real-time decisions fail walks the three failure modes (incomplete, stale, inconsistent) that emerge from it.
How to evaluate a feature store covers the full tradeoff: a feature store is the right answer for offline training plus moderate-freshness online serving, and it is not the right answer for online serving where every decision must reflect events that happened seconds before the inference call.
Model Latency vs Feature Freshness
The single most useful distinction an ML platform team can internalize is that model latency and feature freshness are different metrics that measure different things.
Model latency is the time from the inference service receiving a request to the time it returns a score. It is bounded by the feature-vector retrieval from the online store, the model’s forward pass, and any post-processing. Teams have spent the last decade optimizing this number: GPU inference, quantization, model distillation, batched serving, co-located caches. A good inference system gets model latency well under 50 milliseconds, often under 10.
Feature freshness is the time between a source event occurring and the corresponding feature value being readable by the inference service. It is bounded by the entire upstream pipeline: ingest, stream or batch processing, feature computation, offline store materialization, online store promotion. A typical production feature pipeline has feature freshness in the range of 5 to 120 seconds per feature, depending on the feature and the pipeline’s cadence.
The ratio matters. In production ML inference stacks, feature freshness is typically 100 to 1000 times larger than model latency. A 10ms-latency model reading features that are 10 seconds old is composing a fast read over a slow-moving view of the world. For decisions that depend on fast-moving state — fraud authorization, credit exposure, live session personalization — the slow view is what determines correctness, and the fast model inference does not compensate for it.
The instrumentation most platforms have reflects this asymmetry. Serving p99, GPU utilization, and model latency are instrumented by default in any mature ML platform. Feature freshness — measured as the distribution of source-event-to-feature-read delay, under peak concurrent load, per feature — is rarely instrumented. When a feature is stale, the model is still returning a number. Nothing fires. The decision commits.
Training/Serving Skew vs Stream/Serving Skew
Feature-store practitioners are familiar with training-serving skew: the case where the feature value the model was trained on does not match the feature value the model sees at inference. Point-in-time correctness is the discipline of computing training features using the same logic and cutoffs the online serving path uses, so that training data reflects what production would have seen. Feature stores work hard on this — it is one of their primary value propositions.
There is a second skew that matters equally for decision-time workloads and is less well-named: stream/serving skew, the case where Flink’s in-memory state and the online store the model reads from reflect different points in time. Flink has advanced ahead of the online store because the promotion step hasn’t run yet. The model reads a feature from the online store that is behind Flink’s state by the promotion interval. This is not a training-serving issue; the training pipeline is not involved. It is a live pipeline-to-pipeline drift inside production.
Both skews have the same shape — the feature at read time is not the feature that corresponds to the live state of the world — but training-serving skew gets architectural attention (because it breaks offline model performance claims) while stream/serving skew often gets operational handwaving (because individual decisions rarely surface it visibly). Stateful stream processing for decisions covers the upstream side of this gap in detail.
A decision-coherent architecture closes both. Training-serving parity is preserved by defining features against the same snapshot semantics at training and serving time. Stream-serving parity is preserved by maintaining the feature inside the same serving layer the rest of the decision reads from — no promotion step, no cache tier to fall out of sync with.
A Decision-Coherent ML Inference Architecture
If the online store is where freshness goes to die, what replaces it? Not a faster feature store. A serving layer where feature computation lives alongside the rest of the decision’s context — velocity counters, exposure aggregates, rules flags, embedding indexes — and all of it is served under one snapshot to the inference call.
Four properties define the alternative.
Features live inside the serving layer, not behind a promotion step. A windowed velocity count, a rolling aggregate, a behavioral feature — maintained as an incrementally-updated view inside the same system that holds the raw state. When a transaction commits, the feature reflecting it updates in sub-second time, without a pipeline emitting to an external store.
One read path for every signal the model and the rules engine need. When the inference call reads a feature vector, it reads it from the same layer the rules engine reads flags from and the velocity engine reads counters from. One snapshot. No drift between the model’s view of the customer and the rules engine’s view.
On-demand computation for features that don’t pre-compute usefully. Arbitrary-window aggregations, cross-entity joins, vector similarity over a live index — resolved at inference time against state that is itself coherent. The “what’s the velocity over the last 90 seconds for this card across these 12 merchants” query runs against the same snapshot the velocity counter and the exposure aggregate came from.
Feature definitions are SQL, not Python jobs. The feature definition lives as a view declaration in the serving layer. Changes are DDL operations — promoted the way schema changes are promoted, reviewed as part of feature release, not as a separate Flink job deployment. Feature engineering becomes a platform capability, not a pipeline-operations burden. Data scientists and ML engineers work in the same syntax regardless of feature type.
The inference service still runs the model. The model server — Triton, TorchServe, Ray Serve, BentoML, SageMaker endpoints — is unchanged. What changes is where the feature vector comes from: one coherent serving layer, one read, one snapshot, no training-serving or stream-serving skew to instrument for.
Vertical Patterns
Real-time machine learning inference architectures vary by vertical but share the freshness-gap shape.
Fraud detection: the model scores a transaction against features that include recent velocity, cross-device signals, behavioral embeddings, and historical patterns. The velocity features are the freshness-critical ones; the embedding is typically recomputed on a longer cadence. A decision-coherent stack keeps velocity inside the serving layer with sub-second propagation; the embedding can remain on a slower cadence because its own freshness requirements are softer. Real-time fraud detection architecture covers the full architecture.
Credit decisioning: the risk model scores an application or transaction against features that include payment history, exposure, recent behavior, and cash-flow signals. The exposure and recent-behavior features are the freshness-critical ones. Decision-coherent credit architectures maintain exposure inside the serving layer and resolve the risk model’s feature vector against the same snapshot. Real-time credit decisioning architecture walks the three-pipeline divergence pattern.
Live personalization (enforcement-grade): the ranker scores a next-action recommendation against session state, recent user behavior, and current inventory. When the session itself is the validity window, feature freshness must be bounded by session-event latency, not feature-store refresh. This is where general recommendation systems stop being the right framework; the wedge only applies when the personalization decision has a concrete behavioral consequence inside the session.
Agent tool calls: an AI agent invokes a tool that scores a situation — credit check, fraud assessment, eligibility evaluation — against features. Each tool call is its own inference request, and multiple tool calls within the same agent plan must read from a coherent view of the customer. Without a decision-coherent serving layer, each tool call reads from its own stale snapshot, and the agent composes a plan over inconsistent context. At production scale with concurrent agent invocations, this is a real and growing failure mode.
Across all four, the underlying shape is the same: the ML model is fast, the feature vector is slow, and the freshness budget is the constraint the architecture needs to make explicit.
Real-Time Machine Learning in Production: Use Cases and ML Systems
Real-time ML powers the real-time ML systems that have to act in dynamic environments where stale predictions miss the decision window. The use cases span verticals.
Fraud detection is the canonical real-time ML use case at scale. Credit card fraud teams run ML models on every transaction, scoring fraudulent transactions against features that include velocity counters, device signals, and behavioral history. Real-time predictions feed an automated decision that commits before the transaction settles. Missed opportunities — fraud that should have been blocked — translate directly to loss, which is why fraud detection has driven much of the real time ML infrastructure investment of the past few years.
Recommendation systems evolved from batch ranking to real-time ML when the freshness of customer behavior signals started mattering more than the sophistication of the ranking model. Real time ML in e-commerce powers next-product recommendations that reflect what the customer just viewed, search results that reflect inventory availability, and personalized offers that reflect session intent. The ML system reads recent data signals — current session, recent purchases, in-flight cart — and generates predictions in milliseconds.
Autonomous vehicles run ML models continuously over sensor data streams (lidar, camera, radar, GPS), generating instantaneous predictions about object trajectories, lane positions, and risk events. Live data from real-time sensors feeds ML models that must commit decisions inside tens of milliseconds. This is the most freshness-sensitive real time ML domain in production today.
High frequency trading uses ML models on streaming market data to generate real-time predictions about price movements, order-book dynamics, and execution timing. The latency requirements are sub-millisecond and the freshness window is narrower than any other real-time ML use case.
Smart cities apply real time ML to traffic flow prediction, emergency response routing, and infrastructure monitoring against IoT sensor streams. Continuous learning lets the ML models adapt as patterns shift across days, weather conditions, and events.
Real-time decision making in healthcare flags potential health risks from continuous patient monitoring data, surfacing instant decisions to clinicians when ML models detect anomalies in real time data. Latency, freshness, and explainability all matter; the ML system has to integrate with clinical workflows in dynamic environments.
What unifies these is not the model architecture (decision trees, neural networks, deep learning, ensembles all appear) but the operational shape: ML models running against fresh data with low latency, generating real time predictions that gate immediate responses. The infrastructure that makes these workloads work — robust streaming platforms, real-time features, an ML system that can serve and monitor predictions at scale — is what “real-time ML” means in practice.
What to Measure
If the architecture optimizes what it measures, start measuring feature freshness alongside model latency.
Feature freshness at inference time. Per feature, the distribution of time between when the source event occurred and when the inference service read the value. Measured under peak concurrent load, not averaged over quiet windows. Surfaced per feature so that individual slow features are visible. A good decision-coherent stack keeps the worst per-feature freshness inside the decision’s validity window with headroom.
Cross-feature coherence. For inference calls that read multiple features, the time skew between the features. In a stack where every feature comes from the same snapshot, this is zero. In a pipeline-based stack, it is the difference between the oldest and newest feature in the vector — which quantifies the retrieval gap inside the inference call itself.
Stream-serving skew. For features maintained by stream processing emitting to an online store, the delay between the stream processor’s state and the feature store’s state. Most ML platforms do not measure this by default. It is the single most useful diagnostic for freshness-driven wrong decisions.
Unexpected-prediction delta. For a subset of inference calls, compute the prediction the model would have returned if the features had been fully fresh (reflecting all events up to the inference time, not the features as served). The delta between this and the served prediction is the freshness cost. In most production stacks, this number is surprising the first time a team measures it.
Frequently Asked Questions
Where Real-Time ML Inference Goes From Here
The trajectory is toward decisions that commit on tighter windows against features that must reflect more recent state. Fraud detection is getting faster. Credit is moving from applicant-level to transaction-level. Personalization is crossing from batch ranking into live enforcement. Agents are composing tool calls that each expect coherent context. Every one of these trends tightens the freshness constraint on features and widens the cost of the pipeline-shaped propagation gap.
The architectures that succeed at this scale do not treat feature freshness as a downstream concern of the feature store. They treat it as a first-class decision-time property and serve features from inside the same coherent layer that serves the velocity counter, the exposure aggregate, and the rules-engine flag — under one snapshot, with sub-second propagation from source events, and with freshness measured per feature as an SLA that the serving layer commits to. The model server is unchanged. The feature store’s role shifts to what it was always good at — offline training, batch scoring, moderate-freshness online serving — and the tight-validity-window decisions move to a serving layer the feature store was never designed to be.