Tacnode

Production Grade

Every workload gets its own lane

In shared infrastructure, workloads compete. A batch reprocessing job consumes CPU and I/O. The real-time fraud check slows from 4ms to 400ms. The transaction gets delayed — or times out. This is the noisy neighbor problem.

Workload isolation isn't a configuration knob. It's a structural guarantee — dedicated execution lanes that make it physically impossible for one workload class to steal capacity from another.

Without Isolation

Batch Job10% CPU

Consuming shared resource pool

Real-Time Query4ms

Latency rising as batch consumes resources

With Isolation

Batch JobBatch Nodegroup

Contained within its own Nodegroup

Real-Time Query4ms

Guaranteed capacity — latency unchanged

The batch job does the same work either way. Isolation determines whether it punishes the query beside it.

The Noisy Neighbor Problem Is Architectural

Multi-tenant systems run multiple workload classes concurrently: batch ingestion, real-time queries, ML inference, reporting jobs. Each has fundamentally different resource demands and latency tolerances.

Batch jobs are bursty and unpredictable. ML retraining pipelines are resource-hungry by design. Real-time serving has strict latency SLAs measured in single-digit milliseconds. These workloads cannot coexist without guardrails — not because they are individually unreasonable, but because a shared resource pool has no concept of priority.

Rate limiting and query timeouts treat the symptom. They don't solve it. The problem is that a shared pool allows any workload to consume capacity that another workload was counting on. The only structural solution is to eliminate the shared pool for competing workload classes.

Where Isolation Breaks Down

Isolation failures don't look like infrastructure problems at first. They look like latency spikes, service degradations, and intermittent timeouts that correlate with batch job schedules.

Fraud Detection Under Load

latency degradation

What happens: A nightly batch reprocessing job kicks off. CPU utilization spikes. The fraud scoring service — sharing the same cluster — begins queuing requests. The 4ms p50 becomes 400ms. Transactions time out.

Cost: Fraud checks stall during the highest-risk window. Chargebacks rise. Customers abandon.

ML Retraining vs. Serving

resource starvation

What happens: A model retraining pipeline runs in the same compute tier as the inference endpoint. GPU memory contention causes OOM restarts on the serving path. Inference errors begin returning to application clients.

Cost: Serving interruptions during model update cycles. Customers see fallback or errors.

Ingestion Spike vs. Query SLA

I/O saturation

What happens: A backfill job ingesting historical data saturates disk I/O. Read queries from real-time dashboards begin experiencing timeouts. The ingestion job isn't doing anything wrong — it just has no ceiling.

Cost: Operational dashboards go dark. Incident response is blind during peak load.

Reporting vs. Transactional Serving

thread starvation

What happens: A business intelligence query does a full-table scan across 200M rows. It holds a shared query executor thread pool. OLTP queries — short, latency-sensitive — queue behind it and miss SLAs.

Cost: Payment processing delays. Cart abandonment. Revenue impact.

One Pool vs. Independent Nodegroups

The difference between shared and isolated execution isn't about how much total capacity you provision — it's about whether workloads can reach into each other's portion of it.

A single shared pool is always fully contested. Every workload's performance depends on every other workload's behavior. Separate Nodegroups make contention impossible across workload classes — batch saturation is invisible to real-time serving by design.

Shared Resource Pool

BatchReal-TimeML Inference

All workloads compete for the same CPU, memory, and I/O. Contention is constant.

Independent Nodegroups

Batch Nodegroup
Real-Time Nodegroup
ML Inference Nodegroup

Each workload runs in its own Nodegroup. Batch saturation is invisible to real-time serving.

What Real Isolation Requires

Isolation is often confused with throttling. They are different: throttling limits how much a workload can consume. Isolation guarantees what it can't take.

Separate Nodegroups per Workload Class

Batch, real-time, and ML workloads each run in their own Nodegroup — dedicated CPU, memory, and network resources enforced at the infrastructure layer, not the application layer.
All workloads share a resource pool with soft quotas applied at query time — enforcement is advisory and fails under load.

Priority Queuing with Backpressure

Real-time queries are admitted immediately against their Nodegroup's reserved capacity. Batch Nodegroups apply backpressure when saturated — the system slows ingestion before it impacts serving.
A global queue processes all workloads in order of arrival. High-priority queries wait behind low-priority jobs with no preemption mechanism.

Reserved Capacity for Latency-Sensitive Paths

The real-time Nodegroup has dedicated compute headroom that is never preempted. SLAs are enforced structurally — not by hoping batch jobs finish on time.
Capacity is shared opportunistically — real-time serving gets more resources only when batch jobs are idle, which cannot be guaranteed.

Independent Scaling per Nodegroup

Each Nodegroup scales its unit count independently. An ingestion spike scales the batch Nodegroup without touching real-time capacity — and vice versa.
The entire cluster scales together. Rightsizing is impossible because each workload class has different elasticity requirements.

Shared Resources vs. Isolated Resources

The gap between shared and isolated isn't academic. It maps directly onto whether your real-time latency SLAs hold up when a batch pipeline is in flight.

Shared ResourcesIsolated Resources
Batch impact on real-timeDirect — batch consumes shared CPU and I/ONone — batch is bounded to its own compute pool
Latency predictabilityHighly variable — depends on what else is runningConsistent — real-time paths have reserved capacity
SLA guaranteesDifficult — tail latency tied to batch schedulingAchievable — guaranteed headroom per workload class
Resource contentionStructural — built into the shared-pool modelEliminated — contention cannot cross pool boundaries
Capacity planningRequires modeling worst-case interferencePer workload class — independent and predictable

How Tacnode Delivers Workload Isolation

The core concept is the Nodegroup — a computing module with its own CPU, memory, and network resources. Each Nodegroup executes SQL independently and scales its own capacity (measured in units) without affecting any other Nodegroup.

State is shared through a common storage layer and Catalog. A database binds to one primary Nodegroup for direct, low-latency access — but any other Nodegroup can read it remotely without sharing compute. Isolation is between execution environments, not between copies of data.

The result: a batch scan that saturates its Nodegroup has no path to the real-time serving Nodegroup. A surge in ingestion does not starve query execution. Every workload advances independently while observing the same consistent state.

Dedicated Nodegroup per workload class

Batch ingestion, real-time query serving, and ML inference each run in separate Nodegroups with their own CPU, memory, and failure domain. Resource exhaustion in one Nodegroup cannot propagate to another.

Guaranteed capacity for latency-sensitive paths

The real-time Nodegroup has dedicated compute headroom that is never preempted by lower-priority workloads. The 4ms fraud check stays at 4ms regardless of what the batch Nodegroup is doing.

Batch operations run with backpressure, not timeouts

When a batch Nodegroup is under pressure, ingestion slows gracefully via backpressure. The system throttles the producer, not the consumer — serving Nodegroups are unaffected.

Independent scaling per Nodegroup

Each Nodegroup scales its unit count independently. An ingestion spike scales the batch Nodegroup without touching real-time capacity. Capacity planning is per workload class — no cross-interference to model.

See how Tacnode keeps every workload in its own lane

Dedicated execution pools. Reserved real-time capacity. Batch backpressure that protects serving paths. Workload isolation built into the architecture — not bolted on after the fact.