Approximate Computing

In modern big data analytics, precise computations often consume significant resources and time. Tacnode introduces the concept of Approximate Computing, intentionally trading a small amount of accuracy for substantial performance gains. Especially suitable for:

  1. Massive-scale data analysis (datasets with over ten million rows)
  2. Interactive queries with strict latency requirements
  3. Statistical scenarios tolerating bounded errors

Advantages:

  • Significantly faster queries: 5–100x improvement over exact computation
  • Lower resource consumption: reduced CPU, memory, and I/O usage
  • Enhanced scalability: gradual performance degradation as data volume grows

Supported Approximate Functions

approx_count_distinct

Description

Estimates the number of distinct values (cardinality) in a column using the HyperLogLog algorithm.

Syntax

approx_count_distinct(expr [, precision])

Parameters

  • expr: column or expression for cardinality estimation
  • precision (optional): precision parameter, range 4–18, default 12; higher values increase accuracy but use more memory

Accuracy & Error

  • Default precision (12): standard error ≈0.81%
  • Typical error range: ±2%

Examples

-- Estimate number of unique website visitors
SELECT approx_count_distinct(user_id) AS unique_visitors
FROM website_logs
WHERE date = today();
 
-- Use higher precision for product count
SELECT approx_count_distinct(product_id, 14) AS approx_unique_products
FROM orders;

Use Cases

  • UV (unique visitor) stats on large datasets
  • High-cardinality dimension analysis
  • Real-time dashboard metrics

approx_percentile

Description

Estimates percentiles over numeric columns using the T-Digest algorithm.

Syntax

approx_percentile(expr, percentage [, precision])

Parameters

  • expr: numeric column or expression
  • percentage: percentile to estimate, within [0,1]
  • precision (optional): compression parameter, default 100. Higher values increase accuracy

Accuracy & Error

  • Lower error near edge percentiles (close to 0 or 1)
  • Median/mid-percentile error typically < 1%

Examples

-- Estimate median age (50th percentile)
SELECT approx_percentile(age, 0.5) AS median_age
FROM users;
 
-- Calculate 95th percentile for response time
SELECT approx_percentile(response_time_ms, 0.95, 200) AS p95_response_time
FROM api_metrics;

Use Cases

  • Latency analysis (p50/p90/p99)
  • Resource monitoring
  • Data distribution analysis

Considerations

  1. Not suitable for scenarios requiring absolutely precise results (e.g. financial transactions)
  2. Results may fluctuate within ±2% (repeated queries may yield slightly different results)
  3. Cannot be used for uniqueness constraints or exact deduplication
  4. Extreme data distributions (e.g. 99% identical values) may impact accuracy

On this page