Approximate Computing

When dealing with massive datasets, exact calculations can be slow and resource-intensive. Approximate computing deliberately trades small amounts of accuracy for dramatic performance improvements, making it ideal for analytics workloads where speed matters more than perfect precision.

When to Use Approximate Computing

Perfect Use Cases

  • Large-scale analytics - Datasets with 10M+ rows
  • Real-time dashboards - Interactive queries with strict latency requirements
  • Statistical analysis - Scenarios where small errors are acceptable
  • Trend analysis - Understanding patterns rather than exact counts
  • Resource monitoring - Performance metrics and percentile calculations

Performance Benefits

  • 5-100x faster queries compared to exact calculations
  • Reduced resource usage - Lower CPU, memory, and I/O consumption
  • Better scalability - Performance degrades gracefully as data grows
  • Real-time responsiveness - Enable interactive analytics on large datasets

Supported Functions

approx_count_distinct

Estimate the number of unique values in a column using the HyperLogLog algorithm.

Syntax:

approx_count_distinct(expression [, precision])

Parameters:

ParameterTypeRangeDefaultDescription
expressionAny-RequiredColumn or expression to analyze
precisionInteger4-1812Higher = more accurate, more memory

Accuracy: Default precision (12) provides ≈0.81% standard error, typically ±2% range.

Examples:

-- Basic usage: Count unique visitors today
SELECT approx_count_distinct(user_id) AS unique_visitors
FROM website_logs
WHERE date = CURRENT_DATE;
 
-- High precision for important metrics
SELECT approx_count_distinct(product_id, 16) AS unique_products_sold
FROM orders
WHERE created_at >= '2024-01-01';
 
-- Compare with exact count (for verification)
SELECT 
    COUNT(DISTINCT user_id) AS exact_count,
    approx_count_distinct(user_id) AS approx_count,
    approx_count_distinct(user_id) / COUNT(DISTINCT user_id)::float AS ratio
FROM website_logs;

Best Practices:

  • Use precision 12-14 for most cases
  • Precision 16+ only for critical accuracy requirements
  • Perfect for: UV counting, cardinality estimation, dashboard metrics

approx_percentile

Calculate percentiles efficiently using the T-Digest algorithm.

Syntax:

approx_percentile(expression, percentile [, compression])

Parameters:

ParameterTypeRangeDefaultDescription
expressionNumeric-RequiredColumn with numeric values
percentileFloat0.0-1.0RequiredPercentile to calculate (0.5 = median)
compressionInteger10-10000100Higher = more accurate, more memory

Accuracy: Edge percentiles (p5, p95) have higher accuracy, middle percentiles (p50) typically <1% error.

Examples:

-- Response time analysis
SELECT 
    approx_percentile(response_time_ms, 0.50) AS p50_response_time,
    approx_percentile(response_time_ms, 0.90) AS p90_response_time,
    approx_percentile(response_time_ms, 0.95) AS p95_response_time,
    approx_percentile(response_time_ms, 0.99) AS p99_response_time
FROM api_requests
WHERE date >= '2024-01-01';
 
-- High-precision analysis for SLA monitoring
SELECT 
    service_name,
    approx_percentile(response_time_ms, 0.95, 500) AS p95_response_time
FROM service_metrics
WHERE timestamp >= NOW() - INTERVAL '1 hour'
GROUP BY service_name;
 
-- Compare multiple time periods
SELECT 
    'Last Hour' AS period,
    approx_percentile(price, 0.5) AS median_price
FROM orders 
WHERE created_at >= NOW() - INTERVAL '1 hour'
UNION ALL
SELECT 
    'Last Day',
    approx_percentile(price, 0.5)
FROM orders 
WHERE created_at >= NOW() - INTERVAL '1 day';

Best Practices:

  • Use compression 100-200 for most cases
  • Higher compression (500+) for critical SLA monitoring
  • Perfect for: Latency analysis, price distributions, performance monitoring

Real-World Examples

Real-Time Dashboard:

-- Analytics dashboard that updates every minute
SELECT 
    DATE_TRUNC('hour', timestamp) AS hour,
    approx_count_distinct(user_id) AS unique_users,
    approx_percentile(session_duration, 0.5) AS median_session,
    approx_percentile(page_load_time, 0.95) AS p95_load_time
FROM user_sessions
WHERE timestamp >= NOW() - INTERVAL '24 hours'
GROUP BY hour
ORDER BY hour;

Performance Monitoring:

-- Service performance analysis
WITH service_stats AS (
    SELECT 
        service_name,
        approx_count_distinct(request_id) AS total_requests,
        approx_percentile(duration_ms, 0.50) AS p50_latency,
        approx_percentile(duration_ms, 0.95) AS p95_latency,
        COUNT(*) FILTER (WHERE status_code >= 400) AS error_count
    FROM service_logs
    WHERE timestamp >= NOW() - INTERVAL '1 hour'
    GROUP BY service_name
)
SELECT 
    service_name,
    total_requests,
    p50_latency || 'ms' AS median_latency,
    p95_latency || 'ms' AS p95_latency,
    ROUND(error_count::float / total_requests * 100, 2) || '%' AS error_rate
FROM service_stats
ORDER BY p95_latency DESC;

User Behavior Analysis:

-- Weekly user engagement report
SELECT 
    DATE_TRUNC('week', event_date) AS week,
    approx_count_distinct(user_id) AS weekly_active_users,
    approx_count_distinct(user_id) FILTER (
        WHERE event_type = 'login'
    ) AS users_who_logged_in,
    approx_percentile(session_count, 0.5) AS median_sessions_per_user
FROM user_events
WHERE event_date >= NOW() - INTERVAL '8 weeks'
GROUP BY week
ORDER BY week;

Important Considerations

⚠️ When NOT to Use

  • Financial calculations - Exact precision required for money
  • Compliance reporting - Regulatory requirements for exact counts
  • Uniqueness constraints - Primary key validation, deduplication
  • Small datasets - Overhead not worth it for < 100K rows

🎯 Accuracy Expectations

  • Repeated queries: May yield slightly different results (±2%)
  • Extreme distributions: Less accurate with highly skewed data
  • Edge cases: Very small or very large percentiles less reliable

💡 Best Practices

  1. Test accuracy on your data before production use
  2. Document usage so team understands approximate nature
  3. Monitor results - compare occasionally with exact calculations
  4. Choose precision based on accuracy vs. performance needs
  5. Use for trends rather than exact business decisions

On this page