Incremental Materialized View Practices

Overview

Incremental materialized views are a special type of materialized view in TacNode that improve query performance by refreshing only changed data rather than performing full recalculation. Compared to full materialized views, incremental materialized views can significantly enhance query performance in scenarios where base table data volumes are large but changes are relatively small.

Applicability Assessment

Before using incremental materialized views, you need to assess query applicability. This can be verified through the following method:

Let the materialized view refreshed at time T1 be: MV@T1 = Query@T1 Let the materialized view refreshed at time T2 be: MV@T2 = Query@T2

If the content difference between MV@T1 and MV@T2 is small, incremental refresh can bring significant performance improvements. Conversely, if the content is completely different, incremental refresh is not suitable.

Non-Applicable Example

SELECT a, b, c, NOW() FROM t;

This query produces different results each refresh, making it unable to effectively utilize incremental computation advantages.

Incremental Materialized View Quick Reference

Category	Best Practices	Avoid
WHERE Conditions	- Use fixed time ranges - Add indexes	- Use NOW() - Use CURRENT_TIMESTAMP and other non-deterministic functions
GROUP BY	- Materialize all group results - Filter afterward	- Use HAVING to filter before materialization
Count Distinct	- Use FILTER clause for unified calculation	- Use multiple CASE WHEN for scattered calculation
JOIN Operations	- Use INNER JOIN - Ensure equality conditions - Create indexes	- Use OUTER JOIN - Type conversion - No indexes
Over Window	- Place at query outermost level - Complete persistence	- Nested usage - Result filtering
Complex Queries	- Split into layered materialized views	- Single complex query statements
Function Usage	- Deterministic functions - SUM - COUNT etc.	- Non-deterministic functions - RANDOM - CURRENT_USER etc.

Design and Implementation Best Practices

1. WHERE Clause Optimization

Problem Scenario:

SELECT * FROM t WHERE ts >= NOW() - INTERVAL '1 hour';

Solutions:

Columnar storage tables: Utilize adaptive scan concurrency to optimize filtering performance
Row storage tables: Create secondary indexes on the ts field
Consider using fixed time ranges instead of dynamic time calculations

2. GROUP BY Clause Optimization

Basic Principle: GROUP BY statements need to persist all calculation results

Not Recommended:

SELECT department, AVG(salary)
FROM employees
GROUP BY department
HAVING COUNT(*) > 5;

Recommended Approach:

-- Create base materialized view
CREATE MATERIALIZED VIEW dept_stats AS
SELECT 
    department, 
    AVG(salary) as avg_salary, 
    COUNT(*) as emp_count
FROM employees
GROUP BY department;
 
-- Filter during query
SELECT department, avg_salary
FROM dept_stats
WHERE emp_count > 5;

3. Multi-Dimensional Count Distinct Optimization

Inefficient Implementation:

SELECT
    COUNT(DISTINCT CASE WHEN platform = 'iOS' THEN user_id END) as ios_uv,
    COUNT(DISTINCT CASE WHEN platform = 'Android' THEN user_id END) as android_uv
FROM user_events;

Efficient Implementation:

SELECT
    COUNT(DISTINCT user_id) FILTER (WHERE platform = 'iOS') as ios_uv,
    COUNT(DISTINCT user_id) FILTER (WHERE platform = 'Android') as android_uv
FROM user_events;

4. JOIN Operation Optimization

Optimization Strategies:

Prioritize using INNER JOIN
Ensure each JOIN contains at least one equality condition
Maintain consistent data types on both sides of equality conditions
Create secondary indexes on large table JOIN keys

Example:

-- Create indexes to optimize JOIN performance
CREATE INDEX idx_orders_product_id ON orders(product_id);
CREATE INDEX idx_products_product_id ON products(product_id);
 
-- Execute JOIN query
SELECT o.order_id, p.product_name
FROM orders o
INNER JOIN products p ON o.product_id = p.product_id;

5. Over Window Handling

Supported Pattern:

CREATE MATERIALIZED VIEW top_employees AS
SELECT *
FROM (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as rn
    FROM employees
) WHERE rn <= 3;

Unsupported Pattern:

-- System will report error: Unsupported feature: window not in query top level
SELECT department, AVG(salary) OVER (PARTITION BY department)
FROM employees
WHERE avg_salary > 10000;

6. Architecture Design Patterns

Architecture Design Pattern Quick Reference

Pattern	Use Cases	Advantages	Disadvantages
CTE Materialization	- Intermediate result reuse - Complex logical queries	- Clear logic - Automatic intermediate result materialization	- Cannot directly query intermediate results
Layered Materialization	- Multiple applications reusing base data	- Reusable - Supports indexes - Flexible queries	- Increased number of materialized views
Subquery	- Simple logic - No materialization needed	- Concise syntax	- Poor performance for complex queries

Pattern One: CTE Materialization

CREATE MATERIALIZED VIEW imv
WITH (refresh_mode = 'incremental') AS
WITH base AS (
    SELECT department, AVG(salary) as avg_salary
    FROM employees
    GROUP BY department
)
SELECT department, avg_salary
FROM base
WHERE avg_salary > 5000;

Pattern Two: Layered Materialized Views

-- Base layer
CREATE MATERIALIZED VIEW base_mv
WITH (refresh_mode = 'incremental') AS
SELECT department, AVG(salary) as avg_salary
FROM employees
GROUP BY department;
 
-- Application layer
CREATE MATERIALIZED VIEW app_mv
WITH (refresh_mode = 'incremental') AS
SELECT department, avg_salary
FROM base_mv
WHERE avg_salary > 5000;

Materialized View Practice Strategies

1. Evaluating Whether to Use Materialized Views

Materialized views generally provide benefits when the following characteristics are met:

High computational overhead query patterns: Common high-overhead computation patterns include multi-table joins, aggregations, window functions, etc. Complex functions (like regex matching) and UDFs that need to access external resources can also be considered high computational overhead.
Frequently occurring query patterns: If a query pattern has high overhead but is used only occasionally, considering the cost that materialized views themselves bring, it's not very necessary. If queries are frequent enough that the savings exceed the cost of materialized views, then it's worth considering.

If a query pattern satisfies both characteristics above, you can consider creating materialized views for it.

2. Reasonable Materialized View Definition

We recommend that users comprehensively consider query computational overhead and query frequency to extract appropriate subqueries as materialized view definitions. This allows the subquery to cover as many queries as possible. Common scenarios include:

Multi-table joins plus various ad hoc aggregation calculations: In general data warehouse analysis, fact tables and dimension tables often need to be joined in advance, then analysis is performed based on the joined table. If we find that the multi-table join rules are basically unchanged, but subsequent analysis is very flexible, we can consider defining the multi-table join as a materialized view, then perform ad hoc analysis directly based on the materialized view.
Complex aggregations followed by random queries: Some scenarios involve fixed multi-table join and aggregation rules, but ultimately require extensive point queries on the aggregated results. For example, in e-commerce, aggregated result tables for merchant and buyer dimensions are generally prepared in advance for merchants and buyers to access their own results. We recommend creating materialized views for the aggregated result tables rather than creating materialized views for each possible point query.

3. Choosing Appropriate Storage

Since materialized views ultimately exist as physical tables, table-level configuration methods are also applicable here. Common ones include:

Define names for each column of the materialized view: If column names are not specified when creating materialized views, the engine will automatically infer and generate them. However, considering subsequent tuning methods on materialized views, we recommend manually specifying each column name.
Choose reasonable Access Method based on final access patterns to materialized views: For example, if the materialized view stores pre-aggregated results and subsequent queries are mainly point queries, consider using row storage (row). If subsequent operations require extensive full table scans of materialized views for further calculations, consider using columnar storage (columnar). If various access methods are needed, consider using hybrid storage mode (hybrid).
Create appropriate indexes: Common examples include secondary indexes for accelerating point queries, split indexes for accelerating columnar filtering, etc., all of which can be created on materialized views. Here you also need to balance the greater write overhead that more indexes bring.

4. Choosing Appropriate Refresh Strategies

Choosing appropriate refresh strategies requires comprehensive consideration of the following factors:

Base table modification frequency
Timeliness requirements for materialized views
Complexity of queries to be materialized

First, let's look at relatively rigid conditions. If:

Base tables are basically unchanged or modified very infrequently
Low timeliness requirements (hourly or daily level is sufficient)
Query complexity is too high, cannot use incremental computation

If any of the above conditions is met, we can use full materialized views.

If original data is frequently modified and we have certain timeliness requirements for results, we can consider using incremental or real-time materialized views. In this case, we generally choose from the timeliness perspective:

Real-time materialized views: Generally used for business scenarios with particularly high timeliness requirements (sub-second level).
Incremental materialized views: For scenarios requiring second-level and minute-level delays.

Automated Operations Recommendations

Utilize scheduling systems (like Airflow, crontab) to refresh materialized views regularly, executing during business off-peak periods.
Monitor refresh duration, failure rates, delays and other metrics with timely alerts.
Use system tables (like pg_matviews, pg_depend) to automatically analyze dependency relationships, avoiding refresh failures due to base table changes.
Regularly clean up unused materialized views to save storage and maintenance costs.

Practice Cases

Case 1: User Growth Analysis

Suppose we need to evaluate the effectiveness of marketing campaigns or advertising, where new user count is a good metric. To determine whether a user is new, we can check whether they had any activity on our website in the past 30 days. We need queries like this:

SELECT COUNT(DISTINCT user_id)
FROM user_activities
WHERE pt = current_date
AND user_id not in (
    -- Count all users who appeared in the past 30 days
    SELECT DISTINCT user_id
    FROM user_activities
    WHERE pt >= current_date - interval '30 day' 
      AND pt < current_date
);

We notice that "counting all users who appeared in the past 30 days" is computationally intensive, and results are basically unchanged (assuming no late-arriving data in user_activities). If new user analysis queries occur frequently, we can consider storing the "count all users who appeared in the past 30 days" portion using materialized views. Since this query only needs to run once daily, we can choose full materialized views, refreshing the results completely each morning.

Similar approaches can be extended to user retention analysis, new user growth analysis by different channels, etc.

Case 2: Data Warehouse Layering

In typical data warehouse architecture, we generally adopt layering strategies, such as:

ODS (Operational Data Store): Also called the source layer, storing the most original unprocessed data, serving as the source for subsequent data processing.
DWD (Data Warehouse Detail): Mainly performs cleaning/filtering/normalization processing on ODS layer data, often involving wide table processing, etc.
DWS / ADS (Data Warehouse Service): Based on DWD layer data, performs various degrees of aggregation operations.

The main difference between offline and real-time data warehouses lies in the timeliness of each layer - how long after original data changes are reflected in this layer's content. From a modeling perspective, except for the ODS layer, data in other layers can be seen as materialized views in some sense, refreshed periodically according to data dependency relationships.

Combining TacNode's powerful storage and query capabilities, we can build an integrated offline and real-time data warehouse architecture:

ODS:
- Real-time writing from external sources or real-time synchronization by tools into TacNode
- Some very flexible business scenarios (like data analysts validating ideas) can directly query ODS data
DWD:
- Based on different timeliness requirements, build materialized views based on ODS layer data. Through system tables like pg_depend, pg_class, dependency relationships between materialized views and tables can be automatically analyzed
- Hourly/daily timeliness: Build full materialized views, refreshed completely periodically according to dependency relationships
- Minute-level timeliness: Build incremental materialized views, supporting common filtering, multi-table joins, etc., refreshed incrementally periodically according to dependency relationships
DWS/ADS:
- Based on different business scenario needs, build materialized views based on DWD and ODS layers
- Some relatively fixed query patterns: Can build some lightly aggregated materialized views for direct secondary aggregation analysis or point queries
- Highly flexible queries: May not use materialized views, directly querying ODS and DWD layer data

Through this architecture, we don't need separate complete data architectures for offline and real-time scenarios. We only need to create reasonable tables and materialized views in TacNode based on timeliness and query flexibility requirements, simultaneously satisfying offline and real-time needs while greatly reducing costs and improving usability.

Case 3: Advertising Billing

In advertising scenarios, when the delivery engine decides to display an advertisement, it generally first checks whether the advertiser's balance is sufficient. In the pay-per-click model, each user click triggers billing for that advertiser. Since user clicks happen constantly, we need real-time statistics on each advertiser's spending:

SELECT SUM(cost) AS total_cost 
FROM user_clicks 
GROUP BY advertiser;

If an advertiser's total spending exceeds their balance, their ads are stopped. Conversely, if this spending statistics have delays, it might lead to over-spending situations causing financial losses.

Currently, common approaches use stream computing engines for this real-time statistics, but cannot avoid short delays, and system restarts beyond expectations can also cause uncontrollable delays. At this time, we can consider building real-time materialized views based on user_clicks. When user click behavior is written to the user_clicks table, corresponding materialized views are also updated in real-time, achieving true 0 delay.

Case 4: Financial Risk Control

In the financial industry, real-time risk control is an important link in ensuring business security. Through incremental materialized views, real-time monitoring and risk identification of user transaction behavior can be achieved.

Business Scenario: Banks or payment platforms need to monitor user transaction behavior in real-time, identifying abnormal transaction patterns to prevent fraudulent behavior. For example, monitoring metrics like user transaction frequency and transaction amount distribution by hour and by day. When these metrics exceed preset thresholds, risk control strategies are triggered.

Technical Implementation:

-- Create base incremental materialized view, aggregating user transaction data by time granularity
CREATE MATERIALIZED VIEW user_transaction_base
WITH (refresh_mode = 'incremental') AS
SELECT 
    user_id,
    DATE_TRUNC('hour', transaction_time) as transaction_hour,
    DATE_TRUNC('day', transaction_time) as transaction_day,
    COUNT(*) as transaction_count,
    SUM(amount) as total_amount,
    AVG(amount) as avg_amount,
    MAX(amount) as max_amount,
    COUNT(DISTINCT merchant_id) as merchant_count
FROM transactions 
GROUP BY user_id, DATE_TRUNC('hour', transaction_time), DATE_TRUNC('day', transaction_time);
 
-- Create advanced materialized view, further aggregating based on base materialized view
CREATE MATERIALIZED VIEW user_risk_metrics
WITH (refresh_mode = 'incremental') AS
SELECT
    user_id,
    transaction_day,
    SUM(transaction_count) as daily_transaction_count,
    SUM(total_amount) as daily_total_amount,
    AVG(avg_amount) as daily_avg_amount,
    MAX(max_amount) as daily_max_amount,
    AVG(transaction_count) as hourly_avg_transactions,
    STDDEV(transaction_count) as transaction_count_stddev
FROM user_transaction_base
GROUP BY user_id, transaction_day;

Advantages:

Layered aggregation: Through base and advanced materialized view layered design, improves query efficiency
Multi-time granularity: Supports statistical analysis by different time granularities like hourly and daily
Real-time nature: Incremental materialized views can reflect users' latest transaction behavior in real-time
Performance: Avoids full table scans and real-time calculations for each risk control check

Application Scenarios:

Abnormal high-frequency transaction detection (like large numbers of small transactions in short time)
Large transaction monitoring (single or cumulative large transactions)
Transaction pattern anomaly identification (significant deviation from historical patterns)
Blacklist merchant transaction monitoring

Case 5: IoT Device Monitoring

In Internet of Things (IoT) scenarios, it's typically necessary to process sensor data from millions of devices. Through incremental materialized views, data aggregation and real-time monitoring can be performed efficiently.

Business Scenario: In industrial IoT, real-time monitoring of various device operating states is needed, including temperature, pressure, vibration and other sensor data. Through real-time aggregation analysis of this data, device anomalies can be discovered promptly to prevent equipment failures.

Technical Implementation:

-- Create base incremental materialized view, aggregating sensor data by device and time granularity
CREATE MATERIALIZED VIEW device_sensor_base
WITH (refresh_mode = 'incremental') AS
SELECT 
    device_id,
    device_type,
    DATE_TRUNC('hour', reading_time) as reading_hour,
    DATE_TRUNC('day', reading_time) as reading_day,
    AVG(temperature) as avg_temperature,
    MAX(temperature) as max_temperature,
    MIN(temperature) as min_temperature,
    AVG(pressure) as avg_pressure,
    STDDEV(vibration) as vibration_stddev,
    COUNT(*) as reading_count,
    COUNT(CASE WHEN temperature > 80 THEN 1 END) as high_temp_count
FROM sensor_readings 
GROUP BY device_id, device_type, DATE_TRUNC('hour', reading_time), DATE_TRUNC('day', reading_time);
 
-- Create advanced materialized view, calculating device health metrics based on base materialized view
CREATE MATERIALIZED VIEW device_health_metrics
WITH (refresh_mode = 'incremental') AS
SELECT
    device_id,
    device_type,
    reading_day,
    AVG(avg_temperature) as daily_avg_temperature,
    MAX(max_temperature) as daily_max_temperature,
    AVG(vibration_stddev) as daily_avg_vibration_stddev,
    SUM(high_temp_count) as daily_high_temp_count,
    COUNT(*) as hourly_readings_count,
    -- Calculate device anomaly rate
    SUM(CASE WHEN high_temp_count > 0 THEN 1 ELSE 0 END) * 100.0 / COUNT(*) as high_temp_rate
FROM device_sensor_base
GROUP BY device_id, device_type, reading_day;

Advantages:

Layered processing: Base materialized view processes raw data, advanced materialized view calculates health metrics
Multi-dimensional analysis: Supports statistics by device type, time granularity and other dimensions
Efficient processing: Can handle massive sensor data from millions of devices
Anomaly detection: Quickly identifies abnormal devices, supporting preventive maintenance

Application Scenarios:

Device health status monitoring
Anomaly rate statistics and trend analysis
Preventive maintenance decision support
Device performance baseline comparison

Case 6: Log Analysis

In large-scale distributed systems, log analysis is an important means of ensuring system stability and security. Through incremental materialized views, real-time aggregation and analysis of log data can be achieved.

Business Scenario: Internet companies need to monitor system logs in real-time, counting various errors, warnings, access volumes and other metrics to promptly discover system problems and drive automated operations and security responses.

Technical Implementation:

-- Create base incremental materialized view, aggregating log data by service and time granularity
CREATE MATERIALIZED VIEW log_base_metrics
WITH (refresh_mode = 'incremental') AS
SELECT 
    service_name,
    DATE_TRUNC('hour', log_time) as log_hour,
    DATE_TRUNC('day', log_time) as log_day,
    log_level,
    COUNT(*) as log_count,
    COUNT(CASE WHEN log_level = 'ERROR' THEN 1 END) as error_count,
    COUNT(CASE WHEN log_level = 'WARN' THEN 1 END) as warn_count,
    COUNT(CASE WHEN log_level = 'INFO' THEN 1 END) as info_count,
    COUNT(DISTINCT host_ip) as active_hosts,
    APPROX_COUNT_DISTINCT(user_id) as affected_users
FROM application_logs 
GROUP BY service_name, DATE_TRUNC('hour', log_time), DATE_TRUNC('day', log_time), log_level;
 
-- Create advanced materialized view, calculating system health metrics based on base materialized view
CREATE MATERIALIZED VIEW system_health_metrics
WITH (refresh_mode = 'incremental') AS
SELECT
    service_name,
    log_day,
    log_hour,
    SUM(log_count) as total_logs,
    SUM(error_count) as total_errors,
    SUM(warn_count) as total_warnings,
    -- Calculate error rate
    CASE 
        WHEN SUM(log_count) > 0 THEN SUM(error_count) * 100.0 / SUM(log_count)
        ELSE 0 
    END as error_rate,
    -- Calculate warning rate
    CASE 
        WHEN SUM(log_count) > 0 THEN SUM(warn_count) * 100.0 / SUM(log_count)
        ELSE 0 
    END as warning_rate,
    MAX(active_hosts) as peak_active_hosts,
    SUM(affected_users) as total_affected_users
FROM log_base_metrics
GROUP BY service_name, log_day, log_hour;

Advantages:

Layered aggregation: Base materialized view processes raw logs, advanced materialized view calculates health metrics
Multi-time granularity: Supports statistical analysis by hourly, daily and other time granularities
Real-time monitoring: Reflects system latest status and problems in real-time
Performance optimization: Avoids scanning large amounts of log data for each analysis

Application Scenarios:

System health monitoring
Real-time error and warning statistics
User access volume analysis
Security event detection (like abnormal logins, malicious requests, etc.)
Performance bottleneck identification

Common Issues and Troubleshooting

Refresh failures: Check base table/view permissions, whether dependent objects exist, whether SQL syntax is compatible with incremental materialization.
Data inconsistency: Pay attention to base table delays, late-arriving data, concurrent writes and other scenarios. Use full refresh when necessary.
Performance bottlenecks: Analyze materialized view definition SQL complexity, index coverage, whether storage method (row/column/hybrid) is reasonable.
Dependency changes: After base table structure changes, materialized view definitions need synchronous adjustment to avoid refresh anomalies.

Summary

Materialized views are an effective means of improving query performance, but need to follow specific design principles:

Ensure query result set change rates are low
Avoid using non-deterministic functions
Ensure completeness of GROUP BY and Over Window results
Optimize JOIN conditions and index strategies
Adopt appropriate architecture layering solutions

Following these best practices will help you build efficient and maintainable materialized view systems that significantly improve your data warehouse query performance.

Incremental Materialized View Practices

On this page