Incremental Materialized View Practices
Overview
Incremental materialized views are a special type of materialized view in TacNode that improve query performance by refreshing only changed data rather than performing full recalculation. Compared to full materialized views, incremental materialized views can significantly enhance query performance in scenarios where base table data volumes are large but changes are relatively small.
Applicability Assessment
Before using incremental materialized views, you need to assess query applicability. This can be verified through the following method:
Let the materialized view refreshed at time T1 be: MV@T1 = Query@T1
Let the materialized view refreshed at time T2 be: MV@T2 = Query@T2
If the content difference between MV@T1
and MV@T2
is small, incremental refresh can bring significant performance improvements. Conversely, if the content is completely different, incremental refresh is not suitable.
Non-Applicable Example
This query produces different results each refresh, making it unable to effectively utilize incremental computation advantages.
Incremental Materialized View Quick Reference
Category | Best Practices | Avoid |
---|---|---|
WHERE Conditions | - Use fixed time ranges - Add indexes | - Use NOW() - Use CURRENT_TIMESTAMP and other non-deterministic functions |
GROUP BY | - Materialize all group results - Filter afterward | - Use HAVING to filter before materialization |
Count Distinct | - Use FILTER clause for unified calculation | - Use multiple CASE WHEN for scattered calculation |
JOIN Operations | - Use INNER JOIN - Ensure equality conditions - Create indexes | - Use OUTER JOIN - Type conversion - No indexes |
Over Window | - Place at query outermost level - Complete persistence | - Nested usage - Result filtering |
Complex Queries | - Split into layered materialized views | - Single complex query statements |
Function Usage | - Deterministic functions - SUM - COUNT etc. | - Non-deterministic functions - RANDOM - CURRENT_USER etc. |
Design and Implementation Best Practices
1. WHERE Clause Optimization
Problem Scenario:
Solutions:
- Columnar storage tables: Utilize adaptive scan concurrency to optimize filtering performance
- Row storage tables: Create secondary indexes on the
ts
field - Consider using fixed time ranges instead of dynamic time calculations
2. GROUP BY Clause Optimization
Basic Principle: GROUP BY
statements need to persist all calculation results
Not Recommended:
Recommended Approach:
3. Multi-Dimensional Count Distinct Optimization
Inefficient Implementation:
Efficient Implementation:
4. JOIN Operation Optimization
Optimization Strategies:
- Prioritize using
INNER JOIN
- Ensure each
JOIN
contains at least one equality condition - Maintain consistent data types on both sides of equality conditions
- Create secondary indexes on large table
JOIN
keys
Example:
5. Over Window Handling
Supported Pattern:
Unsupported Pattern:
6. Architecture Design Patterns
Architecture Design Pattern Quick Reference
Pattern | Use Cases | Advantages | Disadvantages |
---|---|---|---|
CTE Materialization | - Intermediate result reuse - Complex logical queries | - Clear logic - Automatic intermediate result materialization | - Cannot directly query intermediate results |
Layered Materialization | - Multiple applications reusing base data | - Reusable - Supports indexes - Flexible queries | - Increased number of materialized views |
Subquery | - Simple logic - No materialization needed | - Concise syntax | - Poor performance for complex queries |
Pattern One: CTE Materialization
Pattern Two: Layered Materialized Views
Materialized View Practice Strategies
1. Evaluating Whether to Use Materialized Views
Materialized views generally provide benefits when the following characteristics are met:
- High computational overhead query patterns: Common high-overhead computation patterns include multi-table joins, aggregations, window functions, etc. Complex functions (like regex matching) and UDFs that need to access external resources can also be considered high computational overhead.
- Frequently occurring query patterns: If a query pattern has high overhead but is used only occasionally, considering the cost that materialized views themselves bring, it's not very necessary. If queries are frequent enough that the savings exceed the cost of materialized views, then it's worth considering.
If a query pattern satisfies both characteristics above, you can consider creating materialized views for it.
2. Reasonable Materialized View Definition
We recommend that users comprehensively consider query computational overhead and query frequency to extract appropriate subqueries as materialized view definitions. This allows the subquery to cover as many queries as possible. Common scenarios include:
-
Multi-table joins plus various ad hoc aggregation calculations: In general data warehouse analysis, fact tables and dimension tables often need to be joined in advance, then analysis is performed based on the joined table. If we find that the multi-table join rules are basically unchanged, but subsequent analysis is very flexible, we can consider defining the multi-table join as a materialized view, then perform ad hoc analysis directly based on the materialized view.
-
Complex aggregations followed by random queries: Some scenarios involve fixed multi-table join and aggregation rules, but ultimately require extensive point queries on the aggregated results. For example, in e-commerce, aggregated result tables for merchant and buyer dimensions are generally prepared in advance for merchants and buyers to access their own results. We recommend creating materialized views for the aggregated result tables rather than creating materialized views for each possible point query.
3. Choosing Appropriate Storage
Since materialized views ultimately exist as physical tables, table-level configuration methods are also applicable here. Common ones include:
-
Define names for each column of the materialized view: If column names are not specified when creating materialized views, the engine will automatically infer and generate them. However, considering subsequent tuning methods on materialized views, we recommend manually specifying each column name.
-
Choose reasonable
Access Method
based on final access patterns to materialized views: For example, if the materialized view stores pre-aggregated results and subsequent queries are mainly point queries, consider using row storage (row
). If subsequent operations require extensive full table scans of materialized views for further calculations, consider using columnar storage (columnar
). If various access methods are needed, consider using hybrid storage mode (hybrid
). -
Create appropriate indexes: Common examples include secondary indexes for accelerating point queries, split indexes for accelerating columnar filtering, etc., all of which can be created on materialized views. Here you also need to balance the greater write overhead that more indexes bring.
4. Choosing Appropriate Refresh Strategies
Choosing appropriate refresh strategies requires comprehensive consideration of the following factors:
- Base table modification frequency
- Timeliness requirements for materialized views
- Complexity of queries to be materialized
First, let's look at relatively rigid conditions. If:
- Base tables are basically unchanged or modified very infrequently
- Low timeliness requirements (hourly or daily level is sufficient)
- Query complexity is too high, cannot use incremental computation
If any of the above conditions is met, we can use full materialized views.
If original data is frequently modified and we have certain timeliness requirements for results, we can consider using incremental or real-time materialized views. In this case, we generally choose from the timeliness perspective:
- Real-time materialized views: Generally used for business scenarios with particularly high timeliness requirements (sub-second level).
- Incremental materialized views: For scenarios requiring second-level and minute-level delays.
Automated Operations Recommendations
- Utilize scheduling systems (like Airflow, crontab) to refresh materialized views regularly, executing during business off-peak periods.
- Monitor refresh duration, failure rates, delays and other metrics with timely alerts.
- Use system tables (like
pg_matviews
,pg_depend
) to automatically analyze dependency relationships, avoiding refresh failures due to base table changes. - Regularly clean up unused materialized views to save storage and maintenance costs.
Practice Cases
Case 1: User Growth Analysis
Suppose we need to evaluate the effectiveness of marketing campaigns or advertising, where new user count is a good metric. To determine whether a user is new, we can check whether they had any activity on our website in the past 30 days. We need queries like this:
We notice that "counting all users who appeared in the past 30 days" is computationally intensive, and results are basically unchanged (assuming no late-arriving data in user_activities
). If new user analysis queries occur frequently, we can consider storing the "count all users who appeared in the past 30 days" portion using materialized views. Since this query only needs to run once daily, we can choose full materialized views, refreshing the results completely each morning.
Similar approaches can be extended to user retention analysis, new user growth analysis by different channels, etc.
Case 2: Data Warehouse Layering
In typical data warehouse architecture, we generally adopt layering strategies, such as:
- ODS (Operational Data Store): Also called the source layer, storing the most original unprocessed data, serving as the source for subsequent data processing.
- DWD (Data Warehouse Detail): Mainly performs cleaning/filtering/normalization processing on ODS layer data, often involving wide table processing, etc.
- DWS / ADS (Data Warehouse Service): Based on DWD layer data, performs various degrees of aggregation operations.
The main difference between offline and real-time data warehouses lies in the timeliness of each layer - how long after original data changes are reflected in this layer's content. From a modeling perspective, except for the ODS layer, data in other layers can be seen as materialized views in some sense, refreshed periodically according to data dependency relationships.
Combining TacNode's powerful storage and query capabilities, we can build an integrated offline and real-time data warehouse architecture:
- ODS:
- Real-time writing from external sources or real-time synchronization by tools into TacNode
- Some very flexible business scenarios (like data analysts validating ideas) can directly query ODS data
- DWD:
- Based on different timeliness requirements, build materialized views based on ODS layer data. Through system tables like
pg_depend
,pg_class
, dependency relationships between materialized views and tables can be automatically analyzed - Hourly/daily timeliness: Build full materialized views, refreshed completely periodically according to dependency relationships
- Minute-level timeliness: Build incremental materialized views, supporting common filtering, multi-table joins, etc., refreshed incrementally periodically according to dependency relationships
- Based on different timeliness requirements, build materialized views based on ODS layer data. Through system tables like
- DWS/ADS:
- Based on different business scenario needs, build materialized views based on DWD and ODS layers
- Some relatively fixed query patterns: Can build some lightly aggregated materialized views for direct secondary aggregation analysis or point queries
- Highly flexible queries: May not use materialized views, directly querying ODS and DWD layer data
Through this architecture, we don't need separate complete data architectures for offline and real-time scenarios. We only need to create reasonable tables and materialized views in TacNode based on timeliness and query flexibility requirements, simultaneously satisfying offline and real-time needs while greatly reducing costs and improving usability.
Case 3: Advertising Billing
In advertising scenarios, when the delivery engine decides to display an advertisement, it generally first checks whether the advertiser's balance is sufficient. In the pay-per-click model, each user click triggers billing for that advertiser. Since user clicks happen constantly, we need real-time statistics on each advertiser's spending:
If an advertiser's total spending exceeds their balance, their ads are stopped. Conversely, if this spending statistics have delays, it might lead to over-spending situations causing financial losses.
Currently, common approaches use stream computing engines for this real-time statistics, but cannot avoid short delays, and system restarts beyond expectations can also cause uncontrollable delays. At this time, we can consider building real-time materialized views based on user_clicks
. When user click behavior is written to the user_clicks
table, corresponding materialized views are also updated in real-time, achieving true 0 delay.
Case 4: Financial Risk Control
In the financial industry, real-time risk control is an important link in ensuring business security. Through incremental materialized views, real-time monitoring and risk identification of user transaction behavior can be achieved.
Business Scenario: Banks or payment platforms need to monitor user transaction behavior in real-time, identifying abnormal transaction patterns to prevent fraudulent behavior. For example, monitoring metrics like user transaction frequency and transaction amount distribution by hour and by day. When these metrics exceed preset thresholds, risk control strategies are triggered.
Technical Implementation:
Advantages:
- Layered aggregation: Through base and advanced materialized view layered design, improves query efficiency
- Multi-time granularity: Supports statistical analysis by different time granularities like hourly and daily
- Real-time nature: Incremental materialized views can reflect users' latest transaction behavior in real-time
- Performance: Avoids full table scans and real-time calculations for each risk control check
Application Scenarios:
- Abnormal high-frequency transaction detection (like large numbers of small transactions in short time)
- Large transaction monitoring (single or cumulative large transactions)
- Transaction pattern anomaly identification (significant deviation from historical patterns)
- Blacklist merchant transaction monitoring
Case 5: IoT Device Monitoring
In Internet of Things (IoT) scenarios, it's typically necessary to process sensor data from millions of devices. Through incremental materialized views, data aggregation and real-time monitoring can be performed efficiently.
Business Scenario: In industrial IoT, real-time monitoring of various device operating states is needed, including temperature, pressure, vibration and other sensor data. Through real-time aggregation analysis of this data, device anomalies can be discovered promptly to prevent equipment failures.
Technical Implementation:
Advantages:
- Layered processing: Base materialized view processes raw data, advanced materialized view calculates health metrics
- Multi-dimensional analysis: Supports statistics by device type, time granularity and other dimensions
- Efficient processing: Can handle massive sensor data from millions of devices
- Anomaly detection: Quickly identifies abnormal devices, supporting preventive maintenance
Application Scenarios:
- Device health status monitoring
- Anomaly rate statistics and trend analysis
- Preventive maintenance decision support
- Device performance baseline comparison
Case 6: Log Analysis
In large-scale distributed systems, log analysis is an important means of ensuring system stability and security. Through incremental materialized views, real-time aggregation and analysis of log data can be achieved.
Business Scenario: Internet companies need to monitor system logs in real-time, counting various errors, warnings, access volumes and other metrics to promptly discover system problems and drive automated operations and security responses.
Technical Implementation:
Advantages:
- Layered aggregation: Base materialized view processes raw logs, advanced materialized view calculates health metrics
- Multi-time granularity: Supports statistical analysis by hourly, daily and other time granularities
- Real-time monitoring: Reflects system latest status and problems in real-time
- Performance optimization: Avoids scanning large amounts of log data for each analysis
Application Scenarios:
- System health monitoring
- Real-time error and warning statistics
- User access volume analysis
- Security event detection (like abnormal logins, malicious requests, etc.)
- Performance bottleneck identification
Common Issues and Troubleshooting
- Refresh failures: Check base table/view permissions, whether dependent objects exist, whether SQL syntax is compatible with incremental materialization.
- Data inconsistency: Pay attention to base table delays, late-arriving data, concurrent writes and other scenarios. Use full refresh when necessary.
- Performance bottlenecks: Analyze materialized view definition SQL complexity, index coverage, whether storage method (row/column/hybrid) is reasonable.
- Dependency changes: After base table structure changes, materialized view definitions need synchronous adjustment to avoid refresh anomalies.
Summary
Materialized views are an effective means of improving query performance, but need to follow specific design principles:
- Ensure query result set change rates are low
- Avoid using non-deterministic functions
- Ensure completeness of
GROUP BY
andOver Window
results - Optimize
JOIN
conditions and index strategies - Adopt appropriate architecture layering solutions
Following these best practices will help you build efficient and maintainable materialized view systems that significantly improve your data warehouse query performance.