tacnode

Materialized View (Beta)

Introduction to Materialized Views

A materialized view in TacNode is a unique view that stores query results directly in the database. Unlike standard views, which compute results dynamically, materialized views use physical storage to improve query performance, particularly for complex queries or frequently accessed aggregate results. These views can also be refreshed regularly to stay in sync with the underlying tables.

Key Uses of Materialized Views

  • Query Acceleration

    Materialized views significantly improve query response times for complex queries, such as multi-table joins or aggregate calculations, especially when the underlying data is stable. By precomputing and storing query results, materialized views eliminate the need for recalculation with each query execution. For example, in data warehouse report generation, where data is summarized and analyzed from various tables or sources, materialized views can store report contents, enhancing data analysis efficiency.

  • Cross-Data Source Queries

    When a query involves multiple data sources, especially external ones, reading data from these sources for each query can lead to excessive data transmission and processing overhead. Materialized views pre-aggregate this dispersed data, reducing cross-data source access during queries and lowering data transmission costs.

  • System Load Reduction

    Frequent execution of complex queries can heavily burden the database server in high-concurrency environments. Materialized views, with pre-calculated query results, can significantly reduce response times and alleviate system load for systems requiring quick responses, such as online trading platforms and real-time decision-support systems.

  • Hierarchical Data Processing

    A typical data warehouse integrates different systems or platforms, requiring multiple ETL jobs for hierarchical data processing. Storing intermediate calculation results in materialized views can streamline this process. Techniques like incremental refresh for materialized views can further enhance ETL operation efficiency.

Characteristics of Materialized Views

  • Precompute and store query results to enhance query performance.
  • Can be queried like a standard table, simplifying complex queries.
  • Automatically sync updates to maintain data consistency.
  • Reduce computing resource usage and avoid complex operations for every query.
  • Serve as a data cache to lessen the load on the source table.

How to Use Materialized Views

Materialized views can be used in two primary ways:

  • Querying as an Ordinary Table: A materialized view functions like a physical table and can be queried like other tables. Users can create various indexes on these views to enhance query speed. However, modifications such as INSERT, UPDATE, or DELETE are not allowed, and updating the data in a materialized view requires special REFRESH commands.

  • Automatic Optimization by the Optimizer: When a user executes a complex query, the optimizer may identify parts that can be replaced with access to a materialized view, rewriting the query accordingly. Materialized views typically store results of resource-intensive queries, allowing users to bypass lengthy calculations, which can significantly enhance performance. It's important to note that for this optimization to occur, the optimizer must consider the materialized view's data as "fresh":

    1. When a materialized view is created and refreshed, it is deemed fresh by the optimizer.
    2. If the underlying table of a materialized view undergoes modifications, the optimizer will mark the related materialized view as expired until the subsequent REFRESH command.

In conclusion, utilizing materialized views requires users to make a trade-off:

While allowing the optimizer to find and rewrite queries automatically is convenient, expired materialized views may lead to unstable performance. Conversely, querying the materialized view directly ensures more stable efficiency, although the data may not be current.

Categories of Materialized Views

As noted earlier, when the data underlying a materialized view changes, we must update it to reflect these modifications. This process is known as refreshing the materialized view (REFRESH).

Materialized views can be categorized based on their refresh strategies into three types:

  1. Full Materialized View: In this approach, the materialized view is refreshed by re-executing the query statement in its definition, entirely replacing the existing content with the latest results from the query.

  2. Incremental Materialized View: This refresh method analyzes the materialized view's defined query statement, assesses changes since the last update, and merges those changes into the existing materialized view.

  3. Real-time Materialized View: This type updates automatically without user intervention when the underlying tables are modified. It synchronously refreshes the materialized view by analyzing the query statement to merge only the changes resulting from these DML modifications. Thus, the real-time materialized view can be regarded as a specific form of the incremental materialized view.

Incremental and real-time materialized views require analysis of the query statement and a transformation into a format suitable for incremental calculation. However, due to the intricacies of incremental calculations, not all SQL functions are fully compatible, imposing certain limitations. We will also compare these materialized views across the following dimensions:

Full Materialized ViewIncremental Materialized ViewReal-Time Materialized View
Supported Query TypesAll- Project
- ProjectSet
- Where
- Distinct
- Aggregate
 - SUM/COUNT/MAX/MIN/AVG
- Inner Join
- Left/Right/Full Outer Join
- Outer Join must have equivalent conditions
Similar to Incremental Materialized View, except for the join specifics
Unsupported Query TypesNone- All VOLATILE functions
- Over Window
- Order By
- Agg with Filter and Distinct Agg
- Subqueries such as Exists / In
- Union (All)
Same as Incremental Materialized View, plus joins
Is It ExpiredThe data in a materialized view is outdated if the underlying table has changed since the last refreshThe data in a materialized view is outdated if the underlying table has changed since the last refreshData in the materialized view never expires
Refresh CostAssociated with the total data volume in the base tableRelates to the volume of data changes in the base table since the last refresh and the incremental calculation complexity of the query.Based on the volume of modifications in the base table and the incremental calculation complexity of the query.
Typical Refresh IntervalDaily, hourlyEvery few seconds to minutesNo refreshing needed, operates in real-time

In real-world situations, we must decide which materialized views to use based on queries' complexity, frequency, timeliness needs, and a thorough assessment of the associated refresh costs.

Fully Materialized View Management

Create Fully Materialized View

Syntax

CREATE MATERIALIZED VIEW [ IF NOT EXISTS ] view_name
	[ (column_name [, ...] ) ]
	AS query
	[ WITH [ NO ] DATA ];

Parameter description:

  • IF NOT EXISTS: No error will be reported if the view exists.
  • view_name: Name of the materialized view.
  • column_name: Optional; specify the column's name in the view.
  • query: SELECT statement defining the view's content.
  • WITH [ NO ] DATA: Indicates whether to populate data during creation.

Example

-- Base Table
CREATE TABLE order_details
(
    order_id     INT,
    product_id   INT,
    product_name VARCHAR(255),
    quantity     INT,
    price        DECIMAL(10, 2),
    PRIMARY KEY (order_id, product_id)
);
 
-- Create a materialized view
CREATE MATERIALIZED VIEW sales_mv AS
SELECT product_name, SUM(quantity) AS total_quantity
FROM order_details
GROUP BY product_name;
 
-- Create a normal view
CREATE VIEW sales_v AS
SELECT product_name, SUM(quantity) AS total_quantity
FROM order_details
GROUP BY product_name;
 
-- Create a materialized view based on a normal View
CREATE MATERIALIZED VIEW sales_mv2 AS
SELECT *
FROM sales_v;
 
-- Querying Materialized View Data
CREATE MATERIALIZED VIEW sales_mv3 (upper_name, quantity) AS
SELECT UPPER(product_name), SUM(quantity) AS total_quantity
FROM order_details
GROUP BY product_name;
 
-- querying materialized view D
data
SELECT * FROM sales_mv;
 
-- write test data
INSERT INTO order_details (order_id, product_id, product_name, quantity, price)
VALUES
    (1, 1, 'Product A', 10, 9.99),
    (1, 2, 'Product B', 5, 19.99),
    (2, 1, 'Product A', 20, 9.99),
    (2, 3, 'Product C', 15, 14.99),
    (3, 2, 'Product B', 8, 19.99),
    (3, 3, 'Product C', 12, 14.99),
    (4, 1, 'Product A', 30, 9.99),
    (4, 2, 'Product B', 10, 19.99);
 
-- refresh Materialized View
REFRESH MATERIALIZED VIEW sales_mv;
 
-- Query the Materialized View data again
SELECT * FROM sales_mv;

When creating a materialized view, using WITH DATA (the default) immediately executes a query to populate the data. If the data volume is significant, this process may take considerable time and block other operations. Alternatively, you might opt for WITH NO DATA, which allows you to refresh and fill out the data separately later.

Refresh Materialized View

Syntax:

REFRESH MATERIALIZED VIEW [ CONCURRENTLY ] view_name [ WITH [ NO ] DATA ];

Parameter description:

  • CONCURRENTLY: This parameter is optional. It allows simultaneous viewing refresh to mitigate the obstruction of other query operations; however, certain restrictions apply:
    • This operation cannot be executed within a transaction block.
    • The WITH NO DATA option is unsupported.
  • view_name: Denotes the name of the materialized view designated for refreshment.
  • WITH [ NO ] DATA:
    • WITH DATA (Default): This option refreshes the data immediately using the query statement outlined by the view.
    • WITH NO DATA: This option updates only the view's definition without refreshing the data. It's commonly used when creating a new view, which is then complemented by a separate data refresh.

Example

--Normal refresh, this statement will immediately execute the query defined in the view and update the data in mv_sales. The table will not be locked during the refresh.
REFRESH MATERIALIZED VIEW sales_mv;
 
--Concurrent refresh, this statement will perform refresh operations in parallel in the background, and will not block queries on the view. However, the view may still return old data during the refresh.
REFRESH MATERIALIZED VIEW CONCURRENTLY sales_mv;

Note:

  1. Updating materialized views uses resources and requires balancing the frequency of refreshes with the need for real-time data.

  2. It’s advisable to schedule refreshes during off-peak times or to implement an automated scheduling system for updates.

Delete materialized view

Syntax

DROP MATERIALIZED VIEW [ IF EXISTS ] view_name [, ...] [ CASCADE | RESTRICT ];

Parameter Details

  • IF EXISTS: No error will be raised if the specified view doesn’t exist.
  • view_name: This represents the name of the materialized view intended for deletion. You can specify multiple values separated by commas.
  • CASCADE: This option removes any dependent objects associated with the view, like indexes.
  • RESTRICT: If there are dependent objects, the deletion will not proceed—this is the default behavior.
-- Normal delete, this statement will delete the materialized view mv_sales. If other objects depend on this view, the delete operation will be aborted, and an error will be reported.
DROP MATERIALIZED VIEW sales_mv;
 
-- Cascading delete, this statement not only deletes the view sales_mv, but also deletes all objects that depend on it, such as indexes.
DROP MATERIALIZED VIEW sales_mv CASCADE;
 
-- Using IF EXISTS, this statement will delete the view sales_mv. If the view does not exist, no error will be reported.
DROP MATERIALIZED VIEW IF EXISTS sales_mv;

Notes

  1. Deleting a materialized view is permanent and cannot be undone, so please take care when proceeding.

  2. Before deletion, confirm that the view is no longer necessary and that no dependent objects exist. Check the view's dependencies to prevent accidental deletion.

  3. When using CASCADE, exercise caution, as all dependent objects will be deleted simultaneously, potentially causing significant consequences.

  4. You must have the necessary permissions to delete a view. Insufficient permissions will result in an error.

  5. Deleting a view does not remove data from the source table; it only deletes the precomputed and stored results.

  6. Regularly review and delete materialized views that are no longer used to conserve storage space.

  7. Maintain a reasonable number of materialized views, as an excessive amount can raise maintenance expenses and complicate the system.

Incremental Materialized View Management

Planned for V1.1 release.

Real-Time Materialized View Management

Planned for V1.1 release.


Best Practices

  1. Assess the use of materialized views

Materialized views should typically possess the following traits to be revenue-generating:

  • The querying pattern involves significant computational effort. High-overhead computational activities include multi-table joins, aggregation, and window functions. Moreover, complex functions such as regular expressions and user-defined functions (UDFs) that access external resources can also incur considerable computational costs.

  • The query pattern occurs frequently. A materialized view may not be necessary if the costly query pattern is rarely executed due to its inherent costs. However, if queries occur often enough that savings surpass the costs associated with materialized views, then it’s worth considering.

Creating a materialized view could be beneficial if a query pattern meets these criteria.

  1. Clearly define materialized views

We encourage users to consider the computational overhead and the frequency of queries to extract relevant subqueries when defining materialized views. This strategy ensures that the subquery addresses as many queries as possible. Common scenarios include:

  • Multi-table associations and various ad hoc aggregation calculations. In typical data warehouse analysis, pre-associating fact tables with dimension tables and conducting analysis based on these relationships is often essential. When the rules governing these multi-table associations remain unchanged, subsequent analyses become highly flexible. Thus, we suggest defining the multi-table association as a materialized view, allowing for direct ad hoc analysis from that view.

  • Varied queries following complex aggregation. When the rules of multi-table association and aggregation are fixed, numerous checks may still be necessary on the aggregation output. For instance, in e-commerce, aggregated result tables from merchant and buyer dimensions are usually pre-processed and made available for them to access their results. We recommend creating a materialized view of the aggregated result table instead of a separate materialized view for each possible final output.

  1. Select the appropriate storage

Specific table-level settings apply since the materialized view is treated as a physical table. Key considerations include:

  • Assign names to each column in the materialized view. If you don't provide specific names during creation, the engine automatically generates them. However, for future tuning methods on the materialized views, it's advisable to manually define each column's name.

  • Select an access method that aligns with how you use the materialized view. For instance, row storage might be ideal if it holds pre-aggregated data and queries are mostly point-based. Conversely, columnar storage would be preferable if many full table scans were planned for further computations. A mixed storage mode could be beneficial when multiple access methods are employed.

Establish relevant indexes. Common indexes, like secondary indexes, can enhance enumeration speed, while split indexes can improve filtering efficiency in column storage. However, it's important to remember that adding more indexes can increase write overhead.

  1. Select an appropriate refresh strategy

Choosing the right refresh strategy involves careful evaluation of various factors, including 1) how often the base table changes, 2) the required timeliness of the materialized view, and 3) the complexity of the query being materialized.

Firstly, consider the more stringent conditions when:

  • The base table is stable or changes rarely.
  • There are low timeliness requirements (e.g., updates can occur over hours or days).
  • The complexity of the query makes incremental calculations impractical.

In such cases, a fully materialized view is suitable.

Conversely, if the original data changes frequently and you have strict timeliness requirements for the results, consider using incremental or real-time materialized views. In these situations, the choice is generally influenced by timeliness:

  • Real-time materialized view: Ideal for specific business situations demanding high timeliness (updates within seconds).

  • Incremental materialized view: Fits scenarios tolerating second or minute-level delays.

  1. Monitor the performance of materialized views
  • Evaluating Materialized View Effects

After implementing materialized views, assessing their impact on query response time, system load, and other factors is essential. Given that materialized views incur costs, such as refresh expenses and storage overhead, we suggest focusing on these two areas for a thorough evaluation.

  • Monitor the refresh status of the materialized view.

The materialized view's refresh history provides insight into its performance. For instance, if the refresh time for a fully materialized view fluctuates, we should investigate any unexpected degradation. Additionally, we should verify if there have been any failures in refreshing the incremental materialized view and note any changes in its refresh duration.

  • Monitor the latency of materialized views

To determine the latency of each materialized view, we can track the time of its last successful refresh. If a materialized view’s latency continues to rise, it may not be refreshing effectively. This situation could lead to prolonged outdated data being presented to users, necessitating prompt identification and resolution.

Use Cases

Here, we present three practical examples to illustrate the function of various materialized views in distinct scenarios.

User Growth Analysis

The influx of new users serves as a valuable metric to assess the effectiveness of our marketing and advertising efforts. To identify whether a user is new, check if they have engaged with our website within the past 30 days. We require a query like this:

SELECT COUNT(DISTINCT user_id)
FROM user_activities
WHERE pt = current_date
AND user_id NOT IN (
-- Count all users who have interacted in the past 30 days
SELECT DISTINCT user_id FROM user_activities
WHERE pt >= current_date - INTERVAL '30 day' AND pt < current_date)

We have observed that the calculation of "counting all users present in the last 30 days" results in a substantial number, and the outcomes remain unchanged (assuming that the user_activities table may contain late data). If our queries for new user analysis are frequent, we might consider storing the results of this "counting users from the last 30 days" query in a materialized view. Since this query only needs to run once daily, we can utilize a full materialized view and refresh the data every morning.

This concept can also be applied to analyze user retention and the growth of new users across different channels.

Data Warehouse Stratification

In a typical data warehouse framework, we usually implement a layered approach, including:

  • ODS (Detailed Data Layer): Also known as the source layer, this is where the original unprocessed data resides and serves as the foundation for subsequent data processing.
  • DWD (Data Base Layer): This layer mainly cleans, filters, and standardizes the data from the ODS layer and often involves processing large tables.
  • DWS/ADS (Data Service Layer): Various summarization operations utilize data from the DWD layer.

The critical distinction between an offline data warehouse and a real-time data warehouse is the responsiveness of each layer, specifically how quickly the most original data changes are reflected. From a modeling standpoint, apart from the ODS layer, data in the other layers can be viewed as materialized views, regularly refreshed based on data dependencies.

With TacNode’s robust storage and querying abilities, we can establish an integrated data warehouse architecture that serves both offline and real-time needs:

  • ODS

    • Data is written in real-time from external sources or synchronized into TacNode via tools in real-time.
    • Certain flexible business scenarios, like data analysts validating ideas, can directly query ODS data.
  • DWD

    • Constructed materialized views from ODS layer data according to varying timeliness requirements. The system tables, such as pg_depend and pg_class, can automatically determine the dependency relationships among materialized views and tables.
    • Hourly/Day-level Timeliness: Create a complete materialized view and refresh it regularly according to dependencies.
    • Minute-level Timeliness: Develop incremental materialized views that support standard filtering and multi-table associations, refreshed incrementally based on dependencies.
  • DWS/AWS: Depending on the needs of various business scenarios, materialized views are formed from the DWD and ODS layers.

    • Fixed Query Patterns: Build lightly aggregated materialized views for secondary aggregation analysis or direct point queries.
    • Flexible Querying: Users can directly query data from ODS and DWD layers without relying on materialized views.

This architecture prevents us from creating a fully comprehensive data architecture for offline and real-time needs. Instead, we only need to establish appropriate tables and materialized views in TacNode based on timeliness and query flexibility requirements. This satisfies offline and real-time demands, ultimately lowering costs and enhancing usability.

Advertising Billing

When the delivery engine decides to run an ad in an advertising setting, it generally checks if the advertiser's balance is adequate. Under the pay-per-click model, each user click results in a charge to the advertiser. Since user clicks happen continuously, it's essential to track each advertiser's spending in real-time:

SELECT SUM(cost) AS total_cost FROM user_clicks GROUP BY advertiser

If an advertiser's total spending surpasses their balance, their ads will be suspended. Conversely, delayed cost reports can result in overspending and financial losses.

Some conventional methods currently employ stream computing engines for real-time statistics; however, this approach cannot eliminate brief delays. Additionally, unexpected system restarts may cause uncontrollable delays. We can now explore the option of establishing a real-time materialized view based on user clicks. Whenever a user's click activity is logged in the user_clicks table, the corresponding materialized view will be updated in real time, ensuring zero delays.