Historical Data Loading by Order
Tacnode's choice of LSM Tree (Log-Structured Merge Tree) as its underlying storage structure is highly appropriate, especially for handling large-scale data, high throughput, and high concurrency scenarios. As a high-performance distributed database, Tacnode needs to manage large-scale data writes and updates, particularly in data warehousing and real-time data analysis contexts. LSM Tree meets these high-throughput write demands, avoiding the performance bottlenecks caused by frequent random writes typical of traditional B-trees.
During the migration of business operations to Tacnode, existing data often needs to be imported in bulk. This can be done using synchronization tools like DataX, Teleport, or DTS, or through batch import methods like copy. However, regardless of the import method, there is a risk of degrading query performance for time range queries due to the disruption of the chronological order of data writes.
The following will describe the causes of this issue and the best practices provided by Tacnode to address it.
Problem Description
The core features of LSM Tree are sequential writing and background merging. When data is batch written to Tacnode, it first enters the in-memory MemTable and is then periodically merged into SSTable files on disk. The write order does not necessarily maintain the chronological order of the data, as data is batched and merged, potentially disrupting the original time sequence. This can lead to the physical storage order of data not being time-sequential, which can affect the efficiency of time range queries.
In many applications, especially those involving time filtering or time range queries (such as log data, monitoring data, IoT data), timestamps are often the most important query condition. When data is organized in chronological order, time-based queries can be optimized in several ways:
-
Columnar Storage Optimization: Columnar storage allows scanning only the columns relevant to the query, ignoring unrelated ones. If data is stored in chronological order, scanning data within a time range becomes more efficient because the data is already ordered physically, reducing jump scans and lowering I/O costs.
-
Time Filtering Optimization: In ordered storage structures, such as columnar tables with time-sequential data, techniques like indexing or range scanning can efficiently locate data within a time range without scanning the entire dataset.
Since data is naturally written in chronological order from the application layer to the database or data warehouse system, it is important during business migration to maintain this order when importing existing data into Tacnode. This prevents performance degradation in time-filtering related SQL queries due to disrupted chronological order.
Best Practices
In Tacnode, the primary method for ensuring data is imported in chronological order is to use a GUC parameter to specify a field, ensuring that during COPY or INSERT operations, data is imported in the order of the specified field (such as a timestamp field).
- Specify the Time Field: Before importing data, set a session-level GUC parameter to specify the time field you want to use (e.g.,
event_time
orcreated_at
). The order of this field will influence the data write order.
- Perform Data Import: Once the GUC parameter is set, you can proceed with the data import operation. Tacnode will ensure that data is written in chronological order based on the specified time field. You can use the
COPY
orINSERT
command to import data.