tacnode

Data Import and Export

Tacnode facilitates efficient data import and export within a table. For importing large datasets, the INSERT or COPY FROM methods are typically used. The INSERT operation is suitable for smaller datasets or when incrementally adding to an existing table. In contrast, the COPY FROM command is ideal for large initial imports, as it reads data directly from a file and writes it into the storage engine's underlying file format, reducing the overhead associated with SQL transaction processing linked with INSERT. This significantly enhances the import experience. Additionally, the COPY TO command allows for batch exporting of a table or query results to a file for subsequent imports or processing.

For example, consider the following table for students:

CREATE TABLE students(
id int PRIMARY KEY,
name varchar(255) NOT NULL,
phone varchar(16)
);

We have a prepared file (stu.txt) containing data conforming to the above table schema. Its content is:

123        Jacob        \N
456        "Alex Xia"   13888811888
789        Henry        \N

In this file, fields in each row are separated by a single tab character \t, and an empty field is indicated by \N.

Data Import

COPY FROM

Once connected to Tacnode's default PostgreSQL database using psql, run these commands to import data and view the results:

postgres=> \COPY students FROM stu.txt;
COPY 3
 
postgres=> SELECT * FROM students;
 id  |    name    |    phone
-----+------------+-------------
 123 | Jacob      |
 456 | "Alex Xia" | 13888811888
 789 | Henry      |
(3 rows)

The syntax for the detailed batch import command COPY FROM is as follows:

COPY table_name [ ( column_name [, ...] ) ]
FROM [ 'filename' | STDIN ]
[ [ WITH ] ( option [, ...] ) ]

Supported options include:

  • DELIMITER 'delimiter_character': Specifies the column separator for each data row in the file, defaulting to the tab character \t. Since Tacnode currently only supports text format file input, it must be escaped if this delimiter appears in normal field content. Future versions will support CSV and binary format file input.
  • NULL 'null_string': Indicates a string representing a NULL value, defaulting to the two-character text \N.
  • ON CONFLICT DO NOTHING: If a primary key from the copied file already exists in the existing data, that row will be ignored.

For additional syntax for COPY, refer to COPY.

Data can be imported via the JDBC driver using the CopyManager in PostgreSQL.

import org.postgresql.copy.CopyManager;
public static void copyToFile(Connection connection, String filePath, String tableOrQuery)
        throws SQLException, IOException {
    FileOutputStream fileOutputStream = null;
    try {
        CopyManager copyManager = new CopyManager((BaseConnection) connection);
        fileOutputStream = new FileOutputStream(filePath);
        String copyOut = "COPY " + tableOrQuery + " TO STDOUT DELIMITER AS ','";
        final long line = copyManager.copyOut(copyOut, fileOutputStream);
        System.out.println(line);
    } finally {
        if (fileOutputStream != null) {
            try {
                fileOutputStream.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

INSERT

The INSERT statement adds new records to a table. You can insert individual rows or multiple rows simultaneously. The syntax is as follows:

[ WITH [ RECURSIVE ] with_query [, ...] ]
INSERT INTO table_name [ AS alias ] [ ( column_name [, ...] ) ]
    [ OVERRIDING { SYSTEM | USER } VALUE ]
    { DEFAULT VALUES | VALUES ( { expression | DEFAULT } [, ...] ) [, ...] | query }
    [ ON CONFLICT [ conflict_target ] conflict_action ]
    [ RETURNING * | output_expression [ [ AS ] output_name ] [, ...] ]
 
where conflict_target can be one of:
 
    ( { index_column_name | ( index_expression ) } [ COLLATE collation ] [ opclass ] [, ...] ) [ WHERE index_predicate ]
    ON CONSTRAINT constraint_nameand conflict_action is one of:
 
    DO NOTHING
    DO UPDATE SET { column_name = { expression | DEFAULT } |
                    ( column_name [, ...] ) = [ ROW ] ( { expression | DEFAULT } [, ...] ) |
                    ( column_name [, ...] ) = ( sub-SELECT )
                  } [, ...]
              [ WHERE condition ]

For additional syntax for INSERT, refer to INSERT.

The test table's definition is outlined below. Unless otherwise stated, the above table definition will be used for testing.

CREATE TABLE t1(
    id int primary key,
    name text,
    age smallint,
    gender char(1)
);
  1. Clearly define the column names, field, and value columns in a one-to-one correspondence. You may provide only a selection of column names and their corresponding values, while the values for the other columns will default to their standard values.

Example:

tacnode=> INSERT INTO t1 (id, name, age, gender) VALUES (1,'testname','22','f');
INSERT 0 1
 
tacnode=> INSERT INTO t1 (id ,name) VALUES (2,'testname');
INSERT 0 1
  1. If you don't specify column names, you must provide values for all fields in VALUES.

Example:

tacnode=> INSERT INTO t1 VALUES (3,'testname3','33','f');
INSERT 0 1
  1. Batch insertion allows you to add several records simultaneously.

Example:

tacnode=> INSERT INTO t1 (id, name, age, gender) VALUES (4,'testname4','22','f'), (5,'testname4','22','m');
INSERT 0 2
  1. Utilize query results as input for the INSERT INTO statement.

Example:

tacnode=> INSERT INTO t1 (id, name, age, gender) SELECT id, name, age, gender FROM t2 WHERE id>10;
INSERT 0 1
  1. When a primary key conflicts during data insertion, you can use the INSERT INTO ON CONFLICT syntax to determine the desired action. There are two main CONFLICT_ACTION options:
  • NOTHING: When a primary key conflict occurs, no action is taken, and the conflicting data to be inserted is discarded.
  • UPDATE: The specified column is updated after a primary key conflict (Note: While technically permissible, updating primary key columns is not recommended).

It is advisable to implement ON CONFLICT based on the primary key or to set up a unique index; otherwise, a full table scan may be initiated, which can diminish performance.

Example:

tacnode=> \d t1
                                 Table "public.t1"
  Column  |     Type     | Collation |  Nullable  |                 Default
--------+--------------+----------+----------+--------------------------------------
 id     | integer      |          | not null | nextval('t1_id_seq'::text::regclass)
 name   | text         |          |          |
 age    | smallint     |          |          |
 gender | character(1) |          |          |
Indexes:
    "t1_pkey" PRIMARY KEY, btree (id ASC)
 
tacnode=> SELECT * FROM t1 WHERE id=1;
 id |   name   | age | gender
----+----------+-----+--------
  1 | testname |  22 | f
(1 row)
 
-- Use INSERT INTO ON CONFLICT DO NONTHING to insert data with primary key conflicts
tacnode=> INSERT INTO t1 (id, name, age, gender) VALUES (1,'testname1', 11, 'm') ON CONFLICT(id) DO NOTHING;
INSERT 0 0
 
-- Data not updated
tacnode=> SELECT * FROM t1 WHERE id=1;
 id |   name   | age | gender
----+----------+-----+--------
  1 | testname |  22 | f
(1 row)
 
-- Use INSERT INTO ON CONFLICT DO UPDATE to insert data with primary key conflicts
tacnode=> INSERT INTO t1 (id, name, age, gender) VALUES (1,'testname1', 11, 'm') ON CONFLICT(id) DO UPDATE SET name=EXCLUDED.name,age=EXCLUDED.age,gender=EXCLUDED.gender;
INSERT 0 1
 
-- Data is updated by specified column
tacnode=> SELECT * FROM t1 WHERE id=1;
 id |   name    | age | gender
----+-----------+-----+--------
  1 | testname1 |  11 | m
(1 row)

Performance Optimization

For large volumes of data, batch submission statements allow for the simultaneous submission of multiple SQL statements to Tacnode, enhancing import performance through batching. Here’s an example:

INSERT INTO test (col1, col2) VALUES ('A', 10), ('B', 20), ('C', 30)
Connection conn = dataSource.getConnection();
try {
    conn.setAutoCommit(false); // Turn off autocommit and turn on transactions
 
    PreparedStatement pstat = conn.prepareStatement("INSERT INTO test (col1, col2) VALUES (?, ?)");
 
    // Add records in batch mode
    pstat.setString(1, "A");
    pstat.setInt(2, 10);
    pstat.addBatch();
 
    pstat.setString(1, "B");
    pstat.setInt(2, 20);
    pstat.addBatch();
 
    pstat.setString(1, "C");
    pstat.setInt(2, 30);
    pstat.addBatch();
 
    // batch commit
    pstat.executeBatch();
    conn.commit(); // commit the transaction
} catch (Exception e) {
    conn.rollback(); // rollback the transaction
    e.printStackTrace();
} finally {
    conn.setAutoCommit(true); // restore the autocommit
    conn.close(); // close the connection
}

For incremental updates to a large data table, such as updating a value of 3 million in a base table of 3 billion, you cannot use COPY with ON CONFLICT DO UPDATE. Instead, use COPY FROM to a temporary table followed by INSERT INTO SELECT FROM (temporary table) ON CONFLICT for optimal import performance. Here are the steps:

-- Verify the table format
test=> \d test
 
            Table "public.test"
 Column |  Type   | Collation |  Nullable  | Default
------+---------+----------+----------+------
 col1 | integer |          | not null |
 col2 | text    |          |          |
 col3 | text    |          |          |
Indexes:
    "test_pkey" PRIMARY KEY, btree (col1 ASC)
 
-- Create the temporary table
test=> CREATE TABLE test_tmp (like test including all);
 
-- Execute copy import to the temporary table
test=> \copy "test_tmp" FROM csv_file;
 
-- Batch import via insert into select from
test=> INSERT INTO test(col1, col2, col3) SELECT col1, col2, col3 FROM test_tmp on conflict (col1) do update set col1=excluded.col1,col2=excluded.col2,col3=excluded.col3;

Data Export

COPY TO

The syntax for the COPY TO command is detailed below:

COPY { table_name [ ( column_name [, ...] ) ] | ( query ) }
TO { 'filename' | STDOUT }
[ [ WITH ] ( option [, ...] ) ]

Options available with WITH include:

  • DELIMITER 'delimiter_character'
  • NULL 'null_string'

These have the same meaning as those in COPY FROM, as shown in the example:

Example:

  • To export all student tables, run this command:
postgres=> \COPY students to stu.out;
COPY 3
  • To export the first 200 students, run this command:
postgres=> \COPY (SELECT * FROM students WHERE id<=200) TO stu.out;
COPY 1

You can use the CopyManager with PostgreSQL to export data through the JDBC driver.

import org.postgresql.copy.CopyManager;
public static long copyFromFile(Connection connection, String filePath, String tableName)
        throws SQLException, IOException {
    FileInputStream fileInputStream = null;
    try {
        CopyManager copyManager = new CopyManager((BaseConnection) connection);
        fileInputStream = new FileInputStream(filePath);
        String copyIn = "COPY " + tableName + " FROM STDIN DELIMITER AS ','";
        return copyManager.copyIn(copyIn, fileInputStream);
    } finally {
        if (fileInputStream != null) {
            try {
                fileInputStream.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

Streaming Data Export

Tacnode enables data export through the streaming replication protocol; for more details, refer to Change Data Capture.

On this page