tacnode

dbt Modeling

In today's data-driven business environment, companies need efficient and flexible systems for data analysis and decision-making. Data warehouses play a vital role in managing, storing, and analyzing large datasets within an organization. However, traditional data warehouse modeling techniques often fall short in meeting the demands for rapid iteration and flexibility. Enter dbt (Data Build Tool), a modern data transformation tool specifically designed for data modeling, testing, and deployment within data warehouses.

By combining the data modeling capabilities of dbt with the storage advantages of Tacnode, you can create a robust, flexible, and easily maintainable data warehouse ecosystem. This article explores how to use dbt and Tacnode for data warehouse modeling, from building data pipelines to executing data transformations.

Preparation

  1. Install Python 3 by visiting Python to download or use your system package manager for installation.
  2. Install Git by going to Git for downloading or use your system package manager.
  3. Download VS Code Editor.
  4. Install dbt and set up the project.

Install dbt

Open your terminal and run pip3 install dbt-core dbt-postgresql

dbt Modeling

Concepts

  1. schema.yml: Defines metadata for the table, including description, tags, and test sets.
  2. Seed: Used to load CSV files and can import external dimension tables or small test datasets.
  3. Model: Conducted by writing an SQL template.
    • source function establishes a relationship between the model and the source for lineage management, e.g., select * from {{ source('public', 'et') }};.
    • ref function associates relationships between different models for lineage management, e.g., select * from {{ ref('base_model') }}.
  4. Source: Defines the source table; table structure definition and data maintenance are not completed in dbt.

schema.yml example

For detailed configuration, refer to dbt schema

version: 2
 
sources:
  - name: ods
    database: example
    schema: public
    tags: ['ods', 'raw']
    tables:
      - name: ods_table0
        tags: ['table_tag']
        columns:
          - name: id
            type: integer
            description: 'PK of ods_table0'
            tests:
              - not_null
              - unique
 
models:
  - name: my_first_dbt_model
    description: 'A starter dbt model'
    columns:
      - name: id
        description: 'The primary key for this table'
        tests:
          - unique
          - not_null
 
seeds:
  - name: country_codes
    description: 'list of ISO-3166 country codes, https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes'
    columns:
      - name: short_name
        type: text
        description: 'country name'
        tests:
          - not_null
          - unique
      - name: alpha2code
        type: text
        description: two-letter country codes
        tests:
          - not_null
          - unique
      - name: alpha3code
        type: text
        description: three-letter country codes
        tests:
          - not_null
          - unique
      - name: numeric_code
        type: integer
        description: 'three-digit country codes which are identical to those developed and maintained by the United Nations Statistics Division, with the advantage of script (writing system) independence, and hence useful for people or systems using non-Latin scripts.'
        tests:
          - not_null
          - unique
      - name: iso3166_2
        type: text
        description: Codes for the representation of names of countries and their subdivisions
        tests:
          - not_null
          - unique

dbt lineage graph management example

  1. Lineage diagrams are automatically generated by modules linked through the source and ref functions.

Modeling Practice

Initialize the dbt project

Open a terminal and execute

dbt init example
# Configure the project according to the prompts
 
cd example
 
# Initialize version management
git init
git add ./
git commit -m 'init repo'

Execute Built-in Modeling and Testing

dbt run     # Execute modeling statements
dbt test    # Execute model test case

Ran dbt test1 and found a test error report showing a failure in the not_null_my_first_dbt_model_id test (located in models/example/schema.yml).

Upon reviewing the models/example/schema.yml file, I noticed that the model my_first_dbt_model implements the not_null test. In models/example/my_first_dbt_model.sql, line 16 generates a NULL value.

To resolve this issue, modify the statement -- where id is not null on line 27 of models/example/my_first_dbt_model.sql to where id is not null.

After making this adjustment, run dbt run; dbt test again and confirm that the issue has been fixed.

Upload CSV data via seed

Download the file country_codes.csv, place it in the seeds folder, and then run dbt seed.

Access Tacnode using psql to examine country_code.

\d+ country_codes
 
select * from country_codes limit 10;

Creating Custom Tests

For instance, consider country_codes. Begin by creating a file named seeds/schema.yml and paste the content below.

version: 2
 
seeds:
  - name: country_codes
    columns:
      - name: short_name
        tests:
          - unique
          - not_null
          - count_between:
              range_left: 100
              range_right: 300

Generate a tailored SQL test template by creating a file named tests/generic/count_between.sql and pasting the content below.

{% test count_between(model, column_name, range_left, range_right) %}
 
WITH AGG_RESULT AS (
    SELECT COUNT({{column_name}}) "num" FROM {{ model }}
),
ERROR_RESULT AS (
    SELECT * FROM AGG_RESULT
        WHERE "num" NOT BETWEEN {{range_left}} AND {{range_right}}
)
SELECT * FROM ERROR_RESULT
 
{% endtest %}

Executing the test dbt test --select country_codes

Define the source table and begin initial modeling

  1. Download the external table ods_orders.sql.gz, unzip it, and execute the import command: psql ...connection_string... -f ods_orders.sql
  2. Edit models/erp/schema.yml to add a description under sources.
# ...
 
version: 2
 
sources:
  - name: ods
    database: example
    schema: public
    tags: ['ods']
    tables:
      - name: ods_orders
        description: 'Original Order Table'
        columns:
          - name: orderid
            description: 'Order ID'
          - name: orderdate
            description: 'Order Date'
          - name: shipdate
            description: 'Ship Date'
          - name: shipmode
            description: 'Ship Mode'
          - name: customername
            description: 'Customer Name'
          - name: country
            description: 'Country'
          - name: sales
            description: 'Sales Amount'
          - name: quantity
            description: 'Quantity'
          - name: discount
            description: 'Discount'
          - name: profit
            description: 'Profit'
          # - name: ... # other columns
# ...
  1. In cases of dirty data, the condition for filtering valid data is where sales are not null and sales > 0.
  2. The cleaned model is named dw_orders. Create a file named models/erp/dw_orders.sql and paste the following content:
SELECT * FROM {{source('ods', 'ods_orders')}}
WHERE sales IS NOT NULL AND sales > 0
  1. Run dbt run --select erp to execute a new model and observe its results.

Model Documentation

  1. Create Documentation dbt docs generate The produced documents are located in the target directory. To enable document access, static sites can be deployed using httpd like ngnix. CI allows for automatic version updates.
  2. View locally using the web dbt docs serve

Appendix

  1. Obtain the project file example.tgz and review the modification records for each step using git.
    • To run the project, update the profiles.yml file by adding the following content:
example:
outputs:
  dev:
    type: postgres
    threads: 1
    host: # Modify to tacnode host information
    port: 5432
    user: # Username
    pass: # Password
    dbname: example # Model the database. If the example does not exist, execute CREATE DATABASE example first;
    schema: public
target: dev
  • profiles.yml file path:
    • macOS, Linux: $HOME/.dbt/profiles.yml
    • Windows: c:/users/$HOME/.dbt/profiles.yml