dbt Data Modeling

In today's data-driven business landscape, organizations require efficient and flexible systems for data analysis and decision-making. Modern data warehouses must support rapid iteration while maintaining data quality and reliability. dbt (Data Build Tool) is a powerful data transformation framework designed specifically for modern data modeling, testing, and deployment.

By combining dbt's transformation capabilities with Tacnode's distributed storage and compute advantages, you can create a robust, flexible, and maintainable data warehouse ecosystem. This guide walks you through building data pipelines from initial setup to production deployment.

Prerequisites

Before getting started, ensure you have the following tools installed:

Required Software

  1. Python 3 - Download from Python.org or use your system package manager
  2. Git - Download from Git-scm.com or use your system package manager
  3. VS Code - Download from Visual Studio Code

Installing dbt

Install dbt with PostgreSQL support using pip:

pip3 install dbt-core dbt-postgres

dbt Core Concepts

Understanding these fundamental concepts is essential for effective dbt modeling:

Configuration Files

  • schema.yml - Defines table metadata including descriptions, tags, and test configurations
  • profiles.yml - Contains database connection settings and environment configurations

Data Sources

  • Seeds - Load CSV files for dimension tables or small reference datasets
  • Sources - Define external tables (structure and data managed outside dbt)
  • Models - SQL-based transformations that create new tables or views

Key Functions

  • source() - Establishes lineage between models and source tables
    SELECT * FROM {{ source('public', 'orders') }};
  • ref() - Creates dependencies between models for proper execution order
    SELECT * FROM {{ ref('staging_orders') }};

Configuration Examples

Database Schema Definition

Create comprehensive metadata definitions in your schema.yml files:

version: 2
 
sources:
  - name: ods
    database: example
    schema: public
    tags: ['ods', 'raw']
    tables:
      - name: ods_orders
        description: 'Original order data from the operational system'
        tags: ['orders']
        columns:
          - name: orderid
            description: 'Unique order identifier'
            tests:
              - not_null
              - unique
          - name: orderdate
            description: 'Date when the order was placed'
            tests:
              - not_null
 
models:
  - name: dim_customers
    description: 'Customer dimension table with cleaned and standardized data'
    columns:
      - name: customer_id
        description: 'Primary key for customer dimension'
        tests:
          - unique
          - not_null
      - name: customer_name
        description: 'Standardized customer name'
        tests:
          - not_null
 
seeds:
  - name: country_codes
    description: 'ISO-3166 country codes reference table'
    columns:
      - name: short_name
        type: text
        description: 'Country name'
        tests:
          - not_null
          - unique
      - name: alpha2code
        type: text
        description: 'Two-letter country code'
        tests:
          - not_null
          - unique
      - name: alpha3code
        type: text
        description: 'Three-letter country code'
        tests:
          - not_null
          - unique

Data Lineage Management

dbt automatically generates lineage diagrams based on source() and ref() function usage, providing clear visibility into data dependencies and transformation flows.

Hands-On Tutorial

Project Initialization

Create and initialize a new dbt project:

# Create new dbt project
dbt init example
 
# Navigate to project directory
cd example
 
# Initialize git repository
git init
git add .
git commit -m 'Initial dbt project setup'

Running Built-in Examples

Execute the default models and tests to verify your setup:

# Run all models
dbt run
 
# Execute all tests
dbt test

Troubleshooting Test Failures:

If you encounter test failures like not_null_my_first_dbt_model_id, check the model definition. The issue is typically in models/example/my_first_dbt_model.sql where line 16 generates a NULL value.

Solution: Uncomment line 27 by changing -- where id is not null to where id is not null.

After fixing, run the commands again:

dbt run && dbt test

Loading Reference Data with Seeds

Seeds enable you to load CSV files as reference tables:

  1. Download the sample file: country_codes.csv
  2. Place it in the seeds/ directory
  3. Load the data:
dbt seed

Verify the loaded data using psql:

\d+ country_codes
SELECT * FROM country_codes LIMIT 10;

Creating Custom Tests

Implement custom data quality tests for business-specific validation rules:

  1. Define test configuration in seeds/schema.yml:
version: 2
 
seeds:
  - name: country_codes
    columns:
      - name: short_name
        tests:
          - unique
          - not_null
          - count_between:
              range_left: 100
              range_right: 300
  1. Create custom test macro in tests/generic/count_between.sql:
{% test count_between(model, column_name, range_left, range_right) %}
 
WITH agg_result AS (
    SELECT COUNT({{ column_name }}) AS record_count 
    FROM {{ model }}
),
validation_result AS (
    SELECT * 
    FROM agg_result
    WHERE record_count NOT BETWEEN {{ range_left }} AND {{ range_right }}
)
 
SELECT * FROM validation_result
 
{% endtest %}
  1. Execute the custom test:
dbt test --select country_codes

Building Production Models

Create production-ready models with proper source definitions:

  1. Load sample data:

    • Download: ods_orders.sql.gz
    • Extract and load: psql <connection_string> -f ods_orders.sql
  2. Define source schema in models/erp/schema.yml:

version: 2
 
sources:
  - name: ods
    database: example
    schema: public
    tags: ['ods', 'source']
    tables:
      - name: ods_orders
        description: 'Raw order data from operational systems'
        columns:
          - name: orderid
            description: 'Unique order identifier'
          - name: orderdate
            description: 'Order placement date'
          - name: shipdate
            description: 'Shipment date'
          - name: customername
            description: 'Customer full name'
          - name: country
            description: 'Customer country'
          - name: sales
            description: 'Total sales amount'
          - name: quantity
            description: 'Item quantity'
          - name: discount
            description: 'Applied discount rate'
          - name: profit
            description: 'Calculated profit margin'
  1. Create cleaned model in models/erp/dw_orders.sql:
-- Clean and filter order data
{{ config(materialized='table') }}
 
SELECT 
    orderid,
    orderdate,
    shipdate,
    shipmode,
    customername,
    country,
    sales,
    quantity,
    discount,
    profit,
    profit / sales AS profit_margin
FROM {{ source('ods', 'ods_orders') }}
WHERE sales IS NOT NULL 
  AND sales > 0
  AND orderdate IS NOT NULL
  1. Execute the model:
dbt run --select erp

Documentation Generation

Generate comprehensive documentation for your data models:

  1. Build documentation:
dbt docs generate
  1. Serve documentation locally:
dbt docs serve

The documentation includes interactive lineage graphs and detailed model descriptions:

Production Configuration

Database Connection Setup

Configure your production connection in profiles.yml:

File Location:

  • macOS/Linux: $HOME/.dbt/profiles.yml
  • Windows: C:\Users\<username>\.dbt\profiles.yml

Configuration:

example:
  outputs:
    dev:
      type: postgres
      threads: 4
      host: <your-tacnode-host>
      port: 5432
      user: <your-username>
      pass: <your-password>
      dbname: example
      schema: public
      keepalives_idle: 0
      connect_timeout: 10
      retries: 1
    
    prod:
      type: postgres
      threads: 8
      host: <your-production-host>
      port: 5432
      user: <your-prod-username>
      pass: <your-prod-password>
      dbname: example
      schema: public
      
  target: dev

Best Practices

Model Organization

  • Staging Models: Clean and standardize raw data
  • Intermediate Models: Business logic and joins
  • Mart Models: Final tables for analytics and reporting

Performance Optimization

  • Use {{ config(materialized='table') }} for frequently accessed models
  • Implement incremental models for large datasets
  • Add appropriate indexes in post-hooks

Data Quality

  • Implement comprehensive testing at each layer
  • Use custom tests for business-specific validation
  • Monitor test results in CI/CD pipelines

Documentation

  • Maintain detailed descriptions for all models and columns
  • Include business context and assumptions
  • Keep documentation current with code changes

Sample Project

Download the complete example project: example.tgz

The project includes:

  • Complete model definitions
  • Custom tests and macros
  • Production-ready configuration
  • Git history showing incremental development steps

Review the git commit history to understand the step-by-step development process and best practices implementation.