File-based Foreign Table

File-based foreign tables enable direct access to data stored in cloud object storage systems like AWS S3 and other compatible object stores. This guide covers implementing and optimizing file-based external data integration.

Overview and Benefits

File-based foreign tables provide seamless integration with cloud object storage, enabling:

  • Direct Data Access: Query files directly without importing
  • Cost Efficiency: Access data on-demand without storage duplication
  • Scalability: Handle petabyte-scale datasets
  • Flexibility: Support multiple file formats and storage providers
  • Real-time Analytics: Query latest data without ETL delays

Supported File Formats

FormatUse CaseCompressionSchema Evolution
ParquetAnalytics, OLAPBuilt-in (Snappy, GZIP)Schema evolution support
ORCBig data processingBuilt-in (ZLIB, Snappy)Advanced schema support
CSVData exchange, importsExternal (GZIP)No schema support
JSONSemi-structured dataExternal (GZIP)Schema-on-read

Installation and Setup

Install Required Extension

Install the file foreign data wrapper extension:

-- Install extension (run once per database)
CREATE EXTENSION IF NOT EXISTS file_fdw;
 
-- Verify installation
SELECT extname, extversion 
FROM pg_extension 
WHERE extname = 'file_fdw';

Extension Requirements

The file_fdw extension must be installed with superuser privileges and is available by default in Tacnode clusters.

Verify Available Foreign Data Wrappers

-- List all available FDWs
SELECT fdwname, fdwhandler, fdwoptions 
FROM pg_foreign_data_wrapper;
 
-- Check file_fdw capabilities
\des+ file_fdw

AWS S3 Integration

Create S3 Foreign Server

-- Create foreign server for AWS S3
CREATE SERVER s3_production FOREIGN DATA WRAPPER file_fdw
OPTIONS (
    REGION 'us-west-2',
    -- Optional: Custom endpoint for S3-compatible storage
    ENDPOINT 'https://s3.us-west-2.amazonaws.com'
);
 
-- Create server for different regions
CREATE SERVER s3_europe FOREIGN DATA WRAPPER file_fdw
OPTIONS (
    REGION 'eu-west-1'
);

Configure User Mapping for S3

-- Create user mapping with IAM credentials
CREATE USER MAPPING FOR current_user SERVER s3_production
OPTIONS (
    ACCESS_ID 'AKIA...your-access-key-id',
    ACCESS_KEY 'your-secret-access-key'
);
 
-- Create mapping for specific database user
CREATE USER MAPPING FOR data_analyst SERVER s3_production
OPTIONS (
    ACCESS_ID 'AKIA...analyst-access-key',
    ACCESS_KEY 'analyst-secret-key'
);

Create S3 Foreign Tables

Parquet Files

-- Create foreign table for Parquet data
CREATE FOREIGN TABLE sales_parquet (
    sale_id BIGINT,
    customer_id INT,
    product_id INT,
    sale_date DATE,
    quantity INT,
    unit_price NUMERIC(10,2),
    total_amount NUMERIC(12,2),
    region VARCHAR(50),
    sales_rep VARCHAR(100)
) SERVER s3_production
OPTIONS (
    DIR 's3://company-data-lake/sales/year=2024/',
    FORMAT 'parquet'
);
 
-- Partitioned Parquet data with dynamic discovery
CREATE FOREIGN TABLE sales_partitioned (
    sale_id BIGINT,
    customer_id INT,
    product_id INT,
    sale_date DATE,
    quantity INT,
    unit_price NUMERIC(10,2),
    total_amount NUMERIC(12,2),
    -- Partition columns
    year INT,
    month INT,
    region VARCHAR(50)
) SERVER s3_production
OPTIONS (
    DIR 's3://company-data-lake/sales/',
    FORMAT 'parquet',
);

ORC Files

-- Create foreign table for ORC format
CREATE FOREIGN TABLE user_activity_orc (
    user_id BIGINT,
    session_id VARCHAR(64),
    activity_timestamp TIMESTAMP,
    page_url VARCHAR(500),
    action_type VARCHAR(50),
    duration_seconds INT,
    device_type VARCHAR(20),
    ip_address INET
) SERVER s3_production
OPTIONS (
    DIR 's3://analytics-bucket/user-activity/orc/',
    FORMAT 'orc'
);

CSV Files

-- Create foreign table for CSV files
CREATE FOREIGN TABLE customer_imports (
    customer_id INT,
    first_name VARCHAR(50),
    last_name VARCHAR(50),
    email VARCHAR(100),
    phone VARCHAR(20),
    registration_date DATE,
    country_code VARCHAR(2)
) SERVER s3_production
OPTIONS (
    DIR 's3://import-bucket/customers/',
    FORMAT 'csv',
    HEADER 'true',
    DELIMITER ',',
    QUOTE '"',
    ESCAPE '\',
    NULL_STRING 'NULL'
);

Query S3 Data

-- Basic queries
SELECT region, COUNT(*) as sale_count, SUM(total_amount) as total_revenue
FROM sales_parquet
WHERE sale_date >= '2024-01-01'
GROUP BY region
ORDER BY total_revenue DESC;
 
-- Join S3 data with local tables
SELECT 
    s.sale_id,
    s.total_amount,
    c.customer_name,
    c.customer_tier
FROM sales_parquet s
JOIN customers c ON s.customer_id = c.customer_id
WHERE s.sale_date BETWEEN '2024-01-01' AND '2024-01-31';
 
-- Aggregate across partitioned data
SELECT 
    year,
    month,
    region,
    COUNT(*) as transactions,
    SUM(total_amount) as revenue
FROM sales_partitioned
WHERE year = 2024 AND month >= 1
GROUP BY year, month, region
ORDER BY year, month, region;

File Formats

Parquet Optimization

-- Optimized Parquet foreign table
CREATE FOREIGN TABLE optimized_sales (
    sale_id BIGINT,
    customer_id INT,
    sale_date DATE,
    amount NUMERIC(12,2),
    -- Partition columns should be at the end
    year INT,
    quarter INT
) SERVER s3_production
OPTIONS (
    DIR 's3://sales-data/parquet/',
    FORMAT 'parquet'
);

ORC Configuration

-- ORC with advanced configuration
CREATE FOREIGN TABLE orc_analytics (
    transaction_id BIGINT,
    amount NUMERIC(15,2),
    timestamp TIMESTAMP,
    metadata JSONB
) SERVER s3_production
OPTIONS (
    DIR 's3://analytics/orc-data/',
    FORMAT 'orc'
);

CSV with Advanced Options

-- CSV with complex formatting
CREATE FOREIGN TABLE complex_csv (
    id INT,
    name VARCHAR(200),
    description TEXT,
    price NUMERIC(10,2),
    tags TEXT[],
    metadata JSONB
) SERVER s3_production
OPTIONS (
    DIR 's3://imports/csv-data/',
    FORMAT 'csv',
    HEADER 'true',
    DELIMITER ',',
    QUOTE '"',
    ESCAPE '\',
    NULL_STRING 'NULL',
    ENCODING 'UTF8',
    -- Handle large files
    MAX_FILE_SIZE '1073741824',  -- 1GB
    -- Parallel processing
    PARALLEL_WORKERS '4'
);

Batch Foreign Table Creation (IMPORT FOREIGN SCHEMA)

The file_fdw extension supports batch creation of foreign tables using the IMPORT FOREIGN SCHEMA command, which simplifies management of external file-based data sources.

Syntax Overview

IMPORT FOREIGN SCHEMA remote_schema
    [ { LIMIT TO | EXCEPT } ( table_name [, ...] ) ]
    FROM SERVER server_name
    INTO local_schema
    [ OPTIONS ( option 'value' [, ... ] ) ]

Parameter Specifications

ParameterDescriptionValues
remote_schemaExtended semantic support for file formatsparquet, orc
LIMIT TOImport only specified tablesList of table names
EXCEPTExclude specified tables from importList of table names
server_namePreviously created foreign serverServer identifier
local_schemaTarget schema for foreign table metadataSchema name
OPTIONSAdditional configuration parametersKey-value pairs

Batch Creation Examples

Batch Import Parquet Files from S3

-- Import all Parquet files from S3 directory
IMPORT FOREIGN SCHEMA parquet
    FROM SERVER s3_production
    INTO public
    OPTIONS (
        DIR 's3://analytics-data/parquet-tables/'
    );

Selective Import with Table Filtering

-- Import only specific ORC tables from object storage
IMPORT FOREIGN SCHEMA orc
    LIMIT TO (sales_data, customer_metrics, product_catalog)
    FROM SERVER s3_production
    INTO analytics_schema
    OPTIONS (
        DIR 's3://warehouse/orc-files/'
    );

Import with Exclusions

-- Import all tables except specified ones
IMPORT FOREIGN SCHEMA parquet
    EXCEPT (temp_data, backup_tables)
    FROM SERVER s3_production
    INTO data_lake
    OPTIONS (
        DIR 's3://production-data/structured/'
    );

Advanced Configuration Options

-- Batch import with advanced settings
IMPORT FOREIGN SCHEMA parquet
    FROM SERVER oss_analytics
    INTO warehouse_schema
    OPTIONS (
        DIR 'oss://analytics-bucket/quarterly-data/',
        -- Performance tuning
        PARALLEL_WORKERS '8',
        MAX_FILE_SIZE '2147483648',  -- 2GB
        -- Schema validation
        VALIDATE_SCHEMA 'true',
        -- Compression handling
        COMPRESSION_TYPE 'snappy'
    );

Operational Considerations

File Discovery Process:

  • Automatically scans specified directory for compatible files
  • Generates table schemas based on file metadata
  • Creates foreign table definitions with optimized settings

Schema Inference:

  • Parquet: Leverages embedded schema information
  • ORC: Uses file footer schema metadata
  • Automatic data type mapping to TacNode equivalents

Performance Optimization:

-- Monitor imported foreign tables
SELECT 
    schemaname,
    tablename,
    -- Check foreign table options
    array_to_string(array_agg(option_name || '=' || option_value), ', ') as table_options
FROM information_schema.foreign_table_options fto
JOIN information_schema.foreign_tables ft 
    ON fto.foreign_table_name = ft.foreign_table_name
WHERE foreign_server_name = 's3_production'
GROUP BY schemaname, tablename;

Limitations and Best Practices

Current Limitations:

  • Supports only Parquet and ORC file formats
  • No partitioned table support
  • CSV and JSON formats require manual table creation

Best Practices:

-- Pre-validate directory structure
SELECT 'Checking S3 directory...' as status;
 
-- Batch import with error handling
DO $$
BEGIN
    -- Attempt batch import
    EXECUTE 'IMPORT FOREIGN SCHEMA parquet 
             FROM SERVER s3_production 
             INTO staging_schema 
             OPTIONS (DIR ''s3://data-lake/quarterly/'')';
    
    RAISE NOTICE 'Batch import completed successfully';
EXCEPTION 
    WHEN OTHERS THEN
        RAISE WARNING 'Import failed: %', SQLERRM;
        -- Fallback to manual table creation
END $$;

Directory Organization:

s3://your-bucket/
├── parquet-data/
│   ├── sales_2024_q1.parquet
│   ├── sales_2024_q2.parquet
│   └── customers.parquet
└── orc-data/
    ├── transactions.orc
    └── products.orc

This comprehensive approach to file-based foreign tables enables efficient access to cloud object storage while maintaining optimal performance for your specific use cases.