File-based Foreign Table
Learn how to efficiently manage foreign tables in S3, and Object Store with our comprehensive guide, enhancing your data integration and access strategies.
File-based foreign tables enable direct access to data stored in cloud object storage systems like AWS S3 and other compatible object stores. This guide covers implementing and optimizing file-based external data integration.
Overview and Benefits
File-based foreign tables provide seamless integration with cloud object storage, enabling:
- Direct Data Access: Query files directly without importing
- Cost Efficiency: Access data on-demand without storage duplication
- Scalability: Handle petabyte-scale datasets
- Flexibility: Support multiple file formats and storage providers
- Real-time Analytics: Query latest data without ETL delays
Supported File Formats
| Format | Use Case | Compression | Schema Evolution |
|---|---|---|---|
| Parquet | Analytics, OLAP | Built-in (Snappy, GZIP) | Schema evolution support |
| ORC | Big data processing | Built-in (ZLIB, Snappy) | Advanced schema support |
| JSON | Semi-structured data | External (GZIP) | Schema-on-read |
Installation and Setup
The following instructions are written to be run from the psql command line
Install Required Extension
Install the file foreign data wrapper extension:
-- Install extension (run once per database)
CREATE EXTENSION IF NOT EXISTS file_fdw;
-- Verify installation
SELECT extname, extversion
FROM pg_extension
WHERE extname = 'file_fdw';
The file_fdw extension must be installed with superuser privileges and is available by default in Tacnode clusters.
Verify Available Foreign Data Wrappers
-- List all available FDWs
SELECT fdwname, fdwhandler, fdwoptions
FROM pg_foreign_data_wrapper;
-- Check file_fdw capabilities
\des+ file_fdw
AWS S3 Integration
Create S3 Foreign Server
-- Create foreign server for AWS S3 the AWS_REGION should be the region that the S3 bucket resides in
CREATE SERVER s3_production FOREIGN DATA WRAPPER file_fdw
OPTIONS (
AWS_REGION 'us-west-2'
);
Configure User Mapping for S3 (This step is not needed for publically available buckets)
-- Create user mapping with IAM credentials
CREATE USER MAPPING FOR current_user SERVER s3_production
OPTIONS (
AWS_ACCESS_ID 'AKIA...your-access-key-id',
AWS_ACCESS_KEY 'your-secret-access-key'
);
-- Create mapping for specific database user
CREATE USER MAPPING FOR data_analyst SERVER s3_production
OPTIONS (
AWS_ACCESS_ID 'AKIA...analyst-access-key',
AWS_ACCESS_KEY 'analyst-secret-key'
);
Create S3 Foreign Tables
Parquet Files
-- Create foreign table for Parquet data
CREATE FOREIGN TABLE sales_parquet (
sale_id BIGINT,
customer_id INT,
product_id INT,
sale_date DATE,
quantity INT,
unit_price NUMERIC(10,2),
total_amount NUMERIC(12,2),
region VARCHAR(50),
sales_rep VARCHAR(100)
) SERVER s3_production
OPTIONS (
DIR 's3://company-data-lake/sales/year=2024/',
FORMAT 'parquet'
);
-- Partitioned Parquet data with dynamic discovery
CREATE FOREIGN TABLE sales_partitioned (
sale_id BIGINT,
customer_id INT,
product_id INT,
sale_date DATE,
quantity INT,
unit_price NUMERIC(10,2),
total_amount NUMERIC(12,2),
-- Partition columns
year INT,
month INT,
region VARCHAR(50)
) SERVER s3_production
OPTIONS (
DIR 's3://company-data-lake/sales/',
FORMAT 'parquet'
);
ORC Files
-- Create foreign table for ORC format
CREATE FOREIGN TABLE user_activity_orc (
user_id BIGINT,
session_id VARCHAR(64),
activity_timestamp TIMESTAMP,
page_url VARCHAR(500),
action_type VARCHAR(50),
duration_seconds INT,
device_type VARCHAR(20),
ip_address INET
) SERVER s3_production
OPTIONS (
DIR 's3://analytics-bucket/user-activity/orc/',
FORMAT 'orc'
);
Query S3 Data
-- Basic queries
SELECT region, COUNT(*) as sale_count, SUM(total_amount) as total_revenue
FROM sales_parquet
WHERE sale_date >= '2024-01-01'
GROUP BY region
ORDER BY total_revenue DESC;
-- Join S3 data with local tables
SELECT
s.sale_id,
s.total_amount,
c.customer_name,
c.customer_tier
FROM sales_parquet s
JOIN customers c ON s.customer_id = c.customer_id
WHERE s.sale_date BETWEEN '2024-01-01' AND '2024-01-31';
-- Aggregate across partitioned data
SELECT
year,
month,
region,
COUNT(*) as transactions,
SUM(total_amount) as revenue
FROM sales_partitioned
WHERE year = 2024 AND month >= 1
GROUP BY year, month, region
ORDER BY year, month, region;
File Formats
Parquet Optimization
-- Optimized Parquet foreign table
CREATE FOREIGN TABLE optimized_sales (
sale_id BIGINT,
customer_id INT,
sale_date DATE,
amount NUMERIC(12,2),
-- Partition columns should be at the end
year INT,
quarter INT
) SERVER s3_production
OPTIONS (
DIR 's3://sales-data/parquet/',
FORMAT 'parquet'
);
ORC Configuration
-- ORC with advanced configuration
CREATE FOREIGN TABLE orc_analytics (
transaction_id BIGINT,
amount NUMERIC(15,2),
timestamp TIMESTAMP,
metadata JSONB
) SERVER s3_production
OPTIONS (
DIR 's3://analytics/orc-data/',
FORMAT 'orc'
);
Batch Foreign Table Creation (IMPORT FOREIGN SCHEMA)
The file_fdw extension supports batch creation of foreign tables using the IMPORT FOREIGN SCHEMA command, which simplifies management of external file-based data sources.
Syntax Overview
IMPORT FOREIGN SCHEMA remote_schema
[ { LIMIT TO | EXCEPT } ( table_name [, ...] ) ]
FROM SERVER server_name
INTO local_schema
[ OPTIONS ( option 'value' [, ... ] ) ]
Parameter Specifications
| Parameter | Description | Values |
|---|---|---|
remote_schema | Extended semantic support for file formats | public |
LIMIT TO | Import only specified tables | List of table names |
EXCEPT | Exclude specified tables from import | List of table names |
server_name | Previously created foreign server | Server identifier |
local_schema | Target schema for foreign table metadata | Schema name |
OPTIONS | Additional configuration parameters | Key-value pairs |
Batch Creation Examples
Batch Import Parquet Files from S3
-- Import all Parquet files from S3 directory
IMPORT FOREIGN SCHEMA public
FROM SERVER s3_production
INTO public
OPTIONS (
FORMAT 'parquet',
DIR 's3://analytics-data/parquet-tables/'
);
Selective Import with Table Filtering
-- Import only specific ORC tables from object storage
IMPORT FOREIGN SCHEMA public
LIMIT TO (sales_data, customer_metrics, product_catalog)
FROM SERVER s3_production
INTO analytics_schema
OPTIONS (
FORMAT 'orc',
DIR 's3://warehouse/orc-files/'
);
Import with Exclusions
-- Import all tables except specified ones
IMPORT FOREIGN SCHEMA public
EXCEPT (temp_data, backup_tables)
FROM SERVER s3_production
INTO data_lake
OPTIONS (
FORMAT 'parquet',
DIR 's3://production-data/structured/'
);
Operational Considerations
File Discovery Process:
- Automatically scans specified directory for compatible files
- Generates table schemas based on file metadata
- Creates foreign table definitions with optimized settings
Schema Inference:
- Parquet: Leverages embedded schema information
- ORC: Uses file footer schema metadata
- Automatic data type mapping to TacNode equivalents
Limitations and Best Practices
Current Limitations:
- Supports only Parquet and ORC file formats
- No partitioned table support
- JSON formats require manual table creation
Directory Organization:
s3://your-bucket/
├── parquet-data/
│ ├── sales_2024_q1.parquet
│ ├── sales_2024_q2.parquet
│ └── customers.parquet
└── orc-data/
├── transactions.orc
└── products.orc
This comprehensive approach to file-based foreign tables enables efficient access to cloud object storage while maintaining optimal performance for your specific use cases.