Medallion Architecture: Bronze, Silver and Gold Layers in Modern Lakehouses
Medallion architecture organizes your lakehouse into bronze, silver and gold layers with progressive data refinement. Learn how each layer works, when to apply the pattern, and where it breaks down for real-time decisioning.
TL;DR: Medallion architecture organizes your lakehouse into three layers: bronze (raw, immutable data), silver (cleaned, conformed enterprise data), and gold (curated, business-ready data products). It's a logical pattern — not a product — that prevents data lakes from becoming data swamps. The pattern excels for analytics and BI but has a structural limitation for real-time automated decisions: the multi-hop data flow adds propagation delay that can't match tight decision validity windows.
Data lakes promised unlimited flexibility. What many organizations got instead was a data swamp—millions of files scattered across folders with no clear lineage, duplicated transformations, and dashboards nobody trusted.
Medallion architecture emerged as the antidote. This layered data design pattern organizes your lakehouse into bronze silver and gold layers, each with distinct responsibilities. The medallion architecture pattern brings order to chaos by defining clear data layers with progressive data refinement at each stage. In this guide, you’ll learn exactly how each layer works, when to apply the medallion architecture pattern, and how to implement it effectively.
What is medallion architecture?
Medallion architecture is a layered data design pattern that organizes data into three progressive stages: Bronze, Silver, and Gold. Databricks popularized this medallion architecture terminology around 2019–2020 when promoting the data lakehouse paradigm, though the underlying concept—progressively refining data through distinct stages—has roots in traditional data warehouses and data warehousing practices.
The critical point to understand is that medallion architecture is a logical pattern, not a specific product or technology. You can implement medallion architecture on Databricks, Microsoft Fabric, Snowflake, Azure Synapse, Google BigQuery, or any platform that supports structured data storage and processing. The medallion architecture pattern remains consistent regardless of your underlying infrastructure.
At its core, the medallion architecture pattern works like this: raw data lands in the Bronze layer exactly as it arrives from external source systems. The Silver layer then transforms this raw data into cleaned, standardized datasets aligned with business entities. Finally, the Gold layer shapes the data into optimized, business-ready data products that power dashboards, reports, business intelligence tools, and machine learning models.
This progression represents the fundamental idea behind medallion architecture—data moves from raw to analytics-ready through incremental data refinement. Each data layer adds structure, improves data quality, and builds business context. By the time data reaches the Gold layer, it carries the full weight of validation, enrichment, and business rules that make it trustworthy for decision-making.
You’ll also hear medallion architecture called “multi-hop” architecture, particularly in streaming and batch data engineering scenarios. This terminology emphasizes how data “hops” through data layers, with each hop applying specific data transformations before passing data downstream. Whether you’re processing data in real-time streams or scheduled batch jobs, the multi-hop concept in medallion architecture applies equally.
The medallion architecture pattern’s flexibility makes it powerful. Some organizations stick strictly to three data layers. Others introduce sub-layers within Silver or add a “Platinum” layer for specialized machine learning features. The bronze silver and gold foundation remains consistent, but implementation details adapt to organizational needs.
Why medallion architecture matters in data lakes and lakehouses
Between 2015 and 2025, cloud object storage fundamentally changed how organizations store data. AWS S3, Azure Data Lake Storage, and Google Cloud Storage made it economically viable to retain massive data volumes without the strict schema requirements of traditional data warehouses. This flexibility spawned the data lake concept—a centralized repository holding structured and unstructured data in native formats.
The problem? Flexibility without structure creates chaos. Organizations discovered that dumping all the data into a data lake without organizing data properly led to what practitioners grimly call the “data swamp.” Files multiplied across inconsistent folder structures. Multiple teams built duplicate data pipelines performing similar data transformations. Nobody could trace how a dashboard metric connected back to source systems. Data governance became nearly impossible.
The data lakehouse emerged as the solution—combining data lake flexibility with data warehouse features like ACID transactions, schema enforcement, and query optimization. Platforms like Databricks Lakehouse and Microsoft Fabric embody this data lakehouse approach. But a data lakehouse still needs logical organization, which is where medallion architecture fits.
Medallion architecture provides the organizing data framework that prevents data lakehouses from degrading into swamps. The data lakehouse paradigm gives you the storage engine; medallion architecture gives you the logical structure.
Traditional enterprise data warehouses used similar layered approaches: staging areas for raw data ingestion, operational data stores or ODS for cleaned data, core data warehouse tables, and finally data marts for consumption. Medallion architecture parallels this data warehouse structure but adapts it for data lakehouse realities—supporting schema-on-read, handling large data volumes efficiently, and accommodating streaming alongside batch processing data.
Key problems medallion architecture solves: - Uncontrolled folder structures and schema drift across data files - Duplicated transformation logic spread across multiple data pipelines - Inconsistent metric definitions causing conflicting reports - Poor data lineage making debugging nearly impossible - Difficulty handling late-arriving or out-of-order incoming data - Unclear ownership of datasets and data transformations - Compliance challenges from lack of historical data preservation
The benefits extend beyond technical data organization. Data engineers know exactly where to build data ingestion pipelines. Data scientists can access cleaned datasets without repeating data cleansing work. Business users trust gold layer data because it carries explicit data quality guarantees. Everyone collaborates more effectively because medallion architecture provides a shared mental model for where data lives and what state it’s in.
Bronze layer: raw and immutable data
The Bronze layer serves as your landing zone—the first stop for all data entering your medallion architecture. Raw data arrives here exactly as it exists in external source systems, whether that’s relational databases, SaaS applications like Salesforce, Kafka topics carrying real-time events, REST APIs, CSV exports, or IoT device streams. The bronze layer principle is simple: capture everything, transform nothing.
The bronze layer is typically append-only, meaning you add new records without data modification of existing data. You preserve complete historical data, enabling replay of downstream data pipelines if business logic changes or bugs require reprocessing. The bronze layer supports schema-on-read approaches, allowing you to store data in efficient data formats like Delta Lake on Parquet or Apache Iceberg without enforcing rigid schemas during data ingestion.
Common metadata columns enrich bronze layer tables for data lineage and auditing purposes. A well-designed bronze layer table typically includes: - ingestion_timestamp: When the record landed in the data lakehouse - source_system: Origin of the data (e.g., “salesforce_crm”, “web_events_kafka”) - batch_id or file_name: Identifies the specific load for traceability - checksum or hash: Validates data integrity against source
Consider a concrete example: clickstream events flowing from a Kafka topic into a Databricks bronze layer table. Events arrive continuously as users navigate your website. Each event contains a raw JSON payload with nested fields—user actions, page URLs, session identifiers, timestamps. The bronze layer table stores these events partitioned by event_date and region, with columns for raw_payload (the complete JSON), ingestion_time, and source metadata. No parsing, no data validation, no transformation—just raw data ingestion and capture.
The bronze layer’s key responsibilities in medallion architecture include: - Capturing full history without data loss - Supporting change data capture for tracking data modification in source systems - Enabling pipeline replay without re-reading operational source systems - Preserving data fidelity for audit and regulatory compliance requirements - Handling diverse data formats including structured and unstructured data
The trade-off is clear: minimal data transformations mean downstream consumers face nested structures, missing fields, and schema variations. But this trade-off is intentional. The bronze layer prioritizes completeness and recoverability over convenience.
Bronze layer do’s and don’ts:
| Do | Don't |
|---|---|
| Always retain raw copies of source data | Clean, filter, or aggregate data |
| Include metadata columns for data lineage | Drop fields to save space |
| Partition by ingestion date or logical splits | Overwrite without backup mechanisms |
| Use data formats supporting time-travel (Delta, Iceberg) | Expose bronze layer tables directly to business users |
| Enforce naming and folder conventions | Allow uncontrolled schema changes |
| Document retention policies | Skip checksums or data integrity validation |
Silver layer: cleansed and conformed enterprise data
The silver layer transforms raw bronze layer data into cleaned, standardized, and integrated datasets. This is where unprocessed data becomes structured data aligned with your organization’s core business entities—Customer, Order, Product, Asset, and other fundamental concepts that drive your operations. The silver layer is the workhorse of medallion architecture, where most data refinement occurs.
Typical silver layer data transformation activities include type casting (converting string dates to proper datetime formats, parsing decimals), deduplication to eliminate duplicate records from external source systems, null handling to manage data consistently, and standardizing reference values like country codes, currency symbols, and product categories. The silver layer also handles late-arriving data through watermarking or delayed joins, ensuring data integrity even when events arrive out of order.
Data validation and data cleaning become systematic at the silver layer stage. You apply data quality rules that reject or flag records failing business constraints. A customer email must be valid. An order amount must be positive. A transaction date cannot be in the future. These data quality checks at the silver layer prevent bad data from propagating to gold layer consumption and incrementally improve data quality across the entire medallion architecture.
Silver layer data modeling approaches vary based on organizational needs. Some teams prefer Third Normal Form (3NF) data models for their flexibility and reduced redundancy. Others adopt Data Vault 2.0 data models for complex enterprise scenarios requiring extensive data lineage and auditability. Many organizations choose lightly normalized “enterprise views” stored as Delta or Iceberg tables—practical data models that compromise between normalization principles and query performance. The right silver layer data models depend on your data sources, data volumes, and downstream consumption patterns.
Consider a concrete example: an e-commerce company with raw orders data in the bronze layer spanning 2022–2024. The silver layer produces several conformed tables: - dim_customer: Cleaned customer records with deduplicated emails, standardized address formatting, and resolved identity merges - dim_product: Product catalog with normalized taxonomy, categories, and attributes - fact_order: Transactional data with correct types, enriched data including currency conversions using reference lookup tables, and flagged anomalies
These silver layer tables become the foundation for advanced analytics, data science work, data analysis, and downstream gold layer data marts. A data scientist building a churn prediction model pulls features from the silver layer rather than parsing raw bronze layer JSON. An analyst exploring customer behavior queries silver layer tables with confidence that data cleansing and data cleaning have already occurred.
Performance practices matter at the silver layer. Silver layer tables are commonly partitioned by business keys or dates to optimize query performance. Platforms supporting clustering (like Z-ordering in Delta Lake or clustering keys in Snowflake) improve read performance for common silver layer data access patterns. Some organizations build indexes on frequently filtered columns where the platform supports them.
Silver layer responsibilities in medallion architecture: - Data cleansing: deduplicate, handle nulls, correct errors - Standardization: convert types, normalize reference values in silver layer data - Integration: join master data, resolve business entities from multiple data sources - Conformance: align to enterprise data models - Data quality enforcement: apply data quality rules systematically to improve data quality - Late-data handling: process out-of-order arrivals correctly - Schema enforcement: manage data schema evolution with explicit policies
Gold layer: curated, business-ready data products
The gold layer is where data becomes actionable. Here, you model and optimize data for specific business use cases—executive dashboards, financial reporting, marketing attribution, customer segmentation, churn prediction, machine learning model serving, and the advanced analytics and data analysis that drive strategic decisions. The gold layer represents the culmination of medallion architecture’s progressive data refinement.
Gold layer data transformation often produces denormalized, read optimized data models designed for consumption speed rather than storage efficiency. Kimball-style star schema data models remain popular, with fact tables surrounded by dimension tables that eliminate joins at query time. Wide tables that pre-join frequently combined data reduce complexity for business users and business intelligence tools. Domain-specific data marts serve particular functions—Sales, Finance, Supply Chain—each gold layer data mart optimized for its consumers’ query patterns.
Concrete examples illustrate gold layer data outputs: - monthly_revenue_gold: 2024 revenue aggregated by country, channel, and product category, with metrics like net revenue, discounts, and returns calculated according to official business definitions - customer_360: A comprehensive customer profile merging behavior data (web sessions, clicks), transaction history, support ticket interactions, and demographic attributes into a single queryable entity - marketing_attribution: Campaign performance gold layer data with attribution data models applied, connecting marketing spend to conversion outcomes
The gold layer enforces final business rules and metric definitions. What exactly constitutes an “active customer”? Is it any purchase in the last 30 days, or does it require subscription status? What’s the official calculation for Monthly Recurring Revenue (MRR)? These definitions get codified in the gold layer, ensuring everyone across the organization references the same enterprise data products with identical interpretations.
Data quality reaches its highest bar at the gold layer. Validated data in gold layer tables carries explicit guarantees. Monitoring and alerting systems watch for data quality anomalies before they reach executive dashboards or machine learning pipelines. Because trust is paramount, write operations to gold layer tables typically require elevated permissions—only designated teams can modify these critical datasets.
Gold layer performance optimization techniques include: - Pre-aggregated tables reducing compute at query time - Materialized views for complex calculations in the gold layer - Serving indexes optimized for business intelligence tool data access patterns - Partitioning and clustering aligned with common filter conditions - Extracts formatted for specific business intelligence tools like Power BI, Tableau, or Looker - Read optimized data models designed for low-latency data access
Some organizations extend beyond the gold layer with additional data layers or data marts. A “Platinum” layer might serve real-time machine learning features or personalization engines. Feature stores for machine learning workflows sometimes exist as specialized gold layer derivatives. However, the canonical bronze silver and gold layers cover most organizational needs for advanced analytics and business intelligence without additional complexity.
Gold layer data serves these typical use cases: - Executive dashboards (daily financial summaries, KPI tracking) - Marketing attribution and campaign performance data analysis - Customer 360 profiles for sales and support teams - Operational reporting (inventory, supply chain, workforce) - Financial close and regulatory compliance reporting - Machine learning feature sets for model training and data science - Self-service advanced analytics for data-literate business users - Domain-specific data marts for business intelligence consumption
Where gold layer data falls short: real-time automated decisions
Gold layer data excels at powering dashboards, business intelligence, machine learning training, and advanced analytics. But one category of workload exposes a structural limitation in medallion architecture: automated decisions that must act on derived context within milliseconds under concurrent load.
A fraud model scoring a transaction needs velocity counts reflecting the last few seconds of activity. An authorization service enforcing a credit limit needs aggregated exposure current to the last transaction. These decisions depend on exactly the kind of derived, aggregated data that gold layer data transformation produces—but they need it current to the moment of decision, not current to the last pipeline run.
The medallion architecture multi-hop data flow—bronze layer to silver layer to gold layer—prepares context asynchronously through sequential data transformations. The decision consumes context synchronously. Under high concurrency, where many events change the same underlying state simultaneously, gold layer data reflects a past snapshot while decisions execute against current reality. Streaming medallion architecture reduces this gap but cannot eliminate it structurally, because each data layer adds propagation delay.
For these workloads, the alternative is architectures that incrementally maintain derived context as events arrive—no multi-hop preparation, no separate data pipelines per data layer. Raw data and derived state (aggregations, joins, velocity counts) are served from a single consistent snapshot at query time. This doesn’t replace medallion architecture for organizing data for analytics and data science. It addresses the specific gap where gold layer data freshness cannot match the decision’s validity window.
Implementing medallion architecture in practice
Implementing medallion architecture follows a logical sequence: define your domains, establish conventions, then build data pipelines layer by layer. A typical medallion architecture implementation workflow moves through distinct phases with clear deliverables at each stage.
Step 1: Define domains and data products Start by identifying core business entities your organization cares about—Customer, Order, Product, Transaction, Asset. Map these to data sources and source systems that feed them. Establish naming standards that will scale: schema prefixes (bronze, silver, gold_), table naming conventions, and folder structures in your distributed file system. Define ownership—which team manages which domain in the medallion architecture.
Step 2: Set up bronze layer data ingestion pipelines For each data source, build data pipelines landing raw data into the bronze layer. Streaming data sources like Kafka topics might flow continuously into bronze layer tables. Batch data sources like daily CRM exports arrive on schedule. All bronze layer raw data ingestion pipelines include metadata columns: ingestion timestamp, source identifier, batch ID. Choose data formats supporting time-travel and ACID transactions—Delta Lake or Apache Iceberg work well for bronze layer data ingestion.
Step 3: Build silver layer data transformation pipelines From the bronze layer, construct silver layer data pipelines that clean, standardize, and integrate. Type conversions, deduplication, null handling, reference data joins—all silver layer data transformation happens here. Schema enforcement kicks in—silver layer tables have explicit schemas with evolution policies. Data quality checks run as part of silver layer pipeline execution, flagging or rejecting invalid records before they propagate to the gold layer.
Step 4: Build gold layer serving pipelines Silver layer data feeds the gold layer. Aggregations, denormalization, business rule application, and metric calculations transform data into consumable gold layer data products and data marts. Optimize gold layer data models for query patterns your business intelligence tools and analysts actually use. Document metric definitions explicitly in your data catalog.
Step 5: Implement data governance and monitoring Data governance underpins the entire medallion architecture. Implement a data catalog (Unity Catalog in Databricks, equivalent tools elsewhere) to track metadata, data lineage, and data access policies. Build business glossaries so “Customer” means the same thing everywhere. Establish role-based data access separated by data layer—perhaps broad read access to the bronze layer for data engineers, curated data access to the silver layer for data scientists, and wide data access to gold layer data for business users with write operations restricted to designated owners.
A concrete mini-scenario: Consider a company integrating CRM data from Salesforce and web events from an Azure Event Hub—two very different data sources requiring different data ingestion approaches. Salesforce contacts sync daily into bronze.salesforce_contacts with full export snapshots. Web events stream continuously into bronze.web_events partitioned by event hour. In the silver layer, a nightly data pipeline deduplicates Salesforce contacts, standardizes phone numbers and addresses, and produces silver.dim_customer. Web events parse into silver.fact_web_session with validated session boundaries, cleaned URLs, and user identifiers resolved. Silver layer data transformation handles the data cleaning and data quality enforcement for both data sources. The gold layer combines these: gold.customer_360 joins customer dimensions with aggregated data from web behavior—session counts, recency, frequency metrics—plus transaction summaries from another silver layer data source. A marketing dashboard consumes this gold layer data to display customer segments, enabling targeted campaigns based on trusted, unified customer profiles. Data scientists access both silver and gold layers for machine learning feature engineering and data analysis.
Orchestration typically uses platform-native tools (Databricks Jobs, Fabric Data Pipelines) or external orchestrators like Apache Airflow or Azure Data Factory to manage data flow across the medallion architecture. Structured streaming enables near-real-time data flow from the bronze layer through silver and gold layers when business requirements demand low latency. Batch processing data remains common for gold layer aggregations running daily, weekly, or monthly.
Design patterns, trade-offs, and best practices
Medallion architecture provides a framework, not a rigid specification. Teams adapt the medallion architecture pattern to their specific contexts while maintaining its core principles. Understanding common medallion architecture design patterns, honest trade-offs, and proven best practices helps you implement effectively.
Medallion architecture design patterns in practice A single bronze layer table often feeds multiple silver layer data models. Raw web events might produce both silver.fact_web_session for advanced analytics and silver.fact_user_action for product telemetry—different grain, different purposes, same data source. Similarly, silver layer tables branch into multiple domain-specific gold layer data marts. A conformed silver.fact_order might feed both gold.sales_summary for the revenue team and gold.fulfillment_metrics for operations. Some organizations introduce sub-layers within the silver layer—“Raw Silver” holding minimally transformed data, “Clean Silver” with full data validation, “Conformed Silver” aligned to enterprise data models. Others add domain boundaries, creating medallion architectures within specific business units that federate into an enterprise-wide medallion architecture pattern. The flexibility accommodates organizational realities.
Trade-offs in medallion architecture to acknowledge
Storage overhead: Maintaining three data layers means storing multiple copies of logically similar data. The bronze layer retains raw records, the silver layer stores cleaned versions, the gold layer holds aggregated data. Storage costs accumulate, though they’re typically modest compared to compute and engineering time in a data lakehouse.
Pipeline complexity: More data layers mean more data pipelines to build, monitor, and maintain. Each bronze layer-to-silver layer and silver layer-to-gold layer data transformation requires code, testing, and ongoing operational attention.
Data freshness latency: Sequential processing data through medallion architecture data layers introduces latency. A change in source data must flow through the bronze layer, transform data through the silver layer, and aggregate into the gold layer before reaching dashboards. Near-real-time requirements add streaming complexity to data pipelines and data flow management.
Data governance burden: Strict data contracts, schema versioning, data access controls, and monitoring across three data layers demand disciplined data governance practices. Without strong data governance, medallion architectures degrade into confusion.
Best practices for effective medallion architecture implementation
| Practice | Description |
|---|---|
| Consistent naming conventions | Use data layer prefixes (bronze, silver, gold_) or dedicated schemas/databases per data layer |
| Schema versioning | Track schema changes explicitly; use platform features for evolution |
| Catalog tagging | Tag tables by data layer and domain in your data catalog for discoverability |
| Clear ownership | Assign domain owners responsible for specific bronze layer to silver layer to gold layer data flows |
| Layer-specific SLAs | Define freshness, completeness, and data quality targets per data layer |
| Partitioning strategy | Partition tables appropriately for data access patterns and data lifecycle management |
| Data lineage tracking | Ensure any gold layer metric traces back through data transformations to bronze layer source data |
| Data quality monitoring | Implement automated data quality checks at silver and gold layers with alerting for anomalies |
Medallion architecture delivers the most value in organizations with large, heterogeneous, fast-growing data estates. When you’re managing data from dozens of data sources, serving multiple teams with different analytical needs, processing both batch and streaming workloads, and requiring strong regulatory compliance and auditability—the medallion architecture pattern pays dividends.
For smaller operations—a handful of data sources, limited schema complexity, few consumers—a simpler approach may suffice. A two-layer pattern (Raw and Business Products) or even direct source-to-consumption data pipelines might reduce unnecessary overhead. Match medallion architecture complexity to actual organizational needs.
Frequently Asked Questions
Key takeaways
- Medallion architecture organizes data lakehouse data into bronze layer (raw data), silver layer (cleaned), and gold layer (business-ready) data layers - The medallion architecture pattern is logical, not product-specific—implement it on Databricks, Fabric, Snowflake, or any data lakehouse platform - The bronze layer captures everything from external source systems without data transformation, enabling replay and full data lineage - Silver layer data transformation produces cleaned, standardized datasets aligned with business entity data models - Gold layer data delivers optimized, aggregated data products and data marts with enforced business rules and metric definitions - Strong data governance, clear ownership, and consistent naming conventions are essential for medallion architecture success - The medallion architecture pattern suits large, complex data estates with many data sources; simpler setups may not need all three data layers - For automated decisions requiring derived context under tight validity windows and concurrent load, the multi-hop data flow creates a structural preparation gap—these workloads need architectures that maintain context incrementally rather than refining it in hops
The data swamp problem hasn’t disappeared, but medallion architecture provides a proven framework for logically organizing data and avoiding it. By organizing data through progressive data refinement across bronze silver and gold layers, you create a data estate where data engineers, data scientists, and business users can all find what they need with confidence in its data quality and data lineage.
Start by examining your current data organization. Identify one critical data domain—perhaps customer data or transaction history—and sketch out what bronze layer, silver layer, and gold layer tables would look like in a medallion architecture. Build from there, establishing patterns and conventions that scale as your data estate grows.
Written by Alex Kimball
Building the infrastructure layer for AI-native applications. We write about Decision Coherence, Tacnode Context Lake, and the future of data systems.
View all postsContinue Reading
Streaming Database: What It Is, How It Works, and When You Need One
Apache Kafka vs Apache Flink: The Real Comparison Is Flink vs Kafka Streams
Foreign Data Wrappers: S3, Iceberg & Delta Lake
Ready to see Tacnode Context Lake in action?
Book a demo and discover how Tacnode can power your AI-native applications.
Book a Demo