Dimensional Modeling & Lakehouse Pipeline Design

Principles of Dimensional Modeling

Dimensional modeling is a data design technique used for data warehouses and data marts, optimized for analytical querying and reporting. It structures data into "facts" (measures) and "dimensions" (context).

Key Concepts:

Fact Tables:
Contain quantitative measurements (metrics) and foreign keys to dimension tables. Facts are typically numeric and additive.
- Additive Facts: Can be summed across all dimensions (e.g., sales amount, quantity).
- Semi-Additive Facts: Can be summed across some dimensions but not all (e.g., account balance - sum across customers, but not across time).
- Non-Additive Facts: Cannot be summed across any dimension (e.g., unit price, ratios). Require specific aggregations like average, min, max.
Dimension Tables:
Contain descriptive attributes that provide context to the facts. They answer "who, what, where, when, why, how."
- Slowly Changing Dimensions (SCDs): Handle changes in dimension attributes over time.
  - SCD Type 0 (Retain Original): Attribute never changes.
  - SCD Type 1 (Overwrite): Old value is overwritten by new value (no history).
  - SCD Type 2 (Add New Row): A new row is added for each change, preserving full history. Most common.
  - SCD Type 3 (Add New Column): A new column is added to store the old value.
  - SCD Type 4 (History Table): Current attributes in dimension, history in separate table.
  - SCD Type 6 (Hybrid): Combines Type 1, 2, and 3 (e.g., current value, historical value, and effective dates).
- Conformed Dimensions: Dimensions that are shared across multiple fact tables or data marts, ensuring consistent reporting and integration across business areas.
Granularity: The level of detail stored in a fact table. It's crucial to define the lowest level of detail for which data is captured (e.g., sales per item per day per store).
Surrogate Keys: Simple, system-generated integer keys used in dimension tables instead of natural keys. They provide performance benefits, handle SCDs, and isolate the data warehouse from source system key changes.

Schema Types:

Star Schema:

The simplest and most common dimensional model. A central fact table surrounded by denormalized dimension tables. Optimized for query performance and ease of understanding.


        [Time Dimension]
              |
              |
[Product Dimension] --- [Fact Table] --- [Customer Dimension]
              |
              |
        [Store Dimension]

Snowflake Schema:

An extension of the star schema where dimensions are normalized into multiple related tables. Reduces data redundancy but can increase query complexity due to more joins.


        [Time Dimension]
              |
              |
[Product Category] --- [Product Dimension] --- [Fact Table] --- [Customer Dimension]
                                 |
                                 |
                          [Store Location] --- [Store Dimension]

Lakehouse Pipeline Design Best Practices

Designing robust and efficient data pipelines in a Lakehouse environment (like Databricks) involves leveraging its unique capabilities.

1. Medallion Architecture (Bronze, Silver, Gold):

Bronze Layer (Raw Data Lake):

Ingest data as-is from source systems. Use Delta Lake for ACID properties and schema inference. Ideal for append-only operations.


# Example: Ingesting raw JSON into Bronze
df_raw = spark.readStream.format("cloudFiles") \
    .option("cloudFiles.format", "json") \
    .load("/mnt/raw_data/json")

df_raw.writeStream.format("delta") \
    .option("checkpointLocation", "/mnt/checkpoints/bronze") \
    .toTable("bronze_layer.raw_events")

Silver Layer (Cleaned & Conformed):

Apply data cleaning, filtering, basic transformations, and schema enforcement. This is where you might implement SCDs for dimensions.


# Example: Cleaning and transforming to Silver
# Assuming 'bronze_events' is streaming from bronze layer
df_silver = spark.readStream.table("bronze_layer.raw_events") \
    .withColumn("processed_timestamp", current_timestamp()) \
    .filter("event_type IS NOT NULL") # Basic cleaning

df_silver.writeStream.format("delta") \
    .option("checkpointLocation", "/mnt/checkpoints/silver") \
    .toTable("silver_layer.cleaned_events")

Gold Layer (Curated & Aggregated):

Create highly refined, aggregated, and business-ready data optimized for specific use cases (BI, ML). This is where dimensional models (fact and dimension tables) reside.


# Example: Creating a Gold layer fact table
df_sales_fact = spark.readStream.table("silver_layer.cleaned_events") \
    .groupBy("product_id", "customer_id", "date") \
    .agg(sum("quantity").alias("total_quantity"), sum("price").alias("total_sales"))

df_sales_fact.writeStream.format("delta") \
    .option("checkpointLocation", "/mnt/checkpoints/gold") \
    .toTable("gold_layer.sales_fact")

2. Data Ingestion Strategies:

Batch Processing: For large historical data loads or less time-sensitive data. Use Spark batch jobs.
Streaming Ingestion (Structured Streaming): For real-time or near real-time data. Leverage Spark Structured Streaming with Delta Lake for exactly-once processing and low latency.
Auto Loader: For efficient and scalable ingestion of new files arriving in cloud storage. Automatically detects and processes new files.

3. Data Quality and Validation:

Schema Enforcement & Evolution: Delta Lake automatically enforces schema. Use mergeSchema or overwriteSchema options for controlled schema evolution.
Expectations (Delta Live Tables - DLT): Use DLT's "expectations" feature to define data quality rules and handle invalid records (e.g., quarantine, drop, fail).
Data Validation Frameworks: Integrate with tools like Great Expectations or Deequ for robust data validation checks.

4. Performance Optimization:

Delta Lake Optimizations:
- OPTIMIZE & ZORDER BY: Periodically run OPTIMIZE to compact small files and ZORDER BY on frequently queried columns to improve query performance.
- Liquid Clustering: Use as a flexible alternative to partitioning and Z-Ordering for dynamic data distribution.
- Predictive I/O: Leverage Databricks' Photon engine for vectorized query execution and optimized I/O.
Cluster Sizing & Autoscaling: Configure clusters with appropriate instance types and enable autoscaling to match compute resources to workload demands.
Adaptive Query Execution (AQE): Ensure AQE is enabled (Spark 3.0+) for runtime query plan optimizations.
Broadcast Joins: Apply for small-to-large table joins to avoid costly shuffles.

5. Data Governance & Security:

Unity Catalog: Centralize metadata, access control, auditing, and lineage for all data assets across your Lakehouse. Essential for robust governance.
Least Privilege: Implement strict access controls, granting only necessary permissions to users and service accounts.
Secrets Management: Use Databricks Secrets or cloud-native secret managers for credentials.

6. Observability & Monitoring:

Logging & Alerting: Implement comprehensive logging within pipelines and set up alerts for failures, performance degradation, or data quality issues.
Monitoring Tools: Utilize Spark UI, Databricks UI, and integrate with external monitoring systems (e.g., Prometheus, Grafana, cloud-native monitoring).
Data Lineage: Use Unity Catalog's lineage capabilities to track data flow and transformations.

7. CI/CD and Automation:

Databricks Repos: Integrate your notebooks and code with Git for version control and collaborative development.
Automated Testing: Incorporate unit, integration, and data quality tests into your CI/CD pipeline.
Job Orchestration: Use Databricks Workflows, Apache Airflow, or other orchestrators to schedule and manage pipeline execution.
Infrastructure as Code (IaC): Manage Databricks workspaces, clusters, and jobs using tools like Terraform.