Case Study: Building a Modern Data Warehouse for a Retail Giant

How a large retail company migrated from a legacy on-premises data warehouse to a scalable, cloud-native solution on AWS to unify their sales, inventory, and supply chain data.

The Challenge

A global retail giant with thousands of stores and a massive online presence was hampered by its legacy on-premises data warehouse. The key problems were:

  • Query Performance: Nightly ETL jobs took over 8 hours, and complex analytical queries could run for hours, delaying critical business decisions on inventory management and marketing campaigns.
  • Data Silos: Sales data, supply chain information, and customer clickstream data were stored in separate, unintegrated systems, making a 360-degree view of the customer journey impossible.
  • Scalability and Cost: The on-premises system was expensive to scale, requiring large upfront hardware investments. It couldn't handle the exponential growth of semi-structured data from their e-commerce platform.
  • Limited Analytics: The rigid structure of the old warehouse made it difficult for data scientists to access raw data for building machine learning models for demand forecasting.

The Architecture

graph TD subgraph "Data Sources" A[On-Premises DW] B[SaaS Platforms] end subgraph "Ingestion & Orchestration" A --> C{AWS DMS}; B --> D{AWS Glue}; C --> E[Amazon S3 Raw]; D --> E; F[AWS Step Functions] --> D; end subgraph "Processing & Warehousing" E --> G(AWS Glue ETL); G --> H[Amazon S3 Processed]; H --> I(Amazon Redshift); end subgraph "Analytics" I --> J[Amazon QuickSight]; end

The new architecture implements a modern "Lake House" pattern on AWS, combining the scalability of a data lake with the performance and features of a data warehouse:

  1. Data Ingestion & Migration: AWS Database Migration Service (DMS) with the Schema Conversion Tool (SCT) was used for a one-time migration of the historical data from the on-premises Teradata warehouse to Amazon S3. Ongoing data is ingested via various pipelines into the S3 "landing zone."
  2. Data Lake Foundation: Amazon S3 serves as the central data lake, providing a cost-effective and durable storage layer for raw, semi-structured (JSON clickstreams), and processed (Parquet) data.
  3. ETL & Data Cataloging: AWS Glue is used for all ETL operations. Glue Crawlers scan the S3 data lake to automatically infer schemas and populate the Glue Data Catalog. Glue ETL jobs, written in PySpark, transform the raw data into a cleaned, partitioned, and columnar Parquet format in the "processed" layer of the data lake.
  4. Orchestration: AWS Step Functions orchestrates the entire data pipeline. It triggers Glue jobs based on a schedule or events, manages dependencies between jobs, and handles error retries, providing a fully auditable workflow.
  5. Data Warehousing & Analytics: Amazon Redshift acts as the high-performance query engine. It uses Redshift Spectrum to directly query the processed Parquet files in S3 without needing to load all the data. This provides a "single source of truth" and decouples storage from compute. Only the most critical "hot" data and aggregated tables are stored locally in Redshift for maximum performance.
  6. Business Intelligence: Amazon QuickSight connects to Redshift to provide interactive dashboards and reports for thousands of business users, from inventory managers to marketing analysts.

Key Technical Details

  • Decoupled Storage and Compute: The Lake House architecture is the cornerstone of this solution. By using S3 as the persistent storage layer and Redshift as the query engine, the company can scale compute resources up or down on demand (e.g., adding more Redshift nodes for month-end reporting) without altering the storage layer, leading to significant cost savings.
  • Columnar Data Format (Parquet): All processed data in the S3 data lake is stored in Apache Parquet format. This columnar layout, combined with snappy compression, drastically reduces the amount of data scanned by Redshift Spectrum, improving query performance and lowering costs.
  • Dimensional Modeling in the Lake House: AWS Glue jobs are used to model the data into a classic star schema (fact and dimension tables) directly within the S3 data lake. This makes it easy for BI tools to consume the data and for analysts to write intuitive queries.
  • Concurrency Scaling in Redshift: To handle peak query loads from business users, Redshift's Concurrency Scaling feature is enabled. It automatically adds and removes cluster capacity to provide consistent performance, even with thousands of concurrent queries.
  • Robust Orchestration with Step Functions: Instead of relying on complex cron jobs, AWS Step Functions provides a visual workflow that is easy to monitor, debug, and extend. Its built-in error handling and state management make the entire data platform more resilient.