Case Study: Massive Data Processing in Adobe Experience Platform with Delta Lake

How Adobe built a cost-effective and scalable data pipeline using Delta Lake and Apache Spark to power its Experience Platform.

The Challenge: A Multi-Tenant, Petabyte-Scale Platform

The Adobe Experience Platform unifies data from across Adobe's products to power real-time customer profiles and AI/ML services. The platform needed to:

  • Process petabytes of data daily from thousands of tenants.
  • Ensure data quality and reliability for both streaming and batch data.
  • Provide a scalable and cost-effective solution for data transformation and enrichment.
  • Handle the "small file problem" caused by ingesting data from many different sources.

The Architecture: A Lakehouse Built on Delta Lake

Adobe's solution is a classic Lakehouse architecture, with Delta Lake and Apache Spark on Databricks at its core.

graph TD subgraph "Data Ingestion" A[Streaming Sources] --> B(Kafka); C[Batch Sources] --> D[ADLS Gen2]; B & D --> E{Databricks Ingestion Jobs}; end subgraph "Lakehouse Processing" E --> F[Bronze Delta Tables]; F --> G(ETL with Spark & Delta); G --> H[Silver Delta Tables]; H --> I(Data Enrichment & Aggregation); I --> J[Gold Delta Tables]; end subgraph "Serving Layer" J --> K(Adobe Experience Platform Services); J --> L(BI & Reporting); end
  1. Ingestion: Streaming and batch data is ingested into a landing zone.
  2. Bronze Layer: The raw data is ingested into Bronze Delta tables, providing an immutable, versioned copy of the source data.
  3. Silver Layer: ETL jobs running on Databricks read from the Bronze tables, perform cleaning, filtering, and schema enforcement, and write the results to Silver Delta tables.
  4. Gold Layer: The Silver data is further enriched, aggregated, and joined to create business-level Gold tables that are optimized for specific use cases, such as powering the real-time customer profile.

Key Technical Details & Learnings

  • ACID Transactions with Delta Lake: Delta Lake's ACID properties were critical for ensuring data reliability and consistency, especially in a multi-tenant environment with concurrent reads and writes.
  • Solving the Small File Problem: Delta Lake's `OPTIMIZE` command, particularly the `ZORDER` option, was used to compact small files into larger ones and co-locate related information, which significantly improved query performance.
  • Time Travel for Debugging and Auditing: The ability to query previous versions of a Delta table (Time Travel) was invaluable for debugging data issues and for auditing data changes over time.
  • Scalability with Databricks: The elastic nature of Databricks clusters allowed Adobe to scale their processing resources up and down to meet the demands of their petabyte-scale workloads, optimizing for both performance and cost.