Case Study: Massive Data Processing in Adobe Experience Platform with Delta Lake
How Adobe built a cost-effective and scalable data pipeline using Delta Lake and Apache Spark to power its Experience Platform.
The Challenge: A Multi-Tenant, Petabyte-Scale Platform
The Adobe Experience Platform unifies data from across Adobe's products to power real-time customer profiles and AI/ML services. The platform needed to:
- Process petabytes of data daily from thousands of tenants.
- Ensure data quality and reliability for both streaming and batch data.
- Provide a scalable and cost-effective solution for data transformation and enrichment.
- Handle the "small file problem" caused by ingesting data from many different sources.
The Architecture: A Lakehouse Built on Delta Lake
Adobe's solution is a classic Lakehouse architecture, with Delta Lake and Apache Spark on Databricks at its core.
graph TD
subgraph "Data Ingestion"
A[Streaming Sources] --> B(Kafka);
C[Batch Sources] --> D[ADLS Gen2];
B & D --> E{Databricks Ingestion Jobs};
end
subgraph "Lakehouse Processing"
E --> F[Bronze Delta Tables];
F --> G(ETL with Spark & Delta);
G --> H[Silver Delta Tables];
H --> I(Data Enrichment & Aggregation);
I --> J[Gold Delta Tables];
end
subgraph "Serving Layer"
J --> K(Adobe Experience Platform Services);
J --> L(BI & Reporting);
end
- Ingestion: Streaming and batch data is ingested into a landing zone.
- Bronze Layer: The raw data is ingested into Bronze Delta tables, providing an immutable, versioned copy of the source data.
- Silver Layer: ETL jobs running on Databricks read from the Bronze tables, perform cleaning, filtering, and schema enforcement, and write the results to Silver Delta tables.
- Gold Layer: The Silver data is further enriched, aggregated, and joined to create business-level Gold tables that are optimized for specific use cases, such as powering the real-time customer profile.
Key Technical Details & Learnings
- ACID Transactions with Delta Lake: Delta Lake's ACID properties were critical for ensuring data reliability and consistency, especially in a multi-tenant environment with concurrent reads and writes.
- Solving the Small File Problem: Delta Lake's `OPTIMIZE` command, particularly the `ZORDER` option, was used to compact small files into larger ones and co-locate related information, which significantly improved query performance.
- Time Travel for Debugging and Auditing: The ability to query previous versions of a Delta table (Time Travel) was invaluable for debugging data issues and for auditing data changes over time.
- Scalability with Databricks: The elastic nature of Databricks clusters allowed Adobe to scale their processing resources up and down to meet the demands of their petabyte-scale workloads, optimizing for both performance and cost.