High-Throughput Data Pipeline

One-line summary: End-to-end design of a high-throughput data pipeline on GCP using Pub/Sub, Dataflow, and BigQuery with SLOs, backpressure handling, and failure recovery.

Prerequisites: Pub/Sub: Delivery Guarantees, BigQuery Architecture, Overload & Backpressure, Reliability & SRE.


System Overview

Requirements

Functional Requirements: - Ingest events from multiple sources (mobile apps, web, IoT devices) - Process events in real-time (transform, enrich, validate) - Store processed events in data warehouse (BigQuery) - Support 1M+ events per second peak load - Support multiple event types (user events, transactions, telemetry)

Non-Functional Requirements: - Latency: P95 < 5 seconds (event ingestion to queryable) - Throughput: Handle 1M+ events/second peak - Availability: 99.9% (SLO) - Durability: 99.999999% (11 nines) - Consistency: At-least-once delivery

Constraints: - Must use GCP services - Cost-effective at scale - Support schema evolution


Architecture

High-Level Architecture

graph TB Sources[Event Sources
Mobile, Web, IoT] --> PubSub[Pub/Sub
Topics] PubSub --> Dataflow[Dataflow
Streaming Pipeline] Dataflow --> Transform[Transform & Enrich] Dataflow --> Validate[Validate] Dataflow --> Dedupe[Deduplicate] Transform --> BigQuery[BigQuery
Data Warehouse] Validate --> BigQuery Dedupe --> BigQuery Dataflow --> DLQ[Dead Letter Queue
Pub/Sub] BigQuery --> Analytics[Analytics & Reporting] style PubSub fill:#99ccff style Dataflow fill:#ffcc99 style BigQuery fill:#99ff99 style DLQ fill:#ff9999

Component Details

1. Event Sources

2. Pub/Sub Topics

3. Dataflow Pipeline

4. BigQuery

5. Dead Letter Queue


Data Flow

Event Ingestion Flow

sequenceDiagram participant Source participant PubSub participant Dataflow participant BigQuery participant DLQ Source->>PubSub: Publish event PubSub->>PubSub: Store message PubSub->>Dataflow: Deliver message Dataflow->>Dataflow: Transform & enrich Dataflow->>Dataflow: Validate schema alt Valid Dataflow->>Dataflow: Deduplicate Dataflow->>BigQuery: Stream insert BigQuery-->>Dataflow: Success Dataflow->>PubSub: Acknowledge else Invalid Dataflow->>DLQ: Publish failed event Dataflow->>PubSub: Acknowledge (to remove from subscription) end

Processing Stages

Stage 1: Read from Pub/Sub

Stage 2: Transform & Enrich

Stage 3: Validate

Stage 4: Deduplicate

Stage 5: Write to BigQuery


SLIs, SLOs & Error Budgets

SLIs (Service Level Indicators)

1. Ingestion Latency SLI

2. Throughput SLI

3. Error Rate SLI

4. Availability SLI

SLOs (Service Level Objectives)

SLI SLO Error Budget
Ingestion Latency P95 < 5 seconds > 5 seconds for > 0.1% events
Throughput Handle 1M events/second < 1M events/second for > 0.1% time
Error Rate < 0.5% > 0.5% for > 0.1% events
Availability 99.9% < 99.9% for > 0.1% time

Error Budget Policy

Policy: - > 50% remaining: Normal operations, can ship features - 25-50% remaining: Warning, reduce risky changes - < 25% remaining: Critical, stop feature work, focus on reliability - 0% remaining: Emergency, only reliability work


Capacity Planning

Current Capacity

Pub/Sub: - Topics: 3 topics (user-events, transactions, telemetry) - Throughput: 1M messages/second per topic - Subscriptions: 1 subscription per topic

Dataflow: - Workers: Auto-scaling (10-100 workers) - Throughput: 100K events/second per worker - Total capacity: 10M events/second (with scaling)

BigQuery: - Slots: On-demand (2000 slots) - Throughput: Millions of rows/second - Storage: Petabytes

Scaling Strategy

Auto-scaling: - Pub/Sub: Automatic (handles load) - Dataflow: Auto-scaling based on backlog - Scale up: Backlog > threshold - Scale down: Backlog < threshold - BigQuery: Automatic (handles load)

Manual scaling: - Dataflow: Adjust min/max workers if needed - BigQuery: Reserve slots for predictable workloads

Capacity Forecasting

Growth Projection: - Current: 800K events/second average, 1M peak - Growth: 30% per quarter - 6 months: ~1.3M events/second average, ~1.6M peak - 12 months: ~2.2M events/second average, ~2.7M peak

Capacity Needs: - 6 months: Need to handle ~2M events/second peak - 12 months: Need to handle ~3M events/second peak - Plan: Optimize pipeline, add more Pub/Sub topics if needed


Failure Modes & Blast Radius

Pipeline Failures

Scenario 1: Pub/Sub Outage

Scenario 2: Dataflow Failure

Scenario 3: BigQuery Unavailable

Backpressure Scenarios

Scenario 1: High Event Rate

Scenario 2: Slow Downstream (BigQuery)

Overload Scenarios

10× Normal Load (10M events/second)

100× Normal Load (100M events/second)


Observability

Metrics

Pipeline Metrics

Component Metrics

Business Metrics

Dashboards

Pipeline Dashboard: - Ingestion rate, processing rate, latency - Error rate, backlog - SLO compliance, error budget

Component Dashboards: - Pub/Sub: Publish rate, delivery rate, backlog - Dataflow: Worker count, processing rate, errors - BigQuery: Insert rate, query performance

DLQ Dashboard: - Failed events count, failure reasons - Reprocessing status

Logs

Application Logs: - Event processing logs (structured JSON) - Error logs with stack traces - Performance logs (slow processing)

Infrastructure Logs: - Pub/Sub logs - Dataflow logs - BigQuery logs

Alerts

Critical Alerts: - Pipeline unavailable - High error rate (> 1%) - High backlog (> threshold) - SLO violation

Warning Alerts: - High latency - Error rate increasing - Backlog growing - Component failures


Deployment & Rollout Strategy

Deployment Process

graph LR Code[Code] --> CI[CI/CD Pipeline] CI --> Build[Build Pipeline] Build --> Test[Test Pipeline] Test --> Deploy[Deploy Pipeline] Deploy --> Canary[Canary Deployment] Canary --> Monitor[Monitor Metrics] Monitor -->|Success| Prod[Production] Monitor -->|Failure| Rollback[Rollback] style Canary fill:#ffcc99 style Prod fill:#99ff99 style Rollback fill:#ff9999

Rollout Process

Phase 1: Canary (5% traffic)

  1. Deploy: Deploy new pipeline version (5% traffic)
  2. Monitor: Monitor for 30 minutes
  3. Check error rate, latency, SLO compliance
  4. Compare to baseline
  5. Decision:
  6. Success: Proceed to Phase 2
  7. Failure: Rollback immediately

Phase 2: Gradual Rollout (25%, 50%, 100%)

  1. 25% traffic: Monitor for 15 minutes
  2. 50% traffic: Monitor for 15 minutes
  3. 100% traffic: Monitor for 1 hour
  4. Complete: Mark rollout complete

Rollback Strategy

Automatic Rollback: - Trigger: Error rate > 1% OR latency > threshold OR SLO violation - Action: Automatically rollback to previous version - Time: < 5 minutes

Manual Rollback: - Trigger: Manual decision - Action: Rollback via CI/CD pipeline - Time: < 10 minutes


Security

Authentication & Authorization

Authentication: - Pub/Sub: Service accounts for publishers - Dataflow: Service accounts for pipeline - BigQuery: Service accounts for writes

Authorization: - IAM: IAM policies for all components - Principle: Least privilege

Data Protection

Encryption: - At rest: All data encrypted (Pub/Sub, BigQuery) - In transit: TLS for all connections - Keys: Managed by Cloud KMS

Data Privacy: - PII: PII data handled per policy - Retention: Data retained per policy - Deletion: Secure deletion procedures


Cost Optimization

Cost Breakdown

Monthly Costs (estimated for 1M events/second): - Pub/Sub: $5,000 (message storage, operations) - Dataflow: $10,000 (compute, streaming) - BigQuery: $15,000 (storage, queries) - Total: ~$30,000/month

Optimization Strategies

  1. Pub/Sub: Optimize message size, use compression
  2. Dataflow: Right-size workers, optimize pipeline
  3. BigQuery: Optimize schema, use partitioning/clustering
  4. Deduplication: Optimize deduplication logic (reduce Firestore calls)

Incident Response Playbook

Common Incidents

Incident 1: High Backlog

Symptoms: - Backlog growing, latency increasing - Events not processed

Response: 1. Acknowledge: Acknowledge incident 2. Assess: Check Pub/Sub backlog, Dataflow backlog 3. Mitigate: - Scale Dataflow workers - Check BigQuery performance - Reduce event rate if possible 4. Investigate: Root cause analysis 5. Resolve: Fix root cause 6. Postmortem: Write postmortem

Incident 2: High Error Rate

Symptoms: - Error rate > 1% - Many events in DLQ

Response: 1. Acknowledge: Acknowledge incident 2. Assess: Check error logs, DLQ 3. Mitigate: - Fix validation logic if needed - Check schema changes - Reprocess DLQ if possible 4. Investigate: Root cause analysis 5. Resolve: Fix root cause 6. Postmortem: Write postmortem


Further Reading

Comprehensive Guide: Further Reading: Data Pipeline

Quick Links: - Pub/Sub Documentation - Dataflow Documentation - BigQuery Documentation - Overload & Backpressure - Back to Case Studies


Exercises

  1. Design improvements: How would you improve this design? What tradeoffs?

  2. Handle schema evolution: How do you handle schema changes without breaking the pipeline?

  3. Optimize costs: How would you reduce costs by 30%? What tradeoffs?

Answer Key: View Answers