Global Content Delivery System

One-line summary: End-to-end design of a global CDN system on GCP using Cloud CDN, load balancing, and cache invalidation with SLOs, edge caching strategies, and consistency guarantees.

Prerequisites: VPC, Load Balancing & DNS, Cloud Storage Deep Dive, Reliability & SRE.


System Overview

Requirements

Functional Requirements: - Serve static content (images, videos, CSS, JS) globally - Support dynamic content (API responses, personalized content) - Handle 100M+ requests per day - Support multiple content types (images, videos, documents)

Non-Functional Requirements: - Latency: P95 < 100ms (edge to user) - Throughput: Handle 10K+ requests/second peak - Availability: 99.9% (SLO) - Cache hit rate: > 80% (SLO) - Consistency: Eventual consistency (acceptable for content)

Constraints: - Must use GCP services - Cost-effective at scale - Support cache invalidation


Architecture

High-Level Architecture

graph TB Users[Users
Global] --> CDN[Cloud CDN
Edge Caches] CDN --> LB[Global Load Balancer
HTTP(S)] LB --> Origin1[Origin Server 1
us-central1] LB --> Origin2[Origin Server 2
europe-west1] LB --> Origin3[Origin Server 3
asia-east1] Origin1 --> Storage1[Cloud Storage
us-central1] Origin2 --> Storage2[Cloud Storage
europe-west1] Origin3 --> Storage3[Cloud Storage
asia-east1] Origin1 --> API[API Server
Dynamic Content] Origin2 --> API Origin3 --> API API --> DB[(Database
Spanner)] style CDN fill:#99ccff style LB fill:#ffcc99 style Storage1 fill:#99ff99

Component Details

1. Cloud CDN

2. Global Load Balancer

3. Origin Servers

4. Cloud Storage

5. API Server


Data Flow

Static Content Flow

sequenceDiagram participant User participant CDN participant LB participant Origin participant Storage User->>CDN: GET /images/photo.jpg alt Cache hit CDN-->>User: Return cached content else Cache miss CDN->>LB: Request content LB->>Origin: Route to nearest origin Origin->>Storage: Get object Storage-->>Origin: Return object Origin->>CDN: Return content + cache headers CDN->>CDN: Cache content CDN-->>User: Return content end

Dynamic Content Flow

sequenceDiagram participant User participant CDN participant LB participant Origin participant API participant DB User->>CDN: GET /api/user/123 alt Cache hit CDN-->>User: Return cached response else Cache miss CDN->>LB: Request content LB->>Origin: Route to nearest origin Origin->>API: Generate dynamic content API->>DB: Query database DB-->>API: Return data API-->>Origin: Return response + cache headers Origin->>CDN: Return response + cache headers CDN->>CDN: Cache response (if cacheable) CDN-->>User: Return response end

Cache Invalidation Flow

sequenceDiagram participant Admin participant Origin participant CDN participant Cache Admin->>Origin: Invalidate cache Origin->>CDN: Purge cache (URL pattern) CDN->>Cache: Remove cached content Cache-->>CDN: Confirmed CDN-->>Origin: Cache purged Origin-->>Admin: Cache invalidated

SLIs, SLOs & Error Budgets

SLIs (Service Level Indicators)

1. Latency SLI

2. Cache Hit Rate SLI

3. Availability SLI

4. Error Rate SLI

SLOs (Service Level Objectives)

SLI SLO Error Budget
Latency P95 < 100ms > 100ms for > 0.1% requests
Cache Hit Rate > 80% < 80% for > 0.1% requests
Availability 99.9% < 99.9% for > 0.1% time
Error Rate < 0.1% > 0.1% for > 0.1% requests

Error Budget Policy

Policy: - > 50% remaining: Normal operations, can ship features - 25-50% remaining: Warning, reduce risky changes - < 25% remaining: Critical, stop feature work, focus on reliability - 0% remaining: Emergency, only reliability work


Capacity Planning

Current Capacity

CDN: - Edge locations: 100+ locations worldwide - Throughput: Unlimited (scales automatically) - Cache capacity: Petabytes

Load Balancer: - Throughput: Millions of requests/second - Regions: 3 regions (us, eu, asia)

Origin Servers: - Per region: 10 pods (can scale to 50) - QPS capacity: ~10K QPS per region (can scale to 50K) - Total: ~30K QPS (can scale to 150K)

Storage: - Per region: Petabytes - Throughput: Millions of requests/second

Scaling Strategy

Auto-scaling: - CDN: Automatic (handles load) - Load Balancer: Automatic (handles load) - Origin Servers: Auto-scaling based on load - Storage: Automatic (handles load)

Manual scaling: - Origin Servers: Adjust min/max replicas if needed

Capacity Forecasting

Growth Projection: - Current: 50M requests/day average, 100M peak - Growth: 25% per quarter - 6 months: ~78M requests/day average, ~156M peak - 12 months: ~122M requests/day average, ~244M peak

Capacity Needs: - 6 months: Need to handle ~200M requests/day peak - 12 months: Need to handle ~300M requests/day peak - Plan: Optimize caching, scale origin servers


Failure Modes & Blast Radius

CDN Failures

Scenario 1: CDN Edge Failure

Scenario 2: Origin Server Failure

Scenario 3: Storage Failure

Cache Failures

Scenario 1: Cache Miss Storm

Scenario 2: Stale Cache

Overload Scenarios

10× Normal Load (1B requests/day)

100× Normal Load (10B requests/day)


Observability

Metrics

CDN Metrics

Origin Metrics

Storage Metrics

Dashboards

CDN Dashboard: - Request rate, cache hit rate, latency - Error rate, bandwidth - SLO compliance, error budget

Origin Dashboard: - Request rate, latency, error rate - Cache miss rate, origin load

Storage Dashboard: - Request rate, latency, error rate - Storage size, storage costs

Logs

CDN Logs: - Access logs (if enabled) - Error logs - Cache hit/miss logs

Origin Logs: - Access logs - Error logs - Cache invalidation logs

Alerts

Critical Alerts: - CDN unavailable - High error rate (> 1%) - Cache hit rate < 70% - SLO violation

Warning Alerts: - High latency - Cache hit rate decreasing - Origin load increasing - Storage errors


Deployment & Rollout Strategy

Deployment Process

Static Content: 1. Upload: Upload to Cloud Storage 2. Invalidate: Invalidate CDN cache (if needed) 3. Verify: Verify content served correctly

Dynamic Content: 1. Deploy: Deploy API changes 2. Canary: Deploy to canary (5% traffic) 3. Monitor: Monitor for issues 4. Rollout: Gradual rollout (25%, 50%, 100%)

Cache Invalidation Strategy

Invalidation Methods: - URL invalidation: Invalidate specific URLs - Path invalidation: Invalidate URL patterns - Full invalidation: Invalidate all cache (rare)

Best Practices: - Versioned URLs: Use versioned URLs (e.g., /v1/image.jpg) - Selective invalidation: Invalidate only changed content - Scheduled invalidation: Invalidate during low traffic


Security

Authentication & Authorization

CDN: - Signed URLs: Use signed URLs for private content - Access control: IAM policies for Cloud Storage

Origin: - IAM: IAM policies for origin servers - Service accounts: Use service accounts

Data Protection

Encryption: - At rest: All data encrypted (Cloud Storage) - In transit: TLS for all connections - CDN: TLS termination at CDN

DDoS Protection: - Cloud Armor: WAF and DDoS protection - Rate limiting: Rate limiting per IP/client


Cost Optimization

Cost Breakdown

Monthly Costs (estimated for 100M requests/day): - CDN: $3,000 (egress, cache) - Load Balancer: $500 (traffic) - Origin Servers: $2,000 (compute) - Storage: $1,000 (storage, egress) - Total: ~$6,500/month

Optimization Strategies

  1. CDN: Optimize cache hit rate, reduce origin load
  2. Storage: Use appropriate storage classes, lifecycle policies
  3. Origin: Optimize origin performance, reduce compute
  4. Caching: Optimize cache TTL, reduce cache misses

Incident Response Playbook

Common Incidents

Incident 1: Low Cache Hit Rate

Symptoms: - Cache hit rate < 70% - High origin load - Increased latency

Response: 1. Acknowledge: Acknowledge incident 2. Assess: Check cache hit rate, origin load 3. Mitigate: - Optimize cache TTL - Pre-warm cache - Scale origin servers 4. Investigate: Root cause analysis 5. Resolve: Fix root cause 6. Postmortem: Write postmortem

Incident 2: Stale Content

Symptoms: - Users report stale content - Cache not invalidated

Response: 1. Acknowledge: Acknowledge incident 2. Assess: Check cache invalidation logs 3. Mitigate: - Invalidate cache immediately - Fix cache invalidation process 4. Investigate: Root cause analysis 5. Resolve: Fix root cause 6. Postmortem: Write postmortem


Further Reading

Comprehensive Guide: Further Reading: CDN System

Quick Links: - Cloud CDN Documentation - Load Balancing Documentation - Cloud Storage Documentation - VPC, LB & DNS - Back to Case Studies


Exercises

  1. Design improvements: How would you improve this design? What tradeoffs?

  2. Handle cache invalidation: How do you handle cache invalidation for frequently updated content?

  3. Optimize costs: How would you reduce costs by 30%? What tradeoffs?

Answer Key: View Answers