Learning Index: Staff-Level GCP System Design & SRE


Learning Path (Curriculum)

Follow these phases in order. Each phase builds on the previous one.

Phase 0: Foundations & Quantitative Reasoning

Goal: Build mental models for capacity, latency, and system behavior.

Topics: - Queueing Theory & Tail Latency - Capacity Math Cheat Sheet - Observability Basics

Milestone: You can reason about P50/P95/P99 latency, calculate capacity needs, and understand what metrics matter.

Suggested pacing: 1-2 weeks


Phase 1: Distributed Systems Building Blocks

Goal: Understand the primitives that make distributed systems work.

Topics: - Time, Ordering, and Causality - Consensus & Leases - Replication Strategies - Sharding & Partitioning - Overload & Backpressure - Idempotency & Retry Semantics - Queues & Streams

Milestone: You understand when to use consensus vs leases, how to handle overload gracefully, and the tradeoffs in replication strategies.

Suggested pacing: 3-4 weeks


Phase 2: GCP Core Building Blocks

Goal: Deep understanding of GCP primitives: how they work, fail, and scale.

Topics: - VPC, Load Balancing & DNS - GKE Control Plane & Data Plane - IAM Evaluation Model - Cloud Storage Deep Dive - Spanner: Consistency & Performance - Bigtable: Design & Tradeoffs - BigQuery Architecture - Pub/Sub: Delivery Guarantees - Cloud KMS & Secret Management - AlloyDB: PostgreSQL-Compatible Database

Milestone: You can design systems using GCP primitives, understand their failure modes, and make informed tradeoffs.

Suggested pacing: 4-6 weeks


Phase 3: Reliability Engineering & SRE

Goal: Build and operate reliable systems at scale.

Topics: - SLIs, SLOs & Error Budgets - Production Readiness Reviews (PRR) - Incident Response & Postmortems - Capacity Planning & Forecasting - Load Shedding & Circuit Breakers - Canary Deployments & Rollouts - Testing for Failure

Milestone: You can define SLIs/SLOs, run PRRs, respond to incidents, and plan capacity.

Suggested pacing: 3-4 weeks


Phase 4: Case Studies & End-to-End Design

Goal: Apply everything to real-world scenarios.

Topics: - Multi-Region API on GCP - High-Throughput Data Pipeline - Global Content Delivery System

Milestone: You can design, implement, and operate production systems with clear SLOs and operational playbooks.

Suggested pacing: 2-3 weeks per case study


Reference Map

Browse by topic area. Each entry links to deep dives and notes prerequisites.

Distributed Systems

Topic Deep Dive Prerequisites
Time & Ordering Time, Ordering, Causality Phase 0
Consensus Consensus & Leases Time & Ordering
Replication Replication Strategies Consensus
Sharding Sharding & Partitioning Replication
Overload Overload & Backpressure Queueing Theory
Idempotency Idempotency & Retries Time & Ordering
Queues Queues & Streams Overload

GCP Infrastructure

Topic Deep Dive Prerequisites
Networking VPC, LB & DNS Phase 0
Kubernetes GKE Internals Networking
Identity IAM Evaluation Phase 0
Storage Cloud Storage Networking
Spanner Spanner Deep Dive Consensus, Replication
Bigtable Bigtable Design Sharding
BigQuery BigQuery Architecture Storage, Sharding
Pub/Sub Pub/Sub Guarantees Queues, Idempotency
Secrets KMS & Secrets IAM
AlloyDB AlloyDB Architecture Spanner, Replication

Reliability & SRE

Topic Deep Dive Prerequisites
SLIs/SLOs SLIs, SLOs & Error Budgets Phase 0
PRR PRR Checklist SLIs/SLOs
Incidents Incident Response SLIs/SLOs
Capacity Capacity Planning Capacity Math
Load Shedding Load Shedding Overload
Rollouts Canary & Rollouts Testing for Failure
Testing Testing for Failure Phase 1

Low-Level Design Patterns

Topic Deep Dive Prerequisites
Rate Limiting Rate Limiter Implementations Queueing Theory
Circuit Breakers Circuit Breaker Pattern Load Shedding
Concurrency Concurrency Primitives Phase 1
Idempotency Idempotency Patterns Idempotency & Retries

Cross-Cutting Concerns

These topics appear across multiple chapters:


Learning Dependencies

flowchart TD Phase0[Phase 0: Foundations] --> Phase1[Phase 1: Distributed Systems] Phase1 --> Phase2[Phase 2: GCP Building Blocks] Phase2 --> Phase3[Phase 3: Reliability & SRE] Phase3 --> Phase4[Phase 4: Case Studies] Phase1 --> LLD[LLD Patterns] LLD --> Phase4

Next Steps

  1. Start with Phase 0 if you're new to quantitative reasoning
  2. Jump to a topic in the Reference Map if you have specific questions
  3. Track progress in PROGRESS.md
  4. Use templates in 00-meta/ for your own deep dives