Learning Index: Staff-Level GCP System Design & SRE

Learning Path - Follow this for structured learning
Reference Map - Jump to specific topics
Progress Tracker - Track your completion

Learning Path (Curriculum)

Follow these phases in order. Each phase builds on the previous one.

Phase 0: Foundations & Quantitative Reasoning

Goal: Build mental models for capacity, latency, and system behavior.

Topics: - Queueing Theory & Tail Latency - Capacity Math Cheat Sheet - Observability Basics

Milestone: You can reason about P50/P95/P99 latency, calculate capacity needs, and understand what metrics matter.

Suggested pacing: 1-2 weeks

Phase 1: Distributed Systems Building Blocks

Goal: Understand the primitives that make distributed systems work.

Topics: - Time, Ordering, and Causality - Consensus & Leases - Replication Strategies - Sharding & Partitioning - Overload & Backpressure - Idempotency & Retry Semantics - Queues & Streams

Milestone: You understand when to use consensus vs leases, how to handle overload gracefully, and the tradeoffs in replication strategies.

Suggested pacing: 3-4 weeks

Phase 2: GCP Core Building Blocks

Goal: Deep understanding of GCP primitives: how they work, fail, and scale.

Topics: - VPC, Load Balancing & DNS - GKE Control Plane & Data Plane - IAM Evaluation Model - Cloud Storage Deep Dive - Spanner: Consistency & Performance - Bigtable: Design & Tradeoffs - BigQuery Architecture - Pub/Sub: Delivery Guarantees - Cloud KMS & Secret Management - AlloyDB: PostgreSQL-Compatible Database

Milestone: You can design systems using GCP primitives, understand their failure modes, and make informed tradeoffs.

Suggested pacing: 4-6 weeks

Phase 3: Reliability Engineering & SRE

Goal: Build and operate reliable systems at scale.

Topics: - SLIs, SLOs & Error Budgets - Production Readiness Reviews (PRR) - Incident Response & Postmortems - Capacity Planning & Forecasting - Load Shedding & Circuit Breakers - Canary Deployments & Rollouts - Testing for Failure

Milestone: You can define SLIs/SLOs, run PRRs, respond to incidents, and plan capacity.

Suggested pacing: 3-4 weeks

Phase 4: Case Studies & End-to-End Design

Goal: Apply everything to real-world scenarios.

Topics: - Multi-Region API on GCP - High-Throughput Data Pipeline - Global Content Delivery System

Milestone: You can design, implement, and operate production systems with clear SLOs and operational playbooks.

Suggested pacing: 2-3 weeks per case study

Reference Map

Browse by topic area. Each entry links to deep dives and notes prerequisites.

Distributed Systems

Topic	Deep Dive	Prerequisites
Time & Ordering	Time, Ordering, Causality	Phase 0
Consensus	Consensus & Leases	Time & Ordering
Replication	Replication Strategies	Consensus
Sharding	Sharding & Partitioning	Replication
Overload	Overload & Backpressure	Queueing Theory
Idempotency	Idempotency & Retries	Time & Ordering
Queues	Queues & Streams	Overload

GCP Infrastructure

Topic	Deep Dive	Prerequisites
Networking	VPC, LB & DNS	Phase 0
Kubernetes	GKE Internals	Networking
Identity	IAM Evaluation	Phase 0
Storage	Cloud Storage	Networking
Spanner	Spanner Deep Dive	Consensus, Replication
Bigtable	Bigtable Design	Sharding
BigQuery	BigQuery Architecture	Storage, Sharding
Pub/Sub	Pub/Sub Guarantees	Queues, Idempotency
Secrets	KMS & Secrets	IAM
AlloyDB	AlloyDB Architecture	Spanner, Replication

Reliability & SRE

Topic	Deep Dive	Prerequisites
SLIs/SLOs	SLIs, SLOs & Error Budgets	Phase 0
PRR	PRR Checklist	SLIs/SLOs
Incidents	Incident Response	SLIs/SLOs
Capacity	Capacity Planning	Capacity Math
Load Shedding	Load Shedding	Overload
Rollouts	Canary & Rollouts	Testing for Failure
Testing	Testing for Failure	Phase 1

Low-Level Design Patterns

Topic	Deep Dive	Prerequisites
Rate Limiting	Rate Limiter Implementations	Queueing Theory
Circuit Breakers	Circuit Breaker Pattern	Load Shedding
Concurrency	Concurrency Primitives	Phase 1
Idempotency	Idempotency Patterns	Idempotency & Retries

Cross-Cutting Concerns

These topics appear across multiple chapters:

Failure Domains: How to isolate failures and limit blast radius
Overload Behavior: What happens at 10×, 100× normal load
Observability Contract: What metrics/logs/traces are needed
Change Safety: Rollout strategies, canaries, reversibility
Security Boundaries: Identity, authorization, data exfiltration controls

Learning Dependencies

flowchart TD Phase0[Phase 0: Foundations] --> Phase1[Phase 1: Distributed Systems] Phase1 --> Phase2[Phase 2: GCP Building Blocks] Phase2 --> Phase3[Phase 3: Reliability & SRE] Phase3 --> Phase4[Phase 4: Case Studies] Phase1 --> LLD[LLD Patterns] LLD --> Phase4

Next Steps

Start with Phase 0 if you're new to quantitative reasoning
Jump to a topic in the Reference Map if you have specific questions
Track progress in PROGRESS.md
Use templates in 00-meta/ for your own deep dives