Learning Index: Staff-Level GCP System Design & SRE
Navigation
- Learning Path - Follow this for structured learning
- Reference Map - Jump to specific topics
- Progress Tracker - Track your completion
Learning Path (Curriculum)
Follow these phases in order. Each phase builds on the previous one.
Phase 0: Foundations & Quantitative Reasoning
Goal: Build mental models for capacity, latency, and system behavior.
Topics: - Queueing Theory & Tail Latency - Capacity Math Cheat Sheet - Observability Basics
Milestone: You can reason about P50/P95/P99 latency, calculate capacity needs, and understand what metrics matter.
Suggested pacing: 1-2 weeks
Phase 1: Distributed Systems Building Blocks
Goal: Understand the primitives that make distributed systems work.
Topics: - Time, Ordering, and Causality - Consensus & Leases - Replication Strategies - Sharding & Partitioning - Overload & Backpressure - Idempotency & Retry Semantics - Queues & Streams
Milestone: You understand when to use consensus vs leases, how to handle overload gracefully, and the tradeoffs in replication strategies.
Suggested pacing: 3-4 weeks
Phase 2: GCP Core Building Blocks
Goal: Deep understanding of GCP primitives: how they work, fail, and scale.
Topics: - VPC, Load Balancing & DNS - GKE Control Plane & Data Plane - IAM Evaluation Model - Cloud Storage Deep Dive - Spanner: Consistency & Performance - Bigtable: Design & Tradeoffs - BigQuery Architecture - Pub/Sub: Delivery Guarantees - Cloud KMS & Secret Management - AlloyDB: PostgreSQL-Compatible Database
Milestone: You can design systems using GCP primitives, understand their failure modes, and make informed tradeoffs.
Suggested pacing: 4-6 weeks
Phase 3: Reliability Engineering & SRE
Goal: Build and operate reliable systems at scale.
Topics: - SLIs, SLOs & Error Budgets - Production Readiness Reviews (PRR) - Incident Response & Postmortems - Capacity Planning & Forecasting - Load Shedding & Circuit Breakers - Canary Deployments & Rollouts - Testing for Failure
Milestone: You can define SLIs/SLOs, run PRRs, respond to incidents, and plan capacity.
Suggested pacing: 3-4 weeks
Phase 4: Case Studies & End-to-End Design
Goal: Apply everything to real-world scenarios.
Topics: - Multi-Region API on GCP - High-Throughput Data Pipeline - Global Content Delivery System
Milestone: You can design, implement, and operate production systems with clear SLOs and operational playbooks.
Suggested pacing: 2-3 weeks per case study
Reference Map
Browse by topic area. Each entry links to deep dives and notes prerequisites.
Distributed Systems
| Topic | Deep Dive | Prerequisites |
|---|---|---|
| Time & Ordering | Time, Ordering, Causality | Phase 0 |
| Consensus | Consensus & Leases | Time & Ordering |
| Replication | Replication Strategies | Consensus |
| Sharding | Sharding & Partitioning | Replication |
| Overload | Overload & Backpressure | Queueing Theory |
| Idempotency | Idempotency & Retries | Time & Ordering |
| Queues | Queues & Streams | Overload |
GCP Infrastructure
| Topic | Deep Dive | Prerequisites |
|---|---|---|
| Networking | VPC, LB & DNS | Phase 0 |
| Kubernetes | GKE Internals | Networking |
| Identity | IAM Evaluation | Phase 0 |
| Storage | Cloud Storage | Networking |
| Spanner | Spanner Deep Dive | Consensus, Replication |
| Bigtable | Bigtable Design | Sharding |
| BigQuery | BigQuery Architecture | Storage, Sharding |
| Pub/Sub | Pub/Sub Guarantees | Queues, Idempotency |
| Secrets | KMS & Secrets | IAM |
| AlloyDB | AlloyDB Architecture | Spanner, Replication |
Reliability & SRE
| Topic | Deep Dive | Prerequisites |
|---|---|---|
| SLIs/SLOs | SLIs, SLOs & Error Budgets | Phase 0 |
| PRR | PRR Checklist | SLIs/SLOs |
| Incidents | Incident Response | SLIs/SLOs |
| Capacity | Capacity Planning | Capacity Math |
| Load Shedding | Load Shedding | Overload |
| Rollouts | Canary & Rollouts | Testing for Failure |
| Testing | Testing for Failure | Phase 1 |
Low-Level Design Patterns
| Topic | Deep Dive | Prerequisites |
|---|---|---|
| Rate Limiting | Rate Limiter Implementations | Queueing Theory |
| Circuit Breakers | Circuit Breaker Pattern | Load Shedding |
| Concurrency | Concurrency Primitives | Phase 1 |
| Idempotency | Idempotency Patterns | Idempotency & Retries |
Cross-Cutting Concerns
These topics appear across multiple chapters:
- Failure Domains: How to isolate failures and limit blast radius
- Overload Behavior: What happens at 10×, 100× normal load
- Observability Contract: What metrics/logs/traces are needed
- Change Safety: Rollout strategies, canaries, reversibility
- Security Boundaries: Identity, authorization, data exfiltration controls
Learning Dependencies
Next Steps
- Start with Phase 0 if you're new to quantitative reasoning
- Jump to a topic in the Reference Map if you have specific questions
- Track progress in PROGRESS.md
- Use templates in
00-meta/for your own deep dives