Further Reading: Capacity Math

Systems Performance (Brendan Gregg)

Book: Systems Performance: Enterprise and the Cloud

Why it matters: Comprehensive guide to capacity planning and performance analysis, with practical formulas and methodologies.

Key Concepts

Capacity Planning: - How to measure current capacity - How to forecast future needs - How to identify bottlenecks

Resource Analysis: - CPU capacity calculations - Memory capacity calculations - Disk I/O capacity calculations - Network capacity calculations

Relevance: Provides the mathematical foundation and practical tools for capacity planning.

Recommended Chapters

Chapter 2: Methodologies: Capacity planning methodologies
Chapter 6: CPUs: CPU capacity and utilization
Chapter 7: Memory: Memory capacity planning
Chapter 8: File Systems: Disk I/O capacity

The Datacenter as a Computer (Barroso & Hölzle, 2018)

Book: The Datacenter as a Computer: Designing Warehouse-Scale Machines

Why it matters: Explains capacity planning at Google scale, including how to think about resource utilization and efficiency.

Key Excerpts

On capacity planning:

"Capacity planning requires understanding both current utilization and future growth. We need to plan for peak loads, not just average loads, and account for growth over time."

Key insight: Capacity planning must account for: 1. Current peak load (not average) 2. Growth projections 3. Safety margins 4. Failure scenarios (fewer resources available)

On resource efficiency:

"Efficiency comes from right-sizing resources, not over-provisioning. We need to understand actual usage patterns, not theoretical maximums."

Relevance: Emphasizes the importance of measuring actual usage and right-sizing, rather than over-provisioning.

Capacity Planning Best Practices

1. Measure Baseline

Critical step: Before planning capacity, measure current usage.

What to measure: - Peak usage: Maximum resource usage, not average - Usage patterns: How usage varies over time (daily, weekly, seasonal) - Growth trends: How usage is changing over time

Tools: - Monitoring systems (Prometheus, Cloud Monitoring) - Resource utilization dashboards - Historical data analysis

2. Forecast Growth

Methods: - Linear growth: Simple percentage increase - Exponential growth: Compound growth (more realistic) - Seasonal patterns: Account for seasonal variations - Event-driven: Account for known events (product launches, marketing campaigns)

Example: - Current: 1,000 QPS - Growth: 15% per month - 12 months: 1,000 × (1.15)^12 = 5,350 QPS

3. Add Safety Margins

Why: - Traffic spikes - Growth uncertainty - Failure scenarios - Maintenance windows

Typical margins: - CPU: 20-30% headroom (70-80% utilization) - Memory: 20% headroom - Disk: 20-30% headroom - Network: 30-50% headroom (for bursts)

4. Plan for Failures

Failure scenarios: - Single instance failure: Need redundancy - Region failure: Need multi-region capacity - Database failure: Need read replicas

Capacity during failures: - Plan for N-1 capacity (one instance down) - Or N-2 capacity (two instances down) - Depends on availability requirements

Resource-Specific Capacity Planning

CPU Capacity

Key formula: CPU Cores = (QPS × CPU Time Per Request) / Target Utilization

Considerations: - Single-threaded vs multi-threaded: Multi-threaded can use more cores - CPU-bound vs I/O-bound: I/O-bound may need more cores for concurrency - Context switching: Too many threads can hurt performance - NUMA: Non-uniform memory access affects performance

Best practices: - Measure actual CPU time per request - Account for context switching overhead - Use appropriate target utilization (70-80%)

Memory Capacity

Key formula: Memory = (Concurrent Requests × Memory Per Request) + Base Memory + Cache

Considerations: - Peak vs average: Use peak memory, not average - Garbage collection: GC overhead in managed languages - Memory leaks: Monitor for gradual memory growth - Swap: Avoid swap for performance-critical systems

Best practices: - Measure peak memory per request - Account for GC overhead - Monitor for memory leaks - Size cache appropriately

Disk Capacity

Key formulas: - Storage: Disk = Data Volume + Logs + Temporary Files + Safety Margin - I/O: IOPS Needed = QPS × I/O Operations Per Request

Considerations: - SSD vs HDD: SSD has much higher IOPS - Random vs sequential: Random I/O is slower - Read vs write: Writes are often slower

Best practices: - Use SSD for performance-critical workloads - Optimize for sequential I/O when possible - Separate read and write workloads

Network Capacity

Key formula: Bandwidth = QPS × (Request Size + Response Size)

Considerations: - Bidirectional: Both ingress and egress matter - Peak vs average: Plan for peak traffic - Compression: Can reduce bandwidth needs - CDN: Reduces egress bandwidth

Best practices: - Plan for peak traffic, not average - Use compression when possible - Use CDN for static content - Monitor both ingress and egress

Additional Resources

Papers

"The Tail at Scale" (Dean & Barroso, 2013) - How tail latency affects capacity planning - Link

"The Datacenter as a Computer" (Barroso & Hölzle, 2018) - Capacity planning at scale - Link

Books

"Systems Performance" by Brendan Gregg - Comprehensive capacity planning guide - Practical formulas and tools

"Designing Data-Intensive Applications" by Martin Kleppmann - Chapter on scalability - Capacity planning for distributed systems

Online Resources

Google SRE Book: Site Reliability Engineering - Chapter on capacity planning - Real-world examples

AWS Well-Architected Framework: Capacity Planning - Capacity planning best practices - Applicable to GCP as well

Key Takeaways

Measure baseline: Understand current usage before planning
Forecast growth: Account for future growth (exponential, not linear)
Add margins: Safety margins for spikes and failures
Plan for failures: Capacity during failure scenarios
Right-size: Don't over-provision, but don't under-provision either

Queueing Theory & Tail Latency - How queueing affects capacity
Observability Basics - How to measure capacity
Capacity Planning - SRE perspective on capacity