Production Readiness Review (PRR) Checklist

One-line summary: What staff engineers and SREs actually look for in production readiness reviews.

Prerequisites: SLIs/SLOs, understanding of production systems.


Mental Model

PRR Purpose

flowchart TD Design[System Design] --> PRR[PRR Review] PRR --> Approved{Approved?} Approved -->|Yes| Production[Production] Approved -->|No| Fix[Fix Issues] Fix --> PRR style PRR fill:#ffcc99 style Approved fill:#99ff99

Key insight: PRR ensures systems are ready for production. It's not about perfection—it's about identifying and mitigating risks.

PRR Scope

What PRR covers: - Reliability (SLIs/SLOs, error budgets) - Observability (monitoring, logging, tracing) - Security (authentication, authorization, encryption) - Operations (runbooks, on-call, incidents) - Capacity (scaling, load testing)

What PRR doesn't cover: - Feature completeness (that's product review) - Code quality (that's code review) - Performance optimization (that's performance review)


Staff-Level PRR Checklist

1. SLIs, SLOs & Error Budgets

SLIs Defined

Common failures: - Using internal metrics (CPU, memory) as SLIs - SLIs that don't reflect user experience - SLIs that can't be measured reliably

SLOs Set

Common failures: - SLOs set without baseline measurement - SLOs too aggressive (constant violations) - SLOs too lax (poor user experience)

Error Budgets

Common failures: - No error budget policy - Error budget not tracked - No alerts for budget exhaustion

2. Observability

Metrics

Common failures: - Missing critical metrics (latency, errors) - Metrics not exported correctly - No business metrics

Logging

Common failures: - Too much logging (noise) - Too little logging (can't debug) - Unstructured logs (hard to query) - No request IDs (can't correlate)

Tracing

Common failures: - No tracing - Tracing too expensive (100% sampling) - Can't correlate across services

Dashboards

Common failures: - No dashboards - Dashboards too complex (hard to understand) - Dashboards don't load (too many metrics)

Alerting

Common failures: - Too many alerts (alert fatigue) - Alerts not actionable (don't know what to do) - No runbooks (don't know how to respond) - Alerts too sensitive (false positives)

3. Incident Response

On-Call

Common failures: - No on-call rotation - On-call engineers not trained - No escalation paths

Runbooks

Common failures: - No runbooks - Runbooks outdated (don't match reality) - Runbooks not tested (don't work)

Incident Management

Common failures: - No incident process - No communication channels - No postmortems

4. Capacity & Scaling

Capacity Planning

Common failures: - Don't know current capacity - No growth forecast - No capacity alerts

Auto-Scaling

Common failures: - Auto-scaling not configured - Scaling policies not tuned (too aggressive/conservative) - Scaling limits not set (unbounded scaling)

Load Testing

Common failures: - No load testing - Load testing doesn't match production - System doesn't degrade gracefully

5. Security

Authentication

Common failures: - No authentication - Weak authentication - Authentication not tested

Authorization

Common failures: - No authorization checks - Overly permissive permissions - No audit logs

Data Protection

Common failures: - No encryption - Keys stored in code - No data retention policies

6. Change Management

Deployment

Common failures: - Manual deployments - No rollback procedure - No deployment windows

Feature Flags

Common failures: - No feature flags - No kill switches - Flags not tested

Testing

Common failures: - No tests - Tests don't catch production issues - No load testing


What Staff Engineers Actually Look For

Red Flags (Must Fix)

  1. No SLIs/SLOs: Can't measure reliability
  2. No monitoring: Can't detect problems
  3. No on-call: Can't respond to incidents
  4. No rollback: Can't recover from bad deployments
  5. No security: Vulnerable to attacks

Yellow Flags (Should Fix)

  1. Weak SLIs: SLIs don't reflect user experience
  2. Alert fatigue: Too many alerts
  3. No runbooks: Don't know how to respond
  4. No load testing: Don't know capacity
  5. Weak security: Security gaps

Green Flags (Good Practices)

  1. Clear SLIs/SLOs: Well-defined and measured
  2. Comprehensive monitoring: Metrics, logs, traces
  3. Tested runbooks: Know how to respond
  4. Load tested: Know capacity limits
  5. Secure: Authentication, authorization, encryption

PRR Decision Matrix

graph TD PRR[PRR Review] --> RedFlags{Red Flags?} RedFlags -->|Yes| Reject[Reject: Must Fix] RedFlags -->|No| YellowFlags{Yellow Flags?} YellowFlags -->|Many| Conditional[Conditional: Fix Soon] YellowFlags -->|Few| Approve[Approve: Ready] style Reject fill:#ff9999 style Conditional fill:#ffcc99 style Approve fill:#99ff99

Further Reading

Comprehensive Guide: Further Reading: PRR Checklist

Quick Links: - "Site Reliability Engineering" (SRE Book) - Production readiness - "The Site Reliability Workbook" - Practical PRR implementation - PRR Template - SLIs/SLOs - Back to Reliability & SRE


Exercises

  1. Run a PRR: Review a system design using this checklist. What issues do you find?

  2. Fix PRR issues: A system fails PRR. What are the top 3 issues to fix?

  3. Design for PRR: Design a system that passes PRR. What do you include?

Answer Key: View Answers