Production Readiness Review (PRR) Template
A PRR ensures a system is ready for production. Use this checklist before launching.
System Information
System Name: [Name]
Team: [Team name]
PRR Date: [Date]
PRR Lead: [Name]
Reviewers: [Names]
Target Launch Date: [Date]
Overview
Purpose: [What does this system do?]
Users: [Who uses this system?]
Scale: [Expected load, users, data volume]
Criticality: [P0/P1/P2 - how critical is this system?]
Reliability & SLOs
SLIs (Service Level Indicators)
- [ ] Latency SLI: [Definition, e.g., "P95 latency of API requests"]
- Measurement: [How is it measured?]
- Current value: [Current P95 latency]
-
Target: [Target P95 latency]
-
[ ] Availability SLI: [Definition, e.g., "Fraction of successful requests"]
- Measurement: [How is it measured?]
- Current value: [Current availability]
-
Target: [Target availability]
-
[ ] Error Rate SLI: [Definition]
- Measurement: [How is it measured?]
- Current value: [Current error rate]
-
Target: [Target error rate]
-
[ ] Throughput SLI: [Definition, if applicable]
- Measurement: [How is it measured?]
- Current value: [Current throughput]
- Target: [Target throughput]
SLOs (Service Level Objectives)
- [ ] SLOs are defined for all SLIs
- [ ] SLOs are achievable (based on historical data or testing)
- [ ] SLOs are documented
- [ ] SLOs are communicated to stakeholders
Error Budgets
- [ ] Error budgets are calculated
- [ ] Error budget policy is defined (what happens when budget is exhausted?)
- [ ] Error budget tracking is automated
- [ ] Error budget alerts are configured
Availability Targets
- [ ] Availability target is appropriate for criticality
- [ ] Availability is measured correctly
- [ ] Planned downtime is accounted for
- [ ] Unplanned downtime is tracked
Monitoring & Observability
Metrics
- [ ] Latency metrics: P50, P95, P99 are tracked
- [ ] Throughput metrics: QPS, requests/sec are tracked
- [ ] Error metrics: Error rates, error types are tracked
- [ ] Resource metrics: CPU, memory, disk, network are tracked
- [ ] Business metrics: User-facing metrics are tracked
- [ ] Metrics are exported correctly
- [ ] Metrics retention is configured
Logging
- [ ] Critical events are logged
- [ ] Log levels are appropriate (INFO, WARN, ERROR)
- [ ] Logs are structured (JSON or structured format)
- [ ] Logs include request IDs for correlation
- [ ] Log retention is configured appropriately
- [ ] Logs are searchable and queryable
Tracing
- [ ] Critical paths are traced
- [ ] Spans are well-defined
- [ ] Trace sampling is configured
- [ ] Cross-service correlation works
- [ ] Traces are queryable
Dashboards
- [ ] Service dashboard: Shows key metrics, health status
- [ ] SLO dashboard: Shows SLO compliance, error budgets
- [ ] Capacity dashboard: Shows resource usage, scaling
- [ ] Business dashboard: Shows user-facing metrics
- [ ] Dashboards are accessible to on-call
- [ ] Dashboards load quickly
Alerting
- [ ] Critical alerts: P0 incidents (page on-call)
- [ ] Warning alerts: P1 incidents (notify, don't page)
- [ ] Info alerts: P2 incidents (log only)
- [ ] Alerts are actionable (clear what to do)
- [ ] Alert fatigue is avoided (not too many alerts)
- [ ] Alert thresholds are tuned (not too sensitive/insensitive)
- [ ] Alert runbooks exist
- [ ] Alert escalation paths are defined
Incident Response
On-Call
- [ ] On-call rotation is established
- [ ] On-call engineers are trained
- [ ] On-call procedures are documented
- [ ] On-call tools are available
- [ ] On-call escalation paths are clear
Runbooks
- [ ] Common incidents: Runbooks exist for common issues
- [ ] Critical failures: Runbooks exist for critical failures
- [ ] Recovery procedures: Runbooks include recovery steps
- [ ] Rollback procedures: Runbooks include rollback steps
- [ ] Runbooks are tested
- [ ] Runbooks are accessible during incidents
Incident Management
- [ ] Incident response process is defined
- [ ] Incident communication channels are established
- [ ] Postmortem process is defined
- [ ] Incident tracking system is used
Postmortems
- [ ] Postmortem template is used
- [ ] Postmortems are written for all P0/P1 incidents
- [ ] Postmortems include root cause analysis
- [ ] Postmortems include action items
- [ ] Action items are tracked to completion
Capacity & Scaling
Capacity Planning
- [ ] Current capacity: Current load and capacity are known
- [ ] Target capacity: Capacity for launch is provisioned
- [ ] Growth forecast: Capacity for growth is planned
- [ ] Scaling limits: Maximum scale is understood
- [ ] Capacity alerts: Alerts for capacity thresholds
Auto-Scaling
- [ ] Auto-scaling is configured (if applicable)
- [ ] Scaling policies are tuned
- [ ] Scaling limits are set (min/max instances)
- [ ] Scaling metrics are appropriate
- [ ] Scaling behavior is tested
Load Testing
- [ ] Load tests have been run
- [ ] System handles expected load
- [ ] System handles 2× expected load
- [ ] System degrades gracefully under overload
- [ ] Bottlenecks are identified and addressed
Resource Limits
- [ ] CPU limits are set appropriately
- [ ] Memory limits are set appropriately
- [ ] Disk limits are set appropriately
- [ ] Network limits are understood
- [ ] Limits prevent resource exhaustion
Security
Authentication
- [ ] Authentication is required for all access
- [ ] Authentication mechanism is appropriate
- [ ] Authentication is tested
- [ ] Authentication failures are logged
Authorization
- [ ] Authorization is checked for all operations
- [ ] Principle of least privilege is followed
- [ ] Permissions are documented
- [ ] Access is audited
- [ ] Authorization failures are logged
Data Protection
- [ ] Encryption at rest: Data is encrypted at rest
- [ ] Encryption in transit: Data is encrypted in transit (TLS)
- [ ] Key management: Keys are managed securely
- [ ] Data classification: Data is classified appropriately
- [ ] Data retention: Data retention policies are defined
- [ ] Data deletion: Data deletion procedures exist
Security Monitoring
- [ ] Security events are logged
- [ ] Security alerts are configured
- [ ] Security incidents are tracked
- [ ] Security reviews are conducted
Compliance
- [ ] Compliance requirements are met (if applicable)
- [ ] Compliance documentation exists
- [ ] Compliance audits are planned
Change Management
Deployment
- [ ] Deployment process: Deployment process is documented
- [ ] Deployment automation: Deployments are automated
- [ ] Deployment rollback: Rollback procedure is tested
- [ ] Deployment windows: Deployment windows are defined
- [ ] Deployment approvals: Approval process is defined
Feature Flags
- [ ] Feature flags are used for risky changes
- [ ] Kill switches are implemented
- [ ] Flag management is documented
- [ ] Flags are tested
Canary Deployments
- [ ] Canary deployment is used (if applicable)
- [ ] Canary metrics are monitored
- [ ] Canary rollback is tested
- [ ] Canary promotion criteria are defined
Testing
- [ ] Unit tests: Unit test coverage is adequate
- [ ] Integration tests: Integration tests exist
- [ ] Load tests: Load tests are run regularly
- [ ] Chaos tests: Chaos tests are considered
- [ ] Smoke tests: Smoke tests run after deployment
Documentation
Architecture
- [ ] Architecture diagram exists
- [ ] Architecture is documented
- [ ] Data flow is documented
- [ ] Component responsibilities are documented
Runbooks
- [ ] Runbooks exist for common operations
- [ ] Runbooks are tested
- [ ] Runbooks are accessible
API Documentation
- [ ] API is documented (if applicable)
- [ ] API examples exist
- [ ] API versioning is documented
- [ ] API deprecation policy exists
Onboarding
- [ ] Onboarding documentation exists
- [ ] New engineers can get started quickly
- [ ] Development environment setup is documented
Dependencies
External Dependencies
- [ ] External dependencies are identified
- [ ] Dependency SLAs are understood
- [ ] Dependency failures are handled gracefully
- [ ] Dependency monitoring is in place
- [ ] Dependency runbooks exist
Internal Dependencies
- [ ] Internal dependencies are identified
- [ ] Dependency contracts are documented
- [ ] Dependency failures are handled gracefully
- [ ] Dependency monitoring is in place
Dependency Risks
- [ ] Single points of failure are identified
- [ ] Dependency risks are mitigated
- [ ] Fallback mechanisms exist
Cost
Cost Model
- [ ] Cost per request/user/unit is understood
- [ ] Cost drivers are identified
- [ ] Cost monitoring is in place
- [ ] Cost alerts are configured
Cost Optimization
- [ ] Cost optimization opportunities are identified
- [ ] Cost optimization is planned
- [ ] Cost budgets are set
Launch Readiness
Pre-Launch Checklist
- [ ] All PRR items are complete
- [ ] On-call rotation is ready
- [ ] Runbooks are ready
- [ ] Monitoring is configured
- [ ] Alerts are configured
- [ ] Dashboards are ready
- [ ] Documentation is complete
- [ ] Team is trained
Launch Plan
- [ ] Launch plan is documented
- [ ] Launch steps are defined
- [ ] Launch rollback plan exists
- [ ] Launch communication plan exists
- [ ] Launch date is set
Post-Launch
- [ ] Post-launch monitoring plan exists
- [ ] Post-launch review is scheduled
- [ ] Post-launch improvements are planned
PRR Decision
- [ ] Approved - System is ready for production
- [ ] Approved with Conditions - System is approved pending fixes
- [ ] Not Approved - System needs significant work
PRR Lead Signature: [Name]
Date: [Date]
Next Steps: [What happens next?]
Notes
[Additional notes, concerns, or recommendations]