Incident Response & Postmortems

One-line summary: How to respond to incidents effectively, write postmortems, and learn from failures.

Prerequisites: SLIs, SLOs & Error Budgets, understanding of monitoring and alerting.


Mental Model

Incident Response Lifecycle

graph LR Detect[Detect] --> Assess[Assess] Assess --> Mitigate[Mitigate] Mitigate --> Resolve[Resolve] Resolve --> Learn[Learn] Learn --> Improve[Improve] style Detect fill:#ff9999 style Mitigate fill:#ffcc99 style Resolve fill:#99ff99

Key insight: Effective incident response requires preparation, clear processes, and learning from incidents.

Incident Severity

P0 (Critical): Complete outage, data loss, security breach. P1 (High): Partial outage, degraded performance. P2 (Medium): Minor issues, non-critical errors. P3 (Low): Cosmetic issues, minor bugs.


Internals & Architecture

Incident Response Process

1. Detection

Detection methods: - Monitoring alerts: Automated alerts from monitoring - User reports: Users reporting issues - Health checks: Health check failures - Error rates: Spike in error rates

Detection time: Minimize time to detection (MTTD).

2. Assessment

Assessment steps: 1. Acknowledge: Acknowledge incident, create ticket 2. Triage: Assess severity and impact 3. Escalate: Escalate if needed 4. Communicate: Communicate to stakeholders

Assessment questions: - What is affected? - How many users affected? - What's the severity? - What's the root cause?

3. Mitigation

Mitigation strategies: - Rollback: Rollback recent changes - Scale: Scale up resources - Circuit breakers: Enable circuit breakers - Load shedding: Shed non-critical load

Mitigation goal: Restore service quickly.

4. Resolution

Resolution steps: 1. Fix root cause: Fix underlying issue 2. Verify fix: Verify service restored 3. Monitor: Monitor for recurrence 4. Communicate: Communicate resolution

5. Postmortem

Postmortem process: 1. Timeline: Document incident timeline 2. Root cause: Identify root cause 3. Impact: Assess impact 4. Actions: Define action items 5. Share: Share learnings


Failure Modes & Blast Radius

Incident Response Failures

Scenario 1: Slow Detection

Scenario 2: Poor Communication


Observability Contract

Metrics

Alerts


Change Safety

Incident Response Process Changes


Security Boundaries


Tradeoffs

Speed vs Accuracy

Fast response: - Pros: Faster resolution, less impact - Cons: May miss root cause

Thorough response: - Pros: Better understanding, fewer recurrences - Cons: Slower resolution, more impact


Operational Considerations

Best Practices

  1. Prepare: Runbooks, playbooks, training
  2. Practice: Regular incident drills
  3. Learn: Postmortems, action items
  4. Improve: Continuous improvement

What Staff Engineers Ask in Reviews


Further Reading

Comprehensive Guide: Further Reading: Incident Response

Quick Links: - "Site Reliability Engineering" (Google SRE Book) - Chapter on Incident Response - PRR Checklist - SLIs, SLOs & Error Budgets - Back to Reliability & SRE


Exercises

  1. Design incident response: Design an incident response process. What are the steps?

  2. Write postmortem: Write a postmortem for a hypothetical incident. What's included?

  3. Handle incident: Your service is down. How do you respond? What's the process?

Answer Key: View Answers