Incidents can always happen. Growth and advancements in applications increase the chances for incidents to occur. Despite this, learning from those incidents can provide value to organizations by fulfilling postmortems.
A postmortem is “a written record of an incident” that describes the following items: the incident’s impact, actions that were taken to resolve the incident, causes of the incident, and any follow-up actions taken after to prevent the re-occurrence of the incident. By evaluating the incident and creating postmortems, the value of the incident is maximized because at that point it has become an unexpected investment in the reliability of the system affected.
Here are seven questions to help frame on the importance of postmortems and best practices.
1. When are postmortems needed, and who completes them?
Postmortems are typically needed when major incidents occur, such as a severity 1 or 2 level incident. Though they are only considered necessary when it comes to significant incidents, postmortems can be completed for any level of incident where it may seem useful.
The postmortem is usually completed by an assigned member belonging to the team that delivered the service in which caused the incident. It is “owned” by that person from the draft to when it becomes published. At times where incidents involve infrastructure or are platform-level, a program manager may be called to run the postmortem.
2. Why should postmortems hold zero blame?
Human nature tends to handle a problem by finding someone to blame when a problem occurs. Though this is the “easy” thing to do, it is not the best. Doing so can harm how effective the postmortem is depending on the situation of the incident. For instance, the team member completing the postmortem may leave out important factors of how the incident occurs if it serves as a potential risk to that team member’s job or reputation from peers. The focus should be held more on why the system faulted in allowing an error to be made rather than how the individual involved reacted.
3. How are postmortems processed?
The duration of how a postmortem is processed includes completing a postmortem issue, running a meeting and capturing actions. Finally, gaining approval and communicating outcome. There are several methods helpful in developing effective postmortems, but the right one depends on the culture of the organization.
- Single-point accountability: Assignee of postmortem ticket holds responsibility of postmortem results.
- Face-to-face meetings: This furthers analysis quickly, creates shared understanding and aligns team.
- Service Level Objective: A goal is set to complete postmortems with frequent reminders and reports.
A template of Atlassian's Incident Postmortem approach can be found here.
4. What are postmortem issue fields?
In addition to the many ways postmortems are processed, multiple issue fields can be helpful. These are completed in effort to collect all the important details needed to discuss in the postmortem meeting. Below are a few examples.
- Incident Summary: A few sentences explaining the incident, its severity and impact
- Leadup: A description of circumstances or occurrences leading up to the incident
- Fault: Data showing fault in system working incorrectly
- Recovery: A detailed explanation of how and when restoration of service occurred
- Five whys: Five potential identifications of the incident’s root cause
5. What is the difference between proximate and root causes?
Postmortems are significant for discovering root causes and finding ways to mitigate them so incidents can be further prevented. Root causes are why the chain of events leading to the incident occurring; whereas, proximate causes are reasons that led directly to the incident.
6. What do postmortem meetings consist of?
Similar to most meetings, postmortem meetings are important in providing effective and quick communication as well as developing deeper understanding and stronger learning. The main points to hit during postmortem meetings are to: remind the team that postmortems are blameless, follow the timeline of events during the incident, present the theory of incident’s causes, encourage open thinking, and ask open-ended questions on how situations or handling can be improved.
7. What are postmortem actions, and how are they tracked?
There are different categories for postmortem action items, inspired by presentations and articles by Sue Lueder and Betsy Beyer of Google. From investigating, mitigating and repairing damage from the incident to detecting, mitigating and preventing future incidents, these actions are important in developing a strong postmortem.
Every action stemmed from postmortems are tracked with Jira issues. Next, actions are linked as “primary action” or “improvement action.” This depends on whether there is a significant mitigation or indirect actions coming from the postmortem.
Missed our last post on Incident Response? Read it here, or contact us if you are looking for ways E7 Solutions can help your organization with incident management. Also, be sure to check back to our site for more incident management to come!