Promoting Learnings in Incidents

2021-11-26

Promoting Learnings in Incidents

Incidents are used for the negative consequences of an action. The incident comes from an action that fails to result in the expected outcome. For instance, deploying a code to production to add a new feature to improve performance. It then takes down the whole service. This is an unexpected outcome. The incident learning is the part where we uncover underlying problems that lead to the incident. These learnings are often more valuable than the fix itself. They reveal how systems behave under stress and how teams adapt.

The more severe the incident, the more learning to come. Nevertheless, incidents without severe consequences happen more often than incidents with severe ones. Therefore, itʼs essential to drive learnings from these minor incidents and share them with a wider audience. Teams like those at GitLab, Netflix, and AWS have shown that frequent, open analysis of even low-severity incidents leads to long-term reliability gains.

Incidents are not just engineering problems. They reflect organizational processes, decision-making, tooling, and communication under pressure. The learnings, then, are not just technical. They belong to everyone: engineers, support teams, product managers, and leaders alike.

By writing and sharing postmortems, asking better questions, and removing blame from the process, organizations can build psychological safety and uncover issues before they grow. Promoting incident learning isn’t just an internal hygiene practice. It’s a strategy for resilience, trust, and continuous improvement.

Multiple Causes

Incidents occur with many different causes. There isnʼt generally one reason that causes the incident. There might be several factors resulting in the incident such as non-functioning alerts, incomplete/incorrect runbooks, unclear direction, and more. In most cases, these factors don’t act alone. It’s the interplay between them that creates a pathway to failure.

More importantly, the reason for debriefs about incidents is to uncover these causes and learn from them. This approach aligns with how high-performing organizations handle complexity: they treat incidents as emergent outcomes of systems under pressure, not as isolated breakdowns.

Rather than oversimplifying with a single root cause, many teams now prefer using the term “contributing factors” and focus on mapping the conditions that made the incident possible. It helps them avoid tunnel vision and surface broader lessons.

For example, Google’s SRE teams often examine both immediate triggers and underlying systemic contributors such as unclear ownership, confusing documentation, or overworked on-call rotations. These insights tend to be more actionable than technical forensics alone.

Investigation

The investigation isn’t about fixing the problem. It’s about uncovering where things went wrong, what we found difficult, surprising, or challenging, and what we know now that we didn’t know/realize before. It’s a process to learn about potential risks. It’s also not about mitigation of these problems. It’s simply the detection of such problems that lead to an incident or incidents and challenges during incident recovery.

This distinction is critical. Fixing brings the system back to baseline, but investigation helps raise the baseline.

The most effective investigations invite multiple perspectives, combining firsthand accounts, system metrics, chat transcripts, and decision logs to build a timeline that reflects reality, not assumptions.

Good organizations emphasize the importance of curiosity in this phase. You want to keep asking not just “what broke?” but “how did we end up here, and what does this tell us about our design, process, or priorities?”

When done well, the investigation becomes a mirror for the team. It shows not only what happened, but also how you think, communicate, and coordinate under stress. That reflection is where deep learning lives.

Language

Itʼs crucial to use good language. Incidents never occur due to someoneʼs mistake. They occur due to organizational problems, support, and risk mitigation. Instead of asking why questions we might use better language. Why questions might put someone on the defense. Why didnʼt you use the X version of the Y library? We can rather ask the same question as follows. During the incident, you have used the P version of the library Y. Tell us why you made this decision.

Reframing questions in this way preserves psychological safety and invites thoughtful reflection instead of defensiveness. Psychological safety is essential to uncover these learnings. We should promote it as often as possible. Teams are far more likely to share accurate timelines, overlooked details, and contextual decision-making when they know they won’t be blamed.

At high-trust organizations, blameless postmortems aren’t just a cultural aspiration. They’re a documented, enforced policy. This matters because without clear boundaries, individuals in positions of power including directors, VPs, or senior leads can unintentionally (or deliberately) use their authority to assign blame, shift accountability, or single people out during incident reviews. Unfortunately, I have seen this happen a few times from senior leaders. A written policy creates a shared standard that applies equally to everyone, regardless of role or seniority. It signals that learning is the goal, not scapegoating. When enforced consistently, this protects psychological safety and ensures postmortems remain constructive spaces where facts matter more than hierarchy.

Even subtle phrasing can shape the quality of the discussion. Words like ‘decision,’ ‘observation,’ and ‘constraint’ encourage analysis. Words like ‘mistake’ or ‘failure’ can shut it down.

What went wrong?

We should uncover what went wrong for the incident not only from a technical perspective but also from a non-technical perspective. For instance, if someone woke up in the middle of the night, it would be a contributing factor to the incident. We should focus on the parts that went wrong. We shouldn’t talk about some conceptual scenario. If we had done X, then it would do Y. This scenario didn’t occur and it doesn’t help in building the learning, so we should spend no time in it.

The point is to reconstruct what actually happened. What people saw, understood, and decided in the moment without rewriting history based on what we now know in hindsight. I want to empathize rewriting because I’ve seen in the past people do that because they feel uncomfortable with what happened. It simply means there’s not enough trust in the organization.

Avoid counterfactuals and hypotheticals. “We should have…” or “If only we had…” aren’t useful because they describe alternate timelines. Instead, focus on what made sense at the time and what signals were (or weren’t) present.

I remember a time where I was on call. I got paged three times. I did alright for the first two but slipped completely for the third, almost bringing the entire application down for a region. Now, why does this matter? Because I was dealing with the same scenario a third time and I wasn’t in the best condition. I slipped away. Fatigue, repetition, and late-night stressors are all valid contributing factors. If we ignore them, we only see part of the picture.

Some interesting questions around what went wrong are as follows:

What actually happened?
Which components were involved?
What decisions did we have to make?
How did we make those decisions?
Did we try something that had worked for us before?

These questions help uncover not only technical failure modes, but also organizational blind spots. Unclear ownership, ambiguous runbooks, alert fatigue, misaligned assumptions. By grounding the analysis in real actions and context, we make it easier to surface the systemic factors that allowed the incident to unfold.

What went right?

We naturally spend quite a good time on what went wrong. Nevertheless, we should probably celebrate what went right. Perhaps, our systems semi-recovered, or some of the measures we took earlier worked. Therefore, we should promote these learnings. They are positive learnings that we might want to share with the incident reader.

Highlighting what went well reinforces good engineering instincts, validates prior investments, and reminds the team that resilience often comes from preparation they’ve already done.

Maybe an alert fired early, a fallback system kicked in, or someone made a fast and informed decision under pressure. These were signals that parts of the system or team response are working. Documenting these wins helps build confidence and also creates patterns others can replicate. If a manual rollback saved the day, maybe it should be automated. If a clear runbook makes execution smooth, it should be celebrated and reused.

Incident debriefs that only focus on failure can quietly erode morale. Recognizing what went right creates psychological balance and makes the team more willing to participate fully.

Classifying Incidents

With many minor incidents happening, it gets hard to know what to focus on for engineering health purposes. Thus, itʼs essential to come up with a classification framework to find out the areas that the team/s should focus on. Once the classification is ready, we can potentially start labeling incidents with these classifications to find them out easily both in incident documents and tickets.

A well-structured classification system helps convert raw incident data into trends . It then reveals weak spots across services, teams, or processes over time.

Classification can have different angles. Here’s a classification matrix for tasting purposes. I classified incidents on two dimensions: severity and labels. An incident can have one type of severity. Nonetheless, it can have one or more labels. In the following matrix, I have 3 colored incidents with different severities.

Incident Classification Matrix

Incident	Severity	System / Component	Primary Cause	Contributing Factors	Tags
#1421	Sev-1	Auth Service	Configuration Change	Lack of rollback, unclear deploy ownership	auth, config, no-rollback, ownership
#1429	Sev-2	Payments API	External Dependency Failure	No timeout set, poor failover logic	payments, 3rd-party, timeout, failover
#1442	Sev-3	Internal Admin Panel	UI Bug	No QA coverage, happened after urgent hotfix	frontend, QA-missing, hotfix
#1450	Sev-2	Notification Pipeline	Scaling Limit Exceeded	Monitoring gap, lacked load test before major release	infra, scaling, monitoring-gap, release-risk
#1458	Sev-1	Database Cluster (EU)	Hardware Failure	Delayed alerting, under-provisioned failover nodes	infra, db, alerting, redundancy
#1462	Sev-3	DevOps Tooling	Misconfigured IAM Policy	Lack of peer review, inconsistent Terraform state	permissions, terraform, review-missing
#1470	Sev-2	Checkout Flow	Deployment Regression	Rushed release, tests bypassed due to manual override	checkout, deploy, testing-bypassed

This matrix can guide hiring, training, monitoring, and even roadmap prioritization. For example, if 40% of your incidents are tied to configuration changes, it might justify investing in better validation tooling or safer rollout strategies.

Employing Metadata

Likewise, by tagging incidents with relevant metadata (service name, time of day, regression, third-party dependency, etc.), teams can later run queries to answer high-leverage questions like “Which systems had the most Sev2 incidents this quarter?” or “Are our scaling issues clustered around specific peak hours?”

Good organizations have incident review systems that emphasize structured tagging for this reason. They allow for searchability, pattern detection, and operational visibility at scale. Classification also helps you communicate better with leadership. Instead of saying “we had 12 incidents,” you can say “we had 5 Sev2 incidents tied to deployment regressions, here’s what we’re doing about it.” That’s the kind of framing that earns trust and unlocks support.

Turning Learnings into Competitive Advantage

A key piece is to generate insights from incidents for the organization. We want to reduce technical errors but we canʼt achieve it with a good strategy. We want to eliminate paths to lead to the next incident, not the incident that happened. Therefore the incident analysis isnʼt about the incident itself, itʼs about stopping the next incident.

The goal is to build a more adaptive, better-informed organization. An organization that doesn’t just fix issues but evolves through them. Netflixʼs chaos monkey approach came out of these incidents. They changed the company strategy to live with hiccups rather than trying to mitigate them. Things can go wrong. Therefore, we should make them go wrong. Chaos monkey became a well-known technique in the industry.

In some cases, people start to rush deploying many services/changes to production. There might be many minor incidents that can go unnoticed. Nevertheless, it might be wise to look at the patterns where incidents happen. Perhaps, they happen during holidays time because there are fewer people. Fewer people work on various activities e.g. design reviews, code reviews, and so on.

These patterns matter. Incidents cluster for reasons, calendar pressure, team fatigue, or deferred reviews, and surfacing these trends lets you plan proactively. Hence, an organizational insight might be around putting extra effort around these times to not get things wrong. For instance, Amazon didnʼt allow pushing changes to production before Black Friday. It might not be optimal but this has been learned from previous incidents. It simply isnʼt worth losing customer trust.

Many orgs now build policies based on past learnings. It’s not just technical mitigations but organizational boundaries: deploy freezes, increased review rigor, on-call coverage audits. These guardrails come from incident retrospectives.

Incidents can happen for various different reasons that arenʼt obvious. Capturing those reasons should be part of the incident learning. Incidents might occur due to under-resourced teams. The team has less time to design or review their output. On the flip side, some teams have degraded morale due to various factors. These dynamics might not be obvious if we only think about the technical aspect of things. The learning here might be just acknowledging the team is under-resourced or has poor morale.

A drop in reliability might be a systems issue or it might be burnout, unclear priorities, or leadership debt. Good postmortems notice both. This insight might help management to fill the missing roles and perhaps bring team-jelling activities. As in these examples, we should try to turn past learnings into a competitive advantage such as hiring, releasing, and bringing new techniques. We need to look at incidents from different perspectives to get these insights.

The companies that treat incidents as strategic learning moments, not just interruptions, are the ones that scale without breaking. Insights then should turn into devising strategies and procedures that enable the rapid restoration of your system and data in the event of a catastrophic failure or disruption. Incident learning should guide these plans. You should consider data backup methods, infrastructure redundancy, and clear recovery procedures.

Highlighting Key Learnings

In the past years, there have been some incidents publicly shared such as the GitLab database incident. Gitlab shared a postmortem document where they went through the incident. The incident debrief goes through a timeline, root cause analysis and later improving recovery procedures. Nevertheless, it doesnʼt call out the learning explicitly.

This is a common and costly gap. Its valuable insights get buried in long timelines and technicalities, instead of being surfaced clearly for others to learn from.

I suggest calling out learnings vocally.

Make learning unavoidable. Place them at the top of the postmortem. List them clearly. Write them for people who weren’t there. One effective format is a “Key Takeaways” section with 3–5 bullets at the top, each phrased clearly enough that another team, or even another company, could apply it.

GitLab’s own team later noted that more explicit learnings help their organization spot repeat issues faster and improve onboarding for engineers who want to study prior failures.

Learning is Different than Fixing

One of the missed opportunities is to overlook the learning piece. When we focus on fixing things then we might lose the opportunity to find deeper causes. Fixes restore service. Learnings improve the system. The two are related but they are not the same.

We will fix the issue with the materials and methods we have at hand. Nevertheless, the fix might become suboptimal or temporary. Therefore, we should rather focus on the key findings and learnings. Rushing to a fix isn’t the intent of incident learning meetings.

Good organizations explicitly separate the “fix” from the “understanding.” The fix is necessary, but the understanding is what prevents future, often larger, failures. A narrow focus on resolution can lead to shallow conclusions e.g. “a config flag was wrong” when the real learning might be “our deployment process lacks validation safeguards.”

It’s also important to pause and reflect after the adrenaline of the incident response fades. Many organizations now run follow-up retrospectives or second-phase reviews, sometimes days after the incident, to explore what was learned once the team is calm and clear-headed.

Ask: What did we learn about our tooling? About our culture? About how decisions were made under pressure? That level of reflection matures engineering immensely. A fix that prevents one incident is useful. A learning that prevents ten is transformational.

Promoting Learnings

Incident meetings are about learnings. Therefore, we should promote learning pieces. Learning should not be an afterthought. It should really be the headline.

Looking at the incident debrief document, the key learning parts are often neglected or come at last. I think the learning part has to be the focus. Lead with it. If someone skims the postmortem, they should walk away with at least one clear takeaway they can apply.

I would love it to be literally the first thing a reader sees. If I’m outside of the organization and I donʼt have enough context, give me some learning that I/my team can reflect on. If Iʼm more interested then I can go ahead and read the rest of the document.

Many high-functioning teams now include a dedicated “Key Learnings” or “Lessons for Others” section, written in plain language and framed for broader audiences, not just the engineers involved.

Whatʼs more, some postmortem documents donʼt even have key learning. Nobody documented it or perhaps didnʼt focus on it. This is a lost opportunity. Even if a fix is already in motion, learning should be shared: in documentation, team meetings, onboarding guides, and tooling decisions. We should actively promote the writing key learning part for each incident debrief document.

Some organizations go further. They host monthly learning reviews, cross-team postmortem circles, or internal newsletters highlighting the most insightful takeaways from recent incidents. These practices ensure learning doesn’t just exist. They spread across organizations.

Conclusion

I know this has been a long read but there’s so much more to talk about incidents. Incident postmortem/debriefs are opportunities for organizations to learn and promote incident learning. In doing so, we need to think about learning at each phase such as investigation.

Learning shouldn’t be passive. It should drive change. From how we monitor systems to how we support teams, incidents are a feedback loop that makes everything sharper, safer, and more aligned.

As an organization, taking a step back and seeing these incidents from different perspectives can give a competitive advantage for the business. Technical fixes solve symptoms. Holistic learnings solve root conditions such as organizational gaps, cultural drift, tool design, or communication flow.

Perhaps, a monthly newsletter for learnings from incidents might help the organization to learn and reflect on them. Others hold cross-team debriefs, incident learning circles, or maintain a searchable “incident library” to spread wisdom across teams and time zones.

Furthermore, a general look at the incidents can detect causes of incidents that arenʼt solely technical. In consequence, consolidating minor incident learnings into actionable insights can happen through the focus on the learnings.

If we treat incidents only as failures, we waste them. If we treat them as opportunities to grow, we unlock their full value.

Good Reads

Book: The Field Guide to Understanding ‘Human Error’

Paper: Learning from error: The influence of error incident characteristics

Article: Learning from Incidents

Blog: Incident Analysis: Your Organization’s Secret Weapon

Book: Operations Anti-Patterns, DevOps Solutions

Blog: Learning from incidents: from ‘what went wrong?’ to ‘what went right?’

Video: Incident Analysis: How *Learning* is Different Than *Fixing*

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Yusuf Aytas