Introduction

Completed

The Dickerson Hierarchy of Reliability offers a map for navigating reliability challenges; what needs to be addressed and in what order. Like other hierarchies of this sort, it's important that the level you're on is solid before moving up the pyramid.

Pyramid diagram of the Dickerson Hierarchy of Reliability with seven tiers; the Post-incident Review tier is highlighted as this module's focus.

From the base up, the seven tiers are:

  1. Monitoring: You can't improve what you can't see.
  2. Incident Response: Reliable, repeatable processes to react when alerts fire.
  3. Post-incident Review: Learning from the incidents that occur (the focus of this module).
  4. Testing and Release: Catching regressions before they reach production.
  5. Capacity Planning: Ensuring the system has the resources it needs to meet demand.
  6. Development: Writing reliable software.
  7. Product: Building the right thing for users.

This module addresses the tier roughly in the middle of the pyramid. Having addressed your monitoring and your incident response (perhaps with the help of other Learn modules in this learning path), you now have the opportunity to focus on principles and practices that can help you level up your operations practice.

The hierarchy is adapted from Mikey Dickerson's Hierarchy of Reliability Needs.

In this module, we're focusing on post-incident reviews that can help you learn from failure, resulting in improved reliability.

When you've completed this module, you will:

  • Discover the importance of learning from incidents.
  • Understand the aspects of complex systems that make learning from failure important.
  • Learn when and how to conduct a post-incident review.
  • Understand the purpose and goals of a post-incident review.
  • Learn the components that go into a good post-incident review.
  • Explore the Azure tools that can assist with getting started with post-incident reviews.
  • Become aware of common traps to avoid.
  • Identify helpful practices to conduct a better review.

An introductory story

To set the scene for this module, here's a true story (or half of it, actually; we get to the second part later in this module):

During World War II, the B-17 "Flying Fortress" aircraft was involved in a series of accidents. We don't know all of the details of these accidents, and we don't know exactly how many there were. It was wartime, and many of the details were secret and remains secret. What we do know is that there were a significant number of similar incidents involving many individual aircraft. Historical retellings tend to focus on damaged aircraft rather than serious injuries, but the wartime record is incomplete.

In each case, what would happen is this: A B-17 would come in to land, would land successfully, and then either on the runway or taxiing back to the hangar, something strange would happen. Something serious would happen. The B-17 would be on the ground and all of a sudden the landing gear would retract, and the plane would collapse onto the runway.

In each case, the investigators would look for evidence of mechanical or electrical failure, and in each case, they couldn't find any. So, what they concluded was that this was a case of pilot error, that the pilots had mistakenly retracted the landing gear.

Here are two additional pieces of information: the investigators were correct that no mechanical or electrical failures had occurred. The accidents kept happening.

This information might lead you to be dissatisfied with the initial conclusion reached about these accidents, perhaps leave you wondering if this is the whole story. In this module, we're going to propose that something is missing in this conclusion and in the investigations that led to it.