I read the news about the computer failure that led to the rebooting of the Delta ticketing systems causing thousands of delayed or cancelled flights and leaving thousands of passengers stranded across airports. But it wasn’t until my own flight was delayed eight hours, and I witnessed hundreds of people sleeping on the floor of the Minneapolis St Paul airport, that it became real.
One of the things that this realness did was to bring into stark clarity a question: How many complex infrastructure projects in the world have significant weaknesses that we either
1) don’t know about, or
2) have not prioritized correctly?
The relationship between the cause and effect of a simple system is fairly easy for our human minds to grasp. Even slightly more complex connections involving related stages, or multiple “downstream” connections, can usually be drawn on a piece of paper and understood. But it is within complex, multi-leveled, interdependent systems where our ability as humans to map and track these relationships breaks down pretty quickly.
This is where the concept of Traceability really comes in: Traceability is the connection of a Requirement to the Test that validates and verifies it. This is especially important when building or designing a complex, safety-critical system that needs to be error-free. Rigorous traceability ensures that all the correct tests are run to meet the requirements of the system. With dozens or hundreds of requirements and tests it can be daunting, which is why there are a whole class of software programs to help people track this.
The other benefit of traceability is that you execute a more efficient process. That is, you don’t run tests that are not tied back to requirements (i.e. unnecessary work). It’s easy to run the same battery of tests from the previous project, but if they’re not the right tests you waste money. It is ironic that some view establishing traceability and testing plans as burdensome, when in fact it can be the opposite.
In the case of Delta’s mishap, I don’t know if it was a flaw in the original design (i.e. some missing test), or an oversight in some evolution of the original design. It could have also been an incorrect prioritization of a particular investment in the system maintenance. This is another area where Traceability can be helpful.
In using statistical qualifiers, a system architect can let the black-and-white numbers determine the economic and safety priorities of a backlog of work associated with risk items. Probability, impact and cost can all be brought to bear, and people who need to be part of the conversation can be quickly identified and connected to the work.
Finally, all complex infrastructure systems require ongoing maintenance, monitoring, and testing to keep running smoothly. There are tools and methods to do these tasks, but the original “requirements” for these processes can and should be part of the original design intent.
For example, if I want to build an airplane, I am going to design into the wings monitoring instrumentation to report back damage and wear to enable more efficient and accurate repair and maintenance. This operational execution plan can be included in the traceability model of the system – connecting high-level product/business goals to system requirements down through test and defect handling.
We know a critical power control module malfunction caused the Delta failure. Turns out it will cost Delta upwards of $1B and a lot of brand tarnish. What we don’t know is how it could have been avoided. Was it an original design flaw, a flaw that emerged that was unknown to the original system architects, or a deficiency in a test plan? And could a more robust accountability system (traceability) have helped if it had been in place?
What other complex systems could benefit, both in risk reduction and economic gain, if they are properly designed and connected with rigorous end-end traceability?