Root-cause analysis

There are essentially two ways of repairing faults and solving problems in science and engineering. Reactive management Reactive management consists of reacting quickly after the problem occurs, by treating the symptoms. This type of management is implemented by reactive systems, self-adaptive systems, self-organized systems, and complex adaptive systems. The goal here is to react quickly and alleviate the effects of the problem as soon as possible. Proactive management Proactive management, conversely, consists of preventing problems from occurring. Many techniques can be used for this purpose, ranging from good practices in design to analyzing in detail problems that have already occurred and taking actions to make sure they never recur. Speed is not as important here as the accuracy and precision of the diagnosis. The focus is on addressing the real cause of the problem rather than its effects. Root-cause analysis is often used in proactive management to identify the root cause of a problem, that is, the factor that was the leading cause. It is customary to refer to the "root cause" in singular form, but one or several factors may constitute the root cause(s) of the problem under study. A factor is considered the "root cause" of a problem if removing it prevents the problem from recurring. Conversely, a "causal factor" is a contributing action that affects an incident/event's outcome but is not the root cause. Although removing a causal factor can benefit an outcome, it does not prevent its recurrence with certainty. A great way to look at the proactive/reactive picture is to consider the Bowtie Risk Assessment model. In the center of the model is the event or accident. To the left, are the anticipated hazards and the line of defenses put in place to prevent those hazards from causing events. The line of defense is the regulatory requirements, applicable procedures, physical barriers, and cyber barriers that are in place to manage operations and prevent events. A great way to use root-cause analysis is to proactively evaluate the effectiveness of those defenses by comparing actual performance against applicable requirements, identifying performance gaps, and then closing the gaps to strengthen those defenses. If an event occurs, then we are on the right side of the model, the reactive side where the emphasis is on identifying the root causes and mitigating the damage. Example Imagine an investigation into a machine that stopped because it was overloaded and the fuse blew. Investigation shows that the machine was overloaded because it had a bearing that was not being sufficiently lubricated. The investigation proceeds further and finds that the automatic lubrication mechanism had a pump that was not pumping sufficiently, hence the lack of lubrication. Investigation of the pump shows that it has a worn shaft. Investigation of why the shaft was worn discovers that there is not an adequate mechanism to prevent metal scrap getting into the pump. This enabled scrap to get into the pump and damage it. The apparent root cause of the problem is that metal scrap can contaminate the lubrication system. Fixing this problem ought to prevent the whole sequence of events from recurring. The real root cause could be a design issue if there is no filter to prevent the metal scrap getting into the system. Or if it has a filter that was blocked due to a lack of routine inspection, then the real root cause is a maintenance issue. Compare this with an investigation that does not find the root cause: replacing the fuse, the bearing, or the lubrication pump will probably allow the machine to go back into operation for a while. However there is a risk that the problem will simply recur until the root cause is dealt with. The above does not include cost/benefit analysis: does the cost of replacing one or more machines exceed the cost of downtime until the fuse is replaced? This situation is sometimes referred to as the cure being worse than the disease. Costs to consider go beyond finances when considering the personnel who operate the machinery. Ultimately, the goal is to prevent downtime; but more so prevent catastrophic injuries. Prevention begins with being proactive. ==General principles==