Common mode failure has a more specific meaning in engineering. It refers to events which are not
statistically independent. Failures in multiple parts of a system may be caused by a single fault, particularly random failures due to environmental conditions or aging. An example is when all of the pumps for a fire sprinkler system are located in one room. If the room becomes too hot for the pumps to operate, they will all fail at essentially the same time, from one cause (the heat in the room). Another example is an electronic system wherein a fault in a power supply injects noise onto a supply line, causing failures in multiple subsystems. This is particularly important in safety-critical systems using multiple
redundant channels. If the probability of failure in one subsystem is
p, then it would be expected that an
N channel system would have a probability of failure of
pN. However, in practice, the probability of failure is much higher because they are not statistically independent; for example
ionizing radiation or
electromagnetic interference (EMI) may affect all the channels. The
principle of redundancy states that, when events of failure of a component are statistically independent, the probabilities of their joint occurrence multiply. Thus, for instance, if the probability of failure of a component of a system is one in one thousand per year, the probability of the joint failure of two of them is one in one million per year, provided that the two events are statistically independent. This principle favors the strategy of the redundancy of components. One place this strategy is implemented is in
RAID 1, where two hard disks store a computer's data redundantly. But even so, a system can have many common modes of failure. For example, consider the common modes of failure of a RAID1 where two disks are purchased from an online store and installed in a computer: • The disks are likely to be from the same manufacturer and of the same model, therefore they share the same design flaws. • The disks are likely to have similar serial numbers, thus they may share any manufacturing flaws affecting production of the same batch. • The disks are likely to have been shipped at the same time, thus they are likely to have suffered from the same transportation damage. • As installed both disks are attached to the same power supply, making them vulnerable to the same power supply issues. • As installed both disks are in the same case, making them vulnerable to the same overheating events. • They will be both attached to the same card or motherboard, and driven by the same software, which may have the same bugs. • Because of the very nature of RAID1, both disks will be subjected to the same workload and very closely similar access patterns, stressing them in the same way. Also, if the events of failure of two components are maximally statistically dependent, the probability of the joint failure of both is identical to the probability of failure of them individually. In such a case, the advantages of redundancy are negated. Strategies for the avoidance of common mode failures include keeping redundant components physically isolated. A prime example of redundancy with isolation is a
nuclear power plant. The new
ABWR has three divisions of
Emergency Core Cooling Systems, each with its own generators and pumps and each isolated from the others. The new
European Pressurized Reactor has two
containment buildings, one inside the other. However, even here it is possible for a common mode failure to occur (for example, in the
Fukushima Daiichi Nuclear Power Plant, mains power was severed by the
Tōhoku earthquake, then the thirteen backup diesel generators were all simultaneously disabled by the subsequent tsunami that flooded the basements of the turbine halls). ==See also==