Research into the kinds of tolerances needed for critical systems involves a large amount of interdisciplinary work. The more complex the system, the more carefully all possible interactions have to be considered and prepared for. Considering the importance of high-value systems in transport,
public utilities, finance, public safety and the military, the field of topics that touch on research is very wide: it can include such obvious subjects as
software modeling and reliability, or
hardware design, to arcane elements such as
stochastic models,
graph theory, formal or exclusionary logic,
parallel processing, remote
data transmission, and more.
Replication Spare components address the first fundamental characteristic of fault tolerance in three ways: •
Replication: Providing multiple identical instances of the same system or subsystem, directing tasks or requests to all of them in
parallel, and choosing the correct result on the basis of a
quorum; •
Redundancy: Providing multiple identical instances of the same system and switching to one of the remaining instances in case of a failure (
failover); • Diversity: Providing multiple
different implementations of the same specification, and using them like replicated systems to cope with errors in a specific implementation. All implementations of
RAID,
redundant array of independent disks, except RAID 0, are examples of a fault-tolerant
storage device that uses
data redundancy. A
lockstep fault-tolerant machine uses replicated elements operating in parallel. At any time, all the replications of each element should be in the same state. The same inputs are provided to each
replication, and the same outputs are expected. The outputs of the replications are compared using a voting circuit. A machine with two replications of each element is termed
dual modular redundant (DMR). The voting circuit can then only detect a mismatch and recovery relies on other methods. A machine with three replications of each element is termed
triple modular redundant (TMR). The voting circuit can determine which replication is in error when a two-to-one vote is observed. In this case, the voting circuit can output the correct result, and discard the erroneous version. After this, the internal state of the erroneous replication is assumed to be different from that of the other two, and the voting circuit can switch to a DMR mode. This model can be applied to any larger number of replications.
Lockstep fault-tolerant machines are most easily made fully
synchronous, with each gate of each replication making the same state transition on the same edge of the clock, and the clocks to the replications being exactly in phase. However, it is possible to build lockstep systems without this requirement. Bringing the replications into synchrony requires making their internal stored states the same. They can be started from a fixed initial state, such as the reset state. Alternatively, the internal state of one replica can be copied to another replica. One variant of DMR is
pair-and-spare. Two replicated elements operate in lockstep as a pair, with a voting circuit that detects any mismatch between their operations and outputs a signal indicating that there is an error. Another pair operates exactly the same way. A final circuit selects the output of the pair that does not proclaim that it is in error. Pair-and-spare requires four replicas rather than the three of TMR, but has been used commercially.
Failure-oblivious computing Failure-oblivious computing is a technique that enables
computer programs to continue executing despite
errors. The technique can be applied in different contexts. It can handle invalid memory reads by returning a manufactured value to the program, which in turn, makes use of the manufactured value and ignores the former
memory value it tried to access, this is a great contrast to
typical memory checkers, which inform the program of the error or abort the program. The approach has performance costs: because the technique rewrites code to insert dynamic checks for address validity, execution time will increase by 80% to 500%.
Recovery shepherding Recovery shepherding is a lightweight technique to enable software programs to recover from otherwise fatal errors such as null pointer dereference and divide by zero. Comparing to the failure oblivious computing technique, recovery shepherding works on the compiled program binary directly and does not need to recompile to program. It uses the
just-in-time binary instrumentation framework
Pin. It attaches to the application process when an error occurs, repairs the execution, tracks the repair effects as the execution continues, contains the repair effects within the application process, and detaches from the process after all repair effects are flushed from the process state. It does not interfere with the normal execution of the program and therefore incurs negligible overhead. For 17 of 18 systematically collected real world null-dereference and divide-by-zero errors, a prototype implementation enables the application to continue to execute to provide acceptable output and service to its users on the error-triggering inputs.
Circuit breaker The
circuit breaker design pattern is a technique to avoid catastrophic failures in distributed systems. == Redundancy ==