Example hardware features for improving RAS include the following, listed by subsystem: •
Processor: • Processor instruction error detection (e.g. residue checking of results) with instruction retry e.g.
alternative processor recovery in IBM mainframes, or "Instruction replay technology" in
Itanium systems. • Processors running in
lock-step to perform
master-checker or voting schemes. •
Machine Check Architecture and
ACPI Platform Error Interface to report errors to the OS. •
Memory: •
Parity or
ECC (including
single device correction) protection of memory components (
cache and
main memory); bad cache line disabling;
memory scrubbing; memory sparing, memory mirroring; bad page offlining;
redundant bit steering;
redundant array of independent memory (RAIM). •
I/O: •
Cyclic redundancy check checksums for data transmission/retry and data storage, e.g.
PCI Express (PCIe) Advanced Error Reporting (AER),
redundant I/O paths. •
Storage: •
RAID configurations for
hard disk drive and
solid-state drive storage. •
Journaling file systems for file repair after crashes. •
Checksums on both data and metadata, and background
scrubbing. •
Self-Monitoring, Analysis, and Reporting Technology for hard disk drive and solid-state drive. • Power/cooling: •
Duplicating components to avoid
single points of failure, e.g., power-supplies. •
Over-designing the system for the specified operating ranges of
clock frequency, temperature, voltage, vibration. •
Temperature sensors to throttle operating frequency when temperature goes out of specification. •
Surge protector,
uninterruptible power supply,
auxiliary power. • System: •
Hot swapping of components:
CPUs,
RAMs,
hard disk drives and
solid-state drives. •
Predictive failure analysis to predict which intermittent correctable errors will lead eventually to hard non-correctable errors. •
Partitioning/domaining of computer components to allow one large system to act as several smaller systems. •
Virtual machines to decrease the severity of
operating system software faults. • Redundant I/O domains or I/O partitions for providing virtual I/O to guest virtual machines. •
Computer clustering capability with
failover capability, for complete
redundancy of hardware and software. •
Dynamic software updating to avoid the need to reboot the system for a
kernel software update, for example
Ksplice under Linux. •
Independent management processor for serviceability: remote monitoring, alerting and control.
Fault-tolerant designs extended the idea by making
RAS to be the defining feature of their computers for applications like
stock market exchanges or
air traffic control, where system crashes would be catastrophic.
Fault-tolerant computers (e.g., see
Tandem Computers and
Stratus Technologies), which tend to have duplicate components running in lock-step for reliability, have become less popular, due to their high cost.
High availability systems, using
distributed computing techniques like
computer clusters, are often used as cheaper alternatives. == See also ==