Concept Error correction codes protect against undetected data corruption and are used in computers where such corruption is unacceptable, examples being scientific and financial computing applications, or in
database and file servers. ECC can also reduce the number of crashes in multi-user server applications and maximum-availability systems. Electrical or magnetic interference inside a computer system can cause a single bit of
dynamic random-access memory (
DRAM) to spontaneously flip to the opposite state. It was initially thought that this was mainly due to
alpha particles emitted by contaminants in chip packaging material, but research has shown that the majority of one-off
soft errors in DRAM chips occur as a result of
background radiation, chiefly
neutrons from
cosmic ray secondaries, which may change the contents of one or more
memory cells or interfere with the circuitry used to read or write to them. As a result, systems operating at high altitudes require special provisions for reliability. As an example, the spacecraft
Cassini–Huygens, launched in 1997, contained two identical flight recorders, each with 2.5 gigabits of memory in the form of arrays of commercial DRAM chips. Due to built-in
EDAC functionality, the spacecraft's engineering telemetry reported the number of (correctable) single-bit-per-word errors and (uncorrectable) double-bit-per-word errors. During the first 2.5 years of flight, the spacecraft reported a nearly constant single-bit error rate of about 280 errors per day. However, on November 6, 1997, during the first month in space, the number of errors increased by more than a factor of four on that single day. This was attributed to a
solar particle event that had been detected by the satellite
GOES 9. Some tests conclude that the isolation of
DRAM memory cells can be circumvented by unintended side effects of specially crafted accesses to adjacent cells. Thus, accessing data stored in DRAM causes memory cells to leak their charges and interact electrically, as a result of high cell density in modern memory, altering the content of nearby memory rows that actually were not addressed in the original memory access. This effect is known as
row hammer, and it has also been used in some
privilege escalation computer security
exploits. An example of a single-bit error that would be ignored by a system with no error-checking, would halt a machine with parity checking or be invisibly corrected by ECC: a single bit is stuck at 1 due to a faulty chip, or becomes changed to 1 due to background or cosmic radiation; a spreadsheet storing numbers in ASCII format is loaded, and the character "8" (decimal value 56 in the ASCII encoding) is stored in the byte that contains the stuck bit at its lowest bit position; then, a change is made to the spreadsheet and it is saved. As a result, the "8" (0011 100
0 binary) has silently become a "9" (0011 100
1).
Solutions Several approaches have been developed to deal with unwanted bit-flips, including immunity-aware programming,
RAM parity memory, and
ECC memory. This problem can be mitigated by using DRAM modules that include extra memory bits and memory controllers that exploit these bits. These extra bits are used to record
parity or to use an
error-correcting code (ECC). Parity allows the detection of all single-bit errors (actually, any odd number of wrong bits), but not correction, so the system has to either carry on (just flagging the problem) or halt. Error-correction codes allow for more errors to be corrected; how much depends on the exact type of memory used.
DRAM memory may provide increased protection against
soft errors by relying on error-correcting codes. Such
error-correcting memory, known as
ECC or
EDAC-protected memory, is particularly desirable for highly fault-tolerant applications, such as servers, as well as deep-space applications due to increased
radiation. Some systems also "
scrub" the memory, by periodically reading all addresses and writing back corrected versions if necessary to remove accumulated soft errors. == Schemes ==