Optimal erasure codes have the property that any
k out of the
n code word symbols are sufficient to recover the original message (i.e., they have optimal reception efficiency). Optimal erasure codes are
maximum distance separable codes (MDS codes).
Parity check Parity check is the special case where
n =
k + 1. From a set of
k values \{v_i\}_{1\leq i \leq k}, a checksum is computed and appended to the
k source values: :v_{k+1}= - \sum_{i=1}^k v_i. The set of
k + 1 values \{v_i\}_{1\leq i \leq k+1} is now consistent with regard to the checksum. If one of these values, v_e, is erased, it can be easily recovered by summing the remaining variables: :v_{e}= - \sum_{i=1 ,i \neq e }^{k+1}v_i.
RAID 5 is a widely used application of the parity check erasure code using XOR.
Polynomial oversampling ==== Example: Err-mail (
k = 2) ==== In the simple case where
k = 2, redundancy symbols may be created by sampling different points along the line between the two original symbols. This is pictured with a simple example, called err-mail:
Alice wants to send her telephone number (555629) to
Bob using err-mail. Err-mail works just like e-mail, except • About half of all the mail gets lost. • Messages longer than 5 characters are illegal. • It is very expensive (similar to air-mail). Instead of asking Bob to acknowledge the messages she sends, Alice devises the following scheme. • She breaks her telephone number up into two parts
a = 555,
b = 629, and sends 2 messages – "A=555" and "B=629" – to Bob. • She constructs a linear function, f(i) = a + (b-a)(i-1), in this case f(i) = 555 + 74(i-1), such that f(1) = 555 and f(2) = 629. • She computes the values
f(3),
f(4), and
f(5), and then transmits three redundant messages: "C=703", "D=777" and "E=851". Bob knows that the form of
f(
k) is f(i) = a + (b-a)(i-1), where
a and
b are the two parts of the telephone number. Now suppose Bob receives "D=777" and "E=851". Bob can reconstruct Alice's phone number by computing the values of
a and
b from the values (
f(4) and
f(5)) he has received. Bob can perform this procedure using any two err-mails, so the erasure code in this example has a rate of 40%. Note that Alice cannot encode her telephone number in just one err-mail, because it contains six characters, and that the maximum length of one err-mail message is five characters. If she sent her phone number in pieces, asking Bob to acknowledge receipt of each piece, at least four messages would have to be sent anyway (two from Alice, and two acknowledgments from Bob). So the erasure code in this example, which requires five messages, is quite economical. This example is a little bit contrived. For truly generic erasure codes that work over any data set, we would need something other than the
f(
i) given.
General case The linear construction above can be generalized to
polynomial interpolation. Additionally, points are now computed over a
finite field. First we choose a finite field
F with order of at least
n, but usually a power of 2. The sender numbers the data symbols from 0 to
k − 1 and sends them. He then constructs a
(Lagrange) polynomial p(
x) of order
k such that
p(
i) is equal to data symbol
i. He then sends
p(
k), ...,
p(
n − 1). The receiver can now also use polynomial interpolation to recover the lost packets, provided he receives
k symbols successfully. If the order of
F is less than 2
b, where b is the number of bits in a symbol, then multiple polynomials can be used. The sender can construct symbols
k to
n − 1 'on the fly', i.e., distribute the workload evenly between transmission of the symbols. If the receiver wants to do his calculations 'on the fly', he can construct a new polynomial
q, such that
q(
i) =
p(
i) if symbol
i <
k was received successfully and
q(
i) = 0 when symbol
i <
k was not received. Now let
r(
i) =
p(
i) −
q(
i). Firstly we know that
r(
i) = 0 if symbol
i <
k has been received successfully. Secondly, if symbol
i ≥
k has been received successfully, then
r(
i) =
p(
i) −
q(
i) can be calculated. So we have enough data points to construct
r and evaluate it to find the lost packets. So both the sender and the receiver require
O(
n (
n −
k)) operations and only
O(
n −
k) space for operating 'on the fly'.
Real world implementation This process is implemented by Reed–Solomon codes, with code words constructed over a
finite field using a
Vandermonde matrix. Most practical erasure codes are
systematic codes—each one of the original
k symbols can be found copied, unencoded, as one of the
n message symbols. (Erasure codes that support
secret sharing never use a systematic code). ==Near-optimal erasure codes==