MarketLempel–Ziv–Welch
Company Profile

Lempel–Ziv–Welch

Lempel–Ziv–Welch (LZW) is a universal lossless compression algorithm created by Abraham Lempel, Jacob Ziv, and Terry Welch. It was published by Welch in 1984 as an improvement to the LZ78 algorithm published by Lempel and Ziv in 1978. Claimed advantages include: simple to implement and the potential for high throughput in a hardware implementation.

Algorithm
The scenario described by Welch's 1984 paper Decoding The decoding process can be described as: • Initialize the dictionary to contain all strings of length one. • Read the next encoded symbol. • If the symbol is not encoded in the dictionary, goto step 7. • Emit the corresponding string W to output. • Concatenate the previous string emitted to output with the first symbol of W; add this to the dictionary. • Go to step 9. • Concatenate the previous string emitted to output with its first symbol; call this string V. • Add V to the dictionary and emit V to output. • Repeat step 2 until end of input. The decoding algorithm works by reading a value from the encoded input and outputting the corresponding string from the dictionary. However, the full dictionary is not needed, only the initial dictionary that contains single-character strings (and that is usually hard coded in the program, instead of sent with the encoded data). Instead, the full dictionary is rebuilt during the decoding process the following way: after decoding a value and outputting a string, the decoder concatenates it with the first character of the next decoded string (or the first character of current string, if the next one can't be decoded; since if the next value is unknown, then it must be the value added to the dictionary in this iteration, and so its first character is the same as the first character of the current string), and updates the dictionary with the new string. The decoder then proceeds to the next input (which was already read in the previous iteration) and processes it as before, and so on until it has exhausted the input stream. Variable-width codes If variable-width codes are being used, the encoder and decoder must be careful to change the width at the same points in the encoded data so they don't disagree on boundaries between individual codes in the stream. In the standard version, the encoder increases the width from p to p + 1 when a sequence ω + s is encountered that is not in the table (so that a code must be added for it) but the next available code in the table is 2p (the first code requiring p + 1 bits). The encoder emits the code for ω at width p (since that code does not require p + 1 bits), and then increases the code width so that the next code emitted is p + 1 bits wide. The decoder is always one code behind the encoder in building the table, so when it sees the code for ω, it generates an entry for code 2p − 1. Since this is the point where the encoder increases the code width, the decoder must increase the width here as well—at the point where it generates the largest code that fits in p bits. Unfortunately, some early implementations of the encoding algorithm increase the code width and then emit ω at the new width instead of the old width, so that to the decoder it looks like the width changes one code too early. This is called "early change"; it caused so much confusion that Adobe now allows both versions in PDF files, but includes an explicit flag in the header of each LZW-compressed stream to indicate whether early change is being used. Of the graphics file formats that support LZW compression, TIFF uses early change, while GIF and most others don't. When the table is cleared in response to a clear code, both encoder and decoder change the code width after the clear code back to the initial code width, starting with the code immediately following the clear code. Packing order Since the codes emitted typically do not fall on byte boundaries, the encoder and decoder must agree on how codes are packed into bytes. The two common methods are LSB-first ("least significant bit first") and MSB-first ("most significant bit first"). In LSB-first packing, the first code is aligned so that the least significant bit of the code falls in the least significant bit of the first stream byte, and if the code has more than 8 bits, the high-order bits left over are aligned with the least significant bits of the next byte; further codes are packed with LSB going into the least significant bit not yet used in the current stream byte, proceeding into further bytes as necessary. MSB-first packing aligns the first code so that its most significant bit falls in the MSB of the first stream byte, with overflow aligned with the MSB of the next byte; further codes are written with MSB going into the most significant bit not yet used in the current stream byte. GIF files use LSB-first packing order. TIFF files and PDF files use MSB-first packing order. Further coding Many applications extend the algorithm by applying further encoding to the sequence of output symbols. Some package the coded stream as printable characters using some form of binary-to-text encoding. This increases the encoded length and decreases the compression rate. Conversely, increased compression can often be achieved with an adaptive entropy encoder. Such a coder estimates the probability distribution for the value of the next symbol, based on the observed frequencies of values so far. A standard entropy encoding such as Huffman coding or arithmetic coding then uses shorter codes for values with higher probabilities. ==Example==
Example
The following example illustrates the LZW algorithm in action, showing the status of the output and the dictionary at every stage, both in encoding and decoding the data. This example has been constructed to give reasonable compression on a very short message. In real text data, repetition is generally less pronounced, so longer input streams are typically necessary before the compression builds up efficiency. The plaintext to be encoded (from an alphabet using only the capital letters) is: TOBEORNOTTOBEORTOBEORNOT# There are 26 symbols in the plaintext alphabet (the capital letters A through Z). # is used to represent a stop code: a code outside the plaintext alphabet that triggers special handling. We arbitrarily assign these the values 1 through 26 for the letters, and 0 for the stop code '#'. (Most flavors of LZW would put the stop code after the data alphabet, but nothing in the basic algorithm requires that. The encoder and decoder only have to agree what value it has.) A computer renders these as strings of bits. Five-bit codes are needed to give sufficient combinations to encompass this set of 27 values. The dictionary is initialized with these 27 values. As the dictionary grows, the codes must grow in width to accommodate the additional entries. A 5-bit code gives 25 = 32 possible combinations of bits, so when the 33rd dictionary word is created, the algorithm must switch at that point from 5-bit strings to 6-bit strings (for all code values, including those previously output with only five bits). Note that since the all-zero code 00000 is used, and is labeled "0", the 33rd dictionary entry is labeled 32. (Previously generated output is not affected by the code-width change, but once a 6-bit value is generated in the dictionary, it could conceivably be the next code emitted, so the width for subsequent output shifts to 6 bits to accommodate that.) The initial dictionary, then, consists of the following entries: Encoding Buffer input characters in a sequence ω until ω + next character is not in the dictionary. Emit the code for ω, and add ω + next character to the dictionary. Start buffering again with the next character. (The string to be encoded is "TOBEORNOTTOBEORTOBEORNOT#".) :Unencoded length = 25 symbols × 5 bits/symbol = 125 bits :Encoded length = (6 codes × 5 bits/code) + (11 codes × 6 bits/code) = 96 bits. Using LZW has saved 29 bits out of 125, reducing the message by more than 23%. If the message were longer, then the dictionary words would begin to represent longer and longer sections of text, sending repeated words very compactly. Decoding To decode an LZW-compressed archive, one needs to know in advance the initial dictionary used, but additional entries can be reconstructed as they are always simply concatenations of previous entries. At each stage, the decoder receives a code X; it looks X up in the table and outputs the sequence χ it codes, and it conjectures χ + ? as the entry the encoder just added – because the encoder emitted X for χ precisely because χ + ? was not in the table, and the encoder goes ahead and adds it. But what is the missing letter? It is the first letter in the sequence coded by the next code Z that the decoder receives. So the decoder looks up Z, decodes it into the sequence ω and takes the first letter z and tacks it onto the end of χ as the next dictionary entry. This works as long as the codes received are in the decoder's dictionary, so that they can be decoded into sequences. What happens if the decoder receives a code Z that is not yet in its dictionary? Since the decoder is always just one code behind the encoder, Z can be in the encoder's dictionary only if the encoder just generated it, when emitting the previous code X for χ. Thus Z codes some ω that is χ + ?, and the decoder can determine the unknown character as follows: • The decoder sees X and then Z, where X codes the sequence χ and Z codes some unknown sequence ω. • The decoder knows that the encoder just added Z as a code for χ + some unknown character c, so ω = χ + c. • Since c is the first character in the input stream after χ, and since ω is the string appearing immediately after χ, c must be the first character of the sequence ω. • Since χ is an initial substring of ω, c must also be the first character of χ. • So even though the Z code is not in the table, the decoder is able to infer the unknown sequence and adds χ + (the first character of χ) to the table as the value of Z. This situation occurs whenever the encoder encounters input of the form cScSc, where c is a single character, S is a string and cS is already in the dictionary, but cSc is not. The encoder emits the code for cS, putting a new code for cSc into the dictionary. Next it sees cSc in the input (starting at the second c of cScSc) and emits the new code it just inserted. The argument above shows that whenever the decoder receives a code not in its dictionary, the situation must look like this. Although input of form cScSc might seem unlikely, this pattern is fairly common when the input stream is characterized by significant repetition. In particular, long strings of a single character (which are common in the kinds of images LZW is often used to encode) repeatedly generate patterns of this sort. == Patents ==
Patents
Various patents have been issued in the United States and other countries for LZW and similar algorithms. LZ78 was covered by by Lempel, Ziv, Cohn, and Eastman, assigned to Sperry Corporation, later Unisys Corporation, filed on August 10, 1981. Two US patents were issued for the LZW algorithm: by Victor S. Miller and Mark N. Wegman and assigned to IBM, originally filed on June 1, 1983, and by Welch, assigned to Sperry Corporation, later Unisys Corporation, filed on June 20, 1983. In addition to the above patents, Welch's 1983 patent also includes citations to several other patents that influenced it, including two 1980 Japanese patents (JP9343880A and JP17790880A) from NEC's Jun Kanatsu, (1974) from John S. Hoerning, (1977) from Klaus E. Holtz, and a 1981 German patent (DE19813118676) from Karl Eckhart Heinz. In 1993–94, and again in 1999, Unisys Corporation received widespread condemnation when it attempted to enforce licensing fees for LZW in GIF images. The 1993–1994 Unisys-CompuServe controversy (CompuServe being the creator of the GIF format) prompted a Usenet comp.graphics discussion Thoughts on a GIF-replacement file format, which in turn fostered an email exchange that eventually culminated in the creation of the patent-unencumbered Portable Network Graphics (PNG) file format in 1995. Unisys's US patent on the LZW algorithm expired on June 20, 2003, 20 years after it had been filed. Patents that had been filed in the United Kingdom, France, Germany, Italy, Japan and Canada all expired in 2004, likewise 20 years after they had been filed. ==Variants==
Variants
LZMW LZMW (1985), by Victor Miller and Mark Wegman, searches input for the longest string already in the dictionary (the "current" match) and then adds the concatenation of the previous match with the current match to the dictionary. Dictionary entries thus grow more rapidly, but this scheme is much more complicated to implement. Miller and Wegman suggest deleting low-frequency entries from the dictionary when the dictionary fills up. LZAP LZAP (1988), by James Storer, is a modification of LZMW. Instead of adding just the concatenation of the previous match with the current match to the dictionary, add the concatenations of the previous match with each initial substring of the current match ("AP" stands for "all prefixes"). For example, if the previous match is "wiki" and current match is "pedia", then the LZAP encoder adds 5 new sequences to the dictionary: "wikip", "wikipe", "wikiped", "wikipedi", and "wikipedia", where the LZMW encoder adds only the one sequence "wikipedia". This eliminates some of the complexity of LZMW, at the price of adding more dictionary entries. LZWL LZWL is a syllable-based variant of LZW. == See also ==
tickerdossier.comtickerdossier.substack.com