Company Profile

LZMA

LZMA is a lossless data compression algorithm developed since 1998 by Igor Pavlov, the developer of 7-Zip. It has been used in the 7z format of the 7-Zip archiver since 2001. This algorithm uses a dictionary compression scheme somewhat similar to the LZ77 algorithm published by Abraham Lempel and Jacob Ziv in 1977 and features a high compression ratio and a variable compression-dictionary size, while still maintaining decompression speed similar to other commonly used compression algorithms.

Overview

LZMA uses a dictionary compression algorithm (a variant of LZ77 with huge dictionary sizes and special support for repeatedly used match distances), whose output is then encoded with a range encoder, using a complex model to make a probability prediction of each bit. The dictionary compressor finds matches using sophisticated dictionary data structures, and produces a stream of literal symbols and phrase references, which is encoded one bit at a time by the range encoder: many encodings are possible, and a dynamic programming algorithm is used to select an optimal one under certain approximations. Prior to LZMA, most encoder models were purely byte-based (i.e. they coded each bit using only a cascade of contexts to represent the dependencies on previous bits from the same byte). The main innovation of LZMA is that instead of a generic byte-based model, LZMA's model uses contexts specific to the bitfields in each representation of a literal or phrase: this is nearly as simple as a generic byte-based model, but gives much better compression because it avoids mixing unrelated bits together in the same context. Furthermore, compared to classic dictionary compression (such as the one used in zip and gzip formats), the dictionary sizes can be and usually are much larger, taking advantage of the large amount of memory available on modern systems. ==Compressed format overview==

Compressed format overview

In LZMA compression, the compressed stream is a stream of bits, encoded using an adaptive binary range coder. The stream is divided into packets, each packet describing either a single byte, or an LZ77 sequence with its length and distance implicitly or explicitly encoded. Each part of each packet is modelled with independent contexts, so the probability predictions for each bit are correlated with the values of that bit (and related bits from the same field) in previous packets of the same type. Both the lzip and the LZMA SDK documentation describe this stream format. There are 7 types of packets: LONGREP[*] refers to LONGREP[0–3] packets, *REP refers to both LONGREP and SHORTREP, and *MATCH refers to both MATCH and *REP. LONGREP[n] packets remove the distance used from the list of the most recent distances and reinsert it at the front, to avoid useless repeated entry, while MATCH just adds the distance to the front even if already present in the list and SHORTREP and LONGREP[0] don't alter the list. The length is encoded as follows: As in LZ77, the length is not limited by the distance, because copying from the dictionary is defined as if the copy was performed byte by byte, keeping the distance constant. Distances are logically 32-bit and distance 0 points to the most recently added byte in the dictionary. The distance encoding starts with a 6-bit "distance slot", which determines how many further bits are needed. Distances are decoded as a binary concatenation of, from most to least significant, two bits depending on the distance slot, some bits encoded with fixed 0.5 probability, and some context encoded bits, according to the following table (distance slots 0−3 directly encode distances 0−3). == Decompression algorithm details ==

Decompression algorithm details

No complete natural language specification of the compressed format seems to exist, other than the one attempted in the following text. The description below is based on the compact XZ Embedded decoder by Lasse Collin included in the Linux kernel source == 7-Zip reference implementation ==

7-Zip reference implementation

The LZMA implementation extracted from 7-Zip is available as LZMA SDK. It was originally dual-licensed under both the GNU LGPL and Common Public License, with an additional special exception for linked binaries, but was placed by Igor Pavlov in the public domain on December 2, 2008, with the release of version 4.62. The reference open source LZMA compression library was originally written in C++ but has been ported to ANSI C, C#, and Java. Go and Ada.{{cite web The 7-Zip implementation uses several variants of hash chains, binary trees and Patricia trees as the basis for its dictionary search algorithm. In addition to LZMA, the SDK and 7-Zip also implements multiple preprocessing filters intended to improve compression, ranging from simple delta encoding (for images) and BCJ for executable code. It also provides some other compression algorithms used in 7z. Decompression-only code for LZMA generally compiles to around 5 KB, and the amount of RAM required during decompression is principally determined by the size of the sliding window used during compression. Small code size and relatively low memory overhead, particularly with smaller dictionary lengths, and free source code make the LZMA decompression algorithm well-suited to embedded applications. == Other implementations ==

Other implementations

In addition to the 7-Zip reference implementation, the following support the LZMA format. • xz: a streaming implementation that contains a gzip-like command line tool supporting LZMA2 in its xz file format. It made its way into several software of the Unix-like world with its high performance (compared to bzip2) and small size (compared to gzip). and Fedora now use xz for compressing their releases. • lzip: another LZMA implementation mostly for Unix-like systems that is an alternative to xz. It features a simpler file format with easier error recovery. • ZIPX: an extension to the ZIP compression format that was created by WinZip starting with version 12.1. It also can use various other compression methods such as BZip and PPMd. == References==

Source: Wikipedia ↗

tickerdossier.com tickerdossier.substack.com