Kolmogorov complexity

Intuition Consider the following two strings of 32 lowercase letters and digits: The first string has a short English-language description, namely "write ab 16 times", which consists of 17 characters. The second one has no obvious simple description (using the same character set) other than writing down the string itself, i.e., "write 4c1j5b2p0cv4w1x8rx2y39umgw5q85s7" which has 38 characters. Hence the operation of writing the first string can be said to have "less complexity" than writing the second. More formally, the complexity of a string is the length of the shortest possible description of the string in some fixed universal description language (the sensitivity of complexity relative to the choice of description language is discussed below). It can be shown that the Kolmogorov complexity of any string cannot be more than a few bytes larger than the length of the string itself. Strings like the abab example above, whose Kolmogorov complexity is small relative to the string's size, are not considered to be complex. The Kolmogorov complexity can be defined for any mathematical object, but for simplicity the scope of this article is restricted to strings. We must first specify a description language for strings. Such a description language can be based on any computer programming language, such as Lisp, Pascal, or Java. If P is a program which outputs a string x, then P is a description of x. The length of the description is just the length of P as a character string, multiplied by the number of bits in a character (e.g., 7 for ASCII). We could, alternatively, choose an encoding for Turing machines, where an encoding is a function which associates to each Turing Machine M a bitstring . If M is a Turing Machine which, on input w, outputs string x, then the concatenated string w is a description of x. For theoretical analysis, this approach is more suited for constructing detailed formal proofs and is generally preferred in the research literature. In this article, an informal approach is discussed. Any string s has at least one description. For example, the second string above is output by the pseudo-code: function GenerateString2() return "4c1j5b2p0cv4w1x8rx2y39umgw5q85s7" whereas the first string is output by the (much shorter) pseudo-code: function GenerateString1() return "ab" × 16 If a description d(s) of a string s is of minimal length (i.e., using the fewest bits), it is called a minimal description of s, and the length of d(s) (i.e. the number of bits in the minimal description) is the Kolmogorov complexity of s, written K(s). Symbolically, The length of the shortest description will depend on the choice of description language; but the effect of changing languages is bounded (a result called the invariance theorem, see below). Plain Kolmogorov complexity C There are two definitions of Kolmogorov complexity: plain and prefix-free. The plain complexity is the minimal description length of any program, and denoted C(x) while the prefix-free complexity is the minimal description length of any program encoded in a prefix-free code, and denoted K(x). The plain complexity is more intuitive, but the prefix-free complexity is easier to study. By default, all equations hold only up to an additive constant. For example, f(x) = g(x) really means that f(x) = g(x) + O(1), that is, \exists c, \forall x, |f(x) - g(x)| \leq c. Let U: 2^* \to 2^* be a computable function mapping finite binary strings to binary strings. It is a universal function if, and only if, for any computable f: 2^* \to 2^*, we can encode the function in a "program" s_f, such that \forall x \in 2^*, U(s_fx) = f(x) . We can think of U as a program interpreter, which takes in an initial segment describing the program, followed by data that the program should process. One problem with plain complexity is that C(xy) \not , because intuitively speaking, there is no general way to tell where to divide an output string just by looking at the concatenated string. We can divide it by specifying the length of x or y, but that would take O(\min(\ln x, \ln y)) extra symbols. Indeed, for any c > 0 there exists x, y such that C(xy) \geq C(x) + C(y) + c. Typically, inequalities with plain complexity have a term like O(\min(\ln x, \ln y)) on one side, whereas the same inequalities with prefix-free complexity have only O(1). The main problem with plain complexity is that there is something extra sneaked into a program. A program not only represents for something with its code, but also represents its own length. In particular, a program x may represent a binary number up to \log_2 |x|, simply by its own length. Stated in another way, it is as if we are using a termination symbol to denote where a word ends, and so we are not using 2 symbols, but 3. To fix this defect, we introduce the prefix-free Kolmogorov complexity. Prefix-free Kolmogorov complexity K A prefix-free universal Turing machine is a universal partial computable function U:2^* \rightarrow 2^* whose domain is a prefix-free set of binary strings. Equivalently, no valid program for U is a prefix of any other, the domain satisfies the prefix property. For instance, if every valid program for a universal Turing machine U ended with a termination string that could not appear elsewhere in the program, U would be prefix-free. The prefix-free Kolmogorov complexity of a string x is defined by K(x) := \min\{|c| : U(c) = x\}the length of the shortest self-delimiting program that causes U to output x. Different choices of prefix-free universal machines change K(x) by at most an additive constant. ==Invariance theorem==