Philosophical The theory is based in philosophical foundations, and was founded by
Ray Solomonoff around 1960. It is a mathematically formalized combination of
Occam's razor All
computable theories which perfectly describe previous observations are used to calculate the probability of the next observation, with more weight put on the shorter computable theories.
Marcus Hutter's
universal artificial intelligence builds upon this to calculate the
expected value of an action.
Principle Solomonoff's induction has been argued to be the computational formalization of pure
Bayesianism. To understand, recall that Bayesianism derives the posterior probability \mathbb P[T|D] of a theory T given data D by applying Bayes rule, which yields :\mathbb P[T|D] = \frac{\mathbb P[D|T] \mathbb P[T]}{\mathbb P[D|T] \mathbb P[T] + \sum_{A \neq T} \mathbb P[D|A] \mathbb P[A]} where theories A are alternatives to theory T. For this equation to make sense, the quantities \mathbb P[D|T] and \mathbb P[D|A] must be well-defined for all theories T and A. In other words, any theory must define a probability distribution over observable data D. Solomonoff's induction essentially boils down to demanding that all such probability distributions be
computable. Interestingly, the set of computable probability distributions is a subset of the set of all programs, which is
countable. Similarly, the sets of observable data considered by Solomonoff were finite.
Without loss of generality, we can thus consider that any observable data is a finite
bit string. As a result, Solomonoff's induction can be defined by only invoking discrete probability distributions. Solomonoff's induction then allows to make probabilistic predictions of future data F, by simply obeying the laws of probability. Namely, we have \mathbb P[F|D] = \mathbb E_T [\mathbb P[F|T,D] ] = \sum_T \mathbb P[F|T,D] \mathbb P[T|D]. This quantity can be interpreted as the average predictions \mathbb P[F|T,D] of all theories T given past data D, weighted by their posterior credences \mathbb P[T|D].
Mathematical The proof of the "razor" is based on the known mathematical properties of a probability distribution over a
countable set. These properties are relevant because the
infinite set of all programs is a denumerable set. The sum S of the probabilities of all programs must be exactly equal to one (as per the definition of
probability) thus the probabilities must roughly decrease as we enumerate the infinite set of all programs, otherwise S will be strictly greater than one. To be more precise, for every \epsilon > 0, there is some length
l such that the probability of all programs longer than
l is at most \epsilon. This does not, however, preclude very long programs from having very high probability. Fundamental ingredients of the theory are the concepts of
algorithmic probability and
Kolmogorov complexity. The universal
prior probability of any prefix
p of a computable sequence
x is the sum of the probabilities of all programs (for a
universal computer) that compute something starting with
p. Given some
p and any computable but unknown probability distribution from which
x is sampled, the universal prior and Bayes' theorem can be used to predict the yet unseen parts of
x in optimal fashion. ==Mathematical guarantees==