Hilberg's hypothesis was proposed by the German telecommunication engineer
Wolfgang Hilberg in 1990, based on data originally published by
Claude Shannon in 1951 on the predictability of English text. Hilberg observed that the amount of new information per character appears to decrease with context length in a manner consistent with a
power law. His analysis implied that the
Shannon entropy H(n) of text blocks of length n grows approximately as :H(n)-hn \propto n^{\beta}, where parameter h is the
entropy rate of the process and parameter \beta\in(0,1) is called the Hilberg exponent. The term proportional to n^{\beta} represents a memory effect, suggesting that human language carries a large amount of information in a repetitive way. Hilberg originally assumed that h=0 and \beta=1/2 basing his hypothesis on a visually meagre evidence. or the length of a
universal code. In the context of
deep learning, the term
neural scaling law is used for analogous power-law relations describing how performance of a
large language model, measured by
cross entropy, improves with data size, model parameters, or computation. Yet another expression involves the
mutual information between two adjacent text blocks of length n, :I(n) = 2H(n) - H(2n). Using this concept, Hilberg's law is equivalent to :I(n) \propto n^{\beta}. This version does not depend on the precise value of the entropy rate and is used in theoretical studies. The value of the Hilberg exponent \beta depends crucially on the applied information measure, or the compression algorithm in case of the universal code. Simultaneously, it exhibits a certain degree of
universality across particular languages and writing systems, being \beta\approx 0.8 for the
prediction by partial matching code run on English, French, Russian, Korean, Chinese, and Japanese news corpora. == Examples of processes ==