Hilberg's hypothesis

Hilberg's hypothesis is a conjecture in quantitative linguistics or a statistical law in information theory. It states that measures of information in natural language texts or samples of particular stochastic processes grow as a sublinear power of the sample length, possibly in addition to the term that is linear. It is closely related to Zipf's law and the neural scaling law for large language models.

Overview

Hilberg's hypothesis was proposed by the German telecommunication engineer Wolfgang Hilberg in 1990, based on data originally published by Claude Shannon in 1951 on the predictability of English text. Hilberg observed that the amount of new information per character appears to decrease with context length in a manner consistent with a power law. His analysis implied that the Shannon entropy H(n) of text blocks of length n grows approximately as :H(n)-hn \propto n^{\beta}, where parameter h is the entropy rate of the process and parameter \beta\in(0,1) is called the Hilberg exponent. The term proportional to n^{\beta} represents a memory effect, suggesting that human language carries a large amount of information in a repetitive way. Hilberg originally assumed that h=0 and \beta=1/2 basing his hypothesis on a visually meagre evidence. or the length of a universal code. In the context of deep learning, the term neural scaling law is used for analogous power-law relations describing how performance of a large language model, measured by cross entropy, improves with data size, model parameters, or computation. Yet another expression involves the mutual information between two adjacent text blocks of length n, :I(n) = 2H(n) - H(2n). Using this concept, Hilberg's law is equivalent to :I(n) \propto n^{\beta}. This version does not depend on the precise value of the entropy rate and is used in theoretical studies. The value of the Hilberg exponent \beta depends crucially on the applied information measure, or the compression algorithm in case of the universal code. Simultaneously, it exhibits a certain degree of universality across particular languages and writing systems, being \beta\approx 0.8 for the prediction by partial matching code run on English, French, Russian, Korean, Chinese, and Japanese news corpora. == Examples of processes ==

Examples of processes

The intuitive meaning of Hilberg's hypothesis can be motivated by some toy models of stationary processes, which link Hilberg's law with Zipf's law for an unbounded arbitrary lexicon that needs to be learned by heart. Equivalently, the following examples may be considered as idealized information sources that convey an unbounded amount of algorithmically random knowledge in a repetitive fashion. Process (K_i)_{i\in\mathbb{Z}} is called narration and process (Z_k)_{k\in\mathbb{N}} is called knowledge. Hilberg's law may also arise for some mixing processes. An example of such a process is a modified Santa Fe process (X_i)_{i\in\mathbb{Z}} that consists of pairs X_i=(K_i,Z_{i,K_i}) where (Z_{i,k})_{i\in\mathbb{Z},k\in\mathbb{N}} is a collection of independent binary Markov processes, where the flipping probabilities P(Z_{i,k}\neq Z_{i-1,k}) are smaller than the mentioning probabilities P(K_i=k). The default examples of Santa Fe processes have a positive entropy rate. There are also examples of processes with a vanishing entropy rate that obey Hilberg's law, called multiperiodic processes. == Related conditions ==

Related conditions

Hilberg's hypothesis connects quantitative linguistics, information theory, and modern machine-learning research on scaling behavior. By means of mathematical theorems, Hilberg's law can be related to some similar conditions for stationary processes over a finite, a countably infinite, or an uncountable set of possible outcomes. Hilberg's law can be deductively linked to Zipf's law and Heaps' law for word-like subwords or chunks by means of grammar-based codes and prediction by partial matching, non-ergodic processes, and the data-processing inequality for mutual information. Hilberg's hypothesis has been discussed as evidence that language production involves infinite memory, contrasting with finite-state Markov models and hidden Markov models. In general, Hilberg's law is incompatible with finite-state hidden Markov models. The mutual information between past and future E=\lim_{n\to\infty} I(n) is called excess entropy Condition E holds for finite-state hidden Markov models and non-degenerate autoregressive moving-average Gaussian processes. By contrast, condition E=\infty holds for non-ergodic processes with a non-atomic invariant sigma-algebra. and Hilberg's law. as this involves a power-law growth of Renyi entropies rather than the Shannon entropy. In the case of stationary processes over a countable alphabet, it is unknown whether Hilberg's law (under a mild condition) implies a slow decay of the two-point mutual information I(X_0;X_n), which seems to hold for natural language. For the non-ergodic Santa Fe processes, one also observes I(X_0;X_n)=\text{const} for n\neq 0. In the case of real-valued non-degenerate Gaussian processes, condition E=\infty is equivalent to \sum_{k=1}^\infty k\alpha_k^2=\infty, where \alpha_k is the partial autocorrelation function. Additionally, if the Gaussian process has a positive and continuous spectral density then condition E=\infty is equivalent to \sum_{k=1}^\infty k\rho_k^2=\infty, where \rho_k is the autocorrelation function. By contrast, the long-range dependence condition is \sum_{k=1}^\infty |\rho_k|=\infty, which is equivalent to the previous condition if the autocorrelation function \rho_k decays exactly according to a power law. Thus Hilberg's law implies long-range dependence if the process is Gaussian, the spectral density is positive and continuous, and the autocorrelation function follows a power law. == See also ==

Source: Wikipedia ↗

tickerdossier.com tickerdossier.substack.com