Prior probability

A prior probability distribution of an uncertain quantity is its assumed probability distribution before evidence is taken into account. For example, the prior could be the probability distribution representing the relative proportions of voters who will vote for a particular politician in a future election. The unknown quantity may be a parameter of the model or a latent variable rather than an observable variable.

Informative priors

An informative prior expresses specific, definite information about a variable. An example is a prior distribution for the temperature at noon tomorrow. A reasonable approach is to make the prior a normal distribution with expected value equal to today's noontime temperature, with variance equal to the day-to-day variance of atmospheric temperature, or a distribution of the temperature for that day of the year. This example has a property in common with many priors, namely, that the posterior from one problem (today's temperature) becomes the prior for another problem (tomorrow's temperature); pre-existing evidence which has already been taken into account is part of the prior and, as more evidence accumulates, the posterior is determined largely by the evidence rather than any original assumption, provided that the original assumption admitted the possibility of what the evidence is suggesting. The terms "prior" and "posterior" are generally relative to a specific datum or observation. Strong prior A strong prior is a preceding assumption, theory, concept or idea upon which, after taking account of new information, a current assumption, theory, concept or idea is founded. A strong prior is a type of informative prior in which the information contained in the prior distribution dominates the information contained in the data being analyzed. The Bayesian analysis combines the information contained in the prior with that extracted from the data to produce the posterior distribution which, in the case of a "strong prior", would be little changed from the prior distribution. == Weakly informative priors ==

Weakly informative priors

A weakly informative prior expresses partial information about a variable, steering the analysis toward solutions that align with existing knowledge without overly constraining the results and preventing extreme estimates. An example is, when setting the prior distribution for the temperature at noon tomorrow in St. Louis, to use a normal distribution with mean 50 degrees Fahrenheit and standard deviation 40 degrees, which very loosely constrains the temperature to the range (10 degrees, 90 degrees) with a small chance of being below -30 degrees or above 130 degrees. The purpose of a weakly informative prior is for regularization, that is, to keep inferences in a reasonable range. == Uninformative priors ==

Uninformative priors

An uninformative, flat, or diffuse prior expresses vague or general information about a variable. As a more contentious example, Jaynes published an argument based on the invariance of the prior under a change of parameters that suggests that the prior representing complete uncertainty about a probability should be the Haldane prior p−1(1 − p)−1. The example Jaynes gives is of finding a chemical in a lab and asking whether it will dissolve in water in repeated experiments. The Haldane prior gives by far the most weight to p=0 and p=1, indicating that the sample will either dissolve every time or never dissolve, with equal probability. However, if one has observed samples of the chemical to dissolve in one experiment and not to dissolve in another experiment then this prior is updated to the uniform distribution on the interval [0, 1]. This is obtained by applying Bayes' theorem to the data set consisting of one observation of dissolving and one of not dissolving, using the above prior. The Haldane prior is an improper prior distribution (meaning that it has an infinite mass). Harold Jeffreys devised a systematic way for designing uninformative priors as e.g., Jeffreys prior p−1/2(1 − p)−1/2 for the Bernoulli random variable. Priors can be constructed which are proportional to the Haar measure if the parameter space X carries a natural group structure which leaves invariant our Bayesian state of knowledge. Such methods are used in Solomonoff's theory of inductive inference. Constructing objective priors have been recently introduced in bioinformatics, and specially inference in cancer systems biology, where sample size is limited and a vast amount of prior knowledge is available. In these methods, either an information theory based criterion, such as KL divergence or log-likelihood function for binary supervised learning problems and mixture model problems. Philosophical problems associated with uninformative priors are associated with the choice of an appropriate metric, or measurement scale. Suppose we want a prior for the running speed of a runner who is unknown to us. We could specify, say, a normal distribution as the prior for his speed, but alternatively we could specify a normal prior for the time he takes to complete 100 metres, which is proportional to the reciprocal of the first prior. These are very different priors, but it is not clear which is to be preferred. Jaynes' method of transformation groups can answer this question in some situations. Similarly, if asked to estimate an unknown proportion between 0 and 1, we might say that all proportions are equally likely, and use a uniform prior. Alternatively, we might say that all orders of magnitude for the proportion are equally likely, the '''', which is the uniform prior on the logarithm of proportion. The Jeffreys prior attempts to solve this problem by computing a prior which expresses the same belief no matter which metric is used. The Jeffreys prior for an unknown proportion p is p−1/2(1 − p'')−1/2, which differs from Jaynes' recommendation. Priors based on notions of algorithmic probability are used in inductive inference as a basis for induction in very general settings. Practical problems associated with uninformative priors include the requirement that the posterior distribution be proper. The usual uninformative priors on continuous, unbounded variables are improper. This need not be a problem if the posterior distribution is proper. Another issue of importance is that if an uninformative prior is to be used routinely, i.e., with many different data sets, it should have good frequentist properties. Normally a Bayesian would not be concerned with such issues, but it can be important in this situation. For example, one would want any decision rule based on the posterior distribution to be admissible under the adopted loss function. Admissibility is often difficult to check, although some results are known (e.g., Berger and Strawderman 1996). The issue is particularly acute with hierarchical Bayes models; the usual priors (e.g., Jeffreys' prior) may give badly inadmissible decision rules if employed at the higher levels of the hierarchy. == Improper priors ==

Improper priors

Let events A_1, A_2, \ldots, A_n be mutually exclusive and exhaustive. If Bayes' theorem is written as P(A_i\mid B) = \frac{P(B \mid A_i) P(A_i)}{\sum_j P(B\mid A_j)P(A_j)}\, , then it is clear that the same result would be obtained if all the prior probabilities P(Ai) and P(Aj) were multiplied by a given constant; the same would be true for a continuous random variable. If the summation in the denominator converges, the posterior probabilities will still sum (or integrate) to 1 even if the prior values do not, and so the priors may only need to be specified in the correct proportion. Taking this idea further, in many cases the sum or integral of the prior values may not even need to be finite to get sensible answers for the posterior probabilities. When this is the case, the prior is called an improper prior. However, the posterior distribution need not be a proper distribution if the prior is improper. This is clear from the case where event B is independent of all of the Aj. Statisticians sometimes use improper priors as uninformative priors. For example, if they need a prior distribution for the mean and variance of a random variable, they may assume p(m, v) ~ 1/v (for v > 0) which would suggest that any value for the mean is "equally likely" and that a value for the positive variance becomes "less likely" in inverse proportion to its value. Many authors (Lindley, 1973; De Groot, 1937; Kass and Wasserman, 1996) warn against the danger of over-interpreting those priors since they are not probability densities. The only relevance they have is found in the corresponding posterior, as long as it is well-defined for all observations. (The Haldane prior is a typical counterexample.) By contrast, likelihood functions do not need to be integrated, and a likelihood function that is uniformly 1 corresponds to the absence of data (all models are equally likely, given no data): Bayes' rule multiplies a prior by the likelihood, and an empty product is just the constant likelihood 1. However, without starting with a prior probability distribution, one does not end up getting a posterior probability distribution, and thus cannot integrate or compute expected values or loss. See for details. Examples Examples of improper priors include: • The uniform distribution on an infinite interval (i.e., a half-line or the entire real line). • Beta(0,0), the beta distribution for α=0, β=0 (uniform distribution on log-odds scale). • The logarithmic prior on the positive reals (uniform distribution on log scale). These functions, interpreted as uniform distributions, can also be interpreted as the likelihood function in the absence of data, but are not proper priors. ==Prior probability in statistical mechanics==

Prior probability in statistical mechanics

While in Bayesian statistics the prior probability is used to represent initial beliefs about an uncertain parameter, in statistical mechanics the a priori probability is used to describe the initial state of a system. The classical version is defined as the ratio of the number of elementary events (e.g., the number of times a die is thrown) to the total number of events—and these considered purely deductively, i.e., without any experimenting. In the case of the die if we look at it on the table without throwing it, each elementary event is reasoned deductively to have the same probability—thus the probability of each outcome of an imaginary throwing of the (perfect) die or simply by counting the number of faces is 1/6. Each face of the die appears with equal probability—probability being a measure defined for each elementary event. The result is different if we throw the die twenty times and ask how many times (out of 20) the number 6 appears on the upper face. In this case time comes into play and we have a different type of probability depending on time or the number of times the die is thrown. On the other hand, the a priori probability is independent of time—you can look at the die on the table as long as you like without touching it and you deduce the probability for the number 6 to appear on the upper face is 1/6. In statistical mechanics, e.g., that of a gas contained in a finite volume V , both the spatial coordinates q_i and the momentum coordinates p_i of the individual gas elements (atoms or molecules) are finite in the phase space spanned by these coordinates. In analogy to the case of the die, the a priori probability is here (in the case of a continuum) proportional to the phase space volume element \Delta q\Delta p divided by h, and is the number of standing waves (i.e., states) therein, where \Delta q is the range of the variable q and \Delta p is the range of the variable p (here for simplicity considered in one dimension). In 1 dimension (length L) this number or statistical weight or a priori weighting is L \Delta p/h . In customary 3 dimensions (volume V) the corresponding number can be calculated to be V 4\pi p^2\Delta p/h^3. In order to understand this quantity as giving a number of states in quantum (i.e., wave) mechanics, recall that in quantum mechanics every particle is associated with a matter wave which is the solution of a Schrödinger equation. In the case of free particles (of energy \epsilon = {\bf p}^2/2m) like those of a gas in a box of volume V = L^3 such a matter wave is explicitly \psi \propto \sin \frac{l\pi x}{L} \sin \frac{m\pi y}{L} \sin \frac{n\pi z}{L}, where l, m, n are integers. The number of different (l,m,n) values and hence states in the region between p, p+dp, p^2 = {\bf p}^2, is then found to be the above expression V4\pi p^2dp/h^3 by considering the area covered by these points. Moreover, in view of the uncertainty relation, which in 1 spatial dimension is \Delta q\Delta p \geq h , these states are indistinguishable (i.e., these states do not carry labels). An important consequence is a result known as Liouville's theorem, i.e., the time independence of this phase space volume element and thus of the a priori probability. A time dependence of this quantity would imply known information about the dynamics of the system, and hence would not be an a priori probability. Thus the region \Omega := \frac{\Delta q\Delta p}{\int \Delta q\Delta p},\;\;\; \int \Delta q\Delta p = \mathrm{const.}, when differentiated with respect to time t yields zero (with the help of Hamilton's equations): The volume at time t is the same as at time zero. One describes this also as conservation of information. In the full quantum theory one has an analogous conservation law. In this case, the phase space region is replaced by a subspace of the space of states expressed in terms of a projection operator P, and instead of the probability in phase space, one has the probability density \Sigma: = \frac{P}{\text{Tr}(P)},\;\;\; N = \text{Tr}(P) = \mathrm{const.}, where N is the dimensionality of the subspace. The conservation law in this case is expressed by the unitarity of the S-matrix. In either case, the considerations assume a closed isolated system. This closed isolated system is a system with (1) a fixed energy E and (2) a fixed number of particles N in (c) a state of equilibrium. If one considers a huge number of replicas of this system, one obtains what is called a microcanonical ensemble. It is for this system that one postulates in quantum statistics the "fundamental postulate of equal a priori probabilities of an isolated system." This says that the isolated system in equilibrium occupies each of its accessible states with the same probability. This fundamental postulate therefore allows us to equate the a priori probability to the degeneracy of a system, i.e., to the number of different states with the same energy. Example The following example illustrates the a priori probability (or a priori weighting) in (a) classical and (b) quantal contexts. {{ordered list Consider the rotational energy E of a diatomic molecule with moment of inertia I in spherical polar coordinates \theta, \phi (this means q above is here \theta, \phi), i.e. E = \frac{1}{2I}\left(p^2_{\theta} + \frac{p^2_{\phi}}{\sin^2\theta}\right). The (p_{\theta}, p_{\phi})-curve for constant E and \theta is an ellipse of area \oint dp_{\theta}dp_{\phi} = \pi \sqrt{2IE}\sqrt{2IE}\sin\theta = 2\pi IE\sin\theta . By integrating over \theta and \phi the total volume of phase space covered for constant energy E is \int^{\phi=2\pi}_{0}\int^{\theta=\pi}_0 2I\pi E\sin\theta\,d\theta\,d\phi = 8\pi^2 IE = \oint dp_{\theta}\,dp_{\phi}\,d\theta\,d\phi, and hence the classical a priori weighting in the energy range dE is :\Omega \propto (phase space volume at E+dE) minus (phase space volume at E) is given by 8{\pi}^2 I dE. Assuming that the number of quantum states in a range \Delta q \Delta p for each direction of motion is given, per element, by a factor \Delta q\Delta p/h, the number of states in the energy range dE is, as seen under (a) 8\pi^2I dE/h^2 for the rotating diatomic molecule. From wave mechanics it is known that the energy levels of a rotating diatomic molecule are given by E_n = \frac{n(n+1)h^2}{8\pi^2 I}, each such level being (2n+1)-fold degenerate. By evaluating dn/dE_n = 1/(dE_n/dn) one obtains \frac{dn}{dE_n} = \frac{8\pi^2 I}{(2n+1)h^2}, \;\;\; (2n+1) dn = \frac{8\pi^2 I}{h^2} dE_n. Thus by comparison with \Omega above, one finds that the approximate number of states in the range dE is given by the degeneracy, i.e. \Sigma \propto (2n+1)dn. Thus the a priori weighting in the classical context (a) corresponds to the a priori weighting here in the quantal context (b). In the case of the one-dimensional simple harmonic oscillator of natural frequency \nu one finds correspondingly: (a) \Omega \propto dE/\nu , and (b) \Sigma \propto dn (no degeneracy). Thus in quantum mechanics the a priori probability is effectively a measure of the degeneracy, i.e. the number of states having the same energy. In the case of the hydrogen atom or Coulomb potential (where the evaluation of the phase space volume for constant energy is more complicated) one knows that the quantum mechanical degeneracy is n^2 with E\propto 1/n^2 . Thus in this case \Sigma \propto n^2 dn . }} Priori probability and distribution functions In statistical mechanics (see any book) one derives the so-called distribution functions f for various statistics. In the case of Fermi–Dirac statistics and Bose–Einstein statistics these functions are respectively f^{FD}_i = \frac{1}{e^{(\epsilon_i - \epsilon_0)/kT}+1}, \quad f^{BE}_i = \frac{1}{e^{(\epsilon_i-\epsilon_0)/kT}-1}. These functions are derived for (1) a system in dynamic equilibrium (i.e., under steady, uniform conditions) with (2) total (and huge) number of particles N = \Sigma_in_i (this condition determines the constant \epsilon_0), and (3) total energy E = \Sigma_in_i\epsilon_i, i.e., with each of the n_i particles having the energy \epsilon_i. An important aspect in the derivation is the taking into account of the indistinguishability of particles and states in quantum statistics, i.e., there particles and states do not have labels. In the case of fermions, like electrons, obeying the Pauli principle (only one particle per state or none allowed), one has therefore 0 \leq f^{FD}_i \leq 1, \quad \text{whereas} \quad 0 \leq f^{BE}_i \leq \infty. Thus f^{FD}_i is a measure of the fraction of states actually occupied by electrons at energy \epsilon_i and temperature T. On the other hand, the a priori probability g_i is a measure of the number of wave mechanical states available. Hence n_i = f_ig_i. Since n_i is constant under uniform conditions (as many particles as flow out of a volume element also flow in steadily, so that the situation in the element appears static), i.e., independent of time t, and g_i is also independent of time t as shown earlier, we obtain \frac{df_i}{dt} = 0, \quad f_i = f_i(t, {\bf v}_i, {\bf r}_i). Expressing this equation in terms of its partial derivatives, one obtains the Boltzmann transport equation. How do coordinates {\bf r} etc. appear here suddenly? Above no mention was made of electric or other fields. Thus with no such fields present we have the Fermi-Dirac distribution as above. But with such fields present we have this additional dependence of f. ==See also==

Source: Wikipedia ↗

tickerdossier.com tickerdossier.substack.com