Descriptive statistics A
descriptive statistic (in the
count noun sense) is a
summary statistic that quantitatively describes or summarizes features of a collection of
information, while
descriptive statistics in the
mass noun sense is the process of using and analyzing those statistics. Descriptive statistics is distinguished from
inferential statistics (or inductive statistics), in that descriptive statistics aims to summarize a
sample, rather than use the data to learn about the
population that the sample of data is thought to represent.
Inferential statistics Statistical inference is the process of using
data analysis to deduce properties of an underlying
probability distribution. Inferential statistical analysis infers properties of a
population, for example by testing hypotheses and deriving estimates. It is assumed that the observed data set is
sampled from a larger population. Inferential statistics can be contrasted with
descriptive statistics. Descriptive statistics is solely concerned with properties of the observed data, and it does not rest on the assumption that the data come from a larger population.
Terminology and theory of inferential statistics Statistics, estimators and pivotal quantities Consider
independent identically distributed (IID) random variables with a given
probability distribution: standard
statistical inference and
estimation theory defines a
random sample as the
random vector given by the
column vector of these IID variables. The
population being examined is described by a probability distribution that may have unknown parameters. A statistic is a random variable that is a function of the random sample, but . The probability distribution of the statistic, though, may have unknown parameters. Consider now a function of the unknown parameter: an
estimator is a statistic used to estimate such function. Commonly used estimators include
sample mean, unbiased
sample variance and
sample covariance. A random variable that is a function of the random sample and of the unknown parameter, but whose probability distribution
does not depend on the unknown parameter is called a
pivotal quantity or pivot. Widely used pivots include the
z-score, the
chi square statistic and Student's
t-value. Between two estimators of a given parameter, the one with lower
mean squared error is said to be more
efficient. Furthermore, an estimator is said to be
unbiased if its
expected value is equal to the
true value of the unknown parameter being estimated, and asymptotically unbiased if its expected value converges at the
limit to the true value of such parameter. Other desirable properties for estimators include:
UMVUE estimators that have the lowest variance for all possible values of the parameter to be estimated (this is usually an easier property to verify than efficiency) and
consistent estimators which
converges in probability to the true value of such parameter. This still leaves the question of how to obtain estimators in a given situation and carry the computation, several methods have been proposed: the
method of moments, the
maximum likelihood method, the
least squares method and the more recent method of
estimating equations.
Null hypothesis and alternative hypothesis Interpretation of statistical information can often involve the development of a
null hypothesis which is usually (but not necessarily) that no relationship exists among variables or that no change occurred over time. The
alternative hypothesis is the name of the hypothesis that contradicts the null hypothesis. The best illustration for a novice is the predicament encountered by a
criminal trial. The null hypothesis, H0, asserts that the defendant is innocent, whereas the alternative hypothesis, H1, asserts that the defendant is
guilty. The
indictment comes because of suspicion of the guilt. The H0 (the
status quo) stands in opposition to H1 and is maintained unless H1 is supported by evidence "beyond a reasonable doubt". However, "failure to reject H0" in this case does not imply innocence, but merely that the evidence was insufficient to convict. So the jury does not necessarily accept H0 but fails to reject H0. While one can not "prove" a null hypothesis, one can test how close it is to being true with a
power test, which tests for
type II errors. The null hypothesis cannot be proven true because it is already assumed to be true when the test is being conducted.
Error Working from a
null hypothesis, two broad categories of error are recognized: •
Type I errors where the null hypothesis is falsely rejected, giving a "false positive". •
Type II errors where the null hypothesis fails to be rejected and an actual difference between populations is missed, giving a "false negative".
Standard deviation refers to the extent to which individual observations in a sample differ from a central value, such as the sample or population mean, while
Standard error refers to an estimate of difference between sample mean and population mean. A
statistical error is the amount by which an observation differs from its
expected value. A
residual is the amount an observation differs from the value the estimator of the expected value assumes on a given sample (also called prediction).
Mean squared error is used for obtaining
efficient estimators, a widely used class of estimators.
Root mean square error is simply the square root of mean squared error. Many statistical methods seek to minimize the
residual sum of squares, and these are called "
methods of least squares" in contrast to
Least absolute deviations. The latter gives equal weight to small and big errors, while the former gives more weight to large errors. Residual sum of squares is also
differentiable, which provides a handy property for doing
regression. Least squares applied to
linear regression is called
ordinary least squares method and least squares applied to
nonlinear regression is called
non-linear least squares. Also in a linear regression model the non deterministic part of the model is called error term, disturbance or more simply noise. Both linear regression and non-linear regression are addressed in
polynomial least squares, which also describes the variance in a prediction of the dependent variable (y axis) as a function of the independent variable (x axis) and the deviations (errors, noise, disturbances) from the estimated (fitted) curve. Measurement processes that generate statistical data are also subject to error. Many of these errors are classified as
random (noise) or
systematic (
bias), but other types of errors (e.g., blunder, such as when an analyst reports incorrect units) can also be important. The presence of
missing data or
censoring may result in
biased estimates and specific techniques have been developed to address these problems.
Interval estimation : the red line is true value for the mean in this example, the blue lines are random confidence intervals for 100 realizations. Most studies only sample part of a population, so results do not fully represent the whole population. Any estimates obtained from the sample only approximate the population value.
Confidence intervals allow statisticians to express how closely the sample estimate matches the true value in the whole population. Often they are expressed as 95% confidence intervals. Formally, a 95% confidence interval for a value is a range where, if the sampling and analysis were repeated under the same conditions (yielding a different dataset), the interval would include the true (population) value in 95% of all possible cases. This does
not imply that the probability that the true value is in the confidence interval is 95%. From the
frequentist perspective, such a claim does not even make sense, as the true value is not a
random variable. Either the true value is or is not within the given interval. However, it is true that, before any data are sampled and given a plan for how to construct the confidence interval, the probability is 95% that the yet-to-be-calculated interval will cover the true value: at this point, the limits of the interval are yet-to-be-observed
random variables. One approach that does yield an interval that can be interpreted as having a given probability of containing the true value is to use a
credible interval from
Bayesian statistics: this approach depends on a different way of
interpreting what is meant by "probability", that is as a
Bayesian probability. In principle, confidence intervals can be symmetrical or asymmetrical. An interval can be asymmetrical because it works as a lower or upper bound for a parameter (left-sided interval or right sided interval), but it can also be asymmetrical if a two-sided interval is built violating symmetry around the estimate. Sometimes the bounds of a confidence interval are reached asymptotically, and these are used to approximate the true bounds.
Significance Statistics rarely give a simple Yes/No type answer to the question under analysis. Interpretation often comes down to the level of statistical significance applied to the numbers and often refers to the probability of a value accurately rejecting the null hypothesis (sometimes referred to as the
p-value). , the
critical region is the set of values to the right of the observed data point (observed value of the test statistic) and the
p-value is represented by the green area. The standard approach • Rejecting the null hypothesis does not automatically prove the alternative hypothesis. • As everything in
inferential statistics it relies on sample size, and therefore under
fat tails p-values may be seriously mis-computed.
Examples Some well-known statistical
tests and procedures are:
Bayesian statistics An alternative paradigm to the popular
frequentist paradigm is to use
Bayes' theorem to update the
prior probability of the hypotheses in consideration based on the
relative likelihood of the evidence gathered to obtain a
posterior probability. Bayesian methods have been aided by the increase in available computing power to compute the
posterior probability using numerical approximation techniques like
Markov Chain Monte Carlo. For statistically modelling purposes, Bayesian models tend to be
hierarchical, for example, one could model each
YouTube channel as having video views distributed as a normal distribution with channel dependent mean and variance \mathcal{N}(\mu_i, \sigma_i) , while modeling the channel means as themselves coming from a normal distribution representing the distribution of average video view counts per channel, and the variances as coming from another distribution. The concept of using
likelihood ratio can also be prominently seen in
medical diagnostic testing.
Exploratory data analysis Exploratory data analysis (
EDA) is an approach to
analyzing data sets to summarize their main characteristics, often with visual methods. A
statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.
Mathematical statistics Mathematical statistics is the application of mathematics to statistics. Mathematical techniques used for this include
mathematical analysis,
linear algebra,
stochastic analysis,
differential equations, and
measure-theoretic probability theory. All statistical analyses make use of at least some mathematics, and mathematical statistics can therefore be regarded as a fundamental component of general statistics. == History ==