Estimation of a proportion A relatively simple situation is estimation of a
proportion. It is a fundamental aspect of statistical analysis, particularly when gauging the prevalence of a specific characteristic within a population. For example, we may wish to estimate the proportion of residents in a community who are at least 65 years old. The
estimator of a
proportion is \hat p = X/n, where
X is the number of 'positive' instances (e.g., the number of people out of the
n sampled people who are at least 65 years old). When the observations are
independent, this estimator has a (scaled)
binomial distribution (and is also the
sample mean of data from a
Bernoulli distribution). The maximum
variance of this distribution is 0.25, which occurs when the true
parameter is
p = 0.5. In practical applications, where the true parameter
p is unknown, the maximum variance is often employed for sample size assessments. If a reasonable estimate for p is known the quantity p(1-p) may be used in place of 0.25. As the sample size
n grows sufficiently large, the distribution of \hat{p} will be closely approximated by a
normal distribution. Using this and the
Wald method for the binomial distribution, yields a confidence interval, with Z representing the standard Z-score for the desired confidence level (e.g., 1.96 for a 95% confidence interval), in the form: :\left (\widehat p - Z\sqrt{\frac{0.25}{n}}, \quad \widehat p + Z\sqrt{\frac{0.25}{n}} \right ) To determine an appropriate sample size
n for estimating proportions, the equation below can be solved, where W represents the desired width of the confidence interval. The resulting sample size formula, is often applied with a conservative estimate of
p (e.g., 0.5): :Z\sqrt{\frac{0.25}{n}} = W/2 for
n, yielding the sample size n=\frac{Z^2}{W^2}, in the case of using 0.5 as the most conservative estimate of the proportion.
(Note: W/2 = margin of error.) Otherwise, the formula would be Z\sqrt{\frac{p(1-p)}{n}} = W/2, which yields n = \frac{4Z^2p(1-p)}{W^2}. In the right-hand figure one can observe how sample sizes for binomial proportions change given different confidence levels and margins of error. For example, in estimating the proportion of the U.S. population supporting a presidential candidate with a 95% confidence interval width of 2 percentage points (0.02), a sample size of (1.96)2/ (0.022) = 9604 is required with the margin of error in this case is 1
percentage point. It is reasonable to use the 0.5 estimate for p in this case because the presidential races are often close to 50/50, and it is also prudent to use a conservative estimate. The
margin of error in this case is 1 percentage point (half of 0.02). In practice, the formula :\left (\widehat p - 1.96\sqrt{\frac{0.25}{n}}, \quad \widehat p + 1.96\sqrt{\frac{0.25}{n}} \right ) is commonly used to form a 95% confidence interval for the true proportion. The equation 2\sqrt{\frac{0.25}{n}} = W/2 can be solved for
n, providing a minimum sample size needed to meet the desired margin of error
W. The foregoing is commonly simplified:
n = 4/
W2 = 1/
B2 where
B is the error bound on the estimate, i.e., the estimate is usually given as
within ± B. For
B = 10% one requires
n = 100, for
B = 5% one needs
n = 400, for
B = 3% the requirement approximates to
n = 1000, while for
B = 1% a sample size of
n = 10000 is required. These numbers are quoted often in news reports of
opinion polls and other
sample surveys. However, the results reported may not be the exact value as numbers are preferably rounded up. Knowing that the value of the
n is the minimum number of
sample points needed to acquire the desired result, the number of respondents then must lie on or above the minimum.
Estimation of a mean Simply speaking, if we are trying to estimate the average time it takes for people to commute to work in a city. Instead of surveying the entire population, you can take a random sample of 100 individuals, record their commute times, and then calculate the mean (average) commute time for that sample. For example, person 1 takes 25 minutes, person 2 takes 30 minutes, ..., person 100 takes 20 minutes. Add up all the commute times and divide by the number of people in the sample (100 in this case). The result would be your estimate of the mean commute time for the entire population. This method is practical when it's not feasible to measure everyone in the population, and it provides a reasonable approximation based on a representative sample. In a precisely mathematical way, when estimating the population mean using an independent and identically distributed (iid) sample of size
n, where each data value has variance
σ2, the
standard error of the sample mean is: :\frac{\sigma}{\sqrt{n}}. This expression describes quantitatively how the estimate becomes more precise as the sample size increases. Using the
central limit theorem to justify approximating the sample mean with a normal distribution yields a confidence interval of the form : \left(\bar x - \frac{Z\sigma}{\sqrt{n}}, \quad \bar x + \frac{Z\sigma}{\sqrt{n}} \right ) , :where Z is a standard
Z-score for the desired level of confidence (1.96 for a 95% confidence interval). To determine the sample size
n required for a confidence interval of width W, with W/2 as the margin of error on each side of the sample mean, the equation : \frac{Z\sigma}{\sqrt{n}} = W/2 can be solved. This yields the sample size formula, for
n: n = \frac{4Z^2\sigma^2}{W^2}
. For instance, if estimating the effect of a drug on blood pressure with a 95% confidence interval that is six units wide, and the known standard deviation of blood pressure in the population is 15, the required sample size would be \frac{4\times1.96^2\times15^2}{6^2} = 96.04, which would be rounded up to 97, since sample sizes must be integers and must meet or exceed the calculated
minimum value. Understanding these calculations is essential for researchers designing studies to accurately estimate population means within a desired level of confidence. ==Required sample sizes for hypothesis tests ==