Drawing conclusions from data The conventional
statistical hypothesis testing procedure using
frequentist probability is to formulate a research hypothesis, such as "people in higher social classes live longer", then collect relevant data. Lastly, a statistical
significance test is carried out to see how likely the results are by chance alone (also called testing against the null hypothesis). A key point in proper statistical analysis is to test a hypothesis with evidence (data) that was not used in constructing the hypothesis. This is critical because every
data set contains some patterns due entirely to chance. If the hypothesis is not tested on a different data set from the same
statistical population, it is impossible to assess the likelihood that chance alone would produce such patterns. For example,
flipping a coin five times with a result of 2 heads and 3 tails might lead one to hypothesize that the coin favors tails by 3/5 to 2/5. If this hypothesis is then tested on the existing data set, it is confirmed, but the confirmation is meaningless. The proper procedure would have been to form in advance a hypothesis of what the tails probability is, and then throw the coin various times to see if the hypothesis is rejected or not. If three tails and two heads are observed, another hypothesis, that the tails probability is 3/5, could be formed, but it could only be tested by a new set of coin tosses. The statistical significance under the incorrect procedure is completely spurious—significance tests do not protect against data dredging.
Optional stopping Optional stopping is a practice where one collects data until some stopping criteria is reached. While it is a valid procedure, it is easily misused. The problem is that p-value of an optionally stopped statistical test is larger than what it seems. Intuitively, this is because the p-value is supposed to be the sum of all events at least as rare as what is observed. With optional stopping, there are even rarer events that are difficult to account for, i.e. not triggering the optional stopping rule, and collect even more data, before stopping. Neglecting these events leads to a p-value that's too low. In fact, if the null hypothesis is true, then
any significance level can be reached if one is allowed to keep collecting data and stop when the desired p-value (calculated as if one has always been planning to collect exactly this much data) is obtained. For a concrete example of testing for a fair coin, see . Or, more succinctly, the proper calculation of p-value requires accounting for counterfactuals, that is, what the experimenter
could have done in reaction to data that
might have been. Accounting for what might have been is hard, even for honest researchers. The problem of early stopping is not just limited to researcher misconduct. There is often pressure to stop early if the cost of collecting data is high. Some animal ethics boards even mandate early stopping if the study obtains a significant result midway.
Post-hoc data replacement If data is removed
after some data analysis is already done on it, such as on the pretext of "removing outliers", then it would increase the false positive rate. Outliers should only be removed from a data set after proper identification and agreement or concurrence that there is special cause variation responsible for the unusual data. Replacing "outliers" by replacement data increases the false positive rate further.
Post-hoc grouping If a dataset contains multiple features, then one or more of the features can be used as grouping, and potentially create a statistically significant result. For example, if a dataset of patients records their age and sex, then a researcher can consider grouping them by age and check if the illness recovery rate is correlated with age. If it does not work, then the researcher might check if it correlates with sex. If not, then perhaps it correlates with age after controlling for sex, etc. The number of possible groupings grows exponentially with the number of features. Missing factors, unmeasured
confounders, and loss to follow-up can also lead to bias. ==Examples ==