Replication crisis

Replication Replication has been called "the cornerstone of science". Environmental health scientist Stefan Schmidt began a 2009 review with this description of replication: But no universal definition of replication or related concepts has been agreed on. Replication types include: • direct (repeating procedures as closely as possible), • systematic (repeating with intentional changes), Replication failures do not indicate that affected fields lack scientific rigor. Rather, they reflect the normal operation of science—a mechanism by which unsupported hypotheses are eliminated, but which often functions slowly and inconsistently. A hypothesis is generally considered supported when the results match the predicted pattern and that pattern is found to be statistically significant. Under null hypothesis assumption, results are deemed statistically significant when their probability falls below a predetermined threshold (the significance level). This generally answers the question of how unlikely such results would be by chance alone if no true effect existed in the statistical population. If the probability associated with the test statistic exceeds the chosen critical value, the results are considered statistically significant. The p-value represents the probability of obtaining results at least as extreme as observed, assuming the null hypothesis is true. The standard threshold p < 0.05 means accepting a 5% false positive rate. Some fields use smaller p-values, such as p < 0.01 (1% chance of a false positive) or p < 0.001 (0.1% chance of a false positive). But a smaller chance of a false positive often requires greater sample sizes or a greater chance of a false negative (a correct hypothesis being erroneously found incorrect). Although p-value testing is the most commonly used method, it is not the only one. Statistics Certain terms commonly used in discussions of the replication crisis have technically precise meanings, which are presented here. For example, if X is binary, then the effect size might be defined as the change in the expectation of Y upon a change of X:(\text{effect size}) = \mathbb{E}[Y | X=1] - \mathbb{E}[Y | X=0] Note that the effect size as defined above might be zero even if X and Y are not independent, such as when their relationship is non-linear (such as Y \sim \mathcal N(0, 1+X)) or when one variable affects different subgroups oppositely. Since different definitions of "effect size" capture different ways for X and Y to be dependent, there are many definitions of effect size. In practice, effect sizes cannot be directly observed, but must be measured by statistical estimators. For example, the above definition of effect size is often measured by Cohen's d estimator. The same effect size might have multiple estimators, as they have tradeoffs between efficiency, bias, variance, etc. This further increases the number of possible statistical quantities that can be computed on a single dataset. When an estimator for an effect size is used for statistical testing, it is called a test statistic. A null hypothesis test is a decision procedure which takes in some data, and outputs either H_0 or H_1. If it outputs H_1, it is usually stated as "there is a statistically significant effect" or "the null hypothesis is rejected". Often, the statistical test is a (one-sided) threshold test, which is structured as follows: • Gather data D. • Compute a test statistic t[D] for the data. • Compare the test statistic against a critical value/threshold t_{\text{threshold}}. If t[D] > t_{\text{threshold}}, then output H_1, else, output H_0. A two-sided threshold test is similar, but with two thresholds, such that it outputs H_1 if either t[D] or t[D] > t_{\text{threshold}}^+ There are 4 possible outcomes of a null hypothesis test: false negative, true negative, false positive, true positive. A false negative means that H_0 is true, but the test outcome is H_1; a true negative means that H_0 is true, and the test outcome is H_0, etc. Significance level, false positive rate, or the alpha level, is the probability of finding the alternative to be true when the null hypothesis is true:(\text{significance}) := \alpha := Pr(\text{find } H_1 | H_0)For example, when the test is a one-sided threshold test, then \alpha = Pr_{D \sim H_0}(t[D] > t_{\text{threshold}}) where D\sim H_0 means "the data is sampled from H_0". Statistical power, true positive rate, is the probability of finding the alternative to be true when the alternative hypothesis is true:(\text{power}) := 1-\beta := Pr(\text{find } H_1 | H_1) where \beta is also called the false negative rate. For example, when the test is a one-sided threshold test, then 1-\beta = Pr_{D \sim H_1}(t[D] > t_{\text{threshold}}). Given a statistical test and a data set D, the corresponding p-value is the probability that the test statistic is at least as extreme, conditional on H_0. For example, for a one-sided threshold test, p[D] = Pr_{D'\sim H_0}(t[D'] > t[D])If the null hypothesis is true, then the p-value is distributed uniformly on [0, 1]. Otherwise, it is typically peaked at p = 0.0 and roughly exponential, though the precise shape of the p-value distribution depends on what the alternative hypothesis is. Because the p-values are distributed uniformly on [0, 1] under the null hypothesis, researchers can set any significance level \alpha by computing the p-value, then output H_1 if p[D] . This is usually stated as "the null hypothesis is rejected at significance level \alpha", or "H_1 \; (p ", such as "smoking is correlated with cancer (p < 0.001)". History The replication crisis dates to a number of events in the early 2010s. Felipe Romero identified four precursors to the crisis: • Social priming failures: In the early 2010s, two direct replication attempts failed to reproduce results from social psychologist John Bargh's much-cited "elderly-walking" study (originally published in 1996). This experiment was part of a series of three studies that had been widely cited throughout the years, was regularly taught in university courses, and had inspired many conceptual replications. These replication failures triggered intense disagreement between replication researchers and the original authors. Notably, many of the conceptual replications of the original studies also failed to replicate in subsequent direct replications. • Experiments on extrasensory perception: Social psychologist Daryl Bem conducted a series of experiments supposedly providing evidence for the controversial phenomenon of extrasensory perception. Bem faced substantial criticism of his study's methodology. Reanalysis of his data found no evidence for extrasensory perception. The experiment also failed to replicate in subsequent direct replications. According to Romero, what the community found particularly upsetting was that many of the flawed procedures and statistical tools used in Bem's studies were part of common research practice in psychology. • Biomedical replication failures: Scientists from biotech companies Amgen and Bayer Healthcare reported alarmingly low replication rates (11–20%) of landmark findings in preclinical oncological research. These studies suggested how a significant proportion of published literature in several scientific fields could be nonreplicable research. This series of events generated a great deal of skepticism about the validity of existing research in light of widespread methodological flaws and failures to replicate findings. This led prominent scholars to declare a "crisis of confidence" in psychology and other fields, and the ensuing situation came to be known as the "replication crisis". Although the beginning of the replication crisis can be traced to the early 2010s, some authors point out that concerns about replicability and research practices in the social sciences had been expressed much earlier. Romero notes that authors voiced concerns about the lack of direct replications in psychological research in the late 1960s and early 1970s. He also writes that certain studies in the 1990s were already reporting that journal editors and reviewers are generally biased against publishing replication studies. In the social sciences, the blog Data Colada (whose three authors coined the term "p-hacking" in a 2014 paper) has been credited with contributing to the start of the replication crisis. University of Virginia professor and cognitive psychologist Barbara Spellman has written that many criticisms of research practices and concerns about replicability of research are not new. She reports that between the late 1950s and the 1990s, scholars were already expressing concerns about a possible crisis of replication, a suspiciously high rate of positive findings, questionable research practices, the effects of publication bias, issues with statistical power, and bad standards of reporting. the works of Paul Meehl, Jacob Cohen, and Tversky and Kahneman in the 1960s-70s were early warnings of replication crisis. In discussing the origins of the problem, Kahneman himself noted historical precedents in subliminal perception and dissonance reduction replication failures. It had been repeatedly pointed out since 1962 ==Prevalence==