There are many proposed causes for the replication crisis.
Historical and sociological causes The replication crisis may be triggered by the "generation of new data and scientific publications at an unprecedented rate" that leads to "desperation to publish or perish" and failure to adhere to good scientific practice. Predictions of an impending crisis in the quality-control mechanism of science can be traced back several decades.
Derek de Solla Price—considered the father of
scientometrics, the
quantitative study of science—predicted in his 1963 book
Little Science, Big Science that science could reach "senility" as a result of its own exponential growth. Some present-day literature seems to vindicate this "overflow" prophecy, lamenting the decay in both attention and quality. Historian
Philip Mirowski argues that the decline of scientific quality can be connected to its commodification, especially spurred by major corporations' profit-driven decision to outsource their research to universities and
contract research organizations. Social
systems theory, as expounded in the work of German sociologist
Niklas Luhmann, inspires a similar diagnosis. This theory holds that each system, such as economy, science, religion, and media, communicates using its own code:
true and
false for science,
profit and
loss for the economy,
news and
no-news for the media, and so on. According to some sociologists, science's
mediatization, commodification, as a result of the structural coupling among systems, have led to a confusion of the original system codes.
Problems with the publication system in science Publication bias Publication bias—the tendency to publish only positive, significant results—creates the "
file drawer effect", where negative results remain unpublished. This produces misleading literature and biased meta-analyses. This does not encourage reporting on, or even attempts to perform, replication studies. Among 1,576 researchers
Nature surveyed in 2016, only a minority had ever attempted to publish a replication, and several respondents who had published failed replications noted that editors and reviewers demanded that they play down comparisons with the original studies. Publication bias is augmented by the
pressure to publish and the author's own
confirmation bias, and is an inherent hazard in the field, requiring a certain degree of skepticism on the part of readers.
Mathematical errors Even high-impact journals have a significant fraction of mathematical errors in their use of statistics. For example, 11% of statistical results published in
Nature and
BMJ in 2001 are "incongruent", meaning that the reported p-value is mathematically different from what it should be if it were correctly calculated from the reported test statistic. These errors were likely from typesetting, rounding, and transcription errors. Among 157 neuroscience papers published in five top-ranking journals that attempt to show that two experimental effects are different, 78 erroneously tested instead for whether one effect is significant while the other is not, and 79 correctly tested for whether their difference is significantly different from 0.
"Publish or perish" culture Academic "publish or perish" culture exacerbates publication bias. Intense pressure to publish in recognized journals, driven by hypercompetitive environments and bibliometric career evaluations, incentivizes researchers to prioritize publishable results over validity. According to Fanelli, this pushes scientists to employ a number of strategies aimed at making results "publishable". In the context of publication bias, this can mean adopting behaviors aimed at making results positive or statistically significant, often at the expense of their validity. Philosopher Brian D. Earp and psychologist Jim A. C. Everett argue that, although replication is in the best interests of academics and researchers as a group, features of academic psychological culture discourage replication by individual researchers. They argue that performing replications can be time-consuming, and take away resources from projects that reflect the researcher's original thinking. They are harder to publish, largely because they are unoriginal, and even when they can be published they are unlikely to be viewed as major contributions to the field. Replications "bring less recognition and reward, including grant money, to their authors". In his 1971 book
Scientific Knowledge and Its Social Problems, philosopher and historian of science
Jerome R. Ravetz predicted that science—in its progression from "little" science composed of isolated communities of researchers to "big" science or "techno-science"—would suffer major problems in its internal system of quality control. He recognized that the incentive structure for modern scientists could become dysfunctional, creating
perverse incentives to publish any findings, however dubious. According to Ravetz, quality in science is maintained only when there is a community of scholars, linked by a set of shared norms and standards, who are willing and able to hold each other accountable.
Standards of reporting Certain publishing practices also make it difficult to conduct replications and to monitor the severity of the reproducibility crisis, for articles often come with insufficient descriptions for other scholars to reproduce the study. The Reproducibility Project: Cancer Biology showed that of 193 experiments from 53 top papers about cancer published between 2010 and 2012, only 50 experiments from 23 papers have authors who provided enough information for researchers to redo the studies, sometimes with modifications. None of the 193 papers examined had its experimental protocols fully described and replicating 70% of experiments required asking for key reagents.
Procedural bias By the
Duhem-Quine thesis, scientific results are interpreted by both a substantive theory and a theory of instruments. For example, astronomical observations depend both on the theory of astronomical objects and the theory of telescopes. A large amount of non-replicable research might accumulate if there is a bias of the following kind: faced with a null result, a scientist prefers to treat the data as saying the instrument is insufficient; faced with a non-null result, a scientist prefers to accept the instrument as good, and treat the data as saying something about the substantive theory.
Cultural evolution Smaldino and McElreath proposed a simple model for the
cultural evolution of scientific practice. Each lab randomly decides to produce novel research or replication research, at different fixed levels of false positive rate, true positive rate, replication rate, and productivity (its "traits"). A lab might use more "effort", making the
ROC curve more convex but decreasing productivity. A lab accumulates a score over its lifetime that increases with publications and decreases when another lab fails to replicate its results. At regular intervals, a random lab "dies" and another "reproduces" a child lab with a similar trait as its parent. Labs with higher scores are more likely to reproduce. Under certain parameter settings, the population of labs converge to maximum productivity even at the price of very high false positive rates.
Questionable research practices Questionable research practices are behaviors that exploit
researcher degrees of freedom (researcher DF)—choices in study design, data analysis, or reporting—to inflate false positive rates and undermine reproducibility.
Genesis Researchers' degrees of freedom occur at many stages:
hypothesis formulation,
design of experiments,
data collection and
analysis, and
reporting of research. This is because research design and data analysis entail numerous decisions that are not sufficiently constrained by a field's best practices and statistical methodologies. As a result, researcher DF can lead to situations where some failed replication attempts use a different, yet plausible, research design or statistical analysis; such studies do not necessarily undermine previous findings.
Multiverse analysis, a method that makes inferences based on all plausible data-processing pipelines, provides a solution to the problem of analytical flexibility.
Sensitivity analysis explores modelling specifications to create a comprehensive view of how different analytical choices influence outcomes. Collaborative approaches can be used to compensate for questionable research practices. In multianalyst approaches, different analysts conduct different analyses to address questions. This collaborative validation fosters intellectual honesty and exposes questionable research practices, leading to more reliable and robust scientific conclusions.
In medicine Irreproducible medical studies commonly share these characteristics: investigators not being blinded to the experimental versus the control arms; failure to repeat experiments; lack of
positive and
negative controls; failing to report all the data; inappropriate use of statistical tests; and use of reagents that were not appropriately
validated.
In AI research In
machine learning research, a range of questionable practices have emerged due to intense pressure to achieve state-of-the-art benchmark results. Common evaluation questionable practices include "benchmark overfitting" by repeatedly tuning hyperparameters on held-out test sets, selectively reporting the best of multiple random seeds or experimental runs, and "metric hacking" through unreported post hoc decisions such as choice of tokenization or evaluation scripts that inflate scores. In October 2024,
Communications of the ACM published a peer-reviewed critique of a 2021
Nature paper by researchers from
Google. The critique described "a smorgasbord of questionable practices in
ML, including irreproducible research practices, multiple variants of
cherry-picking, misreporting, and likely
data contamination (leakage)" with a reference to Leech et al. Specific execution times for both the proposed approach and prior techniques on individual test cases were not disclosed in the paper, but later studies found that Google's method ran orders of magnitude more slowly than the
Cadence Design Systems tools available at the same time. The critique cross-checked such studies and highlighted such questionable practices in the
Nature paper as (i) selective reporting of results on only a subset of benchmarks, (ii) unfavorable comparisons to weaker baseline methods, (iii) use of inconsistent evaluation metrics, and (iv) undisclosed use of commercial software data that was only admitted years after the original work Additionally, the research may have suffered from data leakage, where separation of training and testing data could not be verified using published data, considered a significant flaw in experimental design.
Prevalence According to
IU professor Ernest O'Boyle and psychologist Martin Götz, around 50% of researchers surveyed across various studies admitted engaging in HARKing. In a survey of 2,000 psychologists by behavioral scientist Leslie K. John and colleagues, around 94% of psychologists admitted having employed at least one questionable research practice. More specifically, 63% admitted failing to report all of a study's dependent measures, 28% to report all of a study's conditions, and 46% to selectively reporting studies that produced the desired pattern of results. In addition, 56% admitted having collected more data after having inspected already collected data, and 16% to having stopped data collection because the desired result was already visible. The methodology used to estimate questionable research practices has been contested, and more recent studies suggested lower prevalence rates on average.
Fraud Questionable research practices are considered a separate category from more explicit
violations of scientific integrity, such as data falsification. Prominent examples (see also
List of scientific misconduct incidents) include scientific fraud by social psychologist
Diederik Stapel, In March 2024,
Harvard Business School's investigative committee, after reviewing a nearly 1,300-page report unsealed during Professor
Francesca Gino's $25 million lawsuit against Harvard and the
Data Colada bloggers, found that Gino "committed
research misconduct intentionally, knowingly, or recklessly" by falsifying data in four published studies. The report documented that Gino had altered participant responses—including changing 104 moral-impurity ratings (flipping low values to high in one experimental condition and vice versa in another) and manipulating four networking-intentions items, for a total of 168 modified observations—to make the data conform to her hypotheses. She also engaged in selective reporting by publishing only the "Posted" dataset while omitting the original Qualtrics archives and misrepresented the provenance of her data by attributing discrepancies to alleged third-party tampering rather than to deliberate changes by her research team. In June 2023 Gino was placed on unpaid administrative leave, and in May 2025 Harvard revoked her tenure—the first such action in roughly 80 years—citing egregious violations of academic integrity.
Statistical issues Low statistical power Low
statistical power hinders replication for three reasons: (1) low-power replications have reduced ability to detect true effects, (2) low-power original studies produce biased
effect size estimates leading to undersized replications, and (3) low-power original studies yield results unlikely to reflect true effects. and that of studies of
event-related potentials was estimated as .72‒.98 for large effect sizes, .35‒.73 for medium effects, and .10‒.18 for small effects. Meta-scientist
John Ioannidis and colleagues computed an estimate of average power for empirical economic research, finding a median power of 18% based on literature drawing upon 6.700 studies. In light of these results, it is plausible that a major reason for widespread failures to replicate in several scientific fields might be very low statistical power on average. The same statistical test with the same significance level will have lower statistical power if the effect size is small under the alternative hypothesis. Complex inheritable traits are typically correlated with a large number of genes, each of small effect size, so high power requires a large sample size. In particular, many results from the
candidate gene literature suffered from small effect sizes and small sample sizes and would not replicate. More data from
genome-wide association studies (GWAS) come close to solving this problem. As a numeric example, most genes associated with schizophrenia risk have low effect size (genotypic
relative risk, GRR). A statistical study with 1000 cases and 1000 controls has 0.03% power for a gene with GRR = 1.15, which is already large for schizophrenia. In contrast, the largest GWAS to date has ~100% power for it.
Positive effect size bias Even when the study replicates, the replication typically has a smaller effect size. Underpowered studies have a large effect size bias. . In 2009, it was twice noted that fMRI studies had a suspicious number of positive results with large effect sizes, more than would be expected since the studies have low power (one example had only 13 subjects). It pointed out that over half of the studies would test for correlation between a phenomenon and individual fMRI voxels, and only report on voxels exceeding chosen thresholds. Optional stopping is a practice where one collects data until some stopping criterion is reached. Though a valid procedure, it is easily misused. The problem is that p-value of an optionally stopped statistical test is larger than it seems. Intuitively, this is because the p-value is supposed to be the sum of all events at least as rare as what is observed. With optional stopping, there are even rarer events that are difficult to account for, i.e. not triggering the optional stopping rule, and collecting even more data before stopping. Neglecting these events leads to a p-value that is too low. In fact, if the null hypothesis is true,
any significance level can be reached if one is allowed to keep collecting data and stop when the desired p-value (calculated as if one has always been planning to collect exactly this much data) is obtained. For a concrete example of testing for a fair coin, see
p-value#optional stopping. More succinctly, the proper calculation of p-value requires accounting for counterfactuals, that is, what the experimenter
could have done in reaction to data that
might have been. Accounting for what might have been is hard even for honest researchers. The problem of early stopping is not just limited to researcher misconduct. There is often pressure to stop early if the cost of collecting data is high. Some animal ethics boards even mandate early stopping if the study obtains a significant result midway. This variation can be due to differences in experimental methods, populations, cohorts, and statistical methods between replication studies. Heterogeneity poses a challenge to studies attempting to replicate previously found
effect sizes. When heterogeneity is high, subsequent replications have a high probability of finding an effect size radically different than that of the original study. Importantly, significant levels of heterogeneity are also found in direct/exact replications of a study. Stanley and colleagues discuss this while reporting a study by quantitative behavioral scientist Richard Klein and colleagues, where the authors attempted to replicate 15 psychological effects across 36 different sites in Europe and the U.S. In the study, Klein and colleagues found significant amounts of heterogeneity in 8 out of 16 effects (I-squared = 23% to 91%). Importantly, while the replication sites intentionally differed on a variety of characteristics, such differences could account for very little heterogeneity . According to Stanley and colleagues, this suggested that heterogeneity could have been a genuine characteristic of the phenomena being investigated. For instance, phenomena might be influenced by so-called "hidden moderators" – relevant factors that were previously not understood to be important in the production of a certain effect. In their analysis of 200 meta-analyses of psychological effects, Stanley and colleagues found a median percent of heterogeneity of I-squared = 74%. According to the authors, this level of heterogeneity can be considered "huge". It is three times larger than the random sampling variance of effect sizes measured in their study. If considered along
sampling error, heterogeneity yields a
standard deviation from one study to the next even larger than the median effect size of the 200 meta-analyses they investigated. The authors conclude that if replication is defined by a subsequent study finding a sufficiently similar effect size to the original, replication success is not likely even if replications have very large sample sizes. Importantly, this occurs even if replications are direct or exact since heterogeneity nonetheless remains relatively high in these cases.
Others Within economics, the replication crisis may be also exacerbated because econometric results are fragile: using different but plausible
estimation procedures or
data preprocessing techniques can lead to conflicting results.
Context sensitivity New York University professor Jay Van Bavel and colleagues argue that a further reason findings are difficult to replicate is the sensitivity to context of certain psychological effects. On this view, failures to replicate might be explained by contextual differences between the original experiment and the replication, often called "hidden
moderators". Van Bavel and colleagues tested the influence of context sensitivity by reanalyzing the data of the widely cited Reproducibility Project carried out by the Open Science Collaboration. in complex systems, such as social psychology, "the null hypothesis is always false", or "everything is correlated". If so, then if the null hypothesis is not rejected, that does not show that the null hypothesis is true, but merely that it was a false negative, typically due to low power. Low power is especially prevalent in subject areas where effect sizes are small and data is expensive to acquire, such as social psychology. Furthermore, when the null hypothesis is rejected, it might not be evidence for the substantial alternative hypothesis. In soft sciences, many hypotheses can predict a correlation between two variables. Thus, evidence
against the null hypothesis "there is no correlation" is no evidence
for one of the many alternative hypotheses that equally well predict "there is a correlation". Fisher developed the NHST for agronomy, where rejecting the null hypothesis is usually good proof of the alternative hypothesis, since there are not many of them. Rejecting the hypothesis "fertilizer does not help" is evidence for "fertilizer helps". But in psychology, there are many alternative hypotheses for every null hypothesis. In particular, when statistical studies on extrasensory perception reject the null hypothesis at extremely low p-value (as in the case of
Daryl Bem), it does not imply the alternative hypothesis "ESP exists". Far more likely is that there was a small (non-ESP) signal in the experiment setup that has been measured precisely.
Paul Meehl noted that statistical hypothesis testing is used differently in "soft" psychology (personality, social, etc.) from physics. In physics, a theory makes a quantitative prediction and is tested by checking whether the prediction falls within the statistically measured interval. In soft psychology, a theory makes a directional prediction and is tested by checking whether the null hypothesis is rejected in the right direction. Consequently, improved experimental technique makes theories more likely to be falsified in physics but less likely to be falsified in soft psychology, as the null hypothesis is always false since any two variables are correlated by a "crud factor" of about 0.30. The net effect is an accumulation of theories that remain
unfalsified, but with no empirical evidence for preferring one over the others. On this view, low rates of replicability could be consistent with quality science. Relatedly, the expectation that most findings should replicate would be misguided and, according to Bird, a form of base rate fallacy. Bird's argument works as follows. Assuming an ideal situation of a test of significance, whereby the probability of incorrectly rejecting the null hypothesis is 5% (i.e.
Type I error) and the probability of correctly rejecting the null hypothesis is 80% (i.e.
Power), in a context where a high proportion of tested hypotheses are false, it is conceivable that the number of false positives would be high compared to those of true positives. For example, in a situation where only 10% of tested hypotheses are actually true, one can calculate that as many as 36% of results will be false positives. The claim that the falsity of most tested hypotheses can explain low rates of replicability is even more relevant when considering that the average power for statistical tests in certain fields might be much lower than 80%. For example, the proportion of false positives increases to a value between 55.2% and 57.6% when calculated with the estimates of an average power between 34.1% and 36.4% for psychology studies, as provided by Stanley and colleagues in their analysis of 200 meta-analyses in the field. A high proportion of false positives would then result in many research findings being non-replicable. Bird notes that the claim that a majority of tested hypotheses are false
a priori in certain scientific fields might be plausible given factors such as the complexity of the phenomena under investigation, the fact that theories are seldom undisputed, the "inferential distance" between theories and hypotheses, and the ease with which hypotheses can be generated. In this respect, the fields Bird takes as examples are clinical medicine, genetic and molecular epidemiology, and social psychology. This situation is radically different in fields where theories have outstanding empirical basis and hypotheses can be easily derived from theories (e.g., experimental physics). == Consequences ==