New statistical metrics for multisite replication projects

Published on Jun 1, 2020in Journal of The Royal Statistical Society Series A-statistics in Society
· DOI :10.1111/RSSA.12572
Maya B. Mathur18
Estimated H-index: 18
(Stanford University),
Tyler J. VanderWeele82
Estimated H-index: 82
(Harvard University)
Increasingly, researchers are attempting to replicate published original studies by using large, multisite replication projects, at least 134 of which have been completed or are on going. These designs are promising to assess whether the original study is statistically consistent with the replications and to reassess the strength of evidence for the scientific effect of interest. However, existing analyses generally focus on single replications; when applied to multisite designs, they provide an incomplete view of aggregate evidence and can lead to misleading conclusions about replication success. We propose new statistical metrics representing firstly the probability that the original study's point estimate would be at least as extreme as it actually was, if in fact the original study were statistically consistent with the replications, and secondly the estimated proportion of population effects agreeing in direction with the original study. Generalized versions of the second metric enable consideration of only meaningfully strong population effects that agree in direction, or alternatively that disagree in direction, with the original study. These metrics apply when there are at least 10 replications (unless the heterogeneity estimate τ^=0, in which case the metrics apply regardless of the number of replications). The first metric assumes normal population effects but appears robust to violations in simulations; the second is distribution free. We provide R packages (Replicate and MetaUtility).
📖 Papers frequently viewed together
2 Citations
1 Author (P. J. McCarthy)
#1Maya B. Mathur (Stanford University)H-Index: 18
#2Tyler J. VanderWeele (Harvard University)H-Index: 82
We recently suggested new statistical metrics for routine reporting in random-effects meta-analyses to convey evidence strength for scientifically meaningful effects under effect heterogeneity. First, given a chosen threshold of meaningful effect size, we suggested reporting the estimated proportion of true effect sizes above this threshold. Second, we suggested reporting the proportion of effect sizes below a second, possibly symmetric, threshold in the opposite direction from the estimated mea...
10 CitationsSource
#1Isaiah Andrews (Harvard University)H-Index: 11
#2Maximilian Kasy (University of California, Berkeley)H-Index: 11
Some empirical results are more likely to be published than others. Such selective publication leads to biased estimates and distorted inference. This paper proposes two approaches for identifying the conditional probability of publication as a function of a study’s results, the first based on systematic replication studies and the second based on meta-studies. For known conditional publication probabilities, we propose median-unbiased estimators and associated confidence sets that correct for s...
69 CitationsSource
#1Maya B. Mathur (Stanford University)H-Index: 18
#2Tyler J. VanderWeele (Harvard University)H-Index: 82
: We provide two simple metrics that could be reported routinely in random-effects meta-analyses to convey evidence strength for scientifically meaningful effects under effect heterogeneity (ie, a nonzero estimated variance of the true effect distribution). First, given a chosen threshold of meaningful effect size, meta-analyses could report the estimated proportion of true effect sizes above this threshold. Second, meta-analyses could estimate the proportion of effect sizes below a second, poss...
21 CitationsSource
#1Chia‐Chun Wang (NTU: National Taiwan University)H-Index: 1
#2Wen-Chung Lee (NTU: National Taiwan University)H-Index: 26
: A systematic review and meta-analysis is an important step in evidence synthesis. The current paradigm for meta-analyses requires a presentation of the means under a random-effects model; however, a mean with a confidence interval provides an incomplete summary of the underlying heterogeneity in meta-analysis. Prediction intervals show the range of true effects in future studies and have been advocated to be regularly presented. Most commonly, prediction intervals are estimated assuming that t...
20 CitationsSource
#1David A. Kenny (UConn: University of Connecticut)H-Index: 105
#2Charles M. Judd (CU: University of Colorado Boulder)H-Index: 51
: Repeated investigations of the same phenomenon typically yield effect sizes that vary more than one would expect from sampling error alone. Such variation is even found in exact replication studies, suggesting that it is not only because of identifiable moderators but also to subtler random variation across studies. Such heterogeneity of effect sizes is typically ignored, with unfortunate consequences. We consider its implications for power analyses, the precision of estimated effects, and the...
47 CitationsSource
#1Richard A. Klein (UGA: University of Grenoble)H-Index: 8
#2Michelangelo Vianello (UNIPD: University of Padua)H-Index: 16
Last. Brian A. Nosek (Center for Open Science)H-Index: 93
view all 190 authors...
We conducted preregistered replications of 28 classic and contemporary published findings, with protocols that were peer reviewed in advance, to examine variation in effect magnitudes across samples and settings. Each protocol was administered to approximately half of 125 samples that comprised 15,305 participants from 36 countries and territories. Using the conventional criterion of statistical significance (p < .05), we found that 15 (54%) of the replications provided evidence of a statistical...
265 CitationsSource
#1Daniel Lakens (TU/e: Eindhoven University of Technology)H-Index: 31
#2Anne M. Scheel (TU/e: Eindhoven University of Technology)H-Index: 6
Last. Peder M. Isager (TU/e: Eindhoven University of Technology)H-Index: 9
view all 3 authors...
Psychologists must be able to test both for the presence of an effect and for the absence of an effect. In addition to testing against zero, researchers can use the two one-sided tests (TOST) procedure to test for equivalence and reject the presence of a smallest effect size of interest (SESOI). The TOST procedure can be used to determine if an observed effect is surprisingly small, given that a true effect at least as extreme as the SESOI exists. We explain a range of approaches to determine th...
343 CitationsSource
#1Kenneth Rice (UW: University of Washington)H-Index: 11
#2Julian P T Higgins (UoB: University of Bristol)H-Index: 135
Last. Thomas Lumley (University of Auckland)H-Index: 103
view all 3 authors...
Meta†analysis is a common tool for synthesizing results of multiple studies. Among methods for performing meta†analysis, the approach known as ‘fixed effects’ or ‘inverse variance weighting’ is popular and widely used. A common interpretation of this method is that it assumes that the underlying effects in contributing studies are identical, and for this reason it is sometimes dismissed by practitioners. However, other interpretations of fixed effects analyses do not make this assump...
86 CitationsSource
#2Peter P. J. L. Verkoeijen (Avans University of Applied Sciences)H-Index: 21
Last. Conny WollbrantH-Index: 9
view all 56 authors...
In an anonymous 4-person economic game, participants contributed more money to a common project (i.e., cooperated) when required to decide quickly than when forced to delay their decision (Rand, Greene & Nowak, 2012), a pattern consistent with the social heuristics hypothesis proposed by Rand and colleagues. The results of studies using time pressure have been mixed, with some replication attempts observing similar patterns (e.g., Rand et al., 2014) and others observing null effects (e.g., Tingh...
79 CitationsSource
Abstract Ebersole et al.'s (2016) attempt to replicate Monin and Miller (2001) raises important questions about choosing beforehand which statistical test is the target of a replication. While our original theory a priori only predicted a main effect of the credentials manipulation, we had observed in the study reproduced here an unexpected interaction with participant gender. The current paper fails to replicate this originally unpredicted interaction, which it initially codes as a failure (Tab...
3 CitationsSource
Cited By4
#1Charles R. EbersoleH-Index: 12
#4Diane-Jo Bart-Plange (UVA: University of Virginia)H-Index: 2
Last. Brian A. Nosek (UVA: University of Virginia)H-Index: 93
view all 172 authors...
Replication studies in psychological science sometimes fail to reproduce prior findings. If these studies use methods that are unfaithful to the original study or ineffective in eliciting the pheno...
11 CitationsSource
#1Maya B. Mathur (Stanford University)H-Index: 18
#2Diane-Jo Bart-Plange (UVA: University of Virginia)H-Index: 2
Last. Alan Jern (RHIT: Rose-Hulman Institute of Technology)H-Index: 10
view all 41 authors...
Risen and Gilovich (2008) found that subjects believed that “tempting fate” would be punished with ironic bad outcomes (a main effect), and that this effect was magnified when subjects were under c...
1 CitationsSource
#1Samuel Pawel (UZH: University of Zurich)H-Index: 3
#2Leonhard HeldH-Index: 46
There is an urgent need to develop new methodology for the design and analysis of replication studies. Recently, a reverse-Bayes method called the sceptical pvalue has been proposed for this purpose; the inversion of Bayes' theorem allows us to mathematically formalise the notion of scepticism, which in turn can be used to assess the agreement between the findings of an original study and its replication. However, despite its Bayesian nature, the method relies on tail probabilities as primary...
1 CitationsSource
#1Jane L. Hutton (Warw.: University of Warwick)H-Index: 51
#2Peter J. Diggle (Lancaster University)H-Index: 91
Last. Leonhard HeldH-Index: 46
view all 30 authors...
1 CitationsSource