Many Labs 2: Investigating Variation in Replicability Across Samples and Settings

Published on Dec 24, 2018
· DOI :10.1177/2515245918810225
Richard A. Klein8
Estimated H-index: 8
Michelangelo Vianello16
Estimated H-index: 16
(UNIPD: University of Padua)
+ 187 AuthorsBrian A. Nosek93
Estimated H-index: 93
(Center for Open Science)
We conducted preregistered replications of 28 classic and contemporary published findings, with protocols that were peer reviewed in advance, to examine variation in effect magnitudes across samples and settings. Each protocol was administered to approximately half of 125 samples that comprised 15,305 participants from 36 countries and territories. Using the conventional criterion of statistical significance (p < .05), we found that 15 (54%) of the replications provided evidence of a statistically significant effect in the same direction as the original finding. With a strict significance criterion (p < .0001), 14 (50%) of the replications still provided such evidence, a reflection of the extremely high-powered design. Seven (25%) of the replications yielded effect sizes larger than the original ones, and 21 (75%) yielded effect sizes smaller than the original ones. The median comparable Cohen’s ds were 0.60 for the original findings and 0.15 for the replications. The effect sizes were small (< 0.20) in 16 of the replications (57%), and 9 effects (32%) were in the direction opposite the direction of the original effect. Across settings, the Q statistic indicated significant heterogeneity in 11 (39%) of the replication effects, and most of those were among the findings with the largest overall effect sizes; only 1 effect that was near zero in the aggregate showed significant heterogeneity according to this measure. Only 1 effect had a tau value greater than .20, an indication of moderate heterogeneity. Eight others had tau values near or slightly above .10, an indication of slight heterogeneity. Moderation tests indicated that very little heterogeneity was attributable to the order in which the tasks were performed or whether the tasks were administered in lab versus online. Exploratory comparisons revealed little heterogeneity between Western, educated, industrialized, rich, and democratic (WEIRD) cultures and less WEIRD cultures (i.e., cultures with relatively high and low WEIRDness scores, respectively). Cumulatively, variability in the observed effect sizes was attributable more to the effect being studied than to the sample or setting in which it was studied.
Figures & Tables
📖 Papers frequently viewed together
3,651 Citations
4,715 Citations
86 Citations
#1Joseph Henrich (UBC: University of British Columbia)H-Index: 87
#2Steven J. Heine (UBC: University of British Columbia)H-Index: 60
Last. Ara Norenzayan (UBC: University of British Columbia)H-Index: 53
view all 3 authors...
Behavioral scientists routinely publish broad claims about human psychology and behavior in the world's top journals based on samples drawn entirely from Western, Educated, Industrialized, Rich, and Democratic (WEIRD) societies. Researchers - often implicitly - assume that either there is little variation across human populations, or that these "standard subjects" are as representative of the species as any other population. Are these assumptions justified? Here, our review of the comparative da...
3,710 Citations
#1Eskil ForsellH-Index: 6
#2Domenico Viganola (HHS: Stockholm School of Economics)H-Index: 3
Last. Anna Dreber (HHS: Stockholm School of Economics)H-Index: 39
view all 9 authors...
Abstract Understanding and improving reproducibility is crucial for scientific progress. Prediction markets and related methods of eliciting peer beliefs are promising tools to predict replication outcomes. We invited researchers in the field of psychology to judge the replicability of 24 studies replicated in the large scale Many Labs 2 project. We elicited peer beliefs in prediction markets and surveys about two replication success metrics: the probability that the replication yields a statist...
29 CitationsSource
#1Colin F. Camerer (California Institute of Technology)H-Index: 129
#2Anna Dreber (HHS: Stockholm School of Economics)H-Index: 39
Last. Hang Wu (HIT: Harbin Institute of Technology)
view all 24 authors...
Being able to replicate scientific findings is crucial for scientific progress. We replicate 21 systematically selected experimental studies in the social sciences published in Nature and Science between 2010 and 2015. The replications follow analysis plans reviewed by the original authors and pre-registered prior to the replications. The replications are high powered, with sample sizes on average about five times higher than in the original studies. We find a significant effect in the same dire...
415 CitationsSource
#1Brian A. Nosek (UVA: University of Virginia)H-Index: 93
#2Charles R. Ebersole (UVA: University of Virginia)H-Index: 12
Last. David Thomas Mellor (Center for Open Science)H-Index: 11
view all 4 authors...
Progress in science relies in part on generating hypotheses with existing observations and testing hypotheses with new observations. This distinction between postdiction and prediction is appreciated conceptually but is not respected in practice. Mistaking generation of postdictions with testing of predictions reduces the credibility of research findings. However, ordinary biases in human reasoning, such as hindsight bias, make it hard to avoid this mistake. An effective solution is to define th...
521 CitationsSource
#1Daniel J. Benjamin (SC: University of Southern California)H-Index: 44
#2James O. Berger (Duke University)H-Index: 75
Last. Valen E. Johnson (SC: University of Southern California)H-Index: 44
view all 72 authors...
We propose to change the default P-value threshold for statistical significance from 0.05 to 0.005 for claims of new discoveries.
1,066 CitationsSource
#1Benjamin E. Hilbig (MPG: Max Planck Society)H-Index: 37
Although Web-based research is now commonplace, it continues to spur skepticism from reviewers and editors, especially whenever reaction times are of primary interest. Such persistent preconceptions are based on arguments referring to increased variation, the limits of certain software and technologies, and a noteworthy lack of comparisons (between Web and lab) in fully randomized experiments. To provide a critical test, participants were randomly assigned to complete a lexical decision task eit...
48 CitationsSource
2 CitationsSource
#1Charles R. Ebersole (UVA: University of Virginia)H-Index: 12
#2Olivia E. Atherton (UC Davis: University of California, Davis)H-Index: 10
Last. Brian A. Nosek (Center for Open Science)H-Index: 93
view all 64 authors...
Abstract The university participant pool is a key resource for behavioral research, and data quality is believed to vary over the course of the academic semester. This crowdsourced project examined time of semester variation in 10 known effects, 10 individual differences, and 3 data quality indicators over the course of the academic semester in 20 participant pools ( N = 2696) and with an online sample ( N = 737). Weak time of semester effects were observed on data quality indicators, participan...
158 CitationsSource
#1Norbert Schwarz (SC: University of Southern California)H-Index: 139
#2Gerald L. Clore (UVA: University of Virginia)H-Index: 70
The article discusses research being done on the use of effect-size estimates in testing psychological theories. It references the study "Small Telescopes: Detectability and the Evaluation of Replication Results," by U. Simonsohn published in the 2015 issue. The variables considered include the correlation between mood and weather, intensity of the mood, and the correlation between life satisfaction and marital satisfaction.
12 CitationsSource
#1Martin S. Hagger (Curtin University)H-Index: 91
#2Nikos L. D. Chatzisarantis (Curtin University)H-Index: 71
Good self-control has been linked to adaptive outcomes such as better health, cohesive personal relationships, success in the workplace and at school, and less susceptibility to crime and addictions. In contrast, self-control failure is linked to maladaptive outcomes. Understanding the mechanisms by which self-control predicts behavior may assist in promoting better regulation and outcomes. A popular approach to understanding self-control is the strength or resource depletion model. Self-control...
460 CitationsSource
Cited By240
Climate change is a complex phenomenon that the public learns about both abstractly through media and education, and concretely through personal experiences. While public beliefs about global warming may be controversial in some circles, an emerging body of research on the ‘local warming’ effect suggests that people’s judgments of climate change or global warming are impacted by recent, local temperatures. A meta-analysis including 31 observations across 82 952 participants derived from 17 paper...
#1Jieying Chen (UM: University of Manitoba)H-Index: 1
#2Lok Ching Kwan (HKU: University of Hong Kong)
Last. Gilad Feldman (HKU: University of Hong Kong)H-Index: 9
view all 9 authors...
Abstract null null Hindsight bias refers to the tendency to perceive an event outcome as more probable after being informed of that outcome. We conducted very close replications of two classic experiments of hindsight bias and a conceptual replication testing hindsight bias regarding the perceived replicability of hindsight bias. In Study 1 (N = 890), we replicated Experiment 2 in Fischhoff (1975), and found support for hindsight bias in retrospective judgments (dmean = 0.60). In Study 2 (N = 60...
1 CitationsSource
#1Qingzhou Sun (Zhejiang University of Technology)
#2Evan Polman (UW: University of Wisconsin-Madison)H-Index: 16
Last. Huanren ZhangH-Index: 1
view all 3 authors...
#1Farid Anvari (TU/e: Eindhoven University of Technology)H-Index: 5
#2Daniel Lakens (TU/e: Eindhoven University of Technology)H-Index: 31
Abstract Effect sizes are an important outcome of quantitative research, but few guidelines exist that explain how researchers can determine which effect sizes are meaningful. Psychologists often want to study effects that are large enough to make a difference to people's subjective experience. Thus, subjective experience is one way to gauge the meaningfulness of an effect. We propose and illustrate one method for how to quantify the smallest subjectively experienced difference—the smallest chan...
8 CitationsSource
#1Alexander BowringH-Index: 7
#2Thomas E. Nichols (Warw.: University of Warwick)H-Index: 88
Last. Camille MaumetH-Index: 14
view all 3 authors...
While the development of analytical tools and techniques has broadened our horizons for comprehending the complexities of the human brain, a growing body of research in the neuroimaging literature has highlighted the pitfalls of such methodological plurality. In a recent study, we found that the choice of software package used to run the analysis pipeline can have a considerable impact on the final group-level results of a task-fMRI investigation (Bowring et al., 2019, BMN). Here we revisit our ...
#1Jason Chin (USYD: University of Sydney)H-Index: 6
#2Justin T. Pickett (University at Albany, SUNY)H-Index: 25
Last. Alex O. Holcombe (USYD: University of Sydney)H-Index: 25
view all 4 authors...
Questionable research practices (QRPs) lead to incorrect research results and contribute to irreproducibility in science. Researchers and institutions have proposed open science practices (OSPs) to improve the detectability of QRPs and the credibility of science. We examine the prevalence of QRPs and OSPs in criminology, and researchers’ opinions of those practices. We administered an anonymous survey to authors of articles published in criminology journals. Respondents self-reported their own u...
#1Sarah J. Gervais (NU: University of Nebraska–Lincoln)H-Index: 20
#2Amanda E. Baildon (NU: University of Nebraska–Lincoln)
Last. Tierney K. Lorenz (NU: University of Nebraska–Lincoln)H-Index: 9
view all 3 authors...
In this commentary, we argue that feminist science and open science can benefit from each other’s wisdom and critiques in service of creating systems that produce the highest quality science with t...
#1Daniel J. Hicks (UCM: University of California, Merced)
Concerns about a crisis of mass irreplicability across scientific fields ("the replication crisis") have stimulated a movement for open science, encouraging or even requiring researchers to publish their raw data and analysis code. Recently, a rule at the US Environmental Protection Agency (US EPA) would have imposed a strong open data requirement. The rule prompted significant public discussion about whether open science practices are appropriate for fields of environmental public health. The a...
#1Marija Petrovic (University of Belgrade)H-Index: 9
#2Iris Žeželj (University of Belgrade)H-Index: 9