In Part I, I argued that unsuccessful replications should more constructively be seen as scientific beginnings rather than ends. As promised, in Part II I will more concretely demonstrate this by organizing all of the available replication information for Schnall et al.’s (2008) studies using an approach being developed at CurateScience.org.
CurateScience.org aims to accelerate the growth of cumulative knowledge by organizing information about replication results and allowing constructive comments by the community of scientists regarding the careful interpretation of replication results. Links to available data, syntax files, and experimental materials will also be organized. The web platform aims to be a one-stop shop to locate, add, and modify such information and also facilitate constructive discussions and new scholarship of published research findings. (The kinds of heated debates currently happening regarding Schnall et al.’s studies that makes science so exciting — well, minus the ad hominem attacks!)
Below is a screenshot of the organized replication results for the Schnall et al. (2008) studies, including links to available data files, forest plot graph of the effect size confidence intervals, and aggregated list of relevant blog posts and tweets.
As can be seen, there are actually 4 additional direct replications in addition to Johnson et al.’s (2014) special issue direct replications. As mentioned in Part I, two “successful” direct replications have been reported for Schnall et al.’s Study 1. However, as can readily be seen, these two studies were under-powered (@60%) to detect the original d=-.60 effect size and both effect size CIs include zero. Consequently, it would be inappropriate to characterize these studies as “successful” (the < .05 p-values reported on PsychFileDrawer.org were one-tailed tests). That being said, these studies should not be ignored given they contribute additional evidence that should count toward one’s overall evaluation of the evidence for the claim that cleanliness priming influences moral judgments.
Unsuccessful replications should also be viewed as beginnings given that virtually all replicators make their data publicly available for verification and re-analysis (one of Curate Science’s focus). Hence, any interested researcher can download the data and re-analyze it from a different theoretical perspective and potentially gain new insights into the discrepant results. Data availability also plays an important role in interpreting replication results, especially in the case the results have not been peer-reviewed. That is, one should put more weight into replication results whose conclusions can be verified via re-analysis than replication results that do not have available data.
Organizing replication results in this situation makes it clear that virtually all of the replication efforts have targeted Schnall et al.’s Study 1. Only one direct replication is so far available for Shnall et al.’s Study 2. Though this replication study used a much larger sample and was pre-registered (hence more weight should be given to its results), it is not the case that the final verdict has been spoken. Our confidence in Study 2’s original results should decrease to some extent (assuming the replication results can be reproduced from the raw data), however, more evidence would be needed to further decrease our confidence.
And even in the event of subsequent negative results from high-powered direct replications (for either of Schnall et al.’s studies), it would still be possible that cleanliness priming can influence moral judgments using more accurate instruments or using more advanced designs (e.g., highly-repeated within-person designs). CurateScience.org aims to facilitate constructive discussions and theoretical debates of these kinds to accelerate the growth of cumulative knowledge in psychology/neuroscience (and beyond). Unsuccessful replications are beginnings, not ends.
The Schnall-Effect:
Combine the two studies by Schnall and combine all other studies and compare Schnall’s mean to the mean of other researchers. You can see a significant (non-overlapping confidence intervals) difference.
So, the question is why does Schnall report stronger effects than other researchers?
It is also clear that the effect sizes is negatively related to sample size. So, another question is why smaller samples produce stronger effects?
A plausible explanation is that smaller studies use QRPs to produce significant results, which would lead to the conclusion that Schnall’s stronger effect sizes are the result of a more liberal use of QRPs.
Thanks for the comment Uli. The (unintentional or intentional) use of QRPs is definitely a possibility to explain the discrepant results of the replicators in relation to Schnall et al.’s original findings. Note, however, that there are other possibilities (e.g., differences in cross-cultural moral values, political conservatism), though I would consider these possibilities as less likely. Nonetheless, we should avoid strong conclusions on this point until further evidence emerges on this question.
With a smaller effect, Schnall’s results wouldn’t have been significant, and, therefore, they wouldn’t have been published. This does NOT mean that her results were fabricated. It’s simply that we hardly ever see non-significant resuts published. Therefore, before direct replications are made, we do not know whether a published result of a small scaled study is just one of these the 1-in-20-attempts-outlier, while the other 19 attempts stay unpublished.
Published (!) results of underpowered studies on behavioural effects of small interventions – for which population effect sizes in the small to medium range can be expected at best – necessarily will present exaggerated effect sizes, compared to the true population effect, which even could be zero. Sadly, Ioannides’ “Why Most Published Research Findings Are False” is not just provocative, but the most accurate description of what is going on in psychology, coggnitive science, medicine etc.
Yes. Non-successful replications should indeed be beginnings.
My recollection is that we used to talk about experimental control. Perhaps this was in the days of behaviourism. The idea was that the purpose of an experiment was to gain control over the behaviour of interest. A failure to replicate indicates that we don’t have control over the behaviour of interest, and is a sign that we should be doing more work in order to gain control.
A nice example of such work is found in a paper from Wolfgang Prinz’s group (Diefenbach et al. Front Psychol 4, 2013). This is a study of the priming of actions by sentences. For example, the sentence “Jakob hands you the book” should prime actions towards the body, while “You hand the book to Jakob” should prime responses away from the body. Their first experiment failed to replicate the results of a previous study by others. But they did not stop at this point. They carried out detailed analyses and ran several further experiments. By these means they regained ‘experimental control’ over the behaviour. They specified the circumstances under which positive and negative priming effects could be obtained. These effects depended on the precise timing of the processes underlying sentence comprehension and action preparation. Positive priming is observed if action planning starts at the same time as sentence presentation. Negative priming occurs if action planning is delayed for 500 msec after sentence presentation.
In this case, the failure to replicate led to important new knowledge. Unfortunately, in the current climate, failures to replicate are all too often taken as an excuse to berate the original researchers rather than as an opportunity for new developments. For example, much fuss was made about failures to replicate a much cited study reporting that people walked more slowly after being primed with words relating to old age (Bargh et al. 1996. J Pers Soc Psychol, 71, 230-244). In one of the studies (Doyen et al. 2012 PLoS ONE Jan 18) the authors showed that priming effects could be obtained when the person running the experiment expected certain results. Rather than berating Bargh and colleagues for not properly controlling for expectations, we should be exploring the mechanisms through which the expectations of the experimenter can alter the behaviour of the people participating in the experiment. Understanding such mechanisms is critical, not only for advancing the field of social cognition, but also for gaining experimental control in psychology. (this comment comes from our socialminds blog)
Thanks for the interesting comment Chris. I agree that unsuccessful high-quality replications can indicate a lack of “experimental control” over the phenomenon of interest. I’d say that achieving such “experimenter control” can be particularly challenging for “higher-level” social psychological phenomena that involve thoughts, feelings, and behaviors that are arguably much more multiply-determined than psychological phenomena in many areas of cognition.
But great example regarding the Diefenbach et al. article examining the embodied cognition of language comprehension.
I am late to this discussion but I wanted to simply note that sometimes failed replications ARE (and perhaps should be) an end (not a beginning). This is not about Schnall’s work or any other research in particular but is about the fact that if something never made sense to begin with, but was hot, clever, fun, or confoundingly impossible, it probably didn’t represent reality. In this sense, sometimes a failure to replicate is likely to be the end of the line when a “finding” was nonsense (i.e, Type 1 error) from the beginning. Sometimes a failure to replicate can be the beginning of meaningful discussion about moderators, research design, etc. but sometimes it is a very short conversation… “Oh. Right.”