Confusion regarding scientific theory as contributor to replicability crisis?

[DISCLAIMER: Ideas and statements made in this blog post in no way are intended to insult or disrespect my fellow psychologists.]

In this post, I will discuss psychology’s replicability crisis from a new angle. I want to consider the possibility that confusion regarding what scientific theory is and how theory is developed may have contributed to the replicability crisis in psychology.

Scientific theories are internally consistent sets of principles that are put forward to explain various empirical phenomena. Theories compete in the scientific marketplace by being evaluated according to the following five criteria (Popper, 1959; Quine & Ullian, 1978):

1. parsimony: simpler theories involving the fewest entities are preferred to more complicated theories
2. explanatory power:  theories that can explain many empirical phenomena are preferred to theories that can only explain a few phenomena
3. predictive power: a useful theory makes new empirical predictions above and beyond extant theories
4. falsifiability: a theory must yield falsifiable predictions
5. accuracy: degree to which a theory’s empirical predictions match experimental results

It is important to explicitly point out, however, that underlying all of these considerations is the fact that before a theory can be put forward, demonstrably repeatable empirical phenomena need to exist in the first place that need to be explained! Demonstrably repeatable is understood to mean that an empirical phenomenon “can be regularly reproduced by anyone who carries out the appropriate experiment in the way prescribed” (Popper, 1959, p. 23). Put simply, scientific theories aim to explain repeatable empirical phenomena; without repeatable empirical phenomena, there is nothing to explain and hence no theories can be developed.

The idea then is that confusion regarding these points may have contributed to the current replicability crisis.  To support my point, I will briefly review some examples from the beleaguered “social priming” literature. [DISCLAIMER: I contend my argument likely also holds in other areas of experimental psychology; I’ve chosen this literature out of convenience, and hence my intention was not to pick on these specific researchers.]

For example, in a piece entitled “The Alleged Crisis and the Illusion of Exact Replication”, Stroebe and Strack (2014) state that:

“Although reproducibility of scientific findings is one of science’s defining features, the ultimate issue is the extent to which a theory has undergone strict tests and has been supported by empirical findings” (p. 60).

Stroebe and Strack seem to be saying that the most important issue (i.e., the “ultimate issue”) in evaluating scientific theory is whether the theory has been supported by empirical findings (accuracy criterion #5 from above), but at the same time downplay the reproducibility of findings as “one of science’s defining features”. This kind of position, however, doesn’t seem to fit with the considerations above whereby reproducible empirical phenomena are required before a scientific theory can even be put forward, let alone be evaluated viz-a-viz other theories.

In another example, Cesario (2014) — in the context of discussing what features of the original methodology need to be duplicated for a replication attempt to be informative — states:

“We know this only because we have relevant theories that tell us that these features should matter.” (p. 42) “Theories inform us as to which variables are important and which are unimportant (i.e., which variables can be modified from one research study to the next without consequence).” (p. 45)

Cesario seems to be saying that we can use a scientific theory to tell us which methodological features in an original study need to be duplicated to reliably observe an empirical phenomenon. Such a position would seem to be putting the cart in front of the horse, however, given that without demonstrably repeatable empirical phenomena to explain, no theory can be developed in the first place.1

A final example comes from an article by Dijksterhuis (2014, “Welcome back theory!”), who summarizes Cesario’s (2014) paper by saying:

“Cesario  draws  the  conclusion  that  although  behavioral  priming researchers  could  show  more  methodological  rigor,  the relative infancy of the theory is the main reason the field faces a problem.” (p. 74)

Dijksterhuis seems to be saying that the field of behavioral priming currently has problems with non-replications because of insufficiently developed theory. This position is again difficult to reconcile with the standard conceptualization of scientific theory. With all due respect, such a position would be akin to saying that ESP researchers have yet to document replicable ESP findings because theories of ESP are insufficiently developed!

But how could this happen?

I contend that such confusion regarding scientific theory has emerged due (at least in part) to the relatively weak methods used in modal research (LeBel & Peters, 2011). This includes the improper use of null hypothesis significant testing (i.e., p<.05 indicates a “reliable” finding) and an over-emphasis on conceptual rather than direct replications. Conceptual replications involve immediately following up an observed effect with a study using a different methodology, hence rendering any negative results completely ambiguous (i.e., was the different result due to the different methodology or due to the falsity of the tested hypothesis). This practice effectively shields any positive empirical findings from falsification (see here for a great blog post precisely on this point; see also Greenwald et al., 1986). Granted, once the reproducibility of a particular effect has been independently confirmed (using the original methodology), it is of course important to subsequently test whether the effect generalizes to other methods (i.e., other operationalizations of the IV and DV). However, we simply cannot skip the first step. This broadly fits with Rozin’s (2001) position that psychologists need to place much more emphasis on first reliably describing empirical phenomena, before we set out to actually test hypotheses about those phenomena.


1. That being said, Cesario should be lauded for his public stance that behavioral priming researchers need to directly replicate their own findings (using the same methodology) before publishing their findings.

 

References

Cesario, J. (2014). Priming, replication, and the hardest science. Perspectives on Psychological Science, 9, 40–48.

Dijksterhuis, A. (2014). Welcome Back Theory!. Perspectives on Psychological Science, 9(1), 72-75.

Greenwald, A. G., Pratkanis, A. R., Leippe, M. R., & Baumgardner, M. H. (1986). Under what conditions does theory obstruct research progress?. Psychological Review, 93(2), 216.

LeBel, E. P., & Peters, K. R. (2011). Fearing the future of empirical psychology: Bem’s (2011) evidence of psi as a case study of deficiencies in modal research practice. Review of General Psychology, 15,371-379

Popper, K. R. (1959). The logic of scientific discovery. New York, NY: Basic Books

Quine, W. V. O., & Ullian, J. S. (1978). The web of belief (2nd ed.). New York, NY: Random House

Rozin, P. (2001). Social psychology and science: Some lessons from Solomon Asch. Personality and Social Psychology Review, 5(1), 2-14.

Stroebe, W., & Strack, F. (2014). The alleged crisis and the illusion of exact replication. Perspectives on Psychological Science, 9, 59–71.

New replication policy at flagship social psychology journal will not be effective

The Journal of Personality and Social Psychology (JPSP) — considered social psychology’s flagship journal — recently announced their new replication policy, which officially states:

Although not a central part of its mission, the Journal of Personality and Social Psychology values replications and encourages submissions that attempt to replicate important findings previously published in social and personality psychology. Major criteria for publication of replication papers include:

    • the theoretical importance of the finding being replicated
    • the statistical power of the replication study or studies
    • the extent to which the methodology, procedure, and materials match those of the original study
    • the number and power of previous replications of the same finding
    • Novelty of theoretical or empirical contribution is not a major criterion, although evidence of moderators of a finding would be a positive factor.

Preference will be given to submissions by researchers other than the authors of the original finding, that present direct rather than conceptual replications, and that include attempts to replicate more than one study of a multi-study original publication. However, papers that do not meet these criteria will be considered as well.

Given my “pre-cognitive abilities”1, we actually submitted a replication paper to JPSP about 2 weeks *prior* to their announcement, reporting the results of two unsuccessful high-powered replication attempts of Correll’s (2008, Exp 2) 1/f noise racial bias effect. Exactly one day after the new replication policy was announced we received this rejection letter:

Your paper stands high on several of [our replication policy] criteria. You worked with the author of the original paper to duplicate materials and procedures as closely as possible, and pre-registered your data collection and analysis plans. Your studies are adequately powered. However, I have concluded that because the impact of the original Correll article has been minimal, an article aimed at replicating his findings does not have the magnitude of conceptual impact that we are looking for in the new replication section. Thus, I will decline to publish this manuscript in JPSP. To assess the impact of the Correll (2008) paper, since it is 6 years old, I turned to citation data. It has been cited 22 times (according to Web of Science) but the vast majority are journals such as Human Movement Science, Ecological Psychology, or Physics Reports, far outside our field. I have not looked at all of the citing articles, of course, but the typical citation of Correll’s work appears to be as an in-passing example of the application of dynamical systems logic. There are only two citations within social psychology. One is Correll’s 2011 JESP follow-up (which itself has been cited only twice, again by journals far outside our field). The second is an Annual Review of Psychology article on gender development (in which again Correll’s 2008 paper is cited in passing as an example of dynamical approaches). I have to conclude that Correll’s paper has had zero substantive impact in social psychology, attracting attention almost exclusively from researchers (mostly outside our field) who cite it as an example application of a specific conceptual and analytic approach. Such citations have little or nothing to do with the substance of the finding that you failed to replicate – the impact of task instructions on the PSD slope. In sum, my decision on your replication manuscript is not based on any deficiencies in your work, but on the virtually complete lack of impact of the original finding within our field.

I responded to the decision letter with the following email:

Thanks for your quick response regarding our replication manuscript (PSP-A-2014-0114). Of course it is not the outcome we had hoped for, however, we respect your decision. That being said, I would like to point out what seems to be a major discrepancy between the official policy for publication of replication papers (theoretical importance of the finding, quality of replication methods, & pre-existing replications of the finding) *and* the primary basis for rejecting our replication paper, which was that the original article had insufficient actual impact in terms of citation count. These two things are distinct and if you will be rejecting papers on the latter criteria, then your official policy should be revised to reflect this fact.

Furthermore, if you do revise your official policy in this way — whereby a major criterion for publishing replication papers is “actual impact” of original article in terms of citation count — this would mean that you could avoid publishing replication papers — no matter how high-quality — for about 85% of published articles in JPSP given the skewed distribution of article citation count whereby the vast majority of articles have minimal actual impact (Seglen, 1992). This kind of strategy would of course be a highly ineffective editorial policy if the goal is to increase the credibility and cumulative nature of empirical findings in JPSP.

To which the editor responded by saying that Corell’s (2008, Exp 2) finding was deemed “important” for methodological reasons and re-iterated that Correll’s research has had “little to no impact within our field.” More importantly, he did not address my two main concerns that their “new replication policy is (1) not well specified and (2) will not be effective in increasing the credibility of empirical findings in JPSP.”2

I responded by saying that they need — at the very least — to revise their official policy to state that they will *only* publish high-quality replication papers of theoretically important findings that have had an *actual* impact in terms of citation count. This of course means that they can avoid publishing replication papers of all recently published JPSP papers *and* the vast majority of JPSP papers that are rarely or never cited, which is simply absurd. Another curious aspect (alluded to by Lorne Campbell) is this: Can an empirical finding actually have an impact on a field if it hasn’t been independently corroborated?

 

1. Just kidding, I unfortunately do not actually have pre-cognitive abilities though it would be great if I did.
2. This is in contrast to replication policies at more reputable journals — such as Psychological Science, Journal of Experimental Social Psychology, Psychonomic Bulletin & Review, and Journal of Research in Personality — that publish high-quality replication papers of *any* findings originally published in their journal. For examples, see here and here.

Unsuccessful replications are beginnings not ends – Part II

In Part I, I argued that unsuccessful replications should more constructively be seen as scientific beginnings rather than ends. As promised, in Part II I will more concretely demonstrate this by organizing all of the available replication information for Schnall et al.’s (2008) studies using an approach being developed at CurateScience.org.

CurateScience.org aims to accelerate the growth of cumulative knowledge by organizing information about replication results and allowing constructive comments by the community of scientists regarding the careful interpretation of replication results. Links to available data, syntax files, and experimental materials will also be organized. The web platform aims to be a one-stop shop to locate, add, and modify such information and also facilitate constructive discussions and new scholarship of published research findings. (The kinds of heated debates currently happening regarding Schnall et al.’s studies that makes science so exciting — well, minus the ad hominem attacks!)

Below is a screenshot of the organized replication results for the Schnall et al. (2008) studies, including links to available data files, forest plot graph of the effect size confidence intervals, and aggregated list of relevant blog posts and tweets.

cs-schnalletal-screenshot

As can be seen, there are actually 4 additional direct replications in addition to Johnson et al.’s (2014) special issue direct replications. As mentioned in Part I, two “successful” direct replications have been reported for Schnall et al.’s Study 1. However, as can readily be seen, these two studies were under-powered (@60%) to detect the original d=-.60 effect size and both effect size CIs include zero. Consequently, it would be inappropriate to characterize these studies as “successful” (the < .05 p-values reported on PsychFileDrawer.org were one-tailed tests). That being said, these studies should not be ignored given they contribute additional evidence that should count toward one’s overall evaluation of the evidence for the claim that cleanliness priming influences moral judgments.

Unsuccessful replications should also be viewed as beginnings given that virtually all replicators make their data publicly available for verification and re-analysis (one of Curate Science’s focus). Hence, any interested researcher can download the data and re-analyze it from a different theoretical perspective and potentially gain new insights into the discrepant results. Data availability also plays an important role in interpreting replication results, especially in the case the results have not been peer-reviewed. That is, one should put more weight into replication results whose conclusions can be verified via re-analysis than replication results that do not have available data.

Organizing replication results in this situation makes it clear that virtually all of the replication efforts have targeted Schnall et al.’s Study 1. Only one direct replication is so far available for Shnall et al.’s Study 2. Though this replication study used a much larger sample and was pre-registered (hence more weight should be given to its results), it is not the case that the final verdict has been spoken. Our confidence in Study 2’s original results should decrease to some extent (assuming the replication results can be reproduced from the raw data), however, more evidence would be needed to further decrease our confidence.

And even in the event of subsequent negative results from high-powered direct replications (for either of Schnall et al.’s studies), it would still be possible that cleanliness priming can influence moral judgments using more accurate instruments or using more advanced designs (e.g., highly-repeated within-person designs). CurateScience.org aims to facilitate constructive discussions and theoretical debates of these kinds to accelerate the growth of cumulative knowledge in psychology/neuroscience (and beyond). Unsuccessful replications are beginnings, not ends.

Unsuccessful replications are beginnings not ends – Part I

Recently, there has been lots of controversy brewing around the so called “replication movement” in psychology. This controversy reached new heights this past week in response to Johnson, Cheung, & Donnellen’s (2014) “failed” replications of Schnall, Benton, & Harvey’s (2008) cleanliness priming on moral judgment finding. Exchanges have spiraled out of control, with unprofessional and overly personal comments uttered. For example, an original author accusing replicators of engaging in “replication bullying” and a “status quo supporter” calling (young) replicators “assholes” and “shameless little bullies”.

In this post, I want to try and bring back the conversation to substantive scientific issues regarding the crucial importance of direct replications and will argue that direct replications should be viewed as constructive rather than destructive. But first a quick clarification regarding the peripheral issue of the term “replication bullying.”

The National Center Against Bullying defines bullying as: “Bullying is when someone with more power repeatedly and intentionally causes hurt or harm to another person who feel helpless to respond.” 

According to this definition, it is very clear that publishing failed replications of original research findings does not come close to meeting the criteria for bullying. Replicators have no intention to harm the original researcher(s), but rather have the intention to add new evidence regarding the robustness of a published finding. This is a normal part of science and is actually the most important feature of the scientific method, which ensures an empirical literature is self-correcting and cumulative. Of course the original authors may claim that their reputation might be harmed by the publication of fair and high-quality replication studies that do not corroborate their original findings. However, this is an unavoidable reality of engaging in scientific endeavors. Science involves highly complex and technically challenging activities. When a new empirical finding is added to the pool of existing ideas, there will always be a risk that competent independent researchers may not be able to corroborate the original findings.

That being said, science entails the careful calibration of beliefs about how our world works. Scientific beliefs are carefully calibrated to the totality of the evidence available for a certain claim. This involves a graded continuum between (1) high confidence in a belief when strong evidence is continually found to support a certain claim and (2) strong doubt in a belief when weak evidence is repeatedly found. In between these two poles, exists a graded continuum where one may have low to moderate confidence in a belief until more high-quality evidence is produced.

For example, in the Schnall et al. situation, Johnson et al.’s have reported two unsuccessful direct replications for each of the two studies originally reported by Schnall et al. However, two *successful* direct replications of Schnall et al.’s Study 1 also have been reported by completely independent researchers.  These “successful” direct replications, however, were both severely under-powered to detect the original effect size. Notwithstanding this limitation, these studies nonetheless should be considered in carefully calibrating one’s belief regarding the claim that cleanliness priming can reduce the severity of moral judgments. Furthermore, future research would need to be executed to understand these discrepant results. Finally, even in the absence of the successful direct replications, Johnson et al.’s two high-quality direct replications does not indicate that the idea that cleanliness priming reduces severity of moral judgments is perpetually wrong. The idea might indeed have some truth to it under a different set of operationalizations and/or in different contexts. The challenge is to identify those operationalizations and contexts whereby the phenomenon yields replicable results. Unsuccessful replications are beginnings, not ends.

In the second part of this post, I will more concretely demonstrate how unsuccessful replications are beginnings by organizing all of the replication information for the Schnall et al.’s (2008) studies using an approach being developed at CurateScience.org.

A simpler and more intuitive publication bias index?

At this past SPSP, Uri Simonsohn gave a talk on new ways of thinking about statistical power. From this new perspective, you first determine how large a sample size you can afford for a particular project. Then, you can determine the minimum effect size that can reliably detected (i.e., 95% power) for that sample size (e.g., d_min = .73 can be reliably detected with n=50/cell). I believe that this approach is a much more productive way of thinking about power for several reasons, one being that it substantially enhances the interpretation of null results. For instance, you can conclude (assuming integrity of methods and measurement instruments) that the effect you’re studying is unlikely to be the size of the minimum effect size reliably detectable for your sample size (or else you would have detect it). That being said, it is still possible the effect exists but is much smaller in magnitude, which would require a much larger sample size to reliably detect.

In this post, I use the core ideas from this new approach to come up with a simpler and more intuitive way of gauging publication bias for extant empirical studies.

The idea is simple. If a study reports an observed effect size smaller than the minimum effect size reliably detectable for the sample size used, then the study likely suffers from publication bias and should be interpreted with caution. The further away the observed effect size is from the minimally detectable effect size, the larger the bias. Let’s look at some concrete examples.

Zhong & Liljenquist’s (2006) Study 1 on the “Macbeth effect” found a d=.53 using n=30/cell. At this sample size, however, only effect sizes as large as d=.95 (or greater) are reliably detectable with 95% power. On the other hand, Tversky & Kahneman’s (1981) Framing effect study found a d=1.13 using n=153/cell. At that sample size, effect sizes as small as d=.41 are reliably detectable. See Table below for other examples:
minimum effect size

The new bias index can be calculated as follows:  minimum effect size - bias-equation

(And note we’d want to calculate a 95% C.I. around the bias estimate, given that bias estimates should be more precise for larger Ns all else being equal.)

To shed more light on the value of this simpler publication bias index, in the near future I will calculate these for studies where replicability information exists and test empirically whether the index predicts lower likelihood of replication.

“Replicating down” can lead to “replicating up”

A few days ago, Rolf Zwaan wrote an interesting post about “replicating down” vs. “replicating up”, which he conceptualized as decreasing vs. increasing our confidence in an effect reported in an original paper. I love this distinction and definitely agree that we need to see a lot more “replicating up” type replication efforts and that editors of prominent journals should publish results from such “replication up” replication efforts.

In this blog post, I’m going to tell a story that embodies a different kind of “replicating up” and contend that “replicating down” can lead to “replicating up” in highly constructive ways for our science.

Here is the story.

We executed two high-powered pre-registered direct replication attempts of an effect where we bent over backwards (a la Feynman) in collaboration with the original authors to duplicate as closely as possible all methodological details of the original study. However, we couldn’t get the effect. So we submitted our results to the journal that originally published the results (trying for the Pottery Barn Rule), but it was rejected for not making a sufficiently substantial theoretical contribution. The editor argued that for publication we needed to provide the conditions under which the effect *does* occur.1 In a weird twist of events, one of the reviewers — who was one of the original authors on the paper — reported in their review that they had since “discovered” a moderator variable for the effect in question. The action editor suggested we “combine forces” and consider re-submitting to the journal. Indeed a few days later, I received an email from “Reviewer #1” offering that we combine forces and submit a combined paper with our null replication results and their moderator evidence. I graciously declined the offer instead asking for the methodological details to attempt to independently replicate their new “moderator effect”. Suddenly, the researcher’s tone changed communicating that they hadn’t “yet pinned down the effect”, but would email me with the details as soon as they had them. That email never came.

Fast forward six months later. Out of the blue, an independent team emailed me indicating that they also failed to replicate the original results in question in an even higher-powered design. However, in yet another weird twist of events, their replication results spoke directly to the moderator question at hand, seriously calling into question the so-called “moderator effect” explanation of our failed replication results. I emailed the original author to see if there were any developments regarding their new “moderator effect” given that I was made aware of new evidence calling into question their moderator effect explanation of our failed replication attempts.

They replied saying that since we last communicated, they have realized that the operationalization of their target manipulation was overly noisy and that they have since “substantially improved” it to make it more precise.

What music to my ears that was to hear.

And this is what I mean by saying that “replicating down” can lead to “replicating up”! Our “replicating down” eventually led to a “replicating up” situation by getting the original researchers to improve their methodology in studying their phenomenon of interest.

Take-home message: We definitely need more “replicating up” situations, but that “replicating down” can lead to “replicating up” and that this is very healthy for our science!

Last thing: my story fits very well with the Feynman-inspired name of my blog “Prove Yourself Wrong”, which is that by proving yourself — or others — wrong, scientific progress is achieved!

“We are trying to prove ourselves wrong as quickly as possible, because only in that way can we find progress.” (Richard Feynman)

Footnotes

1. This is ludicrous; it would be as if the Journal of Electroanalytical Chemistry — where Pons & Fleischmann’s (1989) published their now discredited cold fusion findings — would have demanded that independent replicators provide the “conditions” under which cold fusion *can* be observed!

Sufficiently open science

It’s difficult to disagree with the idea that psychologists (and other social scientists) need to be more open about their methods and data. Recently, a growing number of researchers have started being a lot more open about their methods and data (see the growing number of public projects on the Open Science Framework for evidence of this). I am very happy with these developments (along with many others) because of course being more transparent in how we do our research goes a long way in increasing the reliability of our empirical findings. I will not focus on these benefits here (but see this blog post for a great review of current open science initiatives and the various benefits of open science practices).

Rather, in this post I will  argue that though the open science movement is a wonderful development, a “sufficiently open science” (SOS) movement perhaps would be better. I want to consider the position that at this point in time — in this very special era we are currently living in — adopting SOS practices may actually be more beneficial overall for our field in the long run.

What do I mean by “sufficiently open science” practices? From my perspective, SOS practices means that:

  1. you provide minimum methods disclosure,
  2. you share minimum materials, and
  3. you share the minimum dataset underlying your claims.

That’s it! Absolutely no need to be more open than this, though of course that is completely fine if you so desire.

By “minimum methods disclosure”, I mean that you disclose the four categories of methods disclosure covered by Simmons et al.’s 21-word disclosure (and now required at Psychological Science and other journals). [UPDATE: Minimum methods disclosure actually involves a modified wording of the 21-word disclosure whereby an author needs to disclose only the measures that were analyzed with respect to the target research question, rather than all measures assessed. Thanks to Lorne Campbell for pointing this out.] Consistent with PLoS ONE’s new Data Sharing policy, by “minimal dataset”, I mean sharing “the dataset used to reach the conclusions drawn in the manuscript with related metadata and methods, and any additional data required to reproduce the reported study results in their entirety.” Finally, “minimum materials” means sharing the essential materials and procedures needed for a competent independent researcher to execute a diagnostic direct replication. (And if you’re willing to share the minimum dataset and materials with one interested researcher, then why not take 5 minutes and post these publicly on Figshare.com or the OSF so that if someone else emails you about those things, your job is already done!)

I contend that it’s more profitable — at this point in time — to strive for SOS practices because it is more feasible and pragmatic for time-starved researchers, all of whom are caught in a chaotic and rapidly-changing research landscape. Also, it’s a great stepping stone toward broader openness, which can happen in a more natural and gradual manner. And because SOS practices are more feasible, it means that reaching a tipping point where the majority of researchers are being more open may happen much sooner.

Take-home message: Practice “sufficiently open science” and spread the word to any colleagues you know who may still be on the fence about the open science movement!

Oh and here’s a visual depiction of what I’m trying to say, comparing features of the status quo, sufficiently open science (SOS), and open science (OS) approaches.

approaches-compared

*According to current methods disclosure rates at PsychDisclosure.org (see also LeBel et al., 2013).

**According to Wicherts, Borsboom, Kats, & Molenaar (2006).