We Need Federally Funded Daisy Chains

pigee

One of the most provocative requests in the reproducibility crisis was Daniel Kahneman’s call for psychological scientists to collaborate on a “daisy chain” of research replication. He admonished proponents of priming research to step up and work together to replicate the classic priming studies that had, up to that point, been called into question.

What happened? Nothing. Total crickets. There were no grand collaborations among the strongest and most capable labs to reproduce each other’s work. Why not? Using 20:20 hindsight it is clear that the incentive structure in psychological science militated against the daisy chain idea.

The scientific system in 2012 (and the one still in place) rewarded people who were the first to discover a new, counterintuitive feature of human nature, preferably using an experimental method. Since we did not practice direct replications, the veracity of our findings weren’t really the point. The point was to be the…

View original post 800 more words

LEBEL: Introducing “CurateScience.Org”

The Replication Network

It is my pleasure to introduce Curate Science (http://CurateScience.org) to The Replication Network. Curate Science is a web application that aims to facilitate and incentivize the curation and verification of empirical results in the social sciences (initial focus in Psychology). Science is the most successful approach to generating cumulative knowledge about how our world works. This success stems from a key activity, independent verification, which maximizes the likelihood of detecting errors, hence maximizing the reliability and validity of empirical results. The current academic incentive structure, however, does not reward verification and so verification rarely occurs and when it does, is highly difficult and inefficient. Curate Science aims to help change this by facilitating the verification of empirical results (pre- and post-publication) in terms of (1) replicability of findings in independent samples and (2) reproducibility of results from the underlying raw data.
The platform facilitates replicability by enabling users…

View original post 451 more words

High-powered direct replications of social psychology findings (for in press paper; out-of-date)

***IMPORTANT NOTE***: This list was compiled on October 13, 2015 solely for an in press paper at JPSP, to be referenced as additional replications of social psych findings **beyond** large-scale replication efforts such as RP:P, Social Psych special issue, ML1, and ML3 and was not meant to be disseminated widely. Hence, this list is completely out-of-date. For a more systematic effort to track replications in psychology, see Curate Science.

The table below lists successful (n=3) and unsuccessful (n=111) high-powered direct replications of social psychology findings (known to us on October 13, 2015). For simplicity, only replications with statistical power >= 80% to detect an effect size as large (or larger) than the original finding are included (citation counts according to Google Scholar, retrieved October 2015). This list was tabulated as additional evidence to support the broader position that the current incentive structure in social psychology is not conducive to generating cumulative knowledge in light of several meta-scientific investigations revealing low replicability rates of social psychology findings (e.g., Reproducibility Project: 76% replication failure rate of social psychology studies; Social Psych special issue: 70% failure rate; Many Labs 3: 88% failure rate).

 

Effect

Cited
by

Unsuccessful
replications*

Successful
replications*

Elderly
priming

3703

Pashler et al. (2009)

Cesario et al. (2007, Study 2)

Doyen et al. (2012, Study 1)

Doyen et al. (2012, Study 2)

Achievement
priming

1672

Harris et al. (2013, Study 1)

Harris et al. (2013, Study 2)

Deliberation-without-attention
effect

973

Acker (2008)

Calvillo & Penaloza
(2009)

Lassiter et al. (2009)

Newell et al. (2009)

Rey et al. (2009)

Thorsteinson & Withrow
(2009)

Nieuwenstein & van Rijn (2012, Study 1)

Nieuwenstein & van Rijn (2012, Study 2)

Behavioral-consequences-of
automatic-evaluation

943

Rotteveel et al. (2015, Study 1)

Rotteveel et al. (2015, Study 2)

Glucose-boosts-self-control
effect

853

Cesario & Corker (2010)

Astrologo et al. (2014)

Disgust
priming

713

Johnson et al. (2015)

Zhong et al. (2010, Study 2)

Physical
warmth promotes interpersonal warmth

693

Lynott et al. (2014, Study 1)

Lynott et al. (2014, Study 2)

Lynott et al. (2014, Study 3)

Money
priming

686

Tate (2009)

Grenier et al. (2012)

Rohrer et al. (2015, Study 1)

Rohrer et al. (2015, Study 2)

Rohrer et al. (2015, Study 3)

Intelligence
priming

676

Eder et al. (2001)

Shanks et al. (2013, Study 4)

Shanks et al. (2013, Study 5)

Shanks et al. (2013, Study 6)

Shanks et al. (2013, Study 8)

Fertility
facial-preferences effect

602

Harris (2011)

Macbeth
effect

578

Earp et al. (2014, Study 1)

Earp et al. (2014, Study 2)

Earp et al. (2014, Study 3)

Gamez et al. (2011, Study 2)

Gamez et al. (2011, Study 3)

Fayard et al. (2009, Study 1)

Pre-cognition

393

Wagenmakers et al. (2011)

Galak et al. (2012, Study 1)

Galak et al. (2012, Study 2)

Galak et al. (2012, Study 3)

Galak et al. (2012, Study 4)

Galak et al. (2012, Study 6)

Galak et al. (2012, Study 7)

Ritchie et al. (2012, Study 1)

Ritchie et al. (2012, Study 2)

Ritchie et al. (2012, Study 3)

Galak et al. (2012, Study 5)

 

Status-legitimacy
effect

344

Brandt (2013, Study 1)

Brandt (2013, Study 2)

Brandt (2013, Study 3)

Red-cognitive-impairment
effect

321

Steele et al. (2015, Study 1)

Steele et al. (2015, Study 2)

Steele et al. (2015, Study 3)

Steele et al. (2015, Study 4)

Power
posing

304

Ranehill et al. (2015)

Koch & Broughal
(2011)

Cleanliness
priming

278

Johnson et al. (2014a, Study 1)

Johnson et al. (2014a, Study 2)

Lee et al. (2013)

Johnson et al. (2014b)

Reduced
pro-sociality of high SES effect

258

Korndorfer et al. (2015, Study 1)

Korndorfer et al. (2015, Study 2)

Korndorfer et al. (2015, Study 3)

Korndorfer et al. (2015, Study 4)

Korndorfer et al. (2015, Study 5)

Korndorfer et al. (2015, Study 6)

Korndorfer et al. (2015, Study 7)

Korndorfer et al. (2015, Study 8)

Morling et al. (2014)

Social
distance priming

247

Pashler et al. (2012, Study 1)

Pashler et al. (2012, Study 2)

Johnson & Cesario
(2012, Study 1)

Johnson & Cesario
(2012, Study 2)

Sykes et al. (2012)

Color
on approach/avoidance

227

Steele (2013)

Steele (2014)

Social
warmth embodiment effect

117

Donnellan et al. (2015, Study 1)

Donnellan et al. (2015, Study 2)

Donnellan et al. (2015, Study 3)

Donnellan et al. (2015, Study 4)

Donnellan et al. (2015, Study 5)

Donnellan et al. (2015, Study 6)

Donnellan et al. (2015, Study 7)

Donnellan et al. (2015, Study 8)

Donnellan et al. (2015, Study 9)

Ferrell et al. (2014)

McDonald et al. (2015)

Red-boosts-attractiveness
effect

98

Banas et al. (2013)

Blech (2014)

Hesslinger et al. (2015)

Fertility
on voting

60

Harris & Mickes
(2014)

Process
model of AMP

55

Tobin & LeBel
(Study 1)

Tobin & LeBel
(Study 2)

Honesty
priming

47

Pashler et al. (2013, Study 1)

Pashler et al. (2013, Study 2)

Pashler et al. (2013, Study 3)

Modulation
of 1/f noise on WIT

45

Madurski & LeBel
(2015, Study 1)

Madurski & LeBel
(2015, Study 2)

Embodiment
of secrets

31

LeBel & Wilbur (2014, Study 1)

LeBel & Wilbur (2014, Study 2)

Perfecto, Moon, & Nelson (2012)

Time
is money effect

25

Connnors et al. (in press, Study 1)

Connnors et al. (in press, Study 2)

Heat
priming

23

McCarthy (2014, Study 1)

McCarthy (2014, Study 1)

Treating-prejudice-with-imagery
effect

22

McDonald et al. (2014, Study 1)

McDonald et al. (2014, Study 2)

Religion
priming

11

McCullough & Hone (2015)

Attachment-warmth
embodiment effect

10

LeBel & Campbell (2013, Study 1)

LeBel & Campbell (2013, Study 2)

Recommendations for peer review in current (strained?) climate

In this post, I will discuss challenges that arise when peer reviewing submitted articles in the current tense climate. Such climate stems from the growing recognition that we need to more openly and fully report our methods and results, avoiding questionable research practices and hence questionable conclusions. This is inspired by a recent piece wherein two authors felt unfairly accused of “nefarious practices” and also based on some of my own recent experiences peer reviewing articles.

The goal of peer review — for empirical articles at least — is to carefully evaluate research to make sure that conclusions drawn from evidence are valid (i.e., correct). This involves evaluating many different aspects of the reported research including whether correct statistical analyses were carried out, whether appropriate experimental designs were used,  and whether any confounds were unintentionally introduced, to name a few.

Another concern, which has recently received a lot more attention, is to assess the extent to which flexibility in design and/or analyses may have contributed to the reported results (Simmons et al., 2011; Gelman & Lokan, 2013). That is, if a set of data are analyzed in many different ways and such analytic multiplicity isn’t appropriately accounted for, incorrect conclusions can be drawn from the evidence due to an inflated false positive error rate (e.g., incorrectly concluding an IV had a causal effect on a DV when in fact the data are entirely consistent with what one would expect due to sampling error assuming the null is true).

Hence, a crucial task when reviewing an empirical article is to rule out that flexibility in analyses (&/or design, e.g., data collection termination rule) can account for the reported results, and hence avoid the possibility that invalid conclusions have been made. From my perspective, however, it is really important that as reviewers we do this very carefully so that authors (whose work is being reviewed) do not feel accused of intentional p-hacking or researcher misconduct.

Here’s an example to demonstrate my point. During peer-review of an article on goal-directed bias in memory judgments (at Consciousness & Cognition), O’Connor & Mill felt unfairly accused of “unconventional and nefarious practices” in analyzing their data (see here for details). We don’t have all of the details, but it looks like one of the reviewers was concerned about how exclusions were made by the authors with regard to (1) overly low sensitivity index (d’) and (2) native language requirements. This reviewer went on to say that “the authors must accept the consequences of data that might disagree with their hypotheses”. It should be clear that this reviewer was completely justified in being concerned that flexibility in the different exclusions criteria that could have been used could have lead to invalid conclusions regarding the target phenomenon (i.e., how goals can bias memory processes). However, in my opinion, the language used to express such a concern was inappropriate because it insinuated that such flexibility may have been intentionally exploited.

Another example comes from a recent paper I reviewed that reported evidence that “response effort” may moderate the impact of cleanliness priming on moral judgments (under review at Frontiers). On the surface, the evidence seemed very strong, but upon closer inspection I realized that there seemed to be quite a bit of flexibility with respect to (1) how “response effort” was operationalized across the 4 reported studies and (2) exclusion criteria used for excluding participants who exhibited “insufficient effort responding”. Concerned that such flexibility may have contributed to an inflated false positive error rate (and hence invalid conclusions), I carefully delineated these concerns and concluded my review by stating:

“In sum, the main problem is that based on the methods and results presented in the current manuscript, we cannot rule out the possibility that unintentional confirmation bias inadvertently (1) biased the operationalization of “response effort” and (2) biased the chosen exclusion criteria, which in combination represents a potential alternative explanation for the current pattern of results.”

It is important to notice that I intentionally framed my concern in terms of the fact that flexibility in analyses may have unintentionally biased the results. This is extremely crucial because most authors are probably not aware that flexibility in analyses/methods may have unduly influenced their reported results. Hence, of course they will become defensive if you insinuate that they have intentionally exploited such flexibility, when they in fact have not intentionally done so. This would be akin to insinuating that researchers intentionally confounded their experimental manipulation! The point here is that flexibility in analyses/design — just like experimental confounds — need to be ruled out, and this is necessary for valid inference regardless of whether these problems were intentionally or unintentionally introduced.

Recommendations

1. Always frame your concerns about flexibility in analyses/design (or any other concern) using language that focuses on the ideas rather than the authors.
2. Give the benefit of the doubt to authors and always assume that flexibility in analyses/design may have unintentionally influenced the reported results.
3. Use a standard reviewer statement that has been specifically designed to help with such matters. The statement (developed by Uri Simonsohn, Joe Simmons, Leif Nelson, Don Moore, and myself) can be used by any reviewer to request disclosure of additional methodological details, which can help assess the extent to which flexibility in analyses/design may have contributed to the reported results. Using this standard statement is another way to avoid having the authors feel as though you are insinuating they have intentionally done something questionable.

“I request that the authors add a statement to the paper confirming whether, for all experiments, they have reported all measures, conditions, data exclusions, and how they determined their sample sizes. The authors should, of course, add any additional text to ensure the statement is accurate. This is the standard reviewer disclosure request endorsed by the Center for Open Science [see http://osf.io/hadz3]. I include it in every review.”

Insufficiently open science — not theory — obstructs empirical progress!

I stumbled upon Greenwald et al.’s (1986) “Under what conditions does theory obstruct research progress” article the other day and decided to re-read it. I found it very interesting to re-read in the context of current controversies about p-hacking and replication difficulties! Very prescient indeed.

In the article, Greenwald et al. argued that theory obstructs research progress when:
1. testing theory is the central goal of research, and
2. the researcher has more faith in the correctness of the theory than in the suitability of the procedures used to test the theory.

Though I agree with their main argument (& indeed we’ve made a very similar argument here), I don’t think it’s completely correct (or at least incomplete given what we now know about modal research practices).

I want to put forward the possibility that it is insufficiently open research practices — rather than theory-confirming practices — that obstruct empirical progress! Testing theory has always involved the (precarious) goal of producing experimental results that confirm a theory-derived novel empirical predictions. Such endeavors almost always involve repeated tweaking and refinement of procedures and calibration of instruments. As long as researchers are sufficiently open about the methods used to execute their experimental tests, however, such theory-confirming practices *can* lead to empirical progress. This is the case because being open means other researchers can gauge more objectively all of the required methodological tweakings that were required to get the theory-confirming result, but is also the case because being open means using stronger methods and better thought-out experimental designs to begin with. Consequently, being more open means theory-derived empirical predictions are more open to disconfirmation (given disconfirmation requires strong methods), which actually substantially *accelerates* research progress! Don’t take my word for it, here’s what Richard Feynman had to say on the subject:

“We are trying to prove ourselves wrong as quickly as possible, because only in that way can we find progress.” (Richard Feynman)

 

Two quotes from Greenwald et al.’s article that inspired this post!

“The theory-testing approach runs smoothly enough when theoretically predicted results are obtained. However, when predictions are not confirmed, the researcher faces a predicament that can be called the disconfirmation dilemma (Greenwald & Ronis, 1981). This dilemma is resolved by the researcher’s choosing between proceeding (a) as if the theory being tested is incorrect (e.g., by reporting the disconfirming results), or (b) as if the theory is still likely to be correct. The researcher who preserves faith in the theory’s correctness will persevere at testing the theory — perhaps by conducting additional data analyses, by collecting more data, or by revising procedures and then collecting more data.” (p. 219).

“A theory-confirming researcher perseveres by modifying procedures until prediction-supporting results are obtained. Particularly if several false starts have occurred, the resulting confirmation may well depend on conditions introduced while modifying procedures in response to initial disconfirmations. However, because no systematic empirical comparison of the evolved (confirming) procedures with earlier (disconfirming) ones has been attempted, the researcher is unlikely to detect the confirmation’s dependence on the evolved details of procedure. Although the conclusions from such research need to be qualified by reference to the tried-and-abandoned procedures, those conclusions are often stated only in the more general terms of the guiding theory. Such conclusions constitute avoidable overgeneralizations.” (p. 220)

Confusion regarding scientific theory as contributor to replicability crisis?

[DISCLAIMER: Ideas and statements made in this blog post in no way are intended to insult or disrespect my fellow psychologists.]

In this post, I will discuss psychology’s replicability crisis from a new angle. I want to consider the possibility that confusion regarding what scientific theory is and how theory is developed may have contributed to the replicability crisis in psychology.

Scientific theories are internally consistent sets of principles that are put forward to explain various empirical phenomena. Theories compete in the scientific marketplace by being evaluated according to the following five criteria (Popper, 1959; Quine & Ullian, 1978):

1. parsimony: simpler theories involving the fewest entities are preferred to more complicated theories
2. explanatory power:  theories that can explain many empirical phenomena are preferred to theories that can only explain a few phenomena
3. predictive power: a useful theory makes new empirical predictions above and beyond extant theories
4. falsifiability: a theory must yield falsifiable predictions
5. accuracy: degree to which a theory’s empirical predictions match experimental results

It is important to explicitly point out, however, that underlying all of these considerations is the fact that before a theory can be put forward, demonstrably repeatable empirical phenomena need to exist in the first place that need to be explained! Demonstrably repeatable is understood to mean that an empirical phenomenon “can be regularly reproduced by anyone who carries out the appropriate experiment in the way prescribed” (Popper, 1959, p. 23). Put simply, scientific theories aim to explain repeatable empirical phenomena; without repeatable empirical phenomena, there is nothing to explain and hence no theories can be developed.

The idea then is that confusion regarding these points may have contributed to the current replicability crisis.  To support my point, I will briefly review some examples from the beleaguered “social priming” literature. [DISCLAIMER: I contend my argument likely also holds in other areas of experimental psychology; I’ve chosen this literature out of convenience, and hence my intention was not to pick on these specific researchers.]

For example, in a piece entitled “The Alleged Crisis and the Illusion of Exact Replication”, Stroebe and Strack (2014) state that:

“Although reproducibility of scientific findings is one of science’s defining features, the ultimate issue is the extent to which a theory has undergone strict tests and has been supported by empirical findings” (p. 60).

Stroebe and Strack seem to be saying that the most important issue (i.e., the “ultimate issue”) in evaluating scientific theory is whether the theory has been supported by empirical findings (accuracy criterion #5 from above), but at the same time downplay the reproducibility of findings as “one of science’s defining features”. This kind of position, however, doesn’t seem to fit with the considerations above whereby reproducible empirical phenomena are required before a scientific theory can even be put forward, let alone be evaluated viz-a-viz other theories.

In another example, Cesario (2014) — in the context of discussing what features of the original methodology need to be duplicated for a replication attempt to be informative — states:

“We know this only because we have relevant theories that tell us that these features should matter.” (p. 42) “Theories inform us as to which variables are important and which are unimportant (i.e., which variables can be modified from one research study to the next without consequence).” (p. 45)

Cesario seems to be saying that we can use a scientific theory to tell us which methodological features in an original study need to be duplicated to reliably observe an empirical phenomenon. Such a position would seem to be putting the cart in front of the horse, however, given that without demonstrably repeatable empirical phenomena to explain, no theory can be developed in the first place.1

A final example comes from an article by Dijksterhuis (2014, “Welcome back theory!”), who summarizes Cesario’s (2014) paper by saying:

“Cesario  draws  the  conclusion  that  although  behavioral  priming researchers  could  show  more  methodological  rigor,  the relative infancy of the theory is the main reason the field faces a problem.” (p. 74)

Dijksterhuis seems to be saying that the field of behavioral priming currently has problems with non-replications because of insufficiently developed theory. This position is again difficult to reconcile with the standard conceptualization of scientific theory. With all due respect, such a position would be akin to saying that ESP researchers have yet to document replicable ESP findings because theories of ESP are insufficiently developed!

But how could this happen?

I contend that such confusion regarding scientific theory has emerged due (at least in part) to the relatively weak methods used in modal research (LeBel & Peters, 2011). This includes the improper use of null hypothesis significant testing (i.e., p<.05 indicates a “reliable” finding) and an over-emphasis on conceptual rather than direct replications. Conceptual replications involve immediately following up an observed effect with a study using a different methodology, hence rendering any negative results completely ambiguous (i.e., was the different result due to the different methodology or due to the falsity of the tested hypothesis). This practice effectively shields any positive empirical findings from falsification (see here for a great blog post precisely on this point; see also Greenwald et al., 1986). Granted, once the reproducibility of a particular effect has been independently confirmed (using the original methodology), it is of course important to subsequently test whether the effect generalizes to other methods (i.e., other operationalizations of the IV and DV). However, we simply cannot skip the first step. This broadly fits with Rozin’s (2001) position that psychologists need to place much more emphasis on first reliably describing empirical phenomena, before we set out to actually test hypotheses about those phenomena.


1. That being said, Cesario should be lauded for his public stance that behavioral priming researchers need to directly replicate their own findings (using the same methodology) before publishing their findings.

 

References

Cesario, J. (2014). Priming, replication, and the hardest science. Perspectives on Psychological Science, 9, 40–48.

Dijksterhuis, A. (2014). Welcome Back Theory!. Perspectives on Psychological Science, 9(1), 72-75.

Greenwald, A. G., Pratkanis, A. R., Leippe, M. R., & Baumgardner, M. H. (1986). Under what conditions does theory obstruct research progress?. Psychological Review, 93(2), 216.

LeBel, E. P., & Peters, K. R. (2011). Fearing the future of empirical psychology: Bem’s (2011) evidence of psi as a case study of deficiencies in modal research practice. Review of General Psychology, 15,371-379

Popper, K. R. (1959). The logic of scientific discovery. New York, NY: Basic Books

Quine, W. V. O., & Ullian, J. S. (1978). The web of belief (2nd ed.). New York, NY: Random House

Rozin, P. (2001). Social psychology and science: Some lessons from Solomon Asch. Personality and Social Psychology Review, 5(1), 2-14.

Stroebe, W., & Strack, F. (2014). The alleged crisis and the illusion of exact replication. Perspectives on Psychological Science, 9, 59–71.