SPECIAL REPORT Biomedical research: Are All the Results Correct?

As concerns about difficulties to reproduce research studies are growing, the true extent of the problem is still unclear. But poor reproducibility is only one of many factors that together make biomedical research highly inefficient, analyses suggest.

Article published on 31 March 2014

Two studies published in the January 30 issue of Nature suggested that treating mature mouse cells with an acidic solution can turn them into pluripotent stem cells (Nature 505, 641 and 676, 2014). The simplicity of the approach surprised many, and the research made worldwide headlines.

But within a week, commenters on a forum called PubPeer started to report problems, such as manipulated figures. What’s more, multiple attempts to replicate the results have been unsuccessful, according to a web site that lists such attempts. The RIKEN Center for Developmental Biology in Kobe, Japan, where much of the work was done, is investigating the matter. 

The final word hasn’t been spoken on the stem cell papers, but if the results don’t hold up, the papers will join an increasing number of biomedical research studies that are not reproducible.

Concerns about the issue had already reached high levels at the time the stem cell papers appeared. In the same issue of Nature, National Institutes of Health (NIH) director Francis Collins and principal deputy director Lawrence Tabak wrote that “recent evidence showing the irreproducibility of significant numbers of biomedical-research publications demands immediate and substantive action” (Nature 505, 612, 2014). And just one day after the article appeared, on January 31st, the President’s Council of Advisors on Science and Technology met in Washington DC to discuss the problem and possible solutions. 

At the meeting, C. Glenn Begley, chief scientific officer at TetraLogic Pharmaceuticals, said he found mistakes in each of three articles from top-tier journals he had randomly picked from his desk. Typical mistakes, he said, included the lack of blinding or the use of the wrong statistical methods. “You’ll see that in every issue of the top-tier journals,” he says. “Those are the papers you can be sure will not be reproduced, without even doing an experiment.” 

That’s a strong statement, but Begley has evidence to prove it: In 2012, he coauthored a comment in Nature that reported that in his ten years from 2002 to 2012 as the vice president and global head of hematology and oncology research at Amgen, his team failed to reproduce 47 of 53 preclinical cancer studies that had appeared in top journals including Nature and Science and described, in Begley’s words, “something completely new” (Nature 483, 531, 2012).

In about a dozen cases, Begley’s team even went to the lab of the original authors to let them reproduce their own experiments. Most of them couldn’t. In many of these cases, Begley asked that the experiments be done in a blinded fashion—something the authors hadn’t done the first time around. This suggests that lack of blinding was a major problem in many of the studies. Other problems Begley found in studies that couldn’t be reproduced included failure to report data that didn’t support the point the authors wanted to make; failure to repeat experiments; the use of inappropriate statistical tests; and the use of reagents—such as antibodies—that hadn’t been validated. 

Begley says that at least in the field he looked at—preclinical biomedical research—the irreproducibility problem seems to be quite pervasive: For one, the 47 studies his team couldn’t reproduce came from 46 different labs. What’s more, Bayer researchers reported in 2011 that only in about a quarter of 67 drug target validation projects, their in-house findings were completely consistent with the relevant published data and that as a result, many of the projects were terminated (Nat. Rev. Drug Discov. 10, 712, 2011). Astra Zeneca and Novartis have since then also come out publicly saying that this is their experience, Begley says. “This is a systemic problem that’s common to very many laboratories,” he says. 

Casey Ydenberg, a fourth-year postdoc who studies actin disassembly in yeast cells at Brandeis University, isn’t surprised by Begley’s observations. He says his field is awash with biochemical data about how proteins interact with actin, but many of them are contradictory. “It simply can’t all be true,” Ydenberg says. “Certainly the simplest explanation is that a lot of these things just haven’t been very carefully done.” 

Not everyone agrees, however, that the situation is necessarily that serious. Jeffrey Leek, a biostatistician at the Johns Hopkins Bloomberg School of Public Health in Baltimore, says Begley and the Bayer researchers didn’t describe their methods, data and results in sufficient detail to make it possible to interpret the results. For example, neither study identified the articles they tried to replicate. (Begley says he couldn’t disclose the identity of the 53 studies he tried to reproduce because of confidentiality agreements). “We really don’t have enough information to know how they performed those studies,” Leek says.

What’s more, Leek notes, the Many Labs Replication Project, co-led by University of Virginia psychologist Brian Nosek, recently reported that 10 of 13 psychological effects could in fact be replicated, after a huge collaborative effort that involved 36 scientific groups in 12 different countries conducting experiments with over 6,000 volunteers. To Leek, this is a reassuring piece of evidence that things aren’t as bad as some people say they are. Nosek cautions that the study wasn’t designed to determine the overall rate of reproducibility in psychology, something he and others are currently trying to address in another study called “Reproducibility Project: Psychology.” What it does show, however, is that it is possible to consistently replicate observations across many different labs, Nosek says.

Leek also recently challenged the claim of a now-famous theoretical analysis that appeared in 2005 with the provocative title “Why Most Published Research Findings Are False” (PLoS Med. 2, e124, 2005). In it, John Ioannidis, who studies the reliability of research and is now at Stanford University, identified problems that can make studies unreliable. These include small study size; lack of blinding and randomization; analysis of huge amounts of data to look for associations; or bias to publish novel and exciting results without collaborating with other scientists. Ioannidis also showed that well designed studies such as randomized clinical trials are much less likely to come to the wrong conclusions. Still, because most biomedical studies suffer from one or more of the problems he identified, Ioannidis concluded that most published biomedical findings are likely wrong. The analysis has since become one of the most-often viewed articles on the web site of PLoS Medicine, the journal that published it. 

But late last year, Leek and a colleague came to a much more optimistic conclusion when they analyzed the more than 5,000 p values they could find in the abstracts of the 77,430 studies that appeared in four leading medical journals between 2000 and 2010 (the study has since appeared in print in Biostatistics 15, 1, 2014). Assuming that these p values were selected in an unbiased way and behaved the same way they would if they had come from one experiment, Leek and his colleague calculated that the number of false positives among those studies is only about 14%. To Leek, the conclusion from all of this is clear: There is not enough published definitive evidence about whether most results are really false. 

In a published rebuttal, however, Ioannidis argues that Leek’s analysis has major flaws (Biostatistics 15, 28, 2014). For example, most studies Leek analyzed are well-designed studies such as clinical trials that would be expected to be more reliable, and the p values mentioned in the abstracts likely represent the most attractive subset of the p values from those studies. What’s more, Ioannidis says, the fact that Leek used a computer program to automatically extract the p-values from the text of the abstracts can be a problem: If an abstract says that “20 results were significant (p<0.05),” then Leek’s approach would miss 19 of the 20 p-values. Leek’s study, Ioannidis says, is the classical example for the kind of flawed study he described in his 2005 paper: “You have wrong data, you have wrong analysis, and you have wrong interpretation,” he says. 

For now, the debate over how serious the problem really is seems far from over. But even if much of the current evidence is anecdotal, there are clearly enough cases to be concerned, Tabak says. “I am the first one to admit that the plural of anecdote is not data,” he says. “But there are many anecdotes. I have heard this from many, many scientists. I am also an anecdote [and] have experienced this first hand. I personally have been unable to replicate certain studies along the way in my career, and I will tell you how frustrating it is to frankly clean up somebody else’s mess.” 

Beyond (ir)reproducibility

Poor reproducibility is just one of many factors that together make biomedical research highly inefficient, according to two analyses on waste in biomedical research that appeared in The Lancet in 2009 and 2013. 

For one, many studies aren’t even necessary, because researchers often don’t sufficiently check previous research on a topic before they embark on a study, says Paul Glasziou, a researcher at Bond University in Australia who was a coauthor of both analyses. 

For example, he says, 7,000 stroke patients were unnecessarily enrolled in clinical trials that tested if the calcium channel blocker nimodipine was effective to treat them. The trials found that the blocker wasn’t effective. But they should have never been done in the first place, Glasziou says, because at the time they started, a systematic review of animal studies was already available that showed no effect. “When you look into [this], you just think this system is crazy,” he says. “People are going ahead and doing studies, without having looked at the basis on which they do that study.”

Another problem is that about half of all research findings aren’t published, at least not for some time: For example, a random sample of 677 completed clinical trials registered at the NIH-run clinicaltrials.gov web site between 2000 and 2007 found that about half of them still weren’t published several years after they had been finished.

And even if studies are published, about half of them don’t sufficiently report the methods and experimental outcomes that would enable others to reproduce and build upon them: A 2009 analysis found that only 59% of 271 animal studies describe the number and characteristics of the animals used, and a 2013 analysis of 238 basic molecular biology studies reported that about half of the ingredients typically used in such studies—such as antibodies or cell lines—weren’t sufficiently identified to enable others to replicate the studies without having to contact the original authors. 

If the research process was a furniture factory, Glasziou says, it wouldn’t survive: “Half of the design of the furniture [is] so flawed that it [is] never going to work, [half of] the stuff that does work never gets out the door, and [half of the] the stuff that gets out the door gets broken before it actually gets to the end user.” 

As a result, the 2009 Lancet analysis concluded, 85 cents of every dollar invested in research worldwide go to waste. “It is absolutely astonishing,” Glasziou says. “We are just wasting a huge proportion of the roughly 200 billion dollars in investment worldwide in [bio]medical research.”

Taking action

So what can be done about these problems? To Glasziou, funders are in the best position to improve the situation. That’s because they are in charge of the most powerful incentive: Money. For example, the Health Technology Assessment program in the UK holds back 10% of their funding until the complete findings are reported in the form of peer-reviewed articles. The result: After some additional chasing, 98% of the researchers comply. “It’s only the funders that have that clout and interest to do that,” Glasziou says. 

Collins and Tabak recently announced that the NIH is also taking, or considering, steps to address the issue. The reproducibility problems come from a “complex array of factors,” they noted in their January commentary in Nature, including poor training of researchers in experimental design; an emphasis on provocative statements rather than technical details in published research; an overvaluation by funding organizations on research published in high-profile journals; and a “secret sauce” some scientists use to make their experiments work—the details of which they then withhold in their publications to “retain a competitive edge.” 

In response, the NIH will include experimental design and statistics in the mandatory training of NIH postdocs later this year. If successful, the material will be made available to other institutions, Tabak says. And the NIH isn’t the only one trying to improve training: Leek and his colleagues at Johns Hopkins have developed free statistics online classes, one of which deals with reproducibility issues.

The NIH is also piloting the use of a checklist to help evaluate proposed experiments in grant applications for experimental design elements like blinding and randomization, and whether the proposed research is based on previous research that is sufficiently solid. Later this year, the agency will decide whether to implement these changes into the grant application review process on a regular basis. 

To give researchers an incentive to make raw data available, the NIH is also exploring options to make such data available online in such a way that the number of downloads can be counted as a metric of scientific contribution that’s unrelated to journal publication. 

Some NIH Institutes and Centers, Tabak says, are even planning to replicate certain studies that could serve as the basis for human clinical trials. That’s because a failed trial that’s based on flawed preclinical studies is likely more expensive than the cost of replicating a preclinical study. 

But Collins and Tabak also wrote in their January commentary that the NIH can’t tackle the problem alone, and called for help from other parties including journals and universities.

Nature and Science have taken some steps to address these issues, such as removal of length limits of online methods sections; consulting statisticians in some cases when reviewing papers; and ensuring that authors report whether they have followed certain standards in designing their experiments such as blinding and randomization when possible. 

Still, journals often aren’t very interested in publishing replication attempts of earlier research or negative findings, Collins and Tabak wrote, a situation that’s slowly starting to change: Journals that publish negative data now include the Journal of Negative Results in BioMedicine, the Journal of Unsolved Questions (JUnQ), and the Journal of Cerebral Blood Flow & Metabolism. PLoS One also publishes replication attempts, including failed ones, and last November, Nature Biotechnology published the failed replication attempt of a study and announced that the journal “will remain open to publishing replication studies and rigorous efforts that fail to reproduce findings from other publications of high interest to our readers.”

According to a Nature spokesperson, Nature journals “do consider negative results if they are well constructed, solid studies. However, papers reporting negative results tend not to be developed as fully as papers reporting positive results, and as such often fail to meet other editorial criteria.” That said, Nature Publishing Group has launched a “minimal threshold journal” called Scientific Reports that “should easily accommodate” such papers, though “submissions of this kind of paper remain low,” the spokesperson adds.

JUnQ has also had a hard time to get enough submissions of research papers reporting negative results, says JUnQ editor in chief Andreas Neidlinger. That’s probably in part because if you’re a researcher, negative results “won’t get you funds” and might even give the impression “that you don’t know how to do your work correctly,” he says. We still have a long way to go until negative data are as highly appreciated as positive ones, Neidlinger says, adding that most big journals still aren’t that interested in publishing them because negative data aren’t perceived as interesting enough and don’t attract enough attention from readers.

Meanwhile, the company Science Exchange and the non-profit Center for Open Science (which was co-founded by Nosek) have secured $1.3 million in grant funding from the Laura and John Arnold Foundation to replicate, at a cost of less then $25,000 per study, 50 preclinical cancer biology studies they determined to be the highest impact studies published between 2010 and 2012. The researchers doing the work come from a network of over 800 labs Science Exchange has established as part of its core business: Arranging collaborators for researchers who are looking for certain tools or expertise they lack in their own lab, for a fee. 

The results of this “Reproducibility Project: Cancer Biology” will be announced by the end of the year, says Science Exchange co-founder and CEO Elizabeth Iorns. The journal PLoS has agreed to publish the results of all 50 replications (including the negative ones) as well as a meta-analysis. Because the project is funded by grant money, the labs whose papers are replicated don’t have to pay for the replications themselves. 

Next, Iorns plans to replicate studies from other fields, such as neurodegeneration and cardiovascular disease research. Eventually, she hopes that funding organizations like the NIH will routinely allocate a certain percentage of their grant funding to replication of key experimental results of important studies. That, she hopes, will create an incentive for researchers to try to get their experiments right in the first place. 

But for now, Tabak says, the NIH has no plans to extend replication efforts beyond a few selected studies that serve as the basis for clinical trials. “We are extremely selective in how we are going to approach this, at least initially,” he says. Still, he adds, the NIH is watching the “Reproducibility Project: Cancer Biology” with “great interest.” 

Science Exchange is also trying to address another problem: That many of the antibodies researchers use don’t seem to be sufficiently validated and often don’t work the way they are advertised. Science Exchange is therefore validating 10,000 antibodies, in partnership with antibodies-online.com, the world’s largest antibody market place. Because the manufacturers pay for the project, they won’t reveal which antibodies failed. But the validated antibodies will get a green seal of approval on the antibodies-online web site, and Iorns plans to at least publish the overall numbers.

Voices of caution

Not everyone is comfortable with the way a company like Science Exchange is trying to replicate studies. “I am 100% for replication. Replication is the bed rock of science,” says Mina Bissell, a cancer biologist at Lawrence Berkeley National Laboratory. But, she adds, it needs to be done properly, especially in biology, where experiments can be tricky. “People in my lab often need months — if not a year — to replicate some of the experiments we have done,” she noted in a commentary on the issue that recently appeared in Nature (Nature 503, 334, 2013). “I am concerned,” she says, “that a company like Science Exchange tries to replicate as if scientific papers that take 4-6 years are lab tests.”

What’s more, she notes, reports that someone’s research can’t be reproduced might damage their reputation. “I am not saying that every paper in these fancy journals is beyond reproach,” she says, but adds that someone should only be able to say they failed to replicate a study if they have tried everything possible to do so, including, if necessary, a visit to the lab of the original authors. “The work is not finished unless they talk to the authors and exchange reagents and cells,” she says. “If the results still can not be replicated, a visit must be arranged, and if together with the colleagues in the laboratory the work is irreproducible, then the paper should be withdrawn.” In fact, Bissell just published the results of a successful attempt to reconcile different results between her and another lab (Cell Reports 6, 779, 2014). 

But Begley, who is on the advisory board of the “Reproducibility Project: Cancer Biology,” says that if experiments are as tricky to perform as Bissell describes, then the results of those experiments are unlikely to be robust enough as a basis for the development of therapies in humans. “A result that cannot be translated to another lab using what most would regard as standard laboratory procedures is not a result,” he wrote in the comments section of Bissell’s article. “It is simply a ‘scientific allegation.’”

And Iorns notes that the labs that replicate research studies as part of the cancer reproducibility project are all chosen based on their experience with the model system used in the studies, and that the original labs are given the opportunity to comment on the protocol of the replication experiments before they are conducted. What’s more, she says, the peer reviewers of the final papers will include one of the original authors, and Science Exchange will make sure that the experiments are replicated in such a way that they don’t suffer from the problems Begley found in the studies he couldn’t reproduce at Amgen, such as lack of blinding and randomization. 

Reducing the pressure

Another way to make research more reliable is to reduce the intense pressure, especially for early career researchers, to publish exciting results in top journals. It’s not even “publish or perish” anymore, Glasziou says. It’s “publish a lot or perish.” Faced with leaving science as the only alternative to becoming a professor, some might be willing to bend the rules, if not break them, Ydenberg adds. Because too many PhD students are competing for too few academic positions, he says, funding agencies should try to support additional stable career perspectives in research, such as staff scientist.

The NIH is considering steps to reduce some of the pressure, such as a longer period than the current average of four years of grant support per project for selected grants, or changing the “biographical sketch” on grant applications to better reflect an applicant’s actual contributions to science rather than the traditional list of unannotated publications.

For Ydenberg, such changes will likely come too late: He says he is leaving science, negotiating a career transition into software development. The main reason is that he wants to be able to choose where to live, but the uncertainties and pressures in research have certainly “led to a certain level of disenchantment,” he says. “I am not really sad to leave it behind, or not as sad as I think I would have been a few years ago.”