What happens when cancer doctors, psychologists, and drug developers can’t rely on each other’s research?
Nullius in verba is the motto of one of the earliest scientific associations-the Royal Society, founded in 1663. Broadly translated, the phrase means “Don’t take anybody’s word for it.”
You know how it’s supposed to work: A scientist should ideally be able to do the same experiment as any other scientist and get similar results. As researchers check and recheck each other’s findings, the sphere of knowledge expands. Replication is the path to scientific advancement.
Some 15 million researchers published more than 25 million scientific papers between 1996 and 2011. Among them were several casting doubt on the veracity and reliability of the rest — suggesting that even studies published in gold-standard journals by researchers from top-tier institutions are far more likely than anyone previously realized to be false, fudged, or flukey. The upshot is that many researchers have come to believe that science is badly battered, if not broken.
Everything We Know Is Wrong?
The Stanford statistician John Ioannidis sounded the alarm about our science crisis 10 years ago. “Most published research findings are false,” Ioannidis boldly declared in a seminal 2005 PLOS Medicine article . What’s worse, he found that in most fields of research, including biomedicine, genetics, and epidemiology, the research community has been terrible at weeding out the shoddy work largely due to perfunctory peer review and a paucity of attempts at experimental replication. Ioannidis showed, for instance, that about one-third of the results of highly cited original clinical research studies were shown to be wrong or exaggerated by subsequent research. “For many current scientific fields, claimed research findings may often be simply accurate measures of the prevailing bias,” he argued. Today, he says science is still wracked by the reproducibility problem: “In several fields, it is likely that most published research is still false.”
Initially, some researchers argued that Ioannidis’ claims were significantly overstated. “We don’t think the system is broken and needs to be overhauled,” declared New England Journal of Medicine editor Jeffrey Drazen in The Boston Globe in 2005. But his analyses have sparked a vast and ongoing reassessment of how science is done. Once other scientists started looking into the question, they found the same alarming trend everywhere.
In 2012, researchers at the pharmaceutical company Amgen reported in Nature that they were able to replicate the findings of only six out of 53 (11 percent) landmark published preclinical cancer studies. Preclinical studies test a drug, a procedure, or another medical treatment in animals as precursors to human trials. In 2011, researchers at Bayer Healthcare reported that they could not replicate 43 of the 67 published preclinical studies that the company had been relying on to develop cancer and cardiovascular treatments and diagnostics. Ioannidis estimates that “in biomedical sciences, non-replication rates that have been described range from more than 90 percent for observational associations (e.g., nutrient X causes cancer Y), to 75–90 percent for preclinical research (trying to find new drug targets).”
The mounting evidence that most scientific findings are false provoked a rash of worried headlines, including “ How Science Is Broken” at Vox in 2015; “ Why medical clinical trials are so wrong so often” in The Washington Post in 2015; “The Truth Is Many Scientific Studies Can’t Be Trusted” at Business Insider in 2012; “ Lies, Damned Lies, and Medical Science” in The Atlantic in 2010; and “ Most Science Studies Appear to Be Tainted by Sloppy Analysis” in The Wall Street Journal in 2007.
In April 2015, the editor of the prestigious medical journal The Lancet reported that the participants in a recent conference believed “much of the scientific literature, perhaps half, may simply be untrue.” In August, the journal Science reported that only one-third of 100 psychological studies published in three leading psychology journals could be adequately replicated. In October, a panel of senior researchers convened by the British Academy of Medical Sciences (BAMS) issued a major report on research reproducibility indicating that the false discovery rate in some areas of biomedicine could be as high as 69 percent.
If the rate of false positives is as high as feared, the Oxford researchers Ian Chalmers and Paul Glasziou suggest that as much as 85 percent of the resources devoted to biomedical research are being wasted. Globally, that amounts to about $200 billion every year.
In a June 2015 article for PLOS Biology, Leonard Freedman of the Global Biological Standards Institute and his colleagues note that published estimates for the reproducibility of preclinical research range from 51 percent to 89 percent. They estimate that at least half of all U.S. preclinical biomedical research funding—about $28 billion annually—is therefore squandered. As the colossal failure to replicate prominent cancer studies indicates, this wasted research results in treatments not developed and cures not found.
Venture capital firms now take it for granted, according to SciBX: Science-Business eXchange, that 50 percent of published academic studies cannot be replicated by industrial laboratories. Before investing in biomedical startups, they often hedge against “academic risk” by hiring contract research organizations to vet the science. This slows down the process of translating genuine discoveries into new products.
The invention of the scientific process during the past two centuries is arguably humanity’s greatest intellectual achievement. Science and the technological progress it fosters have dramatically lengthened life spans, lessened the burdens of disease, reduced ignorance, and eased the hardships of work and daily life. We are living in a time of technological marvels, with advances like CRISPR gene-editing being used to bring back extinct mammoths; lithium-air batteries that store 10 times more energy than conventional lithium-ion batteries; mitochondrial transfers that create healthy babies who have three genetic parents; Ebola vaccines that are nearly 100 percent effective; and cars that drive themselves.
After accounting for the contributions of labor and capital, economist Robert Solow calculated that nearly 90 percent of all improvements in living standards are due to technological progress. But we are handicapping ourselves with shoddy research practices and standards that waste tens of billions of dollars and send brilliant minds down scientific dead ends.
There is no one single cause for the increase in nonreproducible findings in so many fields. One key problem is that the types of research most likely to make it from lab benches into leading scientific journals are those containing flashy never-before-reported results. Such findings are often too good to check. “All of the incentives are for researchers to write a good story—to provide journal editors with positive results, clean results, and novel results,” notes the University of Virginia psychologist Brian Nosek. “This creates publication bias, and that is likely to be the central cause of the proliferation of false discoveries.”
In 2014, the Cardiff University neuroscientist Christopher Chambers and his colleagues starkly outlined what they think is wrong. They noted that the gold standards for science are rigor, reproducibility, and transparency. But the academic career model has undermined those standards by instead emphasizing the production of striking results. “Within a culture that pressures scientists to produce rather than discover,” they bleakly conclude, “the outcome is a biased and impoverished science in which most published results are either unconfirmed genuine discoveries or unchallenged fallacies.”
As the editors of The Lancet put it in 2014, “Science is not done by paragons of virtue, but by individuals who are as prone to self-interest as anyone else.” Such self-interest can take many forms, including seeking research grants and pursuing academic career advancement.
“If you torture the data long enough, it will confess to anything.” So goes an old joke often attributed to the late Nobel-winning economist Ronald Coase. This data dredging comes in two basic forms, known as p-hacking and HARKing.
The first involves the attempt to quantify the probability that an experiment’s results were not due to chance. P-values (the p stands for probability) range from 0 to 1. In much of the biomedical and social sciences, researchers parse their data to see if there is one chance in 20—a p-value of 0.05—that their results occurred by chance. (This is also known as achieving statistical significance.) P-hacking is the practice of running multiple tests on a data set, looking for a statistic that surpasses the threshold for statistical significance, and reporting only that “best” result.
HARKing is a similar practice in which scientists continue “hypothesizing after the results are known.” If their initial hypothesis does not achieve statistical significance, the researchers retrospectively hunt through the data looking for some kind of positive result. The aim, again, is to report the “best” result they can extract from the data. In Nosek’s words, they “tell a good story by reporting findings as if the research had been planned that way.” This increases the chances of finding false positives and makes results seem stronger than they really are.
Scientists often don’t bother to report when they fail to find anything significant in their experiments.
Another significant issue: Scientists often don’t bother to report when they fail to find anything significant in their experiments. Part of the issue is that the editors of scientific journals are not generally interested in publishing studies that find no effect. But experimental failure is also important knowledge, and not publishing such results skews the literature by making positive results appear more robust than they really are. (Not publishing the outcomes of failed clinical trials is also an ethical violation, since study volunteers deserve to know the results of their sacrifices.)
Many reported studies are simply statistically underpowered. Roughly, statistical power is the likelihood that a study will detect an effect when there is an effect there to be detected. In general, the bigger the effect, the easier it is to detect. Larger sample sizes make it easier to detect more subtle effects. The problem is that many studies use sample sizes that are too small to accurately detect the subtle effects researchers are attempting to extract from the data they’ve collected. Why is this occurring? Largely because it’s expensive and time-consuming to assemble and test sufficient numbers of lab animals or human subjects.
The BAMS report notes another problem: Many researchers treat study designs as flexible. They modify them as they go along in ways that aim to increase the likelihood of finding a result that journal editors will accept and publish. Again, researchers seek to craft the best story instead of acknowledging the likelihood of confounding details.
Since attempts at replication are rare, the accumulating false discoveries in the published scientific literature are not being corrected. One obvious suggestion would be to require replication of results before a study is published. In a 2012 article in Perspectives on Psychology, however, Nosek and his colleagues warn that “requiring the replication of everything could stifle risk-taking and innovation.”
Solving the Problem
Are there any solutions to the reproducibility problem? The BAMS report calls for greater transparency, urging researchers to openly share their underlying data and publish the details of their study protocols. It also presses for more collaboration and endorses a number of reporting guidelines, such as checklists to help researchers adhere to certain criteria when publishing studies. Additionally, researchers are being encouraged to pre-register their studies, and the scientific community is being prodded to shift its focus away from pre-publication peer review and toward post-publication review. Ioannidis notes that several of these practices have started to catch on.
One attempt to enhance data sharing is the Open Science Framework (OSF) project, which the BAMS report cites as a practical example of how to improve research openness and transparency. Developed and promoted by the Center for Open Science, a group Nosek co-founded, the OSF is organized around a free, open-source Web platform that supports research project management, collaboration, and the permanent archiving of scientific workflow and output. The OSF makes it easier to replicate studies because outside researchers can see and retrace the steps taken by the original scientific team.
Nosek advocates an even more radical break with current research reporting practices. “I favor a much more free market idea—publish anything and do peer review post-publication,” he says.
The model he has in mind is the e-print distribution platform arXive. It is now common practice for physicists to post their pre-publication articles there to undergo scrutiny and assessment by other researchers. As a result, physicists today tend to think, as the Stanford University computational scientists Jonathan Buckheit and David Donoho put it back in 1995, that “a scientific publication is not scholarship itself, it is merely advertising of the scholarship.” Other disciplines are emulating the arXive pre-print model, including the Social Science Research Network and the recently launched bioRxiv for life sciences research.
The current peer review process serves as both gatekeeper and evaluator. Post-publication review would separate these functions by letting the author decide when to publish. “Making publication trivial would foster a stronger recognition that study results are tentative and counter the prevalent and often wrong view that whatever is published is true,” Nosek explains. Another big benefit, as Nosek and his colleagues argued in 2012, is that “the priorities in the peer review process would shift from assessing whether the manuscript should be published to whether the ideas should be taken seriously and how they can be improved.” This change would also remove a major barrier to publishing replications, since novelty-seeking journal editors would no longer serve as naysaying gatekeepers. Ultimately, Nosek would like the OSF to evolve into something like a gigantic open-source version of arXive for all scientific research.
Nosek, like Ioannidis, is a big fan of pre-registering research projects. Among other things, pre-registration makes clear to outside researchers when a project aims to be confirmatory rather than exploratory hypothesis- generating research. This prevents researchers from succumbing to the seductions of p-hacking and HARKing. “Transparency can improve our practices even if no one actually looks, simply because we know that someone could look,” Nosek and his colleagues observed in 2012.
Just how big an effect pre-registration can have on reporting results was highlighted in an August 2015 PLOS ONE study. It analyzed the results from 55 large National Heart, Lung, and Blood Institute–supported randomized control trials between 1970 and 2012 that evaluated drugs or dietary supplements for the treatment or prevention of cardiovascular disease. In 2000, the agency started requiring that all trials be pre-registered with clinicaltrials.gov. The PLOS ONE paper found that 17 of 30 studies (57 percent) published prior to 2000 showed a significant benefit from the investigated treatments. After 2000, only two among the 25 trials published (8 percent) reported a therapeutic benefit.
Some journals are now agreeing to publish articles that have been pre-registered regardless of their findings. Researchers submit to peer review of their proposed project’s hypotheses, methods, and analyses before they begin collecting data. Reviewers at that point can make suggestions for revisions to the study and so forth. Once the process is completed, the journal commits to publishing the results as long as the study adheres to the preapproved protocols.
The OSF also awards “badges” to researchers who adopt its open science practices. Several journals now publish the OSF badges with the articles, letting other researchers know they have access to the study data and pre-registration information. To further encourage open science practices, the OSF is now offering 1,000 grants of $1,000 apiece for research that is pre-registered at its site and gets published.
Scientists are still generating plenty of false discoveries. But the good news is that science is beginning to self-correct its broken-down self-correction methods. Despite the current reproducibility crisis, Ioannidis says, “Science is, was, and will continue to be the best thing that has happened to human beings.”