In Scientific Method We Don’t Just Trust: Or Why Replication Has More Value Than Discovery (Gordon)

Welcome to the 2019 Robert S Gordon jr.
lecture in epidemiology. My name is Stephanie George I’m an epidemiologist
in the office of disease prevention and I’m pleased to represent our director
Dr. David Murray today and introduce our speaker. Before I do that I wanted to
invite everyone to join us after the lecture for a reception in the NIH
library directly to my left. The Robert S Gordon jr. lecture is awarded each year
to a scientist who has made major contributions in the area of research or
training in the field of epidemiology or the conduct of clinical trials. The award
was established in tribute to Robert S Gordon jr. for his dedication and
contributions to the field of Epidemiology and his distinguished over
30 years service to NIH during which time he served in numerous senior
leadership positions. Now let me introduce our speaker Dr. John Ioannidis. John comes to us from Stanford University where he is the CF Rehnborg
chair in disease prevention, professor of medicine of health research policy
and by courtesy of biomedical data science and of statistics he’s also the
director of the meta research innovation Center and the director of the ph.d
program in epidemiology and clinical research. Dr. Ioannidis received his MD
from the national university of athens in 1990 and also received a doctorate in
science in bio pathology from the same institution he trained at Harvard and
Tufts in internal medicine and infectious diseases his awards are
numerous as are the disciplines his work spans I encourage you to check out his
his biography online and I will now with no do no more minutes introduce our
speaker. Dr. Ioannidis’ his presentation today is titled in scientific method we
don’t just trust or why replication has more value than discovery please join me
in welcoming our 2019 Robert S. Gordon Jr lecture award recipient Dr. John Ioannidis. Thank you for this tremendous honor and for asking me to give this
lecture today on this topic it’s always a pleasure to be back at NIH and to to
meet with colleagues and to meet more colleagues very bright people and this
is a stellar institution you know very unique in the world.
So in scientific method we don’t just trust a trust is great to have but we
need to have more than just trust science is a fantastic enterprise very
successful about a hundred and eighty million papers floating around this is
just 20 million of them from the last 16 years they create a universe along with
2 million patents and 200,000 disciplines of science I’m pointing with
an arrow here and it’s very tiny font but let me magnify that “hi there my best
paper is a speck of dust in a speck of dust in a speck of dust somewhere around
here.” No single paper can ever compete with its surrounding or with with
science at large science is a communal effort it’s that whole galaxy that
matters it’s it’s a living galaxy and evolving galaxy with plenty of data, with
plenty of hypotheses, with plenty of inferences. At the same time that galaxy
also has plenty of empty space plenty of dark matter and what would that dark
matter be it would be an analysis that have never been published you know that
have been done but are not published. It would be data that are available but
they’re not really available the were accessed maybe once by someone but then they were lost it would also be science that was not done it would be
replications that were not done because it was thought to be “me too” research and
also lost opportunity is about research that never flourished
because it was not funded because there were not enough resources to do it. So
how can we look at the existing matter and try to understand how we can expand
that universe further much of the emphasis has been on making discoveries
but to be honest I think that discovery is a boring nuisance almost every single
paper that you will read in the literature will claim that it has some
novel results. Every grant that I will submit to NIH I promise to find
something novel and I know that probably I’m lying statistically speaking. This is
a text mining exercise probing the entire PubMed from 1990 to 2015,
96% of the biomedical literature claims significant, statistically significant,
and mostly novel results whenever p-values are used either in the abstract
of the paper or in the full-text papers that are available in MEDLINE.
Translation proceeds at a glacial pace despite having all these millions of
statistically significant and seemingly novel results we do make progress but
it’s not millions of discoveries that move to have that tremendous impact on
medical care, on patient outcomes, on health, on living better and longer lives.
Several years ago we looked at the creme de la creme the most highly cited
clinical research ever published these are shown here with red landmarks
milestones and we asked how long did it take to get there we have one of the
most highly cited papers in the history of Medicine. On average it took about 25
to 30 years in the case of nitric oxide it took about 200 plus years. There are a
few exceptions where things materialize faster for example I consider one of the
most exciting experiences in my career when early on as a scientist I was at
NIH I was involved in ACTG 320 a randomized trial
that showed that with triple therapy we could dramatically decrease the risk of
death and disease progression for HIV infected patients. That trial was
published within less than four years from the time that protease inhibitors
had been developed based on a concentrated and communal effort to try
to do something about HIV and it did work. So how can we shift from stories
where it takes 200 years or 30 years two stories where it takes four years and
also how can we save not only the timing but also the credibility of these
efforts? Because in that slide here I also have some black milestones which is
when larger better control studies were done that showed that that creme de la
creme paper was actually either completely wrong or exaggerated in terms
of what it was promising to have found. I think that we are incentivizing a fake
narrative, the narrative that is dominant is that we have an oversupply of major
true discoveries and I think that this is currently untenable. I don’t need to
remind you how few new substances, new drugs get licensed, and even though we’re
trying to push that agenda and we have the brightest minds in the world working
on science the effort that we put is tremendous and the real progress that
you can measure in any way you want is far more limited. Now this narrative
coexists with some additional urges. One of them is make haste,
you know rush, patients are dying license new drugs immediately and I do feel that
pressure, and I think that yes we need to make haste, but this doesn’t mean that we
should cut corners in terms of the evidence that we need to accrue. Second urge is to be methodological sloppy so just get an answer right away we have all
these huge routine collect the data datasets no need to do
randomized controlled trials just run something through the mill of your
computer and that’s gonna be accurate and very often this is not accurate and
the third urge is avoid replication don’t waste time with replication this
is “me too” effort we need to move forth to even more discoveries. Now the value of
discovery can be modeled so here’s a very simple model let’s say that R is
the pre-study odds of the research finding being true bf is the base factor
that is conferred by the discovery data H is the ratio of the weight of negative
consequences from a false positive discovery claim versus the positive
consequences from a true positive discovery. Then the value of the
discovery process is proportional to true positive minus H times false
positive or if you work through that it is proportional to R times the base
factor minus H and obviously you can add a constant if you have two scientists,
2000 scientists, 2 million scientists you can multiply that with how much effort
is going into this. R and H are rather field-specific
if I have wasted my life in a field where there is nothing to discover
that’s very sad but that field just has nothing really useful to discover. If I
want to make progress I just need to change fields. Sometimes maybe there are
things to be discovered but they would require a completely disruptive approach
abandoning what we do and just taking a completely new approach to trying to
attack the same question. H is also pretty much field specific so the focus
should be, must must be, on increasing the base factor of the work that we do the
information content of the work that we do. The options for increasing the base
factor are numerous and they depend on what field we’re working with but
typically running larger studies so getting more evidence would help and
also ensuring greater protection from biases would help which means that we
optimize our design our statistics are thinking ahead of time about how we
should go about answering a question. In the very same equation it’s very easy to
get negative values if you want to avoid having negative values from Discovery
then one needs a base factor to be more than the ratio of H over R and often
this is very difficult. Why is that? Most original discoveries are claimed to come
from small studies where biases are very common and therefore the base factor
often is even less than five which in in the Bayesian work means very very little
and also most fields currently are working in areas where the pre-study
odds are pretty low. We’re attacking massive spaces of exploration where
there are signals to be discovered but the denominator of all the potential
things that we can measure and all the things that we can analyze is probably
very large if you have that combination most discoveries are operating in a
space where they have negative scientific value. It’s far more likely
that they will confuse us that they will generate false negatives that would lead
people astray and more resources wasted downstream building on something that is
a false negative claim or an exaggerated claim rather than that they will save
the world. How do we sort out where to go next?
We need replication we need to take all of these tentative discoveries and try
to replicate and see what still survives different efforts to reproduce the
results either exactly the same way or with different angles of triangulation.
Is that new? Not necessarily, this is pretty much how science started in the
Western world you had to replicate and reproduce findings in front of the Royal
Academy and people were watching to see whether the apparatus would work.
Currently we mostly trust that what was done behind closed doors or behind a
closed computer is something that we can put trust on. We do have empirical
studies on fields where replication is not just considered to be a “me too” type
of effort and actually replication practices are common and these empirical
studies suggest that most of the initially claimed statistically
significant effects are either false positives or substantially exaggerated.
One such field is genetics. Genetic epidemiology went through an immense
transformation over the last ten years moving from candid gene studies where
people had to come up with single hypotheses to test in an agnostic
fashion with much larger sample sizes and consortia the whole genome more or
less in terms of association with phenotypes. By doing this one could go
back and assess how often the papers that were published in the candid gene era
were successfully replicated through genome-wide Association approaches, the
replication rate on average was 1.2 percent this may be an underestimate
because of power considerations I’m willing to take that to five percent or
maybe ten percent at most but more than 90% of these papers that were published
for many many years for over 20 years in the very best of our journals were
probably not saying much and probably just false positives.
Here’s another evaluation animal studies there’s tens of thousands of animal
studies being done and I think that they’re tremendously important because
animal research is an essential indispensable gateway to human clinical research.
If we if we decide to just move directly from very early bench discovery to
humans I think we will be testing lots of noise inadvertently on humans and we
definitely don’t want to do that. However you’re all aware that for many
fields where we have hundreds if not thousands of potential leads from animal
studies like neurological diseases we hardly have any successes when it comes
to humans. For example, dementia or treatment of stroke and we
have a couple of treatments for stroke but we have hundreds of treatments that
seem to work in animal models and their level. One might say that the explanation
for that is that animals are very different from humans and I think that
there are differences I hope I’m a little bit different than a mouse but
you know my genome is not that different and I think that most of the problem is
not necessarily that dissimilarity of the experimental system as the way that
research is done in ways that bias can creep in. This is a slide from a paper
where we looked across the entire literature that we could get on animal
studies and neurological diseases and we found that very prominent signals of
excess significance bias you know pressure to deliver statistically
significant results could be detected across that field and once you started
sorting out different hints or patterns of bias there were very very few pieces
of evidence that remained to be pretty strong. Preclinical research, in the last
eight years we have seen rapid change in our understanding about replication and
reproducibility. This time most of the leads came from the industry and the
industry had every right to feel uncertain about not being able to
reproduce papers published in top journals by academic teams for for
academic investigators maybe this is curiosity, maybe this is a paper in
Nature, maybe it is a way to get tenure. For the industry it was an issue of
spending half a billion dollars on something
that would lead nowhere. So we had several companies that launched the
reproducibility checks on high-profile, highly cited papers coming from academic
teams pretty much summarizing what they had already been seeing in large-scale
most of the results could not be reproduced in their hands
the reproducibility rates range from zero to 25%. In one such famous example
where Amgen could only replicate six out of fifty-three landmark studies for
oncology drug target projects Glenn Begley concluded that “the failure to win
the war on cancer has been blamed on many factors, but recently a new culprit
has emerged: too many basic scientific discoveries are wrong.” We have seen
similar reproducibility checks across very different scientific disciplines.
Psychological science has also gone through a major transformation this is a
paper that was published three years ago in science a collaboration of 273
psychologists and there are teams trying to reproduce a hundred of the creme de la
creme papers from their top journals and you can read these results in different
ways but the summary, no matter how you look at it, is that about two-thirds of
the time the original result could not be reproduced and obviously the the
effect sizes were much smaller compared to the original. What if you only read
Nature and Science? This is another reproducibility check of 21 papers on
average the reproducibility effort revealed an effect size of 50% of the
original and in many cases there was nothing there, the effect was completely
on the vicinity of of the null with pretty tight confidence intervals. Does
it mean that the reproducibility was correct and the original was was wrong? I
mean all of these cases are efforts where people systematically tried to
follow the exact recipe as the original study and they even communicated with
the investigators, they even tried to make sure that they have all the details
that would be critical. So there could be many explanations. It could be that both are correct, but for somehow we don’t know
some reason why they disagree. It could be that both are wrong, because again
there’s some biases that are creeping in that we’re not familiar with. Or it could
be that one of them is not correct. But clearly, if you have a situation we’re
under the very best efforts to reproduce you cannot get something to work again
one has to wonder would it work if we were to use that widely, in real life, in
patients and communities in the real world. Reproducibility efforts can be
tricky and they can also be emotional. They can lead to what I call the the
reproducibility wars. The reproducibility effort for example on cancer biology
started publishing a number of papers were that very meticulous effort to
reproduce high profile cell by cancer biology papers were attempted to be
reproduced everything had been prespecified there were pre register
reports the results were very clear but then most the time if the original was
not reproduced the original investigators would fight back and you
end up in a situation where you feel that reputations is at stake, there’s
very fierce emotions about who is right who’s wrong,
careers are thought to be at stake, and interpretations can be different on what
is successful and what is unsuccessful. This means that people do care about
reproducibility, and they should care it’s really a central piece of the
scientific method. However, what exactly do we mean by reproducibility? What is
research reproducibility? If you look across all 22 major disciplines of
science there’s a rapidly increasing use of that terminology, if you do a text
mining exercise like what I’m showing you here. But people mean very different
things. Basically you can separate
reproducibility into three main clusters. One is reproducibility of methods, which
is the ability to understand or to repeat as exactly as possible the
experimental and computational procedures. So availability of software,
of script, of data, and the ability to put them together and get the same result
from the very same data. Reproducibility of results, which means that we’re doing
yet another study our new participants, new samples, new observations and we hope
to get a result that is consistent compatible ideally as close as the
original; and reproducibility of inferences, which means that we have one
study a replication or multiple studies or a body of evidence and I ask people
in the audience what do you conclude out of this and we may or may not agree
about what these data and what these results mean. Reproducibility can be
affected by the recipe of research practices that we apply and you can
think of two extremes of research practices two stereotypes of course this
is a stereotype it doesn’t mean that I have some particular researcher in mind
who’s doing things wrong but if you think of small data and big data this is
what often that looks like. So with with small data, which is still the majority
in most scientific fields, we have the prototype of the solo, siloed
investigator with a small team, very limited resources working in competitive
environments where there’s limited funding, unavoidably the studies that
will be done will be pretty small. One needs to be successful, you know these
three or four years of funding and and you need to say I have something major
to say and something major that would lead me to my next grant. Actually you
need to get going probably not after three years but after three days to
write the next grant. And how do you do that? Much of the time the results will
be quote unquote negative or unimpressive you need to start exploring
searching whatever space you have available.
There will be a lot of cherry-picking of the best-looking results, a lot of
post-hoc interpretations, p-values of just slightly less than 0.05 are
typically considered to be enoug,h and with p-hacking it’s almost always
possible to get there. There’s no registration because that decreases the
degrees of freedom. No data sharing because it offers ammunition to
competitors and no replication because if you try replication and it fails you
have invested twice the effort and you’re back to square zero and people
say you found nothing so that’s that’s it for you. Some of the ways to improve
the validity of small sample size research are very easy to implement. They
cost nothing. They would save us from a lot of trouble. However, they are not
implemented. For example, we know about experimenter bias for over half a
century now Rosenthal published his papers more than
half a century ago and we know that in animal experiments or in other in vivo
studies but also in vitro it’s great to have people reading the results to be
blinded to the experimental conditions. However, that happens less than 10% of
the time. Randomization we know that it’s not about humans here it’s about animal
work or or other experiments it costs nothing it’s very easy to do it would
save us from a lots of trouble of imbalances that would be consciously,
subconsciously, or unconsciously be created between the
compared groups again less than 30% of the time even in the most recent studies
this is implemented. With big data we’re challenging the status quo of small
study research but it doesn’t mean necessarily that our odds of success are
necessarily better. In that situation which is becoming more common with
electronic health records, with large omics
databases, and other such efforts we have extremely large sample sizes we have
over-powered studies that are likely to give signals no matter what ,even if no
signals are worthwhile detecting you will detect tons of them. This means
that again there needs to be a cherry-picking process and again a lot
of that is done post hoc. Actually many, most of these databases are not even a
sampled for research purposes they just happen to accrue overnight. You know I’m
sleeping and I wake up in the morning and and there’s so many more patients
that have been added to the electronic health records that one or more
researchers could tap into. Statistical inference tools are a bit different
people recognized very quickly that if they just use a p-value of less than
0.05 everything will be statistically
significant so there’s a little bit of more sophistication much of the time in
that space but very often you see idiosyncratic statistical inference
tools without much consensus. People may be working in the same or very similar
fields but nevertheless they’re using very different approaches and very
different thresholds for claiming success. Registration remains an
exception also in that space and data sharing does happen probably more often
than in fields that use small datasets but unfortunately much of the time
there’s no understanding about what exactly is being shared. Most the time
these datasets are just dumped somewhere with some axis where someone could
easily or more difficult get access to them and then you ask does anyone know
what these data are and what they show and and literally even the investigator
who has generated the data probably does not know because they’re just a big
black box that is very, very hard to interpret what exactly has been
generated let alone how valid it might be. Replication does happen in some
fields and I showed you some examples and even when it does not happen we may
have a situation where people say I’m doing a study that is somehow different,
actually there’s an incentive to justify that I’m doing a different study, but
then you look at the two studies and say well you’re asking exactly the same
question but to get it published someone has to say it is different. Meta-analysis can take these similar studies and examine them in
their totality, studies on the same question with pretty much the same
comparisons with pretty much the same interventions or the same risk factors
or the same hypotheses being addressed and give us a sense of heterogeneity
between these different studies that address fairly similar questions.
Heterogeneity may be genuine, much of heterogeneity may reflect genuine
differences across these studies however it can also point out two differences
that stem from biases that are differentially expressed or affect
differentially different investigations on the same question. If that’s the case
you should be able to pick some hints of the presence of a bias because it leaves
a pattern of a particular type of diversity a particular type of
heterogeneity across the results that are being combined in a meta-analysis. So
this is pretty much what we did here we pre-specified 17 patterns of bias and we
thought how would the literature look like if these biases were present and
for all the meta-analysis that we could find across all scientific fields when
they were there when they were many we just selected a random sample thereof, we
tried to ask do we see these patterns? Most of these biases could be seen in
most fields, however, some biases seem to be having more common hints of their
presence in some fields compared to others. For example, small study effects
which is a pattern where small studies give more prominent perhaps exaggerated
results compared to larger studies was a pattern that was seen very prominently
in the social sciences were seen prominently in the biomedical sciences
and was not really seen or very soft signals were seen in the physical
sciences physical sciences are much better used to working with large
datasets with communal science with CERN or big telescopes sharing all the
information across all the participating physicists or astrophysicists who are
working on a particular domain of science. The same applied to other
disciplines sometimes one or another field had more of a signal of one bias
or another but most biases were seen in most scientific fields. What are some
potential solutions? There are many solutions that are being proposed
they’re integral to the process that we do science and you can think of the
reproducibility problems as a central concern in everyday experimentation, in
everyday study design, in everyday ways of running science. This means that lots
of people are thinking about them and some very smart solutions have appeared
but also many of them are very speculative
they have no empirical support to tell us that they will make things better
rather than make things worse because of collateral damages that they may procure.
Here’s 12 families of solutions: large-scale collaboration, adoption of
replication culture as a sine qua non, registration, sharing of data,
reproducibility practices and checks, containment of conflicts and sponsors
and authors and other people involved, more appropriate statistical methods,
standardizations of definitions and analysis whenever this is possible, more
stringent thresholds for claiming discoveries or successes, improvement of
study design standards, improvement in the peer review reporting and
dissemination of research, and finally better training of the entire scientific
workforce in methods in the scientific method and how it should be applied and
statistical literacy and numeracy. I hope I have convinced you by now that some of
the most successful fields in terms of credibility of those that have adopted
large-scale collaboration and adoption of replication culture. This is
example of Manhattan plots, genetic epidemiology was transformed from a
field that almost nothing could be replicated to a situation where we can
pretty much safely reproduce signals. I’m not delving on whether these signals are
useful you don’t know whether that’s something that you can take to patients
and really change their lives but at least they’re there and sometimes the
signals may be true but maybe they’re not going to be useful this is perfectly
fine you know we know what we’re dealing with we know what is the true complexity
of the research questions that we’re facing. Registration can be tricky in
many situations the least that I would like to see is people who are doing very
interesting exploratory research feel that they need to say that I have
registered my study. If you wake up at 3 o’clock after midnight with a fancy idea
that you cannot go back to sleep and you need to run to the lab to try it out and
you’re just messing right and left with different possibilities, probably you
didn’t write a protocol before you started doing that. But you know
something very interesting may come out of it and maybe you are the next
Alexander Fleming with with penicillin how likely is that to be the case, not
very likely but there are 20 million scientists offering scientific papers so
a few among those 20 million will be Alexander Fleming hopefully. What we need to do in that case is just say that that was exploration it was mad, wild, crazy,
exciting, fascinating exploration. That’s what came out of it now someone needs to
reproduce it in a prospective reproducibility effort. Level 1 would be
registration of a data set. I feel very uneasy when I’m working in a field that
I don’t know how many data sets are out there that could be probed and in how
many different ways they can be probed. Registering a data set is like
registering a nuclear arsenal so I’m telling you that I have
this big data set in my computer it includes observations on two million
people and I have two thousand variables on each one of them which means that
tonight if I feel depressed I can press a button on some statistical software
and launch so many billions or trillions of Chi square P values against you. So
it’s it’s one way to confer the the breadth of the possibilities of analysis
that can be done. Level 2 would be registration of a
protocol. If a protocol exists, there’s lots of research where there’s no
protocol and sometimes it’s just blue sky exploration but most the time a
protocol is feasible and should be there before something is done. Time spent on
coming up with a protocol I think is always well-spent. Level 3 would be a registration of the analysis plan if there is an analysis
plan sometimes you have to acknowledge that this is the analysis plan that I
could think ahead of time and these are some modifications that arose because of
some peculiarities that I had not anticipated. Level 4 would be
registration of both the analysis plan and the raw data, and Level 5 would be
open live streaming where you iteratively communicate with the rest of the
scientific community about what are the experiments that I’m planning to do, one
receives feedback, they revise, they’re done, they’re shared again and this is an
iterative open process with the whole community. This is pretty much how the
claim by NASA that they had identified bacteria using arsenic instead of
phosphorus for their DNA backbone was refuted was a paper in Science. In in the
absence of having some rules in the science game in many fields any result
can be obtained what you get is what I call extreme vibration of effects and at
the extreme you get the Janus phenomenon being a Greek Roman god who could see
into opposite directions. These are some data from the National
Household Survey extremely meticulously collected information and if you focus
on the right panel this is hazard ratio and there is a
horizontal axis minus log 10 p-value on the vertical axis, this is the
association of alpha tocopheryl or vitamin E levels with the risk of death
and there’s 1 million points on that plot, there’s 1 million different results
that one can get in the very same data set on the very same question just
analyzing the data slightly different For example, death can be affected by
zillions of other factors, so if you count for a factor or not you have two,
choices if you have 19 such choices to make this is 2 to the 19th power this is
1 million different possibilities of analyzing the very same data on the same
data set on the same question. 70% of the time vitamin E decreases the risk of
death, 30% of the time vitamin E increases the risk of death. If I have a
strong belief on what vitamin E should do, I can get that result no matter what.
This is really happening on a daily basis it is happening on some of the
most influential papers that you will see in the literature. This is a paper
that last year was one of the 20 highest impact papers across all science and
practically it concluded that with 3 cups of coffee per day your risk of
death decreases by 17%. If these wonderful colleagues could give
me their data set I can make it be that your risk of death will increase by 17%;
it’s it’s an open pledge. Transparency, how can we find out that we can even
trust the data that there’s not an a missing from transparency for example. We need to find ways to make sharing easier. This is a pivotal study, study 329
it sounds like a submarine U329 but actually it’s a randomised trial and
when it was done and published in 2001 by Smith, Kline, and Beecham
it showed that paroxetine and imipramine are very effective and very safe for
major depression and adolescents. 15 years later the very same data set was
reanalyzed by independent, non-industry affiliated investigators and they
concluded that paroxetine and imipramine are not effective and they are not safe
for major depression in adolescents. How often does it happen? A few years ago we
looked at all the reanalysis that had been done on the same clinical question
from the same clinical trial data set. These are reanalysis rather than
replications but if we cannot reanalyze and have some confidence why should we
even take the second step of doing it a separate independent replication on a
new study. We found 37 such reanalysis, and 35 percent of the time the
conclusion the main conclusion of the reanalysis was different compared to the
conclusion of the original paper. This treatment should be used,
no this treatment should not be used it, should be used in this subgroup, no it
should be used in that subgroup. Was it research parasites who had run these
reanalysis? Was a rogue analysts who were trying to make a career of
themselves or put you know the great original investigators into shame?
Almost always it was the same original investigators who published the
reanalysis, but they published it in an environment in a publication environment
where if they were to say that I tried to reanalyze my data and I found exactly
the same thing and I conclude exactly the same conclusion someone will tell
them why did you waste your time you know this is duplicate publication. So
our incentive system selects for confusion, selects for discordant results
even for things that are done by the same investigators. Recently we revisited
that pattern looking at another extreme where
reanalysis actually should have been the default. PLOS medicine and BMJ have
policies in place that if you publish a randomized trial you need to make not
only the full protocol but also all the raw data available to anyone who would
ask for them. So along with Floria Florian Oded and other colleagues from
my METRICS team we invited investigators who had published under this policy to
share the data with us and we promised that we would reanalyze them at no cost
for free. It’s a little bit like getting an invitation from IRS but I’m I’m
really glad that almost 50% of of these investigators actually did send us their
data and they also were very helpful with that and re-analyzed the data. We
found a few errors but nothing major the conclusions will still remain the same.
So you have two extremes one is a highly selective environment where people are
just trying to impress and share very little and the other a very selective
also environment where everyone is willing to share, at least 50% are
willing to share, and they’re very open to having their data reanalyzed and then
everybody looks fine. One might argue that maybe these people who did
contribute their data too can extra look and if they found any major errors they
made sure that they send us a version that will be compatible. I don’t want to
become paranoid I believe that if we create the culture of sharing most of
the results that we will get if people are well trained hopefully we’ll be
reanalyze-able and reproducible at that level at least. So how do we improve
sharing? Sharing is a challenge in itself and when we tried to remove to retrieve
the most highly cited papers’ datasets the raw data behind the most cited
papers in psychology and psychiatry we met with quite a lot of resistance.
You know these trails and these studies were not bound by the PLOS
Medicine and BMJ rules so they could or could not share
might not share their data with us we realize that in some cases they had made
their data available already and in few more they were willing to share that
information but very often they could not or did not. What were the reasons for
that? The the most common reason was that it was outside of the researchers
control I have come across many many situations where researchers do not
control their own data. Sometimes researchers published as first or last
authors are both and they have never seen the data that they put their names
on it’s it’s really scary. Legal and ethical concerns, you know the consent
did not allow that and that was done in the 1990s or so. Preparing our own
sharing system, the data no longer exists– the classical example in the literature
the data has been eaten by termites it’s a famous quote–insufficient resources or
researchers are still using data and the question is for how long for six months
for two years for 20 years. Are we making any progress in sharing? We are a few
years ago along with Muin Khoury and Sherri Schully from NIH we looked at the
reproducibility research practices and transparency practices across the
biomedical literature and we found that hardly anything was being shared between
2000-2014 there are some niches like genomics that are doing this but in the
big view of zooming out that was very very uncommon. Conversely when we looked at 2015 to 2017 there was real progress and in some cases that progress can be
explained because journals did change their policies. For example in psychology,
with that revolution of reproducibility some journals switched to encouraging
routine sharing with a budge system and you hope you see a rapidly increasing
rate of sharing data sets in the published papers in these journals. But
in biomedicine I think it has been more of a diverse movement where multiple
journals, multiple fields, multiple investigators, multiple institutions are
incentivizing or facilitating sharing so the rate has gone from literally
close to 0% to something like 25 percent by 2017. We also see some other
concomitant changes in transparency indicators. For example a disclosure of
funding or disclosure of conflicts of interest has gone up over the years. Most
people still claim that they have something novel to say even if you look
at the abstract it’s very prominent but there are more people who say that I’m
trying to replicate something or at least trying to do something of a but at
the same time replicate existing knowledge. Computational methods can
facilitate much of our reproducibility quests, a couple of years ago we
published these guidelines trying to enhance ways that journals, investigators,
and institutions could move forward in improving their reproducibility for
computational methods including software, script, and linked to data very often you
see a link and you click on that link and there’s nothing there you get an
error signal so there’s some very easy interventions that can improve the
availability of these functional links but there’s also some more sophisticated
ones that can take us substantially further. Better statistics and methods. We
have seen a transformation of research over the last several decades and data
science has taken a central role across multiple scientific fields. Is that new?
Well science has always been about data but I think that while in the past it
might have been easy to work with small datasets for descriptive purposes
currently you really need a license to kill, a license to analyze, and most the
time most of the papers that I see probably are done by investigators who
don’t have that license to analyze it’s very uncommon to see transparent
statistical analysis plans we repeatedly lament that we don’t invest much in
training our investigators in statistics and issues of design and
good study designs are underutilized. I mentioned the very simple choices of
randomization of and blinding of investigators that are so underused. How
do we do that? Do we just use some checklists? Do we add another level of
bureaucracy for example and we say well fill in that checklist? Is that enough? If
someone has done something in a way that is very substandard if it’s really silly
how likely is it that it will be uh acknowledged rather than just someone have
checked that checklist so yeah I did that I’m okay it’s it’s a it’s a
continuous tension. Are we using the right statistical methods? There are many
pleas to try to change the way that we make statistical inferences and I’m not
going to spend much time on them because each one of them has a different
philosophy in mind so one such plea is to become more stringent. Many fields
that have improved their track record did become more stringent, genetic
epidemiology move from p-values of 0.05 to a genome-wide significance level of
10 to the minus 9 and you know things seem to be working much better in terms
of reproducibility. For many fields that are working with p-values of 0.05 a very
simple move would be to shift to 0.005 for statistical significance and that
would immediately eliminate probably a very large segment of noise but that
would also take away some genuine signal so this needs to be balanced in each
field in terms of whether it’s a good idea or not. Most fields are still using
null hypothesis testing but probably this is not a good choice for many
applications including for example developing a prognostic score or
assessing a diagnostic test or evaluating a therapy or mining
electronic health records or mining big data. We need to find statistical methods
that are fit for purpose and very often these may be bayesian this may be false
discovery rate based sometimes they may well be frequentist. Who’s going to do
that? We don’t really invest on training our investigators and retraining them
with investment continuous education on
statistical methods and design issues we we just try to catch up with the next
tool or next technical tool that may be available but not the core of the
scientific method. Conflicts of interest I think that there is improvement in
transparency of reporting conflicts of interest but very often I wonder is
transparency enough. Can we leave the generation of original knowledge and
replication to conflicted stakeholders? Who are the stakeholders who should be
running sensitive studies like randomized trials, meta-analysis,
cost-effective analysis, guidelines? NIH has been shifting away from most of
those and there’s some empirical evidence that
if you look at large randomized trials supported by NIH, before registration a
good proportion of them were statistically significant in their
results after registration this happens very very uncommonly. Should public
entities like NIH resume and expand their role about supporting sensitive
research where conflicts really need to be avoided thoroughly if we want to have
full trust in them? Should we wait for perfection? No. Science will never be
perfect. It’s the best thing that has happened to human beings, to Homo sapiens
sapiens but it’s a process in evolution. We need to use the best science that we
have so very often I hear okay let’s get rid of all that junk that horrible
research and we know nothing; that’s not true you know we have lights, we have
this wonderful amphitheater, the projector is working, I can move my
slides, lots of things have happened and I think that we need to defend science
and there will be many anti-science voices that are trying to dismantle the
scientific effort. There are two ways that we can go about it one is to say
that science is perfect and probably that’s not going to leave us very much
room to play because very quickly we will meet with the scientific method
itself which says that science is falsifiable and very often we may be
wrong. Or that science is our best shot and I think that this is what we
need to defend sometimes our best shot probably will have lesser credibility
compared to other times. We know with high certainty about climate change, we
know with high certainty that tobacco is going to kill people, we know far less on
whether broccoli is gonna make me live longer. So we need to be transparent
about what we know and what we do not know. We need more research on research
and I’m clearly biased here because this is where I am investing my effort so
probably it’s like I’m asking for more funding but I think that we need to
study how exactly to evaluate our research practices. We need to find ways
that you can refute everything that I told you today with more empirical data
and with better science. We need to find how we can best perform research,
communicate research, verify research, evaluate research, and reward research.
What scientific workforce and what world of science are we envisioning? We can model
that we can model science in 2030 or 2040. Is it going to be accurate? Well I
think that we may well be wrong but this is one such model of science in the
future we used 11 equations to strive to
describe a simplified universe of science where you have three types of
scientists: you have the diligent scientists who are the majority,
you have the careless cohort of scientists who might be cutting a few
corners or maybe they’re not very well trained and they may be following some
suboptimal research practices. How many are they?
There are many surveys about that when you ask people are you cutting corners
the answer is usually no when you ask people do you know other scientists in
your environment who are cutting corners the answer is almost always yes but I
think that let’s say these are the minority and then you have the unethical
cohort which are clear fraud, you know creating data that don’t exist, you know fraud
is is very very uncommon we’re talking about less than 1%.
If you incentivize these cohorts with the same incentives if you discover
something if you make a claim you’ve published a paper in Nature you will be
fine and you don’t have differential incentives based on their
reproducibility what you will get is that the unethical and the careless
cohort will take over. The reason for that is that they can get there faster
with fewer resources with cutting corners. So we need to re-engineer the
reward system and I will leave you with a couple of slides on how we do that. We
try to incentivize productivity and productivity is wonderful but we need to
think about the whole electrocardiogram we need a thing of P-Q-R-S-T which is also
quality, reproducibility, sharing, and translational impact. We need to find
opportunities to change the way that we do science in our everyday environment, I
think it needs to be a grassroots movement we cannot just wait for some
king of science to impose with his or her authority or queen of science what
exactly should be done I think it’s scientists who realize what
makes their scientific work more credible more reproducible more
applicable. We also need to convince other stakeholders and they may have
other priorities as scientists scientists want to publish a lot they
also want funding for their work but there’s also the industry there’s
private investors, public funders, not-for-profit funders, editors,
publishers, societies, universities, research institutions, non scientific
staff, hospitals, insurance companies, governments, federal authorities, people.
Some of them want to see papers others want to see funding, some want to see
things that work, and others want to make profit. It’s all fine but it needs to be
integrated to try to get the best possible science. To conclude I think
that the presumed dominance of original discovery over application is an anomaly
it may happen occasionally but it’s really the exception. Original discovery
claims typically have small or even negative value and science becomes
worthy mostly because of replication. The reproducibility and the usefulness of many
disciplines of scientific investigation has substantial room for improvement,
this doesn’t seem that science is not good it’s the best that that can happen
to humans but we can really make better. There are many possible interventions
that may improve the efficiency of research practices and the
reproducibility and utility of the evidence and transparency, openness, and
sharing is likely to help but details on how to can be important and we need to
find out about how these details play out in different settings. A million
thanks to all of you for listening and special thanks to a number of
collaborators that have joined forces with me over the years to generate some
of the empirical evidence that I shared with you today. Thank you. You’re one of the biggest fans you have,
and my lab is definitely you know reading your papers. I think it’s
absolutely true that incentive system is completely wrong, science as you showed
it it’s really a system, right? And it works based on the incentives that are
put into it. I have recently tenured here, in ten years of my tenure track nobody
ever gave me any credit for reproducibility, almost when we present
data at the conferences I think we are enemies of majority. When we publish in
the, I mean my lab probably hasn’t published study where we didn’t include
independent validation cohort in the past five years,
we usually get from reviewers oh they should have second independent
validation cohort right when you know 90% of everything that is published or
presented it doesn’t have even the first one right? So unless we change the
incentives unless we use for example these H factors that factor in what
kind of methodological you know advances we have used or or this checklist right
what we did, whether our work is reproducible irrespective of whether
it’s positive or negative the science will continue the way it is continuing.
I fully sympathize with with what you described and I think that this is a
feeling that many scientists working in very different fields convey to me all
the time, but I think that there is progress I think that there are fields
that you have a dominant view that this is important to do and I think that
these fields are likely to be more successful in the long term and it’s not
going to happen overnight. I think that if we focus on training younger
scientists hopefully retraining some of the older
ones about what really matters and and why are we doing this and and why it is
important to do good science rather than just sloppy science. I think this is
like an everyday effort and everyday struggle it’s not one course that one
will take because I sometimes hear the question, is there one course that I can
take? No this is science in in continuity it’s it’s your everyday scientific
method in its application and and there’s no single course that can
replace your your experience as a scientist. Thank you for that talk and I
agree it’s going to take a cultural change to move forward considering we’re at
NIH and there might be a lot of program stuff and in the audience you have any
specific suggestions on the funding side as the major funder of biomedical
research? So I think all of the suggestions that I made clearly applied
to NIH and NIH has tremendous power to facilitate some of these principles and
I think that NIH has moved in that direction on many fronts, probably not
all the movements have been equally evidence-based, but I think that there is
people who are receptive to that message and they do want to to change the way
that things are done. Clearly if you have NIH giving priority to these incentive
structures giving priority to openness and sharing and focusing on good work as
opposed to just quick and quote-unquote successful it can be a tremendous impact. Yes there was a high-profile case of
Theranos where statistic statisticians called them out and said
that this was not going to work. Theranos in terms of the biotech company, and so
this was done with incomplete knowledge nor statistic
data from the company so the question is what does gut instinct tell you of a
reviewer with respect to how something is going to turn out? And it would be
interesting if you did took the gut instinct of a reviewer and saying I
think this is going to turn out to be profound
versus this has turned out to be fake regardless of the underlying statistical
value. Could you come comment on that? Because I think that would be really
important in terms of biotech IPOs. So transparency applies just as well in in
the case of biotech and startups and unicorns. Actually I was the the first to
write a critical paper on Theranos in JAMA about a year before John Caillou
started publishing his Wall Street Journal investigations and I sent that
to Jam in 2014 it went through not only review but also legal review because I
was challenging the highest valuation startup in the country that was visited
by vice presidents and all the the influential people were and on the
advisory board and everybody was so excited and I was saying they have no
evidence to support what they claim I don’t see any paper that they have
published I I think their valuation is nine billion but it could be just nine
dollars and the the paper was published eventually in in JAMA I got a lot of
pushback at that time I heard from their general counsel asking me to recant and
that they were getting FDA approval when they did get their first and single and
only if they approval all of that is way before the the Wall Street Journal I was
told again to recant and write an editorial with their CEO that I was
wrong and I remember Washington Post writing a story like the the insanely,
influential Stanford professor who’s asking too much of Theranos and you
know what a shame you don’t understand basic things this is how innovation is
happening. People now think that Theranos was an exception. you know it was
fraud and everybody’s fine otherwise. I don’t believe that. A couple of months
ago we published a paper where we looked at every single uniform in the health
care space the majority of them looked pretty much like Theranos in there in
terms of their transparency and availability of published, peer-reviewed
science. So I’m not saying it’s fraud but unless we improve transparency and
science for for these entities I think that we are running the risk of
having deja vu Theranos times two times three times four in the near future.
Thank you. Hi, thank you very much for that talk. I’m Ian hutchins a data
scientists here at the NIH I spend a lot of time developing and using research
assessment metrics for portfolio analysis and one of the things that I’ve
observed is that there seem to be a lot of cultural and policy barriers to
reproducibility efforts for example journals will often have in their
editorial policies that they won’t publish replication studies I think even
PLOS One until very recently had that policy in there and I haven’t looked
specifically but one imagines that applications for funding that focused
specifically on replicating a previous study don’t fare particularly well in
the in the general peer-review system. So how do you think that that logjam can
best be broken up? So I think we need more training and we need more people to
understand what what is at stake here. I believe that many journals have changed
their stance over time and many of them are very sympathetic, many fields that
have adopted replication massively like genetics would not allow you to publish
something unless it’s extensively replicated so it’s become a sine qua non
it’s the most integral part of the whole world there’s many other journals that
are probably lagging behind and we need training in in a recent study where
editors-in-chief of respectable journals were asked to perform an investigation
about a third of them could not even tell that a study was a randomized trial.
So the methodological is often lacking and we need to just keep pushing that
message and educating and training people to to be more knowledgeable about
how things are working in in the way that we do science. I think that the best
journals will be the ones that publish the best science eventually and I think
that these will survive. I’m thinking in an evolutionary mode in that regard I
don’t want to think that the best journals are going to disappear and
we’ll just get more and more noise. Thank you.
So we’ve been told we need to exit the auditorium now for another group coming
in however please we have people in line to ask questions come join us in the NIH
library and ask your questions there. Thank you.

Leave a Reply

Your email address will not be published. Required fields are marked *