Resampling Methods: Testing Robustness and Reliability (but Really Replicability)

– So what I wanna talk about here, really, is just some work. When Mo asked me to do this, I kind of felt like Steve a little bit in terms of this isn’t really my area, isn’t anything I’ve published anything on. Obviously, I have an interest
in it as a scientist. And so I went back and forth with Moe on one of the ideas that I had here, really, and this is
the talk that I’m gonna give on this particular one. And what I wanna talk about specifically is an idea that might have merit in terms of thinking about
if we ask journal authors to contribute this information. Would it help us get a
sense about how replicable their results are? And that’s basically
the bottom line on it. But let me start here. So I wanted to give a
little bit of context to why I got interested in this idea. Talk a little bit of background. I’m gonna introduce just some
core ideas with the bootstrap. I apologize that for many of
you this will be a review, but I also know we have
some PhD students in here that this might be a new idea, and if I don’t set the base work for that, they’ll have no idea
what I’m talking about as we go through with that particular one. And then, I’m gonna give a
couple of examples from data. One, my own data, another
one from empirical example. And then in the end what I hope is just to get some good
discussion going about this particular issue as we go through. So again, as with the other ones, feel free to interrupt
if you have a comment, if I’m not making sense,
please just let me know as we go through this. So in terms of the context,
one of the things that we do. Excuse me. You’ll be able to edit that out, right? (laughing)
So in terms of context, one of things that we do at the
University of South Carolina is we work with large
corporations that give us data. It’s kinda cool. Many of you do this in business schools. But I have Master students that’ll take these big data sets, and we’ll do is a whole long Capstone
analytics project. So just in terms of thinking
about the experience of these Master students, they
are just learning regression. So we can’t give ’em things
like machine learning and stuff like that, but basically, we turn the Master’s student into machine-learning algorithms, right? We have the Master’s students take the large data set,
and they become XGBoost, they become random for
us, and we tell ’em, go test for a gazillion interactions, and look for main effects, and so we P-hacked the hell
out of this data, all right? But before we P-hack the
hell out of the data, what we do is we take 40% of it, and that, we tell the firms
when we work with them, we’re gonna take 40% of your data, P-hack the hell out of it, see what we can find, see
if anything comes in there, and then we’re gonna
take 60% of that sample, and we’re gonna use that to potentially replicate the results, for all the reasons that we heard earlier
this morning from Joe, about just capitalizing on random chance as we go through to find
some result on here. So here’s a case, it
doesn’t really matter. This is a generalized
linear mixed effect model, so the outcome in this
case was whether or not an individual did or did not buy a high-deductible health plan. So we’re working with a company in the healthcare industry, obviously. Huge data set, right? I mean 260,000 responses on here. And so this is the summary
output from the model. And I wanna just focus your attention on this particular interaction here. All you need to know is this is just a simple two-way interaction
with the z score of 2.32 on this particular one. But if your intuition,
or certainly my intuition would be if you estimate
and your find an effect with 104,000 individuals, that thing is probably
gonna replicate, right? This would be what I
would generally think. So we take the 160,000 that we
have in the hold out sample, and we run it. And what we find instead is that it doesn’t replicate at all. It doesn’t even come close
to that particular one as we go through. Now, the reason why we do this is because these firms may actually,
much to our surprise, do something with these results. We’ve actually had a couple instances where we give ’em feedback on it, and they change some practice
within the organization. So as a scientist, this is
maybe a little different than the publication thing, but this, to me, has some
very real implications. I’m gonna go back, and they may change their strategy on what they’re gonna do based on the information they’re doing. I don’t want them to go bet the farm on an interaction that
I’m showing on this, if I don’t have evidence
that it can replicate on this particular one. So what I want, I want to
start with this as a context, because I wanted to put
this in here to think, to get us to think about, well, how should we think
about whether results do, or do not replicate? Are we holding one standard if
we’re working with industry, do we have another standard
if we’re publishing? Is there some way to
integrate these two results as we go through on here? So I want to, in this particular one, talk a little bit, just as background about non-parametric bootstrap, show how modification to the common, non-parametric bootstrap gets us to think a little bit differently
about this idea of do results replicate, and then to show that while
the non-parametric bootstrap is very intuitive, and I can walk you through the logic of it, that in fact, you don’t even need to do the bootstrap, and you can get exactly the same results. But I wanna start there,
so you can kind of follow the logic that we’ll go through on this particular one. And the discussion that I wanna have is, should we begin to think
about incorporating this type of information, which I’m gonna show
you, into publications? And I’ll give several of
examples as we go through this. So here’s where I’m
gonna go into some of the basic background of the
non-parametric bootstrap. So for many of you, this
will be a review on here. But you often use a
non-parametric bootstrap to go back to your original data and sample with replacement
from the original data to get confidence intervals
for a parameter estimate that has no natural confidence intervals. This can be one case. So if you think about the median. The median doesn’t have
a confidence interval. So if you wanna get
the confidence interval of the median, you can bootstrap your data over and over, and you can get
a confidence interval for it. We also use this a lot with mediation, because the indirect effect
has a non-normal distribution, so if we have something with
a non-normal distribution, we get a bit leery about
developing confidence intervals using 1.96 times the standard error, so the bootstrap works out
great on that particular case. Non-parametric bootstrap,
this is a core piece of every statistics package. I’m gonna be doing,
giving you examples in R, but you can run this in plus, you can run this in Stata, SaaS,
it’s nothing specific with any program, it’s just a very well-established method for
drawing inferences in data. Give you just the most basic idea again. So if this is a new idea to
you in terms of bootstrap, I can go back to a data set. So this is just using
my multi-level library that I maintained for R. I go back in, I get one
of my data sets in here, and I draw 15 observations. So I know this pointer doesn’t work, but I point up at that row there, those are the numbers. Those are 15 numbers where individuals reported their work hours. How many hours they worked during the day, and if I sample with replacement, this is the key, so it’s just the word sample that data set,
the same 15 observations, but the replace equals true, what happens is I get
another 15 observations, but it’s not exactly the
same as the first one. In this case, I ended up with one less 12, and one extra eight on here,
because of the replacement. What that lets me do is go back in and calculate a mean,
a mean, a mean, a mean, it’s slightly different each time, so 10.3, or 10.5, 10.6, 10.7. So now, by sampling with replacement, I go back to the same data
over, and over, and over, and I can do that. And then you can have computers do what computers do very well,
which is to do some dumb, repetitive task 10,000 times, and you can get a distribution off of it. So I can go back in and I could say hey, if I go back to my 15 observations, I can get a distribution
of what the mean is in this particular case, and unbelievably, if you did it 10,000
times, I must have actually had one draw that got all
eights out of that sample, or something down here, which
is pretty amazing on here. But you can see this is the distribution. And if I wanted to apply
some statistics to it, I can say, well, the top, the top 2 1/2%, the bottom 2 1/2%, those
are my confidence intervals. So I can be 95% confident that everything will fall between those values. And if it’s outside that 2 1/2 either way, then it’s what we think about
as an uncommon observation, given the data. So that’s the basic of it. Here’s how the bootstrap
is typically used. So the bootstrap would often be used more for a regression, for instance. I’ll give you an example
in a more complicated case. I go back to our data. If you, all the code, by the
way, is in my slides here. It’s either here, or in the notes, if you wanna follow this just to see, ’cause you’re curious about it. So I go back, set a random seed so that you can exactly
replicate those results. I get 500 observations
just drawn at random, so we get this. And the typical, I run a
linear model, so regression. I regress wellbeing and work
hours, leadership and cohesion. These are the parameter estimates. And this is almost always what the bootstrap is focused in on. The parameter estimates right here. So I can go back with this one. Like I said, the code is below this. I get some, using the bootstrap library, I get these nice graphs, here, that show me, here is the distribution for the bootstrap draws, for work hours, here it is for leadership,
here it is for cohesion. I look at the q-q plots. They seem to be pretty
nice, they’re lining up, there’s nothing too wacky about it. Not seeing observations over there. And I get these confidence
intervals right there. So again, this as I show here, is really not all that interesting. It’s not all that exciting
because you can just go back to the raw data,
or to the results up here, and again, just using basic stats, I can take the parameter estimate, negative 0.04, and I can
multiply the standard error times 1.96, and I get this value. I can minus it, I get this value, and what you’ll see is that
when I bootstrap the values, or I just use 1.96 times
the standard error, I get basically the same number. All right, so this is just
background showing you how I could use the bootstrap, showing you it’s not
particularly interesting to do it this way. It would be fun to program it. You can see the results. But I could’ve gotten to the same point just by taking 1.96
times the standard errors on the same thing. But here’s where it gets interesting. So if I go back to the same problem, and to actually build up a point Joe made earlier this morning,
sometimes you read again, about showing these confidence intervals, 95% confidence intervals on here, as a way to solve the problem that we have in terms of are results replicable? But really, exactly said
earlier this morning, I look at these, I see
there’s no zero in any, I turn it into a
dichotomous test in my head, and I say basically,
they’re all significant. So seeing the confidence intervals really does nothing for me in
terms of giving me the sense that these would or these would not be replicated in another sample, or even in my own sample on this. I’ve heard many people talk about this. I know Kevin Cummings,
or Cumming, I guess, really emphasizes this. I never understood that logic. I’ve read that paper, and
saying that it doesn’t. It does give you a sense, right, that there’s some variability in here, but the way that we interpret
it, it doesn’t do much in terms of whether
something’s replicable on here. So here is where things
get more interesting. And that is, let’s say,
and one never does this in bootstrap, which is why,
maybe, it’s interesting, but let’s say, instead of going here and focusing on the estimate itself, which is where we
typically do the bootstrap, let me just modify the
bootstrap to go back and say, hey, every time we run a bootstrap draw, instead of saying what was
the parameter estimate, just tell me whether it
was significant or not. Okay, so I’m gonna go
through 10,000 iterations in my data, now I don’t actually care what the mean and the variability
of the parameter estimate is, I’m just starting a counter. One if it was significant,
zero if it wasn’t significant on that particular one. So at the end what I can do is I can say, well, how many times did I get a one? So if I take my same data right here, my results, this is 500 individuals, everything’s significant in here, and if I now modify the
bootstrap to go back in, do the same iterations
of the data and count it, the questions is, how often
should I see significance? Because this gets to the
idea of can I replicate my results in my own data? Not requiring somebody
else to take my data, I’m not waiting for the
next studies that show that my data’s all messed
up and I made some error, I’m just actually using my own
data to get a sense for it. So I’ll throw it out there,
but I’m just kinda curious. If you see these results right here, what’s your sense about these? I mean, give me a number. Do you think I’m gonna see
90% replicate these results? And everything’s significant. I’ve got 500 observations in here, let me just throw it out for discussion. I’m kinda curious. – A lot.
– A lot, right? (laughing) Well–
– I’m not gonna give you a number.
– All right go on. Anyone brave enough to
throw out one number? Well, let’s be more specific, okay? Let’s say, let’s take work hours, right? You can see up there, that’s
this first one, there. Is that, just a guess,
is that gonna replicate 95% of the time, 90% of the time? What would you think?
– I would look at the P-value of the effect size. I think if the P-value less than 0.05 with a very large effect site, are likely to replicate,
and if it’s a P-value, (speaking off microphone) – So keep that idea in mind, but I think you’ll find that it’s, that’s where intuitively,
this gets interesting, okay? ‘Cause I don’t think it’ll
come out that way, right? On that one. So that. – 500.
500. So it’s not massive, but
it’s pretty decent, right? I mean it’s not like 30 or something. But ironically, as I go through this, you’ll realize it wouldn’t matter if I had 50,000 or 50, the same effect
would actually come out as it comes out on this. So let’s just go with I
won’t pin anyone down on, we’ll just go with a lot, right? But this would be my intuition, too. And it truly is my intuition. I think, I look at this I say, I’ve got 500 individuals in here, I mean that’s a good, solid sample size. I’ve got three variables that
are statistically significant, easily meet the criteria, right? A t-value of negative 3.03, t-value of 9.4, a
t-value of 2.15 in there. These weren’t right at 1.96, they weren’t teetering
on the edge right here, so I think they come out.
– Is the idea, I know you’re setting us up. – I would never set you up. – [Student] Is the idea
here that the coefficients are correlated upon resampling? And you’re revealing that in the– – Well, I thought, so initially, so Fred, what he asked,
since this is being recorded, Fred asked, is the idea here that we have a lot of multi-colinearity
among the variables? And so when I draw from the resampling, that may be what I’m doing is in fact, I’m basically getting
different combinations of the correlations among
the variables in here, which is entirely plausible. There may be some part of that. But I don’t think it’s
actually the key thing that happens on here. So let’s see what
happens with the results. What happens in fact, is,
if I set up that algorithm I see that work hours comes
in here about 86% of the time. Not too bad, actually. Although I have no idea
what a good result is. Well, I do know that
this is a good result. Leadership made it 100% of the time, okay? So basically, using my own data, I could replicate that leadership was related to well-being 100% of the time, in, I think I ran 10,000 of these. I might have run 1,000. It doesn’t matter on here. But look what happens
with cohesion over here. So cohesion, significant. T-value of 2.15 on there. And I look at this one. 56% of the time is how often
that becomes significant on that particular one. So the question that I want
to get you to think about with this particular one is, imagine that I had
written up these results. But imagine that I put
a statement in here, in my paper like this, while each variable was
statistically significant with a P-value of less than 0.05, readers should consider that even if the parameter estimates
are the true values, which, by the way, is a
big assumption, right? We’ve kinda talked about the fact we can’t even make that assumption, an identical study with an
equal sample size of 500 would expect to find
work hours significant 86% of the time, cohesion
significant only 56% of the time. Based on the findings however, we would expect any replication
with similar sample sizes to find leadership
significant 100% of the time. Sorry, I’ve misspoke there. It should say cohesion. 56%, leadership 100%, right? So if you were reading
this, so go ahead, Fred. – Oh I just, I can’t help myself. So yeah, no, I think that’s right. So your point is, the bootstrap is looking at the marginals of these
coefficients, right? And the individual, now,
sees they’re looking at the joint profile of coefficients. So across bootstraps, you kinda have this blob of coefficients in P space, however many predictors you have. The bootstrap is just
looking at significance of each one on the–
– Of each one, that’s right. – This one is looking at the joint, the profile of the stars–
– That’s right, yeah. – Yeah, for significance.
– Yeah. – Yeah.
– So there is some aspect of this to it, right? But I think it would happen,
what I could show you is that even if I had one variable in here, I would get a similar effect, right? So it doesn’t actually require that it, just to clarify this, so he’s, Fred is raising the point,
everything is complicated, right? Who was it that I gave all the arrows? Well, it was actually Dr. Bettis who was on the thing, right? All the arrows going everywhere. We have that in, but I could easily show a similar effect here, with
simply one variable in here. But we’ll have to do that after, ’cause I didn’t put it in my talk, here, but we can bring out R,
and I can show you on here. So let’s go to, or, maybe what we could do in our tables, this
would be another thing, would be to go back in here, and say, here’s my table, and I don’t know what
we’d call this thing. Maybe there is a name for this thing. This is what I’m kinda
curious about, okay? But you could put in here, with your own data, exact replication how often would you see the effect, and would present something like that. – Isn’t it just a more sophisticated way of getting the same
information you already see? I mean the one value for leadership has the strongest magnitude of estimate, the strongest t-value, and then it just a more fancy way of showing
the same thing you get from the t-value. – It is, which we’re gonna get to, okay? But it’s also a way of
translating the t-value, right? In other words, it’s basically, yes, it’s the t-value, but when I tell you how often would a t-value of three, point, let’s say three, replicate, right? This kind of gives you an idea. You’ll come back to this,
’cause I’ll tie it together. But you’re onto something with that one. – [Student] But wouldn’t
the speed of standard or either the bootstrap or
jackknife that Stata calculates? So when we have sample
sizes, small sample sizes in particular in strategy,
I think that’s the norm to report either of those rather
than the 1.96 calculation. So isn’t this, you were
looking for a name, so I think we just call it
standard error in parenthesis, either bootstrap or jackknife, depending on–
– But I know the, it might be what you referred to earlier. So Maca earlier indicated that they will, with a relatively small
sample and strategy, they might jackknife it. So what would be, if
you drop it each time? If that was substituted to give the percentage of time, so let’s say you have 100 observations, you
jackknifed it 100 times, and then you went back in and said, 95% of the time I got it. It would be equivalent, although I don’t think it would be exact,
’cause the bootstrap and the jackknife, Moe?
– Yeah, it’s not about standard error.
– It’s not about standard error, right. (student speaking off microphone) (laughing) – [Moe] No, no, no, it’s
not about standard error. Actually, Paul and I, we actually discussed this extensively. So I think Paul is gonna
reveal what it is later. – I’ll reveal, I’ll go through it. So let me keep going on here, all right? But this is an idea,
just to go through that. So let’s take another example. And on this example, I’m not beating up. I didn’t put the author’s name on this. I don’t want to. Because they didn’t do
anything wrong, right? This is just very typical, this is an article that was published. My main criteria for
selecting this article was that I was gonna have to enter this correlation matrix in by hand, okay? So I started just thumbing
through the journal, and I just, I didn’t want a whole lot of correlations on there. And then they had to be
presenting a main effect, right? Because you can’t use
this correlation table to create a data set with an interaction, you’ll just get garbage on there. So I needed to have a hypothesis
that was a main effect on this particular one. So in this one, again, as
we eluded to this morning, what I’m gonna focus on,
I’m just gonna focus on the standardized data
right here, 0.25 on there. There are 83 organizations
in this particular data set, and they had a main effect hypothesis about collective organizational engagement being related to firm performance. It was in the third step of their equation as they went though on
this particular one. So as long as somebody puts in the correlation matrix, then the
means and standard deviations, it’s quite trivial to go
back to that raw data, and replicate a data set that
has every attribute on it. I was kinda talking earlier about it’s sometimes nice to have
three decimal places because to make this one work exactly, I had to go back and give the authors the benefit of the doubt on some, I had to move a couple
things over a little bit to where their correlation table was true, but being rounded to a slightly
different number on there. But you can see that I can go back in and I can create another
data set with 83 observations that matches theirs perfectly, and gives me identical results. This is, interestingly enough, it’s identical here, it’s not identical in the sign for some of theirs, so I don’t know if this was just a transcription error on their point, so I get a negative 0.05 up here, and this one’s off a little bit, but this one really matches
up perfectly on that. So, okay. But anyhow, it’s fairly easy to do that. I can again take my 83
observations in the bootstrap, and I can just say, all right, let me go back through and count how often they got the result that they reported, that they showed in this particular table. (student speaking off microphone) I’m gonna hold onto one of these. – [Student] A technical
person, how do you choose the probability of the distribution of each one of the variables? The only thing that you
have are correlations, and then minimal, maximum, or
mean and standard deviation, but you could have a
logarithmic, or a normal, or it could be anything–
– Well– – [Student] There’s 10
million degrees of freedom. – There could be things
off, it’s not perfect, okay? But the mean and standard deviation would give you a pretty
good idea, hopefully about the distribution.
– Assuming normality. – You have to assume normality. – Normal and then you test–
– Correct, yeah. – But there’s no assurance.
– There’s no assurance. – [Student] You get the same
coefficients and P-values. – That’s right. But you’d be surprised
at how often this works. Okay, just to let you know, all right? Because most of the time, our data’s reasonably normally distributed. Okay, so if it’s way off, this might throw off something on there. But honestly, if it’s way off, the author should have told you that
it was way off, right? And they should have
said, this is so far off I shouldn’t be using regression, right, is the way that I would look at it on that particular one. So if I take these 83 observations that I’ve now made from their own data, and go back through, run the bootstrap, and again I find that they replicate it 53% of the time, on here. So they’re showing the finding right here, well, either one of those would be fine, but in fact, 53% of the time it comes up. It’s also interesting if you look at it you’ll see, zero effects are
never really zero, right? We don’t have, we get something
on these other effects. You can see in fact, 12% of the time that variable was significant 13, 14%. So those ones that were
not significant at all were still coming up
significant like this. This again, goes back
to the P-hacking idea. Like if you work away at this enough, you’ll draw some samples in there, where just about anything
will be significant. – [Student] You’re giving us a mystery. So I expect you to encourage a lot of– – Box.
– Okay. Ooh! – [Student] Really
trying to shut you down. – [Student] I know, I know. All right, well, thanks for indulging me. Now I’m thinking, well, there’s
the bootstrap distribution where you know the overall value, versus testing every individual
effect against the null. That’s why you’re having this lack of replicability in each instance, right? I know the overall
effect in the bootstrap, I don’t know the overall effect for every individual test against zero. Right, so if I have variability, that’s down on the low end.
– It could be. – I didn’t solve your mystery.
– Bear with me, bear with me. All right, and tell me whether, all right? – That’s my last shot.
– No, no. – All right.
– All right. But you’re following
what I’m doing, right? I’m just going back to
either simulated data that matches your publication, or to raw data that you have, and I’m just saying, let’s
put an index on there to say how often would
this result be replicated as we went through on here? So the real question, so basically, let’s think about what the
article would look like then, so this is what was said. “As predicted, collective
organizational engagement “significantly and positively
affected firm performance. “In fact, it was the only
variable among our predictors, “moderator and control variable that had “a significant direct relationship.” So that was very assuring, right? Like this is written as
we write in academics. This is the thing that came in there. And I’m really not beating up on authors, ’cause you know my criteria
for having selected this. But imagine what, if the
next paragraph it said, note however, that even
under the best case scenario, where the estimate of
our effect is accurate, an identical replication
will find a significant one-tailed result only
about 54% of the time. Right, would that change how
we think about these results when we’re reading ’em? As scientist, as we go through? It would, right? On the particular one. So the real question is,
and this is the question that I had to ask myself
on this particular one, am I doing anything new, right? And the answer is, I’m
not even, I’m not really. I’ve given a different
flavor to something we know, very well, within our field already. I’ve applied the bootstrap to a problem, but the solution is in
fact, nothing new at all. This is something that’s
been talked about, I’m just reframing the way we talk about this particular question in here. So the answer is, all I’m
showing you right here with the non-parametric bootstrap
applied to these data is, I’m showing you power. And so this goes back to
what Gilad was saying, can’t I get this same information by just looking at the t-value
of this particular one? So what is power? Power is “the statistical power of a “significant test is the
long-term probability “given the effect size,
alpha, and the N of “rejecting the null hypothesis.” So long-term probability in my bootstrap is the probability having
replicated it 10,000 times on that particular one
as we go through here. So this idea of using non-parametric
bootstrap is intuitive. You get the idea on here, but I can show you that it’s very easy, and this is what I’m gonna
kind of conclude with, to just take the t-values, or z-values, we’re not gonna show z-values, but in many cases our samples
sizes are large enough that we can just substitute
t-value, z-value, it doesn’t really make
that much difference on this particular one. So again, because I find
it a little non-intuitive, I always have to think through power. Power is just one of those
things that I struggle with. I get it. I understand, I’ve done a
lot of randomized trials for the Army, I know that we
do careful power analysis. I know how to do ’em, but
I always have to think, what does it mean, exactly, when I’m just thinking
about, how do I explain it? So here’s one way that
you can think about power, hopefully that may make sense to you. Let’s say I just have two
random numbers, x and y. I’ve got 100 pairs of them on here. So I can go back to a t-table, table of the critical values, and I know if I have 100 numbers, x and y, that if I have a correlation
that’s greater than 0.197, then that will be statistically
significant, right? So that’s just, we have that. It wouldn’t occur by chance to have a correlation that strong in there. So I know that value. Correlation 1.97, then that’s associated with the z or t of 1.96, right? And that means, basically
I’ve written my paper, I’ve got these two variables
that are correlated, I get my correlation of 0.197. I can write, I have a
significant effect in there. So then if Fred comes, and if Fred wants to replicate my findings. So Fred basically goes back in there and let’s say my 1.97 correlation is right on the money,
it’s the perfect estimate. There’s no bias in it,
I didn’t do anything that would make it suspicious. I just, by some miracle, happened to get the perfect effect size for this phenomena on this particular one. So Fred goes back in, gets the
exact same sample size, 100, and he replicates it. Well, if he, if that’s the true value, and Fred replicates it with that, and he does it, let’s say Fred, he’s very ambitious by the way, he’s like an editor for three journals and
I don’t know what else. Okay so he would think
nothing about replicating my own study 1,000 times if he needed to to get to the truth. So he goes back and
replicates it 1,000 times. Just by chance, if the true value is 1.96, half the time he’s
gonna get a lower value, and half the time he’s
gonna get a higher value. So that means my power is 0.50. My ability to detect the effect is anytime it’s higher or equal to 1.97. So that’s why you can basically go back to any z or t-value that’s 1.96, and you can immediately say the chances of that being replicated are 50%. Okay, flip a coin, all right? So when you see a z or a t score in any publication out there,
you can immediately know that there is only a 50% chance, and that’s under the best case scenario, where you have no reason
to believe the estimate was in any way, upwardly biased or anything as you go through that particular. So this just is a simulation
to show that effect, here. To demonstrate it, I’m
not gonna go through this, other than to say if you’re curious, and if R is an open source language, so if you want, if you’re an editor, if you’re a reviewer on an article, you wanna get a sense of how reproducible would this effect be, you can literally just
go take a probability of a normal distribution, 1.96, if you’re doing basically a point of 0.05, minus 1.96, so that would be zero. So that would tell you exactly
what I described to you. You’d have a power of 50% in there. Half would be less than,
half would be greater. That was the scenario I gave there. You can substitute any of
these values right in here. So you could just use
this little formula in R. One minus pnorm, 1.96, minus
the t score or the z-score, and you would estimate it. I can go back with the first
example that I gave you, I did the whole bootstrap,
I let the whole thing run, I got these values. 86%, one, 56%. I used the simple formula here, I go back and I grab the
t-value off of here, 86, 86. One, one, 57, 56, close enough in that particular one to do it. And it’s that simple. We literally can go back to
any publication in there, and we can just use this
power and this estimate, well, how often will
this thing be reproduced? We went back through here. Same thing, this is
off a little bit, here, but I can go back in. They don’t actually give the t-value, but I can look here and I can see, I’ve got an estimate here, with the beta, standard error, that lets me estimate that the t-value was 1.857. I added some extra decimal
places, obviously, in there. And so I can use the same thing, and off this, when I get a 58% using that, and the way I did it before was a 53, this is a case where the t and the z start to get a little divergent. So with 83, your t and your z aren’t gonna line up quite as nicely as when I had 500 in the t and the z worked
up with that particular one. So the big questions, and really, and this is what I wanted to talk about, the big questions are,
could we, do you think, as a field, we could use these ideas about the ability to replicate
and the power to inform us to meaningful ways about the robustness and reliability of results? When I, big sample working with the firm, and I do a holdout sample,
it’s no different, basically. It’s the same data. In that particular case,
I did the sampling before. What did I do with the? – [Student] I have a
question about the comment on this being equivalent
with the holdout sample. When you do a holdout sample,
you’re reviewing the model based on just part of the data. Here, you could view that
model based on the entire data. And then, let’s say if I’m
building a decision tree that is over fitting,
I can get 100% accuracy on every bootstrap sample you draw, which won’t happen when you
have a holdout sample, right? – Yeah, so I think it’s
an important point. I think that you, I mean, to clarify, there may be, it’s not exactly the same in terms of having the holdout sample. But in a weird way, I’m not
sure how different it is, right? Because I start at the very beginning by doing a random number generator to create the holdout in the other, so it’s this one iteration of a bootstrap, in kind of an equivalent way. The holdout typically has more power, because you hold out 60% versus the 40, so that you actually increase it. – [Student] But that is actually related to the second question I’m asking, because my concern on
this being inequivalent with a holdout sample is when you do a machine learning like
algorithm over the data, it’s very easy to go into overfitting. – Right.
– Which won’t be solved by bootstraps and–
– It won’t, that’s right. – [Student] Now, that
brings me to the question of what kind of statistical
test can be supported by this bootstrapping-based approach? Because I have another example, here. If you are trying to do, let’s say, just linear regression, and
I happen to have, let’s say, the dependent variable were b and c, and I have two independent
variables, a and b, let’s say those almost
perfect predictor or c. Okay, a works 95% time, b works 95% time. Now, if I use bootstrap sample
to run linear regression, half of the time I get a,
half of the time I get b. So I will mistakenly
believe that neither a nor b is replicable, but
indeed, both of them are. – Right.
– How do you deal with that? – Well, it’s a good question, right? ‘Cause this gets back to
Fred’s question about, would this work if you have, you’re kinda looking at
it as pairs of variables in that particular one. But I would say, it’s not
a statistical question, I guess, in my mind. It was like, if you’re
communicating the results, how do you communicate that you have two variables that have 90% overlap on it? And so if you present one
result in a journal article, then you’re, by definition, not presenting the full picture, either, right? So it’s kinda, I don’t see
it as a stats question, so much as how should we
require that data be presented, and that people go through
on that particular one? It’s a great question. By the way, I’m not advocating
that people do the bootstrap. Okay? I’m using it really, because conceptually, it helps to see, how did we end up there? Like, how did I end up
at this particular one? But I wouldn’t advocate to people, everyone have to go out and
bootstrap their data repeatedly, ’cause actually you don’t need to. You can just do it off the thing. So, other questions? – [Student] Yeah, so thinking about this, no, I wanna kinda flip it
on its head a little bit and say that given that our r squared are not typically massive, right? We typically deal with maybe 0.03, 0.04, if we’re lucky. And as Joe mentioned,
earlier this morning, we’re frequently underpowered. So could the replicability
crisis really just be an artifact of we deal with
small effects and low power? And not that we’re not actually– – That is a huge part of this. I firmly believe that, and in fact, I think if you read, if you get into this, you will find that’s a
big issue right there. Low power kills a lot
of these things on here. If you have no other takeaway
from what I did here, that would be one there. – [Student] Just to add to that. So the reason why lower
power is indicative of the reproducibility
crisis, it’s not because the findings are actually low
in power if it’s still true. It’s that, in order to continue to get lots of low power studies
to be significant, you had to have P-hacked them, because otherwise, they can’t all be significant at the same time. So we have a tool that’s called P-curve, that tries to diagnose where
sets of significant findings are true or false. And basically what it relies on is, the logic is this. It says, if you, for
example in your paper, have three significant
results and they’re all 0.03, the implied power of each of those results is so low when you combine them, that basically we can say your paper’s, we’re not gonna bet that your
paper is true, all right? So I don’t wanna use the
language of not true or whatever. But it’s not true yet, right? You don’t have evidence yet. And so that’s where,
when you see low power, the findings aren’t true, because you had to have P-hacked in order to
get them in the first place. – Another piece of evidence
that I don’t go into, I saw it at the SMS Conference. I don’t remember who showed this. But someone had just looked at the distribution of t or z-values
in the published literature. And what you see is, suddenly at 1.96, there is a huge spike, right? So if it was a normal distribution, you would not see a huge spike at 1.96, which almost tells you immediately, that individuals were
tweaking things a bit. I’m not, again, it comes back to this idea of let’s drop that control variable, we finally just made it at the 1.96 thing. But that should also give
you a pretty good idea if your reading is steady, and it’s really at 1.96, 1.97, that there may have
been something going on, and b, it’s not gonna, the chances of it reproducing are really, I
mean, at best, flip of a coin. But probably worse than a flip of a coin in all practicality on there. Oh. – [Student] So yeah, I think this, the way of using the
power really gives a way to kind of quantify
the ideal replicability from your original study setting, right? So I actually think this
addresses the origination bias, which is when anyone studies something then publishes the first study, they should really
report, hey, if ideally, they can replicate it
with the same sample, joined from the same
population, same set up, how likely are they gonna
find the significant effects? Just by quantifying it by saying oh, 60% of a chance you’re gonna find it, 40% of a chance you won’t find it. And a paper get published
that encouraged people to decide, oh, whether I
want to replicate this, or whether actually it’s worth pursuing. So I think that’s
another piece of evidence that should be reported. I mean, in terms of like
we’re accumulating evidence. So it gives people more information. – So it really is. I mean, it’s just a
modification of the t value. That’s all it is. But it’s taking something
that isn’t given in one form, a t-value, converting it
into something that we have in some sense, a more
intuitive field for it. That’s all that I’m proposing on here. And it just shows that not only could you generate it by going through
the non-parametric bootstrap, ending up there, or you could just use very simple distributions
to go through on here. I don’t want to, I mean,
these are other questions I put up here. Would researchers in
organizational sciences revolt if we started to ask ’em this? How would we know what the cut is? If we’re on the editorial team, we’re making a decision,
somebody submits a paper, 50%, is that okay? Would we say, listen, I’m not gonna. I wanna know that this thing was powered. I mean, if we stick by what Cohen said, we shouldn’t be generally
publishing anything unless someone could get up to 80%. Our table should all go in there and say, listen, you’re underpowered
on this particular one. Come back to me when
you can show that this would reproduce 80% of the time on that particular one on there. I doubt we’ll ever do it. I think if we did it there
would be a revolution, and they would overthrow the
whole leadership of the thing, but it’s an interesting
thing to draw there. Other comments before
I move on to just the last slide, here? Let me see it. – [Student] I think this is really nice if you have normally distributed data. But as soon as you start
working with kind of real data that’s messy, and
that is like panel data, for example, that changes over time, it’s auto-correlated, and
has all these behavior, like, strange behaviors. I don’t know how to do a,
it’s not straightforward to do a power calculation
on something like that. I don’t think it would be
nearly as straightforward. – I don’t know, okay? ‘Cause I haven’t really dug around on it. But all t and z-values
have two properties. The parameter estimate
and the standard error. Okay, so and that
standard error adjusts for if you’ve correctly specified your model. It doesn’t matter whether you’re doing a generalized linear mixed effect model with a dichotomous outcome, it doesn’t matter whether you’re doing it, using a straight logistic,
it doesn’t matter whether you’re doing
straight up regression, it doesn’t matter whether
your fixed effect econometric, or whatever, right? It all has those parameters in it. And I think we would
find, although I don’t know for sure, I think you’d find that this basic idea of looking
at the distribution would hold just about whatever it was. No matter what you run. Complicated panel longitudinal data. If you give a t-value of 1.96, I’m gonna bet the farm
that there’s about a 50% chance that somebody’s
gonna replicate that. Okay?
– Yeah. I was thinking in terms
of the bootstrapping, which is really interesting
way of doing that. – But the bootstrapping is
really just my example, okay? It’s showing you, giving
you the intuitive feel for what’s happening
with the power analysis. But the easy way is just
to go look straight at it. In fact, really you can apply this to any published paper right now. Pick up a journal article, look at it, say, 60% chance I’d replicate that, 50, I mean by just plugging
that little formula into r, and their equivalent
of other languages. Ooh, I get to throw this. I’m doing it left-handed, by the way. Okay, ooh. That was almost bad. I’m a righty. (class chattering) – [Student] So Paul, just given, let’s say this was adopted. Doesn’t this put the onus on making sure when you’re making one, versus
two-tailed hypothesis testing all the more important? Because when you look at this, it’s really going to drastically influence whether you’re finding the effect size, hypothesized or not,
and that seems something that at least, in terms of publishing, often times you’re pushed for two-tailed, even when you’re making
directional hypothesis. And that seems like that would play, and this would only highlight the need to be more specific. Does that make sense?
– It does. But I think it applies. The example that I gave you where I created the data for the publication was a one-tailed test. So all I did was just adopt, I used 1.65 instead of 1.96 in the calculations. If I actually, I did
estimate, I just cut it out for sake of time, but
I estimated the power to have replicated that in a
two-tailed test, it was 34%. Okay, so I mean it really
took a dive one you did that. Right, so–
– But that would read very differently than 54% versus– – That’s right, yeah. So the ability to replicate
that as a two-tailed, if they did that, they’d say, only 34% of the time would
we have come up with that. Which reads even worse,
really, in one sense than 56%. – [Student] So I go back
to the fact that this is bootstrapping fine, but
it’s based on one sample. One–
– Yep. – And so it could be
that you’re basing those estimates on flawed data to begin with. So I don’t know that, I’m not convinced that I’m getting much. I mean, it’s neat, it’s interesting, but I don’t know that you’re
getting any more information than you get from the basic results. – Maybe not. I think it’s a huge assumption. But I think people would have to specify, like I did in the write-ups, that the assumption is that you’ve got the effect side–
– So let me– – And I think that one could really be called into question
for a number of reasons. – [Student] If you were to
look at meta analysis, okay? And maybe that’s something you could do to be more convincing. Just take a meta analysis where there’s like 20, 30 effects, and
go to the original study, or the earlier study that
found that particular effect. Try to do this, and then
go and see to what extent do those estimates actually
generalize to real data? I think that would be interesting to see. I’d be more convinced if
there’s some prophecy in there. I mean, there’s still the file drawer and all that kind of stuff
that could affect it, but right now, I don’t
know that you’re giving me any more information than what I could get from the just the z-value to begin with. – It’s a good comment. I mean, I don’t know the answer. I just think that to some degree, well, I agree with you,
like what would be great is to say well, what was my effect size, and let’s say that study that I gave you with the Army data at the very beginning, and what was the meta analysis,
meta analytic effect size between cohesion and wellbeing, right? And they got there. I just know that that is a practical tool that’s never gonna happen. Like nobody is gonna
be, no editor is gonna sit there, or AE is gonna
sit there and say hey, I want you to actually go back and compare all your results against
the meta analytic ones, and then give me a feel for whether or not you were in the money. So it’s kind of a, it’s these
issues the trade off on here. So the last slide I have, this is Rob’s one, since
he told me, Poyheart, since he told me this was being, hold on I guess. Oh, there, well, now it finally went. So I just was gonna put up here, since this was being recorded,
what’s my view, okay, on this one, so that it doesn’t, that is that I just, I
wanna emphasize really, to Gilad’s point. Everything I gave up here does assume that that effect size was accurate. So if I go back in
there, it kind of assumes that there was no P-hacking. It assumes that whatever you
show on that particular one, you nailed it, you got that effect size. That’s a huge assumption, right? And to go back to his point, that may be too much of
an assumption to make for this to be practically useful in that particular case. I don’t know if it is or it isn’t, but it’s a key assumption in here, on that particular one. I will say that it’s
important to emphasize, again, as many other have,
that published studies, many of ’em just have low power. And low power is not just
simply the sample size. I mean, I showed you a low power effect when I had 105,000 observations when I started out there, right there. It was low power. How do I know it? Because my t-value was only 2.13. So even though I had 105,000 observations, I had a low powered study in that particular one on there, with it. Moe had made this comment already, but there may be some merit to if we were publishing things like this,
then it might help reduce the originality bias. So if I publish a paper and I say, listen, this is the first time
anyone has shown this, but keep in mind, even if I nailed this, there’s only a 50% chance that someone else would replicate it. Then somebody comes
along in the next study doesn’t find it, then
that paper isn’t rejected for failing to find
it, because it actually conforms to what I would have expected, and what I’ve set the groundwork for on that particular one. This last one, I’m not
necessarily recommending that editors ask authors to include this, but I do think is smart consumers of data, we read this stuff all the time, if there’s nothing else, please take away that you can just go
right to those z values and those t-values and
very easily get a sense as to how the results would
or would not replicate. Even using their exact same data, even with the assumption on there. ‘Cause once you start doing that, you realize that only 50%
chance of something replicating with a z or t-value of 1.96. It really makes you
think pretty critically about articles when you’re
starting to read them yourself as a consumer of the data. So with that, I think I will end. And I think I’m close to
being actually on time. What’s that? Or I can open up for more questions. – [Student] One comment, Paul. I think it’s one of the things
that comes out of here, too. The notion that, to the extent that we’re advocating for more replication, it also means that one-and-done
is not the way to go, ’cause any one replication, especially in this notion that if there’s an effect in the original article, we show nothing, we don’t know conclude the original article was
necessarily a fraud or a fake. This is this notion
that a broader sampling, or broader replication to build this cumulative body of this. – Which is also a great point. It makes you, because your first reaction when you read a paper it
says I didn’t replicate it, is to think there was something wrong with the original paper. But again, if you look
at it from this view of translating the z or the t-value into some kind of reproducibility index, then a paper comes on
doesn’t reproduce it, you really have a very different view of that second paper. Like, ah, well, there’s only
a 50% chance anyhow, on that. Now, I know that this is,
so again as smart consumers we should be aware of the
fact that a lot of times when it is reproduced, the
power is increased, right? Because it’s very typical not to try to replicate the exact sample size, but to acknowledge that it might be a somewhat inflated value. So they will often double, or add to the sample size considerably, which should, in fact, make it more likely to reproduce if it’s still.
– Or improve the measure or things like that. All those things have to
go into the discussion of helping us figure out
where are we after doing that. – Yeah, absolutely. I did think, and I don’t know, this is just maybe too far out there, ’cause it’s never really
done in the bootstrap, but I was thinking as Steve was talking, that it would be interesting to, instead of when you bootstrap,
you always bootstrap to get the original sample size. But what if you bootstrapped and you doubled your sample size? Now, all that would do
is it would increase, in this case, it would
reduce your standard error. But then the question would be, oh, well that’s not really ever done. Maybe that would be a
fairer way of thinking about this reproducibility problem. So you would, in essence,
reduce the standard error in it and that be another way to kinda do it. So instead of having 100, I
have 150 or something on there. But I couldn’t make the bootstrap. The built-in bootstrap program is so set on always sampling with
replacement to the same size. I couldn’t see how to just
modify it in two minutes. So I gave up. Therein lies the problem. Oh! – [Student] Interesting. So hopefully you can hear me, but just one thing I was thinking of is reporting it might make certain authors a little more moderate in the
way they discuss the results. This wouldn’t be a bad thing, because I would say the way some authors discuss their results
is somewhat immoderate. (laughing) – It is presented many times. Like, we found the answer, right? And this kinda make you think.
– Right. But I was just wondering
what you thought about what would happen for future research on that topic or that relationship as a function of these
probabilities that you’re showing? So would there be more
research if I say 50%, or if I report 100%, then people say, oh, we’re done with that one.
– Well, and again, this is my own take on
it, so I have no idea whether I’m right or wrong on it. But I would say that to the idea of the originality bias, I think it
could open up more research, because again, you would
say, yes that study found it, but they were very clear
that there was only about a 50% chance it would. So here’s a study that’s
actually going back to do it. On something like the 100%,
this is my own experience, I analyze data, looking at leadership and wellbeing for 22 years
while I was in the Army. And I can tell you that leadership
was always related, okay? So in this one study,
what I’m finding there, in 100%, I saw data set after data set, after data set, after data set. I don’t need to show in
any other data set ever, the soldier’s perceptions
of his or her leader are related to their wellbeing, right? In this one study, in essence,
would have told me that. But I can say cohesion, particularly in combination with leadership, because of the overlapping
was hit-or-miss, and work hours was hit-or-miss too, so it actually, even
this one study kind of confirmed my long-term
knowledge of the data. – [Student] Just that’s
what I’m worried about, is that 100% thing could be
just that, just an error. – It could be.
– In this case, I know leadership, and I know. But you may basically
incentivize a wrong behavior. – So if you P-hacked the hell out of it and got like a strong effect size, you get 100% showing it, nobody ever bothers to replicate it. Reactions to that? – [Student] You cannot
replicate it ’til you hack your way (speaking off microphone) (laughing)
It’s too hard. Biggest data set in the world. – All right.

Leave a Reply

Your email address will not be published. Required fields are marked *