– So what I wanna talk about here, really, is just some work. When Mo asked me to do this, I kind of felt like Steve a little bit in terms of this isn’t really my area, isn’t anything I’ve published anything on. Obviously, I have an interest

in it as a scientist. And so I went back and forth with Moe on one of the ideas that I had here, really, and this is

the talk that I’m gonna give on this particular one. And what I wanna talk about specifically is an idea that might have merit in terms of thinking about

if we ask journal authors to contribute this information. Would it help us get a

sense about how replicable their results are? And that’s basically

the bottom line on it. But let me start here. So I wanted to give a

little bit of context to why I got interested in this idea. Talk a little bit of background. I’m gonna introduce just some

core ideas with the bootstrap. I apologize that for many of

you this will be a review, but I also know we have

some PhD students in here that this might be a new idea, and if I don’t set the base work for that, they’ll have no idea

what I’m talking about as we go through with that particular one. And then, I’m gonna give a

couple of examples from data. One, my own data, another

one from empirical example. And then in the end what I hope is just to get some good

discussion going about this particular issue as we go through. So again, as with the other ones, feel free to interrupt

if you have a comment, if I’m not making sense,

please just let me know as we go through this. So in terms of the context,

one of the things that we do. Excuse me. You’ll be able to edit that out, right? (laughing)

So in terms of context, one of things that we do at the

University of South Carolina is we work with large

corporations that give us data. It’s kinda cool. Many of you do this in business schools. But I have Master students that’ll take these big data sets, and we’ll do is a whole long Capstone

analytics project. So just in terms of thinking

about the experience of these Master students, they

are just learning regression. So we can’t give ’em things

like machine learning and stuff like that, but basically, we turn the Master’s student into machine-learning algorithms, right? We have the Master’s students take the large data set,

and they become XGBoost, they become random for

us, and we tell ’em, go test for a gazillion interactions, and look for main effects, and so we P-hacked the hell

out of this data, all right? But before we P-hack the

hell out of the data, what we do is we take 40% of it, and that, we tell the firms

when we work with them, we’re gonna take 40% of your data, P-hack the hell out of it, see what we can find, see

if anything comes in there, and then we’re gonna

take 60% of that sample, and we’re gonna use that to potentially replicate the results, for all the reasons that we heard earlier

this morning from Joe, about just capitalizing on random chance as we go through to find

some result on here. So here’s a case, it

doesn’t really matter. This is a generalized

linear mixed effect model, so the outcome in this

case was whether or not an individual did or did not buy a high-deductible health plan. So we’re working with a company in the healthcare industry, obviously. Huge data set, right? I mean 260,000 responses on here. And so this is the summary

output from the model. And I wanna just focus your attention on this particular interaction here. All you need to know is this is just a simple two-way interaction

with the z score of 2.32 on this particular one. But if your intuition,

or certainly my intuition would be if you estimate

and your find an effect with 104,000 individuals, that thing is probably

gonna replicate, right? This would be what I

would generally think. So we take the 160,000 that we

have in the hold out sample, and we run it. And what we find instead is that it doesn’t replicate at all. It doesn’t even come close

to that particular one as we go through. Now, the reason why we do this is because these firms may actually,

much to our surprise, do something with these results. We’ve actually had a couple instances where we give ’em feedback on it, and they change some practice

within the organization. So as a scientist, this is

maybe a little different than the publication thing, but this, to me, has some

very real implications. I’m gonna go back, and they may change their strategy on what they’re gonna do based on the information they’re doing. I don’t want them to go bet the farm on an interaction that

I’m showing on this, if I don’t have evidence

that it can replicate on this particular one. So what I want, I want to

start with this as a context, because I wanted to put

this in here to think, to get us to think about, well, how should we think

about whether results do, or do not replicate? Are we holding one standard if

we’re working with industry, do we have another standard

if we’re publishing? Is there some way to

integrate these two results as we go through on here? So I want to, in this particular one, talk a little bit, just as background about non-parametric bootstrap, show how modification to the common, non-parametric bootstrap gets us to think a little bit differently

about this idea of do results replicate, and then to show that while

the non-parametric bootstrap is very intuitive, and I can walk you through the logic of it, that in fact, you don’t even need to do the bootstrap, and you can get exactly the same results. But I wanna start there,

so you can kind of follow the logic that we’ll go through on this particular one. And the discussion that I wanna have is, should we begin to think

about incorporating this type of information, which I’m gonna show

you, into publications? And I’ll give several of

examples as we go through this. So here’s where I’m

gonna go into some of the basic background of the

non-parametric bootstrap. So for many of you, this

will be a review on here. But you often use a

non-parametric bootstrap to go back to your original data and sample with replacement

from the original data to get confidence intervals

for a parameter estimate that has no natural confidence intervals. This can be one case. So if you think about the median. The median doesn’t have

a confidence interval. So if you wanna get

the confidence interval of the median, you can bootstrap your data over and over, and you can get

a confidence interval for it. We also use this a lot with mediation, because the indirect effect

has a non-normal distribution, so if we have something with

a non-normal distribution, we get a bit leery about

developing confidence intervals using 1.96 times the standard error, so the bootstrap works out

great on that particular case. Non-parametric bootstrap,

this is a core piece of every statistics package. I’m gonna be doing,

giving you examples in R, but you can run this in plus, you can run this in Stata, SaaS,

it’s nothing specific with any program, it’s just a very well-established method for

drawing inferences in data. Give you just the most basic idea again. So if this is a new idea to

you in terms of bootstrap, I can go back to a data set. So this is just using

my multi-level library that I maintained for R. I go back in, I get one

of my data sets in here, and I draw 15 observations. So I know this pointer doesn’t work, but I point up at that row there, those are the numbers. Those are 15 numbers where individuals reported their work hours. How many hours they worked during the day, and if I sample with replacement, this is the key, so it’s just the word sample that data set,

the same 15 observations, but the replace equals true, what happens is I get

another 15 observations, but it’s not exactly the

same as the first one. In this case, I ended up with one less 12, and one extra eight on here,

because of the replacement. What that lets me do is go back in and calculate a mean,

a mean, a mean, a mean, it’s slightly different each time, so 10.3, or 10.5, 10.6, 10.7. So now, by sampling with replacement, I go back to the same data

over, and over, and over, and I can do that. And then you can have computers do what computers do very well,

which is to do some dumb, repetitive task 10,000 times, and you can get a distribution off of it. So I can go back in and I could say hey, if I go back to my 15 observations, I can get a distribution

of what the mean is in this particular case, and unbelievably, if you did it 10,000

times, I must have actually had one draw that got all

eights out of that sample, or something down here, which

is pretty amazing on here. But you can see this is the distribution. And if I wanted to apply

some statistics to it, I can say, well, the top, the top 2 1/2%, the bottom 2 1/2%, those

are my confidence intervals. So I can be 95% confident that everything will fall between those values. And if it’s outside that 2 1/2 either way, then it’s what we think about

as an uncommon observation, given the data. So that’s the basic of it. Here’s how the bootstrap

is typically used. So the bootstrap would often be used more for a regression, for instance. I’ll give you an example

in a more complicated case. I go back to our data. If you, all the code, by the

way, is in my slides here. It’s either here, or in the notes, if you wanna follow this just to see, ’cause you’re curious about it. So I go back, set a random seed so that you can exactly

replicate those results. I get 500 observations

just drawn at random, so we get this. And the typical, I run a

linear model, so regression. I regress wellbeing and work

hours, leadership and cohesion. These are the parameter estimates. And this is almost always what the bootstrap is focused in on. The parameter estimates right here. So I can go back with this one. Like I said, the code is below this. I get some, using the bootstrap library, I get these nice graphs, here, that show me, here is the distribution for the bootstrap draws, for work hours, here it is for leadership,

here it is for cohesion. I look at the q-q plots. They seem to be pretty

nice, they’re lining up, there’s nothing too wacky about it. Not seeing observations over there. And I get these confidence

intervals right there. So again, this as I show here, is really not all that interesting. It’s not all that exciting

because you can just go back to the raw data,

or to the results up here, and again, just using basic stats, I can take the parameter estimate, negative 0.04, and I can

multiply the standard error times 1.96, and I get this value. I can minus it, I get this value, and what you’ll see is that

when I bootstrap the values, or I just use 1.96 times

the standard error, I get basically the same number. All right, so this is just

background showing you how I could use the bootstrap, showing you it’s not

particularly interesting to do it this way. It would be fun to program it. You can see the results. But I could’ve gotten to the same point just by taking 1.96

times the standard errors on the same thing. But here’s where it gets interesting. So if I go back to the same problem, and to actually build up a point Joe made earlier this morning,

sometimes you read again, about showing these confidence intervals, 95% confidence intervals on here, as a way to solve the problem that we have in terms of are results replicable? But really, exactly said

earlier this morning, I look at these, I see

there’s no zero in any, I turn it into a

dichotomous test in my head, and I say basically,

they’re all significant. So seeing the confidence intervals really does nothing for me in

terms of giving me the sense that these would or these would not be replicated in another sample, or even in my own sample on this. I’ve heard many people talk about this. I know Kevin Cummings,

or Cumming, I guess, really emphasizes this. I never understood that logic. I’ve read that paper, and

saying that it doesn’t. It does give you a sense, right, that there’s some variability in here, but the way that we interpret

it, it doesn’t do much in terms of whether

something’s replicable on here. So here is where things

get more interesting. And that is, let’s say,

and one never does this in bootstrap, which is why,

maybe, it’s interesting, but let’s say, instead of going here and focusing on the estimate itself, which is where we

typically do the bootstrap, let me just modify the

bootstrap to go back and say, hey, every time we run a bootstrap draw, instead of saying what was

the parameter estimate, just tell me whether it

was significant or not. Okay, so I’m gonna go

through 10,000 iterations in my data, now I don’t actually care what the mean and the variability

of the parameter estimate is, I’m just starting a counter. One if it was significant,

zero if it wasn’t significant on that particular one. So at the end what I can do is I can say, well, how many times did I get a one? So if I take my same data right here, my results, this is 500 individuals, everything’s significant in here, and if I now modify the

bootstrap to go back in, do the same iterations

of the data and count it, the questions is, how often

should I see significance? Because this gets to the

idea of can I replicate my results in my own data? Not requiring somebody

else to take my data, I’m not waiting for the

next studies that show that my data’s all messed

up and I made some error, I’m just actually using my own

data to get a sense for it. So I’ll throw it out there,

but I’m just kinda curious. If you see these results right here, what’s your sense about these? I mean, give me a number. Do you think I’m gonna see

90% replicate these results? And everything’s significant. I’ve got 500 observations in here, let me just throw it out for discussion. I’m kinda curious. – A lot.

– A lot, right? (laughing) Well–

– I’m not gonna give you a number.

– All right go on. Anyone brave enough to

throw out one number? Well, let’s be more specific, okay? Let’s say, let’s take work hours, right? You can see up there, that’s

this first one, there. Is that, just a guess,

is that gonna replicate 95% of the time, 90% of the time? What would you think?

– I would look at the P-value of the effect size. I think if the P-value less than 0.05 with a very large effect site, are likely to replicate,

and if it’s a P-value, (speaking off microphone) – So keep that idea in mind, but I think you’ll find that it’s, that’s where intuitively,

this gets interesting, okay? ‘Cause I don’t think it’ll

come out that way, right? On that one. So that. – 500.

500. So it’s not massive, but

it’s pretty decent, right? I mean it’s not like 30 or something. But ironically, as I go through this, you’ll realize it wouldn’t matter if I had 50,000 or 50, the same effect

would actually come out as it comes out on this. So let’s just go with I

won’t pin anyone down on, we’ll just go with a lot, right? But this would be my intuition, too. And it truly is my intuition. I think, I look at this I say, I’ve got 500 individuals in here, I mean that’s a good, solid sample size. I’ve got three variables that

are statistically significant, easily meet the criteria, right? A t-value of negative 3.03, t-value of 9.4, a

t-value of 2.15 in there. These weren’t right at 1.96, they weren’t teetering

on the edge right here, so I think they come out.

– Is the idea, I know you’re setting us up. – I would never set you up. – [Student] Is the idea

here that the coefficients are correlated upon resampling? And you’re revealing that in the– – Well, I thought, so initially, so Fred, what he asked,

since this is being recorded, Fred asked, is the idea here that we have a lot of multi-colinearity

among the variables? And so when I draw from the resampling, that may be what I’m doing is in fact, I’m basically getting

different combinations of the correlations among

the variables in here, which is entirely plausible. There may be some part of that. But I don’t think it’s

actually the key thing that happens on here. So let’s see what

happens with the results. What happens in fact, is,

if I set up that algorithm I see that work hours comes

in here about 86% of the time. Not too bad, actually. Although I have no idea

what a good result is. Well, I do know that

this is a good result. Leadership made it 100% of the time, okay? So basically, using my own data, I could replicate that leadership was related to well-being 100% of the time, in, I think I ran 10,000 of these. I might have run 1,000. It doesn’t matter on here. But look what happens

with cohesion over here. So cohesion, significant. T-value of 2.15 on there. And I look at this one. 56% of the time is how often

that becomes significant on that particular one. So the question that I want

to get you to think about with this particular one is, imagine that I had

written up these results. But imagine that I put

a statement in here, in my paper like this, while each variable was

statistically significant with a P-value of less than 0.05, readers should consider that even if the parameter estimates

are the true values, which, by the way, is a

big assumption, right? We’ve kinda talked about the fact we can’t even make that assumption, an identical study with an

equal sample size of 500 would expect to find

work hours significant 86% of the time, cohesion

significant only 56% of the time. Based on the findings however, we would expect any replication

with similar sample sizes to find leadership

significant 100% of the time. Sorry, I’ve misspoke there. It should say cohesion. 56%, leadership 100%, right? So if you were reading

this, so go ahead, Fred. – Oh I just, I can’t help myself. So yeah, no, I think that’s right. So your point is, the bootstrap is looking at the marginals of these

coefficients, right? And the individual, now,

sees they’re looking at the joint profile of coefficients. So across bootstraps, you kinda have this blob of coefficients in P space, however many predictors you have. The bootstrap is just

looking at significance of each one on the–

– Of each one, that’s right. – This one is looking at the joint, the profile of the stars–

– That’s right, yeah. – Yeah, for significance.

– Yeah. – Yeah.

– So there is some aspect of this to it, right? But I think it would happen,

what I could show you is that even if I had one variable in here, I would get a similar effect, right? So it doesn’t actually require that it, just to clarify this, so he’s, Fred is raising the point,

everything is complicated, right? Who was it that I gave all the arrows? Well, it was actually Dr. Bettis who was on the thing, right? All the arrows going everywhere. We have that in, but I could easily show a similar effect here, with

simply one variable in here. But we’ll have to do that after, ’cause I didn’t put it in my talk, here, but we can bring out R,

and I can show you on here. So let’s go to, or, maybe what we could do in our tables, this

would be another thing, would be to go back in here, and say, here’s my table, and I don’t know what

we’d call this thing. Maybe there is a name for this thing. This is what I’m kinda

curious about, okay? But you could put in here, with your own data, exact replication how often would you see the effect, and would present something like that. – Isn’t it just a more sophisticated way of getting the same

information you already see? I mean the one value for leadership has the strongest magnitude of estimate, the strongest t-value, and then it just a more fancy way of showing

the same thing you get from the t-value. – It is, which we’re gonna get to, okay? But it’s also a way of

translating the t-value, right? In other words, it’s basically, yes, it’s the t-value, but when I tell you how often would a t-value of three, point, let’s say three, replicate, right? This kind of gives you an idea. You’ll come back to this,

’cause I’ll tie it together. But you’re onto something with that one. – [Student] But wouldn’t

the speed of standard or either the bootstrap or

jackknife that Stata calculates? So when we have sample

sizes, small sample sizes in particular in strategy,

I think that’s the norm to report either of those rather

than the 1.96 calculation. So isn’t this, you were

looking for a name, so I think we just call it

standard error in parenthesis, either bootstrap or jackknife, depending on–

– But I know the, it might be what you referred to earlier. So Maca earlier indicated that they will, with a relatively small

sample and strategy, they might jackknife it. So what would be, if

you drop it each time? If that was substituted to give the percentage of time, so let’s say you have 100 observations, you

jackknifed it 100 times, and then you went back in and said, 95% of the time I got it. It would be equivalent, although I don’t think it would be exact,

’cause the bootstrap and the jackknife, Moe?

– Yeah, it’s not about standard error.

– It’s not about standard error, right. (student speaking off microphone) (laughing) – [Moe] No, no, no, it’s

not about standard error. Actually, Paul and I, we actually discussed this extensively. So I think Paul is gonna

reveal what it is later. – I’ll reveal, I’ll go through it. So let me keep going on here, all right? But this is an idea,

just to go through that. So let’s take another example. And on this example, I’m not beating up. I didn’t put the author’s name on this. I don’t want to. Because they didn’t do

anything wrong, right? This is just very typical, this is an article that was published. My main criteria for

selecting this article was that I was gonna have to enter this correlation matrix in by hand, okay? So I started just thumbing

through the journal, and I just, I didn’t want a whole lot of correlations on there. And then they had to be

presenting a main effect, right? Because you can’t use

this correlation table to create a data set with an interaction, you’ll just get garbage on there. So I needed to have a hypothesis

that was a main effect on this particular one. So in this one, again, as

we eluded to this morning, what I’m gonna focus on,

I’m just gonna focus on the standardized data

right here, 0.25 on there. There are 83 organizations

in this particular data set, and they had a main effect hypothesis about collective organizational engagement being related to firm performance. It was in the third step of their equation as they went though on

this particular one. So as long as somebody puts in the correlation matrix, then the

means and standard deviations, it’s quite trivial to go

back to that raw data, and replicate a data set that

has every attribute on it. I was kinda talking earlier about it’s sometimes nice to have

three decimal places because to make this one work exactly, I had to go back and give the authors the benefit of the doubt on some, I had to move a couple

things over a little bit to where their correlation table was true, but being rounded to a slightly

different number on there. But you can see that I can go back in and I can create another

data set with 83 observations that matches theirs perfectly, and gives me identical results. This is, interestingly enough, it’s identical here, it’s not identical in the sign for some of theirs, so I don’t know if this was just a transcription error on their point, so I get a negative 0.05 up here, and this one’s off a little bit, but this one really matches

up perfectly on that. So, okay. But anyhow, it’s fairly easy to do that. I can again take my 83

observations in the bootstrap, and I can just say, all right, let me go back through and count how often they got the result that they reported, that they showed in this particular table. (student speaking off microphone) I’m gonna hold onto one of these. – [Student] A technical

person, how do you choose the probability of the distribution of each one of the variables? The only thing that you

have are correlations, and then minimal, maximum, or

mean and standard deviation, but you could have a

logarithmic, or a normal, or it could be anything–

– Well– – [Student] There’s 10

million degrees of freedom. – There could be things

off, it’s not perfect, okay? But the mean and standard deviation would give you a pretty

good idea, hopefully about the distribution.

– Assuming normality. – You have to assume normality. – Normal and then you test–

– Correct, yeah. – But there’s no assurance.

– There’s no assurance. – [Student] You get the same

coefficients and P-values. – That’s right. But you’d be surprised

at how often this works. Okay, just to let you know, all right? Because most of the time, our data’s reasonably normally distributed. Okay, so if it’s way off, this might throw off something on there. But honestly, if it’s way off, the author should have told you that

it was way off, right? And they should have

said, this is so far off I shouldn’t be using regression, right, is the way that I would look at it on that particular one. So if I take these 83 observations that I’ve now made from their own data, and go back through, run the bootstrap, and again I find that they replicate it 53% of the time, on here. So they’re showing the finding right here, well, either one of those would be fine, but in fact, 53% of the time it comes up. It’s also interesting if you look at it you’ll see, zero effects are

never really zero, right? We don’t have, we get something

on these other effects. You can see in fact, 12% of the time that variable was significant 13, 14%. So those ones that were

not significant at all were still coming up

significant like this. This again, goes back

to the P-hacking idea. Like if you work away at this enough, you’ll draw some samples in there, where just about anything

will be significant. – [Student] You’re giving us a mystery. So I expect you to encourage a lot of– – Box.

– Okay. Ooh! – [Student] Really

trying to shut you down. – [Student] I know, I know. All right, well, thanks for indulging me. Now I’m thinking, well, there’s

the bootstrap distribution where you know the overall value, versus testing every individual

effect against the null. That’s why you’re having this lack of replicability in each instance, right? I know the overall

effect in the bootstrap, I don’t know the overall effect for every individual test against zero. Right, so if I have variability, that’s down on the low end.

– It could be. – I didn’t solve your mystery.

– Bear with me, bear with me. All right, and tell me whether, all right? – That’s my last shot.

– No, no. – All right.

– All right. But you’re following

what I’m doing, right? I’m just going back to

either simulated data that matches your publication, or to raw data that you have, and I’m just saying, let’s

put an index on there to say how often would

this result be replicated as we went through on here? So the real question, so basically, let’s think about what the

article would look like then, so this is what was said. “As predicted, collective

organizational engagement “significantly and positively

affected firm performance. “In fact, it was the only

variable among our predictors, “moderator and control variable that had “a significant direct relationship.” So that was very assuring, right? Like this is written as

we write in academics. This is the thing that came in there. And I’m really not beating up on authors, ’cause you know my criteria

for having selected this. But imagine what, if the

next paragraph it said, note however, that even

under the best case scenario, where the estimate of

our effect is accurate, an identical replication

will find a significant one-tailed result only

about 54% of the time. Right, would that change how

we think about these results when we’re reading ’em? As scientist, as we go through? It would, right? On the particular one. So the real question is,

and this is the question that I had to ask myself

on this particular one, am I doing anything new, right? And the answer is, I’m

not even, I’m not really. I’ve given a different

flavor to something we know, very well, within our field already. I’ve applied the bootstrap to a problem, but the solution is in

fact, nothing new at all. This is something that’s

been talked about, I’m just reframing the way we talk about this particular question in here. So the answer is, all I’m

showing you right here with the non-parametric bootstrap

applied to these data is, I’m showing you power. And so this goes back to

what Gilad was saying, can’t I get this same information by just looking at the t-value

of this particular one? So what is power? Power is “the statistical power of a “significant test is the

long-term probability “given the effect size,

alpha, and the N of “rejecting the null hypothesis.” So long-term probability in my bootstrap is the probability having

replicated it 10,000 times on that particular one

as we go through here. So this idea of using non-parametric

bootstrap is intuitive. You get the idea on here, but I can show you that it’s very easy, and this is what I’m gonna

kind of conclude with, to just take the t-values, or z-values, we’re not gonna show z-values, but in many cases our samples

sizes are large enough that we can just substitute

t-value, z-value, it doesn’t really make

that much difference on this particular one. So again, because I find

it a little non-intuitive, I always have to think through power. Power is just one of those

things that I struggle with. I get it. I understand, I’ve done a

lot of randomized trials for the Army, I know that we

do careful power analysis. I know how to do ’em, but

I always have to think, what does it mean, exactly, when I’m just thinking

about, how do I explain it? So here’s one way that

you can think about power, hopefully that may make sense to you. Let’s say I just have two

random numbers, x and y. I’ve got 100 pairs of them on here. So I can go back to a t-table, table of the critical values, and I know if I have 100 numbers, x and y, that if I have a correlation

that’s greater than 0.197, then that will be statistically

significant, right? So that’s just, we have that. It wouldn’t occur by chance to have a correlation that strong in there. So I know that value. Correlation 1.97, then that’s associated with the z or t of 1.96, right? And that means, basically

I’ve written my paper, I’ve got these two variables

that are correlated, I get my correlation of 0.197. I can write, I have a

significant effect in there. So then if Fred comes, and if Fred wants to replicate my findings. So Fred basically goes back in there and let’s say my 1.97 correlation is right on the money,

it’s the perfect estimate. There’s no bias in it,

I didn’t do anything that would make it suspicious. I just, by some miracle, happened to get the perfect effect size for this phenomena on this particular one. So Fred goes back in, gets the

exact same sample size, 100, and he replicates it. Well, if he, if that’s the true value, and Fred replicates it with that, and he does it, let’s say Fred, he’s very ambitious by the way, he’s like an editor for three journals and

I don’t know what else. Okay so he would think

nothing about replicating my own study 1,000 times if he needed to to get to the truth. So he goes back and

replicates it 1,000 times. Just by chance, if the true value is 1.96, half the time he’s

gonna get a lower value, and half the time he’s

gonna get a higher value. So that means my power is 0.50. My ability to detect the effect is anytime it’s higher or equal to 1.97. So that’s why you can basically go back to any z or t-value that’s 1.96, and you can immediately say the chances of that being replicated are 50%. Okay, flip a coin, all right? So when you see a z or a t score in any publication out there,

you can immediately know that there is only a 50% chance, and that’s under the best case scenario, where you have no reason

to believe the estimate was in any way, upwardly biased or anything as you go through that particular. So this just is a simulation

to show that effect, here. To demonstrate it, I’m

not gonna go through this, other than to say if you’re curious, and if R is an open source language, so if you want, if you’re an editor, if you’re a reviewer on an article, you wanna get a sense of how reproducible would this effect be, you can literally just

go take a probability of a normal distribution, 1.96, if you’re doing basically a point of 0.05, minus 1.96, so that would be zero. So that would tell you exactly

what I described to you. You’d have a power of 50% in there. Half would be less than,

half would be greater. That was the scenario I gave there. You can substitute any of

these values right in here. So you could just use

this little formula in R. One minus pnorm, 1.96, minus

the t score or the z-score, and you would estimate it. I can go back with the first

example that I gave you, I did the whole bootstrap,

I let the whole thing run, I got these values. 86%, one, 56%. I used the simple formula here, I go back and I grab the

t-value off of here, 86, 86. One, one, 57, 56, close enough in that particular one to do it. And it’s that simple. We literally can go back to

any publication in there, and we can just use this

power and this estimate, well, how often will

this thing be reproduced? We went back through here. Same thing, this is

off a little bit, here, but I can go back in. They don’t actually give the t-value, but I can look here and I can see, I’ve got an estimate here, with the beta, standard error, that lets me estimate that the t-value was 1.857. I added some extra decimal

places, obviously, in there. And so I can use the same thing, and off this, when I get a 58% using that, and the way I did it before was a 53, this is a case where the t and the z start to get a little divergent. So with 83, your t and your z aren’t gonna line up quite as nicely as when I had 500 in the t and the z worked

up with that particular one. So the big questions, and really, and this is what I wanted to talk about, the big questions are,

could we, do you think, as a field, we could use these ideas about the ability to replicate

and the power to inform us to meaningful ways about the robustness and reliability of results? When I, big sample working with the firm, and I do a holdout sample,

it’s no different, basically. It’s the same data. In that particular case,

I did the sampling before. What did I do with the? – [Student] I have a

question about the comment on this being equivalent

with the holdout sample. When you do a holdout sample,

you’re reviewing the model based on just part of the data. Here, you could view that

model based on the entire data. And then, let’s say if I’m

building a decision tree that is over fitting,

I can get 100% accuracy on every bootstrap sample you draw, which won’t happen when you

have a holdout sample, right? – Yeah, so I think it’s

an important point. I think that you, I mean, to clarify, there may be, it’s not exactly the same in terms of having the holdout sample. But in a weird way, I’m not

sure how different it is, right? Because I start at the very beginning by doing a random number generator to create the holdout in the other, so it’s this one iteration of a bootstrap, in kind of an equivalent way. The holdout typically has more power, because you hold out 60% versus the 40, so that you actually increase it. – [Student] But that is actually related to the second question I’m asking, because my concern on

this being inequivalent with a holdout sample is when you do a machine learning like

algorithm over the data, it’s very easy to go into overfitting. – Right.

– Which won’t be solved by bootstraps and–

– It won’t, that’s right. – [Student] Now, that

brings me to the question of what kind of statistical

test can be supported by this bootstrapping-based approach? Because I have another example, here. If you are trying to do, let’s say, just linear regression, and

I happen to have, let’s say, the dependent variable were b and c, and I have two independent

variables, a and b, let’s say those almost

perfect predictor or c. Okay, a works 95% time, b works 95% time. Now, if I use bootstrap sample

to run linear regression, half of the time I get a,

half of the time I get b. So I will mistakenly

believe that neither a nor b is replicable, but

indeed, both of them are. – Right.

– How do you deal with that? – Well, it’s a good question, right? ‘Cause this gets back to

Fred’s question about, would this work if you have, you’re kinda looking at

it as pairs of variables in that particular one. But I would say, it’s not

a statistical question, I guess, in my mind. It was like, if you’re

communicating the results, how do you communicate that you have two variables that have 90% overlap on it? And so if you present one

result in a journal article, then you’re, by definition, not presenting the full picture, either, right? So it’s kinda, I don’t see

it as a stats question, so much as how should we

require that data be presented, and that people go through

on that particular one? It’s a great question. By the way, I’m not advocating

that people do the bootstrap. Okay? I’m using it really, because conceptually, it helps to see, how did we end up there? Like, how did I end up

at this particular one? But I wouldn’t advocate to people, everyone have to go out and

bootstrap their data repeatedly, ’cause actually you don’t need to. You can just do it off the thing. So, other questions? – [Student] Yeah, so thinking about this, no, I wanna kinda flip it

on its head a little bit and say that given that our r squared are not typically massive, right? We typically deal with maybe 0.03, 0.04, if we’re lucky. And as Joe mentioned,

earlier this morning, we’re frequently underpowered. So could the replicability

crisis really just be an artifact of we deal with

small effects and low power? And not that we’re not actually– – That is a huge part of this. I firmly believe that, and in fact, I think if you read, if you get into this, you will find that’s a

big issue right there. Low power kills a lot

of these things on here. If you have no other takeaway

from what I did here, that would be one there. – [Student] Just to add to that. So the reason why lower

power is indicative of the reproducibility

crisis, it’s not because the findings are actually low

in power if it’s still true. It’s that, in order to continue to get lots of low power studies

to be significant, you had to have P-hacked them, because otherwise, they can’t all be significant at the same time. So we have a tool that’s called P-curve, that tries to diagnose where

sets of significant findings are true or false. And basically what it relies on is, the logic is this. It says, if you, for

example in your paper, have three significant

results and they’re all 0.03, the implied power of each of those results is so low when you combine them, that basically we can say your paper’s, we’re not gonna bet that your

paper is true, all right? So I don’t wanna use the

language of not true or whatever. But it’s not true yet, right? You don’t have evidence yet. And so that’s where,

when you see low power, the findings aren’t true, because you had to have P-hacked in order to

get them in the first place. – Another piece of evidence

that I don’t go into, I saw it at the SMS Conference. I don’t remember who showed this. But someone had just looked at the distribution of t or z-values

in the published literature. And what you see is, suddenly at 1.96, there is a huge spike, right? So if it was a normal distribution, you would not see a huge spike at 1.96, which almost tells you immediately, that individuals were

tweaking things a bit. I’m not, again, it comes back to this idea of let’s drop that control variable, we finally just made it at the 1.96 thing. But that should also give

you a pretty good idea if your reading is steady, and it’s really at 1.96, 1.97, that there may have

been something going on, and b, it’s not gonna, the chances of it reproducing are really, I

mean, at best, flip of a coin. But probably worse than a flip of a coin in all practicality on there. Oh. – [Student] So yeah, I think this, the way of using the

power really gives a way to kind of quantify

the ideal replicability from your original study setting, right? So I actually think this

addresses the origination bias, which is when anyone studies something then publishes the first study, they should really

report, hey, if ideally, they can replicate it

with the same sample, joined from the same

population, same set up, how likely are they gonna

find the significant effects? Just by quantifying it by saying oh, 60% of a chance you’re gonna find it, 40% of a chance you won’t find it. And a paper get published

that encouraged people to decide, oh, whether I

want to replicate this, or whether actually it’s worth pursuing. So I think that’s

another piece of evidence that should be reported. I mean, in terms of like

we’re accumulating evidence. So it gives people more information. – So it really is. I mean, it’s just a

modification of the t value. That’s all it is. But it’s taking something

that isn’t given in one form, a t-value, converting it

into something that we have in some sense, a more

intuitive field for it. That’s all that I’m proposing on here. And it just shows that not only could you generate it by going through

the non-parametric bootstrap, ending up there, or you could just use very simple distributions

to go through on here. I don’t want to, I mean,

these are other questions I put up here. Would researchers in

organizational sciences revolt if we started to ask ’em this? How would we know what the cut is? If we’re on the editorial team, we’re making a decision,

somebody submits a paper, 50%, is that okay? Would we say, listen, I’m not gonna. I wanna know that this thing was powered. I mean, if we stick by what Cohen said, we shouldn’t be generally

publishing anything unless someone could get up to 80%. Our table should all go in there and say, listen, you’re underpowered

on this particular one. Come back to me when

you can show that this would reproduce 80% of the time on that particular one on there. I doubt we’ll ever do it. I think if we did it there

would be a revolution, and they would overthrow the

whole leadership of the thing, but it’s an interesting

thing to draw there. Other comments before

I move on to just the last slide, here? Let me see it. – [Student] I think this is really nice if you have normally distributed data. But as soon as you start

working with kind of real data that’s messy, and

that is like panel data, for example, that changes over time, it’s auto-correlated, and

has all these behavior, like, strange behaviors. I don’t know how to do a,

it’s not straightforward to do a power calculation

on something like that. I don’t think it would be

nearly as straightforward. – I don’t know, okay? ‘Cause I haven’t really dug around on it. But all t and z-values

have two properties. The parameter estimate

and the standard error. Okay, so and that

standard error adjusts for if you’ve correctly specified your model. It doesn’t matter whether you’re doing a generalized linear mixed effect model with a dichotomous outcome, it doesn’t matter whether you’re doing it, using a straight logistic,

it doesn’t matter whether you’re doing

straight up regression, it doesn’t matter whether

your fixed effect econometric, or whatever, right? It all has those parameters in it. And I think we would

find, although I don’t know for sure, I think you’d find that this basic idea of looking

at the distribution would hold just about whatever it was. No matter what you run. Complicated panel longitudinal data. If you give a t-value of 1.96, I’m gonna bet the farm

that there’s about a 50% chance that somebody’s

gonna replicate that. Okay?

– Yeah. I was thinking in terms

of the bootstrapping, which is really interesting

way of doing that. – But the bootstrapping is

really just my example, okay? It’s showing you, giving

you the intuitive feel for what’s happening

with the power analysis. But the easy way is just

to go look straight at it. In fact, really you can apply this to any published paper right now. Pick up a journal article, look at it, say, 60% chance I’d replicate that, 50, I mean by just plugging

that little formula into r, and their equivalent

of other languages. Ooh, I get to throw this. I’m doing it left-handed, by the way. Okay, ooh. That was almost bad. I’m a righty. (class chattering) – [Student] So Paul, just given, let’s say this was adopted. Doesn’t this put the onus on making sure when you’re making one, versus

two-tailed hypothesis testing all the more important? Because when you look at this, it’s really going to drastically influence whether you’re finding the effect size, hypothesized or not,

and that seems something that at least, in terms of publishing, often times you’re pushed for two-tailed, even when you’re making

directional hypothesis. And that seems like that would play, and this would only highlight the need to be more specific. Does that make sense?

– It does. But I think it applies. The example that I gave you where I created the data for the publication was a one-tailed test. So all I did was just adopt, I used 1.65 instead of 1.96 in the calculations. If I actually, I did

estimate, I just cut it out for sake of time, but

I estimated the power to have replicated that in a

two-tailed test, it was 34%. Okay, so I mean it really

took a dive one you did that. Right, so–

– But that would read very differently than 54% versus– – That’s right, yeah. So the ability to replicate

that as a two-tailed, if they did that, they’d say, only 34% of the time would

we have come up with that. Which reads even worse,

really, in one sense than 56%. – [Student] So I go back

to the fact that this is bootstrapping fine, but

it’s based on one sample. One–

– Yep. – And so it could be

that you’re basing those estimates on flawed data to begin with. So I don’t know that, I’m not convinced that I’m getting much. I mean, it’s neat, it’s interesting, but I don’t know that you’re

getting any more information than you get from the basic results. – Maybe not. I think it’s a huge assumption. But I think people would have to specify, like I did in the write-ups, that the assumption is that you’ve got the effect side–

– So let me– – And I think that one could really be called into question

for a number of reasons. – [Student] If you were to

look at meta analysis, okay? And maybe that’s something you could do to be more convincing. Just take a meta analysis where there’s like 20, 30 effects, and

go to the original study, or the earlier study that

found that particular effect. Try to do this, and then

go and see to what extent do those estimates actually

generalize to real data? I think that would be interesting to see. I’d be more convinced if

there’s some prophecy in there. I mean, there’s still the file drawer and all that kind of stuff

that could affect it, but right now, I don’t

know that you’re giving me any more information than what I could get from the just the z-value to begin with. – It’s a good comment. I mean, I don’t know the answer. I just think that to some degree, well, I agree with you,

like what would be great is to say well, what was my effect size, and let’s say that study that I gave you with the Army data at the very beginning, and what was the meta analysis,

meta analytic effect size between cohesion and wellbeing, right? And they got there. I just know that that is a practical tool that’s never gonna happen. Like nobody is gonna

be, no editor is gonna sit there, or AE is gonna

sit there and say hey, I want you to actually go back and compare all your results against

the meta analytic ones, and then give me a feel for whether or not you were in the money. So it’s kind of a, it’s these

issues the trade off on here. So the last slide I have, this is Rob’s one, since

he told me, Poyheart, since he told me this was being, hold on I guess. Oh, there, well, now it finally went. So I just was gonna put up here, since this was being recorded,

what’s my view, okay, on this one, so that it doesn’t, that is that I just, I

wanna emphasize really, to Gilad’s point. Everything I gave up here does assume that that effect size was accurate. So if I go back in

there, it kind of assumes that there was no P-hacking. It assumes that whatever you

show on that particular one, you nailed it, you got that effect size. That’s a huge assumption, right? And to go back to his point, that may be too much of

an assumption to make for this to be practically useful in that particular case. I don’t know if it is or it isn’t, but it’s a key assumption in here, on that particular one. I will say that it’s

important to emphasize, again, as many other have,

that published studies, many of ’em just have low power. And low power is not just

simply the sample size. I mean, I showed you a low power effect when I had 105,000 observations when I started out there, right there. It was low power. How do I know it? Because my t-value was only 2.13. So even though I had 105,000 observations, I had a low powered study in that particular one on there, with it. Moe had made this comment already, but there may be some merit to if we were publishing things like this,

then it might help reduce the originality bias. So if I publish a paper and I say, listen, this is the first time

anyone has shown this, but keep in mind, even if I nailed this, there’s only a 50% chance that someone else would replicate it. Then somebody comes

along in the next study doesn’t find it, then

that paper isn’t rejected for failing to find

it, because it actually conforms to what I would have expected, and what I’ve set the groundwork for on that particular one. This last one, I’m not

necessarily recommending that editors ask authors to include this, but I do think is smart consumers of data, we read this stuff all the time, if there’s nothing else, please take away that you can just go

right to those z values and those t-values and

very easily get a sense as to how the results would

or would not replicate. Even using their exact same data, even with the assumption on there. ‘Cause once you start doing that, you realize that only 50%

chance of something replicating with a z or t-value of 1.96. It really makes you

think pretty critically about articles when you’re

starting to read them yourself as a consumer of the data. So with that, I think I will end. And I think I’m close to

being actually on time. What’s that? Or I can open up for more questions. – [Student] One comment, Paul. I think it’s one of the things

that comes out of here, too. The notion that, to the extent that we’re advocating for more replication, it also means that one-and-done

is not the way to go, ’cause any one replication, especially in this notion that if there’s an effect in the original article, we show nothing, we don’t know conclude the original article was

necessarily a fraud or a fake. This is this notion

that a broader sampling, or broader replication to build this cumulative body of this. – Which is also a great point. It makes you, because your first reaction when you read a paper it

says I didn’t replicate it, is to think there was something wrong with the original paper. But again, if you look

at it from this view of translating the z or the t-value into some kind of reproducibility index, then a paper comes on

doesn’t reproduce it, you really have a very different view of that second paper. Like, ah, well, there’s only

a 50% chance anyhow, on that. Now, I know that this is,

so again as smart consumers we should be aware of the

fact that a lot of times when it is reproduced, the

power is increased, right? Because it’s very typical not to try to replicate the exact sample size, but to acknowledge that it might be a somewhat inflated value. So they will often double, or add to the sample size considerably, which should, in fact, make it more likely to reproduce if it’s still.

– Or improve the measure or things like that. All those things have to

go into the discussion of helping us figure out

where are we after doing that. – Yeah, absolutely. I did think, and I don’t know, this is just maybe too far out there, ’cause it’s never really

done in the bootstrap, but I was thinking as Steve was talking, that it would be interesting to, instead of when you bootstrap,

you always bootstrap to get the original sample size. But what if you bootstrapped and you doubled your sample size? Now, all that would do

is it would increase, in this case, it would

reduce your standard error. But then the question would be, oh, well that’s not really ever done. Maybe that would be a

fairer way of thinking about this reproducibility problem. So you would, in essence,

reduce the standard error in it and that be another way to kinda do it. So instead of having 100, I

have 150 or something on there. But I couldn’t make the bootstrap. The built-in bootstrap program is so set on always sampling with

replacement to the same size. I couldn’t see how to just

modify it in two minutes. So I gave up. Therein lies the problem. Oh! – [Student] Interesting. So hopefully you can hear me, but just one thing I was thinking of is reporting it might make certain authors a little more moderate in the

way they discuss the results. This wouldn’t be a bad thing, because I would say the way some authors discuss their results

is somewhat immoderate. (laughing) – It is presented many times. Like, we found the answer, right? And this kinda make you think.

– Right. But I was just wondering

what you thought about what would happen for future research on that topic or that relationship as a function of these

probabilities that you’re showing? So would there be more

research if I say 50%, or if I report 100%, then people say, oh, we’re done with that one.

– Well, and again, this is my own take on

it, so I have no idea whether I’m right or wrong on it. But I would say that to the idea of the originality bias, I think it

could open up more research, because again, you would

say, yes that study found it, but they were very clear

that there was only about a 50% chance it would. So here’s a study that’s

actually going back to do it. On something like the 100%,

this is my own experience, I analyze data, looking at leadership and wellbeing for 22 years

while I was in the Army. And I can tell you that leadership

was always related, okay? So in this one study,

what I’m finding there, in 100%, I saw data set after data set, after data set, after data set. I don’t need to show in

any other data set ever, the soldier’s perceptions

of his or her leader are related to their wellbeing, right? In this one study, in essence,

would have told me that. But I can say cohesion, particularly in combination with leadership, because of the overlapping

was hit-or-miss, and work hours was hit-or-miss too, so it actually, even

this one study kind of confirmed my long-term

knowledge of the data. – [Student] Just that’s

what I’m worried about, is that 100% thing could be

just that, just an error. – It could be.

– In this case, I know leadership, and I know. But you may basically

incentivize a wrong behavior. – So if you P-hacked the hell out of it and got like a strong effect size, you get 100% showing it, nobody ever bothers to replicate it. Reactions to that? – [Student] You cannot

replicate it ’til you hack your way (speaking off microphone) (laughing)

It’s too hard. Biggest data set in the world. – All right.