Okay everyone, welcome to the

Microsoft Research Colloquium. Today we have Tamara Broderick,

a professor of computer science and statistics at MIT, and

also a longtime friend of mine. We actually have a history of

following each other around the country,

>>[LAUGH]>>We were undergraduates together at Princeton and then Tamara followed me to

Berkeley for grad school. And then I followed her back to

Cambridge to work full time. And I’m delighted that she’s

here today to tell us about her work on Variational Bayes. So let’s give her our attention.>>Thank you. You know actually on that note,

when I was in high school I went to this science and

engineering fair. And my friend Lizz Wodo told

me I had to look out for this super cool guy Luster Macky

who was going to be there.>>[LAUGH]

>>And unfortunately I didn’t meet you

there but in undergrad I did. So today I’m gonna talk

about how we can quickly and accurately quantify

uncertainty and robustness with

variational bayes. So before I get into the details

of that statement let me just start by talking about our

motivating data example in this work. We’ve been looking at

Microcredit, and so more to the point our

collaborator, Rachael Meager, who’s a fantastic economist has

been looking at Microcredit and we’ve been working with her. So if I say anything wrong from

the economic side that’s totally me, it’s not Rachael. And she’s got data on

Microcredit these small loans to individuals in

impoverished areas and the hope is to help bring

them out of poverty. She’s got data from seven

different countries, Mexico, Mongolia, all the way

through Ethiopia. And there are a number

of businesses at each of these sites. So there’s one site in

each of the countries. There’s between 900 and

17,000 businesses. And so there are a lot of

questions that are interesting about this data that basically

revolve around how effective is Microcredit, is it actually helping raise

people out of poverty. But one specific question that

you might be interested in is how much does Microcredit

increase business profit? And a priority this could

be any real number, it could actually

decrease business profit. We don’t know, we wanna find

out if it’s actually helping. Okay, so you know as

part of this analysis, figuring out how much does micro

credit increase business profit on there are a number of

deciderada that we have here. So one is we wanna share

information across sites. I mean one thing that I could

do is just lump all my data together. I have this data from Mexico,

this data from Ethiopia, it’s all in one big dataset. And I just analyze it, but

that seems sorta problematic. We think that at least on

the face of it Mexico might be a little bit different

from Ethiopia. On the other hand I don’t want to just analyze all

the data from Mexico. And then separately and totally independently analyze

all the data from Ethiopia. And then separately and totally independently analyze

all the data in another country. Because, well one I think

that there’s probably some information about micro credit

that can be shared across these countries. And two, if I get a new site and a new country I wanna be able

to say something about it. And so I wanna be able to share

power in a way that doesn’t treat these as totally

separately or totally the same. And this is something

that’s very true, not just in Microcredit. And I think in general these

desiderata won’t be just for Microcredit that we want to

share this kind of information. We wanna quantify uncertainty,

so let’s say that at the end of the

day we find out that Microcredit increases business profit

by 2 US dollars per week. Well is it 2 US dollar plus or

minus 1 US cent? In which case, well Microcredit

is definitely helping. Or is it 2 US dollarsplus or

minus 10 US dollars or plus or minus 100 US dollars? Maybe it’s actually taking

people out of business and ruining their lives some

percentage of the time. So these would be hugely

different scenarios where we might have very different

approaches after the fact to making policy or making

decisions, or things like that. So it’s very important to be

able to quantify our uncertainty in this application. And also to quantify our

uncertainty coherently. So the amount that Microcredit

increases business profit isn’t our only

unknown in this case. We also don’t know a priority, what’s sort of the base profit

in all of these different countries before we

see Microcredit. We don’t know how these

different countries interact, what’s their correlation. And so there are a lot of

unknowns in this problem. And we wanna make sure that when

we have uncertainties over them that they somehow don’t

contradict each other. Okay, so these are two big

reasons that people turn to something like

Bayesian inference. It allows us not only to share

information across experiments. But in more generally if I have

some complex physical phenomena, it lets us have these sort

of modular graphical model frameworks to build up

these complex models. And another big reason to use

Bayesian inference is that it by using the language

of probability, it has inherently a coherent

treatment of uncertainty. Okay, so what’s the problem, what do we still need to do when

we wanna get fast results.? So in this talk certainly tens

of thousands of data points which we’ve seen here and hopefully get up to tens

of millions of data points. So these are the sort of

scales that we wanna be. And we have very quick, fast

results so that Rachael Meager and other economists can

just turn around and work on their

economics modeling. And their work there and

not have to worry about. How long it takes to get

the machine learning done. We wanna quantify robustness on,

so something that’s really important in any

statistical problem. Machine learning problem, optimization problem is

you’re choosing a model. You’re choosing a way to

describe the universe. And as George Box so famously

said, all models are wrong, but some are useful. And so if I take my model and

I change of the little bit, am I gonna be making

the same decisions? And especially when those decisions could

affect livelihoods, they could be very important

sort of governmental decisions. We wanna feel that

they’re robust somehow. So, I wanna quantify robust and

we want guarantees on quality. We wanna make sure we’re

actually getting the right answers at the end. This is where hopefully

this talk will come in, how do we get these

fast results? How do we get this

quantification of robustness, as well as theoretical

guarantees on quality? And as I’ve already

mentioned a few times, none of these desiderata

are specific to micocredit. This is true of a variety of

different modeling scenarios and problems, but this was our motivating example

in this particular case. So, let’s review Bayesian

inference a little bit. So, recall that in

Bayesian inference. We start with some distribution

over our unknown’s theta. So, theta could be a vector. But as humans, we can only

see in two dimensions. So I have had to plot this

one-dimensional theta distribution, but I have some

distribution that recognizes or that represents sort

of my knowledge or lack of knowledge of

theta to begin with. So one element of theta, in this

case would be something like how much does microcredit increase

the profit of businesses? And you could imagine there

would be other ones like other parts to this vector like

how much does microcredit increase the profit of business

in Mexico, in Ethiopia? What’s the base profit

of business in Mexico, in Ethiopia and so on. So we get a lot of

parameters in the end, then we have our likelihood

that describes our data. Let’s call it x,

given our parameter theta. This is another

modelling choice. And then finally, Bayes theorem

tells us what do we know about these unknown parameters, given the data that we’ve seen,

now that we’ve seen the data x. And so

Bayes theorem is just a theorem, it’s basically written

out right here. And it’ll give us some posterior

and the thing that you typically expect to see is

that in the prior, things are pretty diffused. We don’t know as much. And after we’ve seen data, the distribution starts to peak

around a particular value and there is less width

to the distribution. We’re much more certain that

this is the value now that we’ve seen the data. Well, there are actually

a number of challenges to using what at first might seem like a

very simple theory in practice. So you just look at this and

you say, I specify my prior. I specify my likelihood and

I get out my posterior. I’m done. Data science is solved. We’re all good. But in reality, I have to

have come up with a prior. I have to come up with

a likelihood for that matter. These are both choices. In general, modelling

is a difficult problem. And so expressing my prior

beliefs in the distribution could be something that I

find quite challenging for a number of reasons. One, it might be time consuming. Maybe actually we’re all

perfectly capable of expressing our beliefs about the effect of

microcredit before we’ve seen any data in a distribution,

but I only have an hour. I’ve got to leave after this and pick up something from

the grocer store. I’ve got to go do some research. And so I only have so much

time to express my prior and I’m just gonna have to

leave it at that, and maybe I haven’t fully

expressed that at that point. It also might be subjective. Maybe I have some idea

about microcredit and you have some idea

about microcredit, and we both think that we’re

reasonable people, and that if we both express those ideas

about microcredit that we’d like to know we arrive at the same

decisions at the end of the day. So we’d be pleased to see that

if we start with some range of reasonable priors for

whatever of these reasons that, in fact, when we apply

Bayes theorem, we get sort of the same distribution out

in the end for the posterior. The same results,

the same decisions. We would be less pleased if, in fact, you applied Bayes

theorem and you came up with a very different decision

and set of ideas than I did. This would seem to

be problematic. If we wanna go back and

either collect more data or rethink our model, or somehow

revisit what we are doing. And of course, in modern complex

models where actually there’s extremely high dimensional

distributions, there are these interesting connections

between all of the unknowns. This just becomes more and

more problematic. We can’t necessarily reason so easily about all of these

unknowns in our data, So for all of these reasons, we

like robustness quantification. We like to say, hey, if I start

with a set of assumptions and I change those

assumptions a little bit, how much is that going to

change what I’m reporting, the decisions that I’m making

at the end of the day? So now, let’s say, we’re totally

happy with our prior and our likelihood. Yes.>>[INAUDIBLE].>>Yes, so

this is a great question. So, am I talking about

relatively small changes here? You could be interested

in two types of things, global robustness, so over some extremely sort of

large set of distributions. Or you could be interested

in local robustness, which is sort of more

perturbative argument. I think, there’s also an issue

of global robustness, it’s probably a super

hard problem. Whereas local robustness

probably has some linearity to it that we can exploit. And so, basically we’re going

to come at this from a practical perspective and say, let’s use

some implied linearity and local robustness to

get a sense of that. But ultimately as

a far end goal, we’d like global robustness.>>I guess, quick question

on [INAUDIBLE] context is, sometimes when you are trying

to express a prior, you are trying to express it

over some space, but there is other equivalent spaces, where

you may have a stronger and more consistent prior-

>>Yeah.>>And

that tends to be [INAUDIBLE].>>So, I’ll come back

to this later, but we’re not restricting ourselves

to be in a particular parametric family, or

anything like that, yeah. Okay, so first of all we

have this challenge, but a second challenge is,

let’s say, for whatever reason we do feel

perfectly happy with our prior, perfectly happy with

our likelihood. We actually still have to

calculate this posterior. And that in general

is going to be hard. For any non-trivial model that’s

out there, we actually can’t just solve this equation and

get this posterior, get this distribution after

having seen our data. We have to approximate

it somehow. So, for instance, one of the ways that you

might approximate it, certainly one of

the most influential algorithms of the 20th century,

is Markov Chain Monte Carlo. This gives us

a bunch of samples, it’s an empirical approximation

to the posterior. But problematically,

it can be very slow. It might be right eventually,

but you might not have enough

time for eventually. And so, approximating the

posterior can be computationally expensive, and so increasingly

we have seen people turn to what is known as variational bayes. Which I’ll explain in a moment,

but basically you can think of it as

something that people are turned to, because it’s much faster. Now, a problem with variational

base though is it can under estimate uncertainties and that you can underestimate

them dramatically, and so if you’re using, if you’re

making decisions based on those uncertainties like we

talked about before, in this microfitted example,

that could be a big problem. You might be overly

confident in your answer when you shouldn’t be,

when that’s not given metadata. So, we really want

uncertainty quantification. And so, for both of these, for

robustness quantification and uncertainty quantification,

that’s where our methodology linear response variational

Bayes comes into play. Yeah?>>Can you explain

why the [INAUDIBLE]?>>Yeah, I’ll explain this

in a picture shortly. So, the question was, is there a specific reason why

the uncertainty tends to be underestimated. It isn’t actually always

underestimated, but we’ll see that

pictorially in a moment. Okay, and let me just mention

very briefly there is some work by Opteron Winter in 2003 that

touches on a very similar idea from certainty quantification

from a more theoretical perspective. So, that’s also a very

nice paper to check out. Okay, so for

the remainder of the talk, I’m gonna start by talking about

what is directional variational Bayes as an alternative to MCMC. I’m gonna go over this

challenge of what’s going on? Why is it underestimating

uncertainties? How can we correct that? How can we get

accurate uncertainties from variational Bayes? And then, it’ll turn out that

basically everything we’ve been doing here is about

perturbation style argument. And if you think about it, robustness as we’ve described

it so far, certainly this local idea of robustness is

all about perturbations. Perturb my assumptions, how much

does that change my output. And so, we’ll see this

basically gives us robustness quantification. So, again, the big idea here is

derivatives of perturbations are somehow relatively easy

in variational Bayes, and then if there’s time,

I’ll talk about how there’s a remaining challenge

with variations Bayes. So, here we’ve talked

about uncertainties, but there could bias, there could

issues in estimating tail probabilities, things like that. And I’ll talk about some recent

work that Jonathan’s been doing on course sets for reliably

estimating the posterior in all these different ways. Okay, so what’s going on

with variational Bayes. Well, let me first tell you

what variational Bayes is. Now, remember, we’re trying

to estimate this posterior. Here I’ve made a slightly more

complex posterior cartoon, but of course, I can’t draw the

really high dimensional crazy posteriors that we’re dealing

with in actual applications. That’s the kinda thing that we

really want to approximate. And so, in general, the goal here is that we can’t

calculate the actual posterior. Our knowledge of the unknowns

given the data, so we need to find some approximation let’s

call it Q star of data. Now, that’s not specific to

variational Bayes this is just sort of the general variational

Bayes approximation goal. But here is a way that we

could go about doing this.>>Should I think about this

pragmatically, in other words, you want to produce a machine

which outputs approximate samples from q* or do you want

to approximate [INAUDIBLE]?>>So before I go to this green

blob, it could be either. So samples would, for instance,

form an empirical approximation, but you could think

of it as a q star. You could think of it as

a measure that puts mass 1 on each of your samples. And so in that sense, q star is some approximation

of p of theta given x. You could think of this as

some continuous distribution, that’s familiar, and that

would be an approximation for p of theta given x. At this point,

we just want some approximation, we haven’t said what particular

form that would take. But for variational bayes,

we make more assumptions that. So for variational bayes, we’ll

go a step farther and we’ll say, hey, out in the world, there’s

a set of nice distributions, this green blob represents

the nice distributions. Well, what are nice

distributions? Well, what are we trying to

do at the end of the day? Let’s say we’re in this

microcredit problem, we wanna say something like, how much on average is

microcredit helping people? How uncertain are we

about that value? And so that’s something

like calculating a mean of a distribution or

a variance of a distribution. So I wanna be able to

calculate means and variances from my distribution. That seems like

a nice distribution. Maybe I wanna be able to

calculate normalizing constants, that’s also pretty useful. Maybe these are the sorts

a things I want in life, and so we can imagine that these

distributions constitute nice distributions. Maybe you only like to

work with Gaussians, so maybe that is your

definition of nice. But somehow these

are the nice distributions. And the sad fact of life and the reason that we’re even

talking is that unfortunately, the exact posterior is

not a nice distribution. Yeah?>>[INAUDIBLE] paper, which is normally in

these regions, but at the same time your body will

have something like consumption, those kinds of things, where

even though they might be normal they can come out negative cuz-

>>So this is different, so

what you’re describing is just actually the description of the

model, so you can put together, and it’s not even this. It’s really the prior and

the likelihood. You can do whatever you want for

the prior and likelihood. You can basically say,

here’s a model, here’s the description of how

the data could be generated. And that’s based on,

for instance, in the case of Rachel’s paper,

I’m saying hey, I have these different countries

and I want them to share power. And I want them to have this

particular distribution of the amount that people

are earning at each country. And so

those are those kind of choices. The tricky part is once I’ve

made all those choices, I wanna say what do I know

about these unknowns? Things like,

how much microcredit is helping now that I’ve actually

seen the real data? And that’s what p of

theta given x is, and it doesn’t have some nice,

simple form. It won’t be a Gaussian, it

won’t be a known distribution. And so for that reason, because

we can’t calculate this, we can’t just put it in a form, we

have to approximate it somehow. So this is like the layer after

you’ve made those choices. Okay, so we have to

approximate it somehow. We might want it to be

familiar distributions, we might want it

to be something. Somehow we want to approximate

it with a nice distribution and as I just said,

in any problem of interest, certainly in this problem

with the microeconomics data, our true, exact posterior is

just not a nice distribution. We can’t calculate means and

variances for easy. We can’t calculate it

normalizing constants easily, so we have to do something. And so a natural thing is to

say, well, let’s look at all the nice distributions and

find the closest. Let’s use the closet of

the nice distributions for this approximation. Okay, well now we really have to

define what we mean by nice and what we mean by close to make

this an optimization problem. If we define closeness to

be the Kullback-Leibler divergence in a particular

direction between q and p. So this is q*, I’m sorry, this

is between q, and then q* will be the minimum of this, the

minimizer, and p, our posterior. Then we have what’s known

as variational bayes. Why Kullback-Leibler divergence? Is that what you’re gonna ask,

cuz that would be great.>>[INAUDIBLE] was gonna

ask about [INAUDIBLE].>>Yeah, yeah, great.

Why Kullback-Liebler divergence, why this direction? It’s one particular direction,

this is not symmetric. And I think if we’re honest with

ourselves, it’s not because Kullback-Liebler has beautiful

theoretical properties, and things like that. It’s because it seems

to work in practice. So it enjoys a lot of

practical success. We can get reasonable point

estimates and predictions and some problems that seems

to have been working. We’ve show in some work that we

can make it work on streaming data, we can make it work on

distributed architecture, so we can speed it up to

very large data sets. We can run millions of

Wikipedia articles in something under the order of an hour. Yes?>>Someone who believes the distribution is

brought from there>>Updating the samples from that will choose

that point as well?>>I’m not sure I

understand the question.>>So there’s various theorems

about how if someone has a>>Right, yeah, so that actually just happens to be related to

the Kullback-Leibler divergence, but it’s very

different from this. So that one’s all about sorta

yeah if you get more and more data what’s gonna

happen to your posterior. But here we’re just

choosing to minimize the Kullback-Leibler divergence

to our exact posterior just to get an approximation. Not because of some

principal reason, just because it’s convenient. So I have [INAUDIBLE].>>Yeah.>>About [INAUDIBLE] but is there any reason why

that to use [INAUDIBLE] measure under

[INAUDIBLE] things that.>>[INAUDIBLE]

>>Yeah. Great. Okay, so there’s a few things here. So one is could this be

problematic in high dimensions. Actually it could be

problematic in many dimensions. It could be problematic in two

dimensions as we’ll see on the next slide, part of the

problem is that we’re optimizing over a set of nice distributions

we haven’t defined yet. So we’ll define that shortly and

see how things go wrong. Another issue is why

not use Wasersteam? Why not use something else here? They’re a total variation. All kinds of distances we have. And I would challenge you to

come up with a simple algorithm for those distances. And I think that’s why people

have used KL successfully in the past is that it leads

to simple algorithms, yeah?>>Now if you have data

where the distribution doesn’t have necessarily

full support, like there could be areas where you know

where the probability is zero.>>Yeah center doesn’t have.>>Not necessarily,

cuz I could choose my nice distributions to also have

no support over those areas, I could choose them to have zero

probability in those areas.>>Yeah you would need to

know what those areas are and avoid them. If you don’t, then you

definitely run into problems. You will quickly

run into problems. Okay, so what goes wrong in two dimension, and what goes

wrong with uncertainty, are two questions that are on the table

and are about to be answered. Okay, let’s look a little bit

closer at a particular example. This is a super simple example. Let’s imagine that we have two

variables, like maybe this is the amount that micro credit

increases business profit and this is the amount that we would

expect business profits to be, in general,

if micro credit didn’t exist. So these are two unknowns that

we’re trying to figure out from our problem and

maybe are exact posterior even though we don’t know

it is a bivariate normal. This isn’t reliance if our exact

posterior or bivariate normal, we would have found it and

we would set it exactly and we wouldn’t have to approximate

it in any way but this will give us a sense of what goes on

with variational Bayes okay. So with variational bayes, remember we’re trying to

minimize the KL divergence between q and p. So let me write

out the particular thing that we’re

trying to minimize, in the particular direction that

we’re trying to minimize it. Now what’s going on here? Well as was sort of just

suggested in the previous question let’s look at not

areas where there’s no mass but areas where there’s very

little mass, like right here. There’s very little

exact posterior mass, p of theta given x. And so if in an area where

there’s very little mass I assign Q to have a bunch of

mass this is gonna be like dividing something near zero,

this is gonna be huge, this is gonna be huge and we’re not gonna be minimizing

KL with that choice of Q. So we’re going to need to

choose a different queue. And in general, this is telling

us on a very high cartoony kind of level, our approximation

queue is kind of going to be fitting inside of

our exact posterior. Okay to see what

actually happens, we have to say what

a nice distribution is. We said what close was. You haven’t said what nice is. Well, when you think about,

well, what makes things nice? What makes it easy to calculate

means, to calculate variances, to calculate normalizing

constants for distributions? Well, if they were low

dimensional distributions, that would be a case where

I could probably do that. And so the idea of mean

field variational bayes is to assume that our

approximating distribution factorizes into low

dimensional distributions. Just to be clear on

the notation here, you can think of

each of these as. I’m sort of overloading

this Q notation, these don’t all have

to be the same. All I’m saying is that In every

sort of variable direction, I just have this

totally separate. Distribution, and so that’s the

approximation that I’m doing for mean-field variational Bayes. So, what if I had axis and

line distributions, and I’m trying to fit them

inside this posterior? Well here is my mean-field

variational Bayes approximation, q*. Now first of all, can immediately see what’s

going wrong with the variance. The width is the variance. It’s how uncertain we are about

say, the microcredit effect, how much microcredit is helping,

we should be pretty uncertain. We should not be very

confident in our results. And we’re way overconfident. We’re saying, yeah,

we definitely know that this is exactly how much

microcredit is helping. Let’s go make some really

important economic decisions based on this, go team. But it’s not at all

supported by the data. This is just from our

approximation method. Not only that, not only are we underestimating

the variance in this case. In fact we can do this

arbitrarily badly, by making this a thinner and

thinner Bayes variance normal we’re getting no

covariance estimates, so if we care about you know

how this variables interact, we’re not getting any

information about that. Okay, so next, I’m gonna talk

about how we can deal with this. Let me just mention,

this is an old problem. People have known about this for a long time with

very short things. That has this tendency to

underestimate the variance and not give the covariance. But it’s also a very

modern problem. So to this day, people

are saying I want to do this large scale visual

inference problem. I’m really interesting in

running on a ton of data. And I tried MCMC and

it ran way too slowly, and so I have to run something. And so I’m gonna try

variational Bayes and so, yeah?>>[INAUDIBLE]

>>Yeah, so if you did the opposite direction of

the KL divergence then you’d get something that, in this case,

overestimates the co-variants. Unfortunately, that’s not always

going to happen in the other direction, the other direction

can go both ways pretty easily, it can overestimate and

underestimate, so it’s just this particular shape, then it’s definitely

going to overestimate, so you can easily come up

with examples where it goes the other way. Also, the other direction of the

KL divergence turns out to have a lot of issues,

practically speaking, and in particular you can see that

because we’re integrating over q, which is nice because q is

like the nice distribution, so integration with it is pretty. Relatively easy, whereas integrating over p

is pretty problematic, yeah?>>Is the [INAUDIBLE]

propagation [INAUDIBLE]?>>I don’t know the answer to

that off the top of my head. I mean, I suppose you could use

Variational Bayes as part of a belief propagation. [INAUDIBLE] Inner

tubes are different.>>Yeah but yeah, it’s like the

flavor of combining them, yeah.>>[INAUDIBLE]

>>Yeah, so in general certainly

something that you can do is you can avoid the mean field

approximation but then when you get to higher dimensions

you have this issue of. Well you could approximate with

something else but it won’t be nice and in general I think the

way that people are addressing this right now is two votes so

one is saying, let’s have more complex distributions,

of nice distributions, but this still falls into a lot of

the problem that we don’t know if the biases is good, we don’t

know if the variance is good. Another thing to do is to take

and say okay, well let’s throw away some of the theoretical

guarantees to make it faster. And Luster’s work has been

addressing at least you know, can we diagnose whether

that’s going wrong or not. But it is problematic that

we certainly don’t have any guarantees about

what’s going on there. So we like something

that’s fast and that you know we

know is working. And so at least one perspective

we could take is instead of taking Markov Chain Monte Carlo

and making it faster by throwing

away theoretical guarantees, let’s take variational Bayes and

make it more correct. Yeah?>>[INAUDIBLE]>>Talking about the second moment [INAUDIBLE] first moment?>>Yeah, there really aren’t,

and so that’s a big problem. So in fact,

it can be biased as well, and so I’d say that we can

correct the second moment. In particular based

on using knowledge of the first moment when

it’s reasonable. But what I’ll be talking about

going forward relies on that to a certain extent. And so while we see that in

practice that the the point estimates are pretty good,

this won’t solve that issue. And so that’s where that core

set stuff that I’ll talk about comes into play at getting

actually guarantees on the full posterior

approximation. Yeah?>>At this point, what is

the algorithm for maximizing or minimizing over-

>>So I’m being agnostic to

the actual algorithm for maximizing->>That exactly. So, you’re not

approximately solving that?>>No, actually you’re

approximately solving it. You’re not solving it

exactly and generally.>>[INAUDIBLE] over Q,

even though Q is nice, so still only approximates?>>Yeah, cuz this is not

gonna be a convex problem.>>Okay.>>Yeah, so that’s a whole

another layer here. There’s so many layers,

it’s like an onion.>>But that stuff you don’t

believe is problematic, even though it’s non perfect,

the problem you think you’re getting->>No,

I think that’s problematic too.>>[LAUGH]

>>All of these things are problematic, but

we do what we can. Yeah. I mean, I think all these things

are tough we want to get these approximations. These good approximations

we can trust to posterior, and people are really stuck

between a rock and a hard place. Like al lot of people

are saying, hey, I wanna use but I can’t on my data set, and so

I’ll go to variational bayes. Both of them very much have

an issue about getting stuck in local maxima. So, it’s not just

variational bayes, but if you think about the way

works is you’re exploring the space, and it’s very easy

to get stuck in something that’s locally high probability. So, that’s fundamentally

a hard problem. Okay, so let’s at least

try to deal with this on the certainty issue. So, if we had access to

a cumulant-generating function, we’d be set, right? So, we’re calling it

the cumulant-generating function is the log of

the moment-generating function. Why is it the

cumulant-generating function?, because it generates

the cumulants. The first cumulant is the mean,

which we get by taking the first derivative of

the cumulant-generating function setting t to 0. The second cumulant is

exactly the covariance. So, we take the second

derivative of the cumulant generating function set t to

0 and we get the covariance. Of course, our whole problem

is that we don’t have access to this exact

posterior covariant. We need to approximate

it somehow. Something we could do is we

could run mean field variational bayes, get its cumulate

generating function that exists, take the second derivative set

two to 0 and now we have V. But V remember is our bad,

blocked diagonal, bad estimate of the covariants. It under estimates is probably

the covariants in general, and has all these things, and so we’d like a better estimate for

sigma. Okay, so here’s where linear

response comes into play. So, the idea is, let’s take

the log of our exact posterior, and apply a linear perturbation. So, we’re just gonna perturb our

exact posterior a little bit, and this is gonna define

a new distribution, let’s call it ps of t of theta,

so long as we normalize, which we haven’t done yet. And here’s the crazy thing, if we normalize the normalizing

constant is exactly the cumulant generating function. So, somehow this linear

perturbation contains within it the information of the

cumulative generating function, so maybe it’s not such

a crazy thing to do. Okay, and so, now, I’m going to ask you to imagine

something you would never do in a million years,

because it would be impossible. Let’s imagine that for

every one of these ps of t, and there’s a continuum of them that

I fit a mean field variation bayes approximation

Q sub t to it. Now, that’s an uncountable

number of mean field variational bayes approximations, so you

probably don’t want to do it. That you don’t want to

do more than one, but let’s just pretend for

the moment that I can do this. So, I fit the uncountable of

mean field variational bayes approximations Q sub t, so I’m

going to to say to myself, okay, well, I want to

approximate sigma. And that’s the second derivative

of the cumulant generating function. And I can’t just take this

cumulant generating function and put in Q star instead,

cuz that would give me V. So, what it instead,

I take this inner term, this inner derivative,

and think about that? Well, that inner derivative it’s

exactly the expectation under ps of t of theta. And if we are willing to accept,

you should be a little bit critical of this as we have

already been in the audience. But if we’re willing to accept

that this gives reasonable point estimates,

that is successful, because it gives reasonable

point estimates. Then maybe we think the point

estimates are under p sub t are pretty close to the point

estimates under Q sub t, and maybe we’re willing to

make this approximation. So, this is the one

approximation we’re gonna make. Yeah?>>So, does that mean the of

theta is a two-dimensional vector in this example?>>So, in this cartoon,

it’s a two-dimensional vector. In general,

it’s a D dimensional vector.>>Can you say again

what this layer is?>>Yeah, so the linear part is,

where making a linear perturbation in the law

of posterior space. So, we’re taking the law

of posterior and we’re applying this

linear perturbation. Why would we do that? Well, it seems like it contains

the information that we want, the covariance. And so,

that is why we will do that. Also, it will turn out

that we get a really cute formula in the end.>>[LAUGH]

>>So, that’s the definition of linear. Response is [INAUDIBLE]

>>Yeah, is this linear perturbation.>>[INAUDIBLE]

>>So you have to normalize?>>Yeah, yeah, so

you’d have to normalize>>So it is a probability distribution->>It’s a probability distribution as soon

as you normalize.>>[INAUDIBLE]

>>Yeah.>>[INAUDIBLE]

>>Yes, yeah.>>[INAUDIBLE]

probability distribution?>>Exactly.>>Over theta, and

t is also a conventional vector?>>Yeah, t will be the same

dimension as theta. And it’s sort of describing

the size of the perturbation. Okay, and so we’re gonna call

this the linear response variational Bayes approximation

of the covariance. Okay, this is a totally unusable

formula at this point because it involves this uncountable number

of approximations which we will never do in practice, so

let’s get rid of that. And I’ll basically just say,

okay, let’s think again about what

are the nice distributions. Well, nice distributions

are where I can calculate means, covariances, normalizing

constants. Sure, low dimensional

distributions are pretty nice, but exponential families

are really nice. That’s where I can just super easily get all

the things that I want. It’s a one stop shop. So let’s suppose qt is in

the exponential family, and realistically this is what

people do in practice. So if we do that, then we have some mean

parametrization, m sub t. And I’ll just tell you that

after some algebra, we can write that sigma hat, which I had

on the previous slide and I also have right up here,

we can get rid of the t. It’s the second derivative of

the KL divergence with respect to the mean parametrization. And then we can write it

in this super simple form. So V is the bad block-diagonal covariance that we don’t like,

but that we sort of get out of mean field variational

Bayes just by running it. H depends only on the prior and

the likelihood, and I is the identity matrix. And so you can think of this as

taking our bad block-diagonal covariance and correcting it,

applying some correction, yeah.>>Why is this not symmetric?>>So it will be symmetric in

the end, and you can tell that because this is the second

derivative of the KL. It’s not as easy to see

from this formula, but it’s very easy to see

from this formula. So it’s a valid covariance,

it’s both symmetric and positive definite so long as we’re at

a local minimum of the KL. Yeah, these would be concerns.>>What is H?>>So H is the expectation

under Q of the posterior. Which sounds crazy,

cuz we don’t have the posterior. But then we take the second

derivative with respect to M, so the normalizing

constant goes away. And so it’s something we

can calculate with auto differentiation. Okay, so we made one assumption. Everything hinges on this

assumption, which may or may not be right. But when it is right, for

instance, in this cartoon, or in any multivariate

Gaussian exact posterior, it would be exactly right. I mean, this is the only thing

that’s fully described by its mean and its covariance,

then this would be exact. More generally,

it’s gonna be an approximation. Okay, so let’s go back to

this microcredit experiment. Remember we were looking at this

microcredit experiment from Rachael Meager with these

K microcredit trials. There are a number of

businesses at each site, and now we’re gonna talk about

what exactly was the model. What was the choice of the prior

and likelihood that was made. And so the model here is

to model what is the profit of the nth business

at the kth site. So we’ll say that that

profit is y sub kn. We’re gonna say that it’s

normally distributed, With some base mean mu sub k. So this is the profit that would

happen on average if microcredit didn’t exist. Sort of the base profit for

this country. Then we’re gonna have

the effect of microcredit. So first, we’re gonna have an

indicator variable, whether or not microcredit was applied. This is a randomized controlled

trial, so we know whether or not somebody got microcredit. This is totally known,

this is not an unknown. An unknown that we do

have though is Tau k, what is the effect of

microcredit in this country? How much does it add to

the profit on average in this country? And then finally, there’s some noise that’s

specific to the country. Okay, we don’t know mu, k and

Tau k, where k is the index for the country, so

there’s a prior on that. We can imagine that there’s

some global base profit and some global microcredit. And a priori, these might have

some interesting covariance so here’s a covariance for

those And so, in essence, tau is like the,

the most interesting here. It’s the global effect

of microcredit. Is it positive? Is it helping people? Is it negative? What’s going on? We want to know about tau. Yeah. So, my understanding, and I would encourage you

to read the papers, cuz I believe there’s one paper

per each of these countries, is that it was a randomized

controlled trial for individual businesses

on average. And they were giving

them microcredit. And then, they randomly assigned

each business to receive the microcredit, or not, and then they saw at

the end what happened. Yeah.>>These are published papers,

already, done by other experimenters

that are published [INAUDIBLE] they’re mostly MIT

professors, so [INAUDIBLE].>>Yeah,

if you check out Rachel’s paper, there’s a citation for

every single one of the seven. But yeah, it’s also

publicly available data, which I think is fantastic. Okay, so we’ll put a gamma prior

on sigma another prior on mu and tau, we don’t know them. And I’ll just mention that if

you were expecting an inverse Wishart here, a separation in

LKJ has some nice properties that an inverse Wishart doesn’t. So, that’s why that

prior was used. Anyway, the point here is,

we have this complex model. We have a number of unknowns,

we don’t know the effect of microcredit, the base

profit in each country. We don’t know the overall world effect of microcredit

in base profit. We don’t know this

covariance matrix, we don’t know this noise. And we wanna know, how much do

we know about these unknowns after having sigma theta? So, that would be the posterior. Okay, so, first we’re gonna

check are we getting these means estimated well? And to do that, we’re gonna

compare to Markov chain Monte Carlo as ground truth. So, why not just run Markov

chain Monte Carlo all the time? Well, in this case, if we use

a heavily optimized package from Markov chain Monte Carlo, it takes on the order

of 45 minutes. Whereas all of mean field

variational bayes optimization, not even taking advantage of

any distributed computing, or anything like that, all of the linear response

uncertainty is a bunch of sensitivity measures that I

haven’t told you about yet. But we’re gonna get to

in the robustness part, all of that just takes

about 58 seconds. And so, this can be

a real game changer for quickly iterating and

working on economic problems. Okay, so we do in fact see, so

each of these dots represents the mean of one parameter in the

posterior, and we want them to be this x=y line, and

we see that this is the case. So, these are being

estimated well. Now, we’re interested in

saying are we actually getting the uncertainties

estimated correctly. And as we might have expected

from our discussion of mean field variational Bayes,

again, each of the parameters, we’re seeing it’s posterior

uncertainty with one point both for MFVB and for LRVB. We’re seeing that MFVB

is underestimating them, and then LRVB seems to be

estimating them correctly.>>[INAUDIBLE]

>>Because there is literally nothing else we can compare to.>>[LAUGHS]

>>Yeah, yeah.>>[INAUDIBLE]

>>Yes. And we found that by running it

much longer nothing was really changed, but, yeah, it’s

a tough problem in these cases, because you don’t

have any true ground. Well, if things

line up perfectly, it’s probably a good

indication that probably both are doing well as

opposed to one being a problem. [INAUDIBLE] variational Bayes

and multi-column Markov that gets stuck at that-

>>No, that’s totally true, and so a big problem in any type

of Bayesian inference problem is especially this

type of problem, but in much higher dimension

where you have. One local optima, and

then one that’s really far away. And of course, when you’re in

higher dimensions, it’s even much, much, much harder to

go from one to another. And that’s just a hard problem.>>And now on to the left, and mine is to the right

[INAUDIBLE].>>Yes, yes, yeah. And I think that, that is going

to be a fundamentally difficult issue especially, so one dimension I don’t think

this is actually that bad. But of course we’re dealing in much higher

dimensions than that. And there it’s hard to, in particular, because we don’t

know normalizing concepts, that’s the really big issue. Because we don’t know that mass

is missing if we’re in one of these areas, it doesn’t tell us that one

of the other areas exists.>>Does the behavior of the two

approaches tend to be similar? Would they both run

into the same problem? What do you mean?>>Yeah, it’s likely that the

minima that they get stuck in, or the extrema,

would be different. Another thing that’s mitigating

here and why we don’t think that this type of thing

would be happening, is that you do tend to get

a what’s known as a Bayesian central limit theorem

as you get more data. You tend to get something

that is closer and closer to a Gaussian, and

in this particular case, because there are hundreds

of data points, at each of

the individual countries, this seems like an unlikely

sort of situation. But for more complex models,

for fewer data points, it is totally possible. [INAUDIBLE]>>So there’s certain types of things that they’re

doing that are different. I mean, so the optimization

problem with VV isn’t sort of that it’s necessarily just

going through and like. With MCMC, it’s like you’re

flipping this upside down and a lot of the modern MCMC

algorithms are saying, this is like a tarp and I’m

just gonna move on that tarp. Whereas with VV we’re doing and

optimization sort of, after taking an expectation over

the space and things like that. So, I mean,

it’s a little different. Yeah.>>[INAUDIBLE]>>Well, but sometimes it could be similar,

right? So let’s say I was fitting

a Gaussian to this, and I didn’t you know I might

fit it to this one. And that would be a similar

behavior to MC and CB getting stuck in this one. Yeah, exactly. I think there are certain cases

where this could go wrong. Okay, so in this we actually

could be making quite a different decision. Because of this underestimation. So imagine for instance, you

know we’re gathering all these economic experts together and we’re gonna decide is micro

credit gonna be a thing, are we gonna invest the ton

of money in micro credit? And we found that hey,

the posterior mean was $3 U.S. Purchasing power parity. That’s positive. Micro credit might

be helping people. Maybe we should

go ahead with it. And we’ll do it, as long as the

mean is more than two standard deviations from zero. And so, we look at the posterior

standard deviation as reported by LRBB, $1.83 U.S.

dollars purchasing power parody. The mean is 1.68 standard

deviation from zero and we say, ooh, not enough we

need to go back, we need to get some more data. We need to think about maybe

are there certain covariance, certain aspects of businesses

that make micro credit more effective. Maybe who owns it, did they have

a business before, etc etc. These are the types of questions

that Rachel’s actually asking. This is a simplification

of her work. But if we had

erroneously guessed that our standard deviation was

lower by using an MFVB. So in this case if

we had used MFVB we would have underestimated

the standard deviation, in fact, to the point

that the mean would be more than two standard

deviations from zero. And unsupported by the data

we would have said, yeah, micro credit. This is more than two standard

deviations from zero. Let’s go for it. Let’s invest in this. So we want to avoid that. Okay, something that you might

be concerned about at this point is you might have said that that matrix inverse

looks like a problem. Cuz I know matrix inversion is

maybe not exactly n squared but, or sorry, not exactly n cubed

but it’s between squared and cubed as an operation. And the thing that we’re

inverting has size number of parameters by number

of parameters. And that could be

particularly problematic for something like clustering. So in clustering I have one

parameter per data point because I have the assignment of

the data point to a cluster and then I have some global

parameters which are like the cluster location. And if I have to invert a matrix

that’s number of data points by number of data points,

this is not scalable. Luckily, that doesn’t come up. Okay, so we can Decompose our

parameter vector into let’s call it some global parameters again. These are maybe

cluster locations. And some local parameters. These are like assignments

of data points to clusters. We can decompose H, we can

decompose V in the same way. If we use the Schur complement

we can see that one of our matrix inversions is on the

order of the number of global parameters and we think at

least that shouldn’t be as bad. But one of them is on the order

of number of local parameters, something like the number

of data points. This is potentially

a big problem. But let’s look at our

sparsity patterns. Well the whole problem would be,

the whole reason we even got into this was

that it’s block diagonal. And so

that’s not gonna be a problem. This is just gonna be

block diagonal in n, in the local parameters. H, if you look at H, because in likelihoods

you tend to see factors. That connect local parameters

to global parameters, but not local parameters

to local parameters. You will see that the local

parameter part is also block diagonal and so

finally I minus VH which is gonna be block diagonal and

on the local parameters and so in fact this inversion

will be linear. So that’s why that’s

not totally crazy. Okay, so we’ve developed this

whole sort of perturbation idea at this point, now let’s

reply it to robustness, which if anything seems

even more natural. Yeah.>>[INAUDIBLE] question. I think you said [INAUDIBLE].

>>Yeah. [INAUDIBLE]

>>Right so, I think it’s just easier to describe without

getting to the co-variants but there is nothing

fundamentally different about having co-variants in these

models or things like this.>>[INAUDIBLE]

>>Yeah, yeah, it’s just a more complex model. This is the simplest possible

model that you could do. If you look at Rachel’s paper, she has sort of

a simple model and she goes into more complex ones. So you can do that as well. We also don’t want

to say anything. So part of it is we don’t

want to say too much about the economics problem, because we’re not

the economics expert, she is. And so I encourage you

to go to her paper for the full version of that. Yeah.>>Is that [INAUDIBLE]? Is that the precisely the sense

in which your letting this experiments sort of?>>Yeah actually, so

if you check out Rachel’s paper, you can use the posterior over

the heteroscedastic errors to get a sense of how much pooling

of information there is across countries, yeah. Okay, so let’s go to

robustness quantification. Okay, so robustness

quantification, you can think about it as we could have

made a different prior choice. In fact, we could have made

a different likelihood choice as well and

the same idea would go through. But just let’s look at

the prior for the moment. And here you could think of this

as one thing as like, okay, well maybe my prior was normal, and

I chose some hyperparameters for that prior, and then I

changed them a little bit. But it doesn’t have

to be like that. It could be some nonparametric

change to the prior. It could be, maybe my prior was

some crazy distribution, and then I wanted to add some

normal noise over here. So this Alpha doesn’t have

to be just your simple, parametric hyperparameter. It could be something

much more interesting. And so let’s just call our

posterior now when we’ve very much focused on the fact that

we’ve made this choice in our modeling with alpha. Let’s call it p sub alpha. Okay, so now we wanna get

a sense of sensitivity. We wanna get a sense of this

idea of I had taken my prior and I’ve perturbed it and

gotten another reasonable prior. Would I get the same results, or

would I get different results? Well if you think about it, when we’re actually using

a Bayesian posterior, we’re not really reporting this

whole distribution, typically. We’re reporting something

like a posterior functional. In particular,

what we’ve been doing so far this time is we’ve been

reporting a posterior mean. Are we gonna say that

microcredit has the same average effect or it have very different

average effects under these two priors, these two

choices of prior? And so a way to measure

sensitivity in this case would be the say hey, if I look at

this posterior mean like what is the mean effect of microcredit, how much does it change

if I change alpha, if I change my choice of prior,

if I perturb that. And so

we’ll call this the sensitivity. And now it turns out, hopefully

this looks somewhat familiar to all the work that we did before. If you replace

the alphas with ts, this is just exactly what we

were doing on that long linear response variational

Bayes slide. And the first thing we said was,

hey, maybe we think that point

estimates are relatively good. So we can replace them with

an approximation under the variation Bayes. And then maybe we can

make this assumption for this linear response for Bayes estimator in order to

actually calculate it instead of having this uncountable number

of approximations, we can make this assumption that q* sub

alpha is in exponential family. In that case, we get another

super nice formula like before. So this inner formula was

exactly the formula we calculated before. So if we already did all these

uncertainty quantification correction, this work is done. You don’t need to

calculate anything else. b comes from the form of

the prior of it’s conjugate exponential family. This will be the identity, or generally, we can calculate this

with automatic differentiation. A comes from the form of g. If g is the identity of

recalculating the posterior mean, this will be

the identity Okay so again, let’s look at this model. And we can say to ourselves,

there is a lot of choices here. We have to choose this gamma ab. We had to choose theta c and

d for the separation LKJ prior. We had to choose

this lambda inverse. We had to choose,

well we didn’t have to choose C. We had to choose mew naught and

tau naught. And even if you are like some

super microcredit expert like Rachel Meager, there’s

probably still some range of reasonable values that

you think might be okay. And you want to know

that over that range, your decisions wouldn’t change. So we can check if that’s true.>>[INAUDIBLE] something different.>>So there’s a few things. So basically, let’s look for

instance at this normal. So this normal as we’re saying,

here’s our prior. On. And maybe, we believe super

strongly that this is our this is definitely what

I know about going in. But maybe actually, I talked to

my friend, who is also a super micro credit expert, and they

would have come up with this. It’s so close. Is it gonna give me

different results? Is it gonna give me

the same results? I’d like to know. In particular,

at the end of the day, we’re making some kind of

decision like is micro credit helping people? Is the posterior mean more than two standard

deviations from sigma? And if it is, then maybe I’ll

go an invest in this, but if it’s not, I won’t. And so I wanna know that

if I change my prior, would I change that decision? So that’s the kind of

question that we’re asking. Now I’m gonna formulate this

in terms of if we change these hyper-parameters, would

I change that decision. It doesn’t have to be

this parametric decision. It actually could

be non parametric. Yes?>>[INAUDIBLE]>>Right, so this is exactly, so

if you take something like the hyper parameters and just

change them then you can think of this as the derivative with

respect to the hyper parameters. More generally if you wanted to

do something like say here is my prior and then I wanna change

it by maybe adding a normal and normalizing it. Then I could say that Alpha is

sort of the amount that I’m doing the weighting between the

original prior in the normal. And I could take a derivative

with respect to that Alpha.>>[INAUDIBLE]

>>Yeah, basically.>>[INAUDIBLE]

>>As long as you can formulate them in a way that

there is an Alpha that you can take the derivative

with respect to, then you can make that change. But just the one thing

I wanna emphasize, is that it doesn’t have

to be some parametric family form that you change.>>But then is that

approximation still a good one? I mean, if Alpha equals 0, wouldn’t we have to put-

>>So it’s not near Alpha equals 0, it’s Alpha of your original

choice, in this case. So it’s a little bit

different than the T case. So it’s near change in Alpha

equals 0, Delta Alpha.>>[INAUDIBLE]

>>Right so in this case it would be zero. Yeah in that case. As opposed to the case

where other cases it might not be yeah.>>[INAUDIBLE] how drastic

do you actually change by going from this one to

a Gaussian [INAUDIBLE]?>>So I would say that locally, you don’t expect things

to be too different. But yeah, as you turn this

into a normal, eventually, you’re gonna lose the linearity

of the approximation, it’s true. But I still think that if you

want some notion of robustness, this is an easy one to get. It’s an easy one to calculate. It’s one that you can

calculate quickly and sort of with minimal fuss. On your computer there’s a sort

of a nice formula, and so it’s a step in

the right direction. Ultimately we’d like to

have a global sums, but I think that’s a hard problem. But maybe someone can solve it. Okay, so let’s just go ahead and

see what this looks like. Again, the assumption we make is

that the means are well tracked and we have that. Let’s consider

a particular perturbation. And what we’ll do is we’ll just

run mean field variational Bayes for one value of

the hyperparameter. Sorry, we’ll run MCMC for one

value of the hyperparameter and another one just to see if

mean field variational Bayes, if LRVB is tracking that. And we see that it really is. And again, with LRVB

we’re not getting it in just one direction, we’re getting it in sort of

all directions simultaneously. Okay, again, just to see

how does it may affect us, we could calculate on the

sensitivity of the microcredit effect with normalization

to be on the scale of t standard deviations, cuz that’s

sort of the interesting. At the end we are making this

decision is micro credit. Helping people. We can get sensitivities for

all these parameters, all hyper-parameters. It’s interesting to

see that they’re on very different scales. And now let’s go back to our

original thought experiment. So we said that the mean was less than 2 standard

deviations away from 0. So we weren’t going to make

micro credit our new policy. And then let’s say

that we perturbed one of these parameters by 0.04. We thought that was

totally reasonable. We talked to our economist

friend, and they said that that would’ve been a totally fine

hyper parameter to start with. And we found that now the mean

is more than two standard deviations from zero. Now in this direction, that

doesn’t really mean anything. We already weren’t going to fund

this, and so we still won’t. But you can imagine this

going in the other direction, that maybe we started our

analysis, we found the mean was more than two standard

deviations from zero. We high-fived everybody and

we said, okay, we’re gonna find micro credit,

it’s super awesome and then we did this robustness

analysis and we found that for perfectly reasonable values

of the hyper parameter, the mean is not more than two

standard deviations from zero. And so we’d be more

conservative, we’d go back, we’d get more data, we’d think

about our covariance effecting this, things like that. Okay, so I will mention,

I phrase this in terms of mean field variational Bayes, but

there are a lot of other types of variational

Bayes besides this. And this isn’t specific to

mean field variational Bayes. It can be applied to other

types of variational Bayes. And since I am

running out of time, I’ll encourage you

to talk to Jonathan, who is interning here anyway,

about his coresets work. But basically a number of

people brought up some issues with what we’ve

been talking about. So what about bias? Variational Bayes

could be biased. What if I care about

a long tail or some other interesting things

that could be going on? I want something that’s fast but is also going to give me a good

posterior approximation. And so this core sets

work is getting at that by using a data

summarization approach. Let me just skip

ahead to the end. Okay, so this linear response

idea, let’s us get these covariance estimate as well as

this robustness quantification. And while this coresets work

I think is really great, in terms of getting us fast

approximate Bayesian inference that also gives us theoretical

guarantees on quality. I think there’s

still a place for these two ideas

to come together. I mean, this perturbation

idea for robustness, robustness quantification is

something that people want and should have in their analyses. In a lot of these analyses, you

want to know how robust these sometimes arbitrary, or at least

semi-arbitrary choices that you made, how robust your

outputs are to these choices. And this is true for

any type of modeling, but for Bayesian inference,

this seems to be a way to do it. And so combining this with

a way to actually get reliable inference with

theoretical guarantees I think is going to be very powerful. And so if you’re interested in

any of these ideas I’d love to chat more after this. I’ll just point out

our papers on this. So there’s a NIPS 2015 paper on

the linear response methods for covariance estimation our

robustness paper is in preparation. Hopefully it will be out soon. The NIPS 2016 paper

is on the corsets work if you want to read

up on that as well. Thank you. [APPLAUSE]

>>Can you take one question now?>>Yeah.>>On slide nine you showed the

[INAUDIBLE] comparing the MCMC results to your results, and there was one point near the

bottom that on a relative scale there were pretty different.>>Yeah.>>Do you know what happened?>>Yeah, so I mean, I suspect

that all of this relies on the bias being zero and so probably the bias was

not zero for that point. So if the bias is zero

then everything is exact, there’s no approximation. So that’s pretty much

gotta be the culprit.>>I guess it would

be the MCMC also.>>That’s also true. I actually pretty much,

so we use Stan. I think Stan is pretty

trustworthy on this problem. I lean towards it being LRBB,

it’s possible it was the MCC.>>[INAUDIBLE]>>Yeah, yeah.>>All right, let’s [INAUDIBLE].>>[APPLAUSE]