Fast Quantification of Uncertainty and Robustness with Variational Bayes

Okay everyone, welcome to the
Microsoft Research Colloquium. Today we have Tamara Broderick,
a professor of computer science and statistics at MIT, and
also a longtime friend of mine. We actually have a history of
following each other around the country,
>>[LAUGH]>>We were undergraduates together at Princeton and then Tamara followed me to
Berkeley for grad school. And then I followed her back to
Cambridge to work full time. And I’m delighted that she’s
here today to tell us about her work on Variational Bayes. So let’s give her our attention.>>Thank you. You know actually on that note,
when I was in high school I went to this science and
engineering fair. And my friend Lizz Wodo told
me I had to look out for this super cool guy Luster Macky
who was going to be there.>>[LAUGH]
>>And unfortunately I didn’t meet you
there but in undergrad I did. So today I’m gonna talk
about how we can quickly and accurately quantify
uncertainty and robustness with
variational bayes. So before I get into the details
of that statement let me just start by talking about our
motivating data example in this work. We’ve been looking at
Microcredit, and so more to the point our
collaborator, Rachael Meager, who’s a fantastic economist has
been looking at Microcredit and we’ve been working with her. So if I say anything wrong from
the economic side that’s totally me, it’s not Rachael. And she’s got data on
Microcredit these small loans to individuals in
impoverished areas and the hope is to help bring
them out of poverty. She’s got data from seven
different countries, Mexico, Mongolia, all the way
through Ethiopia. And there are a number
of businesses at each of these sites. So there’s one site in
each of the countries. There’s between 900 and
17,000 businesses. And so there are a lot of
questions that are interesting about this data that basically
revolve around how effective is Microcredit, is it actually helping raise
people out of poverty. But one specific question that
you might be interested in is how much does Microcredit
increase business profit? And a priority this could
be any real number, it could actually
decrease business profit. We don’t know, we wanna find
out if it’s actually helping. Okay, so you know as
part of this analysis, figuring out how much does micro
credit increase business profit on there are a number of
deciderada that we have here. So one is we wanna share
information across sites. I mean one thing that I could
do is just lump all my data together. I have this data from Mexico,
this data from Ethiopia, it’s all in one big dataset. And I just analyze it, but
that seems sorta problematic. We think that at least on
the face of it Mexico might be a little bit different
from Ethiopia. On the other hand I don’t want to just analyze all
the data from Mexico. And then separately and totally independently analyze
all the data from Ethiopia. And then separately and totally independently analyze
all the data in another country. Because, well one I think
that there’s probably some information about micro credit
that can be shared across these countries. And two, if I get a new site and a new country I wanna be able
to say something about it. And so I wanna be able to share
power in a way that doesn’t treat these as totally
separately or totally the same. And this is something
that’s very true, not just in Microcredit. And I think in general these
desiderata won’t be just for Microcredit that we want to
share this kind of information. We wanna quantify uncertainty,
so let’s say that at the end of the
day we find out that Microcredit increases business profit
by 2 US dollars per week. Well is it 2 US dollar plus or
minus 1 US cent? In which case, well Microcredit
is definitely helping. Or is it 2 US dollarsplus or
minus 10 US dollars or plus or minus 100 US dollars? Maybe it’s actually taking
people out of business and ruining their lives some
percentage of the time. So these would be hugely
different scenarios where we might have very different
approaches after the fact to making policy or making
decisions, or things like that. So it’s very important to be
able to quantify our uncertainty in this application. And also to quantify our
uncertainty coherently. So the amount that Microcredit
increases business profit isn’t our only
unknown in this case. We also don’t know a priority, what’s sort of the base profit
in all of these different countries before we
see Microcredit. We don’t know how these
different countries interact, what’s their correlation. And so there are a lot of
unknowns in this problem. And we wanna make sure that when
we have uncertainties over them that they somehow don’t
contradict each other. Okay, so these are two big
reasons that people turn to something like
Bayesian inference. It allows us not only to share
information across experiments. But in more generally if I have
some complex physical phenomena, it lets us have these sort
of modular graphical model frameworks to build up
these complex models. And another big reason to use
Bayesian inference is that it by using the language
of probability, it has inherently a coherent
treatment of uncertainty. Okay, so what’s the problem, what do we still need to do when
we wanna get fast results.? So in this talk certainly tens
of thousands of data points which we’ve seen here and hopefully get up to tens
of millions of data points. So these are the sort of
scales that we wanna be. And we have very quick, fast
results so that Rachael Meager and other economists can
just turn around and work on their
economics modeling. And their work there and
not have to worry about. How long it takes to get
the machine learning done. We wanna quantify robustness on,
so something that’s really important in any
statistical problem. Machine learning problem, optimization problem is
you’re choosing a model. You’re choosing a way to
describe the universe. And as George Box so famously
said, all models are wrong, but some are useful. And so if I take my model and
I change of the little bit, am I gonna be making
the same decisions? And especially when those decisions could
affect livelihoods, they could be very important
sort of governmental decisions. We wanna feel that
they’re robust somehow. So, I wanna quantify robust and
we want guarantees on quality. We wanna make sure we’re
actually getting the right answers at the end. This is where hopefully
this talk will come in, how do we get these
fast results? How do we get this
quantification of robustness, as well as theoretical
guarantees on quality? And as I’ve already
mentioned a few times, none of these desiderata
are specific to micocredit. This is true of a variety of
different modeling scenarios and problems, but this was our motivating example
in this particular case. So, let’s review Bayesian
inference a little bit. So, recall that in
Bayesian inference. We start with some distribution
over our unknown’s theta. So, theta could be a vector. But as humans, we can only
see in two dimensions. So I have had to plot this
one-dimensional theta distribution, but I have some
distribution that recognizes or that represents sort
of my knowledge or lack of knowledge of
theta to begin with. So one element of theta, in this
case would be something like how much does microcredit increase
the profit of businesses? And you could imagine there
would be other ones like other parts to this vector like
how much does microcredit increase the profit of business
in Mexico, in Ethiopia? What’s the base profit
of business in Mexico, in Ethiopia and so on. So we get a lot of
parameters in the end, then we have our likelihood
that describes our data. Let’s call it x,
given our parameter theta. This is another
modelling choice. And then finally, Bayes theorem
tells us what do we know about these unknown parameters, given the data that we’ve seen,
now that we’ve seen the data x. And so
Bayes theorem is just a theorem, it’s basically written
out right here. And it’ll give us some posterior
and the thing that you typically expect to see is
that in the prior, things are pretty diffused. We don’t know as much. And after we’ve seen data, the distribution starts to peak
around a particular value and there is less width
to the distribution. We’re much more certain that
this is the value now that we’ve seen the data. Well, there are actually
a number of challenges to using what at first might seem like a
very simple theory in practice. So you just look at this and
you say, I specify my prior. I specify my likelihood and
I get out my posterior. I’m done. Data science is solved. We’re all good. But in reality, I have to
have come up with a prior. I have to come up with
a likelihood for that matter. These are both choices. In general, modelling
is a difficult problem. And so expressing my prior
beliefs in the distribution could be something that I
find quite challenging for a number of reasons. One, it might be time consuming. Maybe actually we’re all
perfectly capable of expressing our beliefs about the effect of
microcredit before we’ve seen any data in a distribution,
but I only have an hour. I’ve got to leave after this and pick up something from
the grocer store. I’ve got to go do some research. And so I only have so much
time to express my prior and I’m just gonna have to
leave it at that, and maybe I haven’t fully
expressed that at that point. It also might be subjective. Maybe I have some idea
about microcredit and you have some idea
about microcredit, and we both think that we’re
reasonable people, and that if we both express those ideas
about microcredit that we’d like to know we arrive at the same
decisions at the end of the day. So we’d be pleased to see that
if we start with some range of reasonable priors for
whatever of these reasons that, in fact, when we apply
Bayes theorem, we get sort of the same distribution out
in the end for the posterior. The same results,
the same decisions. We would be less pleased if, in fact, you applied Bayes
theorem and you came up with a very different decision
and set of ideas than I did. This would seem to
be problematic. If we wanna go back and
either collect more data or rethink our model, or somehow
revisit what we are doing. And of course, in modern complex
models where actually there’s extremely high dimensional
distributions, there are these interesting connections
between all of the unknowns. This just becomes more and
more problematic. We can’t necessarily reason so easily about all of these
unknowns in our data, So for all of these reasons, we
like robustness quantification. We like to say, hey, if I start
with a set of assumptions and I change those
assumptions a little bit, how much is that going to
change what I’m reporting, the decisions that I’m making
at the end of the day? So now, let’s say, we’re totally
happy with our prior and our likelihood. Yes.>>[INAUDIBLE].>>Yes, so
this is a great question. So, am I talking about
relatively small changes here? You could be interested
in two types of things, global robustness, so over some extremely sort of
large set of distributions. Or you could be interested
in local robustness, which is sort of more
perturbative argument. I think, there’s also an issue
of global robustness, it’s probably a super
hard problem. Whereas local robustness
probably has some linearity to it that we can exploit. And so, basically we’re going
to come at this from a practical perspective and say, let’s use
some implied linearity and local robustness to
get a sense of that. But ultimately as
a far end goal, we’d like global robustness.>>I guess, quick question
on [INAUDIBLE] context is, sometimes when you are trying
to express a prior, you are trying to express it
over some space, but there is other equivalent spaces, where
you may have a stronger and more consistent prior-
that tends to be [INAUDIBLE].>>So, I’ll come back
to this later, but we’re not restricting ourselves
to be in a particular parametric family, or
anything like that, yeah. Okay, so first of all we
have this challenge, but a second challenge is,
let’s say, for whatever reason we do feel
perfectly happy with our prior, perfectly happy with
our likelihood. We actually still have to
calculate this posterior. And that in general
is going to be hard. For any non-trivial model that’s
out there, we actually can’t just solve this equation and
get this posterior, get this distribution after
having seen our data. We have to approximate
it somehow. So, for instance, one of the ways that you
might approximate it, certainly one of
the most influential algorithms of the 20th century,
is Markov Chain Monte Carlo. This gives us
a bunch of samples, it’s an empirical approximation
to the posterior. But problematically,
it can be very slow. It might be right eventually,
but you might not have enough
time for eventually. And so, approximating the
posterior can be computationally expensive, and so increasingly
we have seen people turn to what is known as variational bayes. Which I’ll explain in a moment,
but basically you can think of it as
something that people are turned to, because it’s much faster. Now, a problem with variational
base though is it can under estimate uncertainties and that you can underestimate
them dramatically, and so if you’re using, if you’re
making decisions based on those uncertainties like we
talked about before, in this microfitted example,
that could be a big problem. You might be overly
confident in your answer when you shouldn’t be,
when that’s not given metadata. So, we really want
uncertainty quantification. And so, for both of these, for
robustness quantification and uncertainty quantification,
that’s where our methodology linear response variational
Bayes comes into play. Yeah?>>Can you explain
why the [INAUDIBLE]?>>Yeah, I’ll explain this
in a picture shortly. So, the question was, is there a specific reason why
the uncertainty tends to be underestimated. It isn’t actually always
underestimated, but we’ll see that
pictorially in a moment. Okay, and let me just mention
very briefly there is some work by Opteron Winter in 2003 that
touches on a very similar idea from certainty quantification
from a more theoretical perspective. So, that’s also a very
nice paper to check out. Okay, so for
the remainder of the talk, I’m gonna start by talking about
what is directional variational Bayes as an alternative to MCMC. I’m gonna go over this
challenge of what’s going on? Why is it underestimating
uncertainties? How can we correct that? How can we get
accurate uncertainties from variational Bayes? And then, it’ll turn out that
basically everything we’ve been doing here is about
perturbation style argument. And if you think about it, robustness as we’ve described
it so far, certainly this local idea of robustness is
all about perturbations. Perturb my assumptions, how much
does that change my output. And so, we’ll see this
basically gives us robustness quantification. So, again, the big idea here is
derivatives of perturbations are somehow relatively easy
in variational Bayes, and then if there’s time,
I’ll talk about how there’s a remaining challenge
with variations Bayes. So, here we’ve talked
about uncertainties, but there could bias, there could
issues in estimating tail probabilities, things like that. And I’ll talk about some recent
work that Jonathan’s been doing on course sets for reliably
estimating the posterior in all these different ways. Okay, so what’s going on
with variational Bayes. Well, let me first tell you
what variational Bayes is. Now, remember, we’re trying
to estimate this posterior. Here I’ve made a slightly more
complex posterior cartoon, but of course, I can’t draw the
really high dimensional crazy posteriors that we’re dealing
with in actual applications. That’s the kinda thing that we
really want to approximate. And so, in general, the goal here is that we can’t
calculate the actual posterior. Our knowledge of the unknowns
given the data, so we need to find some approximation let’s
call it Q star of data. Now, that’s not specific to
variational Bayes this is just sort of the general variational
Bayes approximation goal. But here is a way that we
could go about doing this.>>Should I think about this
pragmatically, in other words, you want to produce a machine
which outputs approximate samples from q* or do you want
to approximate [INAUDIBLE]?>>So before I go to this green
blob, it could be either. So samples would, for instance,
form an empirical approximation, but you could think
of it as a q star. You could think of it as
a measure that puts mass 1 on each of your samples. And so in that sense, q star is some approximation
of p of theta given x. You could think of this as
some continuous distribution, that’s familiar, and that
would be an approximation for p of theta given x. At this point,
we just want some approximation, we haven’t said what particular
form that would take. But for variational bayes,
we make more assumptions that. So for variational bayes, we’ll
go a step farther and we’ll say, hey, out in the world, there’s
a set of nice distributions, this green blob represents
the nice distributions. Well, what are nice
distributions? Well, what are we trying to
do at the end of the day? Let’s say we’re in this
microcredit problem, we wanna say something like, how much on average is
microcredit helping people? How uncertain are we
about that value? And so that’s something
like calculating a mean of a distribution or
a variance of a distribution. So I wanna be able to
calculate means and variances from my distribution. That seems like
a nice distribution. Maybe I wanna be able to
calculate normalizing constants, that’s also pretty useful. Maybe these are the sorts
a things I want in life, and so we can imagine that these
distributions constitute nice distributions. Maybe you only like to
work with Gaussians, so maybe that is your
definition of nice. But somehow these
are the nice distributions. And the sad fact of life and the reason that we’re even
talking is that unfortunately, the exact posterior is
not a nice distribution. Yeah?>>[INAUDIBLE] paper, which is normally in
these regions, but at the same time your body will
have something like consumption, those kinds of things, where
even though they might be normal they can come out negative cuz-
>>So this is different, so
what you’re describing is just actually the description of the
model, so you can put together, and it’s not even this. It’s really the prior and
the likelihood. You can do whatever you want for
the prior and likelihood. You can basically say,
here’s a model, here’s the description of how
the data could be generated. And that’s based on,
for instance, in the case of Rachel’s paper,
I’m saying hey, I have these different countries
and I want them to share power. And I want them to have this
particular distribution of the amount that people
are earning at each country. And so
those are those kind of choices. The tricky part is once I’ve
made all those choices, I wanna say what do I know
about these unknowns? Things like,
how much microcredit is helping now that I’ve actually
seen the real data? And that’s what p of
theta given x is, and it doesn’t have some nice,
simple form. It won’t be a Gaussian, it
won’t be a known distribution. And so for that reason, because
we can’t calculate this, we can’t just put it in a form, we
have to approximate it somehow. So this is like the layer after
you’ve made those choices. Okay, so we have to
approximate it somehow. We might want it to be
familiar distributions, we might want it
to be something. Somehow we want to approximate
it with a nice distribution and as I just said,
in any problem of interest, certainly in this problem
with the microeconomics data, our true, exact posterior is
just not a nice distribution. We can’t calculate means and
variances for easy. We can’t calculate it
normalizing constants easily, so we have to do something. And so a natural thing is to
say, well, let’s look at all the nice distributions and
find the closest. Let’s use the closet of
the nice distributions for this approximation. Okay, well now we really have to
define what we mean by nice and what we mean by close to make
this an optimization problem. If we define closeness to
be the Kullback-Leibler divergence in a particular
direction between q and p. So this is q*, I’m sorry, this
is between q, and then q* will be the minimum of this, the
minimizer, and p, our posterior. Then we have what’s known
as variational bayes. Why Kullback-Leibler divergence? Is that what you’re gonna ask,
cuz that would be great.>>[INAUDIBLE] was gonna
ask about [INAUDIBLE].>>Yeah, yeah, great.
Why Kullback-Liebler divergence, why this direction? It’s one particular direction,
this is not symmetric. And I think if we’re honest with
ourselves, it’s not because Kullback-Liebler has beautiful
theoretical properties, and things like that. It’s because it seems
to work in practice. So it enjoys a lot of
practical success. We can get reasonable point
estimates and predictions and some problems that seems
to have been working. We’ve show in some work that we
can make it work on streaming data, we can make it work on
distributed architecture, so we can speed it up to
very large data sets. We can run millions of
Wikipedia articles in something under the order of an hour. Yes?>>Someone who believes the distribution is
brought from there>>Updating the samples from that will choose
that point as well?>>I’m not sure I
understand the question.>>So there’s various theorems
about how if someone has a>>Right, yeah, so that actually just happens to be related to
the Kullback-Leibler divergence, but it’s very
different from this. So that one’s all about sorta
yeah if you get more and more data what’s gonna
happen to your posterior. But here we’re just
choosing to minimize the Kullback-Leibler divergence
to our exact posterior just to get an approximation. Not because of some
principal reason, just because it’s convenient. So I have [INAUDIBLE].>>Yeah.>>About [INAUDIBLE] but is there any reason why
that to use [INAUDIBLE] measure under
[INAUDIBLE] things that.>>[INAUDIBLE]
>>Yeah. Great. Okay, so there’s a few things here. So one is could this be
problematic in high dimensions. Actually it could be
problematic in many dimensions. It could be problematic in two
dimensions as we’ll see on the next slide, part of the
problem is that we’re optimizing over a set of nice distributions
we haven’t defined yet. So we’ll define that shortly and
see how things go wrong. Another issue is why
not use Wasersteam? Why not use something else here? They’re a total variation. All kinds of distances we have. And I would challenge you to
come up with a simple algorithm for those distances. And I think that’s why people
have used KL successfully in the past is that it leads
to simple algorithms, yeah?>>Now if you have data
where the distribution doesn’t have necessarily
full support, like there could be areas where you know
where the probability is zero.>>Yeah center doesn’t have.>>Not necessarily,
cuz I could choose my nice distributions to also have
no support over those areas, I could choose them to have zero
probability in those areas.>>Yeah you would need to
know what those areas are and avoid them. If you don’t, then you
definitely run into problems. You will quickly
run into problems. Okay, so what goes wrong in two dimension, and what goes
wrong with uncertainty, are two questions that are on the table
and are about to be answered. Okay, let’s look a little bit
closer at a particular example. This is a super simple example. Let’s imagine that we have two
variables, like maybe this is the amount that micro credit
increases business profit and this is the amount that we would
expect business profits to be, in general,
if micro credit didn’t exist. So these are two unknowns that
we’re trying to figure out from our problem and
maybe are exact posterior even though we don’t know
it is a bivariate normal. This isn’t reliance if our exact
posterior or bivariate normal, we would have found it and
we would set it exactly and we wouldn’t have to approximate
it in any way but this will give us a sense of what goes on
with variational Bayes okay. So with variational bayes, remember we’re trying to
minimize the KL divergence between q and p. So let me write
out the particular thing that we’re
trying to minimize, in the particular direction that
we’re trying to minimize it. Now what’s going on here? Well as was sort of just
suggested in the previous question let’s look at not
areas where there’s no mass but areas where there’s very
little mass, like right here. There’s very little
exact posterior mass, p of theta given x. And so if in an area where
there’s very little mass I assign Q to have a bunch of
mass this is gonna be like dividing something near zero,
this is gonna be huge, this is gonna be huge and we’re not gonna be minimizing
KL with that choice of Q. So we’re going to need to
choose a different queue. And in general, this is telling
us on a very high cartoony kind of level, our approximation
queue is kind of going to be fitting inside of
our exact posterior. Okay to see what
actually happens, we have to say what
a nice distribution is. We said what close was. You haven’t said what nice is. Well, when you think about,
well, what makes things nice? What makes it easy to calculate
means, to calculate variances, to calculate normalizing
constants for distributions? Well, if they were low
dimensional distributions, that would be a case where
I could probably do that. And so the idea of mean
field variational bayes is to assume that our
approximating distribution factorizes into low
dimensional distributions. Just to be clear on
the notation here, you can think of
each of these as. I’m sort of overloading
this Q notation, these don’t all have
to be the same. All I’m saying is that In every
sort of variable direction, I just have this
totally separate. Distribution, and so that’s the
approximation that I’m doing for mean-field variational Bayes. So, what if I had axis and
line distributions, and I’m trying to fit them
inside this posterior? Well here is my mean-field
variational Bayes approximation, q*. Now first of all, can immediately see what’s
going wrong with the variance. The width is the variance. It’s how uncertain we are about
say, the microcredit effect, how much microcredit is helping,
we should be pretty uncertain. We should not be very
confident in our results. And we’re way overconfident. We’re saying, yeah,
we definitely know that this is exactly how much
microcredit is helping. Let’s go make some really
important economic decisions based on this, go team. But it’s not at all
supported by the data. This is just from our
approximation method. Not only that, not only are we underestimating
the variance in this case. In fact we can do this
arbitrarily badly, by making this a thinner and
thinner Bayes variance normal we’re getting no
covariance estimates, so if we care about you know
how this variables interact, we’re not getting any
information about that. Okay, so next, I’m gonna talk
about how we can deal with this. Let me just mention,
this is an old problem. People have known about this for a long time with
very short things. That has this tendency to
underestimate the variance and not give the covariance. But it’s also a very
modern problem. So to this day, people
are saying I want to do this large scale visual
inference problem. I’m really interesting in
running on a ton of data. And I tried MCMC and
it ran way too slowly, and so I have to run something. And so I’m gonna try
variational Bayes and so, yeah?>>[INAUDIBLE]
>>Yeah, so if you did the opposite direction of
the KL divergence then you’d get something that, in this case,
overestimates the co-variants. Unfortunately, that’s not always
going to happen in the other direction, the other direction
can go both ways pretty easily, it can overestimate and
underestimate, so it’s just this particular shape, then it’s definitely
going to overestimate, so you can easily come up
with examples where it goes the other way. Also, the other direction of the
KL divergence turns out to have a lot of issues,
practically speaking, and in particular you can see that
because we’re integrating over q, which is nice because q is
like the nice distribution, so integration with it is pretty. Relatively easy, whereas integrating over p
is pretty problematic, yeah?>>Is the [INAUDIBLE]
propagation [INAUDIBLE]?>>I don’t know the answer to
that off the top of my head. I mean, I suppose you could use
Variational Bayes as part of a belief propagation. [INAUDIBLE] Inner
tubes are different.>>Yeah but yeah, it’s like the
flavor of combining them, yeah.>>[INAUDIBLE]
>>Yeah, so in general certainly
something that you can do is you can avoid the mean field
approximation but then when you get to higher dimensions
you have this issue of. Well you could approximate with
something else but it won’t be nice and in general I think the
way that people are addressing this right now is two votes so
one is saying, let’s have more complex distributions,
of nice distributions, but this still falls into a lot of
the problem that we don’t know if the biases is good, we don’t
know if the variance is good. Another thing to do is to take
and say okay, well let’s throw away some of the theoretical
guarantees to make it faster. And Luster’s work has been
addressing at least you know, can we diagnose whether
that’s going wrong or not. But it is problematic that
we certainly don’t have any guarantees about
what’s going on there. So we like something
that’s fast and that you know we
know is working. And so at least one perspective
we could take is instead of taking Markov Chain Monte Carlo
and making it faster by throwing
away theoretical guarantees, let’s take variational Bayes and
make it more correct. Yeah?>>[INAUDIBLE]>>Talking about the second moment [INAUDIBLE] first moment?>>Yeah, there really aren’t,
and so that’s a big problem. So in fact,
it can be biased as well, and so I’d say that we can
correct the second moment. In particular based
on using knowledge of the first moment when
it’s reasonable. But what I’ll be talking about
going forward relies on that to a certain extent. And so while we see that in
practice that the the point estimates are pretty good,
this won’t solve that issue. And so that’s where that core
set stuff that I’ll talk about comes into play at getting
actually guarantees on the full posterior
approximation. Yeah?>>At this point, what is
the algorithm for maximizing or minimizing over-
>>So I’m being agnostic to
the actual algorithm for maximizing->>That exactly. So, you’re not
approximately solving that?>>No, actually you’re
approximately solving it. You’re not solving it
exactly and generally.>>[INAUDIBLE] over Q,
even though Q is nice, so still only approximates?>>Yeah, cuz this is not
gonna be a convex problem.>>Okay.>>Yeah, so that’s a whole
another layer here. There’s so many layers,
it’s like an onion.>>But that stuff you don’t
believe is problematic, even though it’s non perfect,
the problem you think you’re getting->>No,
I think that’s problematic too.>>[LAUGH]
>>All of these things are problematic, but
we do what we can. Yeah. I mean, I think all these things
are tough we want to get these approximations. These good approximations
we can trust to posterior, and people are really stuck
between a rock and a hard place. Like al lot of people
are saying, hey, I wanna use but I can’t on my data set, and so
I’ll go to variational bayes. Both of them very much have
an issue about getting stuck in local maxima. So, it’s not just
variational bayes, but if you think about the way
works is you’re exploring the space, and it’s very easy
to get stuck in something that’s locally high probability. So, that’s fundamentally
a hard problem. Okay, so let’s at least
try to deal with this on the certainty issue. So, if we had access to
a cumulant-generating function, we’d be set, right? So, we’re calling it
the cumulant-generating function is the log of
the moment-generating function. Why is it the
cumulant-generating function?, because it generates
the cumulants. The first cumulant is the mean,
which we get by taking the first derivative of
the cumulant-generating function setting t to 0. The second cumulant is
exactly the covariance. So, we take the second
derivative of the cumulant generating function set t to
0 and we get the covariance. Of course, our whole problem
is that we don’t have access to this exact
posterior covariant. We need to approximate
it somehow. Something we could do is we
could run mean field variational bayes, get its cumulate
generating function that exists, take the second derivative set
two to 0 and now we have V. But V remember is our bad,
blocked diagonal, bad estimate of the covariants. It under estimates is probably
the covariants in general, and has all these things, and so we’d like a better estimate for
sigma. Okay, so here’s where linear
response comes into play. So, the idea is, let’s take
the log of our exact posterior, and apply a linear perturbation. So, we’re just gonna perturb our
exact posterior a little bit, and this is gonna define
a new distribution, let’s call it ps of t of theta,
so long as we normalize, which we haven’t done yet. And here’s the crazy thing, if we normalize the normalizing
constant is exactly the cumulant generating function. So, somehow this linear
perturbation contains within it the information of the
cumulative generating function, so maybe it’s not such
a crazy thing to do. Okay, and so, now, I’m going to ask you to imagine
something you would never do in a million years,
because it would be impossible. Let’s imagine that for
every one of these ps of t, and there’s a continuum of them that
I fit a mean field variation bayes approximation
Q sub t to it. Now, that’s an uncountable
number of mean field variational bayes approximations, so you
probably don’t want to do it. That you don’t want to
do more than one, but let’s just pretend for
the moment that I can do this. So, I fit the uncountable of
mean field variational bayes approximations Q sub t, so I’m
going to to say to myself, okay, well, I want to
approximate sigma. And that’s the second derivative
of the cumulant generating function. And I can’t just take this
cumulant generating function and put in Q star instead,
cuz that would give me V. So, what it instead,
I take this inner term, this inner derivative,
and think about that? Well, that inner derivative it’s
exactly the expectation under ps of t of theta. And if we are willing to accept,
you should be a little bit critical of this as we have
already been in the audience. But if we’re willing to accept
that this gives reasonable point estimates,
that is successful, because it gives reasonable
point estimates. Then maybe we think the point
estimates are under p sub t are pretty close to the point
estimates under Q sub t, and maybe we’re willing to
make this approximation. So, this is the one
approximation we’re gonna make. Yeah?>>So, does that mean the of
theta is a two-dimensional vector in this example?>>So, in this cartoon,
it’s a two-dimensional vector. In general,
it’s a D dimensional vector.>>Can you say again
what this layer is?>>Yeah, so the linear part is,
where making a linear perturbation in the law
of posterior space. So, we’re taking the law
of posterior and we’re applying this
linear perturbation. Why would we do that? Well, it seems like it contains
the information that we want, the covariance. And so,
that is why we will do that. Also, it will turn out
that we get a really cute formula in the end.>>[LAUGH]
>>So, that’s the definition of linear. Response is [INAUDIBLE]
>>Yeah, is this linear perturbation.>>[INAUDIBLE]
>>So you have to normalize?>>Yeah, yeah, so
you’d have to normalize>>So it is a probability distribution->>It’s a probability distribution as soon
as you normalize.>>[INAUDIBLE]
>>Yes, yeah.>>[INAUDIBLE]
probability distribution?>>Exactly.>>Over theta, and
t is also a conventional vector?>>Yeah, t will be the same
dimension as theta. And it’s sort of describing
the size of the perturbation. Okay, and so we’re gonna call
this the linear response variational Bayes approximation
of the covariance. Okay, this is a totally unusable
formula at this point because it involves this uncountable number
of approximations which we will never do in practice, so
let’s get rid of that. And I’ll basically just say,
okay, let’s think again about what
are the nice distributions. Well, nice distributions
are where I can calculate means, covariances, normalizing
constants. Sure, low dimensional
distributions are pretty nice, but exponential families
are really nice. That’s where I can just super easily get all
the things that I want. It’s a one stop shop. So let’s suppose qt is in
the exponential family, and realistically this is what
people do in practice. So if we do that, then we have some mean
parametrization, m sub t. And I’ll just tell you that
after some algebra, we can write that sigma hat, which I had
on the previous slide and I also have right up here,
we can get rid of the t. It’s the second derivative of
the KL divergence with respect to the mean parametrization. And then we can write it
in this super simple form. So V is the bad block-diagonal covariance that we don’t like,
but that we sort of get out of mean field variational
Bayes just by running it. H depends only on the prior and
the likelihood, and I is the identity matrix. And so you can think of this as
taking our bad block-diagonal covariance and correcting it,
applying some correction, yeah.>>Why is this not symmetric?>>So it will be symmetric in
the end, and you can tell that because this is the second
derivative of the KL. It’s not as easy to see
from this formula, but it’s very easy to see
from this formula. So it’s a valid covariance,
it’s both symmetric and positive definite so long as we’re at
a local minimum of the KL. Yeah, these would be concerns.>>What is H?>>So H is the expectation
under Q of the posterior. Which sounds crazy,
cuz we don’t have the posterior. But then we take the second
derivative with respect to M, so the normalizing
constant goes away. And so it’s something we
can calculate with auto differentiation. Okay, so we made one assumption. Everything hinges on this
assumption, which may or may not be right. But when it is right, for
instance, in this cartoon, or in any multivariate
Gaussian exact posterior, it would be exactly right. I mean, this is the only thing
that’s fully described by its mean and its covariance,
then this would be exact. More generally,
it’s gonna be an approximation. Okay, so let’s go back to
this microcredit experiment. Remember we were looking at this
microcredit experiment from Rachael Meager with these
K microcredit trials. There are a number of
businesses at each site, and now we’re gonna talk about
what exactly was the model. What was the choice of the prior
and likelihood that was made. And so the model here is
to model what is the profit of the nth business
at the kth site. So we’ll say that that
profit is y sub kn. We’re gonna say that it’s
normally distributed, With some base mean mu sub k. So this is the profit that would
happen on average if microcredit didn’t exist. Sort of the base profit for
this country. Then we’re gonna have
the effect of microcredit. So first, we’re gonna have an
indicator variable, whether or not microcredit was applied. This is a randomized controlled
trial, so we know whether or not somebody got microcredit. This is totally known,
this is not an unknown. An unknown that we do
have though is Tau k, what is the effect of
microcredit in this country? How much does it add to
the profit on average in this country? And then finally, there’s some noise that’s
specific to the country. Okay, we don’t know mu, k and
Tau k, where k is the index for the country, so
there’s a prior on that. We can imagine that there’s
some global base profit and some global microcredit. And a priori, these might have
some interesting covariance so here’s a covariance for
those And so, in essence, tau is like the,
the most interesting here. It’s the global effect
of microcredit. Is it positive? Is it helping people? Is it negative? What’s going on? We want to know about tau. Yeah. So, my understanding, and I would encourage you
to read the papers, cuz I believe there’s one paper
per each of these countries, is that it was a randomized
controlled trial for individual businesses
on average. And they were giving
them microcredit. And then, they randomly assigned
each business to receive the microcredit, or not, and then they saw at
the end what happened. Yeah.>>These are published papers,
already, done by other experimenters
that are published [INAUDIBLE] they’re mostly MIT
professors, so [INAUDIBLE].>>Yeah,
if you check out Rachel’s paper, there’s a citation for
every single one of the seven. But yeah, it’s also
publicly available data, which I think is fantastic. Okay, so we’ll put a gamma prior
on sigma another prior on mu and tau, we don’t know them. And I’ll just mention that if
you were expecting an inverse Wishart here, a separation in
LKJ has some nice properties that an inverse Wishart doesn’t. So, that’s why that
prior was used. Anyway, the point here is,
we have this complex model. We have a number of unknowns,
we don’t know the effect of microcredit, the base
profit in each country. We don’t know the overall world effect of microcredit
in base profit. We don’t know this
covariance matrix, we don’t know this noise. And we wanna know, how much do
we know about these unknowns after having sigma theta? So, that would be the posterior. Okay, so, first we’re gonna
check are we getting these means estimated well? And to do that, we’re gonna
compare to Markov chain Monte Carlo as ground truth. So, why not just run Markov
chain Monte Carlo all the time? Well, in this case, if we use
a heavily optimized package from Markov chain Monte Carlo, it takes on the order
of 45 minutes. Whereas all of mean field
variational bayes optimization, not even taking advantage of
any distributed computing, or anything like that, all of the linear response
uncertainty is a bunch of sensitivity measures that I
haven’t told you about yet. But we’re gonna get to
in the robustness part, all of that just takes
about 58 seconds. And so, this can be
a real game changer for quickly iterating and
working on economic problems. Okay, so we do in fact see, so
each of these dots represents the mean of one parameter in the
posterior, and we want them to be this x=y line, and
we see that this is the case. So, these are being
estimated well. Now, we’re interested in
saying are we actually getting the uncertainties
estimated correctly. And as we might have expected
from our discussion of mean field variational Bayes,
again, each of the parameters, we’re seeing it’s posterior
uncertainty with one point both for MFVB and for LRVB. We’re seeing that MFVB
is underestimating them, and then LRVB seems to be
estimating them correctly.>>[INAUDIBLE]
>>Because there is literally nothing else we can compare to.>>[LAUGHS]
>>Yeah, yeah.>>[INAUDIBLE]
>>Yes. And we found that by running it
much longer nothing was really changed, but, yeah, it’s
a tough problem in these cases, because you don’t
have any true ground. Well, if things
line up perfectly, it’s probably a good
indication that probably both are doing well as
opposed to one being a problem. [INAUDIBLE] variational Bayes
and multi-column Markov that gets stuck at that-
>>No, that’s totally true, and so a big problem in any type
of Bayesian inference problem is especially this
type of problem, but in much higher dimension
where you have. One local optima, and
then one that’s really far away. And of course, when you’re in
higher dimensions, it’s even much, much, much harder to
go from one to another. And that’s just a hard problem.>>And now on to the left, and mine is to the right
[INAUDIBLE].>>Yes, yes, yeah. And I think that, that is going
to be a fundamentally difficult issue especially, so one dimension I don’t think
this is actually that bad. But of course we’re dealing in much higher
dimensions than that. And there it’s hard to, in particular, because we don’t
know normalizing concepts, that’s the really big issue. Because we don’t know that mass
is missing if we’re in one of these areas, it doesn’t tell us that one
of the other areas exists.>>Does the behavior of the two
approaches tend to be similar? Would they both run
into the same problem? What do you mean?>>Yeah, it’s likely that the
minima that they get stuck in, or the extrema,
would be different. Another thing that’s mitigating
here and why we don’t think that this type of thing
would be happening, is that you do tend to get
a what’s known as a Bayesian central limit theorem
as you get more data. You tend to get something
that is closer and closer to a Gaussian, and
in this particular case, because there are hundreds
of data points, at each of
the individual countries, this seems like an unlikely
sort of situation. But for more complex models,
for fewer data points, it is totally possible. [INAUDIBLE]>>So there’s certain types of things that they’re
doing that are different. I mean, so the optimization
problem with VV isn’t sort of that it’s necessarily just
going through and like. With MCMC, it’s like you’re
flipping this upside down and a lot of the modern MCMC
algorithms are saying, this is like a tarp and I’m
just gonna move on that tarp. Whereas with VV we’re doing and
optimization sort of, after taking an expectation over
the space and things like that. So, I mean,
it’s a little different. Yeah.>>[INAUDIBLE]>>Well, but sometimes it could be similar,
right? So let’s say I was fitting
a Gaussian to this, and I didn’t you know I might
fit it to this one. And that would be a similar
behavior to MC and CB getting stuck in this one. Yeah, exactly. I think there are certain cases
where this could go wrong. Okay, so in this we actually
could be making quite a different decision. Because of this underestimation. So imagine for instance, you
know we’re gathering all these economic experts together and we’re gonna decide is micro
credit gonna be a thing, are we gonna invest the ton
of money in micro credit? And we found that hey,
the posterior mean was $3 U.S. Purchasing power parity. That’s positive. Micro credit might
be helping people. Maybe we should
go ahead with it. And we’ll do it, as long as the
mean is more than two standard deviations from zero. And so, we look at the posterior
standard deviation as reported by LRBB, $1.83 U.S.
dollars purchasing power parody. The mean is 1.68 standard
deviation from zero and we say, ooh, not enough we
need to go back, we need to get some more data. We need to think about maybe
are there certain covariance, certain aspects of businesses
that make micro credit more effective. Maybe who owns it, did they have
a business before, etc etc. These are the types of questions
that Rachel’s actually asking. This is a simplification
of her work. But if we had
erroneously guessed that our standard deviation was
lower by using an MFVB. So in this case if
we had used MFVB we would have underestimated
the standard deviation, in fact, to the point
that the mean would be more than two standard
deviations from zero. And unsupported by the data
we would have said, yeah, micro credit. This is more than two standard
deviations from zero. Let’s go for it. Let’s invest in this. So we want to avoid that. Okay, something that you might
be concerned about at this point is you might have said that that matrix inverse
looks like a problem. Cuz I know matrix inversion is
maybe not exactly n squared but, or sorry, not exactly n cubed
but it’s between squared and cubed as an operation. And the thing that we’re
inverting has size number of parameters by number
of parameters. And that could be
particularly problematic for something like clustering. So in clustering I have one
parameter per data point because I have the assignment of
the data point to a cluster and then I have some global
parameters which are like the cluster location. And if I have to invert a matrix
that’s number of data points by number of data points,
this is not scalable. Luckily, that doesn’t come up. Okay, so we can Decompose our
parameter vector into let’s call it some global parameters again. These are maybe
cluster locations. And some local parameters. These are like assignments
of data points to clusters. We can decompose H, we can
decompose V in the same way. If we use the Schur complement
we can see that one of our matrix inversions is on the
order of the number of global parameters and we think at
least that shouldn’t be as bad. But one of them is on the order
of number of local parameters, something like the number
of data points. This is potentially
a big problem. But let’s look at our
sparsity patterns. Well the whole problem would be,
the whole reason we even got into this was
that it’s block diagonal. And so
that’s not gonna be a problem. This is just gonna be
block diagonal in n, in the local parameters. H, if you look at H, because in likelihoods
you tend to see factors. That connect local parameters
to global parameters, but not local parameters
to local parameters. You will see that the local
parameter part is also block diagonal and so
finally I minus VH which is gonna be block diagonal and
on the local parameters and so in fact this inversion
will be linear. So that’s why that’s
not totally crazy. Okay, so we’ve developed this
whole sort of perturbation idea at this point, now let’s
reply it to robustness, which if anything seems
even more natural. Yeah.>>[INAUDIBLE] question. I think you said [INAUDIBLE].
>>Right so, I think it’s just easier to describe without
getting to the co-variants but there is nothing
fundamentally different about having co-variants in these
models or things like this.>>[INAUDIBLE]
>>Yeah, yeah, it’s just a more complex model. This is the simplest possible
model that you could do. If you look at Rachel’s paper, she has sort of
a simple model and she goes into more complex ones. So you can do that as well. We also don’t want
to say anything. So part of it is we don’t
want to say too much about the economics problem, because we’re not
the economics expert, she is. And so I encourage you
to go to her paper for the full version of that. Yeah.>>Is that [INAUDIBLE]? Is that the precisely the sense
in which your letting this experiments sort of?>>Yeah actually, so
if you check out Rachel’s paper, you can use the posterior over
the heteroscedastic errors to get a sense of how much pooling
of information there is across countries, yeah. Okay, so let’s go to
robustness quantification. Okay, so robustness
quantification, you can think about it as we could have
made a different prior choice. In fact, we could have made
a different likelihood choice as well and
the same idea would go through. But just let’s look at
the prior for the moment. And here you could think of this
as one thing as like, okay, well maybe my prior was normal, and
I chose some hyperparameters for that prior, and then I
changed them a little bit. But it doesn’t have
to be like that. It could be some nonparametric
change to the prior. It could be, maybe my prior was
some crazy distribution, and then I wanted to add some
normal noise over here. So this Alpha doesn’t have
to be just your simple, parametric hyperparameter. It could be something
much more interesting. And so let’s just call our
posterior now when we’ve very much focused on the fact that
we’ve made this choice in our modeling with alpha. Let’s call it p sub alpha. Okay, so now we wanna get
a sense of sensitivity. We wanna get a sense of this
idea of I had taken my prior and I’ve perturbed it and
gotten another reasonable prior. Would I get the same results, or
would I get different results? Well if you think about it, when we’re actually using
a Bayesian posterior, we’re not really reporting this
whole distribution, typically. We’re reporting something
like a posterior functional. In particular,
what we’ve been doing so far this time is we’ve been
reporting a posterior mean. Are we gonna say that
microcredit has the same average effect or it have very different
average effects under these two priors, these two
choices of prior? And so a way to measure
sensitivity in this case would be the say hey, if I look at
this posterior mean like what is the mean effect of microcredit, how much does it change
if I change alpha, if I change my choice of prior,
if I perturb that. And so
we’ll call this the sensitivity. And now it turns out, hopefully
this looks somewhat familiar to all the work that we did before. If you replace
the alphas with ts, this is just exactly what we
were doing on that long linear response variational
Bayes slide. And the first thing we said was,
hey, maybe we think that point
estimates are relatively good. So we can replace them with
an approximation under the variation Bayes. And then maybe we can
make this assumption for this linear response for Bayes estimator in order to
actually calculate it instead of having this uncountable number
of approximations, we can make this assumption that q* sub
alpha is in exponential family. In that case, we get another
super nice formula like before. So this inner formula was
exactly the formula we calculated before. So if we already did all these
uncertainty quantification correction, this work is done. You don’t need to
calculate anything else. b comes from the form of
the prior of it’s conjugate exponential family. This will be the identity, or generally, we can calculate this
with automatic differentiation. A comes from the form of g. If g is the identity of
recalculating the posterior mean, this will be
the identity Okay so again, let’s look at this model. And we can say to ourselves,
there is a lot of choices here. We have to choose this gamma ab. We had to choose theta c and
d for the separation LKJ prior. We had to choose
this lambda inverse. We had to choose,
well we didn’t have to choose C. We had to choose mew naught and
tau naught. And even if you are like some
super microcredit expert like Rachel Meager, there’s
probably still some range of reasonable values that
you think might be okay. And you want to know
that over that range, your decisions wouldn’t change. So we can check if that’s true.>>[INAUDIBLE] something different.>>So there’s a few things. So basically, let’s look for
instance at this normal. So this normal as we’re saying,
here’s our prior. On. And maybe, we believe super
strongly that this is our this is definitely what
I know about going in. But maybe actually, I talked to
my friend, who is also a super micro credit expert, and they
would have come up with this. It’s so close. Is it gonna give me
different results? Is it gonna give me
the same results? I’d like to know. In particular,
at the end of the day, we’re making some kind of
decision like is micro credit helping people? Is the posterior mean more than two standard
deviations from sigma? And if it is, then maybe I’ll
go an invest in this, but if it’s not, I won’t. And so I wanna know that
if I change my prior, would I change that decision? So that’s the kind of
question that we’re asking. Now I’m gonna formulate this
in terms of if we change these hyper-parameters, would
I change that decision. It doesn’t have to be
this parametric decision. It actually could
be non parametric. Yes?>>[INAUDIBLE]>>Right, so this is exactly, so
if you take something like the hyper parameters and just
change them then you can think of this as the derivative with
respect to the hyper parameters. More generally if you wanted to
do something like say here is my prior and then I wanna change
it by maybe adding a normal and normalizing it. Then I could say that Alpha is
sort of the amount that I’m doing the weighting between the
original prior in the normal. And I could take a derivative
with respect to that Alpha.>>[INAUDIBLE]
>>Yeah, basically.>>[INAUDIBLE]
>>As long as you can formulate them in a way that
there is an Alpha that you can take the derivative
with respect to, then you can make that change. But just the one thing
I wanna emphasize, is that it doesn’t have
to be some parametric family form that you change.>>But then is that
approximation still a good one? I mean, if Alpha equals 0, wouldn’t we have to put-
>>So it’s not near Alpha equals 0, it’s Alpha of your original
choice, in this case. So it’s a little bit
different than the T case. So it’s near change in Alpha
equals 0, Delta Alpha.>>[INAUDIBLE]
>>Right so in this case it would be zero. Yeah in that case. As opposed to the case
where other cases it might not be yeah.>>[INAUDIBLE] how drastic
do you actually change by going from this one to
a Gaussian [INAUDIBLE]?>>So I would say that locally, you don’t expect things
to be too different. But yeah, as you turn this
into a normal, eventually, you’re gonna lose the linearity
of the approximation, it’s true. But I still think that if you
want some notion of robustness, this is an easy one to get. It’s an easy one to calculate. It’s one that you can
calculate quickly and sort of with minimal fuss. On your computer there’s a sort
of a nice formula, and so it’s a step in
the right direction. Ultimately we’d like to
have a global sums, but I think that’s a hard problem. But maybe someone can solve it. Okay, so let’s just go ahead and
see what this looks like. Again, the assumption we make is
that the means are well tracked and we have that. Let’s consider
a particular perturbation. And what we’ll do is we’ll just
run mean field variational Bayes for one value of
the hyperparameter. Sorry, we’ll run MCMC for one
value of the hyperparameter and another one just to see if
mean field variational Bayes, if LRVB is tracking that. And we see that it really is. And again, with LRVB
we’re not getting it in just one direction, we’re getting it in sort of
all directions simultaneously. Okay, again, just to see
how does it may affect us, we could calculate on the
sensitivity of the microcredit effect with normalization
to be on the scale of t standard deviations, cuz that’s
sort of the interesting. At the end we are making this
decision is micro credit. Helping people. We can get sensitivities for
all these parameters, all hyper-parameters. It’s interesting to
see that they’re on very different scales. And now let’s go back to our
original thought experiment. So we said that the mean was less than 2 standard
deviations away from 0. So we weren’t going to make
micro credit our new policy. And then let’s say
that we perturbed one of these parameters by 0.04. We thought that was
totally reasonable. We talked to our economist
friend, and they said that that would’ve been a totally fine
hyper parameter to start with. And we found that now the mean
is more than two standard deviations from zero. Now in this direction, that
doesn’t really mean anything. We already weren’t going to fund
this, and so we still won’t. But you can imagine this
going in the other direction, that maybe we started our
analysis, we found the mean was more than two standard
deviations from zero. We high-fived everybody and
we said, okay, we’re gonna find micro credit,
it’s super awesome and then we did this robustness
analysis and we found that for perfectly reasonable values
of the hyper parameter, the mean is not more than two
standard deviations from zero. And so we’d be more
conservative, we’d go back, we’d get more data, we’d think
about our covariance effecting this, things like that. Okay, so I will mention,
I phrase this in terms of mean field variational Bayes, but
there are a lot of other types of variational
Bayes besides this. And this isn’t specific to
mean field variational Bayes. It can be applied to other
types of variational Bayes. And since I am
running out of time, I’ll encourage you
to talk to Jonathan, who is interning here anyway,
about his coresets work. But basically a number of
people brought up some issues with what we’ve
been talking about. So what about bias? Variational Bayes
could be biased. What if I care about
a long tail or some other interesting things
that could be going on? I want something that’s fast but is also going to give me a good
posterior approximation. And so this core sets
work is getting at that by using a data
summarization approach. Let me just skip
ahead to the end. Okay, so this linear response
idea, let’s us get these covariance estimate as well as
this robustness quantification. And while this coresets work
I think is really great, in terms of getting us fast
approximate Bayesian inference that also gives us theoretical
guarantees on quality. I think there’s
still a place for these two ideas
to come together. I mean, this perturbation
idea for robustness, robustness quantification is
something that people want and should have in their analyses. In a lot of these analyses, you
want to know how robust these sometimes arbitrary, or at least
semi-arbitrary choices that you made, how robust your
outputs are to these choices. And this is true for
any type of modeling, but for Bayesian inference,
this seems to be a way to do it. And so combining this with
a way to actually get reliable inference with
theoretical guarantees I think is going to be very powerful. And so if you’re interested in
any of these ideas I’d love to chat more after this. I’ll just point out
our papers on this. So there’s a NIPS 2015 paper on
the linear response methods for covariance estimation our
robustness paper is in preparation. Hopefully it will be out soon. The NIPS 2016 paper
is on the corsets work if you want to read
up on that as well. Thank you. [APPLAUSE]
>>Can you take one question now?>>Yeah.>>On slide nine you showed the
[INAUDIBLE] comparing the MCMC results to your results, and there was one point near the
bottom that on a relative scale there were pretty different.>>Yeah.>>Do you know what happened?>>Yeah, so I mean, I suspect
that all of this relies on the bias being zero and so probably the bias was
not zero for that point. So if the bias is zero
then everything is exact, there’s no approximation. So that’s pretty much
gotta be the culprit.>>I guess it would
be the MCMC also.>>That’s also true. I actually pretty much,
so we use Stan. I think Stan is pretty
trustworthy on this problem. I lean towards it being LRBB,
it’s possible it was the MCC.>>[INAUDIBLE]>>Yeah, yeah.>>All right, let’s [INAUDIBLE].>>[APPLAUSE]

Leave a Reply

Your email address will not be published. Required fields are marked *