12 August 2019, Deep Learning Opening Workshop: Robust Information Bottleneck, Poh-Ling Loh

– So, I guess I’ll also start my talk with a bit of a caveat. So this is very recent work. It says, “With thanks to
the Simons Institute.” So over the summer, I was
visiting the Simons Institute at Berkeley where they had
a program on deep learning. And really, the reason
I was there was because one of my very close
friends, Matus Telgarsky, was one of the organizers. And I told him when he invited me that I didn’t really do any deep learning and he’s like, “Okay,
well you can get started.” So, this was sort of mentally
like a summer internship, and what you see now, well, it’s August, this is like the report on what I did. So this is kind of still in progress, I’m not really sure when
we’ll be done the paper. So I apologize if some
of the ideas or equations are permuted, because that’s
kind of how it still is in my head. This is joint work with my
colleague and collaborator Varun Jog who is in ECE at Madison. Okay, so as we know, neural
networks are very accurate. So you feed in lots of training data and they classify very, very
accurately on test data. But there’s a phenomenon that
was observed several years ago and has drawn a lot of attention, particularly in the theoretical community, and it’s a curious
property of neural networks that even though they classify very well, let’s say to 95 plus percent accuracy, what happens is that if
you take a neural network that is trained to classify
images very accurately, and you choose some sort of
adversarial tweak to the image, so in this case, this is a panda that was classified accurately, and then you choose some
very small perturbations, so the magnitude of each
pixel is perturbed by at most epsilon, and you choose
that perturbation correctly, then you can make the
classification incorrect. So, there’s another animal called a gibbon and you feed this new
image, which basically looks just like the panda image,
into your neural network and it will fool the neural networks. The neural network will decide that this is something different. And this is a problem because in practice, people like to use neural
networks to classify images and they want to do better than humans, and even though they might
do a little bit better than humans on so-called clean images, the problem is that if you
choose these adversarial attacks, then you can sort of hack the network and make it predict incorrectly. And this is bad if you have
an application in security or medical imaging or something like that and you want them to be more robust. So I’ve been quite
fascinated with this property for a while when I first heard about it. And there are various
methods that people have for trying to deal with this. So, in my head, to kind of
sort out what this phenomenon is actually referring to, I like to think of a
classification boundary. So, imagine that you
have images of one type, let’s say the pandas are over here and the gibbons are over there. And what your neural network is doing is it’s creating a decision boundary that’s classifying very well. So in this case, it’s
a 100 percent accurate on all of the training images. And in fact, if you populate
it with some test images, then they probably will also do very well. At least, that’s the way that
we see the experiments going in actual applications. But the issue, so what’s going on here is that even though things are
classifying correctly, somehow the neural network is, in a way we don’t quite understand, creating the classification boundaries so it’s really, really close
to the images that we feed in. So I call this sort of a
conservative predictor. So it’s creating the correct
classification boundary, but there are lots of correct
boundaries it could be drawing that separate the pluses from the zeros. And it’s separating things in such a way that you get really, really close, so if you choose some
sort of perturbation, let’s say to this red circle, then even though it’s
classified correctly, you can just perturb it by a tiny bit and go over the line, and so this is what
neural networks are doing. So there are a couple
methods that people use to try to combat this. One of them is known as data augmentation, and the first reference I found of this was from analysis of text, actually. There may be other references before that. So the idea is that you
still have the original red and blue points, but you kind of tell your neural network, “I don’t want to just classify
these images correctly, “but I want to classify them “and some ball around them correctly.” And so what I’m drawing here is I’m drawing some sort of enlargement, so you can imagine that a dotted circle around each of these points, and to say that everything
that is red plus dotted has to be on one side and everything that is blue plus dotted has to be on the other side. And so maybe then, the neural network will classify like this, and you’ll get this green line, okay? So the problem with
this, of course, is that theoretically, it sort of makes sense, but then, now imagine that you’re in a
high dimensional image space and you want to, oh, so data augmentation
means that you introduce some other examples that are close by. So it’s like you take
not only this red image, but you create lots of
(mumbles) perturbations. Okay, but the problem
is that you need sort of infinitely many, or a large
number of perturbations because you’re in a very
high dimensional image space. Your dimensionality is the
number of pixels in your image, which could be very, very large. So, it’s not really that practical to say, “Let me just protect every point by “lots and lots and lots of points.” There is another method
that is quite popular, and that’s called adversarial training, and there are different
ways to think about what adversarial training is doing. But one way to think about it is to say, “Okay, so we don’t really
need to protect this point “by all of the points in the solid ball. “Maybe we just need to add
on some other perturbations “that are kind of going
the right direction.” So we make sure that that
point is classified correctly, and then some appropriate
points are classified correctly that push the black boundary
to the green boundary to get what we want. A very popular method
for constructing these adversarial perturbations is called the Fast Gradient Sign Method, or FGSM. I don’t want to go into
too much detail about this, but the idea is that you take your point, let’s say it’s the red point, and then you add the correct
direction of the perturbation. And it turns out that what
has been sort of suggested in papers and kind of works somewhat well is where you look at the
gradient of your loss function. Why does this make sense? Because, well, the loss
function is kind of indicating how close
you are to the boundary. So the loss is going to be
pretty small if you’re far away, and then it’s gonna be
bigger if you’re close by. And so you want to kind
of go in the direction of the boundary, so the gradient of the loss will tell you what’s that direction in which things are going to change very quickly. It turns out that this
sort of protects against what are called L-infinity perturbations whereas there are maybe L2 perturbations or different sort of distances. So the idea is that it’s
sort of like impossible to decide based on your training data exactly what the worst
case perturbation is, but this Fast Gradient Sign Method is one way of creating an image that seems to be going
towards the decision boundary, which seems to improve the
robustness of your classifier. It’s called Fast Gradient Sign Method because it’s very fast to compute, and it’s a sign method because
you’re basically looking at the vector of the gradient and just going, epsilon
perturbation, either plus or minus. The other thing I wanna point out, which is why I wrote this down, is that it turns out that
this is very easy to compute from a trained neural network. So when you think about optimizing things, you usually think about
optimizing your loss with respect to the
parameters, which are theta. So it wouldn’t be too
unrealistic to imagine that if I were talking
about taking a gradient of a loss, with respect to theta, then that should be easy to
do with a neural network, because a neural network is
doing gradient descent anyway. But one thing I want you
to keep in mind for later is that actually, neural
networks can also take derivatives with respect to X, your input. So this kind of weird
when you think about it as optimization. But when you think about what
a neural network is doing, well, on the first layer in the inputs, it’s sort of taking your weight, so let’s say theta, and
it’s doing transpose with X, which is your input. And it does something to that. So from the point of view of what the neural network
is computing and storing, it can also tell you, in addition to the
gradient respect to theta, what the gradient is in respect to X of the loss function, or, it turns out, the
parameters and so on. And this is one thought
that I want you to hold onto for later. Okay, so I guess this
is what I’m saying here then it turns out that we can
actually take these gradients efficiently when we have
a trained neural network. There’s another method
that I wanna point out because I spent a lot of time
thinking about this earlier in the summer. This is another method for
trying to make neural networks more robust. It’s called randomized smoothing. So basically, the idea is
that if you look at just the first image on the left, well, if you smoothed out
somehow the decision boundary, made it like a straight line, then maybe that would be more robust. And so randomized smoothing
is kind of a way of appropriately pre-processing your data or convolving it with
some kind of Gaussian, which will, in the end, smooth
out your decision boundary. Okay. But then the issue, when you start thinking
about these methods, is that there’s a little
bit of an inherent trade-off between robustness and accuracy. Okay, so what do we mean by robustness? Well, robustness means
that if I have a classifier and I perturbed the input by a little bit, the classifier shouldn’t change too much. So if you think about looking at the panda and the gibbon, right? I mean, I have these two input images which look really, really close, that’s in the, let’s
say, X-i is the panda, X-prime is the gibbon,
they look really close. And I want that the classification
is going to be the same. So I want that to be
true of my classifier. On the other hand, I also
want my method to be accurate. And so what you see if you
start looking at these images is well, if I kinda
straightened out this line, then maybe that would be
more robust in some weird way that I need to quantify, but at the same time, you know, the images that actually are sort of
close to the curvy boundary are going to be misclassified. And this is what happens when you try to apply some of these methods. I think I put an image of this, so I took the same images as before and I said, “Well, what if we try to do some data augmentation
and we enlarge things by, this is supposed to be, a big radius.” So we kind of wanna protect
all of these points, and then really, the
green line doesn’t have many choices anymore, it has to follow whatever
curvature is going on here. But who am I to say that
this is the right classifier? This is just a method that I use to try to protect these points. And so we see that as we
try to become more robust, as we try to increase these balls and protect against
perturbations in these balls, then the boundary might become
further and further away from what actually is
supposed to be the case based in the data generating distribution. So, then I started thinking, well, how do we quantify this? So I spent a lot of time
trying to figure out if there were fundamental trade-offs between robustness and accuracy, and I didn’t get very far. Okay, so there’s something else that I started thinking about instead. And the idea is instead
of claiming that there is a fundamental trade-off
between robustness and accuracy of any classifier, what if we tried to
design some classifiers that are inherently more robust? And the way we do this is we look at the features that we’re using to
base our predictions on. So, there’s this image
that I want to look at for a little while. So, what we have here is we
have a mixture of Gaussians, and these are a mixture
of very flat Gaussians. So the co-variants in
the horizontal direction is much bigger than the co-variants in the vertical direction. Now, if I told you that I
wanted a good classifier, and I didn’t care about robustness, then what would you do? Well, you would look
at something like this, let’s say, maximum
likelihood classification in logistic regression. And so another way to
think about it is that logistic regression is learning
a parameter vector beta, in this case, it’s two-dimensional, and it’s trying to project
things onto the beta direction. So really, the beta direction
is perpendicular to this. So it’s projecting things
onto that direction, and then it’s classifying. Okay. But my claim is that if we don’t just care
about prediction accuracy and we care about robustness,
whatever that means, what we should be doing is
doing something like this. So this image, or some version thereof, is in a paper on neural networks and took me a while to
understand what was going on because this just looks
like a terrible classifier. But now, in terms of robustness, so we go back to this first image and my claim is that it’s
not a very robust classifier. Because what we cared about is how much I should perturb points in order to put them over the boundary. And so we see that even though
this is a great classifier, most of the points are really
close to the boundary, right? Pretty much all of them, because I chose these two blobs so that they were basically kind of like parallel point clouds to this boundary. On the other hand, if I
look at this classifier, then I see that, okay,
there are some points that are now misclassified. On the other hand, since
these are Gaussian clouds, the bulk of the mass is sort of in the middle of the red cloud. And the bulk of the mass here is in the middle of the blue cloud. And now, how far do I
need to move those points to put them over the boundary? And so you’ll see that it’s
something sort of non-trivial, whereas here, it’s sort of trivial, you just move points by a tiny bit and they go over the boundary. Here, what we see is that
you move some points, and even these are really far away and they’re kind of further
from the boundary, right? So the idea is that maybe, even though the most accurate vector I should be using for my prediction is the beta vector that
corresponds to this classifier, if I now want things to be
both robust and accurate, then I should be basing my
prediction on some other projection or some other feature. So you can think of what X-i
transposed beta as a feature. If I were to extract a
one-dimensional feature on which I wanted to do a prediction, then which beta should I choose, so that I do logistic regression
using X-i transposed beta. This is also an example
that’s worth mentioning because I think this is what kind of made a splash in the deep
learning community quite recently. This is another example which is, it’s kind of like this Gaussian point clouds example, but it’s also showing that
there are certain features that might be more robust than others. So the idea is that you
again have a mixture, so you have Y-i, which is your label and it’s either plus or minus. So imagine that that’s like
my two Gaussian clouds. But they’re not exactly Gaussians. So what’s going on is that there’s a feature which is
sort of the robust feature. So this first feature is basically telling you, I mean, it’s gonna be
plus one or minus one, depending what the class is, but then there are some
uncertainties so there’s a flip. And then the other features are sort of weak features. So they are going in the right direction, in the sense that the means are correct. So if the class is plus one, then these clouds are going to be centered at the all plus one’s vector
with a little bit of noise. So the idea is that if you were
to try to classify correctly with the best accuracy, you should be using all of these features. So, these features tell you a little bit and this feature also
tells you a little bit. On the other hand, if you choose this eta
parameter to be very small, then that’s kind of like saying that the distance to the
boundary is really small. Okay, so we see that,
if you wanna be robust, you actually shouldn’t be
using any of these features, you should only be using the first one, which is the robust feature. Now, you have to think
about it a little bit more, it really depends on the trade-off between how big q is and how big eta is. The point is that if you
make eta sufficiently small, then, really, all these features
are useful for prediction but they’re just not very robust. And so we see that one of the features comes out as being very robust and the others are not so much. So the question again is how do we do some sort
of feature extraction that tells us that we should
be picking this feature and not those? And then the trade-off between
accuracy and robustness is that if I didn’t care about robustness, I would use all the features, but if I do care about robustness, maybe I should rank my
features in a certain way. Then there’s another paper
I want to mention as well because I also found
this quite fascinating. So this is a NeurIPS paper from last year, and the idea is that in
order to try to build a more robust feature classifier, you should build a graph on your data, and then look at your graph Laplacian. So again, without going
into too much detail, so what I’ve done here is I’ve taken these points
from the earlier picture, right here, and I just
dropped the decision boundary and I said, “Well, let’s
construct some sort of “a nearest neighbor graph.” So you get a graph. And the idea with robustness is that you don’t want your
function to change too much for points that are close. And so, because this graph is telling us where points are close, then we want to extract a function that takes on a different value for every one of these points but doesn’t change that much, across edges in the graph. And, you know, these are
ideas that have come out in spectral graph theory. You can talk about regularizing
according to a graph. And so there are methods
that have been proposed for doing robust feature extraction with respect to the graph
that’s placed over your data. All right. So the approach that we took was inspired by these works but also took in a different idea. And so, the idea was to use an information-theoretic approach. Again, the goal is to figure out how to extract robust features. The question is like, “What should you use as your objective “and does it actually work?” So, there’s a paper
that’s quite famous now because it’s somewhat controversial. And this paper was by Tishby et al. The first version of it appeared in 1999. And it was known as the
information bottleneck. So the idea here had nothing
to do with robustness. The idea was that they wanted a method for extracting features, and these features should
be good at predicting whatever the output is, so they should be good at predicting why, and they should also be
very good at compression, so you don’t want to be too redundant. Like if you just take all your inputs X, then those could be really
good for predicting Y, but maybe some of them are redundant, so you really just wanna
extract out some features T that are not so redundant for X that still are good at predicting Y. This was formulated in this information-theoretic setting. So I refers to the mutual
information between two variables. So remember, X is the
input, Y is the output and T are the extracted features. And we want to impose that the features are conditionally
independent of the outputs given the input, ’cause the
features are just extracted from the input. And what this formulation means is, well, the one thing you need to
know about mutual information is that it’s not negative and furthermore, when it’s zero, that means that two
variables are independent. Okay, so if you want things
to be less independent, then this value should be bigger. So what we wanna do is we
want to make I of T;X small, and we want to make I of T;Y big, right? And so basically, it’s saying that we want T and Y to be very related, but then we want T and X
to be not very related, because we’re minimizing
the mutual information between T and X and small mutual
information between T and X, like the mutual information being zero means that there is no relationship. So we want compression, we want small relationship
between T and X, but we want to predict well, so we need good relationship
between T and Y. So this was known as mutual
information bottleneck. And this is where things become
a little bit controversial. So, there were various studies and especially with the
new wave on deep learning, claiming that in fact deep
learning, neural networks learn according to this mutual
information bottleneck. So in this case, they
were looking at features that were in a deep network. So you can imagine training
your favorite neural network, it extracts features at each layer, right? So it would just focus
on a particular layer and ask for this particular T. What happens if I compute what
this objective function is? So the hope is that as I learn, things become more compressed
and they predict well. So you would hope that somehow
this objective function is going to be getting smaller and smaller because I’m doing some learning, so I’m trying to predict my features well and I’m trying to predict my outputs well and I’m also trying to do compression. And the authors went a
little bit further than that to say that there are
actually two different phases in neural network learning. And one is known as a memorization phase and one is known as a compression phase. So first, you sort of
learn to extract features you recognize, for instance,
the handwritten digits. So you recognize that there
are certain properties of zero, one, two, up to nine. After that, you try to compress things. And we’re not spending too
much time on this slide, basically the idea is that I of T;Y should be constantly increasing because you’re getting better and better at predicting things. On the other hand, I of
T;X is first increasing and then decreasing. And this picture is
taken from this paper by Schwartz-Ziv and Tishby, and so you see that I of Y,
which is on the vertical axis, it’s always increasing, whereas on the horizontal axis, these orange dots are sort going, they’re getting bigger and
then they’re getting smaller. So what you see here is
you see different layers in a neural network, and over 50 samples of
random initialization. Now, this paper kind of, well, there was a follow-up paper at ICLR, maybe in ’17 or ’18, where they claimed that the
simulations were sort of very carefully chosen, and this is not what
happens in neural networks. And so if you Google the ICLR open review, you’ll see that there’s this
war between the authors. So the follow-up paper got in,
but these authors said that the reason why the
simulations are different is because they don’t know
how to calculate their mutual information
accurately from simulation. So, we’re not gonna get into that because that’s not the point. The point is that we use this sort of as a starting point to try to
extract robust features. Because, again, there’s nothing
to do with robustness here, right, it’s just we want features to be an accurate way to predict the labels and then also be good for compression. Okay, um, but before I move on, so another nice thing
about this formulation, well, in general, it’s
kind of nasty, right? You wanna optimize over all
possible functions of your X’s. But there was a paper that showed that at least when X and
Y are jointly Gaussian, you can actually characterize what the extracted features are. And the hope is that if
things are jointly Gaussian and you can write down the exact formula for what the features should be, hopefully it’s something realistic. Hopefully it’s something
meaningful that you can actually, like in the Gaussian mixture case, plot the result and say, “Okay,
it’s doing the right thing.” So it was shown in this
paper that in fact, somewhat surprisingly perhaps, is that if you look at any function of your X’s, it turns out that the best thing to do is a Gaussian mapping. So if you’re allowed to extract
any function of your X’s through any non-linear
transformation or whatever, the best thing you can do is actually to look at Gaussians. All right, and then it was
shown that the right projection has to do with eigenvalues
and eigenvectors of the joint co-variants’ matrix between the X’s and the Y’s. Mmkay. So what did we do? Well, we started thinking about how you can try to encourage robustness when you’re creating a
mapping from X’s to T’s. And what we ended up deciding on was a Fisher information
term as a regularizer. So this Fisher information term might look a little bit weird from what you’re used
to seeing in statistics, but it does have this
idea of a score function, the gradient of the log of the density. And normally what you have is you have the density as
parametrized by parameter. In this case, we’re going
to parametrize things by X, because what is the
Fisher information doing? It’s telling us something
about the curvature of a distribution, of a density with respect
to changes in a parameter. And now what we want is we want that if X changes a little bit, we don’t want to change the distribution of T very much, right? Because we’re trying to
encourage stability of the T’s. So we hope that, if you look
at the conditional distribution of T given X, small changes in X don’t change the conditional
density too much. And then the other stuff is
just sort of an expectation, because in a Fisher information, you have an expectation. So we’re taking an
expectation with respect to X and with respect to T. Um, so as I said here, in some sense, a smaller version of phi means that these gradients of the log of the conditional density
aren’t gonna be very large, and therefore, we should
have some stability with respect to changes in X. Also, I should say that
this is sort of on average. So earlier, I was motivating all of this by saying that we have these
adversarial perturbations. And ideally, we want this
method to actually be robust to adversarial perturbations, but for starters, if we’re trying to extract
some robust features, we’re just gonna look
at average case analysis because it’s easier to
compute things in that sense. Okay, so this leads to the
robust information bottleneck, as advertised in the title. So what you see here is you see again these I of T;X and I of T;Y terms. And then there’s going
to be a parameter gamma that trades off the two, the
compression and the prediction. And then there’s going to be another term which is the regularizer, and that is this Fisher information. And you might ask, a priori, well, why did you use Fisher information, it came out of nowhere. And really, the clean answer is that it turns out that in the Gaussian case, you can write down what the
solution is to all of this. And that’s why, well, things become kind of
clean with mutual information and they also become clean if you add a Fisher
information regularizer. Um, there was another
formulation that we also studied that we could also kind of analyze nicely, and this is where we sort of pulled away from this information bottleneck, because, really, we weren’t starting with information bottleneck, we were sort of starting with how do you make things more robust? What sort of regularizer should you add? And it seemed like this Fisher information seems pretty reasonable, right? I mean, you change your X’s by a bit, it doesn’t change the
distribution of T too much. Um, so you can also think about other ways of trying to say that
T is predicting Y well. ‘Cause really, the
regularizer is saying that the changing X doesn’t change T too much, and then we want something that says that T is actually predictive for Y. So you could imagine looking at a minimum mean squared error. So a minimum mean squared
error is really just saying what is the best
predictor that I could use as a function of T, and that’s going to be
expectation of Y given T, and then the mean squared error of that. All right, and it turns out
that you can kind of analyze these in similar ways, which
I’ll get to in a moment. Um, so, one comment is that while the MMSE method is
potentially preferable, if you’re really looking at distances between inputs and outputs, so mutual information is
kind of an abstract thing. It tells you how independent things are. But if you really want
to measure prediction in terms squared accuracy,
then maybe MMSE is better. But that’s kind of just a speculation. Another thing that’s important to note, which is relevant for the computations which I won’t go into too much detail on, is that it turns out that
these objective functions are invariant to scalings of T. And this is something that we liked because you would hope
that if you say that certain features are more robust, then if you would scale
them up by a 100, you know, they’re still equally robust as if you didn’t scale them up by a 100. This becomes useful in our theory because what we’ll show actually is that in these settings as well,
in the Gaussian case, T should just be a linear
transformation of the Gaussian. And then, what we can
do is we can rescale T’s so that the error term
has identity co-variants and just optimized over A. So this becomes a linear algebra problem. You can rescale your T and
make your residuals just identity co-variants. Okay. A couple other things
about this regularizer. So, I sort of pulled it out of nowhere. As I said, it was
motivated by the question of how sensitive is T,
or the distribution of T, to perturbations in X. It turns out that it also
has some nice properties, which followed of some
information-theoretic analysis. So you can think about the regularizer as a first-order approximation of changes in the mutual information. So, it turns out, and this follows from a result in information geometry called de Bruijn’s identity, that if you look at the mutual
information between X and T, you perturbed X by a little Gaussian, so you add an independent Gaussian noise, then you get the Fisher
information as a first-order term. So this says, well, all right,
if we wanna make this small, then we wanna make sure
that information content of X relative to T isn’t
too much if I perturb X. Which is also consistent
with what we were saying about robustness. Another nice property
is that it turns out, and maybe don’t stare at this too hard, is that if phi is small, then in fact a certain
KL distance is small. So, the KL distance is between the conditional distribution of T given X and T given a perturbation of X. And so, we can show that, in fact, KL perturbations are also
small if phi is small. So these are just sort of justifications for why phi is sort of
a nice thing to look at. Um, and it turns out that, well, by Pinsker’s inequality, if
KL is small then TV is small, if TV is small then
Wasserstein is also small, so a small version of phi is perhaps something that
really should be desired. Okay, um, a little bit more
about this regularizer. Again, trying to convince you
that it’s a nice thing to do. So you could imagine different settings. One is where you’re
extracting your features in some linear way. So, here, this is just
a linear transformation of X plus Gaussian noise. X could be Gaussian, X
could be a Gaussian mixture, it could be whatever you want. And it turns out that you
can actually calculate what phi is, because phi is this Fisher information that depends on the conditional
distribution of T given X, and you end up with basically
the sum of the squares of the entries of A. And, this also agrees with intuition, because what happens is that we see that a small value of phi, so more robust features, corresponds to a small
value of the norm of A, and that means low signal to noise ratio. And normally, you think that
low signal to noise ratio is a bad thing, but on the other hand, if you
want things to be more robust, all right, so you want your
T to not change too much, then what’s the most
robust thing you can do? Just generate independent features, right? I mean, that’s like
terrible for prediction, but it’s really robust. And so we see that making the
signal to noise ratio small is in fact what is encouraging
robustness in features. Of course, we would not
be choosing a tiny SNR because we have the other term too, and that’s either the MMSE or the information bottleneck term. So that’s trying to
give us good prediction in addition to robustness. Um, we can play a similar game where you let your T be binary features, and if you look at binary features, then maybe these are parametrized by some logistic parameter, w, and you can try to calculate exactly what this phi term looks like. And it turns out that the
phi term also depends, in some sense, on a signal to noise ratio, and then there are other
terms that come from how confident your predictors are. So, this is like P times one minus P, and we know that P times
one minus P is minimized when you’re at zero or one. So it’s telling us that if
we want to be very robust, then we want the predictions
to be very confident. You’re either very confidently positive or very confidently negative. Okay. All right, so, that’s to convince you that phi is sort of a reasonable
thing to look at. Um, then the next thing
that we sort of looked at was, well, okay so, well, we could say, “Let’s just use this in practice, “but before that, maybe
let’s try to look at “the theory behind it.” In a very special case, when
we can actually understand what the joint distribution of X and Y is, just like the information bottleneck, can we analyze what sort
of features are extracted? And so that’s what we did here. We looked at the case when X and Y are jointly Gaussian random variables, and it turns out that you can also show, though it takes quite a bit of work, that the optimal choice of T is, again, a jointly Gaussian random variable. And maybe the very short summary is that it has to do with the fact that Guassian random variables are maximizing entropy subject to a
boundary of the co-variants. So you kinda want to look
at your objective function and then move it around in such a way that you end up with a statement that says Guassian is the best thing to do. So then the game becomes, as I said, look at T, which is some linear projection
of X plus Gaussian noise, assume that your Gaussian
noise is standard normal, and then optimize optimize over A. So then you do a lot of linear algebra, and it turns out that the
linear algebra actually kind of works out. So, this is the MMSE formulation, this was the MMSE of T
as a predictor for Y. So, in Gaussians, you
can easily write down what the conditionals are,
expectation of Y given T, and you can write down
what your MMSE should be. So you get something like this. And then you move it around a bunch, and you end up with an answer
that your projection matrix, so these are the directions in which you want to project X in, that, this presumably
give you robust features. There are sort of two
things that are going on. One is that, again, it has
to do with eigenvectors of an appropriate matrix. Okay, so what this means is that if you have jointly Gaussian X’s and Y’s, then hopefully, the direction
that you’re projecting in that’s predictive has to do with the
co-variants being X and Y ’cause that tells you the
shape of the distribution. And so what falls out of
this is that you again have some eigenvectors of the
proper co-variants’ matrix, and then you also have a little
bit of scaling truncation, and maybe I’m not gonna
go into detail on this, but let’s just say that a similar type of result comes out if you analyze the
information bottleneck itself, without the phi term. And then what happens
when you add the phi term to make things more robust is that you have a slightly
different set of eigenvectors and you have a slightly
different set of scalings. And in particular, the scaling is a little
bit of a truncation, so beta was this regularization parameter, that said how much do I
care about robustness. And if beta gets gigantic, then what happens is that
all of these d’s become zero. So basically what that’s saying is that I had these eigenvectors and I’m gonna scale them by something and I’m gonna scale them all to zero. So basically, I’m just gonna make my T independent noise all over again. ‘Cause I said if I make a zero, then I’m just generating
independent noise. But of course, the more
interesting case is when beta is set to be
something in between, so I actually am doing
something with prediction, I’m getting some notion of
how predictive X is for Y in certain directions, and then I’m truncating
certain vectors as well. Okay. Similarly, you can analyze the mutual information formulation, I am not going to get into that, but in some sense, it
has a similar flavor. So this had some d times U, and this also has some d times U. It’s a different d and it’s a different U, but the U comes from certain
eigendirections as well and the d is another scaling. And you can show that, in fact, larger beta also squashes things to zero. Okay. But then there’s something else, so. Okay, so now we believe that the Gaussian case is doing
something reasonable, so therefore, maybe we
should just go ahead and use this method. But then how do we actually use it, right? So, we’re in a setting, where
we have X’s and we have Y’s. So, I’ll show you some preliminary results for MNIST data, so like
really concrete, right? We actually have images,
and we actually have labels, and how do we go about extracting T? And so nobody would think that X and Y are actually jointly Gaussian. So, we go back to this sort of formulation and we say, “Well, how
do we optimize this?” So there’s just absolutely
no way you can do this for an arbitrary distribution
of T as a function of X. Fortunately, there’s some
work in this direction that was done in the case of
the information bottleneck. So remember, sort of, from
a philosophical point, what we’ve done is we’ve
added a regularizer. So the information bottleneck itself just had the first two terms. And so this paper, which
was an ICLR paper from 2017, said, “Okay, so, great,
people have shown theory “for the Gaussian case. “But how do I actually
use this in practice?” And so they came up with
some variational method for, hopefully, extracting
some reasonable features, where you just have the first two terms. The idea is, well, in
variational inference, so there are usually a couple
things that are going on. One is that you sort of
restrict your search space into something that’s more attractable, and another thing you do is if you’re minimizing something, you upper-bound some other
function that’s nicer which is an upper bound for
the function you’re optimizing. So there are two things. One is that we’re gonna look at a certain class of functions and hope that we have meaningful features extracted from that. And the other thing we wanna do is we wanna come up with
an appropriate upper bound of this kind of nasty objective that can be optimized appropriately. Now, I wish I could tell you
that we had a convex function that upper-bounds this,
and everything is great. Well, not quite. But the thing that we do have is we have some function
that upper-bounds this, which can be optimized efficiently using stochastic gradient descent. So stochastic gradient descent is a very big deal in neural networks because that’s the only
way that you can really attempt to optimize something
where it’s so complicated, there is such a high dimensionality
of your parameter space and tons and tons of data. So whatever you’re trying
to do with neural networks, you have to be able to have some sort of stochastic
gradient descent like minimization going on. And if I tell you to just try
stochastic gradient descent on this objective, you’ll say like, “What is that?” I mean, it’s not very
easily written down as empirical risk minimization, you can’t just say, “Draw
some of your samples “and take your gradients down.” Okay, um. And there’s a related paper
that I’ll kind of touch on that is related to both this
information bottleneck paper and also what we did. Okay, so here is where it
might get kind of confusing because it’s sort of stepping
through some language and some terms that might
be a bit unfamiliar. So the idea is that we’re
going to train a neural network that extracts certain robust features. So I said that we want to
optimize just over a certain class of conditional distributions. Because we want to optimize over all possible transformations of X and that’s impossible, so let’s just look at Gaussians. So we’re not saying that X
and Y are jointly Gaussian, we’re just saying that whatever
features we’re extracting are going to be a Gaussian
function of the X’s. And that can be parametrized using theta, where theta is going to be something with the weights of the neural network. So it’s not as simple as
saying theta transposed X, it’s saying, “You have a neural network, “and there are weights
corresponding to all the layers, “and somehow, that’s getting
scrambled into everything “in your extractions in features T.” So what you really want is you want to train some
encoder neural network, which is giving you these features. All right. But, so how do we actually do this? ‘Cause that sounds great, but we need to somehow
get back to this function. And I was saying that we need to come up with some surrogate function
that can be optimized using stochastic gradient descent. All right, so let’s focus on the phi term. So, really, we need to
come up with a bound for all three of the terms. If we focus on the phi term, then because of the nice parametric form that we’re assuming for this feature extractor, this tells us what T given X looks like. And now we want to look at
the Fisher information term, and that has to do with the gradient of the conditional density with the square with an expectation of
all these other things. But at least if we start to write down what the gradient of
the density looks like, then we can write down something, and, I mean, don’t look at that too hard, but the important thing is that what shows up is these parameters in my encoder network and the gradient in respect to X. And so I told you very early
on with this FGSM method that it’s easy to take
gradients with respect to X of a loss function, of the parameters, or whatever you wanna do, and this comes when you’re
training your neural network and things come through back propagation. Okay. So, then the point though is
that we have two terms, right? We have a phi term and
then we have another term. And I’ll punt on the mutual
information framework, but let’s imagine for now
that it’s MMSE, all right. So if the MMSE were there, then I would need to come up
with some upper bound of MMSE and use that in my whole machinery of doing stochastic gradients
and optimizing things. So what’s a good upper bound in MMSE? Well, MMSE is minimum
mean squared error, right? So that means that an upper bound for that is any function I can possibly think of that I’m applying to T, and just the mean squared error of that. Okay, so the idea in this
deep information bottleneck is that I’m gonna train another network, and that other network
is a decoder network. So basically the idea is
that I have two networks and they’re being trained in parallel. One is extracting these features which are hopefully
getting better and better. And the other is a decoder network that is simple, let’s say, a
logistic regression decoder, that’s also being trained in parallel, it’s also getting more and more accurate as I’m running this whole process. And importantly, all
of these terms sort of, well this is hinted at it, the calculation I gave you earlier, the gradients of all of these terms come down to gradients of parameters, which are things that I can
recover from the neural network. So I think I had picture of this. It shows up bigger on another slide, but maybe I won’t have
time to get to that. The basic idea is that I
have my encoder network, so I had my X, and from my X, I’m
extracting these features, mu and sigma, those are the means and the co-variants’. And their theta is all the
parameters that come into this and coding network. And then I have the decoder, and the decoder is saying
that whatever T I got, I’m going to use that to predict the Y. And the idea is that the loss function, well, the loss function had the MMSE stuff and it had the phi stuff, so the loss function and its gradients will depend on the parameters
of the decoder network, the parameters of the encoder network and also the gradients of mu and theta with respect to, mu and sigma with respect to X. Okay. So, I’m not gonna have
time to go through this, so let me just go to here. So then, we had some experiments. And so, I should say this is the part that’s really unfinished. So neither of us had any
clue how to run TensorFlow. Varun taught himself
sometime in the last month and ran some experiments
a couple days ago. So still very much a work in progress. So we thought, “Okay, let’s
look at some synthetic data “and then let’s look at MNIST.” I mean, I’m told that in
the deep learning community, MNIST this is not good enough, you need to do ImageNet. But, I guess as theoreticians, we tried MNIST because it
was, at least, real data. Okay, so we go back to our Gaussian case. So our Gaussian case said that
we have a Gaussian mixture. And we choose our distributions so that one direction has a much
higher variance than the other so we get these fat pancakes. And then we assume that we’re
extracting linear features. So what we want to see is we want to see that
if our beta parameter, which is telling us how much
we care about robustness, is not there, then we end up with, well, the
maximal likelihood solution. Okay, so I mean, that should be true regardless of anything that I said about information bottleneck. What we also wanna see is that if beta gets bigger, so I care about robustness, then my boundary will
tilt from being this one to being something kind
of standing upright more, like I showed you in
one my earlier slides. So the plots are not
really very pretty, but basically, this is in fact what happens. So we see that the slope of w, so it went from being sort
of the classifier that had the slope that corresponded
to this big stretch, so the simulations were 16 to one, and so we see some sort
of a 16 to one slope. But then as we crank up beta, then it turns out that the
slope will get smaller. So what that corresponds to is
that I’ll go from this thing to that classifier. And the other thing that happens is that as beta increases, the
magnitude of w decreases, and this is something that I mentioned in an earlier slide as well, which is that the robustness has to do with signal to noise ratio. So, if you want lots of robustness, you should make w small because then that will make
things really independent. Okay, then we looked at MNIST. So MNIST is these handwritten digits you wanna classify into
zero, one up to nine. And we did something sort of similar, so we said, “Let’s try to extract
a two-dimensional feature, “and let’s see what happens.” So really, here, we had
to train an encoder, network train a decoder network, and what we found, and these
are preliminary results, is we found that as you increase beta, in fact, there is a bit of a trade-off between accuracy and robustness. And so I say this with an asterisk. So, we showed that there was a trade-off in accuracy and robustness in the sense that accuracy
is in fact going down as beta is getting bigger. So this is saying, “We
just care about robustness, “we don’t care about accuracy.” So you can show that for
your trained classifier, the accuracy is getting worse. On the other hand, you can show that your Fisher information
term is getting smaller as beta is getting bigger. So bigger regularization parameter means that the regularizer
should be squashed to zero and if the regularizer
is squashed to zero, then things are very robust. All right, so this is as far
as I’ll go with the simulation. What I really wish I could show you was some pictures saying, “Look, now that I did
these adversarial attacks “on my neural network and
it leads to robustness. “And this is the way that things change “as I attack my network.” But we didn’t quite get there yet. I also don’t have a visualization yet of what these features look like. So ideally, one can draw what
are called saliency maps, so that tells you, if
you look at your image, it’ll highlight certain portions
that are involved in your feature, in your classification. It would be really nice to say, “These are the original
features without robustness “and these are the
features that it looks like “when there is robustness.” Like maybe certain curves of the numerals that show up in your image data. Okay, so, just to wrap up, what we studied was this
new robust information bottleneck method for
extracting some robust features. We developed the rigorous theory, actually this is the only
rigorous theory in our work, but this is for the Gaussian setting, and it was a lot of
heavy-handed linear algebra, but we showed that, in
fact, you can write down, characterize what the solution
is and it makes some sense. We also provided a variational
optimization method, which means that you can actually run this on large-scale data, whether or not it gives
you something meaningful, well, the jury is still
out on how to interpret whether these features
are actually robust. But I do think that the
preliminary results are promising. In the future, well, as I said, I would really like to understand if these neural networks are
giving you something that’s, more robust features. So what I’ve shown you is that the phi term is getting smaller, and I infer from that that
things are more robust, but we would like to see
what the neural network is actually doing. And also, I think, well, this is just sort of
scratching the surface of this idea of robust feature extraction, so what does it mean? How do we come up with
some alternative methods for extracting the most robust features, or just ranking features in
terms of how robust they are? So, that’s it, thanks for your attention. (audience claps)

Leave a Reply

Your email address will not be published. Required fields are marked *