– So, I guess I’ll also start my talk with a bit of a caveat. So this is very recent work. It says, “With thanks to

the Simons Institute.” So over the summer, I was

visiting the Simons Institute at Berkeley where they had

a program on deep learning. And really, the reason

I was there was because one of my very close

friends, Matus Telgarsky, was one of the organizers. And I told him when he invited me that I didn’t really do any deep learning and he’s like, “Okay,

well you can get started.” So, this was sort of mentally

like a summer internship, and what you see now, well, it’s August, this is like the report on what I did. So this is kind of still in progress, I’m not really sure when

we’ll be done the paper. So I apologize if some

of the ideas or equations are permuted, because that’s

kind of how it still is in my head. This is joint work with my

colleague and collaborator Varun Jog who is in ECE at Madison. Okay, so as we know, neural

networks are very accurate. So you feed in lots of training data and they classify very, very

accurately on test data. But there’s a phenomenon that

was observed several years ago and has drawn a lot of attention, particularly in the theoretical community, and it’s a curious

property of neural networks that even though they classify very well, let’s say to 95 plus percent accuracy, what happens is that if

you take a neural network that is trained to classify

images very accurately, and you choose some sort of

adversarial tweak to the image, so in this case, this is a panda that was classified accurately, and then you choose some

very small perturbations, so the magnitude of each

pixel is perturbed by at most epsilon, and you choose

that perturbation correctly, then you can make the

classification incorrect. So, there’s another animal called a gibbon and you feed this new

image, which basically looks just like the panda image,

into your neural network and it will fool the neural networks. The neural network will decide that this is something different. And this is a problem because in practice, people like to use neural

networks to classify images and they want to do better than humans, and even though they might

do a little bit better than humans on so-called clean images, the problem is that if you

choose these adversarial attacks, then you can sort of hack the network and make it predict incorrectly. And this is bad if you have

an application in security or medical imaging or something like that and you want them to be more robust. So I’ve been quite

fascinated with this property for a while when I first heard about it. And there are various

methods that people have for trying to deal with this. So, in my head, to kind of

sort out what this phenomenon is actually referring to, I like to think of a

classification boundary. So, imagine that you

have images of one type, let’s say the pandas are over here and the gibbons are over there. And what your neural network is doing is it’s creating a decision boundary that’s classifying very well. So in this case, it’s

a 100 percent accurate on all of the training images. And in fact, if you populate

it with some test images, then they probably will also do very well. At least, that’s the way that

we see the experiments going in actual applications. But the issue, so what’s going on here is that even though things are

classifying correctly, somehow the neural network is, in a way we don’t quite understand, creating the classification boundaries so it’s really, really close

to the images that we feed in. So I call this sort of a

conservative predictor. So it’s creating the correct

classification boundary, but there are lots of correct

boundaries it could be drawing that separate the pluses from the zeros. And it’s separating things in such a way that you get really, really close, so if you choose some

sort of perturbation, let’s say to this red circle, then even though it’s

classified correctly, you can just perturb it by a tiny bit and go over the line, and so this is what

neural networks are doing. So there are a couple

methods that people use to try to combat this. One of them is known as data augmentation, and the first reference I found of this was from analysis of text, actually. There may be other references before that. So the idea is that you

still have the original red and blue points, but you kind of tell your neural network, “I don’t want to just classify

these images correctly, “but I want to classify them “and some ball around them correctly.” And so what I’m drawing here is I’m drawing some sort of enlargement, so you can imagine that a dotted circle around each of these points, and to say that everything

that is red plus dotted has to be on one side and everything that is blue plus dotted has to be on the other side. And so maybe then, the neural network will classify like this, and you’ll get this green line, okay? So the problem with

this, of course, is that theoretically, it sort of makes sense, but then, now imagine that you’re in a

high dimensional image space and you want to, oh, so data augmentation

means that you introduce some other examples that are close by. So it’s like you take

not only this red image, but you create lots of

(mumbles) perturbations. Okay, but the problem

is that you need sort of infinitely many, or a large

number of perturbations because you’re in a very

high dimensional image space. Your dimensionality is the

number of pixels in your image, which could be very, very large. So, it’s not really that practical to say, “Let me just protect every point by “lots and lots and lots of points.” There is another method

that is quite popular, and that’s called adversarial training, and there are different

ways to think about what adversarial training is doing. But one way to think about it is to say, “Okay, so we don’t really

need to protect this point “by all of the points in the solid ball. “Maybe we just need to add

on some other perturbations “that are kind of going

the right direction.” So we make sure that that

point is classified correctly, and then some appropriate

points are classified correctly that push the black boundary

to the green boundary to get what we want. A very popular method

for constructing these adversarial perturbations is called the Fast Gradient Sign Method, or FGSM. I don’t want to go into

too much detail about this, but the idea is that you take your point, let’s say it’s the red point, and then you add the correct

direction of the perturbation. And it turns out that what

has been sort of suggested in papers and kind of works somewhat well is where you look at the

gradient of your loss function. Why does this make sense? Because, well, the loss

function is kind of indicating how close

you are to the boundary. So the loss is going to be

pretty small if you’re far away, and then it’s gonna be

bigger if you’re close by. And so you want to kind

of go in the direction of the boundary, so the gradient of the loss will tell you what’s that direction in which things are going to change very quickly. It turns out that this

sort of protects against what are called L-infinity perturbations whereas there are maybe L2 perturbations or different sort of distances. So the idea is that it’s

sort of like impossible to decide based on your training data exactly what the worst

case perturbation is, but this Fast Gradient Sign Method is one way of creating an image that seems to be going

towards the decision boundary, which seems to improve the

robustness of your classifier. It’s called Fast Gradient Sign Method because it’s very fast to compute, and it’s a sign method because

you’re basically looking at the vector of the gradient and just going, epsilon

perturbation, either plus or minus. The other thing I wanna point out, which is why I wrote this down, is that it turns out that

this is very easy to compute from a trained neural network. So when you think about optimizing things, you usually think about

optimizing your loss with respect to the

parameters, which are theta. So it wouldn’t be too

unrealistic to imagine that if I were talking

about taking a gradient of a loss, with respect to theta, then that should be easy to

do with a neural network, because a neural network is

doing gradient descent anyway. But one thing I want you

to keep in mind for later is that actually, neural

networks can also take derivatives with respect to X, your input. So this kind of weird

when you think about it as optimization. But when you think about what

a neural network is doing, well, on the first layer in the inputs, it’s sort of taking your weight, so let’s say theta, and

it’s doing transpose with X, which is your input. And it does something to that. So from the point of view of what the neural network

is computing and storing, it can also tell you, in addition to the

gradient respect to theta, what the gradient is in respect to X of the loss function, or, it turns out, the

parameters and so on. And this is one thought

that I want you to hold onto for later. Okay, so I guess this

is what I’m saying here then it turns out that we can

actually take these gradients efficiently when we have

a trained neural network. There’s another method

that I wanna point out because I spent a lot of time

thinking about this earlier in the summer. This is another method for

trying to make neural networks more robust. It’s called randomized smoothing. So basically, the idea is

that if you look at just the first image on the left, well, if you smoothed out

somehow the decision boundary, made it like a straight line, then maybe that would be more robust. And so randomized smoothing

is kind of a way of appropriately pre-processing your data or convolving it with

some kind of Gaussian, which will, in the end, smooth

out your decision boundary. Okay. But then the issue, when you start thinking

about these methods, is that there’s a little

bit of an inherent trade-off between robustness and accuracy. Okay, so what do we mean by robustness? Well, robustness means

that if I have a classifier and I perturbed the input by a little bit, the classifier shouldn’t change too much. So if you think about looking at the panda and the gibbon, right? I mean, I have these two input images which look really, really close, that’s in the, let’s

say, X-i is the panda, X-prime is the gibbon,

they look really close. And I want that the classification

is going to be the same. So I want that to be

true of my classifier. On the other hand, I also

want my method to be accurate. And so what you see if you

start looking at these images is well, if I kinda

straightened out this line, then maybe that would be

more robust in some weird way that I need to quantify, but at the same time, you know, the images that actually are sort of

close to the curvy boundary are going to be misclassified. And this is what happens when you try to apply some of these methods. I think I put an image of this, so I took the same images as before and I said, “Well, what if we try to do some data augmentation

and we enlarge things by, this is supposed to be, a big radius.” So we kind of wanna protect

all of these points, and then really, the

green line doesn’t have many choices anymore, it has to follow whatever

curvature is going on here. But who am I to say that

this is the right classifier? This is just a method that I use to try to protect these points. And so we see that as we

try to become more robust, as we try to increase these balls and protect against

perturbations in these balls, then the boundary might become

further and further away from what actually is

supposed to be the case based in the data generating distribution. So, then I started thinking, well, how do we quantify this? So I spent a lot of time

trying to figure out if there were fundamental trade-offs between robustness and accuracy, and I didn’t get very far. Okay, so there’s something else that I started thinking about instead. And the idea is instead

of claiming that there is a fundamental trade-off

between robustness and accuracy of any classifier, what if we tried to

design some classifiers that are inherently more robust? And the way we do this is we look at the features that we’re using to

base our predictions on. So, there’s this image

that I want to look at for a little while. So, what we have here is we

have a mixture of Gaussians, and these are a mixture

of very flat Gaussians. So the co-variants in

the horizontal direction is much bigger than the co-variants in the vertical direction. Now, if I told you that I

wanted a good classifier, and I didn’t care about robustness, then what would you do? Well, you would look

at something like this, let’s say, maximum

likelihood classification in logistic regression. And so another way to

think about it is that logistic regression is learning

a parameter vector beta, in this case, it’s two-dimensional, and it’s trying to project

things onto the beta direction. So really, the beta direction

is perpendicular to this. So it’s projecting things

onto that direction, and then it’s classifying. Okay. But my claim is that if we don’t just care

about prediction accuracy and we care about robustness,

whatever that means, what we should be doing is

doing something like this. So this image, or some version thereof, is in a paper on neural networks and took me a while to

understand what was going on because this just looks

like a terrible classifier. But now, in terms of robustness, so we go back to this first image and my claim is that it’s

not a very robust classifier. Because what we cared about is how much I should perturb points in order to put them over the boundary. And so we see that even though

this is a great classifier, most of the points are really

close to the boundary, right? Pretty much all of them, because I chose these two blobs so that they were basically kind of like parallel point clouds to this boundary. On the other hand, if I

look at this classifier, then I see that, okay,

there are some points that are now misclassified. On the other hand, since

these are Gaussian clouds, the bulk of the mass is sort of in the middle of the red cloud. And the bulk of the mass here is in the middle of the blue cloud. And now, how far do I

need to move those points to put them over the boundary? And so you’ll see that it’s

something sort of non-trivial, whereas here, it’s sort of trivial, you just move points by a tiny bit and they go over the boundary. Here, what we see is that

you move some points, and even these are really far away and they’re kind of further

from the boundary, right? So the idea is that maybe, even though the most accurate vector I should be using for my prediction is the beta vector that

corresponds to this classifier, if I now want things to be

both robust and accurate, then I should be basing my

prediction on some other projection or some other feature. So you can think of what X-i

transposed beta as a feature. If I were to extract a

one-dimensional feature on which I wanted to do a prediction, then which beta should I choose, so that I do logistic regression

using X-i transposed beta. This is also an example

that’s worth mentioning because I think this is what kind of made a splash in the deep

learning community quite recently. This is another example which is, it’s kind of like this Gaussian point clouds example, but it’s also showing that

there are certain features that might be more robust than others. So the idea is that you

again have a mixture, so you have Y-i, which is your label and it’s either plus or minus. So imagine that that’s like

my two Gaussian clouds. But they’re not exactly Gaussians. So what’s going on is that there’s a feature which is

sort of the robust feature. So this first feature is basically telling you, I mean, it’s gonna be

plus one or minus one, depending what the class is, but then there are some

uncertainties so there’s a flip. And then the other features are sort of weak features. So they are going in the right direction, in the sense that the means are correct. So if the class is plus one, then these clouds are going to be centered at the all plus one’s vector

with a little bit of noise. So the idea is that if you were

to try to classify correctly with the best accuracy, you should be using all of these features. So, these features tell you a little bit and this feature also

tells you a little bit. On the other hand, if you choose this eta

parameter to be very small, then that’s kind of like saying that the distance to the

boundary is really small. Okay, so we see that,

if you wanna be robust, you actually shouldn’t be

using any of these features, you should only be using the first one, which is the robust feature. Now, you have to think

about it a little bit more, it really depends on the trade-off between how big q is and how big eta is. The point is that if you

make eta sufficiently small, then, really, all these features

are useful for prediction but they’re just not very robust. And so we see that one of the features comes out as being very robust and the others are not so much. So the question again is how do we do some sort

of feature extraction that tells us that we should

be picking this feature and not those? And then the trade-off between

accuracy and robustness is that if I didn’t care about robustness, I would use all the features, but if I do care about robustness, maybe I should rank my

features in a certain way. Then there’s another paper

I want to mention as well because I also found

this quite fascinating. So this is a NeurIPS paper from last year, and the idea is that in

order to try to build a more robust feature classifier, you should build a graph on your data, and then look at your graph Laplacian. So again, without going

into too much detail, so what I’ve done here is I’ve taken these points

from the earlier picture, right here, and I just

dropped the decision boundary and I said, “Well, let’s

construct some sort of “a nearest neighbor graph.” So you get a graph. And the idea with robustness is that you don’t want your

function to change too much for points that are close. And so, because this graph is telling us where points are close, then we want to extract a function that takes on a different value for every one of these points but doesn’t change that much, across edges in the graph. And, you know, these are

ideas that have come out in spectral graph theory. You can talk about regularizing

according to a graph. And so there are methods

that have been proposed for doing robust feature extraction with respect to the graph

that’s placed over your data. All right. So the approach that we took was inspired by these works but also took in a different idea. And so, the idea was to use an information-theoretic approach. Again, the goal is to figure out how to extract robust features. The question is like, “What should you use as your objective “and does it actually work?” So, there’s a paper

that’s quite famous now because it’s somewhat controversial. And this paper was by Tishby et al. The first version of it appeared in 1999. And it was known as the

information bottleneck. So the idea here had nothing

to do with robustness. The idea was that they wanted a method for extracting features, and these features should

be good at predicting whatever the output is, so they should be good at predicting why, and they should also be

very good at compression, so you don’t want to be too redundant. Like if you just take all your inputs X, then those could be really

good for predicting Y, but maybe some of them are redundant, so you really just wanna

extract out some features T that are not so redundant for X that still are good at predicting Y. This was formulated in this information-theoretic setting. So I refers to the mutual

information between two variables. So remember, X is the

input, Y is the output and T are the extracted features. And we want to impose that the features are conditionally

independent of the outputs given the input, ’cause the

features are just extracted from the input. And what this formulation means is, well, the one thing you need to

know about mutual information is that it’s not negative and furthermore, when it’s zero, that means that two

variables are independent. Okay, so if you want things

to be less independent, then this value should be bigger. So what we wanna do is we

want to make I of T;X small, and we want to make I of T;Y big, right? And so basically, it’s saying that we want T and Y to be very related, but then we want T and X

to be not very related, because we’re minimizing

the mutual information between T and X and small mutual

information between T and X, like the mutual information being zero means that there is no relationship. So we want compression, we want small relationship

between T and X, but we want to predict well, so we need good relationship

between T and Y. So this was known as mutual

information bottleneck. And this is where things become

a little bit controversial. So, there were various studies and especially with the

new wave on deep learning, claiming that in fact deep

learning, neural networks learn according to this mutual

information bottleneck. So in this case, they

were looking at features that were in a deep network. So you can imagine training

your favorite neural network, it extracts features at each layer, right? So it would just focus

on a particular layer and ask for this particular T. What happens if I compute what

this objective function is? So the hope is that as I learn, things become more compressed

and they predict well. So you would hope that somehow

this objective function is going to be getting smaller and smaller because I’m doing some learning, so I’m trying to predict my features well and I’m trying to predict my outputs well and I’m also trying to do compression. And the authors went a

little bit further than that to say that there are

actually two different phases in neural network learning. And one is known as a memorization phase and one is known as a compression phase. So first, you sort of

learn to extract features you recognize, for instance,

the handwritten digits. So you recognize that there

are certain properties of zero, one, two, up to nine. After that, you try to compress things. And we’re not spending too

much time on this slide, basically the idea is that I of T;Y should be constantly increasing because you’re getting better and better at predicting things. On the other hand, I of

T;X is first increasing and then decreasing. And this picture is

taken from this paper by Schwartz-Ziv and Tishby, and so you see that I of Y,

which is on the vertical axis, it’s always increasing, whereas on the horizontal axis, these orange dots are sort going, they’re getting bigger and

then they’re getting smaller. So what you see here is

you see different layers in a neural network, and over 50 samples of

random initialization. Now, this paper kind of, well, there was a follow-up paper at ICLR, maybe in ’17 or ’18, where they claimed that the

simulations were sort of very carefully chosen, and this is not what

happens in neural networks. And so if you Google the ICLR open review, you’ll see that there’s this

war between the authors. So the follow-up paper got in,

but these authors said that the reason why the

simulations are different is because they don’t know

how to calculate their mutual information

accurately from simulation. So, we’re not gonna get into that because that’s not the point. The point is that we use this sort of as a starting point to try to

extract robust features. Because, again, there’s nothing

to do with robustness here, right, it’s just we want features to be an accurate way to predict the labels and then also be good for compression. Okay, um, but before I move on, so another nice thing

about this formulation, well, in general, it’s

kind of nasty, right? You wanna optimize over all

possible functions of your X’s. But there was a paper that showed that at least when X and

Y are jointly Gaussian, you can actually characterize what the extracted features are. And the hope is that if

things are jointly Gaussian and you can write down the exact formula for what the features should be, hopefully it’s something realistic. Hopefully it’s something

meaningful that you can actually, like in the Gaussian mixture case, plot the result and say, “Okay,

it’s doing the right thing.” So it was shown in this

paper that in fact, somewhat surprisingly perhaps, is that if you look at any function of your X’s, it turns out that the best thing to do is a Gaussian mapping. So if you’re allowed to extract

any function of your X’s through any non-linear

transformation or whatever, the best thing you can do is actually to look at Gaussians. All right, and then it was

shown that the right projection has to do with eigenvalues

and eigenvectors of the joint co-variants’ matrix between the X’s and the Y’s. Mmkay. So what did we do? Well, we started thinking about how you can try to encourage robustness when you’re creating a

mapping from X’s to T’s. And what we ended up deciding on was a Fisher information

term as a regularizer. So this Fisher information term might look a little bit weird from what you’re used

to seeing in statistics, but it does have this

idea of a score function, the gradient of the log of the density. And normally what you have is you have the density as

parametrized by parameter. In this case, we’re going

to parametrize things by X, because what is the

Fisher information doing? It’s telling us something

about the curvature of a distribution, of a density with respect

to changes in a parameter. And now what we want is we want that if X changes a little bit, we don’t want to change the distribution of T very much, right? Because we’re trying to

encourage stability of the T’s. So we hope that, if you look

at the conditional distribution of T given X, small changes in X don’t change the conditional

density too much. And then the other stuff is

just sort of an expectation, because in a Fisher information, you have an expectation. So we’re taking an

expectation with respect to X and with respect to T. Um, so as I said here, in some sense, a smaller version of phi means that these gradients of the log of the conditional density

aren’t gonna be very large, and therefore, we should

have some stability with respect to changes in X. Also, I should say that

this is sort of on average. So earlier, I was motivating all of this by saying that we have these

adversarial perturbations. And ideally, we want this

method to actually be robust to adversarial perturbations, but for starters, if we’re trying to extract

some robust features, we’re just gonna look

at average case analysis because it’s easier to

compute things in that sense. Okay, so this leads to the

robust information bottleneck, as advertised in the title. So what you see here is you see again these I of T;X and I of T;Y terms. And then there’s going

to be a parameter gamma that trades off the two, the

compression and the prediction. And then there’s going to be another term which is the regularizer, and that is this Fisher information. And you might ask, a priori, well, why did you use Fisher information, it came out of nowhere. And really, the clean answer is that it turns out that in the Gaussian case, you can write down what the

solution is to all of this. And that’s why, well, things become kind of

clean with mutual information and they also become clean if you add a Fisher

information regularizer. Um, there was another

formulation that we also studied that we could also kind of analyze nicely, and this is where we sort of pulled away from this information bottleneck, because, really, we weren’t starting with information bottleneck, we were sort of starting with how do you make things more robust? What sort of regularizer should you add? And it seemed like this Fisher information seems pretty reasonable, right? I mean, you change your X’s by a bit, it doesn’t change the

distribution of T too much. Um, so you can also think about other ways of trying to say that

T is predicting Y well. ‘Cause really, the

regularizer is saying that the changing X doesn’t change T too much, and then we want something that says that T is actually predictive for Y. So you could imagine looking at a minimum mean squared error. So a minimum mean squared

error is really just saying what is the best

predictor that I could use as a function of T, and that’s going to be

expectation of Y given T, and then the mean squared error of that. All right, and it turns out

that you can kind of analyze these in similar ways, which

I’ll get to in a moment. Um, so, one comment is that while the MMSE method is

potentially preferable, if you’re really looking at distances between inputs and outputs, so mutual information is

kind of an abstract thing. It tells you how independent things are. But if you really want

to measure prediction in terms squared accuracy,

then maybe MMSE is better. But that’s kind of just a speculation. Another thing that’s important to note, which is relevant for the computations which I won’t go into too much detail on, is that it turns out that

these objective functions are invariant to scalings of T. And this is something that we liked because you would hope

that if you say that certain features are more robust, then if you would scale

them up by a 100, you know, they’re still equally robust as if you didn’t scale them up by a 100. This becomes useful in our theory because what we’ll show actually is that in these settings as well,

in the Gaussian case, T should just be a linear

transformation of the Gaussian. And then, what we can

do is we can rescale T’s so that the error term

has identity co-variants and just optimized over A. So this becomes a linear algebra problem. You can rescale your T and

make your residuals just identity co-variants. Okay. A couple other things

about this regularizer. So, I sort of pulled it out of nowhere. As I said, it was

motivated by the question of how sensitive is T,

or the distribution of T, to perturbations in X. It turns out that it also

has some nice properties, which followed of some

information-theoretic analysis. So you can think about the regularizer as a first-order approximation of changes in the mutual information. So, it turns out, and this follows from a result in information geometry called de Bruijn’s identity, that if you look at the mutual

information between X and T, you perturbed X by a little Gaussian, so you add an independent Gaussian noise, then you get the Fisher

information as a first-order term. So this says, well, all right,

if we wanna make this small, then we wanna make sure

that information content of X relative to T isn’t

too much if I perturb X. Which is also consistent

with what we were saying about robustness. Another nice property

is that it turns out, and maybe don’t stare at this too hard, is that if phi is small, then in fact a certain

KL distance is small. So, the KL distance is between the conditional distribution of T given X and T given a perturbation of X. And so, we can show that, in fact, KL perturbations are also

small if phi is small. So these are just sort of justifications for why phi is sort of

a nice thing to look at. Um, and it turns out that, well, by Pinsker’s inequality, if

KL is small then TV is small, if TV is small then

Wasserstein is also small, so a small version of phi is perhaps something that

really should be desired. Okay, um, a little bit more

about this regularizer. Again, trying to convince you

that it’s a nice thing to do. So you could imagine different settings. One is where you’re

extracting your features in some linear way. So, here, this is just

a linear transformation of X plus Gaussian noise. X could be Gaussian, X

could be a Gaussian mixture, it could be whatever you want. And it turns out that you

can actually calculate what phi is, because phi is this Fisher information that depends on the conditional

distribution of T given X, and you end up with basically

the sum of the squares of the entries of A. And, this also agrees with intuition, because what happens is that we see that a small value of phi, so more robust features, corresponds to a small

value of the norm of A, and that means low signal to noise ratio. And normally, you think that

low signal to noise ratio is a bad thing, but on the other hand, if you

want things to be more robust, all right, so you want your

T to not change too much, then what’s the most

robust thing you can do? Just generate independent features, right? I mean, that’s like

terrible for prediction, but it’s really robust. And so we see that making the

signal to noise ratio small is in fact what is encouraging

robustness in features. Of course, we would not

be choosing a tiny SNR because we have the other term too, and that’s either the MMSE or the information bottleneck term. So that’s trying to

give us good prediction in addition to robustness. Um, we can play a similar game where you let your T be binary features, and if you look at binary features, then maybe these are parametrized by some logistic parameter, w, and you can try to calculate exactly what this phi term looks like. And it turns out that the

phi term also depends, in some sense, on a signal to noise ratio, and then there are other

terms that come from how confident your predictors are. So, this is like P times one minus P, and we know that P times

one minus P is minimized when you’re at zero or one. So it’s telling us that if

we want to be very robust, then we want the predictions

to be very confident. You’re either very confidently positive or very confidently negative. Okay. All right, so, that’s to convince you that phi is sort of a reasonable

thing to look at. Um, then the next thing

that we sort of looked at was, well, okay so, well, we could say, “Let’s just use this in practice, “but before that, maybe

let’s try to look at “the theory behind it.” In a very special case, when

we can actually understand what the joint distribution of X and Y is, just like the information bottleneck, can we analyze what sort

of features are extracted? And so that’s what we did here. We looked at the case when X and Y are jointly Gaussian random variables, and it turns out that you can also show, though it takes quite a bit of work, that the optimal choice of T is, again, a jointly Gaussian random variable. And maybe the very short summary is that it has to do with the fact that Guassian random variables are maximizing entropy subject to a

boundary of the co-variants. So you kinda want to look

at your objective function and then move it around in such a way that you end up with a statement that says Guassian is the best thing to do. So then the game becomes, as I said, look at T, which is some linear projection

of X plus Gaussian noise, assume that your Gaussian

noise is standard normal, and then optimize optimize over A. So then you do a lot of linear algebra, and it turns out that the

linear algebra actually kind of works out. So, this is the MMSE formulation, this was the MMSE of T

as a predictor for Y. So, in Gaussians, you

can easily write down what the conditionals are,

expectation of Y given T, and you can write down

what your MMSE should be. So you get something like this. And then you move it around a bunch, and you end up with an answer

that your projection matrix, so these are the directions in which you want to project X in, that, this presumably

give you robust features. There are sort of two

things that are going on. One is that, again, it has

to do with eigenvectors of an appropriate matrix. Okay, so what this means is that if you have jointly Gaussian X’s and Y’s, then hopefully, the direction

that you’re projecting in that’s predictive has to do with the

co-variants being X and Y ’cause that tells you the

shape of the distribution. And so what falls out of

this is that you again have some eigenvectors of the

proper co-variants’ matrix, and then you also have a little

bit of scaling truncation, and maybe I’m not gonna

go into detail on this, but let’s just say that a similar type of result comes out if you analyze the

information bottleneck itself, without the phi term. And then what happens

when you add the phi term to make things more robust is that you have a slightly

different set of eigenvectors and you have a slightly

different set of scalings. And in particular, the scaling is a little

bit of a truncation, so beta was this regularization parameter, that said how much do I

care about robustness. And if beta gets gigantic, then what happens is that

all of these d’s become zero. So basically what that’s saying is that I had these eigenvectors and I’m gonna scale them by something and I’m gonna scale them all to zero. So basically, I’m just gonna make my T independent noise all over again. ‘Cause I said if I make a zero, then I’m just generating

independent noise. But of course, the more

interesting case is when beta is set to be

something in between, so I actually am doing

something with prediction, I’m getting some notion of

how predictive X is for Y in certain directions, and then I’m truncating

certain vectors as well. Okay. Similarly, you can analyze the mutual information formulation, I am not going to get into that, but in some sense, it

has a similar flavor. So this had some d times U, and this also has some d times U. It’s a different d and it’s a different U, but the U comes from certain

eigendirections as well and the d is another scaling. And you can show that, in fact, larger beta also squashes things to zero. Okay. But then there’s something else, so. Okay, so now we believe that the Gaussian case is doing

something reasonable, so therefore, maybe we

should just go ahead and use this method. But then how do we actually use it, right? So, we’re in a setting, where

we have X’s and we have Y’s. So, I’ll show you some preliminary results for MNIST data, so like

really concrete, right? We actually have images,

and we actually have labels, and how do we go about extracting T? And so nobody would think that X and Y are actually jointly Gaussian. So, we go back to this sort of formulation and we say, “Well, how

do we optimize this?” So there’s just absolutely

no way you can do this for an arbitrary distribution

of T as a function of X. Fortunately, there’s some

work in this direction that was done in the case of

the information bottleneck. So remember, sort of, from

a philosophical point, what we’ve done is we’ve

added a regularizer. So the information bottleneck itself just had the first two terms. And so this paper, which

was an ICLR paper from 2017, said, “Okay, so, great,

people have shown theory “for the Gaussian case. “But how do I actually

use this in practice?” And so they came up with

some variational method for, hopefully, extracting

some reasonable features, where you just have the first two terms. The idea is, well, in

variational inference, so there are usually a couple

things that are going on. One is that you sort of

restrict your search space into something that’s more attractable, and another thing you do is if you’re minimizing something, you upper-bound some other

function that’s nicer which is an upper bound for

the function you’re optimizing. So there are two things. One is that we’re gonna look at a certain class of functions and hope that we have meaningful features extracted from that. And the other thing we wanna do is we wanna come up with

an appropriate upper bound of this kind of nasty objective that can be optimized appropriately. Now, I wish I could tell you

that we had a convex function that upper-bounds this,

and everything is great. Well, not quite. But the thing that we do have is we have some function

that upper-bounds this, which can be optimized efficiently using stochastic gradient descent. So stochastic gradient descent is a very big deal in neural networks because that’s the only

way that you can really attempt to optimize something

where it’s so complicated, there is such a high dimensionality

of your parameter space and tons and tons of data. So whatever you’re trying

to do with neural networks, you have to be able to have some sort of stochastic

gradient descent like minimization going on. And if I tell you to just try

stochastic gradient descent on this objective, you’ll say like, “What is that?” I mean, it’s not very

easily written down as empirical risk minimization, you can’t just say, “Draw

some of your samples “and take your gradients down.” Okay, um. And there’s a related paper

that I’ll kind of touch on that is related to both this

information bottleneck paper and also what we did. Okay, so here is where it

might get kind of confusing because it’s sort of stepping

through some language and some terms that might

be a bit unfamiliar. So the idea is that we’re

going to train a neural network that extracts certain robust features. So I said that we want to

optimize just over a certain class of conditional distributions. Because we want to optimize over all possible transformations of X and that’s impossible, so let’s just look at Gaussians. So we’re not saying that X

and Y are jointly Gaussian, we’re just saying that whatever

features we’re extracting are going to be a Gaussian

function of the X’s. And that can be parametrized using theta, where theta is going to be something with the weights of the neural network. So it’s not as simple as

saying theta transposed X, it’s saying, “You have a neural network, “and there are weights

corresponding to all the layers, “and somehow, that’s getting

scrambled into everything “in your extractions in features T.” So what you really want is you want to train some

encoder neural network, which is giving you these features. All right. But, so how do we actually do this? ‘Cause that sounds great, but we need to somehow

get back to this function. And I was saying that we need to come up with some surrogate function

that can be optimized using stochastic gradient descent. All right, so let’s focus on the phi term. So, really, we need to

come up with a bound for all three of the terms. If we focus on the phi term, then because of the nice parametric form that we’re assuming for this feature extractor, this tells us what T given X looks like. And now we want to look at

the Fisher information term, and that has to do with the gradient of the conditional density with the square with an expectation of

all these other things. But at least if we start to write down what the gradient of

the density looks like, then we can write down something, and, I mean, don’t look at that too hard, but the important thing is that what shows up is these parameters in my encoder network and the gradient in respect to X. And so I told you very early

on with this FGSM method that it’s easy to take

gradients with respect to X of a loss function, of the parameters, or whatever you wanna do, and this comes when you’re

training your neural network and things come through back propagation. Okay. So, then the point though is

that we have two terms, right? We have a phi term and

then we have another term. And I’ll punt on the mutual

information framework, but let’s imagine for now

that it’s MMSE, all right. So if the MMSE were there, then I would need to come up

with some upper bound of MMSE and use that in my whole machinery of doing stochastic gradients

and optimizing things. So what’s a good upper bound in MMSE? Well, MMSE is minimum

mean squared error, right? So that means that an upper bound for that is any function I can possibly think of that I’m applying to T, and just the mean squared error of that. Okay, so the idea in this

deep information bottleneck is that I’m gonna train another network, and that other network

is a decoder network. So basically the idea is

that I have two networks and they’re being trained in parallel. One is extracting these features which are hopefully

getting better and better. And the other is a decoder network that is simple, let’s say, a

logistic regression decoder, that’s also being trained in parallel, it’s also getting more and more accurate as I’m running this whole process. And importantly, all

of these terms sort of, well this is hinted at it, the calculation I gave you earlier, the gradients of all of these terms come down to gradients of parameters, which are things that I can

recover from the neural network. So I think I had picture of this. It shows up bigger on another slide, but maybe I won’t have

time to get to that. The basic idea is that I

have my encoder network, so I had my X, and from my X, I’m

extracting these features, mu and sigma, those are the means and the co-variants’. And their theta is all the

parameters that come into this and coding network. And then I have the decoder, and the decoder is saying

that whatever T I got, I’m going to use that to predict the Y. And the idea is that the loss function, well, the loss function had the MMSE stuff and it had the phi stuff, so the loss function and its gradients will depend on the parameters

of the decoder network, the parameters of the encoder network and also the gradients of mu and theta with respect to, mu and sigma with respect to X. Okay. So, I’m not gonna have

time to go through this, so let me just go to here. So then, we had some experiments. And so, I should say this is the part that’s really unfinished. So neither of us had any

clue how to run TensorFlow. Varun taught himself

sometime in the last month and ran some experiments

a couple days ago. So still very much a work in progress. So we thought, “Okay, let’s

look at some synthetic data “and then let’s look at MNIST.” I mean, I’m told that in

the deep learning community, MNIST this is not good enough, you need to do ImageNet. But, I guess as theoreticians, we tried MNIST because it

was, at least, real data. Okay, so we go back to our Gaussian case. So our Gaussian case said that

we have a Gaussian mixture. And we choose our distributions so that one direction has a much

higher variance than the other so we get these fat pancakes. And then we assume that we’re

extracting linear features. So what we want to see is we want to see that

if our beta parameter, which is telling us how much

we care about robustness, is not there, then we end up with, well, the

maximal likelihood solution. Okay, so I mean, that should be true regardless of anything that I said about information bottleneck. What we also wanna see is that if beta gets bigger, so I care about robustness, then my boundary will

tilt from being this one to being something kind

of standing upright more, like I showed you in

one my earlier slides. So the plots are not

really very pretty, but basically, this is in fact what happens. So we see that the slope of w, so it went from being sort

of the classifier that had the slope that corresponded

to this big stretch, so the simulations were 16 to one, and so we see some sort

of a 16 to one slope. But then as we crank up beta, then it turns out that the

slope will get smaller. So what that corresponds to is

that I’ll go from this thing to that classifier. And the other thing that happens is that as beta increases, the

magnitude of w decreases, and this is something that I mentioned in an earlier slide as well, which is that the robustness has to do with signal to noise ratio. So, if you want lots of robustness, you should make w small because then that will make

things really independent. Okay, then we looked at MNIST. So MNIST is these handwritten digits you wanna classify into

zero, one up to nine. And we did something sort of similar, so we said, “Let’s try to extract

a two-dimensional feature, “and let’s see what happens.” So really, here, we had

to train an encoder, network train a decoder network, and what we found, and these

are preliminary results, is we found that as you increase beta, in fact, there is a bit of a trade-off between accuracy and robustness. And so I say this with an asterisk. So, we showed that there was a trade-off in accuracy and robustness in the sense that accuracy

is in fact going down as beta is getting bigger. So this is saying, “We

just care about robustness, “we don’t care about accuracy.” So you can show that for

your trained classifier, the accuracy is getting worse. On the other hand, you can show that your Fisher information

term is getting smaller as beta is getting bigger. So bigger regularization parameter means that the regularizer

should be squashed to zero and if the regularizer

is squashed to zero, then things are very robust. All right, so this is as far

as I’ll go with the simulation. What I really wish I could show you was some pictures saying, “Look, now that I did

these adversarial attacks “on my neural network and

it leads to robustness. “And this is the way that things change “as I attack my network.” But we didn’t quite get there yet. I also don’t have a visualization yet of what these features look like. So ideally, one can draw what

are called saliency maps, so that tells you, if

you look at your image, it’ll highlight certain portions

that are involved in your feature, in your classification. It would be really nice to say, “These are the original

features without robustness “and these are the

features that it looks like “when there is robustness.” Like maybe certain curves of the numerals that show up in your image data. Okay, so, just to wrap up, what we studied was this

new robust information bottleneck method for

extracting some robust features. We developed the rigorous theory, actually this is the only

rigorous theory in our work, but this is for the Gaussian setting, and it was a lot of

heavy-handed linear algebra, but we showed that, in

fact, you can write down, characterize what the solution

is and it makes some sense. We also provided a variational

optimization method, which means that you can actually run this on large-scale data, whether or not it gives

you something meaningful, well, the jury is still

out on how to interpret whether these features

are actually robust. But I do think that the

preliminary results are promising. In the future, well, as I said, I would really like to understand if these neural networks are

giving you something that’s, more robust features. So what I’ve shown you is that the phi term is getting smaller, and I infer from that that

things are more robust, but we would like to see

what the neural network is actually doing. And also, I think, well, this is just sort of

scratching the surface of this idea of robust feature extraction, so what does it mean? How do we come up with

some alternative methods for extracting the most robust features, or just ranking features in

terms of how robust they are? So, that’s it, thanks for your attention. (audience claps)