UW Allen School Colloquium: Jacob Steinhardt (Stanford University)


– Good afternoon, thanks for coming. We have quite a treat this afternoon. We have Jacob Steinhardt
from Stanford University. He was advised by Percy Liang and got his undergrad from MIT. His research is very
timely, as machine learning has gotten very, very
popular and applied widely. It’s also been abused widely. Jacob attempts to develop methods that actually defend against attacks against learned models
in people that actually maliciously give bad training
data and things like this. He also believes machine learning should not just be an
optimization procedure but actually give results
that are intuitive and predictable, not just
the output of a black box. A lot of his work deals with, at the intersection of
optimization, machine learning, assistive learning theory, and
inference, and computation. He is the recipient of a Hertz fellowship and an NSF fellowship. We are really excited to have him, so please join me in welcoming him. (audience applauding) – Thanks a lot, Kevin. Before I start, I just want to
acknowledge my collaborators, Moses, Ilias, Gautum,
Daniel, Pravesh, Pang Wei, Jerry, Percy, Aditi, Alistair, and Greg. They’ve all been a huge help
throughout all of this process and a lot of this work wouldn’t have been possible without them. So today I’m going to be
talking to you about provably secure machine learning,
but before I dive into that, I want to tell you two stories. The first story is about Twitter. On April 23rd, 2013, Syrian hackers compromised the Associated
Press’ Twitter feed and they put out the following tweet. It reads breaking, two explosions in the White House and
Barack Obama is injured. And afterward, this is what
happened to the stock market. There was about a $136
billion drop in three minutes, although it was fortunately reversed. But part of what caused
this was analytics software and other statistical algorithms
being fooled by this tweet. This isn’t the only time that
Twitter has been attacked. Perhaps more famously,
during recent elections such as the US and French elections, bots have taken to Twitter
to try to influence public opinion and one
of the things they do is they put out fake tweets
to try to fool Twitter’s statistical rating
algorithms to show topics that people wouldn’t have seen otherwise. And an important point
in both of these cases is that it’s not the
traditional components of software that are being attacked here, it’s actually the underlying
statistical algorithms. This is the first story,
and the second story is about computer vision over
the past six or seven years, largely driven by
advances in deep learning, open domain audit recognition
has improved substantially to the extent that there’s
even papers stating that they’ve exceeded
human level performance. However, it turns out that the accuracy actually drops to zero in
the presence of an attacker and in fact, has been
zero this entire time. (audience laughing)
Yes. To understand what’s happening here, we can look at a phenomenon
discovered in 2014 by Szegedy et. al. called
adversarial images. What this phenomenon is, is they can take essentially any image, in this
case, an image of a panda, add some small imperceptible noise, so small that these images render the same on this projector screen,
and cause state-of-the-art computer vision systems
such as neural networks to misclassify them,
essentially arbitrarily, in this case as a gibbon
with high confidence. This goes beyond pandas and gibbons. It’s been shown that
you can take stop signs, put stickers on them
that look like graffiti and cause them to be
classified as yield signs. You can create
imperceptible audio commands that are, or sorry,
imperceptible audio noise that’s interpreted as a
command by your smart phone and hackers can modify malware, non-functional parts of the code, to evade detection by malware systems. What’s going on here? It’s not that the people
designing these systems don’t know what they’re doing, it’s due to a fairly fundamental issue with the way that most machine
learning is done today, which is that most
machine learning assumes, either implicitly or explicitly, that the training distribution,
the distribution of data that was collected to build the system, is similar in some way
to the test distribution, the data that the system is deployed on, and an attacker is free to play something that looks very different
and therefore violates the assumption that the
system was built on, which creates an
exploitable vulnerability. Beyond these invasion
attacks at test time, another source of
vulnerability for ML systems, is in the data collection process itself via a phenomenon called data poisoning. Here, the idea is that
often, if data is collected from users, for instance
on a web based system, simply by creating fake user accounts, an attacker can inject
fake data into the system. This is perhaps most familiar
in the case of fake reviews, although it happens
more generally any time one is building
a recommender system, even more generally, when everyone is trying to
collect data in a crowd source manner
and perhaps more subtly, in the case of malware and spam detection because many of
these classifiers are built based on historical. And since this is
malware and spam, this actually creates a
backdoor for attackers and there’s a line of
work showing that this backdoor can
actually be exploited to degrade performance
of the system. Let’s suppose that we
wanted to try to deal with these issues. One thing that is perhaps
tempting is to just design as many attacks as we
can think of in the lab and if a system works
well against every attack we can think of, we might be
tempted to label it secure. However, an attacker
has a strong incentive to find whatever attack
you didn’t think of and this leads to a sort
of arms race which is actually played
out in this literature on adversarial images. So since their discovery
in 2014, there’s been a number of defenses proposed to try to mitigate this issue, but eventually all of these
defenses have been broken. In fact, this has gone
on for quite a while. There’s been at least 100
papers published on this in the last three or four years. But at the end of the
day, it’s really the case that the attacker just keeps winning and usually within a few months
of a defense being proposed, a new attack comes out that beats it. In fact, at the most
recent ICLR conference, which is one of the main
deep learning conferences, the majority of submitted
defenses were actually broken before the peer review process
ended and this was only counting papers that
were eventually accepted. This goes beyond
adversarial images. This has also played out in
the literature on malware and more generally it
happens behind closed doors at many tech companies. This brings us to an
important takeaway, which is that if we only
rely on empirical evaluation against a known set of attacks, then we’re often going to end
up in a security arms race where the defender is usually
going to be playing catch up. The perspective we’re
going to take in this talk is that to avoid this arms race, we should do the following two things. We should first define
a formal threat model, laying out a comprehensive
family of attacks based on the capabilities
that the attacker has, and then design systems where we can prove formal security guarantees
under that threat model. Now, from the perspective of
traditional computer security, this is actually a fairly
common thing to try to do, but it hasn’t been pursued
as much in machine learning, partially because there’s
a number of difficulties. Often the systems are interacting directly with the physical world
so you end up with a very broad attack surface, and also the formal
specification is often unclear. It’s not clear how you would
formally specify the event that a self-driving car
crashes, for instance. Nevertheless, in this talk we will see that it’s possible to
achieve meaningful proofs of security under two
different threat models. To preview what these models will be, we could consider the
following schematic of a machine learning
pipeline, where for instance, if we’re building a
spam detector, we would first collect
training data. So these would be emails labeled by users as spam or non-spam. Then we would train a model to separate the legitimate emails
from the spam emails. And then finally, we
would deploy this model on new incoming test data,
in this case these two green emails here to classify
them as spam or legitimate. As we’ve already seen, there’s at least two sources of vulnerability here. The first is at the training stage where an attacker can
inject fake training data via data poisoning. And the second is at the
deployment or test stage, where an attacker can
manipulate examples arbitrarily. For instance, perhaps
changing irrelevant words in this spam email to cause it to be misclassified as legitimate. The first two parts of this
talk are going to focus on data poisoning and the final
part is going to focus on test time on these adversarial examples. These are going to be in the
traditions of robust statistics and robust optimization, respectively. I’m going to dive right in
and we’re going to start with the first part,
which is data poisoning, and in particular, trying to understand the space of optimal attacks. Again, recall that data
poisoning is where an attacker adds fake data to manipulate
the learned model. Here, we only have clean data and this is the model that we learn, but if an attacker can add fake data, they might cause us to
learn a very different model and in particular, could even cause legitimate email to be misclassified. We can formalize this via
the following threat model where we say that there’s
n points of clean data, and an adversary is allowed to add epsilon n additional poisoned points, and they can add these however they want, and in fact can even add them with knowledge of the clean data. Intuition, here, is that,
or rather the motivation, is that this parameter
epsilon controls the strength of the attacker, so it’s
implicitly making an assumption of an upper bound on, for instance, the number of fake accounts
that an adversary could create before they run out of
resources or get detected. Let’s try to understand
what happens in this model. If we aren’t careful and
perhaps do the the normal machine learning thing we would do, which is to choose a model that minimizes the overall error on all of our data. So in this case, the union of the clean and the poisoned data. Well, then what could
happen, the problem is that it actually turns
out that in some cases, even with a single poisoned data point, an attacker could move
the model arbitrarily far just by playing some extreme
outlier that completely dominates all other
considerations in the model. But of course, this is a
very easy thing to detect, so to make things less degenerate,
we’ll assume that we have some form of outlier detector
that’s going to try to detect and remove suspicious looking points and to specify this outlier detector, we’ll specify it in
terms of a feasible set. So this is going to be the set of points that don’t get removed
by the outlier detector. The reason we call it the
feasible set is it’s the places that an attacker can
play without their attack just being useless because
it’s just going to get removed. Yes? – But ordinary regularization
would avoid problems like this without any particular consideration of adversarial manipulation. – I think even if you regularized, there’s situations where a single point can move you arbitrarily far. It depends on your loss function. (student mumbling) – So I agree you could
tune the regularizer and I can maybe talk about
that a bit more later in the empirical section. It turns out that actually sometimes, if you regularize too much, you become even more vulnerable to attack
rather than less vulnerable. Cool. Let’s see, where were we? So, yes, I agree that you could
try to tune the regularizer, but in general, things could
take you arbitrarily far. We’re going to use this outlier detector to try to prevent this. A question, though, is once
we have this outlier detector, is it actually good enough? Even if we’re removing outliers, you could run into problems
where the adversary still has a fair amount of power if your outlier detector
isn’t very specific. We can actually state this a
bit more formally by saying, given a description of the clean data and a description of the feasible set, what is the worst case error
that an attacker can induce, get for a given algorithm? And we can state this even more formally via the following optimization problem, where we’re going to simulate an attacker. The attacker is trying to maximize the overall error on the clean data, where their maximizing
over all poison points that lie within the
feasible set and this error is for the model of
parameters that are minimizing the overall error on the
clean plus the poisoned data. Note, just for clarity, now
I’ve only shown the feasible set for the positive class
and there should also be another feasible set
for the negative class. But to avoid cluttering the
slides, I’m just showing one. Yes? – Classify this, because
usually detectors just try to maximize errors. Detectors try to maximize
a particular type of error. Is this somehow we could learn? – It’s not equivalent,
so you could imagine targeted attacks, as well. Where, instead of, instead of having this
overall error function, you have an error function
on a sub-population. In this, it’s going to
turn out that actually even if you’re just trying
to maximize overall error, the attacker can be quite successful. We’re starting here
because even here it’s actually quite
hard to defend. But yes, targeted attacks
would be even harder. Other questions before
I move on? Cool. We have this optimization problem and this optimization
problem actually has a name. It’s called a bilevel optimization problem to denote the fact that we have this inner second level optimization for a theta hat that’s plugged into
the outer maximization. If we wanted to compute the optimal attack which would help us to understand how vulnerable or not we were
for this outlier detector, we could just try to solve this. The problem is that bilevel optimization in general is quite challenging. And in fact, even if
you’re handed the solution and you just want to verify
that the solution is optimal, this is already NP-hard, even for linear and quadratic functions. In practice, it’s also hard. Another place you could
look is to the literature on robust statistics, where
in fact there is a tool called influence functions,
which give you some notion of how vulnerable a model is to outliers. However, formally these require you to have a negligible fraction of outliers and in practice, also
once you start having even a small constant
fraction of outliers, they no longer necessarily
tell you the right thing. Finally, you might be tempted
to simply design attacks and test empirically, but
as we’ve already seen, this is heuristic and could be
quite far from the worst case and leads to this arms race
that we’ve seen before. This brings us to our contribution, which is based on work with
Pang Wei Koh and Percy Liang where we showed how
to upper bound this bilevel problem by
a convex min-max problem, which can be solved very efficiently. And in particular, since
it’s an upper bound, it gives us certificates of
robustness on a given data set. So this gives you a proof that no attacker can induce more than
a certain error. But also it turns out that we can sort of invert this upper bound
to attain strong attacks on many data sets, as well,
and we’ll see this later. Let’s try to understand how
this min-max problem arises. Recall, our goal is to
bound the error of the worst case attack
which corresponds to this bilevel
optimization problem. We’re trying to maximize
the error on the clean data. And the basic intuition
is that if the attacker is going to do a lot of damage, then they better be
able to find some point that has large error under
the true model parameters. Because if they can’t find
any points with large error, then at least intuitively
with only a small fraction of bad data, they’re not
going to be able to move the model by very much. We can actually formalize
this via the following analytical device of a
super attacker, where what we’re going to
do is we’re going to take our attacker and we’re going
to make their life easier. So instead of only giving
them credit for error on the clean data, we’ll
give them credit for error on either of the clean
or the poisoned data. And now the key point is
that the thing that the attacker is
maximizing, and the thing that the learner is minimizing,
are the same function. So this reduces to a
min-max problem which turns out to be much easier to solve than the original
bilevel problem. The intuition behind this
min-max problem is you have this learner that’s trying
to fit all of the points, both the clean and the
poisoned points, and there’s an adversary that’s
trying to add poison points such that it’s hard to fit them at the same time as the
clean data. This gives you the following
algorithm which corresponds to doing gradient descent
on this min-max error. You’re going to start with the clean model on the clean data, this is
simulating the attacker. And note that here I’m showing
you both feasible sets now, just so I can show you
the algorithm. We’re going to first
find the feasible point that has the highest error. In that case, this is this point and then we’re going to incrementally update the model parameters
towards that point. And there we’re just going to keep repeating until convergence. So we’re going to add the next, the new worst point and
we’re going to keep going. And at the end of the day, one of two things is
going to happen. Either you’re going to end up with a model that successfully fits all of this data, both the blue data and
the red data, in which case this is going
to give us a certificate, so a proof that there’s no
attack that can work well, or you’re going to not
have a model, this model isn’t going to
fit all of the data well. And in this case,
actually these red points, that we generated along the way, are going to form a
strong attack in a way that we can actually formalize. – Yeah, Sean? – I think I missed something, so why was the previous
formulation not zero sum? Why isn’t that min-max? – [Jacob] Which, the bilevel problem, or? – Yeah. – So this is zero sum here. – [Sean] So that’s zero sum,
but that’s when you put in– – But here it’s no
longer zero sum because, we’re not assuming, we’re
not taking a learner that somehow playing the mass equilibrium, we’re taking a learner
that’s just minimizing the normal empirical error. – I missed a point, why is
this optimization problem that you’re trying to
solve here, a given and/or, substitutive by the next
one which is easier? – Oh, so it’s not equivalent
in the sense of being equal. It’s that this is an upper
bound because, oops, sorry, so this is
the sum of the errors on the clean data points. And this is the sum of the errors on the clean plus the
poisoned data points. And so at least if you have a non-negative error function, which you typically do, then that’s just going
to be strictly larger. You’ll always get an
upper bound. – Why is the learner not
trying to win the game? Right, the learner is
our learner. The learner knows we’re
playing a game, so it should just try
to fit the data. It should try to beat
the adversary. – Yeah, that’s right, so you could imagine taking other strategies
other than this one. I mean, in some sense,
this is already trying to do something
to defeat the adversary because you have this outlier detector, you have this outlier detector that is trying to detect outliers. Now you’re saying maybe we should modify the learning algorithm in some way. – Limited what the learn
can do artificially. – What would be things
that you’d want to do? You’d want to like tune the
regularization parameter or something like that? – The parameters of the model. Whatever my best function
is against the adversary, I want to tune my
parameters to optimize that. – I’m not 100% sure
what you’re suggesting, but in the next section, we will, I will present an algorithm that tries to deal with the adversary. It’s possible I’ll answer
your question then. And if not, feel free to ask again. Any other questions before I move on? So where were we? Right. We have this algorithm where
we’re either going to get the certificate or we’re
going to get this attack via these red points. And here’s where we can say formally, if we let theta star be the model that we obtain at the
end of this algorithm, and we let X star be the
corresponding red points that we generated, then we get
the following two guarantees. The first is an upper bound,
which says that the error against any attack, including
the worst case attack, is at most this quantity E star which is the average error of theta star on the clean data plus epsilon times the error against
the final attack iterate. So no attack can create
training loss larger than this. And then conversely, we have
the following lower bound, which says that this particular attack, X star actually realizes this error, at least with respect
to the super attacker that we did find. The error under the super attacker is at least E star minus some quantity that goes to zero and
then goes to infinity. Here I’ve only talked about training loss, so we should really care about test error. But you can actually bound the test error via what’s called a uniform
convergence argument. I won’t go into detail
of that in the talk, but I’d be happy to
talk about that offline or in the questions. So this is the theory, but let’s look at what this actually looks like in practice. Here’s results on the Enron spam dataset, so this is a corpus of
emails that was released into the public domain
during the Enron lawsuit. It’s about 70% legitimate
emails, 30% spam emails. It’s about 4,000 examples total. And what we did was we trained
a support vector machine, what we considered a learner that was a support vector machine, on
this dataset and for features, we used what’s called the
bag-of-words representation. This representation has one coordinate for every word in the vocabulary, and the value of that coordinate
is the number of times that word appears in the email. And then for the feasible set, what we’re going to do
is we’re going to remove, the outlier detector’s
going to remove points that are too far away from
the mean of their class and also remove points that
are invalid in the sense that they aren’t non-negative
integers in every coordinate because they weren’t
non-negative integers, they couldn’t correspond to word counts. The attacker has to play
points that look like this. Here’s the results, so first of all, here’s what we get for our certificate. This green line is our
provable upper bound on the training loss on this dataset. And this blue line is
the actual training loss that’s induced by the attack,
the corresponding attack. And this red line is what the test loss or the test error looks like. You can see that they all track
each other relatively well, but you should perhaps be unhappy because if you look at the x-axis, this only goes to about 3% poisoned data, and this y-axis is on the order of one, so the error is actually
increasing quite rapidly. And in fact this is
what we generally found, so except on very simple,
low-dimensional datasets where you could actually get robustness, in general, most datasets
are quite attackable. And you can make this even more stark, if instead of looking at the error of, under the SVM’s error metric,
you look at the actual zero-one error, so the fraction
of misclassified examples and it goes from about 3% error in the absence of an
attacker to around 14% error with only a half a percentage
point of poisoned data. And note that on this dataset, this only corresponds to
adding about 20 emails and we’re increasing the
overall error by about 10%. It’s not just that this particular
defense is a bad defense, the same attack evades many
other detection algorithms based on, say k-nearest neighbors, or PCA. And to get at Pedro’s
question from before, it’s also the case that this is robust to different hyper
parameters of the algorithm, so it still works even if you tune your regularization parameter up or down, and it also still works,
for instance, if I swap out my learning algorithm from exact
learning to SGD or Adagrad. There’s reasonably good
transferability there, as well. It actually turns out
that this is, this is a fairly fundamental
aspect of the fact that this dataset is just
very high dimensional and we’ll get to this in the next section. Just to summarize this
section, we first of all see in that once we define
a formal threat model, often security problems can
turn into optimization problems. In this case, we ended
up with this bilevel optimization problem. And then we saw how to get a certificate via analyzing the super attacker and how to get strong
attacks via exploiting the duality of this
min-max problem. However, the attacker ends
up winning in high dimensions and so this motivates the
next section, where we’re going to ask can
we design learning algorithms that are actually robust
in high dimensions? We’re still talking about data poisoning, but now we’re going to look at
this high dimensional setting and I’m also going to abstract, yes, Sean? – In these convex settings, can you just figure out the value of the game and figure out what’s achievable? Because now you’re giving an upper bound, you look at a particular
algorithm and corrupt it. You actually have written out
what the value of the game is. Like in this SVM case. Can you actually just try to? – Well, so it’s a bit, it’s a bit tricky because the game is defined
in terms of the clean data set. And of course the optimal strategy, if you know the clean dataset, is to just return the
minimizer on the clean dataset. So you have to somehow
define a distribution over clean datasets that
the learner doesn’t know. So I think even formalizing
that is a bit tricky. – One of my students has a paper where in a similar setting, he does this. I’ll give you a pointer later. – Okay, great, yeah. Cool. Let’s abstract things
away a bit. Now instead of thinking about just linear classification tasks,
we’re going to think about general learning tasks,
where we say we observe n points in some d-dimensional space. One minus epsilon fraction of these points come from some distribution
p star that we care about, and the remaining epsilon
n points are controlled by an attacker that can
place points arbitrarily and with knowledge of our algorithm. Our goral is to estimate some parameter of this distribution p start. In the previous section,
it would have been some linear separator, but more generally, we could care about the mean. We could care about the best fit line in the case of regression. We could care about a ranking in the case of a recommender systems. But for this section of the
talk, I’m going to mainly focus on the problem with mean estimation, although our results hold more generally. Here the problem is given,
end points in d-dimensions, one minus epsilon of them come from the distribution we care about. There are many in our outliers and we want to estimate the
mean of the distribution. In one dimension, this is
fairly well understood, where the empirical mean could
be far from the true mean due to outliers, and so you
might use more robust estimates such as the median or the trimmed mean. In higher dimensions, it’s tempting to generalize this strategy. For instance, in the presence of outliers, as we did in the previous section, remove points that are far from the mean and take the empirical
mean of the remainder. However, it turns out that
this and actually a large class of natural strategies actually
fail in high dimensions, in the sense that they
incur an error that grows with the square root of the dimension, which makes essentially
most results vacuous. I’ll explain this more on this slide. To understand what’s going
on, let’s imagine the case where actually the data is
a very simple distribution. The clean data is just a
Gaussian with some unknown mean, mu, and variance one in each coordinate. Now the basic problem is
that in high dimensions, all of the sampled points
are going to be far away from the true mean. This is just because of
how L two distance works in high dimensions. You have d coordinates,
you have variance one in each coordinate, and so
these are going to add up to give you an error of
the square root of d. And so an adversary, or sorry,
a distance of square root d. But now this means that an adversary, only by playing points that
are no further from the mean than any of the good
points can actually shift the empirical mean by quite a bit. And it’s even worse than that. So it’s not just that the
distance isn’t any further away. All of these points have the
exact same probability density under the true data
generating distribution as the good points, and
so there’s no strategy that can simply look at points one by one, decide whether they
look legitimate or not, and then throw them out, that can avoid this dimension dependent error. This brings us to another
important take away, which is that in high dimensions, identifying points that
look not like real data and removing them is
actually the wrong goal. An adversary only by planting points that look perfectly
normal can nevertheless shift the learn model
substantially and so instead, need to look at more subtle things such as some sort of population level notion of influence on
the learned parameters. Perhaps because of this,
even though there’s been a long standing interest
in robust estimation, it’s only quite recently
that actually progress in this high dimensional
setting has been made, so in fact, the first efficient algorithms even for this problem of mean estimation that avoided dimension
dependent error bounds were only in the last couple of years. And so, in this section of the talk, I’m going to discuss our contribution to this literature which is based on work with accommodations of Moses Charikar, Pravesh Kothari, and Greg
Valiant, where we showed how to develop estimators
that work for a large class of estimation and learning problems, and also could tolerate a
larger fraction of outliers than was previously thought possible, and this required actually introducing a new model of learning, which
I’ll get to in a little bit. Let me briefly sketch our approach. It’s based on a reduction from
this mean estimation problem to the problem of matrix de-noising. Now, this might seem odd
because mean estimation is a very simple problem
and matrix de-noising is a very complex problem,
but it turns out to be fairly fruitful in this
adversarial setting. The idea is that we’re
going to take our data, we’re going to stack it into a matrix. And we’re going to think of
this matrix as a noisy version of the following rank one matrix, which is just the true mean
repeated over and over again. To recover this rank one matrix, we can solve the following
minimization problem, which trades off between some
squared reconstruction error to the observed data and some
convex relaxation of the rank that tries to encourage the
solution to have low rank. The problem is, we’re
unlikely to end up with this. What we’ll actually probably
end up with is this, where we have a bunch of copies
of the mean of the bad data. And in fact, it’s going
to be even worse than this because we’re going to
end up with all sorts of linear combinations of the
good mean and the bad mean, and possibly there might be even several directions of variation and so it won’t be clear
how to read things off. But the key observation
is that if this happens, there must be a very
small fraction of points, the bad points that are responsible for these additional
directions of variation, and they must jointly exert high influence on the learned parameters x-bar. And by moving to the dual of this problem, it’s actually going to be possible to decompose that influence
into an additive form and assign blame back
to individual points. And so, in that case, we’ll
be able to remove them and so whenever the adversary
exerts high influence on the learned parameters,
we’ll be able to detect that and remove those points. And if the adversary doesn’t
exert high influence, then we’re just in this
case, and then we’re happy, so we don’t even need to do anything. That’s the flavor and I won’t go further into the technical details, but let me just tell
you what result we get for this problem of mean estimation. Our first result is the following. Let’s assume that the
fraction of outliers epsilon is at most 1/8, and that
the true distribution has variance grounded by sigma squared, then it’s possible to recover
an estimate of the mean, whose error goes at the most sigma, the bound on the variance,
times square root of the epsilon, the fraction of outliers. The key point here is that
the error doesn’t depend on d, so we’ve gotten rid of the square root of d factor from before, which
is essentially the difference between vacuous bounds
and meaningful bounds. But it turns out that you
can do even better than this. So we can get rid of this
assumption on epsilon altogether. And in fact, with no
upper bound on epsilon, we can nevertheless get an
error that goes as sigma over the square root of one minus epsilon. So note that this is
bounded even as the fraction of outliers epsilon approaches one. In particular, this means that we can actually handle even a
majority of outliers, but in this model, the
semi-verified model, which I’ll explain on the next slide. Yes? – Sorry, quick question,
is this independent of the distribution or
for multivariate Gaussian? – This doesn’t assume
Gaussian, it just assumes that the distribution
has bounded covariance. – [Pedro] That’s it. – Yes. Other questions? Yes? – You need n to be much larger
than the one minus epsilon, right, so this doesn’t cover the example that you gave before where
you had n like around 4,000 and d was also like 5,000. – Yeah, so formally you need d to, you need the number of good points to at least be on the
order of the dimension. Even if it’s a constant factor
less than the dimension, you can still get meaningful bounds. Cool. So let me explain what’s going on here, how we can handle the
majority of bad data. It should perhaps seem
like this is impossible, because for instance, say
if 2/3 of the data is bad, then you could end up with
two clusters of bad data that look identical to the good data, and you have no way of
telling which is which. But the key point is that actually, if this is what we see,
we actually do know a lot because at least if we have
this bound on the variance, and these clusters are
further apart then that bound, then we know that the
good data must mostly lie within a single one of these clusters. And so the mean of the data must be close to one of these three green points. We can actually narrow things down to just three possibilities,
and if we observe even a small number of
points, say from this cluster, and know that those points
are all legitimate data, then we can actually be quite confident that the mean is within the
cluster and get a good estimate. So this led us to introduce a model that we call the
semi-verified learning model where we assume that
there’s some small number of verified points that are known to come from the true distribution. Aside from security guarantees, this actually had an
interesting implication to another problem in learning based on unsupervised clustering. So this is the problem
of learning mixtures of distributions, where we can reduce this to the semi-verified model
by thinking of the good data, or sorry, thinking of
one mixture component as the good data and the remaining mixture component as the arbitrary data. And so once we can handle
a majority of outliers, we can actually solve clustering problems. This is actually interesting
because in general, these clustering problems
are NP-hard, but in practice, are often solved with ease
and so there’s a line of work in the learning theory
community trying to understand when we can actually cluster efficiently. There’s one result going
back to Van Paula and Wang showing that with the separation
of about the fourth root of k, so if each of
these mixtures has mean that are far apart by
at least the fourth root of the total number of mixture components, then if everything is Gaussian, we can successfully learn efficiently. But Gassianity is of
course a strong assumption, so there’s another line of work showing that if you just have bounded variance, then you can learn with a separation that goes as about the square root of k. And if you just apply our
results as a black box, you actually recover
this clustering result, but you get robustness for free. This is the first thing that made us think that something good was
happening in this setting, that we were able to reproduce state-of-the-art results with robustness. But you can actually do even better, and we pushed the techniques
further to get clustering to work under an assumption
called the Poincare assumption. This is stronger than second moments, but substantially weaker than Gaussianity. We can deal with a separation
that is essentially independent of the total
number of clusters. So this is improving on
this 15 year old results of Van Paula and Wang, both in
terms of weaker assumptions, better separation, and we get robustness. These techniques apply to other problems, as well, as I’ve mentioned. We’ve applied them, for instance, to problems in crowd sourcing,
and community detection, and item frequency estimation,
and also to general convex, and even non-convex optimization problems. Just briefly, we in fact
revisted this attack on Enron and we showed that using our techniques, we can actually do substantially better, reducing the error from
about 25% to about 8%. To summarize this section of the talk, we’ve first of all seen
that outlier removal fails in high dimensions in a
fairly fundamental way, but we can nevertheless
get robust algorithms via this notion of
population level influence. We introduced a new model that can handle a majority Of outliers and used this model to actually improve on
long standing bounds in other problems in learning. Now I want to take us to the
final section of the talk. We’ve mostly focused on
training data attacks so far, but now we’re going to go to test time. In particular to revisit this
problem of adversarial images. So remember that these images are images that look innocuous, but
that are misclassified by state-of-the-art
machine learning models such as neural networks. I think even if you don’t
care about security, you should be worried
about this phenomenon because they really show that we’re not learning the right thing. These are fairly persistent,
so despite hundreds of papers trying to avoid them, they still exist. You can even create
them in the real world. This we’ve already seen, this stop sign that can be transformed into a yield sign. This is a 3D printed turtle
that is classified as a rifle if you take a picture of
it on your smartphone. And this is a sticker that
you can place on anything to turn it into a toaster. There’s also a Kaggle competition on this, recently, hosted by Google Brain, where about 100 teams competed
in attack and defense. We competed in both and did fairly well. But what we found was that attack is really much, much easier than defense. In fact, all of these attacks
were completely black box, so they didn’t know
anything about the model that they were attacking. You couldn’t even query the model. It was really a black box. And nevertheless, if
you take the top three attacks together, at
least one of them beats essentially every single
defense in the competition in the sense of driving the accuracy from around 90% to say below 30%. This underscores this arms
race that we saw before and in this section of the talk, I’m going to discuss our
work towards trying to start to end this arms race
by building a network that is provably secure
on the MNIST dataset. To talk about provable security, I need to give you a formal threat model, so the model we’re going to use is what’s called the L
infinity attacker model. Here, an attacker is given
some clean input image and is allowed to modify every
single pixel in the image but only by some small amount, epsilon. So each of the pixels
is allowed to be changed by some small amount. This can be seen pictorially here, where this point here
is robust in the sense that every perturbation
around it has the same label, whereas this point is not
robust because there’s this region here where the label gets flipped. This picture is exactly what’s underlying this adversarial image with
the panda and the gibbon. This L infinity model is,
there’s actually a lot of problems with it in
the sense that attackers could do many things other than these L infinity perturbations. But it turns out that it’s actually quite difficult to defend,
even in this model. In fact, this panda and gibbon example, and this entire arms race
that I showed before, and the Kaggle competition, were all under this L infinity model. We thought that this was a good test bed for trying to understand how
formal verification methods could work for these neural networks. In particular, our goal is going to be to train a classifier
that is verifiably robust under this L infinity threat model. You should perhaps think that this problem would be quite hard because
even verifying facts about traditional software
is quite difficult and their things are actually designed with some purpose in mind, whereas for machine learning systems,
it’s just this mess of nodes, and edges, and weights that
are learned organically from data and often
have no explicit design. However, we’re going to be able to do the following two steps, which
is actually going to cause learning and verification to
work together hand in hand. The basic idea is we’re
first going to formulate a verification strategy,
which is actually going to be essentially vacuous on
arbitrary neural networks. If you just take a network
with random weights, the bounds are going to
be completely meaningless. However, there’s going
to be a rich subfamily of neural networks where
the bounds are actually good and where the verification strategy gives you a relatively tight analysis. We’ll just be able to bias learning towards these verifiable networks. So even though there’s other networks where we don’t do well, we can
just avoid them and find ones where the verification
strategy actually works. This leads to another important take away, which is that learning,
rather than actually fighting against verification and
making everything harder, actually has the potential
to make verification easier. Let’s try to understand
this, so recall that our goal is to train a classifier
that’s going to be robust in some neighborhood of the data. This is actually a fairly
classical thing to want to do and falls within the per
view of robust optimization, where, what you would typically do is you would take whatever
error function you care about and define a robust error function which is just the maximum
of the error function over all permissible perturbations. In this case, all the perturbations in the L infinity attacker model. The problem, however, is
that this maximization is going to be quite complex here, because this is maximizing over the inputs to a neural network. It’s highly non-convex, and
it’s very high dimensional, so we’re not going to be able to compute this maximum exactly. What might we do to try to deal with this? The first thing to realize is
that this decision boundary from the neural net isn’t
just some discrete boundary. It’s actually the zero level set of some continuous scoring function that assigns a score to each class. So we can look at the
gradient of that function. We might want to conclude
that if the distance to the boundary is very large compared to the magnitude of the gradient, then we would actually be
robust and not attackable. The problem, of course,
is that there could be some other direction that is perhaps even essentially
orthogonal to the gradient, in which we’re much closer to the boundary due to curvature in this scoring surface. And in fact, this is
what happens in practice. If you just try to train a network to have small gradient relative
to the decision boundary, then you’ll end up with
networks where the gradient is essentially zero at
every single training point. In fact, the property generalizes, so it’s also zero at
every single test point. But there’s points nearby
every training point and nearby every test point
that have very high gradient, and so you’re actually not robust and you can’t generate attacks. This is called the gradient
masking phenomenon, which has been observed
by a number of authors. We need to do something more difficult, we need to somehow get a uniform bound on the gradient over this
entire L infinity ball. To do this, we actually
need to open the hood on the network and look inside. I’ll describe the strategy that works for two layer of L unit networks, which relies on the following observation that the gradient only depends
on which units are active. Don’t worry if this doesn’t
make complete sense, but the basic idea is that
there is kind of these three layers and the gradient depends only on the sum of all
pats from the input layer to the output layer through
some subset of active nodes. This turns into a common
tutorial optimization problem which actually ends up being an instance of the weighted maximum cut problem. While this is NP-hard, you can bound it to semidefinite program and relaxations. If you push this bound on the gradient back through everything,
you get an upper bound on this robust error function. If you minimize this, you get a network that is both robust and verifiably robust. For those of you who
are into optimization, if you just did this,
you’d get an algorithm that’s very slow. But you can, there’s a version of this based on dual gradient
descent that ends up being still slower than normal
training, but reasonably doable. Let’s look at the actual, yes? – For this to work, you need
the relatively small fraction of the hidden units to be
active for any given sample. – No. We’re actually taking worst case overall possible sets of hidden units. – Then how are you exploiting the fact that only some units are active? – We’re not exploiting the fact that only some units are active, we’re saying that the gradient,
basically the point is if I condition on which units are active, then my function’s actually
just linear for a value network. Essentially, the way we get an upper bound is we’re de-coupling
over these two choices. We’re saying, conservatively, we’ll assume that the adversary could
put us in any of these possible linear regions,
and we want the gradient in all of these regions to look small. – And how does this,
okay, how does this vary with the number of such linear regions? – You’re saying, how does the
efficacy of the bound vary with the number of linear regions? – Exactly. – I don’t know if I can
say anything in theory, but at least empirically,
it seems like you could, well, first of all, there’s
lots of zero crossings. You get about 50% active and 50% inactive, so about half of the units are active. It didn’t seem like increasing
the number of hidden units substantially degraded
the quality of the bound. That’s just an empirical statement, on MNIST, yes, that’s right. Let me show you our results. So as Pedro said, this
is on the MNIST dataset. This is a dataset where you’re
trying to classify things into 10 different handwritten digits. Oh, you look like you have
a question, or, oh, okay. We trained a two layer neural
network with 500 hidden units. We considered two methods of training. The first is normal training,
so in machine learning, this is actually what would be
called adversarial training. We take some known attack strategy and we optimize our network to do well against that specific attack. Then we also consider our
method where we’re minimizing an upper bound on the worst case attack. Here’s what happens, so first
of all, this blue dashed line is the accuracy of the
normal training method against the attack that
it was optimized for. In some sense, this is
an oracle that knows what attack it’s going
to have to deal with. You can see that it in
fact does quite well, but we can’t prove anything about it. So it’s possible that
this method is robust, but the best we can show is that in fact, it could be extremely non-robust. If we instead minimize our upper bound, which recall, doesn’t know anything about this specific attack, it actually does almost as well as the oracle method. And, we can achieve reasonably
good provable robustness, as well, against the worst case attack. What’s happened, is we’ve
gone from this picture where we have this, where we
have potential non-robustness, to this picture where we’re
actually provably robust. We’ve seen that minimizing
even this loose bound can lead to robust models
and you can achieve both empirical and provable robustness. I guess to get to your
question about MNIST, I think to scale this up,
the issue is not necessarily with the looseness of
this particular bound, but extending things to deeper networks. We do have approaches to doing that– – I wasn’t thinking so
much about scaling up as the fact that MNIST is a
very special class of problems where things are in a very
low dimensional manifold because they just digits
that are mostly white. – We’ve tried this on some other datasets. I think it’s, it works on
things other than just MNIST, although for instance, for
CFR, for housing numbers, the bounds are quite loose
and this is something that we’re working on right now. Two minutes, okay, so I should
get to future work, then. We’ve now, so now we’ve sort of seen how to get both robustness at
training time and test time. Let me just summarize and tell you about some other directions I’m interested in. First of all, some take aways. We saw in general that
empirical success is not enough and leads to this arms race,
but that formal threat models can actually turn security
into optimization. For data poisoning, we
saw that removing outliers is the wrong goal, and we also saw that robustness can be useful as another learning
problem, such as clustering. For adversarial images,
we also saw that learning can actually make verification easier. Beyond security, my
sort of general interest is in trying to treat
machine learning as software because it’s increasingly being used as part of learning software systems and give it their liability properties you would want good software to have. Security’s of course one,
but there’s other properties you’d want, as well, for instance, you would like things
to not fail silently. You would like to be
able to detect failures and respond appropriately,
which has led us to a line of work on having good estimates of uncertainty under limited computation. You’d also like algorithms
that will behave predictably and generalize well to new situations. This has led us to work trying
to guarantee a performance for a well defined family of
counterfactual distributions. In both cases, an important
theme is that you can’t just rely on the data or on
measured prediction accuracy to succeed, you actually
need some underlying theoretical scaffolding to ensure success. In addition to these lines of work, an area that I’m interested
in in the future, is what you might call humans as software. This is trying to take seriously the fact that machine learning algorithms run on data and this data
is collected by humans, and it might be imperfect
in a number of ways. There’s easier versions
of this, for instance, where data might just
be labeled unreliably, which fall fairly squarely into the robust learning framework
we’ve seen in this talk. But there’s other more difficult problems, for instance, in robotics,
one would often train robots based on demonstrations. These demonstrations could have
all sorts of idiosyncrasies that you would want to remove from the actual deployed system. And then maybe the hardest version of this is what I call proxy signals. Many web companies such
as Facebook or Quora, maybe in theory, at least optimistically, they would like to optimize
for human happiness, but they don’t get to observe
the happiness of their users. They only get to observe
what things they click on or how long they dwell on a webpage. So you get systems that are very good at optimizing clicks and dwell time, but maybe not at making
people happy and fulfilled. I think this is, in general,
a very important problem because machine learning
is this very powerful optimization tool that’s
being increasingly applied to many areas of society. It’s important to make sure
that we can actually optimize the things we care about
rather than just the things that are easy to measure. I’ll leave you with that, and I’d like to once again
thank my collaborators. (audience applauding) – [Kevin] We have time
for a question or two. – In many domains, what
attackers do is the combination of data poisoning and
adversarial examples. They poison the data to make their example they’re gonna generate be on
the wrong side of the boundary. They don’t actually need to
change the boundary a lot because they know and they
can also design their example. Any thoughts on what to do
when both of these things are present at the same time? – I think that’s definitely a problem that you’d actually need to worry about if you want these to be secure end-to-end. The main reason we focused
on each individually is just because even if you
can only do one of these, right now the attacker
is generally winning. I think, I don’t see any strong obstacles to sort of combining these,
I think actually the thing that’s harder is dealing
with targeted attacks rather than these
population level attacks. I think there are a lot
of challenges there. – This is more like,
slightly disappointing thing about some of these experiments. Like you would, suppose I
didn’t care about security, but I just wanted to do
better machine learning. In a sense, the middle part
of your talk was saying, look, these algorithms
can do better estimation. And I feel like in these image examples, the fact that I’m getting
corrupted with this noise and, you take a human and give
this L infinity perturbation to images, you’re gonna
get the same digits. It feels like forcing the
system to be more stable. It should actually give us
better learning algorithms because we’re kind of
telling it more examples of what isn’t the right answer. And often when we write these things, it gives us a trade
off, and that’s sort of, like in these object
recognition, for human, it seems like it would
only help the system if you told it these are not stop signs. Typically, when you do this, it just destroys the performance. – Yeah. I mean, there has been some work, not by us, but by other people showing that at least in
some cases, you can actually use this as a regularizer
which improves things. But the issue is, you
need much more model capacity than you
would otherwise to actually robustly
fit these things. One place people have looked at this is, for instance, in
semi-supervised learning, where you can kind of
try to impose smoothness on your unsupervised
data to try to do things. I guess one other thing
I’d say is, aside from the regularization, if there are actually real
outliers in your data set, then, which I think often can happen
in a lot of applications, especially say genomic
data which often has very systematic
corruptions in it. Then you might hope to
gain something. We actually did get
slightly higher accuracy when we ran our method
than just naive training on one of our datasets, but it turned out it’s ’cause there was like
one outlier that was really messing you up. It would be an easy to
detect outlier, so I don’t think our method should get very much
credit for that. (inspiring music)
(audience applauding)

2 Replies to “UW Allen School Colloquium: Jacob Steinhardt (Stanford University)”

  1. In blockchain technology, data mutability is decentralized. The idea is to not have a central authority controlling the data. To identify data attacks, some verified data is used as reference; because without reference, the attacks can shift the trusted data set and get the wrong information in the database. How do these things work together? Are many data users verified in blockchain technology, or is it like a more efficient population sampling that can analyze more samples?

    What happens when attacks feed in a lot of pseudo-legitimate data? In a sense that if the data seem authenitic, the radius of the confident data shrinks and so now false data can be fed into the distance closer than previously possible, and then if the user grows over time the radius increases again, now consuming the false data as legitimate. This could happen during New Year sales, where just before the sale attacks feed good data and decrease the range and add false data outside the range. But as the sales go off, the range increases and the database takes the false data as legitimate- sort of like a bounce phenomenon. Is that a sensible question to ask?

Leave a Reply

Your email address will not be published. Required fields are marked *