2019 ADSI Summer Workshop: Algorithmic Foundations of Learning and Control, Emo Todorov

Up next is Emo Todorov He
is an algorithm maestro. He’s been working
reinforced learning controls for many years,
presently [INAUDIBLE] for University of Washington. Recently, his work
on the [INAUDIBLE] has really made a
huge contributions to modern deep learning
and reinforcement learning, and I very much look
forward to some cool demos, which I’m sure we’ll see. I’m excited to teach you guys. I decided to be a businessman. And it’s a good life– no teaching, no grand
parenting, no meetings– I cannot explain to
you how good a life. OK. So I’m going to talk
about model-based control, specifically– AUDIENCE: This is more
important than that will be. Emo: –for physical systems. And let me start with
some general comments. So when you do
machine learning, you have the world that
generates data for you, and then you learn
from the data, and the data is there
anything you know. You don’t know beyond it. When we do control,
something else happens. We have, of course, data. With the system idea,
we build a model. And then, when we do control
synthesis, or optimization, or whatever, we use the
model most of the time. Even those of us who run around
claiming to be model freed, you look at what
they actually do. They have a model
and promise you that, one day when, you know,
the moon is right, we’re going to switch to real
world, but that never happens. Now, interesting thing– what
happens when you have a model is you can treat it as if
it’s the world send actions, generate data from it. But here the data is essential
because that’s all you know. Here, the data is just made up. What do I mean by that? Well, the data is implicitly
contained in your model. In fact, the model contains
all the data in the world. Why are you sampling from
something when you already know the answer? And it’s OK, because, you
know, you can practice for a distant future when RL
actually become model-free, as in you don’t
actually have a model, and one day switch those. But you’re missing out, because
if you get to the middleman here and connect your
optimization box directly to the model, you can get a
lot of stuff out of the model that’s not in the form of data. If you’re coming from
machine learning, you might ask, how can
something not be data? Well, there are lots
of things that are not really in the form of data– like, equations are
not really data. You can ask the model to
give you the math that’s in it, for example– can’t sample it. In practice, though,
to doing robotics, there are lots of
things you can compute. For example, end-effort
or Jacobians– these are very
useful things that give you the mapping
from [INAUDIBLE] and tell you how to go
to targets without doing any learning, which
just in microseconds. A lot of robots are
controlled that way. You can get data with
just the dynamics. You can do inverse dynamics,
which I’ll talk about. You can look at, what is the
subspace that are actuated? You can get those analytically
and not have to learn them. Distance fields– if
you have geometries, those are very useful
for planning and putting potential fields in
collision avoidance and such. Stability criteria
like center of pressure or forced closure for
locomotion– lots of things that can just be computed, and
it would be silly not to do so. Now, you could say,
OK, well, if you build a control with
respect to a model when you’re reliant on the real
world, is it going to work? That’s a perfectly
legitimate question. Well, it does
very– quite often. So let me just go, briefly,
through a few videos to remind you that it does. There are no proofs here. Why are there no proofs here? I mean, there are heroic people
who will do robots control and try to provide
certificates, whatever. All these people
can’t put assumptions that are not valid
in the real world, and therefore, those
certificates are not for real. I mean, it’s nice
to have some theory, but the reality is that, when
you’re controlling the real physical system, there is not a
single assumption that you can actually verify– so yeah,
that thing is really true, so it isn’t. So we have to look empirically
to see, well, does it work? This is the classic work of
Peter Beal flying helicopters with model-operative control. Has a model of the helicopter
flying upside down, doing all kinds of crazy things. No perfect model, but
it works quite well. There’s now a lot of drones
that have some kind of model, do MPC through it, and
they fly happily around. Here is the work from Evangelos
Therodorou’s group in Georgia Tech, where they put a model
of this little car on a GPU that’s in the car, and they’re
doing model-predictive control in real time, and it’s doing
some crazy things like slipping and whatnot– works perfectly fine–
fully model-based. Here’s something we did. This is interesting. Here, you are learning
a model on the fly, and then you’re optimizing
the trajectory with respect to that model. Then you’re playing the
trajectory on the system, collecting some
more sensitive data, really learning the model. And the model is just the
time-varying locally linear model, which is why
you can learn it in like 10 or 15 iterations. So this is pretty amazing. So this 24 degree
of freedom hand learns to turn this
object in a total of 75 repetitions of that
behavior, and that’s it. Because, you know,
there’s enough data to learn the local model
and control around it. This is a example
where the nominal model isn’t good enough. So we had to create an ensemble
of models that are slightly different– only 10 of them. And then, we
planned a trajectory that’s feasible with
respect to all 10 of them. And the idea is
that, if something works for 10 different
models and simulations, it probably works for
anything in between. And hopefully a real system
is somewhere in between, and this little guy does that. This is particularly
nice, so this is planned how to get
up from the ground with all kinds of collisions
and predictions and such. And actually, that’s it. And here’s, of course, the
OpenAI thing that I’m sure you’ve all seen, which is
another example of this domain optimization idea. But in this case,
they’re actually learning policies and
not just trajectories. And so, this was actually quite
remarkable when it came out. So you give it the
desired cube orientation, and it manipulates the group
to put it in that orientation. And so, there are
multiple approaches here. Model predictive
creative is just optimizing the trajectory
through the model in real time and shifting the horizon. Offline trajectory
optimization– I’ll talk more about today. You can use this
to optimize longer, more complicated trajectories. Policy gradient– you know
what it is, and you [INAUDIBLE] in this case. They could be
nominal models; they could be adaptive
local models; they could be a randomized physics
models; all kinds of things you could do. Of course, you could
also the model-free RL, but there’s some issues. There existing results,
especially from Sergey Levine and other of people around
Berkeley and Google, they’re really impressive. But they’re impressive because
of the computer vision. Now, if you took away
the computer vision, you had full state
estimation, nobody would care about those results. They would correspond to
solve problems in robotics from 30 years ago. So the fact that
these things actually work with vision in the
real world without VICON markers, without
anything– that’s amazing. But that’s because deep
learning for vision is amazing; it’s not because of the
control success, really. They work well in
quasi-static tasks where sampling can be
safe and automated, so you can actually
get it off the ground. You could extend that
machine a little bit by having specialized objects,
like, for example, this is what my former student,
Vikash, has been doing. He is going to be
giving a talk tomorrow. But, if your hand that is trying
to learn how to manipulate an object is dropping
the object all the time, you could just mount
the object on a motor and make sure it
doesn’t go anywhere. And you can spin it, and so that
greatly simplifies learning. But, you know, clearly that
doesn’t scale very well. Now, this is important. There are clearly situations
where building a good model is just a nightmare. And in some sense, control
is easier than modeling. But that does not
automatically mean that RL is a good idea, because
those may be situations where a sampling is just– you
shouldn’t even think about it, like, for example,
controlling a bulldozer. How many people would
volunteer to sample a bulldozer while sitting in it? AUDIENCE: What’s the
hard modeling part there? EMO TODOROV: Well,
the interactions of the dirt with itself,
how it compresses when you’re all over it. I mean, you can do
something approximately, like 1,000 spheres
that are glued, and you’ll get the
notion of pushing. But the notion of,
like, what happens when you’re all over it,
and how compressed down, I– you, look. Anything could be modeled
given enough effort. That’s not the question. The question is,
should you put– if your goal is to control
and not to model it– should you put that effort
first into an accurate model and into model-based control, or
should you just go do something else? The clearest situation solution,
we go do something else. And that, actually
very simple situation, is this– if we just
want a rigid grasping, all you have to do is
close your fingers. I mean, what is the
object going to do? It has to end up somewhere
in your hand, right? You don’t have to understand
every single thing when that happens. So there clearly are very
simple controls that are just– with a small number
of knobs you can tune, and if you tune those
right, it’s going to work. But that’s not a thorough
reinforcement learning. In fact, that’s not
learning at all. That’s just human
creativity and engineering. And I think in the
situations that are too hard to model,
most of the time it’s really human
creativity that solves it, and not any form of learning. Of course, there are
situations where– yes? Go ahead, yeah. AUDIENCE: I would like
to emphasize that a bit. You would say that
iterative improvement of a repetitive task
is not learning? EMO TODOROV: Sorry, what? AUDIENCE: Iterative
improvement of parameters in a repetitive task
is not learning? EMO TODOROV: Sorry. I mean– AUDIENCE: I don’t know. EMO TODOROV: Fine. I– AUDIENCE: I’m happy to
say that’s not learning. EMO TODOROV: I don’t care
if it’s learning or not. It’s a different type
of learning, right? It’s a very heavily structured
controlled architecture with a small number of
gains your tuning, sure. Yeah. I mean, we can call it
learning if you want, but it’s completely not– AUDIENCE: I know, I know. EMO TODOROV: –obviously
what you [INAUDIBLE].. Now, there there
are some situations that you could model,
like, for example is Boston Dynamics’ humanoid. But still, people prefer
to do manual design and do it very carefully. And they’re doing
better robotics than anyone else on the planet
by laborious manual design, not even by model-based
optimization; just by a bunch of really smart
people tweaking knobs. But before that,
tweaking the knobs, deciding what knobs to
create for themselves so they can tweak those. That’s the tricky
part; that’s what we don’t know how to automate. Why is the knob there? anyone can tweak them. But what knobs do we
put there, and what’s the machine with the
knobs on top of it? That’s the hard part. So you need the
models to do that, and you better be
able to simulate them. So throughout this talk, I’m
using this simulator called with MuJoCo, I’ve been
developing for, now, 11 years, and
that’s what enabled me to leave academia behind and
just sell software licenses. So you can see all
kinds of systems– the various robots, the
tension-driven things. You can play Jenga– it’s actually really
hard to play Jenga, because getting all the
contact interactions right is a nightmare. I have got all kinds of robots. You can have humanoids sliding
on the MatWorks height field. It’s very important
to have it there. You have like 1,000 spheres. You have clothes these days. You can have a
humanoid on a hammock. You can tie ropes
with the mouse. If you try to do through
reinforcement learning, apparently people are trying
and failing for the time being, but I’m sure somebody
will release something at some point. Instead though, the simulator
itself is very robust. Just the task is hard. You can do spongy things. You can even have like,
skeletons with, like, 90 muscles attached to them. So we can still
have lots of things. Just to introduce
a bit of notation. Q is position, v is velocity
of our applied forces. There’s some bass. [INAUDIBLE] matrix. J is a Jacobian of contacts
and maybe other constraints. Lambda are constrained
forces or contract forces. So forward dynamics–
you would think that physics would
be the sole problem, but it isn’t because of
constraints, and especially contacts. And what I mean by
it’s not a sole problem is that there’s no
formula that tells you what all the equations
of motion are. To compute the
equations of motion, you actually need to solve
an optimization problem numerically at every
time step, in order to know how the universe
advances from one time step to the next. And so, you might ask, what will
happen if the universe fails to solve that problem? Does it, like, end? Thankfully, it
solves it every time. So we don’t know how it does it. But when we do it, we have to
solve optimization problem. In MaJoCo, that
optimization problem happens to be convex
after a lot of work that I put into reformulating
the physics of contact, so it can become convex, and
yet reasonably realistic. In most alternative simulators,
like physics or the bullet, et cetera, it actually
solving a non-convex problem, just like a LCP with a
non-positive definite matrix. But anyway, so forward
dynamics is down to a numerical optimization. It is very fast. You can see some numbers here. So, for example, for humanoid
flailing around on a 10 core processor, we can collect
300,000 samples per second on a desktop processor. Importantly, this thing
also has inverse dynamics. So what inverse
dynamics means is, given the position, velocity,
and acceleration of the system, I can compute the force, which
generated that acceleration and that positional velocity. So if you want to
think about, what’s that in an MDP, given the– so MDP-4 dynamics is given
the state in the action, then TP tells you the
probability of our next state, and gets samples from that. The inverse dynamics would
be, given the current state and the next state
that you observed, what would the posterior
of our actions? Obviously, for that, you
need some kind of prior, and it gets messy very quickly. And it’s probably
not very useful. But in continuous
control, that’s actually an extremely
useful quantity. AUDIENCE: [INAUDIBLE] all
your problems is this– this properly
constrained, meaning that it’s not predetermined? The inverse? EMO TODOROV: The inverse? Not– never. So I mean, this is
actually very simple. So the forward is a solution
to optimization problem, which just happens to be convex. So we know the global
minimum is given by setting the gradient to zero. This simply says the
gradient of that is zero, and it’s rearranged. AUDIENCE: And so I’m
just saying we’re guaranteed that that
thing is strongly convex? EMO TODOROV: Oh, yeah. Yeah. Yeah, yeah. Because this is basically–
this is strongly convex. This just says f equals ma,
penalized with internal matrix, which is always symmetrically
posed to the definite. And this is adding something
that is [INAUDIBLE].. Yeah? AUDIENCE: Do you mean like the
dealers dynamics is physical, no matter how it should
be done, but not– EMO TODOROV: Exactly. So this is a soft contact model
and soft constraint model. For any crazy thing
you may choose to do, there exists a finite
possibly a very large force, which would have made the system
do that in that configuration. When we do optimal control,
we’re going to say, we don’t like very
large forces, and of all this will be pruned
out automatically. But we handle it. Now, inverse dynamics,
as I mentioned, it’s something very
useful in control. So again, just
forward means mapping from position, velocity,
and force to acceleration, and then you do a
numerical integration. Inverse means mapping
from position– velocity and acceleration to tau. If you have that, you can
do really simple things, like computer torque control. Namely, you plant some
desired trajectory, which is a sequence of
positions over time. You compute the cost
associations numerically. You use inverse
dynamics to figure out the force, which will have
given that trajectory, and you play that
force in open loop, and you can put some
closed loop around it. That probably accounts for
more applications of control than everything else combined. And these things can
work amazingly well, as you’re probably aware. Here’s a little industrial
robot doing fast and accurate, and crazy stuff. AUDIENCE: [INAUDIBLE]? EMO TODOROV: It just works. AUDIENCE: Is this Theta? EMO TODOROV: No. This is called–
what is this called? The Adept something or other. It’s one of the industrial–
the good industrial robots. Of course, this
is very limiting. It basically has to be
a fully actuated system, but if it is fully actuated,
we’re done, kind of. Now, here’s a general
algorithm based on the notion of inverse
dynamics and the fact that we can use it. So we’re going to
represent the trajectory, so we’re going to do
trajectory optimization here– no control all for
now, just trajectory– as a sequence of poses from
time zero to T minus 1. So this big Q is the–
our decision variable– it’s the thing we’re
trying to optimize. Define velocity accelerations,
we find our differences in time around your physics
and geometry computations to compute the applied forces
and all kinds of other things you may care about. Define some trajectory
cause, which depends on all of the above,
but it’s additive over time. And of course, that’s
key to having efficiency. If there are constraints,
under-actuation constraints are particularly important
in inverse dynamics. Because, as we said, you can
postulate that in trajectory, and there is always some force
that will have generated it. In some cases, it’s
obviously a bad idea, because you’re trying
to penetrate the wall. The wall is pushing back,
you’re going to penalize that. But the more subtle
and more important thing is that, if you have
a passive system like some of the degrees of
freedom of this object or not actuated, right? So I should not be
putting force directly on the position
orientation of this object. I’m only allowed to do that
through contact interactions. So when my inverse
dynamics output is tau, that tau better be
zero in the dimensions where I don’t have a motor or a
muscle attached to the system. And that’s really important. If you don’t do that, you
just do wishful thinking. You fly to space
and you’re done. So these nonlinear
equality constraints are really important. You can also impose
inequality constraints by saying that the control
shouldn’t get too large. I tend to not care too much
about that because you can just do that with soft– like
how quadratic costs. And as long as you keep it
far from the large controls, you don’t really care. If you want to do
something really like bang-bang control,
where you’re really going to saturate your models, fine. But people usually want to
operate things in a machine where they’re not
about to fall apart. And so this is just a
straightforward optimization problem. The way I like to solve it is
to form an augmented Lagrangian where you end up estimating
the [INAUDIBLE] align and use the Gaussian method. One thing that’s
very important here, kind of relates to the– for the
cost being additive over time is that the Hessians end
up being band-diagonal. In this case, I actually
allow cyclic trajectories, so if you want to compute a
movement that repeats itself. I can say that the first
state and the last state are the same. I don’t know what they are,
but they should be the same. So it’s actually an
interesting example where you can not apply dynamic
programming because there is no time axis, and time has
the topology of a circle now. But it’s perfectly fine
optimization problem, and it’s very meaningful
from a control perspective. So if you do that and you
factorize your Hessian, it ends up being an
arrowhead matrix, with less characterisation, but
it has these big, wide regions. And now, if you add
more timestamps, obviously the complexity of
this thing grows linearly. So if you look at
[INAUDIBLE] equations, basically it’s an alternative
way to exploit those parts. It’s just the
recursive factorization of a band-diagonal Hessian. Whether you do it like that
or you form the whole thing and just hit it
with [INAUDIBLE],, it actually makes no difference. So let’s see an example of that. So– oh my God. The light is so bad. So here’s this little hopper
that you can barely see. Cyclic trajectory, 200 timer
steps, 10 milliseconds, so two second movement. I’m constraining it. I have two hard constraints–
two poses of this thing up and down, so it can somehow
has transition between the two. Just a control cost
and nothing else. Here’s what it does. So the initial trajectory–
how it’s initialized is the interpolation
between the two poses. And after it’s optimized,
it does that and goes through all the
contacts, the optimizer considers all the forces
explicitly, and figures it out. How long did it take? This took 245 iterations in
a total of 700 milliseconds. So these are 1,000
decision variables, meaning 200 time steps,
five positions each, so we optimize 1,000 real
values in 0.7 seconds if we do it right. This is something
really interesting. So this is somewhere in the
middle of the optimization. This is a time step of
all the trajectories. The different colors here
are the different joints, so this thing has five joints,
and this is the gradient. So I am plotting the gradient
itself as a trajectory, which– right, so it is. And you can see it has this
very spiky nature over time. Why is it so spiky? Because whenever
there is a contact, if you change the
configuration a little bit, the force changes
a lot, and it’s extremely sensitive to that. And then if you’re
somewhere else, you change things
and not much happens. So get this very, very
badly behaved gradient. If you try to do first order
gradient descent on this, you’re going to be
stuck instantly, right? Because following the
gradient is a bad idea. You end up with trajectories
that are clearly going to be suboptimal. But here’s the good news– you use Gaussian, so we’re not
just following the gradient. We’re multiplying the
gradient by the inverse of a Gaussian, the Hessian. And when you do
that, that nasty, spiky thing becomes that
perfectly reasonable small thing, and now this is
the trajectory deformation, or search that, actually,
that we can go explore. So basically, how
does that happen? Well, the Hessian has the
right scale and the right– captures the right correlations
among the variables, so it cancels this
nastiness here and produces something nice. And that’s the magic,
so without that, like first order gradients, said
again, forget it, basically. Now, of course, we can also
use the forward dynamics and do multiplicative control. Over here, what I
did is this took, like, 200 iterations
to converge because we started from scratch. Remember, we started
this thing that just goes up and down dramatically. In NPC, the way you
do it is you already have a trajectory
that you optimized. This is your initial state. You start executing it. The word deviates from that. You optimizing a new trajectory. But when you optimize
a second trajectory, you always start with the one. So normally you have time for
only one Gaussian iteration when you do NPC. In fact, if you
have time for more, an argument could be
made that you should– it’ll make the horizon longer
or reduce the time step instead of running more. That’s kind of an
empirical question, but point is that warm start is
the reason why NPC ever works. Of course, you
have to do it fast, so you need efficient
physics simulation. You need good
optimization algorithms. This is the [INAUDIBLE]. I published a while ago. It’s basically the
Gauss, Newton version of differential
dynamic programming from Jacobson many years ago. It has some clever
robustness tricks. And cost functions have
to be designed really, really carefully. In this global inverse
dynamic [INAUDIBLE],, the cost function can
be something very basic. And it figures out the
cost has many iterations. When you’re doing [INAUDIBLE]
you have only one iteration. You’d better design
your cost carefully. So the game you’re
playing here is you as a human problem
designer are heuristically guessing what the value
function might look like. And the thing that
you’re calling a cost is really a bit more like
a value in your head. As far as the
[INAUDIBLE] is concerned, it’s a cost, but if you just
what’s called a sparse cost problem [INAUDIBLE]. But then again, tweaking the
weights on a cost function is not that hard. You pick the five or 10 terms
that you believe are relevant to your problem and just tweak
them and watch [INAUDIBLE]—- the thing is happening
in real time, right? You don’t have to wait for
a month for a neural network to converge. It actually happens
in milliseconds. So these things– AUDIENCE: [INAUDIBLE]
ILQR here versus the– could you have a warm start? Is this– the other
methods just aren’t amenable to warm starting? EMO TODOROV: Oh no, this is
amenable to warm starting. Nobody is using these
things for MPC right now. I’m actually about
to start doing that. My feeling is that these are
more powerful and more robust, but a bit slower. The shooting methods are– when they work– you know,
if you read old books on numerical recipes
for data, you know, shoot first, relax later. So this is a
relaxation [INAUDIBLE] and a shooting method. If that works, you
should use that. AUDIENCE: The shooting
could be more stable. Just– if there is any kind
of noise in the system, or it’s multi-modal, it
seems like you actually want to consider– I mean, maybe ILQG with
multiple linearizations on different sections– EMO TODOROV: Like
an assemble thing? AUDIENCE: Yeah, basically. EMO TODOROV: You could do that. But you could also do the
[INAUDIBLE] ensemble mode. There’s only five people–
actually people here at this– [INAUDIBLE] Zoran
Popovich has done– and company did that a while
ago with having ensemble of [INAUDIBLE] system,
put CMA on top of it, and that actually
was really good. AUDIENCE: [INAUDIBLE] EMO TODOROV: That one
video is also kind of– AUDIENCE: Do you
think maybe ensemble– EMO TODOROV: Ensemble definitely
gives a robustification at the expense of n
times more computation. AUDIENCE: But the shooting
is very cheap for– EMO TODOROV: No, no,
no, shooting and– AUDIENCE: [INAUDIBLE]
variance issue. EMO TODOROV: The
computational speed, like what you have
to do per iteration is the same for all of these. Like I said, the Hessian
is band diagonal. And the derivative [INAUDIBLE]—-
everything is the same computation. The question is, how many
iterations does each take? And what happens
during the live search? AUDIENCE: [INAUDIBLE] EMO TODOROV: Sorry, what? AUDIENCE: Because
it’s in the simulator, it’s in the same computational. EMO TODOROV: Yeah. AUDIENCE: Like you– EMO TODOROV: Well, it’s
always going to be– [INAUDIBLE] always going
to be on simulator, right? It’s– AUDIENCE: Some of these things,
you don’t get the derivatives out of it. So then it’s– EMO TODOROV: If you don’t have
derivatives– yeah, so, if you don’t have derivatives, then
you have to ask yourself, where do you get
the derivatives. Yeah, then shooting methods
have an interesting advantage that you could actually
shoot a bunch of trajectories without derivatives,
do linear regression, you have approximated
derivatives, and do the [INAUDIBLE]
equations with those. That can be a good trick. In fact, some people in Europe
working computer graphics have done that. And they’ve gotten results
that are comparable to ours, and in some cases even better. Let me show some
of our old results. So this humanoid
is asked to get up. The cost basically says, put
your torso over the feet. Don’t move too much, and then
some angular momentum tricks. So there are some tricks here. But at the end, it works in
real time on the desktop. So we pull it, and it figures
out how to save itself and how to get up again. These are all
invented on the fly. This was done seven,
eight years ago now. You can do guys on unicycles. You can do– this is an
interesting example, actually. So here, we’re changing
the cost in real time. And you can see that– well, what’s going on? OK, previously, he was driving
around and trying not to fall. Now we tell him that he
really shouldn’t move. So he started using his
arms more to balance. So you can really change
the behavior that comes out by changing cost [INAUDIBLE]. This [INAUDIBLE] the
idea is to get an object and put it somewhere in a game. It’s MPC. So these methods can
work– not always, and they’re not that robust. But when they work they’re– It’s pretty amazing
how they work, because there’s no learning. It just makes it up on
the fly and [INAUDIBLE].. Now, of course, there is no
reason not to do learning. And recently, what
we’ll do is combine MPC with value functions learning. So let me just talk
about this first. So we were going to minimize
our sequence of controls up to some horizon in the future. Normally, we’ll just do that. That’s how you do MPC. And what happens
from the horizon? Well, nothing, because you don’t
know what happens afterwards. But what if you had it
approximate to the value function? Then you should add that to
your MPC cost and optimize that. Because MPC just plans, OK,
here’s my trajectory from here to here. And is it a good
thing to be here? Well, I don’t know. So I need some function
to tell me that. This is the exactly
equivalent to the Monte Carlo research, for example,
that the deep mind people did in Alpha Go
where they learned some kind of value function. It’s not good enough
to play by itself, but when you put research on
top of it, it cleans it out. In control, there’s
no need for research, simply because if you
know the initial state– you could allow the thing to
search for three trajectories. But what happens is all of them
collapse to the same thing. Because there is usually one
optimal trajectory starting from the state that you’re in. And the sensitivities
are kind of smooth. Like a game scenario,
where the opponent takes one different
move, and that sends the game on a
completely different path, that doesn’t usually happen
in continuous systems. AUDIENCE: You can imagine
some obstacles you have to go around, [INAUDIBLE]. EMO TODOROV: You could– exactly. So you have to make up an
example where that happens. Like if the obstacle is
exactly in front of you and [INAUDIBLE] is
exactly equal to good, or if you are standing,
you want to start walking, you could start with the
left foot and the right foot. Yeah, you could make
up examples for that. They are rare, and
it doesn’t matter how you resolve the symmetry. AUDIENCE: Some
would say that those are the interesting
examples where you have those kind of obstacles. EMO TODOROV: Oh, no,
obstacles are interesting. But obstacles are usually
not exactly on your path to the target. So it’s obvious which
side you should go. AUDIENCE: Well, but
something more complicated than just obstacle [INAUDIBLE]. EMO TODOROV: So I can tell
you that people did that. And they showed demos. And in their demos, the
so-called tree actually collapsed into a trajectory. So they wasted their time. It may be slightly more
robust in some situations. And it goes back to
the shooting thing that you’re talking about. If you’re going to
throw a bunch of things, you might as well make
them deviate a little bit and then interpolate. And so then how do we
learn this value function? Well, if we have a bunch of
quasi optimal trajectories, then the optimal value function
has to be consistent with them through [INAUDIBLE]. And so we can use that
to do offline learning. [INAUDIBLE] I mean, I basically
told you what matters. Here’s some examples. This guy was trained to
go and push this cube and has to make it stop. And the ground’s slippery. So he’s doing these
heroic maneuvers to make the cube stop. This is the [INAUDIBLE]. I think that we can now do
roughly 100 times faster than they can in simulation
because of this trick. The sampling is– the complexity
is a bit hard to count, because there are
computations in both cases. But it’s a lot faster than
what they were able to do. So this is a good method. Now, we’ve been doing some
exploration in value functions. But I won’t talk about that. Now looking more generally,
the pattern here is this. You have joint optimization
between some global parameters and a bunch of trajectories. And you can think of it as
a joint optimization problem where you minimize
both the trajectories and these global parameters. Now, you can use ADMM
or iterative merits, and fix one, and
optimize the other, but it starts out
as a joint problem. And there are lots
of instances of that that are all interesting. The last I showed you is called
POLO [INAUDIBLE] learn offline. So here the stated mean, the
value function parameters, and this cost, L, is just
the optimal control cost plus a consistency cost saying
that the value function has to be consistent
through the [INAUDIBLE].. So you have to check this. Another example is a
guided policy search, which is quite popular. Sergey Levin and
company developed that. So in that case, you
learn policy parameters. And again, you have a
consistency condition, but in that case, consistencies
between trajectories you’re optimizing
and the policy. You could do system
ID, where now the meaning of the parameters
is some model parameters, not policy parameters. And what’s the
meaning of the cost? Well, you have some price
over the model parameters most likely. And now you have
an estimation cost. And you want your model to
agree with your sensor data for those model parameters. You can also do optimal
mechanism design where, again, you
parameterize the model. And you ask yourself, how
the– what robot should I build so it can
walk, for instance. So you can do that. You can leave a bunch of real
value parameters undefined and go solve that
problem jointly where now you both generate
locomotion behaviors and figure out what
the parameters should be at the same time. So it’s a very general
and very useful thing. It again has this
arrowhead structure, because now all the
trajectories line up here. We’re not going to
allow limit cycles here, because it’s more of
a controlled setting. These parameters state that they
affect the cost at all times. So they forum dense
rows of this Hessian. But again, if you add more
and more trajectories, the thing grows linearly. And factorization is efficient. How am I doing with time? When do you want me to end? AUDIENCE: It’s 10:35. EMO TODOROV: 10:35, OK. Now, here’s the
thing that you could do, adopting this sort of
determine– so you probably noticed that I treat the
word as being deterministic. Because guess what, that
screen has been here throughout my talk. It didn’t shake. I don’t know if you’ve noticed. So the state
transitions of the MDP are actually delta functions. You could, of course,
stochastic control is a very exciting thing. There are lots of situations
where you can apply it. The physical world is not
one of those situations. There is quantum noise,
but it’s negligible. That’s why Newtonian
mechanics actually is a pretty good
description of the world. So this whole notion that all
the world is stochastic– no, it’s not. It’s deterministic. It only starts shaking when
you shake it, because– either because you don’t
know what you’re doing, or because you’re
shaking it intentionally. Both of these are bad ideas. I’m not sure which
one is worse even. Of course, I’m not saying
you shouldn’t explore. If you really want to
explore, fine, go explore. But there’s no reason to do so. AUDIENCE: [INAUDIBLE] EMO TODOROV: What? The wind is not something noise. The wind is not noise. You don’t understand the wind. That doesn’t make it
noise, first of all. Second of all, it’s
not a [INAUDIBLE].. No, no, no, hold on, hold
on, hold on, hold on. So that– this old quote
from Mike Giorno when I was a student– probability is a model
of our ignorance. AUDIENCE: Sure. EMO TODOROV: The
wind is not noise. We just don’t know what it is. So we call it noise. So we capture some statistics
that happen to match. But they completely
miss the point. It’s not noise. Particularly if
you look at wind, it doesn’t jitter randomly. AUDIENCE: It’s interesting. This is one of those
sort of things, right? EMO TODOROV: Of course it’s– AUDIENCE: [INAUDIBLE]
degrees of freedom. You’re marching rights over
the unknown degrees of freedom. EMO TODOROV: Complete. AUDIENCE: [INAUDIBLE] EMO TODOROV: Yes, yes, yes, this
is all very well understood. This is– what I’m saying
is that statistics misses the point of the physical world,
because the physical world is not random. AUDIENCE: I think– AUDIENCE: He’s
already with Sean. AUDIENCE: Yeah, I’d argue
with Sean on that view. AUDIENCE: I have [INAUDIBLE]. If you’re writing the
simulator, you just [INAUDIBLE] different seeds and you
ensemble them that day and simulate all of the
noise in your trajectory. So I think you can run multiple
seeds for what the wind is. AUDIENCE: No, but,
now, [INAUDIBLE] think about humans, doing
autonomous control of a beaker, and of his child. Do you know if he’s going
to [INAUDIBLE] or not? EMO TODOROV: I don’t. That does not make
the child random. AUDIENCE: Oh, it does. Come on. EMO TODOROV: No, no, no, no, no. AUDIENCE: The child is random. AUDIENCE: Emo, I think we’re
going to get stuck on this. EMO TODOROV: OK, this is a
deep philosophical discussion, which you can continue, but– AUDIENCE: All you’re
saying is that there is a difference between
uncertainty and Gaussian noise. Everybody just assumes that all
uncertainty is Gaussian noise. AUDIENCE: [INAUDIBLE] EMO TODOROV: So that, I– AUDIENCE: You’ve got
to be crazy, right? You’ve got to be
crazy, and yet– EMO TODOROV: I’m saying that. But that’s not important. What I’m saying is
that when you’re controlling a physical
system, even if you want to replace
uncertainty with noise and think of the same
thing, both the uncertainty and the noise are very small. And they’re certainly
not jittering the system like crazy, which
is what an MDP would do, which is why– sorry, what? AUDIENCE: I’m thinking about
when you have touching parts, and then they may
touch or may not touch, if you try to model them exactly
in a real world situation where you don’t have
exact state noise, you’re going to now need to– EMO TODOROV: So
state uncertainty is a separate issue. It should not be confused
with dynamics noise. So you can have noise
in your sensors. And that’s actually
a real thermal noise. And it could be quite large. Because of that noise, you
don’t know what the state is. But you do know that
whatever the state is, there is going to be a
perfectly nice and smooth deterministic trajectory
starting from that state. We just don’t know
what the state is. You know it’s not
going– look, there’s a big difference between a
trajectory starting at a known place and going straight
versus a trajectory doing Brownian motion. I hope you can agree
with that, right? MDP has modelled the
world as Brownian motion, which is a joke. Yes? AUDIENCE: Maybe I think what– that, also, I agree with. But the other thing I think
you’re trying to say here is that if the uncertainty
is large, or significant, you’re going to be
in trouble, right? We hoped you control
in those cases. EMO TODOROV: Of course,
yeah, you buy more sensors. You don’t– before you
start doing control. [INTERPOSING VOICES] OK so fine, OK,
cars– self-driving cars don’t work, right? You don’t have
self-driving cars. AUDIENCE: The planning– AUDIENCE: We don’t
have self-driving cars. Airplanes– try to put
yourself in a state where you don’t fly
through thunderstorms. Because if you do, you’ll crash. AUDIENCE: No, people do this. [LAUGHTER] EMO TODOROV: OK, guys, look,
this is a discussion point. Let’s move on. So here’s– AUDIENCE: I have a
quick comment on that. We do do control, but there’s
lots of stochasticity, but sometimes we just want one
if there’s a banding problem. So if there’s lots
of stochasticity, we can’t project consumers’
behavior, et cetera. We could just treat it as
a one time [INAUDIBLE].. EMO TODOROV: No,
I’m fine with that. I’m– AUDIENCE: Definitely saying we
just do control those cases. We do do control, we
just do it differently. EMO TODOROV: That’s
what I’m emphasizing. I’m talking about
physical systems. So lastly, we’ll go into
stock markets, social things. If you must– AUDIENCE: If I kick a ball– EMO TODOROV: I’m sorry, what? AUDIENCE: If I kick a
ball, you can really predict where it will land. EMO TODOROV: After– 10 milliseconds after
you kick it, yes. AUDIENCE: No, no, no,
before [INAUDIBLE].. EMO TODOROV: But that’s not
about the same thing as an MDP. It’s not an MDP. The ball isn’t like doing that. It’s moving on a parabola. [INTERPOSING VOICES] AUDIENCE: 50 years ago. AUDIENCE: I accidentally
[INAUDIBLE].. There is no real noise. But maybe it’s useful if
we model noise sometimes, even though– EMO TODOROV: Sometimes, yes. AUDIENCE: Because
systems are not– EMO TODOROV:
Sometimes it’s useful. It’s just not most of
the time, and not when you’re controlling machines. AUDIENCE: I think it’s because
we don’t have a good model. That’s why– AUDIENCE: Yeah, it’s incorrect. EMO TODOROV: Look, let’s– AUDIENCE: Why don’t you go. Yeah– EMO TODOROV: All
right, so here’s how you can do
policy gradients very easily if you do the following. Actually, that covers
your ball kicking example. We’re going to absorb all noise
in the set of initial states. And the set of initial
states is going to become part of our
problem formulation. So you give me
the initial states that you believe
are interesting, which would be like
10 milliseconds after I hit the ball
in five different ways. And from that
point on, I’m going to assume everything
is deterministic. I’m glad there’s a
bunch of trajectories. There’s dynamics. There’s a policy. There’s some parameters. And now the cost
is just a function of trajectory, which
depends on the parameters. We can rule it out. This, by the way,
resembles the notion of a mini batch in
supervised learning. So what do you do? So there is stochastic
gradient descent. What is stochastic
about gradient descent? Well, the neural network is
deterministic, thankfully. That’s why the
whole thing works. The only thing stochastic
is that you pick a random subset of queries. So you can think
of that as picking a subset of initial state
that you want to care about. After you pick your
random subset of images– AUDIENCE: [INAUDIBLE] EMO TODOROV: In computer. So after you pick some random
images in computer vision, then after that, you roll the
network deterministically. You do back prob
deterministically. You do gradient descent
deterministically and then call it stochastic afterwards. We can be doing that
with policy gradients. And we’re going
to save ourselves a lot of pain of
sampling, worrying about the bias, various
trade-offs, and so forth. And it’s pretty trivial. You can just do the math. Basically, you need the
recursive thing over time that propagates the sensitivity
of the state with respect to parameters. And then you just do a
chain roll, basically. There’s nothing to– we’ve
done this about 15 years ago on simple systems. It worked like a charm. We didn’t even call it
learning, because it looked like a solved
problem, basically. We put all the papers
about neuroscience modeling how the brain works at the time. Now, here’s the last
thing I want to end with. A point about
optimization– there is a distinction between
optimization and discovery. And that happens a lot in what
I do due to contact dynamics. So imagine a landscape
that looks like that. It has some kind
of nice behavior. If I start too close to
that gradient descent, [INAUDIBLE] are going to work. But there can be
these large landscapes where there’s no local
information whatsoever to tell you where
the solution is. And so if you start
somewhere here and going to here, that’s what
we call discovered, like some big change that a
human may envision and make it, and then after the change,
good things happen. But there’s no reason to
expect an optimized [INAUDIBLE] to figure it out automatically. You could sample, but then
sampling grows exponentially with the radius of how far
the good solutions are. Gradients don’t work, et cetera. What has happened
with contact dynamics? Here is why. Suppose I’m here. My goal is to move
that laser pointer. I start moving my hands around. There is no contact
between me and that. The cost does not change. In fact, if I move
my hands, the cost goes up, because I
am expending energy, but the task isn’t
getting accomplished. So no sensible optimizer
roll should ever discover that, oh, by the way,
I should take a few steps, and touch that thing, and then
start manipulating with it– unless the human designer puts
some information under the hood and doesn’t tell the
reviewers of their paper. And that’s the only way
that can actually happen. Here’s an example of that. This is an experiment– this is
something that we did a few– a couple of years ago. So I’ll just show
you some results. So these are a bunch
of neural networks trained with policy gradients
that I’ll explain in a moment. They were complicating, so a
hand moving objects to targets. So this just shows that it works
for different initial and final conditions. Here, we tell it the desired
position of this obstacle. It figures out how
to put it there. It works again for
various targets and such. So this is a pre-trained
neural network. We’re just training it to
show that it does the job. Here, the goal is
to hammer that nail. And it happily does that. Here, the goal is
to open a door. And that’s that. Now, if you just do RL training
with so-called sparse rewards– most of these problems
actually have this feature. Your hand has to do something
before something else happens. And you basically
have to wait forever. Ironically, the only
thing that actually works is that paint example. From a control perspective,
that may be the hardest task. From an RL perspective,
it’s the only one feasible. Because you’re in a
situation where if you start wiggling your fingers,
something will happen. And so this is complicated,
but you’re in the regime. The [INAUDIBLE] doesn’t
get off the ground. If you start shaping it, you can
kind of do it with a lot labor. What we did is something
called demonstration augmented policy gradient,
which is basically a very simple idea. You do policy gradient
on the composite cost that’s the usual
cost to [INAUDIBLE] optimize plus an imitation cost. What does that mean? It means that we’re
going to evaluate our policy on a bunch
of demonstration states in some database. That we’re going
to ask the policy to stay close to whatever
that demonstrated it. And that makes a
dramatic improvement. So one way to solve
this discovery problem is to just download the
knowledge from a human brain into a network via
demonstrations. AUDIENCE: [INAUDIBLE] EMO TODOROV: So let
me just show you. It’s a bit of comic relief. This is a little
robot that I built. I kind of copied it from my–
from Vikash and made it better. I have four fingers
instead of three. That’s why it’s better. So here, I am heroically trying
to teleoperate this thing. And it’s really hard. Actually, for biodata, that
may be the only example where anyone has successfully
done teleoperation of dexterous manipulation. We actually make
and break contacts. Usually people do
grasping and moving. But this thing is
ridiculously hard. You need a very specialized
system and pretty clever software aids to– so for example, the mapping
here is not one to one I had to invent
quite a few things. It’s a pain. And if you’re trying to
teleoperate something like a humanoid or
whatever, it’s hard. Discovery is usually done
by humans in the real world. We would like to think that
the computer just hands over a policy to the robot. That’s not true. There is actually
a human up there. What the human
does is informally observes what
happens, and goes in and makes important structural
changes which are really the reason why anything works. Let’s just look
for some examples. So in Boston Dynamics, this is
a human subject, the authority. They have this approach
of these multiple regions of attraction–
basically a bunch of state-dependent
simple controls that they tune by hand. In our case, we had these
value functions for MPC that really need to be tuned. If you look at open the eye
case, this is interesting. This is a new tool box of knobs
that the humans can tweak. So you do this
domain optimization. Like you– the randomized
various aspects of the model
[INAUDIBLE] you retrain. But what do you randomize? There are infinitely many
things you can randomize. There are very few you can
actually do in practice. And which one you– ones you pick matter a lot. And so what they– they
randomized this, that, and the other. And that’s where the human
intuition comes in here, what will randomize next. And there is no way to
invent that automatically, and that’s critical. And there is–
we’ve been trying– we’ve been working on something
called the contact invariant optimization, which
is a way to do automated discovery in the
specific domain of contact. So when that nasty landscape
is called by contacts, we kind of have automated
way to handle it. In general, you cannot handle
it, because, in general, that would mean doing
global search somehow. And that’s not tractable. But if you know something about
what’s causing that nastiness, you may be able to handle it. So here, the idea is this. So in these situations where
contact are the only thing that gives you control authority,
you need an automated way to discover which contacts
are good for you when. And one way to do it
is to allow yourself some kind of virtual
contact forces that act from a distance
that can accomplish a task for you if possible. But then you’re going
to penalize them. And you’re going to
penalize them in such a way that if I am here, and the
thing wants to move forward, and I can be like a Jedi
and say move, and it moves, that’s the virtual
contact force. Now, the solver is
going to say, oh, you want to move that thing? Great, but you have
to touch it first. So now the solver
will give me the power to do that– put
out virtual force. But it will also automatically
add a cost on that distance. And the optimizer now will
know that it has to go and snap into that
configuration. There is a formal
way to do this. And fortunately,
I’m out of time. So I’m just going to play some
videos for you, and we’re done. So this is the work of
Igor Mordach from my lab. Let me just go to the
more impressive things. So this whole thing is
optimized as one movie, like all these contact
points and such, you can tell it to go
to various positions. Let’s just fast forward through. You can– thing
can do handstands, can walk up obstacles. So in each one of those
cases, it’s the dynamic– it’s the inverse dynamics
trajectory optimization. So the whole trajectory
is optimized as one. If you don’t do
this virtual contact force match that
I told you about, any solar would get stuck. Because when this nasty green
shape thing where there’s no reason to make progress,
actually– but with– if you allow contacts to kind
of give kids to the optimizer automatically, then
good things can happen. I’ll try to briefly
explain how– maybe I should end– OK, no, I’m told not to
explain anything briefly. OK, thank you for listening. [APPLAUSE] AUDIENCE: We’ll take one final
question before the break. AUDIENCE: What was
your brief explanation? EMO TODOROV: Sorry? AUDIENCE: What was
your brief explanation? EMO TODOROV: The brief
explanation is this. I’m going to have a cost
function preprocessor. I’m going to discover
terms that are quadratic in the applied
forces, including the non-physical forces. And I’m going to allow myself
to invent virtual forces that explain away some of that cost. And then I’m going to put
a regularizer on it, which has something like
complementarity condition, which says if you invent
the virtual force, you’d better close the distance
between those two contacts. And now I’m going to
replace that quadratic with a cost, which is defined
as the minimizer of something. So in order to
evaluate that cost, I actually have to solve
optimization problem on an inner loop. Conveniently, it happens to
be a convex QP, so it’s fast. And then I can evaluate it. I can differentiate it
analytically at that solution. And so then my master
optimizer is just seeing some weird cost, which
happens to evaluate itself by calling another optimizer. But that’s OK, because
it can do that. AUDIENCE: Can we–
given this scan– contact invariant optimization
for optimizing any trajectory where there is impact
with the contact. EMO TODOROV: You
could, but the thing with impact with the
ground is gravity is going to make it
happen without you trying. So I’m not sure you would
gain anything there. Where your gain is if there is
some contact that if you have to take a sequence
of actions that’s going to establish some
contact, which then allows you to do something else, then
this thing is very helpful. Something like falling
and hitting the ground, gravity does that for you. So I’m not sure– I mean, you can
use it, but I’m not sure you’re going
to buy you anything. AUDIENCE: OK. EMO TODOROV: All
right, thank you. [APPLAUSE]

Leave a Reply

Your email address will not be published. Required fields are marked *