Up next is Emo Todorov He

is an algorithm maestro. He’s been working

reinforced learning controls for many years,

presently [INAUDIBLE] for University of Washington. Recently, his work

on the [INAUDIBLE] has really made a

huge contributions to modern deep learning

and reinforcement learning, and I very much look

forward to some cool demos, which I’m sure we’ll see. I’m excited to teach you guys. I decided to be a businessman. And it’s a good life– no teaching, no grand

parenting, no meetings– I cannot explain to

you how good a life. OK. So I’m going to talk

about model-based control, specifically– AUDIENCE: This is more

important than that will be. Emo: –for physical systems. And let me start with

some general comments. So when you do

machine learning, you have the world that

generates data for you, and then you learn

from the data, and the data is there

anything you know. You don’t know beyond it. When we do control,

something else happens. We have, of course, data. With the system idea,

we build a model. And then, when we do control

synthesis, or optimization, or whatever, we use the

model most of the time. Even those of us who run around

claiming to be model freed, you look at what

they actually do. They have a model

and promise you that, one day when, you know,

the moon is right, we’re going to switch to real

world, but that never happens. Now, interesting thing– what

happens when you have a model is you can treat it as if

it’s the world send actions, generate data from it. But here the data is essential

because that’s all you know. Here, the data is just made up. What do I mean by that? Well, the data is implicitly

contained in your model. In fact, the model contains

all the data in the world. Why are you sampling from

something when you already know the answer? And it’s OK, because, you

know, you can practice for a distant future when RL

actually become model-free, as in you don’t

actually have a model, and one day switch those. But you’re missing out, because

if you get to the middleman here and connect your

optimization box directly to the model, you can get a

lot of stuff out of the model that’s not in the form of data. If you’re coming from

machine learning, you might ask, how can

something not be data? Well, there are lots

of things that are not really in the form of data– like, equations are

not really data. You can ask the model to

give you the math that’s in it, for example– can’t sample it. In practice, though,

to doing robotics, there are lots of

things you can compute. For example, end-effort

or Jacobians– these are very

useful things that give you the mapping

from [INAUDIBLE] and tell you how to go

to targets without doing any learning, which

just in microseconds. A lot of robots are

controlled that way. You can get data with

just the dynamics. You can do inverse dynamics,

which I’ll talk about. You can look at, what is the

subspace that are actuated? You can get those analytically

and not have to learn them. Distance fields– if

you have geometries, those are very useful

for planning and putting potential fields in

collision avoidance and such. Stability criteria

like center of pressure or forced closure for

locomotion– lots of things that can just be computed, and

it would be silly not to do so. Now, you could say,

OK, well, if you build a control with

respect to a model when you’re reliant on the real

world, is it going to work? That’s a perfectly

legitimate question. Well, it does

very– quite often. So let me just go, briefly,

through a few videos to remind you that it does. There are no proofs here. Why are there no proofs here? I mean, there are heroic people

who will do robots control and try to provide

certificates, whatever. All these people

can’t put assumptions that are not valid

in the real world, and therefore, those

certificates are not for real. I mean, it’s nice

to have some theory, but the reality is that, when

you’re controlling the real physical system, there is not a

single assumption that you can actually verify– so yeah,

that thing is really true, so it isn’t. So we have to look empirically

to see, well, does it work? This is the classic work of

Peter Beal flying helicopters with model-operative control. Has a model of the helicopter

flying upside down, doing all kinds of crazy things. No perfect model, but

it works quite well. There’s now a lot of drones

that have some kind of model, do MPC through it, and

they fly happily around. Here is the work from Evangelos

Therodorou’s group in Georgia Tech, where they put a model

of this little car on a GPU that’s in the car, and they’re

doing model-predictive control in real time, and it’s doing

some crazy things like slipping and whatnot– works perfectly fine–

fully model-based. Here’s something we did. This is interesting. Here, you are learning

a model on the fly, and then you’re optimizing

the trajectory with respect to that model. Then you’re playing the

trajectory on the system, collecting some

more sensitive data, really learning the model. And the model is just the

time-varying locally linear model, which is why

you can learn it in like 10 or 15 iterations. So this is pretty amazing. So this 24 degree

of freedom hand learns to turn this

object in a total of 75 repetitions of that

behavior, and that’s it. Because, you know,

there’s enough data to learn the local model

and control around it. This is a example

where the nominal model isn’t good enough. So we had to create an ensemble

of models that are slightly different– only 10 of them. And then, we

planned a trajectory that’s feasible with

respect to all 10 of them. And the idea is

that, if something works for 10 different

models and simulations, it probably works for

anything in between. And hopefully a real system

is somewhere in between, and this little guy does that. This is particularly

nice, so this is planned how to get

up from the ground with all kinds of collisions

and predictions and such. And actually, that’s it. And here’s, of course, the

OpenAI thing that I’m sure you’ve all seen, which is

another example of this domain optimization idea. But in this case,

they’re actually learning policies and

not just trajectories. And so, this was actually quite

remarkable when it came out. So you give it the

desired cube orientation, and it manipulates the group

to put it in that orientation. And so, there are

multiple approaches here. Model predictive

creative is just optimizing the trajectory

through the model in real time and shifting the horizon. Offline trajectory

optimization– I’ll talk more about today. You can use this

to optimize longer, more complicated trajectories. Policy gradient– you know

what it is, and you [INAUDIBLE] in this case. They could be

nominal models; they could be adaptive

local models; they could be a randomized physics

models; all kinds of things you could do. Of course, you could

also the model-free RL, but there’s some issues. There existing results,

especially from Sergey Levine and other of people around

Berkeley and Google, they’re really impressive. But they’re impressive because

of the computer vision. Now, if you took away

the computer vision, you had full state

estimation, nobody would care about those results. They would correspond to

solve problems in robotics from 30 years ago. So the fact that

these things actually work with vision in the

real world without VICON markers, without

anything– that’s amazing. But that’s because deep

learning for vision is amazing; it’s not because of the

control success, really. They work well in

quasi-static tasks where sampling can be

safe and automated, so you can actually

get it off the ground. You could extend that

machine a little bit by having specialized objects,

like, for example, this is what my former student,

Vikash, has been doing. He is going to be

giving a talk tomorrow. But, if your hand that is trying

to learn how to manipulate an object is dropping

the object all the time, you could just mount

the object on a motor and make sure it

doesn’t go anywhere. And you can spin it, and so that

greatly simplifies learning. But, you know, clearly that

doesn’t scale very well. Now, this is important. There are clearly situations

where building a good model is just a nightmare. And in some sense, control

is easier than modeling. But that does not

automatically mean that RL is a good idea, because

those may be situations where a sampling is just– you

shouldn’t even think about it, like, for example,

controlling a bulldozer. How many people would

volunteer to sample a bulldozer while sitting in it? AUDIENCE: What’s the

hard modeling part there? EMO TODOROV: Well,

the interactions of the dirt with itself,

how it compresses when you’re all over it. I mean, you can do

something approximately, like 1,000 spheres

that are glued, and you’ll get the

notion of pushing. But the notion of,

like, what happens when you’re all over it,

and how compressed down, I– you, look. Anything could be modeled

given enough effort. That’s not the question. The question is,

should you put– if your goal is to control

and not to model it– should you put that effort

first into an accurate model and into model-based control, or

should you just go do something else? The clearest situation solution,

we go do something else. And that, actually

very simple situation, is this– if we just

want a rigid grasping, all you have to do is

close your fingers. I mean, what is the

object going to do? It has to end up somewhere

in your hand, right? You don’t have to understand

every single thing when that happens. So there clearly are very

simple controls that are just– with a small number

of knobs you can tune, and if you tune those

right, it’s going to work. But that’s not a thorough

reinforcement learning. In fact, that’s not

learning at all. That’s just human

creativity and engineering. And I think in the

situations that are too hard to model,

most of the time it’s really human

creativity that solves it, and not any form of learning. Of course, there are

situations where– yes? Go ahead, yeah. AUDIENCE: I would like

to emphasize that a bit. You would say that

iterative improvement of a repetitive task

is not learning? EMO TODOROV: Sorry, what? AUDIENCE: Iterative

improvement of parameters in a repetitive task

is not learning? EMO TODOROV: Sorry. I mean– AUDIENCE: I don’t know. EMO TODOROV: Fine. I– AUDIENCE: I’m happy to

say that’s not learning. EMO TODOROV: I don’t care

if it’s learning or not. It’s a different type

of learning, right? It’s a very heavily structured

controlled architecture with a small number of

gains your tuning, sure. Yeah. I mean, we can call it

learning if you want, but it’s completely not– AUDIENCE: I know, I know. EMO TODOROV: –obviously

what you [INAUDIBLE].. Now, there there

are some situations that you could model,

like, for example is Boston Dynamics’ humanoid. But still, people prefer

to do manual design and do it very carefully. And they’re doing

better robotics than anyone else on the planet

by laborious manual design, not even by model-based

optimization; just by a bunch of really smart

people tweaking knobs. But before that,

tweaking the knobs, deciding what knobs to

create for themselves so they can tweak those. That’s the tricky

part; that’s what we don’t know how to automate. Why is the knob there? anyone can tweak them. But what knobs do we

put there, and what’s the machine with the

knobs on top of it? That’s the hard part. So you need the

models to do that, and you better be

able to simulate them. So throughout this talk, I’m

using this simulator called with MuJoCo, I’ve been

developing for, now, 11 years, and

that’s what enabled me to leave academia behind and

just sell software licenses. So you can see all

kinds of systems– the various robots, the

tension-driven things. You can play Jenga– it’s actually really

hard to play Jenga, because getting all the

contact interactions right is a nightmare. I have got all kinds of robots. You can have humanoids sliding

on the MatWorks height field. It’s very important

to have it there. You have like 1,000 spheres. You have clothes these days. You can have a

humanoid on a hammock. You can tie ropes

with the mouse. If you try to do through

reinforcement learning, apparently people are trying

and failing for the time being, but I’m sure somebody

will release something at some point. Instead though, the simulator

itself is very robust. Just the task is hard. You can do spongy things. You can even have like,

skeletons with, like, 90 muscles attached to them. So we can still

have lots of things. Just to introduce

a bit of notation. Q is position, v is velocity

of our applied forces. There’s some bass. [INAUDIBLE] matrix. J is a Jacobian of contacts

and maybe other constraints. Lambda are constrained

forces or contract forces. So forward dynamics–

you would think that physics would

be the sole problem, but it isn’t because of

constraints, and especially contacts. And what I mean by

it’s not a sole problem is that there’s no

formula that tells you what all the equations

of motion are. To compute the

equations of motion, you actually need to solve

an optimization problem numerically at every

time step, in order to know how the universe

advances from one time step to the next. And so, you might ask, what will

happen if the universe fails to solve that problem? Does it, like, end? Thankfully, it

solves it every time. So we don’t know how it does it. But when we do it, we have to

solve optimization problem. In MaJoCo, that

optimization problem happens to be convex

after a lot of work that I put into reformulating

the physics of contact, so it can become convex, and

yet reasonably realistic. In most alternative simulators,

like physics or the bullet, et cetera, it actually

solving a non-convex problem, just like a LCP with a

non-positive definite matrix. But anyway, so forward

dynamics is down to a numerical optimization. It is very fast. You can see some numbers here. So, for example, for humanoid

flailing around on a 10 core processor, we can collect

300,000 samples per second on a desktop processor. Importantly, this thing

also has inverse dynamics. So what inverse

dynamics means is, given the position, velocity,

and acceleration of the system, I can compute the force, which

generated that acceleration and that positional velocity. So if you want to

think about, what’s that in an MDP, given the– so MDP-4 dynamics is given

the state in the action, then TP tells you the

probability of our next state, and gets samples from that. The inverse dynamics would

be, given the current state and the next state

that you observed, what would the posterior

of our actions? Obviously, for that, you

need some kind of prior, and it gets messy very quickly. And it’s probably

not very useful. But in continuous

control, that’s actually an extremely

useful quantity. AUDIENCE: [INAUDIBLE] all

your problems is this– this properly

constrained, meaning that it’s not predetermined? The inverse? EMO TODOROV: The inverse? Not– never. So I mean, this is

actually very simple. So the forward is a solution

to optimization problem, which just happens to be convex. So we know the global

minimum is given by setting the gradient to zero. This simply says the

gradient of that is zero, and it’s rearranged. AUDIENCE: And so I’m

just saying we’re guaranteed that that

thing is strongly convex? EMO TODOROV: Oh, yeah. Yeah. Yeah, yeah. Because this is basically–

this is strongly convex. This just says f equals ma,

penalized with internal matrix, which is always symmetrically

posed to the definite. And this is adding something

that is [INAUDIBLE].. Yeah? AUDIENCE: Do you mean like the

dealers dynamics is physical, no matter how it should

be done, but not– EMO TODOROV: Exactly. So this is a soft contact model

and soft constraint model. For any crazy thing

you may choose to do, there exists a finite

possibly a very large force, which would have made the system

do that in that configuration. When we do optimal control,

we’re going to say, we don’t like very

large forces, and of all this will be pruned

out automatically. But we handle it. Now, inverse dynamics,

as I mentioned, it’s something very

useful in control. So again, just

forward means mapping from position, velocity,

and force to acceleration, and then you do a

numerical integration. Inverse means mapping

from position– velocity and acceleration to tau. If you have that, you can

do really simple things, like computer torque control. Namely, you plant some

desired trajectory, which is a sequence of

positions over time. You compute the cost

associations numerically. You use inverse

dynamics to figure out the force, which will have

given that trajectory, and you play that

force in open loop, and you can put some

closed loop around it. That probably accounts for

more applications of control than everything else combined. And these things can

work amazingly well, as you’re probably aware. Here’s a little industrial

robot doing fast and accurate, and crazy stuff. AUDIENCE: [INAUDIBLE]? EMO TODOROV: It just works. AUDIENCE: Is this Theta? EMO TODOROV: No. This is called–

what is this called? The Adept something or other. It’s one of the industrial–

the good industrial robots. Of course, this

is very limiting. It basically has to be

a fully actuated system, but if it is fully actuated,

we’re done, kind of. Now, here’s a general

algorithm based on the notion of inverse

dynamics and the fact that we can use it. So we’re going to

represent the trajectory, so we’re going to do

trajectory optimization here– no control all for

now, just trajectory– as a sequence of poses from

time zero to T minus 1. So this big Q is the–

our decision variable– it’s the thing we’re

trying to optimize. Define velocity accelerations,

we find our differences in time around your physics

and geometry computations to compute the applied forces

and all kinds of other things you may care about. Define some trajectory

cause, which depends on all of the above,

but it’s additive over time. And of course, that’s

key to having efficiency. If there are constraints,

under-actuation constraints are particularly important

in inverse dynamics. Because, as we said, you can

postulate that in trajectory, and there is always some force

that will have generated it. In some cases, it’s

obviously a bad idea, because you’re trying

to penetrate the wall. The wall is pushing back,

you’re going to penalize that. But the more subtle

and more important thing is that, if you have

a passive system like some of the degrees of

freedom of this object or not actuated, right? So I should not be

putting force directly on the position

orientation of this object. I’m only allowed to do that

through contact interactions. So when my inverse

dynamics output is tau, that tau better be

zero in the dimensions where I don’t have a motor or a

muscle attached to the system. And that’s really important. If you don’t do that, you

just do wishful thinking. You fly to space

and you’re done. So these nonlinear

equality constraints are really important. You can also impose

inequality constraints by saying that the control

shouldn’t get too large. I tend to not care too much

about that because you can just do that with soft– like

how quadratic costs. And as long as you keep it

far from the large controls, you don’t really care. If you want to do

something really like bang-bang control,

where you’re really going to saturate your models, fine. But people usually want to

operate things in a machine where they’re not

about to fall apart. And so this is just a

straightforward optimization problem. The way I like to solve it is

to form an augmented Lagrangian where you end up estimating

the [INAUDIBLE] align and use the Gaussian method. One thing that’s

very important here, kind of relates to the– for the

cost being additive over time is that the Hessians end

up being band-diagonal. In this case, I actually

allow cyclic trajectories, so if you want to compute a

movement that repeats itself. I can say that the first

state and the last state are the same. I don’t know what they are,

but they should be the same. So it’s actually an

interesting example where you can not apply dynamic

programming because there is no time axis, and time has

the topology of a circle now. But it’s perfectly fine

optimization problem, and it’s very meaningful

from a control perspective. So if you do that and you

factorize your Hessian, it ends up being an

arrowhead matrix, with less characterisation, but

it has these big, wide regions. And now, if you add

more timestamps, obviously the complexity of

this thing grows linearly. So if you look at

[INAUDIBLE] equations, basically it’s an alternative

way to exploit those parts. It’s just the

recursive factorization of a band-diagonal Hessian. Whether you do it like that

or you form the whole thing and just hit it

with [INAUDIBLE],, it actually makes no difference. So let’s see an example of that. So– oh my God. The light is so bad. So here’s this little hopper

that you can barely see. Cyclic trajectory, 200 timer

steps, 10 milliseconds, so two second movement. I’m constraining it. I have two hard constraints–

two poses of this thing up and down, so it can somehow

has transition between the two. Just a control cost

and nothing else. Here’s what it does. So the initial trajectory–

how it’s initialized is the interpolation

between the two poses. And after it’s optimized,

it does that and goes through all the

contacts, the optimizer considers all the forces

explicitly, and figures it out. How long did it take? This took 245 iterations in

a total of 700 milliseconds. So these are 1,000

decision variables, meaning 200 time steps,

five positions each, so we optimize 1,000 real

values in 0.7 seconds if we do it right. This is something

really interesting. So this is somewhere in the

middle of the optimization. This is a time step of

all the trajectories. The different colors here

are the different joints, so this thing has five joints,

and this is the gradient. So I am plotting the gradient

itself as a trajectory, which– right, so it is. And you can see it has this

very spiky nature over time. Why is it so spiky? Because whenever

there is a contact, if you change the

configuration a little bit, the force changes

a lot, and it’s extremely sensitive to that. And then if you’re

somewhere else, you change things

and not much happens. So get this very, very

badly behaved gradient. If you try to do first order

gradient descent on this, you’re going to be

stuck instantly, right? Because following the

gradient is a bad idea. You end up with trajectories

that are clearly going to be suboptimal. But here’s the good news– you use Gaussian, so we’re not

just following the gradient. We’re multiplying the

gradient by the inverse of a Gaussian, the Hessian. And when you do

that, that nasty, spiky thing becomes that

perfectly reasonable small thing, and now this is

the trajectory deformation, or search that, actually,

that we can go explore. So basically, how

does that happen? Well, the Hessian has the

right scale and the right– captures the right correlations

among the variables, so it cancels this

nastiness here and produces something nice. And that’s the magic,

so without that, like first order gradients, said

again, forget it, basically. Now, of course, we can also

use the forward dynamics and do multiplicative control. Over here, what I

did is this took, like, 200 iterations

to converge because we started from scratch. Remember, we started

this thing that just goes up and down dramatically. In NPC, the way you

do it is you already have a trajectory

that you optimized. This is your initial state. You start executing it. The word deviates from that. You optimizing a new trajectory. But when you optimize

a second trajectory, you always start with the one. So normally you have time for

only one Gaussian iteration when you do NPC. In fact, if you

have time for more, an argument could be

made that you should– it’ll make the horizon longer

or reduce the time step instead of running more. That’s kind of an

empirical question, but point is that warm start is

the reason why NPC ever works. Of course, you

have to do it fast, so you need efficient

physics simulation. You need good

optimization algorithms. This is the [INAUDIBLE]. I published a while ago. It’s basically the

Gauss, Newton version of differential

dynamic programming from Jacobson many years ago. It has some clever

robustness tricks. And cost functions have

to be designed really, really carefully. In this global inverse

dynamic [INAUDIBLE],, the cost function can

be something very basic. And it figures out the

cost has many iterations. When you’re doing [INAUDIBLE]

you have only one iteration. You’d better design

your cost carefully. So the game you’re

playing here is you as a human problem

designer are heuristically guessing what the value

function might look like. And the thing that

you’re calling a cost is really a bit more like

a value in your head. As far as the

[INAUDIBLE] is concerned, it’s a cost, but if you just

what’s called a sparse cost problem [INAUDIBLE]. But then again, tweaking the

weights on a cost function is not that hard. You pick the five or 10 terms

that you believe are relevant to your problem and just tweak

them and watch [INAUDIBLE]—- the thing is happening

in real time, right? You don’t have to wait for

a month for a neural network to converge. It actually happens

in milliseconds. So these things– AUDIENCE: [INAUDIBLE]

ILQR here versus the– could you have a warm start? Is this– the other

methods just aren’t amenable to warm starting? EMO TODOROV: Oh no, this is

amenable to warm starting. Nobody is using these

things for MPC right now. I’m actually about

to start doing that. My feeling is that these are

more powerful and more robust, but a bit slower. The shooting methods are– when they work– you know,

if you read old books on numerical recipes

for data, you know, shoot first, relax later. So this is a

relaxation [INAUDIBLE] and a shooting method. If that works, you

should use that. AUDIENCE: The shooting

could be more stable. Just– if there is any kind

of noise in the system, or it’s multi-modal, it

seems like you actually want to consider– I mean, maybe ILQG with

multiple linearizations on different sections– EMO TODOROV: Like

an assemble thing? AUDIENCE: Yeah, basically. EMO TODOROV: You could do that. But you could also do the

[INAUDIBLE] ensemble mode. There’s only five people–

actually people here at this– [INAUDIBLE] Zoran

Popovich has done– and company did that a while

ago with having ensemble of [INAUDIBLE] system,

put CMA on top of it, and that actually

was really good. AUDIENCE: [INAUDIBLE] EMO TODOROV: That one

video is also kind of– AUDIENCE: Do you

think maybe ensemble– EMO TODOROV: Ensemble definitely

gives a robustification at the expense of n

times more computation. AUDIENCE: But the shooting

is very cheap for– EMO TODOROV: No, no,

no, shooting and– AUDIENCE: [INAUDIBLE]

variance issue. EMO TODOROV: The

computational speed, like what you have

to do per iteration is the same for all of these. Like I said, the Hessian

is band diagonal. And the derivative [INAUDIBLE]—-

everything is the same computation. The question is, how many

iterations does each take? And what happens

during the live search? AUDIENCE: [INAUDIBLE] EMO TODOROV: Sorry, what? AUDIENCE: Because

it’s in the simulator, it’s in the same computational. EMO TODOROV: Yeah. AUDIENCE: Like you– EMO TODOROV: Well, it’s

always going to be– [INAUDIBLE] always going

to be on simulator, right? It’s– AUDIENCE: Some of these things,

you don’t get the derivatives out of it. So then it’s– EMO TODOROV: If you don’t have

derivatives– yeah, so, if you don’t have derivatives, then

you have to ask yourself, where do you get

the derivatives. Yeah, then shooting methods

have an interesting advantage that you could actually

shoot a bunch of trajectories without derivatives,

do linear regression, you have approximated

derivatives, and do the [INAUDIBLE]

equations with those. That can be a good trick. In fact, some people in Europe

working computer graphics have done that. And they’ve gotten results

that are comparable to ours, and in some cases even better. Let me show some

of our old results. So this humanoid

is asked to get up. The cost basically says, put

your torso over the feet. Don’t move too much, and then

some angular momentum tricks. So there are some tricks here. But at the end, it works in

real time on the desktop. So we pull it, and it figures

out how to save itself and how to get up again. These are all

invented on the fly. This was done seven,

eight years ago now. You can do guys on unicycles. You can do– this is an

interesting example, actually. So here, we’re changing

the cost in real time. And you can see that– well, what’s going on? OK, previously, he was driving

around and trying not to fall. Now we tell him that he

really shouldn’t move. So he started using his

arms more to balance. So you can really change

the behavior that comes out by changing cost [INAUDIBLE]. This [INAUDIBLE] the

idea is to get an object and put it somewhere in a game. It’s MPC. So these methods can

work– not always, and they’re not that robust. But when they work they’re– It’s pretty amazing

how they work, because there’s no learning. It just makes it up on

the fly and [INAUDIBLE].. Now, of course, there is no

reason not to do learning. And recently, what

we’ll do is combine MPC with value functions learning. So let me just talk

about this first. So we were going to minimize

our sequence of controls up to some horizon in the future. Normally, we’ll just do that. That’s how you do MPC. And what happens

from the horizon? Well, nothing, because you don’t

know what happens afterwards. But what if you had it

approximate to the value function? Then you should add that to

your MPC cost and optimize that. Because MPC just plans, OK,

here’s my trajectory from here to here. And is it a good

thing to be here? Well, I don’t know. So I need some function

to tell me that. This is the exactly

equivalent to the Monte Carlo research, for example,

that the deep mind people did in Alpha Go

where they learned some kind of value function. It’s not good enough

to play by itself, but when you put research on

top of it, it cleans it out. In control, there’s

no need for research, simply because if you

know the initial state– you could allow the thing to

search for three trajectories. But what happens is all of them

collapse to the same thing. Because there is usually one

optimal trajectory starting from the state that you’re in. And the sensitivities

are kind of smooth. Like a game scenario,

where the opponent takes one different

move, and that sends the game on a

completely different path, that doesn’t usually happen

in continuous systems. AUDIENCE: You can imagine

some obstacles you have to go around, [INAUDIBLE]. EMO TODOROV: You could– exactly. So you have to make up an

example where that happens. Like if the obstacle is

exactly in front of you and [INAUDIBLE] is

exactly equal to good, or if you are standing,

you want to start walking, you could start with the

left foot and the right foot. Yeah, you could make

up examples for that. They are rare, and

it doesn’t matter how you resolve the symmetry. AUDIENCE: Some

would say that those are the interesting

examples where you have those kind of obstacles. EMO TODOROV: Oh, no,

obstacles are interesting. But obstacles are usually

not exactly on your path to the target. So it’s obvious which

side you should go. AUDIENCE: Well, but

something more complicated than just obstacle [INAUDIBLE]. EMO TODOROV: So I can tell

you that people did that. And they showed demos. And in their demos, the

so-called tree actually collapsed into a trajectory. So they wasted their time. It may be slightly more

robust in some situations. And it goes back to

the shooting thing that you’re talking about. If you’re going to

throw a bunch of things, you might as well make

them deviate a little bit and then interpolate. And so then how do we

learn this value function? Well, if we have a bunch of

quasi optimal trajectories, then the optimal value function

has to be consistent with them through [INAUDIBLE]. And so we can use that

to do offline learning. [INAUDIBLE] I mean, I basically

told you what matters. Here’s some examples. This guy was trained to

go and push this cube and has to make it stop. And the ground’s slippery. So he’s doing these

heroic maneuvers to make the cube stop. This is the [INAUDIBLE]. I think that we can now do

roughly 100 times faster than they can in simulation

because of this trick. The sampling is– the complexity

is a bit hard to count, because there are

computations in both cases. But it’s a lot faster than

what they were able to do. So this is a good method. Now, we’ve been doing some

exploration in value functions. But I won’t talk about that. Now looking more generally,

the pattern here is this. You have joint optimization

between some global parameters and a bunch of trajectories. And you can think of it as

a joint optimization problem where you minimize

both the trajectories and these global parameters. Now, you can use ADMM

or iterative merits, and fix one, and

optimize the other, but it starts out

as a joint problem. And there are lots

of instances of that that are all interesting. The last I showed you is called

POLO [INAUDIBLE] learn offline. So here the stated mean, the

value function parameters, and this cost, L, is just

the optimal control cost plus a consistency cost saying

that the value function has to be consistent

through the [INAUDIBLE].. So you have to check this. Another example is a

guided policy search, which is quite popular. Sergey Levin and

company developed that. So in that case, you

learn policy parameters. And again, you have a

consistency condition, but in that case, consistencies

between trajectories you’re optimizing

and the policy. You could do system

ID, where now the meaning of the parameters

is some model parameters, not policy parameters. And what’s the

meaning of the cost? Well, you have some price

over the model parameters most likely. And now you have

an estimation cost. And you want your model to

agree with your sensor data for those model parameters. You can also do optimal

mechanism design where, again, you

parameterize the model. And you ask yourself, how

the– what robot should I build so it can

walk, for instance. So you can do that. You can leave a bunch of real

value parameters undefined and go solve that

problem jointly where now you both generate

locomotion behaviors and figure out what

the parameters should be at the same time. So it’s a very general

and very useful thing. It again has this

arrowhead structure, because now all the

trajectories line up here. We’re not going to

allow limit cycles here, because it’s more of

a controlled setting. These parameters state that they

affect the cost at all times. So they forum dense

rows of this Hessian. But again, if you add more

and more trajectories, the thing grows linearly. And factorization is efficient. How am I doing with time? When do you want me to end? AUDIENCE: It’s 10:35. EMO TODOROV: 10:35, OK. Now, here’s the

thing that you could do, adopting this sort of

determine– so you probably noticed that I treat the

word as being deterministic. Because guess what, that

screen has been here throughout my talk. It didn’t shake. I don’t know if you’ve noticed. So the state

transitions of the MDP are actually delta functions. You could, of course,

stochastic control is a very exciting thing. There are lots of situations

where you can apply it. The physical world is not

one of those situations. There is quantum noise,

but it’s negligible. That’s why Newtonian

mechanics actually is a pretty good

description of the world. So this whole notion that all

the world is stochastic– no, it’s not. It’s deterministic. It only starts shaking when

you shake it, because– either because you don’t

know what you’re doing, or because you’re

shaking it intentionally. Both of these are bad ideas. I’m not sure which

one is worse even. Of course, I’m not saying

you shouldn’t explore. If you really want to

explore, fine, go explore. But there’s no reason to do so. AUDIENCE: [INAUDIBLE] EMO TODOROV: What? The wind is not something noise. The wind is not noise. You don’t understand the wind. That doesn’t make it

noise, first of all. Second of all, it’s

not a [INAUDIBLE].. No, no, no, hold on, hold

on, hold on, hold on. So that– this old quote

from Mike Giorno when I was a student– probability is a model

of our ignorance. AUDIENCE: Sure. EMO TODOROV: The

wind is not noise. We just don’t know what it is. So we call it noise. So we capture some statistics

that happen to match. But they completely

miss the point. It’s not noise. Particularly if

you look at wind, it doesn’t jitter randomly. AUDIENCE: It’s interesting. This is one of those

sort of things, right? EMO TODOROV: Of course it’s– AUDIENCE: [INAUDIBLE]

degrees of freedom. You’re marching rights over

the unknown degrees of freedom. EMO TODOROV: Complete. AUDIENCE: [INAUDIBLE] EMO TODOROV: Yes, yes, yes, this

is all very well understood. This is– what I’m saying

is that statistics misses the point of the physical world,

because the physical world is not random. AUDIENCE: I think– AUDIENCE: He’s

already with Sean. AUDIENCE: Yeah, I’d argue

with Sean on that view. AUDIENCE: I have [INAUDIBLE]. If you’re writing the

simulator, you just [INAUDIBLE] different seeds and you

ensemble them that day and simulate all of the

noise in your trajectory. So I think you can run multiple

seeds for what the wind is. AUDIENCE: No, but,

now, [INAUDIBLE] think about humans, doing

autonomous control of a beaker, and of his child. Do you know if he’s going

to [INAUDIBLE] or not? EMO TODOROV: I don’t. That does not make

the child random. AUDIENCE: Oh, it does. Come on. EMO TODOROV: No, no, no, no, no. AUDIENCE: The child is random. AUDIENCE: Emo, I think we’re

going to get stuck on this. EMO TODOROV: OK, this is a

deep philosophical discussion, which you can continue, but– AUDIENCE: All you’re

saying is that there is a difference between

uncertainty and Gaussian noise. Everybody just assumes that all

uncertainty is Gaussian noise. AUDIENCE: [INAUDIBLE] EMO TODOROV: So that, I– AUDIENCE: You’ve got

to be crazy, right? You’ve got to be

crazy, and yet– EMO TODOROV: I’m saying that. But that’s not important. What I’m saying is

that when you’re controlling a physical

system, even if you want to replace

uncertainty with noise and think of the same

thing, both the uncertainty and the noise are very small. And they’re certainly

not jittering the system like crazy, which

is what an MDP would do, which is why– sorry, what? AUDIENCE: I’m thinking about

when you have touching parts, and then they may

touch or may not touch, if you try to model them exactly

in a real world situation where you don’t have

exact state noise, you’re going to now need to– EMO TODOROV: So

state uncertainty is a separate issue. It should not be confused

with dynamics noise. So you can have noise

in your sensors. And that’s actually

a real thermal noise. And it could be quite large. Because of that noise, you

don’t know what the state is. But you do know that

whatever the state is, there is going to be a

perfectly nice and smooth deterministic trajectory

starting from that state. We just don’t know

what the state is. You know it’s not

going– look, there’s a big difference between a

trajectory starting at a known place and going straight

versus a trajectory doing Brownian motion. I hope you can agree

with that, right? MDP has modelled the

world as Brownian motion, which is a joke. Yes? AUDIENCE: Maybe I think what– that, also, I agree with. But the other thing I think

you’re trying to say here is that if the uncertainty

is large, or significant, you’re going to be

in trouble, right? We hoped you control

in those cases. EMO TODOROV: Of course,

yeah, you buy more sensors. You don’t– before you

start doing control. [INTERPOSING VOICES] OK so fine, OK,

so this is just– [INTERPOSING VOICES] [LAUGHTER] AUDIENCE: [INAUDIBLE] AUDIENCE: Self-driving

cars– self-driving cars don’t work, right? You don’t have

self-driving cars. AUDIENCE: The planning– AUDIENCE: We don’t

have self-driving cars. Airplanes– try to put

yourself in a state where you don’t fly

through thunderstorms. Because if you do, you’ll crash. AUDIENCE: No, people do this. [LAUGHTER] EMO TODOROV: OK, guys, look,

this is a discussion point. Let’s move on. So here’s– AUDIENCE: I have a

quick comment on that. We do do control, but there’s

lots of stochasticity, but sometimes we just want one

if there’s a banding problem. So if there’s lots

of stochasticity, we can’t project consumers’

behavior, et cetera. We could just treat it as

a one time [INAUDIBLE].. EMO TODOROV: No,

I’m fine with that. I’m– AUDIENCE: Definitely saying we

just do control those cases. We do do control, we

just do it differently. EMO TODOROV: That’s

what I’m emphasizing. I’m talking about

physical systems. So lastly, we’ll go into

stock markets, social things. If you must– AUDIENCE: If I kick a ball– EMO TODOROV: I’m sorry, what? AUDIENCE: If I kick a

ball, you can really predict where it will land. EMO TODOROV: After– 10 milliseconds after

you kick it, yes. AUDIENCE: No, no, no,

before [INAUDIBLE].. EMO TODOROV: But that’s not

about the same thing as an MDP. It’s not an MDP. The ball isn’t like doing that. It’s moving on a parabola. [INTERPOSING VOICES] AUDIENCE: 50 years ago. AUDIENCE: I accidentally

[INAUDIBLE].. There is no real noise. But maybe it’s useful if

we model noise sometimes, even though– EMO TODOROV: Sometimes, yes. AUDIENCE: Because

systems are not– EMO TODOROV:

Sometimes it’s useful. It’s just not most of

the time, and not when you’re controlling machines. AUDIENCE: I think it’s because

we don’t have a good model. That’s why– AUDIENCE: Yeah, it’s incorrect. EMO TODOROV: Look, let’s– AUDIENCE: Why don’t you go. Yeah– EMO TODOROV: All

right, so here’s how you can do

policy gradients very easily if you do the following. Actually, that covers

your ball kicking example. We’re going to absorb all noise

in the set of initial states. And the set of initial

states is going to become part of our

problem formulation. So you give me

the initial states that you believe

are interesting, which would be like

10 milliseconds after I hit the ball

in five different ways. And from that

point on, I’m going to assume everything

is deterministic. I’m glad there’s a

bunch of trajectories. There’s dynamics. There’s a policy. There’s some parameters. And now the cost

is just a function of trajectory, which

depends on the parameters. We can rule it out. This, by the way,

resembles the notion of a mini batch in

supervised learning. So what do you do? So there is stochastic

gradient descent. What is stochastic

about gradient descent? Well, the neural network is

deterministic, thankfully. That’s why the

whole thing works. The only thing stochastic

is that you pick a random subset of queries. So you can think

of that as picking a subset of initial state

that you want to care about. After you pick your

random subset of images– AUDIENCE: [INAUDIBLE] EMO TODOROV: In computer. So after you pick some random

images in computer vision, then after that, you roll the

network deterministically. You do back prob

deterministically. You do gradient descent

deterministically and then call it stochastic afterwards. We can be doing that

with policy gradients. And we’re going

to save ourselves a lot of pain of

sampling, worrying about the bias, various

trade-offs, and so forth. And it’s pretty trivial. You can just do the math. Basically, you need the

recursive thing over time that propagates the sensitivity

of the state with respect to parameters. And then you just do a

chain roll, basically. There’s nothing to– we’ve

done this about 15 years ago on simple systems. It worked like a charm. We didn’t even call it

learning, because it looked like a solved

problem, basically. We put all the papers

about neuroscience modeling how the brain works at the time. Now, here’s the last

thing I want to end with. A point about

optimization– there is a distinction between

optimization and discovery. And that happens a lot in what

I do due to contact dynamics. So imagine a landscape

that looks like that. It has some kind

of nice behavior. If I start too close to

that gradient descent, [INAUDIBLE] are going to work. But there can be

these large landscapes where there’s no local

information whatsoever to tell you where

the solution is. And so if you start

somewhere here and going to here, that’s what

we call discovered, like some big change that a

human may envision and make it, and then after the change,

good things happen. But there’s no reason to

expect an optimized [INAUDIBLE] to figure it out automatically. You could sample, but then

sampling grows exponentially with the radius of how far

the good solutions are. Gradients don’t work, et cetera. What has happened

with contact dynamics? Here is why. Suppose I’m here. My goal is to move

that laser pointer. I start moving my hands around. There is no contact

between me and that. The cost does not change. In fact, if I move

my hands, the cost goes up, because I

am expending energy, but the task isn’t

getting accomplished. So no sensible optimizer

roll should ever discover that, oh, by the way,

I should take a few steps, and touch that thing, and then

start manipulating with it– unless the human designer puts

some information under the hood and doesn’t tell the

reviewers of their paper. And that’s the only way

that can actually happen. Here’s an example of that. This is an experiment– this is

something that we did a few– a couple of years ago. So I’ll just show

you some results. So these are a bunch

of neural networks trained with policy gradients

that I’ll explain in a moment. They were complicating, so a

hand moving objects to targets. So this just shows that it works

for different initial and final conditions. Here, we tell it the desired

position of this obstacle. It figures out how

to put it there. It works again for

various targets and such. So this is a pre-trained

neural network. We’re just training it to

show that it does the job. Here, the goal is

to hammer that nail. And it happily does that. Here, the goal is

to open a door. And that’s that. Now, if you just do RL training

with so-called sparse rewards– most of these problems

actually have this feature. Your hand has to do something

before something else happens. And you basically

have to wait forever. Ironically, the only

thing that actually works is that paint example. From a control perspective,

that may be the hardest task. From an RL perspective,

it’s the only one feasible. Because you’re in a

situation where if you start wiggling your fingers,

something will happen. And so this is complicated,

but you’re in the regime. The [INAUDIBLE] doesn’t

get off the ground. If you start shaping it, you can

kind of do it with a lot labor. What we did is something

called demonstration augmented policy gradient,

which is basically a very simple idea. You do policy gradient

on the composite cost that’s the usual

cost to [INAUDIBLE] optimize plus an imitation cost. What does that mean? It means that we’re

going to evaluate our policy on a bunch

of demonstration states in some database. That we’re going

to ask the policy to stay close to whatever

that demonstrated it. And that makes a

dramatic improvement. So one way to solve

this discovery problem is to just download the

knowledge from a human brain into a network via

demonstrations. AUDIENCE: [INAUDIBLE] EMO TODOROV: So let

me just show you. It’s a bit of comic relief. This is a little

robot that I built. I kind of copied it from my–

from Vikash and made it better. I have four fingers

instead of three. That’s why it’s better. So here, I am heroically trying

to teleoperate this thing. And it’s really hard. Actually, for biodata, that

may be the only example where anyone has successfully

done teleoperation of dexterous manipulation. We actually make

and break contacts. Usually people do

grasping and moving. But this thing is

ridiculously hard. You need a very specialized

system and pretty clever software aids to– so for example, the mapping

here is not one to one I had to invent

quite a few things. It’s a pain. And if you’re trying to

teleoperate something like a humanoid or

whatever, it’s hard. Discovery is usually done

by humans in the real world. We would like to think that

the computer just hands over a policy to the robot. That’s not true. There is actually

a human up there. What the human

does is informally observes what

happens, and goes in and makes important structural

changes which are really the reason why anything works. Let’s just look

for some examples. So in Boston Dynamics, this is

a human subject, the authority. They have this approach

of these multiple regions of attraction–

basically a bunch of state-dependent

simple controls that they tune by hand. In our case, we had these

value functions for MPC that really need to be tuned. If you look at open the eye

case, this is interesting. This is a new tool box of knobs

that the humans can tweak. So you do this

domain optimization. Like you– the randomized

various aspects of the model

[INAUDIBLE] you retrain. But what do you randomize? There are infinitely many

things you can randomize. There are very few you can

actually do in practice. And which one you– ones you pick matter a lot. And so what they– they

randomized this, that, and the other. And that’s where the human

intuition comes in here, what will randomize next. And there is no way to

invent that automatically, and that’s critical. And there is–

we’ve been trying– we’ve been working on something

called the contact invariant optimization, which

is a way to do automated discovery in the

specific domain of contact. So when that nasty landscape

is called by contacts, we kind of have automated

way to handle it. In general, you cannot handle

it, because, in general, that would mean doing

global search somehow. And that’s not tractable. But if you know something about

what’s causing that nastiness, you may be able to handle it. So here, the idea is this. So in these situations where

contact are the only thing that gives you control authority,

you need an automated way to discover which contacts

are good for you when. And one way to do it

is to allow yourself some kind of virtual

contact forces that act from a distance

that can accomplish a task for you if possible. But then you’re going

to penalize them. And you’re going to

penalize them in such a way that if I am here, and the

thing wants to move forward, and I can be like a Jedi

and say move, and it moves, that’s the virtual

contact force. Now, the solver is

going to say, oh, you want to move that thing? Great, but you have

to touch it first. So now the solver

will give me the power to do that– put

out virtual force. But it will also automatically

add a cost on that distance. And the optimizer now will

know that it has to go and snap into that

configuration. There is a formal

way to do this. And fortunately,

I’m out of time. So I’m just going to play some

videos for you, and we’re done. So this is the work of

Igor Mordach from my lab. Let me just go to the

more impressive things. So this whole thing is

optimized as one movie, like all these contact

points and such, you can tell it to go

to various positions. Let’s just fast forward through. You can– thing

can do handstands, can walk up obstacles. So in each one of those

cases, it’s the dynamic– it’s the inverse dynamics

trajectory optimization. So the whole trajectory

is optimized as one. If you don’t do

this virtual contact force match that

I told you about, any solar would get stuck. Because when this nasty green

shape thing where there’s no reason to make progress,

actually– but with– if you allow contacts to kind

of give kids to the optimizer automatically, then

good things can happen. I’ll try to briefly

explain how– maybe I should end– OK, no, I’m told not to

explain anything briefly. OK, thank you for listening. [APPLAUSE] AUDIENCE: We’ll take one final

question before the break. AUDIENCE: What was

your brief explanation? EMO TODOROV: Sorry? AUDIENCE: What was

your brief explanation? EMO TODOROV: The brief

explanation is this. I’m going to have a cost

function preprocessor. I’m going to discover

terms that are quadratic in the applied

forces, including the non-physical forces. And I’m going to allow myself

to invent virtual forces that explain away some of that cost. And then I’m going to put

a regularizer on it, which has something like

complementarity condition, which says if you invent

the virtual force, you’d better close the distance

between those two contacts. And now I’m going to

replace that quadratic with a cost, which is defined

as the minimizer of something. So in order to

evaluate that cost, I actually have to solve

optimization problem on an inner loop. Conveniently, it happens to

be a convex QP, so it’s fast. And then I can evaluate it. I can differentiate it

analytically at that solution. And so then my master

optimizer is just seeing some weird cost, which

happens to evaluate itself by calling another optimizer. But that’s OK, because

it can do that. AUDIENCE: Can we–

given this scan– contact invariant optimization

for optimizing any trajectory where there is impact

with the contact. EMO TODOROV: You

could, but the thing with impact with the

ground is gravity is going to make it

happen without you trying. So I’m not sure you would

gain anything there. Where your gain is if there is

some contact that if you have to take a sequence

of actions that’s going to establish some

contact, which then allows you to do something else, then

this thing is very helpful. Something like falling

and hitting the ground, gravity does that for you. So I’m not sure– I mean, you can

use it, but I’m not sure you’re going

to buy you anything. AUDIENCE: OK. EMO TODOROV: All

right, thank you. [APPLAUSE]