2019 ADSI Summer Workshop: Algorithmic Foundations of Learning and Control, Vikash Kumar


The next speaker is Vikash
Kumar from Google Brain. He got his PhD from
University of Washington from this department,
worked with Emo Todorov, and is a research
scientist at Google Brain, and has also been at different
places, like Open AI as well. And we are very
excited about his talk. He’s going to include very cool
demos with two robots as well, so looking forward to it. All right. All right, I’m very
excited to be here. It’s nostalgic to
come back to the place that you’ve learned so much and
then have something to share. So it’s exciting as well. I’ve learned two things
over the last two days. If your advisor is speaking
at the same workshop, it’s a nightmare for you
because he tells half the story and then shits on the rest. So I’m going to
try to pick pieces somewhere from what he
said and some of the pieces where I think we have
been able to move past some of the challenges. I think Byron’s talks
is very much aligned with those challenges,
and we will see what the parallels
are, what the lessons are. Some of the differences
from previous speakers that I want to highlight
right away is– I’m not trying to
prove anything. These are merely data
points of how far we have been able to push. Absence of something here
doesn’t mean it doesn’t work. It just means that
I haven’t tried it. You all should try it. OK. So these are data
points of how far we can push the complicated system. Some of the differences in the
applications will be that these are really high dimensional
systems where you cannot form any kind of simulator, much
like what Byron was mentioning. We are talking
about dimensionality roughly between 60 to 100
dimensional spaces, that a lot of deformation, there is no
rigid body assumption that is probably valid over there. And the second thing I
want to also point out is the state distribution
shifts quite a bit. So unlike Byron, where
you’re going on the track, this is like you can’t really
visit the same place again. Almost those kind of settings. If you guys haven’t
noticed, these are all pictures of hands. So this kind of gives
you the diversity of the things, a
manipulator, this dimension can actually take. So this motivates
problem very nicely, is if things aren’t that
nasty and that complicated and that’s your
state space, let’s figure out what’s possible. All right. And I’m going to be
talking about something would we take as granted
privileges, I’m calling them. A lot of people just
take assumptions. That’s another word
that we use commonly. And I’m going to be
highlighting a spectrum between the model-based and
the model free side of things. So we’ll see how
things are, where they are in the
applications and the areas that I’m interested in. So a fair bit of
a reality check. Emo talked about
this in his talk. We roughly did this in 2012. And things work in real time. You probably saw
movies where you can toss things on the ground. It beautifully gets up. It does cartwheel, and we
all clap and it’s great. But if you do
reality check, this is what things look like today. And this is coming from
Boston Dynamics, which is known to get things to work. It’s trying out things,
to place things. It topples. There are wheels. The cart falls down. Everything that can
go bad will go bad. Basically, that’s
the sad reality. So what is the difference
between the two? And one of the key differences
between the two systems is on the left, when
we got these results, we have perfect access to
the dynamics of the system. On the right, you have no access
to the dynamics of the system. Let’s just relax
it a little bit. You know kinematics. You know somewhat
of the dynamics. But the reality is so far away
from what you can estimate, what you can simulate,
that things like these are inevitable. It will go wrong. OK, so my arguments
here will be trying to see how we take some of
these performances from here and try to move there. AUDIENCE: I just
want to contrast this to [INAUDIBLE] star, because
the [INAUDIBLE] to me was that if you are
computing online at a very high frequency,
even a very crude model can go even a long way. So I was wondering, like
OK, these guys are really high-tech, but if you increase
the frequency more and maybe just throw in more computing,
it’s going to do it. VIKASH KUMAR: I’m with you. Let’s make that
argument together. AUDIENCE: Yeah. VIKASH KUMAR: All right. I’ll follow along that argument. Yeah. So I’m just going to make
the point that like this is the reality in the
way that, from hardware experts, that they know
how to get things working, still are struggling based
on the best they can do. All right. And there is a
reason, and we are going to ground those into
actually scientific challenges and see how we can make
progress in those dimensions. This was the picture
that Emo showed to motivate a lot of things. And a question I
want to ask here is, how can he even build
a model of this thing and then do all the non-linear
trajectory optimization that he preached the religion
here yesterday? Right? And the answer probably
is not very clear. It goes somewhere along the
direction of what Byron also mentioned, that yes,
model is very impressive and concise presentation
of information, but getting access to
it is incredibly hard. It’s a powerful piece,
but how do we build that? How do we get there? I think that’s where the key is. All right. So I’m going to build
up a couple of results and I’m going to put
them in a spectrum. On the right, where we have
no access to the model, which is the bot. I’m saying red. And on the left, when we have
full access to the model. And we will see what
works, what doesn’t work, how far things are along the
spectrum as a data point, not as a proof that
these are the ceilings. All right, keep that in mind,
because these are very personal related to the work
that I’ve done. AUDIENCE: So when you
say full access to model, you mean like– VIKASH KUMAR: I have
a full simulator. AUDIENCE: And that
simulator is accurate? VIKASH KUMAR: So let’s just
say the system, I can query and the system will execute
as exactly the same. AUDIENCE: OK. VIKASH KUMAR: Yeah. AUDIENCE: Yeah. VIKASH KUMAR: So
the first one is you have full access
to the simulator. You can query the simulator
at any rate you want. I’m going to call
something over here as one example that means
during the execution time, how many samples do you need to
execute versus how many samples do you query to execute? So this can be
executed in real time. I’ll go a little bit
more into details of this in a little bit. And I think very well motivated
by Byron, online trajectory optimization. Optimal control has been
incredibly successful in these domains when you
have full access to the model. But roughly the way
these settings work is there is it
robot model, which is a fully global
model of the dynamics. You can query. It will tell you exactly what
happens in the next state. There is assumption of a state
estimator, where you know where you are and where the state is. And then you present goal using
some intuitive cost functions. You hit it with an optimizer. Optimize really, really fast. You get behavior, so iterate
over and over and over again. And you deploy these
things in an online model predictive control settings. I think we heard
about that a lot. Roughly speaking, the way this
works is you’re at some point. You want to make a plan. You want execute a
little bit of it. And it’s like, oh, the
world is not exactly the way I thought it would be. Let’s make that step
and let’s plan again. So you just kind of
iteratively update. And there was a little
bit of discussion that Byron put forward,
is like why does MPC work? I think roughly I agree. There is always a plan. You keep re optimizing the plan. If the world behaves
differently, so be it. You just iterate and go forward. Using this technique,
we’ve been able to show very dynamic movements
of the order of this. Roughly, the dimensionality of
the system we are talking about is about 60 dimensional system. We have full access
to the ground state, full access to the dynamics
function over here. The objective here is
there is a hand, which is about 30 degrees of freedom. There is a six degree
of freedom object that randomly
falling in the scene. Your objective is to
stabilize the object. That’s what you
know, that the object has to be up there
in the air somewhere. And all the details
of the movements are automatically synthesized. You do not tell anything about
the details of the movement, so that’s the beautiful part. When you have access
to the dynamics, we have planners that
can do amazing things. And you will see, it fails. It’s trying to slip and
kind of re-optimizes, which is like one
of the lessons, that fast re-optimization
can give you a lot of things. And another beautiful
part of this is it’s a reusable machine,
which is also something that we want to do, that we
want the machinery of framework and algorithm that can
solve multiple things. It can take the same machinery. We ask it to do typing,
just change the gold state to something that says typing. And you can, on the fly,
keep updating the goal. Since it’s constantly
replanning, it adapts. It kind of does
like nice things. You can apply perturbations,
replan, et cetera. Things glorious,
everything happy. Robotics is solved. We can go. All right. All right, so I want
to take my take of why I think MPC actually works
and it does make sense in the real world. A, when we are executing, we are
never executing to the optimum. So even if we know how
to get to the optimum in an approvable way,
probably doesn’t make sense to spend that compute
to go to the optimum. You’re only going to
execute for a little bit, so your best bet is
to get a good plan in this small, narrow horizon. So focus your
computer over there and get the best
update you possibly can and leave everything to hope. But can we do better than hope? Yes, you can do better
than hope, which is what we call value function. You put the value
function at the end, and that kind of
gives you the hope. You extrapolate
based on the hope, but the thing is we overfit
to the current situations and you just hope that that
kind of carries forward. And then replanning
does the magic for you. But the challenges here are that
if the value function is not good enough, you start
getting stuck in local minima, but that’s the problem today. That is, I have no answer,
but that’s the reality, that you do get stuck
in local minima, and if you have a
better value function, you have a hope to
get out of the minima. So over here will be, just said,
that once you have full access to the dynamics,
we can use gradient based planner that
exploit the dynamics, give you really, really
fast recompute power so that you can execute
these kind of behaviors in real time, which I’m calling
one next number of samples. If you have access
to the dynamics, I also showed you
can generalize one setting, different settings,
throw different cost functions. That’s fine. I will probably
say that these kind of assumptions and settings are
only possible in simulation. That’s why I’m putting the
label simulation over there, that it works in simulation. If we want to run it in
the real system, the access to the dynamics function,
the true dynamics function, it’s not available. So the benefits that we get from
grading based on MPC planner cannot directly
transfer in real world. Questions? All right. So before filling
the spectrum, let’s take to the other side,
completely other side, where we have no access
to the dynamics function. What we are going
to do is we are going to do just
completely model free RL, and we are going to
build a policy that exactly solves the task
that we are trying to solve. So what will happen is we
will get what we want to do, but nothing more,
nothing less than that. And the best thing
we can hope over there is we generalize a
little bit on the thing that we want to do. You can’t do anything
more than that. And the beauty over there is
in these constrained settings, things do work on the hardware,
but the sample complexity is incredibly worse. You will see, while these
things work in one example, they take maybe
about 10 to bar 4. They’re slower 10 to bar
4, because you’re just sampling the system over
and over and over and over and that kind of
exhausts the system. I’m going to show you
some results, which have been presented
in the work over here. But I’m going to motivate
it, though, from a little bit different standpoint. If some of you were
here yesterday, I’m going to show you
this amazing craft, where you have a flat basin, where
you have no great information. You can’t really
move, and then there is a really nice
shape over there. If you push over there,
you can optimize. But the problem is that pushing
in this good space is hard. And if the space of
hand and manipulation is that high
dimensional, then it becomes incredibly challenging. So for most of
these things that we are going to talk
about right now, we solve this problem by adding
a little bit of demonstration that kind of focuses
you into the right area, and then we let the
optimization do the magic. OK. So those are the two
ingredients over there. If you ask if we just have
access to the demonstration to focus in the
right space, why not just do behavior cloning,
et cetera, et cetera, there is a small
restriction or budget I’m going to enforce
myself, that I’m going to use a really low
number of demonstrations. It’s 25 demonstrations. Somewhat manageable,
your budget is right, you can kind of get those
demonstrations easily, even if it takes a lot of work. You can just have five
people like hold the robot and do 25
demonstrations, in a way. But the problem of just
doing behavior cloning is that if it has to
work, it will take a lot of demonstration to work. If you execute,
the state mismatch is going to be too
much and you will run into compounding errors. This is never going to execute
in the real world properly. And if you ask the question,
why this is happening, is because behavior
cloning is just trying to solve the wrong
problem to begin with. It’s trying to produce the
optimal action, given the demo distribution, instead of the
induced state distribution. When you execute,
you don’t remain in the demo distribution. You go out of the distribution,
compounding errors, everything. You drop the object,
everything fails, you’re crazy. But what it does, it puts
you in that right space, where you can actually
focus your exploration and then fine-tune
from there properly. So what we say, what you do
is you do your policy update. The top is the policy
update and the bottom is you also add
a function, which asks you to stay close
to these demonstrations. And there is this
weighing function that you just kneel
away over time, and that kind of
gets you started. So in the beginning,
the top policy update will not give you
any information because you’re so far from
doing anything meaningful. Only thing you can hope for
is staying close to your demo distribution because
it’s taking you to the right place
in the subspace. And as you start annealing
away the W term over there, then it says that, OK,
I’m in the right space. I can throw away
the demonstration. I don’t really care
what was proposed to me. But I can use the power
of like sampling, updates, policy updates to go beyond
what was demonstrated to me. And that will be
critical because you don’t have a lot
of demonstration, 25 demonstrations, and these
high dimensional spaces mean nothing. In these kind of
settings, we have been able to show that you can
do fairly complex behaviors of the kind. And you don’t only have
a manipulator, which is high degree of
freedom manipulator, but the manipulator is also
interacting and understanding these radius objects and
their target configurations. And dealing with free
object is somewhat subtle and it doesn’t really work. And it’s hard to understand why,
but there are some intuitions I can share, is
the base of the arm is working on a very large
space and the fingers are working on a
really small space. So there is this space
mismatch between the two, and the coordination
needs to happen is the arm really
needs to coordinate and this precise
action needs to go exactly the right time for all
these things to come together. Because when these
things are being trained, you will see the arm going the
right place with the finger not doing the right thing. When the finger trusts
to do the right thing, the arm is up in
the right place. These kind of local
minima are very dominant in these problems. So yeah, you can do
picking up of balls. We can use different kind
of tools, opening the door. The dynamics here is primarily
dependent on the door so you can’t really
overpower the door, so these are interesting things. Another interesting aspect
that falls out of this is roboticists has been trying
to get to human-like behaviors for a really, really long time. And they’re trying to enforce
it in a lot of different ways, velocity constraint, the
jerk-free, et cetera, et cetera. But these are all
approximations, or understanding of what we
think human-like means, right? But if you have these
25 demonstrations I’m talking about
that implicitly encode what being human-like means. So this kind of
falls out for free almost, that it gives you
the right inductive bias to be like humans. So if you see on
the right side, you will see these
artifacts are very real, if you do a lot of shaping
and get things to work. But yeah, it gets the job
done, but that artifact that we are looking for,
what it means to be gentle, nimble, being like
human, almost comes out for free, which people have been
trying for a really long time. The addition of
demonstration do tend to make the policy a
little bit more robust. I’m not going to
go into details, but we’ve checked some
robustness characteristics on the right. And the higher it
is, the better robust it is to both the axes, which
are different perturbations to the system. So now, I’m going
to start filling up the gap in the middle. But before we start,
any questions? All right, either everything
is clear or nothing. Correct. OK, so now, the
question is that I am very much proposing
that access to the model is critical and nice. And I’m trying to get more
and more access to the model, and I’m going to start
building this model, because access to the
model, automatically, or from a simulator is just not
the reality of the world today. OK. So let’s see how we can
build a model to the system. So in this case,
what we’re going to do is we’re not going
to focus on developing a full model, but we’re
going to focus on developing a model which is very
rudimentary in the sense that it’s trying
to exactly learn the model around the
things it’s trying to do. And so what we are
going to do is check out that robot box, which was
the full physics model. I’m going to start substituting
that box with a partial model, which is a time
bearing linear model, parameterized directly
on the sensor data. So we are not talking about
joint angled positions or nice properties of
velocities, et cetera. Whatever is your
sensor, we’re just going to go for the sensor data. And that’s where we are
going to build the model. And we are going to adapt
as we go on the model. So very briefly, let’s
look at the right. What it’s trying to do is
if you have a policy, which is on the time
bearing model, we are going to query the policy to
give us the actions that it wants to take at the moment. We are going to execute those
actions on the real system. As we have the actions, we
are going to update the model, and we are going to keep
doing it over and over. AUDIENCE: So is
this basically just like modeling an
application control? [INAUDIBLE] model. I mean, I know that
this model [INAUDIBLE] VIKASH KUMAR: Yes. Yes, yes, yes. AUDIENCE: OK. VIKASH KUMAR: So we
are trying to make a simple model
that we can quickly adapt with less model data. And the beautiful
thing over there is going to be
that we are picking the linear model in the way
that we can run gradient based optimization on it to actually
get the good effects of MPC on top over there. Essentially, everything
I said, except one part, it’s kind of important to
bound the model update by a KL. And this is because
you don’t get to query the real system quite a bit. So if you make
aggressive updates, there is a very good
chance that you’re going to fall really, really far off. And even if it is
one bad update, it’s going to take you
somewhere really bad, which you can’t recover. So if you bound
this with the KL– and the noisier your
system is, the tighter you should keep the bound. And if the system is
better and better behaved, you can relax and make
it bigger and bigger. So that’s kind of a human
intuition over there that goes how close
the band should be. Let’s start looking into a
demo, what we can do over here. So again, a human hand-shaped
robot, what we are trying to do is trying to rotate
free floating object in the space along
different axes over there, x, y, z, and you
can see over there. And most of these behaviors
are learned with 60 trials. So we are not talking
about thousands, or even the millions, thousands. We are talking about
just 60 trials. And agree that these
things are very simple. We are trying to do simple
things, like rotation. But if you see the
kind of behaviors that need to execute in order
to get those things right, it’s quite complex. The pinky has to push. The middle finger
it has to pivot. The thumb is to get in and
out of the whole system. The whole system
has to stabilize, because initially,
what you’ll see, that it will just
drop the whole thing. It doesn’t even understand
how to stabilize the system. It’ll start dropping the system. And it is general,
again, in the same sense that you’re building a model. Given the model, you
can do a lot of things. And since you can build a
model easily, benefits of MPC, you can repurpose
all of those things into different behaviors. Over here, each iteration is
roughly like five rollouts. So about 12 is about 60
trials that we have done. So what we would be
able to show right now is an extremely sample
efficient process. It can work with
contact dynamics, which presents a lot of
constrained in the actual dynamics. So these are not
smooth dynamics. And it does work
in a real system. But there is an
interesting thing that is going on over here,
that it did converge over here, but what did happen here? So what we ended up
doing in this case is you start the system
at the same place initially, because
it’s not good. So you want to help the system
by starting the system, more or less, in the
same place, and it’s updating its own estimate
so it iteratively gets better and better. But the moment you think
that it has gotten better, now you start
destabilizing the system. You don’t reset, or you
reset in a very noisy way, and those perturbations you
see down there are exactly– it’s trying to adapt the learned
behavior, the learned policy to the real world realities,
that the system is never going to stay in the same place. You have to adapt based on that. And it does learn a
really robust policy, but around the task that
we are trying to do. It doesn’t generalize
to a different task. Sorry, yes. AUDIENCE: So how did you
learn the local models? Like do we learn them
for every timestamp, or? VIKASH KUMAR: Correct. So basically, we start
with the uniform policy, the random perturbations. There is a task. So given the task,
given the model, you optimize like
this ILKG, ILQR. It can do all those things. Pick the actions. Execute those actions. You have a new model, and
you update based on the top. Yeah. AUDIENCE: [INAUDIBLE] VIKASH KUMAR: Yes. AUDIENCE: [INAUDIBLE] VIKASH KUMAR: Correct. Yes. AUDIENCE: OK. VIKASH KUMAR: So we did
12 updates over here. Each update collected about
five trajectories, in this case. AUDIENCE: And once this
started policy, where you– VIKASH KUMAR: Just
this random policy. So like yeah, sorry,
uniform policy. Yeah. AUDIENCE: But you drop– you just keep dropping it? VIKASH KUMAR: It just
keeps dropping, yes. Initially, you
will stay up there. And see, default
policy, it just flops, because the weight of the
system, weight of the thing, is too big. So it starts dropping. It tries to move
and it just falls. AUDIENCE: And you
have starting to– VIKASH KUMAR: There is no
simulator involved in it. So directly. On the real system, from
step 0, the first data was collected by
a random policy, putting the object on the hand. And it’s now trying to do– AUDIENCE: Do you have part of
the model, like the actuator and types of things? VIKASH KUMAR: No, nothing. Nothing. The input to the system
is completely raw, potentiometer values
from the joint angles. And action space is a little
bit more convoluted over here because we are going
through pneumatics. So there are pneumatic channel. We don’t even have a
model for pneumatics. We are directly
controlling pneumatics at the voltage at the valves. AUDIENCE: And what’s the
size of the [INAUDIBLE]?? VIKASH KUMAR: 100 to 112. So we have all the joint
angles, joint velocities, the pneumatic pressure, because
that’s a third order system so you’ve got to add that
pressure into the system. And the first order, we
said that, and information about the object. AUDIENCE: [INAUDIBLE] VIKASH KUMAR: Yes. Yes. AUDIENCE: There
is no [INAUDIBLE].. VIKASH KUMAR: No. AUDIENCE: Figures. VIKASH KUMAR: I
can discuss that. There is a very strong
belief that it will help. I do believe it will
help, but I have not been able to show that
it actually helps. And there are hypotheses. I’m happy to share that after. AUDIENCE: Sorry, so what’s
the variables for the object? So the angles– VIKASH KUMAR: So the
object is just a free– it knows the position
of the controls. AUDIENCE: OK. OK. OK. VIKASH KUMAR: Yeah. So it’s not sensing the
object is a [INAUDIBLE].. AUDIENCE: And then you just
have some video to calculate– VIKASH KUMAR: Correct. So this isn’t a
motion capture set up. In this case, positions that
are orientations of objects. AUDIENCE: In what sense is
the regression [INAUDIBLE]?? VIKASH KUMAR: In the sense
that it’s optimizing exactly for the tasks. So the way the
policy and the model updates, that it
actually updates when you solve the task, and it
just forgets everything else. AUDIENCE: Oh. But it’s a full LQR model. VIKASH KUMAR: Correct. But local in the
sense of trajectory that it is planning
to get the task done. AUDIENCE: OK. Not per state. VIKASH KUMAR: No, no, no, no. Yeah, like the
whole state vector. Yeah. OK, so in this case,
what we’re able to– In this case, what we did,
we built a local model. We learned that model. And we used the good
properties of the model over here, which
are the gradients, and we did it
better than what we did in the model pre-settings. It’s still not real
time, because we have to go through the 60
durations to get more data. There is no generalization,
in this case, but things are robust for
the task that you are doing. We didn’t get any
generalization because we had no idea what happened
outside the task distribution. It does work on the hardware. OK, so we are going
to move a little bit on this side of the
system now, that we are going to start learning not
local, but partial model stuff. OK, and let’s see what
the details over here are. So the details are somewhat
weird because this is under V. And since I am no
longer in academia, Google lawyers tell me that
I can’t share the details. [LAUGHS] So in assumption
that no one is going to tell Googlers
anything about this, I’m going to give
you some details, OK? Hopefully, no Google
people are here. So I’m not going to
give full details here, but enough to actually
make my point. But if there is something
not clear, please ask me offline and I’ll be
able to share some more. So what we are trying
to do here is we are not trying to build a global model. We are trying to
build partial models. What are these partial models? We don’t know, but
we are going to build an ensemble of partial models. And it’s important
why it’s an ensemble. I’m going to tell you
a little bit about it. Each of these models are
presented by a neural network. We have tried LSTMs. Have not been able– it works. Have not been able
to show that they are a little bit better, or even
a little bit better than the before ones. But I do believe that we
haven’t pushed it hard enough. So I think we need
to do more there. So what we are going to do
here is the same settings as the previous one. We can only interact with the
system on a reasonable budget. We can’t track the
system way too much. What we are going
to do is, based on this collection
of the data set that we get from the
attraction, we’re just going to start
feeding an ensemble. These ensembles are different
ensembles with different seats, so they will behave a
little bit differently. And it’s important
here to build ensemble because what happens is as you
update the model of the system, they do not capture
the reality very well, because you don’t have
that much data to capture the reality in these
high dimensional spaces. So we are using an ensemble
as a proxy of how good the real world– how good
understanding of the real world we have. If all the ensembles
tends to give you the right update,
the same update, that means that we have captured
the system well over there and we are going to
trust that update. But if all the
ensembles tend to differ in what the predictions
are, then we are going to say that
we don’t understand the system at that
point, and we are going to, in next
iteration, prioritize getting those samples better. So that’s our metric of where
to sample, how to sample. In previous one,
the way we sampled was just directly
shot for the policy. Where policy was trying to
take you, we went there. But in this case, we have
a weighted way of sampling, where we are asking the policy,
what is the task update? Push me in that direction. I want to build a
model over there. But we also want to say
that if I don’t understand the system in some place, I
do want to sample over there to build an understanding of the
system in that place as well. And that’s how the
whole thing goes. And in this case, we are
able to show that we can learn models that generalize. It’s not really task-specific,
but it generalizes roughly in the same family of tasks. Let’s play some videos and I can
share some insights over here. So here, we are trying to
turn this three-finger valve– oh my goodness, the picture– is to the target, and that
was marked by the green spot. In the next one, we are
trying to rotate the cube in different configurations. The target is
marked on the right. So for the rotation task, if
you try to rotate and learn a model for like 90 degrees,
180 degrees, you learn the model and it generalizes to other
parts of the state space, because it’s trying to learn
a partial model of what needs to rotate. It’s not trying to learn a
trajectory, in this case. So we are getting access
to the dynamics model that you can retarget by
changing a cost function later, as long as your state
visitations roughly remains the same. If the state visitation dips
too much, you can’t do much. All right. So in the case of the
cube that I showed, that we did a lot of
side turns, this turn, that turn, and it generalizes
to the other turns that we have never
showed data about. OK. And the beauty of this
is that now, there’s a lot of abolitions,
which I’m not going to go into
details over here, but the important part is
the green lines on all three of them, that it is ridiculously
sample efficient for higher order, or more
difficult tasks, a lot of the sample on policy,
as well as off policy, methods that are
available today. It doesn’t work for everything
that on policy, off policy can solve, but at least
for the settings that we have been trying, it is
ridiculously sample efficient. And it does give you some
generalization in roughly the same task family. So yeah, I’m sorry there’s not
a lot of detail I can share, but if there are some
high level questions, I can try to take
that at the moment. Yes? AUDIENCE: This idea of learning,
generally, how to rotate, as opposed to
learning a trajectory, does it go both ways
or does it go one way? For instance, if I’m learning
how to rotate a cube, I have to learn how to rotate
on a lot of dimensions. So then maybe if the next
task was to just spin a pen, I can do that very well. But would it not go in
the other direction? If I just learned
how to rotate a pen, I might have trouble
learning a cube. Does it make sense to
start with a harder task, in terms of rotation, and
then it generalizes forward? VIKASH KUMAR: It
does make sense. So here, we are not
really giving the detail of what the object is. So if you are learning this
model in a full point cloud space, then it will
actually understand. But as far, what
we have tried is we have put a tracker on top,
and then we have learned just in the state space. And we haven’t given
geometric information of the object, which
gets implicitly embedded in the model at the moment. But if you learn
it more generally, I do believe it will
actually start transferring. All right. And let me bring in
a little bit more towards learning a
more global model, and this is somewhat that
came in discussions on Monday that I eluded to, so I ended
up adding this thing in there. That is a very interesting
insight over here, that we are going to
learn skills over here, but I will explain how
it falls in the model spectrum over here. So we are all going
to learn skills. There are a lot
of different ways people have taught about skill. And we are picking a different
intuition about the skill. So let me start from there,
that the way people have looked into skill is you collect a
lot of data and you do PCA, and those PCA dimensions
are kind of somewhat one way of looking into skills. Dimensionality
reduction is another way of looking into skills. The way we are
looking into skills is not just in state
dimension, but we are looking into skills into
state cross time dimension, OK? And the way we are calling
something as a skill is if you can predict
yourself, you are a skill. So predictability in
the dynamics is a skill. That’s what you call it. It’s pretty bizarre,
but a lot of good things come out of this. All right. So the intuition is that
if there is a skill, the state distribution, the
state action distribution, is somewhat constrained in this. You can learn this more easily,
but the problem becomes now that there could be so
many random things that you can learn over there, how
do you make sure that you learn the right skills? That’s basically the challenge. If you can somehow magically
find the right skills, then you can actually these
things will be useful. And there are
prior work on this, like Dion and stuff,
if someone is familiar. The differences there are
once you learn those skills, you don’t know what to
do with those skills. We are going to learn here,
the dynamics functions, you can repurpose these skills. Sorry, yes? AUDIENCE: Isn’t one
of the core issues here that my sensors are
vision and I shut my eyes, then I’m predicting
my feature state. And I’m sitting still
and not doing anything, and that’s not a
very good skill. VIKASH KUMAR: You are
going to learn that skills. AUDIENCE: All right. VIKASH KUMAR: Yeah, you’re
going to learn that skills. Intuition about
positives of this skill is think of someone juggling. Even if you close
their eyes, they have skills that they can use
to roughly juggle the right way. Not exactly. AUDIENCE: I guess the question
is there’s so many, as you say, so many different skills to
learn, how do you encourage the system to learn at once? VIKASH KUMAR: Yeah, we do end up
learning a lot of that skills. AUDIENCE: OK. VIKASH KUMAR: It is coming,
falling out automatically. That’s why we don’t care. But you’re exactly
right, that we end up learning skills
which are really, really bad, has nothing
to do with being useful. OK. And one of the ones– AUDIENCE: [INAUDIBLE] the
top doesn’t care about that. Just sort them out? VIKASH KUMAR: Yes. So one of the skill that you
will actually go down into is just staying still,
because staying still is the most predictable
thing you can do, OK? So without going too much
into the math over here, I’m going to give
a general idea. We are parameterizing
skills by Z. We are parameterizing
skills by Z. The dynamics function
of the skill dynamics, what we are calling
skill dynamics, is given a state
and the skill, we can predict the
state transition. That’s what we are
calling skilled skills. And the beautiful part of
using this mutual information objective that has a
lot of details in there is that the skill
and the control, the controller of the skill,
both falls out together. That means given a
state and the skill, you know what actions to take. And roughly, things
go down this way, that what we are trying to do
is if given a state and a skill, you want to maximize
the predictability of the next state. At the same time, what
you are going to do is you want to say that
the next state, predictions given a different skill,
should be different. So this is the diversity term
and this is the predictability term. So you are incentivizing
different skills to be different, and
within the skill, you are into incentivizing
the predictions to be better. Yes? AUDIENCE: So mathematically,
is it a pause? VIKASH KUMAR: No. The Z is basically a label. That’s a skill. In discrete setting, Z1,
Z2, Z3 are different skills. In continuous settings, Z equals
1.1, 1.2 are different skills. So that’s my latent
representation. AUDIENCE: It’s a
random variable coupled with a state transition? VIKASH KUMAR: Correct. And we have shown that it works
both on discrete, when Z are discrete, and continuous, OK? So here, if you ask, what
do they end up learning? And then you will
see that they end up learning pretty bizarre things. It doesn’t make sense
for a lot of things. A lot of things, they will just
like sleep and not do anything. All right. So they do learn,
like you were saying, they do learn a lot
of weird things. And here are some pictorial
representation of a lot of them overlaid on each other. But if you look into
what they are learning, we started looking into
some of the feature space. So here’s the
orientation of that end. Like there are two dimensions
even in Z2, in this case. You will see that there is some
structure they did actually learn. In these, a couple of
good feature space. So they tend to do good. But yeah, a lot of them are like
it doesn’t make sense to us. But that’s fine, because these
are totally unsupervised. We haven’t done anything. It’s just trying to go
on offline data that was collected from somewhere. We don’t care. It outperforms a lot
of model-based planning algorithms. And there are a lot of
model-based optimizations. Here are two variants,
discrete and continuous. And then we just showed that for
a lot of goal-reaching tasks, like these, where
we are using goals. And then it’s basically
searching over the skill space and choosing the
skill that makes good progress towards the goal. And then like this, the
MPC style thing again. All right, so I’m going to wrap
up pretty quickly over here, and then we’ll get some
demos going over there. So what has happened right now? So we hit a
spectrum, full access to the dynamics, no
access to the dynamics. We were able to exploit a lot
of gradient-based information when you had access
to the dynamics. We got good access
to the dynamics. A lot of great
input-based information gave us good sample
complexity, but I haven’t talked about the
sample complexity in these two settings not being that great. But we did have the skills. We did have access
to the dynamics. So this is something
which is bizarre, but it kind of does make sense. Since you are
learning the dynamics model in a more generic way,
the first order information of these dynamics
are not preserved. So the moment you start using
gradient-based planners, they actually tend to
take you in random ways. And shooting methods
tend to work better than any of the gradient-based
planters at the moment. But this is fixable, I think,
that when we learn the dynamics model, we have to
kind of somehow figure out how to
preserve the first order, the second order of
information, which I think is a good next step,
but I don’t think it is that easy at the moment. But we are looking into it. All right, cool. So where things stand together
if the model-based and model free really depends
on where you are. I heard a lot of great
insights over the last two days into that if you really take
things to the limit infinitely, there are great things
that we can prove and show in some of the system. But in true reality,
what I am going to argue, that it does matter where
in the spectrum you are. If you are here in the
spectrum, the choices are very biased towards
one side of settings. If you are on the other
side of the spectrum, the choices are very
different on the other side. But in the middle, I think there
is a good class of algorithms over there that can be
designed to take insights from both places. And this is where a
lot of algorithms, like polo, for example, that
was taught yesterday falls. You can try to combine the
benefits of partial model, as well as model free
settings, and then you get more accelerated
methods in that intersection that I mentioned. Yes? AUDIENCE: So what are the things
that motivate using its eyes and moving the [INAUDIBLE]? Because the model is
not accurate enough? Because of parameters? VIKASH KUMAR: A
couple of reasons. One of them is the model
is not accurate enough. But if you think of what the
policy optimization is doing, it doesn’t care
about anything else but to optimize for the policy. It can overfit to
the performance, but if you’re going for
a model-based method, you’re trying to get
generalization, fit the model properly, and based on the best
understanding of the model, you try to get
better information. So there is a trade-off
that the model-based method has to play which the
model free don’t have to. They can just overfit
to performance. Yeah, that’s a great question. OK, so this, I plugged
into one direction of these things, which I call
privileges or [INAUDIBLE],, which was the dynamics. Oh my goodness. It’s not visible, but this
is a very interesting craft that you can ignore. But there are lots of other
things that start to matter. And when we start to put
everything into perspective and start plotting
your algorithm, how they lie in these curves,
then the area under the curve starts to matter. And for real system, that is the
metric that people care about. That is the way the real
systems move forward. So I was having
this conversation with Miriam and Alec
yesterday, that as someone who wants to push on what
is the best or the highest level of things we can solve, we
often come into the constraint that one of these
axes are super great. But because of the other
axes being super bad, the area becomes large. So it’s important to
think of the trade-off and in a budgeted and
constrained settings, if we can get
better algorithms, I think that will go a long way. That’s my key message over here. Oh yeah, by the way, I want to
show how resets, for example, are taken as granted. And it takes up
either four months for you to design or reset
mechanisms or a grad student to spend four months to
do the reset for you. And this was me, as well, when
I was doing the cube rotation thing. So yeah, a lot of these things
are the real bottlenecks, now that the algorithms are
trying to scale and actually go out in the real world. I think a lot of the
assumptions are really where things are weak, and
we need to hit them hard. OK, so the demo time, I’m
going to give a short spiel on what the demo is,
why it is important. Like I said, there is
these huge spectrum and axes on which we are trying
to minimize the area of all the algorithms, right? But it becomes
important to actually be able to probe those
settings equally well. And this is not where I’m
trying to solve DARPA Robotics Challenge. This is not what I’m trying
to do Atlas backflip. This is basically
think of pure science and trying to minimize
the area of this curve. So in those settings, I
think I take intuition from the domain of physics,
where what they did was design something that
was called an optical bench. Most of the Hubble telescope,
electron microscope, et cetera, those intuitions were
first tested over here and they led to
bigger developments when people scaled and found
the right fundamentals. This is what I think is
somewhat missing from robotics, right, that everyone is
trying to do a lot of things where things are different. And there are a
couple of platforms on which I’m trying
to build my intuition, and I’m going to share
these things with you guys. And starting today, it’s
going to be public and online, so feel free to
share with everyone. These are robotics
benchmarks for learning. These are benchmarks. There are some important,
interesting things over there as well. So let me try to just
show you really quickly. There are a few
hardware platforms, which are very modular. The interesting things
about these platforms are that these are modules
that you can reconfigure in many ways you want. Here are the three settings
I’ve shown you guys. But if you have
different applications, just go for a
different application, just the same thing
is going to work for you, in terms of
software, hardware, et cetera. It’s really easy
to reconfigure all of those things that kind of
gives you the optical bench illusion that take lens out, put
the lens in kind of a setting. There are two platforms. One of them is called the Clock. Another one is called the Kitty. The good thing
about these things are that these are modular. You can compose any number
of degrees of freedom. These are super cheap to
build, so advisors, please spend some money. Give your students one of these. $3,500, $3,200, great things. In my team, every one in my
team has one on their desk. And the kind of experiments
they run is bizarre. Like you will have never
thought that anyone who will ever run experiments
are running experiments right now. One of them is Sean’s
student, Irvin. If you ever ask Sean Irvin
will touch a hardware– Sen, will Irvin ever
touch a hardware? AUDIENCE: [INAUDIBLE] now. VIKASH KUMAR: Because these
things are easy to use, and that’s how it easy to use. The best part is that the
price point scales linearly with the degrees of freedom. So you can try to build
a more expensive system. And the reason I
went down this route is I work with a lot of high
dimensional space systems. And these high
dimensional systems tend to break frequently. The shadow hand I
fix every two days. It is ridiculously
expensive, and I can’t really send it back to
London every day, so I have to build
these systems. To the day, these systems
have over 15,000 hours recorded on them
without breakage. So that’s great. I’m going to quickly gloss
over some results on that. Please refer to other videos
and papers associated with them. They scale incredibly well. You can pack them in a
backpack, take it with you. Here are some flexible
object manipulation you can do with them. Here are some disturbance
interesections. You can play evil. They know how to take
care of themselves. You can take one
of the systems out. You know what they can do. Things are pretty
scalable, paralyzable. We are learning multi-task
over here, different objects. They are setting
different settings, and then they’re
learning together, learn how to turn, et cetera. Here is the walking
robot that can walk over anything, literally. You give it on the way. It gets caught in the wires. It can go over all of those
things, push things on the way. We have been walking it
outside, indoors, you name it. Here is a paper that
we recently published. We let a couple of
them out in the wild. They learned how to
interact with objects. If you give big
objects, they learn how to coordinate
with themselves, and do good things
and better things by trying to figure out
how to make both of them move together. There are some benchmarks
for you to get started. Good software. Everything for you, latest. Get ready and go in an hour. OK, so without getting too
much boggled down here, just a last point that
like, yes, these systems have resulted in a good
collection of results so far from a diverse group. So it’s kind of making the point
that they’re widely useful, which you can take a look into. Cool, and then
let’s do the demos. All right, so in the first
one, what is going to happen is the top one is
a target, which you can start picking yourself. If someone wants to
come pick a target they can definitely come
and pick the target. And the policy is
trained in a way to start matching the
target the best it can. The bottom axis is
really, really slippery. So if you just do something
simple, it just follows. All right. So roughly, you get the idea
of what is going on over here. All right. So I’m going to be a
little bit quick over here and go to the next one. I can play these more later,
if you want to come up after. Here, you just randomly let the
top thing pick a random number generator on the top. It’s trying to pick a
velocity and a target. And the bottom one is
trying to actually match the best it could actually do. And we can hit it
with any algorithm. You can study
different algorithms. It is robust,
self-sufficient, independent, that you can run most of
these things remotely. We do it on everything remotely. All right, so that’s a
manipulation platform. We just took the same thing,
put four of them down there, and that’s a locomotion
platform now. So I’m going to make a fool
of myself here, but be with me because there is no
tracking in that room. So it is completely
running blind, but at least I can probably
show you is the policies that we have been able to learn on
these simple systems are robust enough that it can at
least stay upright, handle all of these things by itself. So yeah, it doesn’t
have any tracking so it doesn’t know
where the target is. I’ve roughly given a target,
which is in the front. So that’s a reset. That’s a reset. It didn’t fall down. It’s a reset. It’s asking you to go again. So you see the kind of policies
and behaviors that it exhibits. No, no, don’t pull the wire. AUDIENCE: Put a leash. VIKASH KUMAR: Yeah, I
should push the leash. I tried to put a reset so
it doesn’t go far enough. But look. You saw, probably, like
here’s a different friction, completely different
friction with those and it’s still
understands how to handle those things properly. And these are the
system that I showed how to coordinate
videos, where we have like multiple of these systems. We have made a
small house for them now, where they kind of live
together, explore the world, figure out where the kitchen is. The best thing is call it the
Kitty, so it’s a cat robot. But it doesn’t poop, so you
don’t have to do anything to clean up the poop. This is great. And with that, I think
I’ll just stop over here. I think I’m a little overtime. But feel free to come
by later and then I’ll be happy to answer
more questions. PRESENTER: Thank you very much. [APPLAUSE] VIKASH KUMAR: Yes? Yes? AUDIENCE: All right. Actually, I wanted to ask
you this last week as well. I’m really curious about what
your thoughts are about that. So one of the things
that you mentioned that’s important to kind
of look at is reset. And it seems like these
benchmarking platforms are designed in such a way
that if I make mistakes or if I perturb the
system, a small amount is basically stable. It doesn’t necessarily
trigger a reset, or like resetting
seems very, very easy. And so I’ll try to figure out
how to say my broader point. It’s not totally
formalized in my head yet. But the thing that I’m a
little bit concerned about is a lot of the problems which
are the hardest in robotics are ones where a small perturbation
to the decision that you make results in, potentially,
a catastrophic failure. VIKASH KUMAR: Correct. AUDIENCE: So for
example, if, instead of having a four-legged robot,
you had like a biped walker, it’s a lot harder to
reset that system, and a small mistake causes it to
fall over and require a reset. VIKASH KUMAR: Yes. AUDIENCE: Another example
is like the rally car. If you slip and you
take the wrong action, reset might require
rebuilding the car. VIKASH KUMAR: Correct, yes. AUDIENCE: So what I’m
wondering is do you think that there is a
certain type of problems that these sorts
of benchmarks, it makes sense for machine
learning in that it’s easy to collect lots of data? And are those problems
really the ones which are the most
interesting part of robotics? VIKASH KUMAR: So I’ll
tell you my take. I don’t advocate a
right answer here. My take is that more you have
ability to ask those questions, the better off you are. OK, I didn’t show those
results today at the moment, but we are running these robots
in a non-episodic settings, 24/7. They fail. When you start executing them,
like they are floppy systems. They don’t know how to stand up. You saw trained policy. It will crash down. So it starts by doing
catastrophic failures, and then builds up from there. So there is details
into this, but feel free to browse around more. But there are videos
over here, where we do learn a lot
of these getting up policies along the way as well. So it does start
from random places and tries to get up first. So it is trying to
handle those questions. Can I actually design
a reset policy myself? Yes. I can cheat the system,
but I don’t care about doing Atlas backflip. I care about minimizing the
area on that giant spectrum that I show. And if you are careful
and you’re basically being optimistic, you can
carve these problems out. That’s what we are trying to do. We will be overfitting to
this, but at this point, overfitting to the real
world is a good thing to do, is my argument. Yes, we will scale. We will build a Hubble telescope
from there, but it’s not today. We’re going to ask some
more questions today, and then, hopefully,
build from there. PRESENTER: OK,
thank you very much. Let’s go to break until
11:05, so 15 minutes. [APPLAUSE]

Leave a Reply

Your email address will not be published. Required fields are marked *