Robust, Visual-Inertial State Estimation: from Frame-based to Event-based Cameras


Thank you everybody for
coming here this morning. It gives me great pleasure
to have Davide here who is actually professor at University
of Zurich as well as he is also professor at EDH. We will be here for a long
time if I start listing all of the awards and
accomplishments Davide has. So I will not take up
all of his talk time. So just a couple of things,
Davide has done amazing work in visual inertial
slam in control for drones. And he’s been doing some
pioneering work with event cameras. I personally have been using SVO
and SVO2 with great success, so thanks for
saving me a lot of time. And as well as and yeah,
but without further ado, I’ll let Davide Thank you.>>Thank you for
the nice introduction, and for inviting me here. It’s really a pleasure to be
here, also first time at. So the title of this talk would
be Robust Visual Inertial State Estimation, from Frame-based
to Event-based Cameras. And this is a picture
of my group. So we are a group of 13
researchers, with the mechanical engineering, electrical, and
computer science background, and the lab is fully funded
by third-party funds. So also, we have recently got
a DARPA project about fast navigation with UAV’s. And also some grounds with
the intel and awards. What do we do? So as they said, what we try to do is basically
apply advanced computer vision algorithms through autonomous
navigation of drones. So my was about vision slam for
cars, but then it was more interesting working on drones,
because they’re actually more challenging than cars,
because they’re moving in six. They also have dynamic
constraints that make them very difficult to control. And there are a lot of
challenges, especially for perception. So that makes drones for me a very intriguing
platform to work with. So we have four research
targets in the group which are the following ones. The first is simply
visual estimation, which would be most
topical at this time. Then we try to apply this to the
autonomous navigation of drones. Here you can see a drone that
is able to perform autonomously aggressive maneuvers only
using on both perception and computing. We also work on deep
learning based navigation. Basically, we’re trying to do,
at the moment imitation learning in order to accomplish
certain tasks. In this case, it was about
following a trail in a forest autonomously using a forward
looking mono-angular camera. And then, we have another area
of research which is called event-based vision, which is about using a new class
of cameras called event cameras that are almost a million times
faster based on the camera. So half of this talk will be
relegated to this new research field because I consider
it very revolutionary. I think that potentially
we’ll have a lot of impact computer vision robotics. I have a big motivation to build
machines that can be used in search and rescue fields. So what we try to do in my
lab is to actually work on autonomous navigational drones
because what we want is one day these machines can be
used to search for survivors after an earthquake or
tsunami for example. So in environments like
this one for example, you can not fly based on GPS
because there is no GPS indoors. So you have to rely fully
on board sensing and computing but
you also have to be fast, because in the end you know
time will effect people. So if you want me to give you
a vision of what I would like to do or what I would like to
have in ten years from now, this is what I would like to
have in ten years from now. So an autonomous machine that
can fly very agilely almost, that so agile that it actually
almost look like a bird or even better than a bird. The problem is that in order
to have a machine that can be autonomous and agile and
lightweight at the same time, it means you have to
sold a lot of challenges, not only perception,
but also control. And by the way, the drones that
you see in this video commercial by Lexus were mostly
computer generated. While a few ones were controlled
by a motion capture system. The company that actually did
this video was camera robotics which was one year later,
bought by Qualcomm, and now it’s behind Qualcomm’s
board, by the way. So in order to achieve my dream,
of course, one of the biggest challenges
that you have to address is how do you localize without GPS? So there are two ways to
localize without GPS. Either rebuild a GPS indoors, also called
the Motion Capture System. And then you are able to perform
a giant maneuvers like the ones that Kumar managed
to achieve eleven. I was a post doctor
here at the time. This was the work of
Daniel Mellinger and Anita Michael about passing
through narrow gaps or so, about faster
trajectory generation. All the alternative ways to
actually use onboard camera and computing and
then rely on visuals LAN, vision in algorithms for
localization and mapping. So the difference between this
two kinds of approach is that the first one basically make the
robot blind because it relies on external eyes and
external computing, while the second one really makes
the robot able to see the world. But now, in order to achieve
my visio,n there are certain challenges I think that
robot vision is still quite far compare to what
we would like to do. So, I think, personally,
that perception algorithms are nowadays mature, but
they are not robust. Because, for example, think
of a motion capture system. You have, in that case,
a localization accuracy that is custom throughout
the old workspace. So, typically, it’s a one cubic
millimeter wherever you are. Why, when you use number
sensing, typically the accuracy decreases with the distance
from the scene. And it also depends
on the texture. Okay so you have to take these
two things into account, distance and texture. Another problem is
that algorithms and sensors have still big latencies
between 50 and 200 milliseconds, which puts a hard bound on the
maximum agility of the platform. So you cannot move that fast
because you have a high latency. So we need faster sensors and
algorithms. And I will show you later that
even cameras allow you to basically tear down
some of these barriers. Another big challenge is that
the controller perception has been mostly considered
dis-favorably by the robotics and computer visual communities. If you think about Slam, Slam has kept all of the
robotics busy for over 20 years, trying to create visually
appealing maps for navigation. But very little work has been
done on controlling robots for example to improve
the quality of slum in order to perform a certain task. So most of the work has been
done for the sake of Slam. So instead of when you fly
a robot, when you remove a robot in general, and the robot is
actually vision controlled. So you have to move the robot,
can you wait. For example, you move over
area with respect to maximize the texture in the scene
because there is no texture, you cannot localize. Okay so these things actually
make a lot of sense for robots? Another problem is that
those very well, so computer vision doesn’t
work in low texture. Also, there are problems when
you have scenes characterized by a high dynamic range, and
also when there is motion blur. In these cases, of course, you can use complementary
sensors like an nIMU that can actually cope well with the high
speed motion and which HDRC is. But they are problems
of defense, so they feel advantage to have
sensors that are extra receptive that can actually cope
with the HDR motion floor. Robotics, HDR is a quite
difficult problem. Actually, HDR also cause of the
death of the driver in the test two years ago, because there were basically
a wide truck that suddenly. Appeared in front of the radar,
in front of the mobile eye camera of the car and basically
made the truck totally invisible to the camera, because it was
basically on a bright sky. Another problem that I just
mentioned is motion blur. And the only way to tackle
motion blur at the moment is using or trying to mold
the motion blur life. For example, see the work of. So we start now by talking about
the robust visual-inertial state estimation and some of these
ideas are actually more opinions can also be found in
these three papers. The more recent one is the one
with [INAUDIBLE] Neira, Leonard, and Reid about the future and
the past of the Simultaneous Localization
and Mapping: Towards the Robust-Perception Age,
which was recently published in Transactions on
Robotics last year. So, I would say that what
you see now is more or less the still visual odometry,
so monocular visual odometry. So this is keyframe-based
visual odometry, and it was pioneered by
George here in this room. And basically, this is a very
high-level recap of the idea. So if you use a single camera, how do you keep track
of it’s motion, and then simultaneously build
them up with the environment. For example, one idea is to
first triangulate points using for example the five pointer
algorithm and then localize every further frame we just
better the existing point cloud. And this works well as
long as you remain in a small neighborhood of
existing point cloud, but then when you want to
actually extend them up, what you want to do is to
basically create a new keyframe, triangulate new points, and
then continue like before. So, and this is called
the full keyframe-based motion estimation, because
basically what you do is that you marginalize
out frames and you only keep, you only retain
the frames that are important, and these are actually
called the keyframes, so saving a lot of computation
time especially for. And now keyframe used
in visual algorithms, for example,
as VO or a few ones. But then the main question is
how do we do frame-to-frame motion estimation or
frame-to-keyframe motion. So there are two different
approaches to that, which you know very well. One is called fusion-based
approaches, and the other one is called direct
or photometric-based methods. And feature-based approaches
that actually have dominated the literature on abstraction
of motion for about 30 years, that exist in structure
features from previously. In the current frame
the computing the motion, the routing motion by minimizing
the projection error. The problem with the
feature-based approach is that, although they cook very well
with the larger frame-to-frame motion,they can actually be
quite low due to the cost and matching especially because of. While instead for
the direct methods, actually since you are using all
of the information in the image, basically you cannot achieve a
higher precision of robustness. And actually very interestingly,
if you increase the frame rate of the camera,
the algorithm converges faster. So you end up
adding an algorithm that actually is much faster
than feature-based methods. Conversely, for direct for the metric methods,
because basically they work best if you’re really very close
to the buzzing of convergence. They do not go very well with
large frame to frame motion, so that the frame to frame motion
has to be very, very limited. So what we proposed in 2014
was SVO, which stands for semi-direct visual odometry. And the idea here is to
basically take advantage of both the feature-based and
direct photometric models. The idea here is to
use direct models for frame-to-frame motion
estimation. And to use fissure
based metals for frame to frame bundle
adjustment or a refinement. In this case, we also do not rely on other in
order to make his very fast. So here one of
the things that we wanted to actually maximize was
the speed of the system, so we wanted an algorithm
that could actually run faster than real time,
and had very low latency. So, in order to make this
algorithm very low latency what we did is that
we plus points. That can be either randomly
selected from the image like also what the other
is doing or for example, select them randomly from
all the edges in the image. So we ended up with an SVO
being super faster. It only takes 2.5 milliseconds
per frame to compute, and these are actually the timings
computed on an i7 laptop. As you can see SVO is much
faster than the [INAUDIBLE]. For example, compared to
[INAUDIBLE] which actually take 30 minute segments per frame. After disabling by the way,
the closing. Otherwise, it would
actually take even longer, like almost 100 milliseconds. And also, know this the SVO is
also keeps a CPU load very low. It consumes half of the CPU load
while [INAUDIBLE] take almost takes two course and [INAUDIBLE]
takes more than two course. So all these time means were
actually computed over thousands of trajectories
on the data sets, so they were run for many,
many, many times, by the way. And they can be found
in the TRO, view. We also recently standard SVO,
to with the different models, like a fish eye, optic cameras,
and also multi camera systems. So what you see here
as we are working on four cameras in
real time on CPU. And in this case what
we do is that we actually track many more points. So here we are tracking
one pixel every eight, in order to provide very
high resolution depth maps. And one thing that actually
we have in SVO, which makes the system especially robust for
flying robots, especially in dynamic scenes, is
the way we do depth estimation. So this is totally
probabilistic, and this is not something new. Because already the actually
using a probabilistic approach for especially based
on the particle filter. But actually what we
did differently for is that we don’t use a single
barrier distribution but rather bibarrier
distribution for the depth. So we use d which is basically
more than the distribution on depth and this is actually
inherited distribution, because basically we assume
that a Gaussian distribution of models well the feature
matching process. But, when you have appliers,
these can actually be more better with
the uniform distribution. And actually,
we found this started from so there is a so
actually the of and they found out that like
this one more as well. Then feature uncertainty as
well as outliers in the system. So we use basically
this combination of distribution also
called distribution. So we have two parameters that
we want estimate in real time. For each feature by the way
stands for the probability for the point, here’s an outlier and
that is basically the distance. And we solve this actually by
using [INAUDIBLE] matching, so we have a [INAUDIBLE]
solution for [INAUDIBLE] these
two parameters. So what SVO is doing in
background is basically this. So we estimate the depth
uncertainty, and the for each peak. So in this case we are already displaying the depth
uncertainty, not the. So what does it help for? It helps for example to navigate autonomously
over dynamic scenes. So here we have a drone
that is flying autonomously with the downward
looking camera. So it’s looking down and
I asked my master student to volunteer and
move below the quadrotor and to actually wave his hands,
also move his legs in order to try to disturb the
quadrotor as much as possible. But the quadrotor actually
remained quite undisturbed, so it didn’t really care about the
movements because actually that filters were able to distinguish
inliers from outliers. And therefore, all the moving
points were not considered for mapping its estimation. So the quadrotor here was
following a figure eight, and he was not disturbed at all by
the movements of the person. And you can see at the green
points that the map are belonging to the static
part of the scene.>>How sensitive [INAUDIBLE]
>>Roll is the light erection. At the beginning, we assume that
the light erection is 100%, but the point is that as you start,
actually moving away, basically, height is light. If you remind me, at the end of the talk, I can
move to a different slide and can show you how the uncertainty
evolves over time, okay?>>But we start with one, okay? And then we let
the Bayesian field to evolve using more
multiple frames. In the paper by SVO, we also
compared SVO against LSD-SLAM, DSO, and
ORB-SLAM on different data sets. These our to my knowledge, the
best data sets because they’re providing the ground proof and
all the ground proof, basically and
very well calibrated parameters. So there is the EUROC-MAV data
set which conceived for drones, then there is the Imperial
College London synthetic data set and then there is the one
from Daniel Cremers lab. So here you can see a few ones. Here I show basically the
performance results of the four algorithms, SVO, ORB-SLAM, DSO, LSD-SLAM on ten data sets
of the EUROC data set. And here what you can see
is that basically they all have a very high accuracy. For example, here you can see
over a 60 meter trajectory, we have about 10 centimeters
error with SVO here, 60 centimeters with ORB-SLAM,
5 centimeters with DSO and 18 centimeters with LSD-SLAM. By the way we disabled loop
closure for LSD-SLAM and ORB-SLAM in order to
enable a fair comparison. Here, we were not able to
actually initialize DSO, because actually the EUROC sequences
have increasing complexity, so faster HDR. However, you can see
that also LSD-SLAM was not able to be initialized. While DSO actually
performed very, very well. And also it gets higher
accuracy than for example, SVO, which is second best on here. And the reason is because
DSO actually although it’s fast like SVO and more or less
use the same amount of features. What we didn’t have in SVO which
DSO has is this photometric calibration of the camera, which
is exploited in the way they’re actually, they minimize
the photometric error. So they basically compensate for the exposure time within the
photometric reprojection error. And this is something very
important to do in order to achieve some slight robustness
to high dynamic range. Another thing that we try to
investigate in this TRO paper is also comparison between
the sparse, semi-dense and dense approaches. So dense can be for example
algorithms like the DTAM or REMODE from a former Post-Doc. LSD-SLAM as a representative
of the Semi-dense approaches. And SVO or DSO as a representative
of Sparse approaches. First of all,
notice that we have a huge difference in the number
of pixels that these algorithms are actually tracking. So if you rely on dense
approaches your track, on average, 300,000 pixels, because that’s the effective
number of pixels in a VGA image. LSD-SLAM tracks on
average 10,000 pixels. It depends, of course, on the
number of edges in the scene. SVO and DSO track on average
around 2,000 pixels. So that actually explains also
the computational burden of these algorithms. But what we tried to
investigate in this paper was what is best in
terms of accuracy. So what we did here is that we
used the blender to generate huge scenario, which consisted
in several kilometers of trajectory in an urban canyon. And what we did is that
we wanted to benchmark the convergence rate of dense,
semi-dense, sparse algorithms with respect to the distance
between the little frames, with respect you are trying
to estimate other motions. So the distance between
the current frame and the previous frame. As you would expect, as you
increase the frame distance of the motion baseline, of course
the conversion rate should actually go down because these
are photometric based moments, they are not visual based. If you get too far of course
it’s difficult to actually let them converge. So what you can notice first
is dense and semi dense methods perform similarly in
terms of conversion rate. This actually is not
surprising because I would say that the weak gradients
are not informative for the optimization. So only high gradient,
large gradients actually really contribute to the photomatic
minimization. What we found, also interesting,
is that this day the sparse methods perform
worst than semi-dense and dense methods when the
motion-based line is above 30%. However, notice, that when the
motion-based line is below 30%, which is already
a lot of overlap, if you consider
a high speed camera. Then in this case, you can see
that the conversion rate is actually quite similar, at
least in this region over here. So the sparse method we can
say perform equally well for overlaps up to 30%, okay? So I think this is
also quite interesting, because it means basically when
you have a high speed camera, camera actually, there is not
much gain, you could get in principle in terms of accuracy
using sparse methods. But of course, there are other
advantages using dense methods, which is for example, their increased robustness
with motion blur and focus. So one of the papers what I
write about this was the one of [INAUDIBLE] paper,
that basically they actually claimed that if
you use dense methods, you can actually [INAUDIBLE]. [INAUDIBLE] paper where they
showed with dense methods, they coped better
with the motion. There are, of course, a bunch when you want to
achieve a higher robustness. Let’s now change
completely topic. And now, let’s talk about
visual-inertial fusion. So first of all, what is
the problem we are trying to solve when we have a single
camera and an IMU? So this is the problem, the
equation that we tried to solve. So X is the current position and S is the unknown skill
factor and IMU usually gives you accelerations and
angular velocities. Here I already show
the acceleration and [INAUDIBLE] solve equations for
the angular velocity. But there is a bias that is
usually due to pressure and temperature that is
not known in advance. You can Initialize through
a calibration procedure the biases, but
actually, the biases, they vary slowly over time,
but they can be modeled. Actually, there is a paper by that says that the bias
evolves slowly and it varies, but it can be
modeled as a Gaussian process. But anyway, it’s an unknown that
should be also calibrated and refined, estimated
over the trajectory and g is usually known, of course
it depends on the height. So typically we have three
unknowns over here, and one of them is basically
the initial velocity, because when we integrate
the acceleration, of course you don’t know
the initial velocity. However, there is a paper by
Agostino Martinelli from INRIA in Grenoble, which is the first,
and only at the moment, cross form solution for the
Visual-Intertial Fusion problem. So he found that both of
the absolute scale v0, and also the biases by the way can
be determined in close form. With just one single observation
in a list of three consecutive views, provided you have
a force of acceleration, so no cost and speed. So if found ratio this problem
can be solved with a single fusion and
the more efficient you have it, the better of course
it will work. So of course, you can also
A Volkswagen 1 point algorithm, which he also proposes. This proposed solution, the one
that’s not really being used by anyone, would say, we have
actually used it in a lab. But I would say, it’s very
useful because, in principle, you could use a cross-form
solution to initialize your absolute scale and
initial velocity, and then track those entities using,
for example, a common filter. So at the moment, typically,
these systems work by initializing the absolute
scale, either assuming that their phone for example is at
rest when you start the visual. Or by using a loosely coupled
approach where, basically, first, you double integrate the
acceleration, then you feed this to the theater which is
actually not very good. So this will actually
work better. So we actually have
a 16 where we show that this increases the accuracy for
our tracking models. But then,
once you have initialized, any way, both the absolute
scale and the v0, how do you track those entities? So here there are two
different approaches. There are filter based
approaches that typically only update the last state,
the last pause. The problem is that, as you know, common filters have
the problem that the covariance increases quite rapidly with
the number of features. So what you want to do in
filter-based approaches is to keep the number of features low,
so typically, they only track
around 20 features. So if you think about
Google Tango, they also actually track a very few
number of features, around 50. And there is an algorithm
that is actually now used in which is the MSCKF, most common filter, by. But then there are other two
approaches that from approach, is what they do is that
they update not just frame, but they update the last frames. So they work with the sliding
window as the initial approach. Or they estimate so
they update the of the camera, including all the frames and
the EMU. The problem with these two
approaches of sliding window and full trajectory estimation is
that they are more accurate, of course, because they
use more information. But of course, they sacrifice
a little bit the efficiency, because they actually need to
optimize a larger number of parameters. And these are actually the that
are being used in this kind of approaches, how do they work? So how do optimization
based approaches work? Optimization-based approaches
work trying to solve basically a post-graph slum
problem where basically. Here in this case, the gray
triangles represent the frames. For example,
I don’t know, 50 years. And then the red dots are the
measurements that, for example, come up under the earth,
so 1 kilohertz. And then here in this
case we have for example just a single
3D landmark of course. So what this optimization
approach is trying to do is to compute the minimum energy
configuration of this graph such that basically two errors
are minimize simultaneously. On one side you want to minimize
the prediction error between the prediction by the annual,
and the position estimated by the visual [INAUDIBLE],
which is actually [INAUDIBLE]. And then on the other side, you
want to also minimize the bundle adjustment term, so
[INAUDIBLE] projection error. Of course these two terms
will be weighted by some proficients that will
need to be tuned. The problem with these
optimization approaches to solve the visual-inertial fusion
problem is that typically these graphs are very dense in terms
of number of parameters and number of frames. So what we proposed in
this paper, we defined or assessed is to first specify
the graph by skipping frames. And by preintegrating the IMU
between all the key frames, so we basically represent most
of the IMU measurements into a single post constraint. In order to make the system
actually more manageable from a computational
point of view. And then we use iSum
to solve this graph. And iSAM stands for Incremental
Smoothing and Mapping, which is a nice algorithm
similar to Google G2O. But what it does nicely is that
it only updates the pauses that are affected by
a new measurement. So this makes the algorithm
extremely efficient for post graphic
optimization problems. In fact, it can run in
real time in a processor should have 50 hertz. So we managed to run it
on computer at 50 hertz.>>So
in terms of this is what we did, there we can talk
more about this. So, we also use the which
is the only provides a synchronized camera frames
plus a motion capture. Here we compared against Google
Tango which chooses a base approach. Specifically the and then
which is a sliding Windows so it only updates the frames. So here we show authorization
error, that’s just router, so as we expect
the decreases of course. But as you can see our
estimation prefers this is because actually we are
using more information, okay? Now I want to show you some
videos of how we used this visual on our drones, so
this is one of our drones. And the main sensor is basically
a global-shutter camera which has a little, A resolution is slightly higher than
a digital camera. It also has
a [INAUDIBLE] range and output frames up to
90 frames per second. It also a PX4 autopilot,
which we use for reading the IMU as well as
controlling the models. And then the main computation
happens on an Android XU4, which is basically a quad core
computer with 2 gigahertz. And in there,
we have Linux Ubuntu, ROS, and we run all the SVO and
state estimation, as well as control and
path planning. So all running onboard
using this system. So this is a system that
we have been using for over the past 30 years,
we have done a set of works. So here you can see, for example, the performance
of the system under disturbances like pulling and
pushing a quad rover. Here you can see the system
following a figure eight, here we reach, we achieve
about two centimeter accuracy. And the higher you fly,
the more accuracy, the answer telling me the phase,
of course. Here, you can see how the system
reacts for example to big larger disturbances, like when you put
those helicopters in the air. And this is actually
only using again, single camera based on SVO and
not GPS. And this is to show that
basically there is the need to use a low latency
estimation algorithm. Because if the latency was too
large, then it would actually wouldn’t be able to initialize
the map and the force stabilize. So you need something that
is below ten milliseconds. Now we’re using,
since a few months ago, we switched to a different
system, which is now based on the Qualcomm Snap Dragon Flight
board. Which has the nice property
that it integrates a forward looking camera,
downward looking camera. Plus a Qualcomm quad core
processor on the same board, and it weighs less than 50 grams,
and this is great. So the whole controller, including the Qualcomm
board weighs 200 grams. And this even safer
than the previous one. In addition, it has a forward
looking 4K camera that can be used for other tasks. We also used a carbon fiber
frame to make it lightweight. And here you can see the system
all relying on onboard visual inertial estimation. So it’s even more agile than the
previous one because actually the agility is inversely
proportional to the size. Okay, the agility is
usually measured in terms of angular accelerations. Now, so far I was talking about
[INAUDIBLE] fusion assuming that the camera parameters
are left unchanged. But one of the parameters that
you could, in principle tune, could be for
example the exposure time. And that’s what we try to do
in this [INAUDIBLE] paper this year. So typically, one of the biggest
problem in vision [INAUDIBLE] estimation is how do you cope
with the scenes characterized by a high dynamic range. This is for an example
of high dynamic range. You see it, basically
the I dynamic range here dramatically changes between and
among these different scenes. And the problem is that when
you have an overexposed or underexposed scene [INAUDIBLE], you will lose track of
the features unless you fastly compensate for
the illumination change. But then if you even compensate,
sometimes you change the appearance of the [INAUDIBLE]
and then you lose track anyway. So the problem is
because the built-in camera exposure is usually
optimized for image quality, so for photography and
not for visual odometry. So the idea here was to actively
adjust exposures time in order to maximize the robustness
of visual odometry. So the control approaches
the following one. So here is the visually mesh, so this vision is [INAUDIBLE],
which is not the change at all. What we change is
basically the exposure time of the image before feeding
the visual odometry. So we have the camera,
the camera returns the image. This is further to our active
camera control strategy that basically modifies the image, so
enhance the image exposure time, so illumination. And then it’s compensated
[INAUDIBLE] transformation and then it’s further to
the front end of the visual odometry What is our
control strategy? The control strategy
is the following one. Basically, we assume that we
have already photometrically calibrated the camera, so
we know the gradients for each exposure time. And therefore, the next exposure
time is gonna be what to the previous exposure time plus
a certain manually-tuned gamma times this ratio over here. Where this ratio is
basically a function of the partial derivative of
a percentile of the gradients. So in principle what you wanna
do, is maximize the number of features in your
view by [INAUDIBLE]. And the number of features
depends on the number of gradients. So there was a paper by
[INAUDIBLE] South Korea in 2015, that they tried to maximize
the sum of squares gradients in the image in order to
adjust that exposure time. But actually, we found that this
is not very robust in different scenes, and it’s actually better
to maximize a percentile, for example, the median
of the distributions. By doing this, you actually are
much higher robustness of system [INAUDIBLE], so
here you can see an example. So we apply this system to all
of the iii, so like [INAUDIBLE] and we show that all of them
actually get [INAUDIBLE]. For example, in this scene,
we are moving the camera in front of road sign that
is occluding the sun. And as you can see, this is the
raw implementation of Orb-Slam where the image suddenly gets
overexposed or underexposed. And so it loses track of
fissures while in the upgraded [INAUDIBLE] that uses
our exposure control and compensation. Actually, the features
are not lost. Okay, because this is taken into
account by changing actively the exposure time. Here’s another example. In this example, actually
we are showing Orb-Slam. Here we have another
HDF sequence, where here we have a very
bright light source. And as you can see in
the raw implementation, the features are lost while
after compensating for the exposure time, the features
can continually tracked, okay? So this actually very simple
approach provided to provide a little bit of robustness and
robustness over HDR sequences. But of course, it will not
scale to a iii, which is for those [INAUDIBLE] little bit, it’s actually better
to use in cameras. Yes?>>The controlled response,
is it for every frame?>>Yes.>>Okay.>>And
there is a delay of one frame, so we actually control
the camera directly. We are using an [INAUDIBLE]
sensor here and a matrix vision [INAUDIBLE],
so there is a bigger one.>>[INAUDIBLE] Information
carried from the past?>>You could, but we don’t,
but you could, indeed. Now actually, we have a paper
on archive where basically we add some deep learning. And we show that you even get
a higher robustness, yeah, yes?>>What’s your thoughts on these
cameras, which have coming for a while now, but
they’re becoming cheaper, which use multi-slope
to really increase and invert the slope points
programatically. Change on the fly. Like [INAUDIBLE] just
a [INAUDIBLE] experience have we found that they’re worth it, or
>>We haven’t used them, but why not? Absolutely.>>[INAUDIBLE]
people [INAUDIBLE].>>I see.
>>For reason I’ll tell you later. But of course, if that might
make sense [INAUDIBLE]. So now, yes?>>[INAUDIBLE] Can apply for
both [INAUDIBLE] and [INAUDIBLE].>>Yes. Correct.>>And this doesn’t affect the
DSL part where [INAUDIBLE] does [INAUDIBLE]?>>Yep, so it does in
control with exposure time. So they apply in affine
transformation of the intensity of the image, which depends
on [INAUDIBLE], but that is constant,
it depends on the camera. It’s not changed,
we changed that. And then we use this same affine
compensation [INAUDIBLE], which is actually standard,
okay. Okay, now I’m going
to switch topic, and I’m going to talk about event
cameras, and especially for low-latency state estimation. Again, drone motivation, since
I love drones and robots, and my goal is to one day
fly as a [INAUDIBLE], even faster than birds. So I found this beautiful
video from National Geographic that shows that a sparrowhawk
that in four seconds. So this is real time. It [INAUDIBLE] among trees, it
passes through another gap, and poof, it makes a bird disappear. I like this video,
because it shows that basically, you know, what you put in
Facebook do with a [INAUDIBLE]. So with a camera that has
the latency [INAUDIBLE] of the human eye. In fact, the problem is that if you want to go
faster you need faster sensor. Why? Because the agility of a robot
is limited by the latency and temporal discretization
of it’s sensing pipeline. I’ll try explaining in
different in different terms. Basically, if quadruple,
for example The motors of a quadrotor have a custom time
of about 30 milliseconds. Usually the custom time is
computed as the time it takes the speed to go from 0 up to 80%
or 70% of the desired speed, so 30 milliseconds. But if you want to change
actually the speed incrementally, what the latency
is below a millisecond. And it’s also interestingly if
you scale down the size of your robot, also the latency
will actually reduce. So even to go from 0 to
n meters per second, you will actually need less
than 10 milliseconds, okay? And with a small door like this,
or like the RoboBee project
by Robert Wood Harvard? Well, the latency of their flaps
which is below a millisecond. So there is very strong
interest in developing visual navigation systems
with very low latencies. Now, the problem is that we
are limited by the current technological drawbacks
of standard cameras. So in a standard camera you
have frames across right? Usually every 30 or
10 milliseconds. So every time you grab a frame,
then you need to wait the certain processing time
to process the last frame. And then you need to wait
another delta T in order for the next frame to arrive. So typically the average
robot-vision algorithms have latencies between 50
to 200 milliseconds. As I told you,
SUS is 2.5 milliseconds but this is average
robot-vision algorithms. Even cameras actually would
allow to do low-latency sensory motor control with
latencies that are much below a millisecond including
the temporal discretization. Why is that possible? Because the output on the end
camera, as I will tell you in a little bit, is not a frame but
rather events that correspond to pixel level brightness
changes in the scene. So this camera has smart pixels. Each pixel whenever it
detects a change of, a relative change of intensity,
it triggers an event. In an event you will see
a little bit it’s basically a binary signal, where basically this is the sign
of the derivative of the change. This camera is called
the dynamic vision sensor. And this binary inspired, why? Because it’s inspired by
the way the human eye works. So in the human eye, we have
per eye 113 photoreceptors, which consist of about
120 million rods and 7 million cones,
which you can find here. But we only have
2 million axoms. So the axoms are the wires
that connect the pixels, our pixels to the brain. So how is it possible that we
have many more pixels than the affecting
the number of wires? This is possible because
our eyes spike aa fire of information asynchronously. So every time that a few
photons, heat, a photoreceptor, that photoreceptor actually will
pass through a receptor field and then eventually will
actually fire information asynchronously and independently
of all the other photoreceptors. So inspired by
this principle but not mimicking this
principle by the way. Professor from my department,
in 2008, developed the first commercial
dynamic vision sensor. This is the key paper. Dynamic vision sensor, or DVS,
also called event camera, okay? The camera can be seen here. So it looks from the aesthetical point of view like
a normal camera. So there’s no way to distinguish
an event camera from a standard camera. Actually that can be now found
also in the size of a smartphone camera, okay like a five
by five millimeters. That can only be bought up for
now by iniLabs, but Samsung just announced that they are going
to mass produce the sensor. So Mark face, for example,
both took from Samsung, the Samsung DVS, okay? If DVS has four key properties,
which I find very, very interesting, not just for
robotics, but also, especially for VR, AR,
which are the low-latency, so one microsecond latency
updated events. Also, very high dynamic range,
so 140 dBs instead of 60 dBs of
standard the good cameras. That means that you don’t need
a black filter if you point an event camera
towards the sunlight. This picture was shot in 2015
during the European solar eclipse, and here you can see
the moon shape and the fingers of a hand without using a black
filter, both at the same time. Okay, so this is to show
basically potential of this camera in terms of
high dynamic range. It has a very high update rate,
one megahertz, and it consumes very low power, which is almost
100 of a standard camera. So, 20 milliwatts instead of
1.5 watt of a standard camera. That means that you
could also enable light because of the applications. So you only need to change the
batteries every once in a while. And actually the power
characteristic is being exploited by DARPA and IBM in the TrueNorth processor
this was a 70 milliwatt neuromorphic computer used in
the SyNAPSE DARPA project. So we will say 1 million neural
network that was using input from a DVS and then short
applications to recognize or suggest with very low power. Manada is an even better and neuromorphic computer that
consumes one milliwatt. So it’s asynchronous in
the way that they process the information, so they can
only process event sensors, like DVSs, also audio DVSs,
so there are also. So there is a lot of research
also in terms of the remote computing that could be
applied to event cameras. What are the disadvantages
of event cameras? Well, because the output is
asynchronous and there is no intensive information just
binary information, because these events correspond to
pixel level brightness changes. The disadvantages are basically
you cannot apply any of the standard computer vision
algorithms that have been developed for the past
50 years of research for standard cameras. Before I go on, let me illustrate better
how the camera works. So because the camera
responds to changes, relative changes of
illumination, basically you can see this camera as a motion
activated edge detector. So it mentioned that there is
something moving in the scene. In this case, we have a rotating
disk with a black dot. A standard camera with output
frames across time intervals. Instead, a DVS would
output a spiral of event which corresponds to what is
changing in the scene over time. And so here, we have X and
Y, and time, so it’s a three dimensional, so it’s special
temporal signal, okay. And as expected if nothing
moves, no events are generated, so you have a drastic
reduction bandwidth. But another cool feature is
that there is no motion blur, because this camera works
in differential mode, so it does not accumulate photos,
okay? Only the sign of the derivative,
okay? So no matter how fast
you throw the camera or you use it,
there is no motion blur. Another way to visualize the
output of a DVS is to accumulate the events that record in
the last Delta T milliseconds. So we call this
a BF12 event frames. So it’s an alternative way to
visualize this information where blue stands for positive
changes of illumination and red stands for negative. And notice that points in
the farther of the scene fire less events than points in the
foreground, and that is quite expected, right, because it’s
similar to the optical flow. And there is background noise,
which at the moment, we model as a Gaussian process
on the threshold that actually trigger the events. But we don’t talk a lot about
this noise because actually there are many new
ideologies of the sensor, and a lot of work
needs to be done. Let’s come over and
let’s dig down even more. The pixels of a DVS are even as
slightly better than those on a standard camera. Actually, in the original DVS,
there were 40 microns. So you know the standard pixels
about six microns, all right. And now Samsung managed to scale
them down to nine microns, which is much better. The reason why they are slightly
larger still than a standard camera mega pixels is because
of the electronics behind each pixel. So you have the photo value,
the photo receptor. Then your compacitor, so this is
basically the behind each pixel. So after the photoreceptor you
have a capacitor to basically remove the DC component. Then you have a reset
which basically resets the time at which you
want to compute the event. Then you have an amplifier,
and then you have two sets of comparators that check whether
there are any brightness changes below or
above a certain threshold. Actually, I call it
a relative brightness change. But in practice is
a development of the volume of intensity, okay? So let’s assume this is the low
volume of intensity over time. So whenever the debt
of intensity from the previous spike is above a
certain user-defined threshold, a C, which you can tune. But a t the moment it
doesn’t go below 13%, then you have a non-event. If it’s less,
then the opposite minus 13%, then you have enough event and
so on and so forth. And the delta p between
consecutive events, of course, depends on when the illumination
crosses this threshold. So it’s, what is called? Cross level sampling, basically. What you can see from this
plot is that if you accumulate the number of on and
off events over time, you could reconstruct
intensity of the signal. So people have done it,
actually the first work was by the people of
Delbruck in 2011 and then more recently in 2015 by
the group of Kim and Davidson. So reconstructing
intensity of the image, the relative intensity
of the image, okay? And this is actually
very useful, because you could
generate HTR images. So now there are many other
modern companies that are looking at DVSes because of
the potential for HD animation. So three years ago, one of the first things we did
was to collect the data set. Because I’m interested
in drone applications, I wanted to see first of
all how much information is triggered by event camera
during aggressive maneuvers. So we flipped a quad over,
we mounted a DVS and the standard camera, and
then we flipped quad over and we used a motion tracking system
in this case to control it. So it was not closed
loop at the moment. Here you can see the output of
a DDS and here the output of a standard camera which obviously
is not effected by motion block. And here you can see the
changing accumulation time for rendering this here phrase. The first thing you will notice
is that the frames are always very sharp, okay. So one of the first questions
that we tried to ask was how many events are important for
example, motion estimation, how many? Because the problem is that
if you have a two larger accumulation times, actually
then you have motion law, right? And as you decrease
the accumulation time, or the rendering time, then,
of course, less and less events will appear. So one could think could
actually accumulate a right number of events, and then apply standard line
based or based algorithms here. But this is not what we wanted
to do, because if you accumulate events, well, why don’t you
then use a standard camera? Cuz we’re increasing
the latency, instead what we’re interested in
is event based vision that means processing solution
of the single event. Ideally, in practices
single event does not supply information, but free
event actually is the minimum amount of information that can
be used for several roles. So what we’re interested in
is event by event processing. So try to update the system as
a single event is generated. The problem is, you first
have to model the sensor. So with my colleague we try to come up with a probabilistic
measurement model of a DBS. Which turns out to be this. What we are saying is that
the probability that an event is generated is proportional to
the scalar product between the gradient of the patch,
that you don’t know, times the apparent motion of the image
playing, which you don’t know. I tried to convince you
that this model makes sense indubitably. So first of all, the optics on
the main camera is the same one as a dental standard cameras,
of course. So you have a standard
pinhole camera model. Let’s assume that the DBS is
just looking at a single patch, represented by a black and
white transition. So you would agree with me, that
if you moved the DBS parallel to the edge,
no events will be generated. While if you move it in
any other direction, there would be generation,
with maximum generation with directions that are
perpendicular to the gradient. So that’s where this comes from. Now we have actually
modified this one slightly. But [INAUDIBLE], I use these formulas you see
just to convey the message. The problem is that now
you need this to end. And you can estimate
this to end, this for example using by
Bayesian filter. So one of the first works that
we did then took two years to review for family so it came out
very recently is basically using Bayesian filter to
estimate gradient or the motion alternatively. Okay, but now we’re
getting away from that. Anyway, I’m going to show
you some of the results and then later if you’re
interested to dig down. So here is one of our first
event based visual odometer pipelines. That is not a close up and
its actually more odomotry, and you can see this is
the input of events. And the output is a semi-dense
map of the environment plus trajectory
estimated real time. Because the input is basically
already semi-dense is by definition, because it’s a
computer [INAUDIBLE] the output can only be semi-dense. And notice that the map
is pretty reasonable. It looks very reasonable. Of course that can be too
depending on the parameters. Here we show you a comparison
with the state of the art algorithm running
on a standard camera. So here, we are SVO running
on a standard camera. And our event-based visual
running on the DVS, event by event. The details can be found accept. So here, what I would like to point out
is that when the camera starts moving fast, SVO loses tracking
because of the motion blur. We don’t compensate for
motion blur and it will be very difficult to
compensate for motion blur, because you will need
to estimate the depth. That’s what in this
paper while instead, the event-based visual
continues to track the camera because the events
are actually still meaningful. Okay, there is a lot
of information spiked. Another thing I
wanted to show you, it’s the robustness of
an event camera in HDR scenes. So this is my iPhone camera,
notice that they don’t sign features disappear
when the sun shines and this is what the event
camera looks like. You can see that
when the sun shines, you can still observe
that that’s sign. And here we show the output, so the estimation of the tracking
and mapping estimation. In addition we also show that
you can compute in real time the intensity of the image does
not need to compute intensity. Because as I told you we
are now moving away from this alternative estimation of image
gradient and apparent motion. We are using an approach which
we call contrast maximization, which in the next few slides. Okay, so this is basically two
cameras, are having very high potential for HDR and
high speed motions, okay? But now,
what about computing time? So, in event camera, this a question people
typically ask me. And what happens if every
picture packs information? Well in that case, there is a maximum number of
events that a camera can handle. The card and the DBS can only handle up to
50 million events a second. But the new Samsung DBS, can handle up to 300
million events a second. So using a VJ camera is like if
every pixel was buying a bit of time. So that could be done. The point is that, then you have
to process this information. So if you process
it event by event, you will actually need a GPU. Because, you have
more information than a standard camera. And then, if you use a GPU then
the low power considerations, actually, they’re
a little bit unusable. So what we try to do, is try to rethink also what
information is useful. And so, we try to see
if we could extract salient key points
from an event stream. And it’s what you
see in this video. This is a BMVC paper I did this
year, where basically, what we do is that we extract fast like
corners from an event stream. So this is the row event stream,
and then now, you will in a little bit, you will see the
corners that have been tracked. So are you all familiar with
the fast corner detector? Just to recap activity in
the fast corner detector. You say that it peaks,
or is a corner, if in a circle of 16 pixels,
there are at least nine adjacent pixels that are all darker, or
brighter than the center pixel. So here, we take a similar
thing take into account, and we say that since it is
no driving information, we take the timing into account. So we say there are a 3
to 6 contiguous pixels, that are new work. Then all of the pixels
on the same ring, for example 4 to 6 on
the smaller circle. Then, this point corner. So we managed to
construct this corner. So with the same complexity
as the FAST corner detector, same computational complexity. But because the main cameras
trigger information much faster, you end up with having a FAST
corner detector that is 100 times than the FAST
corner detector. Now how do we do the tracking? So there’s another way that
we have changed it completely the way we are now
doing tracking. We call this contrast
maximization and we are now running
a although some results can be found in this
payroll by the way. So if this is a special
representation of the last, I don’t know, delta t milliseconds
of the event stream, what we try to do is to compute
the motion that over a small time interval, will maximize
the contrast of the scene. So if you could visualize
this as special signal, try to unwarp it in
a temporally new way, the best you can see very
high contrast images. It is something that you cannot
do with a standard camera, unless you know the depth and
you know the convolution filter. So in a standard
camera basically, when you motion blur the image,
you lose the time information. In an event camera,
the timing information is there. And this activity is very,
very cool. Now, you can solve this
contrast maximization either in local way,
their features or globally. Sometimes also globally works,
because in the end. So we try to do both. And they can be done very,
very fast actually. So now I’m gonna show
you some results. This is a BMBC paper from
this year, and also a. Here what we’re doing
is now plus IMU. So visual initial odometry
with features extracted from the event stream. And to show the potential of
this, what we did is that we attached a DBS to a leash and
then we spin it in the air. Here is the camera attached to
a string and now we spin it. In the [INAUDIBLE] we also
compare to the raw integration of the [INAUDIBLE] and we show that to
the [INAUDIBLE] is of course, much lower when you integrate
the event camera, obviously. Also, we compare with
the standaard camera. So this [INAUDIBLE] sensor,
without exposure, you can see that when you spin so fast,
the images are motion-blurred or under and overexposed, so
you can’t extract features. This is what virtual frames
will look like on an image and these are the features. We extract these fast
corner-life features and that here, you also see the performance
on an HD app sequence. This is the only real time,
by the way. So we use green to represent
the persistent features, so like an SBO. So pixies that converge, and the
blue features are the ones that are already initialized,
but cannot be tracked. And now, what we did is we have
an summation is that we apply this to a quadcopter and
we fly with it only. So this is the first
autonomous flight in close look with an event camera. So here we are using our
proposed VIO, visual-based, event-based. Okay, and
to show the potential of this, what we do now is that we are
going to switch off the light. And the quad rotor continues
to fly without being bothered. We tried that with
the recent DGI drones. And quarkindrones that
have there own VIO, They crash immediately. We also tried it with SVO,
of course because the problem is that first of all you have
an abrupt change of illumination when you switch off the light,
on and off. And also because
of the low light. So all the standard cameras
in low light strong, from strong motion glow you’re
actually gonna see this here, so this is what is
gonna happen now. So these are event frames,
the actual event frames, these are the events only,
these are standard frames. The standard frames basically
do not track features, while event frames
continuously track features, both in light and in low light. Of course, there has to
be a little bit of light, otherwise the main camera
doesn’t see anything, okay? But because of
the [INAUDIBLE] range, they can actually deal
with the full moonlight, according to the data
sheet full moonlight. This is another experiment
where the room was partially illuminated and
partially with low illumination. And here you’ re gonna see that
when the quadcopter flies in the darker areas, let’s see,
The frames get motion blurred, too low light and therefore you
can’t track features anymore. And by the way,
in this pipeline, in this paper, what we did is that we
combined frames, events and. Tightly coupled altogether, so
the paper is already in archive, if you’re interested.>>With these cameras, do you have the ability to
set the threshold computer?>>Yes, but not explicitly. That means there is not
a parameter called C that allows you to set the pressure
explicitly, but you have 12 parameters. With combination,
we set a threshold.>>When an even is triggered
that, that like basically, does a rest on that
little circuit.>>Um-hm. Yes.>>Actually.>>So basically,
you have the reset time. It’s one of the 12
parameters that you can set. But for this positive and
negative threshold, there is nothing in
an explicit parameter. So it’s the combination of
several other parameters will give you. The effect of the question.>>So if I move
the camera very slowly, it will eventually
trigger an event?>>Yes, it would obviously
trigger events, but depending on the reset time and
the parameters, you can actually trigger the sensitivity in
a way that you see more or less events.>>Okay.>>Okay, if you see less events, you will notice all the
background noise, by the way. By increasing the sensitivity, of course you decrease the SNR
so you have more noise. Okay, but
we usually also have filters for that like we check whether the
pixel is isolated in time and space and then we can actually
demo it from event stream. So in conclusion the message
of this talk is the following. I think a is very far in time,
but with ten years from now, perception control should
be considered jointly. For example, things of this
active exposure time control and compensation. So there is a stronger brand
that you’re using control active control, active perception approaches,
what about perception? So visual estimation, or I think the theory is
very well established. In fact, now, there are many
less theoretical papers, in terms of they’re
trying to tackle and increase the accuracy. But more papers are try to
tackle the robustness of the systems in terms of
different parameters. For example, with high dynamic
machines, high speed motion. Low texture, and
dynamic environments especially. Here I didn’t talk about
dynamic environments. We also try to model
outliers with this high heavy tail distributions, I think the results have big
potential for machine learning. Actually now we have also we
are also using deep learning on the cameras to try to see if
we can provide context or information to improve
robustness uncertainties. In general, I think [INAUDIBLE]
cameras are revolutionary, because they provide robustness
to high speed motion and high dynamic range scenes. And they also allow
[INAUDIBLE] control, which is actually
some on going work. And that’s thing that
are intellectually challenging because [INAUDIBLE] can also
be used for over 50 years, so there is a need for a change. In this slide, I tried to
summarize the serve the art over the past 50 years
of VIO research, and this is my honest opinion,
and you may disagree. But basically I think
that from 1980, when Hans Moraveck
published a key paper on stereo visual odometry for
the mass exploration rovers. Until 2000 the research was
dominated by feature-based approaches that increased a lot
of so we went from 5% to 1% of the travel distance,
if there was a drift. What happened from 2000 onwards? Then GPUs came, and so
also direct metals came. And actually did not
increase too much accuracy, maybe slightly. Fractions of percent, but
they actually moved us on this other line in terms
of efficiency because models are more efficient
than feature based models. And if you use them in
combination you actually have even higher efficiency. And they will also increase
in terms of robustness, especially against the motion
blur and the focus. Now, when you use an IMU,
in addition to a camera you can provide up to
ten times better. So an IMU basically it brings
the error down to 0.1% to the one meter to
of the projector. Okay, and then,
if you use an event camera, you can actually improve on
all three axes, accuracy, efficiency, and robustness. And probably, using other
cameras like this one, now the other camera,
like smart cameras. Like this camera from
the University of Manchester, which also would work with
that open new scenarios. So I think there is a lot of
interesting research to do and going on,
which is difficult to track. So since there can only be
bought by this company at the moment. We published a dataset called
Event Camera Dataset and that’s a blender simulator. Where we include IMU
frames events and ground troops from
a motion capture system, so we have sequences HDR,
high speed and low speed. And very recently we also
published this list of papers called video companies with only one that you
can at the moment buy. So here we find over papers
organized by category and you can continue to the list. So and with this,
I thank you for your attention. Sorry if it took me
longer than I expected.>>[APPLAUSE]>>Can I say, that kinda sold me on improving performance and
robustness using a [INAUDIBLE] but maybe I
missed the opportunities. But I’m curious what you think
you can do with an event-based camera that improves
accuracy of tracking. Do you feel like
you could do that?>>Yeah, sorry,
I can [INAUDIBLE] that. So camera outputs almost 10,000
more information than a standard camera, if you say that
a standard camera works on okay. So you have a higher temper
resolution up to one megahertz. So a higher temporal resolution
means more information, by information theory,
the formal procedure, what we did here,
in this paper from PAMI. We were using at the time,
two years ago, one of the first generational DBSes,
in terms of 128 by 128 pixels. And here we plot the error,
XYZ roll picture, XYZ and we compare that against SVO. That is rounding and
higher than this image and here we get comparable error. Although we are using 20
timescales resolution image, this is because. Now the question is,
what is the best way to fuse information from an event camera
with that of a standard camera? So there are different ways, in
the last item that we were doing is that we now, okay, extracting
features from both cameras, and then fitting them for filter. But actually, we have also
of the only from the main. Which we didn’t yet
make work in their time, is we try to also tightly
couple, basically, the standard camera
with the camera. And basically try to
modify that probabilistic measurement model is toward, so consider the gradient that you
have from the standard camera. So let me go back to that slide,
you see it here, I told you an event camera
doesn’t need video gradient, but the standard camera does
video gradient, but [INAUDIBLE] in time. Now you can basically,
you know, fit forward that. You can use an [INAUDIBLE] or
even the event [INAUDIBLE] fissures and then basically,
in the sliding window estimation way, try to [INAUDIBLE]
this information. So you can find the idea in
this paper [INAUDIBLE] but that was of the moment all
in 2D and [INAUDIBLE] 3D.>>Question about your
active control scheme. Because you’re trying to
increase the gradient of the image that you get, does
that also help you with motion blur, or is it in situations
of dynamic range?>>Nope, we don’t compensate for
the motion blur there, indeed, explicitly.>>Right.
>>Okay, so no.>>Would that be a side
effect of the system?>>That would be a side effect
of the system of our system? Yes, because of course,
in the light scenes you will try actually to increase
the exposure time. Yes, that would
be a side effect. Yes?>>[INAUDIBLE] Cameras.>>[LAUGH] Good question, so I haven’t told you
the whole story. There is a new [INAUDIBLE]
camera which is called the DAVIS, dynamic conductive
pixel vision sensor that outputs events and
frames in [INAUDIBLE]. And the events and frames are coming from
the same pixel array. So if you calibrate it,
the frame path to your camera. Until these became commercially
available, we were basically moving the camera in front
of a checkerboard, or of a flickering checkerboard
pattern on a screen. That’s how we were doing that. Or blinking LEDs, Samsung uses blinking LEDs
>>You mentioned that the conversion algorithms
need to change for this.>>Yes.
>>What about the control aspects, the control algorithms
that they are using?>>Using [INAUDIBLE].>>Which [INAUDIBLE]?>>I mean any kind of
[INAUDIBLE] any kind of adaptive control, right?>>Yeah, yeah,
absolutely, absolutely. This you have to reframe
all these control algorithms in a fashion if
you want to do it in a mix. There is a group led by at
the University of Paris, where basically they
implemented an event-based a Fourier transform,
for example.>>Okay.>>Yes, now there is also
plenty of contributions there, absolutely.>>Is it on the webpage?>>Yes,
you find that there too, yeah. By the way there is this slide. This is a new Samsung,
generation three already, by the way, DVS compared
to the original DVS and the DAVIS sensor. So the DAVIS which came out in
2008 had the 128 by 128 pixels, larger surface per pixel,
40 microns, and only one million events
per second at the most. The DAVIS, so frames, IMU,
and events has almost higher core VGA resolution,
smaller pixels but still high, after 50
million events per second, higher sensitivity and
lower noise. Now the Samsung Generation 3, which we have two units
has a VGA resolution, much smaller pixels Lower
sensitivity to low light, so the original DAVIS is better
sensitivity, 0.1 illuminance. But it can up to 300 million
events per second, so all pixels are spiking. Less than 13% sensitivity and
lower noise. So higher SNR but
it only has events and IMU, not frames for now.>>[INAUDIBLE] IMU in this case-
>>Well because, if you want to go for
[INAUDIBLE] applications, I mean [INAUDIBLE] in general
what, all the problems, you may still have problems,
right? So why not using an IMU
when you have it there?>>Is IMU still running
faster at this point?>>No, it runs slower. Vestibular IMU go up to, low-cost IMU will go
up to one kilohertz.>>I guess the question is how
many events does one need to make a-
>>[INAUDIBLE] We found->>Good observation>>Yeah, in our application that would basically mean
10,000 fps camera.>>I see.
>>So ten times faster than an IMU. But what the application
really need that much high speed, right? That will be a question, for [INAUDIBLE] that
would make sense. For low latency,
sensory remote control. for VR/AR, I would say that for
low power, you may want to maybe
trigger the camera, the lower frame rate, and
use the event camera for updating your holes in
the blind time between frames.>>Some of what I use is
gonna be observing gravity?>>Sure, sure, yeah.>>That’s all it’s really
going to be doing?>>Yeah, exactly.
Motion priority, indeed.>>For getting some sort
of baseline orientation.>>When there is no
illumination, of course, the provides information. Of course.>>I was wondering if
anybody object perception or recognition->>Yeah you’ll find that in the list. Yeah, the people in
Paris have done that. There is a company called
Chronogram that raised over 30 million from Bosch and
Intel that has shown, so we organized the first workshop
on event cameras at this year. Chronogram was there,
Samsung too. And Chronogram showed basically
some results on detection of pedestrians and cars for
automotive applications, using all events, so
with very low latency. Although the success rate was
in the order of 70%, which is 20 times less than what you get
is on the cameras at the moment. But I think also that they
will not use any sophisticated algorithms, very simple
things also, yeah.>>So my question from the
audience asks why [INAUDIBLE] in fluorescent lights.>>In?
>>Fluorescent lights.>>That’s a very good question. So, one of the thresholds
that you can tune is the cutoff frequencies of
the camera minimum and maximum. You can tune it in a way that
is above the frequency of the fluorescent light which
is around 50 or 60 Hertz. So if you pool above,
then you can see anything, okay? But already the minimum
is about 50 hertz. That means that you know
the event camera doesn’t see, it is low basically,
the flickering is too slow for the event camera to see. But you can tune [INAUDIBLE], so
with some fluorescent lights it depends on, you do the cycle,
but if goes down very quickly. Then you have a very
abrupt changes, then in that case you would
actually be able to see it. So, it really depends on
the type of fluorescent light. But in general you can tune it
in a way that you don’t see it.>>So, I guess the LED would
sort of, kind of flicker?>>Some LED actually give
problems, so in that case you will always see actually blue
light all the time, yes.>>Can you tune it so you only
see the fluorescent light?>>Say again?>>Can you tune it so
you only see fluorescent light?>>Only see fluorescent light?>>Or light.>>But then you see everything. The problem is then
you see everything, or the pixels will spike. Then that means it’s
not usable anymore. That’s the problem.>>Yeah.>>When you switch off or
on the light in that experiment, all the pixels spike
in a few microseconds. Actually, you can
see the going down. And we basically observe a
glitch in the normal events, and we just filter it, okay? So we skip all those events.>>Does the camera typically
generate events in pairs, like for example?>>Yes, of course.>>Typically you see
the negative files.>>Yes, very true. So rough because basically, they’re three terms
continuous in time. It’s analog, but
the reading is digital. It opens up one megaHertz. Of course,
you could trigger it faster, but that’s the way this is
done at the moment. That means that, basically,
if there was a larger change, you would actually see more
events, typically two or three. But you don’t see very often,
I would say 1 or 2%. Okay?>>I actually have
one last question.>>Yes.>>What do you think about doing
fast projects with that correct?>>Yes, I.>>I see.
>>So here, what you’re gonna see, this in collaboration with
a company calls with .com. What they are doing is collision
awareness systems for drones. So in this video, you’re gonna
fist see the VJ phantom, which is a commercial product,
which has a depth camera for obstacle avoidance. And then, in the next footage, you’re gonna see a video with
our drone equipped camera. So let’s see what a commercial
drone would do with an incoming obstacle.>>[LAUGH]
>>And now our with DBS Here, they agree
almost very stupid naturally. It was a very simple optical
flow, the was very small, a few milliseconds.>>And you have, because you
also have a very dynamic range, you’re much better.>>Yes.>>So does it build up the high
gradient, or the sensing?>>No, no, no, not because of
the low latency of the sensor. Here, the scene was not
high dynamic range, yeah.>>What’s the velocity
it was going at?>>3 meters per second so
15 kilometers per hour. So we are involved in a project
called fast flight autonomy. And there, that place, really,
only interested in, well, mostly interested in DVS technology for
reliant applications where some of the cameras are affected
by motion blower and.>>VW?
>>No.>>Because one of the things that can
>>It’s like some mechanics. No, you can just replace
the lens with another lens.>>And that affects the, but it would affect
the angular resolution.>>Right.
>>Not to any sort of respect to the native sensing device.>>No.
>>Cool. Thank you very much.>>Thank you, thank you.>>[APPLAUSE]

3 Replies to “Robust, Visual-Inertial State Estimation: from Frame-based to Event-based Cameras”

Leave a Reply

Your email address will not be published. Required fields are marked *