Building Neural Network Models That Can Reason


>>It’s my real pleasure this
morning to introduce Chris Manning. Chris is the Siebel Professor of Machine Learning at Stanford
University where he has a joint appointment in
computer science and linguistics reflecting his broad
and multidisciplinary background. He’s also the Director of the
Stanford AI Lab and is one of the folks on the advisory board of a new human-centered AI
Initiative at Stanford. So you can see the breadth of his influence at Stanford
and well beyond that. He’s very very well known for his work on statistical
natural language processing, ranging from more
linguistic perspectives on parsing and grammar induction, to more machine learning perspectives
including his recent work on glove and many other
techniques that leverage large-scale data to solve
very interesting problems. Chris is amazingly well known
in many different communities. He has over a 100,000 citations. You can figure out how
many that is per waking hours since he published
his first paper, it’s probably more than one. But another part of
his influence I think, goes above and beyond his research, that spans language,
artificial intelligence, machine learning,
information retrieval. He’s a fellow of all three of the major societies
in computer science, linguistics and AI, but he’s also contributed to
the field in many other ways. He is a co-author of
two really important books, one on the foundations
of statistical NLP and another on modern
information retrieval. But he’s also, with his students, develop many tools that are shared. How many have used
the Stanford Parser at the Sanford NLP tools? He shared code of all kinds, he shared representations, things that arose out
of his work on glove. So Chris, I think is a real scholar
in many senses of that word. What he’s going to do today is
talk to us about how we can go. Right now if you look at
language and representations, many of them focus on understanding
the relationships among words, similarity structures in words, but he’s going to talk much
more broadly about how we can bring notions of
memory, attention, compositionality, and
reasoning to develop systems that have
much more intelligence than many of our current system. So Chris, without
further ado. Thanks.>>Okay. Good morning and thank you Sue for that very kind introduction. I hope I can live up
to it and thanks to everyone for coming to the talk. Yes. So today what I want to do is talk about some
of our recent work on trying to build
neural networks that can reason and this is joint work
with my student, Drew Hudson. So our current neural network
machine-learning systems truly excel on a variety of tasks such as speech recognition
and computer vision. Interestingly, they’ve even been
pushed with surprising success to other tasks which you’d think involve higher level
reasoning and understanding. Notable success has
been the development of neural machine translation which
has worked exceedingly well. But in some sense, even neural machine
translation has approached things as a stimulus response task
that you are provided, here’s the input sentence, produce the output as a kind of
a statistical association task. So these tasks don’t require
deliberate thinking or reasoning, the kind of processors that Daniel Kahneman has referred
to as thinking slow. So what can we do about reasoning? Reasoning is central to figuring
out a good approach when humans are faced with a new problem in developing longer-term plans
for higher level objective. Can we use neural networks
to do the more deliberate, conscious,
multi-step thought? So that might be things like middle-school style reading
comprehension problems, which aren’t really like what most of the current reading comprehension question-answering datasets are like, but much more understanding, a broad understanding of the characters and what
they’re doing in a novel. Commonsense reasoning
and problem-solving, working out plans,
playing a strategic game. For tasks like this, we really need a better notions of having knowledge,
reasoning, and inference. So what then is reasoning? How can reasoning, which has
traditionally been thought of in connection with logics and hand-built knowledge
representations, be learned by the distributed
representation computation units of deep learning? So one inspiring viewpoint on this was provided by Leon Battou in 2011, who suggested that the heart of
reasoning is having the means for algebraically manipulating
previously acquired knowledge, in order to answer new questions. So Battou suggested that
we could seek to enhance deep learning systems with
reasoning capabilities, right from the ground up. That reasoning is not necessarily achieved by
making logical inferences. That he saw a continuity between algebraically rich inferences and
connecting together trainable, learning components
and in particularly emphasize that central to reasoning
is composition rules, to guide the combinations of
modules to address new tasks. So one of the things I want to think about today is how can we start to build composition rules for multi-step reasoning
in neural networks? So underlying this is some kind of a conception as to where things
might head in deep learning. So in some sense, the dominant viewpoint in
the deep learning community is to seek a learning device that
is as empty as possible, some kind of tabula rasa. So what’s happened in a lot of recent work is that the emptiness of the machine learning
devices compensated for by providing extra
information in the inputs. So very successful technique in recent years has been
data augmentation. So effectively, you are providing lots of extra inputs
on a vision system, you rotate the images
a little and shave off a few pixels and changed
the coloration and things like this, but you’re saying all of
that your model you want to build should be invariant
to all of those things. However, I believe that
we shouldn’t be afraid of good inductive biases of
thinking about how we could design our
neural network systems, so that they are built
so that they have the ability to learn
quickly and well. That probably having
some architectural constraints is vital to how human beings are able to learn so
quickly and well. Indeed, I think the biggest
breakthroughs in deep learning, have come from building
the right inductive biases, structural priors into models. Now, of course you can fail if you build models with
too rigid structure, but we succeed by finding appropriate but flexible
structural priors. So these successes
include convolutions, which have been central
to work in vision, the notion of attention
which has been very central to a lot of the recent advances in natural language processing
and other areas, also the gating that you find and LSTMs or highway networks is again, building in a good structural
prior into your models. In my early deep learning work, I was a strong proponent of the idea of tree-structured
compositional models. I believe that that gave
the right inductive bias for many human language tasks and
also some tasks in other domains. So the model was that you
would take pairs of vectors, which would initially be
vector representations of words, and then you’d build from that a vector representation of
a phrase like very good here, and then you could continue applying this same pairwise composition
operation recursively, and you could build up representations
of phrases and sentences. We were able to apply this model with some success to various tasks
including sentiment analysis. So here, the model could
start reading the sentence, there are slow and repetitive
parts and start to build up a composed structure
where it says well, that’s a negative impression. But it has just enough space
to keep it interesting, and it builds up a
representation where that part of the
sentence is positive. The positivity wins
out so that overall, this is a positive thing
to say about a movie. These models aren’t only
applicable to language, you can actually apply
the same ideas in vision. So the visual scenes
also commonly have a compositional structure and
so in these experiments that we did on the Stanford background data set that is building
up a composition or representation of a church
from the pieces of the building and is using that
to understand this visual scene. Interestingly, what
Battou suggests in 2011 as a model for reasoning, was essentially exactly
the same kind of tree structured recursive neural network that we start employing
for natural language. He proposed that the path to universal composition was that you build an association module A, in this picture from
his paper that maps two representations taken from memory into a new representation
of the same sort and that that new representation can be
scored by module A in this picture, the guy in France and simultaneously the new
representation can be put back into the short-term memory
where it can be used recursively to build up a proof tray. However, for reasons
partly computational efficiency and partly limitations
and flexibility, it turns out that tree structured
composition hasn’t really worn out in neural network
research in the last five years, but there are alternatives. So one that I already mentioned
was this idea of attention. If you look at it, you can think of
attention as almost trees by another name because
we’re effectively putting a soft tree structure over the previous nodes to
generate a representation of the next node and we can apply that recursively as we work building
up along the sequence, but we now have soft weights rather
than a rigid tree structure. A second alternative for
building up something like the two’s function that takes
multiple arguments is to think that maybe our neural networks could
do what logicians call currying. So rather than having
multi-argument functions, like F that’s taking
three arguments XYZ, we could instead build
intermediate compositions. So we can take one argument at a time like X and build an
intermediate representation or function which can then take the next argument Y and proceed to build up an overall representation. It seems reasonable to assume that neural sequence models could
do this kind of computation. So in this current line of work, the hope is that we
can start exploring neural network architectures
so that rather than having big fairly unstructured
neural networks that act as association engines or
correlation engines that look just for any kind
of patterning in the import, we could use a model that is more
structured with the prior that encourages compositional and
transparent multi-step reasoning. However, we’d like to do this in models that are still
practically usable. So for the models that we’ve built, we’ve focused on models that’s
still end-to-end differentiable. There’s a ton of work now on building reinforcement learning models
in which this is not true, but I still see the simplicity of end-to-end
differentiable models as so much easier to
train and work with. Also, it provides a space
where it’s easy to build models that are still scalable
to reasonably large problems. So in the work today, I’m going to concentrate
on showing results from the area of
visual question answering. So here we are shown a picture and we’re asked
questions about it and the suggestion is
that asking questions is a good way to
assess understanding. This was a viewpoint that was
put forward very early on. So long ago they used to
be the Yale AI school. Who here is old enough to have
heard of the Yale AI school? Yeah. That’s right. Two or three people. So one of the members of
the Yale AI school was Wendy Lehnert who worked on
question-answering and she writes, “When a person understands the story, they can demonstrate
their understanding by answering questions about the story. Since questions can be devised to query any aspect of
text comprehension, the ability to answer questions is the strongest possible demonstration
of understanding.” The same is true for visual scenes. Why I’ve been interested in visual question answering
is because it’s just seemed like
a better proving ground for compositional reasoning, JHHalthough there has been a ton
of work which I’ve also been involved in doing textual question
answering systems. In general, the textual
question-answering work has still been dominated by lexical semantic
matching and there just hasn’t been very much opportunity to do multi-step iterative reasoning
in that domain where it seemed like visual
question-answering gave a nice place ground to be looking at multi-step
compositional reasoning. Okay. So that’s my intro on trying to go from machine
learning to machine reading. So for the rest of the talk today, I first of all I want
to tell you about our initial work on MAC networks
on the clever task, then tell you a little bit about a new dataset TQA that we’ve
developed more recently for visual question answering
and then tell you about more fresh off the press, neural state machine model. They’re doing
visual question answering. So let’s just start off
with the CLEVR data set. So the CLEVR dataset came out of considerations of
visual question answering. So the belief of some of the people working in visual question
answering that the most used visual question answering
tasks which came from Bertrand Prereqs group at Georgia Tech that it had led to some kind of research
on language and vision, but it hadn’t really been a very good testing
ground for actually doing scene understanding
and compositional reasoning. So Justin Johnson and colleagues at Facebook AI Research decided
that they should try and come up with a diagnostic dataset
that especially focused on compositional language and
visual reasoning about scenes. So they went back to
this old classic of AI blocks world and so they
synthesized block world scenes in Blender and then they asked long compositional questions
about the scenes. So in this one, there are some purple cubes, but this one is behind
a metal object. So there’s a metal object, so that scenes that has
to be this purple cube, that is left to a large ball. Well, there’s a large ball
and it’s left of it. That seems hopeful. What material is the cube? If you haven’t seen
this dataset before, you’d probably say no idea, but it turns out in this dataset, all things are made
out of two materials, they’re either metal or rubber. So if they’re not shiny, the right answer is rubber. So a number of people have worked on systems to approach this data set. But as well as the scene
and the question, answer, another attribute of
this data set which was the reflection as to
how it was constructed, that on the right-hand side, there’s informal representation
of a functional program. So the way that this data
set was constructed was that they were building
functional programs that they could run on these days on the visual scenes in an abstracted
form and get out of them, the answer and then that functional program was
converted into natural language. So one of the questions on using this data set then will
come up again later is, are you just building a system over just textual questions and answers and images or are you making use of these functional program
as added supervision? So one of the example of
piece of work that has made use of the functional
program has been a line of work by Jacob Andreas
and then by Justin Johnson himself that has explored
neural module networks. So these are partially
differentiable models that try to approach
the problem composition way, but they rely on a strong supervision
of the functional program to translate queries into a
tree-structured functional programs. So the first part of the model is LSTM encoder-decoder model that reads the textual question and
produces the functional program. Then the second part of that is another neural network which
interprets this functional program. So that neural network
is built out of custom building blocks for
these different semantic operation. So there’s counting
neural network and comparing neural network and
filtering neural network and they’re plugged together in
a compositional way and they interpret this functional program and then aim to give
you the answer format. But what I want to
do first today is to introduce our version of approaching this problem which
is the MAC Network, which stand for Memory
Attention Composition as a neural model for problem
solving and reasoning tasks. So we want to have
this idea of building a network with structure that encourages it to do
multi-step reasoning. So it should decompose
a problem into a sequence of explicit reasoning steps and each of them corresponds to
one MAC cell in our Network. But on the other hand, it didn’t seem right to as the neural module network approach that I just showed
on the previous slide, that it seemed much to be
spoke by the time you’re custom-designing individual
network units for comparison, filtering, counting,
and things like that. That didn’t seem sufficiently
generic as a model of intelligence. So what we wanted is one universal MAC cell that could
be used to be for everything, and it’s versatile enough
that it can learn to do different things depending on the context in which it’s applied. What we’re building is
the recurrent neural network, but the recurrent neural
network is able to have attention backwards throughout
a sequence of reasoning steps. So through attention, it can a soft way represent
a complex reasoning graph, but in the model, that still has
end-to-end differentiability. So each MAC cell was one step
of a reasoning system. So the design we had is of
a more articulated design than most standard recurrent
neural networks since the model retains
two recurrent states. So it has a control
state which is used to describe the reasoning
operation of the network, and the control state is an attention-based average
of a given query, which in our case is
just the textual question. Then there is a memory
state which is based on information which in general we’d say is being collected
from a knowledge-based, but in the particular
application here, the knowledge-based is simply the image that the system
is looking at. So the memory is going
to be represented as a attention-based average over our image or over our knowledge base. So there are a couple of things
that are worth noting here. One is that in our model, we’re not going to make use of the strong supervision of
the functional programs at all, we’re simply working
from the question, the picture and attempting
to get the answer. The other one is this design choice
that we’re representing, thinking in terms of attention, so attention-based averages
over a given query and attention-based averages
over the image. The second model I’ll
show later does things a bit differently and
has more abstraction, we’ll get to that later, but it also represents everything
as attention-based averages. I can’t prove anything here, this is just a bet, but it seemed to us that the use of attention-based models of this sort has proved to be very successful. An immediate good property
of these models is that attention gives
the easy interpretability. You can say what words is
the model looking at and what part of the picture is the
model looking at, but beyond that, it seems that by grounding models in the space of attention-based
averages that that appears to be a useful ways to somehow constrain
and direct the models and get them to learn more
effectively than if we just have unconstrained
hidden states. In slightly more detail, so here is our MAC cell. So on the top part of the MAC cell, we have the control unit which
computes a control state. So it takes in the question text
and the previous control state and it will generate a new control state by focusing
on some aspect of the query. Then down in the bottom, we have the memory. So this part takes in preceding memory hidden state and our so-called
knowledge base here, the image and based on the control information and
the previous memory state, it reads some information out
of the knowledge base or image, and then that generates a new candidate memory
which is combined with preceding memories to generate a new MI memory unit which is then written into the next memory state, merging old and new information. So the bet here is, at the time we first
started this work, there had been some quite
prominent work from Deep Mind on Neural Turing machines and then differentiable neural computers. It seemed that although in theory
those models were very powerful, it seemed like in practice they were very difficult to
control and indeed I think to this day they
haven’t been demonstrated on any larger scale problems. Part of the difficulty comes from the Turing machines style,
arbitrariness, if you can read and write
anywhere and that makes it very difficult to learn
and control these models. Whereas this model has a very simple
organizations since you’re sequentially laying down memories, but each next memory can be done based on attention over
previous memories. Let’s see. Time goes by fast, so I probably should do this quickly
so I get onto the later ones. Very quickly, more
details in the paper. So the control part takes the previous control
straight and the query, it computes a representation, it then uses attention onto the words of the, actually
I should explain that. For the words of the question they run through a bidirectional LSTM, so they’re contextual words. So it then uses what’s
it’s computed up here to put attention
over the words of the query in the standard attention
distribution way to produce a weighted average over the words of the query that normally
focusing on certain words, and that gives the next
timesteps control signal. For the read unit that’s taking in the previous memory state
and the knowledge base, the picture and then it is wanting to get something
out of the image. The early part of that allows
the previous memory to interact with the image and get stuff out
of it in an associative way, and then it feeds in the
control state and does a second round of using that to
put attention over the image. Again, we have a weighted based on attention retrieval from the image and that creates
the new memories state. So then outside the new
candidate memory state. So for the new candidate memory state that then goes into the right unit, and so the new candidate
memory state then, itself, is combined with past memories states
using the control state. So the control is used to feed
which past memory states to pay attention to in
a key value attention mechanism. So that is then used to calculate a weighted distribution
over previous memories, the new memory and that then
gives you the new memory state. The hope is that we can
simulate in that way, doing a soft but arbitrary DAG of reasoning of successively
writing new memory states. So that’s one MAC cell. Then we build
a MAC Network by building recurrence sequence model that
runs through bunch of those cells, and that gives us a model
that’s efficient, easy to deploy, and still
fully differentiable. But it has the capacity to represent arbitrarily complex reasoning via Directed Acyclic Graphs in
a soft way through retention. So let me present
the results of this in terms of the initially
the clever datasets. So it had 700,000 training
examples, 150,000 test examples. The space of answers is very small, there’s metal, rubber, cube, sphere, few colors,
very small numbers; zero, one, two, three, four, five So the baseline is quite high. So if you answer
the most frequent answer by question type you’re already
at almost 42 percent. But what I wanted to show is that this kind of architecture
that had been used for visual question answering
sort of CNN stacks plus LSTMs didn’t really work on this data because they didn’t
do the necessary reasoning. So kind of a state of the art
BQA system at this time, a couple of years ago
only got 52 percent. Compared to that new
or module networks, this is Jacob Andreas’ work, did a ton better got 83 percent
which was not too far off the 92.6 percent that was
reported as human accuracy. But actually, these problems
were synthesized, so they are synthesized
images and synthesized answers based on
the functional programs. So there’s absolutely
no reason why a system shouldn’t be able to get
a 100 percent on this data, and there’s a sort of a actually a fairly
artificial reason why human performance was
depressed from a 100 percent, not just human laziness. So actually, what’s happened
in more recent work was all of the action moves to
the 95 to 100 percent space, and so initially
Deep Mind came up with a relation network model
that got to 95.4 percent, then Justin Johnson did
a new generation of neural module networks that
got to almost 97 percent. Montreal with the film network
got to 97.7 percent and so relation networks and
film are essentially both large CNN stacks interleaved
with specialized layers. So relation nets had a relation
net layer where every pair of pixels has relationships assessed between it to understand
binary relations. Film inserts these conditional
linear normalization layers that tilt the activations
based on the question. It’s harder to get an intuitive
sense of what that model does. But anyway, our MAC Network
work super-well. It gets 98.9 percent which is essentially half
the remaining error. But you might be starting to
wonder how meaningful this is when things are getting so
close to a 100 percent. So something that I think is
interesting is the following. That if we look at
learning curves based on how much training data you
give to the models and if you instead of giving
the model 700,000 examples you only give them
70,000 training examples, then what you do is find
that other models such as the film model or the neural module network model
of Justin Johnson, they can’t actually learned very
much from 70,000 examples at all. Even though 70,000 is a lot, this isn’t a very big space. You’d think you should have
to learn something from your 70 or 100 examples. They don’t get that far above baseline performance in
most common classes, whereas our MacNet model
is already getting 86 percent accuracy
over 70,000 examples. So I think that’s actually an
interesting proof that the design of the model has the right priors to be solving these multi-step
inference problems. The subsequent work. Justin Johnson also built
a clever humans dataset, where he collected 18,000 real
questions from humans through crowd sourcing where
they’re told to write questions hard for
a smart robot to answer. So these questions have
a more diverse vocabulary, different reasonings, skills which might not be
in the original dataset, and there’s a small training
set for fine tuning. So if you run with
the original data zeros shot on this CLEVER-Humans dataset you get these results when none of
the systems work great, but MacNet does do
a little bit better. If you fine tune on the training
set you get these results. Where again, the MacNet
is able to get more value from the fine tuning set than any of the other models again, I think reflecting its ability
to generalize well from small amounts of data and starts
to get quite good performance. Here’s a couple of examples that show how we get
the benefits of having interpretable reasoning from having the attention over
both the sentence and our image. So what color is the matte
thing to the right of the sphere in front
of the teeny blue block? So this is a very short example where there are
only three reasoning steps. So in the first reasoning step it focuses on teeny blue block
in the language, and looks at the teeny blue block. In this second reasoning step
it’s the sphere in front and it’s then focusing in
the image on the sphere in front. Then the third reasoning step
is color is the matte thing and it’s going to the matte thing and asking its color and
correctly answers purple. You get better results for
longer sequence of reasoning. So here’s it using
six reasoning steps on this example. How many objects are either
small objects behind this tiny metal cylinder or metallic cubes and front of
the large green metal object? Often what you see here
is in the first couple of timesteps in more sort of looks at the macro structure of
what it’s being asked before it starts to focus visually. So initially it’s
looking at the how many, the question type, and
well either small, I’m not sure what it’s doing there. But then the second step
is realizing it’s a disjunction between two things and then in the third time step at starts focusing on particular things. So large, green, metal. It’s looking at the
large, green, metal, metallic cubes and it’s
looking at the metallic cubes. Teeny metal cylinder
and it’s looking at that small objects behind things. At any rate, it’s not quite
clear how it does the counting but it does in this example
correctly have the answer of four. So I guess it’s managed to somehow get that out of
the what’s going on. So that was an initial neural
compositional reasoning engine. So I hope I’ve shown that it seems like we’ve gotten
some good value from having a constrained sequence model with good priors separated out control, and memory and exploiting attention. But let me move on and
get to the newest stuff. So one problem is that although there were
these early reasoning datasets including clever, they seem limited. There’s artificial images
and or language. There’s a very small space of
possible objects and attributes. Although in theory you’re
doing compositional reasoning, the feeling is that
because the amount of data is so large and the space is so small that the suspicion is that these models actually
just memorize molecules, so they just learn what a red, metal, sphere is and therefore they have a lot less compositionality in
them than you’d hope they have. On the other hand, the main visual question answering benchmarks have also
seem somewhat problematic. So they’re a strong what are often
referred to as language biases, but I think it’s mainly actually
more real-world biases and model’s guess based on priors. So snow covers the ground, grass is green, and things like that. There’s also visceral biases to overly salient objects and it’s hard to tell when
systems are going wrong, what exactly is causing it and really the questions
are too simple. The questions are often simply, what color is the grass? So little reasoning or
compositionality is required. So we’ve worked to produce
this new dataset for compositional question-answering
over real-world image called, GQA, and in some sense this is
clever done on a larger scale. So we start with
real images which come from the same MS COCO sources
that everybody uses in the vision world and then generate questions involving compositional
questions a bit like clever, and so we generate
10 million compositional questions overall and then generate a balanced smaller set of questions with closely controlled
answer distributions. The way we do that
is that we make use of and provide a scene graph
with each image that represents its semantics and then the questions
like in Clever come with a functional program grounded in the scene graph that
shows that semantics. Perhaps surprisingly the questions we build are generated using a traditional rule-based
question engine. But this means that it’s just the precise semantics that a rule-based system can
give you that we’re just sort of turning the scene
graph representation in a very controlled manner into
natural language questions. So it’s still a controlled language, but we try to give much
more in the way of linguistic diversity
and a large vocabulary. Since we totally can understand
the scenes and we can have metrics that can assess the consistency of models answers
and various other metrics. So in slightly more detail
are starting off point is the Visual Genome dataset that my Stanford colleagues and Ranjay
Krishna and Michael Bernstein, Fei-Fei Li and others developed, where you’re staying off
with the MS COCO pictures. And then they’re putting over this
a scene graph representations shown at rise where there
are identified objects, which are the pink things
like helmet, watch, and man. These objects can have
attributes silver helmet, black watch, and then they can be connected together
by relationship. So the man is wearing a helmet or the cow is kneeling on the grass. We produce the improved
visual genome. So rather than just bounding boxes, we made use of the last few years and Computer Vision Technology
and now have massive images. But more importantly,
the original scene graphs are completely unrestrained in their natural language
labeling of things. So we move to a clear ontology of
concepts by resolving synonyms, discarding some very
unclear or rare things. So our ontology has 1,700
objects, 600 attributes, and 330 relations which are grouped into 60 categories and subcategories. We also augment the graphs with
some additional information, we put in positional
relation information, some comparative information
of same or different color, and some other global
information that seems useful but intend to be
in the scene graphs; things about the weather
and things like that. So then, once we have
the scene graphs, we generate using our rule-based engine
Natural Language questions. This is a conventional rule-based Natural
Language generation system with probabilistic rules
that we can have common and uncommon patterns with the standard context-free
style grammar that can build up descriptions. There’s then quite a lot of
work behind the scenes which is then trying to make
these questions be good questions. So we want to make sure
they’re answerable, they’re uniquely answerable, and that they seem reasonably natural. So that they’re are actually then a lot of large, somewhat induced, somewhat hand-built
lexica that underlie what words and
natural words to use to describe various kinds of
relationships that might be more abstractly expressed
in the scene graph. This gives you an idea of
the differences between VQA and GQA. So visual question answering
has crowdsourced questions and crowdsourced answers in
unrestricted Natural Language. So the VQA questions are
real human questions whereas GQA, the questions are more artificial, because we are generating
them from our grammar. On the other hand, the GQA questions, we believe, are a better test bed for exploring scene
understanding and reasoning. Because the VQA questions
are sort of random. Does this man need a haircut? What is different about
the man’s shirt that shows this is for a special occasion? Rely on a lot of world knowledge that you can’t really answer things from the question. Our questions are
more straightforward, scene-understanding
questions which are perhaps more artificial sounding. Is their neck tie in
the picture that’s not red? But at least with thinking that
this is probably a better testbed for exploring
visual understanding and compositional reasoning
than VQA has provided. This isn’t one is better, this is better for a certain purpose
kind of an answer. Here are the baseline accuracies that we got with this new dataset. So global prior is if
you’re just saying, if you’re asking a color question, you give the most
common color answer, you get whatever about 17 percent. This is a vision only model. So there’s been a tradition
VQA of seeing how far you can get with a vision only model
or a language only model. The local prior is prior
for a particular object. So this is, if you’re asking, what color is the apple, you do rather better
than the global prior. The LSTM is the language only model, and this is the result
that’s always been shown in current VQA systems
that you can do not terribly by ignoring
the picture altogether and just answering based on
the language that you’re given. LSTM+CNN is sort of
the standard baseline VQA system. Bottom-up is the recent
winning VQA systems of Peter Anderson and
colleagues that use bottom-up and top-down attention. If you look really closely, you’ll notice that our MAC model is better than any of
these other models, but somewhat disappointingly, the MAC model doesn’t actually
work for anywhere on this dataset. That’s trying to prove the point
and more abstract reasoning, and it’s way below
humans that are coming in at about 98 percent accuracy. So in the last bit of the talk, I then want to tell you a bit
about a new model that we’ve been exploring that tries to do
somewhat better than this. So the thing that we’ve been
exploring this new model is, so for Visual Question
Answering, well, question and answers are
clearly about language. But maybe we could actually make more progress on visual question
answering rather than having these systems
which have vision on one side and language
on the other side. If we instead represented
what we’re doing for visual question answering is doing everything in terms of
an internal language of thought. So what we’d like to do is have a common conceptual representation which we can represent both language, which we can use to represent both language expressions
and visual scenes in, and then we’ll be able to reason using that and we’ll
be able to do better. So I am no expert on
human visual perception, but nevertheless my
superficial understanding of human visual perception is the idea that brains have a photo in their brain
that they’re using, isn’t supported by most of what’s in human visual perception, instead, that eyes are making
momentary fixations, and what they’re getting
out of each fixation is some very high level
scene gist, right? So you do a fixation on
the man and you see there’s a man looks like
a cyclist with glasses, helmet, gloves, watch, and
you’re aware of the fact there’s a grassland scene with
a cow somewhere off to the right. But you don’t actually get a lot of detail out of that and
then you’ll of succumb, make another fixation and
maybe you look at the cow, and then you’ll notice it has horns and a bell on the front
or something like that. But it’s really that they’re
fairly abstracted scene gist is what’s actually in the brain as
you do this visual processing. So in the same way, our hope is that we
can use concepts to organize our visual
sensory experience, and we can build from
those an abstraction, a world model to represent
what we’re seeing in our environment and
our world models will essentially be the scene graphs
that I just showed you. We’ll be able to use
those to draw inferences. So this fits in with
the general deep learning story. So the hope of deep learning
models is that we could use the depth to learn
higher level abstractions. The reason why that
should be useful is that for surface signals
like visual signals, there’s all sorts of
complications and variations. But if we can build a
higher level abstractions, we should be able to disentangle our representations and that will allow us to improve generalization. So in particular, the way we
want to do that is by building a model that does content-based
attention over concepts. So attention is
our central operation, again, that we’re using here. So we’re going to have a large set of concepts and we’re going to put
attention over a few concepts. Crucially what we’re going to
explore here is previously, and in most other vision systems, you’re putting
attention directly over the image, over pixel space. But here it’s arguing, maybe if instead we can put
our visual attention over concept space we’ll
be able to do rather better where concept space
is our language of thought. So this is related to what Yoshua Bengio has proposed as
his so-called Consciousness Prior. I’m not sure I like that name, but I think the idea
is basically right. So the general research program
is to say that we should try and learn deep representations that disentangle abstract
explanatory factors, and then his suggestion is that conscious state has a very low
dimensional vector which is an attention mechanism over
the disentangle deep representation. The way we’re doing
things is somewhat different what he proposed us. But in some sense,
it’s the same ideas of disentangled concept space
putting attention over it. So this is a model of
the neural state machine. So we have a vocabulary of embedded concepts which are
our atomic semantic units, and that is that cleaned
up visual genome ontology that I told you about earlier. Then, but this time, we’re going to translate
both questions and images into these concepts so that they
both speak the same language. So everything that’s going
to be represented as attention over concept vocabulary. So we hope that this
will give us a model of concept learning and use somewhat
similar to what humans might do. So a Neural State Machine is a differentiable graph-based model which simulates a state machine. So in some sense, we’re hoping
to see it combining some of the strengths of both neural and
traditional symbolic approaches. So this is what we do.
We’ve got two stages of construction and in France. So in the construction stage, we’re going to take image and
turn it into a scene graph. So the way we’re doing this is following other work that’s
done imaged a scene graph, and again including work by
Ranjay Krishna and others. At Stanford, we’re using similar methods to generate
scene graphs from images. We do object recognition and
then have further components that infer relations by attributes of objects and
relations between objects. Simultaneously, we run
encoded decoder style model over their natural language question
and turn into a sequence of instructions which are
attention over concepts, now abstracted concepts space. Then we’re going to do
inference which is going to be a state machine style computation over our graph to compute an answer. So formally, we have
something that looks like a finite state machine
with states and edges. Then there’s a sequence
of instructions that are going to be like an input, but we’re going to do things
in probabilistic manner. So we start with the probability
distribution over initial states. Then we have a neural
state-transition function which computes probabilities of next states based on the current state distribution
and what the instruction is. So this is it all pictorially. So from an image, we construct a scene graph which happens to also look
like a state machine, where the states correspond to objects and the
transitions correspond to relations and the states
have these properties, and all of this is representing the soft way as attention over
concepts in our concept ontology. So effectively, these attentions
over concepts give us a disentangled
representation in terms of concepts and their properties
and relations between them. So it’s all factorized using
our concept vocabulary. The question is then also translated into a series
of instructions. So each instruction is again an attention distribution
over concepts. But if we just go with
the argmax of it, what is the red fruit
inside of the bowl to the right of the coffee maker
gets interpreted to instruction sequence of
coffee maker right bowl inside red? So then we can run these instructions on our state machine and say well, start with the coffee maker, you want to look to the right, there should be a bowl there. You want to look inside that, and there’s something red, and the network will then be
able to say “Yeah that’s an apple” and will then be able to
answer the question correctly. Here’s one more example of that. So what is the tall object
the left of the bed made of? So the model will form instruction
sequence bed left tall made. So it’ll first of all
be looking at the bed, looking to the left of the bed, confirming there’s
something tall there, then asking what is made of
and will turn the answer red. But I’m just doing this as
one-hot language in describing it. Really at each stage, this is done as
soft attention distributions and at all times there’s a soft activation level over
the entire of the scene graph. Now, so here’s our new model. So this is over GQA again. So these were all the models I
already talked about and MAC. So we are actually getting
a nice lift there from the neutral state
machine model which is now working quite a bit better. Find point, you’ll notice max doing a bit better
than before as well. Yeah it turns out if you tune
these models for longer, you will work out ways to
even do it a bit better. But nevertheless, their neural state machine is
doing well better than that. I hope that that’s showing that having reduced things both to the languages concepts
gives you some power. Here a couple more demonstrations. So there’s another interesting
dataset that was done. I go out and colleagues from Georgia Tech where
they wanted to give a better test of visual question answering
than standard VQA datasets. So they produced
this VQA-CP dataset where deliberately the distribution was changed between the training
set and the test set. So for example, in the training set for
what sport is that questions the two most common answers
with tennis and baseball that they changed the
distributions on the test set. The most common sports
showing the skiing, then baseball soccer,
skateboarding, etc. So you have a different distribution. So if you’ve actually an
understanding concepts, you should be able to still
get the answer right but if you’re just guessing
based on world priors, you’ll get the answers badly wrong. So what they showed was most
current models, you can see here. Here are the neural module networks that we talked about
before and again, and various other models
at that point. It seemed like well no actually they weren’t understanding concepts. So many of these models their
performance essentially halved when you went
to the VQA-CP dataset. So sect attention networks
perform terribly on this dataset. So I’ll go out about produce
their own model that worked better. Then the interesting this was
actually also a model that mapped both images and language to concepts. So they had that idea as well. Then here are all the 2019 models on archive which are doing rather
better than their model. But again, with the neural
state machine model, I think we’re getting
additional power by having the state machine that we
can simulate reasoning on. I have one more example of that, but time is going by. So maybe I should skip that.>>You have five minutes.>>I don’t have many slides left.>>Go forward.>>So I could do this
example then. Okay, I will. So we could do the same kind of
thing with GQA that since we do have scene graphs and functional
programs behind all of GQA, we can also do the same trick that they did
of changing distributions. So we can have distinctions in structure
between training and testing. So we can have the
training only using some linguistic constructs
from a notion of covering or what
something is made of, and at testing doing
different kinds of questions, or we can have different content. So at training, we can not have any questions that refer to
types of foods or animals. At testing time, we do have
questions of that sort. The fine point here is, I mean, it’s not that thing we don’t have completely novel objects but
it’s not that we’re asking at test time what’s this animal
and as a red panda and I never saw a red panda
at training time. So far, objects in isolation, we’ve got our trained object
detector which is seen all of the objects that map onto our concept vocabulary
in the training data, but that’s only
learning in isolation. So this is saying “Can it compose together
that knowledge to be able to answer new kinds of questions that it hadn’t been
asked about at training time?” So here are our results here. Again, the MAC Network will argue
as a bit along the right lines, but the neural state
machine is much more effectively being
able to generalize to these new types of
questions where some of the more standard VQA models, yeah I thought they don’t
work at all but perform fairly poorly at generalizing to
new types of things like this. So in conclusion, yes. So VQA as a problem has always been considered
to be about language, but I think it might be
interesting to explore whether you can think in terms
of a conceptual space or a language of thought where you
use a common representation for different modalities for reasoning about and that might actually
give additional power, and I’ve outlined
a model that did that. I mean, in general, I think there’s
this exciting question of, can we take our neural
network models that have been so successful for
sensory perception tasks, and work out whether we can also
use them for thinking slow tasks involving understanding and
multi-step compositional reasoning? So the overall goal is, can we build neural
networks that think? I think there’s actually a reasonable hope that we
might be able to do that. I’ve suggested that
at least one promising direction has been to do this by trying to build models which use attention over abstracted
disentangled concepts, and then do multi-step
reasoning by having an iterative attention process over different time steps. Thank you.>>Thanks, Chris, for a fabulous
and thought-provoking talk. Why don’t we spend the next 10 or
15 minutes answering questions, and then give you a little bit of a break before
you head off to your next.>>Sure.>>You can pick.>>Okay.>>I had like in all these
compositional reasoning data set, one thing that’s clearly
missing is hierarchy. So initially, the picture of
all these people like tall woman, short boy, and man. First that come out to me was, “Wait, it’s a family,” but family was
never mentioned in the scene graph. So that hierarchy of
composition beneath four people in the next step,
actually constituted family. So do you think we should also be pursuing this certain ideologies?>>Yes. So I think that’s
a completely valid observation.>>Can you repeat the question?>>Sorry. The question was these models are doing
some multi-step reasoning. Yes. That they’re not representing hierarchy or perhaps not even really representing
compositionality, and it seems like they should. I mean, that actually goes back to this early picture I
showed right there. But I think there’s no doubt
at all that there’s lots of need to represent composition
hierarchy because a lot of the time, yes, we’re thinking
in terms of these. Hierarchical models where we’re
seeing bigger composite holes, and I agree we haven’t
really been addressing that, but I think we should. We did in this old work in 2011, and I’ll get back to one of
these days. Right there.>>Yeah. So it’s a question
about the testing methodology. So at the end, you should have the distribution on the training
of the testing was different, you get worse result. Is there any effort to
systematically test every possible object in the scene, or it seems like you could be, since you’re generating these queries and the scenes automatically, you could just
systematically generate all the possible inputs, right?>>So yeah. I mean, that’s not something I
talked about, but I mean, that’s actually one of
the things that we aim. So square was down here, not for explained very
well on that last point. I mean, that was something
that we saw as one of the big advantages of having
the controlled questions of GQA that we can generate lots of questions which are effectively answering the same or related things. So we can answer, what color is the go-kart? Is the go-kart red? Is the go-kart green? We could assess automatically whether a system is giving
consistent answers, and it turns out that
a lot of the time, GQA systems don’t give
consistent answers, and we could do that for
relational ones as well. We could say, what is to
the left of the go-kart? Maybe there’s a person
then we can say, what are some of them ride of the person and ask
backwards and so on. So better assess actual understanding
rather than random guessing. So we actually have
some metrics that we defined that look at consistency of answering and have gotten
some results on that.>>Thanks.>>Yeah.>>So I have a question about
the neural state machine, and currently I think
you have 117 object, and that the object recognition on certain natural
recovery objectives. I think performance
should be very bad. So I think the accuracy of
object detection is even lower than the accuracy
of the VQ times the final VQA answer accuracy. So how can you really get such a high accuracy to
answer the questions, but based on a very poor performance
on object detection?>>So I think maybe there
are two answers to that. I mean, I think
one answer to that is, it turns out that is
a distribution of data question. It turns out that a lot
of the stuff that’s in the scene graphs that are ultimately come from Visual Genome
though we’ve normalized them, is fairly simple stuff, right? That there’s a lot of
things that’s talking about people and fruits
and cars and airplanes. So even though we do some things
to rebalance the data, that is core concepts,
basic categories, if you will in
psychological terms that are used a lot and a much fewer
in number, so that helps. I think the other part of the answer is we don’t have to get
the answer exactly right because each point we’re putting an attention distribution
across the space of concepts. So to the extent that
the model roughly thinks that, “I was probably one of these four
things and I don’t know which.” The model can place
non-trivial attention over all of those four things and
reason forward from there, even though, if it was just
guessing at random between them, then it only be getting
25 percent accuracy.>>We will have
some component odds because some run on object detection, and then the reasoning model
for this run scene, so do something in random. In fact, final answer is correct. Just like routing, routing
and then get it right, so we do not learn anything, we do not learn that use
for with an index.>>So there’s can be a chance
of getting things wrong, but there’s also a chance that the multi-step
reasoning could actually be more human-like because if you say a go-kart to
the right of a person, well even if you’re go-kart
detector isn’t that good, if you know it’s to
the right of a person, well, you can say, “Oh, that thing over
there must be a go-kart,” right? So the actual relational reasoning can actually help you do better. The one person behind there.>>So it seems that there are two main approaches to
mark such reasoning. One is explicit modular instructions versus how many universal
continuous actions, and it seems that you have tried both of them involved in
two of your works. One is the MAC cell which
is a universal cell, as opposed to your recent work, which is more like
a discrete instructions. So from your experience, which one of these
approaches you would think would be more
fruitful to pursue?>>Yeah. So I guess I’m
right at that moment, I’ll say the neural
state machine because look it seems to be working there. But I think, there’s clearly
a wide open space of approaches, then it’s not really clear
that I or anyone know the best approach, but you’re right. So in some sense, the neural state machine
approach is more specific with these different things compared to the MAC architecture. But it’s still not as specific and custom as something like
the neural modular networks, right? That it’s still being
done as there aren’t special units to do anything like spatial reasoning or counting
or anything like that, right? That it’s still doing a more general
attention-based computation. So in some sense, I still see it as similar to
the MAC family. I saw him, yeah.>>I have two questions. One is, the results that you present, you’re not based on the
ground two [inaudible] , so do you guys get the results
for ground two [inaudible] , no object detected,
no relation detected, just the ground two [inaudible] , and use your architecture
on the top of that. So the question is, you’re using a vision system and there’s an answer that you should be able to get where you just say, here’s the gold scene graph
for this scene, you learn it on that, and then really you’re back in that state then you
should be able to get a 100 percent of the question
and the gold scene graph. No, I don’t have those answers, and you know what, that would be
a interesting thing to do. I mean, I think clearly it’ll
be much higher number, right? Because here there’s two parts of
this question: one of which is what is the defaulty of the vision
problem of recognizing things, and the other one is how successfully you actually reason across
these scene graphs?>>I’m going to rule out questions
from the NLP team and MSRAI, and the Deep Learning team
and MSRAI because you’re going to meet with
Chris over the next few hours. So any non-MSRAI [inaudible]. Lots of them raising hands.>>I have a more
high-level questions. So if we combine
the revision and NLP, it’s BQA or TQA and
my background is ASR, so my question is really how
can ASR be combined with, or do you think one day we will be working on something like combining
all the three [inaudible] ; ASR, CV, and NLP?>>So, yeah, I haven’t done that. So I mean, I have nothing against
speech recognition, of course. In some sense, so speech
recognition hasn’t seen the right place to play for trying to do this
high level reasoning. I mean, on the one hand, there’s clearly a connection, right? When humans are doing
speech recognition, they have a much
higher level understanding of where they are and
what’s being talked about, and they use that to help with
their speech recognition, and you should be
able to do the same. So yes, we should be
able to combine all of these modalities together
and do much more. I mean, in practice that’s tended not to be what
happens in speech systems, and the speech is recognized to the word level before
anything else happens, and to the extent that that’s true. If you’re more interested on
higher level of reasoning, incorporating in speech hasn’t seemed the easy way to
make the problem more interesting where
doing multi-modal work with vision has seen
much more approachable. Anyone on this side
have a question? Yes.>>So you talked about how the clever dataset was basically
a very large amount of data, but a very small space. So you basically make a bigger
version of that in the form of GQA. So what’s the intuition on how complex you want to
make this datasets so that these high capacity models don’t actually memorize it versus
actually learn to reason on them.>>Yeah, I mean, I can say my thoughts, but they’re the usual half
informed thoughts. Yes, so I mean, I guess in research
somehow you want to find the right sweet spot where there
things aren’t too difficult, but it presents the right kind
of challenges and questions for what
you want to pursue. I guess my feeling coming into this was that if you compared textural question-answering versus
visual question answering that even though they’d been progress every year on
visual question answering, there’s a way in which the problem seemed too hard and too diverse, so that whereas for textural
question-answering, there’s been a really
good ramp, in fact, in some sense that
might seem like it’s almost too easy with
all of these results on textural question-answering
claiming better than human results on squad
and other datasets. For visual question answering it seemed like
the space of questions, and knowledge, and so on was too variegated and hard for
the current systems. But on the other hand, the blocks world system seemed like they were too small and too easy, and so we were looking for
something that was in the middle, which presented reasoning challenges
over real vision problems, but is still more
constrained than just saying Turkers come out
with some question to ask. At that point, it’s a bet
effectively that you’re betting that this might be an interesting level of
difficulty for people to explore, for working on scene understanding, and it can certainly
be criticized because our questions are still artificial
when it comes down to it. We hope that they’re from a broad
enough space that it’s actually an interesting good challenge for developing visual scene
understanding, reasoning systems for a few years.>>One last question.>>Okay. [inaudible]. Yeah.>>So in your opinion, what do you think is the role of logical inferences in
these recent exercise? At least it seems to me that
it might make sense for these models to know that if
something is black, it cannot be red. If A is to the left of B and
C is to the left of whatever, [inaudible] and so on, do you think there’s a role for
logical inferences in this?>>So certainly, there
is a role for having more understanding of domains, and being able to make use
of that understanding. So I mean, there’s
perhaps slightly more of that than I fully made clear, right? So by construction, so when I said there
was a space of concepts that we put into taxonomy, so we actually are
exploiting that taxonomy, and this is how we have modally hand-done,
disentangled representation. So that we have taxonomic
concepts like colors, and so then for
the property of color, you’ll have attention distribution
over different colors. So in a soft way, it is representing that the choice of colors are complementary
from each other. So we are making some use of
that kind of information, but there’s certainly a lot of
other reasoning information like there to the left of and to the right of being
opposites we aren’t capturing. So yeah, I think there is
absolutely a role to be doing more inference of a logical
character in future models.>>Okay. With that, let’s thank
Chris again for his [inaudible]

17 Replies to “Building Neural Network Models That Can Reason”

  1. At last somebody tries to figure out how to build a NN that implements "reasoning"

    Seeing this videolog I should say they are still clueless but in a better way than before 😕

    https://youtu.be/-2JRiv3Mycs?t=180
    All of it is of course a description of several hybrid methods (joining visual processing with "so called natural" language processing).

    In fact there is not even the slightest hint in this video that they realize they are not working on reasoning.

    At this moment in time the general consensus among human researchers seems to be that reasoning is the ability to answer single sentence questions in a single question answer.

    And they are proud the system (e.g. CLEVR) already learns something with only 700000 examples (what the F*CK 700K examples ?!?!?!? )

    I rather think that anything capable of demonstrating reasoning can draw multiple conclusions from 1 example ( ONE, single exposure learning )

  2. The only part where this is getting to a level of grasping is at https://youtu.be/-2JRiv3Mycs?t=3755

    timestamp 1:02:40 -> "doing multi-step reasoning by having an iterative attention process over different time steps.".

    But that should have been the first sentence of the talk and all remainder based on that instead of having it as a last side thought like "oh yeah, this could be it"

    if you know why waste so much effort on dumb statistic calculus (aca dynamic programming 1950 -2020) ?

  3. you might like time frames (pause playback at illustration there):

    – 9:26 (pure NLP)

    – 10:37 (saliency detection -> conditional learning / learn by exception)

    – 13:04

    – 28:47 (MAC logic)

    – 41:30 (hybrid VP/NLP composition tree allowing feeding visual info into NLP processing STM)

    – 48:10

    – 53:49

    – 54:34

    – 59:37

    – 1:02:40 (a moment of insight) "use attention over abstracted disentangled concepts and then do multi-step reasoning by having an iterative attention process over different time steps."

    – 1:04:38

  4. We can track mycelial networks, neurobiopathy, and the physics of it should model an easy replicable blueprint for you to encapsulate in code. It should bring mathematical light and 3-Dimensional modelling to morphic resonance.

  5. To evolve reasoning ai. Guess u must be able to measure how well it is reasoning. Maybe then it's easier to reason about moving things in space, building a stack of boxes, building a bridge like just a plate a car can move over to get to other side of a river. Cause its measurable.

  6. Wonder why its called neural network when it gots very little to do with nervecells nor their organisation. Actualle not much has happened since the perceptron was inventet (1940th or so) except that we has got better computers. The structure of the neocortex is obviously very differet and there are a multitude of different neural cells .

  7. So basically, if I understand this CLEVR work, one builds a classifier and runs a scene through it which builds a database including every object, object characteristic, and relationship to other objects would be classified. Then the question is translated via an end-to-end differentiable NN into some query language? So the classifier would be considered a structural prior — as would anything like convolutional layers. Finally, another translation net would take the query language answer and turn it into whatever the user's language is.

  8. Is there any advantage to using a tree structure rather than a table or tables? Is it simply that there is hierarchy in a tree structure? Conversely, is there any advantage to a table? My initial thought is that either would work, but there would be more overhead with a table. There would be better middleware tools, however.

  9. For MAC nets, does the entire database exist as part of the net? I assume that in order for a query to be run, this would be the case, and RU would actually be a weight on one "connection" from a database entry to the RU cell within the MAC node. Is this the case? Or is there a function that's called to access an "external" database? I'm getting the impression that the MAC NET is a "one size fits all" NN, but it is called iteratively with different control values thus storing results. Rinse and repeat. And would the key/value pairs making up control unit options act as "function loaders?"

  10. A machine will never reason for a very simple reason: to reason is a spiritual operation of a spiritual being.
    The materialism has now spread to such a point that the people that present themselves as the brightest minds of our societies now ignore elementary truths that small children knew not long ago.

  11. This guy is so autistic, he makes Rain Man look normal. He speaks monotonically like a machine, and never takes a break in between words and sentences.

  12. This is frustrating. I'm always amazed to watch deep learning experts compare neural nets to the brain while ignoring the gigantic pink elephant in the room. Do you people realize that, unlike the brain, a deep neural net is blind to things it has not seen before? Do you realize that a deep neural net is essentially an old-fashioned, rule-based expert system (IF pattern A THEN label X) and just as brittle? If it encounters a situation for which there is no rule, it fails catastrophically. This is a fatal and fundamental flaw of deep learning. It is the reason that self-driving car companies drive their cars millions of miles in the vain hope of capturing all possible situations. How silly is that? Why then do you keep insisting that deep learning has a role to play in general intelligence? Please stop the hype and the pseudoscience before you get crushed by the pink elephant in the room.

Leave a Reply

Your email address will not be published. Required fields are marked *