Learning to reason by reading text and answering questions

Learning to reason by reading
text and answering questions
Minjoon Seo
Natural Language Processing Group
University of Washington
June 2, 2017
@ Naver

One-to-one model
“Hello” “Bonjour”
A lot of
parameters
(to learn)

Examples
• Most neural machine translation systems (Bahdanau et al. , 2014)
• Need very high hidden state size (~1000)
• No need to query the database (context) à very fast
• Most dependency, constituency parser (Chen et al., 2014; Klein et al.,
2003)
• Sentiment classification (Socher et al., 2013)
• Classifying whether a sentence is positive or negative
• Most neural image classification systems

One-to-one Model
Problem: parametric model has finite capacity.
“You can’t even fit a sentence into a single vector” -Dan Roth
A lot of
parameters
(to learn)

Model with explicit knowledge
English French
Hello Bonjour
Thank you Merci
Knowledge Base

Examples
• Phrase-based Statistical Machine Translation (Chiang, 2005)
• Wiki QA (Yang et al., 2015)
• QA Sent (Wang et al., 2007)
• WebQuestions (Berant et al., 2013)
• WikiAnswer (Wikia)
• Free917 (Cai and Yates, 2013)
• Many probabilistic models
• Deep learning models with external memory (e.g. Memory Networks)

Model with explicit knowledge
Eats IsA
(Amphibian, insect) (Frog, amphibian)
(insect, flower) (Fly, insect)
What does a frog eat? Fly
Context (Knowledge Base)
Something is missing …

Explicit knowledge and reasoning capability
Eats IsA
First Order Logic
IsA(A, B) ^ IsA(C, D) ^ Eats(B, D) à Eats(A, C)

Examples
• Semantic parsing
• SAIL (Chen & Mooney, 2011; Artzi & Zettlemoyer, 2013)
• Science questions
• Aristo Challenge (Clark et al., 2015)
• ProcessBank (Berant et al., 2014)
• Machine comprehension
• MCTest (Richardson et al., 2013)

“Vague” line between
non-reasoning QA and reasoning QA
• Non-reasoning:
• The required information is explicit in the context
• The model often needs to handle lexical / syntactic variations
• Reasoning:
• The required information may not be explicit in the context
• Need to combine multiple facts to derive the answer
• There is no clear line between the two!

If our objective is to “answer” difficult
questions …
• We can try to make the machine more capable of reasoning (better
model)
• We can try to make more information explicit in the context (more
data)
OR

Explicit knowledge and reasoning capability
Eats IsA
First Order Logic
IsA(A, B) ^ IsA(C, D) ^ Eats(B,
D) à Eats(A, C)
Who makes
this?
Tell me it’s not
me …

Reasoning model with unstructured data
Frog is an example of amphibian.
Flies are one of the most common insects around us.
Insects are good sources of protein for amphibians.
…
Context in natural language

Let’s define:
‘reasoning’ = “using existing knowledge (or
context) to produce new knowledge”

How to learn to reason?
• Question-driven
• Read text (unstructured data)
• That is, learning to reason by reading text and answering questions

Three aspects of “reasoning system”
• Natural language understanding
• How to retrieve relevant knowledge (formulas)?
• Natural language has diverse surface forms (lexically, syntactically)
• Reasoning
• Deriving new knowledge from the retrieved knowledge
• End-to-end training
• Minimizing human efforts
• Using only unstructured data

Reasoning capability
NLU capability End-to-end

What we want…

AAAI 2014
EMNLP 2015
ECCV 2016
CVPR 2017
ICLR 2017
ACL 2017
ICLR 2017

Geometry QA

Geometry QA
In the diagram at the
right, circle O has a
radius of 5, and CE =
2. Diameter AC is
perpendicular to chord
BD. What is the length
of BD?
a) 2 b) 4 c) 6
d) 8 e) 10
EB D
A
O
5
2
C

Geometry QA Model
What is the length of
BD?
8
In the diagram at the
right, circle O has a
radius of 5, and CE =
2. Diameter AC is
perpendicular to chord
BD.
First
Order
Logic
Local context Global context

Method
• Learn to map question to logical form
• Learn to map local context to logical form
• Text à logical form
• Diagram à logical form
• Global context is already formal!
• Manually defined
• “If AB = BC, then <CAB = <ACB”
• Solver on all logical forms
• We created a reasonable numerical solver

Mapping question / text to logical form
In triangle ABC, line DE is parallel with
line AC, DB equals 4, AD is 8, and DE is 5.
Find AC.
(a) 9 (b) 10 (c) 12.5 (d) 15 (e) 17
B
D E
A C
IsTriangle(ABC) ∧ Parallel(AC, DE) ∧
Equals(LengthOf(DB), 4) ∧ Equals(LengthOf(AD),
8) ∧ Equals(LengthOf(DE), 5) ∧ Find(LengthOf(AC))
Text
Input
Logical
form
Difficult to directly map text to a long logical form!

Mapping question / text to logical form
In triangle ABC, line DE is parallel with
line AC, DB equals 4, AD is 8, and DE is 5.
Find AC.
(a) 9 (b) 10 (c) 12.5 (d) 15 (e) 17
B
D E
A C
IsTriangle(ABC)
Parallel(AC, DE)
Parallel(AC, DB)
Equals(LengthOf(DB), 4)
Equals(LengthOf(AD), 8)
Equals(LengthOf(DE), 5)
Equals(4, LengthOf(AD))
…
Over-generated literals
0.96
0.91
0.74
0.97
0.94
0.94
0.31
…
Text scores
1.00
0.99
0.02
n/a
n/a
n/a
n/a
…
Diagram scores
Selected subset
Text
Input
Logical
form
Our
method
IsTriangle(ABC) ∧ Parallel(AC, DE) ∧
Equals(LengthOf(DB), 4) ∧ Equals(LengthOf(AD),
8) ∧ Equals(LengthOf(DE), 5) ∧ Find(LengthOf(AC))

Numerical solver
Literal Equation
Equals(LengthOf(AB),d) (Ax-Bx)2+(Ay-By)2-d2 = 0
Parallel(AB, CD) (Ax-Bx)(Cy-Dy)-(Ay-By)(Cx-Dx) = 0
PointLiesOnLine(B, AC) (Ax-Bx)(By-Cy)-(Ay-By)(Bx-Cx) = 0
Perpendicular(AB,CD) (Ax-Bx)(Cx-Dx)+(Ay-By)(Cy-Dy) = 0
• Find the solution to the equation system
• Use off-the-shelf numerical minimizers (Wales and Doye,
1997; Kraft, 1988)
• Numerical solver can choose not to answer question
• Translate literals to numeric equations

Dataset
• Training questions (67 questions, 121 sentences)
• Seo et al., 2014
• High school geometry questions
• Test questions (119 questions, 215 sentences)
• We collected them
• SAT (US college entrance exam) geometry questions
• We manually annotated the text parse of all
questions

Results (EMNLP 2015)
0
10
20
30
40
50
60
Text only Diagram
only
Rule-based GeoS Student
average
SAT Score (%)
*** 0.25 penalty for incorrect answer

Demo (geometry.allenai.org/demo)

Limitations
• Dataset is small
• Required level of reasoning is very high
• A lot of manual efforts (annotations, rule definitions, etc.)
• End-to-end system is simply hopeless

Diagram QA

Diagram QA
Q: The process of water
being heated by sun and
becoming gas is called
A: Evaporation

Is DQA subset of VQA?
• Diagrams and real images are very different
• Diagram components are simpler than real images
• Diagram contains a lot of information in a single image
• Diagrams are few (whereas real images are almost infinitely many)

Problem
What comes before
second feed?
8
Difficult to latently
learn relationships

Strategy
Diagram Graph

Results (ECCV 2016)
Method Training data Accuracy
Random (expected) - 25.00
LSTM + CNN VQA 29.06
LSTM + CNN AI2D 32.90
Ours AI2D 38.47

Limitations
• You can’t really call this reasoning…
• Rather matchting algorithm
• No complex inference involved
• You need a lot of prior knowledge to answer some questions!
• E.g. “Fly is an insect”, “Frog is an amphibian”

Textbook QA textbookqa.org (CVPR 2017)

Machine
Comprehension

Question Answering Task
(Stanford Question Answering Dataset, 2016)
Q: Which NFL team represented the
AFC at Super Bowl 50?
A: Denver Broncos

Why Neural Attention?
Q: Which NFL team represented the
AFC at Super Bowl 50?
Allows a deep learning architecture to focus on the most
relevant phrase of the context to the query
in a differentiable manner.

Our Model:
Bi-directional
Attention Flow
(BiDAF)
Attention
Modeling
MLP + softmax
𝑖$ = 0 𝑖' = 1
Barak Obama is the president of the U.S. Who leads the United States?
Attention

(Bidirectional) Attention Flow
Modeling Layer
Output Layer
Attention Flow
Layer
Phrase Embed
Layer
Word Embed
Layer
x1 x2 x3 xT q1 qJ
LSTM
LSTM
LSTMLSTM
Start End
h1 h2 hT
u1
u2
uJ
Softmax
h1 h2 hT
u1
u2
uJ
Max
Softmax
Context2Query
Query2Context
h1 h2 hT
u1 uJ
LSTM + SoftmaxDense + Softmax
Context Query
Query2Context and Context2Query
Attention
Word
Embedding
GLOVE Char-CNN
Character
Embed Layer
Character
Embedding
g1 g2 gT
m1 m2 mT

Char/Word Embedding Layers
Modeling Layer
Output Layer
Attention Flow
Layer
Phrase Embed
Layer
Word Embed
Layer
x1 x2 x3 xT q1 qJ
LSTM
LSTM
LSTMLSTM
Start End
h1 h2 hT
u1
u2
uJ
Softmax
h1 h2 hT
u1
u2
uJ
Max
Softmax
Context2Query
Query2Context
h1 h2 hT
u1 uJ
Context Query
Attention
Word
Embedding
GLOVE Char-CNN
Character
Embed Layer
Character
Embedding
g1 g2 gT
m1 m2 mT

Character and Word Embedding
• Word embedding is fragile against
unseen words
• Char embedding can’t easily learn
semantics of words
• Use both!
• Char embedding as proposed by Kim
(2015)
S e a t t l e
Seattle
CNN
+ Max Pooling
concat
Embedding vector

Phrase Embedding Layer
Modeling Layer
Output Layer
Attention Flow
Layer
Phrase Embed
Layer
Word Embed
Layer
x1 x2 x3 xT q1 qJ
LSTM
LSTM
LSTMLSTM
Start End
h1 h2 hT
u1
u2
uJ
Softmax
h1 h2 hT
u1
u2
uJ
Max
Softmax
Context2Query
Query2Context
h1 h2 hT
u1 uJ
Context Query
Attention
Word
Embedding
GLOVE Char-CNN
Character
Embed Layer
Character
Embedding
g1 g2 gT
m1 m2 mT

Phrase Embedding Layer
• Inputs: the char/word embedding of query and context words
• Outputs: word representations aware of their neighbors (phrase-
aware words)
• Apply bidirectional RNN (LSTM) for both query and context
ng Layer
n Flow
yer
Embed
yer
Embed
yer
LSTM
LSTM
LSTMLSTM
h
Softmax
h1
Max
Co
h1 h2 hT
u1 uJ
Attention
Wo
Embe
acter
d Layer
g1 g2 gT
m1 m2 mT
Context Query

Attention Layer
Modeling Layer
Output Layer
Attention Flow
Layer
Phrase Embed
Layer
Word Embed
Layer
x1 x2 x3 xT q1 qJ
LSTM
LSTM
LSTMLSTM
Start End
h1 h2 hT
u1
u2
uJ
Softmax
h1 h2 hT
u1
u2
uJ
Max
Softmax
Context2Query
Query2Context
h1 h2 hT
u1 uJ
Context Query
Attention
Word
Embedding
GLOVE Char-CNN
Character
Embed Layer
Character
Embedding
g1 g2 gT
m1 m2 mT

Attention Layer
• Inputs: phrase-aware context and query words
• Outputs: query-aware representations of
context words
• Context-to-query attention: For each (phrase-
aware) context word, choose the most relevant
word from the (phrase-aware) query words
• Query-to-context attention: Choose the
context word that is most relevant to any of
query words.
Modeling Layer
Output Layer
Attention Flow
Layer
Phrase Embed
Layer
Word Embed
Layer
LSTM
LSTM
LSTMLSTM
Start End
h1 h2 hT
u1
u2
uJ
Softmax
h1 h2 hT
u1
u2
uJ
Max
Softmax
Context2Query
Query2Context
h1 h2 hT
u1 uJ
Attention
WordCharacter Character
g1 g2 gT
m1 m2 mT

Context-to-Query Attention (C2Q)
Q: Who leads the United States?
C: Barak Obama is the president of the USA.
For each context word, find the most relevant query word.

Query-to-Context Attention (Q2C)
While Seattle’s weather is very nice in summer, its weather is very rainy
in winter, making it one of the most gloomy cities in the U.S. LA is …
Q: Which city is gloomy in winter?

Modeling Layer
Modeling Layer
Output Layer
Attention Flow
Layer
Phrase Embed
Layer
Word Embed
Layer
x1 x2 x3 xT q1 qJ
LSTM
LSTM
LSTMLSTM
Start End
h1 h2 hT
u1
u2
uJ
Softmax
h1 h2 hT
u1
u2
uJ
Max
Softmax
Context2Query
Query2Context
h1 h2 hT
u1 uJ
Context Query
Attention
Word
Embedding
GLOVE Char-CNN
Character
Embed Layer
Character
Embedding
g1 g2 gT
m1 m2 mT

Modeling Layer
• Attention layer: modeling interactions between query and context
• Modeling layer: modeling interactions within (query-aware) context
words via RNN (LSTM)
• Division of labor: let attention and modeling layers solely focus on
their own tasks
• We experimentally show that this leads to a better result than
intermixing attention and modeling

Output Layer
Modeling Layer
Output Layer
Attention Flow
Layer
Phrase Embed
Layer
Word Embed
Layer
x1 x2 x3 xT q1 qJ
LSTM
LSTM
LSTMLSTM
Start End
h1 h2 hT
u1
u2
uJ
Softmax
h1 h2 hT
u1
u2
uJ
Max
Softmax
Context2Query
Query2Context
h1 h2 hT
u1 uJ
Context Query
Attention
Word
Embedding
GLOVE Char-CNN
Character
Embed Layer
Character
Embedding
g1 g2 gT
m1 m2 mT

Training
• Minimizes the negative log probabilities of the true start index and
the true end index
𝑦*
+
True end index of example i
𝑦*
,
True start index of example i
𝐩+
Probability distribution of stop index
𝐩,
Probability distribution of start index

Previous work
• Using neural attention as a controller (Xiong et al., 2016)
• Using neural attention within RNN (Wang & Jiang, 2016)
• Most of these attentions are uni-directional
• BiDAF (our model)
• uses neural attention as a layer,
• Is separated from modeling part (RNN),
• Is bidirectional

VGG-16
Modeling Layer
Output Layer
Attention Flow
Layer
Phrase Embed
Layer
Word Embed
Layer
x1 x2 x3 xT q1 qJ
LSTM
LSTM
LSTMLSTM
Start End
h1 h2 hT
u1 uJ
Context Query
Attention
E
Character
Embed Layer
g1 g2 gT
m1 m2 mT
BiDAF (ours)
Image Classifier and BiDAF

Stanford Question Answering Dataset
(SQuAD) (Rajpurkar et al., 2016)
• Most popular articles from Wikipedia
• Questions and answers from Turkers
• 90k train, 10k dev, ? test (hidden)
• Answer must lie in the context
• Two metrics: Exact Match (EM) and F1

SQuAD Results
(http://stanford-qa.com)
as of Dec 2
(ICLR 2017)

50
55
60
65
70
75
80
No Char Embedding No Word Embedding No C2Q Attention No Q2C Attention Dynamic Attention Full Model
EM F1
Ablations on dev data

Interactive Demo
http://allenai.github.io/bi-att-flow/demo

Attention Visualizations
There%are%13 natural%reserves%in%Warsaw%–
among%others%,%Bielany Forest%,%Kabaty
Woods%,%Czerniaków Lake%.%About%15%
kilometres (%9%miles%)%from%Warsaw%,%the%
Vistula%river%'s%environment%changes%
strikingly%and%features%a%perfectly%preserved%
ecosystem%,%with%a%habitat%of%animals%that%
How
many
natural
reserves
[]
hundreds, few, among, 15, several, only, 13, 9
natural, of
reserves
Where
did
Super
Bowl
50
take
place
?
Super%Bowl%50%was%an%American%football%game%
to%determine%the%champion%of%the%National%
Football%League%(%NFL%)%for%the%2015%season%.%
The%American%Football%Conference%(%AFC%)%
champion%Denver%Broncos%defeated%the%
National%Football%Conference%(%NFC%)%champion%
Carolina%Panthers%24–10%to%earn%their%third%
Super%Bowl%title%.%The%game%was%played%on%
February%7%,%2016%,%at%Levi%'s%Stadium%in%the%
San%Francisco%Bay%Area%at%Santa%Clara%,%
California .%As%this%was%the%50th%Super%Bowl%,%
the%league%emphasized%the%"%golden%
anniversary%"%with%various%goldZthemed%
initiatives%,%as%well%as%temporarily%suspending%
the%tradition%of%naming%each%Super%Bowl%game%
with%Roman%numerals%(%under%which%the%game%
would%have%been%known%as%"%Super%Bowl%L%"%)%,%
so%that%the%logo%could%prominently%feature%the%
Arabic%numerals%50%.
at, the, at, Stadium, Levi, in, Santa, Ana
[]
Super, Super, Super,Super, Super
Bowl, Bowl, Bowl, Bowl, Bowl
50
initiatives

Embedding Visualization
at Word vs Phrase Layers
January
September
August
July
May
may
effect and may result in
the state may not aid
of these may be more
Opening in May 1852 at
debut on May 5 ,
from 28 January to 25
but by September had been

How does it compare with feature-based
models?

CNN/DailyMail Cloze Test
(Hermann et al., 2015)
• Cloze Test (Predicting Missing words)
• Articles from CNN/DailyMail
• Human-written summaries
• Missing words are always entities
• CNN – 300k article-query pairs
• DailyMail – 1M article-query pairs

CNN/DailyMail Cloze Test Results

bAbI
QA & Dialog

Dialog System
U: Can you book a table in Rome in Italian Cuisine
S: How many people in your party?
U: For four people please.
S: What price range are you looking for?

Dialog task vs QA
• Dialog system can be considered as QA system:
• Last user’s utterance is the query
• All previous conversations are context to the query
• The system’s next response is the answer to the query
• Poses a few unique challenges
• Dialog system requires tracking states
• Dialog system needs to look at multiple sentences in the conversation
• Building end-to-end dialog system is more challenging

Our approach: Query-Reduction
<START>
Sandra got the apple there.
Sandra dropped the apple.
Daniel took the apple there.
Sandra went to the hallway.
Daniel journeyed to the garden.
Q: Where is the apple?
Reduced query:
Where is the apple?
Where is Sandra?
Where is Sandra?
Where is Daniel?
Where is Daniel?
Where is Daniel? à garden
A: garden

Query-Reduction Networks
• Reduce the query into an easier-to-answer query over the sequence
of state-changing triggers (sentences), in vector space
Sandra got
the apple
there.
!"
!"
#"
"
#"
$
%"
"
%"
$
Where is
Sandra?
Sandra
dropped the
apple
!$
!$
#$
"
#$
$
%"
"
%$
$
Daniel took
the apple
there.
!&
!&
#&
"
#&
$
%"
"
%&
$
Where is
Daniel?
Sandra
went to
the hallway.
!'
!'
#'
"
#'
$
%"
"
%'
$
Where is
Daniel?
Daniel
journeyed to
the garden.
!(
!(
#(
"
#(
$
%"
"
%(
$
→ *+
Where is
Daniel?
Where is
the apple?
#
garden
Where is
Sandra?
∅ ∅ ∅ ∅

QRN Cell
𝛼 𝜌
1 − ×
× +
𝐱 𝑡 𝐪 𝑡
𝐡 𝑡−1 𝐡 𝑡
𝐳 𝑡 𝐡 𝑡
sentence query
reduced query
(hidden state)
update gate
candidate
reduced query
update func reduction func

Characteristics of QRN
• Update gate can be considered as local attention
• QRN chooses to consider / ignore each candidate reduced query
• The decision is made locally (as opposed to global softmax attention)
• Subclass of Recurrent Neural Network (RNN)
• Two inputs, hidden state, gating mechanism
• Able to handle sequential dependency (attention cannot)
• Simpler recurrent update enables parallelization over time
• Candidate hidden state (reduced query) is computed from inputs only
• Hidden state can be explicitly computed as a function of inputs

Parallelization
computed from inputs only,
so can be trivially
parallelized
Can be explicitly expressed as
the geometric sum of previous
candidate hidden states

Characteristics of QRN
• Update gate can be considered as local attention
• Subclass of Recurrent Neural Network (RNN)
• Simpler recurrent update enables parallelization over time
QRN sits between neural attention mechanism and recurrent
neural networks, taking the advantage of both paradigms.

bAbI QA Dataset
• 20 different tasks
• 1k story-question pairs for each task (10k also available)
• Synthetically generated
• Many questions require looking at multiple sentences
• For end-to-end system supervised by answers only

What’s different from SQuAD?
• Synthetic
• More than lexical / syntactic understanding
• Different kinds of inferences
• induction, deduction, counting, path finding, etc.
• Reasoning over multiple sentences
• Interesting testbed towards developing complex QA system (and
dialog system)

bAbI QA Results (1k) (ICLR 2017)
0
10
20
30
40
50
60
LSTM DMN+ MemN2N GMemN2N QRN (Ours)
Avg Error (%)
Avg Error (%)

bAbI QA Results (10k)
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
MemN2N DNC GMemN2N DMN+ QRN (Ours)
Avg Error (%)
Avg Error (%)

Dialog Datasets
• bAbI Dialog Dataset
• Synthetic
• 5 different tasks
• 1k dialogs for each task
• DSTC2* Dataset
• Real dataset
• Evaluation metric is different from original DSTC2: response generation
instead of “state-tracking”
• Each dialog is 800+ utterances
• 2407 possible responses

bAbI Dialog Results (OOV)
0
5
10
15
20
25
30
35
MemN2N GMemN2N QRN (Ours)
Avg Error (%)
Avg Error (%)

DSTC2* Dialog Results
0
10
20
30
40
50
60
70
MemN2N GMemN2N QRN (Ours)
Avg Error (%)
Avg Error (%)

bAbI QA Visualization
𝑧/
= Local attention (update gate) at layer l

DSTC2 (Dialog) Visualization
𝑧/
= Local attention (update gate) at layer l

Is this possible?

Or this?

So… What should we do?
• Disclaimer: completely subjective!
• Logic (reasoning) is discrete
• Modeling logic with differentiable model is hard
• Relaxation: either hard to optimize or converge to bad optimum (low
generalization error)
• Estimation: Low-bias or low-variance methods are proposed (Williams, 1992;
Jang et al., 2017), but improvements are not substantial.
• Big data: how much do we need? Exponentially many?
• Perhaps new paradigm is needed…

“If you got a billion dollars to spend on a huge
research project, what would you like to do?”
“I'd use the billion dollars to build a NASA-size
program focusing on natural language processing
(NLP), in all of its glory (semantics, pragmatics,
etc).”
Michael Jordan
Professor of Computer Science
UC Berkeley

Towards Artificial General Intelligence…
Natural language is the best tool to describe and
communicate “thoughts”
Asking and answering questions is an effective
way to develop deeper “thoughts”

Thank you!
• minjoon@cs.uw.edu
• http://seominjoon.github.io

Learning to reason by reading text and answering questions

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (20)

Mais de NAVER Engineering

Mais de NAVER Engineering (20)

Último

Último (20)

Learning to reason by reading text and answering questions