Factoid based natural language question generation system

A ROBUST FACTOID BASED
QUESTION GENERATION SYSTEM
PRESENTED BY
ANIMESH SHAW
ARITRA DAS
SHREEPARNA SARKAR

CONTENTS
• Motivation.
• Our Objective.
• About Factoid Questions.
• Basic Terminology.
• Working Procedure.
• Rule base Generation.
• Question Generation.
• Evaluation
• Future Scope

MOTIVATION
Google Speech Recognition
Chat bots talking to each other , taken
from Cornell Creative Machines Lab

Google Translator , currently translating English to Bengali.
Cleverbot, A chat bot with a good sense of humor. Taken from http://www.cleverbot.com/
CONTD.

OUR OBJECTIVE
Generate an efficient Question Generation System
To generate factoid questions from some text document or corpus
Generate questions from each and every sentence if the sentence has
some information else the sentence is discarded
For some sentences more than
one type of factoid questions
is possible thus attempt
generating all such possible
types.
Take user’s opinion or
feedback, and improve the
result for further use.

FACTOID QUESTIONS?
Factoid Questions: Type of questions that demand accurate information
regarding an entity or an event like person names, locations, organization
etc., as opposed to definition questions, opinion questions or complex
questions as the why or how questions.

BASIC TERMINOLOGY
1. TOKENIZING: Breaking the string into words and Punctuation marks.
e.g. - I went home last night. → [‘I’, ‘went’, ‘home’, ‘last’, ‘night’, ‘.’ ]
2. TAGGING: Assigning Parts-of-speech tags to words.
e.g. - cat → noun → NN, eat → verb → VB
3. LEMMATIZING: Finding word lemmata
(e.g. - was → be).
4. CHUNKING: A feature that allows grouping of words that convey the
same thought and then tagging those sets of words. These tags can be
like - Verb phrase, Prepositional Phrase, Noun Phrase.
e.g. → Bangladesh defeated India in 2007 World Cup

CONTD.
5. CHUNKS: ‘Bangladesh’ , ‘defeated’, ‘India’ , ‘in’, ‘2007 World Cup’
6. RELATION FINDING: Finding relation between the chunks, ,
sentence subject, object and predicates as:
RELATIONS:
Bangladesh defeated India in 2007 World Cup
NP-SBJ-1 VP-1 NP-OBJ-1 PP-TMP-1 NP-TMP-1

WORKING PROCEDURE
1. Taken large train sets of Wh-Questions.
2. Broken the sentence into chunks and parsed.
3. Found relations.
The Sentence: “Who became the 16th president of the United States of America in
1861”
CHUNKING:
['Who', 'NP-SBJ-1']
['became', 'VP-1']
['the 16th president', 'NP-PRD-1']
['of', 'PP']
['United States', 'NP']
['of', 'PP']
['America', 'NP']
['in', 'PP']
['1681', 'NP']

Storing the tags in a List
['NP-SBJ-1', 'VP-1', 'NP-PRD-1', 'PP', 'NP', 'PP-1', 'NP-1']
“who”
Storing the tags with the corresponding Wh-Type in a list
['Who', ['VP-1', 'NP-PRD-1', 'PP', 'NP', 'PP-1', 'NP-1']]
4. Determined the wh-type , which is determined by the observing the head
word of the question.
CONTD.

RULE BASE GENERATION
The Parent Tree:
This tree is fed to the system before the training is done. When the
system reads a question it determines the Wh-type and traverses to
that specific node and starts populating the tree.

POPULATING THE RULE-TREE
Travelled to the specific wh-node and stored these relations by populating
the subsequent nodes of the tree with these chunk relations

NORMALIZED COUNT
This is used to let the parser know when to print the question and
when not while backtracking to other child nodes. It is defined as :
Occurrence of every tail node of a question from train set
Total number of question of that particular wh-tag
The Count is attached to the tail node only.
Example:
‘Who doesn’t want to rule the world?’
Nodes: NP-SBJ-1 VP-1,VBZ-VB-TO-VB NP-OBJ -1,14
Here, this type of question structure appears 14 times in the trained set.
The tail will have the count value as integer but when the recursive decent
parser parses the question base it normalizes the value. This also provides
the user with a more probable question among many other questions.

RULE-TREE WITH NORMALIZED COUNT
While populating, the count of visiting each tail node (the node that holds the
last chunk relation) is saved in the corresponding node.
A snapshot of the rule base with count value:

ANSWER PREPROCESSING AND
QUESTION TYPE DECISION SYSTEM
• While populating the tree with manually generated questions, the NER tag of
the answer for a given question is stored with the corresponding wh-tag.
• Only some word(s) are stored.
Example :
“Who is the Father of the Nation? Ans Mahatma Gandhi.”
‘Mahatma Gandhi ‘ on NER tagging :
Mahatma [PERSON]
Gandhi [PERSON]

ANSWER BASE
When the same tag is found in the answer again and again, the
count value is increased accordingly.
The Answer Base
Vocabulary = 4
Vocabulary = 4

PRIORITIZING THE QUESTIONS
It’s possible that there are more than one questions in a single path from
root to a leaf. The system will prioritize the questions according to their
count-depth product:
Normalized count * depth of tail
Example:
Questions Priority
Who is Mahatma Gandhi? (14/747)*3 = 0.056
Who is the father of nation? (21/747)*5 = 0.14
So the 2nd sentence is more likely to be the question that is
generally asked if the given sentence is parsed. Although it
depends upon the train sets’ questions.

SELECTION OF QUESTION
The probability of each type of probable question is calculated using the
following function:
Tag with maximum probability is taken into consideration and that type
of question is generated.
F(sentence) = Max(Probability(Words/Wh-tag))

EXAMPLE: TRIGGERING WH-TYPE QUESTIONS
When Tags Count
2011 DATE DATE = 1
9:30PM TIME TIME = 1
In 2012 IN IN = 2
10th OCT DATE
In summer IN
Who Tags Count
Grace Badell Person Person = 3
John Whiks Person
General Mccllen Person
Sentence: “Sourav was captain of India in the 2003 world cup.”
Chunks: ‘Sourav’ , ‘India’ , ‘in the 2003’
Tags : ‘PERSON’ ‘LOCATION’ ‘IN’
Where Tags Count
Asia LOCATION LOCATION = 3
Plymouth LOCATION IN = 2
In the sea IN
Pacific Ocean LOCATION
In her eyes IN

Probability(Sourav, India, in the 2003/when)
= Prob(When) * Prob (PERSON/when) * Prob (LOCATION/when) * Prob (IN/when)
= (4/13) * (1/(4+3)) * (1/(4+3)) * (2/(4+3))
= 0.30 * 0.14 * 0.14 * 0.28
= 0.0016
Probability(Sourav, India, in the 2003/where)
= Prob(When) * Prob (PERSON/where) * Prob (LOCATION/where) * Prob (IN/where)
= (6/13) * (1/(5+2)) * (3/(5+2)) * (2/(5+2))
= 0.46 * 0.1 * 0.3 * 0.2
= 0.0027
So, the system will generate the ‘Who’ type question.
Probability(Sourav, India, in the 2003/who)
= Prob(Who) * Prob (PERSON/who) * Prob (LOCATION/who) * Prob (IN/who)
= (3/13) * (3/(3+3)) * (1/(3+3)) * (1/(3+3))
= 0.23 * 0.5 * 0.16 * 0.16
= 0.0029
CONTD.

After the training is done , the system will generate questions from sentences by
traversing the question base with the values of the nodes.
Example: “Mahatma Gandhi is the Father of Nation.”
Suppose it tries to generate ‘Who’ question from this, then the steps would be :
Sentence parsing:
Mahatma Gandhi is the Father of Nation.
Chunks: NP-SBJ-1 VP-1 NP-PRD-1 PP NP
Tags: NNP – NNP VBZ DT-NN IN NN
The chunks and the corresponding relations are put into a table where
the keys are the relations and the values are the chunk phrases
CONTD.

Question Generation:
These relation and tag pairs are searched by a Recursive Descent Parser in the
question base. If a path is found with these nodes the corresponding chunks
are appended one after another and the question is generated.
“Who is the father of nation?”
The Chunk Table:
CONTD.

THE FEEDBACK SYSTEM
• Takes the user feedback on the generated questions.
• Updates the count values.
• Updates the question base accordingly.
• Reduces the generation of False Positives.
• Enhances the probability of generation of quality questions.
For reference the image is as:

EVALUATION
Manual
Generation(%)
System
Generation(%)
Perception 100 58.82
Recall 100 91
Perception : % of selected items that are correct
Recall : % of correct items that are selected
We tested the system on a given test dataset and acquire the following results :

SCOPE IN FUTURE
Question Generation is an important function of advanced learning technologies
such as:
• Intelligent tutoring systems
• Inquiry-based environments
• Game-based learning environments
• Psycholinguistics
• Discourse and Dialogue
• Natural Language Generation
• Natural Language Understanding
• Academic purposes to create Practice and Assessment materials

REFERENCES
[1] Liu, Ming, Rafael A. Calvo, and Vasile Rus. "G-Asks: An intelligent automatic question
generation system for academic writing support." Dialogue & Discourse 3.2 (2012): 101-124.
[2] Chen, Wei, and Jack Mostow. "Using Automatic Question Generation to Evaluate Questions
Generated by Children." The 2011 AAAI Fall Symposium on Question Generation. 2011.
[3] Radev, Dragomir, et al. "Probabilistic question answering on the web." Journal of the American
Society for Information Science and Technology 56.6 (2005): 571-583.
[4] Roussinov, Dmitri, and Jose Robles. "Web question answering through automatically learned
patterns." Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries. ACM, 2004.
[5] Agarwal, Manish, and Prashanth Mannem. "Automatic gap-fill question generation from text
books." Proceedings of the 6th Workshop on Innovative Use of NLP for Building Educational
Applications. Association for Computational Linguistics, 2011.
[6] Skalban, Yvonne, et al. "Automatic Question Generation in Multimedia-Based Learning."
COLING (Posters). 2012.
[7] Becker, Lee, Rodney D. Nielsen, and W. Ward. "What a pilot study says about running a
question generation challenge." Proceedings of the Second Workshop on Question Generation,
Brighton, England, July. 2009.

[8] Xu, Yushi, Anna Goldie, and Stephanie Seneff. "Automatic question generation and answer
judging: a q&a game for language learning." SLaTE. 2009.
[9] Rus, Vasile, and C. Graesser Arthur. "The question generation shared task and evaluation
challenge." The University of Memphis. National Science Foundation. 2009.
[10] Lin, Chin-Yew. "Automatic question generation from queries." Workshop on the Question
Generation Shared Task. 2008.
[11] Ali, Husam, Yllias Chali, and Sadid A. Hasan. "Automation of question generation from
sentences." Proceedings of QG2010: The Third Workshop on Question Generation. 2010.
[12] Bird, Steven, Ewan Klein, and Edward Loper. Natural language processing with Python. "
O'Reilly Media, Inc.", 2009.
CONTD.

Factoid based natural language question generation system

Factoid based natural language question generation system

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Factoid based natural language question generation system

Similar to Factoid based natural language question generation system (20)

More from Animesh Shaw

More from Animesh Shaw (7)

Recently uploaded

Recently uploaded (20)

Factoid based natural language question generation system