2. Who am I?
• Prof in CS department working on issues
of big data, data science, natural
language processing
• mtdiab@gwu.edu
• Check out my research @
– www.seas.gwu.edu/~mtdiab
• NLP lab @gw
– Care4lang1.seas.gwu.edu
3. “Every 2 days we produce as much
information as we did from the beginning of
time till 2003”
“Big Data refers to our ability to make use
of the ever-increasing volumes of data.”
“…everything we do is increasingly leaving a
digital trace (or data), which we (and others)
can use and analyze.”
Bernard Marr
4. The Dream
• It’d be great if machines could
• Process our email (usefully)
• Translate languages accurately
• Help us manage, summarize, and
aggregate information
• Use speech as a UI (when
needed)
• Talk to us / listen to us
• But they can’t:
• Language is complex, ambiguous,
flexible, and subtle
• Good solutions need linguistics
and machine learning
knowledge
Slide courtesy of Heng Ji
5. Heterogeneous BigData
MartinLockheed
3,000 workers
to furlough
amid
#USGovernmentShutdown
The Patient Protection and Affordable
Care Act (PPACA),[1] commonly
called the Affordable Care Act (ACA)
or Obamacare, is a United States
federal statute signed into law by
President Barack Obama on March
23, 2010.
The U.S. Congress, still in partisan
deadlock over Republican efforts to
halt President Barack Obama's
healthcare reforms, was on the
verge of shutting down most of the
U.S. government starting on
Tuesday morning.
NSF and NIST are
temporarily closed
because the Government
entered a period of
partial shutdown.
President Obama's 70-minute White
House meeting late Wednesday
afternoon with congressional leaders
including House Speaker John
Boehner, did nothing to help end the
impasse.
6. Mystery
• What’s now impossible for computers (and any other
species) to do is effortless for humans
✕ ✕ ✓
8. What is NLP?
• Fundamental goal: deep understanding of broad language use
• not just string processing or keyword matching!
9. What is NLP/CL?
• NLP: Natural Language Processing
– Is the field of making computers process natural language
• Does process entail understand?
• CL: Computational Linguistics
– Is the field of using computers to understand (natural)
language
• Natural Language?
– Refers to the language spoken by people, e.g. English,
Japanese, Swahili, as opposed to artificial languages, like C++,
Java, etc.
10. What is NLP?
• Computers using and processing natural language input (data)
and producing useful information, could be natural language
output/or structured data
• Software that can recognize, analyze and generate text and
speech
• Typically NLP refers to processing unstructured data – text in
free form (unstructured text)
• Contrast to Structured data refers to information in “tables”
– Typically allows numerical range and exact match (for text)
queries, e.g.,Salary < 60000 AND Manager = Smith, should
return Turner, Ian
Employee Manager Salary
Smith, John David, Richard $80,000
Turner, Ian Smith, John $59,000
Huang, Chang Smith, John $69,000
11. 11
Unstructured (text) vs. structured
(database) data in 1996
0
20
40
60
80
100
120
140
160
Data volume Market Cap
Unstructured
Structured
12. 12
Unstructured (text) vs. structured
(database) data
0
20
40
60
80
100
120
140
160
Data volume Market Cap
Unstructured
Structured
13. Goals of NLP/CL
• Model Human Language Processing
• Analyze Human Language
• Facilitate Human Language Communication
via Automated Tools
15. Computers Lack Knowledge!
• Computers “see” text in English/Arabic/French
the same way you saw the previous slide!
• People have no trouble understanding language
– Common sense knowledge
– Reasoning capacity
– Experience
• However, Computers have
– No common sense knowledge
– No reasoning capacity
Unless we teach them!
16. Why Should You Care?
• An enormous amount of knowledge is now
available in machine readable form as
natural language text
• Conversational agents are becoming an
important form of human-computer
communication
• Much of human-human communication is
now mediated by computers
• Very cool stuff! And with lots of commercial
interest.
Adapted from Speech and Language Processing - Jurafsky and MarJn
17. Why NLP?
• Applications for
processing large
amounts of texts
(BIG DATA)
require NLP
expertise
• Classify text into categories
• Index and search large texts
• Automatic machine translation
• Speech understanding
– Understand phone conversations
• Information extraction
– Extract useful information from
resumes
• Automatic summarization
– Condense 1 book into 1 page
• Question answering
• Knowledge acquisition
• Text generation / dialogs
19. Why is NLP intriguing?
• NLP has an AI aspect to it
– We’re often dealing with ill-defined problems
– We don’t often come up with exact solutions/
algorithms
– We can’t let either of those facts get in the
way of making progress
20. NLP in CS taxonomy
Computers
Artificial Intelligence AlgorithmsDatabases Networking
Robotics SearchNatural Language Processing
Information
Retrieval
Machine
Translation
Language
Analysis
Semantics Parsing
21. The Challenge
• Language is complex with infinite
possible constructions
• Good news is that there are patterns as
the symbol set is finite, but the
patterns are latent
• Abundance of raw data
22. Why is NLP hard? Some Headlines…
• Police Begin Campaign To Run Down Jaywalkers
• Iraqi Head Seeks Arms
• Enraged Cow Injures Farmer With Ax
• Teacher Strikes Idle Kids
• Squad Helps Dog Bite Victim
• Red Tape Holds Up New Bridges
• Hospitals Are Sued by 7 Foot Doctors
• Court to Try Shooting Defendant
• Local High School Dropouts Cut in Half
23. How can a machine understand
these differences?
• Get the cat with the gloves.
24. Ambiguous Spoken Example
I made her duck
• I cooked waterfowl for her
• I cooked the waterfowl that belongs to
her
• I created the ceramic duck she owns
• I caused her to quickly lower her head
• And more….
25. Example … continued!
I made her duck
maid
Eye
Speech
recognition
cook
create
Word Sense
Disambiguation
Syntactic parsing
Verb
noun
Part of
Speech
Tagging
26. Linguistics
• It is the study of the science of human
language
• How the mind comes up with language
27. Levels of Language Description
• 6 basic levels (more or less explicitly present in most theories):
– and beyond (pragmatics/logic/...)
– meaning (semantics)
– (surface) syntax
– morphology
– phonology
– phonetics/orthography
• Each level has an input and output representation
– output from one level is the input to the next (upper)
level
– sometimes levels might be skipped (merged) or split
28. The Steps in NLP
Discourse
Pragmatics
Semantics
Syntax
Morphology
**we can go up, down and up and
down and combine steps too!!
**every step is equally complex
29. The View: Ambiguity
• All 6 levels of linguistic knowledge require
resolving ambiguity
• Ambiguity results from the existence of
multiple possibilities for each level
30. Ambiguity
• Computational linguists are obsessed with ambiguity
• Ambiguity is a fundamental problem of computational
linguistics
• Resolving ambiguity is a crucial goal
38. Making progress on this problem…
• The task is difficult! What tools do we
need?
– Knowledge about language
– Knowledge about the world
– A way to combine knowledge sources
• How we generally do this:
– probabilistic models built from language data
• P(“maison” → “house”) high
• P(“L’avocat général” → “the general avocado”) low
– Luckily, rough text features can often do half
the job.
39. CL Toolkit
• Knowledge of Linguistics, i.e. NLPers call them features!!
• State Machines
– Finite state automata, transducers
• Formal Rule Systems
– Regular Grammars, Context Free Grammars
• Logic
– First order logic, predicate calculus
• Probability Theory
– Associating probabilities with the previous machinery
• Machine Learning Tools
– Learning automatically from representations, play a very important role in cases where
we don’t have good explanations of why things happen the way they do
• Performance Metrics
– Well defined evaluation metrics for different tasks
41. Models and Algorithms
• By models we mean the formalisms that
are used to capture the various kinds of
linguistic knowledge we need.
• Algorithms are then used to manipulate
the knowledge representations needed
to tackle the task at hand.
42. Models
• Finite state machines
• Linguistic Rules
• Markov models
• Alignment
• Vector space model of word and
document meaning
• Logical formalisms
• Network models
43. Algorithms
• Rule-based
– Symbolic Parsers and morphological
analyzers
– Finite state automata
• Probabilistic/statistical
– Learned from observation of (labeled) data
– Predicting new data based on old
– Machine learning
44. Algorithms
• Many of the algorithms that we’ll study will turn out to
be transducers; algorithms that take one kind of
structure as input and output another
• Unfortunately, ambiguity makes this process difficult
• This leads us to employ algorithms that are designed to
handle ambiguity of various kinds
• State-space search paradigm: To manage the problem
of making choices during processing when we lack
the information needed to make the right choice
45. Machine Learning
Machine learning based classifiers that are
trained to make decisions based on (implicitly
or explicitly modeled) features from context
Simple Classifiers:
Naïve Bayes
Logistic Regression (MaxEnt)
Decision Trees
Neural Networks
Sequence Models:
Hidden Markov Models
Maximum Entropy Markov Models
Conditional Random Fields
Recursive Neural Networks (RNNs, LSTMs)
46. Approaching the challenge
• Divide & Conquer
– Break the problem into smaller problems
• Throw state of the art techniques at
the smaller problems
• Keep your fingers crossed!!
47. NLP Categories
• Applications
• Word counters (wc in UNIX)
• Spell Checkers, grammar checkers
• Predictive Text on mobile handsets
• Machine Translation (MT)
• Information Retrieval (IR)
• Automatic Speech Recognition (ASR)
• Optical Character Recognition (OCR)
• Automatic Summarization, Speech Synthesis, etc.
• Enabling Technologies
– Tokenization
– Part-of-Speech Tagging
– Syntactic Parsing
– Lemmatization
– Word Sense Disambiguation, etc.
48. • Alan Turing was British pioneering
computer scientist, mathematician,
logician, and cryptanalyst. He is widely
considered the Father of Computer
Science.
• The movie Imitation Game is about him.
• The Turing test is a test of a machine's ability to exhibit
intelligent behavior equivalent to, or indistinguishable from, that
of a human. Turing proposed that a human evaluator would judge
natural language conversations between a human and a machine
that is designed to generate human-‐ like responses.
Turing Test
Courtesy of Nizar Habash
49. Current Real-World Applications
• Search: very large corpora, e.g. Google
• Information Extraction: relevant information to a task
• Sentiment analysis: restaurant or movie reviews
• Summarizing very large amounts of text or speech: e.g.
your email, the news, voicemail
• Translating between one language and another: e.g.
Google Translate, Babelfish
• Dialogue systems: e.g. chatbots, Amtrak’s ‘Julie’
• Question answering: e.g. IBM’s Watson Jeopardy!,
DARPA who/what/where…, Ask Jeeves
• Even more: speech processing, common sense
knowledge, text categorization, web monitoring, etc.
52. Machine Translation
• Basic types of Machine Translation
– Text to Text Machine Translations
– Speech to Speech Machine Translations
• To date, majority of approaches have
targeted rich language pairs (with lots of
automated resources) – No Swahili-German
systems
• Current approaches are statistical,
learning from existing translations (parallel
data collections)
• Reasonable performance due significant
funding
56. Blog Analytics
• Data-mining of blogs, discussion forums,
message boards, user groups, and other
forms of user generated media
– Product marketing information
– Political opinion tracking
– Social network analysis
– Buzz analysis (what’s hot, what topics are
people talking about right now).
57. Livejournal.com:
I, me, my on or after Sep 11, 2001
o30-n5
o16-o22
o2-o8
s24
s22
s20
s18
s16
s14
s12
B
7.2
7.0
6.8
6.6
6.4
6.2
6.0
5.8
Graph from Pennebaker slides
Cohn, Mehl, Pennebaker. 2004. LinguisJc markers of psychological change surrounding September
11, 2001. Psychological Science 15, 10: 687-693.
58. September 11 LiveJournal.com study:
We, us, our
o30-n5
o16-o22
o2-o8
s24
s22
s20
s18
s16
s14
s12
B
1.1
1.0
.9
.8
.7
.6
.5
Cohn, Mehl, Pennebaker. 2004. LinguisJc markers of psychological change surrounding
September 11, 2001. Psychological Science 15, 10: 687-693.
Graph from Pennebaker slides
59. Sentiment Analysis
• Movie Review Mining
– User1: The Matrix rocked, I simply loved it….
– User2: Really, that Keanu Reaves gets on my nerves,
he is too robotic
– User1: it was way deep, it obviously went over your
head!
– User2: I think it GOT INTO ur head J
• What do you think User1 and User2’s
sentiments are toward the movie?
– User1
– User2
• What do you think the sentiment of User2
toward User1 is?
60. Sentiment Analysis
• Movie Review Mining
– User1: The Matrix rocked, I simply loved it….
– User2: Really, that Keanu Reaves gets on my nerves,
he is too robotic
– User1: it was way deep, it obviously went over your
head!
– User2: I think it GOT INTO ur head J
• What do you think User1 and User2’s
sentiments are toward the movie?
– User1
– User2
• What do you think the sentiment of User2
toward User1 is?
61. What about positive restaurant reviews?
Sex, Drugs, and Dessert
• sexy food
• seductively seared fois
gras
— addicted to pepper shooters
— garlic noodles… my drug of
choice
— the fries are like crack
62. Let’s Zoom in
• Beyond Sentiment
– Perspective Identification
• Why would you need that?
– Polling
– Detection of bias
• How would you go about doing it?
– Sentiment analysis
– Opinion mining
– Cognitive state (belief)
63. Dialogue Systems
• User: Can you book me a flight from New York
City to Washington?
• System: Is your destination Washington
state?
• User: No I mean Washington DC
• System: When would you like to fly?
• Example: Amtrak’s Julie system
64. Tell me what's bothering you.
hbp://www.masswerk.at/elizabot/eliza.html
65. Eliza
• Eliza is a mock Rogerian psychotherapist
– Rogerian psychotherapy, or person-‐centered therapy, is a
form of talk-‐psychotherapy developed by psychologist Carl
Rogers in the 1940s & 1950s.
• The original program was described by
Joseph Weizenbaum in 1966.
• Primitive natural language processing using simple
scripts and regular expressions
• Chatterbots or chatbots
• Eliza effect
– the tendency to unconsciously assume computer behaviors
are analogous to human behaviors.
• Turing test
66. • Men are all alike.
• IN WHAT WAY?
• They're always bugging us about something or other.
• CAN YOU THINK OF A SPECIFIC EXAMPLE?
• Well, my boyfriend made me come here.
• YOUR BOYFRIEND MADE YOU COME HERE
• He says I'm depressed much of the time.
• I AM SORRY TO HEAR YOU ARE DEPRESSED
• It's true. I am unhappy
• DO YOU THINK COMING HERE WILL HELP YOU NOT TO BE UNHAPPY
• I need some help, that much seems certain.
• WHAT WOULD IT MEAN TO YOU IF YOU GOT SOME HELP
• Perhaps I could learn to get along with my mother.
• TELL ME MORE ABOUT YOUR FAMILY
• My mother takes care of me.
• WHO ELSE IN YOUR FAMILY TAKES CARE OF YOU
• My father.
• YOUR FATHER
• You are like my father in some ways.
Original Chatting Program
Eliza
67. s/.* YOU ARE (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE 1/
s/.* YOU ARE (depressed|sad) .*/WHY DO YOU THINK YOU ARE 1/
s/.* all .*/IN WHAT WAY/
s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/
Eliza-style regular expressions
Step 1: replace first person with second person references
s/bI(’m| am)b /YOU ARE/g
s/bmyb /YOUR/g
S/bmineb /YOURS/g
Step 2: use additional regular expressions to generate replies
Step 3: use scores to rank possible transformations
68. • Let’s chat with Mitsuku!
• http://www.mitsuku.com
• Loebner prize winner 2013,
runner up 2015
– Modern form of the Turing test
for Artificial Intelligence
Mitsuku
Slide courtesy of Nizar Habash
70. Question Answering: IBM’s
Watson
• Won Jeopardy on February 16, 2011!
70
WILLIAM WILKINSON’S
“AN ACCOUNT OF THE PRINCIPALITIES OF
WALLACHIA AND MOLDOVIA”
INSPIRED THIS AUTHOR’S
MOST FAMOUS NOVEL
Bram Stoker
76. Information Retrieval
• Very successful enterprise: Google, Bing,
Yahoo, Altavista
• General model: given a huge collection of texts
(document collection), given a query
– Task: find specific documents that are relevant to
the given query
– How: Create an index, like the index in a book to
look up the information, predominant approaches
include vector space models
77. Information Extraction
Subject: curriculum meeting
Date: January 15, 2012
To: Dan Jurafsky
Hi Dan,
we’ve now scheduled the curriculum meeting.
It will be in Gates 159 tomorrow from 10:00-11:30.
-Chris Create new Calendar entry
Event: Curriculum mtg
Date: Jan-16-2012
Start: 10:00am
End: 11:30am
Where: Gates 159
78. Information Extraction
• nice and compact to carry!
• since the camera is small and light, I won't
need to carry around those heavy, bulky
professional cameras either!
• the camera feels flimsy, is plastic and very
light in weight you have to be very delicate
in the handling of this camera78
Size and weight
Abributes:
zoom
affordability
size and weight
flash
ease of use
✓
✗
✓
81. Reminder of who I amJ
• Prof in CS department working on issues
of big data, data science, natural
language processing
• mtdiab@gwu.edu
• Check out my research @
– www.seas.gwu.edu/~mtdiab
• NLP lab @gw
– Care4lang1.seas.gwu.edu