Natural Language Processing
+ Python
by Ann C. Tan-Pohlmann

February 22, 2014
• NLP Basics
– Text Processing

• Gensim (really, really short )
– Text Classification

Natural Language Processing
• computer science, artificial intelligence, and
• human–computer interaction
• natural language understanding
• natural language generation
- Wikipedia

Star Trek's Universal Translator
Spoken Dialog Systems

NLP Basics
• Morphology
– study of word formation
– how word forms vary in a sentence

• Syntax
– branch of grammar
– how words are arranged in a sentence to show
connections of meaning

• Semantics
– study of meaning of words, phrases and sentences
NLTK: Getting Started
• Natural Language Took Kit
– for symbolic and statistical NLP
– teaching tool, study tool and as a platform for prototyping

• Python 2.7 is a prerequisite
>>> import nltk

Some NLTK methods

Frequency Distribution



– len(text)/

fd = FreqDist(text)

• text.collocations()
- sequence of words that occur
together often

MORPHOLOGY > Syntax > Semantics

Frequency Distribution

fd = FreqDist(text) – increment count
fd[str] – returns the number of occurrence for sample str
fd.N() – total number of samples
fd.max() – sample with the greatest count

• large collection of raw or categorized text on
one or more domain
• Examples: Gutenberg, Brown, Reuters, Web &
Chat Txt
>>> from nltk.corpus import brown
>>> brown.categories()
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', '
humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance',
>>> adventure_text = brown.words(categories='adventure')

Corpora in Other Languages
>>> from nltk.corpus import udhr
>>> languages = nltk.corpus.udhr.fileids()
>>> languages.index('Filipino_Tagalog-Latin1')
>>> tagalog = nltk.corpus.udhr.raw('Filipino_Tagalog-Latin1')
>>> tagalog_words = nltk.corpus.udhr.words('Filipino_Tagalog-Latin1')
>>> tagalog_tokens = nltk.word_tokenize(tagalog)
>>> tagalog_text = nltk.Text(tagalog_tokens)
>>> fd = FreqDist(tagalog_text)
>>> for sample in fd:
... print sample

Using Corpus from Palito
– large collection of raw or categorized text
>>> import nltk
>>> from nltk.corpus import PlaintextCorpusReader
>>> corpus_dir = '/Users/ann/Downloads'
>>> tagalog = PlaintextCorpusReader(corpus_dir,
>>> raw = tagalog.raw()
>>> sentences = tagalog.sents()
>>> words = tagalog.words()
>>> tokens = nltk.word_tokenize(raw)
>>> tagalog_text = nltk.Text(tokens)
Spoken Dialog Systems

MORPHOLOGY > Syntax > Semantics

– breaking up of string into words and punctuations

>>> tokens = nltk.word_tokenize(raw)
>>> tagalog_tokens = nltk.Text(tokens)
>>> tagalog_tokens = set(sample.lower() for sample in tagalog_tokens)

MORPHOLOGY > Syntax > Semantics

– normalize words into its base form, result may not be the 'root' word
>>> def stem(word):
... for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
if word.endswith(suffix):
return word[:-len(suffix)]
... return word
>>> stem('reading')
>>> stem('moment')

MORPHOLOGY > Syntax > Semantics

– uses vocabulary list and morphological analysis (uses POS of a word)
>>> def stem(word):
... for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
if word.endswith(suffix) and word[:-len(suffix)] in brown.words():
return word[:-len(suffix)]
... return word
>>> stem('reading')
>>> stem('moment')

MORPHOLOGY > Syntax > Semantics

NLTK Stemmers & Lemmatizer
• Porter Stemmer and Lancaster Stemmer
>>> porter = nltk.PorterStemmer()
>>> lancaster = nltk.LancasterStemmer()
>>> [porter.stem(w) for w in brown.words()[:100]]

• Word Net Lemmatizer
>>> wnl = nltk.WordNetLemmatizer()
>>> [wnl.lemmatize(w) for w in brown.words()[:100]]

• Comparison
>>> [wnl.lemmatize(w) for w in ['investigation', 'women']]
>>> [porter.stem(w) for w in ['investigation', 'women']]
>>> [lancaster.stem(w) for w in ['investigation', 'women']]

MORPHOLOGY > Syntax > Semantics

Using Regular Expression

Wildcard, matches any character
Matches some pattern abc at the start of a string
Matches some pattern abc at the end of a string
Matches one of a set of characters
Matches one of a range of characters
Matches one of the specified strings (disjunction)
Zero or more of previous item, e.g. a*, [a-z]* (also known as Kleene Closure)
One or more of previous item, e.g. a+, [a-z]+
Zero or one of the previous item (i.e. optional), e.g. a?, [a-z]?
Exactly n repeats where n is a non-negative integer
At least n repeats
No more than n repeats
At least m and no more than n repeats
Parentheses that indicate the scope of the operators

MORPHOLOGY > Syntax > Semantics

Using Regular Expression
>>> import re
>>> re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', 'reading')
[('read', 'ing')]
>>> def stem(word):
... regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
... stem, suffix = re.findall(regexp, word)[0]
... return stem
>>> stem('reading')
>>> stem('moment')

MORPHOLOGY > Syntax > Semantics

Spoken Dialog Systems

Morphology > SYNTAX > Semantics

Lexical Resources
• collection of words with association information (annotation)
• Ex: stopwords – high-frequency words with little lexical
>>> from nltk.corpus import stopwords
>>> stopwords.words('english')
>>> stopwords.words('german')

MORPHOLOGY > Syntax > Semantics

Part-of-Speech (POS) Tagging
• the process of labeling and classifying words
to a particular part of speech based on its
definition and context

Morphology > SYNTAX > Semantics

NLTKs POS Tag Sets* – 1/2

foreign word
modal verb
proper noun

new, good, high, special, big, local
really, already, still, early, now
and, or, but, if, while, although
the, a, some, most, every, no
there, there's
dolce, ersatz, esprit, quo, maitre
will, can, would, may, must, should
year, home, costs, time, education
Alison, Africa, April, Washington

Morphology > SYNTAX > Semantics

NLTKs POS Tag Sets* – 2/2

the word to
past tense
present participle
past participle
wh determiner

twenty-four, fourth, 1991, 14:24
he, their, her, its, my, I, us
on, of, at, with, by, into, under
ah, bang, ha, whee, hmpf, oops
is, has, get, do, make, see, run
said, took, told, made, asked
making, going, playing, working
given, taken, begun, sung
who, which, when, what, where, how

Morphology > SYNTAX > Semantics

NLTK POS Tagger (Brown)
>>> nltk.pos_tag(brown.words()[:30])
[('The', 'DT'), ('Fulton', 'NNP'), ('County', 'NNP'), ('Grand', 'NNP'),
('Jury', 'NNP'), ('said', 'VBD'), ('Friday', 'NNP'), ('an', 'DT'),
('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'JJ'), ('recent', 'JJ'),
('primary', 'JJ'), ('election', 'NN'), ('produced', 'VBN'), ('``', '``'), ('no',
'DT'), ('evidence', 'NN'), ("''", "''"), ('that', 'WDT'), ('any', 'DT'),
('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.'), ('The',
'DT'), ('jury', 'NN'), ('further', 'RB'), ('said', 'VBD'), ('in', 'IN')]
>>> brown.tagged_words(simplify_tags=True)
[('The', 'DET'), ('Fulton', 'NP'), ('County', 'N'), ...]

Morphology > SYNTAX > Semantics

NLTK POS Tagger (German)
>>> german = nltk.corpus.europarl_raw.german
>>> nltk.pos_tag(german.words()[:30])
[(u'Wiederaufnahme', 'NNP'), (u'der', 'NN'), (u'Sitzungsperiode', 'NNP'),
(u'Ich', 'NNP'), (u'erklxe4re', 'NNP'), (u'die', 'VB'), (u'am', 'NN'), (u'Freita
g', 'NNP'), (u',', ','), (u'dem', 'NN'), (u'17.', 'CD'), (u'Dezember', 'NNP'), (u'
unterbrochene', 'NN'), (u'Sitzungsperiode', 'NNP'), (u'des', 'VBZ'), (u'Eur
opxe4ischen', 'JJ'), (u'Parlaments', 'NNS'), (u'fxfcr', 'JJ'), (u'wiederaufg
enommen', 'NNS'), (u',', ','), (u'wxfcnsche', 'NNP'), (u'Ihnen', 'NNP'), (u'
nochmals', 'NNS'), (u'alles', 'VBZ'), (u'Gute', 'NNP'), (u'zum', 'NN'), (u'Ja
hreswechsel', 'NNP'), (u'und', 'NN'), (u'hoffe', 'NN'), (u',', ',')]

xe4 = ä xfc = ü

Morphology > SYNTAX > Semantics

NLTK POS Dictionary
>>> pos = nltk.defaultdict(lambda:'N')
>>> pos['eat']
>>> pos.items()
[('eat', 'N')]
>>> for (word, tag) in brown.tagged_words(simplify_tags=True):
... if word in pos:
if isinstance(pos[word], str):
new_list = [pos[word]]
pos[word] = new_list
if tag not in pos[word]:
... else:
pos[word] = [tag]
>>> pos['eat']
['N', 'V']
Morphology > SYNTAX > Semantics

What else can you do with NLTK?
• Other Taggers
– Unigram Tagging
• nltk.UnigramTagger()
• train tagger using tagged sentence data

– N-gram Tagging

• Text classification using machine learning
– decision trees
– naïve Bayes classification (supervised)
– Markov Models

• Tool that extracts semantic structure of
documents, by examining word statistical cooccurrence patterns within a corpus of
training documents.
• Algorithms:
1. Latent Semantic Analysis (LSA)
2. Latent Dirichlet Allocation (LDA) or Random
Morphology > Syntax > SEMANTICS

• Features
– memory independent
– wrappers/converters for several data formats

• Vector
– representation of the document as an array of features or
question-answer pair

(word occurrence, count)
(paragraph, count)
(font, count)

• Model
– transformation from one vector to another
– learned from a training corpus without supervision
Morphology > Syntax > SEMANTICS

Wiki document classification

Other NLP tools for Python
• TextBlob
– part-of-speech tagging, noun phrase extraction,
sentiment analysis, classification, translation

• Pattern
– part-of-speech taggers, n-gram search, sentiment
analysis, WordNet, machine learning
Star Trek technology that became a reality
Installation Guides

• Gensim

• Palito
Using iPython
>>> documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]

• Natural Language Processing with Python By
Steven Bird, Ewan Klein, Edward Loper

Thank You!
• For questions and comments:
- ann at auberonsolutions dot com


Natural Language Processing and Python

  • 1. Natural Language Processing + Python by Ann C. Tan-Pohlmann February 22, 2014
  • 2. Outline • NLP Basics • NLTK – Text Processing • Gensim (really, really short ) – Text Classification 2
  • 3. Natural Language Processing • computer science, artificial intelligence, and linguistics • human–computer interaction • natural language understanding • natural language generation - Wikipedia 3
  • 4. Star Trek's Universal Translator V2zp0
  • 6. NLP Basics • Morphology – study of word formation – how word forms vary in a sentence • Syntax – branch of grammar – how words are arranged in a sentence to show connections of meaning • Semantics – study of meaning of words, phrases and sentences 6
  • 7. NLTK: Getting Started • Natural Language Took Kit – for symbolic and statistical NLP – teaching tool, study tool and as a platform for prototyping • Python 2.7 is a prerequisite >>> import nltk >>> 7
  • 8. Some NLTK methods • • • • • Frequency Distribution text.similar(str) concordance(str) len(text) len(set(text)) lexical_diversity • • • • • – len(text)/ len(set(text)) fd = FreqDist(text) fd[str] fd.N() fd.max() • text.collocations() - sequence of words that occur together often MORPHOLOGY > Syntax > Semantics 8
  • 9. Frequency Distribution • • • • • fd = FreqDist(text) – increment count fd[str] – returns the number of occurrence for sample str fd.N() – total number of samples fd.max() – sample with the greatest count 9
  • 10. Corpus • large collection of raw or categorized text on one or more domain • Examples: Gutenberg, Brown, Reuters, Web & Chat Txt >>> from nltk.corpus import brown >>> brown.categories() ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', ' humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction'] >>> adventure_text = brown.words(categories='adventure') 10
  • 11. Corpora in Other Languages >>> from nltk.corpus import udhr >>> languages = nltk.corpus.udhr.fileids() >>> languages.index('Filipino_Tagalog-Latin1') >>> tagalog = nltk.corpus.udhr.raw('Filipino_Tagalog-Latin1') >>> tagalog_words = nltk.corpus.udhr.words('Filipino_Tagalog-Latin1') >>> tagalog_tokens = nltk.word_tokenize(tagalog) >>> tagalog_text = nltk.Text(tagalog_tokens) >>> fd = FreqDist(tagalog_text) >>> for sample in fd: ... print sample 11
  • 12. Using Corpus from Palito Corpus – large collection of raw or categorized text >>> import nltk >>> from nltk.corpus import PlaintextCorpusReader >>> corpus_dir = '/Users/ann/Downloads' >>> tagalog = PlaintextCorpusReader(corpus_dir, 'Tagalog_Literary_Text.txt') >>> raw = tagalog.raw() >>> sentences = tagalog.sents() >>> words = tagalog.words() >>> tokens = nltk.word_tokenize(raw) >>> tagalog_text = nltk.Text(tokens) 12
  • 13. Spoken Dialog Systems MORPHOLOGY > Syntax > Semantics 13
  • 14. Tokenization Tokenization – breaking up of string into words and punctuations >>> tokens = nltk.word_tokenize(raw) >>> tagalog_tokens = nltk.Text(tokens) >>> tagalog_tokens = set(sample.lower() for sample in tagalog_tokens) MORPHOLOGY > Syntax > Semantics 14
  • 15. Stemming Stemming – normalize words into its base form, result may not be the 'root' word >>> def stem(word): ... for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']: ... if word.endswith(suffix): ... return word[:-len(suffix)] ... return word ... >>> stem('reading') 'read' >>> stem('moment') 'mo' MORPHOLOGY > Syntax > Semantics 15
  • 16. Lemmatization Lemmatization – uses vocabulary list and morphological analysis (uses POS of a word) >>> def stem(word): ... for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']: ... if word.endswith(suffix) and word[:-len(suffix)] in brown.words(): ... return word[:-len(suffix)] ... return word ... >>> stem('reading') 'read' >>> stem('moment') 'moment' MORPHOLOGY > Syntax > Semantics 16
  • 17. NLTK Stemmers & Lemmatizer • Porter Stemmer and Lancaster Stemmer >>> porter = nltk.PorterStemmer() >>> lancaster = nltk.LancasterStemmer() >>> [porter.stem(w) for w in brown.words()[:100]] • Word Net Lemmatizer >>> wnl = nltk.WordNetLemmatizer() >>> [wnl.lemmatize(w) for w in brown.words()[:100]] • Comparison >>> [wnl.lemmatize(w) for w in ['investigation', 'women']] >>> [porter.stem(w) for w in ['investigation', 'women']] >>> [lancaster.stem(w) for w in ['investigation', 'women']] MORPHOLOGY > Syntax > Semantics 17
  • 18. Using Regular Expression Operator . ^abc abc$ [abc] [A-Z0-9] ed|ing|s * + ? {n} {n,} {,n} {m,n} a(b|c)+ Behavior Wildcard, matches any character Matches some pattern abc at the start of a string Matches some pattern abc at the end of a string Matches one of a set of characters Matches one of a range of characters Matches one of the specified strings (disjunction) Zero or more of previous item, e.g. a*, [a-z]* (also known as Kleene Closure) One or more of previous item, e.g. a+, [a-z]+ Zero or one of the previous item (i.e. optional), e.g. a?, [a-z]? Exactly n repeats where n is a non-negative integer At least n repeats No more than n repeats At least m and no more than n repeats Parentheses that indicate the scope of the operators MORPHOLOGY > Syntax > Semantics 18
  • 19. Using Regular Expression >>> import re >>> re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', 'reading') [('read', 'ing')] >>> def stem(word): ... regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$' ... stem, suffix = re.findall(regexp, word)[0] ... return stem ... >>> stem('reading') 'read' >>> stem('moment') 'moment' MORPHOLOGY > Syntax > Semantics 19
  • 20. Spoken Dialog Systems Morphology > SYNTAX > Semantics 20
  • 21. Lexical Resources • collection of words with association information (annotation) • Ex: stopwords – high-frequency words with little lexical content >>> from nltk.corpus import stopwords >>> stopwords.words('english') >>> stopwords.words('german') MORPHOLOGY > Syntax > Semantics 21
  • 22. Part-of-Speech (POS) Tagging • the process of labeling and classifying words to a particular part of speech based on its definition and context Morphology > SYNTAX > Semantics 22
  • 23. NLTKs POS Tag Sets* – 1/2 Tag ADJ ADV CNJ DET EX FW MOD N NP Meaning adjective adverb conjunction determiner existential foreign word modal verb noun proper noun Examples new, good, high, special, big, local really, already, still, early, now and, or, but, if, while, although the, a, some, most, every, no there, there's dolce, ersatz, esprit, quo, maitre will, can, would, may, must, should year, home, costs, time, education Alison, Africa, April, Washington *simplified Morphology > SYNTAX > Semantics 23
  • 24. NLTKs POS Tag Sets* – 2/2 Tag NUM PRO P TO UH V VD VG VN WH Meaning number pronoun preposition the word to interjection verb past tense present participle past participle wh determiner Examples twenty-four, fourth, 1991, 14:24 he, their, her, its, my, I, us on, of, at, with, by, into, under to ah, bang, ha, whee, hmpf, oops is, has, get, do, make, see, run said, took, told, made, asked making, going, playing, working given, taken, begun, sung who, which, when, what, where, how *simplified Morphology > SYNTAX > Semantics 24
  • 25. NLTK POS Tagger (Brown) >>> nltk.pos_tag(brown.words()[:30]) [('The', 'DT'), ('Fulton', 'NNP'), ('County', 'NNP'), ('Grand', 'NNP'), ('Jury', 'NNP'), ('said', 'VBD'), ('Friday', 'NNP'), ('an', 'DT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'JJ'), ('recent', 'JJ'), ('primary', 'JJ'), ('election', 'NN'), ('produced', 'VBN'), ('``', '``'), ('no', 'DT'), ('evidence', 'NN'), ("''", "''"), ('that', 'WDT'), ('any', 'DT'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.'), ('The', 'DT'), ('jury', 'NN'), ('further', 'RB'), ('said', 'VBD'), ('in', 'IN')] >>> brown.tagged_words(simplify_tags=True) [('The', 'DET'), ('Fulton', 'NP'), ('County', 'N'), ...] Morphology > SYNTAX > Semantics 25
  • 26. NLTK POS Tagger (German) >>> german = nltk.corpus.europarl_raw.german >>> nltk.pos_tag(german.words()[:30]) [(u'Wiederaufnahme', 'NNP'), (u'der', 'NN'), (u'Sitzungsperiode', 'NNP'), (u'Ich', 'NNP'), (u'erklxe4re', 'NNP'), (u'die', 'VB'), (u'am', 'NN'), (u'Freita g', 'NNP'), (u',', ','), (u'dem', 'NN'), (u'17.', 'CD'), (u'Dezember', 'NNP'), (u' unterbrochene', 'NN'), (u'Sitzungsperiode', 'NNP'), (u'des', 'VBZ'), (u'Eur opxe4ischen', 'JJ'), (u'Parlaments', 'NNS'), (u'fxfcr', 'JJ'), (u'wiederaufg enommen', 'NNS'), (u',', ','), (u'wxfcnsche', 'NNP'), (u'Ihnen', 'NNP'), (u' nochmals', 'NNS'), (u'alles', 'VBZ'), (u'Gute', 'NNP'), (u'zum', 'NN'), (u'Ja hreswechsel', 'NNP'), (u'und', 'NN'), (u'hoffe', 'NN'), (u',', ',')] xe4 = ä xfc = ü !!! DOES NOT WORK FOR GERMAN Morphology > SYNTAX > Semantics 26
  • 27. NLTK POS Dictionary >>> pos = nltk.defaultdict(lambda:'N') >>> pos['eat'] 'N' >>> pos.items() [('eat', 'N')] >>> for (word, tag) in brown.tagged_words(simplify_tags=True): ... if word in pos: ... if isinstance(pos[word], str): ... new_list = [pos[word]] ... pos[word] = new_list ... if tag not in pos[word]: ... pos[word].append(tag) ... else: ... pos[word] = [tag] ... >>> pos['eat'] ['N', 'V'] Morphology > SYNTAX > Semantics 27
  • 28. What else can you do with NLTK? • Other Taggers – Unigram Tagging • nltk.UnigramTagger() • train tagger using tagged sentence data – N-gram Tagging • Text classification using machine learning techniques – decision trees – naïve Bayes classification (supervised) – Markov Models Morphology > SYNTAX > SEMANTICS 28
  • 29. Gensim • Tool that extracts semantic structure of documents, by examining word statistical cooccurrence patterns within a corpus of training documents. • Algorithms: 1. Latent Semantic Analysis (LSA) 2. Latent Dirichlet Allocation (LDA) or Random Projections Morphology > Syntax > SEMANTICS 29
  • 30. Gensim • Features – memory independent – wrappers/converters for several data formats • Vector – representation of the document as an array of features or question-answer pair 1. 2. 3. (word occurrence, count) (paragraph, count) (font, count) • Model – transformation from one vector to another – learned from a training corpus without supervision Morphology > Syntax > SEMANTICS 30
  • 32. Other NLP tools for Python • TextBlob – part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation – • Pattern – part-of-speech taggers, n-gram search, sentiment analysis, WordNet, machine learning – 32
  • 33. Star Trek technology that became a reality IH9RI
  • 34. Installation Guides • NLTK – – • Gensim – • Palito – p 34
  • 35. Using iPython • >>> documents = ["Human machine interface for lab abc computer applications", >>> "A survey of user opinion of computer system response time", >>> "The EPS user interface management system", >>> "System and human system engineering testing of EPS", >>> "Relation of user perceived response time to error measurement", >>> "The generation of random binary unordered trees", >>> "The intersection graph of paths in trees", >>> "Graph minors IV Widths of trees and well quasi ordering", >>> "Graph minors A survey"] 35
  • 36. References • Natural Language Processing with Python By Steven Bird, Ewan Klein, Edward Loper • • l 36
  • 37. Thank You! • For questions and comments: - ann at auberonsolutions dot com 37