2. About Me
•My name is Md. Fasihul Kabir
•Working as a Software Engineer @ Escenic Asia Ltd. (April, 2013 –
Present)
•BSc in CSE from AUST (April, 2013).
•MSc in CSE from UIU.
•Research interests are NLP, IR, ML and Compiler Design.
3. Agenda
• What is NLTK?
• What is NLP?
• Installing NLTK
• NLTK Modules & Functionality
• NLP with NLTK
• Accessing Text Corpora & Lexical Resources
• Tokenization
• Normalizing Text
• POS Tagging
• NER
• Language Model
4. Natural Language Toolkit (NLTK)
• A collection of Python programs, modules, data set and tutorial to support
research and development in Natural Language Processing (NLP)
• Written by Steven Bird, Edvard Loper and Ewan Klien
• NLTK is
• Free and Open source
• Easy to use
• Modular
• Well documented
• Simple and extensible
• http://www.nltk.org/
5. What is Natural Language Processing
•Computer aided text analysis of human language
•The goal is to enable machines to understand human language and
extract meaning from text
•It is a field of study which falls under the category of machine
learning and more specifically computational linguistics
6. Application of NLP
•Automatic summarization
•Machine translation
•Natural language generation
•Natural language understanding
•Optical character recognition
•Question answering
•Speech Recognition
•Text-to-Speech
7. Installing NLTK
•Install PyYAML, Numpy, Matplotlib
•NLTK Source Installation
• Download NLTK source ( http://nltk.googlecode.com/)
• Unzip it & Go to the new unzipped folder
• Just do it!
➢ python setup.py install
•To install data
• Start python interpreter
>>> import nltk
>>> nltk.download()
9. Accessing Text Corpora & Lexical Resources
•NLTK provides over 50 corpora and lexical resources.
>>> from nltk.corpus import brown
>>> brown.categories()
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies',
'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance',
'science_fiction']
>>> len(brown.sents())
57340
>>> len(brown.words())
1161192
•http://www.nltk.org/book/ch02.html
10. Tokenization
• Tokenization is the process of breaking a stream of text up into words, phrases,
symbols, or other meaningful elements called tokens.
>>> from nltk.tokenize import word_tokenize, wordpunct_tokenize, sent_tokenize
>>> s = '''Good muffins cost $3.88nin New York. Please buy me two of them.nnThanks.'''
• Word Punctuation Tokenization
>>> wordpunct_tokenize(s)
['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
• Sentence Tokenization
>>> sent_tokenize(s)
['Good muffins cost $3.88nin New York.', 'Please buy mentwo of them.', 'Thanks.']
• Word Tokenization
>>> [word_tokenize(t) for t in sent_tokenize(s)]
[['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.'], ['Please', 'buy', 'me', 'two', 'of',
'them', '.'], ['Thanks', '.']]
11. Normalizing Text
• Stemming is the process for reducing inected (or sometimes derived) words to their stem, base or root form
, generally a written word form.
• Porter Stemming Algorithm
>>> from nltk.stem import PorterStemmer
>>> stemmer = PorterStemmer()
>>> stemmer.stem('cooking')
'cook'
• LancasterStemmer Algorithm
>>> from nltk.stem import LancasterStemmer
>>> stemmer = LancasterStemmer()
>>> stemmer.stem('cooking')
'cook'
• SnowballStemmer Algorithm (supports 15 languages)
>>> from nltk.stem import SnowballStemmer
>>> stemmer = SnowballStemmer('english')
>>> stemmer.stem('cooking')
'cook'
12. Normalizing Text (Cont.)
•Lemmatization process involves first determining the part of speech
of a word, and applying different normalization rules for each part of
speech.
>>> from nltk.stem import WordNetLemmatizer
>>> lemmatizer = WordNetLemmatizer()
>>> lemmatizer.lemmatize('cooking')
'cooking'
>>> lemmatizer.lemmatize('cooking', pos='v')
'cook'
13. Normalizing Text (Cont.)
•Comparison between stemming and lemmatizing.
>>> stemmer.stem('believes')
'believ'
>>> lemmatizer.lemmatize('believes')
'belief'
14. Part-of-speech Tagging
•Part-of-speech Tagging is the process of marking up a word in a text
(corpus) as corresponding to a particular part of speech
>>> from nltk.tokenize import word_tokenize
>>> from nltk.tag import pos_tag
>>> words = word_tokenize('And now for something completely different')
>>> pos_tag(words)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different',
'JJ')]
•https://www.ling.upenn.
edu/courses/Fall_2003/ling001/penn_treebank_pos.html
15. Named-entity Recognition
•Named-entity recognition is a subtask of information extraction that
seeks to locate and classify elements in text into pre-defined
categories such as the names of persons, organizations, locations,
expressions of times, quantities, monetary values, percentages, etc.
>>> from nltk import pos_tag, ne_chunk
>>> from nltk.tokenize import wordpunct_tokenize
>>> sent = 'Jim bought 300 shares of Acme Corp. in 2006.'
>>> ne_chunk(pos_tag(wordpunct_tokenize(sent)))
Tree('S', [Tree('PERSON', [('Jim', 'NNP')]), ('bought', 'VBD'), ('300', 'CD'), ('shares', 'NNS'),
('of', 'IN'), Tree('ORGANIZATION', [('Acme', 'NNP'), ('Corp', 'NNP')]), ('.', '.'), ('in', 'IN'),
('2006', 'CD'), ('.', '.')])
16. Language model
•A statistical language model assigns a probability to a sequence of m
words P(w1, w2, …., wm) by means of a probability distribution.
>>> import nltk
>>> from nltk.corpus import gutenberg
>>> from nltk.model import NgramModel
>>> from nltk.probability import LidstoneProbDist
>>> ssw=[w.lower() for w in gutenberg.words('austen-sense.txt')]
>>> ssm=NgramModel(3, ssw, True, False, lambda f,b:LidstoneProbDist(f,0.01,f.B()+1))
>>> ssm.prob('of',('the','name'))
0.907524932004
>>> ssm.prob('if',('the','name'))
0.0124444830775