SlideShare uma empresa Scribd logo
1 de 48
Microsoft New England Research and Development Center, June 22, 2011 Natural Language Processing and Machine Learning Using Python Shankar Ambady|
Example Files Hosted on Github https://github.com/shanbady/NLTK-Boston-Python-Meetup
What is “Natural Language Processing”? Where is this stuff used? The Machine learning paradox A look at a few key terms Quick start – creating NLP apps in Python
What is Natural Language Processing?	 ,[object Object]
 The goal is to enable machines to understand human language and extract meaning from text.
 It is a field of study which falls under the category of machine learning and more specifically computational linguistics.
The “Natural Language Toolkit” is a python module that provides a variety of functionality that will aide us in processing text.,[object Object]
Paradoxes in Machine Learning
Context Little sister: What’s your name? Me: Uhh….Shankar..? Sister: Can you spell it? Me: yes. S-H-A-N-K-A…..
Sister: WRONG! It’s spelled “I-T”
Language translation is a complicated matter! Go to:  http://babel.mrfeinberg.com/ The problem with communication is the illusion that it has occurred
The problem with communication is the illusion that it has occurred Das Problem mit Kommunikation ist die Illusion, dass es aufgetreten ist The problem with communication is the illusion that it arose Das Problem mit Kommunikation ist die Illusion, dass es entstand The problem with communication is the illusion that it developed Das Problem mit Kommunikation ist die Illusion, die sie entwickelte The problem with communication is the illusion, which developed it
The problem with communication is the illusion that it has occurred The problem with communication is the illusion, which developed it EPIC FAIL
Police policepolice. The above statement is a complete sentence states that:   “police officers police other police officers”. She yelled “police police police” “someone called for police”.
Key Terms
The NLP Pipeline
Setting up NLTK Source downloads available for mac and linux as well as installable packages for windows. Currently only available for Python 2.5 – 2.6 http://www.nltk.org/download `easy_install nltk` Prerequisites NumPy SciPy
First steps NLTK comes with packages of corpora that are required for many modules.  Open a python interpreter: importnltk nltk.download()  If you do not want to use the downloader with a gui (requires TKInter module) Run: python -m nltk.downloader <name of package or “all”>
You may individually select packages or download them in bulk.
Let’s dive into some code!
Part of Speech Tagging fromnltkimportpos_tag,word_tokenize sentence1='this is a demo that will show you how to detects parts of speech with little effort using NLTK!' tokenized_sent=word_tokenize(sentence1) printpos_tag(tokenized_sent) [('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('demo', 'NN'), ('that', 'WDT'), ('will', 'MD'), ('show', 'VB'), ('you', 'PRP'), ('how', 'WRB'), ('to', 'TO'), ('detects', 'NNS'), ('parts', 'NNS'), ('of', 'IN'), ('speech', 'NN'), ('with', 'IN'), ('little', 'JJ'), ('effort', 'NN'), ('using', 'VBG'), ('NLTK', 'NNP'),('!', '.')]
Penn Bank Part-of-Speech Tags Source: http://www.ai.mit.edu/courses/6.863/tagdef.html
NLTK Text nltk.clean_html(rawhtml) from nltk.corpus import brown from nltk import Text brown_words = brown.words(categories='humor') brownText = Text(brown_words) brownText.collocations() brownText.count("car") brownText.concordance("oil") brownText.dispersion_plot(['car', 'document', 'funny', 'oil']) brownText.similar('humor')
Find similar terms (word definitions) using Wordnet importnltk fromnltk.corpusimportwordnetaswn synsets=wn.synsets('phone') print[str(syns.definition)forsynsinsynsets]  'electronic equipment that converts sound into electrical signals that can be transmitted over distances and then converts received signals back into sounds‘ '(phonetics) an individual sound unit of speech without concern as to whether or not it is a phoneme of some language‘ 'electro-acoustic transducer for converting electric signals into sounds; it is held over or inserted into the ear‘ 'get or try to get into communication (with someone) by telephone'
from nltk.corpus import wordnet as wn synsets = wn.synsets('phone') print [str(  syns.definition ) for syns in synsets] “syns.definition” can be modified to output hypernyms , meronyms, holonyms etc:
Fun things to Try
Feeling lonely? Eliza is there to talk to you all day! What human could ever do that for you?? fromnltk.chatimporteliza eliza.eliza_chat() ……starts the chatbot Therapist --------- Talk to the program by typing in plain English, using normal upper- and lower-case letters and punctuation.  Enter "quit" when done. ======================================================================== Hello.  How are you feeling today?
Englisch to German to Englisch to German…… fromnltk.bookimport* babelize_shell() Babel> the internet is a series of tubes Babel> german Babel> run 0> the internet is a series of tubes 1> das Internet ist eine Reihe SchlSuche 2> the Internet is a number of hoses 3> das Internet ist einige SchlSuche 4> the Internet is some hoses Babel>
Let’s build something even cooler
Lets write a spam filter! 	A program that analyzes legitimate emails “Ham” as well as “spam” and learns the features that are associated with each. 	Once trained, we should be able to run this program on incoming mail and have it reliably label each one with the appropriate category.
What you will need NLTK (of course) as well as the “stopwords” corpus A good dataset of emails; Both spam and ham Patience and a cup of coffee  	(these programs tend to take a while to complete)
Finding Great Data: The Enron Emails A dataset of 200,000+ emails made publicly available in 2003 after the Enron scandal. Contains both spam and actual corporate ham mail. For this reason it is one of the most popular datasets used for testing and developing anti-spam software. The dataset we will use is located at the following url: http://labs-repos.iit.demokritos.gr/skel/i-config/downloads/enron-spam/preprocessed/ It contains a list of archived files that contain plaintext emails in two folders , Spam and Ham.
Extract one of the archives from the site into your working directory.  Create a python script, lets call it “spambot.py”. Your working directory should contain the “spambot” script and the folders “spam” and “ham”. “Spambot.py” fromnltkimportword_tokenize,WordNetLemmatizer,NaiveBayesClassifierbr />,classify,MaxentClassifier fromnltk.corpusimportstopwords importrandom importos,glob,re
“Spambot.py” (continued) wordlemmatizer = WordNetLemmatizer() commonwords = stopwords.words('english')  hamtexts=[] spamtexts=[] forinfileinglob.glob(os.path.join('ham/','*.txt')): text_file=open(infile,"r") hamtexts.extend(text_file.read()) text_file.close() forinfileinglob.glob(os.path.join('spam/','*.txt')): text_file=open(infile,"r") spamtexts.extend(text_file.read()) text_file.close() load common English words into list start globbing the files into the appropriate lists
“Spambot.py” (continued) mixedemails=([(email,'spam')foremailinspamtexts] mixedemails+=   [(email,'ham')foremailinhamtexts]) random.shuffle(mixedemails) label each item with the appropriate label and store them as a list of tuples From this list of random but labeled emails, we will defined a “feature extractor” which outputs a feature set that our program can use to statistically compare spam and ham.  lets give them a nice shuffle
“Spambot.py” (continued) defemail_features(sent): features={} wordtokens=[wordlemmatizer.lemmatize(word.lower())forwordinword_tokenize(sent)] forwordinwordtokens: ifwordnotincommonwords: features[word]=True returnfeatures featuresets=[(email_features(n),g)for(n,g)inmixedemails] Normalize words If the word is not a stop-word then lets consider it a “feature” Let’s run each email through the feature extractor  and collect it in a “featureset” list
[object Object]
To use features that are non-binary such as number values, you must convert it to a binary feature. This process is called “binning”.
If the feature is the number 12 the feature is: (“11<x<13”, True),[object Object]
“Spambot.py” (continued) print classifier.labels() This will output the labels that our classifier will use to tag new data ['ham', 'spam'] The purpose of create a “training set” and a “test set” is to test the accuracy of our classifier on a separate sample from the same data source. printclassify.accuracy(classifier,test_set) 0.75623
classifier.show_most_informative_features(20) Spam Ham
“Spambot.py” (continued) WhileTrue: featset=email_features(raw_input("Enter text to classify: ")) printclassifier.classify(featset) We can now directly input new email and have it classified as either Spam or Ham
A few notes: ,[object Object]
 The threshold value that determines the sample size of the feature set will need to be refined until it reaches its maximum accuracy. This will need to be adjusted if training data is added, changed or removed.,[object Object]
Try classifying your own emails using this trained classifier and you will notice a sharp decline in accuracy.,[object Object]
Two Labels Fresh rotten Flixster, Rotten Tomatoes, the Certified Fresh Logo are trademarks or registered trademarks of Flixster, Inc. in the United States and other countries
easy_installtweetstream importtweetstream words=["green lantern"] withtweetstream.TrackStream(“yourusername",“yourpassword",words)asstream: fortweetinstream: tweettext=tweet.get('text','') tweetuser=tweet['user']['screen_name'].encode('utf-8') featset=review_features(tweettext) printtweetuser,': ',tweettext printtweetuser,“ thinks Green Lantern is ",classifier.classify(featset),br />"----------------------------"

Mais conteúdo relacionado

Mais procurados

Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - IJaganadh Gopinadhan
 
Natural Language Processing made easy
Natural Language Processing made easyNatural Language Processing made easy
Natural Language Processing made easyGopi Krishnan Nambiar
 
Natural language processing (Python)
Natural language processing (Python)Natural language processing (Python)
Natural language processing (Python)Sumit Raj
 
Nltk natural language toolkit overview and application @ PyHug
Nltk  natural language toolkit overview and application @ PyHugNltk  natural language toolkit overview and application @ PyHug
Nltk natural language toolkit overview and application @ PyHugJimmy Lai
 
Most Asked Python Interview Questions
Most Asked Python Interview QuestionsMost Asked Python Interview Questions
Most Asked Python Interview QuestionsShubham Shrimant
 
Document Classification using the Python Natural Language Toolkit
Document Classification using the Python Natural Language ToolkitDocument Classification using the Python Natural Language Toolkit
Document Classification using the Python Natural Language ToolkitBen Healey
 
Python interview questions
Python interview questionsPython interview questions
Python interview questionsPragati Singh
 
KiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialKiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialAlyona Medelyan
 
Python interview questions
Python interview questionsPython interview questions
Python interview questionsPragati Singh
 
Introduction to Python Pandas for Data Analytics
Introduction to Python Pandas for Data AnalyticsIntroduction to Python Pandas for Data Analytics
Introduction to Python Pandas for Data AnalyticsPhoenix
 
NLP Deep Learning with Tensorflow
NLP Deep Learning with TensorflowNLP Deep Learning with Tensorflow
NLP Deep Learning with Tensorflowseungwoo kim
 
Python Foundation – A programmer's introduction to Python concepts & style
Python Foundation – A programmer's introduction to Python concepts & stylePython Foundation – A programmer's introduction to Python concepts & style
Python Foundation – A programmer's introduction to Python concepts & styleKevlin Henney
 
Natural Language Processing and Python
Natural Language Processing and PythonNatural Language Processing and Python
Natural Language Processing and Pythonanntp
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to PythonNowell Strite
 
GDG Helwan Introduction to python
GDG Helwan Introduction to pythonGDG Helwan Introduction to python
GDG Helwan Introduction to pythonMohamed Hegazy
 
Chapter 1 - INTRODUCTION TO PYTHON -MAULIK BORSANIYA
Chapter 1 - INTRODUCTION TO PYTHON -MAULIK BORSANIYAChapter 1 - INTRODUCTION TO PYTHON -MAULIK BORSANIYA
Chapter 1 - INTRODUCTION TO PYTHON -MAULIK BORSANIYAMaulik Borsaniya
 

Mais procurados (20)

Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - I
 
Natural Language Processing made easy
Natural Language Processing made easyNatural Language Processing made easy
Natural Language Processing made easy
 
Natural language processing (Python)
Natural language processing (Python)Natural language processing (Python)
Natural language processing (Python)
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Nltk natural language toolkit overview and application @ PyHug
Nltk  natural language toolkit overview and application @ PyHugNltk  natural language toolkit overview and application @ PyHug
Nltk natural language toolkit overview and application @ PyHug
 
Most Asked Python Interview Questions
Most Asked Python Interview QuestionsMost Asked Python Interview Questions
Most Asked Python Interview Questions
 
NLTK
NLTKNLTK
NLTK
 
Document Classification using the Python Natural Language Toolkit
Document Classification using the Python Natural Language ToolkitDocument Classification using the Python Natural Language Toolkit
Document Classification using the Python Natural Language Toolkit
 
Python interview questions
Python interview questionsPython interview questions
Python interview questions
 
KiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialKiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorial
 
Python interview questions
Python interview questionsPython interview questions
Python interview questions
 
Introduction to Python Pandas for Data Analytics
Introduction to Python Pandas for Data AnalyticsIntroduction to Python Pandas for Data Analytics
Introduction to Python Pandas for Data Analytics
 
NLP Deep Learning with Tensorflow
NLP Deep Learning with TensorflowNLP Deep Learning with Tensorflow
NLP Deep Learning with Tensorflow
 
Python Foundation – A programmer's introduction to Python concepts & style
Python Foundation – A programmer's introduction to Python concepts & stylePython Foundation – A programmer's introduction to Python concepts & style
Python Foundation – A programmer's introduction to Python concepts & style
 
Natural Language Processing and Python
Natural Language Processing and PythonNatural Language Processing and Python
Natural Language Processing and Python
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to Python
 
GDG Helwan Introduction to python
GDG Helwan Introduction to pythonGDG Helwan Introduction to python
GDG Helwan Introduction to python
 
Chapter 1 - INTRODUCTION TO PYTHON -MAULIK BORSANIYA
Chapter 1 - INTRODUCTION TO PYTHON -MAULIK BORSANIYAChapter 1 - INTRODUCTION TO PYTHON -MAULIK BORSANIYA
Chapter 1 - INTRODUCTION TO PYTHON -MAULIK BORSANIYA
 
Python - Lesson 1
Python - Lesson 1Python - Lesson 1
Python - Lesson 1
 
Python Presentation
Python PresentationPython Presentation
Python Presentation
 

Destaque

TheAssociationOfAtheismOnLegalPersonalityInTurkey
TheAssociationOfAtheismOnLegalPersonalityInTurkeyTheAssociationOfAtheismOnLegalPersonalityInTurkey
TheAssociationOfAtheismOnLegalPersonalityInTurkeyMorgan Elizabeth Romano
 
Python an-intro - odp
Python an-intro - odpPython an-intro - odp
Python an-intro - odpArulalan T
 
Manual de trabajos con quimicos peligrosos
Manual de trabajos con quimicos peligrososManual de trabajos con quimicos peligrosos
Manual de trabajos con quimicos peligrososJorge Andres Godoy Marin
 
Diplomatic list. - Free Online Library
Diplomatic list. - Free Online LibraryDiplomatic list. - Free Online Library
Diplomatic list. - Free Online Libraryaboundingconcei84
 
Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics Prakash Pimpale
 
Centros de contacto: las demandas y requerimientos del mercado
Centros de contacto: las demandas y requerimientos del mercadoCentros de contacto: las demandas y requerimientos del mercado
Centros de contacto: las demandas y requerimientos del mercadoMundo Contact
 
Intro to Python Programming Language
Intro to Python Programming LanguageIntro to Python Programming Language
Intro to Python Programming LanguageDipankar Achinta
 
Unutturulmak istenen Devrimci Atatürk! (2)
Unutturulmak istenen Devrimci Atatürk! (2)Unutturulmak istenen Devrimci Atatürk! (2)
Unutturulmak istenen Devrimci Atatürk! (2)Olgaç Demirkol
 

Destaque (8)

TheAssociationOfAtheismOnLegalPersonalityInTurkey
TheAssociationOfAtheismOnLegalPersonalityInTurkeyTheAssociationOfAtheismOnLegalPersonalityInTurkey
TheAssociationOfAtheismOnLegalPersonalityInTurkey
 
Python an-intro - odp
Python an-intro - odpPython an-intro - odp
Python an-intro - odp
 
Manual de trabajos con quimicos peligrosos
Manual de trabajos con quimicos peligrososManual de trabajos con quimicos peligrosos
Manual de trabajos con quimicos peligrosos
 
Diplomatic list. - Free Online Library
Diplomatic list. - Free Online LibraryDiplomatic list. - Free Online Library
Diplomatic list. - Free Online Library
 
Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics
 
Centros de contacto: las demandas y requerimientos del mercado
Centros de contacto: las demandas y requerimientos del mercadoCentros de contacto: las demandas y requerimientos del mercado
Centros de contacto: las demandas y requerimientos del mercado
 
Intro to Python Programming Language
Intro to Python Programming LanguageIntro to Python Programming Language
Intro to Python Programming Language
 
Unutturulmak istenen Devrimci Atatürk! (2)
Unutturulmak istenen Devrimci Atatürk! (2)Unutturulmak istenen Devrimci Atatürk! (2)
Unutturulmak istenen Devrimci Atatürk! (2)
 

Semelhante a Nltk - Boston Text Analytics

summer training report on python
summer training report on pythonsummer training report on python
summer training report on pythonShubham Yadav
 
python presentation
python presentationpython presentation
python presentationVaibhavMawal
 
Pythonlearn-01-Intro.pptx
Pythonlearn-01-Intro.pptxPythonlearn-01-Intro.pptx
Pythonlearn-01-Intro.pptxMrHackerxD
 
Hasktut
HasktutHasktut
Hasktutkv33
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to pythonRanjith kumar
 
Python_Introduction_Good_PPT.pptx
Python_Introduction_Good_PPT.pptxPython_Introduction_Good_PPT.pptx
Python_Introduction_Good_PPT.pptxlemonchoos
 
Python interview questions and answers
Python interview questions and answersPython interview questions and answers
Python interview questions and answersRojaPriya
 
WEB PROGRAMMING UNIT VIII BY BHAVSINGH MALOTH
WEB PROGRAMMING UNIT VIII BY BHAVSINGH MALOTHWEB PROGRAMMING UNIT VIII BY BHAVSINGH MALOTH
WEB PROGRAMMING UNIT VIII BY BHAVSINGH MALOTHBhavsingh Maloth
 
Programming Under Linux In Python
Programming Under Linux In PythonProgramming Under Linux In Python
Programming Under Linux In PythonMarwan Osman
 
Mastering python lesson1
Mastering python lesson1Mastering python lesson1
Mastering python lesson1Ruth Marvin
 
Python interview questions and answers
Python interview questions and answersPython interview questions and answers
Python interview questions and answerskavinilavuG
 
AI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
AI UNIT 3 - SRCAS JOC.pptx enjoy this pptAI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
AI UNIT 3 - SRCAS JOC.pptx enjoy this pptpavankalyanadroittec
 
REPORT ON AUDIT COURSE PYTHON BY SANA 2.pdf
REPORT ON AUDIT COURSE PYTHON BY SANA 2.pdfREPORT ON AUDIT COURSE PYTHON BY SANA 2.pdf
REPORT ON AUDIT COURSE PYTHON BY SANA 2.pdfSana Khan
 

Semelhante a Nltk - Boston Text Analytics (20)

How To Tame Python
How To Tame PythonHow To Tame Python
How To Tame Python
 
PYTHON PPT.pptx
PYTHON PPT.pptxPYTHON PPT.pptx
PYTHON PPT.pptx
 
summer training report on python
summer training report on pythonsummer training report on python
summer training report on python
 
python presentation
python presentationpython presentation
python presentation
 
Pythonlearn-01-Intro.pptx
Pythonlearn-01-Intro.pptxPythonlearn-01-Intro.pptx
Pythonlearn-01-Intro.pptx
 
Hasktut
HasktutHasktut
Hasktut
 
biopython, doctest and makefiles
biopython, doctest and makefilesbiopython, doctest and makefiles
biopython, doctest and makefiles
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
 
Python_Introduction_Good_PPT.pptx
Python_Introduction_Good_PPT.pptxPython_Introduction_Good_PPT.pptx
Python_Introduction_Good_PPT.pptx
 
MODULE 1.pptx
MODULE 1.pptxMODULE 1.pptx
MODULE 1.pptx
 
Python programming
Python programmingPython programming
Python programming
 
Python interview questions and answers
Python interview questions and answersPython interview questions and answers
Python interview questions and answers
 
Python for dummies
Python for dummiesPython for dummies
Python for dummies
 
WEB PROGRAMMING UNIT VIII BY BHAVSINGH MALOTH
WEB PROGRAMMING UNIT VIII BY BHAVSINGH MALOTHWEB PROGRAMMING UNIT VIII BY BHAVSINGH MALOTH
WEB PROGRAMMING UNIT VIII BY BHAVSINGH MALOTH
 
Programming Under Linux In Python
Programming Under Linux In PythonProgramming Under Linux In Python
Programming Under Linux In Python
 
Mastering python lesson1
Mastering python lesson1Mastering python lesson1
Mastering python lesson1
 
Python interview questions and answers
Python interview questions and answersPython interview questions and answers
Python interview questions and answers
 
Python Programming.pptx
Python Programming.pptxPython Programming.pptx
Python Programming.pptx
 
AI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
AI UNIT 3 - SRCAS JOC.pptx enjoy this pptAI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
AI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
 
REPORT ON AUDIT COURSE PYTHON BY SANA 2.pdf
REPORT ON AUDIT COURSE PYTHON BY SANA 2.pdfREPORT ON AUDIT COURSE PYTHON BY SANA 2.pdf
REPORT ON AUDIT COURSE PYTHON BY SANA 2.pdf
 

Último

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 

Último (20)

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Nltk - Boston Text Analytics

  • 1. Microsoft New England Research and Development Center, June 22, 2011 Natural Language Processing and Machine Learning Using Python Shankar Ambady|
  • 2. Example Files Hosted on Github https://github.com/shanbady/NLTK-Boston-Python-Meetup
  • 3. What is “Natural Language Processing”? Where is this stuff used? The Machine learning paradox A look at a few key terms Quick start – creating NLP apps in Python
  • 4.
  • 5. The goal is to enable machines to understand human language and extract meaning from text.
  • 6. It is a field of study which falls under the category of machine learning and more specifically computational linguistics.
  • 7.
  • 9. Context Little sister: What’s your name? Me: Uhh….Shankar..? Sister: Can you spell it? Me: yes. S-H-A-N-K-A…..
  • 10. Sister: WRONG! It’s spelled “I-T”
  • 11. Language translation is a complicated matter! Go to: http://babel.mrfeinberg.com/ The problem with communication is the illusion that it has occurred
  • 12. The problem with communication is the illusion that it has occurred Das Problem mit Kommunikation ist die Illusion, dass es aufgetreten ist The problem with communication is the illusion that it arose Das Problem mit Kommunikation ist die Illusion, dass es entstand The problem with communication is the illusion that it developed Das Problem mit Kommunikation ist die Illusion, die sie entwickelte The problem with communication is the illusion, which developed it
  • 13. The problem with communication is the illusion that it has occurred The problem with communication is the illusion, which developed it EPIC FAIL
  • 14. Police policepolice. The above statement is a complete sentence states that: “police officers police other police officers”. She yelled “police police police” “someone called for police”.
  • 17.
  • 18. Setting up NLTK Source downloads available for mac and linux as well as installable packages for windows. Currently only available for Python 2.5 – 2.6 http://www.nltk.org/download `easy_install nltk` Prerequisites NumPy SciPy
  • 19. First steps NLTK comes with packages of corpora that are required for many modules. Open a python interpreter: importnltk nltk.download() If you do not want to use the downloader with a gui (requires TKInter module) Run: python -m nltk.downloader <name of package or “all”>
  • 20. You may individually select packages or download them in bulk.
  • 21. Let’s dive into some code!
  • 22. Part of Speech Tagging fromnltkimportpos_tag,word_tokenize sentence1='this is a demo that will show you how to detects parts of speech with little effort using NLTK!' tokenized_sent=word_tokenize(sentence1) printpos_tag(tokenized_sent) [('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('demo', 'NN'), ('that', 'WDT'), ('will', 'MD'), ('show', 'VB'), ('you', 'PRP'), ('how', 'WRB'), ('to', 'TO'), ('detects', 'NNS'), ('parts', 'NNS'), ('of', 'IN'), ('speech', 'NN'), ('with', 'IN'), ('little', 'JJ'), ('effort', 'NN'), ('using', 'VBG'), ('NLTK', 'NNP'),('!', '.')]
  • 23. Penn Bank Part-of-Speech Tags Source: http://www.ai.mit.edu/courses/6.863/tagdef.html
  • 24. NLTK Text nltk.clean_html(rawhtml) from nltk.corpus import brown from nltk import Text brown_words = brown.words(categories='humor') brownText = Text(brown_words) brownText.collocations() brownText.count("car") brownText.concordance("oil") brownText.dispersion_plot(['car', 'document', 'funny', 'oil']) brownText.similar('humor')
  • 25. Find similar terms (word definitions) using Wordnet importnltk fromnltk.corpusimportwordnetaswn synsets=wn.synsets('phone') print[str(syns.definition)forsynsinsynsets] 'electronic equipment that converts sound into electrical signals that can be transmitted over distances and then converts received signals back into sounds‘ '(phonetics) an individual sound unit of speech without concern as to whether or not it is a phoneme of some language‘ 'electro-acoustic transducer for converting electric signals into sounds; it is held over or inserted into the ear‘ 'get or try to get into communication (with someone) by telephone'
  • 26. from nltk.corpus import wordnet as wn synsets = wn.synsets('phone') print [str( syns.definition ) for syns in synsets] “syns.definition” can be modified to output hypernyms , meronyms, holonyms etc:
  • 28. Feeling lonely? Eliza is there to talk to you all day! What human could ever do that for you?? fromnltk.chatimporteliza eliza.eliza_chat() ……starts the chatbot Therapist --------- Talk to the program by typing in plain English, using normal upper- and lower-case letters and punctuation. Enter "quit" when done. ======================================================================== Hello. How are you feeling today?
  • 29. Englisch to German to Englisch to German…… fromnltk.bookimport* babelize_shell() Babel> the internet is a series of tubes Babel> german Babel> run 0> the internet is a series of tubes 1> das Internet ist eine Reihe SchlSuche 2> the Internet is a number of hoses 3> das Internet ist einige SchlSuche 4> the Internet is some hoses Babel>
  • 30. Let’s build something even cooler
  • 31. Lets write a spam filter! A program that analyzes legitimate emails “Ham” as well as “spam” and learns the features that are associated with each. Once trained, we should be able to run this program on incoming mail and have it reliably label each one with the appropriate category.
  • 32. What you will need NLTK (of course) as well as the “stopwords” corpus A good dataset of emails; Both spam and ham Patience and a cup of coffee (these programs tend to take a while to complete)
  • 33. Finding Great Data: The Enron Emails A dataset of 200,000+ emails made publicly available in 2003 after the Enron scandal. Contains both spam and actual corporate ham mail. For this reason it is one of the most popular datasets used for testing and developing anti-spam software. The dataset we will use is located at the following url: http://labs-repos.iit.demokritos.gr/skel/i-config/downloads/enron-spam/preprocessed/ It contains a list of archived files that contain plaintext emails in two folders , Spam and Ham.
  • 34. Extract one of the archives from the site into your working directory. Create a python script, lets call it “spambot.py”. Your working directory should contain the “spambot” script and the folders “spam” and “ham”. “Spambot.py” fromnltkimportword_tokenize,WordNetLemmatizer,NaiveBayesClassifierbr />,classify,MaxentClassifier fromnltk.corpusimportstopwords importrandom importos,glob,re
  • 35. “Spambot.py” (continued) wordlemmatizer = WordNetLemmatizer() commonwords = stopwords.words('english') hamtexts=[] spamtexts=[] forinfileinglob.glob(os.path.join('ham/','*.txt')): text_file=open(infile,"r") hamtexts.extend(text_file.read()) text_file.close() forinfileinglob.glob(os.path.join('spam/','*.txt')): text_file=open(infile,"r") spamtexts.extend(text_file.read()) text_file.close() load common English words into list start globbing the files into the appropriate lists
  • 36. “Spambot.py” (continued) mixedemails=([(email,'spam')foremailinspamtexts] mixedemails+= [(email,'ham')foremailinhamtexts]) random.shuffle(mixedemails) label each item with the appropriate label and store them as a list of tuples From this list of random but labeled emails, we will defined a “feature extractor” which outputs a feature set that our program can use to statistically compare spam and ham. lets give them a nice shuffle
  • 37. “Spambot.py” (continued) defemail_features(sent): features={} wordtokens=[wordlemmatizer.lemmatize(word.lower())forwordinword_tokenize(sent)] forwordinwordtokens: ifwordnotincommonwords: features[word]=True returnfeatures featuresets=[(email_features(n),g)for(n,g)inmixedemails] Normalize words If the word is not a stop-word then lets consider it a “feature” Let’s run each email through the feature extractor and collect it in a “featureset” list
  • 38.
  • 39. To use features that are non-binary such as number values, you must convert it to a binary feature. This process is called “binning”.
  • 40.
  • 41. “Spambot.py” (continued) print classifier.labels() This will output the labels that our classifier will use to tag new data ['ham', 'spam'] The purpose of create a “training set” and a “test set” is to test the accuracy of our classifier on a separate sample from the same data source. printclassify.accuracy(classifier,test_set) 0.75623
  • 43. “Spambot.py” (continued) WhileTrue: featset=email_features(raw_input("Enter text to classify: ")) printclassifier.classify(featset) We can now directly input new email and have it classified as either Spam or Ham
  • 44.
  • 45.
  • 46.
  • 47. Two Labels Fresh rotten Flixster, Rotten Tomatoes, the Certified Fresh Logo are trademarks or registered trademarks of Flixster, Inc. in the United States and other countries
  • 48. easy_installtweetstream importtweetstream words=["green lantern"] withtweetstream.TrackStream(“yourusername",“yourpassword",words)asstream: fortweetinstream: tweettext=tweet.get('text','') tweetuser=tweet['user']['screen_name'].encode('utf-8') featset=review_features(tweettext) printtweetuser,': ',tweettext printtweetuser,“ thinks Green Lantern is ",classifier.classify(featset),br />"----------------------------"
  • 49.
  • 50. Further Resources: I will be uploading this presentation to my site: http://www.shankarambady.com “Natural Language Processing with Python” by Steven Bird, Ewan Klein, and Edward Loper http://www.nltk.org/book API reference : http://nltk.googlecode.com/svn/trunk/doc/api/index.html Great NLTK blog: http://streamhacker.com/
  • 51. Thank you for watching! Special thanks to: John Verostek and Microsoft!

Notas do Editor

  1. You may also do #print [str(syns.examples) for syns in synsets] for usage examples of each definition