This document provides an overview of building a real-time sentiment analysis API using machine learning techniques. It discusses preprocessing text data using NLTK, training a naive Bayes classifier on labelled tweet data, reducing features using chi-square testing, and scaling the API using ZeroMQ for real-time sentiment predictions. The document also briefly mentions potential improvements like using Redis instead of serialized classifiers and exploring deep learning methods.
7. MACHINE LEARNING
WHAT IS MACHINE LEARNING?
Amethod of teachingcomputers to make and improve
predictions or behaviors based on some data.
Itallow computers to evolve behaviors based on empiricaldata
Datacan be anything
Stock marketprices
Sensors and motors
emailmetadata
16. NATURAL LANGUAGE PROCESSING
WHAT IS NATURAL LANGUAGE PROCESSING?
Interactions between computers and human languages
Extractinformation from text
Some NLTK features
Bigrams
Part-or-speech
Tokenization
Stemming
WordNetlookup
17. NATURAL LANGUAGE PROCESSING
SOME NLTK FEATURES
Tokentization
Stopword Removal
>>>phrase="Iwishtobuyspecifiedproductsorservice"
>>>phrase=nlp.tokenize(phrase)
>>>phrase
['I','wish','to','buy','specified','products','or','service']
>>>phrase=nlp.remove_stopwords(tokenized_phrase)
>>>phrase
['I','wish','buy','specified','products','service']
19. CLASSIFYING TWITTER SENTIMENT IS HARD
Improper language use
Spellingmistakes
160 characters to express sentiment
Differenttypes of english (US, UK, Pidgin)
Gr8 picutre..God bless u RT @WhatsNextInGosp:
Resurrection Sunday Service @PFCNY with
@Donnieradio pic.twitter.com/nOgz65cpY5
7:04 PM - 21 Apr 2014
Donnie McClurkin
@Donnieradio
Follow
8 RETWEETS 36 FAVORITES
23. FEATURE EXTRACTION
How are we goingto find features from aphrase?
"Bag of Words"representation
my_phrase="Todaywassucharainyandhorribleday"
In[12]:fromnltkimportword_tokenize
In[13]:word_tokenize(my_phrase)
Out[13]:['Today','was','such','a','rainy','and','horrible','day']
25. FEATURE EXTRACTION
PASS THE REPRESENTATION DOWN THE PIPELINE
In[11]:feature_extractor.extract("Todaywassucharainyandhorribleday")
Out[11]:{'day':True,'horribl':True,'raini':True,'today':True}
The resultis adictionaryof variable length, containingkeys as
features and values as always True
26. DIMENSIONALITY REDUCTION
Remove features thatare common across allclasses (noise)
Increase performanceof the classifier
Decrease the sizeof the model, less memoryusage and more
speed
33. TRAINING
Now thatwe can extractfeatures from text, we can train a
classifier. The simplestand mostflexible learningalgorithm for
textclassification is Naive Bayes
P(label|features)=P(label)*P(features|label)/P(features)
Simple to compute = fast
Assumes feature indipendence = easytoupdate
Supports multiclass = scalable
34. TRAINING
NLTK provides built-in components
1. Train the classifier
2. Serialize classifier for later use
3. Train once, use as much as you want
>>>fromnltk.classifyimportNaiveBayesClassifier
>>>nb_classifier=NaiveBayesClassifier.train(train_feats)
...waitalotoftime
>>>nb_classifier.labels()
['neg','pos']
>>>serializer.dump(nb_classifier,file_handle)
41. POST-MORTEM
Real-time sentimentanalysis APIs can be implemented, and
can be scalable
Whatif we use Redis instead of havingserialized classifiers?
Deep learningis givingverygood results in NLP, let's tryit!