Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

MACHINE LEARNING AS
A SERVICE
MAKING SENTIMENT PREDICTIONS IN REALTIME WITH ZMQ
AND NLTK

DISSERTATION
Let's make somethingcool!

SOCIAL MEDIA
+
MACHINE
LEARNING
+
API

SENTIMENT ANALYSIS
AS A SERVICE
A STEP-BY-STEP GUIDE

FundamentalTopics
Machine Learning
NaturalLanguage Processing
Overview of the platform
The process
Prepare
Analyze
Train
Use
Scale

MACHINE LEARNING
WHAT IS MACHINE LEARNING?
Amethod of teachingcomputers to make and improve
predictions or behaviors based on some data.
Itallow computers to evolve behaviors based on empiricaldata
Datacan be anything
Stock marketprices
Sensors and motors
emailmetadata

SUPERVISED MACHINE LEARNING
SPAM OR HAM

NATURAL LANGUAGE PROCESSING
WHAT IS NATURAL LANGUAGE PROCESSING?
Interactions between computers and human languages
Extractinformation from text
Some NLTK features
Bigrams
Part-or-speech
Tokenization
Stemming
WordNetlookup

NATURAL LANGUAGE PROCESSING
SOME NLTK FEATURES
Tokentization
Stopword Removal
>>>phrase="Iwishtobuyspecifiedproductsorservice"
>>>phrase=nlp.tokenize(phrase)
>>>phrase
['I','wish','to','buy','specified','products','or','service']
>>>phrase=nlp.remove_stopwords(tokenized_phrase)
>>>phrase
['I','wish','buy','specified','products','service']

CLASSIFYING TWITTER SENTIMENT IS HARD
Improper language use
Spellingmistakes
160 characters to express sentiment
Differenttypes of english (US, UK, Pidgin)
Gr8 picutre..God bless u RT @WhatsNextInGosp:
Resurrection Sunday Service @PFCNY with
@Donnieradio pic.twitter.com/nOgz65cpY5
7:04 PM - 21 Apr 2014
Donnie McClurkin
@Donnieradio
Follow
8 RETWEETS 36 FAVORITES

BACK TO BUILDING OUR API
.. FINALLY!

THE DATASET
SENTIMENT140
160.000 labelled tweets
CSVformat
Polarityof the tweet(0 = negative, 2 = neutral, 4 = positive)
The textof the tweet(Lyx is cool)

FEATURE EXTRACTION
How are we goingto find features from aphrase?
"Bag of Words"representation
my_phrase="Todaywassucharainyandhorribleday"
In[12]:fromnltkimportword_tokenize
In[13]:word_tokenize(my_phrase)
Out[13]:['Today','was','such','a','rainy','and','horrible','day']

FEATURE EXTRACTION
CREATE A PIPELINE OF FEATURE EXTRACTORS
FORMATTER=formatting.FormatterPipeline(
formatting.make_lowercase,
formatting.strip_urls,
formatting.strip_hashtags,
formatting.strip_names,
formatting.remove_repetitons,
formatting.replace_html_entities,
formatting.strip_nonchars,
functools.partial(
formatting.remove_noise,
stopwords=stopwords.words('english')+['rt']
),
functools.partial(
formatting.stem_words,
stemmer=nltk.stem.porter.PorterStemmer()
)
)

FEATURE EXTRACTION
PASS THE REPRESENTATION DOWN THE PIPELINE
In[11]:feature_extractor.extract("Todaywassucharainyandhorribleday")
Out[11]:{'day':True,'horribl':True,'raini':True,'today':True}
The resultis adictionaryof variable length, containingkeys as
features and values as always True

DIMENSIONALITY REDUCTION
Remove features thatare common across allclasses (noise)
Increase performanceof the classifier
Decrease the sizeof the model, less memoryusage and more
speed

CHI-SQUARE TEST

NLTK gives us BigramAssocMeasures.chi_sq
CHI-SQUARE TEST
#Calculatethenumberofwordsforeachclass
pos_word_count=label_word_fd['pos'].N()
neg_word_count=label_word_fd['neg'].N()
total_word_count=pos_word_count+neg_word_count
#Foreachwordandit'stotaloccurance
forword,freqinword_fd.iteritems():
#Calculateascoreforthepositiveclass
pos_score=BigramAssocMeasures.chi_sq(label_word_fd['pos'][word],
(freq,pos_word_count),total_word_count)
#Calculateascoreforthenegativeclass
neg_score=BigramAssocMeasures.chi_sq(label_word_fd['neg'][word],
(freq,neg_word_count),total_word_count)
#Thesumofthetwowillgiveyouit'stotalscore
word_scores[word]=pos_score+neg_score

TRAINING
Now thatwe can extractfeatures from text, we can train a
classifier. The simplestand mostflexible learningalgorithm for
textclassification is Naive Bayes
P(label|features)=P(label)*P(features|label)/P(features)
Simple to compute = fast
Assumes feature indipendence = easytoupdate
Supports multiclass = scalable

TRAINING
NLTK provides built-in components
1. Train the classifier
2. Serialize classifier for later use
3. Train once, use as much as you want
>>>fromnltk.classifyimportNaiveBayesClassifier
>>>nb_classifier=NaiveBayesClassifier.train(train_feats)
...waitalotoftime
>>>nb_classifier.labels()
['neg','pos']
>>>serializer.dump(nb_classifier,file_handle)

USING THE CLASSIFIER
#Loadtheclassifierfromtheserializedfile
classifier=pickle.loads(classifier_file.read())
#Pickanewphrase
new_phrase="AtPyconItaly!Lovethefoodandthisspeakerissoamazing"
#1)Preprocessing
feature_vector=feature_extractor.extract(phrase)
#2)Dimensionalityreduction,best_featuresisoursetofbestwords
reduced_feature_vector=reduce_features(feature_vector,best_features)
#3)Classify!
printself.classifier.classify(reduced_feature_vector)
>>>"pos"

BUILDING A CLASSIFICATION API
Classifier is slow, no matter how much optimization is made
Classifier is ablockingprocess, API mustbe event-driven

SCALING TOWARDS INFINITY AND BEYOND

ZEROMQ
Fast, uses native sockets
Promotes horizontalscalability
Language-agnostic framework

ZEROMQ
...
socket=context.socket(zmq.REP)
...
whileTrue:
message=socket.recv()
phrase=json.loads(message)["text"]
#1)Featureextraction
feature_vector=feature_extractor.extract(phrase)
#2)Dimensionalityreduction,best_featuresisoursetofbestwords
reduced_feature_vector=reduce_features(feature_vector,best_features)
#3)Classify!
result=classifier.classify(reduced_feature_vector)
socket.send(json.dumps(result))

POST-MORTEM
Real-time sentimentanalysis APIs can be implemented, and
can be scalable
Whatif we use Redis instead of havingserialized classifiers?
Deep learningis givingverygood results in NLP, let's tryit!

Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

Semelhante a Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK (20)

Último

Último (20)

Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK