SlideShare uma empresa Scribd logo
1 de 26
Online Tweet Sentiment Analysis
with Apache Spark
Davide Nardone
0120/131
PARTHENOPE
UNIVERSITY
1. Introduction
2. Bag-of-words
3. Spark Streaming
4. Apache Kafka
5. DataFrame and SQL operation
6. Machine Learning library (MLlib)
7. Apache Zeppelin
8. Implementation and results
Summary
 Sentiment Analysis (SA) refers to the use of Natural
Language Processing (NLP) and Text Analysis to
extract, identify or otherwise characterize the sentiment
content of a text unit.
Introductio
n
The main dish
was delicious
It is an dish
The main dish
was salty and
horrible
Positive NegativeNeutral
 Existing SA approaches can be grouped into three main
categories:
1. Knowledge-based techniques;
2. Statistical method;
3. Hybrid approaches.
 Statistical method take advantages on elements of
Machine Learning (ML) such as Latent Semantic
Analysis (LSA), Multinomial Naïve Bayes (MNB),
Support Vector Machines (SVM) etc.
Introduction
(cont.)
 The bag-of-words model is a simplifying representation
used in NLP and Information Retrieval (IR).
 In this model, a text is represented as the the bag of its
words, ignoring grammar and even word order but
keeping multiplicity.
 The bag-of-words model is commonly used in methods
of document classification where the occurrence of
each word (TF) is used as feature for training a
classifier.
Bag-of-
words
1. Tokening;
2. Stopping;
3. Stemming;
4. Computation of tf (term frequency) idf (inverse
document frequency);
5. Using a machine learning classifier for the tweets
classification (e.g., Naïve Bayes, Support Vector
Machine, etc.)
Bag-of-words
(cont.)
 Spark Streaming in an extension of the core Spark API.
 Data can be ingested from many sources like Kafka,
etc.
 Processed data can be pushed out to filesystems,
databases, etc.
 Furthermore, it’s possible to apply Spark’s machine
learning algorithms on data streams.
Spark
Streaming
 Spark Streaming receives live input data streams and
divides the data into batches.
 Spark Streaming provides a high-level abstraction
called Discretized Stream, DStream (continuous stream
of data).
 DStream can be created either from input data streams
from sources as Kafka, Flume, etc.
Spark Streaming
(cont.)
 Kafka is a Distributed Streaming Platform and it
behaves like a partitioned, replicated commit log
services.
 It provides the functionality of a messaging system.
 Kafka is run as a cluster on one or more servers.
 The Kafka cluster stores streams of records in
categories called topics.
Apache Kafka
 Kafka has two out of four main core APIs:
1. The Producer API allows an application to publish a
stream record to one or more Kafka topics;
2. The Consumer API allows an application to subscribe to
one or more topics and process the stream of records
produced to them.
Apache Kafka
(cont.)
 So, at high level, producers
send messages over the
network to the Kafka cluster
which in turn serves them up
to consumers.
 Spark SQL is a component on the top of Spark Core that
introduce a new data abstraction called SchemaRDD which
provides support for structured and semi-structured data.
 Spark SQL also provides JDBC connectivity and can
access to several databases using both Hadoop connector
and Spark connector.
 In order to access to store or get data from it, it’s
necessary:
 Define an SQLContext (entry point) for using all the Spark's
functionality;
 Create a table schema by means of a StructType on which is
applied a specific method for creating a Dataframe.
 By using JDBC drivers, the previous schema is written on a
database.
Output operations for
DStream
 MLlib is a Spark’s library of machine learning functions.
 MLlib contains a variety of learning algorithms and is
accessible from all Spark’s programming languages.
 It consists of common learning algorithms and features,
which includes classification, regression, clustering, etc.
Machine Learning with MLlib
 The mllib.features package contains several classes for
common features transformation. These includes
algorithms to construct feature vectors from text and ways
to to normalize and scale features.
 Term frequency-inverse document frequency (TF-IDF) is a
feature vectorization method widely used in text mining to
reflect the importance of a term to a document in the
corpus.
Feature extraction
 Classification and regression are two common forms of
supervised learning, where algorithms attempts to predict a
variable from features of objects using labeled training
data.
 Both classification and regression use LabeledPoint class
in MLlib.
 MLlib includes a variety of methods for classification and
regression, including simple linear methods and decision
three and forests.
Classification
 Naïve Bayes is a multiclass classification algorithm that
scores how well each point belongs in each class based on
linear function of the features.
 It’s commonly used in text classification with TF-IDF
features, among other applications such as Tweet
Sentiment Analysis.
 In MLlib, it’s possible to use Naïve Bayes through the
mllib.classification.NaiveBayes class.
Naïve Bayes
 Clustering is the unsupervised learning task that involves
grouping objects into clusters of high similarity.
 Unlike the supervised tasks, where the data is labeled,
clustering can be used to make sense of unlabeled data.
 It is commonly used in data exploration and in anomaly
detection
Clustering
 MLlib, in addition to including the popular K-means “offline
algorithm”, it also provides an “online” version for clustering
“online” data streams.
 When data arrive in a stream, the algorithm dynamically:
1. Estimate the membership data groups;
2. Update the centroids of the clusters.
Streaming K-means
 In MLlib, it’s possible to use Streaming K-means through
the mllib.clustering.StreamingKMeans class.
Streaming K-means (cont.)
 Given a dataset of points in high-dimension space, we are
often interested in reducing the dimensionality of the points
so that they can be analyzed with simpler tools.
 For example, we might want to plot the points in two
dimensions, or just reduce the number of features to train
models more efficiently.
 In MLlib, it’s possible to use Streaming K-means through
the mllib.feature.PCA class.
Principal Component
Analysis (PCA)
 Apache Zeppelin is a web-based notebook that enables
interactive data visualization.
 Apache Zeppelin interpreters concept allows any
language/data-processing-backend to be plugged into
Zeppelin such as JDBC.
Apache Zeppelin
 Because of the lack of Spark-Streaming API (Python) for
accessing to a Twitter account, the tweet streams have
been simulated using Apache Kafka.
 In particular, the entity accounting for this task is a
Producer which publishes stream of data on a specific
topic.
 The training and testing data stream have been retrieved
from [1].
 On the other side, each received DStream is processed by
a Consumer, using stateless Spark functions such as map,
transform, etc..
Implementation and results
Naïve Bayes classification
results
Clustering results
Future work
 Integrate Twitter API’s method to retrieve tweet from
accounts.
 Use an alternative feature extraction method for the
Streaming K-means task.
[1] http://help.sentiment140.com/for-students/
[2] Karau, Holden, et al. Learning spark: lightning-fast big data
analysis. “O'Reilly Media, Inc.", 2015.
[3] Bogomolny, A. Benford’s Law and Zipf ’sLaw.
http://www.cut-the-knot.org/doyouknow/zipfLaw.shtml.
References
For any questions, contact me at:
davide.nardone@live.it

Mais conteúdo relacionado

Mais procurados

Twitter sentiment analysis ppt
Twitter sentiment analysis pptTwitter sentiment analysis ppt
Twitter sentiment analysis ppt
SonuCreation
 

Mais procurados (20)

Sentiment analysis
Sentiment analysisSentiment analysis
Sentiment analysis
 
Twitter Sentiment Analysis
Twitter Sentiment AnalysisTwitter Sentiment Analysis
Twitter Sentiment Analysis
 
Twitter sentiment analysis ppt
Twitter sentiment analysis pptTwitter sentiment analysis ppt
Twitter sentiment analysis ppt
 
Natural language processing (Python)
Natural language processing (Python)Natural language processing (Python)
Natural language processing (Python)
 
AI Programming language (LISP)
AI Programming language (LISP)AI Programming language (LISP)
AI Programming language (LISP)
 
Sentiment Analysis Using Machine Learning
Sentiment Analysis Using Machine LearningSentiment Analysis Using Machine Learning
Sentiment Analysis Using Machine Learning
 
Twitter sentiment analysis ppt
Twitter sentiment analysis pptTwitter sentiment analysis ppt
Twitter sentiment analysis ppt
 
LR PARSE.pptx
LR PARSE.pptxLR PARSE.pptx
LR PARSE.pptx
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine Learning
 
Amazon Product Sentiment review
Amazon Product Sentiment reviewAmazon Product Sentiment review
Amazon Product Sentiment review
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
 
Sentiment analysis using ml
Sentiment analysis using mlSentiment analysis using ml
Sentiment analysis using ml
 
Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics
 
Sentiment Analysis Using Hybrid Structure of Machine Learning Algorithms
Sentiment Analysis Using Hybrid Structure of Machine Learning AlgorithmsSentiment Analysis Using Hybrid Structure of Machine Learning Algorithms
Sentiment Analysis Using Hybrid Structure of Machine Learning Algorithms
 
A Role of Lexical Analyzer
A Role of Lexical AnalyzerA Role of Lexical Analyzer
A Role of Lexical Analyzer
 
Laptop Price Prediction system
Laptop Price Prediction systemLaptop Price Prediction system
Laptop Price Prediction system
 
Sentiment analysis
Sentiment analysisSentiment analysis
Sentiment analysis
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 
Sentiment Analysis in Twitter
Sentiment Analysis in TwitterSentiment Analysis in Twitter
Sentiment Analysis in Twitter
 
NLP
NLPNLP
NLP
 

Destaque

Tutorial of Sentiment Analysis
Tutorial of Sentiment AnalysisTutorial of Sentiment Analysis
Tutorial of Sentiment Analysis
Fabio Benedetti
 
Sentiment analysis of tweets
Sentiment analysis of tweetsSentiment analysis of tweets
Sentiment analysis of tweets
Vasu Jain
 
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemAdding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
Cloudera, Inc.
 
Product Sentiment Analysis
Product Sentiment AnalysisProduct Sentiment Analysis
Product Sentiment Analysis
nancy amala
 

Destaque (20)

Tutorial of Sentiment Analysis
Tutorial of Sentiment AnalysisTutorial of Sentiment Analysis
Tutorial of Sentiment Analysis
 
Sentiment analysis of tweets
Sentiment analysis of tweetsSentiment analysis of tweets
Sentiment analysis of tweets
 
Twitter sentiment analysis project report
Twitter sentiment analysis project reportTwitter sentiment analysis project report
Twitter sentiment analysis project report
 
Extreme - Web & Social Media monitoring and analysis - Company Presentation
Extreme - Web & Social Media monitoring and analysis - Company PresentationExtreme - Web & Social Media monitoring and analysis - Company Presentation
Extreme - Web & Social Media monitoring and analysis - Company Presentation
 
C* Summit 2013: Dude, Where's My Tweet? Taming the Twitter Firehose by Andrew...
C* Summit 2013: Dude, Where's My Tweet? Taming the Twitter Firehose by Andrew...C* Summit 2013: Dude, Where's My Tweet? Taming the Twitter Firehose by Andrew...
C* Summit 2013: Dude, Where's My Tweet? Taming the Twitter Firehose by Andrew...
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
How Sentiment Analysis works
How Sentiment Analysis worksHow Sentiment Analysis works
How Sentiment Analysis works
 
Introduction to Sentiment Analysis
Introduction to Sentiment AnalysisIntroduction to Sentiment Analysis
Introduction to Sentiment Analysis
 
Twitter analysis by Kaify Rais
Twitter analysis by Kaify RaisTwitter analysis by Kaify Rais
Twitter analysis by Kaify Rais
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
The Kotlin Programming Language, Svetlana Isakova
The Kotlin Programming Language, Svetlana IsakovaThe Kotlin Programming Language, Svetlana Isakova
The Kotlin Programming Language, Svetlana Isakova
 
Lexicon-Based Sentiment Analysis at GHC 2014
Lexicon-Based Sentiment Analysis at GHC 2014Lexicon-Based Sentiment Analysis at GHC 2014
Lexicon-Based Sentiment Analysis at GHC 2014
 
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemAdding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
 
Product Sentiment Analysis
Product Sentiment AnalysisProduct Sentiment Analysis
Product Sentiment Analysis
 
Text analytics and R - Open Question: is it a good match?
Text analytics and R - Open Question: is it a good match?Text analytics and R - Open Question: is it a good match?
Text analytics and R - Open Question: is it a good match?
 
Measuring Opinion Credibility in Twitter
Measuring Opinion Credibility in TwitterMeasuring Opinion Credibility in Twitter
Measuring Opinion Credibility in Twitter
 
Mike davies sentiment_analysis_presentation_backup
Mike davies sentiment_analysis_presentation_backupMike davies sentiment_analysis_presentation_backup
Mike davies sentiment_analysis_presentation_backup
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 

Semelhante a Online Tweet Sentiment Analysis with Apache Spark

MLconf 2022 NYC Event-Driven Machine Learning at Scale.pdf
MLconf 2022 NYC Event-Driven Machine Learning at Scale.pdfMLconf 2022 NYC Event-Driven Machine Learning at Scale.pdf
MLconf 2022 NYC Event-Driven Machine Learning at Scale.pdf
Timothy Spann
 
Java Abs Java Productivity Creator & Analyzer
Java Abs   Java Productivity Creator & AnalyzerJava Abs   Java Productivity Creator & Analyzer
Java Abs Java Productivity Creator & Analyzer
ncct
 
Real time text stream processing - a dynamic and distributed nlp pipeline
Real time text stream  processing - a dynamic and distributed nlp pipelineReal time text stream  processing - a dynamic and distributed nlp pipeline
Real time text stream processing - a dynamic and distributed nlp pipeline
Conference Papers
 

Semelhante a Online Tweet Sentiment Analysis with Apache Spark (20)

Learning spark ch11 - Machine Learning with MLlib
Learning spark ch11 - Machine Learning with MLlibLearning spark ch11 - Machine Learning with MLlib
Learning spark ch11 - Machine Learning with MLlib
 
Web Spa
Web SpaWeb Spa
Web Spa
 
MLconf 2022 NYC Event-Driven Machine Learning at Scale.pdf
MLconf 2022 NYC Event-Driven Machine Learning at Scale.pdfMLconf 2022 NYC Event-Driven Machine Learning at Scale.pdf
MLconf 2022 NYC Event-Driven Machine Learning at Scale.pdf
 
HANA SPS07 App Function Library
HANA SPS07 App Function LibraryHANA SPS07 App Function Library
HANA SPS07 App Function Library
 
Basic concepts of parallelization
Basic concepts of parallelizationBasic concepts of parallelization
Basic concepts of parallelization
 
Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark - Mario Car...
Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark - Mario Car...Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark - Mario Car...
Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark - Mario Car...
 
OOP Comparative Study
OOP Comparative StudyOOP Comparative Study
OOP Comparative Study
 
SA UNIT II KAFKA.pdf
SA UNIT II KAFKA.pdfSA UNIT II KAFKA.pdf
SA UNIT II KAFKA.pdf
 
Synopsis Software Training ppt.pptx
Synopsis Software Training ppt.pptxSynopsis Software Training ppt.pptx
Synopsis Software Training ppt.pptx
 
Java Abs Java Productivity Creator & Analyzer
Java Abs   Java Productivity Creator & AnalyzerJava Abs   Java Productivity Creator & Analyzer
Java Abs Java Productivity Creator & Analyzer
 
Mca5010 web technologies
Mca5010   web technologiesMca5010   web technologies
Mca5010 web technologies
 
Edbt19 paper 329
Edbt19 paper 329Edbt19 paper 329
Edbt19 paper 329
 
Real time text stream processing - a dynamic and distributed nlp pipeline
Real time text stream  processing - a dynamic and distributed nlp pipelineReal time text stream  processing - a dynamic and distributed nlp pipeline
Real time text stream processing - a dynamic and distributed nlp pipeline
 
Mit302 web technologies
Mit302 web technologiesMit302 web technologies
Mit302 web technologies
 
Java_Interview Qns
Java_Interview QnsJava_Interview Qns
Java_Interview Qns
 
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
Parallel programming model, language and compiler in ACA.
Parallel programming model, language and compiler in ACA.Parallel programming model, language and compiler in ACA.
Parallel programming model, language and compiler in ACA.
 
Proposal for google summe of code 2016
Proposal for google summe of code 2016 Proposal for google summe of code 2016
Proposal for google summe of code 2016
 
Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Spark
 

Mais de Davide Nardone

M.Sc thesis
M.Sc thesisM.Sc thesis
M.Sc thesis
Davide Nardone
 
A Sparse-Coding Based Approach for Class-Specific Feature Selection
A Sparse-Coding Based Approach for Class-Specific Feature SelectionA Sparse-Coding Based Approach for Class-Specific Feature Selection
A Sparse-Coding Based Approach for Class-Specific Feature Selection
Davide Nardone
 
Blind Source Separation using Dictionary Learning
Blind Source Separation using Dictionary LearningBlind Source Separation using Dictionary Learning
Blind Source Separation using Dictionary Learning
Davide Nardone
 
Accelerating Dynamic Time Warping Subsequence Search with GPU
Accelerating Dynamic Time Warping Subsequence Search with GPUAccelerating Dynamic Time Warping Subsequence Search with GPU
Accelerating Dynamic Time Warping Subsequence Search with GPU
Davide Nardone
 

Mais de Davide Nardone (9)

M.Sc thesis
M.Sc thesisM.Sc thesis
M.Sc thesis
 
Quantum computing
Quantum computingQuantum computing
Quantum computing
 
A Sparse-Coding Based Approach for Class-Specific Feature Selection
A Sparse-Coding Based Approach for Class-Specific Feature SelectionA Sparse-Coding Based Approach for Class-Specific Feature Selection
A Sparse-Coding Based Approach for Class-Specific Feature Selection
 
A Biological Smart Platform for the Environmental Risk Assessment
A Biological Smart Platform for the Environmental Risk AssessmentA Biological Smart Platform for the Environmental Risk Assessment
A Biological Smart Platform for the Environmental Risk Assessment
 
Installing Apache tomcat with Netbeans
Installing Apache tomcat with NetbeansInstalling Apache tomcat with Netbeans
Installing Apache tomcat with Netbeans
 
Internet of Things: Research Directions
Internet of Things: Research DirectionsInternet of Things: Research Directions
Internet of Things: Research Directions
 
Blind Source Separation using Dictionary Learning
Blind Source Separation using Dictionary LearningBlind Source Separation using Dictionary Learning
Blind Source Separation using Dictionary Learning
 
Accelerating Dynamic Time Warping Subsequence Search with GPU
Accelerating Dynamic Time Warping Subsequence Search with GPUAccelerating Dynamic Time Warping Subsequence Search with GPU
Accelerating Dynamic Time Warping Subsequence Search with GPU
 
LZ78
LZ78LZ78
LZ78
 

Último

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
amitlee9823
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
gajnagarg
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Último (20)

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 

Online Tweet Sentiment Analysis with Apache Spark

  • 1. Online Tweet Sentiment Analysis with Apache Spark Davide Nardone 0120/131 PARTHENOPE UNIVERSITY
  • 2. 1. Introduction 2. Bag-of-words 3. Spark Streaming 4. Apache Kafka 5. DataFrame and SQL operation 6. Machine Learning library (MLlib) 7. Apache Zeppelin 8. Implementation and results Summary
  • 3.  Sentiment Analysis (SA) refers to the use of Natural Language Processing (NLP) and Text Analysis to extract, identify or otherwise characterize the sentiment content of a text unit. Introductio n The main dish was delicious It is an dish The main dish was salty and horrible Positive NegativeNeutral
  • 4.  Existing SA approaches can be grouped into three main categories: 1. Knowledge-based techniques; 2. Statistical method; 3. Hybrid approaches.  Statistical method take advantages on elements of Machine Learning (ML) such as Latent Semantic Analysis (LSA), Multinomial Naïve Bayes (MNB), Support Vector Machines (SVM) etc. Introduction (cont.)
  • 5.  The bag-of-words model is a simplifying representation used in NLP and Information Retrieval (IR).  In this model, a text is represented as the the bag of its words, ignoring grammar and even word order but keeping multiplicity.  The bag-of-words model is commonly used in methods of document classification where the occurrence of each word (TF) is used as feature for training a classifier. Bag-of- words
  • 6. 1. Tokening; 2. Stopping; 3. Stemming; 4. Computation of tf (term frequency) idf (inverse document frequency); 5. Using a machine learning classifier for the tweets classification (e.g., Naïve Bayes, Support Vector Machine, etc.) Bag-of-words (cont.)
  • 7.  Spark Streaming in an extension of the core Spark API.  Data can be ingested from many sources like Kafka, etc.  Processed data can be pushed out to filesystems, databases, etc.  Furthermore, it’s possible to apply Spark’s machine learning algorithms on data streams. Spark Streaming
  • 8.  Spark Streaming receives live input data streams and divides the data into batches.  Spark Streaming provides a high-level abstraction called Discretized Stream, DStream (continuous stream of data).  DStream can be created either from input data streams from sources as Kafka, Flume, etc. Spark Streaming (cont.)
  • 9.  Kafka is a Distributed Streaming Platform and it behaves like a partitioned, replicated commit log services.  It provides the functionality of a messaging system.  Kafka is run as a cluster on one or more servers.  The Kafka cluster stores streams of records in categories called topics. Apache Kafka
  • 10.  Kafka has two out of four main core APIs: 1. The Producer API allows an application to publish a stream record to one or more Kafka topics; 2. The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them. Apache Kafka (cont.)  So, at high level, producers send messages over the network to the Kafka cluster which in turn serves them up to consumers.
  • 11.  Spark SQL is a component on the top of Spark Core that introduce a new data abstraction called SchemaRDD which provides support for structured and semi-structured data.  Spark SQL also provides JDBC connectivity and can access to several databases using both Hadoop connector and Spark connector.  In order to access to store or get data from it, it’s necessary:  Define an SQLContext (entry point) for using all the Spark's functionality;  Create a table schema by means of a StructType on which is applied a specific method for creating a Dataframe.  By using JDBC drivers, the previous schema is written on a database. Output operations for DStream
  • 12.  MLlib is a Spark’s library of machine learning functions.  MLlib contains a variety of learning algorithms and is accessible from all Spark’s programming languages.  It consists of common learning algorithms and features, which includes classification, regression, clustering, etc. Machine Learning with MLlib
  • 13.  The mllib.features package contains several classes for common features transformation. These includes algorithms to construct feature vectors from text and ways to to normalize and scale features.  Term frequency-inverse document frequency (TF-IDF) is a feature vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus. Feature extraction
  • 14.  Classification and regression are two common forms of supervised learning, where algorithms attempts to predict a variable from features of objects using labeled training data.  Both classification and regression use LabeledPoint class in MLlib.  MLlib includes a variety of methods for classification and regression, including simple linear methods and decision three and forests. Classification
  • 15.  Naïve Bayes is a multiclass classification algorithm that scores how well each point belongs in each class based on linear function of the features.  It’s commonly used in text classification with TF-IDF features, among other applications such as Tweet Sentiment Analysis.  In MLlib, it’s possible to use Naïve Bayes through the mllib.classification.NaiveBayes class. Naïve Bayes
  • 16.  Clustering is the unsupervised learning task that involves grouping objects into clusters of high similarity.  Unlike the supervised tasks, where the data is labeled, clustering can be used to make sense of unlabeled data.  It is commonly used in data exploration and in anomaly detection Clustering
  • 17.  MLlib, in addition to including the popular K-means “offline algorithm”, it also provides an “online” version for clustering “online” data streams.  When data arrive in a stream, the algorithm dynamically: 1. Estimate the membership data groups; 2. Update the centroids of the clusters. Streaming K-means
  • 18.  In MLlib, it’s possible to use Streaming K-means through the mllib.clustering.StreamingKMeans class. Streaming K-means (cont.)
  • 19.  Given a dataset of points in high-dimension space, we are often interested in reducing the dimensionality of the points so that they can be analyzed with simpler tools.  For example, we might want to plot the points in two dimensions, or just reduce the number of features to train models more efficiently.  In MLlib, it’s possible to use Streaming K-means through the mllib.feature.PCA class. Principal Component Analysis (PCA)
  • 20.  Apache Zeppelin is a web-based notebook that enables interactive data visualization.  Apache Zeppelin interpreters concept allows any language/data-processing-backend to be plugged into Zeppelin such as JDBC. Apache Zeppelin
  • 21.  Because of the lack of Spark-Streaming API (Python) for accessing to a Twitter account, the tweet streams have been simulated using Apache Kafka.  In particular, the entity accounting for this task is a Producer which publishes stream of data on a specific topic.  The training and testing data stream have been retrieved from [1].  On the other side, each received DStream is processed by a Consumer, using stateless Spark functions such as map, transform, etc.. Implementation and results
  • 24. Future work  Integrate Twitter API’s method to retrieve tweet from accounts.  Use an alternative feature extraction method for the Streaming K-means task.
  • 25. [1] http://help.sentiment140.com/for-students/ [2] Karau, Holden, et al. Learning spark: lightning-fast big data analysis. “O'Reilly Media, Inc.", 2015. [3] Bogomolny, A. Benford’s Law and Zipf ’sLaw. http://www.cut-the-knot.org/doyouknow/zipfLaw.shtml. References
  • 26. For any questions, contact me at: davide.nardone@live.it