SlideShare uma empresa Scribd logo
1 de 15
Sentiment Analysis using Hive
Secrets From the Pros
We will be starting at 11:03 PDT
Use the Chat Pane in GoToWebinar to Ask Questions!
Assess your level and learn new stuff
This webinar is intended for intermediate audiences
(familiar with Apache Hive and Hadoop, but not experts)
?
News Cycle for “Mortgage” 2008-
09
Mortgage- Crisis, Foreclosures, Fraud
-10
0
10
20
30
40
50
60
70
80
90
6/12/04 8/1/04 9/20/04 11/9/04 12/29/04 2/17/05 4/8/05 5/28/05
Crisis
Foreclosure
Fraud
Linear (Crisis)
Linear (Foreclosure)
Linear (Fraud)
# of records: 90M/partition
Partitions:
Month
Columns:
URL
Timestamp
Array of Memes
Links
Table: MemeTracker
36GB of JSON Data
AGENDA
This Webinar provides tips on doing basic sentiment analysis on
large data sets using Hive:
• Overview of Sentiment Analysis (SA)
• Hive UDFs useful for SA
• Demo, Guided Tutorial
• Developing advanced, custom SA Engines
Sentiment Analysis
Applications
Direct-- Call center
logs, Emails, Chat logs
Indirect-- Social
Media, Forums, Review websites
Gather Customer Feedback
Over time, geography
By customer, market
segments
Sentiment Analysis
Product / service decisions
Customer support
Marketing- messaging, offers
Customer retention, upsell
Use for Decision making
Sentiment Analysis
How to operationalize a Sentiment Analysis App
1.
Crawl, Scrape, API
calls, collect
2. Create
“Documents”
3. Pre-process
Data
4. Apply Language
Model, Extract
Sentiment
5. Integrate with
Mktg
Automn., CRM, C
CA, etc OLTP
6. Improve
Product, Better
CS, Targeted
Offers
Pre and Post Preprocessing
Hive Built-In Functions
Goal Input Data Output Data
Use this
Hive UDF
Tokenization (“Hello There! How are you?”)
(
(“Hello”, “There”)
, (“How”, “are”,
“you”)
)
sentences
Column (array) to rows [1, 2, 3]
1
2
3
explode
Navigating documents,
extracting fields
{"store":
{"fruit":[{"weight":8,"type":"apple"}
,{"weight":9,"type":"pear"}],
"bicycle":{"price":19.95,"color":"red
"}
},
"email":"amy@xyz.net",
"owner":"amy"
}
{"weight":8,"type":"
apple"}
get_json_object(
src_json.json,
'$.fruit[0]')
N-Gram
Language Models
Q: What is a language model?
A: A mathematical model that assigns probability to a sequence of m words
Q: What is “n-gram” model?
A: Probabilistic language model for predicting next word in a sequence of words
Q: What is an n-gram?
A: A contiguous sequence of n items from a given sequence of text
Eg: “Mary had a little lamb”
Bi-grams: “Mary had”, “had a”, “a little”, “little lamb”
N-Gram Language Model
Hive Built-In Functions
Goal Input Data Output Data
Use this
Hive UDF
Find important topics
using a stop word list,
trending topics
Collection of sentences
k most frequently occurring
n-grams
ngrams
Extract intelligence
around certain
keywords, pre-compute
search look aheads
Collection of sentences
k most frequently occuring
n-grams around a “context”
word. Eg: “Government
shutdown”
context_ngrams
Dataset used-- Meme Tracker
How MemeTracker.org creates the dataset
90 Million sources
900K news stories / day
Track 17M memes
# of records: 90M/partition
Partitions:
Month
Columns:
URL
Timestamp
Array of Memes
Links
Table: MemeTracker
6GB of Data / month
Analyze Sentiment on “Mortgage”
By Tracking How Memes spread, using Hive
What is a Meme?
“Government Shutdown”, “Affordable Care Act”, “Green Eggs and Ham”, etc
# of records: 90M/partition
Partitions:
Month
Columns:
URL
Timestamp
Array of Memes
Links
Table: MemeTracker
36GB of JSON Data
Demo
Hive’s Extensibility Framework
• There are many UDFs built into Hive
• For more advanced users Hive allows many
ways to extend the language
– SERDEs
– UDFs, UDAFs, and UDTFs
– Hive Streaming
How to access this Tutorial
• Create a free Qubole Account (www.qubole.com)
• Login  Click on “Analyze”  Look for “Tutorials”
tab at top of page
Summary
• Pre and post processing
– Use Hive
• Language Models
– Use pre-existing language models codified as Hive UDFs such as
ngrams and context_ngrams
– UDFs-- Build your own language model in java using Hive UDF
framework
– Hive Streaming-- Plug-in your existing language models or 3rd
party libraries
• Visualization
– Use a spreadsheet / BI reporting tool
THANK YOU
Managed Cluster Built-In Connectors Friendly User-Interface Dedicated Support
• 100% Managed Hadoop Cluster in the Cloud
• Auto-Scaling Cluster. Full Life-cycle Management
• +12 Connectors to Applications and Data Sources
• 14-Day Free Trial (free account available)
• 24/7 Customer Support
What’s Included?
 www.qubole.com/try 

Mais conteúdo relacionado

Mais procurados

Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security datamarkgrover
 
Question Answering and Virtual Assistants with Deep Learning
Question Answering and Virtual Assistants with Deep LearningQuestion Answering and Virtual Assistants with Deep Learning
Question Answering and Virtual Assistants with Deep LearningLucidworks
 
Knowledge graph convolutional networks - London 2018
Knowledge graph convolutional networks - London 2018Knowledge graph convolutional networks - London 2018
Knowledge graph convolutional networks - London 2018Vaticle
 
How to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on SparkHow to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on SparkCaserta
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeDataWorks Summit
 
Crowd sourced intelligence built into search over hadoop
Crowd sourced intelligence built into search over hadoopCrowd sourced intelligence built into search over hadoop
Crowd sourced intelligence built into search over hadooplucenerevolution
 
Introducing GRAKN.AI
Introducing GRAKN.AIIntroducing GRAKN.AI
Introducing GRAKN.AIVaticle
 
Knowledge graph convolutional networks - Berlin 2019
Knowledge graph convolutional networks - Berlin 2019Knowledge graph convolutional networks - Berlin 2019
Knowledge graph convolutional networks - Berlin 2019Vaticle
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentationTao Feng
 
Cheat sheets for data scientists
Cheat sheets for data scientistsCheat sheets for data scientists
Cheat sheets for data scientistsAjay Ohri
 
Social Network Analysis with Spark
Social Network Analysis with SparkSocial Network Analysis with Spark
Social Network Analysis with SparkGhulam Imaduddin
 
Outlier and fraud detection using Hadoop
Outlier and fraud detection using HadoopOutlier and fraud detection using Hadoop
Outlier and fraud detection using HadoopPranab Ghosh
 
Crime Analysis & Prediction System
Crime Analysis & Prediction SystemCrime Analysis & Prediction System
Crime Analysis & Prediction SystemBigDataCloud
 
Notey's talk 20160923
Notey's talk 20160923Notey's talk 20160923
Notey's talk 20160923Rosanna Man
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discoverymarkgrover
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkKenny Bastani
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Simplilearn
 

Mais procurados (20)

resume_MH
resume_MHresume_MH
resume_MH
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
 
Question Answering and Virtual Assistants with Deep Learning
Question Answering and Virtual Assistants with Deep LearningQuestion Answering and Virtual Assistants with Deep Learning
Question Answering and Virtual Assistants with Deep Learning
 
Knowledge graph convolutional networks - London 2018
Knowledge graph convolutional networks - London 2018Knowledge graph convolutional networks - London 2018
Knowledge graph convolutional networks - London 2018
 
How to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on SparkHow to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on Spark
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
 
Crowd sourced intelligence built into search over hadoop
Crowd sourced intelligence built into search over hadoopCrowd sourced intelligence built into search over hadoop
Crowd sourced intelligence built into search over hadoop
 
Introducing GRAKN.AI
Introducing GRAKN.AIIntroducing GRAKN.AI
Introducing GRAKN.AI
 
Vespa, A Tour
Vespa, A TourVespa, A Tour
Vespa, A Tour
 
Knowledge graph convolutional networks - Berlin 2019
Knowledge graph convolutional networks - Berlin 2019Knowledge graph convolutional networks - Berlin 2019
Knowledge graph convolutional networks - Berlin 2019
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Cheat sheets for data scientists
Cheat sheets for data scientistsCheat sheets for data scientists
Cheat sheets for data scientists
 
Social Network Analysis with Spark
Social Network Analysis with SparkSocial Network Analysis with Spark
Social Network Analysis with Spark
 
Outlier and fraud detection using Hadoop
Outlier and fraud detection using HadoopOutlier and fraud detection using Hadoop
Outlier and fraud detection using Hadoop
 
Crime Analysis & Prediction System
Crime Analysis & Prediction SystemCrime Analysis & Prediction System
Crime Analysis & Prediction System
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
 
Notey's talk 20160923
Notey's talk 20160923Notey's talk 20160923
Notey's talk 20160923
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache Spark
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...
 

Destaque

Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSumit Raj
 
Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Lu Wei
 
Resume of Vimal 4.1
Resume of Vimal 4.1Resume of Vimal 4.1
Resume of Vimal 4.1Vimal Suthar
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and SparkJongwook Woo
 
Traffic data analysis using HADOOP
Traffic data analysis using HADOOPTraffic data analysis using HADOOP
Traffic data analysis using HADOOPKirthan S Holla
 
Hadoop - Stock Analysis
Hadoop - Stock AnalysisHadoop - Stock Analysis
Hadoop - Stock AnalysisVaibhav Jain
 
TRAFFIC DATA ANALYSIS USING HADOOP
TRAFFIC DATA ANALYSIS USING HADOOPTRAFFIC DATA ANALYSIS USING HADOOP
TRAFFIC DATA ANALYSIS USING HADOOPKirthan S Holla
 
Log analysis with Hadoop in livedoor 2013
Log analysis with Hadoop in livedoor 2013Log analysis with Hadoop in livedoor 2013
Log analysis with Hadoop in livedoor 2013SATOSHI TAGOMORI
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Jonathan Seidman
 
Best Practices for Hadoop Data Analysis with Tableau and Hortonworks Data Pla...
Best Practices for Hadoop Data Analysis with Tableau and Hortonworks Data Pla...Best Practices for Hadoop Data Analysis with Tableau and Hortonworks Data Pla...
Best Practices for Hadoop Data Analysis with Tableau and Hortonworks Data Pla...Hortonworks
 
HW09 Social network analysis with Hadoop
HW09 Social network analysis with HadoopHW09 Social network analysis with Hadoop
HW09 Social network analysis with HadoopCloudera, Inc.
 
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraEscape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraPiotr Kolaczkowski
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Jonathan Seidman
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsAsad Masood Qazi
 
Twitter sentiment analysis project report
Twitter sentiment analysis project reportTwitter sentiment analysis project report
Twitter sentiment analysis project reportBharat Khanna
 
Practical sentiment analysis
Practical sentiment analysisPractical sentiment analysis
Practical sentiment analysisDiana Maynard
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
 

Destaque (20)

Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
 
Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]
 
Resume of Vimal 4.1
Resume of Vimal 4.1Resume of Vimal 4.1
Resume of Vimal 4.1
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
 
Hadoop data analysis
Hadoop data analysisHadoop data analysis
Hadoop data analysis
 
Traffic data analysis using HADOOP
Traffic data analysis using HADOOPTraffic data analysis using HADOOP
Traffic data analysis using HADOOP
 
Hadoop - Stock Analysis
Hadoop - Stock AnalysisHadoop - Stock Analysis
Hadoop - Stock Analysis
 
TRAFFIC DATA ANALYSIS USING HADOOP
TRAFFIC DATA ANALYSIS USING HADOOPTRAFFIC DATA ANALYSIS USING HADOOP
TRAFFIC DATA ANALYSIS USING HADOOP
 
Log analysis with Hadoop in livedoor 2013
Log analysis with Hadoop in livedoor 2013Log analysis with Hadoop in livedoor 2013
Log analysis with Hadoop in livedoor 2013
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
 
Best Practices for Hadoop Data Analysis with Tableau and Hortonworks Data Pla...
Best Practices for Hadoop Data Analysis with Tableau and Hortonworks Data Pla...Best Practices for Hadoop Data Analysis with Tableau and Hortonworks Data Pla...
Best Practices for Hadoop Data Analysis with Tableau and Hortonworks Data Pla...
 
HW09 Social network analysis with Hadoop
HW09 Social network analysis with HadoopHW09 Social network analysis with Hadoop
HW09 Social network analysis with Hadoop
 
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraEscape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questions
 
Twitter sentiment analysis project report
Twitter sentiment analysis project reportTwitter sentiment analysis project report
Twitter sentiment analysis project report
 
Practical sentiment analysis
Practical sentiment analysisPractical sentiment analysis
Practical sentiment analysis
 
Video Analysis in Hadoop
Video Analysis in HadoopVideo Analysis in Hadoop
Video Analysis in Hadoop
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 

Semelhante a Basic Sentiment Analysis using Hive

Filtering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media StreamingFiltering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media StreamingCloud Elements
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Eli White
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
 
COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku Lepisto
COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku LepistoCOSCUP - Open Source Engines Providing Big Data in the Cloud, Markku Lepisto
COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku LepistoAmazon Web Services
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
An AI Use Case: Market Event Impact Determination via Sentiment and Emotion A...
An AI Use Case: Market Event Impact Determination via Sentiment and Emotion A...An AI Use Case: Market Event Impact Determination via Sentiment and Emotion A...
An AI Use Case: Market Event Impact Determination via Sentiment and Emotion A...Databricks
 
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache SparkAI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache SparkValue Amplify Consulting
 
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! Sumeet Singh
 
Big Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudBig Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudAmazon Web Services
 
Text Analytics Online Knowledge Base / Database
Text Analytics Online Knowledge Base / DatabaseText Analytics Online Knowledge Base / Database
Text Analytics Online Knowledge Base / DatabaseNaveen Kumar
 
Big data on_aws in korea by abhishek sinha (lunch and learn)
Big data on_aws in korea by abhishek sinha (lunch and learn)Big data on_aws in korea by abhishek sinha (lunch and learn)
Big data on_aws in korea by abhishek sinha (lunch and learn)Amazon Web Services Korea
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantageAmazon Web Services
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsKamalika Dutta
 
Narrative Mind Lessons Learned
Narrative Mind Lessons LearnedNarrative Mind Lessons Learned
Narrative Mind Lessons LearnedH4Diadmin
 
Narrative Mind Lessons Learned H4D Stanford 2016
Narrative Mind Lessons Learned H4D Stanford 2016Narrative Mind Lessons Learned H4D Stanford 2016
Narrative Mind Lessons Learned H4D Stanford 2016Stanford University
 
Take Action: The New Reality of Data-Driven Business
Take Action: The New Reality of Data-Driven BusinessTake Action: The New Reality of Data-Driven Business
Take Action: The New Reality of Data-Driven BusinessInside Analysis
 

Semelhante a Basic Sentiment Analysis using Hive (20)

Filtering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media StreamingFiltering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media Streaming
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku Lepisto
COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku LepistoCOSCUP - Open Source Engines Providing Big Data in the Cloud, Markku Lepisto
COSCUP - Open Source Engines Providing Big Data in the Cloud, Markku Lepisto
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
An AI Use Case: Market Event Impact Determination via Sentiment and Emotion A...
An AI Use Case: Market Event Impact Determination via Sentiment and Emotion A...An AI Use Case: Market Event Impact Determination via Sentiment and Emotion A...
An AI Use Case: Market Event Impact Determination via Sentiment and Emotion A...
 
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache SparkAI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
 
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
 
No sql databases
No sql databasesNo sql databases
No sql databases
 
Big data technologies with Case Study Finance and Healthcare
Big data technologies with Case Study Finance and HealthcareBig data technologies with Case Study Finance and Healthcare
Big data technologies with Case Study Finance and Healthcare
 
Big Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudBig Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS Cloud
 
Observability at Spotify
Observability at SpotifyObservability at Spotify
Observability at Spotify
 
Text Analytics Online Knowledge Base / Database
Text Analytics Online Knowledge Base / DatabaseText Analytics Online Knowledge Base / Database
Text Analytics Online Knowledge Base / Database
 
Big data on_aws in korea by abhishek sinha (lunch and learn)
Big data on_aws in korea by abhishek sinha (lunch and learn)Big data on_aws in korea by abhishek sinha (lunch and learn)
Big data on_aws in korea by abhishek sinha (lunch and learn)
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantage
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
 
Narrative Mind Lessons Learned
Narrative Mind Lessons LearnedNarrative Mind Lessons Learned
Narrative Mind Lessons Learned
 
Narrative Mind Lessons Learned H4D Stanford 2016
Narrative Mind Lessons Learned H4D Stanford 2016Narrative Mind Lessons Learned H4D Stanford 2016
Narrative Mind Lessons Learned H4D Stanford 2016
 
Take Action: The New Reality of Data-Driven Business
Take Action: The New Reality of Data-Driven BusinessTake Action: The New Reality of Data-Driven Business
Take Action: The New Reality of Data-Driven Business
 
A Empresa na Era da Informação Extrema
A Empresa na Era da Informação ExtremaA Empresa na Era da Informação Extrema
A Empresa na Era da Informação Extrema
 

Mais de Qubole

Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...Qubole
 
7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome ThemQubole
 
State of Big Data Adoption
State of Big Data AdoptionState of Big Data Adoption
State of Big Data AdoptionQubole
 
Big Data at Pinterest - Presented by Qubole
Big Data at Pinterest - Presented by QuboleBig Data at Pinterest - Presented by Qubole
Big Data at Pinterest - Presented by QuboleQubole
 
5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance 5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance Qubole
 
Spark on Yarn
Spark on YarnSpark on Yarn
Spark on YarnQubole
 
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConfQubole
 
Running Spark on Cloud
Running Spark on CloudRunning Spark on Cloud
Running Spark on CloudQubole
 
Qubole State of the Big Data Industry
Qubole State of the Big Data IndustryQubole State of the Big Data Industry
Qubole State of the Big Data IndustryQubole
 
Big Data Platform at Pinterest
Big Data Platform at PinterestBig Data Platform at Pinterest
Big Data Platform at PinterestQubole
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesQubole
 
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup   Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup Qubole
 
BIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - QuboleBIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - QuboleQubole
 
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache HiveHarnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache HiveQubole
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudQubole
 
Getting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataGetting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataQubole
 
Expert Big Data Tips
Expert Big Data TipsExpert Big Data Tips
Expert Big Data TipsQubole
 
Big dataproposal
Big dataproposalBig dataproposal
Big dataproposalQubole
 
Presto in the cloud
Presto in the cloudPresto in the cloud
Presto in the cloudQubole
 
Effective Hive Queries
Effective Hive QueriesEffective Hive Queries
Effective Hive QueriesQubole
 

Mais de Qubole (20)

Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
 
7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them
 
State of Big Data Adoption
State of Big Data AdoptionState of Big Data Adoption
State of Big Data Adoption
 
Big Data at Pinterest - Presented by Qubole
Big Data at Pinterest - Presented by QuboleBig Data at Pinterest - Presented by Qubole
Big Data at Pinterest - Presented by Qubole
 
5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance 5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance
 
Spark on Yarn
Spark on YarnSpark on Yarn
Spark on Yarn
 
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConf
 
Running Spark on Cloud
Running Spark on CloudRunning Spark on Cloud
Running Spark on Cloud
 
Qubole State of the Big Data Industry
Qubole State of the Big Data IndustryQubole State of the Big Data Industry
Qubole State of the Big Data Industry
 
Big Data Platform at Pinterest
Big Data Platform at PinterestBig Data Platform at Pinterest
Big Data Platform at Pinterest
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slides
 
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup   Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
 
BIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - QuboleBIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - Qubole
 
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache HiveHarnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 
Getting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataGetting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big Data
 
Expert Big Data Tips
Expert Big Data TipsExpert Big Data Tips
Expert Big Data Tips
 
Big dataproposal
Big dataproposalBig dataproposal
Big dataproposal
 
Presto in the cloud
Presto in the cloudPresto in the cloud
Presto in the cloud
 
Effective Hive Queries
Effective Hive QueriesEffective Hive Queries
Effective Hive Queries
 

Último

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 

Último (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Basic Sentiment Analysis using Hive

  • 1. Sentiment Analysis using Hive Secrets From the Pros We will be starting at 11:03 PDT Use the Chat Pane in GoToWebinar to Ask Questions! Assess your level and learn new stuff This webinar is intended for intermediate audiences (familiar with Apache Hive and Hadoop, but not experts) ?
  • 2. News Cycle for “Mortgage” 2008- 09 Mortgage- Crisis, Foreclosures, Fraud -10 0 10 20 30 40 50 60 70 80 90 6/12/04 8/1/04 9/20/04 11/9/04 12/29/04 2/17/05 4/8/05 5/28/05 Crisis Foreclosure Fraud Linear (Crisis) Linear (Foreclosure) Linear (Fraud) # of records: 90M/partition Partitions: Month Columns: URL Timestamp Array of Memes Links Table: MemeTracker 36GB of JSON Data
  • 3. AGENDA This Webinar provides tips on doing basic sentiment analysis on large data sets using Hive: • Overview of Sentiment Analysis (SA) • Hive UDFs useful for SA • Demo, Guided Tutorial • Developing advanced, custom SA Engines
  • 4. Sentiment Analysis Applications Direct-- Call center logs, Emails, Chat logs Indirect-- Social Media, Forums, Review websites Gather Customer Feedback Over time, geography By customer, market segments Sentiment Analysis Product / service decisions Customer support Marketing- messaging, offers Customer retention, upsell Use for Decision making
  • 5. Sentiment Analysis How to operationalize a Sentiment Analysis App 1. Crawl, Scrape, API calls, collect 2. Create “Documents” 3. Pre-process Data 4. Apply Language Model, Extract Sentiment 5. Integrate with Mktg Automn., CRM, C CA, etc OLTP 6. Improve Product, Better CS, Targeted Offers
  • 6. Pre and Post Preprocessing Hive Built-In Functions Goal Input Data Output Data Use this Hive UDF Tokenization (“Hello There! How are you?”) ( (“Hello”, “There”) , (“How”, “are”, “you”) ) sentences Column (array) to rows [1, 2, 3] 1 2 3 explode Navigating documents, extracting fields {"store": {"fruit":[{"weight":8,"type":"apple"} ,{"weight":9,"type":"pear"}], "bicycle":{"price":19.95,"color":"red "} }, "email":"amy@xyz.net", "owner":"amy" } {"weight":8,"type":" apple"} get_json_object( src_json.json, '$.fruit[0]')
  • 7. N-Gram Language Models Q: What is a language model? A: A mathematical model that assigns probability to a sequence of m words Q: What is “n-gram” model? A: Probabilistic language model for predicting next word in a sequence of words Q: What is an n-gram? A: A contiguous sequence of n items from a given sequence of text Eg: “Mary had a little lamb” Bi-grams: “Mary had”, “had a”, “a little”, “little lamb”
  • 8. N-Gram Language Model Hive Built-In Functions Goal Input Data Output Data Use this Hive UDF Find important topics using a stop word list, trending topics Collection of sentences k most frequently occurring n-grams ngrams Extract intelligence around certain keywords, pre-compute search look aheads Collection of sentences k most frequently occuring n-grams around a “context” word. Eg: “Government shutdown” context_ngrams
  • 9. Dataset used-- Meme Tracker How MemeTracker.org creates the dataset 90 Million sources 900K news stories / day Track 17M memes # of records: 90M/partition Partitions: Month Columns: URL Timestamp Array of Memes Links Table: MemeTracker 6GB of Data / month
  • 10. Analyze Sentiment on “Mortgage” By Tracking How Memes spread, using Hive What is a Meme? “Government Shutdown”, “Affordable Care Act”, “Green Eggs and Ham”, etc # of records: 90M/partition Partitions: Month Columns: URL Timestamp Array of Memes Links Table: MemeTracker 36GB of JSON Data
  • 11. Demo
  • 12. Hive’s Extensibility Framework • There are many UDFs built into Hive • For more advanced users Hive allows many ways to extend the language – SERDEs – UDFs, UDAFs, and UDTFs – Hive Streaming
  • 13. How to access this Tutorial • Create a free Qubole Account (www.qubole.com) • Login  Click on “Analyze”  Look for “Tutorials” tab at top of page
  • 14. Summary • Pre and post processing – Use Hive • Language Models – Use pre-existing language models codified as Hive UDFs such as ngrams and context_ngrams – UDFs-- Build your own language model in java using Hive UDF framework – Hive Streaming-- Plug-in your existing language models or 3rd party libraries • Visualization – Use a spreadsheet / BI reporting tool
  • 15. THANK YOU Managed Cluster Built-In Connectors Friendly User-Interface Dedicated Support • 100% Managed Hadoop Cluster in the Cloud • Auto-Scaling Cluster. Full Life-cycle Management • +12 Connectors to Applications and Data Sources • 14-Day Free Trial (free account available) • 24/7 Customer Support What’s Included?  www.qubole.com/try 

Notas do Editor

  1. Great to model clicks and impressions and try and understand what a buyers intent is. Intent to purchase or churn.. Quality-- Banks, call centers,
  2. Great to model clicks and impressions and try and understand what a buyers intent is. Intent to purchase or churn.. Quality-- Banks, call centers,
  3. Information diffusionData is already gathered, documents created, memes extracted. Lot of work already done. Data ready for you.Can do this on your own on twitter feeds.
  4. Solutions– many..Framework– pre-processing --- applying model --- post processingChallenges: Scaling.