SlideShare a Scribd company logo
1 of 47
1
Analyzing Twitter Data with Hadoop
Open Analytics Summit, March 2013
Joey Echeverria | Principal Solutions Architect
joey@cloudera.com | @fwiffo
©2013 Cloudera, Inc.
About Joey
• Principal Solutions Architect
• 2 years @ Cloudera
• 5 years of Hadoop
• Local
2
Analyzing Twitter Data with Hadoop
BUILDING A BIG DATA SOLUTION
3 ©2013 Cloudera, Inc.
Big Data
• Big
• Larger volume than you’ve handled before
• No litmus test
• High value, under utilized
• Data
• Structured
• Unstructured
• Semi-structured
• Hadoop
• Distributed file system
• Distributed, batch computation
4 ©2013 Cloudera, Inc.
Data Management Systems
5 ©2013 Cloudera, Inc.
Data Source Data Storage
Data
Ingestion
Data
Processing
Relational Data Management Systems
6 ©2013 Cloudera, Inc.
Data Source RDBMSETL
Reporting
A Canonical Hadoop Architecture
7 ©2013 Cloudera, Inc.
Data Source HDFSFlume
Hive
(Impala)
Analyzing Twitter Data with Hadoop
AN EXAMPLE USE CASE
8 ©2013 Cloudera, Inc.
Analyzing Twitter
• Social media popular with marketing teams
• Twitter is an effective tool for promotion
• Who is influential?
• Tweets
• Followers
• Retweets
• Similar to e-mail forwarding
• Which twitter user gets the most retweets?
• Who is influential in our industry?
9 ©2013 Cloudera, Inc.
Analyzing Twitter Data with Hadoop
HOW DO WE ANSWER THESE
QUESTIONS?
10 ©2013 Cloudera, Inc.
Techniques
• SQL
• Filtering
• Aggregation
• Sorting
• Complex data
• Deeply nested
• Variable schema
11
Architecture
12 ©2013 Cloudera, Inc.
Twitter
HDFSFlume Hive
Custom
Flume
Source
Sink to
HDFS
JSON SerDe
Parses Data
Oozie
Add
Partitions
Hourly
Analyzing Twitter Data with Hadoop
TWITTER SOURCE
13 ©2013 Cloudera, Inc.
Flume
• Streaming data flow
• Sources
• Push or pull
• Sinks
• Event based
14 ©2013 Cloudera, Inc.
Pulling Data From Twitter
• Custom source, using twitter4j
• Sources process data as discrete events
Loading Data Into HDFS
• HDFS Sink comes stock with Flume
• Easily separate files by creation time
• hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/
Flume Source
17 ©2013 Cloudera, Inc.
public class TwitterSource extends AbstractSource
implements EventDrivenSource, Configurable {
...
// The initialization method for the Source. The context contains all
// the Flume configuration info
@Override
public void configure(Context context) {
...
}
...
// Start processing events. Uses the Twitter Streaming API to sample
// Twitter, and process tweets.
@Override
public void start() {
...
}
...
// Stops Source's event processing and shuts down the Twitter stream.
@Override
public void stop() {
...
}
}
Twitter API
• Callback mechanism for catching new tweets
18 ©2013 Cloudera, Inc.
/** The actual Twitter stream. It's set up to collect raw JSON data */
private final TwitterStream twitterStream = new TwitterStreamFactory(
new ConfigurationBuilder().setJSONStoreEnabled(true).build())
.getInstance();
...
// The StatusListener is a twitter4j API that can be added to a stream,
// and will call a method every time a message is sent to the stream.
StatusListener listener = new StatusListener() {
// The onStatus method is executed every time a new tweet comes in.
public void onStatus(Status status) {
...
}
}
...
// Set up the stream's listener (defined above), and set any necessary
// security information.
twitterStream.addListener(listener);
twitterStream.setOAuthConsumer(consumerKey, consumerSecret);
AccessToken token = new AccessToken(accessToken, accessTokenSecret);
twitterStream.setOAuthAccessToken(token);
JSON Data
• JSON data is processed as an event and written to
HDFS
19 ©2013 Cloudera, Inc.
public void onStatus(Status status) {
// The EventBuilder is used to build an event using the headers and
// the raw JSON of a tweet
headers.put("timestamp", String.valueOf(
status.getCreatedAt().getTime()));
Event event = EventBuilder.withBody(
DataObjectFactory.getRawJSON(status).getBytes(), headers);
channel.processEvent(event);
}
Analyzing Twitter Data with Hadoop
FLUME DEMO
20 ©2013 Cloudera, Inc.
Analyzing Twitter Data with Hadoop
HIVE
21 ©2013 Cloudera, Inc.
What is Hive?
• Created at Facebook
• HiveQL
• SQL like interface
• Hive interpreter
converts HiveQL to
MapReduce code
• Returns results to the
client
22 ©2013 Cloudera, Inc.
Hive Details
• Schema on read
• Scalar types (int, float, double, boolean, string)
• Complex types (struct, map, array)
• Metastore contains table definitions
• Stored in a relational database
• Similar to catalog tables in other DBs
23
Complex Data
24 ©2013 Cloudera, Inc.
SELECT
t.retweet_screen_name,
sum(retweets) AS total_retweets,
count(*) AS tweet_count
FROM (SELECT
retweeted_status.user.screen_name AS retweet_screen_name,
retweeted_status.text,
max(retweeted_status.retweet_count) AS retweets
FROM tweets
GROUP BY
retweeted_status.user.screen_name,
retweeted_status.text) t
GROUP BY t.retweet_screen_name
ORDER BY total_retweets DESC
LIMIT 10;
Analyzing Twitter Data with Hadoop
JSON INTERLUDE
25 ©2013 Cloudera, Inc.
What is JSON?
• Complex, semi-structured data
• Based on JavaScript’s data syntax
• Rich, nested data types:
• number
• string
• Array
• object
• true, false
• null
26 ©2013 Cloudera, Inc.
What is JSON?
27 ©2013 Cloudera, Inc.
{
"retweeted_status": {
"contributors": null,
"text": "#Crowdsourcing – drivers already generate traffic data for your smartphone to suggest
alternative routes when a road is clogged. #bigdata",
"retweeted": false,
"entities": {
"hashtags": [
{
"text": "Crowdsourcing",
"indices": [0, 14]
},
{
"text": "bigdata",
"indices": [129,137]
}
],
"user_mentions": []
}
}
}
Hive Serializers and Deserializers
• Instructs Hive on how to interpret data
• JSONSerDe
28 ©2013 Cloudera, Inc.
Analyzing Twitter Data with Hadoop
HIVE DEMO
29 ©2013 Cloudera, Inc.
Analyzing Twitter Data with Hadoop
IT’S A TRAP!
30 ©2013 Cloudera, Inc.
Photo from http://www.flickr.com/photos/vanf/6798065626/ Some rights reserved
Not a Database
31 ©2013 Cloudera, Inc.
RDBMS Hive
Language
Generally >= SQL-92
Subset of SQL-92 plus
Hive specific
extensions
Update Capabilities
INSERT, UPDATE,
DELETE
INSERT OVERWRITE
no UPDATE, DELETE
Transactions Yes No
Latency Sub-second Minutes
Indexes Yes Yes
Data size Terabytes Petabytes
Analyzing Twitter Data with Hadoop
IMPALA ASIDE
32 ©2013 Cloudera, Inc.
Cloudera Impala
33
Real-Time Query for Data Stored in Hadoop.
Supports Hive SQL
4-30X faster than Hive over MapReduce
Uses existing drivers, integrates with existing
metastore, works with leading BI tools
Flexible, cost-effective, no lock-in
Deploy & operate with
Cloudera Enterprise RTQ
Supports multiple storage engines &
file formats
©2013 Cloudera, Inc.
Benefits of Cloudera Impala
34
Real-Time Query for Data Stored in Hadoop
• Real-time queries run directly on source data
• No ETL delays
• No jumping between data silos
• No double storage with EDW/RDBMS
• Unlock analysis on more data
• No need to create and maintain complex ETL between systems
• No need to preplan schemas
• All data available for interactive queries
• No loss of fidelity from fixed data schemas
• Single metadata store from origination through analysis
• No need to hunt through multiple data silos
©2013 Cloudera, Inc.
Cloudera Impala Details
35 ©2013 Cloudera, Inc.
HDFS DN
Query Exec Engine
Query Coordinator
Query Planner
HBase
ODBC
SQL App
HDFS DN
Query Exec Engine
Query Coordinator
Query Planner
HBaseHDFS DN
Query Exec Engine
Query Coordinator
Query Planner
HBase
Fully MPP
Distributed
Local Direct Reads
State Store
HDFS NN
Hive
Metastore YARN
Common Hive SQL and interface
Unified metadata and scheduler
Low-latency scheduler and cache
(low-impact failures)
Analyzing Twitter Data with Hadoop
OOZIE AUTOMATION
36 ©2013 Cloudera, Inc.
Oozie: Everything in its Right Place
Oozie for Partition Management
• Once an hour, add a partition
• Takes advantage of advanced Hive functionality
Analyzing Twitter Data with Hadoop
OOZIE DEMO
39 ©2013 Cloudera, Inc.
Analyzing Twitter Data with Hadoop
PUTTING IT ALL TOGETHER
40 ©2013 Cloudera, Inc.
Complete Architecture
41 ©2013 Cloudera, Inc.
Twitter
HDFSFlume Hive
Custom
Flume
Source
Sink to
HDFS
JSON SerDe
Parses Data
Oozie
Add
Partitions
Hourly
Analyzing Twitter Data with Hadoop
MORE DEMOS
42 ©2013 Cloudera, Inc.
What next?
• Download Hadoop!
• CDH available at www.cloudera.com
• Cloudera provides pre-loaded VMs
• https://ccp.cloudera.com/display/SUPPORT/Cloudera+Ma
nager+Free+Edition+Demo+VM
• Clone the source repo
• https://github.com/cloudera/cdh-twitter-example
My personal preference
• Cloudera Manager
• https://ccp.cloudera.com/display/SUPPORT/Downloads
• Free up to 50 unlimited nodes!
Shout Out
• Jon Natkins
• @nattyice
• Blog posts
• http://blog.cloudera.com/blog/2013/09/analyzing-twitter-
data-with-hadoop/
• http://blog.cloudera.com/blog/2013/10/analyzing-twitter-
data-with-hadoop-part-2-gathering-data-with-flume/
• http://blog.cloudera.com/blog/2013/11/analyzing-twitter-
data-with-hadoop-part-3-querying-semi-structured-data-
with-hive/
Questions?
• Contact me!
• Joey Echeverria
• joey@cloudera.com
• @fwiffo
• We’re hiring!
47 ©2013 Cloudera, Inc.

More Related Content

More from Joey Echeverria

Analyzing twitter data with hadoop
Analyzing twitter data with hadoopAnalyzing twitter data with hadoop
Analyzing twitter data with hadoopJoey Echeverria
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use casesJoey Echeverria
 
Scratching your own itch
Scratching your own itchScratching your own itch
Scratching your own itchJoey Echeverria
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computingJoey Echeverria
 
Hadoop and h base in the real world
Hadoop and h base in the real worldHadoop and h base in the real world
Hadoop and h base in the real worldJoey Echeverria
 

More from Joey Echeverria (6)

Analyzing twitter data with hadoop
Analyzing twitter data with hadoopAnalyzing twitter data with hadoop
Analyzing twitter data with hadoop
 
Big data security
Big data securityBig data security
Big data security
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use cases
 
Scratching your own itch
Scratching your own itchScratching your own itch
Scratching your own itch
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
 
Hadoop and h base in the real world
Hadoop and h base in the real worldHadoop and h base in the real world
Hadoop and h base in the real world
 

Recently uploaded

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 

Recently uploaded (20)

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 

Analyzing Twitter Data with Hadoop

  • 1. 1 Analyzing Twitter Data with Hadoop Open Analytics Summit, March 2013 Joey Echeverria | Principal Solutions Architect joey@cloudera.com | @fwiffo ©2013 Cloudera, Inc.
  • 2. About Joey • Principal Solutions Architect • 2 years @ Cloudera • 5 years of Hadoop • Local 2
  • 3. Analyzing Twitter Data with Hadoop BUILDING A BIG DATA SOLUTION 3 ©2013 Cloudera, Inc.
  • 4. Big Data • Big • Larger volume than you’ve handled before • No litmus test • High value, under utilized • Data • Structured • Unstructured • Semi-structured • Hadoop • Distributed file system • Distributed, batch computation 4 ©2013 Cloudera, Inc.
  • 5. Data Management Systems 5 ©2013 Cloudera, Inc. Data Source Data Storage Data Ingestion Data Processing
  • 6. Relational Data Management Systems 6 ©2013 Cloudera, Inc. Data Source RDBMSETL Reporting
  • 7. A Canonical Hadoop Architecture 7 ©2013 Cloudera, Inc. Data Source HDFSFlume Hive (Impala)
  • 8. Analyzing Twitter Data with Hadoop AN EXAMPLE USE CASE 8 ©2013 Cloudera, Inc.
  • 9. Analyzing Twitter • Social media popular with marketing teams • Twitter is an effective tool for promotion • Who is influential? • Tweets • Followers • Retweets • Similar to e-mail forwarding • Which twitter user gets the most retweets? • Who is influential in our industry? 9 ©2013 Cloudera, Inc.
  • 10. Analyzing Twitter Data with Hadoop HOW DO WE ANSWER THESE QUESTIONS? 10 ©2013 Cloudera, Inc.
  • 11. Techniques • SQL • Filtering • Aggregation • Sorting • Complex data • Deeply nested • Variable schema 11
  • 12. Architecture 12 ©2013 Cloudera, Inc. Twitter HDFSFlume Hive Custom Flume Source Sink to HDFS JSON SerDe Parses Data Oozie Add Partitions Hourly
  • 13. Analyzing Twitter Data with Hadoop TWITTER SOURCE 13 ©2013 Cloudera, Inc.
  • 14. Flume • Streaming data flow • Sources • Push or pull • Sinks • Event based 14 ©2013 Cloudera, Inc.
  • 15. Pulling Data From Twitter • Custom source, using twitter4j • Sources process data as discrete events
  • 16. Loading Data Into HDFS • HDFS Sink comes stock with Flume • Easily separate files by creation time • hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/
  • 17. Flume Source 17 ©2013 Cloudera, Inc. public class TwitterSource extends AbstractSource implements EventDrivenSource, Configurable { ... // The initialization method for the Source. The context contains all // the Flume configuration info @Override public void configure(Context context) { ... } ... // Start processing events. Uses the Twitter Streaming API to sample // Twitter, and process tweets. @Override public void start() { ... } ... // Stops Source's event processing and shuts down the Twitter stream. @Override public void stop() { ... } }
  • 18. Twitter API • Callback mechanism for catching new tweets 18 ©2013 Cloudera, Inc. /** The actual Twitter stream. It's set up to collect raw JSON data */ private final TwitterStream twitterStream = new TwitterStreamFactory( new ConfigurationBuilder().setJSONStoreEnabled(true).build()) .getInstance(); ... // The StatusListener is a twitter4j API that can be added to a stream, // and will call a method every time a message is sent to the stream. StatusListener listener = new StatusListener() { // The onStatus method is executed every time a new tweet comes in. public void onStatus(Status status) { ... } } ... // Set up the stream's listener (defined above), and set any necessary // security information. twitterStream.addListener(listener); twitterStream.setOAuthConsumer(consumerKey, consumerSecret); AccessToken token = new AccessToken(accessToken, accessTokenSecret); twitterStream.setOAuthAccessToken(token);
  • 19. JSON Data • JSON data is processed as an event and written to HDFS 19 ©2013 Cloudera, Inc. public void onStatus(Status status) { // The EventBuilder is used to build an event using the headers and // the raw JSON of a tweet headers.put("timestamp", String.valueOf( status.getCreatedAt().getTime())); Event event = EventBuilder.withBody( DataObjectFactory.getRawJSON(status).getBytes(), headers); channel.processEvent(event); }
  • 20. Analyzing Twitter Data with Hadoop FLUME DEMO 20 ©2013 Cloudera, Inc.
  • 21. Analyzing Twitter Data with Hadoop HIVE 21 ©2013 Cloudera, Inc.
  • 22. What is Hive? • Created at Facebook • HiveQL • SQL like interface • Hive interpreter converts HiveQL to MapReduce code • Returns results to the client 22 ©2013 Cloudera, Inc.
  • 23. Hive Details • Schema on read • Scalar types (int, float, double, boolean, string) • Complex types (struct, map, array) • Metastore contains table definitions • Stored in a relational database • Similar to catalog tables in other DBs 23
  • 24. Complex Data 24 ©2013 Cloudera, Inc. SELECT t.retweet_screen_name, sum(retweets) AS total_retweets, count(*) AS tweet_count FROM (SELECT retweeted_status.user.screen_name AS retweet_screen_name, retweeted_status.text, max(retweeted_status.retweet_count) AS retweets FROM tweets GROUP BY retweeted_status.user.screen_name, retweeted_status.text) t GROUP BY t.retweet_screen_name ORDER BY total_retweets DESC LIMIT 10;
  • 25. Analyzing Twitter Data with Hadoop JSON INTERLUDE 25 ©2013 Cloudera, Inc.
  • 26. What is JSON? • Complex, semi-structured data • Based on JavaScript’s data syntax • Rich, nested data types: • number • string • Array • object • true, false • null 26 ©2013 Cloudera, Inc.
  • 27. What is JSON? 27 ©2013 Cloudera, Inc. { "retweeted_status": { "contributors": null, "text": "#Crowdsourcing – drivers already generate traffic data for your smartphone to suggest alternative routes when a road is clogged. #bigdata", "retweeted": false, "entities": { "hashtags": [ { "text": "Crowdsourcing", "indices": [0, 14] }, { "text": "bigdata", "indices": [129,137] } ], "user_mentions": [] } } }
  • 28. Hive Serializers and Deserializers • Instructs Hive on how to interpret data • JSONSerDe 28 ©2013 Cloudera, Inc.
  • 29. Analyzing Twitter Data with Hadoop HIVE DEMO 29 ©2013 Cloudera, Inc.
  • 30. Analyzing Twitter Data with Hadoop IT’S A TRAP! 30 ©2013 Cloudera, Inc. Photo from http://www.flickr.com/photos/vanf/6798065626/ Some rights reserved
  • 31. Not a Database 31 ©2013 Cloudera, Inc. RDBMS Hive Language Generally >= SQL-92 Subset of SQL-92 plus Hive specific extensions Update Capabilities INSERT, UPDATE, DELETE INSERT OVERWRITE no UPDATE, DELETE Transactions Yes No Latency Sub-second Minutes Indexes Yes Yes Data size Terabytes Petabytes
  • 32. Analyzing Twitter Data with Hadoop IMPALA ASIDE 32 ©2013 Cloudera, Inc.
  • 33. Cloudera Impala 33 Real-Time Query for Data Stored in Hadoop. Supports Hive SQL 4-30X faster than Hive over MapReduce Uses existing drivers, integrates with existing metastore, works with leading BI tools Flexible, cost-effective, no lock-in Deploy & operate with Cloudera Enterprise RTQ Supports multiple storage engines & file formats ©2013 Cloudera, Inc.
  • 34. Benefits of Cloudera Impala 34 Real-Time Query for Data Stored in Hadoop • Real-time queries run directly on source data • No ETL delays • No jumping between data silos • No double storage with EDW/RDBMS • Unlock analysis on more data • No need to create and maintain complex ETL between systems • No need to preplan schemas • All data available for interactive queries • No loss of fidelity from fixed data schemas • Single metadata store from origination through analysis • No need to hunt through multiple data silos ©2013 Cloudera, Inc.
  • 35. Cloudera Impala Details 35 ©2013 Cloudera, Inc. HDFS DN Query Exec Engine Query Coordinator Query Planner HBase ODBC SQL App HDFS DN Query Exec Engine Query Coordinator Query Planner HBaseHDFS DN Query Exec Engine Query Coordinator Query Planner HBase Fully MPP Distributed Local Direct Reads State Store HDFS NN Hive Metastore YARN Common Hive SQL and interface Unified metadata and scheduler Low-latency scheduler and cache (low-impact failures)
  • 36. Analyzing Twitter Data with Hadoop OOZIE AUTOMATION 36 ©2013 Cloudera, Inc.
  • 37. Oozie: Everything in its Right Place
  • 38. Oozie for Partition Management • Once an hour, add a partition • Takes advantage of advanced Hive functionality
  • 39. Analyzing Twitter Data with Hadoop OOZIE DEMO 39 ©2013 Cloudera, Inc.
  • 40. Analyzing Twitter Data with Hadoop PUTTING IT ALL TOGETHER 40 ©2013 Cloudera, Inc.
  • 41. Complete Architecture 41 ©2013 Cloudera, Inc. Twitter HDFSFlume Hive Custom Flume Source Sink to HDFS JSON SerDe Parses Data Oozie Add Partitions Hourly
  • 42. Analyzing Twitter Data with Hadoop MORE DEMOS 42 ©2013 Cloudera, Inc.
  • 43. What next? • Download Hadoop! • CDH available at www.cloudera.com • Cloudera provides pre-loaded VMs • https://ccp.cloudera.com/display/SUPPORT/Cloudera+Ma nager+Free+Edition+Demo+VM • Clone the source repo • https://github.com/cloudera/cdh-twitter-example
  • 44. My personal preference • Cloudera Manager • https://ccp.cloudera.com/display/SUPPORT/Downloads • Free up to 50 unlimited nodes!
  • 45. Shout Out • Jon Natkins • @nattyice • Blog posts • http://blog.cloudera.com/blog/2013/09/analyzing-twitter- data-with-hadoop/ • http://blog.cloudera.com/blog/2013/10/analyzing-twitter- data-with-hadoop-part-2-gathering-data-with-flume/ • http://blog.cloudera.com/blog/2013/11/analyzing-twitter- data-with-hadoop-part-3-querying-semi-structured-data- with-hive/
  • 46. Questions? • Contact me! • Joey Echeverria • joey@cloudera.com • @fwiffo • We’re hiring!