These are slides from a lecture given at the UC Berkeley School of Information for the Analyzing Big Data with Twitter class. A video of the talk can be found at http://blogs.ischool.berkeley.edu/i290-abdt-s12/2012/08/31/video-lecture-posted-intro-to-hadoop/
1. Intro To Hadoop
Bill Graham - @billgraham
Data Systems Engineer, Analytics Infrastructure
Info 290 - Analyzing Big Data With Twitter
UC Berkeley Information School
September 2012
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States
See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
2. Outline
• What is Big Data?
• Hadoop
• HDFS
• MapReduce
• Twitter Analytics and Hadoop
2
3. What is big data?
• A bunch of data?
• An industry?
• An expertise?
• A trend?
• A cliche?
3
4. Wikipedia big data
In information technology, big data is a loosely-defined
term used to describe data sets so large and
complex that they become awkward to work with
using on-hand database management tools.
Source: http://en.wikipedia.org/wiki/Big_data 4
6. How big is big?
• 2008: Google processes 20 PB a day
5
7. How big is big?
• 2008: Google processes 20 PB a day
• 2009: Facebook has 2.5 PB user data + 15 TB/
day
5
8. How big is big?
• 2008: Google processes 20 PB a day
• 2009: Facebook has 2.5 PB user data + 15 TB/
day
• 2009: eBay has 6.5 PB user data + 50 TB/day
5
9. How big is big?
• 2008: Google processes 20 PB a day
• 2009: Facebook has 2.5 PB user data + 15 TB/
day
• 2009: eBay has 6.5 PB user data + 50 TB/day
• 2011: Yahoo! has 180-200 PB of data
5
10. How big is big?
• 2008: Google processes 20 PB a day
• 2009: Facebook has 2.5 PB user data + 15 TB/
day
• 2009: eBay has 6.5 PB user data + 50 TB/day
• 2011: Yahoo! has 180-200 PB of data
• 2012: Facebook ingests 500 TB/day
5
11. That’s a lot of data
Credit: http://www.flickr.com/photos/19779889@N00/1367404058/ 6
15. No really, what do you do with it?
• User behavior analysis
8
16. No really, what do you do with it?
• User behavior analysis
• AB test analysis
8
17. No really, what do you do with it?
• User behavior analysis
• AB test analysis
• Ad targeting
8
18. No really, what do you do with it?
• User behavior analysis
• AB test analysis
• Ad targeting
• Trending topics
8
19. No really, what do you do with it?
• User behavior analysis
• AB test analysis
• Ad targeting
• Trending topics
• User and topic modeling
8
20. No really, what do you do with it?
• User behavior analysis
• AB test analysis
• Ad targeting
• Trending topics
• User and topic modeling
• Recommendations
8
21. No really, what do you do with it?
• User behavior analysis
• AB test analysis
• Ad targeting
• Trending topics
• User and topic modeling
• Recommendations
• And more...
8
24. Parallel processing is complicated
Credit: http://www.flickr.com/photos/sybrenstuvel/2468506922/ 11
25. Parallel processing is complicated
• How do we assign tasks to workers?
Credit: http://www.flickr.com/photos/sybrenstuvel/2468506922/ 11
26. Parallel processing is complicated
• How do we assign tasks to workers?
• What if we have more tasks than slots?
Credit: http://www.flickr.com/photos/sybrenstuvel/2468506922/ 11
27. Parallel processing is complicated
• How do we assign tasks to workers?
• What if we have more tasks than slots?
• What happens when tasks fail?
Credit: http://www.flickr.com/photos/sybrenstuvel/2468506922/ 11
28. Parallel processing is complicated
• How do we assign tasks to workers?
• What if we have more tasks than slots?
• What happens when tasks fail?
• How do you handle distributed
synchronization?
Credit: http://www.flickr.com/photos/sybrenstuvel/2468506922/ 11
29. Data storage is not trivial
• Data volumes are massive
• Reliably storing PBs of data is challenging
• Disk/hardware/network failures
• Probability of failure event increases with number of
machines
For example:
1000 hosts, each with 10 disks
a disk lasts 3 year
how many failures per day?
12
30. Hadoop cluster
Cluster of machine running Hadoop at Yahoo! (credit: Yahoo!)
13
35. Hadoop origins
• Hadoop is an open-source implementation based
on GFS and MapReduce from Google
17
36. Hadoop origins
• Hadoop is an open-source implementation based
on GFS and MapReduce from Google
• Sanjay Ghemawat, Howard Gobioff, and Shun-
Tak Leung. (2003) The Google File System
17
37. Hadoop origins
• Hadoop is an open-source implementation based
on GFS and MapReduce from Google
• Sanjay Ghemawat, Howard Gobioff, and Shun-
Tak Leung. (2003) The Google File System
• Jeffrey Dean and Sanjay Ghemawat. (2004)
MapReduce: Simplified Data Processing on Large
Clusters. OSDI 2004
17
38. Hadoop origins
• Hadoop is an open-source implementation based
on GFS and MapReduce from Google
• Sanjay Ghemawat, Howard Gobioff, and Shun-
Tak Leung. (2003) The Google File System
• Jeffrey Dean and Sanjay Ghemawat. (2004)
MapReduce: Simplified Data Processing on Large
Clusters. OSDI 2004
17
43. HDFS is...
• A distributed file system
• Redundant storage
20
44. HDFS is...
• A distributed file system
• Redundant storage
• Designed to reliably store data using commodity
hardware
20
45. HDFS is...
• A distributed file system
• Redundant storage
• Designed to reliably store data using commodity
hardware
• Designed to expect hardware failures
20
46. HDFS is...
• A distributed file system
• Redundant storage
• Designed to reliably store data using commodity
hardware
• Designed to expect hardware failures
• Intended for large files
20
47. HDFS is...
• A distributed file system
• Redundant storage
• Designed to reliably store data using commodity
hardware
• Designed to expect hardware failures
• Intended for large files
• Designed for batch inserts
20
48. HDFS is...
• A distributed file system
• Redundant storage
• Designed to reliably store data using commodity
hardware
• Designed to expect hardware failures
• Intended for large files
• Designed for batch inserts
• The Hadoop Distributed File System
20
50. HDFS - files and blocks
• Files are stored as a collection of blocks
21
51. HDFS - files and blocks
• Files are stored as a collection of blocks
• Blocks are 64 MB chunks of a file (configurable)
21
52. HDFS - files and blocks
• Files are stored as a collection of blocks
• Blocks are 64 MB chunks of a file (configurable)
• Blocks are replicated on 3 nodes (configurable)
21
53. HDFS - files and blocks
• Files are stored as a collection of blocks
• Blocks are 64 MB chunks of a file (configurable)
• Blocks are replicated on 3 nodes (configurable)
• The NameNode (NN) manages metadata about files
and blocks
21
54. HDFS - files and blocks
• Files are stored as a collection of blocks
• Blocks are 64 MB chunks of a file (configurable)
• Blocks are replicated on 3 nodes (configurable)
• The NameNode (NN) manages metadata about files
and blocks
• The SecondaryNameNode (SNN) holds a backup
of the NN data
21
55. HDFS - files and blocks
• Files are stored as a collection of blocks
• Blocks are 64 MB chunks of a file (configurable)
• Blocks are replicated on 3 nodes (configurable)
• The NameNode (NN) manages metadata about files
and blocks
• The SecondaryNameNode (SNN) holds a backup
of the NN data
• DataNodes (DN) store and serve blocks
21
56. Replication
• Multiple copies of a block are stored
• Replication strategy:
• Copy #1 on another node on same rack
• Copy #2 on another node on different rack
22
57. HDFS - writes
Client Master
Note: Write path for a
File NameNode single block shown.
Client writes multiple
blocks in parallel.
block
Rack #1 Rack #2
Slave node Slave node Slave node
DataNode DataNode DataNode
Block Block Block
23
58. HDFS - reads
Client Master
Client reads multiple blocks
File NameNode in parallel and re-assembles
into a file.
block N
block 1 block 2
Slave node Slave node Slave node
DataNode DataNode DataNode
Block Block Block
24
59. What about DataNode failures?
• DNs check in with the NN to report health
• Upon failure NN orders DNs to replicate under-
replicated blocks
Credit: http://www.flickr.com/photos/18536761@N00/367661087/ 25
62. MapReduce is...
• A programming model for expressing distributed
computations at a massive scale
27
63. MapReduce is...
• A programming model for expressing distributed
computations at a massive scale
• An execution framework for organizing and
performing such computations
27
64. MapReduce is...
• A programming model for expressing distributed
computations at a massive scale
• An execution framework for organizing and
performing such computations
• An open-source implementation called Hadoop
27
67. Typical large-data problem
• Iterate over a large number of records
• Extract something of interest from each
(Dean and Ghemawat, OSDI 2004) 28
68. Typical large-data problem
• Iterate over a large number of records
• Extract something of interest from each
• Shuffle and sort intermediate results
(Dean and Ghemawat, OSDI 2004) 28
69. Typical large-data problem
• Iterate over a large number of records
• Extract something of interest from each
• Shuffle and sort intermediate results
• Aggregate intermediate results
(Dean and Ghemawat, OSDI 2004) 28
70. Typical large-data problem
• Iterate over a large number of records
• Extract something of interest from each
• Shuffle and sort intermediate results
• Aggregate intermediate results
• Generate final output
(Dean and Ghemawat, OSDI 2004) 28
71. Typical large-data problem
• Iterate over a large number of records
Map
• Extract something of interest from each
• Shuffle and sort intermediate results
• Aggregate intermediate results
• Generate final output
(Dean and Ghemawat, OSDI 2004) 28
72. Typical large-data problem
• Iterate over a large number of records
Map
• Extract something of interest from each
• Shuffle and sort intermediate results
• Aggregate intermediate results Reduce
• Generate final output
(Dean and Ghemawat, OSDI 2004) 28
78. MapReduce paradigm
• Implement two functions:
Map(k1, v1) -> list(k2, v2)
Reduce(k2, list(v2)) -> list(v3)
• Framework handles everything else*
• Value with same key go to same reducer
29
79. MapReduce Flow
k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6
map map map map
a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8
Shuffle and Sort: aggregate values by keys
a 1 5 b 2 7 c 2 3 6 8
reduce reduce reduce
r1 s1 r2 s2 r3 s3
30
80. MapReduce - word count example
function map(String name, String document):
for each word w in document:
emit(w, 1)
function reduce(String word, Iterator partialCounts):
totalCount = 0
for each count in partialCounts:
totalCount += count
emit(word, totalCount)
31
83. MapReduce paradigm - part 2
• There’s more!
• Partioners decide what key goes to what reducer
32
84. MapReduce paradigm - part 2
• There’s more!
• Partioners decide what key goes to what reducer
• partition(k’, numPartitions) ->
partNumber
32
85. MapReduce paradigm - part 2
• There’s more!
• Partioners decide what key goes to what reducer
• partition(k’, numPartitions) ->
partNumber
• Divides key space into parallel reducers chunks
32
86. MapReduce paradigm - part 2
• There’s more!
• Partioners decide what key goes to what reducer
• partition(k’, numPartitions) ->
partNumber
• Divides key space into parallel reducers chunks
• Default is hash-based
32
87. MapReduce paradigm - part 2
• There’s more!
• Partioners decide what key goes to what reducer
• partition(k’, numPartitions) ->
partNumber
• Divides key space into parallel reducers chunks
• Default is hash-based
• Combiners can combine Mapper output before
sending to reducer
32
88. MapReduce paradigm - part 2
• There’s more!
• Partioners decide what key goes to what reducer
• partition(k’, numPartitions) ->
partNumber
• Divides key space into parallel reducers chunks
• Default is hash-based
• Combiners can combine Mapper output before
sending to reducer
• Reduce(k2, list(v2)) -> list(v3)
32
89. MapReduce flow
k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6
map map map map
a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8
combine combine combine combine
a 1 b 2 c 9 a 5 c 2 b 7 c 8
partition partition partition partition
Shuffle and Sort: aggregate values by keys
a 1 5 b 2 7 c 2 9 8 8
3 6
reduce reduce reduce
r1 s1 r2 s2 r3 s3
33
93. MapReduce additional details
• Reduce starts after all mappers complete
• Mapper output gets written to disk
• Intermediate data can be copied sooner
34
94. MapReduce additional details
• Reduce starts after all mappers complete
• Mapper output gets written to disk
• Intermediate data can be copied sooner
• Reducer gets keys in sorted order
34
95. MapReduce additional details
• Reduce starts after all mappers complete
• Mapper output gets written to disk
• Intermediate data can be copied sooner
• Reducer gets keys in sorted order
• Keys not sorted across reducers
34
96. MapReduce additional details
• Reduce starts after all mappers complete
• Mapper output gets written to disk
• Intermediate data can be copied sooner
• Reducer gets keys in sorted order
• Keys not sorted across reducers
• Global sort requires 1 reducer or smart
partitioning
34
98. MapReduce - jobs and tasks
• Job: a user-submitted map and reduce
implementation to apply to a data set
35
99. MapReduce - jobs and tasks
• Job: a user-submitted map and reduce
implementation to apply to a data set
• Task: a single mapper or reducer task
35
100. MapReduce - jobs and tasks
• Job: a user-submitted map and reduce
implementation to apply to a data set
• Task: a single mapper or reducer task
• Failed tasks get retried automatically
35
101. MapReduce - jobs and tasks
• Job: a user-submitted map and reduce
implementation to apply to a data set
• Task: a single mapper or reducer task
• Failed tasks get retried automatically
• Tasks run local to their data, ideally
35
102. MapReduce - jobs and tasks
• Job: a user-submitted map and reduce
implementation to apply to a data set
• Task: a single mapper or reducer task
• Failed tasks get retried automatically
• Tasks run local to their data, ideally
• JobTracker (JT) manages job submission and
task delegation
35
103. MapReduce - jobs and tasks
• Job: a user-submitted map and reduce
implementation to apply to a data set
• Task: a single mapper or reducer task
• Failed tasks get retried automatically
• Tasks run local to their data, ideally
• JobTracker (JT) manages job submission and
task delegation
• TaskTrackers (TT) ask for work and execute
tasks
35
105. What about failed tasks?
Credit: http://www.flickr.com/photos/phobia/2308371224/ 37
106. What about failed tasks?
• Tasks will fail
Credit: http://www.flickr.com/photos/phobia/2308371224/ 37
107. What about failed tasks?
• Tasks will fail
• JT will retry failed tasks up to N attempts
Credit: http://www.flickr.com/photos/phobia/2308371224/ 37
108. What about failed tasks?
• Tasks will fail
• JT will retry failed tasks up to N attempts
• After N failed attempts for a task, job fails
Credit: http://www.flickr.com/photos/phobia/2308371224/ 37
109. What about failed tasks?
• Tasks will fail
• JT will retry failed tasks up to N attempts
• After N failed attempts for a task, job fails
• Some tasks are slower than other
Credit: http://www.flickr.com/photos/phobia/2308371224/ 37
110. What about failed tasks?
• Tasks will fail
• JT will retry failed tasks up to N attempts
• After N failed attempts for a task, job fails
• Some tasks are slower than other
• Speculative execution is JT starting up
multiple of the same task
Credit: http://www.flickr.com/photos/phobia/2308371224/ 37
111. What about failed tasks?
• Tasks will fail
• JT will retry failed tasks up to N attempts
• After N failed attempts for a task, job fails
• Some tasks are slower than other
• Speculative execution is JT starting up
multiple of the same task
• First one to complete wins, other is killed
Credit: http://www.flickr.com/photos/phobia/2308371224/ 37
114. MapReduce data locality
• Move computation to the data
• Moving data between nodes has a cost
38
115. MapReduce data locality
• Move computation to the data
• Moving data between nodes has a cost
• MapReduce tries to schedule tasks on nodes
with the data
38
116. MapReduce data locality
• Move computation to the data
• Moving data between nodes has a cost
• MapReduce tries to schedule tasks on nodes
with the data
• When not possible TT has to fetch data from DN
38
136. MapReduce - Counters are...
• A distributed count of events during a job
• A way to indicate job metrics without logging
41
137. MapReduce - Counters are...
• A distributed count of events during a job
• A way to indicate job metrics without logging
• Your friend
41
138. MapReduce - Counters are...
• A distributed count of events during a job
• A way to indicate job metrics without logging
• Your friend
41
139. MapReduce - Counters are...
• A distributed count of events during a job
• A way to indicate job metrics without logging
• Your friend
• Bad:
41
140. MapReduce - Counters are...
• A distributed count of events during a job
• A way to indicate job metrics without logging
• Your friend
• Bad:
System.out.println(“Couldn’t parse value”);
41
141. MapReduce - Counters are...
• A distributed count of events during a job
• A way to indicate job metrics without logging
• Your friend
• Bad:
System.out.println(“Couldn’t parse value”);
• Good:
41
142. MapReduce - Counters are...
• A distributed count of events during a job
• A way to indicate job metrics without logging
• Your friend
• Bad:
System.out.println(“Couldn’t parse value”);
• Good:
reporter.incrCounter(BadParseEnum, 1L);
41
143. MapReduce - word count mapper
public static class Map extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
42
144. MapReduce - word count reducer
public static class Reduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key,
Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
43
145. MapReduce - word count main
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
44
146. MapReduce - running a job
• To run word count, add files to HDFS and do:
$ bin/hadoop jar wordcount.jar
org.myorg.WordCount input_dir output_dir
45
149. MapReduce is good for...
• Embarrassingly parallel algorithms
• Summing, grouping, filtering, joining
46
150. MapReduce is good for...
• Embarrassingly parallel algorithms
• Summing, grouping, filtering, joining
• Off-line batch jobs on massive data sets
46
151. MapReduce is good for...
• Embarrassingly parallel algorithms
• Summing, grouping, filtering, joining
• Off-line batch jobs on massive data sets
• Analyzing an entire large dataset
46
153. MapReduce is ok for...
• Iterative jobs (i.e., graph algorithms)
47
154. MapReduce is ok for...
• Iterative jobs (i.e., graph algorithms)
• Each iteration must read/write data to disk
47
155. MapReduce is ok for...
• Iterative jobs (i.e., graph algorithms)
• Each iteration must read/write data to disk
• IO and latency cost of an iteration is high
47
157. MapReduce is not good for...
• Jobs that need shared state/coordination
48
158. MapReduce is not good for...
• Jobs that need shared state/coordination
• Tasks are shared-nothing
48
159. MapReduce is not good for...
• Jobs that need shared state/coordination
• Tasks are shared-nothing
• Shared-state requires scalable state store
48
160. MapReduce is not good for...
• Jobs that need shared state/coordination
• Tasks are shared-nothing
• Shared-state requires scalable state store
• Low-latency jobs
48
161. MapReduce is not good for...
• Jobs that need shared state/coordination
• Tasks are shared-nothing
• Shared-state requires scalable state store
• Low-latency jobs
• Jobs on small datasets
48
162. MapReduce is not good for...
• Jobs that need shared state/coordination
• Tasks are shared-nothing
• Shared-state requires scalable state store
• Low-latency jobs
• Jobs on small datasets
• Finding individual records
48
168. Running Hadoop
• Multiple options
• On your local machine (standalone or pseudo-
distributed)
52
169. Running Hadoop
• Multiple options
• On your local machine (standalone or pseudo-
distributed)
• Local with a virtual machine
52
170. Running Hadoop
• Multiple options
• On your local machine (standalone or pseudo-
distributed)
• Local with a virtual machine
• On the cloud (i.e. Amazon EC2)
52
171. Running Hadoop
• Multiple options
• On your local machine (standalone or pseudo-
distributed)
• Local with a virtual machine
• On the cloud (i.e. Amazon EC2)
• In your own datacenter
52
174. Cloudera VM
• Virtual machine with Hadoop and related technologies
pre-loaded
• Great tool for learning Hadoop
53
175. Cloudera VM
• Virtual machine with Hadoop and related technologies
pre-loaded
• Great tool for learning Hadoop
• Eases the pain of downloading/installing
53
176. Cloudera VM
• Virtual machine with Hadoop and related technologies
pre-loaded
• Great tool for learning Hadoop
• Eases the pain of downloading/installing
• Pre-loaded with sample data and jobs
53
177. Cloudera VM
• Virtual machine with Hadoop and related technologies
pre-loaded
• Great tool for learning Hadoop
• Eases the pain of downloading/installing
• Pre-loaded with sample data and jobs
• Documented tutorials
53
178. Cloudera VM
• Virtual machine with Hadoop and related technologies
pre-loaded
• Great tool for learning Hadoop
• Eases the pain of downloading/installing
• Pre-loaded with sample data and jobs
• Documented tutorials
• VM: https://ccp.cloudera.com/display/SUPPORT/
Cloudera's+Hadoop+Demo+VM+for+CDH4
53
179. Cloudera VM
• Virtual machine with Hadoop and related technologies
pre-loaded
• Great tool for learning Hadoop
• Eases the pain of downloading/installing
• Pre-loaded with sample data and jobs
• Documented tutorials
• VM: https://ccp.cloudera.com/display/SUPPORT/
Cloudera's+Hadoop+Demo+VM+for+CDH4
• Tutorial: https://ccp.cloudera.com/display/SUPPORT/
Hadoop+Tutorial
53
181. Multiple teams use Hadoop
• Analytics
• Revenue
• Personalization & Recommendations
• Growth
55
182. Twitter Analytics data flow
Production Hosts
Log Application
events Data
Scribe
Aggregators
Third Party
Social graph
Imports HDFS MySQL/ Tweets
Gizzard User profiles
Staging Hadoop Cluster
Main Hadoop DW HBase Analytics
Vertica
Web Tools
MySQL
56
183. Twitter Analytics data flow
Production Hosts
Log Application
events Data
Scribe
Aggregators
Third Party
Social graph
Imports HDFS MySQL/ Tweets
Gizzard User profiles
Distributed
Staging Hadoop Cluster Crawler
Log
Mover
Main Hadoop DW HBase Analytics
Vertica
Web Tools
MySQL
56
184. Twitter Analytics data flow
Production Hosts
Log Application
events Data
Scribe
Aggregators
Third Party
Social graph
Imports HDFS MySQL/ Tweets
Gizzard User profiles
Distributed
Staging Hadoop Cluster Crawler
Crane Crane
Crane
Log
Mover
Main Hadoop DW HBase Analytics
Vertica
Web Tools
Crane
Crane
Crane
MySQL
56
185. Twitter Analytics data flow
Production Hosts
Log Application
events Data
Scribe
Aggregators
Third Party
Social graph
Imports HDFS MySQL/ Tweets
Gizzard User profiles
Distributed
Staging Hadoop Cluster Crawler
Crane Crane
Crane
Log
Mover
Oink
Oink Main Hadoop DW HBase Analytics
Vertica
Web Tools
Crane
Crane
Crane
Oink
MySQL
56
186. Twitter Analytics data flow
Production Hosts
Log Application
events Data
Scribe
Aggregators
Third Party
Social graph
Imports HDFS MySQL/ Tweets
Gizzard User profiles
Distributed
Staging Hadoop Cluster Crawler
Crane Crane
Crane
Log Rasvelg
Mover
Oink
Oink Main Hadoop DW HBase Analytics
Vertica
Web Tools
Crane
Crane
Crane
Oink
MySQL
56
187. Twitter Analytics data flow
Production Hosts
Log Application
events Data
Scribe
Aggregators
Third Party
Social graph
Imports HDFS MySQL/ Tweets
Gizzard User profiles
Distributed Analysts
Staging Hadoop Cluster Crawler Engineers
Crane PMs
Crane Sales
Crane
Log Rasvelg
Mover
Oink
Oink Main Hadoop DW HBase Analytics
Vertica
Web Tools
Crane
Crane
Crane
Oink
MySQL
56
188. Twitter Analytics data flow
Production Hosts
Log Application
events Data
Scribe
Aggregators
Third Party
Social graph
Imports HDFS MySQL/ Tweets
Gizzard User profiles
Distributed Analysts
Staging Hadoop Cluster Crawler Engineers
Crane PMs
Crane Sales
Crane
Log Rasvelg
HCatalog Mover
Oink
Oink Main Hadoop DW HBase Analytics
Vertica
Web Tools
Crane
Crane
Crane
Oink
MySQL
56
189. Example: active users
Production Hosts
Main Hadoop DW
MySQL/ Analytics
Gizzard MySQL Dashboards
Vertica
57
190. Example: active users
Job DAG
Log mover
Production Hosts
Log mover
(via staging cluster)
ib e web_events
Scr
Main Hadoop DW
Scr
ibe sms_events
MySQL/ Analytics
Gizzard MySQL Dashboards
Vertica
57
191. Example: active users
Job DAG
Oink
Log mover
Production Hosts
Log mover
(via staging cluster)
Oink/Pig
ibe web_events
Scr Cleanse
Main Hadoop DW Filter
Transform
Scr Geo lookup
ibe sms_events Union
Distinct
MySQL/ Analytics
Gizzard MySQL Dashboards
Vertica
57
192. Example: active users
Job DAG
Oink Oink
Log mover
Production Hosts
Log mover
(via staging cluster)
Oink/Pig
ibe web_events
Scr Cleanse
Main Hadoop DW Filter
Transform
Scr Geo lookup
ibe sms_events Union
Distinct
Oink
user_sessions
MySQL/ Analytics
Gizzard MySQL Dashboards
Vertica
57
193. Example: active users
Job DAG
Oink Oink
Log mover
Production Hosts
Crane
Log mover
(via staging cluster)
Oink/Pig
ibe web_events
Scr Cleanse
Main Hadoop DW Filter
Transform
Scr Geo lookup
ibe sms_events Union
Distinct
Oink
user_sessions
MySQL/ Crane Analytics
Gizzard MySQL Dashboards
user_profiles Vertica
57
194. Example: active users
Job DAG
Oink Oink
Log mover
Production Hosts
Crane
Log mover Rasvelg
(via staging cluster)
Oink/Pig
ibe web_events
Scr Cleanse
Main Hadoop DW Filter
Transform
Scr Geo lookup
ibe sms_events Union
Distinct
Oink
user_sessions
MySQL/ Crane Analytics
Gizzard MySQL Dashboards
user_profiles Vertica
Rasvelg
Join,
Join Group, Count
Aggregations:
- active_by_geo
- active_by_device
- active_by_client
... 57