A Birds-Eye View of Pig and Scalding Jobs with hRaven

A Bird’s-Eye View of Pig and Scalding
with hRaven
a tale by @gario and @joep
Hadoop Summit 2013
v1.2

@Twitter#HadoopSummit2013 2
Apache HBase PMC member and
Committer
Software Engineer @ Twitter
Core Storage Team - Hadoop/HBase
•
•
•
About the authors
Software Engineer @ Twitter
Engineering Manager Hadoop/HBase
team @ Twitter
•
•

Chapter 1: The Problem
Chapter 2: Why hRaven?
Chapter 3: How Does it Work?
3a: Loading
3b: Table structure / querying
Chapter 4: Current Uses
Appendix: Future Work
•
•
•
•
•
•
•
Table of Contents

Chapter 1: The Problem
Illustration by Sirxlem (CC BY-NC-ND
3.0)

Most users run Pig and Scalding scripts, not straight map reduce
JobTracker UI shows jobs, not DAGs of jobs generated by Pig and Scalding
•
•
Chapter 1: Mismatched Abstractions

@Twitter#HadoopSummit2013
Chapter 1: A Problem of Scale
6

How many Pig versus Scalding jobs do we run ?
What cluster capacity do jobs in my pool take ?
How many jobs do we run each day ?
What % of jobs have > 30k tasks ?
Why do I need to hand-tune these (hundreds) of jobs, can’t the cluster learn ?
•
•
•
•
•
Chapter 1: Questions

How many Pig versus Scalding jobs do we run ?
What cluster capacity do jobs in my pool take ?
How many jobs do we run each day ?
What % of jobs have > 30k tasks ?
Why do I need to hand-tune these (hundreds) of jobs, can’t the cluster learn ?
•
•
•
•
•
Chapter 1: Questions
#Nevermore

Photo by DAVID ILIFF. License: CC-BY-SA
3.0

Stores stats, configuration and timing for every map reduce job on every
cluster
Structured around the full DAG of jobs from a Pig or Scalding application
Easily queryable for historical trending
Allows for Pig reducer optimization based on historical run stats
Keep data online forever (12.6M jobs, 4.5B tasks + attempts)
•
•
•
•
•

cluster - each cluster has a unique name mapping to the Job Tracker
user - map reduce jobs are run as a given user
application - a Pig or Scalding script (or plain map reduce job)
flow - the combined DAG of jobs executed from a single run of an
application
version - changes impacting the DAG are recorded as a new version of the
same application
•
•
•
•
•
Chapter 2: Key Concepts

Chapter 2: Application Flows
Edgar

All jobs in a flow are ordered together•
Chapter 2: Flow Storage

Most recent flow is ordered first•
Chapter 2: Flow Storage

All jobs in a flow are ordered together
Per-job metrics stored
Total map and reduce tasks
HDFS bytes read / written
File bytes read / written
Total map and reduce slot milliseconds
Easy to aggregate stats for an entire flow
Easy to scan the timeseries of each application’s flows
•
•
•
•
•
•
•
•
Chapter 2: Key Features

Chapter 3: ETL - Step 1: JobFilePreprocessor

Chapter 3: ETL - Step 2: JobFileRawLoader

Chapter 3: ETL - Step 3: JobFileProcessor

Chapter 3: ETL - Step 3: JobFileProcessor
Jobs finish out of order with respect to job_id

job_history_raw
job_history
job_history_task
job_history_app_version
•
•
•
•
Chapter 3: Tables

Row key: cluster!jobID
Columns:
jobconf - stores serialized raw job_*_conf.xml file
jobhistory - stored serialized raw job history log file
job_processed_success - indicates whether job has been processed
•
•
•
Chapter 3: job_history_raw

Row key: cluster!user!application!timestamp!jobID
cluster - unique cluster name (ie. “cluster1@dc1”)
user - user running the application (“edgar”)
application - application ID derived from job configuration:
uses “batch.desc” property if set
otherwise parses a consistent ID from “mapred.job.name”
timestamp - inverted (Long.MAX_VALUE - value) value of submission time
jobID - stored as Job Tracker start time (long), concatenated with job sequence number
job_201306271100_0001 -> [1372352073732L][1L]
•
•
•
•
•
•
•
•
Chapter 3: job_history

Row key: cluster!user!application!timestamp!jobID!taskID
same components as job_history key (same ordering)
taskID - (ie. “m_00001”) uniquely identifies individual task/attempt in job
Two row types:
Task - “meta” row
cluster1@dc1!edgar!wordcount!9654...!...[00001]!m_00001
Task Attempt - individual execution on a Task Tracker
cluster1@dc1!edgar!wordcount!9654...!...[00001]!m_00001_1
•
•
•
•
Chapter 3: job_history_task

Row key: cluster!user!application
Example: cluster1@dc1!edgar!wordcount
Columns:
v1=1369585634000
v2=1372263813000
Chapter 3: job_history_app_version

Using Pig’s HBaseStorage (or direct HBase APIs)
Through Client API
Through REST API
•
•
•
Chapter 3: Querying hRaven

Pig reducer optimizations
Cluster utilization / capacity planning
Application performance trending over time
Identifying common job anti-patterns
Ad-hoc analysis troubleshooting cluster problems
•
•
•
•
•
Chapter 4: Current Uses

Chapter 4: Cluster reads-writes

Chapter 4: Pool / Application reads/writes
31
Pool view
Spike in File size read
Indicates jobs spilling
•
•
•
Application view
Spike in HDFS size
read
Indicates spiking input
•
•
•

Chapter 4: Pool usage: Used vs. Allocated
32

Chapter 4: Compute cost

Real-time data loading from Job Tracker / Application Master
Full flow-centric UI (Job Tracker UI replacement)
Hadoop 2.0 compatibility (in-progress)
Ambrose integration
•
•
•
•
Appendix: Future Work

hRaven on Github
https://github.com/twitter/hraven
hRaven Mailing Lists
hraven-user@googlegroups.com
hraven-dev@googlegroups.com
•
•
•
Additional Resources

Afterword
37
Now will thou drop your job data on the floor ?
Quoth the hRaven, 'Nevermore.'

#TheEnd
@gario and @joep
Come visit us at booth #26 to continue the story

Desired order
job_201306271100_9999
job_201306271100_10000
...
job_201306271100_99999
job_201306271100_100000
...
job_201306271100_999999
job_201306271100_1000000
•
Sort order Variable length job_id
Lexical order
job_201306271100_10000
job_201306271100_100000
job_201306271100_1000000
job_201306271100_9999
job_201306271100_99999
job_201306271100_999999
•

A Birds-Eye View of Pig and Scalding Jobs with hRaven

Recommended

Recommended

More Related Content

Similar to A Birds-Eye View of Pig and Scalding Jobs with hRaven

Similar to A Birds-Eye View of Pig and Scalding Jobs with hRaven (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

A Birds-Eye View of Pig and Scalding Jobs with hRaven