Workflow Engines for Hadoop

Workﬂow Engines for
Hadoop
Joe Crobak
@joecrobak
NYC Data Engineering Meetup
September 5, 2013
1

Background
• Devops/Infra for Hadoop
• ~4 years with Hadoop
• Have done two migrations from EMR to the colo.
• Formerly Data/Analytics Infrastructure @
• worked with Apache Oozie and Luigi
• Before that, Hadoop @
• worked with Azkaban 1.0
Disclosure: I’ve contributed to Luigi and Azkaban 1.0
3

Analytics / Data Warehousing
• logs -> fact table(s).
• database backups -> dimension tables.
• Compute rollups/cubes.
• Load data into a low-latency store (e.g.
Redshift,Vertica, HBase).
• Dashboarding & BI tools hit database.
8

9

• What happens if there’s a failure?
• rebuild the failed day
• ... and any downstream datasets
10

Hadoop-Driven Features
• PeopleYou May Know
• Amazon-style “People
that buy this often by
that”
• SPAM detection
• logs, databases ->
machine learning /
collaborative ﬁltering
• derivative datasets ->
production database
(often k/v store)
11

Hadoop-Driven Features
• What happens if there’s a failure?
• possibly OK to skip a day.
• Workﬂow tends to be self-contained, so
you don’t need to rerun downstream.
• Sanity check your data before pushing to
production.
13

Workﬂow Engine Evolution
• Usually start with cron
• at 01:00 import data
• at 02:00 run really expensive query A
• at 03:00 run query B, C, D
• ...
• This goes on until you have ~10 jobs or so.
• It’s hard to debug and rerun.
• Doesn’t scale to many people.
14

Workflow Engine Evolution
• Two possibilities:
1. “a workflow engine can’t be too hard,
let’s write our own”
2. spend weeks evaluating all the options
out there.Try to shoehorn your
workflow into each one.
15

Workflow Engine
Considerations
How do I...
• Deploy and Upgrade
• workflows and the workflow engine
• Test
• Detect Failure
• Debug/find logs
• Rebuild/backfill datasets
• Load data to/from a RDBMS
• Manage a set of similar tasks
16

Apache
http://oozie.apache.org/
17

Oozie - the good
• Great community support
• Integrated with HUE, Cloudera Manager,Apache
Ambari
• HCatalog integration
• SLA alerts (new in Oozie 4)
• Ecosystem support: Pig, Hive, Sqoop, etc.
• Very detailed documentation
• Launcher jobs as map tasks
19

Oozie - the bad
• Launcher jobs as map tasks.
• UI - but HUE, oozie-web (and
good API)
• Confusing object model (bundles,
coordinators, workﬂows) - high
barrier to entry.
• Setup - extjs, hadoop proxy user,
RDBMS.
• XML!
20

Oozie - the bad
• Hello World in Oozie
21

http://azkaban.github.io/azkaban2/
22

Azkaban - architecture
Source: http://azkaban.github.io/azkaban2/overview.html
23

Azkaban - the good
• Great UI
• DAG visualization
• Task history
• Easy access to log ﬁles
• Plugin architecture
• Pig, Hive, etc. Also, voldemort “build and push” integration
• SLA Alerting
• HDFS Browser
• User Authentication/Authorization and auditing.
• Reportal: https://github.com/azkaban/azkaban-plugins/pull/6
24

Azkaban - the bad
• Representing data dependencies
• i.e. run job X when datasetY is available.
• Executors run on separate workers, can be
under-utilized (YARN anyone?).
• Community - mostly just LinkedIn, and they
rewrote it in isolation.
• mailing list responsiveness is good.
26

Azkaban - good and
bad
• Job deﬁnitions as java properties
• Web uploads/deploy
• Running jobs, scheduling jobs.
• nearly impossible to integrate with
conﬁguration management
27

https://github.com/spotify/luigi
28

Luigi - the good
• Task definitions are code.
• Tasks are idempotent.
• Workflow defines data (and task) dependencies.
• Growing community.
• Easy to hack on the codebase (<6k LoC).
• Postgres integration
• Foursquare got this working with Redshift and
Vertica.
30

Luigi - the bad
• Missing some key features, e.g. Pig support
• but this is easy to add
• Deploy situation is confusing (but easy to
automate)
• visualizer scaling
• no persistent backing
• JVM overhead
31

Comparison matrix -
part 1
Lang
Code
Complexity
Frameworks Logs Community Docs
oozie java high - 105k
pig, hive, sqoop,
mapreduce
decentralized,
map tasks
Good - ASF in
many distros
excelle
nt
azkaban java moderate - 26k
pig, hive,
mapreduce
UI-accessible
few users,
responsive on
MLs
good
luigi python simple - 5.9k
hive, postgres,
scalding, python
streaming
decentral-ized
on workers
few users,
responsive on
github and MLs
good
32

Comparison matrix -
part 2
property
configuration
Reruns
Customizat
ion (new
job type)
Testing User Auth
oozie
command-line,
properties file, xml
defaults
oozie job -
rerun
difficult MiniOozie
Kerberos, simple,
custom
azkaban
bundled inside
workflow zip, system
defaults
partial
reruns in UI
plugin
architecture
?
xml-based,
custom
luigi
command-line,
python ini file
remove
output,
idempotency
subclass
luigi.Task
python
unittests
linux-based
33

Other workﬂow
engines
• Chronos
• EMR
• Mortar
• Qubole
• general purpose:
• kettle, spring batch
34

Qualities I like in a
workﬂow engine
• scripting language
• you end up writing scripts to run your job anyway
• custom logic, e.g. representing a dep on 7-days of data or run
only every week
• Less property propagation
• Idempotency
• WYSIWYG
• It shouldn't be hard to take my existing job and move it to the
workﬂow engine (it should just work).
• Easy to hack on
35

Less important
• High availability (cold failover with manual
intervention is OK)
• Multiple cluster support
• Security
36

Best Practices
• Version datasets
• Backﬁlling datasets
• Monitor the absence of a job running
• Continuous deploy?
37

Resources
• Azkaban talk at Hadoop User Group:
http://www.youtube.com/watch?
v=rIUlh33uKMU
• PyData talk on Luigi: http://vimeo.com/
63435580
• Oozie talk at Hadoop user Group: http://
www.slideshare.net/mislam77/oozie-hug-
may12
38

Thanks!
• Questions?
• shameless plug: Subscribe to my
newsletter: http://hadoopweekly.com
39

Workflow Engines for Hadoop

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Workflow Engines for Hadoop

Semelhante a Workflow Engines for Hadoop (20)

Último

Último (20)

Workflow Engines for Hadoop