2. Agenda
• Who I am
• Problem
• Solution
• Demo
• Q&A
Saturday, August 3, 13
3. Who I am
• Wisely Chen ( thegiive@gmail.com )
• Release manager of Yahoo![Taiwan] shopping and data team
• Loves to promote open source tech in Taiwan
• Hadoop Summit 2013 San Jose
• Ruby and Rails : Coscup 2006, Ubisunrise 2007, OSDC 2007
• Puppet : PHPConf 2012 , RubyConf 2012
• Release Practice :Webconf 2013, Coscup 2012
Saturday, August 3, 13
4. Who I am
• Neal Lee (@neal_lee)
• Data Engineer at Yahoo![Taiwan] data team
• Aims to build a easy to use self-service BI platform
connecting to Hadoop.
Saturday, August 3, 13
5. EC Data Team
拍賣/商城/購物中心
站台
流量/點擊/使用者行為 追蹤
Transactional
data
Tracking data
Data
Highway
Data Warehouse/
Data Mart
Data
Infra BI
Platform
Report
Recommendation
API
Machine
Learning
Serve
Saturday, August 3, 13
10. Continuous Integration
• A software engineering practice
• Maintain code repos
• Automate the build
• Make the build self-testing
• Everyone commit to the baseline everyday
• Every commit should be a build
• Test in a clone of production environment
• Make it easy to get the latest deliverables
• Everyone can see the result of latest build
• Automate deployment
Saturday, August 3, 13
11. We focus on
• A software engineering practice
• Maintain code repos
• Automate the build
• Make the build self-testing
• Everyone commits to the baseline everyday
• Every commit should be a build
• Test in a clone of production environment
• Make it easy to get the latest deliverables
• Everyone can see the result of latest build
• Automate deployment
Saturday, August 3, 13
12. CI on Hadoop Flow
Code
Unit
Test
Performance
Test
Deploy Doc Execution
Saturday, August 3, 13
17. PigUnit
• A simple xUnit framework
• No cluster set up is required in local mode
• Unit testing, regression testing, and rapid
prototyping on the fly
Saturday, August 3, 13
18. Using PigUnit
• After
• Coding
• Write PigUnit test case
• Run local PigUnit test
• Push to cluster
• Run Pig on cluster
• Get right result !
• Before
• Coding
• Manual local test
• Push to cluster
• Run Pig on cluster
• Get right result !
Saturday, August 3, 13
19. Unit test is live doc
• Unit test is runnable live doc
• Pass test case and meet previous
requirement
Saturday, August 3, 13
20. Flexible
• Pig can use PigUnit
• MapReduce can use MapUnit
• Hive can use hive_test
Saturday, August 3, 13
22. Vaidya
• Rule based performance diagnosis of M/R jobs
• Extensible framework
• You can add your own rules
• Write complex rules using existing rules
Saturday, August 3, 13
23. Performance Test
Pig Job
Pig Job
History
Vaidya
Vaidya
Rule
4
Pig Job
Conf
Notify
User
3
Performance
result
Next CI
Stage
1
1
2
2
2
5
1. Exec pig job with sampling data on beta server
2. Vaidya read job history,conf,rule
to check performance problem
3. If ok, create performance result
4. If job has performance issue,
notify user
5. Go to next CI stage
Sampling
data
1
Saturday, August 3, 13
24. Vaidya Rule<Diagnos)cTest>
<Title><![CDATA[Balanaced Reduce Partitioning]]></Title>
<ClassName>
<![CDATA[
org.apache.hadoop.vaidya.postexdiagnosis.tests.BalancedReducePar77oning
]]>
</ClassName>
<Descrip)on>
<![CDATA[This
rule
tests
as
to
how
well
the
input
to
reduce
tasks
is
balanced]]>
</Descrip)on>
<Importance><![CDATA[High]]></Importance>
<SuccessThreshold><![CDATA[0.20]]></SuccessThreshold>
<Prescrip)on><![CDATA[advice]]></Prescrip)on>
</Diagnos)cTest>
See
if
the
reduce
job
is
balance
or
not
Rule
importance
Diagnose
success
threshold
Test
Java
Class
Saturday, August 3, 13
26. Deploy
• Deploy to production cluster
• Easy to rollback
• Create a git tag
• Auto doc generating
• Each release should map to a ticket
• Auto comment in Bugzilla
Saturday, August 3, 13
27. Auto comment in bugzilla
Repo url
Release
Note
Issue status
change
Saturday, August 3, 13
28. Auto create git tag
Release Note
[Bug xxx] log....
Git Tag
Saturday, August 3, 13
30. Demo
• Demo1 : Unit test fail
• Demo2 : Unit test success
• Demo3 : Check performance test
• Demo4 :Auto generate Doc
• Demo5 : Notify user
Saturday, August 3, 13
34. Logic Debug
• Map/Reduce
job
oJen
takes
a
lot
of
)me
for
execu)on
• Repeated
Map/Reduce
execu)on
cost
a
lot
of
)me
during
logic
debugging
phase
• Need
a
way
to
find
out
logic
problem
before
execu)on
produc)on
job
• Coding
Manual
Test
Exec
Get Bug
Saturday, August 3, 13
35. Performance
• Map/Reduce
performance
is
hard
to
es)mate
before
execu)on
• Production Grid computing resource is shared by allYahoos
• Bad performance will affect otherYahoos Grid jobs
• Putting bad performance code on production grid is guilty
• We manually investigate the job performance before we actually execute it
on production Grid
Coding
Manual
Test
Manual
investgate
Get Bug
Saturday, August 3, 13