11. Step 2: Fire Away
select t.county, count(1) from (select
transform(a.zip) using ‘geo.py’ as
a.county from SMALL_TABLE a) t
group by t.county;
AdCo Hadoop
11
12. Step 2: Fire Away
select t.county, count(1) from (select
transform(a.zip) using ‘geo.py’ as
a.county from SMALL_TABLE a) t
group by t.county;
AdCo Hadoop
12
13. Step 2: Fire Away
hadoop jar –Dmapred.min.split.size=32000000
myapp.jar –partitioner .org.apache…
select t.county, count(1) from (select
transform(a.zip) using ‘geo.py’ as
a.county from SMALL_TABLE a) t
group by t.county;
AdCo Hadoop
insert overwrite table dest
select a.id, a.zip, count(distinct b.uid)
from ads a join LARGE_TABLE b on (a.id=b.ad_id)
13 group by a.id, a.zip;
13
14. Step 2: Fire Away
hadoop jar –Dmapred.min.split.size=32000000
myapp.jar –partitioner .org.apache…
select t.county, count(1) from (select
transform(a.zip) using ‘geo.py’ as
a.county from SMALL_TABLE a) t
group by t.county;
AdCo Hadoop
insert overwrite table dest
select a.id, a.zip, count(distinct b.uid)
from ads a join LARGE_TABLE b on (a.id=b.ad_id)
14 group by a.id, a.zip;
14
15. Step 2: Fire Away
hadoop jar –Dmapred.min.split.size=32000000
myapp.jar –partitioner .org.apache…
AdCo Hadoop
15
16. Step 2: Fire Away
hadoop jar –Dmapred.min.split.size=32000000
myapp.jar –partitioner .org.apache…
AdCo Hadoop
16
19. Hadoop as Service
1. Detect when cluster is required
– Not all Hive statements require cluster (EXPLAIN/SHOW/..)
2. Atomically create cluster
– Long running process, concurrency control using Mysql
3. Shutdown when not in use
– Do on hour boundary (whose?)
– Not if User Sessions are active!
19
20. Hadoop as Service
• Archive Job History/Logs to S3
– Transparent access to Old jobs
• Auto-Config different node types
– Use ALL ephemeral drives for HDFS/MR
– Use right number of slots per machine
• Scrub, Scrub, Scrub
– Bad Nodes, Bad Clusters, AWS timeouts
20
22. Scaling Up
insert overwrite table dest Slaves
select … from ads join
campaigns on …group by …;
Map Tasks
Job Tracker
ReduceTasks
Master StarCluster
22
AWS
23. Scaling Up
insert overwrite table dest Slaves
select … from ads join
campaigns on …group by …;
Map Tasks
Job Tracker
ReduceTasks
Master StarCluster
23
AWS
24. Scaling Up
insert overwrite table dest Slaves
select … from ads join
campaigns on …group by …;
Map Tasks
Job Tracker
ReduceTasks
Master StarCluster
24
AWS
25. Scaling Up
insert overwrite table dest Slaves
select … from ads join
campaigns on …group by …;
Progress
Map Tasks
Job Tracker
ReduceTasks
Master StarCluster
25
AWS
26. Scaling Up
insert overwrite table dest Slaves
select … from ads join
campaigns on …group by …;
Progress
Map Tasks
Job Tracker
ReduceTasks
Supply
Demand
Master StarCluster
26
AWS
27. Scaling Up
insert overwrite table dest Slaves
select … from ads join
campaigns on …group by …;
Progress
Map Tasks
Job Tracker
ReduceTasks
Supply
Demand
Master StarCluster
27
AWS
28. Scaling Up
insert overwrite table dest Slaves
select … from ads join
campaigns on …group by …;
Progress
Map Tasks
Job Tracker
ReduceTasks
Master StarCluster
28
AWS
29. Scaling Up
insert overwrite table dest Slaves
select … from ads join
campaigns on …group by …;
Progress
Map Tasks
Job Tracker
ReduceTasks
Master StarCluster
29
AWS
30. Scaling Down
1. On hour boundary – check if node is required:
– Can’t remove nodes with map-outputs (today)
– Don’t go below minimum cluster size
2. Remove node from Map-Reduce Cluster
3. Request HDFS Decomissioning – fast!
– Delete affected cache files instead of re-replicating
– One surviving replica and we are Done.
4. Delete Instance
30
32. Spot Instance: Challenges
• Can lose Spot nodes anytime
– Disastrous for HDFS
– Hybrid Mode: Use mix of On-Demand and Spot
– Hybrid Mode: Keep one replica in On-Demand nodes
• Spot Instances may not be available
– Timeout and use On-Demand nodes as fallback
32
33. Agenda
What is Qubole Data Service
Hadoop as a Service in Cloud
Hive as a Service in Cloud
33