Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Hadoop Summit 2010 Data Management On Grid
1. Data Management on Hadoop
@ Yahoo!
Srikanth Sundarrajan
Principal Engineer
2. Why is Data Management
important?
• Large datasets are incentives for users to come to
grid
• Volume of data movement
• Cluster access / partitioning (Research &
Production purposes)
• Resource consumption
• SLA’s on data availability
• Data Retention
• Regulatory compliance
• Data conversion
3. Data volumes
• Steady growth in data volumes (Data
movement per DAY – Into the grid)
40
35
30
25
TB
20
15
10
5
0
4. Data Acquisition Service
JT HDFS
Cluster 1
JT HDFS
Data Acquisition Cluster 2
Service
JT HDFS
Source Cluster 3
• Replication & Retention are additional Targets
services that handle cross cluster data
movement and data purge respectively
5. Pluggable interfaces
• Different warehouse may use different
interfaces to expose data (ex. http, scp, ftp or
some proprietary mechanism)
• Acquisition service should be generic and have
ability to plugin interfaces easily to support
newer warehouses
6. Data load & conversion
• Heavy lifting delegated to Map-reduce jobs,
keeping the acquisition service light
• Data load executed as a map-reduce job
• Data conversion as map-reduce job (to enable
faster data processing post acquisition)
– Fields inclusion/removal
– Data filtering
– Data Anonymization
– Data format conversion (raw delimited / Hadoop
sequence file)
• Cluster to cluster copy is a map-reduce job
7. Warehouse & Cluster isolation
• Source warehouses have diverse capacity,
often constrained
• Different clusters can have different versions
of Hadoop and cluster performance may not
be uniform
• Need for isolation at a warehouse & cluster
level and resource usage limits at a warehouse
level
8. Job throttling
Discovery
Discovery
threads
Queue per
source
Job execution
threads
Async Map reduce job post
resource negotiation
Cluster 1 Cluster N
9. Other things in consideration
• SLA, Feed priority & frequency in
consideration for scheduling data load
• Retention to remove old data (as required for
legal compliance and for capacity purposes)
• Interoperability across Hadoop versions