2. What is
• Distributed computing framework
– Offers storage and batch processing for petabytes
of data
– Very suitable for ad-hoc textual processing
applications
• Components
– Hadoop Distributed File System
– Map/Reduce programming framework
• Apache Software Foundation project
4. Hadoop Adoption Trends - Yahoo!
•Runs the Yahoo! Distribution of Hadoop
•http://github.com/yahoo/hadoop
•230 jobs/hour on average
•4.38 Tb/hour of input, 936 Gb/hour of output
5. Hadoop on your FB, Twitter pages
• Facebook
– Reporting, analytics, machine learning
• Amazon
– Hosted Hadoop on top of EC2 and S3
– Product search index
• Twitter
– Analytics, social network graphs
• AOL, Microsoft (PowerSet), IBM, …
• http://wiki.apache.org/hadoop/PoweredBy
6. Support of a vibrant community
Hadoop contributions:
Core: HDFS, Map/Reduce; Non-core: sub-projects Hadoop mailing list traffic
Cloudera Distribution of Hadoop – paid, supported service offering
from Cloudera
7. Support from Academia, Research
• PSG Tech, Coimbatore
– Semantic search, information retrieval,
scheduling, applications in molecular biology –
Deep dive on this later
• IIIT, Hyderabad
– Applications in Indian language content
processing, scheduling
• IISc, Bangalore
– Modeling a simulator for Hadoop
• Many more – M45, OpenCirrus, …
8. Hadoop – a RAD tool ?
• Without Hadoop
– Build-out and maintenance of hardware
– Transfer, storage of data - Deep dive on this later
on
– Handling failures, efficiency
• Enables rapid experimentation, iteration,
repeatability, low cost of failure
• Great Ecosystem: Streaming, PIG, Hive, Hbase,
Oozie, Avro…
9. Technical focus areas at Yahoo!
• Security
– Kerberos based authentication
• Backwards Compatibility – 1.0
– APIs cannot be broken between major releases
– A new API in Map/Reduce that enables this
• Robustness
– Multiple bug fixes
– Map/Reduce framework refactoring for better
concurrency, simplifying control flow logic
10. Technical focus areas at Yahoo!
• Append / Sync / Flush
– Until Hadoop 0.20, files were write once
– Append going to open Hadoop for more apps
• Efficiency in scheduling, data processing
– Task scheduling for better utilization, better
sharing policies
– Zero data copy – usage of direct I/O buffers
• Quality engineering
– Automated distributed system testing,
performance benchmarks (deep dive coming)
11. Agenda for Hadoop Summit
• Lightning Talk by Hari Vasudev (VP Platform
Tech Group, Yahoo!)
• Data Management on Grid by Srikanth
Sundarrajan (Yahoo!)
• Machine Learning using Hadoop- Real Case
Study by Krishna Prasad Chitrapura (Yahoo!)
• Multiple Sequence Alignment using Hadoop
by Dr. Sudha Sadhasivam (PSG Tech,
Coimbatore)
12. Agenda for Hadoop Summit
• Benchmarking and Optimizing Hadoop
deployments(benchmarking on HiBench) by Mukesh
Gangadhar (Intel)
• Challenges and Uniqueness of QE and RE processes in Hadoop
by Jayant Mahajan (Yahoo!)
• Tuning Hadoop to deliver performance to your application by
Srigurunath Chakravarthi (Yahoo!)
• Panel Discussion: Moderator: Basant Verma (Yahoo!);
Panelist: T. S. Mohan (Infosys), Sudha Sadhasivam (PSG Tech),
Chidambaran Kollengode (Yahoo!) & Jothi Padmanabhan
(Yahoo!),
• Yahoo booth throughout the day: win cool prizes ☺
13. Thank You ! – Q&A
Hemanth Yamijala
(yhemanth@yahoo-inc.com
yhemanth@apache.org)
15. Challenges for Yahoo!
• No longer just a wildly successful cool project!
– People are demanding we deliver !
• Production usage, availability, SLAs
– Jobs that MUST finish in 15 minutes, or revenue is
lost, and the time limits are going down
• Usability, Operability
• Scale, Performance
– Ever increasing demands mean we need larger
clusters, faster throughput
16. Design considerations
• Cost Effectiveness
– Runs on commodity hardware, Linux
• Linear Scale
• Fault Tolerance
– Block replication, checksums
– Transparent monitoring and re-execution of tasks
• Efficiency
– Data locality
– Efficient resource usage