4. Brief History of Hadoop
● 2005 -
● Inspired by the GFS and MapReduce papers
published by Google.
● Promoted heavily by Yahoo! Since 2006
● Today, the defacto standard in 'Big Data'
computing
8. When To Use It
● Can you use Hadoop to do X?
● Is your problem 'embarassingly' parallel?
● Workflow?
– Dependent/Independent Tasks
● Data/CPU intensive?
● Can you use Hadoop to do X in the Clouds?
● Depends where your data is
9. Why To Use It?
● Ad hoc analysis
● Semi/structured data
– Log files
– Text
– CSV, XML, anything really
– RDBMS
– NoSQL!
10. Use Cases
● Analytics
● User behavior
● Reporting
● Filtering
● Machine Learning
● Just storing your data
11. Just From The Logs
● Suppose you run a web-site
● User breakdown by browsers
● Location
● Understanding user session
– How long do they use it?
– Who are the active users?
– What part of my app they use the most?
– What part of my app is user X's fav?
12. Tools
● Native Hadoop APIs – Java
● Streaming – Perl, Python, Ruby, any language
as long it has support for 'stdin' and 'stdout'
● Pig
● HIVE
● Pipes – C and C++