3. • What is Hadoop?
• Why do people use Hadoop?
• How does it work?
• When should you consider Hadoop?
4. What is Hadoop?
Apache Hadoop is an open source, java-based
system for processing data on a network of
commodity servers using a map-reduce
paradigm.
5. How do people use Hadoop?
A few examples from the Apache site
– Amazon search
– Facebook log storage and reporting
– LinkedIn’s People You May Know
– Twitter data analysis
– Yahoo! Uses it for ad targeting
A search on LinkedIn shows people at financial
services, biotech, oil and gas exploration, retail,
and other industries are using Hadoop.
6. Where did Hadoop come from?
• Hadoop was created by Doug Cutting. It’s
named after his son’s toy elephant.
• Hadoop was written to support Nutch, an
open source web search engine.
Hadoop was spun out in 2006.
• Yahoo! invested in Hadoop,
bringing it to “web scale” by
2008.
7. Hadoop is open source
• Hadoop is an open source project (Apache
license)
– You can download and install it freely
– You can also compile your own custom version of
Hadoop
• There are three subprojects
8. Hadoop is written for Java
• The good news: Hadoop runs on a JVM
– You can run Hadoop on your workstation (for testing),
on a private cluster, or in a cloud
– You can write Hadoop jobs in Java, or in Scala, Jruby,
Jython, Clojure, or any other JVM language
– You can use other Java libraries
• The bad news: Hadoop was originally written by
and for Java programmers.
– You can do basic work without knowing Java. But you
will quickly get stuck if you can’t write code.
10. Hadoop runs on commodity servers
• Doesn’t require very fast, very big, or very
reliable servers
• Works better on good quality servers
connected through a fast network
• Hadoop is fault tolerant—multiple copies of
data, protection against failed jobs
11. When should you consider Hadoop?
• Big problem
• Fits Map/Reduce model
• Don’t need to compute in real time
• Technical team
12. Picking the right tool for the job
1,000,000,000,000
100,000,000,000
?
10,000,000,000
1,000,000,000
100,000,000
10,000,000
1,000,000
100,000
10,000
1,000
100
10
1
Calculator Spreadsheet Numerical Parallel Systems ?
Software
13. Man / Reduce
• I need 7 volunteers:
– 4 mappers
– 3 reducers
• We’re going to show how map/reduce works
by sorting and counting some notes.
14. What is Map/Reduce
• You compute things in two phases
– The map step
• Reads the input data
• Transforms the data
• Tags each datum with a key and sends each datum to
the right reducer
– The reduce step
• Collects all the data for each key
• Do some work on the data by key
• Outputs the results
15. Map/Reduce is over 100 years old
• Hollerith machines from the 1890 census
16. Good fits for Map/Reduce
• Aggregating unstructured data to enter into a
database (ETL)
• Creating email messages
• Processing log files and creating reports
17. Problems that don’t perfectly fit
• Logistic regression
• Matrix operations
• Social graph calculations
18. Batch computation
Hadoop is a shared system that allocates
resources to jobs from a queue. It’s not a real
time system.
19. Coding example
Suppose that we had some log files with events by
date (say, page views). Let’s count the number of
events by day!
Sample data:
1335300359000,Home Page, Joe
1335300359027,Login,
1335300359031,Home Page, Romy
1335300369123,Settings, Joe
…
20. A Java Example
• Mappers will
– Read the input files
– Extract the timestamp
– Round to the nearest day
– Set the output key to the day
• Reducers will
– Iterate through records by day, counting records
– Output the count for each day
21. A Java example (Mapper)
public class exampleMapper
extends MapReduceBase
implements Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
String line = value.toString();
String[] values = line.split(",");
Long timeStampLong = Long.parseLong(values[0]);
DateTime timeStamp = new DateTime(timeStampLong);
DateTimeFormatter dateFormat =
ISODateTimeFormat.date();
output.collect(new
Text(dateFormat.print(timeStamp)),
new Text(line));
}
}
22. A Java example (Reducer)
public class exampleReducer
extends MapReduceBase
implements Reducer<Text, Text, Text,
LongWritable> {
public void reduce(Text key,
Iterator<Text> values,
OutputCollector<Text,LongWritable> output,
Reporter reporter) throws IOException {
long count = 0;
while (values.hasNext())
count++;
output.collect(key,
new LongWritable(count));
}
}
23. A Java example (job file)
public class exampleJob extends Configured implements Tool {
@Override
public int run(String[] arg0) throws Exception {
// TODO Auto-generated method stub
JobConf conf = new JobConf(getConf(), getClass());
conf.setJobName("Count events by date");
conf.setInputFormat(TextInputFormat.class);
TextInputFormat.addInputPath(conf, new Path(arg0[0]));
conf.setOutputFormat(TextOutputFormat.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(LongWritable.class);
TextOutputFormat.setOutputPath(conf, new Path(arg0[1]));
conf.setMapperClass(exampleMapper.class);
conf.setReducerClass(exampleReducer.class);
JobClient.runJob(conf);
return 0;
}
}
24. • Tools that make it easier to use Hadoop:
– Hive
– Pig
– Cascading
25. Cascading
• Tool for constructing Hadoop workflows in Java
• Example:
Scheme pvScheme = new TextLine(new Fields (“timestamp”, …);
Tap source = new Hfs(pvScheme, inpath);
Scheme countScheme = new TextLine(new Files (“date”, “count”);
Tap sink = new Hfs(countScheme, outpath);
Pipe assembly = new Pipe(“pagesByDate”);
Function function = new DateFormatter(Fields(“timestamp”),
“yyyy/mm/dd”);
assembly = new Each(assembly , new Fields(“date”), function);
assembly = new GroupBy(assembly , new Fields (“date”));
Aggregator count = new Count( new Fields( "count" ) );
assembly = new Every(assembly , count );
Properties properties = new Properties();
FlowConnector.setApplicationJarClass( properties, Main.class );
FlowConnector flowConnector = new FlowConnector( properties );
Flow flow = flowConnector.connect( "pagesByDate", source, sink,
assembly );
flow.complete();
26. Pig
• Tool to write SQL-like queries against Hadoop
• Example:
define TODATE
org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToDay();
%declare now `date "+%s000"`;
page_views = LOAD ‘PAGEVIEWS’ USING PigStorage()
AS (timestamp:int, page:chararray, user:chararray);
last_week = FILTER page_views BY timestamp > $now – 86400000 * 7;
truncated = FOREACH page_views GENERATE *,
TODATE(timestamp) as date;
grouped = GROUP truncated BY date;
counted = FOREACH grouped GENERATE group as date,
COUNT_STAR(truncated) as N;
sorted = ORDER counted BY date;
STORE sorted INTO ‘results’ USING PigStorage();
27. Hive
• Tool from Facebook that lets you write SQL
queries against Hadoop
• Example code:
SELECT TO_DATE(timestamp), COUNT(*)
FROM PAGEVIEWS
WHERE timestamp > unix_timestamp()-86400000 * 7
GROUP BY TO_DATE(timestamp)
ORDER BY TO_DATE(timestamp)
28.
29. Some important related projects
• Hbase
• NextGen Hadoop (0.23)
• Zookeeper
• Mahout
• Giraph
30. What to do next
• Watch training videos at
http://www.cloudera.com/resource-types/video/
• Get Hadoop (including the code!) at
http://hadoop.apache.org
• Get commercial support from
http://www.cloudera.com/
or http://hortonworks.com/
• Run it in the cloud with Amazon Elastic Map Reduce:
http://aws.amazon.com/elasticmapreduce/
Notas do Editor
Thanks for having me here today as part of Big Data week. For a lot of people, Hadoop is big data.Today, I’m here to share my experience as a Hadoop user. I use Hadoop every day at LinkedIn because it helps me get my work done. Ask audience: Who uses Hadoop nowWho is thinking about itWho sort of knows what Hadoop is for, but isn’t sure how it helps them
Hadoop can help you if you have a gigantic amount of data. You can do things with Hadoop that are hard to do with any other off-the-shelf tool. But Hadoop can be a handful.
I’m hoping that you leave here today knowing what Hadoop is.
Open sourceJava basedNetwork of serversComodity serversMap reduce
The biggest users are mostly web companies:Amazon builds their search indices on HadoopFacebook processes all their usage logs on Hadoop. (They also store photos with hbase.) I bet they do other things as well.Twitter uses hadoop for data analysisYahoo use Hadoop for many things, including a log of their advertising modelseBay and Netlix uses Hadoop as wellAnd a lot more people are using Hadoop for some tasks.
The source code for hadoop is freely avaialble, and easy to modifyBut that doesn’t mean it’s cheap and easy to run. It take a lot of operational expertise to set up and run a system with hundreds or thousands of computers. Every big Hadoop shop has a team of developers and operations people who keep the system runningWe’ve modified the Hadoop scheduler, added extra code for debugging, and fixed quite a few bugs
I have become very good at reading Java stack traces.
Hadoop was designed to run on commodity servers.It doesn’t need servers with super-fast processors, huge amounts of memory, solid state disks, or any other exotic featuresBut that doesn’t mean you should just run down to Fry’s and buy the cheapest computers you can find. Cheap computers fail more often. You need to find a good balance between cost and reliability.By the way, Hadoop runs really well on cloud services.
Even really good quality computers fail, and Hadoop was designed to deal with that problem. If the probability of a machine failing is 1/1000 for a given day, you’re going to see failures when you have thousands of computers.As a user, you don’t usually have to worry too much about how hadoop runs your jobs. But sometimes, understanding what Hadoop is doing can help you understand what the system is up to.
Let’s talk about each of these things Hadoop is great for doing all the data munging that you do at the start of a data project
Mentally, this is my hierarchy of tools. As your data gets bigger, it takes more work to use each tool, so I try not to overshoot.[should add in databases, python tools in the middle of R and hadoop]But sometimes, you have to upgrade. For example, suppose that it takes 25 hours to analyze 24 hours of data on your desktop…
As we said before, for your problem to fit, your problem should meet 4 criteria… one of them is that it has to work with Map/Reduce.To help explain map reduce, we’re going to use map reduce here to do some work. [ask for volunteers]
The key is used to group data together and to route it to the right reducer.
At LinkedIn, we have hundreds of users on our Hadoop system running dozens of jobs. It’s pretty busy in the middle of the day.Unlike some other tools (like Oracle), Hadoop won’t start working on your problem until earlier jobs finish. It’s a very efficient way to use resources, but it could mean that you have to wait around for a long time.
So far, we’ve talked about who uses hadoop, and how hadoop works.I’d like to show an example of what you see as a hadoop user; how do you write programs for hadoop.In practice, you might have many input files from many different web servers. Or maybe one giant file. Either way, Hadoop can split up those files to divide the processing work across the cluster.
Most Java map/reduce jobs have three parts: a mapper, a reducer, and a job file. I’m going to walk through all three of them here.
Here is part of the Java Map/Reduce job for doing this calculation. At this point, it should be clear why we didn’t make this a hands on session. I’m not going to explain everything that’s going on here, but I’ll point out a few piece of how this works.
All the keys for a key are handled by a specific reducer. In this case, that means that all the records for each date will be sent to a single reducer, so all we have to do is to count those records.
Lastly, you connect everything together with a job file and run it.
I’ve probably scared off a lot of people in this room by showing the Java Map/Reduce code. Luckily, there are some simpler ways to solve the problem.
One of the coolest things about Cascading is that you can use if from other JVM languages: jython, jruby, clojure, and scala
Don’t need a lot of software, can run from your workstation
Hive is great, but it takes some work to set it upIt’s great for working with unstructured data…The big disadvantage of Hive is that every operation is a full table scan. With a database like Oracle, data is stored with indexes, so you can quickly look up single values. Hive is good for large calculations, bad for lookups.Another issue with Hive is that it’s not as mature as most databases. You can easily see a Java stack trace.