Big data week presentation

Don’t use Hadoop.
(Unless you have to.)

• What is Hadoop?

• Why do people use Hadoop?

• How does it work?

• When should you consider Hadoop?

What is Hadoop?
Apache Hadoop is an open source, java-based
system for processing data on a network of
commodity servers using a map-reduce
paradigm.

How do people use Hadoop?
A few examples from the Apache site
– Amazon search
– Facebook log storage and reporting
– LinkedIn’s People You May Know
– Twitter data analysis
– Yahoo! Uses it for ad targeting
A search on LinkedIn shows people at financial
services, biotech, oil and gas exploration, retail,
and other industries are using Hadoop.

Where did Hadoop come from?
• Hadoop was created by Doug Cutting. It’s
named after his son’s toy elephant.
• Hadoop was written to support Nutch, an
open source web search engine.
Hadoop was spun out in 2006.
• Yahoo! invested in Hadoop,
bringing it to “web scale” by
2008.

Hadoop is open source
• Hadoop is an open source project (Apache
license)
– You can download and install it freely
– You can also compile your own custom version of
Hadoop
• There are three subprojects

Hadoop is written for Java
• The good news: Hadoop runs on a JVM
– You can run Hadoop on your workstation (for testing),
on a private cluster, or in a cloud
– You can write Hadoop jobs in Java, or in Scala, Jruby,
Jython, Clojure, or any other JVM language
– You can use other Java libraries
• The bad news: Hadoop was originally written by
and for Java programmers.
– You can do basic work without knowing Java. But you
will quickly get stuck if you can’t write code.

Hadoop runs on a network of servers

Hadoop runs on commodity servers
• Doesn’t require very fast, very big, or very
reliable servers
• Works better on good quality servers
connected through a fast network
• Hadoop is fault tolerant—multiple copies of
data, protection against failed jobs

When should you consider Hadoop?
• Big problem
• Fits Map/Reduce model
• Don’t need to compute in real time
• Technical team

Picking the right tool for the job

1,000,000,000,000
100,000,000,000
?
10,000,000,000
1,000,000,000
100,000,000
10,000,000
1,000,000
100,000
10,000
1,000
100
10
1
Calculator Spreadsheet Numerical Parallel Systems ?
Software

Man / Reduce
• I need 7 volunteers:
– 4 mappers
– 3 reducers
• We’re going to show how map/reduce works
by sorting and counting some notes.

What is Map/Reduce
• You compute things in two phases
– The map step
• Reads the input data
• Transforms the data
• Tags each datum with a key and sends each datum to
the right reducer
– The reduce step
• Collects all the data for each key
• Do some work on the data by key
• Outputs the results

Map/Reduce is over 100 years old
• Hollerith machines from the 1890 census

Good fits for Map/Reduce
• Aggregating unstructured data to enter into a
database (ETL)
• Creating email messages
• Processing log files and creating reports

Problems that don’t perfectly fit
• Logistic regression
• Matrix operations
• Social graph calculations

Batch computation
Hadoop is a shared system that allocates
resources to jobs from a queue. It’s not a real
time system.

Coding example
Suppose that we had some log files with events by
date (say, page views). Let’s count the number of
events by day!

Sample data:

1335300359000,Home Page, Joe
1335300359027,Login,
1335300359031,Home Page, Romy
1335300369123,Settings, Joe
…

A Java Example
• Mappers will
– Read the input files
– Extract the timestamp
– Round to the nearest day
– Set the output key to the day
• Reducers will
– Iterate through records by day, counting records
– Output the count for each day

A Java example (Mapper)
public class exampleMapper
extends MapReduceBase
implements Mapper<LongWritable, Text, Text, Text> {

public void map(LongWritable key, Text value,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
String line = value.toString();
String[] values = line.split(",");
Long timeStampLong = Long.parseLong(values[0]);
DateTime timeStamp = new DateTime(timeStampLong);
DateTimeFormatter dateFormat =
ISODateTimeFormat.date();
output.collect(new
Text(dateFormat.print(timeStamp)),
new Text(line));
}

}

A Java example (Reducer)
public class exampleReducer
extends MapReduceBase
implements Reducer<Text, Text, Text,
LongWritable> {

public void reduce(Text key,
Iterator<Text> values,
OutputCollector<Text,LongWritable> output,
Reporter reporter) throws IOException {
long count = 0;
while (values.hasNext())
count++;
output.collect(key,
new LongWritable(count));
}

}

A Java example (job file)
public class exampleJob extends Configured implements Tool {

@Override
public int run(String[] arg0) throws Exception {
// TODO Auto-generated method stub

JobConf conf = new JobConf(getConf(), getClass());
conf.setJobName("Count events by date");
conf.setInputFormat(TextInputFormat.class);
TextInputFormat.addInputPath(conf, new Path(arg0[0]));

conf.setOutputFormat(TextOutputFormat.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(LongWritable.class);
TextOutputFormat.setOutputPath(conf, new Path(arg0[1]));

conf.setMapperClass(exampleMapper.class);
conf.setReducerClass(exampleReducer.class);

JobClient.runJob(conf);

return 0;
}
}

• Tools that make it easier to use Hadoop:

– Hive
– Pig
– Cascading

Cascading
• Tool for constructing Hadoop workflows in Java
• Example:
Scheme pvScheme = new TextLine(new Fields (“timestamp”, …);
Tap source = new Hfs(pvScheme, inpath);
Scheme countScheme = new TextLine(new Files (“date”, “count”);
Tap sink = new Hfs(countScheme, outpath);
Pipe assembly = new Pipe(“pagesByDate”);
Function function = new DateFormatter(Fields(“timestamp”),
“yyyy/mm/dd”);
assembly = new Each(assembly , new Fields(“date”), function);
assembly = new GroupBy(assembly , new Fields (“date”));
Aggregator count = new Count( new Fields( "count" ) );
assembly = new Every(assembly , count );
Properties properties = new Properties();
FlowConnector.setApplicationJarClass( properties, Main.class );
FlowConnector flowConnector = new FlowConnector( properties );
Flow flow = flowConnector.connect( "pagesByDate", source, sink,
assembly );
flow.complete();

Pig
• Tool to write SQL-like queries against Hadoop
• Example:
define TODATE
org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToDay();
%declare now `date "+%s000"`;
page_views = LOAD ‘PAGEVIEWS’ USING PigStorage()
AS (timestamp:int, page:chararray, user:chararray);
last_week = FILTER page_views BY timestamp > $now – 86400000 * 7;
truncated = FOREACH page_views GENERATE *,
TODATE(timestamp) as date;
grouped = GROUP truncated BY date;
counted = FOREACH grouped GENERATE group as date,
COUNT_STAR(truncated) as N;
sorted = ORDER counted BY date;
STORE sorted INTO ‘results’ USING PigStorage();

Hive
• Tool from Facebook that lets you write SQL
queries against Hadoop
• Example code:

SELECT TO_DATE(timestamp), COUNT(*)
FROM PAGEVIEWS
WHERE timestamp > unix_timestamp()-86400000 * 7
GROUP BY TO_DATE(timestamp)
ORDER BY TO_DATE(timestamp)

Some important related projects
• Hbase
• NextGen Hadoop (0.23)
• Zookeeper
• Mahout
• Giraph

What to do next
• Watch training videos at
http://www.cloudera.com/resource-types/video/

• Get Hadoop (including the code!) at
http://hadoop.apache.org

• Get commercial support from
http://www.cloudera.com/
or http://hortonworks.com/

• Run it in the cloud with Amazon Elastic Map Reduce:
http://aws.amazon.com/elasticmapreduce/

Big data week presentation

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Big data week presentation

Semelhante a Big data week presentation (20)

Último

Último (20)

Big data week presentation

Notas do Editor