SlideShare a Scribd company logo
1 of 37
Download to read offline
Java MapReduce
Programming on
Apache Hadoop
Aaron T. Myers, aka ATM
with thanks to Sandy Ryza
Introductions
● Software Engineer/Tech Lead for HDFS at
Cloudera
● Committer/PMC Member on the Apache
Hadoop project
● My work focuses primarily on HDFS and
Hadoop security
What is MapReduce?
● A distributed programming paradigm
What is a distributed programming
paradigm?
Help!
What is a distributed programming
paradigm?
Distributed Systems are Hard
● Monitoring
● RPC protocols, serialization
● Fault tolerance
● Deployment
● Scheduling/Resource Management
Writing Data Parallel Programs
Should Not Be
MapReduce to the Rescue
● You specify map(...) and reduce(...)
functions
○ map = (list(k, v) -> list(k, v))
○ reduce = (k, list(v) -> k, v)
● The framework does the rest
○ Split up the data
○ Run several mappers over the splits
○ Shuffle the data around for the reducers
○ Run several reducers
○ Store the final results
Map
apple apple banana
a happy airplane
airplane on the runway
runway apple runway
rumple on the apple
apple apple banana
a happy airplane
airplane on the runway
runway apple runway
rumple on the apple
apple - 1
apple - 1
banana - 1
a - 1
happy - 1
airplane - 1
on - 1
the - 1
runway - 1
runway - 1
runway - 1
apple - 1
rumple - 1
on - 1
the - 1
apple - 1
map()
map()
map()
map()
map()
Map Inputs Map OutputsInput Data Map Function
Shuffle
Reduce
reduce()
reduce()
reduce()
reduce()
reduce()
reduce()
reduce()
reduce()
a - 1
airplane - 1
apple - 4
banana - 1
on - 2
runway - 3
rumple - 1
the - 2
a - 1, 1
airplane - 1
apple - 1, 1, 1, 1
banana - 1
on - 1, 1
runway - 1, 1, 1
rumple - 1
the - 1, 1
Shuffle Reduce Output
What is (Core) Hadoop?
● An open source platform for storing,
processing, and analyzing enormous
amounts of data
● Consists of…
○ A distributed file system (HDFS)
○ An implementation of the Map/Reduce paradigm
(Hadoop MapReduce)
● Written in Java!
What is Hadoop?
Traditional Operating System
Storage:
File System
Execution/Scheduling:
Processes
What is Hadoop?
Hadoop
(Distributed operating system)
Storage:
Hadoop Distributed
File System (HDFS)
Execution/Scheduling:
MapReduce
HDFS (briefly)
● Distributed file system that runs on all nodes
in the cluster
○ Co-located with Hadoop MapReduce daemons
● Looks like a pretty normal Unix file system
○ hadoop fs -ls /user/atm/
○ hadoop fs -cp /user/atm/data.txt /user/atm/data2.txt
○ hadoop fs -rm /user/atm/data.txt
○ …
● Don’t use the normal Java File API
○ Instead use org.apache.hadoop.fs.FileSystem API
Writing MapReduce programs in
Java
● Interface to MapReduce in Hadoop is Java
API
● WordCount!
Word Count Map Function
public class WordCountMapper extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one= new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable>output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
Word Count Reduce Function
public static class WordCountReducer extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable>output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Word Count Driver
InputFormats
● TextInputFormat
○ Each line becomes <LongWritable, Text> = <byte
offset in file, whole line>
● KeyValueTextInputFormat
○ Splits lines on delimiter into Text key and Text value
● SequenceFileInputFormat
○ Reads key/value pairs from SequenceFile, a Hadoop
format
● DBInputFormat
○ Uses JDBC to connect to a database
● Many more, or write your own!
Serialization
● Writables
○ Native to Hadoop
○ Implement serialization for higher level structures
yourself
● Avro
○ Extensible
○ Cross-language
○ Handles serialization of higher level structures for
you
● And others…
○ Parquet, Thrift, etc.
Writables
public class MyNumberAndStringWritable implements Writable {
private int number;
private String str;
public void write(DataOutput out) throws IOException {
out.writeInt(number);
out.writeUTF(str);
}
public void readFields(DataInput in) throws IOException {
number = in.readInt();
str = in.readUTF();
}
}
Avro
protocol MyMapReduceObjects {
record MyNumberAndString {
string str;
int number;
}
}
Testing MapReduce Programs
● First, write unit tests (duh) with MRUnit
● LocalJobRunner
○ Runs job in single process
● Single-node cluster (Cloudera VM!)
○ Multiple processes on the same machine
● On the real cluster
MRUnit
@Test
public void testMapper() throws IOException {
MapDriver<LongWritable, Text, Text, IntWritable> mapDriver=
new MapDriver<LongWritable, Text, Text, IntWritable>(new WordCountMapper());
String line = "apple banana banana carrot";
mapDriver.withInput(new LongWritable(0), new Text(line));
mapDriver.withOutput(new Text("apple"), new IntWritable(1));
mapDriver.withOutput(new Text("banana"), new IntWritable(1));
mapDriver.withOutput(new Text("banana"), new IntWritable(1));
mapDriver.withOutput(new Text("carrot"), new IntWritable(1));
mapDriver.runTest();
}
MRUnit
@Test
public void testReducer() {
ReduceDriver<Text, IntWritable, Text, IntWritable> reduceDriver=
new MapDriver<Text, IntWritable, Text, IntWritable>(new WordCountReducer());
reduceDriver.withInput(new Text("apple"),
Arrays.asList(new IntWritable(1), new IntWritable(2)));
reduceDriver.withOutput(new Text("apple"), new IntWritable("3"));
reduceDriver.runTest();
}
Counters
Map-Reduce Framework
Map input records=183
Map output records=183
Map output bytes=533563
Map output materialized bytes=534190
Input split bytes=144
Combine input records=0
Combine output records=0
Reduce input groups=183
Reduce shuffle bytes=0
Reduce input records=183
Reduce output records=183
Spilled Records=366
Shuffled Maps =0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=7
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
File System Counters
FILE: Number of bytes read=1844866
FILE: Number of bytes written=1927344
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
File Input Format Counters
Bytes Read=655137
File Output Format Counters
Bytes Written=537484
Counters
if (record.isUgly()) {
context.getCounter("Ugly Record Counters",
"Ugly Records").increment(1);
}
Counters
Map-Reduce Framework
Map input records=183
Map output records=183
Map output bytes=533563
Map output materialized bytes=534190
Input split bytes=144
Combine input records=0
Combine output records=0
Reduce input groups=183
Reduce shuffle bytes=0
Reduce input records=183
Reduce output records=183
Spilled Records=366
Shuffled Maps =0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=7
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
File System Counters
FILE: Number of bytes read=1844866
FILE: Number of bytes written=1927344
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
File Input Format Counters
Bytes Read=655137
File Output Format Counters
Bytes Written=537484
Ugly Record Counters
Ugly Records=1024
Distributed Cache
We need some data and libraries on all the
nodes.
Distributed Cache
Map or
Reduce Task
Map or
Reduce Task
Local
Copy
HDFS
Distributed
CacheMap or
Reduce Task
Map or
Reduce Task
Local
Copy
Distributed Cache
In our driver:
DistributedCache .addCacheFile(
new URI("/some/path/to/ourfile.txt" ), conf);
In our mapper or reducer:
@Override
public void setup(Context context) throws IOException,
InterruptedException {
Configuration conf = context.getConfiguration();
localFiles = DistributedCache .getLocalCacheFiles(conf);
}
Java
Technologies
Built on
MapReduce
Crunch
● Library on top of MapReduce that makes it
easy to write pipelines of jobs in Java
● Contains capabilities like joins and
aggregation functions to save programmers
from writing these for each job
Crunch
public class WordCount {
public static void main(String[] args) throws Exception {
Pipeline pipeline = new MRPipeline(WordCount.class);
PCollection<String> lines = pipeline.readTextFile(args[0]);
PCollection<String> words = lines.parallelDo("my splitter", new DoFn<String, String>() {
public void process(String line, Emitter<String> emitter) {
for (String word : line.split("s+")) {
emitter.emit(word);
}
}
}, Writables.strings());
PTable<String, Long> counts= Aggregate.count(words);
pipeline.writeTextFile(counts, args[1]);
pipeline.run();
}
}
Mahout
● Machine Learning on Hadoop
○ Collaborative Filtering
○ User and Item based recommenders
○ K-Means, Fuzzy K-Means clustering
○ Dirichlet process clustering
○ Latent Dirichlet Allocation
○ Singular value decomposition
○ Parallel Frequent Pattern mining
○ Complementary Naive Bayes classifier
○ Random forest decision tree based classifier
Non-Java technologies that use
MapReduce
● Hive
○ SQL -> M/R translator, metadata manager
● Pig
○ Scripting DSL -> M/R translator
● Distcp
○ HDFS tool to bulk copy data from one HDFS cluster
to another
Thanks!
● Questions?

More Related Content

What's hot

Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explainedDmytro Sandu
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentationateeq ateeq
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-ReduceBrendan Tierney
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReducefvanvollenhoven
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reducePaladion Networks
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduceHassan A-j
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windowsMuhammad Shahid
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and PipesHanborq Inc.
 

What's hot (19)

Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and Pipes
 

Similar to Hadoop - Introduction to map reduce programming - Reunião 12/04/2014

Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型wang xing
 
Hadoop online-training
Hadoop online-trainingHadoop online-training
Hadoop online-trainingGeohedrick
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introductionrajsandhu1989
 
An Introduction to MapReduce
An Introduction to MapReduce An Introduction to MapReduce
An Introduction to MapReduce Sina Ebrahimi
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview questionpappupassindia
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop IntroductionSNEHAL MASNE
 

Similar to Hadoop - Introduction to map reduce programming - Reunião 12/04/2014 (20)

Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Data Science
Data ScienceData Science
Data Science
 
SparkNotes
SparkNotesSparkNotes
SparkNotes
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Hadoop online-training
Hadoop online-trainingHadoop online-training
Hadoop online-training
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introduction
 
An Introduction to MapReduce
An Introduction to MapReduce An Introduction to MapReduce
An Introduction to MapReduce
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Spark overview
Spark overviewSpark overview
Spark overview
 
mapReduce.pptx
mapReduce.pptxmapReduce.pptx
mapReduce.pptx
 
hadoop
hadoophadoop
hadoop
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
 

Recently uploaded

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 

Hadoop - Introduction to map reduce programming - Reunião 12/04/2014

  • 1. Java MapReduce Programming on Apache Hadoop Aaron T. Myers, aka ATM with thanks to Sandy Ryza
  • 2. Introductions ● Software Engineer/Tech Lead for HDFS at Cloudera ● Committer/PMC Member on the Apache Hadoop project ● My work focuses primarily on HDFS and Hadoop security
  • 3. What is MapReduce? ● A distributed programming paradigm
  • 4. What is a distributed programming paradigm? Help!
  • 5. What is a distributed programming paradigm?
  • 6. Distributed Systems are Hard ● Monitoring ● RPC protocols, serialization ● Fault tolerance ● Deployment ● Scheduling/Resource Management
  • 7. Writing Data Parallel Programs Should Not Be
  • 8. MapReduce to the Rescue ● You specify map(...) and reduce(...) functions ○ map = (list(k, v) -> list(k, v)) ○ reduce = (k, list(v) -> k, v) ● The framework does the rest ○ Split up the data ○ Run several mappers over the splits ○ Shuffle the data around for the reducers ○ Run several reducers ○ Store the final results
  • 9. Map apple apple banana a happy airplane airplane on the runway runway apple runway rumple on the apple apple apple banana a happy airplane airplane on the runway runway apple runway rumple on the apple apple - 1 apple - 1 banana - 1 a - 1 happy - 1 airplane - 1 on - 1 the - 1 runway - 1 runway - 1 runway - 1 apple - 1 rumple - 1 on - 1 the - 1 apple - 1 map() map() map() map() map() Map Inputs Map OutputsInput Data Map Function Shuffle
  • 10. Reduce reduce() reduce() reduce() reduce() reduce() reduce() reduce() reduce() a - 1 airplane - 1 apple - 4 banana - 1 on - 2 runway - 3 rumple - 1 the - 2 a - 1, 1 airplane - 1 apple - 1, 1, 1, 1 banana - 1 on - 1, 1 runway - 1, 1, 1 rumple - 1 the - 1, 1 Shuffle Reduce Output
  • 11. What is (Core) Hadoop? ● An open source platform for storing, processing, and analyzing enormous amounts of data ● Consists of… ○ A distributed file system (HDFS) ○ An implementation of the Map/Reduce paradigm (Hadoop MapReduce) ● Written in Java!
  • 12. What is Hadoop? Traditional Operating System Storage: File System Execution/Scheduling: Processes
  • 13. What is Hadoop? Hadoop (Distributed operating system) Storage: Hadoop Distributed File System (HDFS) Execution/Scheduling: MapReduce
  • 14. HDFS (briefly) ● Distributed file system that runs on all nodes in the cluster ○ Co-located with Hadoop MapReduce daemons ● Looks like a pretty normal Unix file system ○ hadoop fs -ls /user/atm/ ○ hadoop fs -cp /user/atm/data.txt /user/atm/data2.txt ○ hadoop fs -rm /user/atm/data.txt ○ … ● Don’t use the normal Java File API ○ Instead use org.apache.hadoop.fs.FileSystem API
  • 15. Writing MapReduce programs in Java ● Interface to MapReduce in Hadoop is Java API ● WordCount!
  • 16. Word Count Map Function public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one= new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable>output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } }
  • 17. Word Count Reduce Function public static class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable>output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
  • 19. InputFormats ● TextInputFormat ○ Each line becomes <LongWritable, Text> = <byte offset in file, whole line> ● KeyValueTextInputFormat ○ Splits lines on delimiter into Text key and Text value ● SequenceFileInputFormat ○ Reads key/value pairs from SequenceFile, a Hadoop format ● DBInputFormat ○ Uses JDBC to connect to a database ● Many more, or write your own!
  • 20. Serialization ● Writables ○ Native to Hadoop ○ Implement serialization for higher level structures yourself ● Avro ○ Extensible ○ Cross-language ○ Handles serialization of higher level structures for you ● And others… ○ Parquet, Thrift, etc.
  • 21. Writables public class MyNumberAndStringWritable implements Writable { private int number; private String str; public void write(DataOutput out) throws IOException { out.writeInt(number); out.writeUTF(str); } public void readFields(DataInput in) throws IOException { number = in.readInt(); str = in.readUTF(); } }
  • 22. Avro protocol MyMapReduceObjects { record MyNumberAndString { string str; int number; } }
  • 23. Testing MapReduce Programs ● First, write unit tests (duh) with MRUnit ● LocalJobRunner ○ Runs job in single process ● Single-node cluster (Cloudera VM!) ○ Multiple processes on the same machine ● On the real cluster
  • 24. MRUnit @Test public void testMapper() throws IOException { MapDriver<LongWritable, Text, Text, IntWritable> mapDriver= new MapDriver<LongWritable, Text, Text, IntWritable>(new WordCountMapper()); String line = "apple banana banana carrot"; mapDriver.withInput(new LongWritable(0), new Text(line)); mapDriver.withOutput(new Text("apple"), new IntWritable(1)); mapDriver.withOutput(new Text("banana"), new IntWritable(1)); mapDriver.withOutput(new Text("banana"), new IntWritable(1)); mapDriver.withOutput(new Text("carrot"), new IntWritable(1)); mapDriver.runTest(); }
  • 25. MRUnit @Test public void testReducer() { ReduceDriver<Text, IntWritable, Text, IntWritable> reduceDriver= new MapDriver<Text, IntWritable, Text, IntWritable>(new WordCountReducer()); reduceDriver.withInput(new Text("apple"), Arrays.asList(new IntWritable(1), new IntWritable(2))); reduceDriver.withOutput(new Text("apple"), new IntWritable("3")); reduceDriver.runTest(); }
  • 26. Counters Map-Reduce Framework Map input records=183 Map output records=183 Map output bytes=533563 Map output materialized bytes=534190 Input split bytes=144 Combine input records=0 Combine output records=0 Reduce input groups=183 Reduce shuffle bytes=0 Reduce input records=183 Reduce output records=183 Spilled Records=366 Shuffled Maps =0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=7 CPU time spent (ms)=0 Physical memory (bytes) snapshot=0 Virtual memory (bytes) snapshot=0 File System Counters FILE: Number of bytes read=1844866 FILE: Number of bytes written=1927344 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 File Input Format Counters Bytes Read=655137 File Output Format Counters Bytes Written=537484
  • 27. Counters if (record.isUgly()) { context.getCounter("Ugly Record Counters", "Ugly Records").increment(1); }
  • 28. Counters Map-Reduce Framework Map input records=183 Map output records=183 Map output bytes=533563 Map output materialized bytes=534190 Input split bytes=144 Combine input records=0 Combine output records=0 Reduce input groups=183 Reduce shuffle bytes=0 Reduce input records=183 Reduce output records=183 Spilled Records=366 Shuffled Maps =0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=7 CPU time spent (ms)=0 Physical memory (bytes) snapshot=0 Virtual memory (bytes) snapshot=0 File System Counters FILE: Number of bytes read=1844866 FILE: Number of bytes written=1927344 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 File Input Format Counters Bytes Read=655137 File Output Format Counters Bytes Written=537484 Ugly Record Counters Ugly Records=1024
  • 29. Distributed Cache We need some data and libraries on all the nodes.
  • 30. Distributed Cache Map or Reduce Task Map or Reduce Task Local Copy HDFS Distributed CacheMap or Reduce Task Map or Reduce Task Local Copy
  • 31. Distributed Cache In our driver: DistributedCache .addCacheFile( new URI("/some/path/to/ourfile.txt" ), conf); In our mapper or reducer: @Override public void setup(Context context) throws IOException, InterruptedException { Configuration conf = context.getConfiguration(); localFiles = DistributedCache .getLocalCacheFiles(conf); }
  • 33. Crunch ● Library on top of MapReduce that makes it easy to write pipelines of jobs in Java ● Contains capabilities like joins and aggregation functions to save programmers from writing these for each job
  • 34. Crunch public class WordCount { public static void main(String[] args) throws Exception { Pipeline pipeline = new MRPipeline(WordCount.class); PCollection<String> lines = pipeline.readTextFile(args[0]); PCollection<String> words = lines.parallelDo("my splitter", new DoFn<String, String>() { public void process(String line, Emitter<String> emitter) { for (String word : line.split("s+")) { emitter.emit(word); } } }, Writables.strings()); PTable<String, Long> counts= Aggregate.count(words); pipeline.writeTextFile(counts, args[1]); pipeline.run(); } }
  • 35. Mahout ● Machine Learning on Hadoop ○ Collaborative Filtering ○ User and Item based recommenders ○ K-Means, Fuzzy K-Means clustering ○ Dirichlet process clustering ○ Latent Dirichlet Allocation ○ Singular value decomposition ○ Parallel Frequent Pattern mining ○ Complementary Naive Bayes classifier ○ Random forest decision tree based classifier
  • 36. Non-Java technologies that use MapReduce ● Hive ○ SQL -> M/R translator, metadata manager ● Pig ○ Scripting DSL -> M/R translator ● Distcp ○ HDFS tool to bulk copy data from one HDFS cluster to another