SlideShare uma empresa Scribd logo
1 de 25
Introduction to Hadoop
Ron Sher
Agenda

•
•
•
•
•

Big data - big issues
Hadoop to the rescue
Storage - HDFS
Processing - MapReduce
Hadoop ecosystem
Big Data - Big Issues
● Volume, Velocity, Variability
● Lots of data - logs, sensors, social, pictures,
video, etc.
● May not fit a single machine
● Access to data is slow
● Hardware may fail
● Network errors happen
Hadoop to the rescue

•
•
•
•
•
•

Distributed “operating system”
Scalable - many servers of commodity hardware
with lots of cores and disks
Reliable - detect failures, redundant storage
Fault-tolerant - auto-retry, self-healing
Simple - use many servers as one really big
computer
Suitable for batch processing (throughput over
Storage - HDFS

•
•

•
•

Hadoop Distributed File System
Replicated (3 default) fixed size blocks
(64MB default)
runs on large clusters of commodity
machines
Optimized for write once - read many
throughput of large files
HDFS Architecture
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/images/hdfsarchitecture.png
Useful HDFS commands
•
•
•
•
•
•
•
•

hdfs dfs -get <file name> - copy a file from hdfs to local
hdfs dfs -put <file name> [destination]- copy a file from local
to hdfs in the specified destination
hdfs dfs -cat <file name> - prints a file to stdout
hdfs dfs -ls <dir name> - show all files under the specified
directory
hdfs dfs -mv <file name> <changed name> - rename a file
hdfs dfs -rm <file name> - remove a file
hdfs dfs -rmr <directory name> - remove a directory
hdfs dfs -mkdir <dir name> - creates a directory
Processing - MapReduce

•
•

•
•

A distributed data processing model and execution
environment that runs on large clusters of commodity
machines
Responsible for running a job in parallel on many
servers
Handles re-trying a task that fails, validating complete
results
Computation moved to the data
MapReduce Sample - Word Count
input

Ini Mini Miny
Mo Mo Miny
Ini Mo Mini
MapReduce Sample - Word Count
input

splitting

Ini Mini Miny

Ini Mini Miny
Mo Mo Miny
Ini Mo Mini

Mo Mo Miny

Ini Mo Mini
MapReduce Sample - Word Count
input

splitting

Ini Mini Miny

Ini Mini Miny
Mo Mo Miny
Ini Mo Mini

Mo Mo Miny

Ini Mo Mini

mapping
Ini, 1
Mini, 1
Miny,1
Mo, 1
Mo, 1
Miny,1

Ini, 1
Mo, 1
Mini, 1
MapReduce Sample - Word Count
input

splitting

Ini Mini Miny

Ini Mini Miny
Mo Mo Miny
Ini Mo Mini

Mo Mo Miny

Ini Mo Mini

mapping
Ini, 1
Mini, 1
Miny,1
Mo, 1
Mo, 1
Miny,1

Ini, 1
Mo, 1
Mini, 1

shuffling
Ini, 1
Ini, 1
Mini, 1
Mini, 1
Miny, 1
Miny, 1
Mo, 1
Mo, 1
Mo, 1
MapReduce Sample - Word Count
input

splitting

Ini Mini Miny

Ini Mini Miny
Mo Mo Miny
Ini Mo Mini

Mo Mo Miny

Ini Mo Mini

mapping
Ini, 1
Mini, 1
Miny,1
Mo, 1
Mo, 1
Miny,1

Ini, 1
Mo, 1
Mini, 1

shuffling

reducing

Ini, 1
Ini, 1

Ini, [1,1]

Mini, 1
Mini, 1

Mini, [1,1]

Miny, 1
Miny, 1

Miny, [1,1]

Mo, 1
Mo, 1
Mo, 1

Mo, [1,1,1]
MapReduce Sample - Word Count
input

splitting

Ini Mini Miny

Ini Mini Miny
Mo Mo Miny
Ini Mo Mini

Mo Mo Miny

Ini Mo Mini

mapping
Ini, 1
Mini, 1
Miny,1
Mo, 1
Mo, 1
Miny,1

Ini, 1
Mo, 1
Mini, 1

shuffling

reducing

Ini, 1
Ini, 1

Ini, [1,1]

Mini, 1
Mini, 1

Mini, [1,1]

Miny, 1
Miny, 1

Miny, [1,1]

Mo, 1
Mo, 1
Mo, 1

Mo, [1,1,1]

final result

Ini, 2
Mini, 2
Miny,2
Mo, 3
http://answers.oreilly.com/uploads/monthly_10_2009/post-118-125676084924_thumb.png

How a MapReduce Job Runs in Hadoop
Monitoring MR jobs (machine:50030)
Monitoring MR jobs (machine:50030)
Monitoring MR jobs (machine:50030)
Monitoring MR jobs (machine:50030)
Useful Commands

•
•

mapred job -kill <job id> - kill a running job
mapred job -status <job id> - show status
of a job
Useful Commands

•
•

mapred job -kill <job id> - kill a running job
mapred job -status <job id> - show status
of a job
Word Count Mapper
public static class Map extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text,
IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
Word Count Reducer
public static class Reduce extends MapReduceBase implements Reducer<Text,
IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text,
IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Hadoop Ecosystem

•
•
•
•
•
•
•
•

Hive - SQL like language over big data using MR
HBase - distributed, column-oriented database
ZooKeeper - coordination service
Avro - cross language serialization
Pig - language for exploring big data
Impala - SQL like directly over HDFS
Sqoop - tool for moving data from DBs to HDFS
Mahout - machine learning and data mining library
Some resources

•
•
•
•
•
•

Motivation about hadoop and where it’s
going video and whitepaper
HDFS Architecture Guide
How MapReduce Works With Hadoop
HDFS shell commands
VM
MapReduce tutorial

Mais conteúdo relacionado

Mais procurados

Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
Marin Dimitrov
 

Mais procurados (19)

Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with Dependencies
 
MapReduce
MapReduceMapReduce
MapReduce
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoop
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling Algorithms
 
06 pig etl features
06 pig etl features06 pig etl features
06 pig etl features
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce
 
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
 
Probabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profitProbabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profit
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
【Unite 2017 Tokyo】C#ジョブシステムによるモバイルゲームのパフォーマンス向上テクニック(note付き)
【Unite 2017 Tokyo】C#ジョブシステムによるモバイルゲームのパフォーマンス向上テクニック(note付き)【Unite 2017 Tokyo】C#ジョブシステムによるモバイルゲームのパフォーマンス向上テクニック(note付き)
【Unite 2017 Tokyo】C#ジョブシステムによるモバイルゲームのパフォーマンス向上テクニック(note付き)
 
Priority queue
Priority queuePriority queue
Priority queue
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 
Hadoop job chaining
Hadoop job chainingHadoop job chaining
Hadoop job chaining
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 

Destaque

Performance Issues on Hadoop Clusters
Performance Issues on Hadoop ClustersPerformance Issues on Hadoop Clusters
Performance Issues on Hadoop Clusters
Xiao Qin
 

Destaque (7)

HDFS Issues
HDFS IssuesHDFS Issues
HDFS Issues
 
Heirarchy
HeirarchyHeirarchy
Heirarchy
 
Resume2015 copy
Resume2015 copyResume2015 copy
Resume2015 copy
 
Performance Issues on Hadoop Clusters
Performance Issues on Hadoop ClustersPerformance Issues on Hadoop Clusters
Performance Issues on Hadoop Clusters
 
IBM Big Data for Social Good Challenge - Submission Showcase
IBM Big Data for Social Good Challenge - Submission ShowcaseIBM Big Data for Social Good Challenge - Submission Showcase
IBM Big Data for Social Good Challenge - Submission Showcase
 
MapReduce: Optimizations, Limitations, and Open Issues
MapReduce: Optimizations, Limitations, and Open IssuesMapReduce: Optimizations, Limitations, and Open Issues
MapReduce: Optimizations, Limitations, and Open Issues
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 

Semelhante a Introduction to hadoop

First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
Tomas Cervenka
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
Pallav Jha
 

Semelhante a Introduction to hadoop (20)

Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
MapReduce on Zero VM
MapReduce on Zero VM MapReduce on Zero VM
MapReduce on Zero VM
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Introduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe SeilerIntroduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe Seiler
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)
 
OpenSource Big Data Platform - Flamingo Project
OpenSource Big Data Platform - Flamingo ProjectOpenSource Big Data Platform - Flamingo Project
OpenSource Big Data Platform - Flamingo Project
 
Manta Unleashed BigDataSG talk 2 July 2013
Manta Unleashed BigDataSG talk 2 July 2013Manta Unleashed BigDataSG talk 2 July 2013
Manta Unleashed BigDataSG talk 2 July 2013
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big Data
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
The Dirty Little Secrets They Didn’t Teach You In Pentesting Class
The Dirty Little Secrets They Didn’t Teach You In Pentesting Class The Dirty Little Secrets They Didn’t Teach You In Pentesting Class
The Dirty Little Secrets They Didn’t Teach You In Pentesting Class
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

Introduction to hadoop

  • 2. Agenda • • • • • Big data - big issues Hadoop to the rescue Storage - HDFS Processing - MapReduce Hadoop ecosystem
  • 3. Big Data - Big Issues ● Volume, Velocity, Variability ● Lots of data - logs, sensors, social, pictures, video, etc. ● May not fit a single machine ● Access to data is slow ● Hardware may fail ● Network errors happen
  • 4. Hadoop to the rescue • • • • • • Distributed “operating system” Scalable - many servers of commodity hardware with lots of cores and disks Reliable - detect failures, redundant storage Fault-tolerant - auto-retry, self-healing Simple - use many servers as one really big computer Suitable for batch processing (throughput over
  • 5. Storage - HDFS • • • • Hadoop Distributed File System Replicated (3 default) fixed size blocks (64MB default) runs on large clusters of commodity machines Optimized for write once - read many throughput of large files
  • 7. Useful HDFS commands • • • • • • • • hdfs dfs -get <file name> - copy a file from hdfs to local hdfs dfs -put <file name> [destination]- copy a file from local to hdfs in the specified destination hdfs dfs -cat <file name> - prints a file to stdout hdfs dfs -ls <dir name> - show all files under the specified directory hdfs dfs -mv <file name> <changed name> - rename a file hdfs dfs -rm <file name> - remove a file hdfs dfs -rmr <directory name> - remove a directory hdfs dfs -mkdir <dir name> - creates a directory
  • 8. Processing - MapReduce • • • • A distributed data processing model and execution environment that runs on large clusters of commodity machines Responsible for running a job in parallel on many servers Handles re-trying a task that fails, validating complete results Computation moved to the data
  • 9. MapReduce Sample - Word Count input Ini Mini Miny Mo Mo Miny Ini Mo Mini
  • 10. MapReduce Sample - Word Count input splitting Ini Mini Miny Ini Mini Miny Mo Mo Miny Ini Mo Mini Mo Mo Miny Ini Mo Mini
  • 11. MapReduce Sample - Word Count input splitting Ini Mini Miny Ini Mini Miny Mo Mo Miny Ini Mo Mini Mo Mo Miny Ini Mo Mini mapping Ini, 1 Mini, 1 Miny,1 Mo, 1 Mo, 1 Miny,1 Ini, 1 Mo, 1 Mini, 1
  • 12. MapReduce Sample - Word Count input splitting Ini Mini Miny Ini Mini Miny Mo Mo Miny Ini Mo Mini Mo Mo Miny Ini Mo Mini mapping Ini, 1 Mini, 1 Miny,1 Mo, 1 Mo, 1 Miny,1 Ini, 1 Mo, 1 Mini, 1 shuffling Ini, 1 Ini, 1 Mini, 1 Mini, 1 Miny, 1 Miny, 1 Mo, 1 Mo, 1 Mo, 1
  • 13. MapReduce Sample - Word Count input splitting Ini Mini Miny Ini Mini Miny Mo Mo Miny Ini Mo Mini Mo Mo Miny Ini Mo Mini mapping Ini, 1 Mini, 1 Miny,1 Mo, 1 Mo, 1 Miny,1 Ini, 1 Mo, 1 Mini, 1 shuffling reducing Ini, 1 Ini, 1 Ini, [1,1] Mini, 1 Mini, 1 Mini, [1,1] Miny, 1 Miny, 1 Miny, [1,1] Mo, 1 Mo, 1 Mo, 1 Mo, [1,1,1]
  • 14. MapReduce Sample - Word Count input splitting Ini Mini Miny Ini Mini Miny Mo Mo Miny Ini Mo Mini Mo Mo Miny Ini Mo Mini mapping Ini, 1 Mini, 1 Miny,1 Mo, 1 Mo, 1 Miny,1 Ini, 1 Mo, 1 Mini, 1 shuffling reducing Ini, 1 Ini, 1 Ini, [1,1] Mini, 1 Mini, 1 Mini, [1,1] Miny, 1 Miny, 1 Miny, [1,1] Mo, 1 Mo, 1 Mo, 1 Mo, [1,1,1] final result Ini, 2 Mini, 2 Miny,2 Mo, 3
  • 16. Monitoring MR jobs (machine:50030)
  • 17. Monitoring MR jobs (machine:50030)
  • 18. Monitoring MR jobs (machine:50030)
  • 19. Monitoring MR jobs (machine:50030)
  • 20. Useful Commands • • mapred job -kill <job id> - kill a running job mapred job -status <job id> - show status of a job
  • 21. Useful Commands • • mapred job -kill <job id> - kill a running job mapred job -status <job id> - show status of a job
  • 22. Word Count Mapper public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }
  • 23. Word Count Reducer public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
  • 24. Hadoop Ecosystem • • • • • • • • Hive - SQL like language over big data using MR HBase - distributed, column-oriented database ZooKeeper - coordination service Avro - cross language serialization Pig - language for exploring big data Impala - SQL like directly over HDFS Sqoop - tool for moving data from DBs to HDFS Mahout - machine learning and data mining library
  • 25. Some resources • • • • • • Motivation about hadoop and where it’s going video and whitepaper HDFS Architecture Guide How MapReduce Works With Hadoop HDFS shell commands VM MapReduce tutorial