SlideShare uma empresa Scribd logo
1 de 55
Baixar para ler offline
Distributed batch
processing with Hadoop
Ferran Galí i Reniu
@ferrangali

09/01/2014
Ferran Galí i Reniu
● UPC - FIB
● Trovit
Problem
● Too much data
○ 90% of all the data in the world has been generated
in the last two years
○ Large Hadron Collider: 25 petabytes per year
○ Walmart: 1M transactions per hour

● Hard disks
○ Cheap!
○ Still slow access time
○ Write even slower
Solutions
● Multiple Hard Disks
○ Work in parallel
○ We can reduce access time!

● How to deal with hardware failure?
● What if we need to combine data?
Hadoop
● Doug Cutting & Mike Cafarella
Hadoop

The Google File System
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
October 2003
Hadoop

MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay Ghemawat
December 2004
Hadoop
● Doug Cutting & Mike Cafarella

● Yahoo!
Hadoop
● HDFS
○ Storage

● MapReduce
○ Processing

● Ecosystem
HDFS
● Distributed storage
○ Managed across a network of commodity machines

● Blocks
○ About 128Mb
○ Large data sets

● Tolerance to node failure
○ Data replication

● Streaming data access
○ Many access
○ Write once (batch)
HDFS
● DataNodes (Workers)
○ Store blocks

● NameNode (Master)
○
○
○
○
○

Maintains metadata
Knows where the blocks are located
Make DataNodes fault tolerant
Single point of failure
Secondary NameNode
HDFS

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
HDFS
● Interfaces
○ Java
○ Command line interface

● Load
hadoop fs -put file.csv /user/hadoop/file.csv

● Extract
hadoop fs -get /user/hadoop/file.csv file.csv
MapReduce
● Distributed processing paradigm
○ Moving computation is cheaper than moving data

● Map
○ Map(k1,v1) -> list(k2,v2)

● Reduce
○ Reduce(k2,list(v2)) -> list(v3)
Word Counter
map (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
Word Counter
map (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));

Java is great
Hadoop is also great
Word Counter - Map
map (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));

Key

Value

1

Java is great

2

Hadoop is also great
Word Counter - Map
map (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));

Key

Value

1

Java is great

2

Hadoop is also great

map(1, “Java is great”)
Word Counter - Map
map (Long key, String value)

map(1, “Java is great”)

for each(String word in value)
emit(word, 1);

Key
reduce (String word, List values)
emit(word, sum(values));

Key

Value

1

Java is great

2

Hadoop is also great

Value

Java

1
Word Counter - Map
map (Long key, String value)

map(1, “Java is great”)

for each(String word in value)
emit(word, 1);

Key
Java

1

is

reduce (String word, List values)

Value

1

emit(word, sum(values));

Key

Value

1

Java is great

2

Hadoop is also great
Word Counter - Map
map (Long key, String value)

map(1, “Java is great”)

for each(String word in value)
emit(word, 1);

Key
Java

1

is

1

great

reduce (String word, List values)

Value

1

emit(word, sum(values));

Key

Value

1

Java is great

2

Hadoop is also great
Word Counter - Map
map (Long key, String value)
for each(String word in value)

map(2, “Hadoop is also
great”)

emit(word, 1);

Key
Java

1

is

1

great

reduce (String word, List values)

Value

1

emit(word, sum(values));

Key

Value

1

Java is great

2

Hadoop is also great
Word Counter - Map
map (Long key, String value)
for each(String word in value)

map(2, “Hadoop is also
great”)

emit(word, 1);

Key
Java

1

is

1

great

reduce (String word, List values)

Value

1

emit(word, sum(values));

Key

Value

Hadoop 1

1

Java is great

is

1

2

Hadoop is also great

also

1

great

1
Word Count - Group & Sort
map(k1,v1) -> list(k2, v2)

Key

Value

Java

1

is

1

great

1

Hadoop 1
is

1

also

1

great

1

reduce(k2, list(v2)) -> list(v3)
Word Count - Group & Sort
map(k1,v1) -> list(k2,v2)

reduce(k2,list(v2)) -> list(v3)

Key

Value

Java

1

Key

Value

is

1

Java

[1]

great

1

is

[1, 1]

Hadoop 1

great

[1, 1]

is

1

Hadoop [1]

also

1

also

great

1

group

[1]
Word Count - Group & Sort
map(k1,v1) -> list(k2,v2)

reduce(k2,list(v2)) -> list(v3)

Key

Value

Java

1

Key

Value

Key

Value

is

1

Java

[1]

also

[1]

great

1

is

[1, 1]

great

[1, 1]

sort

group

Hadoop 1

great

is

1

Hadoop [1]

is

[1, 1]

also

1

also

Java

[1]

great

1

[1, 1]

[1]

Hadoop [1]
Word Count - Reduce
map (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));

Key

Value

also

[1]

great

[1, 1]

Hadoop [1]
is

[1, 1]

Java

[1]
Word Count - Reduce
map (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));

Key

Value

also

[1]

great

[1, 1]

Hadoop [1]
is

[1, 1]

Java

[1]

reduce(“also”, [1])
Word Count - Reduce
map (Long key, String value)

reduce(“also”, [1])

for each(String word in value)
emit(word, 1);

Key
reduce (String word, List values)
emit(word, sum(values));

Key

Value

also

[1]

great

[1, 1]

Hadoop [1]
is

[1, 1]

Java

[1]

Value

also

1
Word Count - Reduce
map (Long key, String value)

reduce(“great”, [1, 1])

for each(String word in value)
emit(word, 1);

Key
reduce (String word, List values)
emit(word, sum(values));

Key

Value

also

[1]

great

[1, 1]

Hadoop [1]
is

[1, 1]

Java

[1]

Value

also

1
Word Count - Reduce
map (Long key, String value)

reduce(“great”, [1, 1])

for each(String word in value)
emit(word, 1);

Key
also

1

great

reduce (String word, List values)

Value

2

emit(word, sum(values));

Key

Value

also

[1]

great

[1, 1]

Hadoop [1]
is

[1, 1]

Java

[1]
Word Count - Reduce
map (Long key, String value)

reduce(“Hadoop”, [1])

for each(String word in value)
emit(word, 1);

Key
also

1

great

reduce (String word, List values)

Value

2

emit(word, sum(values));

Key

Value

also

[1]

great

[1, 1]

Hadoop [1]
is

[1, 1]

Java

[1]

Hadoop 1
Word Count - Reduce
map (Long key, String value)

reduce(“is”, [1, 1])

for each(String word in value)
emit(word, 1);

Key
also

1

great

reduce (String word, List values)

Value

2

emit(word, sum(values));

Key

Value

also

[1]

great

[1, 1]

Hadoop [1]
is

[1, 1]

Java

[1]

Hadoop 1
is

2
Word Count - Reduce
map (Long key, String value)

reduce(“Java”, [1])

for each(String word in value)
emit(word, 1);

Key
also

1

great

reduce (String word, List values)

Value

2

emit(word, sum(values));

Key

Value

also

[1]

great

[1, 1]

Hadoop [1]
is

[1, 1]

Java

[1]

Hadoop 1
is

2

Java

1
Distributed?
● Map tasks
○ Each read block executes a map task

● Reduce tasks
○ Partitioning when grouping
Word Count - Partition
num partitions = 1

Key

Value

Java

1

Key

Value

Key

Value

is

1

Java

[1]

also

[1]

great

1

is

[1, 1]

great

[1, 1]

sort

group

Hadoop 1

great

is

1

Hadoop [1]

is

[1, 1]

also

1

also

Java

[1]

great

1

[1, 1]

[1]

Hadoop [1]
Word Count - Partition
num partitions = 2

is

1

great

Value

Java

1

Key

Value

[1]

is

[1, 1]

is

[1, 1]

Java

[1]

Key

Value

Key

Value

great

[1, 1]

also

[1]
[1, 1]

1

up

Java

Key

Value

sort

gr
o

Key

p

ou

gr

Hadoop 1
is

1

also

1

Hadoop [1]

great

great

1

also

Hadoop [1]

sort

[1]
Distributed?
● Map tasks
○ Each read block executes a map task

● Reduce tasks
○ Partitioning when grouping
○ Each partition executes a reduce task
MapReduce
● Job Tracker
○ Dispatches Map & Reduce Tasks

● Task Tracker
○ Executes Map & Reduce Tasks
MapReduce
Example 1:
● Map
● Reduce
● Group & Partition
$> hadoop jar jug-hadoop.jar example1 /user/hadoop/input.
txt /user/hadoop/output 2
$> hadoop fs -text /user/hadoop/output/part-r-*

http://github.com/ferrangali/jug-hadoop
MapReduce
Example 2:
● Sorting
● n-Job workflow

$> hadoop jar jug-hadoop.jar example2 /user/hadoop/input.
txt /user/hadoop/output 2
$> hadoop fs -text /user/hadoop/output/part-r-*

http://github.com/ferrangali/jug-hadoop
Big Data
Big Data
● Too much data
○ Not a problem any more

● It’s just a matter of which tools use
● New opportunity for businesses
Big Data Platform
Consumption

logs

Processing

Serving

indexes
DB

DB
NoSQL
Hadoop Ecosystem
Hive
● Data Warehouse
● SQL-Like analysis system
SELECT SPLIT(line, “ ”) AS word, COUNT(*)
FROM table
GROUP BY word
ORDER BY word ASC;

● Executes MapReduce underneath!
HBase
●
●
●
●

Based on BigTable
Column-oriented database
Random realtime read/write access
Easy to bulk load from Hadoop
Hadoop Ecosystem
● ZooKeeper:
○ Centralized coordination system

● Pig
○ Data-flow language to analyze large data sets

● Kafka:
○ Distributed messaging system

● Sqoop:
○ Transfer between RDBMS - HDFS

● ...
Hadoop - Who’s using it?
Trovit
● What is it:
○ Vertical search engine.
○ Real estate, cars, jobs, products, vacations.

● Challenges:
○ Millions of documents to index
○ Traffic generates a huge amount of log files
Trovit
● Legacy:
○ Use MySQL as a support to document indexing
○ Didn’t scale!

● Batch processing:
○ Hadoop with a pipeline workflow
○ Problem solved!

● Real time processing:
○ Storm to improve freshness

● More challenges:
○ Content analysis
○ Traffic analysis
Questions?
Distributed batch processing with Hadoop
@ferrangali

Mais conteúdo relacionado

Mais procurados

Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseCloudera, Inc.
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopbigdatasyd
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceDenis Shestakov
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Cloudera, Inc.
 
Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsChien Chung Shen
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using PigDavid Wellman
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To HadoopAdeel Ahmad
 
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.Zekeriya Besiroglu
 
Distributed computing the Google way
Distributed computing the Google wayDistributed computing the Google way
Distributed computing the Google wayEduard Hildebrandt
 
The Family of Hadoop
The Family of HadoopThe Family of Hadoop
The Family of HadoopNam Nham
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache HadoopOleksiy Krotov
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandasPurna Chander K
 
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Cloudera, Inc.
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zingzingopen
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingMitsuharu Hamba
 
PyRate for fun and research
PyRate for fun and researchPyRate for fun and research
PyRate for fun and researchBrianna McHorse
 

Mais procurados (19)

Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBase
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
 
Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle Professionals
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
 
Distributed computing the Google way
Distributed computing the Google wayDistributed computing the Google way
Distributed computing the Google way
 
The Family of Hadoop
The Family of HadoopThe Family of Hadoop
The Family of Hadoop
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandas
 
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zing
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
 
PyRate for fun and research
PyRate for fun and researchPyRate for fun and research
PyRate for fun and research
 

Destaque

Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in SearchAmund Tveit
 
La idiosincrasia española
La idiosincrasia españolaLa idiosincrasia española
La idiosincrasia españolaMiNiBuDa
 
YARN - Hadoop's Resource Manager
YARN - Hadoop's Resource ManagerYARN - Hadoop's Resource Manager
YARN - Hadoop's Resource ManagerVertiCloud Inc
 
Paises mas importantes del continente europeo
Paises mas importantes del continente europeoPaises mas importantes del continente europeo
Paises mas importantes del continente europeoJuan Valencia Lara
 
How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014James Chittenden
 
Types of operating system
Types of operating systemTypes of operating system
Types of operating systemMohammad Alam
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processingsscdotopen
 
The Google File System (GFS)
The Google File System (GFS)The Google File System (GFS)
The Google File System (GFS)Romain Jacotin
 

Destaque (9)

Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in Search
 
La idiosincrasia española
La idiosincrasia españolaLa idiosincrasia española
La idiosincrasia española
 
YARN - Hadoop's Resource Manager
YARN - Hadoop's Resource ManagerYARN - Hadoop's Resource Manager
YARN - Hadoop's Resource Manager
 
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce
 
Paises mas importantes del continente europeo
Paises mas importantes del continente europeoPaises mas importantes del continente europeo
Paises mas importantes del continente europeo
 
How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014
 
Types of operating system
Types of operating systemTypes of operating system
Types of operating system
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
 
The Google File System (GFS)
The Google File System (GFS)The Google File System (GFS)
The Google File System (GFS)
 

Semelhante a Distributed batch processing with Hadoop

[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueShay Sofer
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoopdatasalt
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkThoughtWorks
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
12-BigDataMapReduce.pptx
12-BigDataMapReduce.pptx12-BigDataMapReduce.pptx
12-BigDataMapReduce.pptxShree Shree
 

Semelhante a Distributed batch processing with Hadoop (20)

Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the Rescue
 
2 hadoop
2  hadoop2  hadoop
2 hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
L3.fa14.ppt
L3.fa14.pptL3.fa14.ppt
L3.fa14.ppt
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoop
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
mapreduce ppt.ppt
mapreduce ppt.pptmapreduce ppt.ppt
mapreduce ppt.ppt
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
 
Hadoop london
Hadoop londonHadoop london
Hadoop london
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
12-BigDataMapReduce.pptx
12-BigDataMapReduce.pptx12-BigDataMapReduce.pptx
12-BigDataMapReduce.pptx
 

Último

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 

Último (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 

Distributed batch processing with Hadoop

  • 1. Distributed batch processing with Hadoop Ferran Galí i Reniu @ferrangali 09/01/2014
  • 2. Ferran Galí i Reniu ● UPC - FIB ● Trovit
  • 3. Problem ● Too much data ○ 90% of all the data in the world has been generated in the last two years ○ Large Hadron Collider: 25 petabytes per year ○ Walmart: 1M transactions per hour ● Hard disks ○ Cheap! ○ Still slow access time ○ Write even slower
  • 4. Solutions ● Multiple Hard Disks ○ Work in parallel ○ We can reduce access time! ● How to deal with hardware failure? ● What if we need to combine data?
  • 5.
  • 6. Hadoop ● Doug Cutting & Mike Cafarella
  • 7. Hadoop The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung October 2003
  • 8. Hadoop MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat December 2004
  • 9. Hadoop ● Doug Cutting & Mike Cafarella ● Yahoo!
  • 10. Hadoop ● HDFS ○ Storage ● MapReduce ○ Processing ● Ecosystem
  • 11.
  • 12. HDFS ● Distributed storage ○ Managed across a network of commodity machines ● Blocks ○ About 128Mb ○ Large data sets ● Tolerance to node failure ○ Data replication ● Streaming data access ○ Many access ○ Write once (batch)
  • 13. HDFS ● DataNodes (Workers) ○ Store blocks ● NameNode (Master) ○ ○ ○ ○ ○ Maintains metadata Knows where the blocks are located Make DataNodes fault tolerant Single point of failure Secondary NameNode
  • 15. HDFS ● Interfaces ○ Java ○ Command line interface ● Load hadoop fs -put file.csv /user/hadoop/file.csv ● Extract hadoop fs -get /user/hadoop/file.csv file.csv
  • 16.
  • 17. MapReduce ● Distributed processing paradigm ○ Moving computation is cheaper than moving data ● Map ○ Map(k1,v1) -> list(k2,v2) ● Reduce ○ Reduce(k2,list(v2)) -> list(v3)
  • 18. Word Counter map (Long key, String value) for each(String word in value) emit(word, 1); reduce (String word, List values) emit(word, sum(values));
  • 19. Word Counter map (Long key, String value) for each(String word in value) emit(word, 1); reduce (String word, List values) emit(word, sum(values)); Java is great Hadoop is also great
  • 20. Word Counter - Map map (Long key, String value) for each(String word in value) emit(word, 1); reduce (String word, List values) emit(word, sum(values)); Key Value 1 Java is great 2 Hadoop is also great
  • 21. Word Counter - Map map (Long key, String value) for each(String word in value) emit(word, 1); reduce (String word, List values) emit(word, sum(values)); Key Value 1 Java is great 2 Hadoop is also great map(1, “Java is great”)
  • 22. Word Counter - Map map (Long key, String value) map(1, “Java is great”) for each(String word in value) emit(word, 1); Key reduce (String word, List values) emit(word, sum(values)); Key Value 1 Java is great 2 Hadoop is also great Value Java 1
  • 23. Word Counter - Map map (Long key, String value) map(1, “Java is great”) for each(String word in value) emit(word, 1); Key Java 1 is reduce (String word, List values) Value 1 emit(word, sum(values)); Key Value 1 Java is great 2 Hadoop is also great
  • 24. Word Counter - Map map (Long key, String value) map(1, “Java is great”) for each(String word in value) emit(word, 1); Key Java 1 is 1 great reduce (String word, List values) Value 1 emit(word, sum(values)); Key Value 1 Java is great 2 Hadoop is also great
  • 25. Word Counter - Map map (Long key, String value) for each(String word in value) map(2, “Hadoop is also great”) emit(word, 1); Key Java 1 is 1 great reduce (String word, List values) Value 1 emit(word, sum(values)); Key Value 1 Java is great 2 Hadoop is also great
  • 26. Word Counter - Map map (Long key, String value) for each(String word in value) map(2, “Hadoop is also great”) emit(word, 1); Key Java 1 is 1 great reduce (String word, List values) Value 1 emit(word, sum(values)); Key Value Hadoop 1 1 Java is great is 1 2 Hadoop is also great also 1 great 1
  • 27. Word Count - Group & Sort map(k1,v1) -> list(k2, v2) Key Value Java 1 is 1 great 1 Hadoop 1 is 1 also 1 great 1 reduce(k2, list(v2)) -> list(v3)
  • 28. Word Count - Group & Sort map(k1,v1) -> list(k2,v2) reduce(k2,list(v2)) -> list(v3) Key Value Java 1 Key Value is 1 Java [1] great 1 is [1, 1] Hadoop 1 great [1, 1] is 1 Hadoop [1] also 1 also great 1 group [1]
  • 29. Word Count - Group & Sort map(k1,v1) -> list(k2,v2) reduce(k2,list(v2)) -> list(v3) Key Value Java 1 Key Value Key Value is 1 Java [1] also [1] great 1 is [1, 1] great [1, 1] sort group Hadoop 1 great is 1 Hadoop [1] is [1, 1] also 1 also Java [1] great 1 [1, 1] [1] Hadoop [1]
  • 30. Word Count - Reduce map (Long key, String value) for each(String word in value) emit(word, 1); reduce (String word, List values) emit(word, sum(values)); Key Value also [1] great [1, 1] Hadoop [1] is [1, 1] Java [1]
  • 31. Word Count - Reduce map (Long key, String value) for each(String word in value) emit(word, 1); reduce (String word, List values) emit(word, sum(values)); Key Value also [1] great [1, 1] Hadoop [1] is [1, 1] Java [1] reduce(“also”, [1])
  • 32. Word Count - Reduce map (Long key, String value) reduce(“also”, [1]) for each(String word in value) emit(word, 1); Key reduce (String word, List values) emit(word, sum(values)); Key Value also [1] great [1, 1] Hadoop [1] is [1, 1] Java [1] Value also 1
  • 33. Word Count - Reduce map (Long key, String value) reduce(“great”, [1, 1]) for each(String word in value) emit(word, 1); Key reduce (String word, List values) emit(word, sum(values)); Key Value also [1] great [1, 1] Hadoop [1] is [1, 1] Java [1] Value also 1
  • 34. Word Count - Reduce map (Long key, String value) reduce(“great”, [1, 1]) for each(String word in value) emit(word, 1); Key also 1 great reduce (String word, List values) Value 2 emit(word, sum(values)); Key Value also [1] great [1, 1] Hadoop [1] is [1, 1] Java [1]
  • 35. Word Count - Reduce map (Long key, String value) reduce(“Hadoop”, [1]) for each(String word in value) emit(word, 1); Key also 1 great reduce (String word, List values) Value 2 emit(word, sum(values)); Key Value also [1] great [1, 1] Hadoop [1] is [1, 1] Java [1] Hadoop 1
  • 36. Word Count - Reduce map (Long key, String value) reduce(“is”, [1, 1]) for each(String word in value) emit(word, 1); Key also 1 great reduce (String word, List values) Value 2 emit(word, sum(values)); Key Value also [1] great [1, 1] Hadoop [1] is [1, 1] Java [1] Hadoop 1 is 2
  • 37. Word Count - Reduce map (Long key, String value) reduce(“Java”, [1]) for each(String word in value) emit(word, 1); Key also 1 great reduce (String word, List values) Value 2 emit(word, sum(values)); Key Value also [1] great [1, 1] Hadoop [1] is [1, 1] Java [1] Hadoop 1 is 2 Java 1
  • 38. Distributed? ● Map tasks ○ Each read block executes a map task ● Reduce tasks ○ Partitioning when grouping
  • 39. Word Count - Partition num partitions = 1 Key Value Java 1 Key Value Key Value is 1 Java [1] also [1] great 1 is [1, 1] great [1, 1] sort group Hadoop 1 great is 1 Hadoop [1] is [1, 1] also 1 also Java [1] great 1 [1, 1] [1] Hadoop [1]
  • 40. Word Count - Partition num partitions = 2 is 1 great Value Java 1 Key Value [1] is [1, 1] is [1, 1] Java [1] Key Value Key Value great [1, 1] also [1] [1, 1] 1 up Java Key Value sort gr o Key p ou gr Hadoop 1 is 1 also 1 Hadoop [1] great great 1 also Hadoop [1] sort [1]
  • 41. Distributed? ● Map tasks ○ Each read block executes a map task ● Reduce tasks ○ Partitioning when grouping ○ Each partition executes a reduce task
  • 42. MapReduce ● Job Tracker ○ Dispatches Map & Reduce Tasks ● Task Tracker ○ Executes Map & Reduce Tasks
  • 43. MapReduce Example 1: ● Map ● Reduce ● Group & Partition $> hadoop jar jug-hadoop.jar example1 /user/hadoop/input. txt /user/hadoop/output 2 $> hadoop fs -text /user/hadoop/output/part-r-* http://github.com/ferrangali/jug-hadoop
  • 44. MapReduce Example 2: ● Sorting ● n-Job workflow $> hadoop jar jug-hadoop.jar example2 /user/hadoop/input. txt /user/hadoop/output 2 $> hadoop fs -text /user/hadoop/output/part-r-* http://github.com/ferrangali/jug-hadoop
  • 46. Big Data ● Too much data ○ Not a problem any more ● It’s just a matter of which tools use ● New opportunity for businesses
  • 49. Hive ● Data Warehouse ● SQL-Like analysis system SELECT SPLIT(line, “ ”) AS word, COUNT(*) FROM table GROUP BY word ORDER BY word ASC; ● Executes MapReduce underneath!
  • 50. HBase ● ● ● ● Based on BigTable Column-oriented database Random realtime read/write access Easy to bulk load from Hadoop
  • 51. Hadoop Ecosystem ● ZooKeeper: ○ Centralized coordination system ● Pig ○ Data-flow language to analyze large data sets ● Kafka: ○ Distributed messaging system ● Sqoop: ○ Transfer between RDBMS - HDFS ● ...
  • 52. Hadoop - Who’s using it?
  • 53. Trovit ● What is it: ○ Vertical search engine. ○ Real estate, cars, jobs, products, vacations. ● Challenges: ○ Millions of documents to index ○ Traffic generates a huge amount of log files
  • 54. Trovit ● Legacy: ○ Use MySQL as a support to document indexing ○ Didn’t scale! ● Batch processing: ○ Hadoop with a pipeline workflow ○ Problem solved! ● Real time processing: ○ Storm to improve freshness ● More challenges: ○ Content analysis ○ Traffic analysis
  • 55. Questions? Distributed batch processing with Hadoop @ferrangali