Distributed batch processing with Hadoop

Distributed batch
processing with Hadoop
Ferran Galí i Reniu
@ferrangali

09/01/2014

Ferran Galí i Reniu
● UPC - FIB
● Trovit

Problem
● Too much data
○ 90% of all the data in the world has been generated
in the last two years
○ Large Hadron Collider: 25 petabytes per year
○ Walmart: 1M transactions per hour

● Hard disks
○ Cheap!
○ Still slow access time
○ Write even slower

Solutions
● Multiple Hard Disks
○ Work in parallel
○ We can reduce access time!

● How to deal with hardware failure?
● What if we need to combine data?

Hadoop
● Doug Cutting & Mike Cafarella

Hadoop

The Google File System
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
October 2003

Hadoop

MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay Ghemawat
December 2004

Hadoop
● Doug Cutting & Mike Cafarella

● Yahoo!

Hadoop
● HDFS
○ Storage

● MapReduce
○ Processing

● Ecosystem

HDFS
● Distributed storage
○ Managed across a network of commodity machines

● Blocks
○ About 128Mb
○ Large data sets

● Tolerance to node failure
○ Data replication

● Streaming data access
○ Many access
○ Write once (batch)

HDFS
● DataNodes (Workers)
○ Store blocks

● NameNode (Master)
○
○
○
○
○

Maintains metadata
Knows where the blocks are located
Make DataNodes fault tolerant
Single point of failure
Secondary NameNode

HDFS

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

HDFS
● Interfaces
○ Java
○ Command line interface

● Load
hadoop fs -put file.csv /user/hadoop/file.csv

● Extract
hadoop fs -get /user/hadoop/file.csv file.csv

MapReduce
● Distributed processing paradigm
○ Moving computation is cheaper than moving data

● Map
○ Map(k1,v1) -> list(k2,v2)

● Reduce
○ Reduce(k2,list(v2)) -> list(v3)

Word Counter
map (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));

Word Counter
emit(word, 1);

Java is great
Hadoop is also great

Word Counter - Map
emit(word, 1);

Key

Value

1

Java is great

2


Word Counter - Map
emit(word, 1);

Key

Value

1

Java is great

2


map(1, “Java is great”)

Word Counter - Map


emit(word, 1);

Key

Key

Value

1

Java is great

2


Value

Java

1

Word Counter - Map


emit(word, 1);

Key
Java

1

is


Value

1


Key

Value

1

Java is great

2


Word Counter - Map


emit(word, 1);

Key
Java

1

is

1

great


Value

1


Key

Value

1

Java is great

2


Word Counter - Map

map(2, “Hadoop is also
great”)

emit(word, 1);

Key
Java

1

is

1

great


Value

1


Key

Value

1

Java is great

2


Word Counter - Map

map(2, “Hadoop is also
great”)

emit(word, 1);

Key
Java

1

is

1

great


Value

1


Key

Value

Hadoop 1

1

Java is great

is

1

2


also

1

great

1

Word Count - Group & Sort
map(k1,v1) -> list(k2, v2)

Key

Value

Java

1

is

1

great

1

Hadoop 1
is

1

also

1

great

1

reduce(k2, list(v2)) -> list(v3)

map(k1,v1) -> list(k2,v2)

reduce(k2,list(v2)) -> list(v3)

Key

Value

Java

1

Key

Value

is

1

Java

[1]

great

1

is

[1, 1]

Hadoop 1

great

[1, 1]

is

1

Hadoop [1]

also

1

also

great

1

group

[1]

map(k1,v1) -> list(k2,v2)

reduce(k2,list(v2)) -> list(v3)

Key

Value

Java

1

Key

Value

Key

Value

is

1

Java

[1]

also

[1]

great

1

is

[1, 1]

great

[1, 1]

sort

group

Hadoop 1

great

is

1

Hadoop [1]

is

[1, 1]

also

1

also

Java

[1]

great

1

[1, 1]

[1]

Hadoop [1]

Word Count - Reduce
emit(word, 1);

Key

Value

also

[1]

great

[1, 1]

Hadoop [1]
is

[1, 1]

Java

[1]

Word Count - Reduce
emit(word, 1);

Key

Value

also

[1]

great

[1, 1]

Hadoop [1]
is

[1, 1]

Java

[1]

reduce(“also”, [1])

Word Count - Reduce

reduce(“also”, [1])

emit(word, 1);

Key

Key

Value

also

[1]

great

[1, 1]

Hadoop [1]
is

[1, 1]

Java

[1]

Value

also

1

Word Count - Reduce

reduce(“great”, [1, 1])

emit(word, 1);

Key

Key

Value

also

[1]

great

[1, 1]

Hadoop [1]
is

[1, 1]

Java

[1]

Value

also

1

Word Count - Reduce

reduce(“great”, [1, 1])

emit(word, 1);

Key
also

1

great


Value

2


Key

Value

also

[1]

great

[1, 1]

Hadoop [1]
is

[1, 1]

Java

[1]

Word Count - Reduce

reduce(“Hadoop”, [1])

emit(word, 1);

Key
also

1

great


Value

2


Key

Value

also

[1]

great

[1, 1]

Hadoop [1]
is

[1, 1]

Java

[1]

Hadoop 1

Word Count - Reduce

reduce(“is”, [1, 1])

emit(word, 1);

Key
also

1

great


Value

2


Key

Value

also

[1]

great

[1, 1]

Hadoop [1]
is

[1, 1]

Java

[1]

Hadoop 1
is

2

Word Count - Reduce

reduce(“Java”, [1])

emit(word, 1);

Key
also

1

great


Value

2


Key

Value

also

[1]

great

[1, 1]

Hadoop [1]
is

[1, 1]

Java

[1]

Hadoop 1
is

2

Java

1

Distributed?
● Map tasks
○ Each read block executes a map task

● Reduce tasks
○ Partitioning when grouping

Word Count - Partition
num partitions = 1

Key

Value

Java

1

Key

Value

Key

Value

is

1

Java

[1]

also

[1]

great

1

is

[1, 1]

great

[1, 1]

sort

group

Hadoop 1

great

is

1

Hadoop [1]

is

[1, 1]

also

1

also

Java

[1]

great

1

[1, 1]

[1]

Hadoop [1]

Word Count - Partition
num partitions = 2

is

1

great

Value

Java

1

Key

Value

[1]

is

[1, 1]

is

[1, 1]

Java

[1]

Key

Value

Key

Value

great

[1, 1]

also

[1]
[1, 1]

1

up

Java

Key

Value

sort

gr
o

Key

p

ou

gr

Hadoop 1
is

1

also

1

Hadoop [1]

great

great

1

also

Hadoop [1]

sort

[1]

Distributed?
● Map tasks
○ Each read block executes a map task

● Reduce tasks
○ Partitioning when grouping
○ Each partition executes a reduce task

MapReduce
● Job Tracker
○ Dispatches Map & Reduce Tasks

● Task Tracker
○ Executes Map & Reduce Tasks

MapReduce
Example 1:
● Map
● Reduce
● Group & Partition
$> hadoop jar jug-hadoop.jar example1 /user/hadoop/input.
txt /user/hadoop/output 2
$> hadoop fs -text /user/hadoop/output/part-r-*

http://github.com/ferrangali/jug-hadoop

MapReduce
Example 2:
● Sorting
● n-Job workflow

$> hadoop jar jug-hadoop.jar example2 /user/hadoop/input.
txt /user/hadoop/output 2
$> hadoop fs -text /user/hadoop/output/part-r-*

http://github.com/ferrangali/jug-hadoop

Big Data
● Too much data
○ Not a problem any more

● It’s just a matter of which tools use
● New opportunity for businesses

Big Data Platform
Consumption

logs

Processing

Serving

indexes
DB

DB
NoSQL

Hive
● Data Warehouse
● SQL-Like analysis system
SELECT SPLIT(line, “ ”) AS word, COUNT(*)
FROM table
GROUP BY word
ORDER BY word ASC;

● Executes MapReduce underneath!

HBase
●
●
●
●

Based on BigTable
Column-oriented database
Random realtime read/write access
Easy to bulk load from Hadoop

Hadoop Ecosystem
● ZooKeeper:
○ Centralized coordination system

● Pig
○ Data-flow language to analyze large data sets

● Kafka:
○ Distributed messaging system

● Sqoop:
○ Transfer between RDBMS - HDFS

● ...

Trovit
● What is it:
○ Vertical search engine.
○ Real estate, cars, jobs, products, vacations.

● Challenges:
○ Millions of documents to index
○ Traffic generates a huge amount of log files

Trovit
● Legacy:
○ Use MySQL as a support to document indexing
○ Didn’t scale!

● Batch processing:
○ Hadoop with a pipeline workflow
○ Problem solved!

● Real time processing:
○ Storm to improve freshness

● More challenges:
○ Content analysis
○ Traffic analysis

Questions?
Distributed batch processing with Hadoop
@ferrangali

Distributed batch processing with Hadoop

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (19)

Destaque

Destaque (9)

Semelhante a Distributed batch processing with Hadoop

Semelhante a Distributed batch processing with Hadoop (20)

Último

Último (20)

Distributed batch processing with Hadoop