Big Data Hadoop 2.0 MapReduce HDFS

Big Data: Hadoop 2.0
Map Reduce / HDFS 2.0

@diego_pacheco
Software Architect | Agile Coach

Hadoop
Ddistributed
F ile
S ystem

4000 nodes: 14PB storage

HDFS – Assumptions and Goals
• Hardware Failure: Houndred or thousands machines, expect to fail.
• Streaming Data Access: Batch processing, high throughtput not low latency.
• Large Data Sets: Terrabytes, works on cluster, scale, milions of files single instance.
• Simple Coherency Model: Write-once-read-many(create, read, close, no
changes) maximize coherency and high throughtput, perfect for Map/Reduce.

• Moving Computation instead of Moving Data: Is way
more cheaper, huge data, minimize network. HDFS moves the computation close to the data.

• Sofware and hardware Portability: Easily Portable.

HDFS

• Very large distributed FS
• 10k nodes, 100M files, 10PB
• Works with comodity hardware
• File replication
• Detect and recover from failures
• Optimized for batch processing
• Files break by blocks 128mb
• blocks: replicated in N dataNodes
• Data Coherency
• Write Once, Read Many
• Only Append to existent files

Today: Parallelism per file

Single LARGE File
Single Thread

No
Parallelism

Map/Reduce: Unit of data

Task 0
0..64 mb

Task 1
64..128mb

Task 2
128..192mb

Each task process a unit of data

Task 3
192..256mb

Map/Reduce: Local Read

Task 0

Task 1

Task 2

Task 3

0..64 mb

64..128mb

128..192mb

192..256mb

Node 0 Node 1 Node 2 Node 3
• Local Read, no need for network copy
• Data is read from many disks in parallel

Map/Reduce: The Magic!

Single Hard Drive: Reads 75mb/second

12 hard drive
Per machine
12 * 75mb/second * 4k =

3.4 TB/ second

Big Data: Hadoop 2.0
Map Reduce / HDFS 2.0

Obrigado!
Thank You!

@diego_pacheco
Software Architect | Agile Coach

Big Data Hadoop 2.0 MapReduce HDFS

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Big Data Hadoop 2.0 MapReduce HDFS

Semelhante a Big Data Hadoop 2.0 MapReduce HDFS (20)

Mais de Diego Pacheco

Mais de Diego Pacheco (20)

Último

Último (20)

Big Data Hadoop 2.0 MapReduce HDFS