Intro to big data choco devday - 23-01-2014

Intro to Big Data
Askhat Murzabayev

Explicit attempt of self promotion
•

23 years old

•

Suleyman Demirel University, BSc in CS 2012

•

Chocomart.kz - Apr 2013 - present

•

Product Manager at Twitter Dec 2012 - May 2013(Twitter API & Android
app)

•

SE at Twitter Sep 2011 - Dec 2012(Search, Relevance and Machine
Learning dept.)

•

Sold diploma thesis(CV algorithm) to Microsoft, used in Bing Maps

•

Sold image processing algorithm(better pattern recognition of objects) to
Microsoft Research

•

Scalable Machine Learning algorithms are my passion

Numbers
•

1 zettabyte = 1,000,000 petabytes

•

2006 - 0.18 zettabytes

•

2011 - 1.8 zettabytes

•

2012 - 2.8 zettabytes(3% analyzed)

•

Estimate: 2020 - 40 zettabytes

Numbers Everyone Should Know
•

Numbers Everyone Should Know

•

L1 cache reference 0.5 ns

•

Branch mispredict 5 ns

•

L2 cache reference 7 ns

•

Mutex lock/unlock 100 ns

•

Main memory reference 100 ns

•

Compress 1K bytes with Zippy 10,000 ns

•

Send 2K bytes over 1 Gbps network 20,000 ns

•

Read 1 MB sequentially from memory 0.25 ms

Numbers Everyone Should Know part 2
•

Round trip within same datacenter 0.5 ms

•

Disk seek 10 ms

•

Read 1 MB sequentially from network 10 ms

•

Read 1 MB sequentially from disk 30 ms

•

Send packet CA->Netherlands->CA 150 ms

•

Send package via Kazpost - everlasting

Conclusion
!

•

time(CPU) < time(RAM) < time(Disk) <
time(Network)!

•

amount(CPU) < amount(RAM) <<< amount(Disk)
< amount(Network)

Problem statement
•

Tons of data

•

F*cking tons of data

•

We need to process it

•

Process it fast

•

Idea is to “parallelize” processing of data

The “Joys” of Real Hardware
•

~0.5 overheating (power down most
machines in <5 mins, ~1-2 days to recover)

•

~1 PDU failure (~500-1000 machines
suddenly disappear, ~6 hours to come back)

•

~1 rack-move (plenty of warning, ~500-1000
machines powered down, ~6 hours)

•

~1 network rewiring (rolling ~5% of machines
down over 2-day span)

•

~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get
back)

•

~5 racks go wonky (40-80 machines see 50% packetloss)

•

~8 network maintenances (4 might cause ~30-minute random
connectivity losses)

•

~12 router reloads (takes out DNS and external vips for a couple
minutes)

•

~3 router failures (have to immediately pull traffic for an hour)

•

~dozens of minor 30-second blips for dns

•

~1000 individual machine failures

•

~thousands of hard drive failures

•

slow disks, bad memory, misconfigured machines, flaky machines, etc

Problem statement(2)
•

A lot of data

•

Fast processing

•

Reliable

•

“Cheap”

•

Scale

•

Wouldn’t require much of hand work

•

Should work on many prog.languages, platforms

•

Google File System (GFS)
•
•

•

Distributed ﬁlesystem
Fault tolerant

MapReduce
•

Distributed processing framework

Apache Hadoop
•

“Hadoop was created by Doug Cutting, the creator
of Apache Lucene, the widely used text search
library. Hadoop has its origins in Apache Nutch, an
open source web search engine, itself a part of the
Lucene project”

Ecosystem
•

Apache Hadoop
•

Commons

•

HDFS (Hadoop Distributed FileSystem)

•

MapReduce(v1, v2)

•

Apache HBase

•

Apache Pig

•

Apache Hive

•

Apache Zookeeper

•

Apache Oozie

•

Apache Sqoop

MapReduce
(0, 0067011990999991950051507004...9999999N9+00001+99999999999...)  
(106, 0043011990999991950051512004...9999999N9+00221+99999999999...)  
(212, 0043011990999991950051518004...9999999N9-00111+99999999999...)  
(318, 0043012650999991949032412004...0500001N9+01111+99999999999...)  
(424, 0043012650999991949032418004…0500001N9+00781+99999999999...)

Input to Map function

(1950, 0)  
(1950, 22)  
(1950, −11)  
(1949, 111)  
(1949, 78)

Output from Map function is input for Reduce function

(1949, [111, 78]) 
(1950, [0, 22, −11])

Output from Reduce function

(1949, 111)  
(1950, 22)

!

Data locality optimization
!
!
!
!
!

•

HDFS block size 64 MB(default)

Combiner Functions
•

map1
•

•

map2
•

•

(1950, 0)  
(1950, 20)  
(1950, 10)

(1950, 25)  
(1950, 15)

reduce
•
•

•

input: (1950, [0, 20, 10, 25, 15])
output: (1950, 25)

job.setCombinerClass(MaxTemperatureReducer.class);

The Design of HDFS

•

HDFS is a ﬁlesystem designed for storing very
large ﬁles with streaming data access patterns,
running on clusters of commodity hardware.

HDFS is not good fit for:
•

Low-latency data access(use HBase instead)

•

Lots of small files

•

Multiple writers, arbitrary file modifications

HDFS Concepts
•

Blocks
•
•

•

Size on “normal” ﬁlesystem: 512 bytes
Size in HDFS: 64 MB

File in HDFS that is smaller than a single block
does not occupy a full block’s worth of underlying
storage

Why block size is so large?
•

Disk seek time 10 ms

•

Transfer rate is 100 MB/s

•

Goal is to make the seek time 1% of the transfer
time

•

We need around 100 MB block size

Why blocks?
•

A file can be larger than any disk in the network

•

Making unit of abstraction a block rather than file
simplifies the storage subsystem
•

Simplifies storage subsystem(Fixed size of block,
it is easy to calculate how many can be stored)

•

Eliminating metadata info(Don’t need to store
permissions, created time, created user and etc.
with block)

Namenodes and Datanodes
•

Namenode = master
•

Manages filesystem namespace(filesystem tree, metadata
for dirs and files)
•

•

•

Namespace image, edit log - stored persistently on disk

Stores on which datanodes blocks of given file are
stored(stored in RAM)

Datanode = workers(slaves)
•

Store and retrieve blocks when needed

Troubles

•

If namenode fails - God save us, it hurts…

Solutions
•

Hadoop can be conﬁgured so that the namenode
writes its persistent state to multiple ﬁlesystems

•

Secondary namenode: main role is to periodically
merge the namespace image with the edit log to
prevent the edit log from becoming too large.

HDFS Federation(since 2.x)
•

Allows a cluster to scale by adding namenodes, each of which
manages a portion of the ﬁlesystem namespace.
•

/user

•

/share

•

Namenode manages a namespace volume, which is made up of
the metadata for the namespace, and a block pool containing all
the blocks for the ﬁles in the namespace.

•

Namespace volumes are independent of each other.

•

Block pool storage is not partitioned, datanodes register with each
namenode in the cluster and store blocks from multiple block pools.

HDFS High-Availability(since 2.x)
•

Namenode still is SPOF (Single Point of Failure)
•

•

If fails then, unable to do MR jobs, read/write/list
ﬁles

Recovery algorithm(could take 30 mins)
•

load its namespace image into memory,

•

replay its edit log, and

•

receive enough block reports from the datanodes
to leave safe mode.

HDFS HA
•

Switching namenodes could take 1-2 minutes
•

The namenodes must use highly available shared
storage to share the edit log.

•

Datanodes must send block reports to both
namenodes because the block mappings are stored
in a namenode’s memory, and not on disk.

•

Clients must be conﬁgured to handle namenode
failover, using a mechanism that is transparent to
users.

Moving large datasets to HDFS
•

Apache Flume
•

•
•

Moving large quantities of streaming data into HDFS. Log
data from one system—a bank of web servers and
aggregating it in HDFS for later analysis.
Supports tail, syslog, and Apache log4j

Apache Sqoop
•

Designed for performing bulk imports of data into HDFS from
structured data stores, such as relational databases.

•

An example of a Sqoop use case is an organization that runs
a nightly Sqoop import to load the day’s data from a
production database into a Hive data warehouse for analysis.

Parallel Copying with distcp
•

% hadoop distcp hdfs://namenode1/foo hdfs://namenode2/bar!

•

Will create foo directory inside of bar in namenode2

•

Only map jobs, no reducers pass option -m(shows amount of map jobs)

•

% hadoop distcp -update hdfs://namenode1/foo hdfs://namenode2/bar/foo!

•

% hadoop distcp webhdfs://namenode1:50070/foo  
webhdfs://namenode2:50070/bar

Balancer
•

Only one program at a time

•

Utilization is usage over total capacity

•

Utilization of every datanode differs from utilization
of cluster by no more than THRESHOLD_VALUE

•

Calling balancer 
% start-balancer.sh [OPTIONAL default is 10%] THRESHOLD_VALUE

Hadoop Archives(HAR)
•

HDFS stores small files inefficiently.

•

Note: Small files do not take up any more disk
space than is required to store the raw contents of
the file.
•

1 MB file stored with a block size of 128 MB uses
1 MB of disk space, not 128 MB.

•

Archiver tool is MapReduce job

•

HAR is directory not single file

•

% hadoop archive -archiveName files.har /my/files /my

Limitations of HAR
•

No compression

•

Immutable

•

MapReduce split is still inefﬁcient

Intro to big data choco devday - 23-01-2014

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Intro to big data choco devday - 23-01-2014

Semelhante a Intro to big data choco devday - 23-01-2014 (20)

Último

Último (20)

Intro to big data choco devday - 23-01-2014