This document provides an introduction to big data and Hadoop. It discusses the growth of data from 2006 to 2020. It then introduces key concepts of Hadoop including HDFS, MapReduce, and the Hadoop ecosystem. It describes how HDFS stores and processes large datasets in a distributed manner through block storage on datanodes and metadata management by the namenode. MapReduce provides a programming model for distributed processing of large datasets across clusters. The document also discusses challenges of hardware failures and solutions in Hadoop like HDFS high availability and federation.
3. Explicit attempt of self promotion
•
23 years old
•
Suleyman Demirel University, BSc in CS 2012
•
Chocomart.kz - Apr 2013 - present
•
Product Manager at Twitter Dec 2012 - May 2013(Twitter API & Android
app)
•
SE at Twitter Sep 2011 - Dec 2012(Search, Relevance and Machine
Learning dept.)
•
Sold diploma thesis(CV algorithm) to Microsoft, used in Bing Maps
•
Sold image processing algorithm(better pattern recognition of objects) to
Microsoft Research
•
Scalable Machine Learning algorithms are my passion
6. Numbers Everyone Should Know
•
Numbers Everyone Should Know
•
L1 cache reference 0.5 ns
•
Branch mispredict 5 ns
•
L2 cache reference 7 ns
•
Mutex lock/unlock 100 ns
•
Main memory reference 100 ns
•
Compress 1K bytes with Zippy 10,000 ns
•
Send 2K bytes over 1 Gbps network 20,000 ns
•
Read 1 MB sequentially from memory 0.25 ms
7. Numbers Everyone Should Know part 2
•
Round trip within same datacenter 0.5 ms
•
Disk seek 10 ms
•
Read 1 MB sequentially from network 10 ms
•
Read 1 MB sequentially from disk 30 ms
•
Send packet CA->Netherlands->CA 150 ms
•
Send package via Kazpost - everlasting
9. Problem statement
•
Tons of data
•
F*cking tons of data
•
We need to process it
•
Process it fast
•
Idea is to “parallelize” processing of data
10. The “Joys” of Real Hardware
•
~0.5 overheating (power down most
machines in <5 mins, ~1-2 days to recover)
•
~1 PDU failure (~500-1000 machines
suddenly disappear, ~6 hours to come back)
•
~1 rack-move (plenty of warning, ~500-1000
machines powered down, ~6 hours)
•
~1 network rewiring (rolling ~5% of machines
down over 2-day span)
11. •
~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get
back)
•
~5 racks go wonky (40-80 machines see 50% packetloss)
•
~8 network maintenances (4 might cause ~30-minute random
connectivity losses)
•
~12 router reloads (takes out DNS and external vips for a couple
minutes)
•
~3 router failures (have to immediately pull traffic for an hour)
•
~dozens of minor 30-second blips for dns
•
~1000 individual machine failures
•
~thousands of hard drive failures
•
slow disks, bad memory, misconfigured machines, flaky machines, etc
12. Problem statement(2)
•
A lot of data
•
Fast processing
•
Reliable
•
“Cheap”
•
Scale
•
Wouldn’t require much of hand work
•
Should work on many prog.languages, platforms
13. •
Google File System (GFS)
•
•
•
Distributed filesystem
Fault tolerant
MapReduce
•
Distributed processing framework
14. Apache Hadoop
•
“Hadoop was created by Doug Cutting, the creator
of Apache Lucene, the widely used text search
library. Hadoop has its origins in Apache Nutch, an
open source web search engine, itself a part of the
Lucene project”
28. The Design of HDFS
•
HDFS is a filesystem designed for storing very
large files with streaming data access patterns,
running on clusters of commodity hardware.
29. HDFS is not good fit for:
•
Low-latency data access(use HBase instead)
•
Lots of small files
•
Multiple writers, arbitrary file modifications
30. HDFS Concepts
•
Blocks
•
•
•
Size on “normal” filesystem: 512 bytes
Size in HDFS: 64 MB
File in HDFS that is smaller than a single block
does not occupy a full block’s worth of underlying
storage
31. Why block size is so large?
•
Disk seek time 10 ms
•
Transfer rate is 100 MB/s
•
Goal is to make the seek time 1% of the transfer
time
•
We need around 100 MB block size
32. Why blocks?
•
A file can be larger than any disk in the network
•
Making unit of abstraction a block rather than file
simplifies the storage subsystem
•
Simplifies storage subsystem(Fixed size of block,
it is easy to calculate how many can be stored)
•
Eliminating metadata info(Don’t need to store
permissions, created time, created user and etc.
with block)
33. Namenodes and Datanodes
•
Namenode = master
•
Manages filesystem namespace(filesystem tree, metadata
for dirs and files)
•
•
•
Namespace image, edit log - stored persistently on disk
Stores on which datanodes blocks of given file are
stored(stored in RAM)
Datanode = workers(slaves)
•
Store and retrieve blocks when needed
35. Solutions
•
Hadoop can be configured so that the namenode
writes its persistent state to multiple filesystems
•
Secondary namenode: main role is to periodically
merge the namespace image with the edit log to
prevent the edit log from becoming too large.
36. HDFS Federation(since 2.x)
•
Allows a cluster to scale by adding namenodes, each of which
manages a portion of the filesystem namespace.
•
/user
•
/share
•
Namenode manages a namespace volume, which is made up of
the metadata for the namespace, and a block pool containing all
the blocks for the files in the namespace.
•
Namespace volumes are independent of each other.
•
Block pool storage is not partitioned, datanodes register with each
namenode in the cluster and store blocks from multiple block pools.
37. HDFS High-Availability(since 2.x)
•
Namenode still is SPOF (Single Point of Failure)
•
•
If fails then, unable to do MR jobs, read/write/list
files
Recovery algorithm(could take 30 mins)
•
load its namespace image into memory,
•
replay its edit log, and
•
receive enough block reports from the datanodes
to leave safe mode.
38. HDFS HA
•
Switching namenodes could take 1-2 minutes
•
The namenodes must use highly available shared
storage to share the edit log.
•
Datanodes must send block reports to both
namenodes because the block mappings are stored
in a namenode’s memory, and not on disk.
•
Clients must be configured to handle namenode
failover, using a mechanism that is transparent to
users.
42. Moving large datasets to HDFS
•
Apache Flume
•
•
•
Moving large quantities of streaming data into HDFS. Log
data from one system—a bank of web servers and
aggregating it in HDFS for later analysis.
Supports tail, syslog, and Apache log4j
Apache Sqoop
•
Designed for performing bulk imports of data into HDFS from
structured data stores, such as relational databases.
•
An example of a Sqoop use case is an organization that runs
a nightly Sqoop import to load the day’s data from a
production database into a Hive data warehouse for analysis.
43. Parallel Copying with distcp
•
% hadoop distcp hdfs://namenode1/foo hdfs://namenode2/bar!
•
Will create foo directory inside of bar in namenode2
•
Only map jobs, no reducers pass option -m(shows amount of map jobs)
•
% hadoop distcp -update hdfs://namenode1/foo hdfs://namenode2/bar/foo!
•
% hadoop distcp webhdfs://namenode1:50070/foo
webhdfs://namenode2:50070/bar
44. Balancer
•
Only one program at a time
•
Utilization is usage over total capacity
•
Utilization of every datanode differs from utilization
of cluster by no more than THRESHOLD_VALUE
•
Calling balancer
% start-balancer.sh [OPTIONAL default is 10%] THRESHOLD_VALUE
45. Hadoop Archives(HAR)
•
HDFS stores small files inefficiently.
•
Note: Small files do not take up any more disk
space than is required to store the raw contents of
the file.
•
1 MB file stored with a block size of 128 MB uses
1 MB of disk space, not 128 MB.
46. •
Archiver tool is MapReduce job
•
HAR is directory not single file
•
% hadoop archive -archiveName files.har /my/files /my