4. Search ,Collect , store ,analyze ,understand
Finally , Answer quires !
problem:
THIS IS NOT STRUCTURED DATA
Solution:
!
too large to store it on single machine !
Too long to process in serial
Use multiple machines
5. Single machines tend to fail:
Requirements:
Hard disk.
Power supply
Built-in backup.
Built-in failover
More machines
Increased failure probability
6. Has never dealt with
large (petabytes) amount of data.
Has no thorough understanding of parallel programming.
Has no time to make software production ready.
Requirements
● Built-in backup.
● Built-in failover.
● Easy to use.
● Parallel on rails.
7. Built-in backup.
HADOOP
Built-in failover.
Easy to administrate.
Single system.
Easy to use.
Parallel on rails.
Active development
IS THE
SOLUTION
9.
Large scale, open source software framework
Dedicated to scalable, distributed, data-intensive computing
Handles thousands of nodes and petabytes of data
Supports applications under a free license
Hadoop main subprojects:
1.
HFDS: Hadoop Distributed File System with high throughput access to
application data
2.
MapReduce: A software framework for distributed processing of large data sets on
computer clusters
10. HDFS – a distributed file system
NameNode – namespace and block management
DataNodes – block replica container
MapReduce – a framework for distributed computations
JobTracker – job scheduling, resource management, lifecycle coordination
TaskTracker – task execution module
11. Hadoop: Architecture Principles
Linear scalability: more nodes can do more work within the same time
Linear on data size:
2. Linear on compute resources:
Move computation to data
1. Minimize expensive data transfers
2. Data are large, programs are small
Reliability and Availability: Failures are common
Simple computational model
1. hides complexity in efficient execution framework
2. Sequential data processing (avoid random reads)
1.
12. A reliable, scalable, high performance distributed computing system
The Hadoop Distributed File System (HDFS)
1.
Reliable storage layer
2.
With more sophisticated layers on top
MapReduce – distributed computation framework
Hadoop scales computation capacity, storage capacity, and I/O bandwidth by adding
commodity servers.
Divide-and-conquer using lots of commodity hardware
13. HDFS – a distributed file system
NameNode – namespace and block management
DataNodes – block replica container
MapReduce – a framework for distributed computations
JobTracker – job scheduling, resource management, lifecycle coordination
TaskTracker – task execution module
14. The name space is a hierarchy of files and directories
Files are divided into blocks (typically 128 MB)
Namespace (metadata) is decoupled from data
1.
Lots of fast namespace operations, not slowed down by
2.
Data streaming
Single NameNode keeps the entire name space in RAM
DataNodes store block replicas
Blocks are replicated on 3 DataNodes for redundancy and availability
15. The durability of the name space is maintained by a
write-ahead journal and checkpoints
1.
Journal transactions are persisted into edits file before replying to the client
2.
Checkpoints are periodically written to fsimage file Handled by Checkpointer,
SecondaryNameNode
3.
Block locations discovered from DataNodes during startup via block reports. Not
persisted on NameNode
BackupNode Multiple storage directories
Two on local drives, and one remote server, typically NFS file
16. DataNodes register with the NameNode, and provide periodic block reports that list the
block replicas on hand
block report contains block id, generation stamp and length for each replica DataNodes
send heartbeats to the NameNode to confirm its alive: 3 sec
If no heartbeat is received during a 10-minute interval, the node is presumed to be lost,
and the replicas hosted by that node to be unavailable
NameNode schedules re-replication of lost replicas Heartbeat responses give instructions
for managing replicas
1.
Replicate blocks to other nodes
2.
Remove local block replicas
3.
Re-register or shut down the node
4.
Send an urgent block report
17. Supports conventional file system operation
1.
2.
Create, read, write, delete files
Create, delete directories Rename files and directories Permissions
Modification and access times for files
Quotas: 1) namespace, 2) disk space
Per-file, when created
Replication factor – can be changed later
Block size – can not be reset after creation Block replica locations are exposed to
external applications
18. Open a socket stream to
chosen DataNode, read
bytes from the stream
If fails add to dead
DataNodes
Choose the next DataNode
in the list
Retry 2 times
19. MapReduce schedules a task assigned to process block B to a DataNode serving a
replica of B
Local access to data
20. HDFS implements a single-writer, multiple-reader model.
HDFS client maintains a lease on files it opened for write
Only one client can hold a lease on a single file
Client periodically renews the lease by sending heartbeats to the NameNode Lease
expiration:
Until soft limit expires client has exclusive access to the file
After soft limit (10 min): any client can reclaim the lease
After hard limit (1 hour): NameNode mandatory closes the file, revokes the lease
Writer's lease does not prevent other clients from reading the file
21. Original implementation supported write-once semantics
After the file is closed, the bytes written cannot be altered or removed Now files can
be modified by reopening for append
Block modifications during appends use the copy-on-write technique
Last block is copied into temp location and modified
When “full” it is copied into its permanent location HDFS provides consistent visibility
of data for readers before file is closed
hflush operation provides the visibility guarantee
On hflush current packet is immediately pushed to the pipeline
hflush waits until all DataNodes successfully receive the packet hsync also
guarantees the data is persisted to local disks on DataNode
22.
23. Cluster topology
Hierarchal grouping of nodes according to
network distance Default block placement policy - a tradeoff
between minimizing the write cost,
and maximizing data reliability, availability and aggregate read bandwidth 1
. First replica on the local to the writer node
2. Second and the Third replicas on two different nodes in a different rack
3. the Rest are placed on random nodes with restrictions
no more than one replica of the same block is placed at one node and
no more than two replicas are placed in the same rack (if there is enough racks) HDFS
provides a configurable block placement policy interface
experimental
24. Namespace ID
a unique cluster id common for cluster components (NN and DNs)
Namespace ID assigned to the file system at format time Prevents DNs from other
clusters join this cluster Storage ID
DataNode persistently stores its unique (per cluster) Storage ID
makes DN recognizable even if it is restarted with a different IP address or port
assigned to the DataNode when it registers with the NameNode for the first time
Software Version (build version)
Different software versions of NN and DN are incompatible
Starting from Hadoop-2: version compatibility for rolling upgrades Data integrity via
Block Checksum
25. NameNode startup
1.
Read image, replay journal, write new image and empty journal
2.
Enter SafeMode
DataNode startup
1.
Handshake: check Namespace ID and Software Version
2.
Registration: NameNode records Storage ID and address of DN
3.
Send initial block report
SafeMode – read-only mode for NameNode
1.
Disallows modifications of the namespace
2.
Disallows block replication or deletion
3.
Prevents unnecessary block replications until the majority of blocks are reported
Minimally replicated blocks
4.
SafeMode threshold, extension
5.
Manual SafeMode during unforeseen circumstances
26. Ensure that each block always has the intended number of replicas
Conform with block placement policy (BPP)
Replication is updated when a block report is received
Over-replicated blocks: choose replica to remove
1.
Balance storage utilization across nodes without reducing the block’s availability
2.
Try not to reduce the number of racks that host replicas
3.
Remove replica from DataNode with longest heartbeat or least available space
Under-replicated blocks: place into the replication priority queue
1.
Less replicas means higher in the queue
2.
Minimize cost of replica creation but concur with BPP Mis-replicated block, not to BPP
Missing block: nothing to do
Corrupt block: try to replicate good replicas, keep corrupt replicas as is
27. Snapshots prevent from data corruption/loss during software upgrades
allow to rollback if software upgrades go bad
NameNode image snapshot
1.
Start hadoop namenode –upgrade
2.
Read checkpoint image and journal based on persisted Layout Version o New
software can always read old layouts
3.
Rename current storage directory to previous
4.
Save image into new current
28. • One master, many workers
– Input data split into M map tasks (typically 64 MB in size)
– Reduce phase partitioned into R reduce tasks
– Tasks are assigned to workers dynamically
– Often: M=200,000; R=4,000; workers=2,000
• Master assigns each map task to a free worker
– Considers locality of data to worker when assigning task
– Worker reads task input (often from local disk!)
– Worker produces R local files containing intermediate k/v pairs
• Master assigns each reduce task to a free worker
– Worker reads intermediate k/v pairs from map workers
– Worker sorts & applies user’s Reduce op to produce the output
29. On worker failure:
• Detect failure via periodic heartbeats
• Re-execute completed and in-progress map tasks
• Re-execute in progress reduce tasks
• Task completion committed through master
On master failure:
• State is checkpointed to GFS: new master recovers &
continues
Very Robust: lost 1600 of 1800 machines once, but finished fine
30. HDFS – a distributed file system
NameNode – namespace and block management
DataNodes – block replica container
MapReduce – a framework for distributed computations
JobTracker – job scheduling, resource management, lifecycle coordination
TaskTracker – task execution module
31. Example: Word Frequencies in Web Pages
A typical exercise for a new engineer in his or her first week
• Input is files with one document per record
• Specify a map function that takes a key/value pair
key = document URL
value = document contents
• Output of map function is (potentially many) key/value pairs.
Key:“document1”
In our case, output (word, “1”) once per word in the document
Value:“to be or not to be”
MapReduce library gathers together all pairs with the same key
MAP
(shuffle/sort)
• The reduceValue SHUFFLE the valuesValue key REDUCE
function combines Key
for a
Key
In our case, “1”
compute the sum
“to”
&
“be”
“1”“1”
“be”
“1”
SORT
“not”
“1”
“or”
“1”
“or”
“1”
“not”
”1”
“to”
“1””1”
“to”
”1”
“be”
”1”
Key
“be”
“not”
“or”
“to”
Value
“2”
“1”
“1”
“2”
32. Recommendation Financial analysis
systems
Natural Language
Processing (NLP)
Image/video processing
Log analysis
Data warehousing
Correlation engines
Market research/forecasting
33. Finance
Health &
Life Sciences Academic research
Government
Social networking
Telecommunications