In this session you will learn:
1. History of hadoop
2. Hadoop Ecosystem
3. Hadoop Animal Planet
4. What is Hadoop?
5. Distinctions of hadoop
6. Hadoop Components
7. The Hadoop Distributed Filesystem
8. Design of HDFS
9. When Not to use Hadoop?
10. HDFS Concepts
11. Anatomy of a File Read
12. Anatomy of a File Write
13. Replication & Rack awareness
14. Mapreduce Components
15. Typical Mapreduce Job
2. Page 2Classification: Restricted
Agenda
• History of hadoop
• Hadoop Ecosystem
• Hadoop Animal Planet
• What is Hadoop?
• Distinctions of hadoop
• Hadoop Components
• The Hadoop Distributed Filesystem
• Design of HDFS
• When Not to use Hadoop?
• HDFS Concepts
• Anatomy of a File Read
• Anatomy of a File Write
• Replication & Rack awareness
• Mapreduce Components
• Typical Mapreduce Job
5. Page 5Classification: Restricted
Hadoop Ecosystem
• Hadoop Distributed File System (HDFS) is a Java-based file system that
provides scalable and reliable data storage that is designed to span large
clusters of commodity servers. HDFS, MapReduce and YARN form the core of
Apache™ Hadoop
• MapReduce is a programming model and it provides implementation for
processing and generating large data sets with a parallel, distributed
algorithm on a cluster.
• Apache Hadoop YARN (short, in self-deprecating fashion, for Yet Another
Resource Negotiator) is a cluster management technology. It is one of the key
features in second-generation Hadoop.
• Apache Hive is a data warehouse infrastructure built on top of Hadoop for
providing data summarization, query, and analysis. While initially developed
by Facebook, Apache Hive is now used and developed by other companies
such as Netflix.
6. Page 6Classification: Restricted
Hadoop Ecosystem
• Apache Pig is a platform for analyzing large data sets that consists of a high-
level language for expressing data analysis programs, coupled with
infrastructure for evaluating these programs.
• Apache HBase is an open-source, distributed, versioned, non-relational
database modeled after Google's Bigtable: A Distributed Storage System for
Structured Data by Chang et al.
• ZooKeeper is a centralized service for maintaining configuration information,
naming, providing distributed synchronization, and providing group services.
All of these kinds of services are used in some form or another by distributed
applications.
• Apache Sqoop is a tool designed for efficiently transferring bulk data
between Apache Hadoop and structured datastores such as relational
databases. Sqoop imports data from external structured datastores into
HDFS or related systems like Hive and HBase.
7. Page 7Classification: Restricted
Hadoop Ecosystem
• Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of log data. It has a simple and
flexible architecture based on streaming data flows. It is robust and fault
tolerant with tunable reliability mechanisms and many failover and recovery
mechanisms.
• Machine learning is a type of artificial intelligence (AI) that provides
computers with the ability to learn without being explicitly programmed.
Machine learning focuses on the development of computer programs that
can teach themselves to grow and change when exposed to new data.
• R is a programming language and software environment for statistical
computing and graphics. The R language is widely used among statisticians
and data miners for developing statistical software and data analysis
• Cloudera Impala is Cloudera's open source massively parallel processing
(MPP) SQL query engine for data stored in a computer cluster
running Apache Hadoop.
8. Page 8Classification: Restricted
Hadoop Ecosystem
• MongoDB is a document database that provides high performance, high
availability, and easy scalability. Document Database. Documents (objects)
map nicely to programming language data types. Embedded documents and
arrays reduce need for joins. Dynamic schema makes polymorphism easier.
• Apache CouchDB, commonly referred to as CouchDB, is an open
source database that focuses on ease of use and on being "a databasethat
completely embraces the web". It is a document-oriented
NoSQLdatabase that uses JSON to store data, JavaScript as its query
language using MapReduce, and HTTP for an API.
• Cascading is a software abstraction layer for Apache Hadoop. Cascading is
used to create and execute complex data processing workflows on a Hadoop
cluster using any JVM-based language (Java, JRuby, Clojure, etc.), hiding the
underlying complexity of MapReduce jobs.
10. Page 10Classification: Restricted
What is Hadoop?
• The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers using
simple programming models
• It is designed to scale up from single servers to thousands of machines, each
offering local computation and storage.
• Rather than rely on hardware to deliver high-availability, the
library itself is designed to detect and handle failures at the application layer,
so delivering a highly-available service on top of a cluster of computers, each
of which may be prone to failures.
13. Page 13Classification: Restricted
Hadoop Components
MapReduc
e
HDFS
Cluster
Job
Tracker
Namenod
e
Task
Tracke
r
Task
Tracke
r
Task
Tracke
r
Data
Node
Data
Node
Data
Node
14. Page 14Classification: Restricted
Hadoop Components:
• HDFS – Hadoop Distributed File System(storage):
• Data is split and distributed across nodes
• Each split is replicated
• Namenode is the master & Datanodes are the slaves
• Mapreduce(processing):
• Splits a task across processors
• Execution is Near the data & the results are merged
• Self-healing
• JobTracker is the master & Task trackers are slaves
15. Page 15Classification: Restricted
The Hadoop Distributed Filesystem:
• When a dataset is larger than the storage capacity of a single physical
machine, it is necessary to split it and save it in a number of separate
machines. Filesystems that manage the storage across a network of
machines are called distributed filesystems. One of the biggest challenges
of distributed filesystems is handling node (machine) failure without
suffering data loss.
• Hadoop comes with a distributed filesystem called HDFS, which
stands for Hadoop Distributed Filesystem. HDFS overcomes this
challenge.
16. Page 16Classification: Restricted
Design Of HDFS:
• HDFS is designed for storing very large files with streaming data access
patterns, running on clusters of commodity hardware.
• Very Large Files: Files in the sizes of hundreds of gigabytes (GB) or
terabytes (TB)
• Streaming Data Access: HDFS is built around the idea that the most
efficient data processing pattern is a write-once, read-many-times
pattern. A dataset is typically generated or copied from source, and then
various analyses are performed on that dataset over time.
• Commodity Hardware: Commonly available hardware which is cheap
(not enterprise level)
17. Page 17Classification: Restricted
When Not to Use Hadoop?
• HDFS does not fit
• When you need low-latency(faster) access to data: Applications that
require faster data access will not work well with HDFS. HDFS is
optimized for high throughput of data.
• When you have lots of small files: Namenode holds the filesystem
metadata in memory. Every block, file, directory occupies around
150bytes. Hence having large number of small files will cause burden on
the Namenode
• When random writes in the file is needed: Files in the HDFS may be
written to by a single writer. Writes are always made at the end-of file
in append-only fashion. There is no support for multiple writers or for
modifications at random offsets(positions)
18. Page 18Classification: Restricted
HDFS Concepts
• HDFS Block: A HDFS block is the smallest unit of data that we can read or
write in HDFS. HDFS stores files in blocks that are typically at least 64 MB or
(more commonly now) 128 MB in size, much larger than the 4-32 KB seen in
most filesystems.
• In HDFS, 1 MB file stored with a block size of 128 MB uses 1 MB of disk
space, not 128 MB
Why is a block in HDFS so large?
• The Disk seek time and the data transfer rate are not at the same level
(data transfer is much faster). As of now, the disk seek time is 10 ms and
the transfer rate is 100MB/s. To keep them in sync with each other, the
data block was chosen to be large (which results in longer transfer time)
19. Page 19Classification: Restricted
HDFS Concepts:
• NameNode
• It is the master node & responsible for the entire cluster
• Manages the filesystem namespace
• Enterprise level software is used
• DataNode
• Slaves which run on commodity/cheap hardware
• Store and retrieve data when they are told to (by client or Namenode)
• Sends heart-beat signals to NN with the blocks that they store
• Secondary NameNode
• It is a backup for the Namenode(not a hot stand-by)
• It periodically merges the fsimage & edit log files
24. Page 24Classification: Restricted
Anatomy of a File Read
1. The client opens the file it wishes to read by calling open() on
the FileSystem object, which for HDFS is an instance of DistributedFileSystem
2. DistributedFileSystem calls the namenode, using RPC, to determine the
locations of the blocks for the first few blocks in the file
3. For each block, the namenode returns the addresses of the datanodes that
have a copy of that block. Furthermore, the datanodes are sorted according
to their proximity to the client. The DistributedFileSystem returns
an FSDataInputStream (an input stream that supports file seeks) to the client
for it to read data from. FSDataInputStream in turn wraps a DFSInputStream,
which manages the datanode and namenode I/O. The client then
calls read() on the stream.
25. Page 25Classification: Restricted
Anatomy of a File Read
4. DFSInputStream, which has stored the datanode addresses for the first
few blocks in the file, then connects to the first (closest) datanode for the
first block in the file. Data is streamed from the datanode back to the
client, which calls read() repeatedly on the stream
5. When the end of the block is reached, DFSInputStream will close the
connection to the datanode, then find the best datanode for the next
block
6. Blocks are read in order, with the DFSInputStream opening new
connections to datanodes as the client reads through the stream. It will
also call the namenode to retrieve the datanode locations for the next
batch of blocks as needed. When the client has finished reading, it
calls close() on the FSDataInputStream
27. Page 27Classification: Restricted
Anatomy of a File Write
1. The client creates the file by calling create() on DistributedFileSystem
2. DistributedFileSystem makes an RPC call to the namenode to create a new
file in the filesystem’s namespace, with no blocks associated with it. NN
checks for client’s access permissions and checks if the file already exists (file
exists - IO Exception); The DistributedFileSystem returns an
FSDataOutputStream for the client to start writing data to.
FSDataOutputStream wraps a DFSOutputStream, which handles
communication with the datanodes and namenode.
3. As the client writes data (step 3), DFSOutputStream splits it into packets,
which it writes to an internal queue, called the data queue. The data queue is
consumed by the DataStreamer, which is responsible for asking the
namenode to allocate new blocks by picking a list of suitable datanodes to
store the replicas. The list of datanodes forms a pipeline
28. Page 28Classification: Restricted
Anatomy of a File Write
4. The DataStreamer streams the packets to the first datanode in the
pipeline, which stores the packet and forwards it to the second
datanode in the pipeline. Similarly, the second datanode stores the
packet and forwards it to the third (and last) datanode in the pipeline
5. DFSOutputStream also maintains an internal queue of packets that
are waiting to be acknowledged by datanodes, called the ack queue.
A packet is removed from the ack queue only when it has been
acknowledged by all the datanodes in the pipeline (step 5).
6. When the client has finished writing data, it calls close() on the
stream waits for acknowledgments before contacting the namenode
to signal that the file is complete
32. Page 32Classification: Restricted
MapReduce Components:
• Job Tracker:
• Coordinates all the jobs run on the system by scheduling tasks
• Keeps a record of overall progress of each job
• If a job fails, reschedules the job on a different tasktracker
• Task Tracker:
• Slave daemon which accepts tasks to be run a block of data
• Sends progress reports as heart beat signals to the Job tracker at regular
intervals