2. Contents
• What is Hadoop
• Hadoop Components
• Why Hadoop
• HDFS
• HDFS Features
• When Not to use Hadoop
• HDFS Components
• DFS and HDFS
• Hadoop & Big data – Relatives !
3. What is Hadoop
• Conventional Definition
• Framework for Distributed Processing of Large Datasets( usually unstructured data)
across clusters of Commodity Hardware.
• Well, I Have been really bad with these bookish definitions never understood the heavy
terms used in them . So, here are some explanations –
• Distributed Processing : Spreading a heavy task across various workers/resources
to improve the time taken to deliver the task.
• Large Datasets(Unstructured data) : The data which does not have any defined
structure /format or size.
• Commodity Hardware : Hardware easily available usually with low performance
issues . These can failover anytime.
So, as of now we can say that Hadoop is nothing but a system that stores huge volume of unstructured data in a way that the data can
be accessed for reading faster.
• Fun Fact: Hadoop follows all standards, directory structure and other patters
of LINUX/UNIX. Most details easily available on “Apache” web site.
4. Hadoop Components
Level 0 - Hadoop
HDFS
MapReduce
Hadoop Distributed File System
Simple Programming Model
HDFS : HDFS is just a file system that serves the storage of data, in hadoop way.
MapReduce : Though termed as joint word but Map and Reduce are 2 separate
programs that helps in defining the Map for data spread in distributed
environment and reduce the complexity/volume of data sent/received or
processed.
5. Why Hadoop !
• So If Hadoop is another storage system then why so hype !
• Yes, Hadoop again is a Distributed File Processing system but I see something
that makes it different or in fact special “Faster I/O Processing using
commodity hardware”.
• We all know that this generation has no issue with Storage
size. We have TBs of hard drives available at home too .
But, only problem remains is accessing the huge volume of
unstructured data using low performance I/O devices we
have. This is where Hadoop enters to rescue. How !! .. We
might know that through other slides.
• Fun Fact: Hadoop is not a software which you can download and install
on your system. It is a set of tools organized to serve some specific
purpose.
6. HDFS
Conventional Definition : HDFS is a file system designed for storing very large files with streaming
data access patterns running clusters on commodity hardware.
Like Name Says – It is a Distributed File System following some specific protocols/standards or
techniques, we will call Hadoop way
Map Reduce
Engine
__________
HDFS Cluster
Job Tracker
__________
Name Node
Task Tracker
____________
Data Node
Task Tracker
____________
Data Node
Task Tracker
____________
Data Node
Task Tracker
____________
Data Node
7. HDFS Advantages
• Fault Tolerance
• Now, if Hadoop has an important highlight in its definition i.e.
”using commodity hardware”, then we can be certain of failovers.
But Hadoop handles this failing nodes very effectively and
ensures that we do not loose any data anyway. How – read about
replication ..
• Handles large Datasets
• No doubt why companies like Facebook, Google, yahoo etc.
prefers it. So proven system for handling large data sets.
• Streaming access to File system data
• You have your “youtube” videos using this .
• High Performance
• The facts says that the processing time for data using Hadoop is
“n”-times faster, where n is “number of nodes/data nodes”.
8. When Not to Use Hadoop/HDFS
• For many small files used in transactions
• Low Latency data access
• When there are many people who modifies the data/files (
multiple writers) arbitrarily.
9. HDFS Components
Name Node
(Job Tracker)
Data Nodes
(Task Trackers)
Name Node : This component of HDFS is generally on a High Performance machine and
if we talk in layman terms, it is kind of “Index” for the data spread across several data
nodes. We can also call it metadata storage process.
Data Node : This is responsible for storing actual data. This runs as Daemon in local
machines.
Fun Fact: Daemon is a resident program that runs in background on your machine as
processes. Daemon is terminology used in UNIX. In DOS we call it TSR.
10. DFS and HDFS
• So, what is difference between a regular Distributed File
System and Hadoop !!
• Hadoop processes the data in local nodes and just transmits the
output to Client while in regular DFS data is brought to master node
from various nodes for processing. So quiet obvious – Hadoop has to
transfer less amount of data( just the output) over network while a
regular DFS has to transfer huge volume of data on network. This
Makes Hadoop winner for faster processing!!
• This type of processing of data on data nodes is called data
localization which is one of the important super powers of
Hadoop ..
11. Hadoop & Big Data – Relatives !
Relation is not very complex. Its just like simple husband-wife relation
where Hadoop comes in just to resolves issues with Big data .
In other words, Big data provides challenges for Hadoop to resolve.