Hadoop is a quickly budding ecosystem of components based on Google’s MapReduce algorithm and file system work for implementing MapReduce[3] algorithms in a scalable fashion and distributed on commodity hardware. Hadoop enables users to store and process large volumes of data and analyze it in ways not previously possible with SQL-based approaches or less scalable solutions. Remarkable improvements in conventional compute and storage resources help make Hadoop clusters feasible for most organizations. This paper begins with the discussion of Big Data [1][7][9] evolution and the future of Big Data based on Gartner’s Hype Cycle. We have explained how Hadoop Distributed File System (HDFS) works and its architecture with suitable illustration. Hadoop’s MapReduce paradigm for distributing a task across multiple nodes in Hadoop is discussed with sample data sets. The working of MapReduce and HDFS when they are put all together is discussed. Finally the paper ends with a discussion on Big Data Hadoop sample use cases which shows how enterprises can gain a competitive benefit by being early adopters of big data analytics. Hadoop Distributed File System (HDFS) is the core component of Apache Hadoop project. In HDFS, the computation is carried out in the nodes where relevant data is stored. Hadoop also implemented a parallel computational paradigm named as Map-Reduce. In this paper, we have measured the performance of read and write operations in HDFS by considering small and large files. For performance evaluation, we have used a Hadoop cluster with five nodes. The results indicate that HDFS performs well for the files with the size greater than the default block size and performs poorly for the files with the size less than the default block size.
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
1. In The Name of Allah The Most Merciful The
Most Gracious
• Name: Abdul Nasir Afridi
• Roll Number:01
• Batch#10
• Subject: Advanced Database And Data
mining.
Page-1
2. Research Article
1. Performance Evaluation of Read and
Write operations in Hadoop Distributed
File System.
Published: 2014 Sixth International
Symposium on Parallel Architectures,
Algorithms and Programming
Conference Paper: IEEE Computer
Society
Authors: Dr T Ragunathan et al.
7B-2
5. Research Article
7B-5
H-Store: A High-Performance, Distributed
Main Memory Transaction Processing
System
Published: August 23-28, 2008, Auckland,
New Zealand
Conference Paper:ACM 978-1-60558-
306-8/08/08
Copyright 2008 VLDB Endowment,
7. What is Apache Hadoop?
• Hadoop Distributed File System:
• HDFS, the storage layer of Hadoop, is a
distributed, scalable, Java-based file system
adept at storing large volumes of unstructured
data
• It is an open-source system developed by
Apache in Java.
• It is designed to handle very large data sets.
• It is designed to scale to very large clusters.
• It is designed to run on commodity hardware.
7B-7
12. Hadoop echosystem
• Hadoop Distributed File System:HDFS, the
storage layer of
• Hadoop, is a distributed, scalable, Java-based
file system.
• It offers data replication.
• It offers automatic failover in the event of a
crash. •
• It automatically fragments storage over the
cluster. •
• It brings processing to the data. •
• Its supportlarge volumes of file into the milion7B-12
13. Hadoop echosystem
• MapReduce:
• MapReduce is a software framework that
serves as the compute layer of Hadoop.
• MapReduce jobs are divided into two
parts.The mapfunction divides a query into
multiple parts and processes data at the node
level.
• The reducefunction aggregates the results of
the map function to determine the answer to
the query.
7B-13
14. Hadoop echosystem
• Hive:
Hive is a Hadoop-based data warehouse
developed by Facebook. It allows users to
write queries in SQL, which are then
converted to map-reduce. This allows SQL
programmers with no map-reduce experience
to use the warehouse and makes it easier to
integrate with business intelligence and
visualization tools such as Micro Strategy,
Tableau, Revolutions Analytics, etc
7B-14
15. Hadoop echosystem
• Pig:
Pig Latin is a Hadoop-based language
developed by Yahoo.
It is relatively easy to learn and is adept at
very deep, very long data pipelines (a
limitation of SQL.)
Pig, originally developed at Yahoo research,
is a high-level language for building map-
reduce programs for Hadoop,
thus simplifying the use of map-reduce. It is a
data flow language that provides high-level
commands7B-15
17. Hadoop echosystem
• HBase:
• HBase is a non-relational database that
allows for low-latency, quick lookups in
Hadoop.
• It adds transactional capabilities to
Hadoop, allowing users to conduct
updates,inserts, and deletes.
• E-Bay and Facebook use HBase
heavily
7B-17
18. Hadoop echosystem
• Flume:
• Flume is a framework for populating
Hadoop with data.
• Agents are populated throughout ones’
IT infrastructure (inside web servers,
application servers, and mobile devices,
for example) to collect data and
integrate it into Hadoop.
7B-18
19. Hadoop echosystem
• Oozie:
• Oozie is a workflow processing system that
lets users define a series of jobs written in
multiple languages (such as mapreduce, Pig
and Hive) then intelligently links them to one
another.
• Oozie allows users to specify, for example,
that a particular query is only to be initiated
after specified previous jobs on which it relies
for data are completed
7B-19
20. Hadoop echosystem
• Whirr:
• Whirr is a set of libraries that allows
users to easily spin-up Hadoop clusters
on top of Amazon EC2, Rackspace, or
any virtual infrastructure.
• It supports all major virtualized
infrastructure vendors on the market
7B-20
21. Hadoop echosystem
• Avro:
• Avro is a data serialization system that
allows for encoding the schema of
Hadoop files.
• It is adept at parsing data and
performing removed procedure calls.
7B-21
22. Hadoop echosystem
• Mahout:
• Mahout is a data-mining library.
• It takes the most popular data-mining
algorithms for performing clustering,
regression testing, and statistical
modeling
• and implements them using the map-
reduce mode
7B-22
24. Hadoop echosystem
• Sqoop:
• Sqoop is a connectivity tool for moving data
from non-Hadoop data stores such as
relational databases and data warehouses
into Hadoop.
• It allows users to specify the target location
inside of Hadoop and instruct Sqoop to move
data from Oracle, Teradata, or other relational
databases to the target
7B-24
28. Big data
Big data is being generated by everything
around us at all times.
Every digital process and social media
exchange produces it.
Systems, sensors and mobile devices
transmit it.
Big data is arriving from multiple sources at an
alarming velocity, volume and variety.
To extract meaningful value from big data,
you need optimal processing power, analytics
capabilities and skills.
7B-28
36. Scheduling
• By default
▫ Hadoop uses FIFO to schedule jobs.
▫ No preemption once a job is running.
In Hadoop version 2.x fair scheduling
introduces.assigning resources to
applications such that all applications
get, on average, an equal share of
resources over time
7B-36