6. ISSUES WITH LARGE DATA
• Map Parallelism: Chunking input data
• Reduce Parallelism: Grouping related
data
• Dealing with failures & load imbalance
7.
8. • Doug Cutting, Mike Cafarella developed an
Open Source Project called HADOOP in 2005
and Daug named it after his son's toy elephant.
9. • Hadoop has become one of the most talked about
technologies.
• Why? One of the top reasons is its ability to handle
huge amounts of data – any kind of data – quickly.
With volumes and varieties of data growing each
day, especially from social media and automated
sensors, that’s a key consideration for most
organizations.
10. • Hadoop is an open-source software framework
for storing and processing big data in a
distributed fashion on large clusters of
commodity hardware.
• Essentially, it accomplishes two tasks:
-massive data storage
- faster processing.
11. • Hadoop is an Apache open source framework
written in java that allows distributed
processing of large datasets across clusters of
computers using simple programming models.
Hadoop is designed to scale up from single
server to thousands of machines, each offering
local computation and storage.
14. WHY IS HADOOP IMPORTANT?
• Low cost : The open-source framework is free and uses
commodity hardware to store large quantities of data.
• Computing power : Its distributed computing model
can quickly process very large volumes of data.
• Scalability : You can easily grow your system simply by
adding more nodes
• Storage flexibility : You can store as much data as you
want and decide how to use it later.
• Inherent data protection and self-healing
capabilities : Data, application processing are protected
15. WHAT’S IN HADOOP?
• HDFS – the Java-based distributed file system that can
store all kinds of data without prior organization.
• MapReduce – a software programming model for
processing large sets of data in parallel.
• YARN – a resource management framework for
scheduling and handling resource requests from distributed
applications.
16.
17. COMPONENTS THAT HAVE ACHIEVED TOP-
LEVEL APACHE PROJECT STATUS
• Pig – a platform for manipulating data stored in HDFS. It
consists of a compiler for Map Reduce programs and a
high-level language called Pig Latin.
• Hive – a data warehousing and SQL-like query language
that presents data in the form of tables. Hive programming
is similar to database programming. (It was initially
developed by Facebook.)
• HBase – a non relational, distributed database that runs
on top of Hadoop. HBase tables can serve as input and
output for Map Reduce jobs.
• Zookeeper – an application that coordinates distributed
processes.
18. • Ambari – a web interface for managing, configuring
and testing Hadoop services and components.
• Flume – software that collects, aggregates and moves
large amounts of streaming data into HDFS.
• Sqoop – a connection and transfer mechanism that
moves data between Hadoop and relational databases.
• Oozie – a Hadoop job scheduler.
19. HADOOP ARCHITECTURE
• Hadoop framework includes following four modules:
• Hadoop Common : These are Java libraries and
utilities required by other Hadoop modules. These
libraries provides filesystem and OS level abstractions
and contains the necessary Java files and scripts
required to start Hadoop.
• Hadoop YARN : This is a framework for job
scheduling and cluster resource management.
• Hadoop Distributed File System (HDFS) : A
distributed file system that provides high-throughput
access to application data.
• Hadoop MapReduce : This is YARN-based system
for parallel processing of large data sets.
24. • Hadoop runs applications using the Map
Reduce algorithm, where the data is processed
in parallel on different CPU nodes.
• Map Reduce program executes in three stages,
namely map stage, shuffle stage, and reduce
stage.
WHAT IS MAP REDUCE?
25. STAGES OF MAP REDUCE
• Map stage : The map ‘s job is to process the input data
which is in the form of file or directory and is stored in the
Hadoop file system (HDFS) and is passed to the mapper
function line by line. The mapper processes the data and
creates several small chunks of data.
• Reduce stage : This stage is the combination of
the Shuffle stage and the Reduce stage. The Reducer’s
job is to process the data that comes from the mapper.
After processing, it produces a new set of output, which will
be stored in the HDFS.
30. • Data is organized into files and
directories
• Files are divided into uniform sized
blocks(default 128MB) and distributed
across cluster nodes
31. HDFS
• Blocks are replicated to handle hardware
failure
• Replication for performance and fault
tolerance (Rack-Aware placement)
• HDFS keeps checksums of data for
corruption detection and recovery
32. FEATURES OF HDFS
• It is suitable for the distributed storage and
processing.
• Hadoop provides a command interface to
interact with HDFS.
• The built-in servers of name node and data
node help users to easily check the status of
cluster.
• Streaming access to file system data.
• HDFS provides file permissions and
authentication.
34. • Namenode is a software that can be run on commodity
hardware. The system having the namenode acts as the
master server and it does the following tasks:
- Manages the file system namespace.
- Regulates client’s access to files.
- It also executes file system operations such as renaming,
closing, and opening files and directories.
• Datanode nodes manage the data storage of the system.
- perform read-write operations on the file systems, as per
client request.
- perform operations such as block creation, deletion, and
replication
• Block the user data is stored in the files of HDFS in which file
system will be divided into one or more segments and stored
in individual data nodes segments are called as blocks
36. GOALS OF HDFS
• Fault detection and recovery :
Since HDFS includes a large number of commodity
hardware, failure of components is frequent. Therefore
HDFS should have mechanisms for quick and automatic
fault detection and recovery.
• Huge datasets :
HDFS should have hundreds of nodes per cluster to
manage the applications having huge datasets.
• Hardware at data :
A requested task can be done efficiently, when the
computation takes place near the data where huge
datasets are involved, it reduces the network traffic and
increases the throughput.
37. ADVANTAGES OF HADOOP
• Hadoop framework allows the user to quickly write and
test distributed systems.
• Hadoop library itself detects and handles failures at the
application layer.
• Servers can be added or removed from the cluster
dynamically and Hadoop continues to operate without
interruption.
• apart from being open source, it is compatible on all the
platforms since it is Java based.