The data management industry has matured over the last three decades, primarily based on relational database management system(RDBMS) technology. Since the amount of data collected, and analyzed in enterprises has increased several folds in volume, variety and velocityof generation and consumption, organisations have started struggling with architectural limitations of traditional RDBMS architecture. As a result a new class of systems had to be designed and implemented, giving rise to the new phenomenon of “Big Data”. In this paper we will trace the origin of new class of system called Hadoop to handle Big data.
1. MANAGING BIG DATA WITH
HADOOP
Presented by:
Nalini Mehta
Student(MLVTEC Bhilwara)
Email: nalinimehta52@gmail.com
2. Introduction
Big Data:
•Big data is a term used to describe the voluminous amount of unstructured and
semi-structured data .
•Data that would take too much time and cost too much money to load into a
relational database for analysis.
• Big data doesn't refer to any specific quantity, the term is often used when
speaking about petabytes and exabytes of data.
3.
4. General framework of Big Data
Networking
The driving force behind
the implementation of Big
data is both infrastructure
and analytics which
together constitutes the
software.
Hadoop is the Big Data
management software
which is used to
distribute, catalogue
manage and query data
across multiple,
horizontally scaled server
nodes.
6. Overview of Hadoop
• Hadoop is a platform for
processing large amount of
data in distributed fashion.
• It provides scheduling and
resource management
framework to execute the
map and to reduce phases
in the cluster environment.
• Hadoop Distributed File is
Hadoop’s data storage layer
which is designed to handle
the petabytes and exabytes
of data distributed over
multiple nodes in parallel.
7. Hadoop Cluster
• DataNode- The DataNodes are
the repositories for the data, and it
consist of multiple smaller
database infrastructures.
• Client- The client represents the
user interface to the big data
implementation and query engine.
The client could be a server or PC
with a traditional user interface.
• NameNode- the NameNode is
equivalent to the address router
and location of every data node.
• Job Tracker- The job tracker
represents the software tracking
mechanism to distribute and
aggregate search queries across
multiple nodes for ultimate client
analysis.
8. Apache Hadoop
• Apache Hadoop is an open source distributed software platform for
storing and processing data.
• It is a framework for running applications on large cluster built of
commodity hardware.
• A common way of avoiding data loss is through replication:
redundant copies of the data are kept by the system so that in the
event of failure, there is another copy available. The Hadoop
Distributed File system (HDFS), takes care of this problem.
• MapReduce is a simple programming model for processing and
generating large data sets.
9. What is MapReduce?
MapReduce is a programming model .
Programs written automatically parallelized and executed on a large
cluster of commodity machines.
Users specify a map function that processes a key/value pair to
generate a set of intermediate key/value pair, and a reduce function that
merges all intermediate values associated with the same intermediate
key.
MapReduce
MAP
map function that
processes a key/value
pair to generate a set of
intermediate key/value
pairs
REDUCE
and a reduce function
that merges all
intermediate values
associated with the
same intermediate key.
10. The Programming Model Of MapReduce
Map, written by the user, takes an input pair and produces a set of
intermediate key/value pairs. The MapReduce library groups
together all intermediate values associated with the same
intermediate key and passes them to the Reduce function.
11. The Reduce function, also written by the user, accepts
an intermediate key and a set of values for that key.
It merges together these values to form a possibly
smaller set of values.
12. HADOOP DISTRIBUTED FILE
SYSTEM (HDFS)
Apache Hadoop comes with a distributed file system called HDFS,
which stands for Hadoop Distributed File System.
HDFS is designed to hold very large amounts of data (terabytes or
even petabytes), and provide high-throughput access to this
information.
HDFS is designed for scalability and fault tolerance and provides
APIs MapReduce applications to read and write data in parallel.
The capacity and performance of HDFS can be scaled by adding
Data Nodes, and a single Name Node mechanisms that manages
data placement and monitor server availability.
13. Assumptions and Goals
1. Hardware Failure
• An HDFS instance may consist of hundreds or thousands of server machines,
each storing part of the file system’s data.
• There are a huge number of components and that each component has a non-trivial
probability of failure.
• Detection of faults and quick, automatic recovery from them is a core
architectural goal of HDFS.
2. Streaming Data Access
• Applications that run on HDFS need streaming access to their data sets.
• HDFS is designed more for batch processing rather than interactive use by
users.
• The emphasis is on high throughput of data access rather than low latency of
data access.
3. Large Data Sets
• A typical file in HDFS is gigabytes to terabytes in size.
• Thus, HDFS is tuned to support large files.
• It should provide high aggregate data bandwidth and scale to hundreds of
nodes in a single cluster.
14. 4. Simple coherency model
• HDFS applications need a write-once-read-many access model for files.
• A file once created, written, and closed need not be changed.
• This assumption simplifies data coherency issues and enables high
throughput data access.
5. “Moving Computation is Cheaper than Moving
Data”
• A computation requested by an application is much more efficient if it is
executed near the data it operates on when the size of the data set is huge.
• This minimizes network congestion and increases the overall throughput of
the system.
6. Portability across Heterogeneous Hardware and
Software Platforms
• HDFS has been designed to be easily portable from one platform to
another. This facilitates widespread adoption of HDFS as a platform of
choice for a large set of applications.
16. NameNode and DataNodes
A HDFS cluster has two
types of node operating in
a master-slave pattern: a
NameNode (the master)
and a number of
DataNodes (slaves).
The NameNode manages
the file system
namespace. It maintains
the file system tree and
the metadata for all the
files and directories in the
tree.
Internally a file is split into
one or more blocks and
these blocks are stored in
a set of DataNodes.
17. The NameNode executes file system namespace
operations like opening, closing, and renaming
files and directories.
DataNodes store and retrieve blocks when they
are told to (by clients or the NameNode), and they
report back to the NameNode periodically with lists
of blocks that they are storing.
The DataNodes also perform block creation,
deletion, and replication upon instruction from the
NameNode.
Without the NameNode, the file system cannot be
used. In fact, if the machine running the
NameNode were destroyed, all the files on the file
system would be lost since there would be no way
of knowing how to reconstruct the files from the
blocks on the DataNodes.
18. File System Namespace
HDFS supports a traditional hierarchical file
organization. A user or an application can create
and remove files, move a file from one directory to
another, rename a file, create directories and store
files inside these directories.
The NameNode maintains the file system
namespace. Any change to the file system
namespace or its properties is recorded by the
NameNode.
An application can specify the number of replicas of
a file that should be maintained by HDFS. The
number of copies of a file is called the replication
factor of that file. This information is stored by the
NameNode.
19. Data Replication
The blocks of a file are replicated for fault
tolerance.
The block and replication factor are configurable as
per file.
The NameNode makes all decisions regarding
replication of blocks.
A Block report contains a list of all blocks on a
DataNode.
20. Hadoop as a Service in the Cloud
(Haas):
Hadoop is economical for large scale data driven
companies like Yahoo or Facebook.
The ecosystem around Hadoop nowadays offers various
tools like Hive and Pig to make Big Data processing
accessible focusing on what to do with the data and to
avoid the complexity of programming.
Consequently, a minimal Hadoop as a Service provide a
managed Hadoop cluster ready to use without the need to
configure or install any Hadoop relevant services on any
cluster nodes like Job tracker, Task tracker, NameNode or
DataNode.
Depending on the level of service, abstraction and tools
provided, Hadoop as a Service (HaaS) can be placed in the
cloud stack as a Platform or Software as a Service
solutions, between infrastructure services and cloud clients.
21. Limitations:
It places several requirements on the network:
Data locality
The distributed Hadoop nodes running jobs parallel
causes east-west network traffic that can be adversely
affected by the suboptimal network connectivity.
The network should provide high bandwidth, low latency
and any to any connectivity between the nodes for
optimal Hadoop performance.
Scale out
Deployments might start with a small cluster and then
scale out over time as the customer may realize the
initial success and then needs.
The underlying network architecture should also scale
seamlessly with Hadoop clusters and should provide
predictable performance.
22. Conclusion
The growth of communication and
connectivity has led to the emergence of
Big Data. Apache Hadoop is an open
source framework that has become a de-facto
standard for big data platforms
deployed today.
To sum up, we conclude that promising
progress has been made in the area of
Big Data but much remains to be done.
Almost all proposed approaches are
evaluated to a limited scale, and further
research is required for large scale
evaluations.
23. References:
White paper –Introduction to Big Data: Infrastructure
and Network consideration
MapReduce: Simplified Data processing on Large
Clusters, http://research .google.com/archive
/mapreduce.html
White paper Big Data Analytics[http:/Hadoop.intel.com]
The Hadoop Distributed File System Architecture and
Design:by Dhruba Borthakur
Big Data in the enterprise, Cisco White Paper.
Cloudera capacity planning recommendations:
http://www.cloudera.com/blog/ 2010/08/Hadoop HBase-capacity-
planning/
Apache Hadoop Wiki Website:
http://en.wikipedia.org/wiki/Apache-Hadoop.
Towards a Big Data Reference Architecture
[www.win.tue.nl/~gfletche/Maier_MSc_thesis.pdf]