We have entered an era of Big Data. Huge information is for the most part accumulation of information sets so extensive and complex that it is exceptionally hard to handle them utilizing close by database administration devices. The principle challenges with Big databases incorporate creation, curation, stockpiling, sharing, inquiry, examination and perception. So to deal with these databases we require, "exceedingly parallel software's". As a matter of first importance, information is procured from diverse sources, for example, online networking, customary undertaking information or sensor information and so forth. Flume can be utilized to secure information from online networking, for example, twitter. At that point, this information can be composed utilizing conveyed document frameworks, for example, Hadoop File System. These record frameworks are extremely proficient when number of peruses are high when contrasted with composes.
2. INTRODUCTION
We have entered an era of Big Data. Huge information is for the most part
accumulation of information sets so extensive and complex that it is
exceptionally hard to handle them utilizing close by database
administration devices. The principle challenges with Big databases
incorporate creation, curation, stockpiling, sharing, inquiry, examination
and perception. So to deal with these databases we require, "exceedingly
parallel software's". As a matter of first importance, information is
procured from diverse sources, for example, online networking,
customary undertaking information or sensor information and so forth.
Flume can be utilized to secure information from online networking, for
example, twitter. At that point, this information can be composed utilizing
conveyed document frameworks, for example, Hadoop File System. These
record frameworks are extremely proficient when number of peruses are
high when contrasted with composes.
3. DEFINING HADOOP
Hadoop is an open-source framework that allows to collection and
procedure big data in a distributed environment across clusters of
computers using simple programming models. It is designed to scale up
from single servers to thousands of machines, each offering local
computation and storage. Rather than rely on hardware to deliver high-
availability, the library itself is designed to detect and handle failures at
the application layer, so delivering a highly-available service on top of a
cluster of computers, each of which may be prone to failures.
4. HADOOP MODULES
▪ Hadoop Common: The common utilities that support the other
Hadoop modules.
▪ Hadoop Distributed File System (HDFS™): A distributed file system
that provides high-throughput access to application data.
▪ Hadoop YARN: A framework for job scheduling and cluster resource
management.
▪ Hadoop Map Reduce: A YARN-based system for parallel processing of
large data sets.
5. HADOOP DISTRIBUTED FILE SYSTEM
(HDFS™):
HDFS is structured similarly to a regular Unix file system except that data storage is distributed
across several machines. It is not intended as a replacement to a regular file system, but
rather as a file system-like layer for large distributed systems to use. It has in built
mechanisms to handle machine outages, and is optimized for throughput rather than latency.
▪ There are two and a half types of machine in a HDFS cluster:
▪ Data node - The Data Nodes are responsible for serving read and write requests from the
file system’s clients. The Data Nodes also perform block creation, deletion, and replication
upon instruction from the Name Node.
▪ Name node - The Name Node executes file system namespace operations like opening,
closing, and renaming files and directories. It also determines the mapping of blocks to
Data Nodes.
▪ Secondary Name node - this is NOT a backup name node, but is a separate service that
keeps a copy of both the edit logs, and file system image, merging them periodically to
keep the size reasonable. this is soon being deprecated in favor of the backup node and
the checkpoint node, but the functionality remains similar (if not the same)
7. HDFS FEATURES
HDFS also has a bunch of unique features that make it ideal for distributed systems:
Failure tolerant - data can be duplicated across multiple data nodes to protect against
machine failures. The industry standard seems to be a replication factor of 3
(everything is stored on three machines).
Scalability - data transfers happen directly with the data nodes so your read/write
capacity scales fairly well with the number of data nodes
Space - need more disk space? Just add more data nodes and re-balance
Industry standard - Lots of other distributed applications build on top of HDFS (H
Base, Map-Reduce)
Pairs well with Map Reduce - As we shall learn
8. DATA ORGANIZATION
• Data Blocks : HDFS is designed to support very large files. Applications that are
compatible with HDFS are those that deal with large data sets. These applications write
their data only once but they read it one or more times and require these reads to be
satisfied at streaming speeds. HDFS supports write-once-read-many semantics on files.
A typical block size used by HDFS is 64 MB. Thus, an HDFS file is chopped up into 64
MB chunks, and if possible, each chunk will reside on a different Data Node.
• Staging : A client requests to create a file does not reach the Name Node immediately.
In fact, initially the HDFS client caches the file data into a temporary local file.
Application writes are transparently redirected to this temporary local file. When the
local file accumulates data worth over one HDFS block size, the client contacts the
Name Node. The Name Node inserts the file name into the file system hierarchy and
allocates a data block for it. The Name Node responds to the client request with the
identity of the Data Node and the destination data block. Then the client flushes the
block of data from the local temporary file to the specified Data Node. When a file is
closed, the remaining un-flushed data in the temporary local file is transferred to the
Data Node. The client then tells the Name Node that the file is closed. At this point, the
Name Node commits the file creation operation into a persistent store. If the Name
Node dies before the file is closed, the file is lost.
9. DATA ORGANIZATION
▪ Replication Pipelining : When a client is writing data to an HDFS file, its data is first written
to a local file as explained in the previous section. Suppose the HDFS file has a replication
factor of three. When the local file accumulates a full block of user data, the client
retrieves a list of Data Nodes from the Name Node. This list contains the Data Nodes that
will host a replica of that block. The client then flushes the data block to the first Data
Node. The first Data Node starts receiving the data in small portions, writes each portion to
its local repository and transfers that portion to the second Data Node in the list. The
second Data Node, in turn starts receiving each portion of the data block, writes that
portion to its repository and then flushes that portion to the third Data Node. Finally, the
third Data Node writes the data to its local repository. Thus, a Data Node can be receiving
data from the previous one in the pipeline and at the same time forwarding data to the
next one in the pipeline. Thus, the data is pipelined from one Data Node to the next.
▪ Accessibility : HDFS can be accessed from applications in many different ways. Natively,
HDFS provides a File System Java API for applications to use. A C language wrapper for this
Java API is also available. In addition, an HTTP browser can also be used to browse the files
of an HDFS instance. Work is in progress to expose HDFS through the WebDAV protocol.
▪ FS Shell: HDFS allows user data to be organized in the form of files and directories. It
provides a command line interface called FS shell that lets a user interact with the data in
HDFS.