2. Introduction
•Big data is a term used to describe the voluminous amount of
unstructured and semi-structured data a company creates.
•Data that would take too much time and cost too much money to
load into a relational database for analysis.
•Big data doesn't refer to any specific quantity, the term is often used
when speaking about petabytes and exabytes of data.
3. So What is The Problem?
The transfer speed is around 100 MB/s
A standard disk is 1 Terabyte
Time to read entire disk= 10000 seconds or 3 Hours!
Increase in processing time may not be as helpful because
•
•
Network bandwidth is now more of a limiting factor Physical limits of
processor chips have been reached
4. So What do we Do?
•The obvious solution is that we use multiple
processors to solve the same problem by
fragmenting it into pieces.
•Imagine if we had 100 drives, each holding
one hundredth of the data. Working in
parallel, we could read the data in under two
minutes.
5. Distributed v/s Parallelization
Parallelization- Multiple computers to a single network to achieve a
particular goal. Involves multiple computers
Each computer has its own memory , Allow scalability sharing. Resource
and helps to perform computation
Distributed Computing- Divides a single task between n task easily
computation Multiple processors or tasks simultaneously on a single unit.
It occurs in a single computer. Computer can have shared memory or
distributed memory. It increase the performance of computer
6. Problems In Distributed Computing
Hardware Failure
As soon as we start using many pieces of hardware, the chance that
one will fail is fairly high.
Combine the data after analysis
Most analysis tasks need to be able to combine the data in some way;
data read from one disk may need to be combined with the data from
any of the other 99 disks.
7. To The Rescue
Apache Hadoop is a framework for running applications on large
cluster built of commodity hardware.
A common way of avoiding data loss is through replication:
redundant copies of the data are kept by the system so that in the
event of failure, there is another copy available. The Hadoop
Distributed Filesystem (HDFS), takes care of this problem.
The second problem is solved by a simple programming model-
Mapreduce. Hadoop is the popular open source implementation
of MapReduce, a powerful tool designed for deep analysis and
transformation of very large data sets.
8. What is ? Technology?
The most well known technology used for Big Data is Hadoop.
It is an open source project by the Apache Foundation to handle
large data processing
•
It was inspired by Google’s MapReduce and Google File System
(GFS) papers in 2003 and 2004
•
It was originally conceived by Doug Cutting in 2005 and first used by
Yahoo! in 2006 .
It is made by apache software foundation in 2011
.
•
It is named after his son’s pet elephant incidentally
•
It is basically a distributed file system which is written in Java.
9. Hadoop Approach to distributed Computing
The theoretical 1000-CPU machine would cost a very large amount
of money, far more than 1,000 single-CPU.
Hadoop will tie these smaller and more reasonably priced machines
together into a single cost-effective computer cluster.
Hadoop provides a simplified programming model which allows the
user to quickly write and test distributed systems, and its’ efficient,
automatic distribution of data and work across machines and in turn
utilizing the underlying parallelism of the CPU cores.
10. Developers of Hadoop
Michael j. cafarella Doug Cutting
Doug Cutting
and Michael
J.Cafarella
developed
Hadoop to
support
distribution
for the Notch
search engine project. This project was funded by Yahoo
13. Hadoop Map Reduce
MapReduce is a programming model
Programs written in this functional style are automatically parallelized and
executed on a large cluster of commodity machines
MapReduce is an associated implementation for processing and generating
large data sets.
14. The Programming Model of Map Reduce
Map, written by the user, takes an input pair and produces a set of intermediate
key/value pairs. The MapReduce library groups together all intermediate values
associated with the same intermediate key I and passes them to the Reduce
function.
15. The Reduce function, also written by the user, accepts an intermediate key I and a
set of values
16. Files systems that manage the storage across a network of
machines are called distributed filesystems.
Hadoop comes with a distributed filesystem called HDFS,
which stands for Hadoop Distributed Filesystem.
.
HDFS, the Hadoop Distributed File System, is a distributed file system
designed to hold very large amounts of data (terabytes or even
petabytes), and provide high-throughput access to this information.
The Hadoop distributed file system (HDFS) is a distributed,
scalable, and portable file-system written in Java for the Hadoop
framework.
Hadoop Distributed File System(HDFS)
17. Name nodes and Data nodes
A HDFS cluster has two types of node operating in a master-slave
pattern: a namenode (the master) and a number of datanodes (slave).
The namenode manages the filesystem namespace. It maintains the
filesystem tree and the metadata for all the files and directories in the
tree.
Datanodes are the work horses of the filesystem.It manages storage
attached to the nodes that they run on.
HDFS exposes a file system namespace and allows user data to be
stored in files.
Internally, a file is split into one or more blocks and these blocks are
stored in a set of DataNodes.
18.
19. Conclusion
So major companies like facebook amazon,yahoo,etc. are
adapting Hadoop and in future there can be many names in the
list.
This technology has bright future scope because day by day need of
data would increase and security issues also the major point.
Hence Hadoop Technology is the best appropriate approach for
handling the large data in smart way and its future is bright…