This document provides an introduction to Hadoop and big data. It defines big data as large amounts of data from a variety of structured, semi-structured, and unstructured sources that is difficult to store, analyze, and visualize due to its volume, velocity, and variety. Hadoop is introduced as an open source framework for distributed processing and storage of large datasets across clusters of commodity hardware. Key Hadoop components like HDFS, MapReduce, YARN and daemons like NameNode, DataNode, ResourceManager and NodeManager are described. Modes of operation for Hadoop including standalone, pseudo-distributed and fully distributed are also outlined.
3. What is Big Data??
• Large amount of Data .
• Its a popular term used to express exponential growth of
data .
• Big data is difficult to store , collect , maintain , Analyze
and Visualize .
4/11/2017Footer Text 3
4. Big Data characteristics
• Volume :-
Large amount of data .
• Velocity :-
The rate at which data is getting generated
• Variety :-
Different types of Data
- Structured data ,eg MySql
- Semi-Structured data, eg xml , json
- Unstructured data, eg text , audio, video
4/11/2017Footer Text 4
5. Big Data sources
• Social Media
• Banks
• Instruments
• Websites
• Stock Market
4/11/2017Footer Text 5
6. Use cases of Big Data
• Recommendation engines
• Analyzing Call Detail Record(CDR)
• Fraud Detection
• Market Basket Analysis
• Sentimental Analysis
4/11/2017Footer Text 6
7. Hadoop Introduction
• Open source framework that allows distributed
processing of large datasets on the cluster of commodity
hardware
• Hadoop is a data management tool and uses scale out
storage .
4/11/2017Footer Text 7
8. Defining Hadoop Cluster
• Size of data is most important factor while defining
hadoop cluster
4/11/2017Footer Text 8
5 Servers with 10 TB storage
capacity each
Total Storage Capacity : - 50TB
14. Hadoop Cluster
• Assume that we have hadoop cluster with 4 nodes
4/11/2017Footer Text 14
Master
NameNode
ResourceManager
Slave
DataNode
NodeManager
15. Secondary Name Node
• Secondary Namenode is not a hot backup for Namenode
.
• It just takes hourly backup of Namenode metadata
• It is can be used to Restart a crashed Hadoop Cluster
• Secondary Namenode is an important demon for
Hadoop1 , However in hadoop2 It is not that much
Important .
4/11/2017Footer Text 15
16. Modes of Operation
• Stand Alone
• Pseudo Distributed
• Fully Distributed
4/11/2017Footer Text 16