Hadoop : The Pile of Big Data

www.edureka.co/big-data-and-hadoop
Hadoop : The pile of Data
View Big Data and Hadoop Course at: http://www.edureka.co/big-data-and-hadoop
For more details please contact us:
US : 1800 275 9730 (toll free)
INDIA : +91 88808 62004
Email Us : sales@edureka.co
For Queries:
Post on Twitter @edurekaIN: #askEdureka
Post on Facebook /edurekaIN

www.edureka.co/big-data-and-hadoopSlide 2
Objectives
At the end of this module, you will be able to…
 What is Haoop Framework
 What is Big Data
 Hadoop core components
 When to use Hadoop
 Processing
• Unstructured data
• Semi structured data
• Structured data

Slide 3 www.edureka.co/big-data-and-hadoopSlide 3
What is Hadoop?
Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters
of commodity computers using a simple programming model.
It is an Open-source Data Management with scale-out storage and distributed processing.

Hadoop Key Characteristics
Reliable
EconomicalFlexible
Scalable
Hadoop
Features

Hadoop Design Principles
 Facilitate the storage and processing of large and/or rapidly growing data sets
» Structured and unstructured data
» Simple programming models
 Scale-Out rather than Scale-Up
 Bring Code to Data rather than data to code
 High scalability and availability
 Use commodity hardware
 Fault-tolerance

Hadoop – It’s about Scale and Structure
Structured Data Types Multi and Unstructured
Limited, No Data Processing Processing Processing coupled with Data
Standards & Structured Governance Loosely Structured
Required On Write Schema Required On Read
Reads are Fast Speed Writes are Fast
Software License Cost Support Only
Known Entity Resources Growing, Complexities, Wide
OLTP
Complex ACID Transactions
Operational Data Store
Best Fit Use Data Discovery
Processing Unstructured Data
Massive Storage/Processing
RDBMS HADOOP

 Lots of Data (Terabytes or Petabytes)
 Big data is the term for a collection of data sets so
large and complex that it becomes difficult to process
using on-hand database management tools or
traditional data processing applications
 The challenges include capture, curation, storage,
search, sharing, transfer, analysis, and visualization
What is Big Data?
cloud
tools
statistics
No SQL
compression
storage
support
database
analyze
information
terabytes
processing
mobile
Big Data

IBM’s Definition – Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/
IBM’s Definition of Big Data
VOLUME
Web
logs
Images
Videos
Audios
Sensor
Data
VARIETYVELOCITY VERACITY
Min Max Mean SD
4.3 7.9 5.84 0.83
2.0 4.4 3.05 0.43
0.1 2.5 1.20 0.76

Hadoop Ecosystem
Pig Latin
Data Analysis
Hive
DW System
Other
YARN
Frameworks
(MPI, GRAPH)
HBaseMapReduce Framework
YARN
Cluster Resource Management
Apache Oozie
(Workflow)
HDFS
(Hadoop Distributed File System)
Pig Latin
Data Analysis
Hive
DW System
MapReduce Framework
Apache Oozie
(Workflow)
HDFS
(Hadoop Distributed File System)
Mahout
Machine Learning
HBase
Hadoop 1.0 Hadoop 2.0
Sqoop
Unstructured or
Semi-structured Data Structured Data
Flume
Sqoop
Unstructured or
Semi-structured Data Structured Data
Flume
Mahout
Machine Learning

Hadoop 2.x Core Components
Hadoop 2.x Core Components
HDFS YARN
Storage Processing
DataNode
NameNode Resource Manager
Node Manager
Master
Slave
Secondary
NameNode

Hadoop 2.x Core Components ( Contd.)
DataNode
Node
Manager
DataNode DataNode DataNode
YARN
HDFS
Cluster
Resource
Manager
NameNode
Node
Manager
Node
Manager
Node
Manager

Main Components of HDFS
 NameNode:
» Master of the system
» Maintains and manages the blocks which are present on
the DataNodes
 DataNodes:
» Slaves which are deployed on each machine and provide
the actual storage
» Responsible for serving read and write requests for the
clients

When to use Hadoop

 Your have different types of data : structured, semi-structured
and unstructured
 The data set is huge in size i.e. several Terabytes or Petabytes
 You are not in a hurry for Answers
Data Size and Data Diversity

 To implement Hadoop on you data you should first understand the level of complexity of data and the rate it is
going to grow
 So we need a cluster planning, its may begin with building a small or medium cluster in your industry as per
data (in GBs or few TBs ) available at present and scale up your cluster in future depending on the growth of
your data
Future Planning

 Hadoop can be integrated with multiple analytic tools to get the best out of it, like M-Learning, R , Python,
Spark, MongoDB etc.
Multiple Frameworks for Big Data

 When you want your data to be live and running forever, it can be achieved using Hadoop’s scalability
Lifetime Data Availability

Processing Unstructured data : Image processing
Demo

Processing Semi-structured data : XML processing
Demo

Processing structured data : csv processing
Demo

Questions
Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions

Hadoop : The Pile of Big Data

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (8)

Semelhante a Hadoop : The Pile of Big Data

Semelhante a Hadoop : The Pile of Big Data (20)

Mais de Edureka!

Mais de Edureka! (20)

Último

Último (20)

Hadoop : The Pile of Big Data