The document is from Edureka about their Big Data and Hadoop course. It provides an overview of Hadoop including what it is, its key characteristics of being reliable, economical, scalable and flexible. It describes the core Hadoop components of HDFS for storage and YARN for processing. It also discusses when to use Hadoop such as for large datasets that are diverse, growing, and not time critical. Examples are provided for processing different data types like images, XML files, and CSVs. Contact information is given for the Big Data and Hadoop course.
Boost PC performance: How more available memory can improve productivity
Hadoop : The Pile of Big Data
1. www.edureka.co/big-data-and-hadoop
Hadoop : The pile of Data
View Big Data and Hadoop Course at: http://www.edureka.co/big-data-and-hadoop
For more details please contact us:
US : 1800 275 9730 (toll free)
INDIA : +91 88808 62004
Email Us : sales@edureka.co
For Queries:
Post on Twitter @edurekaIN: #askEdureka
Post on Facebook /edurekaIN
2. www.edureka.co/big-data-and-hadoopSlide 2
Objectives
At the end of this module, you will be able to…
What is Haoop Framework
What is Big Data
Hadoop core components
When to use Hadoop
Processing
• Unstructured data
• Semi structured data
• Structured data
3. Slide 3Slide 3 www.edureka.co/big-data-and-hadoopSlide 3
What is Hadoop?
Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters
of commodity computers using a simple programming model.
It is an Open-source Data Management with scale-out storage and distributed processing.
5. Slide 5Slide 5 www.edureka.co/big-data-and-hadoopSlide 5
Hadoop Design Principles
Facilitate the storage and processing of large and/or rapidly growing data sets
» Structured and unstructured data
» Simple programming models
Scale-Out rather than Scale-Up
Bring Code to Data rather than data to code
High scalability and availability
Use commodity hardware
Fault-tolerance
6. Slide 6Slide 6 www.edureka.co/big-data-and-hadoopSlide 6
Hadoop – It’s about Scale and Structure
Structured Data Types Multi and Unstructured
Limited, No Data Processing Processing Processing coupled with Data
Standards & Structured Governance Loosely Structured
Required On Write Schema Required On Read
Reads are Fast Speed Writes are Fast
Software License Cost Support Only
Known Entity Resources Growing, Complexities, Wide
OLTP
Complex ACID Transactions
Operational Data Store
Best Fit Use Data Discovery
Processing Unstructured Data
Massive Storage/Processing
RDBMS HADOOP
7. Slide 7 www.edureka.co/big-data-and-hadoop
Lots of Data (Terabytes or Petabytes)
Big data is the term for a collection of data sets so
large and complex that it becomes difficult to process
using on-hand database management tools or
traditional data processing applications
The challenges include capture, curation, storage,
search, sharing, transfer, analysis, and visualization
What is Big Data?
cloud
tools
statistics
No SQL
compression
storage
support
database
analyze
information
terabytes
processing
mobile
Big Data
8. Slide 8 www.edureka.co/big-data-and-hadoop
IBM’s Definition – Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/
IBM’s Definition of Big Data
VOLUME
Web
logs
Images
Videos
Audios
Sensor
Data
VARIETYVELOCITY VERACITY
Min Max Mean SD
4.3 7.9 5.84 0.83
2.0 4.4 3.05 0.43
0.1 2.5 1.20 0.76
9. Slide 9Slide 9 www.edureka.co/big-data-and-hadoopSlide 9
Hadoop Ecosystem
Pig Latin
Data Analysis
Hive
DW System
Other
YARN
Frameworks
(MPI, GRAPH)
HBaseMapReduce Framework
YARN
Cluster Resource Management
Apache Oozie
(Workflow)
HDFS
(Hadoop Distributed File System)
Pig Latin
Data Analysis
Hive
DW System
MapReduce Framework
Apache Oozie
(Workflow)
HDFS
(Hadoop Distributed File System)
Mahout
Machine Learning
HBase
Hadoop 1.0 Hadoop 2.0
Sqoop
Unstructured or
Semi-structured Data Structured Data
Flume
Sqoop
Unstructured or
Semi-structured Data Structured Data
Flume
Mahout
Machine Learning
12. Slide 12Slide 12 www.edureka.co/big-data-and-hadoopSlide 12
Main Components of HDFS
NameNode:
» Master of the system
» Maintains and manages the blocks which are present on
the DataNodes
DataNodes:
» Slaves which are deployed on each machine and provide
the actual storage
» Responsible for serving read and write requests for the
clients
13. Slide 13Slide 13 www.edureka.co/big-data-and-hadoopSlide 13
When to use Hadoop
14. Slide 14Slide 14 www.edureka.co/big-data-and-hadoopSlide 14
Your have different types of data : structured, semi-structured
and unstructured
The data set is huge in size i.e. several Terabytes or Petabytes
You are not in a hurry for Answers
Data Size and Data Diversity
15. Slide 15Slide 15 www.edureka.co/big-data-and-hadoopSlide 15
To implement Hadoop on you data you should first understand the level of complexity of data and the rate it is
going to grow
So we need a cluster planning, its may begin with building a small or medium cluster in your industry as per
data (in GBs or few TBs ) available at present and scale up your cluster in future depending on the growth of
your data
Future Planning
16. Slide 16Slide 16 www.edureka.co/big-data-and-hadoopSlide 16
Hadoop can be integrated with multiple analytic tools to get the best out of it, like M-Learning, R , Python,
Spark, MongoDB etc.
Multiple Frameworks for Big Data
17. Slide 17Slide 17 www.edureka.co/big-data-and-hadoopSlide 17
When you want your data to be live and running forever, it can be achieved using Hadoop’s scalability
Lifetime Data Availability