Hadoop Adminstration with Latest Release (2.0)

How It Works…
 LIVE classes
 Class recordings
 Module wise Quizzes and Practical Assignments
 24x7 on-demand technical support
 Deployment of different clusters
 Online certification exam
 Lifetime access to the Learning Management System

www.edureka.in/hadoop-admin

Course Topics
 Week 1
–
–
–

Understanding Big Data
Hadoop Components
Introduction to Hadoop 2.0

 Week 2
–
–
–

Hadoop 2.0
Hadoop Configuration
Hadoop Cluster Architecture

 Week 3
–
–
–

Different Hadoop Server Roles
Data processing flow
Cluster Network Configuration

 Week 4
–
–
–

Job Scheduling
Fair Scheduler
Monitoring a Hadoop Cluster

 Week 5
–
–
–

Securing your Hadoop Cluster
Kerberos and HDFS Federation
Backup and Recovery

 Week 6
–
–
–

Oozie and Hive Administration
HBase Architecture
HBase Administration

Topics for Today
 What is Big Data?
 Limitations of the existing solutions
 Solving the problem with Hadoop
 Introduction to Hadoop

 Hadoop Eco-System
 Hadoop Core Components
 MapReduce software framework
 Hadoop Cluster Administrator: Roles and Responsibilities
 Introduction to Hadoop 2.0


What Is Big Data?
 Lots of Data (Terabytes or Petabytes).
 Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytes of
information.

A airline jet collects 10 terabytes of sensor data
for every 30 minutes of flying time.

NYSE generates about one terabyte of new trade data
per day to Perform stock trading analytics to determine
trends for optimal trades.

IBM’s Definition
 IBM’s definition – Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/

Characteristics Of Big Data

Volume

Velocity

Variety

12 Terabytes of
tweets created
each day

Scrutinizes 5 million
trade events created
each day to identify
potential fraud

Sensor
data, audio, video, cli
ck streams, log files
and more

Data Volume Is Growing Exponentially


Estimated Global Data Volume:
 2011: 1.8 ZB
 2015: 7.9 ZB



The world's information doubles every two years



Over the next 10 years:
 The number of servers worldwide will grow by 10x

 Amount of information managed by enterprise data
centers will grow by 50x
 Number of “files” enterprise data center handle will
grow by 75x

Source: http://www.emc.com/leadership/programs/digital-universe.htm,
which was based on the 2011 IDC Digital Universe Study


What Big Companies Have To Say…
“Analyzing Big Data sets will become a key basis for competition.”

McKinsey

“Leaders in every sector will have to grapple the implications of Big Data.”

“Big Data analytics are rapidly emerging as the preferred solution to business and
technology trends that are disrupting.”

Gartner

“Enterprises should not delay implementation of Big Data Analytics.”
“Use Hadoop to gain a competitive advantage over more risk-averse enterprises.”

Forrester
Research

“Prioritize Big Data projects that might benefit from Hadoop.”


Some Of the Hadoop Users


Hadoop Users – In Detail

http://wiki.apache.org/hadoop/PoweredBy


What Is Hadoop?
 Apache Hadoop is a framework that allows for the distributed processing of large data sets
across clusters of commodity computers using a simple programming model.
 It is an Open-source Data Management with scale-out storage & distributed processing.


Hadoop Key Characteristics

Reliable

Flexible

Hadoop

Features

Economical

Scalable


Hadoop History
Fastest sort of a TB, 3.5mins
over 910 nodes
Doug Cutting adds DFS &
MapReduce support to Nutch

Doug Cutting & Mike Cafarella
started working on Nutch

2002

2003

Fastest sort of a TB,
62secs over 1,460 nodes
Sorted a PB in 16.25hours
Over 3.658 nodes

NY Times converts 4TB of
Image archives over 100 EC2s

2004

Google publishes GFS &
MapReduce papers

2005

2006

2007

2008

Yahoo! hires Cutting,
Hadoop spins out of Nutch
Facebook launches Hive:
SQL Support for Hadoop

2009

Founded

Doug Cutting
Joins Cloudera

Hadoop Summit 2009,
750 attendees


Hadoop 1.x Eco-System
Apache Oozie (Workflow)
Hive

Pig Latin

DW System

Data Analysis

Mahout
Machine Learning

MapReduce Framework

HBase
HDFS (Hadoop Distributed File System)
Flume

Sqoop
Import Or Export

Unstructured or
Semi-Structured data

Structured Data

Hadoop 1.x Core Components
Hadoop is a system for large scale data processing.
It has two main components:
 HDFS – Hadoop Distributed File System (Storage)

 Distributed across “nodes”
 Natively redundant
 NameNode tracks locations.

 Additional Administration
Tools:
 Filesystem utilities
 Job scheduling and monitoring
 Web UI

 MapReduce (Processing)
 Splits a task across processors
 “near” the data & assembles results
 Self-Healing, High Bandwidth
 Clustered storage
 JobTracker manages the TaskTrackers

Hadoop 1.x Core Components (Contd.)

MapReduce
Engine

Job Tracker

Task
Tracker

Task
Tracker

Task
Tracker

Task
Tracker

HDFS
Cluster

Admin Node
Name node

Data Node

Data Node

Data Node

Data Node


Name Node and Data Nodes
 NameNode:
 master of the system
 maintains and manages the blocks which are present on the
DataNodes

 DataNodes:
 slaves which are deployed on each machine and provide the
actual storage
 responsible for serving read and write requests for the clients


Secondary Name Node
metadata

 Secondary NameNode:

NameNode

Single Point
Failure

 Not a hot standby for the NameNode
 Connects to NameNode every hour*

 Housekeeping, backup of NemeNode metadata
 Saved metadata can build a failed NameNode

Secondary
NameNode

You give me
metadata every
hour, I will make
it secure

metadata

What Is MapReduce?
 MapReduce is a programming model
 It is neither platform- nor language-specific
 Record-oriented data processing (key and value)
 Task distributed across multiple nodes

Key

Value

 Where possible, each node processes data
stored on that node

 Consists of two phases

MapReduce

 Map
 Reduce


What Is MapReduce? (Contd.)
Process can be considered as being similar to a Unix pipeline

cat /my/log | grep '.html' | sort | uniq –c > /my/outfile

MAP

SORT

REDUCE


Hadoop 1.x – In Summary
Client

HDFS
Name Node

Data Node

Data Node

Map Reduce
Secondary
Name Node

Job Tracker

Task Tracker
Map

Reduce

Task Tracker
Map

Reduce

….
Data
Blocks

Poll Questions


Hadoop Cluster Administrator
Roles and Responsibilities
 Deploying the cluster
 Performance and availability of the cluster
 Job scheduling and Management
 Upgrades
 Backup and Recovery
 Monitoring the cluster
 Troubleshooting


Hadoop 1.0 Vs. Hadoop 2.0

Property

Hadoop 1.x

Hadoop 2.x

NameNodes

1

Many

High Availability

Not present

Highly Available

Processing Control

JobTracker, Task Tracker

Resource Manager, Node
Manager, App Master


MRv1 Vs. MRv2
Hadoop 1.0

Hadoop 2.0
MapReduce
(data processing)

MapReduce
(Data Processing)
Job Tracker

HDFS
(Data Storage)

 Problems with Resource utilization
 Slots only for Map and Reduce

Others
(data Processing)

Data Node
YARN

(Cluster Resource Management)
Scheduler

Applications
Manager (AsM)

HDFS
(Data Storage)
 Provides a Cluster Level Resource Manager
 Application Level Resource Management (Node
Manager??)
 Provides slots for Jobs other than Map and Reduce

Hadoop 2.0 - Architecture
HDFS
All name space edits
logged to shared NFS
storage; single writer
(fencing)
Secondary
Name Node

Shared
edit logs

YARN

Read edit logs and applies
to its own namespace
Standby
NameNode

Active
NameNode

Data Node

Client

Data Node

Resource Manager

Node Manager
Container

Node Manager
Container

App
Master

Node Manager
Container

App
Master

Data Node

Node Manager
Container

App
Master

Data Node

App
Master


Assignments
 Attempt the following Assignments using the documents present in the LMS:
 Apache Hadoop 1.0 Installation on Ubuntu in Pseudo-Distributed Mode
 Execute Linux Basic Commands
 Execute HDFS Hands On
 Cloudera CDH3 and CDH4 Quick VM installation on your local machine


Thank You
See You in Class Next Week

Hadoop Adminstration with Latest Release (2.0)

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Hadoop Adminstration with Latest Release (2.0)

Semelhante a Hadoop Adminstration with Latest Release (2.0) (20)

Mais de Edureka!

Mais de Edureka! (20)

Último

Último (20)

Hadoop Adminstration with Latest Release (2.0)

Notas do Editor