SlideShare uma empresa Scribd logo
1 de 57
Understanding Hadoop

by Ahmed Ossama
Agenda
●
●
●
●
●
●
●
●
●

Introduction to Big Data
Hadoop
HDFS
MapReduce and YARN
Hadoop Ecosystem
Planning and Installing Hadoop Clusters
Writing Simple Streaming Jobs
Demo
Q&A
Commodore
Amiga 500 (1990)
Memory: 512K

Atari ST
Amiga 500 (1985)
Memory: 512K

Macintosh (1984)
Memory: 128K

Wait a sec… Are we in the 80’s?!
●

●

30 Billion pieces of content were
added to Facebook this past month
by more than 600 million users
2.7 billion likes made daily on and
off of the Facebook site

More than 2.5 Billion videos were
watched on YouTube… Yesterday!

●
●

1.2 million deliveries per second
35 billion searches were performed
last month on Twitter

What are the volumes of data that we are seeing today?
What does the future look like?
● Worldwide IP traffic will quadruple by 2015.
● Nearly 3 billion people will be online pushing the data
created and shared to nearly 8 zettabytes.
○
○

Zettabyte = 1024^1 Exabyte = 1024^2 Petabyte = 1024^3 Terabyte =
1024^4 Gigabyte = 1024^5 Megabyte = 1024^6 KiloByte
8 ZB = 9,223,372,036,854,775,808 KB

● 2/3rd of surveyed businesses in North America said
big data will become a concern for them within the
next five years.
Huston,
We have a Problem!!!

A new IDC study says the market for big technology and services will grow from $3.2 billion in
2010 to $16.9 billion in 2015! That’s a growth of 40%
What is Big Data?
“When your data sets become so large that
you have to start innovating to collect,
store, organize, analyze and share”
From WWW to VVV
●
●

●

Volume
○ data volumes are becoming unmanageable
Variety
○ data complexity is growing
○ more types of data are captured than previously
Velocity
○ some data is arriving so rapidly that it must either be processed
instantly, or lost
○ this is a whole subfield called “stream processing”
Sources of Data
Computer Generated
●
●
●

Application server logs
(websites, games)
Sensor data (weather, water,
smart grids)
Images/Videos (traffic
surveillance, security cameras)

Human Generated
●
●
●
●

Twitter/Facebook
Blogs/Reviews/Emails
Images/Videos
Social Graphs: Facebook,
Linkedin
Types of Data
●
●
●
●
●
●

Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
Graph Data
Social Network, Semantic Web (RDF), …
Streaming Data
What to do with these data?
● Aggregation and Statistics
○ Data warehouse and OLAP

● Indexing, Searching, and Querying
○ Keyword based search
○ Pattern matching (XML/RDF)

● Knowledge discovery
○ Machine Learning
○ Data Mining
○ Statistical Modeling
If RDBMS are not enough, what is?
Hadoop!
Hadoop - inspired by Google
● Apache Hadoop project
○ inspired by Google MapReduce implementation
and Google File System papers
● Open sourced, flexible and available architecture for
large scale computation and data processing on a
network of commodity hardware
● Open Source Software + Commodity Hardware
○

IT Cost Reduction
Hadoop Concepts
● Distribute the data as it is initially stored in the system
● Bring the processing to the data
● Users can focus in developing applications
Hadoop Versions
● Hadoop version 1 (HDFS + MapReduce)
○ hadoop-1.2.X
● Hadoop Version 2 (HDFS + MR2 + YARN)
○ hadoop-2.2.X
○ hadoop-0.23.X
■

same as 2.2.X but missing NN HA
Enterprise Hadoop
● Cloudera
○ Oldest company provided Hadoop enterprise
○ CDH
○ Cloudera Manager
● Hortonworks
○ Forked from Yahoo! Hadoop team
○ Biggest contributor to Hadoop
○ HDP (Hortonworks Data Platform)
● MapR
Hadoop Components
●

●

Two Core components
○ Hadoop Distributed Filesystem
○ MapReduce Software Framework
Components around Hadoop
○ Often referred to as ‘Hadoop
Ecosystem’
○ Pig, Hive, HBase, Flume, Oozie,
Sqoop
Hadoop Components: HDFS
●

●

HDFS, the Hadoop Distributed File
System, is responsible for storing
data on the cluster
Two Roles:
○ Namenode (NN): Records
metadata
○ Datanode (DN): Stores Data
HDFS Features
●
●
●
●

High fault tolerant
Commodity Hardware = Node
Failure
Rack Awareness
Large Datasets
HDFS Structure
HDFS has a master/slave architecture for the filesystem
structure, it has two main layers:
● Namespace, which consists of directories, files and
blocks. It supports the file system operations.
● Block storage service, which offers Block
Management and Storage:
○ Block Management service provided by the NN,
supports block related operations, maintain block
locations and manages block replicas.
○ Storage service provided by the DN and allows
the read/write access to blocks on the local
storage of the node.
HDFS: How files are stored?
File System Read Operations
1.
2.

3.

4.

Client contacts the NameNode indicating
the file it wants to read
Client identity is validated and checked
against the owner and permissions of the
file
The NameNode responds with the list of
DataNodes that host replicas of the
blocks of the file
The client contact the DataNodes based
on the topology that was provided from
the NameNode and requests the transfer
of the desired block
File System Write Operations
1.
2.

3.
4.

5.
6.

Client asks the NameNode to choose DataNodes
to host replicas of the first block of the file
The NameNode grant permissions to the client
and responds with a list of DataNodes for the
block
The client organizes a pipeline from node-tonode and sends the data
When the first block is filled, the client requests
new DataNodes to be chosen to host replicas of
the next block
The NameNode responds with a new list of
DataNodes which is likely to be different
The client organizes a new pipeline and sends
the further blocks of the file
Hadoop Components: MapReduce
●
●

●

●

Programming model for processing
and generating large data sets
Computation takes some input data,
then it gets mapped using some code
written by the user
Then the mapped data gets reduced
using another code written by the
user
It works like a pipeline:
$ cat file | grep something | sort | uniq -c
MapReduce Features
●
●
●
●

Automatic parallelization and
distribution
Automatic re-execution on failure
Locality Optimizations
Abstract the “housekeeping” away
from the developer
○ Developer concentrate on
writing MapReduce functions

Job Tracker

Task Tracker

MapReduce

History Server
MapReduce Features
●

●

●

TaskTracker is a node in the
cluster that accepts tasks - Map,
Shuffle and Reduce operations from a JobTracker
JobTracker is the service within
Hadoop that farms out MapReduce
tasks to specific nodes in the
cluster
History Server allows the user to
get status on finished applications.
Currently it only supports
MapReduce and provides
information on finished jobs

Job Tracker

Task Tracker

MapReduce

History Server
Hadoop Components: YARN (MR2)
The new architecture introduced in
hadoop-0.2x and hadoop-2.x, divides the
two major functions of the JobTracker:
resource management and job life-cycle
management into separate components.
YARN Architecture
YARN Components
●
●

●

ResourceManager (RM) is the ultimate authority that arbitrates
resources among all the applications in the system.
NodeManager (NM) is the per-machine framework agent who is
responsible for containers, monitoring their resource usage (cpu,
memory, disk, network) and reporting the same to the
ResourceManager/Scheduler.
ApplicationsManager (ASM) is responsible for accepting jobsubmissions, negotiating the first container for executing the application
specific ApplicationMaster and provides the service for restarting the
ApplicationMaster container on failure.
Hadoop Ecosystem
● Hadoop has become the
kernel of the distributed
operating system for Big
Data
● No one uses the kernel
alone
● A collection of projects
at Apache
Hadoop Components: HBase
●

●
●

●

Low-latency, distributed, nonrelational database built on top of
HDFS
Inspired by Google’s Bigtable
Data is stored in semi-columnar
format partitioned by rows into
regions
Typically a single table accommodate
hundreds of terabytes
Hadoop Components: Sqoop
●
●
●

●
●

Exchanging data with relational
databases
Short for “SQL to Hadoop”
Performs bidirectional transfer
between Hadoop and almost any
database with a JDBC driver.
Includes native connections for
MySQL and PostgreSQL
Free connectors exist for Teradata,
Netezza, SQL Server and Oracle
Hadoop Components: Flume
●

●

●

Streaming data collection and
aggregation system designed to
transport massive volumes of data
into systems such as Hadoop
Simplifies reliable streaming data
delivery from a variety of sources
including RPC services, log4j
appenders and syslog
Data can be routed, load-balanced,
replicated to multiple destinations
and aggregated from thousands of
hosts
Hadoop Components: Pig
●

●

●

Created to simplify the authoring of
MapReduce jobs, so no need to write
Java code
Users write data processing jobs in a
high-level scripting language from
which Pig builds an execution plan
and executes a series of MapReduce
jobs
Developers can extend its set of builtin operations by writing user-defined
functions in Java
Hadoop Components: Hive
●

●

●

Creates a relational database-style
abstraction that allows the developer
to write a dialect of SQL
Hive’s dialect of SQL is called HiveQL
and implements only a subset of any
of the common standards
Hive works by defining a table-like
schema over an existing set of files in
HDFS
Hadoop Components: Oozie
●

●

●

Workflow engine and scheduler built
specifically for large-scale job
orchestration on Hadoop
Workflows can be triggered by time
or events such as data arriving in a
directory
Major flexibility (start, stop, suspend
and re-run jobs)
Hadoop Components: Hue
●
●
●
●

Hadoop User Experience
Apache Open Source project
HUE is a web UI for Hadoop
Platform for building custom
applications with a nice UI library
Hadoop Components: Mahout
●

●

Distributed and scalable machine
learning algorithms on the Hadoop
platform
Building intelligent applications
easier and faster
Hadoop Components: ZooKeeper
●

●

●

Centralized service for maintaining:
○ Configuration Information
○ Providing distributed
synchronization
Designed to store coordination data:
○ Status Information
○ Configuration
○ Location Information
Implement reliable messaging and
redundant services
Planning and Installing
Hadoop Clusters
Picking a Distribution and Version
● Apache Hadoop Version
○ 1.2.X
○ 2.2.X
● Choosing a distribution
○ HDP
○ Cloudera
● What should I Use?
Hardware Selection
● Master Hardware Selection
○ NameNode considerations
○ Secondary NameNode considerations
● Worker Hardware Selection
○ CPU, RAM and Storage
● Cluster Sizing
○ Small clusters < 20 nodes
○ Midline configuration (2x6 core, 64 GB, 12x3 TB)
○ High end configuration (2x6 core, 96 GB, 24x1 TB)
OS Selection and Preparation
● Deployment layout
○ Hadoop home
○ DataNode data directories
○ NameNode directories
● Software
○ Java, cron, ntp, ssh, rsync, postfix/sendmail
● Hostnames, DNS and Identification
● Users, Groups, and Privileges
Network Design
2-tier tree Network

3-tier tree Network

Core

Core

TOR

TOR

TOR

Host

Host

Host

Host

Host

Host

Host

Host

Host

Aggregation

Aggregation

TOR

TOR

TOR

TOR

Host

Host

Host

Host

Host

Host

Host

Host

Host

Host

Host

Host
Simple Streaming Jobs
How Streaming Works
The mapper and the reducer read the input from stdin
(line by line) and emit the output to stdout.
●
●
●
●
●
●

Each mapper task will launch the executable as a separate process
Converts its inputs into lines and feed the lines to the stdin of the process
Mapper collects the line oriented outputs from the stdout of the process
and converts each line into a key/value pair
Each reducer task will launch the executable as a separate process
Converts its input key/values pairs into lines and feeds the lines to the
stdin of the process
the reducer collects the line oriented outputs from the stdout of the
process, converts each line into a key/value pair
The Input
192.168.100.4 - 05/Nov/2013:00:15:46 - "Get /"
192.168.100.20 - 05/Nov/2013:00:17:46 - "Get /"
192.168.100.20 - 05/Nov/2013:00:18:00 - "Get /about"
192.168.100.41 - 05/Nov/2013:00:18:00 - "Get /feedback"
192.168.100.9 - 05/Nov/2013:00:19:23 - "Get /"
192.168.100.201 - 05/Nov/2013:00:20:00 - "Get /about"
192.168.100.201 - 05/Nov/2013:00:20:31 - "Get /"
192.168.100.4 - 05/Nov/2013:00:21:46 - "Get /"
What do we want to do?
We want to extract how many hits come on each page. So filtering the
above line should yield:
'/': 5
'/about': 2
'/feedback': 1
The Mapper
#!/usr/bin/perl
use strict;
use warnings;
while (<>) {
chomp;
my ($ip, $date, $action) = split('-', $_);
$action =~ s/^ "Get (.*)"$/$1/;
print "$actiont1n";
}
The Reducer
#!/usr/bin/perl

my %actions;

if (exists $actions{$action}) {
$actions{$action} = $actions
{$action} + $count;
} else {
$actions{$action} = $count;
}
}

while (<>) {
chomp;
my ($action, $count) = split
("t", $_);

foreach my $c (sort{$a cmp $b} keys
%actions) {
print "'$c': $actions{$c}n";
}

use strict;
use warnings;
use Data::Dumper;
The Output
Now redirecting ‘log’ to Mapper.pl and piping the output to
Reducer.pl yield:
$ perl Mapper.pl < log | perl Reducer.pl
'/': 5
'/about': 2
'/feedback': 1
Running over Hadoop
$ hadoop jar $HADOOP_HOME/hadoop-streaming.jar 
-input myInputDirs 
-output myOutputDir 
-mapper /home/ahmed/Mapper.pl 
-reducer /home/ahmed/Reducer.pl
Demo
Thank You
Q&A

Mais conteúdo relacionado

Mais procurados

Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation WorkflowsSCAPE Project
 
Hadoop ensma poitiers
Hadoop ensma poitiersHadoop ensma poitiers
Hadoop ensma poitiersRim Moussa
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopVictoria López
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 
Doug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop EcosystemDoug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop EcosystemCloudera, Inc.
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemMahabubur Rahaman
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceCsaba Toth
 
Giving MongoDB a Way to Play with the GIS Community
Giving MongoDB a Way to Play with the GIS CommunityGiving MongoDB a Way to Play with the GIS Community
Giving MongoDB a Way to Play with the GIS CommunityMongoDB
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Senthil Kumar
 
Twitter's Data Replicator for Google Cloud Storage
Twitter's Data Replicator for Google Cloud StorageTwitter's Data Replicator for Google Cloud Storage
Twitter's Data Replicator for Google Cloud Storagelohitvijayarenu
 
Parallel Sequence Generator
Parallel Sequence GeneratorParallel Sequence Generator
Parallel Sequence GeneratorRim Moussa
 

Mais procurados (20)

002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation Workflows
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop ensma poitiers
Hadoop ensma poitiersHadoop ensma poitiers
Hadoop ensma poitiers
 
Hadoop Ecosystem Overview
Hadoop Ecosystem OverviewHadoop Ecosystem Overview
Hadoop Ecosystem Overview
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
parallel OLAP
parallel OLAPparallel OLAP
parallel OLAP
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Doug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop EcosystemDoug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop Ecosystem
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Asd 2015
Asd 2015Asd 2015
Asd 2015
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Giving MongoDB a Way to Play with the GIS Community
Giving MongoDB a Way to Play with the GIS CommunityGiving MongoDB a Way to Play with the GIS Community
Giving MongoDB a Way to Play with the GIS Community
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
Twitter's Data Replicator for Google Cloud Storage
Twitter's Data Replicator for Google Cloud StorageTwitter's Data Replicator for Google Cloud Storage
Twitter's Data Replicator for Google Cloud Storage
 
Parallel Sequence Generator
Parallel Sequence GeneratorParallel Sequence Generator
Parallel Sequence Generator
 

Semelhante a Understanding Hadoop

Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks
 
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkNicola Ferraro
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersA performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersKumari Surabhi
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriDemi Ben-Ari
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simpleDori Waldman
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsAndrew Brust
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapItai Yaffe
 
Juniper Innovation Contest
Juniper Innovation ContestJuniper Innovation Contest
Juniper Innovation ContestAMIT BORUDE
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overviewMartin Zapletal
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
 

Semelhante a Understanding Hadoop (20)

Intro to Big Data - Spark
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - Spark
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache Spark
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersA performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
hadoop
hadoophadoop
hadoop
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
Juniper Innovation Contest
Juniper Innovation ContestJuniper Innovation Contest
Juniper Innovation Contest
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
BigData Hadoop
BigData Hadoop BigData Hadoop
BigData Hadoop
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 

Último

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 

Último (20)

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 

Understanding Hadoop

  • 2. Agenda ● ● ● ● ● ● ● ● ● Introduction to Big Data Hadoop HDFS MapReduce and YARN Hadoop Ecosystem Planning and Installing Hadoop Clusters Writing Simple Streaming Jobs Demo Q&A
  • 3. Commodore Amiga 500 (1990) Memory: 512K Atari ST Amiga 500 (1985) Memory: 512K Macintosh (1984) Memory: 128K Wait a sec… Are we in the 80’s?!
  • 4. ● ● 30 Billion pieces of content were added to Facebook this past month by more than 600 million users 2.7 billion likes made daily on and off of the Facebook site More than 2.5 Billion videos were watched on YouTube… Yesterday! ● ● 1.2 million deliveries per second 35 billion searches were performed last month on Twitter What are the volumes of data that we are seeing today?
  • 5. What does the future look like? ● Worldwide IP traffic will quadruple by 2015. ● Nearly 3 billion people will be online pushing the data created and shared to nearly 8 zettabytes. ○ ○ Zettabyte = 1024^1 Exabyte = 1024^2 Petabyte = 1024^3 Terabyte = 1024^4 Gigabyte = 1024^5 Megabyte = 1024^6 KiloByte 8 ZB = 9,223,372,036,854,775,808 KB ● 2/3rd of surveyed businesses in North America said big data will become a concern for them within the next five years.
  • 6. Huston, We have a Problem!!! A new IDC study says the market for big technology and services will grow from $3.2 billion in 2010 to $16.9 billion in 2015! That’s a growth of 40%
  • 7. What is Big Data? “When your data sets become so large that you have to start innovating to collect, store, organize, analyze and share”
  • 8. From WWW to VVV ● ● ● Volume ○ data volumes are becoming unmanageable Variety ○ data complexity is growing ○ more types of data are captured than previously Velocity ○ some data is arriving so rapidly that it must either be processed instantly, or lost ○ this is a whole subfield called “stream processing”
  • 9.
  • 10. Sources of Data Computer Generated ● ● ● Application server logs (websites, games) Sensor data (weather, water, smart grids) Images/Videos (traffic surveillance, security cameras) Human Generated ● ● ● ● Twitter/Facebook Blogs/Reviews/Emails Images/Videos Social Graphs: Facebook, Linkedin
  • 11. Types of Data ● ● ● ● ● ● Relational Data (Tables/Transaction/Legacy Data) Text Data (Web) Semi-structured Data (XML) Graph Data Social Network, Semantic Web (RDF), … Streaming Data
  • 12. What to do with these data? ● Aggregation and Statistics ○ Data warehouse and OLAP ● Indexing, Searching, and Querying ○ Keyword based search ○ Pattern matching (XML/RDF) ● Knowledge discovery ○ Machine Learning ○ Data Mining ○ Statistical Modeling
  • 13. If RDBMS are not enough, what is?
  • 15. Hadoop - inspired by Google ● Apache Hadoop project ○ inspired by Google MapReduce implementation and Google File System papers ● Open sourced, flexible and available architecture for large scale computation and data processing on a network of commodity hardware ● Open Source Software + Commodity Hardware ○ IT Cost Reduction
  • 16. Hadoop Concepts ● Distribute the data as it is initially stored in the system ● Bring the processing to the data ● Users can focus in developing applications
  • 17. Hadoop Versions ● Hadoop version 1 (HDFS + MapReduce) ○ hadoop-1.2.X ● Hadoop Version 2 (HDFS + MR2 + YARN) ○ hadoop-2.2.X ○ hadoop-0.23.X ■ same as 2.2.X but missing NN HA
  • 18. Enterprise Hadoop ● Cloudera ○ Oldest company provided Hadoop enterprise ○ CDH ○ Cloudera Manager ● Hortonworks ○ Forked from Yahoo! Hadoop team ○ Biggest contributor to Hadoop ○ HDP (Hortonworks Data Platform) ● MapR
  • 19. Hadoop Components ● ● Two Core components ○ Hadoop Distributed Filesystem ○ MapReduce Software Framework Components around Hadoop ○ Often referred to as ‘Hadoop Ecosystem’ ○ Pig, Hive, HBase, Flume, Oozie, Sqoop
  • 20. Hadoop Components: HDFS ● ● HDFS, the Hadoop Distributed File System, is responsible for storing data on the cluster Two Roles: ○ Namenode (NN): Records metadata ○ Datanode (DN): Stores Data
  • 21. HDFS Features ● ● ● ● High fault tolerant Commodity Hardware = Node Failure Rack Awareness Large Datasets
  • 22. HDFS Structure HDFS has a master/slave architecture for the filesystem structure, it has two main layers: ● Namespace, which consists of directories, files and blocks. It supports the file system operations. ● Block storage service, which offers Block Management and Storage: ○ Block Management service provided by the NN, supports block related operations, maintain block locations and manages block replicas. ○ Storage service provided by the DN and allows the read/write access to blocks on the local storage of the node.
  • 23. HDFS: How files are stored?
  • 24. File System Read Operations 1. 2. 3. 4. Client contacts the NameNode indicating the file it wants to read Client identity is validated and checked against the owner and permissions of the file The NameNode responds with the list of DataNodes that host replicas of the blocks of the file The client contact the DataNodes based on the topology that was provided from the NameNode and requests the transfer of the desired block
  • 25. File System Write Operations 1. 2. 3. 4. 5. 6. Client asks the NameNode to choose DataNodes to host replicas of the first block of the file The NameNode grant permissions to the client and responds with a list of DataNodes for the block The client organizes a pipeline from node-tonode and sends the data When the first block is filled, the client requests new DataNodes to be chosen to host replicas of the next block The NameNode responds with a new list of DataNodes which is likely to be different The client organizes a new pipeline and sends the further blocks of the file
  • 26. Hadoop Components: MapReduce ● ● ● ● Programming model for processing and generating large data sets Computation takes some input data, then it gets mapped using some code written by the user Then the mapped data gets reduced using another code written by the user It works like a pipeline: $ cat file | grep something | sort | uniq -c
  • 27. MapReduce Features ● ● ● ● Automatic parallelization and distribution Automatic re-execution on failure Locality Optimizations Abstract the “housekeeping” away from the developer ○ Developer concentrate on writing MapReduce functions Job Tracker Task Tracker MapReduce History Server
  • 28. MapReduce Features ● ● ● TaskTracker is a node in the cluster that accepts tasks - Map, Shuffle and Reduce operations from a JobTracker JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster History Server allows the user to get status on finished applications. Currently it only supports MapReduce and provides information on finished jobs Job Tracker Task Tracker MapReduce History Server
  • 29.
  • 30. Hadoop Components: YARN (MR2) The new architecture introduced in hadoop-0.2x and hadoop-2.x, divides the two major functions of the JobTracker: resource management and job life-cycle management into separate components.
  • 32. YARN Components ● ● ● ResourceManager (RM) is the ultimate authority that arbitrates resources among all the applications in the system. NodeManager (NM) is the per-machine framework agent who is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler. ApplicationsManager (ASM) is responsible for accepting jobsubmissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.
  • 33. Hadoop Ecosystem ● Hadoop has become the kernel of the distributed operating system for Big Data ● No one uses the kernel alone ● A collection of projects at Apache
  • 34. Hadoop Components: HBase ● ● ● ● Low-latency, distributed, nonrelational database built on top of HDFS Inspired by Google’s Bigtable Data is stored in semi-columnar format partitioned by rows into regions Typically a single table accommodate hundreds of terabytes
  • 35. Hadoop Components: Sqoop ● ● ● ● ● Exchanging data with relational databases Short for “SQL to Hadoop” Performs bidirectional transfer between Hadoop and almost any database with a JDBC driver. Includes native connections for MySQL and PostgreSQL Free connectors exist for Teradata, Netezza, SQL Server and Oracle
  • 36. Hadoop Components: Flume ● ● ● Streaming data collection and aggregation system designed to transport massive volumes of data into systems such as Hadoop Simplifies reliable streaming data delivery from a variety of sources including RPC services, log4j appenders and syslog Data can be routed, load-balanced, replicated to multiple destinations and aggregated from thousands of hosts
  • 37. Hadoop Components: Pig ● ● ● Created to simplify the authoring of MapReduce jobs, so no need to write Java code Users write data processing jobs in a high-level scripting language from which Pig builds an execution plan and executes a series of MapReduce jobs Developers can extend its set of builtin operations by writing user-defined functions in Java
  • 38. Hadoop Components: Hive ● ● ● Creates a relational database-style abstraction that allows the developer to write a dialect of SQL Hive’s dialect of SQL is called HiveQL and implements only a subset of any of the common standards Hive works by defining a table-like schema over an existing set of files in HDFS
  • 39. Hadoop Components: Oozie ● ● ● Workflow engine and scheduler built specifically for large-scale job orchestration on Hadoop Workflows can be triggered by time or events such as data arriving in a directory Major flexibility (start, stop, suspend and re-run jobs)
  • 40. Hadoop Components: Hue ● ● ● ● Hadoop User Experience Apache Open Source project HUE is a web UI for Hadoop Platform for building custom applications with a nice UI library
  • 41. Hadoop Components: Mahout ● ● Distributed and scalable machine learning algorithms on the Hadoop platform Building intelligent applications easier and faster
  • 42. Hadoop Components: ZooKeeper ● ● ● Centralized service for maintaining: ○ Configuration Information ○ Providing distributed synchronization Designed to store coordination data: ○ Status Information ○ Configuration ○ Location Information Implement reliable messaging and redundant services
  • 44. Picking a Distribution and Version ● Apache Hadoop Version ○ 1.2.X ○ 2.2.X ● Choosing a distribution ○ HDP ○ Cloudera ● What should I Use?
  • 45. Hardware Selection ● Master Hardware Selection ○ NameNode considerations ○ Secondary NameNode considerations ● Worker Hardware Selection ○ CPU, RAM and Storage ● Cluster Sizing ○ Small clusters < 20 nodes ○ Midline configuration (2x6 core, 64 GB, 12x3 TB) ○ High end configuration (2x6 core, 96 GB, 24x1 TB)
  • 46. OS Selection and Preparation ● Deployment layout ○ Hadoop home ○ DataNode data directories ○ NameNode directories ● Software ○ Java, cron, ntp, ssh, rsync, postfix/sendmail ● Hostnames, DNS and Identification ● Users, Groups, and Privileges
  • 47. Network Design 2-tier tree Network 3-tier tree Network Core Core TOR TOR TOR Host Host Host Host Host Host Host Host Host Aggregation Aggregation TOR TOR TOR TOR Host Host Host Host Host Host Host Host Host Host Host Host
  • 49. How Streaming Works The mapper and the reducer read the input from stdin (line by line) and emit the output to stdout. ● ● ● ● ● ● Each mapper task will launch the executable as a separate process Converts its inputs into lines and feed the lines to the stdin of the process Mapper collects the line oriented outputs from the stdout of the process and converts each line into a key/value pair Each reducer task will launch the executable as a separate process Converts its input key/values pairs into lines and feeds the lines to the stdin of the process the reducer collects the line oriented outputs from the stdout of the process, converts each line into a key/value pair
  • 50. The Input 192.168.100.4 - 05/Nov/2013:00:15:46 - "Get /" 192.168.100.20 - 05/Nov/2013:00:17:46 - "Get /" 192.168.100.20 - 05/Nov/2013:00:18:00 - "Get /about" 192.168.100.41 - 05/Nov/2013:00:18:00 - "Get /feedback" 192.168.100.9 - 05/Nov/2013:00:19:23 - "Get /" 192.168.100.201 - 05/Nov/2013:00:20:00 - "Get /about" 192.168.100.201 - 05/Nov/2013:00:20:31 - "Get /" 192.168.100.4 - 05/Nov/2013:00:21:46 - "Get /"
  • 51. What do we want to do? We want to extract how many hits come on each page. So filtering the above line should yield: '/': 5 '/about': 2 '/feedback': 1
  • 52. The Mapper #!/usr/bin/perl use strict; use warnings; while (<>) { chomp; my ($ip, $date, $action) = split('-', $_); $action =~ s/^ "Get (.*)"$/$1/; print "$actiont1n"; }
  • 53. The Reducer #!/usr/bin/perl my %actions; if (exists $actions{$action}) { $actions{$action} = $actions {$action} + $count; } else { $actions{$action} = $count; } } while (<>) { chomp; my ($action, $count) = split ("t", $_); foreach my $c (sort{$a cmp $b} keys %actions) { print "'$c': $actions{$c}n"; } use strict; use warnings; use Data::Dumper;
  • 54. The Output Now redirecting ‘log’ to Mapper.pl and piping the output to Reducer.pl yield: $ perl Mapper.pl < log | perl Reducer.pl '/': 5 '/about': 2 '/feedback': 1
  • 55. Running over Hadoop $ hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input myInputDirs -output myOutputDir -mapper /home/ahmed/Mapper.pl -reducer /home/ahmed/Reducer.pl
  • 56. Demo