SlideShare uma empresa Scribd logo
1 de 86
Baixar para ler offline
Hadoop MapReduce Fundamentals
@LynnLangit
a five part series – Part 1 of 5
Course Outline
What is Hadoop?
 Open-source data storage and processing API
 Massively scalable, automatically parallelizable

Based on work from Google

GFS + MapReduce + BigTable

Current Distributions based on Open Source and Vendor Work

Apache Hadoop

Cloudera – CH4 w/ Impala

Hortonworks

MapR

AWS

Windows Azure HDInsight
Why Use Hadoop?
 Cheaper

Scales to Petabytes or
more
 Faster

Parallel data processing
 Better

Suited for particular types
of BigData problems
What types of business problems for Hadoop?
Source: Cloudera “Ten Common Hadoopable Problems”
Companies Using
Hadoop
 Facebook
 Yahoo
 Amazon
 eBay
 American Airlines
 The New York Times
 Federal Reserve Board
 IBM
 Orbitz
Forecast growth of Hadoop Job Market
Source: Indeed -- http://www.indeed.com/jobtrends/Hadoop.html
Hadoop is a set of Apache Frameworks and more…
 Data storage (HDFS)

Runs on commodity hardware (usually Linux)

Horizontally scalable
 Processing (MapReduce)

Parallelized (scalable) processing

Fault Tolerant
 Other Tools / Frameworks

Data Access

HBase, Hive, Pig, Mahout

Tools

Hue, Sqoop

Monitoring

Greenplum, Cloudera
Hadoop Core - HDFS
MapReduce API
Data Access
Tools & Libraries
Monitoring & Alerting
What are the core parts of a Hadoop distribution?
Hadoop Cluster HDFS (Physical) Storage
MapReduce Job – Logical View
Image from - http://mm-tom.s3.amazonaws.com/blog/MapReduce.png
Hadoop Ecosystem
Common Hadoop Distributions
 Open Source

Apache
 Commercial

Cloudera

Hortonworks

MapR

AWS MapReduce

Microsoft HDInsight (Beta)
A View of Hadoop (from Hortonworks)
Source: “Intro to Map Reduce” -- http://www.youtube.com/watch?v=ht3dNvdNDzI
Setting up Hadoop Development
Demo – Setting up Cloudera Hadoop
Note: Demo VMs can be downloaded from - https://ccp.cloudera.com/display/SUPPORT/Demo+VMs
Hadoop MapReduce Fundamentals
@LynnLangit
a five part series – Part 2 of 5
So, what’s the problem?
 “I can just use some ‘SQL-like’ language to query Hadoop, right?
 “Yeah, SQL-on-Hadoop…that’s what I want
 “I don’t want learn a new query language and….
 “I want massive scale for my shiny, new BigData
Ways to MapReduce
Libraries Languages
Note: Java is most common, but other languages can be used
Demo – Using Hive QL on CDH4
What is Hive?
 a data warehouse system for Hadoop that

facilitates easy data summarization

supports ad-hoc queries (still batch though…)

created by Facebook
 a mechanism to project structure onto this data and query the data using a
SQL-like language – HiveQL

Interactive-console –or-

Execute scripts

Kicks off one or more MapReduce jobs in the background
 an ability to use indexes, built-in user-defined functions
Is HQL == ANSI SQL? – NO!
--non-equality joins ARE allowed on ANSI SQL
--but are NOT allowed on Hive (HQL)
SELECT a.*
FROM a
JOIN b ON (a.id <> b.id)
Note: Joins are quite different in MapReduce, more on that coming up…
Preparing for MapReduce
Common Hadoop Shell Commands
hadoop fs –cat file:///file2
hadoop fs –mkdir /user/hadoop/dir1 /user/hadoop/dir2
hadoop fs –copyFromLocal <fromDir> <toDir>
hadoop fs –put <localfile>
hdfs://nn.example.com/hadoop/hadoopfile
sudo hadoop jar <jarFileName> <method> <fromDir> <toDir>
hadoop fs –ls /user/hadoop/dir1
hadoop fs –cat hdfs://nn1.example.com/file1
hadoop fs –get /user/hadoop/file <localfile>
Tips
-- ‘sudo’ means ‘run as administrator’ (super user)
--some hadoop configurations use ‘hadoop dfs’ rather than ‘hadoop fs’ – file paths to hadoop differ for the former, see the link
included for more detail
Demo – Working with Files and HDFS
Thinking in MapReduce
 Hint: “It’s Functional”
Understanding MapReduce – P1/3
 Map>>

(K1, V1) 

Info in

Input Split

list (K2, V2)

Key / Value out
(intermediate values)

One list per local
node

Can implement local
Reducer (or
Combiner)
Understanding MapReduce – P2/3
 Map>>

(K1, V1) 

Info in

Input Split

list (K2, V2)

Key / Value out
(intermediate values)

One list per local
node

Can implement local
Reducer (or
Combiner)
 Shuffle/Sort>>
Understanding MapReduce – P3/3
 Map>>

(K1, V1) 

Info in

Input Split

list (K2, V2)

Key / Value out
(intermediate values)

One list per local
node

Can implement local
Reducer (or
Combiner)
 Reduce

(K2, list(V2) 

Shuffle / Sort phase
precedes Reduce phase

Combines Map output
into a list

list (K3, V3)

Usually aggregates
intermediate values
(input) <k1, v1>  map  <k2, v2>  combine  <k2, v2>  reduce  <k3, v3> (output)
 Shuffle/Sort>>
Image from: http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.png
MapReduce Example - WordCount
MapReduce Objects
Each daemon spawns a new JVM
Ways to MapReduce
Libraries Languages
Note: Java is most common, but other languages can be used
Demo – Running MapReduce WordCount
Hadoop MapReduce Fundamentals
@LynnLangit
a five part series – Part 3 of 5
Ways to run MapReduce Jobs
 Configure JobConf options
 From Development Environment (IDE)
 From a GUI utility

Cloudera – Hue

Microsoft Azure – HDInsight console
 From the command line

hadoop jar <filename.jar> input output
Ways to MapReduce
Libraries Languages
Note: Java is most common, but other languages can be used
Setting up Hadoop On Windows Azure
 About HDInsight
Demo – MapReduce in the Cloud
 WordCount MapReduce using HDInsight
MapReduce (WordCount) with Java Script
Note: JavaScript is
part of the Azure
Hadoop distribution
Common Data Sources for MapReduce Jobs
Where is your Data coming from?
 On premises

Local file system

Local HDFS instance
 Private Cloud

Cloud storage
 Public Cloud

Input Storage buckets

Script / Code buckets

Output buckets
Common Data Jobs for MapReduce
Demo – Other Types of MapReduce
Tip: Review the Java MapReduce code in these samples as well.
Methods to write MapReduce Jobs
 Typical – usually written in Java

MapReduce 2.0 API

MapReduce 1.0 API
 Streaming

Uses stdin and stdout

Can use any language to write Map and Reduce Functions

C#, Python, JavaScript, etc…
 Pipes

Often used with C++
 Abstraction libraries

Hive, Pig, etc… write in a higher level language, generate one or more
MapReduce jobs
Ways to MapReduce
Libraries Languages
Note: Java is most common, but other languages can be used
Demo – MapReduce via C# & PowerShell
Ways to MapReduce
Libraries Languages
Note: Java is most common, but other languages can be used
Using AWS MapReduce
Note: You can select Apache or MapR Hadoop Distributions to run your MapReduce job on the
AWS Cloud
What is Pig?
 ETL Library for HDFS developed at Yahoo

Pig Runtime

Pig Language

Generates MapReduce Jobs
 ETL steps

LOAD <file>

FILTER, JOIN, GROUP BY, FOREACH, GENERATE, COUNT…

DUMP {to screen for testing}  STORE <newFile>
MapReduce Python Sample
Remember that white space matters in Python!
Demo – Using AWS MapReduce with
Pig
Note: You can select Apache or MapR Hadoop Distributions to run your MapReduce job on the
AWS Cloud
AWS Data Pipeline with HIVE
Hadoop MapReduce Fundamentals
@LynnLangit
a five part series – Part 4 of 5
Better MapReduce - Optimizations
Optimization BEFORE running a MapReduce Job
More about Input File Compression
 From Cloudera…
 Their version of LZO ‘splittable’
Type File Size GB Compress Decompress
None Log 8.0 - -
Gzip Log.gz 1.3 241 72
LZO Log.lzo 2.0 55 35
Optimization WITHIN a MapReduce Job
59
Mapper Task Optimization
Data Types
 Writable

Text (String)

IntWritable

LongWritable

FloatWritable

BooleanWritable
 WritableComparable for keys
 Custom Types supported – write RawComparator
Reducer Task Optimization
MapReduce Job Optimization
Demo – Unit Testing MapReduce
 Using MRUnit + Asserts
 Optionally using ApprovalTests
Image from http://c0de-x.com/wp-content/uploads/2012/10/staredad_english.png
A note about MapReduce 2.0
 Splits the existing JobTracker’s roles

resource management

job lifecycle management
 MapReduce 2.0 provides many benefits over the existing MapReduce
framework, such as better scalability

through distributed job lifecycle management

support for multiple Hadoop MapReduce API versions in a single cluster
What is Mahout?
 Library with common machine learning algorithms
 Over 20 algorithms

Recommendation (likelihood – Pandora)

Classification (known data and new data – spam id)

Clustering (new groups of similar data – Google news)
 Can non-statisticians find value using this library?
Mahout Algorithms
Setting up Hadoop on Windows
 For local development
 Install from binaries from Web Platform Installer
 Install .NET Azure SDK (for Azure BLOB storage)
 Install other tools

Neudesic Azure Storage Viewer
Demo – Mahout
 Using HDInsight
What about the output?
Clients (Visualizations) for HDFS
 Many clients use Hive

Often included in GUI console tools for Hadoop distributions as well
 Microsoft includes clients in Office (Excel 2013)

Direct Hive client

Connect using ODBC

PowerPivot – data mashups and presentation

Data Explorer – connect, transform, mashup and filter

Hadoop SDK on Codeplex
 Other popular clients

Qlikview

Tableau

Karmasphere
Demo – Executing Hive Queries
Demo – Using HDFS output in Excel 2013
To download Data Explorer:
http://www.microsoft.com/en-
us/download/details.aspx?id=36803
AboutVisualization
Demo – New Visualizations – D3
Hadoop MapReduce Fundamentals
@LynnLangit
a five part series – Part 5 of 5
Limitations of MapReduce
Comparing: RDBMS vs. Hadoop
Traditional RDBMS Hadoop / MapReduce
Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)
Access Interactive and Batch Batch – NOT Interactive
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low
Scaling Nonlinear Linear
Query Response
Time
Can be near immediate Has latency (due to batch processing)
Microsoft alternatives to MapReduce
 Use existing relational system

Scale via cloud or edition (i.e. Enterprise or PDW)
 Use in memory OLAP

SQL Server Analysis Services Tabular Models
 Use “productized” Dremel

Microsoft Polybase – status = beta?
Looking Forward - Dremel or Apache Drill
 Based on original research from Google
Apache Drill Architecture
In-market MapReduce Alternatives
Cloudera
 Impala
Google
 Big Query
Demo – Google’s BigQuery
 Dremel for the rest of us
Hadoop MapReduce Call to Action
More MapReduce Developer Resources
 Based on the distribution – on premises

Apache

MapReduce tutorial - http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.htmlCloudera

Cloudera

Cloudera University - http://university.cloudera.com/

Cloudera Developer Course (4 day) - *RECOMMENDED* -
http://university.cloudera.com/training/apache_hadoop/developer.html

Hortonworks

MapR
 Based on the distribution – cloud

AWS MapReduce

Tutorial - http://aws.amazon.com/elasticmapreduce/training/#gs

Windows Azure HDInsight

Tutorial -
http://www.windowsazure.com/en-us/manage/services/hdinsight/using-mapreduce-with-hdinsight/

More resources - http://www.windowsazure.com/en-us/develop/net/tutorials/intro-to-hadoop/
The Changing Data Landscape

Mais conteúdo relacionado

Mais procurados

HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
MongoDB presentation
MongoDB presentationMongoDB presentation
MongoDB presentationHyphen Call
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map ReduceApache Apex
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Databasenehabsairam
 
Big Data Analytics with Spark
Big Data Analytics with SparkBig Data Analytics with Spark
Big Data Analytics with SparkMohammed Guller
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATAGauravBiswas9
 

Mais procurados (20)

Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Hadoop
HadoopHadoop
Hadoop
 
MongoDB presentation
MongoDB presentationMongoDB presentation
MongoDB presentation
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Datawarehouse and OLAP
Datawarehouse and OLAPDatawarehouse and OLAP
Datawarehouse and OLAP
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Database
 
Unit-3_BDA.ppt
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt
 
Big Data Analytics with Spark
Big Data Analytics with SparkBig Data Analytics with Spark
Big Data Analytics with Spark
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
 

Destaque

Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsSkillspeed
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Hadoop Map Reduce 程式設計
Hadoop Map Reduce 程式設計Hadoop Map Reduce 程式設計
Hadoop Map Reduce 程式設計Wei-Yu Chen
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windowsMuhammad Shahid
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsAnju Singh
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word countJeff Patti
 
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big DataLegacy Typesafe (now Lightbend)
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design PatternsDonald Miner
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reducePaladion Networks
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 

Destaque (19)

Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop Map Reduce 程式設計
Hadoop Map Reduce 程式設計Hadoop Map Reduce 程式設計
Hadoop Map Reduce 程式設計
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
MapReduce in Simple Terms
MapReduce in Simple TermsMapReduce in Simple Terms
MapReduce in Simple Terms
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce Details
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word count
 
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Map reduce vs spark
Map reduce vs sparkMap reduce vs spark
Map reduce vs spark
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 

Semelhante a Hadoop MapReduce Fundamentals

Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010BOSC 2010
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
Hadoop - A Very Short Introduction
Hadoop - A Very Short IntroductionHadoop - A Very Short Introduction
Hadoop - A Very Short Introductiondewang_mistry
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 
Windows Azure HDInsight Service
Windows Azure HDInsight ServiceWindows Azure HDInsight Service
Windows Azure HDInsight ServiceNeil Mackenzie
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoopveeracynixit
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoopveeracynixit
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop GuideSimplilearn
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Yahoo Developer Network
 
Hadoop online training by certified trainer
Hadoop online training by certified trainerHadoop online training by certified trainer
Hadoop online training by certified trainersriram0233
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: RevealedSachin Holla
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache HadoopOleksiy Krotov
 

Semelhante a Hadoop MapReduce Fundamentals (20)

Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Hadoop - A Very Short Introduction
Hadoop - A Very Short IntroductionHadoop - A Very Short Introduction
Hadoop - A Very Short Introduction
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Windows Azure HDInsight Service
Windows Azure HDInsight ServiceWindows Azure HDInsight Service
Windows Azure HDInsight Service
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
 
Hadoop online training by certified trainer
Hadoop online training by certified trainerHadoop online training by certified trainer
Hadoop online training by certified trainer
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
 
Lecture 2 Hadoop.pptx
Lecture 2 Hadoop.pptxLecture 2 Hadoop.pptx
Lecture 2 Hadoop.pptx
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 

Mais de Lynn Langit

VariantSpark on AWS
VariantSpark on AWSVariantSpark on AWS
VariantSpark on AWSLynn Langit
 
Serverless Architectures
Serverless ArchitecturesServerless Architectures
Serverless ArchitecturesLynn Langit
 
10+ Years of Teaching Kids Programming
10+ Years of Teaching Kids Programming10+ Years of Teaching Kids Programming
10+ Years of Teaching Kids ProgrammingLynn Langit
 
Blastn plus jupyter on Docker
Blastn plus jupyter on DockerBlastn plus jupyter on Docker
Blastn plus jupyter on DockerLynn Langit
 
Testing in Ballerina Language
Testing in Ballerina LanguageTesting in Ballerina Language
Testing in Ballerina LanguageLynn Langit
 
Teaching Kids to create Alexa Skills
Teaching Kids to create Alexa SkillsTeaching Kids to create Alexa Skills
Teaching Kids to create Alexa SkillsLynn Langit
 
Understanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesUnderstanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesLynn Langit
 
Genome-scale Big Data Pipelines
Genome-scale Big Data PipelinesGenome-scale Big Data Pipelines
Genome-scale Big Data PipelinesLynn Langit
 
Teaching Kids Programming
Teaching Kids ProgrammingTeaching Kids Programming
Teaching Kids ProgrammingLynn Langit
 
Serverless Reality
Serverless RealityServerless Reality
Serverless RealityLynn Langit
 
Genomic Scale Big Data Pipelines
Genomic Scale Big Data PipelinesGenomic Scale Big Data Pipelines
Genomic Scale Big Data PipelinesLynn Langit
 
VariantSpark - a Spark library for genomics
VariantSpark - a Spark library for genomicsVariantSpark - a Spark library for genomics
VariantSpark - a Spark library for genomicsLynn Langit
 
Bioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSBioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSLynn Langit
 
Serverless Reality
Serverless RealityServerless Reality
Serverless RealityLynn Langit
 
Beyond Relational
Beyond RelationalBeyond Relational
Beyond RelationalLynn Langit
 
New AWS Services for Bioinformatics
New AWS Services for BioinformaticsNew AWS Services for Bioinformatics
New AWS Services for BioinformaticsLynn Langit
 
Google Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline PatternsGoogle Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline PatternsLynn Langit
 
Scaling Galaxy on Google Cloud Platform
Scaling Galaxy on Google Cloud PlatformScaling Galaxy on Google Cloud Platform
Scaling Galaxy on Google Cloud PlatformLynn Langit
 

Mais de Lynn Langit (20)

VariantSpark on AWS
VariantSpark on AWSVariantSpark on AWS
VariantSpark on AWS
 
Serverless Architectures
Serverless ArchitecturesServerless Architectures
Serverless Architectures
 
10+ Years of Teaching Kids Programming
10+ Years of Teaching Kids Programming10+ Years of Teaching Kids Programming
10+ Years of Teaching Kids Programming
 
Blastn plus jupyter on Docker
Blastn plus jupyter on DockerBlastn plus jupyter on Docker
Blastn plus jupyter on Docker
 
Testing in Ballerina Language
Testing in Ballerina LanguageTesting in Ballerina Language
Testing in Ballerina Language
 
Teaching Kids to create Alexa Skills
Teaching Kids to create Alexa SkillsTeaching Kids to create Alexa Skills
Teaching Kids to create Alexa Skills
 
Practical cloud
Practical cloudPractical cloud
Practical cloud
 
Understanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesUnderstanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examples
 
Genome-scale Big Data Pipelines
Genome-scale Big Data PipelinesGenome-scale Big Data Pipelines
Genome-scale Big Data Pipelines
 
Teaching Kids Programming
Teaching Kids ProgrammingTeaching Kids Programming
Teaching Kids Programming
 
Practical Cloud
Practical CloudPractical Cloud
Practical Cloud
 
Serverless Reality
Serverless RealityServerless Reality
Serverless Reality
 
Genomic Scale Big Data Pipelines
Genomic Scale Big Data PipelinesGenomic Scale Big Data Pipelines
Genomic Scale Big Data Pipelines
 
VariantSpark - a Spark library for genomics
VariantSpark - a Spark library for genomicsVariantSpark - a Spark library for genomics
VariantSpark - a Spark library for genomics
 
Bioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSBioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWS
 
Serverless Reality
Serverless RealityServerless Reality
Serverless Reality
 
Beyond Relational
Beyond RelationalBeyond Relational
Beyond Relational
 
New AWS Services for Bioinformatics
New AWS Services for BioinformaticsNew AWS Services for Bioinformatics
New AWS Services for Bioinformatics
 
Google Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline PatternsGoogle Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline Patterns
 
Scaling Galaxy on Google Cloud Platform
Scaling Galaxy on Google Cloud PlatformScaling Galaxy on Google Cloud Platform
Scaling Galaxy on Google Cloud Platform
 

Último

Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sectoritnewsafrica
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 

Último (20)

Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 

Hadoop MapReduce Fundamentals

  • 1. Hadoop MapReduce Fundamentals @LynnLangit a five part series – Part 1 of 5
  • 3. What is Hadoop?  Open-source data storage and processing API  Massively scalable, automatically parallelizable  Based on work from Google  GFS + MapReduce + BigTable  Current Distributions based on Open Source and Vendor Work  Apache Hadoop  Cloudera – CH4 w/ Impala  Hortonworks  MapR  AWS  Windows Azure HDInsight
  • 4. Why Use Hadoop?  Cheaper  Scales to Petabytes or more  Faster  Parallel data processing  Better  Suited for particular types of BigData problems
  • 5. What types of business problems for Hadoop? Source: Cloudera “Ten Common Hadoopable Problems”
  • 6. Companies Using Hadoop  Facebook  Yahoo  Amazon  eBay  American Airlines  The New York Times  Federal Reserve Board  IBM  Orbitz
  • 7. Forecast growth of Hadoop Job Market Source: Indeed -- http://www.indeed.com/jobtrends/Hadoop.html
  • 8. Hadoop is a set of Apache Frameworks and more…  Data storage (HDFS)  Runs on commodity hardware (usually Linux)  Horizontally scalable  Processing (MapReduce)  Parallelized (scalable) processing  Fault Tolerant  Other Tools / Frameworks  Data Access  HBase, Hive, Pig, Mahout  Tools  Hue, Sqoop  Monitoring  Greenplum, Cloudera Hadoop Core - HDFS MapReduce API Data Access Tools & Libraries Monitoring & Alerting
  • 9. What are the core parts of a Hadoop distribution?
  • 10. Hadoop Cluster HDFS (Physical) Storage
  • 11. MapReduce Job – Logical View Image from - http://mm-tom.s3.amazonaws.com/blog/MapReduce.png
  • 13.
  • 14. Common Hadoop Distributions  Open Source  Apache  Commercial  Cloudera  Hortonworks  MapR  AWS MapReduce  Microsoft HDInsight (Beta)
  • 15. A View of Hadoop (from Hortonworks) Source: “Intro to Map Reduce” -- http://www.youtube.com/watch?v=ht3dNvdNDzI
  • 16. Setting up Hadoop Development
  • 17. Demo – Setting up Cloudera Hadoop Note: Demo VMs can be downloaded from - https://ccp.cloudera.com/display/SUPPORT/Demo+VMs
  • 18. Hadoop MapReduce Fundamentals @LynnLangit a five part series – Part 2 of 5
  • 19. So, what’s the problem?  “I can just use some ‘SQL-like’ language to query Hadoop, right?  “Yeah, SQL-on-Hadoop…that’s what I want  “I don’t want learn a new query language and….  “I want massive scale for my shiny, new BigData
  • 20. Ways to MapReduce Libraries Languages Note: Java is most common, but other languages can be used
  • 21. Demo – Using Hive QL on CDH4
  • 22. What is Hive?  a data warehouse system for Hadoop that  facilitates easy data summarization  supports ad-hoc queries (still batch though…)  created by Facebook  a mechanism to project structure onto this data and query the data using a SQL-like language – HiveQL  Interactive-console –or-  Execute scripts  Kicks off one or more MapReduce jobs in the background  an ability to use indexes, built-in user-defined functions
  • 23. Is HQL == ANSI SQL? – NO! --non-equality joins ARE allowed on ANSI SQL --but are NOT allowed on Hive (HQL) SELECT a.* FROM a JOIN b ON (a.id <> b.id) Note: Joins are quite different in MapReduce, more on that coming up…
  • 25. Common Hadoop Shell Commands hadoop fs –cat file:///file2 hadoop fs –mkdir /user/hadoop/dir1 /user/hadoop/dir2 hadoop fs –copyFromLocal <fromDir> <toDir> hadoop fs –put <localfile> hdfs://nn.example.com/hadoop/hadoopfile sudo hadoop jar <jarFileName> <method> <fromDir> <toDir> hadoop fs –ls /user/hadoop/dir1 hadoop fs –cat hdfs://nn1.example.com/file1 hadoop fs –get /user/hadoop/file <localfile> Tips -- ‘sudo’ means ‘run as administrator’ (super user) --some hadoop configurations use ‘hadoop dfs’ rather than ‘hadoop fs’ – file paths to hadoop differ for the former, see the link included for more detail
  • 26. Demo – Working with Files and HDFS
  • 27. Thinking in MapReduce  Hint: “It’s Functional”
  • 28. Understanding MapReduce – P1/3  Map>>  (K1, V1)   Info in  Input Split  list (K2, V2)  Key / Value out (intermediate values)  One list per local node  Can implement local Reducer (or Combiner)
  • 29. Understanding MapReduce – P2/3  Map>>  (K1, V1)   Info in  Input Split  list (K2, V2)  Key / Value out (intermediate values)  One list per local node  Can implement local Reducer (or Combiner)  Shuffle/Sort>>
  • 30. Understanding MapReduce – P3/3  Map>>  (K1, V1)   Info in  Input Split  list (K2, V2)  Key / Value out (intermediate values)  One list per local node  Can implement local Reducer (or Combiner)  Reduce  (K2, list(V2)   Shuffle / Sort phase precedes Reduce phase  Combines Map output into a list  list (K3, V3)  Usually aggregates intermediate values (input) <k1, v1>  map  <k2, v2>  combine  <k2, v2>  reduce  <k3, v3> (output)  Shuffle/Sort>>
  • 32. MapReduce Objects Each daemon spawns a new JVM
  • 33. Ways to MapReduce Libraries Languages Note: Java is most common, but other languages can be used
  • 34. Demo – Running MapReduce WordCount
  • 35. Hadoop MapReduce Fundamentals @LynnLangit a five part series – Part 3 of 5
  • 36. Ways to run MapReduce Jobs  Configure JobConf options  From Development Environment (IDE)  From a GUI utility  Cloudera – Hue  Microsoft Azure – HDInsight console  From the command line  hadoop jar <filename.jar> input output
  • 37. Ways to MapReduce Libraries Languages Note: Java is most common, but other languages can be used
  • 38. Setting up Hadoop On Windows Azure  About HDInsight
  • 39. Demo – MapReduce in the Cloud  WordCount MapReduce using HDInsight
  • 40. MapReduce (WordCount) with Java Script Note: JavaScript is part of the Azure Hadoop distribution
  • 41. Common Data Sources for MapReduce Jobs
  • 42. Where is your Data coming from?  On premises  Local file system  Local HDFS instance  Private Cloud  Cloud storage  Public Cloud  Input Storage buckets  Script / Code buckets  Output buckets
  • 43. Common Data Jobs for MapReduce
  • 44. Demo – Other Types of MapReduce Tip: Review the Java MapReduce code in these samples as well.
  • 45. Methods to write MapReduce Jobs  Typical – usually written in Java  MapReduce 2.0 API  MapReduce 1.0 API  Streaming  Uses stdin and stdout  Can use any language to write Map and Reduce Functions  C#, Python, JavaScript, etc…  Pipes  Often used with C++  Abstraction libraries  Hive, Pig, etc… write in a higher level language, generate one or more MapReduce jobs
  • 46. Ways to MapReduce Libraries Languages Note: Java is most common, but other languages can be used
  • 47. Demo – MapReduce via C# & PowerShell
  • 48. Ways to MapReduce Libraries Languages Note: Java is most common, but other languages can be used
  • 49. Using AWS MapReduce Note: You can select Apache or MapR Hadoop Distributions to run your MapReduce job on the AWS Cloud
  • 50. What is Pig?  ETL Library for HDFS developed at Yahoo  Pig Runtime  Pig Language  Generates MapReduce Jobs  ETL steps  LOAD <file>  FILTER, JOIN, GROUP BY, FOREACH, GENERATE, COUNT…  DUMP {to screen for testing}  STORE <newFile>
  • 51. MapReduce Python Sample Remember that white space matters in Python!
  • 52. Demo – Using AWS MapReduce with Pig Note: You can select Apache or MapR Hadoop Distributions to run your MapReduce job on the AWS Cloud
  • 53. AWS Data Pipeline with HIVE
  • 54. Hadoop MapReduce Fundamentals @LynnLangit a five part series – Part 4 of 5
  • 55. Better MapReduce - Optimizations
  • 56. Optimization BEFORE running a MapReduce Job
  • 57. More about Input File Compression  From Cloudera…  Their version of LZO ‘splittable’ Type File Size GB Compress Decompress None Log 8.0 - - Gzip Log.gz 1.3 241 72 LZO Log.lzo 2.0 55 35
  • 58. Optimization WITHIN a MapReduce Job
  • 59. 59
  • 61. Data Types  Writable  Text (String)  IntWritable  LongWritable  FloatWritable  BooleanWritable  WritableComparable for keys  Custom Types supported – write RawComparator
  • 64. Demo – Unit Testing MapReduce  Using MRUnit + Asserts  Optionally using ApprovalTests Image from http://c0de-x.com/wp-content/uploads/2012/10/staredad_english.png
  • 65. A note about MapReduce 2.0  Splits the existing JobTracker’s roles  resource management  job lifecycle management  MapReduce 2.0 provides many benefits over the existing MapReduce framework, such as better scalability  through distributed job lifecycle management  support for multiple Hadoop MapReduce API versions in a single cluster
  • 66. What is Mahout?  Library with common machine learning algorithms  Over 20 algorithms  Recommendation (likelihood – Pandora)  Classification (known data and new data – spam id)  Clustering (new groups of similar data – Google news)  Can non-statisticians find value using this library?
  • 68. Setting up Hadoop on Windows  For local development  Install from binaries from Web Platform Installer  Install .NET Azure SDK (for Azure BLOB storage)  Install other tools  Neudesic Azure Storage Viewer
  • 69. Demo – Mahout  Using HDInsight
  • 70. What about the output?
  • 71. Clients (Visualizations) for HDFS  Many clients use Hive  Often included in GUI console tools for Hadoop distributions as well  Microsoft includes clients in Office (Excel 2013)  Direct Hive client  Connect using ODBC  PowerPivot – data mashups and presentation  Data Explorer – connect, transform, mashup and filter  Hadoop SDK on Codeplex  Other popular clients  Qlikview  Tableau  Karmasphere
  • 72. Demo – Executing Hive Queries
  • 73. Demo – Using HDFS output in Excel 2013 To download Data Explorer: http://www.microsoft.com/en- us/download/details.aspx?id=36803
  • 75. Demo – New Visualizations – D3
  • 76. Hadoop MapReduce Fundamentals @LynnLangit a five part series – Part 5 of 5
  • 78. Comparing: RDBMS vs. Hadoop Traditional RDBMS Hadoop / MapReduce Data Size Gigabytes (Terabytes) Petabytes (Hexabytes) Access Interactive and Batch Batch – NOT Interactive Updates Read / Write many times Write once, Read many times Structure Static Schema Dynamic Schema Integrity High (ACID) Low Scaling Nonlinear Linear Query Response Time Can be near immediate Has latency (due to batch processing)
  • 79. Microsoft alternatives to MapReduce  Use existing relational system  Scale via cloud or edition (i.e. Enterprise or PDW)  Use in memory OLAP  SQL Server Analysis Services Tabular Models  Use “productized” Dremel  Microsoft Polybase – status = beta?
  • 80. Looking Forward - Dremel or Apache Drill  Based on original research from Google
  • 82. In-market MapReduce Alternatives Cloudera  Impala Google  Big Query
  • 83. Demo – Google’s BigQuery  Dremel for the rest of us
  • 85. More MapReduce Developer Resources  Based on the distribution – on premises  Apache  MapReduce tutorial - http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.htmlCloudera  Cloudera  Cloudera University - http://university.cloudera.com/  Cloudera Developer Course (4 day) - *RECOMMENDED* - http://university.cloudera.com/training/apache_hadoop/developer.html  Hortonworks  MapR  Based on the distribution – cloud  AWS MapReduce  Tutorial - http://aws.amazon.com/elasticmapreduce/training/#gs  Windows Azure HDInsight  Tutorial - http://www.windowsazure.com/en-us/manage/services/hdinsight/using-mapreduce-with-hdinsight/  More resources - http://www.windowsazure.com/en-us/develop/net/tutorials/intro-to-hadoop/
  • 86. The Changing Data Landscape

Notas do Editor

  1. http://en.wikipedia.org/wiki/MapReduce
  2. http://allthingsd.com/files/2012/04/big-numbers.jpg
  3. http://www.cloudera.com/content/dam/cloudera/Resources/PDF/cloudera_White_Paper_Ten_Hadoopable_Problems_Real_World_Use_Cases.pdf Also -- http://gigaom.com/2012/06/05/10-ways-companies-are-using-hadoop-to-do-more-than-serve-ads/
  4. Image: http://siliconangle.com/files/2012/08/hadoop-300x300.jpg
  5. http://www.platfora.com/wp-content/themes/PlatforaV2.0/img/enter/deployment_pick_graphic.png
  6. http://indoos.files.wordpress.com/2010/08/hadoop_map1.png?w=819&amp;h=612
  7. http://datameer2.datameer.com/blog/wp-content/uploads/2012/06/hadoop_ecosystem_d3_photoshop.jpg http://datameer2.datameer.com/blog/wp-content/uploads/2013/01/hadoop_ecosystem_clean.png http://www.datameer.com/blog/perspectives/hadoop-ecosystem-as-of-january-2013-now-an-app.html
  8. Image from: http://vichargrave.com/wp-content/uploads/2013/02/Hadoop-Development.png http://wiki.apache.org/hadoop/HowToSetupYourDevelopmentEnvironment https://ccp.cloudera.com/display/SUPPORT/Cloudera&apos;s+Hadoop+Demo+VM+for+CDH4
  9. https://ccp.cloudera.com/display/SUPPORT/CDH+Downloads
  10. http://queryio.com/hadoop-big-data-images/hadoop-sql.jpg
  11. http://www.dataspora.com/2011/04/pigs-bees-and-elephants-a-comparison-of-eight-mapreduce-languages/
  12. http://hive.apache.org/ https://cwiki.apache.org/confluence/display/Hive/GettingStarted
  13. https://cwiki.apache.org/confluence/display/Hive/LanguageManual http://en.wikipedia.org/wiki/Apache_Hive
  14. http://hadoop.apache.org/docs/r0.18.3/hdfs_shell.html http://nsinfra.blogspot.in/2012/06/difference-between-hadoop-dfs-and.html
  15. http://www.fincher.org/tips/General/SoftwareEngineering/FunctionalProgramming.shtml http://rbxbx.info/images/fault-tolerance.png
  16. The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job.
  17. The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job.
  18. The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job.
  19. http://www.dataspora.com/2011/04/pigs-bees-and-elephants-a-comparison-of-eight-mapreduce-languages/
  20. http://www.dataspora.com/2011/04/pigs-bees-and-elephants-a-comparison-of-eight-mapreduce-languages/
  21. http://www.windowsazure.com/en-us/manage/services/hdinsight/get-started-hdinsight/
  22. Image from http://curiousellie.typepad.com/.a/6a0133ec911c1f970b0168ebe6a2e4970c-500wi
  23. http://hadoop.apache.org/docs/r1.1.2/streaming.html How to run and compile a Hadoop Java program -- https://sites.google.com/site/hadoopandhive/home/how-to-run-and-compile-a-hadoop-program Sample code to compile a JAVA class: javac –classpath ~/hadoop/hadoop-core-1.0.1.jar;commons-cli-1.2.jar –d classes &lt;nameOfJavaFile&gt;.java &amp;&amp; jar –cvf &lt;nameOfJarFile&gt;.jar –C classes/
  24. http://www.dataspora.com/2011/04/pigs-bees-and-elephants-a-comparison-of-eight-mapreduce-languages/
  25. http://blogs.msdn.com/b/carlnol/archive/2013/02/05/submitting-hadoop-mapreduce-jobs-using-powershell.aspx
  26. http://www.dataspora.com/2011/04/pigs-bees-and-elephants-a-comparison-of-eight-mapreduce-languages/
  27. About: Pig - http://en.wikipedia.org/wiki/Pig_(programming_tool) PigLatin language reference - http://pig.apache.org/docs/r0.10.0/start.html#pl-statements
  28. http://blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/
  29. http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/ http://www.slideshare.net/cloudera/mr-perf
  30. http://4.bp.blogspot.com/-2S6IuPD71A8/TZiNw8AyWkI/AAAAAAAAB0k/tS5QTP9SzHA/s1600/Detailed%2BHadoop%2BMapreduce%2BData%2BFlow.png
  31. The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework.
  32. Tips from Cloudera -- http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/ &amp; http://www.slideshare.net/Hadoop_Summit/optimizing-mapreduce-job-performance
  33. http://blog.cloudera.com/blog/2012/02/mapreduce-2-0-in-hadoop-0-23/ http://hadoop.apache.org/docs/r0.23.6/api/index.html
  34. http://mahout.apache.org/
  35. Download local Hadoop via the Web Platform InstallerAlso download the Azure .NET SDK for VS 2012Link to download Windows Azure storage explorerhttp://azurestorageexplorer.codeplex.com/LInk for downloading .NET SDK for Hadoophttp://hadoopsdk.codeplex.com/wikipage?title=roadmap&amp;referringTitle=Home
  36. Image from - http://bluewatersql.files.wordpress.com/2013/04/image12.png
  37. http://www.research-live.com/Journals/1/Files/2013/1/11/covermania.jpg
  38. https://github.com/mbostock/d3/wiki/Gallery
  39. Original Reference: Tom White’ s Hadoop: The Definitive Guide (I made some modifications based on my experience)
  40. http://research.google.com/pubs/pub36632.html
  41. https://docs.google.com/document/d/1QTL8warUYS2KjldQrGUse7zp8eA72VKtLOHwfXy6c7I/edit
  42. http://cloudera.com/content/cloudera/en/campaign/introducing-impala.html GigaOm ‘The Future…of Hadoop is real-time’ -- http://gigaom.com/2013/03/07/5-reasons-why-the-future-of-hadoop-is-real-time-relatively-speaking/ http://devopsangle.com/2012/08/20/googles-dremel-here-comes-a-new-challenger-to-yarnhadoop/
  43. Course Title: Module Title ©2011 DevelopMentor 1-Oct-2011