SlideShare a Scribd company logo
1 of 19
Download to read offline
Hadoop Design
Patterns
Donald Miner
@donaldpminer
Donald.Miner@emc.com

© Copyright 2013 EMC Corporation. All rights reserved.

1
Book was made available December 2012
Written by Donald Miner and Adam Shook,
Hadoop Architects at Greenplum.

© Copyright 2013 EMC Corporation. All rights reserved.

2
What Are Design Patterns?
(in general)

• Reusable solution frameworks to
problems
• Domain independent
• Not a cookbook, but not a guide
• Not a finished solution

© Copyright 2013 EMC Corporation. All rights reserved.

3
Why Design Patterns?
(in general)

 Makes the intent of code easier to understand
 Provides a common language for solutions
 Be able to reuse code
 Known performance profiles and limitations of
solutions

© Copyright 2013 EMC Corporation. All rights reserved.

4
Why MapReduce Design Patterns?
 Recurring patterns in data-related problem solving
 Groups are building patterns independently
 Lots of new users every day
 MapReduce is a new way of thinking
 Foundation for higher-level tools (Pig, Hive, …)
 Community is reaching the right level of maturity
© Copyright 2013 EMC Corporation. All rights reserved.

5
Pattern Template
Each pattern follows a standard template
Intent

Consequences

Motivation

Resemblances

Applicability

Performance analysis

Structure

Examples

© Copyright 2013 EMC Corporation. All rights reserved.

6
Pattern Categories
Summarization
Filtering
Data Organization
Joins
Metapatterns
Input and output

© Copyright 2013 EMC Corporation. All rights reserved.

7
Filtering Patterns
Extract interesting subsets
Keep only a subset of the data
 Filtering
– Removes records of data based on a condition

 Bloom filtering
– Removes records of data based on a bloom filter membership test

 Top ten
– Returns the top-k records, given a ranking criteria

 Distinct
– Remove duplicates from a data set
© Copyright 2013 EMC Corporation. All rights reserved.

8
Summarization Patterns
Top-down summaries
Give a top-level view of the data
 Numerical summarizations
– Perform numerical calculations on groups of data

 Inverted index
– Build a lookup table

 Counting with counters
– Count the occurrences of particular things

© Copyright 2013 EMC Corporation. All rights reserved.

9
Data organization patterns
Reorganize, restructure
Change the way the data is organized
 Structured to hierarchical
– Denormalize data into documents
 Partitioning
– Place data into partitions based on a hash key
 Binning
– Place each record into zero or more bins
 Total order sorting
– Sort the data set in ascending or descending order
 Shuffling
– Completely randomize the order of the data

© Copyright 2013 EMC Corporation. All rights reserved.

10
Join patterns
Bringing data sets together
Take several data sets and bring them together into one
 Reduce-side join
– General purpose join

 Replicated join
– Replicates the smaller data set everywhere before the join

 Composite join
– Joins if the data sets are sorted and partitioned in the same way

 Cartesian product
– Match up every record to every other record
© Copyright 2013 EMC Corporation. All rights reserved.

11
Input and output patterns
Custom input and output
Perform custom behavior for input or output
 Generating data
– Generate data from nothing

 External source output
– Send data to an external source

 External source input
– Pull data from an external source

 Partition pruning
– Remove chunks of data because we know some parts are not useful
© Copyright 2013 EMC Corporation. All rights reserved.

12
Example Pattern: “Top Ten”
(filtering)

Intent

Retrieve a relatively small number of top K records, according
to a ranking scheme in your data set, no matter how large
the data.

Motivation

Finding outliers
Top ten lists are fun
Building dashboards
Sorting/Limit isn’t going to work here

© Copyright 2013 EMC Corporation. All rights reserved.

13
Example Pattern: “Top Ten”
Applicability

Rank-able records
Limited number of output records

Consequences

The top K records are returned.

© Copyright 2013 EMC Corporation. All rights reserved.

14
Example Pattern: “Top Ten”
Structure
class m apper :
setu p():
i nitia lize t op ten sorte d li st
map( key, reco rd ):
i nsert rec or d into top t en s or ted lis t
i f len gth of array is gr eate r- than 10 :
tru ncat e list to a le ngth o f 10
clea nup() :
f or re cord i n top s orted ten l ist:
e mit n ull, re cord
class r educe r:
setu p():
i nitia lize t op ten sorte d li st
redu ce(ke y, r ec ords):
s ort r ecor ds
t runca te r ec ords to top 10
f or re cord i n recor ds:
emi t re co rd

© Copyright 2013 EMC Corporation. All rights reserved.

15
Example Pattern: “Top Ten”
Resemblances
SQL:
SELECT * FROM table ORDER BY col4 DESC LIMIT 10;

Pig:
B = ORDER A BY col4 DESC;
C = LIMIT B 10;

© Copyright 2013 EMC Corporation. All rights reserved.

16
Example Pattern: “Top Ten”
Performance analysis

Pretty quick: map-heavy, low network usage
Pay attention to how many records the reducer is getting
[number of input splits] x K

Example

Top ten StackOverflow users by reputation

© Copyright 2013 EMC Corporation. All rights reserved.

17
Pivotal Sessions at EMC World
Session

Presenter

Dates/Times

The Pivotal Platform: A Purpose-Built Platform for Big-DataDriven Applications

Josh Klahr

Tue 5:30 - 6:30, Palazzo E Wed
11:30 - 12:30, Delfino 4005

Pivotal: Data Scientists on the Front Line: Examples of
Data Science in Action

Noelle Sio

Tue 10:00 - 11:00, Lando 4205
Thu 8:30 - 9:30, Palazzo F

Pivotal: Operationalizing 1000-node Hadoop Cluster –
Analytics Workbench

Clinton Ooi
Bhavin Modi

Tue 11:30 - 12:30, Palazzo L Thu
10:00- 11:00 am, Delfino 4001A

Pivotal: for Powerful Processing of Unstructured Data For
Valuable Insights

SK
Krishnamurthy

Mon 4:00 - 5:00, Lando 4201 A
Tue 4:00 - 5:00, Palazzo M

Pivotal: Big & Fast data – merging real-time data and deep
analytics

Michael
Crutcher

Mon 1:00 - 2:00, Lando 4201 A
Wed 10:00 - 11:00, Palazzo M

Pivotal: Virtualize Big Data to Make The Elephant Dance

June Yang
Dan Baskette

Mon 11:30 - 12:30, Marcello
4401A Wed 4:00 - 5:00, Palazzo
E

Hadoop Design Patterns

Don Miner

Mon 2:30 - 3:30, Palazzo F Wed
8:30 - 9:30, Delfino 4005

© Copyright 2013 EMC Corporation. All rights reserved.

18
Hadoop Design Patterns

More Related Content

What's hot

【旧版】Oracle Database Cloud Service:サービス概要のご紹介 [2021年7月版]
【旧版】Oracle Database Cloud Service:サービス概要のご紹介 [2021年7月版]【旧版】Oracle Database Cloud Service:サービス概要のご紹介 [2021年7月版]
【旧版】Oracle Database Cloud Service:サービス概要のご紹介 [2021年7月版]オラクルエンジニア通信
 
Developing Java Web Applications
Developing Java Web ApplicationsDeveloping Java Web Applications
Developing Java Web Applicationshchen1
 
Introduction of Oracle
Introduction of Oracle Introduction of Oracle
Introduction of Oracle Salman Memon
 
BD I - Aula 08 B - Algebra Relacional - Exercicios Resolucao
BD I - Aula 08 B - Algebra Relacional - Exercicios ResolucaoBD I - Aula 08 B - Algebra Relacional - Exercicios Resolucao
BD I - Aula 08 B - Algebra Relacional - Exercicios ResolucaoRodrigo Kiyoshi Saito
 
A brief introduction to SQLite PPT
A brief introduction to SQLite PPTA brief introduction to SQLite PPT
A brief introduction to SQLite PPTJavaTpoint
 
Curso de OO com C# - Parte 01 - Orientação a objetos
Curso de OO com C# - Parte 01 - Orientação a objetosCurso de OO com C# - Parte 01 - Orientação a objetos
Curso de OO com C# - Parte 01 - Orientação a objetosLeonardo Melo Santos
 
Banco de Dados I - Aula 05 - Banco de Dados Relacional (Modelo Conceitual)
Banco de Dados I - Aula 05 - Banco de Dados Relacional (Modelo Conceitual)Banco de Dados I - Aula 05 - Banco de Dados Relacional (Modelo Conceitual)
Banco de Dados I - Aula 05 - Banco de Dados Relacional (Modelo Conceitual)Leinylson Fontinele
 
Introduction to SQL
Introduction to SQLIntroduction to SQL
Introduction to SQLRam Kedem
 
Restricting and Sorting Data - Oracle Data Base
Restricting and Sorting Data - Oracle Data BaseRestricting and Sorting Data - Oracle Data Base
Restricting and Sorting Data - Oracle Data BaseSalman Memon
 
DDL,DML,SQL Functions and Joins
DDL,DML,SQL Functions and JoinsDDL,DML,SQL Functions and Joins
DDL,DML,SQL Functions and JoinsAshwin Dinoriya
 
[Curso Java Basico - Orientacaoo a Objetos] Aula 24: Classes e atributos
[Curso Java Basico - Orientacaoo a Objetos] Aula 24: Classes e atributos[Curso Java Basico - Orientacaoo a Objetos] Aula 24: Classes e atributos
[Curso Java Basico - Orientacaoo a Objetos] Aula 24: Classes e atributosLoiane Groner
 
1. SQL Basics - Introduction
1. SQL Basics - Introduction1. SQL Basics - Introduction
1. SQL Basics - IntroductionVarun A M
 

What's hot (20)

【旧版】Oracle Database Cloud Service:サービス概要のご紹介 [2021年7月版]
【旧版】Oracle Database Cloud Service:サービス概要のご紹介 [2021年7月版]【旧版】Oracle Database Cloud Service:サービス概要のご紹介 [2021年7月版]
【旧版】Oracle Database Cloud Service:サービス概要のご紹介 [2021年7月版]
 
Developing Java Web Applications
Developing Java Web ApplicationsDeveloping Java Web Applications
Developing Java Web Applications
 
Sql operator
Sql operatorSql operator
Sql operator
 
Introduction of Oracle
Introduction of Oracle Introduction of Oracle
Introduction of Oracle
 
Sql DML
Sql DMLSql DML
Sql DML
 
Sql
SqlSql
Sql
 
Sql operators & functions 3
Sql operators & functions 3Sql operators & functions 3
Sql operators & functions 3
 
BD I - Aula 08 B - Algebra Relacional - Exercicios Resolucao
BD I - Aula 08 B - Algebra Relacional - Exercicios ResolucaoBD I - Aula 08 B - Algebra Relacional - Exercicios Resolucao
BD I - Aula 08 B - Algebra Relacional - Exercicios Resolucao
 
A brief introduction to SQLite PPT
A brief introduction to SQLite PPTA brief introduction to SQLite PPT
A brief introduction to SQLite PPT
 
Curso de OO com C# - Parte 01 - Orientação a objetos
Curso de OO com C# - Parte 01 - Orientação a objetosCurso de OO com C# - Parte 01 - Orientação a objetos
Curso de OO com C# - Parte 01 - Orientação a objetos
 
Banco de Dados I - Aula 05 - Banco de Dados Relacional (Modelo Conceitual)
Banco de Dados I - Aula 05 - Banco de Dados Relacional (Modelo Conceitual)Banco de Dados I - Aula 05 - Banco de Dados Relacional (Modelo Conceitual)
Banco de Dados I - Aula 05 - Banco de Dados Relacional (Modelo Conceitual)
 
Introduction to SQL
Introduction to SQLIntroduction to SQL
Introduction to SQL
 
Js: master prototypes
Js: master prototypesJs: master prototypes
Js: master prototypes
 
Restricting and Sorting Data - Oracle Data Base
Restricting and Sorting Data - Oracle Data BaseRestricting and Sorting Data - Oracle Data Base
Restricting and Sorting Data - Oracle Data Base
 
DDL,DML,SQL Functions and Joins
DDL,DML,SQL Functions and JoinsDDL,DML,SQL Functions and Joins
DDL,DML,SQL Functions and Joins
 
MySql:Introduction
MySql:IntroductionMySql:Introduction
MySql:Introduction
 
Oracle
OracleOracle
Oracle
 
[Curso Java Basico - Orientacaoo a Objetos] Aula 24: Classes e atributos
[Curso Java Basico - Orientacaoo a Objetos] Aula 24: Classes e atributos[Curso Java Basico - Orientacaoo a Objetos] Aula 24: Classes e atributos
[Curso Java Basico - Orientacaoo a Objetos] Aula 24: Classes e atributos
 
Banco de Dados
Banco de DadosBanco de Dados
Banco de Dados
 
1. SQL Basics - Introduction
1. SQL Basics - Introduction1. SQL Basics - Introduction
1. SQL Basics - Introduction
 

Viewers also liked

Federated Approach for Interoperating AEC/FM Ontologies
Federated Approach for Interoperating AEC/FM OntologiesFederated Approach for Interoperating AEC/FM Ontologies
Federated Approach for Interoperating AEC/FM OntologiesAna Roxin
 
You Are the Target
You Are the TargetYou Are the Target
You Are the TargetEMC
 
Classical approach
Classical approachClassical approach
Classical approachTravis Klein
 
Predestinação - A salvação é para poucos ou para todos?
Predestinação - A salvação é para poucos ou para todos?Predestinação - A salvação é para poucos ou para todos?
Predestinação - A salvação é para poucos ou para todos?João Carlos
 
What is agg demand
What is agg demandWhat is agg demand
What is agg demandTravis Klein
 
Federmanager Bologna: Presentazione sintetica dei servizi - 10 dicembre 2013
Federmanager Bologna: Presentazione sintetica dei servizi - 10 dicembre 2013Federmanager Bologna: Presentazione sintetica dei servizi - 10 dicembre 2013
Federmanager Bologna: Presentazione sintetica dei servizi - 10 dicembre 2013Marco Frullanti
 
Deconstruction of production splash
Deconstruction of production splashDeconstruction of production splash
Deconstruction of production splashharryronchetti
 
โรคขาดโปรตีน
โรคขาดโปรตีนโรคขาดโปรตีน
โรคขาดโปรตีนThanaporn Srithananun
 
EMC IT's Virtual Oracle Deployment Framework
EMC IT's Virtual Oracle Deployment FrameworkEMC IT's Virtual Oracle Deployment Framework
EMC IT's Virtual Oracle Deployment FrameworkEMC
 
El cas del... oriol, oriol i nil
El cas del... oriol, oriol i nilEl cas del... oriol, oriol i nil
El cas del... oriol, oriol i nilmgonellgomez
 
EMC Symmetrix Data at Rest Encryption - Detailed Review
EMC Symmetrix Data at Rest Encryption - Detailed Review EMC Symmetrix Data at Rest Encryption - Detailed Review
EMC Symmetrix Data at Rest Encryption - Detailed Review EMC
 
RSA-Pivotal Security Big Data Reference Architecture
RSA-Pivotal Security Big Data Reference ArchitectureRSA-Pivotal Security Big Data Reference Architecture
RSA-Pivotal Security Big Data Reference ArchitectureEMC
 
Beliefs men have_about_women
Beliefs men have_about_womenBeliefs men have_about_women
Beliefs men have_about_womenChandan Dubey
 

Viewers also liked (20)

Colours speaking
Colours speakingColours speaking
Colours speaking
 
Federated Approach for Interoperating AEC/FM Ontologies
Federated Approach for Interoperating AEC/FM OntologiesFederated Approach for Interoperating AEC/FM Ontologies
Federated Approach for Interoperating AEC/FM Ontologies
 
You Are the Target
You Are the TargetYou Are the Target
You Are the Target
 
Classical approach
Classical approachClassical approach
Classical approach
 
Predestinação - A salvação é para poucos ou para todos?
Predestinação - A salvação é para poucos ou para todos?Predestinação - A salvação é para poucos ou para todos?
Predestinação - A salvação é para poucos ou para todos?
 
What is agg demand
What is agg demandWhat is agg demand
What is agg demand
 
Federmanager Bologna: Presentazione sintetica dei servizi - 10 dicembre 2013
Federmanager Bologna: Presentazione sintetica dei servizi - 10 dicembre 2013Federmanager Bologna: Presentazione sintetica dei servizi - 10 dicembre 2013
Federmanager Bologna: Presentazione sintetica dei servizi - 10 dicembre 2013
 
Deconstruction of production splash
Deconstruction of production splashDeconstruction of production splash
Deconstruction of production splash
 
โรคขาดโปร..
โรคขาดโปร..โรคขาดโปร..
โรคขาดโปร..
 
โรคขาดโปรตีน
โรคขาดโปรตีนโรคขาดโปรตีน
โรคขาดโปรตีน
 
EMC IT's Virtual Oracle Deployment Framework
EMC IT's Virtual Oracle Deployment FrameworkEMC IT's Virtual Oracle Deployment Framework
EMC IT's Virtual Oracle Deployment Framework
 
Nomina tarea
Nomina tareaNomina tarea
Nomina tarea
 
El cas del... oriol, oriol i nil
El cas del... oriol, oriol i nilEl cas del... oriol, oriol i nil
El cas del... oriol, oriol i nil
 
Fiscal policy
Fiscal policyFiscal policy
Fiscal policy
 
EMC Symmetrix Data at Rest Encryption - Detailed Review
EMC Symmetrix Data at Rest Encryption - Detailed Review EMC Symmetrix Data at Rest Encryption - Detailed Review
EMC Symmetrix Data at Rest Encryption - Detailed Review
 
Ppf productivity
Ppf productivityPpf productivity
Ppf productivity
 
RSA-Pivotal Security Big Data Reference Architecture
RSA-Pivotal Security Big Data Reference ArchitectureRSA-Pivotal Security Big Data Reference Architecture
RSA-Pivotal Security Big Data Reference Architecture
 
らくがき
らくがきらくがき
らくがき
 
Beliefs men have_about_women
Beliefs men have_about_womenBeliefs men have_about_women
Beliefs men have_about_women
 
Magazine Analysis
Magazine AnalysisMagazine Analysis
Magazine Analysis
 

Similar to Hadoop Design Patterns

MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design PatternsDonald Miner
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsRobert Grossman
 
Simplifying Cloud Architectures with Data Virtualization
Simplifying Cloud Architectures with Data VirtualizationSimplifying Cloud Architectures with Data Virtualization
Simplifying Cloud Architectures with Data VirtualizationDenodo
 
Building modern data lakes
Building modern data lakes Building modern data lakes
Building modern data lakes Minio
 
Top 3 design patterns in Map Reduce
Top 3 design patterns in Map ReduceTop 3 design patterns in Map Reduce
Top 3 design patterns in Map ReduceEdureka!
 
data analytics lecture 3.2.ppt
data analytics lecture 3.2.pptdata analytics lecture 3.2.ppt
data analytics lecture 3.2.pptRutujaPatil247341
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesJon Meredith
 
Big Data Meetup #7
Big Data Meetup #7Big Data Meetup #7
Big Data Meetup #7Paul Lo
 
Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...
Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...
Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...Amazon Web Services
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupesh Bansal
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...Srivatsan Ramanujam
 
IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...
IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...
IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...In-Memory Computing Summit
 
Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)Srivatsan Ramanujam
 
The thinking persons guide to data warehouse design
The thinking persons guide to data warehouse designThe thinking persons guide to data warehouse design
The thinking persons guide to data warehouse designCalpont
 
Lambdas And Streams Hands On Lab, JavaOne 2014
Lambdas And Streams Hands On Lab, JavaOne 2014Lambdas And Streams Hands On Lab, JavaOne 2014
Lambdas And Streams Hands On Lab, JavaOne 2014Simon Ritter
 
Hadoop - Integration Patterns and Practices__HadoopSummit2010
Hadoop - Integration Patterns and Practices__HadoopSummit2010Hadoop - Integration Patterns and Practices__HadoopSummit2010
Hadoop - Integration Patterns and Practices__HadoopSummit2010Yahoo Developer Network
 
PGQL: A Language for Graphs
PGQL: A Language for GraphsPGQL: A Language for Graphs
PGQL: A Language for GraphsJean Ihm
 

Similar to Hadoop Design Patterns (20)

MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Dremel
DremelDremel
Dremel
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
 
Simplifying Cloud Architectures with Data Virtualization
Simplifying Cloud Architectures with Data VirtualizationSimplifying Cloud Architectures with Data Virtualization
Simplifying Cloud Architectures with Data Virtualization
 
Building modern data lakes
Building modern data lakes Building modern data lakes
Building modern data lakes
 
Top 3 design patterns in Map Reduce
Top 3 design patterns in Map ReduceTop 3 design patterns in Map Reduce
Top 3 design patterns in Map Reduce
 
data analytics lecture 3.2.ppt
data analytics lecture 3.2.pptdata analytics lecture 3.2.ppt
data analytics lecture 3.2.ppt
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL Databases
 
Big Data Meetup #7
Big Data Meetup #7Big Data Meetup #7
Big Data Meetup #7
 
Dwh faqs
Dwh faqsDwh faqs
Dwh faqs
 
Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...
Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...
Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
 
IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...
IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...
IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...
 
Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)
 
The thinking persons guide to data warehouse design
The thinking persons guide to data warehouse designThe thinking persons guide to data warehouse design
The thinking persons guide to data warehouse design
 
Greenplum feature
Greenplum featureGreenplum feature
Greenplum feature
 
Lambdas And Streams Hands On Lab, JavaOne 2014
Lambdas And Streams Hands On Lab, JavaOne 2014Lambdas And Streams Hands On Lab, JavaOne 2014
Lambdas And Streams Hands On Lab, JavaOne 2014
 
Hadoop - Integration Patterns and Practices__HadoopSummit2010
Hadoop - Integration Patterns and Practices__HadoopSummit2010Hadoop - Integration Patterns and Practices__HadoopSummit2010
Hadoop - Integration Patterns and Practices__HadoopSummit2010
 
PGQL: A Language for Graphs
PGQL: A Language for GraphsPGQL: A Language for Graphs
PGQL: A Language for Graphs
 

More from EMC

INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDINDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDEMC
 
Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote EMC
 
EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC
 
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOTransforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOEMC
 
Citrix ready-webinar-xtremio
Citrix ready-webinar-xtremioCitrix ready-webinar-xtremio
Citrix ready-webinar-xtremioEMC
 
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC
 
EMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lakeEMC
 
Force Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereForce Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereEMC
 
Pivotal : Moments in Container History
Pivotal : Moments in Container History Pivotal : Moments in Container History
Pivotal : Moments in Container History EMC
 
Data Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewData Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewEMC
 
Mobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeMobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeEMC
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic EMC
 
Intelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityIntelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityEMC
 
The Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeThe Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeEMC
 
EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC
 
EMC Academic Summit 2015
EMC Academic Summit 2015EMC Academic Summit 2015
EMC Academic Summit 2015EMC
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesEMC
 
Using EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsUsing EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsEMC
 
Using EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookUsing EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookEMC
 

More from EMC (20)

INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDINDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
 
Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote
 
EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX
 
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOTransforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
 
Citrix ready-webinar-xtremio
Citrix ready-webinar-xtremioCitrix ready-webinar-xtremio
Citrix ready-webinar-xtremio
 
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
 
EMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC with Mirantis Openstack
EMC with Mirantis Openstack
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lake
 
Force Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereForce Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop Elsewhere
 
Pivotal : Moments in Container History
Pivotal : Moments in Container History Pivotal : Moments in Container History
Pivotal : Moments in Container History
 
Data Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewData Lake Protection - A Technical Review
Data Lake Protection - A Technical Review
 
Mobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeMobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or Foe
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic
 
Intelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityIntelligence-Driven GRC for Security
Intelligence-Driven GRC for Security
 
The Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeThe Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure Age
 
EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015
 
EMC Academic Summit 2015
EMC Academic Summit 2015EMC Academic Summit 2015
EMC Academic Summit 2015
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education Services
 
Using EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsUsing EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere Environments
 
Using EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookUsing EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBook
 

Recently uploaded

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Hadoop Design Patterns

  • 1. Hadoop Design Patterns Donald Miner @donaldpminer Donald.Miner@emc.com © Copyright 2013 EMC Corporation. All rights reserved. 1
  • 2. Book was made available December 2012 Written by Donald Miner and Adam Shook, Hadoop Architects at Greenplum. © Copyright 2013 EMC Corporation. All rights reserved. 2
  • 3. What Are Design Patterns? (in general) • Reusable solution frameworks to problems • Domain independent • Not a cookbook, but not a guide • Not a finished solution © Copyright 2013 EMC Corporation. All rights reserved. 3
  • 4. Why Design Patterns? (in general)  Makes the intent of code easier to understand  Provides a common language for solutions  Be able to reuse code  Known performance profiles and limitations of solutions © Copyright 2013 EMC Corporation. All rights reserved. 4
  • 5. Why MapReduce Design Patterns?  Recurring patterns in data-related problem solving  Groups are building patterns independently  Lots of new users every day  MapReduce is a new way of thinking  Foundation for higher-level tools (Pig, Hive, …)  Community is reaching the right level of maturity © Copyright 2013 EMC Corporation. All rights reserved. 5
  • 6. Pattern Template Each pattern follows a standard template Intent Consequences Motivation Resemblances Applicability Performance analysis Structure Examples © Copyright 2013 EMC Corporation. All rights reserved. 6
  • 7. Pattern Categories Summarization Filtering Data Organization Joins Metapatterns Input and output © Copyright 2013 EMC Corporation. All rights reserved. 7
  • 8. Filtering Patterns Extract interesting subsets Keep only a subset of the data  Filtering – Removes records of data based on a condition  Bloom filtering – Removes records of data based on a bloom filter membership test  Top ten – Returns the top-k records, given a ranking criteria  Distinct – Remove duplicates from a data set © Copyright 2013 EMC Corporation. All rights reserved. 8
  • 9. Summarization Patterns Top-down summaries Give a top-level view of the data  Numerical summarizations – Perform numerical calculations on groups of data  Inverted index – Build a lookup table  Counting with counters – Count the occurrences of particular things © Copyright 2013 EMC Corporation. All rights reserved. 9
  • 10. Data organization patterns Reorganize, restructure Change the way the data is organized  Structured to hierarchical – Denormalize data into documents  Partitioning – Place data into partitions based on a hash key  Binning – Place each record into zero or more bins  Total order sorting – Sort the data set in ascending or descending order  Shuffling – Completely randomize the order of the data © Copyright 2013 EMC Corporation. All rights reserved. 10
  • 11. Join patterns Bringing data sets together Take several data sets and bring them together into one  Reduce-side join – General purpose join  Replicated join – Replicates the smaller data set everywhere before the join  Composite join – Joins if the data sets are sorted and partitioned in the same way  Cartesian product – Match up every record to every other record © Copyright 2013 EMC Corporation. All rights reserved. 11
  • 12. Input and output patterns Custom input and output Perform custom behavior for input or output  Generating data – Generate data from nothing  External source output – Send data to an external source  External source input – Pull data from an external source  Partition pruning – Remove chunks of data because we know some parts are not useful © Copyright 2013 EMC Corporation. All rights reserved. 12
  • 13. Example Pattern: “Top Ten” (filtering) Intent Retrieve a relatively small number of top K records, according to a ranking scheme in your data set, no matter how large the data. Motivation Finding outliers Top ten lists are fun Building dashboards Sorting/Limit isn’t going to work here © Copyright 2013 EMC Corporation. All rights reserved. 13
  • 14. Example Pattern: “Top Ten” Applicability Rank-able records Limited number of output records Consequences The top K records are returned. © Copyright 2013 EMC Corporation. All rights reserved. 14
  • 15. Example Pattern: “Top Ten” Structure class m apper : setu p(): i nitia lize t op ten sorte d li st map( key, reco rd ): i nsert rec or d into top t en s or ted lis t i f len gth of array is gr eate r- than 10 : tru ncat e list to a le ngth o f 10 clea nup() : f or re cord i n top s orted ten l ist: e mit n ull, re cord class r educe r: setu p(): i nitia lize t op ten sorte d li st redu ce(ke y, r ec ords): s ort r ecor ds t runca te r ec ords to top 10 f or re cord i n recor ds: emi t re co rd © Copyright 2013 EMC Corporation. All rights reserved. 15
  • 16. Example Pattern: “Top Ten” Resemblances SQL: SELECT * FROM table ORDER BY col4 DESC LIMIT 10; Pig: B = ORDER A BY col4 DESC; C = LIMIT B 10; © Copyright 2013 EMC Corporation. All rights reserved. 16
  • 17. Example Pattern: “Top Ten” Performance analysis Pretty quick: map-heavy, low network usage Pay attention to how many records the reducer is getting [number of input splits] x K Example Top ten StackOverflow users by reputation © Copyright 2013 EMC Corporation. All rights reserved. 17
  • 18. Pivotal Sessions at EMC World Session Presenter Dates/Times The Pivotal Platform: A Purpose-Built Platform for Big-DataDriven Applications Josh Klahr Tue 5:30 - 6:30, Palazzo E Wed 11:30 - 12:30, Delfino 4005 Pivotal: Data Scientists on the Front Line: Examples of Data Science in Action Noelle Sio Tue 10:00 - 11:00, Lando 4205 Thu 8:30 - 9:30, Palazzo F Pivotal: Operationalizing 1000-node Hadoop Cluster – Analytics Workbench Clinton Ooi Bhavin Modi Tue 11:30 - 12:30, Palazzo L Thu 10:00- 11:00 am, Delfino 4001A Pivotal: for Powerful Processing of Unstructured Data For Valuable Insights SK Krishnamurthy Mon 4:00 - 5:00, Lando 4201 A Tue 4:00 - 5:00, Palazzo M Pivotal: Big & Fast data – merging real-time data and deep analytics Michael Crutcher Mon 1:00 - 2:00, Lando 4201 A Wed 10:00 - 11:00, Palazzo M Pivotal: Virtualize Big Data to Make The Elephant Dance June Yang Dan Baskette Mon 11:30 - 12:30, Marcello 4401A Wed 4:00 - 5:00, Palazzo E Hadoop Design Patterns Don Miner Mon 2:30 - 3:30, Palazzo F Wed 8:30 - 9:30, Delfino 4005 © Copyright 2013 EMC Corporation. All rights reserved. 18