SlideShare uma empresa Scribd logo
1 de 34
Introduction to Map/Reduce Data Transformations Tasso Argyros CTO and Co-Founder Aster Data Systems [email_address]
A Brief History of MapReduce Confidential and proprietary. Copyright © 2008 Aster Data Systems
What is MapReduce? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Confidential and proprietary. Copyright © 2008 Aster Data Systems
Why is MapReduce Useful? ,[object Object],[object Object],[object Object],[object Object],Confidential and proprietary. Copyright © 2008 Aster Data Systems
The quick brown fox jumps over the lazy dog. To be or not to be: that is the question. Switch The world only needs five computers. Hello world. In-Database MapReduce is the future. MapReduce is a very powerful programming paradigm. Confidential and proprietary. Copyright © 2008 Aster Data Systems Server A Server B Server C Server D
Goal We Want to Count  the # of Times  Each Word Occurs Confidential and proprietary. Copyright © 2008 Aster Data Systems
1 st  Approach No MapReduce 1 st  Approach No MapReduce Confidential and proprietary. Copyright © 2008 Aster Data Systems
The quick brown fox jumps over the lazy dog To be or not to be: that is the question. Switch The world only needs five computers. Hello world. In-Database MapReduce is the future. MapReduce is a very powerful concept. the quick brown fox jumps over the lazy dog in database mapreduce is the future the world only needs five computers the quick brown fox jumps over the lazy dog in database mapreduce is the future the world only needs five computers hello world mapreduce is a very powerful concept to be or not to be that is the question Confidential and proprietary. Copyright © 2008 Aster Data Systems Server A Server B Server C Server D hello world mapreduce is a very powerful concept to be or not to be that is the question
Confidential and proprietary. Copyright © 2008 Aster Data Systems Server 4 Final Result File the 5 is 3 mapreduce 2 … …
What Did We Do? ,[object Object],[object Object],[object Object],[object Object],Confidential and proprietary. Copyright © 2008 Aster Data Systems
2 nd  Approach No MapReduce Fully Distributed Confidential and proprietary. Copyright © 2008 Aster Data Systems
The quick brown fox jumps over the lazy dog To be or not to be: that is the question. Switch The world only needs five computers. Hello world. In-Database MapReduce is the future. MapReduce is a very powerful concept. Confidential and proprietary. Copyright © 2008 Aster Data Systems Server A Server B Server C Server D the quick brown fox jumps over the lazy dog in database mapreduce is the future the world only needs five computers hello world mapreduce is a very powerful concept to be or not to be that is the question the the the the the database database future world world powerful lazy brown mapreduce mapreduce be be to jumps computers hello is is is question over a that
Confidential and proprietary. Copyright © 2008 Aster Data Systems Server 1 Final Result File the 5 … … . Server 2 Final Result File world 2 … … . Server 3 Final Result File mapreduce 2 … … . Server 4 Final Result File is 3 … … .
2 nd  Approach: No MapReduce, Distributed Confidential and proprietary. Copyright © 2008 Aster Data Systems
Does it work? Yes Is it a pain? Yes!! Does it take lots of time? Yes! Would you do it? No!!! Confidential and proprietary. Copyright © 2008 Aster Data Systems
Moreover… ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Confidential and proprietary. Copyright © 2008 Aster Data Systems
Data Redistribution and Grouping Confidential and proprietary. Copyright © 2008 Aster Data Systems Map() Input Any file (e.g. documents) Output Stream of <key, value> pairs (e.g. <word, count> pairs) Input All <key, value> pairs with the  same  key grouped (e.g. all <word, count> pairs where word = “the”) Output Anything (e.g. sum of counts for a specific word) Reduce()
The quick brown fox jumps over the lazy dog In-Database MapReduce is the future. <the, 1> <quick, 1> <brown,1> <fox,1> <jumps,1> <over,1> <the,1> <lazy,1> <dog,1> <in, 1> <database, 1> <mapreduce,1> <is,1> <the,1> <future,1> <world,1> <world,1> <powerful,1> <lazy,1> <brown,1> <mapreduce,1> <mapreduce,1> <be,1> <be,1> <to,1> <jumps,1> <computers,1> <hello,1> <is,1> <is,1> <is,1> <question,1> <over,1> <a,1> <that,1> Switch <the, 1> <the, 1> <the, 1> <the, 1> <the, 1> <database,1> <database,1> <future,1> Map() and Redistribution Phase Confidential and proprietary. Copyright © 2008 Aster Data Systems Map() Map() Server A Server B Server C Server D
<the, 1> <the, 1> <the, 1> <the, 1> <the, 1> <database,1> <database,1> <future,1> <the, 1> <the, 1> <the, 1> <the, 1> <the, 1> <database,1> <database,1> <future,1> Grouping and Reduce() Phase (on Server 1) Confidential and proprietary. Copyright © 2008 Aster Data Systems Reduce() Server 1 Final Result File the 5 database 2 future 1 Reduce() Reduce()
What Just Happened? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Confidential and proprietary. Copyright © 2008 Aster Data Systems
Word Count was Only an Example! ,[object Object],“ The indexing code is simpler, smaller, and easier to understand, because the code that deals with fault tolerance, distribution and parallelization is hidden within the MapReduce library. For example, the size of one phase of the computation dropped from approximately 3,800 lines of C++ code to approximately 700 lines when expressed using MapReduce .” Google 2004 MapReduce paper Confidential and proprietary. Copyright © 2008 Aster Data Systems
Word Count was Only an Example! ,[object Object],“ We adapt Google’s MapReduce paradigm to demonstrate this parallel speed up technique on a variety of learning algorithms including locally weighted linear regression (LWLR), k-means, logistic regression (LR), naive Bayes (NB), SVM, ICA, PCA, gaussian discriminant analysis (GDA), EM, and backpropagation (NN).” Stanford 2006 AI Lab paper Confidential and proprietary. Copyright © 2008 Aster Data Systems
Result? ,[object Object],[object Object],[object Object],Confidential and proprietary. Copyright © 2008 Aster Data Systems
But… ,[object Object],[object Object],[object Object],[object Object],[object Object],Confidential and proprietary. Copyright © 2008 Aster Data Systems
Beyond SQL and MapReduce Confidential and proprietary. Copyright © 2008 Aster Data Systems
SQL vs MapReduce: Two different worlds? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Confidential and proprietary. Copyright © 2008 Aster Data Systems
Implementing MR in the Database ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Confidential and proprietary. Copyright © 2008 Aster Data Systems
The SQL/MR Process Confidential and proprietary. Copyright © 2008 Aster Data Systems
SQL/MR Function: Syntax ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Optional conditions & filters (5) Select output (eg. count) (1) Source table or sub-select (3) Sort before the MR function (4) Java/Python/… MR function (2) <key> for data redistribution Optional MR_Function Arguments Confidential and proprietary. Copyright © 2008 Aster Data Systems
Example 1: Tokenization ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Confidential and proprietary. Copyright © 2008 Aster Data Systems
Example 2: Sessionization ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Confidential and proprietary. Copyright © 2008 Aster Data Systems
Example 2: Sessionization Slide  Session Timeout = 60 seconds Clickstream Confidential and proprietary. Copyright © 2008 Aster Data Systems timestamp userid 10:00:00 Shawn1 00:58:24 PrezBush 10:00:24 Shawn1 02:30:33 PrezBush 10:01:23 Shawn1 10:02:40 Shawn1 timestamp userid sessionid 10:00:00 Shawn1 0 10:00:24 Shawn1 0 10:01:23 Shawn1 0 10:02:40 Shawn1 1 timestamp userid sessionid 00:58:24 PrezBush 0 02:30:33 PrezBush 1 INPUT OUTPUT
MR Applications in the Database ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Confidential and proprietary. Copyright © 2008 Aster Data Systems
Summary ,[object Object],[object Object],[object Object],[email_address] (Questions, Comments) asterdata.com/blog (Lots of technical details) 1.888.Aster.Data (Any other information) Confidential and proprietary. Copyright © 2008 Aster Data Systems

Mais conteúdo relacionado

Mais procurados

Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
vithakur
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
Paco Nathan
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
gothicane
 

Mais procurados (20)

Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating System
 
Spark at-hackthon8jan2014
Spark at-hackthon8jan2014Spark at-hackthon8jan2014
Spark at-hackthon8jan2014
 
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig Latin
 
Latent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with SparkLatent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with Spark
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
 
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark ML
 
Neo4j vs giraph
Neo4j vs giraphNeo4j vs giraph
Neo4j vs giraph
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
 
Reactive Stream Processing Using DDS and Rx
Reactive Stream Processing Using DDS and RxReactive Stream Processing Using DDS and Rx
Reactive Stream Processing Using DDS and Rx
 

Destaque

Technology Investment for Mutual Insurance Companies
Technology Investment for Mutual Insurance CompaniesTechnology Investment for Mutual Insurance Companies
Technology Investment for Mutual Insurance Companies
Chris Reynolds
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 
The Chief Data Officer Agenda: Metrics for Information and Data Management
The Chief Data Officer Agenda: Metrics for Information and Data ManagementThe Chief Data Officer Agenda: Metrics for Information and Data Management
The Chief Data Officer Agenda: Metrics for Information and Data Management
DATAVERSITY
 
EDRMS Pre implementation project plan
EDRMS Pre implementation project planEDRMS Pre implementation project plan
EDRMS Pre implementation project plan
Donna_Maree_Findlay
 

Destaque (20)

MapReduce for Idiots
MapReduce for IdiotsMapReduce for Idiots
MapReduce for Idiots
 
Big data vccorp
Big data vccorpBig data vccorp
Big data vccorp
 
DMAvatar
DMAvatarDMAvatar
DMAvatar
 
Bfit for healthcare - A Document Management System for Healthcare Industry
Bfit for healthcare - A Document Management System for Healthcare IndustryBfit for healthcare - A Document Management System for Healthcare Industry
Bfit for healthcare - A Document Management System for Healthcare Industry
 
Why Are Change Management And Metrics Such Crucial Aspects To Your Overall De...
Why Are Change Management And Metrics Such Crucial Aspects To Your Overall De...Why Are Change Management And Metrics Such Crucial Aspects To Your Overall De...
Why Are Change Management And Metrics Such Crucial Aspects To Your Overall De...
 
Technology Investment for Mutual Insurance Companies
Technology Investment for Mutual Insurance CompaniesTechnology Investment for Mutual Insurance Companies
Technology Investment for Mutual Insurance Companies
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Non-Relational Databases & Key/Value Stores
Non-Relational Databases & Key/Value StoresNon-Relational Databases & Key/Value Stores
Non-Relational Databases & Key/Value Stores
 
A Practical Guide to Capturing, Organizing, and Securing Your Documents
A Practical Guide to Capturing, Organizing, and Securing Your DocumentsA Practical Guide to Capturing, Organizing, and Securing Your Documents
A Practical Guide to Capturing, Organizing, and Securing Your Documents
 
The Chief Data Officer Agenda: Metrics for Information and Data Management
The Chief Data Officer Agenda: Metrics for Information and Data ManagementThe Chief Data Officer Agenda: Metrics for Information and Data Management
The Chief Data Officer Agenda: Metrics for Information and Data Management
 
Alfresco As SharePoint Alternative - Architecture Overview
Alfresco As SharePoint Alternative - Architecture OverviewAlfresco As SharePoint Alternative - Architecture Overview
Alfresco As SharePoint Alternative - Architecture Overview
 
Scale your Alfresco Solutions
Scale your Alfresco Solutions Scale your Alfresco Solutions
Scale your Alfresco Solutions
 
Intro To Alfresco Part 1
Intro To Alfresco Part 1Intro To Alfresco Part 1
Intro To Alfresco Part 1
 
EDRMS Pre implementation project plan
EDRMS Pre implementation project planEDRMS Pre implementation project plan
EDRMS Pre implementation project plan
 
Big data 5Vs 2014 - View from World to Vietnam by Dinh Le Dat
Big data 5Vs 2014 - View from World to Vietnam by Dinh Le DatBig data 5Vs 2014 - View from World to Vietnam by Dinh Le Dat
Big data 5Vs 2014 - View from World to Vietnam by Dinh Le Dat
 
Alfresco 5.2 REST API
Alfresco 5.2 REST APIAlfresco 5.2 REST API
Alfresco 5.2 REST API
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
 
On business capabilities, functions and application features
On business capabilities, functions and application featuresOn business capabilities, functions and application features
On business capabilities, functions and application features
 
TỔNG QUAN VỀ DỮ LIỆU LỚN (BIGDATA)
TỔNG QUAN VỀ DỮ LIỆU LỚN (BIGDATA)TỔNG QUAN VỀ DỮ LIỆU LỚN (BIGDATA)
TỔNG QUAN VỀ DỮ LIỆU LỚN (BIGDATA)
 

Semelhante a Introduction to MapReduce Data Transformations

Sql on hadoop the secret presentation.3pptx
Sql on hadoop  the secret presentation.3pptxSql on hadoop  the secret presentation.3pptx
Sql on hadoop the secret presentation.3pptx
Paulo Alonso
 
Taste Java In The Clouds
Taste Java In The CloudsTaste Java In The Clouds
Taste Java In The Clouds
Jacky Chu
 
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Hw09   Hadoop Based Data Mining Platform For The Telecom IndustryHw09   Hadoop Based Data Mining Platform For The Telecom Industry
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Cloudera, Inc.
 

Semelhante a Introduction to MapReduce Data Transformations (20)

What's New in ArcGIS 10.1 Data Interoperability Extension
What's New in ArcGIS 10.1 Data Interoperability ExtensionWhat's New in ArcGIS 10.1 Data Interoperability Extension
What's New in ArcGIS 10.1 Data Interoperability Extension
 
MapReduce on Zero VM
MapReduce on Zero VM MapReduce on Zero VM
MapReduce on Zero VM
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Cloud Computing ...changes everything
Cloud Computing ...changes everythingCloud Computing ...changes everything
Cloud Computing ...changes everything
 
Big Data Meetup #7
Big Data Meetup #7Big Data Meetup #7
Big Data Meetup #7
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Intro to hadoop ecosystem
Intro to hadoop ecosystemIntro to hadoop ecosystem
Intro to hadoop ecosystem
 
Distributed Computing & MapReduce
Distributed Computing & MapReduceDistributed Computing & MapReduce
Distributed Computing & MapReduce
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
 
Dancing with the Elephant
Dancing with the ElephantDancing with the Elephant
Dancing with the Elephant
 
Sql on hadoop the secret presentation.3pptx
Sql on hadoop  the secret presentation.3pptxSql on hadoop  the secret presentation.3pptx
Sql on hadoop the secret presentation.3pptx
 
Taste Java In The Clouds
Taste Java In The CloudsTaste Java In The Clouds
Taste Java In The Clouds
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
 
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Hw09   Hadoop Based Data Mining Platform For The Telecom IndustryHw09   Hadoop Based Data Mining Platform For The Telecom Industry
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
 

Último

“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
Muhammad Subhan
 

Último (20)

Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & Ireland
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdf
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 

Introduction to MapReduce Data Transformations

  • 1. Introduction to Map/Reduce Data Transformations Tasso Argyros CTO and Co-Founder Aster Data Systems [email_address]
  • 2. A Brief History of MapReduce Confidential and proprietary. Copyright © 2008 Aster Data Systems
  • 3.
  • 4.
  • 5. The quick brown fox jumps over the lazy dog. To be or not to be: that is the question. Switch The world only needs five computers. Hello world. In-Database MapReduce is the future. MapReduce is a very powerful programming paradigm. Confidential and proprietary. Copyright © 2008 Aster Data Systems Server A Server B Server C Server D
  • 6. Goal We Want to Count the # of Times Each Word Occurs Confidential and proprietary. Copyright © 2008 Aster Data Systems
  • 7. 1 st Approach No MapReduce 1 st Approach No MapReduce Confidential and proprietary. Copyright © 2008 Aster Data Systems
  • 8. The quick brown fox jumps over the lazy dog To be or not to be: that is the question. Switch The world only needs five computers. Hello world. In-Database MapReduce is the future. MapReduce is a very powerful concept. the quick brown fox jumps over the lazy dog in database mapreduce is the future the world only needs five computers the quick brown fox jumps over the lazy dog in database mapreduce is the future the world only needs five computers hello world mapreduce is a very powerful concept to be or not to be that is the question Confidential and proprietary. Copyright © 2008 Aster Data Systems Server A Server B Server C Server D hello world mapreduce is a very powerful concept to be or not to be that is the question
  • 9. Confidential and proprietary. Copyright © 2008 Aster Data Systems Server 4 Final Result File the 5 is 3 mapreduce 2 … …
  • 10.
  • 11. 2 nd Approach No MapReduce Fully Distributed Confidential and proprietary. Copyright © 2008 Aster Data Systems
  • 12. The quick brown fox jumps over the lazy dog To be or not to be: that is the question. Switch The world only needs five computers. Hello world. In-Database MapReduce is the future. MapReduce is a very powerful concept. Confidential and proprietary. Copyright © 2008 Aster Data Systems Server A Server B Server C Server D the quick brown fox jumps over the lazy dog in database mapreduce is the future the world only needs five computers hello world mapreduce is a very powerful concept to be or not to be that is the question the the the the the database database future world world powerful lazy brown mapreduce mapreduce be be to jumps computers hello is is is question over a that
  • 13. Confidential and proprietary. Copyright © 2008 Aster Data Systems Server 1 Final Result File the 5 … … . Server 2 Final Result File world 2 … … . Server 3 Final Result File mapreduce 2 … … . Server 4 Final Result File is 3 … … .
  • 14. 2 nd Approach: No MapReduce, Distributed Confidential and proprietary. Copyright © 2008 Aster Data Systems
  • 15. Does it work? Yes Is it a pain? Yes!! Does it take lots of time? Yes! Would you do it? No!!! Confidential and proprietary. Copyright © 2008 Aster Data Systems
  • 16.
  • 17. Data Redistribution and Grouping Confidential and proprietary. Copyright © 2008 Aster Data Systems Map() Input Any file (e.g. documents) Output Stream of <key, value> pairs (e.g. <word, count> pairs) Input All <key, value> pairs with the same key grouped (e.g. all <word, count> pairs where word = “the”) Output Anything (e.g. sum of counts for a specific word) Reduce()
  • 18. The quick brown fox jumps over the lazy dog In-Database MapReduce is the future. <the, 1> <quick, 1> <brown,1> <fox,1> <jumps,1> <over,1> <the,1> <lazy,1> <dog,1> <in, 1> <database, 1> <mapreduce,1> <is,1> <the,1> <future,1> <world,1> <world,1> <powerful,1> <lazy,1> <brown,1> <mapreduce,1> <mapreduce,1> <be,1> <be,1> <to,1> <jumps,1> <computers,1> <hello,1> <is,1> <is,1> <is,1> <question,1> <over,1> <a,1> <that,1> Switch <the, 1> <the, 1> <the, 1> <the, 1> <the, 1> <database,1> <database,1> <future,1> Map() and Redistribution Phase Confidential and proprietary. Copyright © 2008 Aster Data Systems Map() Map() Server A Server B Server C Server D
  • 19. <the, 1> <the, 1> <the, 1> <the, 1> <the, 1> <database,1> <database,1> <future,1> <the, 1> <the, 1> <the, 1> <the, 1> <the, 1> <database,1> <database,1> <future,1> Grouping and Reduce() Phase (on Server 1) Confidential and proprietary. Copyright © 2008 Aster Data Systems Reduce() Server 1 Final Result File the 5 database 2 future 1 Reduce() Reduce()
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25. Beyond SQL and MapReduce Confidential and proprietary. Copyright © 2008 Aster Data Systems
  • 26.
  • 27.
  • 28. The SQL/MR Process Confidential and proprietary. Copyright © 2008 Aster Data Systems
  • 29.
  • 30.
  • 31.
  • 32. Example 2: Sessionization Slide Session Timeout = 60 seconds Clickstream Confidential and proprietary. Copyright © 2008 Aster Data Systems timestamp userid 10:00:00 Shawn1 00:58:24 PrezBush 10:00:24 Shawn1 02:30:33 PrezBush 10:01:23 Shawn1 10:02:40 Shawn1 timestamp userid sessionid 10:00:00 Shawn1 0 10:00:24 Shawn1 0 10:01:23 Shawn1 0 10:02:40 Shawn1 1 timestamp userid sessionid 00:58:24 PrezBush 0 02:30:33 PrezBush 1 INPUT OUTPUT
  • 33.
  • 34.