SlideShare uma empresa Scribd logo
1 de 27
An Introduction to
MapReduce with MongoDB
        Russell Smith
/usr/bin/whoami

•   Russell Smith

•   Consultant for UKD1 Limited

•   I Specialise in helping companies going through rapid growth;

•   Code, architecture, infrastructure, devops, sysops, capacity planning, etc

•   <3 Gearman, MongoDB, Neo4j, MySQL, Riak, Kohana, PHP, Debian, Puppet, etc...
What is MongoDB

•   A scalable, high-performance, open source, document-oriented
    database.

•   Stores JSON like documents

•   Indexible on any attributes (like MySQL)

•   Built in MapReduce
Requirements

•   A running MongoDB server
    http://www.mongodb.org/downloads


•   Basic knowledge of MongoDB

•   Basic Javascript
What is Map Reduce

•   Allows aggregating data in parallel

•   Some built in aggregation functions exist;
    distinct, count

•   If you need to do something more, either query or MapReduce
How does it work?
•   You write two functions

•   You write them in Javascript (currently)
•   Map function:
    Called once per document - returns a key + a value

•   Reduce function:
    Called once per key emitted, with an array of values

•   Optional finalize function allowing rounding up of the reduce data
Some example data

•   I downloaded the H1B (US temporary work VISA data)
    http://www.flcdatacenter.com/CaseH1B.aspx


•   Imported the CSV data using mongoimport command

•   Total imported documents ~335k
What do the documents look like?
                                  {
                                  
   "_id" : ObjectId("4db7c981e243a6e23725570f"),
                                  
   "LCA_CASE_NUMBER" : "I-200-09132-243675",
                                  
   "STATUS" : "CERTIFIED",
                                  
   "LCA_CASE_SUBMIT" : "7/14/2010 9:06:36",



•
                                  
   "VISA_CLASS" : "H-1B",

    LCA_CASE_EMPLOYER_STATE       
                                  
                                  
                                      "LCA_CASE_EMPLOYMENT_START_DATE" : "12/15/2010 0:00:00",
                                      "LCA_CASE_EMPLOYMENT_END_DATE" : "12/15/2013 0:00:00",
                                      "LCA_CASE_EMPLOYER_NAME" : "BRITISH SCHOOL OF AMERICA, LLC",
                                  
   "LCA_CASE_EMPLOYER_ADDRESS" : "4211 WATONGA BLVD.",
                                  
   "LCA_CASE_EMPLOYER_CITY" : "HOUSTON",



•
                                  
   "LCA_CASE_EMPLOYER_STATE" : "TX",

    STATUS                        
                                  
                                  
                                      "LCA_CASE_EMPLOYER_POSTAL_CODE" : 77092,
                                      "LCA_CASE_SOC_CODE" : "25-2022.00",
                                      "LCA_CASE_SOC_NAME" : "Middle School Teachers, Except Special and Vocatio",
                                  
   "LCA_CASE_JOB_TITLE" : "MIDDLE SCHOOL TEACHER/IB COORDINATOR",
                                  
   "LCA_CASE_WAGE_RATE_FROM" : 51577.63,



•
                                  
   "LCA_CASE_WAGE_RATE_UNIT" : "Year",

    LCA_CASE_SUMBIT / Decision_Date
                                  
                                  
                                  
                                      "FULL_TIME_POS" : "Y",
                                      "TOTAL_WORKERS" : 1,
                                      "LCA_CASE_WORKLOC1_CITY" : "HOUSTON",
                                  
   "LCA_CASE_WORKLOC1_STATE" : "TX",




•
                                  
   "PW_1" : 47827,


    LCA_CASE_WAGE_RATE_FROM
                                  
   "PW_UNIT_1" : "Year",
                                  
   "PW_SOURCE_1" : "OES",
                                  
   "OTHER_WAGE_SOURCE_1" : "OFLC ONLINE DATA CENTER",
                                  
   "YR_SOURCE_PUB_1" : 2010,
                                  
   "LCA_CASE_NAICS_CODE" : 611110,
                                  
   "Decision_Date" : "7/20/2010 0:00:00r"
                                  }
What we can do with the data?

•   Work out the;

•   Applications per state

•   Applications by status per state

•   Average time from submission to decision, by status
Applications by State


•   Key will be LCA_CASE_EMPLOYER_STATE

•   Assume (wrongly) one person per document
Map


•   this is equal to the current document     m = function () {

                                              
   emit(this.LCA_CASE_EMPLOYER_STATE, 1);
•   emit a value of 1; as we are assuming a
    single H1B app per document               }
Reduce


•   Return a value; the length of the array      r = function (k, v_arr) {
                                                    return v_arr.length
•   This works as each value in the array is 1   }
Executing


•   This will execute the map/reduce
                                        db.text2010.mapReduce(m,r,
                                        {out: 'workers_by_state',
•   Output goes to a collection named
                                        keeptemp:true, verbose:true})
    workers_by_state
Result

{
"_id"
:
"NEW
YORK",
"value"
:
512
}
{
"_id"
:
"IOWA",
"value"
:
15
}
{
"_id"
:
"KANSAS",
"value"
:
54
}
...
A more complex Map!

                                            m = function () {
•   The last example assumed one worker
    per state...which is wrong.                   emit(this.LCA_CASE_EMPLOYER_STATE,
                                            this.TOTAL_WORKERS);

•   We now emit a numeric value per state
                                            }
Reduce
                                             r = function (k, v_arr) {
                                                   var total = 0;
                                                   var len = v_arr.length;

•   As the array now contains values other
                                                  for (var i=0, i<len, i++)
    than 1, we have to iterate over it
                                                  {
                                                        total = total + v_arr[i];
•   This is standard Javascript
                                                  }
                                                  return total;
                                             }
VISA Class by Application Status by
          Average wage                    m = function () {
                                               var k = this.VISA_CLASS + ' ' + this.STATUS;

                                              switch (this.LCA_CASE_WAGE_RATE_UNIT)
                                              {


•
                                                   case 'Year':
    Assumptions:                                         emit(k, this.LCA_CASE_WAGE_RATE_FROM);
                                                         break;

                                                   case 'Month':

•   People work ~40 hour weeks                         emit(k, this.LCA_CASE_WAGE_RATE_FROM * 12);
                                                       break;

                                                   case 'Bi-Weekly':


•
                                                       emit(k, this.LCA_CASE_WAGE_RATE_FROM * 26);
    Weekly wages are paid every week                   break;

    rather than only the weeks worked              case 'Week':
                                                       emit(k, this.LCA_CASE_WAGE_RATE_FROM * 52);
                                                       break;



•   'Select Pay Range' seems to the the            case 'Hour':
                                                       emit(k, this.LCA_CASE_WAGE_RATE_FROM * 40 * 52);

    default option...                                  break;

                                                   default:
                                                        emit(k, 0);
                                              }

                                          }
Reduce
                                        r = function (k, v_arr) {
                                              var tot = 0;
                                              var len = v_arr.length;
•   Work out the average for each key
                                             for (var i = 0; i < len; i++)
                                             {
•   Add each of the elements up
                                                   tot += v_arr[i];
                                             }
•   Average them

                                             return tot / len;
                                        }
Finalize

•   A finalize function may be run after reduction.

•   Called a single time per object

•   The finalize function takes a key and a value, and returns a finalized
    value.
Options

•   Persist the output

•   Filtering input documents

•   Sorting input documents

•   Javascript scope - allows you to pass in extra variables (cannot be
    changed at runtime?)
Current limitations / Watch for

•   Single threaded per node (which sucks)
    https://jira.mongodb.org/browse/SERVER-463


•   Language is restricted to Javascript (which sucks)
    https://jira.mongodb.org/browse/SERVER-699)


•   Does not use secondaries in replica sets

•   From 1.7.3 on, you can reduce into existing collection
...


•   Doesn't allow creation of full documents (which can be a pain for
    perm MR collections if using libraries)
    https://jira.mongodb.org/browse/SERVER-2517


•   Slow; ~x20-30 slower than Hadoop with 1.8
    https://jira.mongodb.org/browse/SERVER-3055
Using MongoDB with Hadoop

•   https://github.com/mongodb/mongo-hadoop

•   Open source

•   Requires knowledge of Java

•   Working Input and Output adapters for MongoDB are provided

•   Alpha quality from what I can tell
The future
1.9 / 2.0

•   V8 is replacing SpiderMonkey

•   Recent Hadoop provider

•   Sharded output collections

•   Improved yielding (concurrency)
> 2.0

•   Multi-threaded

•   Alternative languages
    https://jira.mongodb.org/browse/SERVER-699


•   ~2.2 native aggregation framework

•   Js only mode is faster for lighter jobs
    https://jira.mongodb.org/browse/SERVER-2976
Further reading
•   I’ve only brushed on the details, but this should be enough to get you
    interested / started with MongoDB Map Reduce. Some of the missing
    stuff;

•   Finalize functions - http://bit.ly/gEfKOr

•   Some more examples - http://bit.ly/ig1Yfj

Mais conteúdo relacionado

Mais procurados

NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLRamakant Soni
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBMike Dirolf
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map ReduceApache Apex
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Webinar: MongoDB Schema Design and Performance Implications
Webinar: MongoDB Schema Design and Performance ImplicationsWebinar: MongoDB Schema Design and Performance Implications
Webinar: MongoDB Schema Design and Performance ImplicationsMongoDB
 
MongoDB Aggregation
MongoDB Aggregation MongoDB Aggregation
MongoDB Aggregation Amit Ghosh
 
MongoDB Performance Tuning
MongoDB Performance TuningMongoDB Performance Tuning
MongoDB Performance TuningPuneet Behl
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
The Basics of MongoDB
The Basics of MongoDBThe Basics of MongoDB
The Basics of MongoDBvaluebound
 
A simple introduction to redis
A simple introduction to redisA simple introduction to redis
A simple introduction to redisZhichao Liang
 
03 spark rdd operations
03 spark rdd operations03 spark rdd operations
03 spark rdd operationsVenkat Datla
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Simplilearn
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
MongoDB Performance Debugging
MongoDB Performance DebuggingMongoDB Performance Debugging
MongoDB Performance DebuggingMongoDB
 

Mais procurados (20)

Spark SQL
Spark SQLSpark SQL
Spark SQL
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
MapReduce
MapReduceMapReduce
MapReduce
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Webinar: MongoDB Schema Design and Performance Implications
Webinar: MongoDB Schema Design and Performance ImplicationsWebinar: MongoDB Schema Design and Performance Implications
Webinar: MongoDB Schema Design and Performance Implications
 
MongoDB Aggregation
MongoDB Aggregation MongoDB Aggregation
MongoDB Aggregation
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
MongoDB Performance Tuning
MongoDB Performance TuningMongoDB Performance Tuning
MongoDB Performance Tuning
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Nosql data models
Nosql data modelsNosql data models
Nosql data models
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
The Basics of MongoDB
The Basics of MongoDBThe Basics of MongoDB
The Basics of MongoDB
 
A simple introduction to redis
A simple introduction to redisA simple introduction to redis
A simple introduction to redis
 
03 spark rdd operations
03 spark rdd operations03 spark rdd operations
03 spark rdd operations
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
MongoDB Performance Debugging
MongoDB Performance DebuggingMongoDB Performance Debugging
MongoDB Performance Debugging
 

Destaque

Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-ReduceBrendan Tierney
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
MongoDB: Replication,Sharding,MapReduce
MongoDB: Replication,Sharding,MapReduceMongoDB: Replication,Sharding,MapReduce
MongoDB: Replication,Sharding,MapReduceTakahiro Inoue
 
Aggregation Framework
Aggregation FrameworkAggregation Framework
Aggregation FrameworkMongoDB
 
Webinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation OptionsWebinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation OptionsMongoDB
 
Introduction to MongoDB and Workshop
Introduction to MongoDB and WorkshopIntroduction to MongoDB and Workshop
Introduction to MongoDB and WorkshopAhmedabadJavaMeetup
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation FrameworkTyler Brock
 
Introduction to MongoDB with PHP
Introduction to MongoDB with PHPIntroduction to MongoDB with PHP
Introduction to MongoDB with PHPfwso
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceM Baddar
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsLeila panahi
 
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...Gianfranco Palumbo
 
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...NoSQLmatters
 
Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB MongoDB
 
Justin J. Dunne Resume
Justin J. Dunne ResumeJustin J. Dunne Resume
Justin J. Dunne ResumeJustin Dunne
 
shared-ownership-21_FINAL
shared-ownership-21_FINALshared-ownership-21_FINAL
shared-ownership-21_FINALChristoph Sinn
 
apprenticeship-levy-summary-5may2016 (1)
apprenticeship-levy-summary-5may2016 (1)apprenticeship-levy-summary-5may2016 (1)
apprenticeship-levy-summary-5may2016 (1)David Ritchie
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceBhupesh Chawda
 

Destaque (20)

An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
MongoDB: Replication,Sharding,MapReduce
MongoDB: Replication,Sharding,MapReduceMongoDB: Replication,Sharding,MapReduce
MongoDB: Replication,Sharding,MapReduce
 
Aggregation Framework
Aggregation FrameworkAggregation Framework
Aggregation Framework
 
Webinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation OptionsWebinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation Options
 
Introduction to MongoDB and Workshop
Introduction to MongoDB and WorkshopIntroduction to MongoDB and Workshop
Introduction to MongoDB and Workshop
 
MongoDB - Ekino PHP
MongoDB - Ekino PHPMongoDB - Ekino PHP
MongoDB - Ekino PHP
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation Framework
 
MongoDB
MongoDBMongoDB
MongoDB
 
Introduction to MongoDB with PHP
Introduction to MongoDB with PHPIntroduction to MongoDB with PHP
Introduction to MongoDB with PHP
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling Algorithms
 
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
 
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
 
Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB
 
Justin J. Dunne Resume
Justin J. Dunne ResumeJustin J. Dunne Resume
Justin J. Dunne Resume
 
shared-ownership-21_FINAL
shared-ownership-21_FINALshared-ownership-21_FINAL
shared-ownership-21_FINAL
 
apprenticeship-levy-summary-5may2016 (1)
apprenticeship-levy-summary-5may2016 (1)apprenticeship-levy-summary-5may2016 (1)
apprenticeship-levy-summary-5may2016 (1)
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 

Semelhante a An Introduction to Map/Reduce with MongoDB

GraphQL, Redux, and React
GraphQL, Redux, and ReactGraphQL, Redux, and React
GraphQL, Redux, and ReactKeon Kim
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingEd Kohlwey
 
The Evolution of Streaming Expressions - Joel Bernstein, Alfresco & Dennis Go...
The Evolution of Streaming Expressions - Joel Bernstein, Alfresco & Dennis Go...The Evolution of Streaming Expressions - Joel Bernstein, Alfresco & Dennis Go...
The Evolution of Streaming Expressions - Joel Bernstein, Alfresco & Dennis Go...Lucidworks
 
CouchDB at JAOO Århus 2009
CouchDB at JAOO Århus 2009CouchDB at JAOO Århus 2009
CouchDB at JAOO Århus 2009Jason Davies
 
Joins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation EnhancementsJoins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation EnhancementsAndrew Morgan
 
Practical Ruby Projects with MongoDB - Ruby Kaigi 2010
Practical Ruby Projects with MongoDB - Ruby Kaigi 2010Practical Ruby Projects with MongoDB - Ruby Kaigi 2010
Practical Ruby Projects with MongoDB - Ruby Kaigi 2010Alex Sharp
 
"An introduction to object-oriented programming for those who have never done...
"An introduction to object-oriented programming for those who have never done..."An introduction to object-oriented programming for those who have never done...
"An introduction to object-oriented programming for those who have never done...Fwdays
 
JavaScript Fundamentals & JQuery
JavaScript Fundamentals & JQueryJavaScript Fundamentals & JQuery
JavaScript Fundamentals & JQueryJamshid Hashimi
 
Practical AngularJS
Practical AngularJSPractical AngularJS
Practical AngularJSWei Ru
 
kissy-past-now-future
kissy-past-now-futurekissy-past-now-future
kissy-past-now-futureyiming he
 
KISSY 的昨天、今天与明天
KISSY 的昨天、今天与明天KISSY 的昨天、今天与明天
KISSY 的昨天、今天与明天tblanlan
 
Altitude SF 2017: Fastly GSLB: Scaling your microservice and multi-cloud envi...
Altitude SF 2017: Fastly GSLB: Scaling your microservice and multi-cloud envi...Altitude SF 2017: Fastly GSLB: Scaling your microservice and multi-cloud envi...
Altitude SF 2017: Fastly GSLB: Scaling your microservice and multi-cloud envi...Fastly
 
Everything That Is Really Useful in Oracle Database 12c for Application Devel...
Everything That Is Really Useful in Oracle Database 12c for Application Devel...Everything That Is Really Useful in Oracle Database 12c for Application Devel...
Everything That Is Really Useful in Oracle Database 12c for Application Devel...Lucas Jellema
 
前后端mvc经验 - webrebuild 2011 session
前后端mvc经验 - webrebuild 2011 session前后端mvc经验 - webrebuild 2011 session
前后端mvc经验 - webrebuild 2011 sessionRANK LIU
 
Converting a Rails application to Node.js
Converting a Rails application to Node.jsConverting a Rails application to Node.js
Converting a Rails application to Node.jsMatt Sergeant
 
Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...
Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...
Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...StampedeCon
 
Building your first Java Application with MongoDB
Building your first Java Application with MongoDBBuilding your first Java Application with MongoDB
Building your first Java Application with MongoDBMongoDB
 
JavaScript- Functions and arrays.pptx
JavaScript- Functions and arrays.pptxJavaScript- Functions and arrays.pptx
JavaScript- Functions and arrays.pptxMegha V
 
ITB2019 10 in 50: Ten Coldbox Modules You Should be Using in Every App - Jon ...
ITB2019 10 in 50: Ten Coldbox Modules You Should be Using in Every App - Jon ...ITB2019 10 in 50: Ten Coldbox Modules You Should be Using in Every App - Jon ...
ITB2019 10 in 50: Ten Coldbox Modules You Should be Using in Every App - Jon ...Ortus Solutions, Corp
 
Programming the Physical World with Device Shadows and Rules Engine
Programming the Physical World with Device Shadows and Rules EngineProgramming the Physical World with Device Shadows and Rules Engine
Programming the Physical World with Device Shadows and Rules EngineAmazon Web Services
 

Semelhante a An Introduction to Map/Reduce with MongoDB (20)

GraphQL, Redux, and React
GraphQL, Redux, and ReactGraphQL, Redux, and React
GraphQL, Redux, and React
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel Processing
 
The Evolution of Streaming Expressions - Joel Bernstein, Alfresco & Dennis Go...
The Evolution of Streaming Expressions - Joel Bernstein, Alfresco & Dennis Go...The Evolution of Streaming Expressions - Joel Bernstein, Alfresco & Dennis Go...
The Evolution of Streaming Expressions - Joel Bernstein, Alfresco & Dennis Go...
 
CouchDB at JAOO Århus 2009
CouchDB at JAOO Århus 2009CouchDB at JAOO Århus 2009
CouchDB at JAOO Århus 2009
 
Joins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation EnhancementsJoins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation Enhancements
 
Practical Ruby Projects with MongoDB - Ruby Kaigi 2010
Practical Ruby Projects with MongoDB - Ruby Kaigi 2010Practical Ruby Projects with MongoDB - Ruby Kaigi 2010
Practical Ruby Projects with MongoDB - Ruby Kaigi 2010
 
"An introduction to object-oriented programming for those who have never done...
"An introduction to object-oriented programming for those who have never done..."An introduction to object-oriented programming for those who have never done...
"An introduction to object-oriented programming for those who have never done...
 
JavaScript Fundamentals & JQuery
JavaScript Fundamentals & JQueryJavaScript Fundamentals & JQuery
JavaScript Fundamentals & JQuery
 
Practical AngularJS
Practical AngularJSPractical AngularJS
Practical AngularJS
 
kissy-past-now-future
kissy-past-now-futurekissy-past-now-future
kissy-past-now-future
 
KISSY 的昨天、今天与明天
KISSY 的昨天、今天与明天KISSY 的昨天、今天与明天
KISSY 的昨天、今天与明天
 
Altitude SF 2017: Fastly GSLB: Scaling your microservice and multi-cloud envi...
Altitude SF 2017: Fastly GSLB: Scaling your microservice and multi-cloud envi...Altitude SF 2017: Fastly GSLB: Scaling your microservice and multi-cloud envi...
Altitude SF 2017: Fastly GSLB: Scaling your microservice and multi-cloud envi...
 
Everything That Is Really Useful in Oracle Database 12c for Application Devel...
Everything That Is Really Useful in Oracle Database 12c for Application Devel...Everything That Is Really Useful in Oracle Database 12c for Application Devel...
Everything That Is Really Useful in Oracle Database 12c for Application Devel...
 
前后端mvc经验 - webrebuild 2011 session
前后端mvc经验 - webrebuild 2011 session前后端mvc经验 - webrebuild 2011 session
前后端mvc经验 - webrebuild 2011 session
 
Converting a Rails application to Node.js
Converting a Rails application to Node.jsConverting a Rails application to Node.js
Converting a Rails application to Node.js
 
Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...
Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...
Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...
 
Building your first Java Application with MongoDB
Building your first Java Application with MongoDBBuilding your first Java Application with MongoDB
Building your first Java Application with MongoDB
 
JavaScript- Functions and arrays.pptx
JavaScript- Functions and arrays.pptxJavaScript- Functions and arrays.pptx
JavaScript- Functions and arrays.pptx
 
ITB2019 10 in 50: Ten Coldbox Modules You Should be Using in Every App - Jon ...
ITB2019 10 in 50: Ten Coldbox Modules You Should be Using in Every App - Jon ...ITB2019 10 in 50: Ten Coldbox Modules You Should be Using in Every App - Jon ...
ITB2019 10 in 50: Ten Coldbox Modules You Should be Using in Every App - Jon ...
 
Programming the Physical World with Device Shadows and Rules Engine
Programming the Physical World with Device Shadows and Rules EngineProgramming the Physical World with Device Shadows and Rules Engine
Programming the Physical World with Device Shadows and Rules Engine
 

Mais de Rainforest QA

Machine Learning in Practice - CTO Summit Chicago 2019
Machine Learning in Practice - CTO Summit Chicago 2019Machine Learning in Practice - CTO Summit Chicago 2019
Machine Learning in Practice - CTO Summit Chicago 2019Rainforest QA
 
CTO Summit NASDAQ NYC 2017: Creating a QA Strategy
CTO Summit NASDAQ NYC 2017: Creating a QA StrategyCTO Summit NASDAQ NYC 2017: Creating a QA Strategy
CTO Summit NASDAQ NYC 2017: Creating a QA StrategyRainforest QA
 
Ops Skills and Tools for Beginners [#MongoDB World 2014]
Ops Skills and Tools for Beginners [#MongoDB World 2014]Ops Skills and Tools for Beginners [#MongoDB World 2014]
Ops Skills and Tools for Beginners [#MongoDB World 2014]Rainforest QA
 
Pragmatic Rails Architecture [SF Rails, 24 Apr 14]
Pragmatic Rails Architecture [SF Rails, 24 Apr 14]Pragmatic Rails Architecture [SF Rails, 24 Apr 14]
Pragmatic Rails Architecture [SF Rails, 24 Apr 14]Rainforest QA
 
Bitcoin Ops & Security Primer
Bitcoin Ops & Security PrimerBitcoin Ops & Security Primer
Bitcoin Ops & Security PrimerRainforest QA
 
Pivotal Labs Lunch Talk; 3 Infrastructure and workflow lessons learned at an ...
Pivotal Labs Lunch Talk; 3 Infrastructure and workflow lessons learned at an ...Pivotal Labs Lunch Talk; 3 Infrastructure and workflow lessons learned at an ...
Pivotal Labs Lunch Talk; 3 Infrastructure and workflow lessons learned at an ...Rainforest QA
 
MongoDB Command Line Tools
MongoDB Command Line ToolsMongoDB Command Line Tools
MongoDB Command Line ToolsRainforest QA
 
Seedhack MongoDB 2011
Seedhack MongoDB 2011Seedhack MongoDB 2011
Seedhack MongoDB 2011Rainforest QA
 
How does Riak compare to Cassandra? [Cassandra London User Group July 2011]
How does Riak compare to Cassandra? [Cassandra London User Group July 2011]How does Riak compare to Cassandra? [Cassandra London User Group July 2011]
How does Riak compare to Cassandra? [Cassandra London User Group July 2011]Rainforest QA
 
London MongoDB User Group April 2011
London MongoDB User Group April 2011London MongoDB User Group April 2011
London MongoDB User Group April 2011Rainforest QA
 
Geo & capped collections with MongoDB
Geo & capped collections  with MongoDBGeo & capped collections  with MongoDB
Geo & capped collections with MongoDBRainforest QA
 

Mais de Rainforest QA (11)

Machine Learning in Practice - CTO Summit Chicago 2019
Machine Learning in Practice - CTO Summit Chicago 2019Machine Learning in Practice - CTO Summit Chicago 2019
Machine Learning in Practice - CTO Summit Chicago 2019
 
CTO Summit NASDAQ NYC 2017: Creating a QA Strategy
CTO Summit NASDAQ NYC 2017: Creating a QA StrategyCTO Summit NASDAQ NYC 2017: Creating a QA Strategy
CTO Summit NASDAQ NYC 2017: Creating a QA Strategy
 
Ops Skills and Tools for Beginners [#MongoDB World 2014]
Ops Skills and Tools for Beginners [#MongoDB World 2014]Ops Skills and Tools for Beginners [#MongoDB World 2014]
Ops Skills and Tools for Beginners [#MongoDB World 2014]
 
Pragmatic Rails Architecture [SF Rails, 24 Apr 14]
Pragmatic Rails Architecture [SF Rails, 24 Apr 14]Pragmatic Rails Architecture [SF Rails, 24 Apr 14]
Pragmatic Rails Architecture [SF Rails, 24 Apr 14]
 
Bitcoin Ops & Security Primer
Bitcoin Ops & Security PrimerBitcoin Ops & Security Primer
Bitcoin Ops & Security Primer
 
Pivotal Labs Lunch Talk; 3 Infrastructure and workflow lessons learned at an ...
Pivotal Labs Lunch Talk; 3 Infrastructure and workflow lessons learned at an ...Pivotal Labs Lunch Talk; 3 Infrastructure and workflow lessons learned at an ...
Pivotal Labs Lunch Talk; 3 Infrastructure and workflow lessons learned at an ...
 
MongoDB Command Line Tools
MongoDB Command Line ToolsMongoDB Command Line Tools
MongoDB Command Line Tools
 
Seedhack MongoDB 2011
Seedhack MongoDB 2011Seedhack MongoDB 2011
Seedhack MongoDB 2011
 
How does Riak compare to Cassandra? [Cassandra London User Group July 2011]
How does Riak compare to Cassandra? [Cassandra London User Group July 2011]How does Riak compare to Cassandra? [Cassandra London User Group July 2011]
How does Riak compare to Cassandra? [Cassandra London User Group July 2011]
 
London MongoDB User Group April 2011
London MongoDB User Group April 2011London MongoDB User Group April 2011
London MongoDB User Group April 2011
 
Geo & capped collections with MongoDB
Geo & capped collections  with MongoDBGeo & capped collections  with MongoDB
Geo & capped collections with MongoDB
 

Último

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 

Último (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

An Introduction to Map/Reduce with MongoDB

  • 1. An Introduction to MapReduce with MongoDB Russell Smith
  • 2. /usr/bin/whoami • Russell Smith • Consultant for UKD1 Limited • I Specialise in helping companies going through rapid growth; • Code, architecture, infrastructure, devops, sysops, capacity planning, etc • <3 Gearman, MongoDB, Neo4j, MySQL, Riak, Kohana, PHP, Debian, Puppet, etc...
  • 3. What is MongoDB • A scalable, high-performance, open source, document-oriented database. • Stores JSON like documents • Indexible on any attributes (like MySQL) • Built in MapReduce
  • 4. Requirements • A running MongoDB server http://www.mongodb.org/downloads • Basic knowledge of MongoDB • Basic Javascript
  • 5. What is Map Reduce • Allows aggregating data in parallel • Some built in aggregation functions exist; distinct, count • If you need to do something more, either query or MapReduce
  • 6. How does it work? • You write two functions • You write them in Javascript (currently) • Map function: Called once per document - returns a key + a value • Reduce function: Called once per key emitted, with an array of values • Optional finalize function allowing rounding up of the reduce data
  • 7. Some example data • I downloaded the H1B (US temporary work VISA data) http://www.flcdatacenter.com/CaseH1B.aspx • Imported the CSV data using mongoimport command • Total imported documents ~335k
  • 8. What do the documents look like? { "_id" : ObjectId("4db7c981e243a6e23725570f"), "LCA_CASE_NUMBER" : "I-200-09132-243675", "STATUS" : "CERTIFIED", "LCA_CASE_SUBMIT" : "7/14/2010 9:06:36", • "VISA_CLASS" : "H-1B", LCA_CASE_EMPLOYER_STATE "LCA_CASE_EMPLOYMENT_START_DATE" : "12/15/2010 0:00:00", "LCA_CASE_EMPLOYMENT_END_DATE" : "12/15/2013 0:00:00", "LCA_CASE_EMPLOYER_NAME" : "BRITISH SCHOOL OF AMERICA, LLC", "LCA_CASE_EMPLOYER_ADDRESS" : "4211 WATONGA BLVD.", "LCA_CASE_EMPLOYER_CITY" : "HOUSTON", • "LCA_CASE_EMPLOYER_STATE" : "TX", STATUS "LCA_CASE_EMPLOYER_POSTAL_CODE" : 77092, "LCA_CASE_SOC_CODE" : "25-2022.00", "LCA_CASE_SOC_NAME" : "Middle School Teachers, Except Special and Vocatio", "LCA_CASE_JOB_TITLE" : "MIDDLE SCHOOL TEACHER/IB COORDINATOR", "LCA_CASE_WAGE_RATE_FROM" : 51577.63, • "LCA_CASE_WAGE_RATE_UNIT" : "Year", LCA_CASE_SUMBIT / Decision_Date "FULL_TIME_POS" : "Y", "TOTAL_WORKERS" : 1, "LCA_CASE_WORKLOC1_CITY" : "HOUSTON", "LCA_CASE_WORKLOC1_STATE" : "TX", • "PW_1" : 47827, LCA_CASE_WAGE_RATE_FROM "PW_UNIT_1" : "Year", "PW_SOURCE_1" : "OES", "OTHER_WAGE_SOURCE_1" : "OFLC ONLINE DATA CENTER", "YR_SOURCE_PUB_1" : 2010, "LCA_CASE_NAICS_CODE" : 611110, "Decision_Date" : "7/20/2010 0:00:00r" }
  • 9. What we can do with the data? • Work out the; • Applications per state • Applications by status per state • Average time from submission to decision, by status
  • 10. Applications by State • Key will be LCA_CASE_EMPLOYER_STATE • Assume (wrongly) one person per document
  • 11. Map • this is equal to the current document m = function () { emit(this.LCA_CASE_EMPLOYER_STATE, 1); • emit a value of 1; as we are assuming a single H1B app per document }
  • 12. Reduce • Return a value; the length of the array r = function (k, v_arr) { return v_arr.length • This works as each value in the array is 1 }
  • 13. Executing • This will execute the map/reduce db.text2010.mapReduce(m,r, {out: 'workers_by_state', • Output goes to a collection named keeptemp:true, verbose:true}) workers_by_state
  • 15. A more complex Map! m = function () { • The last example assumed one worker per state...which is wrong. emit(this.LCA_CASE_EMPLOYER_STATE, this.TOTAL_WORKERS); • We now emit a numeric value per state }
  • 16. Reduce r = function (k, v_arr) { var total = 0; var len = v_arr.length; • As the array now contains values other for (var i=0, i<len, i++) than 1, we have to iterate over it { total = total + v_arr[i]; • This is standard Javascript } return total; }
  • 17. VISA Class by Application Status by Average wage m = function () { var k = this.VISA_CLASS + ' ' + this.STATUS; switch (this.LCA_CASE_WAGE_RATE_UNIT) { • case 'Year': Assumptions: emit(k, this.LCA_CASE_WAGE_RATE_FROM); break; case 'Month': • People work ~40 hour weeks emit(k, this.LCA_CASE_WAGE_RATE_FROM * 12); break; case 'Bi-Weekly': • emit(k, this.LCA_CASE_WAGE_RATE_FROM * 26); Weekly wages are paid every week break; rather than only the weeks worked case 'Week': emit(k, this.LCA_CASE_WAGE_RATE_FROM * 52); break; • 'Select Pay Range' seems to the the case 'Hour': emit(k, this.LCA_CASE_WAGE_RATE_FROM * 40 * 52); default option... break; default: emit(k, 0); } }
  • 18. Reduce r = function (k, v_arr) { var tot = 0; var len = v_arr.length; • Work out the average for each key for (var i = 0; i < len; i++) { • Add each of the elements up tot += v_arr[i]; } • Average them return tot / len; }
  • 19. Finalize • A finalize function may be run after reduction. • Called a single time per object • The finalize function takes a key and a value, and returns a finalized value.
  • 20. Options • Persist the output • Filtering input documents • Sorting input documents • Javascript scope - allows you to pass in extra variables (cannot be changed at runtime?)
  • 21. Current limitations / Watch for • Single threaded per node (which sucks) https://jira.mongodb.org/browse/SERVER-463 • Language is restricted to Javascript (which sucks) https://jira.mongodb.org/browse/SERVER-699) • Does not use secondaries in replica sets • From 1.7.3 on, you can reduce into existing collection
  • 22. ... • Doesn't allow creation of full documents (which can be a pain for perm MR collections if using libraries) https://jira.mongodb.org/browse/SERVER-2517 • Slow; ~x20-30 slower than Hadoop with 1.8 https://jira.mongodb.org/browse/SERVER-3055
  • 23. Using MongoDB with Hadoop • https://github.com/mongodb/mongo-hadoop • Open source • Requires knowledge of Java • Working Input and Output adapters for MongoDB are provided • Alpha quality from what I can tell
  • 25. 1.9 / 2.0 • V8 is replacing SpiderMonkey • Recent Hadoop provider • Sharded output collections • Improved yielding (concurrency)
  • 26. > 2.0 • Multi-threaded • Alternative languages https://jira.mongodb.org/browse/SERVER-699 • ~2.2 native aggregation framework • Js only mode is faster for lighter jobs https://jira.mongodb.org/browse/SERVER-2976
  • 27. Further reading • I’ve only brushed on the details, but this should be enough to get you interested / started with MongoDB Map Reduce. Some of the missing stuff; • Finalize functions - http://bit.ly/gEfKOr • Some more examples - http://bit.ly/ig1Yfj

Notas do Editor

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n