SlideShare uma empresa Scribd logo
1 de 58
Principal Solutions Architect, MongoDB, Inc.
Asya Kamsky
Data Processing and
Aggregation Options
#BigDataCamp @MongoDB @asya999
Applications and data
Store
Process
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Big Data
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Big Data in MongoDB
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Big Data in MongoDB
• An ideal operational database
• High performance for storage and
retrieval at large scale
• Robust query interface for intelligent
operations
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
MongoDB data processing
options
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Big Data in MongoDB
Pre-aggregate in MongoDB for real-time queries
Process in MongoDB using Aggregation
Framework
Process in MongoDB using Map/Reduce
Process outside MongoDB using Hadoop and
other external tools
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Aggregation Framework
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Aggregation Framework
• Declared in JSON, executes in C++
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Aggregation Framework
• Declared in JSON, executes in C++
• Flexible, functional, and simple
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Aggregation Framework
• Declared in JSON, executes in C++
• Flexible, functional, and simple
• Plays nice with sharding
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Pipeline
ps ax | grep mongod | head 1
Piping command line operations
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Pipeline
$match $group | $sort|
Piping aggregation operations
Stream of documents Result document
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Pipeline Operators
• $match
• $project
• $group
• $unwind
• $sort/$skip/$limit
• $redact
• $geoNear
• $out
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
$match
• Filter documents
• Uses existing query syntax
• 2.4 added support for geospatial operations
• 2.6 added support for full text search indexes
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
{ $match : { state : "NY" } }
{
city: “SAN FRANCISCO",
loc: [-122.4614, 37.781],
state: ”CA"
}
{
city: "NEW YORK",
loc: [ -73.989, 40.731],
state: "NY"
}
{
city: “PALO ALTO",
loc: [ -122.127, 37.418],
state: ”CA"
}
{ $match : { loc : { $geoWithin:
{$centerSphere : [ [ -122.4, 37.79 ], 20/3959 ] }
{
city: “SAN FRANCISCO",
loc: [-122.4614, 37.781],
state: ”CA"
}
{
city: "NEW YORK",
loc: [ -73.989, 40.731],
state: "NY"
}
{
city: “PALO ALTO",
loc: [ -122.127, 37.418],
state: ”CA"
}
$project
• Reshape documents
• Include, exclude or rename fields
• Inject computed fields
• Create sub-document fields
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
{
loc: [-122.3892, 37.7864],
state: ”CA"
}
{
_id: "94105",
city: “SAN FRANCISCO",
loc: [-122.3892, 37.7864],
state: ”CA"
}
Selecting and Excluding
Fields
$project: { _id: 0, loc: 1, state: 1 }
{
zip: "94105",
cityState: ”SAN FRANCISCO,
CA"
}
{
_id: "94105",
city: “SAN FRANCISCO",
loc: [-122.3892, 37.7864],
state: ”CA"
}
$project:{zip:"$_id",cityState: {$concat:["$city", ", ", "$state" ]},_id:0}
Renaming and Computing
Fields
{
zip: "94105",
cityState: ”SAN FRANCISCO,
CA"
}
{
_id: "94105",
city: “SAN FRANCISCO",
loc: [-122.3892, 37.7864],
state: ”CA"
}
$project:{zip:"$_id",cityState: {$concat:["$city", ", ", "$state" ]},_id:0}
Renaming and Computing
Fields
New Field Operation
{
dt : {
y : 2012,
m : 9,
d : 1
},
totalprice: 123350.97,
status: "F"
}
{
_id : 6694,
cname : "Cust#000060209",
status" : "F",
totalprice : 123350.97,
orderdate : ISODate("2012-09-
01T13:11:31Z"),
lineitems: [
{ ... },
{ ... },
{ ... }
]
}
Renaming and Computing
Fields
$project : { dt: { y : { "$year" : "$orderdate" },
m : { "$month" : "$orderdate" },
d : { "$dayOfMonth" : "$orderdate" } },
totalprice : 1, status : 1, _id : 0 }
$group
• Group documents by an ID
– Field reference, object, constant
• Other output fields are computed
– $max, $min, $avg, $sum
– $addToSet, $push
– $first, $last
• Processes all data in memory
– can utilize external disk-based sort in 2.6
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Find the smallest cities
within twenty miles of San
Francisco{ _id: "94306",
city: “PALO ALTO",
loc: [ -122.127, 37.418],
pop: 24309 }
{ _id: "10280",
city: "NEW YORK",
loc: [ -74.016, 40.710],
pop: 5574 }
{ _id: "94124",
city: “SAN FRANCISCO",
loc: [-122.388, 37.73],
pop: 27239 }
{
_id: "WOODACRE",
pop: 1524
}
{
_id: "STINSON BEACH",
pop: 630
}
{ _id: "94306",
city: “PALO ALTO",
loc: [ -122.127, 37.418],
pop: 24309 }
{ _id: "10280",
city: "NEW YORK",
loc: [ -74.016, 40.710],
pop: 5574 }
{ _id: "94124",
city: “SAN FRANCISCO",
loc: [-122.388, 37.73],
pop: 27239 }
{
_id: "BOLINAS",
pop: 1555
}
{ $match : { loc :
{ $geoWithin:
{ $centerSphere : [
[ -122.4, 37.79 ],
20/3959
]
} } }
{ $group : {
_id : "$city",
pop : {$sum:
"$pop"}
}
}
{ $sort : { "pop" : 1 } },
{ $limit : 3 }
Find the smallest cities
within twenty miles of San
Francisco
$unwind
• Operate on an array field
• Yield new documents for each array element
– Array replaced by element value
– Missing/empty fields → no output
– Non-array fields → error
• Pipe to $group to aggregate array values
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
$unwind
{
title: "The Great Gatsby",
ISBN: "9781857150193",
subjects: "Long Island"
}
{ $unwind: "$subjects" }
{
title: "The Great Gatsby",
ISBN: "9781857150193",
subjects: "New York"
}
{
title: "The Great Gatsby",
ISBN: "9781857150193",
subjects: "1920s"
}
{
title: "The Great Gatsby",
ISBN: "9781857150193",
subjects: [
"Long Island",
"New York",
"1920s"
]
}
2.6 Improvements
• Returns a cursor (not a document)
– just like a regular find
• New stages
– $redact
– $out
• New operators:
– set expression operators.
– $let and $map operators to allow for the use of variables.
– $literal operator and $size operator
– $cond expression object
• Integrated $text search
• Performance improvements, "explain" and more
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Advantages
• Runs on the server
– Uses indexes
– Uses shards
• Simple to build complex pipelines
• Easy to use from any driver
• Fast -er than other options
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Limitations
• Pipeline operator memory limits
– 10% of total system RAM in 2.4 and earlier
– 100MB in 2.6 but can use disk for external sort
• Some data types not allowed
– Code, CodeWithScope, etc.
• Result size limited• Result size limited (in 2.4 and earlier)
– 2.6 returns a cursor or direct output to a new collection
No result size limit!
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
MapReduce
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
MapReduce
• Versatile, powerful
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
MapReduce
• Versatile, powerful
• Intended for complex data
analysis
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
MapReduce
• Versatile, powerful
• Intended for complex data
analysis
• Overkill for simple aggregations
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
MapReduce
Worker thread
calls mapper
Data Set
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
MapReduce
Workers call Reduce()
Data Set
Output
Worker thread
calls mapper
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
{
_id: 375,
title: "The Great Gatsby",
ISBN: "9781857150193",
available: true,
pages: 218,
chapters: 9,
subjects: [
"Long Island",
"New York",
"1920s"
],
language: "English"
}
Our Example Data
MapReduce
db.books.mapReduce(
map, reduce, {finalize: finalize, out: { inline : 1} } )
db.books.mapReduce(
map, reduce, {finalize: finalize, out: { inline : 1} } )
function map() {
var key = this.language;
emit ( key, { totalPages : this.pages, numBooks : 1
} )
}
MapReduce
db.books.mapReduce(
map, reduce, {finalize: finalize, out: { inline : 1} } )
db.books.mapReduce(
map, reduce, {finalize: finalize, out: { inline : 1} } )
function reduce(key, values) {
var result = { numBooks : 0, totalPages : 0};
values.forEach(function (value) {
result.numBooks += value.numBooks;
result.totalPages += value.totalPages;
});
return result;
}
MapReduce
db.books.mapReduce(
map, reduce, {finalize: finalize, out: { inline : 1} } )
db.books.mapReduce(
map, reduce, {finalize: finalize, out: { inline : 1} } )
function finalize( key, value ) {
if ( value.numBooks != 0 )
return value.totalPages / value.numBooks;
}
MapReduce
db.books.mapReduce(
map, reduce, {finalize: finalize, out: { inline : 1} } )
function finalize( key, value ) {
if ( value.numBooks != 0 )
return value.totalPages / value.numBooks;
}
db.books.mapReduce(
map, reduce, {finalize: finalize, out: { inline : 1} } )
MapReduce
db.books.mapReduce(
map, reduce, {finalize: finalize, out: { inline : 1} } )
"results" : [
{
"_id" : "English",
"value" : 653
},
{
"_id" : "Russian",
"value" : 1440
}
]
Advantages
• Map and reduce code can be arbitrarily complex
– JavaScript, helper functions
• Results can be saved into a new collection
– replace, merge or re-reduce
• Incremental MapReduce
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Limitations
• Implemented with JavaScript
– Single-threaded
• Slower than Aggregation Framework
– Batch, not real time
• Harder to understand, implement, debug...
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Analyzing MongoDB Data in
External Systems
Hadoop
Framework that allows for the distributed processing
of large data sets across clusters of computers
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Hadoop MongoDB Connector
• MongoDB or BSON files as input/output
• Source data can be filtered with queries
• Hadoop Streaming support
– For jobs written in Python, Ruby, Node.js
• Supports Hadoop tools such as Pig and Hive
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Processing Big Data
• Data broken up into smaller pieces
• Process data across multiple nodes
Hadoop Hadoop Hadoop Hadoop
Hadoop Hadoop Hadoop Hadoop Hadoop
Hadoop
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Input splits on Non-sharded
Systems
Single Map
Reduce
Hadoop Hadoop Hadoop Hadoop Hadoop
Hadoop Hadoop Hadoop Hadoop Hadoop
Total Dataset
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Advantages
• Processing decoupled
from data store
• Parallel processing
• Leverage existing
infrastructure
• Java has rich set of data
processing libraries
– And other languages if
using Hadoop Streaming
• Batch processing
• Requires synchronization
between data store and
processor
• Adds complexity to
infrastructure
Disadvantages
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Storm
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Storm
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Storm MongoDB connector
• Spout for MongoDB oplog or capped collections
– Filtering capabilities
– Threaded and non-blocking
• Output to new or existing documents
– Insert/update bolt
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Aggregating MongoDB’s
Data Processing Options
Internal Tools
• Storing pre-aggregated data
– An exercise in schema design
• Aggregation Framework
• MapReduce
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
External Tools
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
Questions?
Principal Solutions Architect, MongoDB Inc.
Asya Kamsky
Thank You
#BigDataCamp @MongoDB @asya999

Mais conteúdo relacionado

Mais procurados

The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
MongoDB
 
Social Data and Log Analysis Using MongoDB
Social Data and Log Analysis Using MongoDBSocial Data and Log Analysis Using MongoDB
Social Data and Log Analysis Using MongoDB
Takahiro Inoue
 

Mais procurados (20)

The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Webinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDBWebinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDB
 
Webinar: Back to Basics: Thinking in Documents
Webinar: Back to Basics: Thinking in DocumentsWebinar: Back to Basics: Thinking in Documents
Webinar: Back to Basics: Thinking in Documents
 
Analytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop ConnectorAnalytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop Connector
 
Back to Basics: My First MongoDB Application
Back to Basics: My First MongoDB ApplicationBack to Basics: My First MongoDB Application
Back to Basics: My First MongoDB Application
 
Back to Basics Webinar 2: Your First MongoDB Application
Back to Basics Webinar 2: Your First MongoDB ApplicationBack to Basics Webinar 2: Your First MongoDB Application
Back to Basics Webinar 2: Your First MongoDB Application
 
Social Data and Log Analysis Using MongoDB
Social Data and Log Analysis Using MongoDBSocial Data and Log Analysis Using MongoDB
Social Data and Log Analysis Using MongoDB
 
Conceptos básicos. Seminario web 5: Introducción a Aggregation Framework
Conceptos básicos. Seminario web 5: Introducción a Aggregation FrameworkConceptos básicos. Seminario web 5: Introducción a Aggregation Framework
Conceptos básicos. Seminario web 5: Introducción a Aggregation Framework
 
Introduction to MongoDB and Hadoop
Introduction to MongoDB and HadoopIntroduction to MongoDB and Hadoop
Introduction to MongoDB and Hadoop
 
Getting Started with MongoDB and NodeJS
Getting Started with MongoDB and NodeJSGetting Started with MongoDB and NodeJS
Getting Started with MongoDB and NodeJS
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation Framework
 
MongoDB for Analytics
MongoDB for AnalyticsMongoDB for Analytics
MongoDB for Analytics
 
Back to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQLBack to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQL
 
Back to Basics Webinar 4: Advanced Indexing, Text and Geospatial Indexes
Back to Basics Webinar 4: Advanced Indexing, Text and Geospatial IndexesBack to Basics Webinar 4: Advanced Indexing, Text and Geospatial Indexes
Back to Basics Webinar 4: Advanced Indexing, Text and Geospatial Indexes
 
An introduction to MongoDB
An introduction to MongoDBAn introduction to MongoDB
An introduction to MongoDB
 
Webinar: Exploring the Aggregation Framework
Webinar: Exploring the Aggregation FrameworkWebinar: Exploring the Aggregation Framework
Webinar: Exploring the Aggregation Framework
 
Conceptos básicos. Seminario web 4: Indexación avanzada, índices de texto y g...
Conceptos básicos. Seminario web 4: Indexación avanzada, índices de texto y g...Conceptos básicos. Seminario web 4: Indexación avanzada, índices de texto y g...
Conceptos básicos. Seminario web 4: Indexación avanzada, índices de texto y g...
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
MongoDB - Back to Basics - La tua prima Applicazione
MongoDB - Back to Basics - La tua prima ApplicazioneMongoDB - Back to Basics - La tua prima Applicazione
MongoDB - Back to Basics - La tua prima Applicazione
 

Destaque

Destaque (20)

Ag big datacampla-06-14-2014-ajay_gopal
Ag big datacampla-06-14-2014-ajay_gopalAg big datacampla-06-14-2014-ajay_gopal
Ag big datacampla-06-14-2014-ajay_gopal
 
Aziksa hadoop for buisness users2 santosh jha
Aziksa hadoop for buisness users2 santosh jhaAziksa hadoop for buisness users2 santosh jha
Aziksa hadoop for buisness users2 santosh jha
 
Summit v4 dave wolcott
Summit v4 dave wolcottSummit v4 dave wolcott
Summit v4 dave wolcott
 
Big datacamp june14_alex_liu
Big datacamp june14_alex_liuBig datacamp june14_alex_liu
Big datacamp june14_alex_liu
 
Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...
Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...
Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...
 
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
 
Yarn cloudera-kathleenting061414 kate-ting
Yarn cloudera-kathleenting061414 kate-tingYarn cloudera-kathleenting061414 kate-ting
Yarn cloudera-kathleenting061414 kate-ting
 
Big Data Day LA 2015 - Solr Search with Spark for Big Data Analytics in Actio...
Big Data Day LA 2015 - Solr Search with Spark for Big Data Analytics in Actio...Big Data Day LA 2015 - Solr Search with Spark for Big Data Analytics in Actio...
Big Data Day LA 2015 - Solr Search with Spark for Big Data Analytics in Actio...
 
20140614 introduction to spark-ben white
20140614 introduction to spark-ben white20140614 introduction to spark-ben white
20140614 introduction to spark-ben white
 
Kiji cassandra la june 2014 - v02 clint-kelly
Kiji cassandra la   june 2014 - v02 clint-kellyKiji cassandra la   june 2014 - v02 clint-kelly
Kiji cassandra la june 2014 - v02 clint-kelly
 
La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixit
 
140614 bigdatacamp-la-keynote-jon hsieh
140614 bigdatacamp-la-keynote-jon hsieh140614 bigdatacamp-la-keynote-jon hsieh
140614 bigdatacamp-la-keynote-jon hsieh
 
Hadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapRHadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapR
 
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
 
Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014
 
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
 
Big Data Day LA 2015 - Deep Learning Human Vocalized Animal Sounds by Sabri S...
Big Data Day LA 2015 - Deep Learning Human Vocalized Animal Sounds by Sabri S...Big Data Day LA 2015 - Deep Learning Human Vocalized Animal Sounds by Sabri S...
Big Data Day LA 2015 - Deep Learning Human Vocalized Animal Sounds by Sabri S...
 
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
 
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
 

Semelhante a 2014 bigdatacamp asya_kamsky

Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache KafkaSolutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Guido Schmutz
 
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafk...
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafk...Solutions for bi-directional integration between Oracle RDBMS and Apache Kafk...
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafk...
confluent
 
1403 app dev series - session 5 - analytics
1403   app dev series - session 5 - analytics1403   app dev series - session 5 - analytics
1403 app dev series - session 5 - analytics
MongoDB
 

Semelhante a 2014 bigdatacamp asya_kamsky (20)

Joins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation EnhancementsJoins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation Enhancements
 
MongoDB 3.2 - Analytics
MongoDB 3.2  - AnalyticsMongoDB 3.2  - Analytics
MongoDB 3.2 - Analytics
 
Joins and Other Aggregation Enhancements Coming in MongoDB 3.2
Joins and Other Aggregation Enhancements Coming in MongoDB 3.2Joins and Other Aggregation Enhancements Coming in MongoDB 3.2
Joins and Other Aggregation Enhancements Coming in MongoDB 3.2
 
Webinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation OptionsWebinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation Options
 
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache KafkaSolutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
 
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafk...
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafk...Solutions for bi-directional integration between Oracle RDBMS and Apache Kafk...
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafk...
 
Introduction to MongoDB and Workshop
Introduction to MongoDB and WorkshopIntroduction to MongoDB and Workshop
Introduction to MongoDB and Workshop
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
 
Presentation
PresentationPresentation
Presentation
 
Einführung in MongoDB
Einführung in MongoDBEinführung in MongoDB
Einführung in MongoDB
 
1403 app dev series - session 5 - analytics
1403   app dev series - session 5 - analytics1403   app dev series - session 5 - analytics
1403 app dev series - session 5 - analytics
 
Gab document db scaling database
Gab   document db scaling databaseGab   document db scaling database
Gab document db scaling database
 
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
 
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5
 
Couchbase Tutorial: Big data Open Source Systems: VLDB2018
Couchbase Tutorial: Big data Open Source Systems: VLDB2018Couchbase Tutorial: Big data Open Source Systems: VLDB2018
Couchbase Tutorial: Big data Open Source Systems: VLDB2018
 
Querying mongo db
Querying mongo dbQuerying mongo db
Querying mongo db
 
Webinar: Applikationsentwicklung mit MongoDB : Teil 5: Reporting & Aggregation
Webinar: Applikationsentwicklung mit MongoDB: Teil 5: Reporting & AggregationWebinar: Applikationsentwicklung mit MongoDB: Teil 5: Reporting & Aggregation
Webinar: Applikationsentwicklung mit MongoDB : Teil 5: Reporting & Aggregation
 
OSCON 2011 CouchApps
OSCON 2011 CouchAppsOSCON 2011 CouchApps
OSCON 2011 CouchApps
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
 
MongoDB Meetup
MongoDB MeetupMongoDB Meetup
MongoDB Meetup
 

Mais de Data Con LA

Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA
 

Mais de Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Último (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

2014 bigdatacamp asya_kamsky

  • 1. Principal Solutions Architect, MongoDB, Inc. Asya Kamsky Data Processing and Aggregation Options #BigDataCamp @MongoDB @asya999
  • 2. Applications and data Store Process Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 3. Big Data Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 4. Big Data in MongoDB Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 5. Big Data in MongoDB • An ideal operational database • High performance for storage and retrieval at large scale • Robust query interface for intelligent operations Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 6. MongoDB data processing options Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 7. Big Data in MongoDB Pre-aggregate in MongoDB for real-time queries Process in MongoDB using Aggregation Framework Process in MongoDB using Map/Reduce Process outside MongoDB using Hadoop and other external tools Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 8. Aggregation Framework Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 9. Aggregation Framework • Declared in JSON, executes in C++ Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 10. Aggregation Framework • Declared in JSON, executes in C++ • Flexible, functional, and simple Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 11. Aggregation Framework • Declared in JSON, executes in C++ • Flexible, functional, and simple • Plays nice with sharding Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 12. Pipeline ps ax | grep mongod | head 1 Piping command line operations Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 13. Pipeline $match $group | $sort| Piping aggregation operations Stream of documents Result document Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 14. Pipeline Operators • $match • $project • $group • $unwind • $sort/$skip/$limit • $redact • $geoNear • $out Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 15. $match • Filter documents • Uses existing query syntax • 2.4 added support for geospatial operations • 2.6 added support for full text search indexes Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 16. { $match : { state : "NY" } } { city: “SAN FRANCISCO", loc: [-122.4614, 37.781], state: ”CA" } { city: "NEW YORK", loc: [ -73.989, 40.731], state: "NY" } { city: “PALO ALTO", loc: [ -122.127, 37.418], state: ”CA" }
  • 17. { $match : { loc : { $geoWithin: {$centerSphere : [ [ -122.4, 37.79 ], 20/3959 ] } { city: “SAN FRANCISCO", loc: [-122.4614, 37.781], state: ”CA" } { city: "NEW YORK", loc: [ -73.989, 40.731], state: "NY" } { city: “PALO ALTO", loc: [ -122.127, 37.418], state: ”CA" }
  • 18. $project • Reshape documents • Include, exclude or rename fields • Inject computed fields • Create sub-document fields Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 19. { loc: [-122.3892, 37.7864], state: ”CA" } { _id: "94105", city: “SAN FRANCISCO", loc: [-122.3892, 37.7864], state: ”CA" } Selecting and Excluding Fields $project: { _id: 0, loc: 1, state: 1 }
  • 20. { zip: "94105", cityState: ”SAN FRANCISCO, CA" } { _id: "94105", city: “SAN FRANCISCO", loc: [-122.3892, 37.7864], state: ”CA" } $project:{zip:"$_id",cityState: {$concat:["$city", ", ", "$state" ]},_id:0} Renaming and Computing Fields
  • 21. { zip: "94105", cityState: ”SAN FRANCISCO, CA" } { _id: "94105", city: “SAN FRANCISCO", loc: [-122.3892, 37.7864], state: ”CA" } $project:{zip:"$_id",cityState: {$concat:["$city", ", ", "$state" ]},_id:0} Renaming and Computing Fields New Field Operation
  • 22. { dt : { y : 2012, m : 9, d : 1 }, totalprice: 123350.97, status: "F" } { _id : 6694, cname : "Cust#000060209", status" : "F", totalprice : 123350.97, orderdate : ISODate("2012-09- 01T13:11:31Z"), lineitems: [ { ... }, { ... }, { ... } ] } Renaming and Computing Fields $project : { dt: { y : { "$year" : "$orderdate" }, m : { "$month" : "$orderdate" }, d : { "$dayOfMonth" : "$orderdate" } }, totalprice : 1, status : 1, _id : 0 }
  • 23. $group • Group documents by an ID – Field reference, object, constant • Other output fields are computed – $max, $min, $avg, $sum – $addToSet, $push – $first, $last • Processes all data in memory – can utilize external disk-based sort in 2.6 Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 24. Find the smallest cities within twenty miles of San Francisco{ _id: "94306", city: “PALO ALTO", loc: [ -122.127, 37.418], pop: 24309 } { _id: "10280", city: "NEW YORK", loc: [ -74.016, 40.710], pop: 5574 } { _id: "94124", city: “SAN FRANCISCO", loc: [-122.388, 37.73], pop: 27239 }
  • 25. { _id: "WOODACRE", pop: 1524 } { _id: "STINSON BEACH", pop: 630 } { _id: "94306", city: “PALO ALTO", loc: [ -122.127, 37.418], pop: 24309 } { _id: "10280", city: "NEW YORK", loc: [ -74.016, 40.710], pop: 5574 } { _id: "94124", city: “SAN FRANCISCO", loc: [-122.388, 37.73], pop: 27239 } { _id: "BOLINAS", pop: 1555 } { $match : { loc : { $geoWithin: { $centerSphere : [ [ -122.4, 37.79 ], 20/3959 ] } } } { $group : { _id : "$city", pop : {$sum: "$pop"} } } { $sort : { "pop" : 1 } }, { $limit : 3 } Find the smallest cities within twenty miles of San Francisco
  • 26. $unwind • Operate on an array field • Yield new documents for each array element – Array replaced by element value – Missing/empty fields → no output – Non-array fields → error • Pipe to $group to aggregate array values Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 27. $unwind { title: "The Great Gatsby", ISBN: "9781857150193", subjects: "Long Island" } { $unwind: "$subjects" } { title: "The Great Gatsby", ISBN: "9781857150193", subjects: "New York" } { title: "The Great Gatsby", ISBN: "9781857150193", subjects: "1920s" } { title: "The Great Gatsby", ISBN: "9781857150193", subjects: [ "Long Island", "New York", "1920s" ] }
  • 28. 2.6 Improvements • Returns a cursor (not a document) – just like a regular find • New stages – $redact – $out • New operators: – set expression operators. – $let and $map operators to allow for the use of variables. – $literal operator and $size operator – $cond expression object • Integrated $text search • Performance improvements, "explain" and more Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 29. Advantages • Runs on the server – Uses indexes – Uses shards • Simple to build complex pipelines • Easy to use from any driver • Fast -er than other options Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 30. Limitations • Pipeline operator memory limits – 10% of total system RAM in 2.4 and earlier – 100MB in 2.6 but can use disk for external sort • Some data types not allowed – Code, CodeWithScope, etc. • Result size limited• Result size limited (in 2.4 and earlier) – 2.6 returns a cursor or direct output to a new collection No result size limit! Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 31. MapReduce Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 32. MapReduce • Versatile, powerful Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 33. MapReduce • Versatile, powerful • Intended for complex data analysis Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 34. MapReduce • Versatile, powerful • Intended for complex data analysis • Overkill for simple aggregations Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 35. MapReduce Worker thread calls mapper Data Set Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 36. MapReduce Workers call Reduce() Data Set Output Worker thread calls mapper Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 37. { _id: 375, title: "The Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, chapters: 9, subjects: [ "Long Island", "New York", "1920s" ], language: "English" } Our Example Data
  • 38. MapReduce db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } ) db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } ) function map() { var key = this.language; emit ( key, { totalPages : this.pages, numBooks : 1 } ) }
  • 39. MapReduce db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } ) db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } ) function reduce(key, values) { var result = { numBooks : 0, totalPages : 0}; values.forEach(function (value) { result.numBooks += value.numBooks; result.totalPages += value.totalPages; }); return result; }
  • 40. MapReduce db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } ) db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } ) function finalize( key, value ) { if ( value.numBooks != 0 ) return value.totalPages / value.numBooks; }
  • 41. MapReduce db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } ) function finalize( key, value ) { if ( value.numBooks != 0 ) return value.totalPages / value.numBooks; } db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } )
  • 42. MapReduce db.books.mapReduce( map, reduce, {finalize: finalize, out: { inline : 1} } ) "results" : [ { "_id" : "English", "value" : 653 }, { "_id" : "Russian", "value" : 1440 } ]
  • 43. Advantages • Map and reduce code can be arbitrarily complex – JavaScript, helper functions • Results can be saved into a new collection – replace, merge or re-reduce • Incremental MapReduce Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 44. Limitations • Implemented with JavaScript – Single-threaded • Slower than Aggregation Framework – Batch, not real time • Harder to understand, implement, debug... Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 45. Analyzing MongoDB Data in External Systems
  • 46. Hadoop Framework that allows for the distributed processing of large data sets across clusters of computers Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 47. Hadoop MongoDB Connector • MongoDB or BSON files as input/output • Source data can be filtered with queries • Hadoop Streaming support – For jobs written in Python, Ruby, Node.js • Supports Hadoop tools such as Pig and Hive Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 48. Processing Big Data • Data broken up into smaller pieces • Process data across multiple nodes Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 49. Input splits on Non-sharded Systems Single Map Reduce Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Total Dataset Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 50. Advantages • Processing decoupled from data store • Parallel processing • Leverage existing infrastructure • Java has rich set of data processing libraries – And other languages if using Hadoop Streaming • Batch processing • Requires synchronization between data store and processor • Adds complexity to infrastructure Disadvantages Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 51. Storm Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 52. Storm Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 53. Storm MongoDB connector • Spout for MongoDB oplog or capped collections – Filtering capabilities – Threaded and non-blocking • Output to new or existing documents – Insert/update bolt Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 55. Internal Tools • Storing pre-aggregated data – An exercise in schema design • Aggregation Framework • MapReduce Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 56. External Tools Data Processing andAggregation Options in MongoDB / Asya Kamsky
  • 58. Principal Solutions Architect, MongoDB Inc. Asya Kamsky Thank You #BigDataCamp @MongoDB @asya999

Notas do Editor

  1. "h" : { "$hour" : "$time" }, "m" : { "$minute" : "$time" }, "s" : { "$second" : "$time" },
  2. { $match : { loc : { $geoWithin: {$centerSphere : [ [ -122.4, 37.79 ], 20/3959 ] }}} { city: “PALO ALTO", loc: [ -122.127, 37.418], state: ”CA" }
  3. { $match : { loc : { $geoWithin: {$centerSphere : [ [ -122.4, 37.79 ], 20/3959 ] }}} { city: “PALO ALTO", loc: [ -122.127, 37.418], state: ”CA" }
  4. 2.4 will improve somewhat
  5. 2.4 will improve somewhat
  6. Distributed, real-time computation system.
  7. Distributed, real-time computation system.