SlideShare uma empresa Scribd logo
1 de 88
Massimo Brignoli
Principal Solutions Architect
massimo@mongodb.com
@massimobrignoli
Analytics in MongoDB
Agenda
‱ Analytics in MongoDB?
‱ Aggregation Framework
‱ Aggregation Pipeline Stages
‱ Aggregation Framework in
Action
‱ Joins in MongoDB 3.2
‱ Integrations
‱ Analytical Architectures
Relational
Expressive Query Language
& Secondary Indexes
Strong Consistency
Enterprise Management
& Integrations
The World Has Changed
Volume
Velocity
Variety
Iterative
Agile
Short Cycles
Always On
Secure
Global
Open-Source
Cloud
Commodity
Data Time
Risk Cost
Scalability
& Performance
Always On,
Global Deployments
FlexibilityExpressive Query Language
& Secondary Indexes
Strong Consistency
Enterprise Management
& Integrations
NoSQL
Nexus Architecture
Scalability
& Performance
Always On,
Global Deployments
FlexibilityExpressive Query Language
& Secondary Indexes
Strong Consistency
Enterprise Management
& Integrations
Some Common MongoDB Use Cases
Single View Internet of Things Mobile Real-Time Analytics
Catalog Personalization Content Management
MongoDB in Research
Analytics in MongoDB?
Analytics in MongoDB?
Create
Read
Update
Delete
Analytics
?
Group
Count
Derive Values
Filter
Average
Sort
Analytics on MongoDB Data
‱ Extract data from MongoDB and
perform complex analytics with
Hadoop
– Batch rather than real-time
– Extra nodes to manage
‱ Direct access to MongoDB from
SPARK
‱ MongoDB BI Connector
– Direct SQL Access from BI Tools
‱ MongoDB aggregation pipeline
– Real-time
– Live, operational data set
– Narrower feature set
Hadoop
Connector
MapReduce & HDFS
SQL
Connector
For Example: US Census Data
‱ Census data from 1990, 2000, 2010
‱ Question:
– Which US Division has the fastest growing population density?
– We only want to include data states with more than 1M people
– We only want to include divisions larger than 100K square miles
– Division = a group of US States
– Population density = Area of division/# of people
– Data is provided at the state level
US Regions and Divisions
How would we solve this in SQL?
‱ SELECT GROUP BY HAVING
Aggregation Framework
Aggregation Framework
What is an Aggregation Pipeline?
‱ A Series of Document Transformations
– Executed in stages
– Original input is a collection
– Output as a cursor or a collection
‱ Rich Library of Functions
– Filter, compute, group, and summarize data
– Output of one stage sent to input of next
– Operations executed in sequential order
Aggregation Pipeline
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
Aggregation Pipeline
$match
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds} {}
Aggregation Pipeline
$match
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds} {}
{★ds}
{★ds}
{★ds}
Aggregation Pipeline
$match $project
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds} {}
{★ds}
{★ds}
{★ds}
{ïź=d+s}
Aggregation Pipeline
$match $project
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds} {}
{★ds}
{★ds}
{★ds}
{ï”â˜…ïź}
{ï”â˜…ïź}
{ï”â˜…ïź}
{ïź=d+s}
Aggregation Pipeline
$match $project $lookup
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds} {}
{★ds}
{★ds}
{★ds}
{★}
{★}
{★}
{★}
{ï”â˜…ïź}
{ï”â˜…ïź}
{ï”â˜…ïź}
{ïź=d+s}
Aggregation Pipeline
$match $project $lookup
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds} {}
{★ds}
{★ds}
{★ds}
{★}
{★}
{★}
{★}
{ï”â˜…ïź}
{ï”â˜…ïź}
{ï”â˜…ïź}
{ïź=d+s}
{ï”â˜…ïź[]}
{ï”â˜…ïź[]}
{ï”â˜…ïź}
Aggregation Pipeline
$match $project $lookup $group
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds} {}
{★ds}
{★ds}
{★ds}
{★}
{★}
{★}
{★}
{ï”â˜…ïź}
{ï”â˜…ïź}
{ï”â˜…ïź}
{ïź=d+s}
{ïź
ÎŁ λ σ}
{ïź
ÎŁ λ σ}
{ïź
ÎŁ λ σ}
{ï”â˜…ïź[]}
{ï”â˜…ïź[]}
{ï”â˜…ïź}
Aggregation Pipeline Stages
‱ $match
Filter documents
‱ $geoNear
Geospherical query
‱ $project
Reshape documents
‱ $lookup
New – Left-outer equi joins
‱ $unwind
Expand documents
‱ $group
Summarize documents
‱ $sample
New – Randomly selects a subset of
documents
‱ $sort
Order documents
‱ $skip
Jump over a number of documents
‱ $limit
Limit number of documents
‱ $redact
Restrict documents
‱ $out
Sends results to a new collection
Aggregation Framework in Action
(let’s play with the census data)
MongoDB State Collection
‱ Document For Each State
‱ Name
‱ Region
‱ Division
‱ Census Data For 1990, 2000, 2010
– Population
– Housing Units
– Occupied Housing Units
‱ Census Data is an array with three subdocuments
Document Model
{ "_id" : ObjectId("54e23c7b28099359f5661525"),
"name" : "California",
"region" : "West",
"data" : [
{ "totalPop" : 33871648,
"totalHouse" : 12214549,
"occHouse" : 11502870,
"year" : 2000},
{ "totalPop" : 37253956,
"totalHouse" : 13680081,
"occHouse" : 12577498,
"year" : 2010},
{ "totalPop" : 29760021,
"totalHouse" : 11182882,
"occHouse" : 29008161,
"year" : 1990}
],


}
Total US Area
db.cData.aggregate([
{"$group" :
{"_id" : null,
"totalArea" : {$sum : "$areaM"},
"avgArea" : {$avg : "$areaM"}}}])
$group
‱ Group documents by value
– Field reference, object, constant
– Other output fields are computed
‱ $max, $min, $avg, $sum
‱ $addToSet, $push
‱ $first, $last
– Processes all data in memory by default
Area By Region
db.cData.aggregate([{
"$group" : {
"_id" : "$region",
"totalArea" : {$sum : "$areaM"},
"avgArea" : {$avg : "$areaM"},
"numStates" : {$sum : 1},
"states" : {$push : "$name"}}}])
Calculating Average State Area By Region
{state: ”New York",
areaM: 218,
region: “North East"
}
{state: ”New Jersey",
areaM: 90,
region: “North East”
}
{state: “California",
area: 300,
region: “West"
}
{ $group: {
_id: "$region",
avgAreaM: {$avg: ”$areaM" }
}}
{ _id: ”North East",
avgAreaM: 154}
{_id: “West",
avgAreaM: 300}
Calculating Total Area and State Count
{state: ”New York",
areaM: 218,
region: “North East"
}
{state: ”New Jersey",
areaM: 90,
region: “North East”
}
{state: “California",
area: 300,
region: “West"
}
{ $group: {
_id: "$region",
totArea: {$sum: ”$areaM" },
sCount : {$sum : 1}
}}
{ _id: ”North East",
totArea: 308
sCount: 2}
{ _id: “West",
totArea: 300,
sCount: 1}
Total US Population By Year
db.cData.aggregate([
{$unwind : "$data"},
{$group : {
"_id" : "$data.year",
"totalPop" : {$sum :"$data.totalPop"}}},
{$sort : {"totalPop" : 1}}
])
$unwind
‱ Operate on an array field
– Create documents from array elements
‱ Array replaced by element value
‱ Missing/empty fields → no output
‱ Non-array fields → error
– Pipe to $group to aggregate
$unwind
{ state: ”New York",
census: [1990, 2000,
2010]}
{ state: ”New Jersey",
census: [1990, 2000]}
{ state: “California",
census: [1980, 1990, 2000,
2010]}
{ state: ”Delaware",
census: [1990, 2000]}
{ $unwind: $census }
{ state: “New York”, census: 1990}
{ state: “New York”, census: 2000}
{ state: “New York”, census: 2010}
{ state: “New Jersey”, census: 1990}
{ state: “New Jersey”, census: 2000}
Southern State Population By Year
db.cData.aggregate([
{$match : {"region" : "South"}},
{$unwind : "$data"},
{$group : { "_id" : "$data.year",
"totalPop" : {"$sum" :"$data.totalPop"}}}
])
$match
‱ Filter documents
– Uses existing query syntax, same as .find()
$match
{state: ”New York",
areaM: 218,
region: “North East"
}
{state: ”Oregon",
areaM: 245,
region: “West”
}
{state: “California",
area: 300,
region: “West"
}
{state: ”Oregon",
areaM: 245,
region: “West”}
{state: “California",
area: 300,
region: “West"}
{ $match:
{ “region” : “West” }
}
Population Delta By State from 1990 to 2010
db.cData.aggregate([
{$unwind : "$data"},
{$sort : {"data.year" : 1}},
{$group : { "_id" : "$name",
"pop1990" : {"$first" : "$data.totalPop"},
"pop2010" : {"$last" : "$data.totalPop"}}},
{$project : { "_id" : 0,
"name" : "$_id",
"delta" : {"$subtract" : ["$pop2010", "$pop1990"]},
"pop1990" : 1,
"pop2010" : 1}
}])
$sort, $limit, $skip
‱ Sort documents by one or more fields
– Same order syntax as cursors
– Waits for earlier pipeline operator to return
– In-memory unless early and indexed
‱ Limit and skip follow cursor behavior
$first, $last
‱ Collection operations like $push and $addToSet
‱ Must be used in $group
‱ $first and $last determined by document order
‱ Typically used with $sort to ensure ordering is known
$project
‱ Reshape Documents
– Include, exclude or rename fields
– Inject computed fields
– Create sub-document fields
Including and Excluding Fields
{
"_id" : "Virginia”,
"pop1990" : 453588,
"pop2010" : 3725789
}
{
"_id" : "South Dakota",
"pop1990" : 453588,
"pop2010" : 3725789
}
{ $project:
{ “_id” : 0,
“pop1990” : 1,
“pop2010” : 1}
}
{"pop1990" : 453588,
"pop2010" : 3725789}
{"pop1990" : 453588,
"pop2010" : 3725789}
Renaming and Computing Fields
{ $project:
{ “_id” : 0,
“pop1990” : 0,
“pop2010” : 0,
“name” : “$_id”,
"delta" :
{"$subtract" :
["$pop2010",
"$pop1990"]}}
}
{
"_id" : "Virginia”,
"pop1990" : 6187358,
"pop2010" : 8001024
}
{
"_id" : "South Dakota",
"pop1990" : 696004,
"pop2010" : 814180
} {”name" : “Virginia”,
”delta" : 1813666}
{“name" : “South Dakota”,
“delta" : 118176}
Compare number of people living within 500KM of
Memphis, TN in 1990, 2000, 2010
Compare number of people living within 500KM of
Memphis, TN in 1990, 2000, 2010
db.cData.aggregate([
{$geoNear : { "near" : {"type" : "Point", "coordinates" : [90, 35]},
"distanceField" : "dist.calculated",
"maxDistance" : 500000,
"includeLocs" : "dist.location",
"spherical": true }},
{$unwind : "$data"},
{$group : { "_id" : "$data.year",
"totalPop" : {"$sum" : "$data.totalPop"},
"states" : {"$addToSet" : "$name"}}},
{$sort : {"_id" : 1}}
])
$geoNear
‱ Order/Filter Documents by Location
– Requires a geospatial index
– Output includes physical distance
– Must be first aggregation stage
$geoNear
{"_id" : "Virginia”,
"pop1990" : 6187358,
"pop2010" : 8001024,
“center” :
{“type” : “Point”,
“coordinates” :
[78.6, 37.5]}}
{ "_id" : ”Tennessee",
"pop1990" : 4877185,
"pop2010" : 6346105,
“center” :
{“type” : “Point”,
“coordinates” :
[86.6, 37.8]}}
{"_id" : ”Tennessee",
"pop1990" : 4877185,
"pop2010" : 6346105,
“center” :
{“type” : “Point”,
“coordinates” :
[86.6, 37.8]}}
{$geoNear : {
"near”: {"type”: "Point",
"coordinates”:
[90, 35]},
maxDistance : 500000,
spherical : true }}
What if I want to save the results to a collection?
db.cData.aggregate([
{$geoNear : { "near" : {"type" : "Point", "coordinates" : [90, 35]},
“distanceField” : "dist.calculated",
“maxDistance” : 500000,
“includeLocs” : "dist.location",
“spherical” : true }},
{$unwind : "$data"},
{$group : { "_id" : "$data.year",
"totalPop" : {"$sum" : "$data.totalPop"},
"states" : {"$addToSet" : "$name"}}},
{$sort : {"_id" : 1}},
{$out : “peopleNearMemphis”}
])
$out
db.cData.aggregate([<pipeline stages>,
{“$out”:“resultsCollection”}])
‱ Save aggregation results to a new collection
‱ New aggregation uses:
‱ Transform documents - ETL
Back To The Original Question
‱ Which US Division has the fastest growing population density?
– We only want to include data states with more than 1M people
– We only want to include divisions larger than 100K square miles
Division with Fastest Growing Pop Density
db.cData.aggregate(
[{$match : {"data.totalPop" : {"$gt" : 1000000}}},
{$unwind : "$data"},
{$sort : {"data.year" : 1}},
{$group : {"_id" : "$name",
"pop1990" : {"$first" : "$data.totalPop"},
"pop2010" : {"$last" : "$data.totalPop"},
"areaM" : {"$first" : "$areaM"},
"division" : {"$first" : "$division"}}},
{$group : { "_id" : "$division",
"totalPop1990" : {"$sum" : "$pop1990"},
"totalPop2010" : {"$sum" : "$pop2010"},
"totalAreaM" : {"$sum" : "$areaM"}}},
{$match : {"totalAreaM" : {"$gt" : 100000}}},
{$project : {"_id" : 0,
"division" : "$_id",
"density1990" : {"$divide" : ["$totalPop1990", "$totalAreaM"]},
"density2010" : {"$divide" : ["$totalPop2010", "$totalAreaM"]},
"denDelta" : {"$subtract" : [{"$divide" : ["$totalPop2010", "$totalAreaM"]}, {"$divide" : ["$totalPop1990","$totalAreaM"]}]},
"totalAreaM" : 1,
"totalPop1990" : 1,
"totalPop2010" : 1}},
{$sort : {"denDelta" : -1}}])
Aggregate Options
Aggregate options
db.cData.aggregate([<pipeline stages>],
{‘explain’ : false
'allowDiskUse' : true,
'cursor' : {'batchSize' : 5}})
‱ explain – similar to find().explain()
‱ allowDiskUse – enable use of disk to store intermediate
results
‱ cursor – specify the size of the initial result
Aggregation and Sharding
Sharding
‱ Workload split between shards
– Shards execute pipeline up to a
point
– Primary shard merges cursors and
continues processing*
– Use explain to analyze pipeline
split
– Early $match can exclude shards
– Potential CPU and memory
implications for primary shard host
*Prior to v2.6 second stage pipeline processing was
done by mongos
MongoDB 3.2: Joins and other improvements
Existing Alternatives to Joins
{ "_id": 10000,
"items": [
{ "productName": "laptop",
"unitPrice": 1000,
"weight": 1.2,
"remainingStock": 23},
{ "productName": "mouse",
"unitPrice": 20,
"weight": 0.2,
"remainingStock": 276}],


}
‱ Option 1: Include all data for
an order in the same document
– Fast reads
‱ One find delivers all the required data
– Captures full description at the time of the
event
– Consumes extra space
‱ Details of each product stored in many
order documents
– Complex to maintain
‱ A change to any product attribute must be
propagated to all affected orders
orders
The Winner?
‱ In general, Option 1 wins
– Performance and containment of everything in same place beats space
efficiency of normalization
– There are exceptions
‱ e.g. Comments in a blog post -> unbounded size
‱ However, analytics benefit from combining data from
multiple collections
– Keep listening...
Existing Alternatives to Joins
{
"_id": 10000,
"items": [
12345,
54321
],
...
}
‱ Option 2: Order document
references product documents
– Slower reads
‱ Multiple trips to the database
– Space efficient
‱ Product details stored once
– Lose point-in-time snapshot of full record
– Extra application logic
‱ Must iterate over product IDs in the order
document and find the product documents
‱ RDBMS would automate through a JOIN
orders
{
"_id": 12345,
"productName": "laptop",
"unitPrice": 1000,
"weight": 1.2,
"remainingStock": 23
}
{
"_id": 54321,
"productName": "mouse",
"unitPrice": 20,
"weight": 0.2,
"remainingStock": 276
}
products
$lookup
‱ Left-outer join
– Includes all documents from
the left collection
– For each document in the left
collection, find the matching
documents from the right
collection and embed them
Left Collection Right Collection
$lookup
db.leftCollection.aggregate([{
$lookup:
{
from: “rightCollection”,
localField: “leftVal”,
foreignField: “rightVal”,
as: “embeddedData”
}
}])
Left Collection Right Collection
Worked Example – Data Set
db.postcodes.findOne()
{
"_id":ObjectId("5600521e50fa77da54d
fc0d2"),
"postcode": "SL6 0AA",
"location": {
"type": "Point",
"coordinates": [
51.525605,
-0.700974
]}}
db.homeSales.findOne()
{
"_id":ObjectId("56005dd980c3678b19792b7f"),
"amount": 9000,
"date": ISODate("1996-09-19T00:00:00Z"),
"address": {
"nameOrNumber": 25,
"street": "NORFOLK PARK COTTAGES",
"town": "MAIDENHEAD",
"county": "WINDSOR AND MAIDENHEAD",
"postcode": "SL6 7DR"
}
}
Reduce Data Set First
db.homeSales.aggregate([
{$match: {
amount: {$gte:3000000}}
}
])


{
"_id":
ObjectId("56005dda80c3678b19799e52"),
"amount": 3000000,
"date": ISODate("2012-04-19T00:00:00Z"),
"address": {
"nameOrNumber": "TEMPLE FERRY
PLACE",
"street": "MILL LANE",
"town": "MAIDENHEAD",
"county": "WINDSOR AND
MAIDENHEAD",
"postcode": "SL6 5ND"
}
},

Join (left-outer-equi) Results With Second Collection
db.homeSales.aggregate([
{$match: {
amount: {$gte:3000000}}
},
{$lookup: {
from: "postcodes",
localField: "address.postcode",
foreignField: "postcode",
as: "postcode_docs"}
}
])
...
"county": "WINDSOR AND MAIDENHEAD",
"postcode": "SL6 5ND"
},
"postcode_docs": [
{
"_id": ObjectId("560053e280c3678b1978b293"),
"postcode": "SL6 5ND",
"location": {
"type": "Point",
"coordinates": [
51.549516,
-0.80702
]
}}]}, ...
Refactor Each Resulting Document
...},
{$project: {
_id: 0,
saleDate: ”$date",
price: "$amount",
address: 1,
location:
{$arrayElemAt:
["$postcode_docs.location", 0]}}
])
{ "address": {
"nameOrNumber": "TEMPLE FERRY PLACE",
"street": "MILL LANE",
"town": "MAIDENHEAD",
"county": "WINDSOR AND MAIDENHEAD",
"postcode": "SL6 5ND"
},
"saleDate": ISODate("2012-04-19T00:00:00Z"),
"price": 3000000,
"location": {
"type": "Point",
"coordinates": [
51.549516,
-0.80702
]}},...
Sort on Sale Price & Write to Collection
...},
{$sort:
{price: -1}},
{$out: "hotSpots"}
])

{"address": {
"nameOrNumber": "2 - 3",
"street": "THE SWITCHBACK",
"town": "MAIDENHEAD",
"county": "WINDSOR AND MAIDENHEAD",
"postcode": "SL6 7RJ"
},
"saleDate": ISODate("1999-03-15T00:00:00Z"),
"price": 5425000,
"location": {
"type": "Point",
"coordinates": [
51.536848,
-0.735835
]}},...
Aggregated Statistics
db.homeSales.aggregate([
{$group:
{ _id:
{$year: "$date"},
higestPrice:
{$max: "$amount"},
lowestPrice:
{$min: "$amount"},
averagePrice:
{$avg: "$amount"},
amountStdDev:
{$stdDevPop: "$amount"}
}}
])
...
{
"_id": 1995,
"higestPrice": 1000000,
"lowestPrice": 12000,
"averagePrice": 114059.35206869633,
"amountStdDev": 81540.50490801703
},
{
"_id": 1996,
"higestPrice": 975000,
"lowestPrice": 9000,
"averagePrice": 118862,
"amountStdDev": 79871.07569783277
}, ...
Clean Up Output
...,
{$project:
{
_id: 0,
year: "$_id",
higestPrice: 1,
lowestPrice: 1,
averagePrice:
{$trunc: "$averagePrice"},
priceStdDev:
{$trunc: "$amountStdDev"}
}
}
])
...
{
"higestPrice": 1000000,
"lowestPrice": 12000,
"averagePrice": 114059,
"year": 1995,
"priceStdDev": 81540
},
{
"higestPrice": 2200000,
"lowestPrice": 10500,
"averagePrice": 307372,
"year": 2004,
"priceStdDev": 199643
},...
Integrations
Hadoop Connector
Input data
Hadoop Cluster
-or-
.BSON
Mongo-Hadoop Connector
‱ Turn MongoDB into a Hadoop-enabled ïŹlesystem: use as the
input or output for Hadoop
‱ Works with MongoDB backup ïŹles (.bson)
Benefits and Features
‱ Takes advantage of full multi-core parallelism to process data
in Mongo
‱ Full integration with Hadoop and JVM ecosystems
‱ Can be used with Amazon Elastic MapReduce
‱ Can read and write backup ïŹles from local ïŹlesystem, HDFS,
or S3
Benefits and Features
‱ Vanilla Java MapReduce
‱ If you don’t want to use Java, support for Hadoop Streaming.
‱ Write MapReduce code in
Benefits and Features
‱ Support for Pig
– high-level scripting language for data analysis and building map/reduce
workïŹ‚ows
‱ Support for Hive
– SQL-like language for ad-hoc queries + analysis of data sets on Hadoop-
compatible ïŹle systems
How It Works
‱ Adapter examines the MongoDB input collection and
calculates a set of splits from the data
‱ Each split gets assigned to a node in Hadoop cluster
‱ In parallel, Hadoop nodes pull data for splits from MongoDB
(or BSON) and process them locally
‱ Hadoop merges results and streams output back to MongoDB
or BSON
BI Connector
MongoDB Connector for BI
Visualize and explore multi-dimensional
documents using SQL-based BI tools. The
connector does the following:
‱ Provides the BI tool with the schema of the
MongoDB collection to be visualized
‱ Translates SQL statements issued by the BI tool
into equivalent MongoDB queries that are sent
to MongoDB for processing
‱ Converts the results into the tabular format
expected by the BI tool, which can then
visualize the data based on user requirements
Location & Flow of Data
MongoDB
BI
Connector
Mapping meta-data Application data
{name:
“Andrew”,
address:
{street:

}}
DocumentTableAnalytics & visualization
82
Defining Data Mapping
mongodrdl --host 192.168.1.94 --port 27017 -d myDbName 
-o myDrdlFile.drdl
mongobischema import myCollectionName myDrdlFile.drdl
DRDL
mongodrdl mongobischema
PostgreSQL
MongoDB-
specific
Foreign Data
Wrapper
83
Optionally Manually Edit DRDL File
‱ Redact attributes
‱ Use more appropriate types
(sampling can get it wrong)
‱ Rename tables (v1.1+)
‱ Rename columns (v1.1+)
‱ Build new views using
MongoDB Aggregation
Framework
‱ e.g., $lookup to join 2 tables
- table: homesales
collection: homeSales
pipeline: []
columns:
- name: _id
mongotype: bson.ObjectId
sqlname: _id
sqltype: varchar
- name: address.county
mongotype: string
sqlname: address_county
sqltype: varchar
- name:
address.nameOrNumber
mongotype: int
sqlname:
address_nameornumber
sqltype: varchar
Summary
Analytics in MongoDB?
Create
Read
Update
Deletet
Analytics
?
Group
Count
Derive Values
Filter
Average
Sort
YES!
Framework Use Cases
‱ Complex aggregation queries
‱ Ad-hoc reporting
‱ Real-time analytics
‱ Visualizing and reshaping data
Questions?
MongoDB 3.2  - Analytics

Mais conteĂșdo relacionado

Mais procurados

Webinarserie: EinfĂŒhrung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...
Webinarserie: EinfĂŒhrung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...Webinarserie: EinfĂŒhrung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...
Webinarserie: EinfĂŒhrung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...
MongoDB
 

Mais procurados (20)

MongoDB for Analytics
MongoDB for AnalyticsMongoDB for Analytics
MongoDB for Analytics
 
Webinar: General Technical Overview of MongoDB for Dev Teams
Webinar: General Technical Overview of MongoDB for Dev TeamsWebinar: General Technical Overview of MongoDB for Dev Teams
Webinar: General Technical Overview of MongoDB for Dev Teams
 
Getting Started with MongoDB and NodeJS
Getting Started with MongoDB and NodeJSGetting Started with MongoDB and NodeJS
Getting Started with MongoDB and NodeJS
 
Analytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop ConnectorAnalytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop Connector
 
Webinarserie: EinfĂŒhrung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...
Webinarserie: EinfĂŒhrung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...Webinarserie: EinfĂŒhrung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...
Webinarserie: EinfĂŒhrung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...
 
Back to Basics Webinar 5: Introduction to the Aggregation Framework
Back to Basics Webinar 5: Introduction to the Aggregation FrameworkBack to Basics Webinar 5: Introduction to the Aggregation Framework
Back to Basics Webinar 5: Introduction to the Aggregation Framework
 
Back to Basics Webinar 2: Your First MongoDB Application
Back to Basics Webinar 2: Your First MongoDB ApplicationBack to Basics Webinar 2: Your First MongoDB Application
Back to Basics Webinar 2: Your First MongoDB Application
 
2014 bigdatacamp asya_kamsky
2014 bigdatacamp asya_kamsky2014 bigdatacamp asya_kamsky
2014 bigdatacamp asya_kamsky
 
MongoDB - Aggregation Pipeline
MongoDB - Aggregation PipelineMongoDB - Aggregation Pipeline
MongoDB - Aggregation Pipeline
 
How to leverage what's new in MongoDB 3.6
How to leverage what's new in MongoDB 3.6How to leverage what's new in MongoDB 3.6
How to leverage what's new in MongoDB 3.6
 
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
 
MongoDB San Francisco 2013: Hash-based Sharding in MongoDB 2.4 presented by B...
MongoDB San Francisco 2013: Hash-based Sharding in MongoDB 2.4 presented by B...MongoDB San Francisco 2013: Hash-based Sharding in MongoDB 2.4 presented by B...
MongoDB San Francisco 2013: Hash-based Sharding in MongoDB 2.4 presented by B...
 
Webinaire 2 de la série « Retour aux fondamentaux » : Votre premiÚre applicat...
Webinaire 2 de la série « Retour aux fondamentaux » : Votre premiÚre applicat...Webinaire 2 de la série « Retour aux fondamentaux » : Votre premiÚre applicat...
Webinaire 2 de la série « Retour aux fondamentaux » : Votre premiÚre applicat...
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
 
Back to Basics Webinar 3: Schema Design Thinking in Documents
 Back to Basics Webinar 3: Schema Design Thinking in Documents Back to Basics Webinar 3: Schema Design Thinking in Documents
Back to Basics Webinar 3: Schema Design Thinking in Documents
 
Conceptos bĂĄsicos. Seminario web 5: IntroducciĂłn a Aggregation Framework
Conceptos bĂĄsicos. Seminario web 5: IntroducciĂłn a Aggregation FrameworkConceptos bĂĄsicos. Seminario web 5: IntroducciĂłn a Aggregation Framework
Conceptos bĂĄsicos. Seminario web 5: IntroducciĂłn a Aggregation Framework
 
PistonHead's use of MongoDB for Analytics
PistonHead's use of MongoDB for AnalyticsPistonHead's use of MongoDB for Analytics
PistonHead's use of MongoDB for Analytics
 
Introduction to MongoDB and Hadoop
Introduction to MongoDB and HadoopIntroduction to MongoDB and Hadoop
Introduction to MongoDB and Hadoop
 
Joins and Other Aggregation Enhancements Coming in MongoDB 3.2
Joins and Other Aggregation Enhancements Coming in MongoDB 3.2Joins and Other Aggregation Enhancements Coming in MongoDB 3.2
Joins and Other Aggregation Enhancements Coming in MongoDB 3.2
 
Back to Basics Webinar 4: Advanced Indexing, Text and Geospatial Indexes
Back to Basics Webinar 4: Advanced Indexing, Text and Geospatial IndexesBack to Basics Webinar 4: Advanced Indexing, Text and Geospatial Indexes
Back to Basics Webinar 4: Advanced Indexing, Text and Geospatial Indexes
 

Semelhante a MongoDB 3.2 - Analytics

The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
MongoDB
 
De normalised london aggregation framework overview
De normalised london  aggregation framework overview De normalised london  aggregation framework overview
De normalised london aggregation framework overview
Chris Harris
 

Semelhante a MongoDB 3.2 - Analytics (20)

Agg framework selectgroup feb2015 v2
Agg framework selectgroup feb2015 v2Agg framework selectgroup feb2015 v2
Agg framework selectgroup feb2015 v2
 
Joins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation EnhancementsJoins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation Enhancements
 
Building a Cross Channel Content Delivery Platform with MongoDB
Building a Cross Channel Content Delivery Platform with MongoDBBuilding a Cross Channel Content Delivery Platform with MongoDB
Building a Cross Channel Content Delivery Platform with MongoDB
 
Webinar: Exploring the Aggregation Framework
Webinar: Exploring the Aggregation FrameworkWebinar: Exploring the Aggregation Framework
Webinar: Exploring the Aggregation Framework
 
Schema Design by Chad Tindel, Solution Architect, 10gen
Schema Design  by Chad Tindel, Solution Architect, 10genSchema Design  by Chad Tindel, Solution Architect, 10gen
Schema Design by Chad Tindel, Solution Architect, 10gen
 
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDBMongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
 
Aggregation Framework MongoDB Days Munich
Aggregation Framework MongoDB Days MunichAggregation Framework MongoDB Days Munich
Aggregation Framework MongoDB Days Munich
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation Framework
 
Couchbase Tutorial: Big data Open Source Systems: VLDB2018
Couchbase Tutorial: Big data Open Source Systems: VLDB2018Couchbase Tutorial: Big data Open Source Systems: VLDB2018
Couchbase Tutorial: Big data Open Source Systems: VLDB2018
 
Using MongoDB As a Tick Database
Using MongoDB As a Tick DatabaseUsing MongoDB As a Tick Database
Using MongoDB As a Tick Database
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
 
Webinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation OptionsWebinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation Options
 
De normalised london aggregation framework overview
De normalised london  aggregation framework overview De normalised london  aggregation framework overview
De normalised london aggregation framework overview
 
Data Analytics with MongoDB - Jane Fine
Data Analytics with MongoDB - Jane FineData Analytics with MongoDB - Jane Fine
Data Analytics with MongoDB - Jane Fine
 
MongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right Way
MongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right WayMongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right Way
MongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right Way
 
Mongo db 101 dc group
Mongo db 101 dc groupMongo db 101 dc group
Mongo db 101 dc group
 
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5
 
Query for json databases
Query for json databasesQuery for json databases
Query for json databases
 
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB Evenings Dallas: What's the Scoop on MongoDB & HadoopMongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

MongoDB 3.2 - Analytics

  • 1. Massimo Brignoli Principal Solutions Architect massimo@mongodb.com @massimobrignoli Analytics in MongoDB
  • 2. Agenda ‱ Analytics in MongoDB? ‱ Aggregation Framework ‱ Aggregation Pipeline Stages ‱ Aggregation Framework in Action ‱ Joins in MongoDB 3.2 ‱ Integrations ‱ Analytical Architectures
  • 3. Relational Expressive Query Language & Secondary Indexes Strong Consistency Enterprise Management & Integrations
  • 4. The World Has Changed Volume Velocity Variety Iterative Agile Short Cycles Always On Secure Global Open-Source Cloud Commodity Data Time Risk Cost
  • 5. Scalability & Performance Always On, Global Deployments FlexibilityExpressive Query Language & Secondary Indexes Strong Consistency Enterprise Management & Integrations NoSQL
  • 6. Nexus Architecture Scalability & Performance Always On, Global Deployments FlexibilityExpressive Query Language & Secondary Indexes Strong Consistency Enterprise Management & Integrations
  • 7. Some Common MongoDB Use Cases Single View Internet of Things Mobile Real-Time Analytics Catalog Personalization Content Management
  • 11. Analytics on MongoDB Data ‱ Extract data from MongoDB and perform complex analytics with Hadoop – Batch rather than real-time – Extra nodes to manage ‱ Direct access to MongoDB from SPARK ‱ MongoDB BI Connector – Direct SQL Access from BI Tools ‱ MongoDB aggregation pipeline – Real-time – Live, operational data set – Narrower feature set Hadoop Connector MapReduce & HDFS SQL Connector
  • 12. For Example: US Census Data ‱ Census data from 1990, 2000, 2010 ‱ Question: – Which US Division has the fastest growing population density? – We only want to include data states with more than 1M people – We only want to include divisions larger than 100K square miles – Division = a group of US States – Population density = Area of division/# of people – Data is provided at the state level
  • 13. US Regions and Divisions
  • 14. How would we solve this in SQL? ‱ SELECT GROUP BY HAVING
  • 17. What is an Aggregation Pipeline? ‱ A Series of Document Transformations – Executed in stages – Original input is a collection – Output as a cursor or a collection ‱ Rich Library of Functions – Filter, compute, group, and summarize data – Output of one stage sent to input of next – Operations executed in sequential order
  • 23. Aggregation Pipeline $match $project $lookup {★ds} {★ds} {★ds} {★ds} {★ds} {★ds} {★ds} {★ds} {★ds} {★ds} {★ds} {} {★ds} {★ds} {★ds} {★} {★} {★} {★} {ï”â˜…ïź} {ï”â˜…ïź} {ï”â˜…ïź} {ïź=d+s}
  • 24. Aggregation Pipeline $match $project $lookup {★ds} {★ds} {★ds} {★ds} {★ds} {★ds} {★ds} {★ds} {★ds} {★ds} {★ds} {} {★ds} {★ds} {★ds} {★} {★} {★} {★} {ï”â˜…ïź} {ï”â˜…ïź} {ï”â˜…ïź} {ïź=d+s} {ï”â˜…ïź[]} {ï”â˜…ïź[]} {ï”â˜…ïź}
  • 25. Aggregation Pipeline $match $project $lookup $group {★ds} {★ds} {★ds} {★ds} {★ds} {★ds} {★ds} {★ds} {★ds} {★ds} {★ds} {} {★ds} {★ds} {★ds} {★} {★} {★} {★} {ï”â˜…ïź} {ï”â˜…ïź} {ï”â˜…ïź} {ïź=d+s} {ïź ÎŁ λ σ} {ïź ÎŁ λ σ} {ïź ÎŁ λ σ} {ï”â˜…ïź[]} {ï”â˜…ïź[]} {ï”â˜…ïź}
  • 26. Aggregation Pipeline Stages ‱ $match Filter documents ‱ $geoNear Geospherical query ‱ $project Reshape documents ‱ $lookup New – Left-outer equi joins ‱ $unwind Expand documents ‱ $group Summarize documents ‱ $sample New – Randomly selects a subset of documents ‱ $sort Order documents ‱ $skip Jump over a number of documents ‱ $limit Limit number of documents ‱ $redact Restrict documents ‱ $out Sends results to a new collection
  • 27. Aggregation Framework in Action (let’s play with the census data)
  • 28. MongoDB State Collection ‱ Document For Each State ‱ Name ‱ Region ‱ Division ‱ Census Data For 1990, 2000, 2010 – Population – Housing Units – Occupied Housing Units ‱ Census Data is an array with three subdocuments
  • 29. Document Model { "_id" : ObjectId("54e23c7b28099359f5661525"), "name" : "California", "region" : "West", "data" : [ { "totalPop" : 33871648, "totalHouse" : 12214549, "occHouse" : 11502870, "year" : 2000}, { "totalPop" : 37253956, "totalHouse" : 13680081, "occHouse" : 12577498, "year" : 2010}, { "totalPop" : 29760021, "totalHouse" : 11182882, "occHouse" : 29008161, "year" : 1990} ], 
 }
  • 30. Total US Area db.cData.aggregate([ {"$group" : {"_id" : null, "totalArea" : {$sum : "$areaM"}, "avgArea" : {$avg : "$areaM"}}}])
  • 31. $group ‱ Group documents by value – Field reference, object, constant – Other output fields are computed ‱ $max, $min, $avg, $sum ‱ $addToSet, $push ‱ $first, $last – Processes all data in memory by default
  • 32. Area By Region db.cData.aggregate([{ "$group" : { "_id" : "$region", "totalArea" : {$sum : "$areaM"}, "avgArea" : {$avg : "$areaM"}, "numStates" : {$sum : 1}, "states" : {$push : "$name"}}}])
  • 33. Calculating Average State Area By Region {state: ”New York", areaM: 218, region: “North East" } {state: ”New Jersey", areaM: 90, region: “North East” } {state: “California", area: 300, region: “West" } { $group: { _id: "$region", avgAreaM: {$avg: ”$areaM" } }} { _id: ”North East", avgAreaM: 154} {_id: “West", avgAreaM: 300}
  • 34. Calculating Total Area and State Count {state: ”New York", areaM: 218, region: “North East" } {state: ”New Jersey", areaM: 90, region: “North East” } {state: “California", area: 300, region: “West" } { $group: { _id: "$region", totArea: {$sum: ”$areaM" }, sCount : {$sum : 1} }} { _id: ”North East", totArea: 308 sCount: 2} { _id: “West", totArea: 300, sCount: 1}
  • 35. Total US Population By Year db.cData.aggregate([ {$unwind : "$data"}, {$group : { "_id" : "$data.year", "totalPop" : {$sum :"$data.totalPop"}}}, {$sort : {"totalPop" : 1}} ])
  • 36. $unwind ‱ Operate on an array field – Create documents from array elements ‱ Array replaced by element value ‱ Missing/empty fields → no output ‱ Non-array fields → error – Pipe to $group to aggregate
  • 37. $unwind { state: ”New York", census: [1990, 2000, 2010]} { state: ”New Jersey", census: [1990, 2000]} { state: “California", census: [1980, 1990, 2000, 2010]} { state: ”Delaware", census: [1990, 2000]} { $unwind: $census } { state: “New York”, census: 1990} { state: “New York”, census: 2000} { state: “New York”, census: 2010} { state: “New Jersey”, census: 1990} { state: “New Jersey”, census: 2000}
  • 38. Southern State Population By Year db.cData.aggregate([ {$match : {"region" : "South"}}, {$unwind : "$data"}, {$group : { "_id" : "$data.year", "totalPop" : {"$sum" :"$data.totalPop"}}} ])
  • 39. $match ‱ Filter documents – Uses existing query syntax, same as .find()
  • 40. $match {state: ”New York", areaM: 218, region: “North East" } {state: ”Oregon", areaM: 245, region: “West” } {state: “California", area: 300, region: “West" } {state: ”Oregon", areaM: 245, region: “West”} {state: “California", area: 300, region: “West"} { $match: { “region” : “West” } }
  • 41. Population Delta By State from 1990 to 2010 db.cData.aggregate([ {$unwind : "$data"}, {$sort : {"data.year" : 1}}, {$group : { "_id" : "$name", "pop1990" : {"$first" : "$data.totalPop"}, "pop2010" : {"$last" : "$data.totalPop"}}}, {$project : { "_id" : 0, "name" : "$_id", "delta" : {"$subtract" : ["$pop2010", "$pop1990"]}, "pop1990" : 1, "pop2010" : 1} }])
  • 42. $sort, $limit, $skip ‱ Sort documents by one or more fields – Same order syntax as cursors – Waits for earlier pipeline operator to return – In-memory unless early and indexed ‱ Limit and skip follow cursor behavior
  • 43. $first, $last ‱ Collection operations like $push and $addToSet ‱ Must be used in $group ‱ $first and $last determined by document order ‱ Typically used with $sort to ensure ordering is known
  • 44. $project ‱ Reshape Documents – Include, exclude or rename fields – Inject computed fields – Create sub-document fields
  • 45. Including and Excluding Fields { "_id" : "Virginia”, "pop1990" : 453588, "pop2010" : 3725789 } { "_id" : "South Dakota", "pop1990" : 453588, "pop2010" : 3725789 } { $project: { “_id” : 0, “pop1990” : 1, “pop2010” : 1} } {"pop1990" : 453588, "pop2010" : 3725789} {"pop1990" : 453588, "pop2010" : 3725789}
  • 46. Renaming and Computing Fields { $project: { “_id” : 0, “pop1990” : 0, “pop2010” : 0, “name” : “$_id”, "delta" : {"$subtract" : ["$pop2010", "$pop1990"]}} } { "_id" : "Virginia”, "pop1990" : 6187358, "pop2010" : 8001024 } { "_id" : "South Dakota", "pop1990" : 696004, "pop2010" : 814180 } {”name" : “Virginia”, ”delta" : 1813666} {“name" : “South Dakota”, “delta" : 118176}
  • 47. Compare number of people living within 500KM of Memphis, TN in 1990, 2000, 2010
  • 48. Compare number of people living within 500KM of Memphis, TN in 1990, 2000, 2010 db.cData.aggregate([ {$geoNear : { "near" : {"type" : "Point", "coordinates" : [90, 35]}, "distanceField" : "dist.calculated", "maxDistance" : 500000, "includeLocs" : "dist.location", "spherical": true }}, {$unwind : "$data"}, {$group : { "_id" : "$data.year", "totalPop" : {"$sum" : "$data.totalPop"}, "states" : {"$addToSet" : "$name"}}}, {$sort : {"_id" : 1}} ])
  • 49. $geoNear ‱ Order/Filter Documents by Location – Requires a geospatial index – Output includes physical distance – Must be first aggregation stage
  • 50. $geoNear {"_id" : "Virginia”, "pop1990" : 6187358, "pop2010" : 8001024, “center” : {“type” : “Point”, “coordinates” : [78.6, 37.5]}} { "_id" : ”Tennessee", "pop1990" : 4877185, "pop2010" : 6346105, “center” : {“type” : “Point”, “coordinates” : [86.6, 37.8]}} {"_id" : ”Tennessee", "pop1990" : 4877185, "pop2010" : 6346105, “center” : {“type” : “Point”, “coordinates” : [86.6, 37.8]}} {$geoNear : { "near”: {"type”: "Point", "coordinates”: [90, 35]}, maxDistance : 500000, spherical : true }}
  • 51. What if I want to save the results to a collection? db.cData.aggregate([ {$geoNear : { "near" : {"type" : "Point", "coordinates" : [90, 35]}, “distanceField” : "dist.calculated", “maxDistance” : 500000, “includeLocs” : "dist.location", “spherical” : true }}, {$unwind : "$data"}, {$group : { "_id" : "$data.year", "totalPop" : {"$sum" : "$data.totalPop"}, "states" : {"$addToSet" : "$name"}}}, {$sort : {"_id" : 1}}, {$out : “peopleNearMemphis”} ])
  • 52. $out db.cData.aggregate([<pipeline stages>, {“$out”:“resultsCollection”}]) ‱ Save aggregation results to a new collection ‱ New aggregation uses: ‱ Transform documents - ETL
  • 53. Back To The Original Question ‱ Which US Division has the fastest growing population density? – We only want to include data states with more than 1M people – We only want to include divisions larger than 100K square miles
  • 54. Division with Fastest Growing Pop Density db.cData.aggregate( [{$match : {"data.totalPop" : {"$gt" : 1000000}}}, {$unwind : "$data"}, {$sort : {"data.year" : 1}}, {$group : {"_id" : "$name", "pop1990" : {"$first" : "$data.totalPop"}, "pop2010" : {"$last" : "$data.totalPop"}, "areaM" : {"$first" : "$areaM"}, "division" : {"$first" : "$division"}}}, {$group : { "_id" : "$division", "totalPop1990" : {"$sum" : "$pop1990"}, "totalPop2010" : {"$sum" : "$pop2010"}, "totalAreaM" : {"$sum" : "$areaM"}}}, {$match : {"totalAreaM" : {"$gt" : 100000}}}, {$project : {"_id" : 0, "division" : "$_id", "density1990" : {"$divide" : ["$totalPop1990", "$totalAreaM"]}, "density2010" : {"$divide" : ["$totalPop2010", "$totalAreaM"]}, "denDelta" : {"$subtract" : [{"$divide" : ["$totalPop2010", "$totalAreaM"]}, {"$divide" : ["$totalPop1990","$totalAreaM"]}]}, "totalAreaM" : 1, "totalPop1990" : 1, "totalPop2010" : 1}}, {$sort : {"denDelta" : -1}}])
  • 56. Aggregate options db.cData.aggregate([<pipeline stages>], {‘explain’ : false 'allowDiskUse' : true, 'cursor' : {'batchSize' : 5}}) ‱ explain – similar to find().explain() ‱ allowDiskUse – enable use of disk to store intermediate results ‱ cursor – specify the size of the initial result
  • 58. Sharding ‱ Workload split between shards – Shards execute pipeline up to a point – Primary shard merges cursors and continues processing* – Use explain to analyze pipeline split – Early $match can exclude shards – Potential CPU and memory implications for primary shard host *Prior to v2.6 second stage pipeline processing was done by mongos
  • 59. MongoDB 3.2: Joins and other improvements
  • 60. Existing Alternatives to Joins { "_id": 10000, "items": [ { "productName": "laptop", "unitPrice": 1000, "weight": 1.2, "remainingStock": 23}, { "productName": "mouse", "unitPrice": 20, "weight": 0.2, "remainingStock": 276}], 
 } ‱ Option 1: Include all data for an order in the same document – Fast reads ‱ One find delivers all the required data – Captures full description at the time of the event – Consumes extra space ‱ Details of each product stored in many order documents – Complex to maintain ‱ A change to any product attribute must be propagated to all affected orders orders
  • 61. The Winner? ‱ In general, Option 1 wins – Performance and containment of everything in same place beats space efficiency of normalization – There are exceptions ‱ e.g. Comments in a blog post -> unbounded size ‱ However, analytics benefit from combining data from multiple collections – Keep listening...
  • 62. Existing Alternatives to Joins { "_id": 10000, "items": [ 12345, 54321 ], ... } ‱ Option 2: Order document references product documents – Slower reads ‱ Multiple trips to the database – Space efficient ‱ Product details stored once – Lose point-in-time snapshot of full record – Extra application logic ‱ Must iterate over product IDs in the order document and find the product documents ‱ RDBMS would automate through a JOIN orders { "_id": 12345, "productName": "laptop", "unitPrice": 1000, "weight": 1.2, "remainingStock": 23 } { "_id": 54321, "productName": "mouse", "unitPrice": 20, "weight": 0.2, "remainingStock": 276 } products
  • 63. $lookup ‱ Left-outer join – Includes all documents from the left collection – For each document in the left collection, find the matching documents from the right collection and embed them Left Collection Right Collection
  • 64. $lookup db.leftCollection.aggregate([{ $lookup: { from: “rightCollection”, localField: “leftVal”, foreignField: “rightVal”, as: “embeddedData” } }]) Left Collection Right Collection
  • 65. Worked Example – Data Set db.postcodes.findOne() { "_id":ObjectId("5600521e50fa77da54d fc0d2"), "postcode": "SL6 0AA", "location": { "type": "Point", "coordinates": [ 51.525605, -0.700974 ]}} db.homeSales.findOne() { "_id":ObjectId("56005dd980c3678b19792b7f"), "amount": 9000, "date": ISODate("1996-09-19T00:00:00Z"), "address": { "nameOrNumber": 25, "street": "NORFOLK PARK COTTAGES", "town": "MAIDENHEAD", "county": "WINDSOR AND MAIDENHEAD", "postcode": "SL6 7DR" } }
  • 66. Reduce Data Set First db.homeSales.aggregate([ {$match: { amount: {$gte:3000000}} } ]) 
 { "_id": ObjectId("56005dda80c3678b19799e52"), "amount": 3000000, "date": ISODate("2012-04-19T00:00:00Z"), "address": { "nameOrNumber": "TEMPLE FERRY PLACE", "street": "MILL LANE", "town": "MAIDENHEAD", "county": "WINDSOR AND MAIDENHEAD", "postcode": "SL6 5ND" } },

  • 67. Join (left-outer-equi) Results With Second Collection db.homeSales.aggregate([ {$match: { amount: {$gte:3000000}} }, {$lookup: { from: "postcodes", localField: "address.postcode", foreignField: "postcode", as: "postcode_docs"} } ]) ... "county": "WINDSOR AND MAIDENHEAD", "postcode": "SL6 5ND" }, "postcode_docs": [ { "_id": ObjectId("560053e280c3678b1978b293"), "postcode": "SL6 5ND", "location": { "type": "Point", "coordinates": [ 51.549516, -0.80702 ] }}]}, ...
  • 68. Refactor Each Resulting Document ...}, {$project: { _id: 0, saleDate: ”$date", price: "$amount", address: 1, location: {$arrayElemAt: ["$postcode_docs.location", 0]}} ]) { "address": { "nameOrNumber": "TEMPLE FERRY PLACE", "street": "MILL LANE", "town": "MAIDENHEAD", "county": "WINDSOR AND MAIDENHEAD", "postcode": "SL6 5ND" }, "saleDate": ISODate("2012-04-19T00:00:00Z"), "price": 3000000, "location": { "type": "Point", "coordinates": [ 51.549516, -0.80702 ]}},...
  • 69. Sort on Sale Price & Write to Collection ...}, {$sort: {price: -1}}, {$out: "hotSpots"} ]) 
{"address": { "nameOrNumber": "2 - 3", "street": "THE SWITCHBACK", "town": "MAIDENHEAD", "county": "WINDSOR AND MAIDENHEAD", "postcode": "SL6 7RJ" }, "saleDate": ISODate("1999-03-15T00:00:00Z"), "price": 5425000, "location": { "type": "Point", "coordinates": [ 51.536848, -0.735835 ]}},...
  • 70. Aggregated Statistics db.homeSales.aggregate([ {$group: { _id: {$year: "$date"}, higestPrice: {$max: "$amount"}, lowestPrice: {$min: "$amount"}, averagePrice: {$avg: "$amount"}, amountStdDev: {$stdDevPop: "$amount"} }} ]) ... { "_id": 1995, "higestPrice": 1000000, "lowestPrice": 12000, "averagePrice": 114059.35206869633, "amountStdDev": 81540.50490801703 }, { "_id": 1996, "higestPrice": 975000, "lowestPrice": 9000, "averagePrice": 118862, "amountStdDev": 79871.07569783277 }, ...
  • 71. Clean Up Output ..., {$project: { _id: 0, year: "$_id", higestPrice: 1, lowestPrice: 1, averagePrice: {$trunc: "$averagePrice"}, priceStdDev: {$trunc: "$amountStdDev"} } } ]) ... { "higestPrice": 1000000, "lowestPrice": 12000, "averagePrice": 114059, "year": 1995, "priceStdDev": 81540 }, { "higestPrice": 2200000, "lowestPrice": 10500, "averagePrice": 307372, "year": 2004, "priceStdDev": 199643 },...
  • 74. Input data Hadoop Cluster -or- .BSON Mongo-Hadoop Connector ‱ Turn MongoDB into a Hadoop-enabled ïŹlesystem: use as the input or output for Hadoop ‱ Works with MongoDB backup ïŹles (.bson)
  • 75. Benefits and Features ‱ Takes advantage of full multi-core parallelism to process data in Mongo ‱ Full integration with Hadoop and JVM ecosystems ‱ Can be used with Amazon Elastic MapReduce ‱ Can read and write backup ïŹles from local ïŹlesystem, HDFS, or S3
  • 76. Benefits and Features ‱ Vanilla Java MapReduce ‱ If you don’t want to use Java, support for Hadoop Streaming. ‱ Write MapReduce code in
  • 77. Benefits and Features ‱ Support for Pig – high-level scripting language for data analysis and building map/reduce workïŹ‚ows ‱ Support for Hive – SQL-like language for ad-hoc queries + analysis of data sets on Hadoop- compatible ïŹle systems
  • 78. How It Works ‱ Adapter examines the MongoDB input collection and calculates a set of splits from the data ‱ Each split gets assigned to a node in Hadoop cluster ‱ In parallel, Hadoop nodes pull data for splits from MongoDB (or BSON) and process them locally ‱ Hadoop merges results and streams output back to MongoDB or BSON
  • 80. MongoDB Connector for BI Visualize and explore multi-dimensional documents using SQL-based BI tools. The connector does the following: ‱ Provides the BI tool with the schema of the MongoDB collection to be visualized ‱ Translates SQL statements issued by the BI tool into equivalent MongoDB queries that are sent to MongoDB for processing ‱ Converts the results into the tabular format expected by the BI tool, which can then visualize the data based on user requirements
  • 81. Location & Flow of Data MongoDB BI Connector Mapping meta-data Application data {name: “Andrew”, address: {street:
 }} DocumentTableAnalytics & visualization
  • 82. 82 Defining Data Mapping mongodrdl --host 192.168.1.94 --port 27017 -d myDbName -o myDrdlFile.drdl mongobischema import myCollectionName myDrdlFile.drdl DRDL mongodrdl mongobischema PostgreSQL MongoDB- specific Foreign Data Wrapper
  • 83. 83 Optionally Manually Edit DRDL File ‱ Redact attributes ‱ Use more appropriate types (sampling can get it wrong) ‱ Rename tables (v1.1+) ‱ Rename columns (v1.1+) ‱ Build new views using MongoDB Aggregation Framework ‱ e.g., $lookup to join 2 tables - table: homesales collection: homeSales pipeline: [] columns: - name: _id mongotype: bson.ObjectId sqlname: _id sqltype: varchar - name: address.county mongotype: string sqlname: address_county sqltype: varchar - name: address.nameOrNumber mongotype: int sqlname: address_nameornumber sqltype: varchar
  • 86. Framework Use Cases ‱ Complex aggregation queries ‱ Ad-hoc reporting ‱ Real-time analytics ‱ Visualizing and reshaping data