This presentation is showing how to use the Aggregation Framework, the powerful aggregation language of MongoDB. Using some real data coming from the USA Census, we will discover the most important operations.
11. Analytics on MongoDB Data
âą Extract data from MongoDB and
perform complex analytics with
Hadoop
â Batch rather than real-time
â Extra nodes to manage
âą Direct access to MongoDB from
SPARK
âą MongoDB BI Connector
â Direct SQL Access from BI Tools
âą MongoDB aggregation pipeline
â Real-time
â Live, operational data set
â Narrower feature set
Hadoop
Connector
MapReduce & HDFS
SQL
Connector
12. For Example: US Census Data
âą Census data from 1990, 2000, 2010
âą Question:
â Which US Division has the fastest growing population density?
â We only want to include data states with more than 1M people
â We only want to include divisions larger than 100K square miles
â Division = a group of US States
â Population density = Area of division/# of people
â Data is provided at the state level
17. What is an Aggregation Pipeline?
âą A Series of Document Transformations
â Executed in stages
â Original input is a collection
â Output as a cursor or a collection
âą Rich Library of Functions
â Filter, compute, group, and summarize data
â Output of one stage sent to input of next
â Operations executed in sequential order
28. MongoDB State Collection
âą Document For Each State
âą Name
âą Region
âą Division
âą Census Data For 1990, 2000, 2010
â Population
â Housing Units
â Occupied Housing Units
âą Census Data is an array with three subdocuments
31. $group
âą Group documents by value
â Field reference, object, constant
â Other output fields are computed
âą $max, $min, $avg, $sum
âą $addToSet, $push
âą $first, $last
â Processes all data in memory by default
35. Total US Population By Year
db.cData.aggregate([
{$unwind : "$data"},
{$group : {
"_id" : "$data.year",
"totalPop" : {$sum :"$data.totalPop"}}},
{$sort : {"totalPop" : 1}}
])
36. $unwind
âą Operate on an array field
â Create documents from array elements
âą Array replaced by element value
âą Missing/empty fields â no output
âą Non-array fields â error
â Pipe to $group to aggregate
42. $sort, $limit, $skip
âą Sort documents by one or more fields
â Same order syntax as cursors
â Waits for earlier pipeline operator to return
â In-memory unless early and indexed
âą Limit and skip follow cursor behavior
43. $first, $last
âą Collection operations like $push and $addToSet
âą Must be used in $group
âą $first and $last determined by document order
âą Typically used with $sort to ensure ordering is known
47. Compare number of people living within 500KM of
Memphis, TN in 1990, 2000, 2010
48. Compare number of people living within 500KM of
Memphis, TN in 1990, 2000, 2010
db.cData.aggregate([
{$geoNear : { "near" : {"type" : "Point", "coordinates" : [90, 35]},
"distanceField" : "dist.calculated",
"maxDistance" : 500000,
"includeLocs" : "dist.location",
"spherical": true }},
{$unwind : "$data"},
{$group : { "_id" : "$data.year",
"totalPop" : {"$sum" : "$data.totalPop"},
"states" : {"$addToSet" : "$name"}}},
{$sort : {"_id" : 1}}
])
49. $geoNear
âą Order/Filter Documents by Location
â Requires a geospatial index
â Output includes physical distance
â Must be first aggregation stage
53. Back To The Original Question
âą Which US Division has the fastest growing population density?
â We only want to include data states with more than 1M people
â We only want to include divisions larger than 100K square miles
56. Aggregate options
db.cData.aggregate([<pipeline stages>],
{âexplainâ : false
'allowDiskUse' : true,
'cursor' : {'batchSize' : 5}})
âą explain â similar to find().explain()
âą allowDiskUse â enable use of disk to store intermediate
results
âą cursor â specify the size of the initial result
58. Sharding
âą Workload split between shards
â Shards execute pipeline up to a
point
â Primary shard merges cursors and
continues processing*
â Use explain to analyze pipeline
split
â Early $match can exclude shards
â Potential CPU and memory
implications for primary shard host
*Prior to v2.6 second stage pipeline processing was
done by mongos
60. Existing Alternatives to Joins
{ "_id": 10000,
"items": [
{ "productName": "laptop",
"unitPrice": 1000,
"weight": 1.2,
"remainingStock": 23},
{ "productName": "mouse",
"unitPrice": 20,
"weight": 0.2,
"remainingStock": 276}],
âŠ
}
âą Option 1: Include all data for
an order in the same document
â Fast reads
âą One find delivers all the required data
â Captures full description at the time of the
event
â Consumes extra space
âą Details of each product stored in many
order documents
â Complex to maintain
âą A change to any product attribute must be
propagated to all affected orders
orders
61. The Winner?
âą In general, Option 1 wins
â Performance and containment of everything in same place beats space
efficiency of normalization
â There are exceptions
âą e.g. Comments in a blog post -> unbounded size
âą However, analytics benefit from combining data from
multiple collections
â Keep listening...
62. Existing Alternatives to Joins
{
"_id": 10000,
"items": [
12345,
54321
],
...
}
âą Option 2: Order document
references product documents
â Slower reads
âą Multiple trips to the database
â Space efficient
âą Product details stored once
â Lose point-in-time snapshot of full record
â Extra application logic
âą Must iterate over product IDs in the order
document and find the product documents
âą RDBMS would automate through a JOIN
orders
{
"_id": 12345,
"productName": "laptop",
"unitPrice": 1000,
"weight": 1.2,
"remainingStock": 23
}
{
"_id": 54321,
"productName": "mouse",
"unitPrice": 20,
"weight": 0.2,
"remainingStock": 276
}
products
63. $lookup
âą Left-outer join
â Includes all documents from
the left collection
â For each document in the left
collection, find the matching
documents from the right
collection and embed them
Left Collection Right Collection
75. Benefits and Features
âą Takes advantage of full multi-core parallelism to process data
in Mongo
âą Full integration with Hadoop and JVM ecosystems
âą Can be used with Amazon Elastic MapReduce
âą Can read and write backup ïŹles from local ïŹlesystem, HDFS,
or S3
76. Benefits and Features
âą Vanilla Java MapReduce
âą If you donât want to use Java, support for Hadoop Streaming.
âą Write MapReduce code in
77. Benefits and Features
âą Support for Pig
â high-level scripting language for data analysis and building map/reduce
workïŹows
âą Support for Hive
â SQL-like language for ad-hoc queries + analysis of data sets on Hadoop-
compatible ïŹle systems
78. How It Works
âą Adapter examines the MongoDB input collection and
calculates a set of splits from the data
âą Each split gets assigned to a node in Hadoop cluster
âą In parallel, Hadoop nodes pull data for splits from MongoDB
(or BSON) and process them locally
âą Hadoop merges results and streams output back to MongoDB
or BSON
80. MongoDB Connector for BI
Visualize and explore multi-dimensional
documents using SQL-based BI tools. The
connector does the following:
âą Provides the BI tool with the schema of the
MongoDB collection to be visualized
âą Translates SQL statements issued by the BI tool
into equivalent MongoDB queries that are sent
to MongoDB for processing
âą Converts the results into the tabular format
expected by the BI tool, which can then
visualize the data based on user requirements
81. Location & Flow of Data
MongoDB
BI
Connector
Mapping meta-data Application data
{name:
âAndrewâ,
address:
{street:âŠ
}}
DocumentTableAnalytics & visualization
82. 82
Defining Data Mapping
mongodrdl --host 192.168.1.94 --port 27017 -d myDbName
-o myDrdlFile.drdl
mongobischema import myCollectionName myDrdlFile.drdl
DRDL
mongodrdl mongobischema
PostgreSQL
MongoDB-
specific
Foreign Data
Wrapper
83. 83
Optionally Manually Edit DRDL File
âą Redact attributes
âą Use more appropriate types
(sampling can get it wrong)
âą Rename tables (v1.1+)
âą Rename columns (v1.1+)
âą Build new views using
MongoDB Aggregation
Framework
âą e.g., $lookup to join 2 tables
- table: homesales
collection: homeSales
pipeline: []
columns:
- name: _id
mongotype: bson.ObjectId
sqlname: _id
sqltype: varchar
- name: address.county
mongotype: string
sqlname: address_county
sqltype: varchar
- name:
address.nameOrNumber
mongotype: int
sqlname:
address_nameornumber
sqltype: varchar