3. Map Reduce
Map/Reduce is a big hammer
• Used to perform complex analytics tasks on
massive amounts of data
• Users are currently using it for aggregation…
• totaling, averaging, etc
4. Problem
• It should be easier to do simple aggregations
• Shouldn’t need to write JavaScript
• Avoid the overhead of JavaScript engine
5. New Aggregation Framework
• Declarative
• No JavaScript required
• C++ implementation
• Higher performance than JavaScript
• Expression evaluation
• Return computed values
• Framework: we can add new operations easily
6. Pipeline
• Series of operations
• Members of a collection are passed through a
pipeline to produce a result
7. The Aggregation Command
• Takes two arguments
• Aggregate -- name of collection
• Pipeline -- array of pipeline operators
db.runCommand(
{
aggregate : "article",
pipeline : [ {$op1, $op2, ...} ]
}
);
10. $match
• Uses a query predicate (like .find({…})) as a filter
{
title : "this is my title" ,
author : "bob" ,
posted : new Date(1079895594000) ,
pageViews : 5 ,
tags : [ "fun" , "good" , "fun" ] ,
}
{ $match :
{ $match : { author : "bob" } } { pgv : { $gt : 50, $lte : 90 } }
}
11. $sort
• Sorts input documents
• Requires sort key -- specified like index keys
{ $sort : { name : 1, age: -1 } }
12. $limit
• Limits the number of JSON documents
{ $limit : 5 }
$skip
• Skips a number of JSON documents
{ $skip : 5 }
13. $project
• Project can reshape a document
• add, remove, rename, move
• Similar to .find()’s field selection syntax
• But much more powerful
• Can generate computed values
14. $project (include and exclude fields)
{ $project : {
title : 1 , /* include this field, if it exists */
author : 1 , /* include this field, if it exists */
"comments.author" : 1
}
}
{ $project : {
title : 0 , /* exclude this field */
author : 0 , /* exclude this field */
}
}
15. $project (computed fields)
{ $project : {
title : 1, /* include this field if it exists */
doctoredPageViews : { $add: ["$pageViews", 10] }
}
}
16. Computed Fields
• Prefix expression language
• Add two fields
• $add:[“$field1”, “$field2”]
• Provide a value for a missing field
• $ifnull:[“$field1”, “$field2”]
• Nesting
• $add:[“$field1”, $ifnull:[“$field2”, “$field3”]]
• Date field extraction
• Get year, month, day, hour, etc, from Date
• Date arithmetic
17. $project (rename and pull fields up)
{ $project : {
title : 1 ,
page_views : "$pageViews" , /* rename this field */
upgrade : "$other.foo" /* move to top level */
}
}
18. $project (push fields down)
{ $project : {
title : 1 ,
stats : {
pv : "$pageViews", /* rename this from the top-level */
}
}
}
19. $unwind
• Produces document for each value in an array
where the array value is single array element
{
title : "this is my title" ,
author : "bob" ,
posted : new Date(1079895594000) ,
pageViews : 5 ,
tags : [ "fun" , "good" , "awesome" ] ,
comments : [
{ author :"joe" , text : "this is cool" } ,
{ author :"sam" , text : "this is bad" }
],
other : { foo : 5 }
}
30. Usage Tips
• Use $match in a pipeline as early as possible
• The query optimizer can then be used to choose
an index and avoid scanning the entire collection
31. Driver Support
• Initial version is a command
• For any language, build a JSON database object,
and execute the command
• { aggregate : <collection>, pipeline : [ ] }
• Beware of command result size limit
32. Sharding support
• Initial release will support sharding
• Mongos analyzes pipeline, and forwards
operations up to first $group or $sort to
shards; combines shard server results and
continues
well, why do we need a new aggregation framework\n
but...\n
\n
it works by creating a pipeline\n
the way you create this pipeline is through the aggregation command\n
\n
\n
\n
$match should be placed as early in the aggregation pipeline as possible. This minimizes the number of documents after it, thereby minimizing later processing. Placing a $match at the very beginning of a pipeline will enable it to take advantage of indexes in exactly the same way as a regular query (find()/findOne()).\n
\n
\n
\n
_id is included by default in inclusion mode\nuser can specify _id: 0 but no other fields can be excluded\n\n
respects ordering\n
\n
\n
Doctored page views\nThe BSON specification specifies that field order matters, and is to be preserved. A projection will honor that, and fields will be output in the same order as they are input, regardless of the order of any inclusion or exclusion specifications.\nWhen new computed fields are added via a projection, these always follow all fields from the original source, and will appear in the order they appear in the projection specification.\n
\n
\n
\n
$unwind is most useful when combined with $group or $filter.\nThe effects of an unwind can be undone with the $push $group aggregation function.\nIf the target field does not exist within an input document, the document is passed through unchanged.\nIf the target field within an input document is not an array, an error is generated.\nIf the target field within an input document is an empty array ("[]"), then the document is passed through unchanged.\n
\n
_id can be a dotted field path reference (prefixed with a dollar sign, '$'), a braced document expression containing multiple fields (an order-preserving concatenated key), or a single constant. Using a constant will create a single bucket, and can be used to count documents, or to add all the values for a field in all the documents in a collection.\nIf you need the output to have a different name, it can easily be renamed using a simple $project after the $group.\n