4. Big Data in MongoDB
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
5. Big Data in MongoDB
• An ideal operational database
• High performance for storage and
retrieval at large scale
• Robust query interface for intelligent
operations
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
7. Big Data in MongoDB
Pre-aggregate in MongoDB for real-time queries
Process in MongoDB using Aggregation
Framework
Process in MongoDB using Map/Reduce
Process outside MongoDB using Hadoop and
other external tools
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
10. Aggregation Framework
• Declared in JSON, executes in C++
• Flexible, functional, and simple
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
11. Aggregation Framework
• Declared in JSON, executes in C++
• Flexible, functional, and simple
• Plays nice with sharding
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
12. Pipeline
ps ax | grep mongod | head 1
Piping command line operations
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
13. Pipeline
$match $group | $sort|
Piping aggregation operations
Stream of documents Result document
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
14. Pipeline Operators
• $match
• $project
• $group
• $unwind
• $sort/$skip/$limit
• $redact
• $geoNear
• $out
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
15. $match
• Filter documents
• Uses existing query syntax
• 2.4 added support for geospatial operations
• 2.6 added support for full text search indexes
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
23. $group
• Group documents by an ID
– Field reference, object, constant
• Other output fields are computed
– $max, $min, $avg, $sum
– $addToSet, $push
– $first, $last
• Processes all data in memory
– can utilize external disk-based sort in 2.6
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
24. Find the smallest cities
within twenty miles of San
Francisco{ _id: "94306",
city: “PALO ALTO",
loc: [ -122.127, 37.418],
pop: 24309 }
{ _id: "10280",
city: "NEW YORK",
loc: [ -74.016, 40.710],
pop: 5574 }
{ _id: "94124",
city: “SAN FRANCISCO",
loc: [-122.388, 37.73],
pop: 27239 }
26. $unwind
• Operate on an array field
• Yield new documents for each array element
– Array replaced by element value
– Missing/empty fields → no output
– Non-array fields → error
• Pipe to $group to aggregate array values
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
27. $unwind
{
title: "The Great Gatsby",
ISBN: "9781857150193",
subjects: "Long Island"
}
{ $unwind: "$subjects" }
{
title: "The Great Gatsby",
ISBN: "9781857150193",
subjects: "New York"
}
{
title: "The Great Gatsby",
ISBN: "9781857150193",
subjects: "1920s"
}
{
title: "The Great Gatsby",
ISBN: "9781857150193",
subjects: [
"Long Island",
"New York",
"1920s"
]
}
28. 2.6 Improvements
• Returns a cursor (not a document)
– just like a regular find
• New stages
– $redact
– $out
• New operators:
– set expression operators.
– $let and $map operators to allow for the use of variables.
– $literal operator and $size operator
– $cond expression object
• Integrated $text search
• Performance improvements, "explain" and more
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
29. Advantages
• Runs on the server
– Uses indexes
– Uses shards
• Simple to build complex pipelines
• Easy to use from any driver
• Fast -er than other options
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
30. Limitations
• Pipeline operator memory limits
– 10% of total system RAM in 2.4 and earlier
– 100MB in 2.6 but can use disk for external sort
• Some data types not allowed
– Code, CodeWithScope, etc.
• Result size limited• Result size limited (in 2.4 and earlier)
– 2.6 returns a cursor or direct output to a new collection
No result size limit!
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
33. MapReduce
• Versatile, powerful
• Intended for complex data
analysis
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
34. MapReduce
• Versatile, powerful
• Intended for complex data
analysis
• Overkill for simple aggregations
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
43. Advantages
• Map and reduce code can be arbitrarily complex
– JavaScript, helper functions
• Results can be saved into a new collection
– replace, merge or re-reduce
• Incremental MapReduce
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
44. Limitations
• Implemented with JavaScript
– Single-threaded
• Slower than Aggregation Framework
– Batch, not real time
• Harder to understand, implement, debug...
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
46. Hadoop
Framework that allows for the distributed processing
of large data sets across clusters of computers
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
47. Hadoop MongoDB Connector
• MongoDB or BSON files as input/output
• Source data can be filtered with queries
• Hadoop Streaming support
– For jobs written in Python, Ruby, Node.js
• Supports Hadoop tools such as Pig and Hive
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
48. Processing Big Data
• Data broken up into smaller pieces
• Process data across multiple nodes
Hadoop Hadoop Hadoop Hadoop
Hadoop Hadoop Hadoop Hadoop Hadoop
Hadoop
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
49. Input splits on Non-sharded
Systems
Single Map
Reduce
Hadoop Hadoop Hadoop Hadoop Hadoop
Hadoop Hadoop Hadoop Hadoop Hadoop
Total Dataset
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
50. Advantages
• Processing decoupled
from data store
• Parallel processing
• Leverage existing
infrastructure
• Java has rich set of data
processing libraries
– And other languages if
using Hadoop Streaming
• Batch processing
• Requires synchronization
between data store and
processor
• Adds complexity to
infrastructure
Disadvantages
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
53. Storm MongoDB connector
• Spout for MongoDB oplog or capped collections
– Filtering capabilities
– Threaded and non-blocking
• Output to new or existing documents
– Insert/update bolt
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
55. Internal Tools
• Storing pre-aggregated data
– An exercise in schema design
• Aggregation Framework
• MapReduce
Data Processing andAggregation Options in MongoDB / Asya
Kamsky