MongoDB is the leading open-source, document database. In this webinar we'll dive into the technical details of MongoDB by first mapping it from relational concepts. Next we'll discuss an example data model and associated query functionality using commands pulled straight from the MongoDB shell. Finally, we'll delve into some of the deployment functionality provided by MongoDB including solutions for data redundancy, node failover and auto-sharding.
3. About 10gen
• Background
– Founded in 2007
– First release of MongoDB in 2009
– 74M+ in funding
• MongoDB
– Core server
– Native drivers
• Subscriptions, Consulting, Training
• Monitoring
5. Category
·Name
·URL
Article Tag
User ·Name
·Name ·Slug ·Name
·Email address ·Publish date ·URL
·Text
Comment
·Comment
·Date
·Author
Relational Databases
6. RDBMS Strengths
• Data stored is very compact
• Rigid schemas have led to powerful query
capabilities
• Data is optimized for joins and storage
• Robust ecosystem of tools, libraries, integratons
• 40+ years old!
7. Enter “Big Data”
• Gartner defines it with 3Vs
• Volume
– Vast amounts of data being collected
• Variety
– Evolving data
– Uncontrolled formats, no single schema
– Unknown at design time
• Velocity
– Inbound data speed
– Fast read/write operations
– Low latency
8. Mapping Big Data to RDBMS
• Difficult to store uncontrolled data formats
• Scaling via big iron or custom data marts/
partitioning schemes
• Schema must be known at design time
• Impedance mismatch with agile development and
deployment techniques
• Doesn’t map well to native language constructs
10. Goals
• Scale horizontally over commodity systems
• Incorporate what works for RDBMSs
– Rich data models, ad-hoc queries, full indexes
• Drop what doesn’t work well
– Multi-row transactions, complex joins
• Do not homogenize APIs
• Match agile development and deployment
workflows
11. Key Features
• Data stored as documents (JSON)
– Flexible-schema
• Full CRUD support (Create, Read, Update, Delete)
– Atomic in-place updates
– Ad-hoc queries: Equality, RegEx, Ranges, Geospatial
• Secondary indexes
• Replication – redundancy, failover
• Sharding – partitioning for read/write scalability
18. Documents
> var new_article = {
author: “roger”,
date: new Date(),
title: “My Favorite 2012 Movies”,
body: “Here are my favorite movies from 2012…”
tags: [“horror”, “action”, “independent”]
}
> db.articles.save(new_article)
19. Querying
> db.articles.find()
{
_id: ObjectId(“4c4ba5c0672c685e5e8aabf3”),
author: “roger”,
date: ISODate("2013-01-08T22:10:19.880Z")
title: “My Favorite 2012 Movies”,
body: “Here are my favorite movies from 2012…”
tags: [“horror”, “action”, “independent”]
}
// _id is unique but can be anything you like
20. Indexes
// create an ascending index on “author”
> db.articles.ensureIndex({author:1})
> db.articles.find({author:”roger”})
{
_id: ObjectId(“4c4ba5c0672c685e5e8aabf3”),
author: “roger”,
…
}
31. Replication
Node 1 Node 2
Secondary Primary
Heartbeat
n
tio
ica
pl
Re
Node 3
Recovery
Replica Set – Recovery
32. Replication
Node 1 Node 2
Secondary Primary
Heartbeat
n
tio
ica
pl
Re
Node 3
Secondary
Replica Set – Recovered
33. Client Application
Driver
Write
d
Re
a
Re
a
Primary
d
Secondary Secondary
Scaling Reads
34. App Server App Server App Server
Mongos Mongos Mongos
Config
Node 1
Server
Secondary
Config
Node 1
Server
Secondary
Config
Node 1
Server
Secondary
Shard Shard Shard
Sharding
35. Data stored in shard
• Shard is a node of the
Shard Shard
cluster
Primary
• For production Mongod
or
Secondary
deployments a shard is a
Secondary
replica set
36. Config server stores meta data
• Config Server Config
Node 1
– Stores cluster chunk Server
Secondary
ranges and locations
Config
Node 1 Config
Node 1
– Production deployments
Server
Secondary
or Server
Secondary
need 3 nodes Config
Node 1
– Two phase commit (not Server
Secondary
a replica set)
37. Mongos manages the data
• Mongos
– Acts as a router / balancer
– No local data (persists to config database)
– Can have 1 or many
App Server App Server App Server App Server
or
Mongos Mongos Mongos
38. App Server App Server App Server
Mongos Mongos Mongos
Config
Node 1
Server
Secondary
Config
Node 1
Server
Secondary
Config
Node 1
Server
Secondary
Shard Shard Shard
Sharding
42. Mapping SQL to Aggregation
SQL
statement
MongoDB
command
SELECT
COUNT(*)
FROM
db.users.aggregate([
users
{
$group:
{_id:null,
count:
{$sum:1}}
}
])
SELECT
SUM(price)
db.users.aggregate([
FROM
orders
{
$group:
{_id:null,
total:
{$sum:”$price”}}
}
])
SELECT
cust_id,
db.users.aggregate([
SUM(PRICE)
from
{
$group:
{_id:”$cust_id”,
total:{$sum:”$price”}}
}
orders
GROUP
BY
])
cust_id
SELECT
cust_id,
db.users.aggregate([
SUM(price)
FROM
{
$match:
{active:true}
},
orders
WHERE
{
$group:
{_id:”$cust_id”,
total:{$sum:”$price”}}
}
active=true
GROUP
BY
])
cust_id
43. Native Map/Reduce
• More complex aggregation tasks
• Map and Reduce functions written in JS
• Can be distributed across sharded cluster for
increased parallelism
44. Map/Reduce Functions
var map = function() {
emit(this.author, {votes: this.votes});
};
var reduce = function(key, values) {
var sum = 0;
values.forEach(function(doc) {
sum += doc.votes;
});
return {votes: sum};
};
45. Hadoop and MongoDB
• MongoDB-Hadoop adapter
• 1.0 released, 1.1 in development
• Supports Hadoop
– Map/Reduce, Streaming, Pig
• MongoDB as input/output storage for Hadoop jobs
– No need to go through HDFS
• Leverage power of Hadoop ecosystem against
operational data in MongoDB