1. The document describes using MongoDB to store and analyze a large weather dataset containing 2.5 billion data points and 4 terabytes of data.
2. Loading the data into a single server MongoDB deployment achieved a maximum loading throughput of 85,000 documents per second, while a sharded cluster configuration achieved 228,000 documents per second.
3. Query performance was significantly faster in the clustered deployment, while analytic queries that required full scans of the data like finding all reported tornadoes or maximum temperatures were possible in minutes or hours instead of taking over an hour on a single server.
8. First Deployment
• A single server with a really big disk
Application mongod
i2.8xlarge
251 GB RAM
6 TB SSD
c3.8xlarge
9. Second Deployment
• A really big cluster where everything is in RAM
Application / mongos
...
100 x r3.2xlarge
@
61 GB RAM
100 GB disk
mongod
c3.8xlarge
10. Second Deployment
• A really big cluster where everything is in RAM
Application / mongos
...
100 x r3.2xlarge
@
61 GB RAM
100 GB disk
mongod
13. Now... how much would you pay?
$60,000 / yr
..
$700,000 / yr
14. Use Cases
• Bulk loading
– getting all data into the system
• Latency and throughput for queries
– point in space-time
– one station, one year
– the whole world, once upon a time
• Aggregation and Exploration
– warmest and coldest day ever, etc.
15. Bulk Loading: Principles
• On the application side:
– batch size
– number of client threads
– use unordered bulk writes
• On the server side:
– Journaling off ( temporarily! )
– Index later
– In cluster: pre-split, no balancing
17. Bulk Loading: Single Server
batch
size
threads
through
put
8 threads,
batch size 100
→ 85,000 doc/s
18. Bulk Loading: Single Server
• Settings: 8 threads
batch size 100
• Total loading time: 10 h 20 min
• Documents per second: 70,000
• Index build time: 7 h 40 min (ts_1_st_1)
21. Bulk Loading: Cluster
• Shard Key: Station ID, hashed
• Settings: 10 mongos @ 144
threads
batch size 200
• Total loading time: 3 h 10 min
• Documents per second: 228,000
• Index build time: 5 min (ts_1_st_1)
22. Queries: Point in Space-Time
db.data.find({"st" : "u747940",
"ts" : ISODate("1969-07-16T12:00:00Z")})
23. Queries: Point in Space-Time
db.data.find({"st" : "u747940",
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
"ts" : ISODate("1969-07-16T12:00:00Z")})
single server cluster
ms
avg
95th
99th
max.
throughput:
40,000/s 610,000/s
(10 mongos)
24. Queries: One Station, One Year
db.data.find({"st" : "u103840",
"ts" : {"$gte": ISODate("1989-01-01"),
"$lt" : ISODate("1990-01-01")}})
25. Queries: One Station, One Year
db.data.find({"st" : "u103840",
5000
4000
3000
2000
1000
0
"ts" : {"$gte": ISODate("1989-01-01"),
"$lt" : ISODate("1990-01-01")}})
single server cluster
ms
avg
95th
99th
max.
throughput: 20/s 430/s
(10 mongos)
targeted query
26. Queries: The Whole World, Once
Upon...
db.data.find({"ts" : ISODate("2000-01-01T00:00:00Z")})
27. Queries: The Whole World, Once
Upon...
db.data.find({"ts" : ISODate("2000-01-01T00:00:00Z")})
10000
8000
6000
4000
2000
0
single server cluster
ms
avg
95th
99th
max.
throughput: 8/s
310/s
(10 mongos)
scatter/gather query
28. Analytics and Exploration
• Analytics means ad-hoc queries for which
we do not have an index
– Find all tornados
– Maximum reported temperature
• We cannot just index everything
– memory
– write performance
34. Analytics: Maximum Temperature
db.data.aggregate ([
{ "$match" : { "airTemperature.quality" :
{ "$in" : [ "1", "5" ] } } },
{ "$group" : { "_id" : null,
"maxTemp" : { "$max" :
"$airTemperature.value" } } }
])
61.8 °C = 143 °F
4 h 45 min
Single Server
35. Analytics: Maximum Temperature
db.data.aggregate ([
{ "$match" : { "airTemperature.quality" :
{ "$in" : [ "1", "5" ] } } },
{ "$group" : { "_id" : null,
"maxTemp" : { "$max" :
"$airTemperature.value" } } }
])
61.8 °C = 143 °F
2 min
Cluster
4 h 45 min
Single Server
36. Summary: Single Server
Pro
• Cost-effective
• Very good latency for single queries
Con
• Some operations are prohibitive:
– Indexing
– Table Scans
37. Summary: Cluster
Con
• High cost
Pro
• High throughput
• Very good latency for single queries
• Scatter-gather yields significant speed-up
• Analytics are possible
..