The Weather of the Century: MongoDB for Storing and Analyzing Historical Weather Data

#MongoDB
The Weather of the Century:
Design and High Performance
André Spiegel
Consulting Engineer, MongoDB

What was the weather
when you were born?

Data Format: Raw and in MongoDB
0303725053947282013060322517+40779-073969FM-15+0048KNYC
V0309999C00005030485MN0080475N5+02115+02005100975
ADDAA101000095AU100001015AW1105GA1025+016765999GA2045+024385999
GA3075+030485999GD11991+0167659GD22991+0243859GD33991+0304859...
{
"st" : "u725053",
"ts" : ISODate("2013-06-03T22:51:00Z"),
"airTemperature" : {
"value" : 21.1,
"quality" : "5"
},
"atmosphericPressure" : {
"value" : 1009.7,
"quality" : "5"
}
}

Data Format: Raw and in MongoDB
0303725053947282013060322517+40779-073969FM-15+0048KNYC
V0309999C00005030485MN0080475N5+02115+02005100975
ADDAA101000095AU100001015AW1105GA1025+016765999GA2045+024385999
GA3075+030485999GD11991+0167659GD22991+0243859GD33991+0304859...
{
"st" : "u725053",
"ts" : ISODate("2013-06-03T22:51:00Z"),
"airTemperature" : {
"value" : 21.1,
"quality" : "5"
},
"atmosphericPressure" : {
"value" : 1009.7,
"quality" : "5"
}
}
Station Identifier
(»NYC Central Park«)

How Big Is It?
• 2.5 billion data points
• 4 Terabyte (1.6k per document)
• “moderately big”

First Deployment
• A single server with a really big disk
Application mongod
i2.8xlarge
251 GB RAM
6 TB SSD
c3.8xlarge

Second Deployment
• A really big cluster where everything is in RAM
Application / mongos
...
100 x r3.2xlarge
@
61 GB RAM
100 GB disk
mongod
c3.8xlarge

Second Deployment
• A really big cluster where everything is in RAM
Application / mongos
...
100 x r3.2xlarge
@
61 GB RAM
100 GB disk
mongod

Now... how much would you pay?
..

$60,000 / yr
..

$60,000 / yr
..
$700,000 / yr

Use Cases
• Bulk loading
– getting all data into the system
• Latency and throughput for queries
– point in space-time
– one station, one year
– the whole world, once upon a time
• Aggregation and Exploration
– warmest and coldest day ever, etc.

Bulk Loading: Principles
• On the application side:
– batch size
– number of client threads
– use unordered bulk writes
• On the server side:
– Journaling off ( temporarily! )
– Index later
– In cluster: pre-split, no balancing

Bulk Loading: Single Server
batch
size
threads
through
put

batch
size
threads
through
put
8 threads,
batch size 100
→ 85,000 doc/s

• Settings: 8 threads
batch size 100
• Total loading time: 10 h 20 min
• Documents per second: 70,000
• Index build time: 7 h 40 min (ts_1_st_1)

Bulk Loading: Cluster
144 threads,
batch size 200
→ 220,000 doc/s

Bulk Loading: Cluster
• Shard Key: Station ID, hashed
• Settings: 10 mongos @ 144
threads
batch size 200
• Total loading time: 3 h 10 min
• Documents per second: 228,000
• Index build time: 5 min (ts_1_st_1)

Queries: Point in Space-Time
db.data.find({"st" : "u747940",
"ts" : ISODate("1969-07-16T12:00:00Z")})

Queries: Point in Space-Time
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
"ts" : ISODate("1969-07-16T12:00:00Z")})
single server cluster
ms
avg
95th
99th
max.
throughput:
40,000/s 610,000/s
(10 mongos)

Queries: One Station, One Year
"ts" : {"$gte": ISODate("1989-01-01"),
"$lt" : ISODate("1990-01-01")}})

Queries: One Station, One Year
5000
4000
3000
2000
1000
0
"ts" : {"$gte": ISODate("1989-01-01"),
"$lt" : ISODate("1990-01-01")}})
ms
avg
95th
99th
max.
throughput: 20/s 430/s
(10 mongos)
targeted query

Queries: The Whole World, Once
Upon...
db.data.find({"ts" : ISODate("2000-01-01T00:00:00Z")})

Queries: The Whole World, Once
Upon...
db.data.find({"ts" : ISODate("2000-01-01T00:00:00Z")})
10000
8000
6000
4000
2000
0
ms
avg
95th
99th
max.
throughput: 8/s
310/s
(10 mongos)
scatter/gather query

Analytics and Exploration
• Analytics means ad-hoc queries for which
we do not have an index
– Find all tornados
– Maximum reported temperature
• We cannot just index everything
– memory
– write performance

Analytics: Find all Tornados
db.data.find ({
"presentWeatherObservation.condition" : "99"
})

db.data.find ({
})
1 h 28 min
Single Server

db.data.find ({
})
47 s
Cluster
1 h 28 min
Single Server

Analytics: Maximum Temperature
db.data.aggregate ([
{ "$match" : { "airTemperature.quality" :
{ "$in" : [ "1", "5" ] } } },
{ "$group" : { "_id" : null,
"maxTemp" : { "$max" :
"$airTemperature.value" } } }
])

{ "$in" : [ "1", "5" ] } } },
{ "$group" : { "_id" : null,
])
61.8 °C = 143 °F

{ "$in" : [ "1", "5" ] } } },
{ "$group" : { "_id" : null,
])
61.8 °C = 143 °F
4 h 45 min
Single Server

{ "$in" : [ "1", "5" ] } } },
{ "$group" : { "_id" : null,
])
61.8 °C = 143 °F
2 min
Cluster
4 h 45 min
Single Server

Summary: Single Server
Pro
• Cost-effective
• Very good latency for single queries
Con
• Some operations are prohibitive:
– Indexing
– Table Scans

Summary: Cluster
Con
• High cost
Pro
• High throughput
• Very good latency for single queries
• Scatter-gather yields significant speed-up
• Analytics are possible
..

The Weather of the Century: MongoDB for Storing and Analyzing Historical Weather Data

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (7)

Semelhante a The Weather of the Century: MongoDB for Storing and Analyzing Historical Weather Data

Semelhante a The Weather of the Century: MongoDB for Storing and Analyzing Historical Weather Data (20)

Mais de MongoDB

Mais de MongoDB (20)

Último

Último (20)

The Weather of the Century: MongoDB for Storing and Analyzing Historical Weather Data