What's New in Teams Calling, Meetings and Devices March 2024
Hellenic MongoDB user group - Introduction to sharding
1. Introduction to sharding
Christos Soulios
Software Architect, Persado
(christos.soulios@persado.com)
15 Jan 2013 0
2. Lets start with an example:
We have launched our latest and greatest web application
We use MongoDB database which is fast and cool
We even have setup replication for high availability
Our application turns out to be popular and we are already
planning our next project
Cool!
1
4. MongoDB problems when dataset grows
Dataset does not fit on local disks.
Solution: Let’s buy more disks
Database indexes do not fit in memory. They have to be paged
in and out. Database becomes sluggish
Solution: Let’s buy more memory
High throughput writing operations cause high contention on
the infamous MongoDB locks
Now what?
We need to scale horizontally. We need sharding
3
5. What is sharding?
Shardingis automatic data partitioning
Distributes data evenly across cluster nodes (called shards)
Allows for seamless querying. Almost no functionality lost
over single master
Keeps database consistent
4
6. How sharding works
Collection data is broken into chunks based on the range of a
selected collection field. This field is called the shard key
Chunks are evenly distributed across shards. Each data chunk is
controlled by a single shard
Special config servers are responsible for storing which shard
controls which chunks
Database clients communicate with the shards through the mongos
router process
mongos router behaves to the client just as a normal mongod
server. Sharding is transparent to the client
For each database operation, the mongosrouter queries the config
servers using the shard key and redirects the operation to the
correct shards
While more data is inserted, ranges are split into more chunks
5
9. Database operations
All queries are routed through the mongosprocess
Insert operations are routed by shard key. Shard key is
required
Querying by shard key routes the query to shards
Querying by non-shard key scatters the query to all shards
and gathers results
Updates and deletes behave like queries
8
10. Data balancing
System becomes unbalanced when one shard stores more
data chunks than others
Data is automatically balanced without intervention from the
client application or the administrator
9
11. Data balancing
The range of the loaded shard is split and chunks are migrated
to other shards
10
12. Data balancing
Config servers are updated using a 2phase commit process to
ensure database consistency
System ends up balanced
11
13. Choosing a shard key
Choosing a good shard key is critical
Once chosen, we are stuck with it
Shard key must be immutable
Should distribute data load evenly across shards
Should be of high cardinality. Enumerated values are not good
shard keys
Should not be monotically increasing. ObjectIds, dates or database
sequences are not good shard keys, because they create hotspots
Should be used by most critical queries to provide query isolation.
Avoid scatter-gather queries
Should provide good data affinity to avoid disk to memory transfers
(random values are not good shard keys)
13
14. Choosing a shard key
Know your data. It is important
What is the expected dataset size?
What is the write throughput?
How do data look like? Which fields are random or increasing?
Are there low cardinality fields?
Can we identify any access patterns for reads?
What data is indexed?
What is the active working set? Are there historical data that
are not used after sometime?
14
15. Choosing a shard key
It is not trivial
Most of the times there is no single field that can be used as
shard key
We have to invent one
15
16. Choosing a shard key
Usually applications access lately inserted data more often
What about a compound shard key?
What about a combination of a coarsely ascending field and a
commonly queried search key?
Coarsely ascending key should have a few hundreds of chunks
per value. This provides good data locality and even
distribution
Search key provides query isolation
Rule of thumb: {coarseLocality: 1, search : 1}
16
17. Example (Tweets collection)
{user: „asterix‟,
ts: ISODate(“01/14/2013Z22:53:33.123”),
month: „2013-01‟
retweets: 45,
client: „TweetDeck‟,
text: „Mongodbsharding is super cool!‟
}
We are typically looking for the latest tweets of a user.
Therefore, a combination of „month + user‟ fields would create a
good shard key
monthfield is coarsely ascending, allowing to transfer only
latest tweets to memory
user field is a commonly searched key
17
18. Conclusion
Sharding allows MongoDB databases to scale horizontally
Shard balancing is performed automatically by the system
Sharding is transparent to the client application
Choosing a good shard key is critical
Choosing a good shard key is not trivial
Be creative and experiment with your data before choosing
the shard key
18