In version 2.4, MongoDB introduces hash-based sharding, allowing the user to shard based on a randomized shard key to spread documents evenly across a cluster. Hash-based sharding is an alternative to range-based sharding, making it easier to manage your growing cluster. In this talk, we'll discuss provide an overview of this new feature and discuss the pros and cons of using a hash-based sharding vs. range-based approach.
5. What is a Shard Key
• Shard key is used to partition your collection
• Shard key must exist in every document
• Shard key is immutable
• Shard key values are immutable
• Shard key must be indexed
• Shard key is used to route requests to shards
11. Chunk splitting
{x: 0} {x: 6} {x: 7}{x: -5} {x: 10}{x: -9}
0 0
• Achunk is split once it exceeds the maximum size
• There is no split point if all documents have the same shard key
• Chunk split is a logical operation (no data is moved)
• If split creates too large of a discrepancy of chunk count across cluster
a balancing round starts
12. Data distribution
• MinKey to 0 lives on Shard1
• 0 to MaxKey lives on Shard2
• Mongos routes queries appropriately
21. Picking a shard key
• Cardinality
• Optimize routing
• Minimize (unnecessary) traffic
• Allow best scaling
22. What about Object Id?
ObjectId("51597ca8e28587b86528edfd”)
• Used for _id
• 12 byte value
• Generated by the driver if not specified
• Theoretically globally unique
23. What about Object Id?
ObjectId("51597ca8e28587b86528edfd”)
12 Bytes
Timestamp
MAC
PID
Counter
24. // enabling sharding on test database
mongos> sh.enableSharding("test")
{ "ok" : 1 }
// sharding the test collection
mongos> sh.shardCollection("test.test",{_id:1})
{ "collectionsharded" : "test.test", "ok" : 1 }
// create a loop inserting data
mongos> for (x=0; x<10000; x++) {
... db.test.insert({value:x})
... }
Sharding on ObjectId
31. Under the hood
• Create a hashed index used for sharding
• Uses the first 64-bits of md5 hash of field
• Uses existing hash index, or creates a new one
on a collection
• Hash both data and BSON type
• Represented as a NumberLong in the JS shell
32. // hash on 1 as an integer
> db.runCommand({_hashBSONElement:1})
{
"key" : 1,
"seed" : 0,
"out" : NumberLong("5902408780260971510"),
"ok" : 1
}
// hash on “1” as a string
> db.runCommand({_hashBSONElement:"1"})
{
"key" : "1",
"seed" : 0,
"out" : NumberLong("-2448670538483119681"),
"ok" : 1
}
Hash on simple or embedded BSON
values
33. Enabling hashed indexes
• Create index
– db.collection.ensureIndex( {field : ”hashed”} )
• Options
– Seed, specify a different seed to use
– hashVersion, at the moment only version 0 (md5).
34. Using hash shard keys
• Enable sharding on collection
– sh.shardCollection(“test.collection”, {field: “hashed”})
• Options
– numInitialChunks, specifies the number of initial chunks
per shard. Default is two chunks per shard (use
“sh._adminCommand” to specify options)
35. // enabling sharding on test database
mongos> sh.enableSharding("test")
{ "ok" : 1 }
// shard by hashed _id field
mongos> sh.shardCollection("test.hash",{_id:"hashed"})
{ "collectionsharded" : "test.hash", "ok" : 1 }
Sharding on hashed ObjectId
62. Other limitations
• Cannot use a compound key
• Key cannot have an array value
• Tag-aware sharding
– Only makes sense to assign the full hashed shard key
collection to particular shards
– By design, there’s no real way to know or control what
data is in what range
• Key with poor cardinality is going to give a hash
with poor cardinality
– Floating point numbers are squashed. E.g. 100.4 will be
hashed as 100
63. Summary
• Range-based Sharding
– Most efficient for applications that operate on ranges
– Requires careful shard key selection
• Hash-based Sharding
– Uniform writes,
– No routed range queries
• TagAware Sharding
– That’s another talk!
64. Resources
• Tutorial:Sharding (laptop friendly)
• Tutorial:Converting a Replica Set to a Replicated Sharded Cluster(laptop
friendly)
• Manual:Select a Shard Key
• Manual:Hash-based Sharding
• Manual:TagAware Sharding
• Manual:Strategiesfor Bulk Inserts in ShardedClusters
• Manual:Manage Sharded Cluster Balancer
BIO: I live and work in the Washington D.C. area and focus on supporting the MongoDB community and delivering solutions using MongoDB to the Federal government. I’ve spent the last 7 years or so working with NoSQL database. I’ve worked in a variety of industries including precise timing systems, military command and control and digital mapping, e-commerce, distributed computing platforms, search, system integration and Big Data applications.
Remind everyone what a sharded cluster is. We will take a close look at some how sharded clusters work and at the new hashed shard key feature of 2.4
Isolating queries (to a few shards)Scatter -- gather ( high latency but not bad )hash keys
Min value includedMax value not included
Balancer is running on mongosOnce the difference in chunks between the most dense shard and the least dense shard is above the migration threshold, a balancing round starts
Moved chunk on shard2 should be gray
Source shard deletes moved dataMust wait for open cursors to either close or time outNoTimeout cursors may prevent the release of the lockMongos releases the balancer lock after old chunks are deleted
Moving data is expensive (i/o, network bandwidth)Moving many chunks takes a long time (can only move one chunk at a time)Balancing and migrations compete for resources with your application
What’s the solution to sharding on incremental values as a shard key?
mongod --setParameter=enableTestCommands=1
Seed and hashVersion are undocumented at this pointLet people know we are at least thinking about these things (especially the ability to change the hash algorithm)
When sharding a new collection using Hash-based shard keys, MongoDB will take care of the presplitting for you. Similarly sized ranges of the Hash-based key are distributed to each existing shard, which means that no initial balancing is needed (unless of course new shards are added).
Only happens on new collections
Query contains the shard key field
The mongos does not have to load the whole set into memory since each shard sorts locally. The mongos can just getMore from the shards as needed and incrementally return the results to the client.
Uses the hashed index
Assuming only a hashed index on “x”
Tag-aware note: it doesn’t usually make a lot of sense to tag anything other than the full hashed shard key collection to particular shards - by design, there’s no real way to know or control what data is in what range.since the chunk ranges are based on the value of the randomized hash of the shard key instead of the shard key itself, this is usually only useful for tagging the whole range to a specific set of shards