2. What will we cover? Many details of how indexing and the query optimizer work A full understanding of these details is not required to use mongo, but this knowledge can be helpful when making optimizations. We’ll discuss functionality of Mongo 1.8 (for our purposes pretty similar to 1.6 and almost identical to 1.7 edge). Much of the material will be presented through examples. Diagrams are to aid understanding – some details will be left out.
3. What will we cover? Basic index bounds Compound key index bounds Or queries Automatic index selection
4. How will we cover it? We’re going to try and cover this material interactively - please volunteer your thoughts on what mongo should do in given scenarios when I ask. Pertinent questions are welcome, but please keep off topic or specialized questions until the end so we don’t lose momentum.
26. Full Document Matcher "nscanned" : 3, "nscannedObjects" : 3, "n" : 1, Documents for all matching keys scanned, but only one document matched on non index keys.
36. Exclusive Range Match "indexBounds" : { "x" : [ [ 4, 7 ] ] } Explain doesn’t indicate that the range is exclusive.
37. Exclusive Range Match "nscanned" : 2, "nscannedObjects" : 2, "n" : 2, But index keys matching the range bounds are not scanned because the bounds are exclusive.
43. Multikeys "nscanned" : 2, "nscannedObjects" : 2, "n" : 1, All keys in valid range are scanned, but the matcher rejects duplicate documents making n == 1.
57. Set Match "nscanned" : 3, "nscannedObjects" : 2, "n" : 2, Why is nscanned 3? This is an algorithmic detail we’ll discuss more later, but when there are disjoint ranges for a key nscanned may be higher than the number of matching keys.
62. All Match "indexBounds" : { "x" : [ [ 3, 3 ] ] } The first entry in the $all match array is always used for index bounds. Note this may not be the least numerous indexed value in the $all array.
82. Sort "cursor" : "BtreeCursor x_1", "nscanned" : 5, "nscannedObjects" : 5, "n" : 4, "scanAndOrder" : true, Results are sorted on the fly to match requested order. The scanAndOrder field is only printed when its value is true.
83. Sort and scanAndOrder With “scanAndOrder” sort, all documents must be touched even if there is a limit spec. With scanAndOrder, sorting is performed in memory and the memory footprint is constrained by the limit spec if present.
90. Negation - $ne, $nin, $not, etc.With current semantics, all multikey elements must match negation constraints Multikey de duplication works without loading full document
91. Covered Indexes db.c.find( {x:6}, {x:1,_id:0} ) Index {x:1} Id would be returned by default, but isn’t in the index so we need to exclude to return only indexed fields.
97. Covered Indexes "isMultiKey" : true, "indexOnly" : false, Currently we set isMultiKey to true the first time we save a doc where the field is a multikey array. But when all multikey docs are removed we don’t reset isMultiKey. This can be improved.
164. Disjoint $or Criteria ? 1 3 4 5 6 7 9 5 7 d b d g d a e f c g ✓ We have already scanned the x index for x:5. So this document was returned already. We don’t return it again.
165. Unindexed $or Clause db.c.find( {$or:[{x:5},{y:’d’}]} ) Index {x:1} (no index on y)
166. Unindexed $or Clause > db.c.find( {$or:[{x:5},{y:'d'}]} ).explain() { "cursor" : "BasicCursor", "nscanned" : 9, "nscannedObjects" : 9, "n" : 3, "millis" : 0, "nYields" : 0, "nChunkSkips" : 0, "isMultiKey" : false, "indexOnly" : false, "indexBounds" : { } } Since y is not indexed, we must do a full collection scan to match y:’d’. Since a full scan is required, we don’t use the index on x to match x:5.
169. Eliminated $or Clause > db.c.find( {$or:[{x:{$gt:2,$lt:6}},{x:5}]} ).explain() { "cursor" : "BtreeCursor x_1", "nscanned" : 3, "nscannedObjects" : 3, "n" : 3, "millis" : 0, "nYields" : 0, "nChunkSkips" : 0, "isMultiKey" : false, "indexOnly" : false, "indexBounds" : { "x" : [ [ 2, 6 ] ] } } The index range of the second clause is included in the index range of the first clause, so we use the first index range only.
170. Eliminated $or Clause with Differing Unindexed Criteria db.c.find( {$or:[{x:{$gt:2,$lt:6},y:’c’},{x:5,y:'d’}]} ) Index {x:1}
171. Eliminated $or Clause with Differing Unindexed Criteria < ? < and 1 3 4 5 6 7 9 5 7 2 6 1 3 4 5 6 7 9 5 7 5 b d g d a e f c g c b d g d a e f c g d and
173. Eliminated $or Clause with Differing Unindexed Criteria 1 3 4 5 6 7 9 5 7 2 6 < ? < and , b d g d a e f c g c d The index range for the first clause contains the index range for the second clause, so all matching is done using the index range for the first clause.
174. Overlapping $or Clauses db.c.find( {$or:[{x:{$gt:2,$lt:6}},{x:{$gt:4,$lt:7}}]} ) Index {x:1,y:1}
184. 2D Overlapping $or Clauses { "cursor" : "BtreeCursor x_1_y_1", "nscanned" : 0, "nscannedObjects" : 0, "n" : 0, "millis" : 1, "nYields" : 0, "nChunkSkips" : 0, "isMultiKey" : false, "indexOnly" : false, "indexBounds" : { "x" : [ [ 6, 7 ] ], "y" : [ [ "b", "e" ] ] } } ], The index range scanned for the previous clause is removed.
185. 2D Overlapping $or Clauses y We only have to scan the remainder here f Clause 1 e Clause 2 b x 7 6 2
186. Overlapping $or Clauses Rule of thumb for n dimensions: We subtract earlier clause boxes from current box when the result is a/some box(es). 2 ✓ 1 1 2 ✓ ✓
187. Overlapping $or Clauses Rule of thumb for n dimensions: We subtract earlier clause boxes from current box when the result is a/some box(es). 1 2 ✗
188. $or TODO Use indexes on $or fields to satisfy a sort specification SERVER-1205 Use full query optimizer to select $or clause indexes in getMore SERVER-1215 Improve index range elimination (handling some cases where remainder is not a box)
190. Optimal Index find( {x:5} ) Index {x:1} Index {x:1,y:1} find( {x:5} ).sort( {y:1 } ) Index {x:1,y:1} find( {} ).sort( {x:1} ) Index {x:1} find( {x:{$gt:1,$lt:7}} ).sort( {x:1} ) Index {x:1}
191. Optimal Index Rule of Thumb No scanAndOrder All fields with index useful constraints are indexed If there is a range or sort it is the last field of the index used to resolve the query If multiple optimal indexes exist, one chosen arbitrarily.
192. Optimal Index These same criteria are useful when you are designing your indexes.
193. Multiple Candidate Indexes find( {x:4,y:’a’} ) Index {x:1} or {y:1}? find( {x:4} ).sort( {y:1} ) Index {x:1} or {y:1}? Note: {x:1,y:1} is optimal find( {x:{$gt:2,$lt:7},y:{$gt:’a’,$lt:’f’}} ) Index {x:1,y:1} or {y:1,x:1}?
194. Multiple Candidate Indexes The only index selection criterion is nscanned find( {x:4,y:’a’} ) Index {x:1} or {y:1} ? If fewer documents match {y:’a’} than {x:4} then nscanned for {y:1} will be less so we pick {y:1} find( {x:{$gt:2,$lt:7},y:{$gt:’b’,$lt:’f’}} ) Index {x:1,y:1} or {y:1,x:1} ? If fewer distinct values of 2 < x < 7 than distinct values of ‘b’ < y < ‘f’ then {x:1,y:1} chosen (rule of thumb)
195. Multiple Candidate Indexes The only index selection criterion is nscanned Pretty good, but doesn’t cover every case, eg Cost of scanAndOrdervs ordered index Cost of loading full document vs just index key Cost of scanning adjacent btree keys vs non adjacent keys/documents
196. Competing Indexes At most one query plan per index Run in interleaved fashion Plans kept in a priority queue ordered by nscanned. We always continue progress on plan with lowest nscanned.
197. Competing Indexes Run until one plan returns all results or enough results to satisfy the initial query request (based on soft limit spec / data size requirement for initial query). We only allow plans to compete in initial query. In getMore, we continue reading from the index cursor established by the initial query.
198. “Learning” a Query Plan When an index is chosen for a query the query’s “pattern” and nscanned are recorded find( {x:3,y:’c’} ) {Pattern: {x:’equality’, y:’equality’}, Index: {x:1}, nscanned: 50} find( {x:{$gt:5},y:{$lt:’z’}} ) {Pattern: {x:’gt bound’, y:’lt bound’}, Index: {y:1}, nscanned: 500}
199. “Learning” a Query Plan When a new query matches the same pattern, the same query plan is used find( {x:5,y:’z’} ) Use index {x:1} find( {x:{$gt:20},y:{$lt:’b’}} ) Use index {y:1}
201. Bad Plan Insurance If nscanned for a new query using a recorded plan is much worse than the recorded nscanned for an earlier query with the same pattern, we start interleaving other plans with the current plan. Currently “much worse” means 10x
202. Query Planner Ad hoc heuristics in some cases Seem to work decently in practice
203. Feedback Large and small scale optimizer features are generally prioritized based on user input. Please use jira to request new features and vote on existing feature requests.
204. Thanks! Feature Requests jira.mongodb.org Support groups.google.com/group/mongodb-user Next up: Sharding Details with Eliot