Presentation on Secondary Indexes from the 9/11/12 HBase Contributor's Meetup. It discusses the current state of the discussion and some possible future directions.
3. Problem
• HBase rows are multi-dimensional
– Only sorted on the row key
• How do you efficiently lookup deeper into the
row key?
4. Example
Row Family Qualifier Timestamp value
1 Name First 0 Babe
1 Name Last 0 Ruth
How do we find all people with the last name ‘Ruth’?
Full table scan!
5. Indexing!
Row Family Qualifier Timestamp Value
Ruth Name Last 0 1
Store the property we need to search
for as the primary key
• pointer back to the primary row
• fast lookup - O(lg(n))
6. Use Cases
• Point lookups
– Volume of data influences usefulness of index
• Let user decide if they need to use an index
• Scan lookup
– WHERE age > 16
16. Built-in vs.
external library vs.
semi-supported (e.g. security)
17. Which should I use??
• HBase experts write a single ‘right’ impl
• Officially endorse a ‘correct’ version
• What changes do we need to make
• How close to the core is the project
– Written in everywhere
– hbase-index module
– External library
19. Key Observation
“Secondary indexing is inherently an easier
problem than full transactions… secondary
index updates are idempotent.”
- Lars Hofhansl
20. Async vs. Synchronous vs.Transactional
• We don’t need full transactions
– Transactions are slow
– Transactions fail with increasing probability as
number of servers increases
• Optionally async or sync
– Async
• Inherently ‘dirty’ index
• How does index cleanup work?
– Inherently different for each type
22. Where’s my data?
• Extra columns vs. index table
• HBase Region-pinning
– Has to be best-effort or will decrease availability
– Helps minimize RPC overhead
– Cross-table region-pinning
– Needs a coprocessor hook to be useful
• HDFS block allocation
– Keep index and data blocks on same HDFS node
24. How much data are we talking?
“Seems like there are 3 categories of sparseness:
1. sparse indexes (like ipAddress) where a per-table
approach is more efficient for reads
1. dense indexes (like eventType) where there are likely
values of every index key on each region
1. very dense indexes (like male/female) where you
should just be doing a table scan anyway”
- Matt Corgan (9/10/12)
25. Impact on implementation
• Need a lot of knowledge of data to pick the
right kind of index
– User knows their data, let them do the hard work
of picking indexes
27. Everyone’s got an impl already
• We need to make HBase flexible enough to
support (most) current indexing formats with
minimal overhead for switching
– Lucene style Codec/CodecProvider?
29. What should it look like?
• Minimal changes to the top-level interfaces
– Add a single new flag?
– Configuration based?
• Enough that the user gets to be smart about
what should be used
– We can’t get all cases right – just provide building
blocks
• Automatically use an index?
• Scanner/Filter style use?
30. Properties for the client
• Should the user even see the index lookups?
• ACID?
• Ordering of results?
– Support the current sorted order?
– Batch lookup?
• Implications on current features
– Replication
– splitting
31. Schema(less)
• Schema enforced?
– Rigid usage of index matching an expected schema?
– Schema table? Reserved schema columns?.META.?
• Schema-less
– Let the user apply whatever they think and use only
what actually works
• Best-effort
– Use client-hinted schema and try to apply all the
known indexes
32. My random thoughts….
• Client-side managed indexes are efficient
– Minimal RPC overhead
• Cleanup is async to client and rarely misses
– Solves the cross-region/server problem
• Region-pinning is a nice-to-have optimization
– Scales without concern for locality
– Flexible enough to support custom codecs
– Can be built to provide server-side optimizations
• Locality aware indexes to minimize RPCs