SlideShare uma empresa Scribd logo
1 de 33
Jesse Yates
 Salesforce.com




                  Secondary Indexing

                   the discussion so far….




9/11/12                                      HBase Pow-wow
What is it?
Problem
• HBase rows are multi-dimensional
  – Only sorted on the row key


• How do you efficiently lookup deeper into the
  row key?
Example
 Row        Family       Qualifier   Timestamp   value
 1          Name         First       0           Babe
 1          Name         Last        0           Ruth




How do we find all people with the last name ‘Ruth’?


                     Full table scan!
Indexing!
Row       Family    Qualifier   Timestamp   Value
Ruth      Name      Last        0           1




  Store the property we need to search
  for as the primary key
  • pointer back to the primary row
  • fast lookup - O(lg(n))
Use Cases
• Point lookups
  – Volume of data influences usefulness of index
     • Let user decide if they need to use an index


• Scan lookup
  – WHERE age > 16
Implementations
Omid

Full transactional support
    Centralized oracle
Lily

WAL implementation on top of HBase
        100-500 writes/sec
Percolator

       Full transactions
Distributed, optimistic locking
  ~10 sec latencies possible
Culvert

         Async
Dead project, incomplete
http://jyates.github.com/2012/07/0
  9/consistent-enough-secondary-
            indexes.html
       Client-side coordinated index
       Use timestamps to coordinate
           Not yet implemented
Trend Micro Implementation

          Still just POC
                 ???
Solr/Lucene

Standard Lucene library bolted on HBase
           Not commonly used
 Lots of formats/codecs already written
Considerations for HBase

    What do we need to do?
Built-in vs.
     external library vs.
semi-supported (e.g. security)
Which should I use??
•   HBase experts write a single ‘right’ impl
•   Officially endorse a ‘correct’ version
•   What changes do we need to make
•   How close to the core is the project
    – Written in everywhere
    – hbase-index module
    – External library
Async vs.
Synchronous vs.
 Transactional
Key Observation
“Secondary indexing is inherently an easier
  problem than full transactions… secondary
  index updates are idempotent.”

        - Lars Hofhansl
Async vs. Synchronous vs.Transactional

• We don’t need full transactions
  – Transactions are slow
  – Transactions fail with increasing probability as
    number of servers increases
• Optionally async or sync
  – Async
     • Inherently ‘dirty’ index
• How does index cleanup work?
  – Inherently different for each type
Locality
Where’s my data?
• Extra columns vs. index table
• HBase Region-pinning
  –   Has to be best-effort or will decrease availability
  –   Helps minimize RPC overhead
  –   Cross-table region-pinning
  –   Needs a coprocessor hook to be useful


• HDFS block allocation
  – Keep index and data blocks on same HDFS node
Index Cardinality
How much data are we talking?
“Seems like there are 3 categories of sparseness:
1. sparse indexes (like ipAddress) where a per-table
   approach is more efficient for reads

1. dense indexes (like eventType) where there are likely
   values of every index key on each region

1. very dense indexes (like male/female) where you
   should just be doing a table scan anyway”

                      - Matt Corgan (9/10/12)
Impact on implementation
• Need a lot of knowledge of data to pick the
  right kind of index
  – User knows their data, let them do the hard work
    of picking indexes
Pluggability
Everyone’s got an impl already
• We need to make HBase flexible enough to
  support (most) current indexing formats with
  minimal overhead for switching
  – Lucene style Codec/CodecProvider?
Client-interface
What should it look like?
• Minimal changes to the top-level interfaces
  – Add a single new flag?
  – Configuration based?
• Enough that the user gets to be smart about
  what should be used
  – We can’t get all cases right – just provide building
    blocks
• Automatically use an index?
• Scanner/Filter style use?
Properties for the client
• Should the user even see the index lookups?

• ACID?
• Ordering of results?
  – Support the current sorted order?
  – Batch lookup?

• Implications on current features
  – Replication
  – splitting
Schema(less)
• Schema enforced?
  – Rigid usage of index matching an expected schema?
  – Schema table? Reserved schema columns?.META.?
• Schema-less
  – Let the user apply whatever they think and use only
    what actually works
• Best-effort
  – Use client-hinted schema and try to apply all the
    known indexes
My random thoughts….
• Client-side managed indexes are efficient
  – Minimal RPC overhead
     • Cleanup is async to client and rarely misses
  – Solves the cross-region/server problem
     • Region-pinning is a nice-to-have optimization
  – Scales without concern for locality
  – Flexible enough to support custom codecs
  – Can be built to provide server-side optimizations
     • Locality aware indexes to minimize RPCs
Discussion!

Mais conteúdo relacionado

Mais procurados

Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...Lucidworks
 
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, TargetJourney of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, TargetLucidworks
 
Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...
Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...
Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...Caserta
 
Apache Con 2021 Structured Data Streaming
Apache Con 2021 Structured Data StreamingApache Con 2021 Structured Data Streaming
Apache Con 2021 Structured Data StreamingShivji Kumar Jha
 
Riak at shareaholic
Riak at shareaholicRiak at shareaholic
Riak at shareaholicfreerobby
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight OverviewJacques Nadeau
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Chris Fregly
 
Apachecon Europe 2012: Operating HBase - Things you need to know
Apachecon Europe 2012: Operating HBase - Things you need to knowApachecon Europe 2012: Operating HBase - Things you need to know
Apachecon Europe 2012: Operating HBase - Things you need to knowChristian Gügi
 
Rolling With Riak
Rolling With RiakRolling With Riak
Rolling With RiakJohn Lynch
 
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Spark Summit
 
Practical Cross-Dataset Queries with SPARQL (Introduction)
Practical Cross-Dataset Queries with SPARQL (Introduction)Practical Cross-Dataset Queries with SPARQL (Introduction)
Practical Cross-Dataset Queries with SPARQL (Introduction)Richard Cyganiak
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big DataJoe Alex
 
Liferay and Big Data
Liferay and Big DataLiferay and Big Data
Liferay and Big DataMiguel Pastor
 

Mais procurados (20)

Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
 
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, TargetJourney of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target
 
Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...
Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...
Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...
 
Apache Con 2021 Structured Data Streaming
Apache Con 2021 Structured Data StreamingApache Con 2021 Structured Data Streaming
Apache Con 2021 Structured Data Streaming
 
Sql over hadoop ver 3
Sql over hadoop ver 3Sql over hadoop ver 3
Sql over hadoop ver 3
 
Riak at shareaholic
Riak at shareaholicRiak at shareaholic
Riak at shareaholic
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight Overview
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
Cloudera Hadoop Distribution
Cloudera Hadoop DistributionCloudera Hadoop Distribution
Cloudera Hadoop Distribution
 
Apachecon Europe 2012: Operating HBase - Things you need to know
Apachecon Europe 2012: Operating HBase - Things you need to knowApachecon Europe 2012: Operating HBase - Things you need to know
Apachecon Europe 2012: Operating HBase - Things you need to know
 
Apache drill
Apache drillApache drill
Apache drill
 
Rolling With Riak
Rolling With RiakRolling With Riak
Rolling With Riak
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
 
Practical Cross-Dataset Queries with SPARQL (Introduction)
Practical Cross-Dataset Queries with SPARQL (Introduction)Practical Cross-Dataset Queries with SPARQL (Introduction)
Practical Cross-Dataset Queries with SPARQL (Introduction)
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
Funtional Programming
Funtional ProgrammingFuntional Programming
Funtional Programming
 
Rails on HBase
Rails on HBaseRails on HBase
Rails on HBase
 
Liferay and Big Data
Liferay and Big DataLiferay and Big Data
Liferay and Big Data
 

Semelhante a Musings on Secondary Indexing in HBase

OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AGOLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AGLucidworks
 
Dropping ACID: Wrapping Your Mind Around NoSQL Databases
Dropping ACID: Wrapping Your Mind Around NoSQL DatabasesDropping ACID: Wrapping Your Mind Around NoSQL Databases
Dropping ACID: Wrapping Your Mind Around NoSQL DatabasesKyle Banerjee
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Unit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptxUnit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptxBhavanaHotchandani
 
Oracle Week 2016 - Modern Data Architecture
Oracle Week 2016 - Modern Data ArchitectureOracle Week 2016 - Modern Data Architecture
Oracle Week 2016 - Modern Data ArchitectureArthur Gimpel
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summitOpen Analytics
 
NoSql - mayank singh
NoSql - mayank singhNoSql - mayank singh
NoSql - mayank singhMayank Singh
 
La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitData Con LA
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...
Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...
Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...Bob Pusateri
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practicelarsgeorge
 
Got documents Code Mash Revision
Got documents Code Mash RevisionGot documents Code Mash Revision
Got documents Code Mash RevisionMaggie Pint
 

Semelhante a Musings on Secondary Indexing in HBase (20)

Apache Drill
Apache DrillApache Drill
Apache Drill
 
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AGOLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
 
Dropping ACID: Wrapping Your Mind Around NoSQL Databases
Dropping ACID: Wrapping Your Mind Around NoSQL DatabasesDropping ACID: Wrapping Your Mind Around NoSQL Databases
Dropping ACID: Wrapping Your Mind Around NoSQL Databases
 
NoSQL_Night
NoSQL_NightNoSQL_Night
NoSQL_Night
 
Impala for PhillyDB Meetup
Impala for PhillyDB MeetupImpala for PhillyDB Meetup
Impala for PhillyDB Meetup
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Apache hive
Apache hiveApache hive
Apache hive
 
Unit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptxUnit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptx
 
Oracle Week 2016 - Modern Data Architecture
Oracle Week 2016 - Modern Data ArchitectureOracle Week 2016 - Modern Data Architecture
Oracle Week 2016 - Modern Data Architecture
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Incredible Impala
Incredible Impala Incredible Impala
Incredible Impala
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
 
NoSql - mayank singh
NoSql - mayank singhNoSql - mayank singh
NoSql - mayank singh
 
La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixit
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Got documents?
Got documents?Got documents?
Got documents?
 
Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...
Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...
Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
 
From 0 to syncing
From 0 to syncingFrom 0 to syncing
From 0 to syncing
 
Got documents Code Mash Revision
Got documents Code Mash RevisionGot documents Code Mash Revision
Got documents Code Mash Revision
 

Último

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 

Último (20)

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Musings on Secondary Indexing in HBase

  • 1. Jesse Yates Salesforce.com Secondary Indexing the discussion so far…. 9/11/12 HBase Pow-wow
  • 3. Problem • HBase rows are multi-dimensional – Only sorted on the row key • How do you efficiently lookup deeper into the row key?
  • 4. Example Row Family Qualifier Timestamp value 1 Name First 0 Babe 1 Name Last 0 Ruth How do we find all people with the last name ‘Ruth’? Full table scan!
  • 5. Indexing! Row Family Qualifier Timestamp Value Ruth Name Last 0 1 Store the property we need to search for as the primary key • pointer back to the primary row • fast lookup - O(lg(n))
  • 6. Use Cases • Point lookups – Volume of data influences usefulness of index • Let user decide if they need to use an index • Scan lookup – WHERE age > 16
  • 8. Omid Full transactional support Centralized oracle
  • 9. Lily WAL implementation on top of HBase 100-500 writes/sec
  • 10. Percolator Full transactions Distributed, optimistic locking ~10 sec latencies possible
  • 11. Culvert Async Dead project, incomplete
  • 12. http://jyates.github.com/2012/07/0 9/consistent-enough-secondary- indexes.html Client-side coordinated index Use timestamps to coordinate Not yet implemented
  • 13. Trend Micro Implementation Still just POC ???
  • 14. Solr/Lucene Standard Lucene library bolted on HBase Not commonly used Lots of formats/codecs already written
  • 15. Considerations for HBase What do we need to do?
  • 16. Built-in vs. external library vs. semi-supported (e.g. security)
  • 17. Which should I use?? • HBase experts write a single ‘right’ impl • Officially endorse a ‘correct’ version • What changes do we need to make • How close to the core is the project – Written in everywhere – hbase-index module – External library
  • 18. Async vs. Synchronous vs. Transactional
  • 19. Key Observation “Secondary indexing is inherently an easier problem than full transactions… secondary index updates are idempotent.” - Lars Hofhansl
  • 20. Async vs. Synchronous vs.Transactional • We don’t need full transactions – Transactions are slow – Transactions fail with increasing probability as number of servers increases • Optionally async or sync – Async • Inherently ‘dirty’ index • How does index cleanup work? – Inherently different for each type
  • 22. Where’s my data? • Extra columns vs. index table • HBase Region-pinning – Has to be best-effort or will decrease availability – Helps minimize RPC overhead – Cross-table region-pinning – Needs a coprocessor hook to be useful • HDFS block allocation – Keep index and data blocks on same HDFS node
  • 24. How much data are we talking? “Seems like there are 3 categories of sparseness: 1. sparse indexes (like ipAddress) where a per-table approach is more efficient for reads 1. dense indexes (like eventType) where there are likely values of every index key on each region 1. very dense indexes (like male/female) where you should just be doing a table scan anyway” - Matt Corgan (9/10/12)
  • 25. Impact on implementation • Need a lot of knowledge of data to pick the right kind of index – User knows their data, let them do the hard work of picking indexes
  • 27. Everyone’s got an impl already • We need to make HBase flexible enough to support (most) current indexing formats with minimal overhead for switching – Lucene style Codec/CodecProvider?
  • 29. What should it look like? • Minimal changes to the top-level interfaces – Add a single new flag? – Configuration based? • Enough that the user gets to be smart about what should be used – We can’t get all cases right – just provide building blocks • Automatically use an index? • Scanner/Filter style use?
  • 30. Properties for the client • Should the user even see the index lookups? • ACID? • Ordering of results? – Support the current sorted order? – Batch lookup? • Implications on current features – Replication – splitting
  • 31. Schema(less) • Schema enforced? – Rigid usage of index matching an expected schema? – Schema table? Reserved schema columns?.META.? • Schema-less – Let the user apply whatever they think and use only what actually works • Best-effort – Use client-hinted schema and try to apply all the known indexes
  • 32. My random thoughts…. • Client-side managed indexes are efficient – Minimal RPC overhead • Cleanup is async to client and rarely misses – Solves the cross-region/server problem • Region-pinning is a nice-to-have optimization – Scales without concern for locality – Flexible enough to support custom codecs – Can be built to provide server-side optimizations • Locality aware indexes to minimize RPCs