SlideShare uma empresa Scribd logo
1 de 46
Culvert
A secondary indexing framework for BigTable-
    style databases with HIVE integration

   Ed Kohlwey
   Cloud Computing Team
Session Agenda
•   Secondary Indexing
•   The Solution: Culvert
•   Culvert Design & Architecture
•   How It Works
•   API Examples
•   Where to Get It & Credits
Secondary Indexing
• General design pattern for inverted index
  – Maintain a map from value to location of
    records/documents that contain them
• Lots of different variations
  – Term partitioned index
  – Document partitioned index
• Solves problem of BigTable-style databases
  only having one primary key for records
Sample Inventory Application
  Foo Table
  RowID    contact: city   contact: phone   inventory:count   order:Apples
  Apples                                          5
   John    Springfield     (999)-888-7777                          3
  Pears                                           10

Sample Term-Partitioned Index Table
                      order:Apples Index
                      RowID
                      3 -> Dave
                      3 -> John
                      17 -> Paul
                      20 -> Sue
Sample Inventory Application
        Foo Table
                    RowID                         contact: comments
                    John                          John likes apples.
                    Sue                            Sue likes pears.


  Sample Document-Partitioned Index
              Table
contact:comments Index

RowID      apples:john john:John   likes:John likes:Sue      pears:Sue   sue:Sue
0x178df    -                -      -
0x32da4                                       -              -           -
We found ourselves implementing
these ideas over and over for clients.

        Why not make a library?
Solution: Culvert
Requirements
• Support secondary indexing
• Support an analyst query environment
• Database Extensibility
   – There’s actually a lot of BigTable implementations out
     there (HBase, Cassandra, proprietary)
• Internal Extensibility
   – There’s lots of ways to index records
   – There’s lots of ways to retrieve records
   – Separate retrieval operations from index
     implementation
What Culvert Does
• Indexing
• Interface for queries (Java and HIVE)
• Abstraction mechanism for multiple
  underlying databases
Culvert Design & Architecture
• Use sorted iterators to retrieve values
   – Lots of algorithms can be expressed as sorting (like
     people tend to do in Map/Reduce)
   – Optional “dumping” feature can provide parallelism
• Decorator design pattern is intuitive to interact
  with
• Allows streaming of results as they become
  available
• Uses Coprocessors to implement parallel
  operations
Architecture Diagram
                     Java API                        Hive

                          Culvert Client-Side Operation

               TableAdapter        Constraint             Client




   Culvert Region-Side Operation                Culvert Region-Side Operation
LocalTableAdapter       RemoteOp             LocalTableAdapter       RemoteOp
Constraint Architecture
• Used to express query predicate operations
  – projection and selection (SELECT)
  – set operations (AND/OR)
  – joins
• Decoupled from Indices
  – Currently focused on term-partitioned indices
  – Future work includes expanding document-
    partitioned index functionality
Index Architecture
• Index is an abstract type
  – Defines how to store and use the index
• One index per column
  – Didn’t see a performance reason to index over
    multiple columns
  – Multiple indices complicates framework code
  – Map of “logical fields” was more easily maintained
    in the application
  – May evolve in the future
Index Architecture (cont.)
• One index table per index
  – Allows Index implementations to assume they
    don’t share the index table
  – Don’t need to worry about other Indices
    clobbering their table structure
  – Tables are assumed to be cheap
Table Adapters
• TableAdapter and LocalTableAdapter are
  abstraction mechanisms, roughly equivalent
  to HTable and HRegion
• RemoteOp is roughly equivalent to
  CoprocessorProtocol, is handled by
  TableAdapter and LocalTableAdapter
• Gives implementers fine-grained control over
  parallelism + table operations
Using Culvert With HIVE
• Why HIVE?
  – Already very popular
  – Take advantage of upstream advances
  – Good framework to “optimize later”
• Culvert implements a HIVE StorageHandler
  and PredicateHandler
• Facilitates analyst interaction with database
• Reduces the “SQL Gap”
HIVE Culvert Input Format
• Handles AND, >, < query predicates based on
  indices
• Each index can be broken up into fragments
  based on region start and end keys
  – We take the cross-product of each indexes regions
    to create input splits for AND
How It Works

Overview of Indexing Operations
Indexing
• Indices are built via insertion operations on
  the client (i.e. Client.put(…))
• Whether a field is indexed is controlled by a
  configuration file
• In the future, will support indexing of arbitrary
  columns via Map/Reduce
Retrieval
• Query API is exposed via HIVE and Java
  – HIVE API delegates to Java API
  – Java API is based on subclasses of Constraint
• Focused on providing parallel, real-time query
  execution
Walkthrough of Logical
Operations on Indices
Logical Operations on Indices
• Logical operations can be represented as a merge
  sort if we return the keys from the original table
  in sorted order
• Example: AND
orders:Apples Index             orders:Oranges Index
1 -> Dean                       4 -> Dean
3 -> Susan                      5 -> Susan
4 -> John                       5 -> Paul
8 -> Paul                       6 -> George
14 -> Renee                     12 -> Karen
33 -> Sheryl                    19 -> Tom
Apples < 3 AND Oranges > 5
• First query each index


orders:Apples Index          orders:Oranges Index
1 -> Dean                    4 -> Dean
3 -> Susan                   5 -> Susan
4 -> John                    5 -> Paul
8 -> Paul                    6 -> George
14 -> Renee                  12 -> Karen
33 -> Sheryl                 19 -> Tom
Apples < 3 AND Oranges > 5
• Then order results for each index
• Happens on the region servers


1 -> Dean
3 -> Susan                    5 -> Susan
                              5 -> Paul
                              6 -> George
                              12 -> Karen
                              19 -> Tom
Apples < 3 AND Oranges > 5
• Then order results for each index
• Happens on the region servers


Dean
Susan                         Susan
                              Paul
                              George
                              Karen
                              Tom
Apples < 3 AND Oranges > 5
• Then order results for each index
• Notice this happens on the region servers*
Done

Dean
Susan                        Susan
                             Paul
                             George
                             Karen
                             Tom
Apples < 3 AND Oranges > 5
• Then order results for each index
• Notice this happens on the region servers*
Done

Dean                         Done
Susan                        George
                             Karen
                             Paul
                             Susan
                             Tom
Apples < 3 AND Oranges > 5
• Then merge the sorted results on the client



Dean
Susan                         George
                              Karen
                              Paul
                              Susan
                              Tom
Apples < 3 AND Oranges > 5
• Dean is lowest, Dean is not on the head of all
  the queues, discard


Dean
Susan                         George
                              Karen
                              Paul
                              Susan
                              Tom
Apples < 3 AND Oranges > 5
• George is lowest, George is not on the head of
  all queues, discard


Dean
Susan                         George
                              Karen
                              Paul
                              Susan
                              Tom
Apples < 3 AND Oranges > 5
• Continue…



Dean
Susan                    George
                         Karen
                         Paul
                         Susan
                         Tom
Apples < 3 AND Oranges > 5
  • Susan is on the head of all the queues, return
    Susan


  Dean
✔ Susan                         George
                                Karen
                                Paul
                                Susan                ✔
                                Tom
Apples < 3 AND Oranges > 5
  • Tom is discarded, now we’re finished



  Dean
✔ Susan                        George
                               Karen
                               Paul
                               Susan       ✔
                               Tom
Joins
• Numerous methods possible
• A few examples
  – Use sub-queries to fetch related records
  – Use merge sorting to simultaneously fetch records
    satisfying both sides of the join, filter those that
    don’t match
• Presently, Culvert has only one join (sub-
  queries method)
Example: Join Apple Order Size on
Orange Order Size (order:Apples =
        order:Oranges)
                          User performs joins with a
         JoinConstraint   constraint (decorator design pattern)
Example: Join Apple Order Size on
       Orange Order Size (order:Apples =
               order:Oranges)
                     JoinConstraint

…
John
                     Constraint receives row ID’s from a left
…
                     sub-constraint.

Left SubConstraint
Example: Join Apple Order Size on
       Orange Order Size (order:Apples =
               order:Oranges)
                         JoinConstraint

…
John
…                                         Constraint looks up field
                                          values for the left side (if not
                                          already present in the results)
Left SubConstraint         order:Apples
                     …     …
                     John 5
                     …     …
Example: Join Apple Order Size on
       Orange Order Size (order:Apples =
               order:Oranges)
                         JoinConstraint   For each record in the left
                                          result set, the constraint creates
…                                         a new right-side constraint to
                                          fetch indexed items matching
John                                      the right side of the constraint.
…
                                                      order:Oranges
                                           …          …
Left SubConstraint         order:Apples
                                           George     5
                     …     …
                                           Jane       5
                     John 5
                                           …          …
                     …     …
Example: Join Apple Order Size on
       Orange Order Size (order:Apples =
               order:Oranges)
                                                                      Finally,
                                          …          …       …        the joined
                         JoinConstraint                               records
                                          John 5             George   are returned.
…                                         John 5             Jane
John                                      …          …       …
…
                                                         order:Oranges
                                              …          …
Left SubConstraint         order:Apples
                                              George     5
                     …     …
                                              Jane       5
                     John 5
                                              …          …
                     …     …
Culvert Java API Examples
• Goal: to be intuitive and easy to interact with
• Provide a simple relational API without forcing
  a developer to use SQL
Culvert API Example: Insertion
Configuration culvertConf = CConfiguration.getDefault();
// index definitions are loaded implicitly from the
// configuration
Client client = new Client(culvertConf);
List<CKeyValue> valuesToPut = Lists.newArrayList();
valuesToPut.add(new CKeyValue(
      "foo".getBytes(),
      "bar".getBytes(),
      "baz”.getBytes()));
Put put = new Put(valuesToPut);
client.put("tableName", put);
Culvert API Example: Retrieval
Configuration culvertConf = CConfiguration.getDefault();
// index definitions are loaded implicitly from the configuration
Client client = new Client(culvertConf);
Index c1Index = client.getIndexByName("index1");
Constraint c1Constraint = new IndexRangeConstraint(
      c1Index, new CRange(
            "abba".getBytes(),
            "cadabra".getBytes()));
Index[] c2Indices = client.getIndicesForColumn(
      "rabbit".getBytes(),
      "hat".getBytes());
Constraint c2Constraint = new IndexRangeConstraint(
      c2Indices[0],
      new CRange("bar".getBytes(), "foo".getBytes()));
Constraint and = new And(c1Constraint, c2Constraint);
Iterator<Result> results = client.query("tablename", and);
Future Work
• (Re)Building Indices via Map/Reduce
• More index types
  – Document-partitioned
  – Others?
• More retrieval operations
• Profiling + tuning
• Storing configuration details in a table or in
  Zookeeper
Where to Get It*

http://github.com/booz-allen-hamilton/culvert


          Where to Tweet It

                  #culvert
                                       *Available 6/29/2011
Culvert Team
•   Ed Kohlwey (@ekohlwey)
•   Jesse Yates (@jesse_yates)
•   Jeremy Walsh
•   Tomer Kishoni (@tokbot)
•   Jason Trost (@jason_trost)
Questions?

Mais conteúdo relacionado

Último

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Último (20)

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

Destaque

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Destaque (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

  • 1. Culvert A secondary indexing framework for BigTable- style databases with HIVE integration Ed Kohlwey Cloud Computing Team
  • 2. Session Agenda • Secondary Indexing • The Solution: Culvert • Culvert Design & Architecture • How It Works • API Examples • Where to Get It & Credits
  • 3. Secondary Indexing • General design pattern for inverted index – Maintain a map from value to location of records/documents that contain them • Lots of different variations – Term partitioned index – Document partitioned index • Solves problem of BigTable-style databases only having one primary key for records
  • 4. Sample Inventory Application Foo Table RowID contact: city contact: phone inventory:count order:Apples Apples 5 John Springfield (999)-888-7777 3 Pears 10 Sample Term-Partitioned Index Table order:Apples Index RowID 3 -> Dave 3 -> John 17 -> Paul 20 -> Sue
  • 5. Sample Inventory Application Foo Table RowID contact: comments John John likes apples. Sue Sue likes pears. Sample Document-Partitioned Index Table contact:comments Index RowID apples:john john:John likes:John likes:Sue pears:Sue sue:Sue 0x178df - - - 0x32da4 - - -
  • 6. We found ourselves implementing these ideas over and over for clients. Why not make a library?
  • 8. Requirements • Support secondary indexing • Support an analyst query environment • Database Extensibility – There’s actually a lot of BigTable implementations out there (HBase, Cassandra, proprietary) • Internal Extensibility – There’s lots of ways to index records – There’s lots of ways to retrieve records – Separate retrieval operations from index implementation
  • 9. What Culvert Does • Indexing • Interface for queries (Java and HIVE) • Abstraction mechanism for multiple underlying databases
  • 10. Culvert Design & Architecture • Use sorted iterators to retrieve values – Lots of algorithms can be expressed as sorting (like people tend to do in Map/Reduce) – Optional “dumping” feature can provide parallelism • Decorator design pattern is intuitive to interact with • Allows streaming of results as they become available • Uses Coprocessors to implement parallel operations
  • 11. Architecture Diagram Java API Hive Culvert Client-Side Operation TableAdapter Constraint Client Culvert Region-Side Operation Culvert Region-Side Operation LocalTableAdapter RemoteOp LocalTableAdapter RemoteOp
  • 12. Constraint Architecture • Used to express query predicate operations – projection and selection (SELECT) – set operations (AND/OR) – joins • Decoupled from Indices – Currently focused on term-partitioned indices – Future work includes expanding document- partitioned index functionality
  • 13. Index Architecture • Index is an abstract type – Defines how to store and use the index • One index per column – Didn’t see a performance reason to index over multiple columns – Multiple indices complicates framework code – Map of “logical fields” was more easily maintained in the application – May evolve in the future
  • 14. Index Architecture (cont.) • One index table per index – Allows Index implementations to assume they don’t share the index table – Don’t need to worry about other Indices clobbering their table structure – Tables are assumed to be cheap
  • 15. Table Adapters • TableAdapter and LocalTableAdapter are abstraction mechanisms, roughly equivalent to HTable and HRegion • RemoteOp is roughly equivalent to CoprocessorProtocol, is handled by TableAdapter and LocalTableAdapter • Gives implementers fine-grained control over parallelism + table operations
  • 16. Using Culvert With HIVE • Why HIVE? – Already very popular – Take advantage of upstream advances – Good framework to “optimize later” • Culvert implements a HIVE StorageHandler and PredicateHandler • Facilitates analyst interaction with database • Reduces the “SQL Gap”
  • 17. HIVE Culvert Input Format • Handles AND, >, < query predicates based on indices • Each index can be broken up into fragments based on region start and end keys – We take the cross-product of each indexes regions to create input splits for AND
  • 18. How It Works Overview of Indexing Operations
  • 19. Indexing • Indices are built via insertion operations on the client (i.e. Client.put(…)) • Whether a field is indexed is controlled by a configuration file • In the future, will support indexing of arbitrary columns via Map/Reduce
  • 20. Retrieval • Query API is exposed via HIVE and Java – HIVE API delegates to Java API – Java API is based on subclasses of Constraint • Focused on providing parallel, real-time query execution
  • 22. Logical Operations on Indices • Logical operations can be represented as a merge sort if we return the keys from the original table in sorted order • Example: AND orders:Apples Index orders:Oranges Index 1 -> Dean 4 -> Dean 3 -> Susan 5 -> Susan 4 -> John 5 -> Paul 8 -> Paul 6 -> George 14 -> Renee 12 -> Karen 33 -> Sheryl 19 -> Tom
  • 23. Apples < 3 AND Oranges > 5 • First query each index orders:Apples Index orders:Oranges Index 1 -> Dean 4 -> Dean 3 -> Susan 5 -> Susan 4 -> John 5 -> Paul 8 -> Paul 6 -> George 14 -> Renee 12 -> Karen 33 -> Sheryl 19 -> Tom
  • 24. Apples < 3 AND Oranges > 5 • Then order results for each index • Happens on the region servers 1 -> Dean 3 -> Susan 5 -> Susan 5 -> Paul 6 -> George 12 -> Karen 19 -> Tom
  • 25. Apples < 3 AND Oranges > 5 • Then order results for each index • Happens on the region servers Dean Susan Susan Paul George Karen Tom
  • 26. Apples < 3 AND Oranges > 5 • Then order results for each index • Notice this happens on the region servers* Done Dean Susan Susan Paul George Karen Tom
  • 27. Apples < 3 AND Oranges > 5 • Then order results for each index • Notice this happens on the region servers* Done Dean Done Susan George Karen Paul Susan Tom
  • 28. Apples < 3 AND Oranges > 5 • Then merge the sorted results on the client Dean Susan George Karen Paul Susan Tom
  • 29. Apples < 3 AND Oranges > 5 • Dean is lowest, Dean is not on the head of all the queues, discard Dean Susan George Karen Paul Susan Tom
  • 30. Apples < 3 AND Oranges > 5 • George is lowest, George is not on the head of all queues, discard Dean Susan George Karen Paul Susan Tom
  • 31. Apples < 3 AND Oranges > 5 • Continue… Dean Susan George Karen Paul Susan Tom
  • 32. Apples < 3 AND Oranges > 5 • Susan is on the head of all the queues, return Susan Dean ✔ Susan George Karen Paul Susan ✔ Tom
  • 33. Apples < 3 AND Oranges > 5 • Tom is discarded, now we’re finished Dean ✔ Susan George Karen Paul Susan ✔ Tom
  • 34. Joins • Numerous methods possible • A few examples – Use sub-queries to fetch related records – Use merge sorting to simultaneously fetch records satisfying both sides of the join, filter those that don’t match • Presently, Culvert has only one join (sub- queries method)
  • 35. Example: Join Apple Order Size on Orange Order Size (order:Apples = order:Oranges) User performs joins with a JoinConstraint constraint (decorator design pattern)
  • 36. Example: Join Apple Order Size on Orange Order Size (order:Apples = order:Oranges) JoinConstraint … John Constraint receives row ID’s from a left … sub-constraint. Left SubConstraint
  • 37. Example: Join Apple Order Size on Orange Order Size (order:Apples = order:Oranges) JoinConstraint … John … Constraint looks up field values for the left side (if not already present in the results) Left SubConstraint order:Apples … … John 5 … …
  • 38. Example: Join Apple Order Size on Orange Order Size (order:Apples = order:Oranges) JoinConstraint For each record in the left result set, the constraint creates … a new right-side constraint to fetch indexed items matching John the right side of the constraint. … order:Oranges … … Left SubConstraint order:Apples George 5 … … Jane 5 John 5 … … … …
  • 39. Example: Join Apple Order Size on Orange Order Size (order:Apples = order:Oranges) Finally, … … … the joined JoinConstraint records John 5 George are returned. … John 5 Jane John … … … … order:Oranges … … Left SubConstraint order:Apples George 5 … … Jane 5 John 5 … … … …
  • 40. Culvert Java API Examples • Goal: to be intuitive and easy to interact with • Provide a simple relational API without forcing a developer to use SQL
  • 41. Culvert API Example: Insertion Configuration culvertConf = CConfiguration.getDefault(); // index definitions are loaded implicitly from the // configuration Client client = new Client(culvertConf); List<CKeyValue> valuesToPut = Lists.newArrayList(); valuesToPut.add(new CKeyValue( "foo".getBytes(), "bar".getBytes(), "baz”.getBytes())); Put put = new Put(valuesToPut); client.put("tableName", put);
  • 42. Culvert API Example: Retrieval Configuration culvertConf = CConfiguration.getDefault(); // index definitions are loaded implicitly from the configuration Client client = new Client(culvertConf); Index c1Index = client.getIndexByName("index1"); Constraint c1Constraint = new IndexRangeConstraint( c1Index, new CRange( "abba".getBytes(), "cadabra".getBytes())); Index[] c2Indices = client.getIndicesForColumn( "rabbit".getBytes(), "hat".getBytes()); Constraint c2Constraint = new IndexRangeConstraint( c2Indices[0], new CRange("bar".getBytes(), "foo".getBytes())); Constraint and = new And(c1Constraint, c2Constraint); Iterator<Result> results = client.query("tablename", and);
  • 43. Future Work • (Re)Building Indices via Map/Reduce • More index types – Document-partitioned – Others? • More retrieval operations • Profiling + tuning • Storing configuration details in a table or in Zookeeper
  • 44. Where to Get It* http://github.com/booz-allen-hamilton/culvert Where to Tweet It #culvert *Available 6/29/2011
  • 45. Culvert Team • Ed Kohlwey (@ekohlwey) • Jesse Yates (@jesse_yates) • Jeremy Walsh • Tomer Kishoni (@tokbot) • Jason Trost (@jason_trost)

Notas do Editor

  1. Just say the bullet points,