Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Culvert
A secondary indexing framework for BigTable-
style databases with HIVE integration

Ed Kohlwey
Cloud Computing Team

Session Agenda
• Secondary Indexing
• The Solution: Culvert
• Culvert Design & Architecture
• How It Works
• API Examples
• Where to Get It & Credits

Secondary Indexing
• General design pattern for inverted index
– Maintain a map from value to location of
records/documents that contain them
• Lots of different variations
– Term partitioned index
– Document partitioned index
• Solves problem of BigTable-style databases
only having one primary key for records

Sample Inventory Application
Foo Table
RowID contact: city contact: phone inventory:count order:Apples
Apples 5
John Springfield (999)-888-7777 3
Pears 10

Sample Term-Partitioned Index Table
order:Apples Index
RowID
3 -> Dave
3 -> John
17 -> Paul
20 -> Sue

Sample Inventory Application
Foo Table
RowID contact: comments
John John likes apples.
Sue Sue likes pears.

Sample Document-Partitioned Index
Table
contact:comments Index

RowID apples:john john:John likes:John likes:Sue pears:Sue sue:Sue
0x178df - - -
0x32da4 - - -

We found ourselves implementing
these ideas over and over for clients.

Why not make a library?

Requirements
• Support secondary indexing
• Support an analyst query environment
• Database Extensibility
– There’s actually a lot of BigTable implementations out
there (HBase, Cassandra, proprietary)
• Internal Extensibility
– There’s lots of ways to index records
– There’s lots of ways to retrieve records
– Separate retrieval operations from index
implementation

What Culvert Does
• Indexing
• Interface for queries (Java and HIVE)
• Abstraction mechanism for multiple
underlying databases

Culvert Design & Architecture
• Use sorted iterators to retrieve values
– Lots of algorithms can be expressed as sorting (like
people tend to do in Map/Reduce)
– Optional “dumping” feature can provide parallelism
• Decorator design pattern is intuitive to interact
with
• Allows streaming of results as they become
available
• Uses Coprocessors to implement parallel
operations

Architecture Diagram
Java API Hive

Culvert Client-Side Operation

TableAdapter Constraint Client

Culvert Region-Side Operation Culvert Region-Side Operation
LocalTableAdapter RemoteOp LocalTableAdapter RemoteOp

Constraint Architecture
• Used to express query predicate operations
– projection and selection (SELECT)
– set operations (AND/OR)
– joins
• Decoupled from Indices
– Currently focused on term-partitioned indices
– Future work includes expanding document-
partitioned index functionality

Index Architecture
• Index is an abstract type
– Defines how to store and use the index
• One index per column
– Didn’t see a performance reason to index over
multiple columns
– Multiple indices complicates framework code
– Map of “logical fields” was more easily maintained
in the application
– May evolve in the future

Index Architecture (cont.)
• One index table per index
– Allows Index implementations to assume they
don’t share the index table
– Don’t need to worry about other Indices
clobbering their table structure
– Tables are assumed to be cheap

Table Adapters
• TableAdapter and LocalTableAdapter are
abstraction mechanisms, roughly equivalent
to HTable and HRegion
• RemoteOp is roughly equivalent to
CoprocessorProtocol, is handled by
TableAdapter and LocalTableAdapter
• Gives implementers fine-grained control over
parallelism + table operations

Using Culvert With HIVE
• Why HIVE?
– Already very popular
– Take advantage of upstream advances
– Good framework to “optimize later”
• Culvert implements a HIVE StorageHandler
and PredicateHandler
• Facilitates analyst interaction with database
• Reduces the “SQL Gap”

HIVE Culvert Input Format
• Handles AND, >, < query predicates based on
indices
• Each index can be broken up into fragments
based on region start and end keys
– We take the cross-product of each indexes regions
to create input splits for AND

How It Works

Overview of Indexing Operations

Indexing
• Indices are built via insertion operations on
the client (i.e. Client.put(…))
• Whether a field is indexed is controlled by a
configuration file
• In the future, will support indexing of arbitrary
columns via Map/Reduce

Retrieval
• Query API is exposed via HIVE and Java
– HIVE API delegates to Java API
– Java API is based on subclasses of Constraint
• Focused on providing parallel, real-time query
execution

Walkthrough of Logical
Operations on Indices

Logical Operations on Indices
• Logical operations can be represented as a merge
sort if we return the keys from the original table
in sorted order
• Example: AND
orders:Apples Index orders:Oranges Index
1 -> Dean 4 -> Dean
3 -> Susan 5 -> Susan
4 -> John 5 -> Paul
8 -> Paul 6 -> George
14 -> Renee 12 -> Karen
33 -> Sheryl 19 -> Tom

Apples < 3 AND Oranges > 5
• First query each index

orders:Apples Index orders:Oranges Index
1 -> Dean 4 -> Dean
4 -> John 5 -> Paul
8 -> Paul 6 -> George
14 -> Renee 12 -> Karen
33 -> Sheryl 19 -> Tom

• Then order results for each index
• Happens on the region servers

1 -> Dean
5 -> Paul
6 -> George
12 -> Karen
19 -> Tom

• Happens on the region servers

Dean
Susan Susan
Paul
George
Karen
Tom

• Notice this happens on the region servers*
Done

Dean
Susan Susan
Paul
George
Karen
Tom

• Notice this happens on the region servers*
Done

Dean Done
Susan George
Karen
Paul
Susan
Tom

• Then merge the sorted results on the client

Dean
Susan George
Karen
Paul
Susan
Tom

• Dean is lowest, Dean is not on the head of all
the queues, discard

Dean
Susan George
Karen
Paul
Susan
Tom

• George is lowest, George is not on the head of
all queues, discard

Dean
Susan George
Karen
Paul
Susan
Tom

• Continue…

Dean
Susan George
Karen
Paul
Susan
Tom

• Susan is on the head of all the queues, return
Susan

Dean
✔ Susan George
Karen
Paul
Susan ✔
Tom

• Tom is discarded, now we’re finished

Dean
✔ Susan George
Karen
Paul
Susan ✔
Tom

Joins
• Numerous methods possible
• A few examples
– Use sub-queries to fetch related records
– Use merge sorting to simultaneously fetch records
satisfying both sides of the join, filter those that
don’t match
• Presently, Culvert has only one join (sub-
queries method)

Example: Join Apple Order Size on
Orange Order Size (order:Apples =
order:Oranges)
User performs joins with a
JoinConstraint constraint (decorator design pattern)

order:Oranges)
JoinConstraint

…
John
Constraint receives row ID’s from a left
…
sub-constraint.

Left SubConstraint

order:Oranges)
JoinConstraint

…
John
… Constraint looks up field
values for the left side (if not
already present in the results)
Left SubConstraint order:Apples
… …
John 5
… …

order:Oranges)
JoinConstraint For each record in the left
result set, the constraint creates
… a new right-side constraint to
fetch indexed items matching
John the right side of the constraint.
…
order:Oranges
… …
George 5
… …
Jane 5
John 5
… …
… …

order:Oranges)
Finally,
… … … the joined
JoinConstraint records
John 5 George are returned.
… John 5 Jane
John … … …
…
order:Oranges
… …
George 5
… …
Jane 5
John 5
… …
… …

Culvert Java API Examples
• Goal: to be intuitive and easy to interact with
• Provide a simple relational API without forcing
a developer to use SQL

Culvert API Example: Insertion
Configuration culvertConf = CConfiguration.getDefault();
// index definitions are loaded implicitly from the
// configuration
Client client = new Client(culvertConf);
List<CKeyValue> valuesToPut = Lists.newArrayList();
valuesToPut.add(new CKeyValue(
"foo".getBytes(),
"bar".getBytes(),
"baz”.getBytes()));
Put put = new Put(valuesToPut);
client.put("tableName", put);

Culvert API Example: Retrieval
Configuration culvertConf = CConfiguration.getDefault();
// index definitions are loaded implicitly from the configuration
Client client = new Client(culvertConf);
Index c1Index = client.getIndexByName("index1");
Constraint c1Constraint = new IndexRangeConstraint(
c1Index, new CRange(
"abba".getBytes(),
"cadabra".getBytes()));
Index[] c2Indices = client.getIndicesForColumn(
"rabbit".getBytes(),
"hat".getBytes());
Constraint c2Constraint = new IndexRangeConstraint(
c2Indices[0],
new CRange("bar".getBytes(), "foo".getBytes()));
Constraint and = new And(c1Constraint, c2Constraint);
Iterator<Result> results = client.query("tablename", and);

Future Work
• (Re)Building Indices via Map/Reduce
• More index types
– Document-partitioned
– Others?
• More retrieval operations
• Profiling + tuning
• Storing configuration details in a table or in
Zookeeper

Where to Get It*

http://github.com/booz-allen-hamilton/culvert

Where to Tweet It

#culvert
*Available 6/29/2011

Culvert Team
• Ed Kohlwey (@ekohlwey)
• Jesse Yates (@jesse_yates)
• Jeremy Walsh
• Tomer Kishoni (@tokbot)
• Jason Trost (@jason_trost)

Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Recomendados

Recomendados

Mais conteúdo relacionado

Último

Último (20)

Destaque

Destaque (20)

Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Notas do Editor