More than Just Lines on a Map: Best Practices for U.S Bike Routes
Â
Causata HBase Deployment Presentation
1. Mixing low latency with analytical
workloads for Customer Experience
Management
June 13, 2013
Neil Ferguson
2. www.causata.com
Causata Overview
âą Real-time Offer Management
â Involves predicting something about a
customer based on their profile
â For example, predicting if somebody is
a high-value customer when deciding
whether to offer them a discount
â Typically involves low latency
(< 50 ms) access to an individual
profile
â Both on-premise and hosted
âą Analytics
â Involves getting a large set of
profiles matching certain criteria
â For example, finding all of the
people who have spent more than
$100 in the last month
â Involves streaming access to large
amounts of data (typically millions
of rows / sec per node)
â Often ad-hoc
3. www.causata.com
Some History
âą Started building our platform 4 Âœ years ago
âą Started on MySQL
â Latency too high when reading large profiles
â Write throughput too low with large data sets
âą Built our own custom-built data store
âPerformed well (it was built for our specific needs)
âNon-standard; maintenance costs
âą Moved to HBase last year
â Industry standard; lowered maintenance costs
â Can perform well!
4. www.causata.com
Our Data
âą All data is stored as Events, each of which has the
following:
â A type (for example, âProduct Purchaseâ)
â A timestamp
â An identifier (who the event belongs to)
â A set of attributes, each of which has a type and value(s), for
example:
âą âProduct Price -> 99.99
âą âProduct Categoryâ -> âShoesâ, âFootwearâ
5. www.causata.com
Our Storage
âą Only raw data is stored (not
pre-aggregated)
âą Event table (row-oriented):
â Stores data clustered by user profile
â Used for low latency retrieval of
individual profiles for offer
management, and for bulk queries for
analytics
âą Index table (âcolumn-
orientedâ):
â Stores data clustered by attribute type
â Used for bulk queries (scanning) for
analytics
âą Identity Graph:
â Stores a graph of cross-channel
identifiers for a user profile
â Stored as an in-memory
column family in the Events
table
6. www.causata.com
Maintaining Locality
âą Data locality (with HBase client) gives around a
60% throughput increase
â Single node can scan around 1.6 million rows / second with Region
Server on separate machine
â Same node can scan around 2.5 million rows / second with Region
Server on the local machine
âą Custom region splitter: ensures that (where
possible), event tables and index tables are split at
the same point
â Tables divided into buckets, and split at bucket boundaries
âą Custom load balancer: ensures that index table data
is balanced to the same RS as event table data
âą All upstream services are locality-aware
7. www.causata.com
Querying Causata
For each customer who has spent more than $100, get product
views in the last week from now:
SELECT S.product_views_in_last_week
FROM Scenarios S
WHERE S.timestamp = now()
AND total_spend > 100;
For each customer who has spent more than $100, get product
views in the last week from when they purchased something:
SELECT S.product_views_in_last_week
FROM Scenarios S, Product_Purchase P
WHERE S.timestamp = P.timestamp
AND S.profile_id = P.profile_id
AND S.total_spend > 100;
8. www.causata.com
Query Engine
âą Raw data stored in HBase, queries typically
performed against aggregated data
â Need to scan billions of rows, and aggregate on the fly
- Many parallel scans performed:
- Across machines (obviously)
- Across regions (and therefore disks)
- Across cores
âą Queries can optionally skip non-compacted data
(based on HFile timestamps)
â Allows result recency to be traded for performance
âą Some other performance tuning:
- Shortcircuit reads turned on (available from 0.94)
- Multiple columns combined into one
9. www.causata.com
Parallelism
Single Region Server, local client, all rows returned to client, disk-bound workload
(disk cache cleared before test), ~1 billion rows scanned in total, ~15 bytes per row (on
disk, compressed), 2 x 6 core Intel(R) X5650 @ 2.67GHz, 4 x 10k RPM SAS disks,
48GB RAM
10. www.causata.com
Request Prioritization
âą All requests to HBase go through a single thread pool
âą This allows requests to be prioritized according to
sensitivity to latency
âą âReal-timeâ (latency-sensitive) requests are treated
specially
âą Real-time request latency is monitored continuously,
and more resources allocated if deadlines are not met