2. Agenda
• Lightning talks / community announcements
• Main session
• Bier @ Feierabend - 422 Yale Ave North
• Hashtags #Seattle #Hadoop
3. GigaSpaces:
• DeWayne's talk will cover the joining of a real-
time service/data fabric with NOSQL big data
to create a complete linearly scalable solution
supporting analytics, complex event
processing, and reporting in both real-time
and batch domains.
4. Expedia (Cassandra):
• Todd's session: Expedia needs the ability
to search by price in a fast and efficient
manner. Prices are complex objects
containing base rate, taxes, fees, etc
which means a calculation is required to
determine the customer price. This
makes searching by price difficult. What
to do?
5.
6. Building An Elastic Real Time NoSQL Platform
Creating a platform for unlimited elastic
computation power and storage
7. Motivation
• Complete elastic solution stack
• Applications that need massive “strategic” storage
(disk-based NoSQL) and a real time (“tactical”)
component
• Horizontally and vertically scalable
• Highly available
• Self healing
• Fault tolerant: suitable for commodity h/w strategy
• Simplified management and monitoring, vs
conventional, multi-product solutions
® Copyright 2011 Gigaspaces
Ltd. All Rights Reserved
8. What Is Real-Time?
• In this context, means “really fast”.
• Reads as low as 5 μs and typically under 1 ms
for a fully replicated write.
Source: http://blog.gigaspaces.com/2010/12/06/possible-impossibility-the-race-to-zero-latency/
® Copyright 2011 Gigaspaces
Ltd. All Rights Reserved
9. Two Layer Approach
• Advantage: Minimal
Raw Event Stream
Raw Event Stream
Raw Event Stream
ts
ents
“impedance mismatch”
en
Real Time Ev
Real Time Ev
between layers.
– Both NoSQL cluster
technologies, with similar
advantages SCALE
• Grid layer serves as an in
Reporting Engine
In Memory Compute Cluster
memory cache for interactive
Raw And Derived Events
requests.
• Grid layer serves as a real time ...
SCALE
computation fabric for CEP, and
NoSQL Cluster
limited ( to allocated memory)
real time map/reduce capability.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
10. Two Layer Approach (continued)
• Grid layer doing CEP can act as a filter, as
many raw events get converted to
semantic/business events, reducing
meaningless data verbosity
• Grid layer provides scalable messaging
• NoSQL layer provides unlimited cheap storage
on commodity hardware
• NoSQL layer provides virtually unlimited scale
processing power
® Copyright 2011 Gigaspaces
Ltd. All Rights Reserved
11. Basics Of In Memory DataGrid
Technology
• An In Memory Data Grid (IMDG) is a data store
• Grid just means “cluster”
• Data can be partitioned across cluster nodes
• Processing power near data storage
• Distributed hash table
• Application optimized data model denormalization
• Nodes are typically configured with one or more
replicas (sound familiar yet)?
• Not a “cache”: a system of record, but can be used as a
cache, or both
® Copyright 2011 Gigaspaces
Ltd. All Rights Reserved
12. Advanced Capabilities
• Business logic (code) co-resident with data shards
• Scalable messaging
• Dynamic code execution across cluster
• Multi-language support
• Object-oriented
• Document-oriented/schema free
• Multi-level indexing
• SQL Queries
• Full ACID transaction support
• Elastic scaling (automatic and manual)
• Write-behind persistence
® Copyright 2011 Gigaspaces
Ltd. All Rights Reserved
13. Features: IMDG vs NoSQL
Disk Based
Data Grid
NoSQL
Low Latency
Eventual/Tunable
Horizontally Scalable
Consistency
Code co-location
Service remoting
Parallel Execution Unlimited scale
Fault Tolerant
Cloud enabled Hadoop tools
Transactional
Highly Available
Elastic
Messaging
Platform Independent
Complex Event Processing Flexible Schema
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
14. Vive La Difference
• The IMDG compliments a NoSQL store:
– Can serve as a short term request cache (side cache or
inline)
– Can serve as a cache for MR results
– Enables event driven architectures / CEP
– In memory map/reduce
– Very fast writes, regardless of NoSQL store
– Transactional layer: can essentially turn “eventual”
consistency into pure transactional persistency
without a performance hit
– Highly available and independently scalable
® Copyright 2011 Gigaspaces
Ltd. All Rights Reserved
15. A Complete Scalable Application
Platform
Raw Event Stream
Raw Event Stream
Raw Event Stream
ts
vents
n
Real Time Eve
Real Time E
SCALE
Reporting Engine
In Memory Compute Cluster
Raw And Derived Events
...
SCALE
NoSQL Cluster
® Copyright 2011 Gigaspaces
Ltd. All Rights Reserved
16. Key Implementation Issues
• Grid must support reliable asynchronous persistence
– If not reliable: in-flight data is at risk. Ideally tunable to
accommodate differing risk tolerance.
– If not asynchronous: too slow
– If not persistent: obviously nothing gets send to disk
• To do more than a distributed cache, grid must support
code and data partitioning
– Ideally, code is collocated in memory with data partition
– Needed to support CEP, application, and service remoting
capabilities
® Copyright 2011 Gigaspaces
Ltd. All Rights Reserved
17. Key Implementation Issues
• Grid ideally supports FIFO entry ordering
– Key to using grid as a queue
– Key to scaling messaging without an additional tier
– Combined with co-located business logic, operates at memory
speeds
• Write speed on the NoSQLlayer
– Grid is, in effect, queuing entries to the NoSQL layer
– If the NoSQL layer cannot keep up, in memory grid backs up
– This behavior is an asset, unless an unanticipated, sustained
flood occurs.
– The faster the write speed the better
® Copyright 2011 Gigaspaces
Ltd. All Rights Reserved
18. Use Case 1 – Event Cloud
• Complex event processing
Collect events in real time Transform into decision factors
•Interactions •Good customer
•Orders •Pays 3-6 days early
•Bills •Decreasing usage
•Payments •Missed payment
•Activations •Unusual bill
•… •App usage
Original events, possibly scrubbed or annotated, are passed
through
Business logic derived “synthetic events” constructed from
raw event stream. Possible rule engine integration(e.g.
Drools).
Derived events and analytics passed on to NoSQL layer
Other events forwarded to external listeners, systems
® Copyright 2011 Gigaspaces
Ltd. All Rights Reserved
19. Use Case 2 – Time Bounded
• Time Bounded – suited to operations with daily business cycle (e.g.
trading)
• Current day (or other time period that will fit in memory) held in
memory, along with related application state, caching etc…
• Still streaming operations to underlying NoSQL platform, or hold for
end of day flush if back end can’t write fast enough.
• Supports application hosting, messaging, and complex event
processing.
• External clients are aware of “current day” store, vs archival.
• Large scale reports/analytics run in background on NoSQL archive.
® Copyright 2011 Gigaspaces
Ltd. All Rights Reserved
20. Use Case 3 - LRU
• Grid holds a subset of NoSQL store, and
supports an LRU caching model.
• In line or side-cache.
• Appropriate only in cases where, like any
cache, usage pattern does not generate many
cache misses.
• Still supports CEP, messaging, and
computation scaling (provided grid product
supports it).
® Copyright 2011 Gigaspaces
Ltd. All Rights Reserved
21. Wishlist
• This platform concept is still at an early stage
• For Gigaspaces, integrations already exist for Cassandra
and MongoDB.
• Customers are currently implementing solutions
• Stuff I’d like to see:
– Unified management and scaling. Shared infrastructure.
– Grid/NoSQL aware hive façade that can run MR jobs on
both. Perhaps other Hadoop tools integration
– Deeper integration. To further optimize write
speed/capacity, and perhaps offload some in-memory
aspects of underlying NoSQL platform to minimize
duplication and possibly optimize elasticity.
® Copyright 2011 Gigaspaces
Ltd. All Rights Reserved
22. Conclusion
• Two shared nothing “NoSQL” architectures
complementing each other
• Fully elastic/scalable
• Ultra high performance/low latency combined
with unlimited scale.
• Full application stack
• Highly reliable and self-healing
• Scalable complex event handling
• Multi-language
• Simple. Two products.
® Copyright 2011 Gigaspaces
Ltd. All Rights Reserved
23.
24. DataStax is the company behind Apache Cassandra. Besides
contributing the majority of the code for the open source
project, DataStax also provides products and services for
Apache Cassandra
DataStax Community – the 100% free way to get started with Apache
Cassandra (free management software and packaging!)
DataStax Enterprise –Hadoop Analytics and Support!
Download + Docs at http://www.datastax.com/dev
26. Who am I
• B. Todd Burruss – Sr Architect, Expedia
• Worked with Cassandra for nearly 2 years
• Committer on Hector (Java Client)
• General testing on Cassandra, working with
community, but not committer
27. Expedia’s Motivation
We need the ability to search by price in a fast
and efficient manner. Prices are complex
objects containing base rate, taxes, fees, etc
which means a calculation is required to
determine the customer price. This makes
searching by price difficult. What to do?
28. What to Do?
• Precalculate Total Price! Let’s look at Hotels
• Hotel prices vary based on Date, Length of
Stay (LOS), Number Adult Travelers (AT)
• Customers book in advance so must have
prices fairly far into the future (1 year)
• Approximately 140,000 hotels in our inventory
• Support 1-14 LOS and 1-4 AT
• Over 2 billion prices!
29. Example of Hotel Pricing
• A customer’s family of 4 wants to stay 7 nights
at the Hilton in Maui, checkin on 12/1/2011,
checkout on 12/8/2011
• Each night could be a different rate because of
day of week, conference in the area, holiday,
etc.
• So must sum the rate, taxes and fees for each
night to get the total room price
30. Use Case : Median Price
• Ex: What is the median hotel price in Seattle
for each day between 11/1 and 11/30?
• 200 * 30 = 6,000 prices returned from
Cassandra – median calculated on client.
• Idea is customer searches city and date
range, then narrows search to smaller area
and dates
• Prices are volatile, so want close to real-time
updates
31. Enter Cassandra : Expectations
• Cassandra can handle large amounts of data
nicely : billions of price objects
• Cassandra is very fast (read and write.) Can
handle the volatile prices
• Cluster expands easily – our dataset is growing
• Easy to setup, administer and use
• Operational costs are good
• Support is available
32. Solution : Data Model
• 1 ColumnFamily : Prices
• Row key : date + LOS + AT
• Column name : Hotel ID – 140,000 columns
(integer comparator)
• Column value : precalculated hotel price for date
+ LOS + AT
• 365 * 14 * 4 = 20,440 row keys
• 20,440 * 140,000 = 2,861,600,000 price objects
33. Solution : Retrieving Prices
• Generate keys for each checkin day, LOS, AT
combination wanted
• Query Cassandra using the generated keys,
using specific column names (hotel IDs)
• For family example, one key, one column =
12/1/2011 + 7 + 4 = total price for hilton hotel
• For median example, 30 keys, 200 columns
per key. Client receives 30 result rows, then
calculates median for each row
34. Testing Scenario
• Found 19 boxes, 16gb old RAM, 1 old 4 core CPU
• 18th and 19th boxes are clients + Cassandra
servers (don’t do this in prod)
• Can never find enough hardware :)
• 2 Keyspaces on cluster
• Reduce dataset to 90 days, up to 7 day LOS, up to
4 AT, 70k hotels – removes disk I/O from test and
leaves some RAM for caching
• We believe our hot data will be in RAM
• Query 30 days and 200 hotels : 6000 price objects
35. Results : Page 1
Default Memory and Column Index Settings
• ~50ms : -Xmn400m, 64k index, 8gb, no row
cache, no key cache : pretty good
• ~800ms : -Xmn400m, 64k index, 8gb, no key
cache, 600 row cache (Serializing) – copying to
heap accounts for slowness
36. Results : Page 2
Change Index to 1k Column Pages:
• ~45ms : -Xmn400m, 1k index, 8gb, no row
cache, no key cache
• ~29ms : -Xmn400m, 1k index, 8gb, 600 row
cache (ConcurrentLinkedHash)
1k index saves a little, but data is all in RAM.
Bigger savings when hitting disk
37. Results : Page 3
Tune Memory
• ~45ms : -Xmn200m, 1k index, 8gb, no row cache,
no key cache
• ~29ms : -Xmn200m, 1k index, 8gb, 600 row cache
(ConcurrentLinkedHash)
Increasing Old Gen will not help because all data fits
in RAM, based on reported JVM usage. Reducing
New Gen moves from less frequent long pauses to
more frequent short pauses. No help.
38. Take Away
• Test is worst case scenario. Completely random
usage pattern – which is rarely (if ever) the case
in production. Causes cache churn if cache is too
small
• Wide rows are not always bad. Access columns
sequentially or by range is very good (e.g. time
series data)
• Serializing cache has trade off between serialized
objects and copying to/from off heap storage
39. References
• Query plan description by Aaron Morton :
http://thelastpickle.com/2011/07/04/Cassand
ra-Query-Plans/
• Disk sizing by B. Todd Burruss (me):
http://btoddb-cass-storage.blogspot.com/