'www.NoSQL.com' by Mark Addy
The talk, split into two sections and drawing from a real implementation covers potential NoSQL architectures and the features they offer in the quest to reach the Holy Grail AKA linear scalability. First we examine an existing traditional replicated cache running at full capacity and how replacement with a distributed solution allowed the application to scale dramatically with a performance improvement to boot. Secondly we look at one of the pitfalls of distributed caching and how, using an essential tool from the Data Grid functionality armory, namely grid-execution we can provide massive scale out and low latencies. Both parts describe a real-world implementation delivered for a global on-line travel agency using Infinispan. Level: All levels, the talk is aimed at developers, architects and middleware specialists interested in caching solutions. Focus: Use-cases including technical detail.
Two parts to this presentation Part 1 – Replacement of replicated EhCache with distributed Infinispan Part 2 – Using Infinispans distributed execution framework to enable a new project to meet SLA’s
Accommodation and associated add-ons – excursions, day trips, food etc Content supplied from their own ‘catalogue’ and data supplied from third parties to supplement offerings USP is response time – they are quicker than their competitors! Existing relationship JBoss workshop Troubleshooting JBoss Messaging Troubleshooting Drools Diagnosis and fixes for Memory Leaks Diagnosis and fixes for 100% CPU race conditions
Connectivity Engine Responsible for retrieving third party offerings for presentation in the local site
Above the line is local to the customer Below the line is over the WWW to third party content providers Third parties may then make further trips over the WWW to the source Hotel chains for data NOTE that the third parties also attempt to cache the remote Hotel chain data locally
Third party systems may have periods of downtime Network latency associated with trips to third parties Local site content relies on third party content to provide end user choice Response time is critical, studies show that users will navigate away More Choice, Low Response times == more promfit
Cache for third party data Saves on remote network trips Faster response time NOT Hibernate, hand-cranked Local database consulted to determine which third party providers have content for a search Spring Interceptors Separation of caching code from business logic Spring configuration enables/disables caching Ops team have control via external spring-config.xml
Existing Cache Mixture of local and replicated 10 application nodes == 10xcopies of data Stateless requests Unable to increase max entries – huge full GC pauses Would like to make all caches replicated for redundancy & reduce latency
Ideally heap should be no more than 4G Elements evicted prior to natural expiry Results in more trips to third parties Eviction and expiry occurs in client thread Slows down data retrieval Cache contains many stale elements
Unable to effectively cache all data from existing third parties More providers coming on-line More hotel offerings
This is what we said we would do!
Terracota – simple xml configuration to extend existing EhCache Byte code enhancement Terracotta server array – static hash, restart to scale Coherence – rolls royce grid – very expensive – full features mapreduce Infinispan – free, cutting edge, open source - Since start of 2010
First stage is to replicate live system Must be able to repeatedly simulate live for comparison YOU CANT BENCHMARK AGAINST 3 rd PARTIES
Replicate LIVE infrastructure This took some persuasion – tried small virtualised instances – no go Port Mirroring This is the most effective way to capture REAL traffic Simulating use-cases manually is not effective enough JMeter to replay live traffic Captured traffic converted to files, loaded by custom jmeter app Mock Spring beans Can’t match captured requests to captured responses
Multiple JMeter instances -Protect against overloading jmeter and this becoming a bottleneck Index.php - picks a response at random - Mocked application spring beans process as if this is a matching (valid) response
Get a good understanding of whats in the cache - EhCache Monitor – not great for large caches - EhCache JMX Heap dump & OQL
Deep visibility into the properties of your cached data Size Time to Live Time to Idle Data properties etc
Cache Manager Statistics Current cluster view JGroups transport statistics Bytes transferred / received etc Cache Statistics Hits / Misses Current count
Enable attachment of a remote JMX Client Jconsole Jvisualvm
RHQ free monitoring tool Infinispan distributed with plugin Plugin collects metrics Customisable RHQ collects platform statistics Allows historical comparisons Graph shows Full GC time per minute (PS MarkSweep)
We started out of the latest FINAL release 4.2.1 Went live 5.0.0.CR8
Explain distributed mode Dynamic scaling with numOwners=2 Redundancy = 2 copies Hash of KEY determines which cache nodes hold the data
Summarises last slide numOwners = level of redundancy Dynamic scaling = auto rebalancing of data for distribution and redundancy recovery Overhead of replication is constant no matter how many nodes are added
Other client server architectures REST (Http) – NO smart routing, Dynamic scaling would involve HW / SW loadbalancer, Assemble HTTP requests your self Memcached – NO smart routing, NO Dynamic scaling WebSocket – experimental java script API websocket implementation 100% Java Client Connection pooling based on commons-pool (NOTE this maybe removed) Dynamic Scaling – cluster topology returned to clients as part of request Smart Routing – clients are “aware” of the hashing algorthm used to distributed data between server nodes so can route requests appropriately Cached and application memory have distinct and separate memory / CPU requirements
Graph from SUN’s JVM 1.5 performance tuning Discuss why its better to separate cache and application JVM
Visually this is how the hotrod architecture looks
Option of defining machine, rack and site ID’s as part of the Infinispan configuration Doing this improves redundancy Machine failure OR Rack Failure OR Site Failure Should not result in data loss Didnt work in 4.2.1 – we upgraded
Data owners determined by hash wheel locations Standard hash algorithm doesn't give great data distribution Virtual nodes subdivides the “hash-wheel” positions == better distribution NOW.... MIN_INT -> MAX INT And turned ON by default 40?????
RHQ graph Shows improvement in distribution with more virtual nodes We found 20 gave the best results
Dynamic Scaling and Smart Routing hotRodTopologyCache Internal cache, fully replicated
Talk through SPLIT BRAIN New servers only added to ___hotRodTopologyCache on start-up Restart failed Nodes
Requires careful management Wasn't appropriate for this customer We have other customers that are happy to manage these “features”
Hotrod wasn't appropriate for this customer We started looking at what we could achieve with embedded
More trouble Dynamic Scaling – disable rehash / re-balancing Still generates new “Hash” on leaves / joins but no data movement Get small “loss” of data – becomes un-reachable when the new positions on hash-wheel are determined NBST – cf. Gemfire / coherence Manual / on-demand rehashing – ESSENTIAL EG. Adding 10 new nodes to a 10 node cluster == 10 rehashes!
Embedded compromises JVM heap size JBoss Marshalling Infinispans standard method for serializing onto the wire Used in Hotrod JBoss Serialization Very simple high performance serialization Coded both ways switch to turn on JBM – better compression, slightly worse performance JBS – slightly worse compression, slightly better performance
Compression not suitable for all use-cases Requires additional CPU Test before assuming it is the solution Best for low frequency, large item access
org.infinispan.Cache implements java.util.Map
Failure detection based on simple heartbeat protocol. Every member periodically multicasts a heartbeat. Every member also maintains a table of all members (minus itself). When data or a heartbeat from P are received, we reset the timestamp for P to the current time. Periodically, we check for expired members, and suspect those. Example: <FD_ALL interval="3000" timeout="10000"/> In the example above, we send a heartbeat every 3 seconds and suspect members if we haven't received a heartbeat (or traffic) for more than 10 seconds. Note that since we check the timestamps every 'interval' milliseconds, we will suspect a member after roughly 4 * 3s == 12 seconds. If we set the timeout to 8500, then we would suspect a member after 3 * 3 secs == 9 seconds.
Not everything needs to be request facing Increase capacity further by using non-request facing nodes Helps keep JVM size small as possible
Brand new application Infinispan based to start!!!
Controller == Router & Aggregator Determines which engine will generate results Aggregates results Same Caching architecture as before Not Hibernate Manual Spring AspectJ interceptor based Cache expensive Hotel pricing entries built from local DB NO THIRD PARTY DATA
Belief was that the same architecture used for CE could be used again Then they called us...
Distributed mode blew the network bandwidth All the code sat in org.jgroups when we thread dumped To meet SLA’s (i.e. as fast as possible) Run infinispan locally with 14G heap but unacceptable pauses
First we analyzed the cached data RATES records account for very few records but most of the size
Next we re-architected the caches Numerous very small elements == LOCAL Few, very large elements == DISTRIBUTED Still problems though...
Explain why distributed doesnt work for large data items
Rewrite the application caching logic Late in the day Very expensive in development cost You dont want to lose any data richness Oversized heap CMS – dangerous, last resort GC algorithm – expensive CPU Requires very careful monitoring and management Very expensive in terms of hardware Some financial platforms do this!!!!!!!!!!! Go back to the DB Response times suffer Database performance suffers
Without distributed execution / Map reduce a data grid is just a big cache... Meaning If one's will does not prevail, one must submit to an alternative. Origin The full phrase 'If the mountain will not come to Muhammad, then Muhammad must go to the mountain' arises from the story of Muhammad, as retold by Francis Bacon, in Essays , 1625: Mahomet cald the Hill to come to him. And when the Hill stood still, he was neuer a whit abashed, but said; If the Hill will not come to Mahomet, Mahomet wil go to the hil. Present uses of the phrase usually use the word 'mountain' rather than 'hill' and this version appeared soon after Bacon's Essays, in a work by John Owen, 1643: If the mountaine will not come to Mahomet, Mahomet will goe to the mountaine. The early citations use various forms of the spelling of the name of the founder of the Islamic religion - Muhammad, Mahomet, Mohammed, Muhammed etc.
Implements java.util.concurrent.ExecutorService
How much code do I need to write?? Not much!! Make use of existing AspectJ Cache interceptor Task runs on data owning node – if data isnt there then it retrieves and caches it as part of the task Return just what you need
Concurrent execution Collate results using future.get() Handle failures yourself