ScaleOut technical session at Cloud Expo 2013 in NY. Covers the use of in-memory data grids for real-time analysis of fast-changing data. Includes a financial services example.
2. 2 ScaleOut Software, Inc.
• What is an In-Memory Data Grid (IMDG)?
• Top Benefits of IMDGs
• The Need for Real-Time Analytics
• Example: A Platform for Managing Hedging Strategies
• Using an IMDG to Perform Real-Time Analysis
• Benchmark Results
• Integrating an IMDG into Hadoop
2
Agenda
3. 3 ScaleOut Software, Inc.
• Dr. Mikhail Sobolev, Lead Java Architect
• Ph.D. from Moscow Institute of Physics and Technology
• Research and consulting focus in parallel computing
• Responsible for development of scalable software services in Java
• David Brinker, COO
• 20 years software business and executive management experience
• Mentor Graphics, Cadence, Webridge
• Company: ScaleOut Software
• Develops and markets IMDG products
• Founded in September 2003
• Offices in Bellevue, WA and Beaverton, OR
• Eight years market experience in Windows
& Linux
About the Speakers
4. 4 ScaleOut Software, Inc.
• ScaleOut StateServer®
• Flagship product
• IMDG middleware for Windows
and Linux
• Industry-leading performance and ease of use
• ScaleOut GeoServer® adds
• WAN based data replication for DR
• Breakthrough technology for global
data access
• ScaleOut Analytics Server™ adds
• Real-time data analysis for operational data
• Comprehensive management tools
• ScaleOut hServer™ adds
• 1st step for Hadoop real-time analytics
• Accelerates data access and execution.
ScaleOut Software Products
ScaleOut StateServer In-Memory Data Grid
Grid
Service
Grid
Service
Grid
Service
Grid
Service
5. 5 ScaleOut Software, Inc.
In-memory storage for fast updates and retrieval of live data
• Fits in the business logic layer:
• Stores collections of Java/.NET
objects shared by multiple clients.
• Uses create/read/update/delete
and query APIs to access data.
• Implemented across a cluster of
servers or VMs:
• Scales storage and throughput
by adding servers.
• Provides high availability
in case a server fails.
What is an In-Memory Data Grid?
6. 6 ScaleOut Software, Inc.
Scaling Data Access Using an IMDG
Example: Cloud-Hosted App
• Application runs as multiple virtual
servers (VS).
• Application instances store and
retrieve LOB data from cloud-based
file system or database-.
• Applications need fast, scalable
storage for live data.
• In-memory data grid runs as
multiple virtual servers to provide
“elastic” in-memory storage for
live data.
7. 7 ScaleOut Software, Inc.
• As a “vertical” storage tier:
• Runs as middleware software.
• Adds missing storage layer to boost
performance.
• Uses out-of-process memory.
• Avoids repeated trips to a backing store.
Where IMDGs Are Deployed
Processor
Cache
Application
Memory
“In-Process”
L2 Cache
Processor
Cache
Application
Memory
“In-Process”
L2 Cache
Backing
Storage
• As a “horizontal” storage tier:
• Allows data sharing among servers.
• Scales performance & capacity.
• Adds high availability.
• Can be used independently of backing
storage.
In-Memory
Data Grid
“Out-of-
Process”
In-Memory
Data Grid
“Out-of-
Process”
8. 8 ScaleOut Software, Inc.
• IMDG incorporates a client-side in-process
cache (“near cache”):
• Transparent to the application
• Holds recently accessed data
• Boosts performance:
• Eliminates repeated network data transfers &
deserialization
• Reduces access times to near “in-process”
latency
• Is automatically updated if the grid is
updated
• Supports various coherency models
(coherent, polled, event-driven)
The Secret to Fast Access Time
Application
Memory
“In-Process”
Client-side
Cache
“In-Process”
In-Memory
Data Grid
“Out-of-
Process”
9. 9 ScaleOut Software, Inc.
• IMDGs enable seamless data access across on-premise sites and
cloud-based deployments:
• Automatically access
remote data as needed.
• Efficiently manage
WAN bandwidth.
• Enable full data
coherency across sites.
• Supports multiple usage
models:
• Replication for DR
• Remote access
• Synchronized read/write
Global Data Integration
10. 10 ScaleOut Software, Inc.
• IMDG bridges on-premise and cloud-based in-memory storage of
Web session state.
• IMDG automatically migrates session-state objects into the cloud
on demand.
• This enables seamless access to data across multiple sites.
Example: Web Farm Cloud-Bursting
11. 11 ScaleOut Software, Inc.
In-Memory Data Grid is middleware software which provides:
1. Fast access time for fast-changing, “live” data
2. Scalable throughput and storage capacity to match a
growing workload and keep response times low
3. High availability to prevent data loss if a grid server (or
network link) fails
4. Shared access to data
across the server farm
5. Global data access across
multiple sites and the cloud
6. And … fast data analysis
for quickly and easily mining
data using “map/reduce”
Top Benefits of IMDGs
AccessLatency
Throughput
Grid DBMS
Access Latency vs. Throughput
Faster
Scales
12. 12 ScaleOut Software, Inc.
• Traditional “big data” analysis
platforms analyze offline data:
• Example: Hadoop
• Very large, static datasets
• Data is often copied from other
disk-based storage systems to a
distributed file system for analysis.
• IMDGs store and analyze online data:
• Fast-changing, operational data
• Data storage is memory-based.
• Data motion is minimized for fast,
continuous analysis.
IMDGs Analyze Live Data
13. 13 ScaleOut Software, Inc.
A few examples:
• Equity trading: to minimize risk during a trading day
• Ecommerce: to optimize real-time shopping activity
• Reservations systems: to identify issues, reroute, etc.
• Credit cards: to detect fraud in real time
• Smart grids: to optimize power distribution & detect issues
Online Systems Need Real-Time Analysis
14. 14 ScaleOut Software, Inc.
A platform for managing hedging strategies:
• A hedge fund manages a set of hedging strategies:
• Strategies can cover various market
sectors, such as high-tech, automotive,
energy, consumer, real estate, etc.
• Each strategy contains list of holdings
and rules for managing the holdings
(such as target allocations).
• Updates to market data
continuously arrive during
the trading day.
• Challenge: The hedge fund must be able to quickly update and
analyze its hedging strategies and provide alerts to traders.
Example in Financial Services
15. 15 ScaleOut Software, Inc.
• Deliver a stream of alerts to traders
within a few seconds.
• Enable the trader to examine strategy details in real time:
The Result: Real-Time Alerts
16. 16 ScaleOut Software, Inc.
• The IMDG holds the set of strategy objects as an in-memory collection.
• Updates to market data
continuously flow through
the IMDG.
• The IMDG performs
repeated map/reduce
analysis on hedging
strategies every
second.
• Each analysis iteration both updates
and analyzes every strategy object.
• The IMDG collects alerts after each
analysis and delivers them to the
trader.
The Solution: Real-Time Analytics
Using an IMDG
17. 17 ScaleOut Software, Inc.
• Analyze every selected strategy object in parallel within the IMDG:
• Update the strategy’s positions with latest market prices.
• Evaluate the strategy’s rules to see if a trade is needed.
• Example: Alert if current allocation exceeds target threshold.
• Generate an alert if holdings need to be changed.
• Merge the results across all strategy objects to create a set of
alerts.
The Analysis Algorithm
18. 18 ScaleOut Software, Inc.
Shipping Analysis Code to the IMDG
• IMDG creates Java or .NET execution environment for analysis:
• Spans all IMDG servers.
• Ensures tight integration with memory-based data storage.
• IMDG client ships jars/assemblies to IMDG servers for execution:
• Keeps development model simple.
• Optionally allows pre-staging for multiple runs to shorten startup time.
• Optionally allows automatic re-staging if code changes between runs.
• Client starts analysis:
• Sends invocation to
the IMDG.
• IMDG returns
analysis results.
19. 19 ScaleOut Software, Inc.
The parallel analysis executes in three steps:
• Step 1: The application first selects all relevant objects in the
collection with a parallel query run on all grid servers.
• Note: Query spec matches data’s object-oriented properties.
Running the Analysis
20. 20 ScaleOut Software, Inc.
• Step 2: The IMDG automatically schedules analysis operations
across all grid servers and cores.
• The analysis runs on all objects selected
by the parallel query.
• Each grid server analyzes its locally stored
objects to minimize data motion.
• Parallel execution ensures fast
completion time:
• IMDG automatically distributes
workload across servers/cores.
• Scaling the IMDG automatically
handles larger data sets.
Running the Analysis: Step 2
21. 21 ScaleOut Software, Inc.
• File-based map/reduce must move data to memory for analysis:
• IMDG’s memory-based computation engine analyzes data in place:
IMDG Minimizes Data Motion
D D D D D D D D D
D D D D D D D D D
Grid ServerGrid ServerGrid Server
E E E
M/R Server
E
M/R Server
E
M/R Server
E
File System /
Database
Server
Memory
In-Memory
Data Grid
22. 22 ScaleOut Software, Inc.
• Step 3: The IMDG automatically merges all analysis results.
• The IMDG first merges all results within each grid server in parallel.
• It then merges results across all grid servers to create one combined
result.
• Efficient parallel merge
minimizes the delay in
combining all results.
• The IMDG delivers the
combined result to the
trader’s display as one
object.
Running the Analysis: Step 3
23. 23 ScaleOut Software, Inc.
Running a similar analysis algorithm (stock back-testing) within an
IMDG:
• IMDG hosted in Amazon cloud using 75 servers.
• IMDG holds 1 TB of stock history data in memory.
• IMDG handles continuous stream of updates (1.1 GB/s) while
performing real-time analysis on live data.
• Entire data set analyzed in
4.1 seconds (250 GB/s).
• IMDG scales linearly by
adding servers as
workload grows.
Benchmark Results
24. 24 ScaleOut Software, Inc.
• Typically used for very large, static, offline datasets
• Data is held on disk in a file system (HDFS) or DBMS
• Data is often copied from other disk-based storage systems to
HDFS for analysis.
Problem: Hadoop Cannot Efficiently
Perform Real-Time Analytics
25. 25 ScaleOut Software, Inc.
Comparison of IMDGs and Hadoop
IMDG Hadoop
Data set size Gigabytes->terabytes Terabytes->petabytes
Data repository In-memory File / database
Data view Queried object collection File-based key/value
pairs
Development time Low High
Automatic
scalability
Yes Application dependent
Best use Real-time analysis of
live, memory-based data
Batch analysis of
large, static datasets
I/O overhead Low High
Cluster mgt. Simple Complex
High availability Memory-based File-based
26. 26 ScaleOut Software, Inc.
• Survey result from Strata 2013: 93% of Hadoop users would
benefit from real-time data analytics.
• Strategy: Integrate IMDG into Hadoop.
• How:
• Stage data in IMDG for fast access.
• Thereby allow updates to data during
Hadoop execution.
• Automatically retrieve
data from HDFS as
necessary.
• Enable unchanged
Hadoop program
structure.
• Combine scalability
of Hadoop map/reduce
and IMDG.
Enabling Hadoop to Perform
Real-Time Analysis
27. 27 ScaleOut Software, Inc.
• IMDG adds Hadoop grid record
reader for accessing key/value
pairs held in the IMDG.
• Hadoop programs optionally can
output results to IMDG with grid
record writer.
• Applications can access and update
key/value pairs as live data during
analysis.
• Grid record reader and writer
optimize access to key/value pairs
to eliminate network overhead.
Accessing IMDG Data in Hadoop
28. 28 ScaleOut Software, Inc.
• IMDG adds wrapper for HDFS record reader to cache HDFS data
during program execution.
• Hadoop automatically retrieves data from IMDG on subsequent runs.
• Wrapper accesses IMDG to
store and retrieve data
with minimum network
overhead.
• Useful in multiple “what-if”
analyses on one data set
• Tests with Terasort
benchmark have
demonstrated 11X
lower access latency
over HDFS without IMDG.
Using IMDG as an HDFS Cache
29. 29 ScaleOut Software, Inc.
• IMDGs use in-memory storage to scale access to data for
applications which process live, fast-changing data.
• IMDGs can be deployed in the cloud and provide global data
integration across sites.
• Many applications need to
perform real-time analytics
on live data.
• IMDGs can meet this need,
delivering results in seconds
instead of minutes or hours.
• Hadoop was not designed for
real-time analytics, but…
• IMDGs can enable Hadoop to accelerate access to data.
Summary
30. In-Memory Data Grids for
Server Farms & Cloud Computing
www.scaleoutsoftware.com