RedisConf18 - Auto-Scaling Redis Caches - Observability, Efficiency & Performance SLAs

Auto-Scaling Redis Caches
Observability, Efficiency & Performance SLAs
Copyright 2018 CachePhysics, Inc. 1

Uncontrolled
Performance
=
Lost Revenue
Amazon.com
100 ms latency => 1% drop in sales
Aberdeen (1,000 company sample)
Latency => 9% enterprise revenue loss
Stable
High latency ads => 46% viewer loss
Google
500 ms =>20% less clicks: ad revenue loss

Delivering Data QoS is Challenging
App Datarequirements
changingdaily. Providing
QoS has become hard

Performance Management is Getting Harder
The problemsare
gettingmuch
worse with increasing hardware
complexity

What’s the Path
Forward?
Copyright 2018 CachePhysics, Inc.5

New Layer Needed
Systems of
Intelligence
Self-Learn / Model
• Application workloads
• Infrastructure components
Control / Operate to maximize:
• Efficiency
• Performance
• SLA Conformance
Infrastructure
Applications

How about an
Auto-Scaling Data Infrastructure?
Copyright 2018 CachePhysics, Inc.7

8 Copyright 2018 CachePhysics, Inc.
Autonomous Performance Management?

KeepExisting
Applications
9Copyright 2018 CachePhysics, Inc.
KeepExisting
Infrastructure Add Real-time Self-Learning
Instrumentation
Enforce SLA
Optimize
Monitor
Existing Application Infrastructure
Model
Control
Apps & Microservices
Caches
Network, Database, Filesystem, Compute
Predict
SaaS API
Calls
API
Callbacks
Autonomous Performance
Management
Auto Size,
Reconfigure

Modeling Cache Performance in Real-Time
10
Lower is better
42 84 128 1700
Cache Size (GB)
Latency(ms)
• Learn performance model of applications
and cache
• Predict the performance of workload as
f(cache size, params)
Copyright 2018 CachePhysics, Inc.
Cache Performance
Hit Ratio
Cache Size
65%
128GB
123

Understanding Cache Models
Lower is better
42 84 128 1700
Cache Size (GB)
Latency(ms)
123
x2 x4 x8 Models help decide useful
increments of change.
In this example, no benefit
despite an 8x increase in
budget.

Understanding Cache Models
Lower is better
42 84 128 1700
Cache Size (GB)
Latency(ms)
123
Often, most operating points are
highly inefficient.
This cache is operating at the
lowest ROI point; equivalent
performance to 1/8 the budget.
Auto-scaling data infrastructure
would picking only the efficient
operating points.

Understanding Model-based Adaptation
Single Workload.
Prediction of
performance under
different policies.
A auto-scaling data
infrastructure would
always pick the optimal.

Sample Models From Production Workloads

��
��
��
��
��
��
��
��
Achieving Latency Targets
Latency
Target (7 ms)
Cache
Allocation
(>16 GB)Client target 95th %ile latency is 7 ms
Autoset cache partitions size to 16GB to
guarantee avg latency SLOs * Throughput
targets can be
implemented
similarlyCache Size (GB)
95th%ileLatency(ms)

��
��
��
��
��
��
��
��
Achieveing Multi-Tier Sizing
* Can model network
bandwidth as a
function of cache
misses from each tier
Tier 0 (DRAM) allocation
Tier 1 (3D Xpoint)
Network
Misses
Tier 2 (Local Flash)
Tier 3 (Remote
Flash)

Auto-Scaling Data Infrastructure
Remote
Tier
Monitoring Auto-Select Policies
(dynamic parameters)
Lower is better
42 84 128 1700
Cache Size (GB)
MissRatio
0.0
0.2
0.4
0.6
0.8
0.00
0.25
0.50
0.75
1.00
0 5 10 15 20 25
Cache Size(GB)
MissRatio
Block Size (KB)
4
16
64
256
1024
Latency Reduction
(Thrashing Remediation)
Latency Guarantees Accurate Tiering
Tier 0 allocation for this client
Tier 1 allocation for this clientLatency
Target (7 ms)
Cache
Allocation
(>16 GB)
Client target IO latency is: 7 ms
Guarantee avg latency: autoset
cache partition to 16GB
Multi-Tenant Isolation
vs
msr_mds (1.10%) msr_proj (0.06%) msr_src1 (0.06%) t01 (0.05%)
t06 (0.33%) t08 (0.04%) t14 (0.38%) t15 (0.10%)
t18 (0.08%) t19 (0.06%) t30 (0.06%) t32 (0.98%)
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
0 40 80 0 500 1000 0 200 0 200 400
0 50 100 0 300 600 0 100 200 300 0 200 400
0 100 200 0 200 400 0 100 200 300 0 9 18
Cache Size (GB)
MissRatio
msr_mds (1.10%) msr_proj (0.06%) msr_src1 (0.06%) t01 (0.05%)
t06 (0.33%) t08 (0.04%) t14 (0.38%) t15 (0.10%)
t18 (0.08%) t19 (0.06%) t30 (0.06%) t32 (0.98%)
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
0 40 80 0 500 1000 0 200 0 200 400
0 50 100 0 300 600 0 100 200 300 0 200 400
0 100 200 0 200 400 0 100 200 300 0 9 18
Cache Size (GB)
MissRatio
smax = 8K exact MRC
Client 0 Client 1
Part. 0 Part. 1
Results:
• Safely quantify
impact of changes
• Often 50-150%
cache efficiency
improvements ($$)
• Latency SLAs met
• Fewer production
fire fights
• Higher consolidation
ratios
• Accurate Capacity
Planning

Why Now?
Every code change can morph
infrastructure requirements. App
performance today: static,
manual, error-prone.
We auto-learn dynamically
New Apps: Cloud Native
Hardware complexity increasing,
even as data growth accelerates.
But latency not keeping up
w/FLOPS (-4%/yr).
We tame complexity, maximize ROI
New Layers: NVMe/3DXpoint
IncreasingComplexity
Massive DataGrowth
Latencyworseby4%/yr
SMR HDD
SLOW FLASH
FLASH
MRAM
RERAM
7200 RPM HDD
ENTERPRISE HDD
NVME
DRAM
I$
D$ LLC
SMR HDD, OPTICAL
TAPE
3DXPOINT
DRAM
Kubernetes’ CRDs + API extensions =
default platform for next decade. But
performance unpredictability pain
unsolved.
We layer on SLA guarantee solution
Infrastructure »» Declarative
* CD = Continuous Deployment, the predominant trend where code is deployed automatically after testing

Use Case #1: Self-service Infrastructure Sizing
• Value for Developers
• Observability dashboard
• Sizing recommendations
• Cost savings analysis
• Value for Ops
• Achieve desired SLAs for lowest spend
• Observability for data teams
• Faster troubleshooting for infra teams
• Predictive planning
Monitoring
Lower is better
42 84 128 1700
Resources ($$)
Latency
0.0
2.0
4.0
6.0
8.0

Use Case #2: Automated QoS SLA
• Value for Developers
• Dial-in latency or throughput target and a budget
• Auto-scaling data infrastructure allocates just
enough capacity to hit SLA
• Value for Ops
• Automated SLA achievement!
• Set and Forget, ease of mind
• Revenue disruption avoidance
• Improved margins
• Zero OpEx performance scaling
• Dramatically reduced service interruptions
Latency Guarantees
Latency
Target (7 ms)
Cache
Allocation
(>16 GB)
Client target IO latency is: 7 ms
Guarantee avg latency: autoset cache
partition to 16GB

Model your Redis cache!
1. Sign up for free trial: www.cachephysics.com/redisconf18
2. Test your access pattern to create a Redis cache model, receive a live dashboard
3. Cache Observability ==> Faster Time to Resolution ==> Performance SLAs
– Simple!

Thank You

RedisConf18 - Auto-Scaling Redis Caches - Observability, Efficiency & Performance SLAs

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a RedisConf18 - Auto-Scaling Redis Caches - Observability, Efficiency & Performance SLAs

Semelhante a RedisConf18 - Auto-Scaling Redis Caches - Observability, Efficiency & Performance SLAs (20)

Mais de Redis Labs

Mais de Redis Labs (20)

Último

Último (20)

RedisConf18 - Auto-Scaling Redis Caches - Observability, Efficiency & Performance SLAs