Performance evaluation between checkpoint services in multi tier stateful

Performance
Evaluation Between
Checkpoint Services
in Multi-tier Stateful
Applications
Demis Gomes
Advisor: Glauco Gonçalves
Co-Advisor: Patricia Endo

Introduction
• Plataform-as-a-Service (PaaS)
3
Developer
PaaS
Application
User
PaaS Provider

Introduction
• Multi-tier stateful applications
4

Introduction
• It is important keep an application in a
PaaS running as long as possible
• A downtime causes many financial losses
5

Introduction
• The average cost of a critical application
failure per hour is $500,000 to $1 million.
Source: https://devops.com/2015/02/11/real-cost-downtime/ .
Last access 11 out. 2016
6
Checkpoint Services!

Introduction
7
Developers Users
Checkpoint
Service
PaaS Providers

Background
• A checkpoint service is divided into three
mechanisms
– Checkpoint saving
– Failure detection
– Failover
8

Background
• Checkpoint Service
9
App
ActiveStandby
Checkpoint Service
App
State
App
State
App
State
Failover
Checkpoint
Saving
App
Failure
Detection

Background
• Service Availability Forum (SAF)
• Three different implementations:
– Non-collocated
– Collocated warm
– Collocated hot
10

Checkpoint Services
12
CS Application-level CS System-level
App
Agent
State-aware application
App
Agent
HA-agnostic application
Container
Checkpoint
Manager Checkpoint
Manager

Motivation
• Works presented either app-lvl [1] or sys-
lvl [2]
• Lack of consistent comparison between
these services
• No implementation in accordance with the
SAF standard
13

Motivation
• Carry out a performance evaluation
between system and application
checkpoint services, where these models
follow the SAF standard and evaluate the
impact of different recovery modes in
time and resource consumption
14

Answer three questions
• System-level ~= App-level?
• Impact of changing from non-collocated to
collocated?
• Bottlenecks of the system-level and
application-level?
15

Application
• State-aware application
• A multi-tier stateful chat
– Frontend: provides interface and saves user’s
data
– Backend: saves room messages
– Database: stores information related to rooms
and users
17
App Agent
GET /state
200 OK

Application
• State provided via JSON (backend)
18

CS System-level
• We used well-known tools:
– LXC as container
– NFS as file system
– rsync to transfer files between instances
– CRIU to establish checkpoint and restore
containers
19
CS: Checkpoint Service! :D

CS System-level
• We did not implement collocated hot
because CRIU does not allow restore in a
running instance
20

CS System-level
• Checkpoint in non-collocated
21
App
Checkpoint
Manager
Agent
App
Agent
Standby Instance
Active
Instance
Container
Container

CS System-level
• Checkpoint in collocated warm
22
App
Checkpoint
Manager
Agent
App
Agent
Standby Instance
Active
Instance
rsync
Container
Container

Container
CS System-level
• Failover in non-collocated
23
App
Checkpoint
Manager
Agent
App
Agent
Standby Instance
Active
Instance
Container

Container
CS System-level
• Failover in collocated warm
24
App
Checkpoint
Manager
Agent
App
Agent
Standby Instance
Active
Instance
Container
rsync

CS App-level
• CS at application-level was developed
from scratch for this work
• REST resources
25
Remember, CS: Checkpoint Service! :D
GET http://{manager_ip}:{manager_port}/config
RESPONSE 200 OK Content-type: application/json

CS App-level
• Checkpoint at Application-level
26
App
Checkpoint
Manager
Agent
App
Agent
Standby Instance
Active
Instance
State-aware
application Non-collocated
Collocated
warm
Collocated
hot

CS App-level
• Failover in non-collocated
27
App
Checkpoint
Manager
Agent
App
Agent
Standby Instance
Active
Instance

CS App-level
• Failover in collocated warm
28
App
Checkpoint
Manager
Agent
App
Agent
Standby Instance
Active
Instance

CS App-level
• Failover in collocated hot
29
App
Checkpoint
Manager
Agent
App
Agent
Standby Instance
Active
Instance

Evaluation
• Two evaluations were conducted
– Evaluation I: Failover time comparison
– Evaluation II: Checkpoint time and resources
consumption comparison
31

Evaluation
32Physical Machines: 16 GB RAM, 8 cores, Gigabit Interface

Evaluation I
• Methodology
– Backend with 1, 5,10,15,20 and 25 MB of
state sizes
– Experiment Manager starts the experiment and
generates a failure alert
– Failover process is executed
– Failover time is collected
33

Failover time – Non collocated
34
Application-level has a
greater failover time
The growth is linear

Failover time – Non collocated
35
We estimate the failover
time with state size
increasing until 100 MB
App lvl would be 66%
faster

Failover time – Collocated
36
Application-level
collocated warm is
greatly impacted with
increase of state size
The values of app lvl
collocated hot and sys lvl
collocated warm are very
similar

Failover time – Collocated
37
Linear regression shows:
High increase of app lvl
collocated warm
Slight increase on sys lvl
collocated warm
Constant values to
collocated hot

Evaluation II
• Methodology
– Similarly to the previous experiment, states are
saved in same state sizes
– Experiment Manager triggers a checkpoint
process
– Checkpoint time is collected
– Resources consumption are evaluated
38

Evaluation II
• Methodology
– Resources consumption metrics
39
Metrics Measured in
Checkpoint Time s
CPU Load %
Memory Occupation %
Network I/O Throughput Mbps
Disk I/O Throughput b/s

Evaluation II
40
Checkpoint times

Evaluation II – Active Instance
41
At 25MB CPU Memory Network (I/O) Disk (W)
Sys-lvl
collocated
warm
6,8% 9,4% 0/59,8 Mbps 1300 b/s
App-lvl
collocated
warm
2,7% 9,1% 0/8,8 Mbps 9220 b/s
App-lvl
collocated hot
2,53% 9,5% 0/8,64 Mbps 8340 b/s
At 25MB CPU Memory Network
(I/O)
Disk (W)
Sys-lvl non-
collocated
6% 9,1% 0/81 Mbps 1780b/s
App-lvl non-
collocated
2% 8,92% 0/11,6
Mbps
2410 b/s

Evaluation II – Standby Instance
42
At 25 MB CPU Memory Network (I/O) Disk (W)
Sys-lvl
collocated
warm
1,8% 10,3% 5,1/0 Mbps 12500 b/s
App-lvl
collocated
warm
2,5% 11,9% 8,5/8,5 Mbps 7280 b/s
App-lvl
collocated hot
4,1% 12,4% 8,35/8,35
Mbps
6900 b/s
At 25 MB CPU Memory Network
(I/O)
Disk (W)
Sys-lvl non-
collocated
0,16% 9,8% 0/0 Mbps 800 b/s
App-lvl non-
collocated
0,2% 11,4% 0/0 Mbps 2600 b/s

Discussion
• Availability Analysis in a year
• Mean Time To Recovery (MTTR) as
failover time
• Mean Time To Failure (MTTF) as Apache
Server (788.4h/year) [3]
• Assuming that the failover time is 50 times
greater
• High Availability (HA) = 99.999% (five
nines) 43

Discussion
MTTR in
25 MB (s)
MTTR in 25
MB with
factor 50 (s)
MTTF(s) Availability with
factor 50 (%)
System-level
collocated warm
0.38636 19.318 2838240 99.9993
Application-level
collocated warm
1.27823 63.9115 2838240 99.997
Application-level
collocated hot
0.25802 12.901 2838240 99.9995
System-level
non-collocated
3.5441 177.205 2838240 99.9937
Application-level
non-collocated
1.38795 69.3975 2838240 99.997
44
Availability analysis (25 MB)

Discussion
MTTR in
100 MB
(s)
MTTR in 100
MB with
factor 50 (s)
MTTF(s) Availability with
factor 50 (%)
System-level
collocated warm
0.5902 29.51 2838240 99.9989
Application-level
collocated warm
3.8621 193.1 2838240 99.993
Application-level
collocated hot
0.2677 13.385 2838240 99.9995
System-level
non-collocated
9.7999 498.995 2838240 99.9824
Application-level
non-collocated
4.321 216.05 2838240 99.9923
45
Availability analysis (prediction until 100 MB)

CONCLUSIONS AND
FUTURE WORKS
46

Conclusions
Answering the questions
• System-level ~= App-level?
Yes! In collocated warm
47

Conclusions
• Impact of change from non-collocated to
collocated?
– Failover: great decrease
– Checkpoint: great increase
– Resources Consumption: Similar, except of
CPU and disk (greater on collocated)
48

Conclusions
• Bottlenecks of the system-level and
application-level?
– App : disk, CPU in standby (hot) and
development time
– Sys: CPU, network and NFS
49

Conclusions
• CS Application-level
– Private PaaS
– App with large state size and high rate of
checkpoints (massive online applications)
50

Conclusions
• CS System-level
– PaaS with legacy applications
– App with less state size and higher checkpoint
intervals
51

Conclusions
• PaaS Business Model
– Non-collocated: Free plans
– Collocated: Premium plans
52

Contributions
• Short paper approved with results of
Experiment I, entitled:
“Failover Time Evaluation Between
Checkpoint Services in Multi-tier Stateful
Applications”
IM-2017, Exp. Session (Qualis B1) 53

Future Works
As future works, we will study
• Scalability of services
• Resources consumption on Experiment
Instance
54

Acknowledgments
55
• Thanks!
#CatãoEterno

THANKS!
Demis Gomes
demismg72@gmail.com
demis.gomes@ufrpe.br
56

References
• [1] KANSO, Ali; LEMIEUX, Yves. Achieving High Availability at
the Application Level in the Cloud. In: 2013 IEEE Sixth
International Conference on Cloud Computing. IEEE, 2013. p.
778-785.
• [2] LI, Wubin; KANSO, Ali; GHERBI, Abdelouahed. Leveraging
linux containers to achieve high availability for cloud services. In:
Cloud Engineering (IC2E), 2015 IEEE International Conference
on. IEEE, 2015. p. 76-83
• [3] MELO, R. M. D. et al. Redundant vod streaming service in a
private cloud: availability modeling and sensitivity analysis.
Mathematical Problems in Engineering, Hindawi Publishing
Corporation, v. 2014, 2014
57

Agenda
• Introduction
• Checkpoint Services
• Evaluation
– Experiment I
– Experiment II
• Conclusion and Future Works
• Acknowledgments
59

Introduction
• PaaS contains several challenges, where
one is the availability of your services
• Multi-tier stateful applications
60

Introduction
• Many PaaS not have a mechanism that
handles failures on application
• Some offers a backup but is not transparent
61

Introduction
62
Tsuru only restarts
application, not
saving your last state

VM x Container
• VMs
• Containerization
63

Objectives
• General
– Carry out a consistent comparison between
checkpoint in system and application levels
• Specifics
– Develop the two modes following SAF standard
– Compare the services among following metrics:
• Failover time
• Checkpoint time
• Load generated in application
64

Application
• Application generates new base states if
– threshold defined by developer has reached
– A time limit has reached
65
App 20 new
messages!
App 120 seconds
without
updates!

CS System-level
• Checkpoint/Restore In Userspace (CRIU)
• Saves memory context
• Freezes processes reading memory
• Restores processes in machines with same
filesystem
67

CS System-level
• Phoenix!
68

Checkpoint Services
Implementation
• URLS implemented by chat
69

Checkpoint Services
• CS Application-level
70
App
Checkpoint
Manager
Agent
App
Agent
Standby Instance
Active
Instance
State-aware
application Non-collocated
Collocated
warm
Collocated
hot

VM/Container
Checkpoint Services
• CS System-level
71
App
Checkpoint
Manager
Agent
App
Agent
Standby Instance
Active
Instance
HA-agnostic
application
Non-collocated
Collocated
warm
Collocated
hot
VM/Container

CS System-level
• LXC must be configured to allow CRIU
make checkpoint and restore
72

Evaluation II
• Methodology
– Checkpoint time is presented as means with
95% Confidence Interval (CI)
– Resource consumption are means with 95% CI
related to active and standby instances
73

CS System-level
• Checkpoint process is established in non-
collocated
– saving container via CRIU and storing your
memory context in a shared file system
between Manager and Agent
• In collocated:
– saving container via CRIU and send state via
rsync to all standby instances
74

CS System-level
• Failover process (non-collocated)
76

CS System-level
• Failover process (collocated warm)
77

CS App-level
• In failover process (non-collocated)
79

CS App-level
• In failover process (collocated warm)
80

CS App-level
• In failover process (collocated hot)
81

Evaluation I
82
• T-test between app collocated hot and sys
collocated warm

Evaluation II
83
Network received (collocated modes)

Evaluation II
84
Network received (non-collocated)

Evaluation II
85
CPU Load (collocated modes)

Evaluation II
86
CPU Load (non-collocated)

Evaluation II
87
Memory occupation (collocated modes)

Evaluation II
88
Memory occupation (non-collocated)

Evaluation II
89
Network sent (collocated modes)

Evaluation II
90
Network sent (non-collocated)

Evaluation II
91
Disk written (collocated modes)

Evaluation II
92
Disk written (non-collocated)

Acknowledgments
93
• Family
• Friends
• Creators
• UFRPE
• Advisors (the bests)
• CNPq and FACEPE

Performance evaluation between checkpoint services in multi tier stateful

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Performance evaluation between checkpoint services in multi tier stateful

Semelhante a Performance evaluation between checkpoint services in multi tier stateful (20)

Último

Último (20)

Performance evaluation between checkpoint services in multi tier stateful

Notas do Editor