Presentation of final graduation work as Bachelor of Information Systems
Abstract:
Cloud applications are offered to users with high availability and minimal data loss. Any
failure in hardware or software layer must be detected and recovered quickly, to maintain customer trust and avoid financial losses. When we are dealing with multi-tier and
stateful applications, the failure recovery process is a big challenge because the whole
state of the failed application must be retrieved and restored in a new instance. This
process is named as failover; it can be performed by a checkpoint service at applicationlevel or system-level. Depending on the location of the checkpoint data storage, it can
be classified as non-collocated, collocated warm, or collocated hot. This work presents
an evaluation of these two checkpoint services in a physical environment, considering
a multi-tier and stateful application, measuring checkpoint time, failover time, and resources consumption. A state-aware application, in which handles its own state, was
developed to allow a checkpoint at application-level. The checkpoint at system-level
was developed integrating existent tools, whereas the checkpoint at application-level
was implemented especially to this work. The SAF standard was implemented in two
approaches.
Results show that the system-level checkpoint presents worse times regarding failover
and checkpoint times when compared against the application-level solution. Furthermore, the failover time of system-level with collocated warm obtained similar times with
application-level collocated hot, despite first one grows slightly with the increase of state
size, while the last one is not impacted when the state data size increases.
Regarding resources consumption, system-level consumes more CPU and network,
while application-level generates more load on disk and, slightly, in memory. However,
bottlenecks may occur in standby instances in mode collocated hot at application-level,
as well as in disk written in system-level collocated warm mode. A checkpoint service
at application-level is more indicated to PaaS with high availability and requires a stateaware application, in which handles your state. A system-level checkpoint service is
required in legacy applications and PaaS with large infrastructure
5. Introduction
• It is important keep an application in a
PaaS running as long as possible
• A downtime causes many financial losses
5
6. Introduction
• The average cost of a critical application
failure per hour is $500,000 to $1 million.
Source: https://devops.com/2015/02/11/real-cost-downtime/ .
Last access 11 out. 2016
6
Checkpoint Services!
13. Motivation
• Works presented either app-lvl [1] or sys-
lvl [2]
• Lack of consistent comparison between
these services
• No implementation in accordance with the
SAF standard
13
14. Motivation
• Carry out a performance evaluation
between system and application
checkpoint services, where these models
follow the SAF standard and evaluate the
impact of different recovery modes in
time and resource consumption
14
15. Answer three questions
• System-level ~= App-level?
• Impact of changing from non-collocated to
collocated?
• Bottlenecks of the system-level and
application-level?
15
17. Application
• State-aware application
• A multi-tier stateful chat
– Frontend: provides interface and saves user’s
data
– Backend: saves room messages
– Database: stores information related to rooms
and users
17
App Agent
GET /state
200 OK
19. CS System-level
• We used well-known tools:
– LXC as container
– NFS as file system
– rsync to transfer files between instances
– CRIU to establish checkpoint and restore
containers
19
CS: Checkpoint Service! :D
20. CS System-level
• We did not implement collocated hot
because CRIU does not allow restore in a
running instance
20
21. CS System-level
• Checkpoint in non-collocated
21
App
Checkpoint
Manager
Agent
App
Agent
Standby Instance
Active
Instance
Container
Container
25. CS App-level
• CS at application-level was developed
from scratch for this work
• REST resources
25
Remember, CS: Checkpoint Service! :D
GET http://{manager_ip}:{manager_port}/config
RESPONSE 200 OK Content-type: application/json
26. CS App-level
• Checkpoint at Application-level
26
App
Checkpoint
Manager
Agent
App
Agent
Standby Instance
Active
Instance
State-aware
application Non-collocated
Collocated
warm
Collocated
hot
27. CS App-level
• Failover in non-collocated
27
App
Checkpoint
Manager
Agent
App
Agent
Standby Instance
Active
Instance
28. CS App-level
• Failover in collocated warm
28
App
Checkpoint
Manager
Agent
App
Agent
Standby Instance
Active
Instance
29. CS App-level
• Failover in collocated hot
29
App
Checkpoint
Manager
Agent
App
Agent
Standby Instance
Active
Instance
31. Evaluation
• Two evaluations were conducted
– Evaluation I: Failover time comparison
– Evaluation II: Checkpoint time and resources
consumption comparison
31
33. Evaluation I
• Methodology
– Backend with 1, 5,10,15,20 and 25 MB of
state sizes
– Experiment Manager starts the experiment and
generates a failure alert
– Failover process is executed
– Failover time is collected
33
34. Failover time – Non collocated
34
Application-level has a
greater failover time
The growth is linear
35. Failover time – Non collocated
35
We estimate the failover
time with state size
increasing until 100 MB
App lvl would be 66%
faster
36. Failover time – Collocated
36
Application-level
collocated warm is
greatly impacted with
increase of state size
The values of app lvl
collocated hot and sys lvl
collocated warm are very
similar
37. Failover time – Collocated
37
Linear regression shows:
High increase of app lvl
collocated warm
Slight increase on sys lvl
collocated warm
Constant values to
collocated hot
38. Evaluation II
• Methodology
– Similarly to the previous experiment, states are
saved in same state sizes
– Experiment Manager triggers a checkpoint
process
– Checkpoint time is collected
– Resources consumption are evaluated
38
39. Evaluation II
• Methodology
– Resources consumption metrics
39
Metrics Measured in
Checkpoint Time s
CPU Load %
Memory Occupation %
Network I/O Throughput Mbps
Disk I/O Throughput b/s
41. Evaluation II – Active Instance
41
At 25MB CPU Memory Network (I/O) Disk (W)
Sys-lvl
collocated
warm
6,8% 9,4% 0/59,8 Mbps 1300 b/s
App-lvl
collocated
warm
2,7% 9,1% 0/8,8 Mbps 9220 b/s
App-lvl
collocated hot
2,53% 9,5% 0/8,64 Mbps 8340 b/s
At 25MB CPU Memory Network
(I/O)
Disk (W)
Sys-lvl non-
collocated
6% 9,1% 0/81 Mbps 1780b/s
App-lvl non-
collocated
2% 8,92% 0/11,6
Mbps
2410 b/s
42. Evaluation II – Standby Instance
42
At 25 MB CPU Memory Network (I/O) Disk (W)
Sys-lvl
collocated
warm
1,8% 10,3% 5,1/0 Mbps 12500 b/s
App-lvl
collocated
warm
2,5% 11,9% 8,5/8,5 Mbps 7280 b/s
App-lvl
collocated hot
4,1% 12,4% 8,35/8,35
Mbps
6900 b/s
At 25 MB CPU Memory Network
(I/O)
Disk (W)
Sys-lvl non-
collocated
0,16% 9,8% 0/0 Mbps 800 b/s
App-lvl non-
collocated
0,2% 11,4% 0/0 Mbps 2600 b/s
43. Discussion
• Availability Analysis in a year
• Mean Time To Recovery (MTTR) as
failover time
• Mean Time To Failure (MTTF) as Apache
Server (788.4h/year) [3]
• Assuming that the failover time is 50 times
greater
• High Availability (HA) = 99.999% (five
nines) 43
44. Discussion
MTTR in
25 MB (s)
MTTR in 25
MB with
factor 50 (s)
MTTF(s) Availability with
factor 50 (%)
System-level
collocated warm
0.38636 19.318 2838240 99.9993
Application-level
collocated warm
1.27823 63.9115 2838240 99.997
Application-level
collocated hot
0.25802 12.901 2838240 99.9995
System-level
non-collocated
3.5441 177.205 2838240 99.9937
Application-level
non-collocated
1.38795 69.3975 2838240 99.997
44
Availability analysis (25 MB)
45. Discussion
MTTR in
100 MB
(s)
MTTR in 100
MB with
factor 50 (s)
MTTF(s) Availability with
factor 50 (%)
System-level
collocated warm
0.5902 29.51 2838240 99.9989
Application-level
collocated warm
3.8621 193.1 2838240 99.993
Application-level
collocated hot
0.2677 13.385 2838240 99.9995
System-level
non-collocated
9.7999 498.995 2838240 99.9824
Application-level
non-collocated
4.321 216.05 2838240 99.9923
45
Availability analysis (prediction until 100 MB)
48. Conclusions
• Impact of change from non-collocated to
collocated?
– Failover: great decrease
– Checkpoint: great increase
– Resources Consumption: Similar, except of
CPU and disk (greater on collocated)
48
49. Conclusions
• Bottlenecks of the system-level and
application-level?
– App : disk, CPU in standby (hot) and
development time
– Sys: CPU, network and NFS
49
53. Contributions
• Short paper approved with results of
Experiment I, entitled:
“Failover Time Evaluation Between
Checkpoint Services in Multi-tier Stateful
Applications”
IM-2017, Exp. Session (Qualis B1) 53
54. Future Works
As future works, we will study
• Scalability of services
• Resources consumption on Experiment
Instance
54
57. References
• [1] KANSO, Ali; LEMIEUX, Yves. Achieving High Availability at
the Application Level in the Cloud. In: 2013 IEEE Sixth
International Conference on Cloud Computing. IEEE, 2013. p.
778-785.
• [2] LI, Wubin; KANSO, Ali; GHERBI, Abdelouahed. Leveraging
linux containers to achieve high availability for cloud services. In:
Cloud Engineering (IC2E), 2015 IEEE International Conference
on. IEEE, 2015. p. 76-83
• [3] MELO, R. M. D. et al. Redundant vod streaming service in a
private cloud: availability modeling and sensitivity analysis.
Mathematical Problems in Engineering, Hindawi Publishing
Corporation, v. 2014, 2014
57
64. Objectives
• General
– Carry out a consistent comparison between
checkpoint in system and application levels
• Specifics
– Develop the two modes following SAF standard
– Compare the services among following metrics:
• Failover time
• Checkpoint time
• Load generated in application
64
65. Application
• Application generates new base states if
– threshold defined by developer has reached
– A time limit has reached
65
App 20 new
messages!
App 120 seconds
without
updates!
72. CS System-level
• LXC must be configured to allow CRIU
make checkpoint and restore
72
73. Evaluation II
• Methodology
– Checkpoint time is presented as means with
95% Confidence Interval (CI)
– Resource consumption are means with 95% CI
related to active and standby instances
73
74. CS System-level
• Checkpoint process is established in non-
collocated
– saving container via CRIU and storing your
memory context in a shared file system
between Manager and Agent
• In collocated:
– saving container via CRIU and send state via
rsync to all standby instances
74
Ressaltar que um PaaS tem módulos já configurados e um provedor deve sempre manter a confiabilidade do serviço
Provides web page where users choose products. ***to users**
Tentar ressaltar que o stateful é a parte mais importante em vez do multitier. Se conseguir, dizer que o multitier dificulta o checkpoint da aplicação
Lembrar da confiabilidade do PaaS provider em relação ao desenvolvedor
I said maybe...
Dar ênfase principalmente ao Checkpoint Saving e ao Failover, para que a galera não se esqueça
Tentar terminar esse slide antes dos 6 min
Se der, falar a definição de disponibilidade
Passar mais rápido
Avaliar a performance, não propor
SAF, não OpenSAf
Dizer que focamos no backend devido a maior troca e armazenamento de mensagens
Colocar o sys level antes do app level, dizendo que usamos ferramentas existentes como o CRIU, RSYNC e etc
Para não perder muito tempo, talvez fazer uma animação explicando o funcionamento tanto do app lvl quanto do sys lvl
Para não perder muito tempo, talvez fazer uma animação explicando o funcionamento tanto do app lvl quanto do sys lvl
Fazer animações dos failovers
Fazer animações dos failovers
Desgraçado, lembrar que foi você que fez o app-lvl
Apenas inserir as chamadas REST para diferenciar os modos
Adicionar as chamadas REST aqui tbm destacando o failover
Adicionar as chamadas REST aqui tbm destacando o failover
Adicionar as chamadas REST aqui tbm destacando o failover
Figura do cenário
Explicar em topicos o que foi feito
Add os numeros em porcentagem
Tabela resumindo os resultados. Destacar os principais resultados
Tabela resumindo os resultados
Acrescentar discussões dos outros resultados
Acrescentar discussões dos outros resultados
Acrescentar discussões dos outros resultados
Faz teu nome
Desenvolvimento da app
Comparacão dos services
Artigo
Topicalizar
Template
Explicar o CRIU
Tabela para sumarizar
Resumir títulos
Verificar os subtópicos
Aumentar as legendas
Tabela
Mencionar que os dados são compactados
PaaS com sistemas legados
Test t com slide de backup
Tirar
Seria massa nós pesquisarmos os PaaS existentes e colocarmos uma figura informando quais contém checkpoint. Talvez aquele doc do projeto contenha mais informações
Tomar cuidado com o “fazer checkpoint”, e sim fazer checkpoint da aplicação
Fazer diferentes níveis: os que reiniciam a app quando falha e os que fazem checkpoint/failover
CloudFoundry: No checkpoint/ No Restart application
Tsuru: No checkpoint/ Yes restart application
Deis: No checkpoint / No restart application
Flynn: No checkpoint / No restart application
Openshift: No checkpoint / No restart application
Passar mais rápido
Outra animação aqui seria bom
I said maybe...
Dá uma explicada no CRIU
Para não perder muito tempo, talvez fazer uma animação explicando o funcionamento tanto do app lvl quanto do sys lvl
Para não perder muito tempo, talvez fazer uma animação explicando o funcionamento tanto do app lvl quanto do sys lvl