Cloud providers like Amazon or Goggle have great user experience to create and manage PaaS and IaaS services. But is it possible to reproduce same experience and flexibility locally, in on premise datacenter? This talk describes success story of creation private cloud based on DC/OS cluster. It is used to host and share different services like hadoop or kafka for development teams, dynamically manage services and resource pools with GKE integration.
6. How resource
request works
DEV
IT
Infra Team
Dev Team asked to create
separated Kafka instance
• request IT to provide Linux
VM
• configure access
• request Infrastructure
Team to setup Kafka
• add monitoring
• add healthcheck
10. Maintain
infrastructure
manually
Using Docker allow to
standardize deployment
process to docker allowed
interface.
Docker Image for each of
components need to be
prepared and preconfigured
for our needs.
Infra Team
11. Maintain
infrastructure
manually
Cloud Engineers has lot of
good examples how
component deployment
automation should looks like.
Next step was to provide
simple user experience to
maintain this process just
simply stupid.
12. Maintain
infrastructure
manually
Cloud Engineers has lot of
good examples how
component deployment
automation should looks like.
Next step was to provide
simple user experience to
maintain this process just
simply stupid.
13. Maintain
infrastructure
manually
Cloud Engineers has lot of
good examples how
component deployment
automation should looks like.
Next step was to provide
simple user experience to
maintain this process just
simply stupid.
15. It’s all about Jenkins
The short story how we started to use Mesos
16. Jenkins for
CI/CD
• ~ 300 repositories
• ~ 500 builds per day
• ~ 1 build per minute
• ~ 40 windows slaves
17. Jenkins for
CI/CD
It requires some additional
interfaces to provide and
process information for
developers and build
engineers.
18. Jenkins for
CI/CD
It requires some additional
interfaces to provide and
process information for
developers and build
engineers.
19. Jenkins for
CI/CD
It requires some additional
interfaces to provide and
process information for
developers and build
engineers.
20. Jenkins for
CI/CD
It requires some additional
interfaces to provide and
process information for
developers and build
engineers.
21. Jenkins for
CI/CD
It requires some additional
interfaces to provide and
process information for
developers and build
engineers.
22. Jenkins for
CI/CD
It requires some additional
interfaces to provide and
process information for
developers and build
engineers.
23. Jenkins for
CI/CD
The common infrastructure
includes:
• Elastic + Kibana
• Go APIs
• Angular SPAs
• MySQL and Redis
• Zabbix for Jenkins slaves
monitoring
• SonarQube
28. Jenkins for
CI/CD
The most issues was:
• Out of free disk space
• Service fault
• Network outage
• VM shutdown
• CPU/MEM overload
• Deployed package is
broken
29. Jenkins for
CI/CD
And here is a moment, You
realize that having lot of pets
in your datacenter may be
not a good idea.
31. Jenkins for
CI/CD
Requirements to infrastructure management:
• Need to be distributed
• Supports health-checks
• Docker based
• Easy deploy and maintain
• Can scale
• Has persistence storage support
• To be user friendly
• Be self-recoverable
41. Resource
Management
in Mesos
Mesos is responsible for
running tasks and resource
management at the nodes.
When starting new Task,
Mesos Master will
automatically assign the Node
depending on recourse usage.
Marathon + Zookeeper
Mesos
Mesos
Mesos
42. Resource
Management
in Mesos
When Agent Node initialized
and Mesos Slave started, it
analyzes available resources and
connect to cluster with
predefined resource pool.
A: 4CPU, 8Gb RAM, 10 Gb HDD
B: 2CPU, 4Gb RAM, 10 Gb HDD
workerA
CPU MEM
8 Gb
HDD
4
10 Gb
workerB
CPU MEM
4 Gb
HDD
2
10 Gb
43. Resource
Management
in Mesos
New service can be started
only with predefined value of
requested resources.
To change this value – full
service redeployment
needed.
workerA
CPU MEM
8 Gb
HDD
4
10 Gb
workerB
CPU MEM
4 Gb
HDD
2
10 Gb
2 CPU
6 GB RAM
5 Gb HDD
44. Resource
Management
in Mesos
When new deployment
starts, task will be started on
random Agent Node with free
resource available.
In case of lack of resources,
deployment will be put on
Pending state.
workerA
8 Gb4
10 Gb
workerB
CPU MEM
4 Gb
HDD
2
10 Gb
2 CPU
6 GB RAM
5 Gb HDD
45. Resource
Management
in Mesos
During service scaling,
resource request and
assignment operations will be
proceed for each instance
separately.
workerA
8 Gb4
10 Gb
workerB
CPU MEM
4 Gb
HDD
2
10 Gb
1 CPU
1 GB RAM
0 Gb HDD
2 x
46. Resource
Management
in Mesos
The killing feature of DC/OS is
better resources utilization.
Each separated machine
potentially can contain more
then one component to be
installed.
Health check, deployment,
distribution and resources
management will be proceed
on DC/OS (Mesos) side.
workerA
8 Gb4
10 Gb
workerB
CPU MEM
4 Gb
HDD
2
10 Gb
1 CPU
1 GB RAM
0 Gb HDD
2 x
47. Resource
Management
in Mesos
DC/OS takes care about Agent
Nodes health check and each
running service instance
health check.
In case of unhealthy state,
service or all services in
broken node will be
redeployed at another nodes.
workerA
8 Gb4
10 Gb
workerB
CPU MEM
4 Gb
HDD
2
10 Gb
2
52. Summary
• DCOS is all about resources usage
• Each Agent Node has declared resources scope
• Each Service has declared resource request
• DCOS deploy new tasks only in case of free
resources available
• Agent Node will be shared between services in
case of free resources
• Each Service or Agent Node has it’s own health-
check – in case of failed state task will be
redeployed
• Bonus: in case of master failure, all Agent Nodes
and deployed Service will continue working
54. Hiding Pets behind the Cattles
Managing hardware for Docker orchestration private cloud
55. DC/OS
Deployment-
Bootstrap Node
The bootstrap node is only
used during the installation
and upgrade process, so
there are no specific
recommendations for high
performance storage or
separated mount points.
Bootstrap Node
Master
Master
Master
56. DC/OS
Deployment-
Master Nodes
Master nodes should be
joined to HA cluster
Bootstrap Node
Master
Master
Master
cluster
Minimum Recommended
Nodes 1* 3 or 5
Processor 4 cores 4 cores
Memory 32 GB RAM 32 GB RAM
Hard disk 120 GB 120 GB
57. DC/OS
Deployment –
Agent Nodes
Agent Nodes is worker nodes
with tasks and services
running.
It support Docker or any
other Mesos runtime.
Bootstrap Node
Master
Master
Master
cluster
61. How we started
• Vagrant + Virtual Box
• Mini PC
• 8 TVs with Mini PC
• Ubuntu 16.04
• Daily usage – for Scrum and Monitoring
Dashboards
• 8 * 2 CPU = 16 CPU
• 8 * 4 Gb RAM = 32 GB RAM
• PC
• VMWare
• Google Cloud
62. DC/OS Initial
Setup
We started from the simple
one-master node
configuration and one slave
node, dedicated for services
deployments.
Elastic was the first try – it
aggregates lot of logs, and
time to time goes to broken
state.
Master
cluster
DevOps
63. DC/OS Initial
Setup – first
customer
The schema was very simple:
We need some service to be
running – we request
machine for this service in
VMWare and add it as DC/OS
Agent.
At backstage – all VMs
became a DC/OS nodes.
Master
cluster
DevOps
4 8 Gb
64. DC/OS Initial
Setup – TV
boxes
The best start was to use TV
boxes for scrum meetings in
all office rooms.
It gives lot of free resources
just for fun.
10 * 2 CPU = 20 CPU
10 * 4Gb = 40 Gb
Master
cluster
24 48 Gb
65. DC/OS Initial
Setup – internal
services
Such setup allow us to run all
services requires for internal
needs of infrastructure
teams.
.. and a little more like bots
for Slack, Sonar, etc.
24 48 Gb
Slack Bot
66. DC/OS Initial
Setup – issues
24 48 Gb
The main issues on this stage were:
• Master Node performance
Master nodes had lack of resources, which causes
often DC/OS UI failures or Marathon failures.
Temporary solution – master node restart.
• Agent Nodes failures
Out of free disk space, machine shutdown, CPU
high load, out of memory – the most common
reason of failures.
67. DC/OS Initial
Setup – issues
24 48 Gb
The main issues on this stage were:
• Master Node performance
Master nodes had lack of resources, which causes
often DC/OS UI failures or Marathon failures.
Temporary solution – master node restart.
• Agent Nodes failures
Out of free disk space, machine shutdown, CPU
high load, out of memory – the most common
reason of failures.
68. DC/OS Initial
Setup – Cluster
With number of Agents
greater than 1 single master
became the gap:
• In case of failure – system
goes down
• Lack of performance
Master
cluster
Master
Master
24 48 Gb
69. DC/OS Initial
Setup – Cluster
With number of Agents
greater than 1 single master
became the gap:
• In case of failure – system
goes down
• Lack of performance
Master
cluster
Master
Master
24 48 Gb
70. DC/OS Initial
Setup – more
VMs
With current setup, DC/OS
became ready for consuming
external requests.
Master
cluster
Master
Master
Dev
40 60 Gb
71. DC/OS Initial
Setup –
Hardware PCs
Few dedicated PCs became a
cluster member in worker
Agent role.
Master
cluster
Master
Master
DevTeam
60 90 Gb
72. DC/OS Initial
Setup – Self-
Bootstrap
To allow setup new nodes
automatically without need
to login, Chef was added.
As part of DC/OS Services,
Chef allow to Bootstrap new
nodes muck more quickly.
Master
cluster
Master
Master
60 90 Gb
73. DC/OS Initial
Setup –
Google Cloud
Creating scaling group in the
cloud makes a DC/OS cluster
unlimited by resources.
Master
cluster
Master
Master
Unit
100 160 Gb
74. Summary
• Number of nodes was increased eventually
• Infra Team used DCOS for own needs only for
the first time
• To monitor and bootstrap the cluster some
additional resources were required: Zabbix and
Chef
• Different type of nodes allow to increase
flexibility and give positive grow speed.
• Adding Google Cloud instance eliminated
cluster size limit. With hybrid cloud DC/OS can
grow much quickly.
75. Docker as a Service.
Available Docker Images and Use Cases for shared infrastructure cluster
76. Who are You –
Mr. Microservice?
Meet Mr. Microservice. It was
written on .NET Standard/C#,
self-hostable and http-rest
friendly.
Mr. Microservice
77. Who are You –
Mr. Microservice?
Mr. Microservice is using
Consul for service discovery.
To find his friend and call
them in HA way it is using
Fabio load balancer.
Mr. Microservice
Consul
Fabio-lb
discovery
78. Who are You –
Mr. Microservice?
Some of Mr. Microservice old
friends like to talk using ICQ
RabbitMQ (using RPC).
Mr. Microservice
Consul
Fabio-lb
eventsdiscovery
79. Who are You –
Mr. Microservice?
Sending logs in High Load
system is real art.
Kafka will help us to make this
bicycle much more simpler
and stable.
Unfortunately, users will not
be happy to search log
records with kafka-
consumer.sh. Elastic and
Kibana are correct tools to
have deal with logs.
Mr. Microservice
Consul
Fabio-lb
logs
eventsdiscovery
80. Who are You –
Mr. Microservice?
In large distributed system it
will be hard to make proper
scale decisions without
knowing some system
internals.
Collecting and analyzing
metrics will be the best
approach to collect runtime
data continuously.
InfluxDB and Grafana is one
of popular tools for this.
Mr. Microservice
Consul
Fabio-lb
logs
metrics
eventsdiscovery
81. Who are You –
Mr. Microservice?
Zipkin is great tool to send
traces in runtime according
MS configuration.
Mr. Microservice
Consul
Fabio-lb
logs
metrics
trace
eventsdiscovery
82. Who are You –
Mr. Microservice?
Distributed cache is
something we are really
needed and in High Load
system.
Aerospike is one of possible
solutions.
Mr. Microservice
Consul
Fabio-lb
logs
metrics
trace
cache
eventsdiscovery
83. Who are You –
Mr. Microservice?
Oh, your Mr. Microservice has
it’s own state ?
Greetings, here is your
Mongo DB MySQL PostgreSQL
Cassandra!
Mr. Microservice
Consul
Fabio-lb
logs
metrics
trace
cache
state
eventsdiscovery
84. Who are You –
Mr. Microservice?
Hash Cop vault is good
enough for securing
configuration settings, but
also requires some initial
infrastructure setup.
Mr. Microservice
Consul
Fabio-lb
logs
metrics
trace
cache
state
eventsdiscovery
config
85. Who are You –
Mr. Microservice?
But sometimes it’s something
specific even for experienced
DevOps engineers.
Mr. Microservice
Consul
Fabio-lb
logs
metrics
trace
cache
state
eventsdiscovery
config
86. Who are You –
Mr. Microservice?
But sometimes it’s something
specific even for experienced
DevOps engineers.
Mr. Microservice
Consul
Fabio-lb
logs
metrics
trace
cache
state
eventsdiscovery
config
88. Solution 1 – add
services quickly
Service catalog allows to
chose service from
predefined list and deploy in
one click.
If needed – own repository
can be added.
89. Solution 1 – add
services quickly
For all rest cases – Service
manual deployment are
available
1. Single Container (Docker)
2. Bash runtime
3. Multi-container
90. Solution 2 –
Folder Structure
Services can be organized as
folder structure.
This feature allows to isolate
environments for different
dev teams.
91. Solution 3 –
Mesos DNS vs
Marathon LB
Mesos DNS, integrated with
company DNS server, allow to
access each service directly
by Agent IP/port.
Marathon-LB is l4 load
balancer that allows to route
requests to target Agent
IP/port through HA Proxy.
92. DNS and LB
Services access
Same services can be
deployed with different
names and different nested
levels.
DC/OS will took care to assign
ports automatically.
192.168.101.21 192.168.101.26
Services
devteam1
devteam2
consul | 192.168.101.21:15444
consul | 192.168.101.26:15327
Marathon-LB | 192.168.101.26:15121
AgentA
AgentB
consul | 192.168.101.21:15632
93. DNS and LB
Services access
192.168.101.21 192.168.101.26
Services
devteam1
devteam2
consul | 192.168.101.21:15444
consul | 192.168.101.26:15327
Marathon-LB | 192.168.101.26:15121
AgentA
AgentB
consul | 192.168.101.21:15632
dig consul.marathon.mesos
It’s easy to integrate Mesos
DNS with your company DNS
servers.
94. DNS and LB
Services access
192.168.101.21 192.168.101.26
Services
devteam1
devteam2
consul | 192.168.101.21:15444
consul | 192.168.101.26:15327
Marathon-LB | 192.168.101.26:15121
AgentA
AgentB
consul | 192.168.101.21:15632
dig consul.devteam2.marathon.mesos
All running services launched on
DC/OS get an FQDN based upon
the service that launched it, in
the form <service-name>.<group-
name>.<framework-
name>.mesos
95. DNS and LB
Services access
192.168.101.21 192.168.101.26
Services
devteam1
devteam2
consul | 192.168.101.21:15444
consul | 192.168.101.26:15327
Marathon-LB | 192.168.101.26:15121
AgentA
AgentB
consul | 192.168.101.21:15632
curl consul.marathon.mesos:15632
But to access service using
TCP or HTTP, client need to
know IP address of running
service.
IP address can be assigned
automatically (default) or
assigned manually.
96. DNS and LB
Services access
192.168.101.21 192.168.101.26
Services
devteam1
devteam2
consul | 192.168.101.21:15444
consul | 192.168.101.26:15327
Marathon-LB | 192.168.101.26:15121
AgentA
AgentB
consul | 192.168.101.21:15632
curl consul.devteam2.marathon.mesos:15327
Current approach has one
significant limitation – You
always should care about
ports by yourself.
97. DNS and LB
Services access
192.168.101.21 192.168.101.26
Services
devteam1
devteam2
consul | 192.168.101.21:15444
consul | 192.168.101.26:15327
Marathon-LB | 192.168.101.26:15121
AgentA
AgentB
consul | 192.168.101.21:15632
dig consul.devteam1.company.local
Another approach is to use
internal Marathon-LB and
manage each service
manually on DNS server
192.168.101.26
consul.devteam1.company.local
DNS
98. DNS and LB
Services access
192.168.101.21 192.168.101.26
Services
devteam1
devteam2
consul | 192.168.101.21:15444
consul | 192.168.101.26:15327
Marathon-LB | 192.168.101.26:15121
AgentA
AgentB
consul | 192.168.101.21:15632
curl consul.devteam1.company.local:80
In this case, Marathon-LB will
take care about port
redirecting automatically.
192.168.101.26
consul.devteam1.company.local
DNS
15444
99. Summary
• DC/OS allows to build complex DEV/UAT
environments bases on Docker infrastructure
• The simplest way of deployment – Universe
Catalog with well known services deployed in
one click.
• Each service can be placed to separated folder.
• Mesos DNS include full folder structure in
service DNS name.
• Marathon-LB allows to proxy any external call
thought HA Proxy to target service instance
(transforming IP and Port)