Consul administration at scale

Pierre Souchay
Discovery Team @Criteo
Twitter: @vizionr
Github: pierresouchay
Consul
Administration
At Scale

2 •
1 2 3 4 5
Numbers
What we do
Make it work 24/24
Our pillars
Tools to scale
What’s new?
Consul
everywhere
Benefits
Tools / References
Q&A
Our 30 minutes presentation

4 •
• Consul in use for 3+ years @criteo
• Dedicated team is 6 months old, 5 people
• SDKs development (JVM / C# / Python), tools (GUIs)
• Handle all infrastructure, on-call 24/24 7/7
• Architecture, 1st External Consul Contributor (70+ PR)
The discovery team

5 •
• Prod 35k bare-metal hosts (40/60 Win/Linux), 8 DCs (2 Hadoop)
• 3200 kind of services with 260k instances
• Up to 2.5M req/sec, 100+Pb of data in Hadoop
• More than 300 developers: we MUST scale users too
Our Infrastructure

6 •
• Automatic Load Balancers provisioning (F5/HaProxy)
• SDKs provides discovery for all apps
• DNS provides discovery for non-aware Consul systems
• Bare Metal systems / Hadoop / Mesos (~Nomad)
• K/V for configuration of some tools
Consul to rule them all

7 •
When Consul is down,
Criteo is down.

9 •
• 35k Consul agents installed by Chef
• Registration of service
• by Chef with helpers: standardized/easy
• in Mesos, standardized/automatic
Rule #1 - (1/3) Full automation - as predictable as possible

10 •
• More than 3k services, protected service registration by ACLs
• ACLs as a Service REST API
• No service Conflict by default, Goal: 1 ACL per Service
• Add/Help people putting service Metadata: version, alerts...
• Deploy in preprod, check ACLs, Go Prod
Rule #1 - (2/3) Full automation - as predictable as possible

11 •
• Secure by default in order to be predictable
• Nobody can write on APIs outside of localhost
• https://github.com/hashicorp/consul/issues/4712
• Available in Consul 1.4.2+
• Reduce entropy added by humans
Rule #1 : (3/3) Full automation - as predictable as possible

12 •
• Blackbox monitoring (5+ probes in each DC)
• Register a service, wait its publication in Consul Catalog
• SLA: objective 1s to register a service, up to 3s max
• When SLA is violated, wake up the on-call
Rule #2 - Metrics (1/3)

13 •
• Consul Metrics
• Native Prometheus Support
• Additional on-call alerts
• Track new usages (increase of RPCs, DNS calls…)
• Debug when there is mess

14 •
• Consul-templaterb : metrics.erb export to Prometheus
• Provides rate of changes
• Provides instances Passing/Warning/Critical
• View from an agent point of view, not Consul Server

15 •
• Logs in Kibana for Consul Server / few canary agents
• Analyzed regularly for early errors detection
• Expose all data to everybody
• Instant view of all services
• Timeline of changes for all services
Rule #3 - Logs, info and History

16 •
• Consul fork: mainstream with patches
• Ready to go to prod in less than 2 hours
• Compare metrics after deployment
• Preprod → Observe → Prod
• Deploy feature per feature, no bulk updates
Rule #4 - Ready to patch

17 •
• Look at all issues on github
• See if known patterns
• Check if issue might impact us
• PR when issue is potentially critical for us (ex: #5050)
Rule #5 - Work on upstream

19 •
Consul-UI: scalable UI to show all details about a service

20 •
Consul-UI: Timeline of changes : not an OPS problem anymore

21 •
Changes/sec is a good indicator, will allow
you to detect:
- deployments (right)
- incidents or future incidents
- optimizations to perform
Many of optimizations/fixes from Criteo:
- #3889, #4720 and many more merged
- With #5050 allowed us to more than x100
performance!
Consul-template metrics.erb : changes/sec on a service

22 •
Consul-Templaterb: script everything! 1/2
<%
# This script cleanup all services with tag `marathon` having less than 1 healthcheck (SerfHealth)
instances_to_cleanup=0
total_instances=0
datacenters.each do |dc|
services(dc:dc, tag:'marathon').each do |service_name, tags|
service(service_name, dc:dc, tag:'marathon').each do |snode|
total_instances+=1
if snode['Checks'].count < 2
instances_to_cleanup += 1
%>ssh $SSH_OPTIONS <%= snode['Node']['Node'] %> "/usr/bin/curl -H X-Consul-Token:$CONSUL_DEREGISTER_TOKEN -fs -XPUT
localhost:8500/v1/agent/service/deregister/<%= snode['Service']['ID'] %>"
<%
end
end
end
end %>
echo Found <%= instances_to_cleanup %> / <%= total_instances %> instances to cleanup

23 •
Consul-Templaterb: script everything! 2/2
Call it once…
$ consul-templaterb -c <CONSUL_ADDR> ./clean_svcs_without_hc.sh.erb --once &&
bash ./clean_svcs_without_hc.sh
Or automatically every minute !
--wait 60 --template “clean_svcs_without_hc.sh.erb:./result.sh:bash ./result.sh”
ssh $SSH_OPTIONS mesos-slave017-pa4.central.criteo.preprod "/usr/bin/curl -H X-Consul-Token:$CONSUL_DEREGISTER_TOKEN -fs -XPUT
localhost:8500/v1/agent/service/deregister/marathon-app-deepr-pipeline-31510-23d44ebc2b8d11e9b0125065f387ef80"
ssh $SSH_OPTIONS mesos-slave019-pa4.central.criteo.preprod "/usr/bin/curl -H X-Consul-Token:$CONSUL_DEREGISTER_TOKEN -fs -XPUT
localhost:8500/v1/agent/service/deregister/marathon-app-jtc-jtc-app-31934-1d5a8e202ba611e9b0125065f387ef80"
echo Found 2 / 1812 instances to cleanup

24 •
• See video from Michael Stewart (To 20,000 Nodes and Beyond)
• We had the same stories, found the same tricks
• Read Consul Docs: all is RPC, there is no cache by default
• Use discovery_max_stale to scale servers horizontally
• Use ttl for DNS and allow_stale = true
Useful configurations hints

26 •
• Inversion of Control
• Monitoring can be automated: ratio > 0.5 passing/critical
• Everything is As A Service, Users are free to experiment (full
network automation for instance)
• Everything is standardized
• ServiceMeta standardization, LB weights...
• Build features on top of services: Monitoring, versions tracking
Benefits (1/2)

27 •
• Debug is easier
• One single place to look for configuration
• LB/API Load balancing works the same way
• Nothing is hidden: people can troubleshoot themselves
• The team is not a SPOF to debug issues
Benefits (2/2)

29 •
• Consul-templaterb : https://github.com/criteo/consul-templaterb/
• Script/Hack/Automate it easily: supports hot-reload
• Provide Consul-UI as well as Consul-timeline
• Provide additional prometheus endpoints (service changes)
• https://github.com/pierresouchay/consul-ops-tools
• small scripts to help debug Consul (will be enriched)
• A Consul Story: To 20,000 Nodes and Beyond (video)
Open-Source Tools

30 •
Q&A
Discovery Team @Criteo
Twitter: @vizionr
Github: pierresouchay

Consul administration at scale

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Consul administration at scale

Semelhante a Consul administration at scale (20)

Último

Último (20)

Consul administration at scale