The document provides an overview of Consul administration at scale at Criteo. Key points:
- Criteo uses Consul for service discovery across 35k servers with 3200 services and 260k instances
- A dedicated team of 5 people manages Consul infrastructure and tools 24/7
- Automation is key to make Consul predictable at scale through standardized service registration, ACLs, and automation tools
- Metrics, logs, and monitoring are critical to detect issues with Consul and the services it manages
2. 2 •
1 2 3 4 5
Numbers
What we do
Make it work 24/24
Our pillars
Tools to scale
What’s new?
Consul
everywhere
Benefits
Tools / References
Q&A
Our 30 minutes presentation
4. 4 •
• Consul in use for 3+ years @criteo
• Dedicated team is 6 months old, 5 people
• SDKs development (JVM / C# / Python), tools (GUIs)
• Handle all infrastructure, on-call 24/24 7/7
• Architecture, 1st External Consul Contributor (70+ PR)
The discovery team
5. 5 •
• Prod 35k bare-metal hosts (40/60 Win/Linux), 8 DCs (2 Hadoop)
• 3200 kind of services with 260k instances
• Up to 2.5M req/sec, 100+Pb of data in Hadoop
• More than 300 developers: we MUST scale users too
Our Infrastructure
6. 6 •
• Automatic Load Balancers provisioning (F5/HaProxy)
• SDKs provides discovery for all apps
• DNS provides discovery for non-aware Consul systems
• Bare Metal systems / Hadoop / Mesos (~Nomad)
• K/V for configuration of some tools
Consul to rule them all
9. 9 •
• 35k Consul agents installed by Chef
• Registration of service
• by Chef with helpers: standardized/easy
• in Mesos, standardized/automatic
Rule #1 - (1/3) Full automation - as predictable as possible
10. 10 •
• More than 3k services, protected service registration by ACLs
• ACLs as a Service REST API
• No service Conflict by default, Goal: 1 ACL per Service
• Add/Help people putting service Metadata: version, alerts...
• Deploy in preprod, check ACLs, Go Prod
Rule #1 - (2/3) Full automation - as predictable as possible
11. 11 •
• Secure by default in order to be predictable
• Nobody can write on APIs outside of localhost
• https://github.com/hashicorp/consul/issues/4712
• Available in Consul 1.4.2+
• Reduce entropy added by humans
Rule #1 : (3/3) Full automation - as predictable as possible
12. 12 •
• Blackbox monitoring (5+ probes in each DC)
• Register a service, wait its publication in Consul Catalog
• SLA: objective 1s to register a service, up to 3s max
• When SLA is violated, wake up the on-call
Rule #2 - Metrics (1/3)
13. 13 •
• Consul Metrics
• Native Prometheus Support
• Additional on-call alerts
• Track new usages (increase of RPCs, DNS calls…)
• Debug when there is mess
Rule #2 - Metrics (2/3)
14. 14 •
• Consul-templaterb : metrics.erb export to Prometheus
• Provides rate of changes
• Provides instances Passing/Warning/Critical
• View from an agent point of view, not Consul Server
Rule #2 - Metrics (3/3)
15. 15 •
• Logs in Kibana for Consul Server / few canary agents
• Analyzed regularly for early errors detection
• Expose all data to everybody
• Instant view of all services
• Timeline of changes for all services
Rule #3 - Logs, info and History
16. 16 •
• Consul fork: mainstream with patches
• Ready to go to prod in less than 2 hours
• Compare metrics after deployment
• Preprod → Observe → Prod
• Deploy feature per feature, no bulk updates
Rule #4 - Ready to patch
17. 17 •
• Look at all issues on github
• See if known patterns
• Check if issue might impact us
• PR when issue is potentially critical for us (ex: #5050)
Rule #5 - Work on upstream
21. 21 •
Changes/sec is a good indicator, will allow
you to detect:
- deployments (right)
- incidents or future incidents
- optimizations to perform
Many of optimizations/fixes from Criteo:
- #3889, #4720 and many more merged
- With #5050 allowed us to more than x100
performance!
Consul-template metrics.erb : changes/sec on a service
22. 22 •
Consul-Templaterb: script everything! 1/2
<%
# This script cleanup all services with tag `marathon` having less than 1 healthcheck (SerfHealth)
instances_to_cleanup=0
total_instances=0
datacenters.each do |dc|
services(dc:dc, tag:'marathon').each do |service_name, tags|
service(service_name, dc:dc, tag:'marathon').each do |snode|
total_instances+=1
if snode['Checks'].count < 2
instances_to_cleanup += 1
%>ssh $SSH_OPTIONS <%= snode['Node']['Node'] %> "/usr/bin/curl -H X-Consul-Token:$CONSUL_DEREGISTER_TOKEN -fs -XPUT
localhost:8500/v1/agent/service/deregister/<%= snode['Service']['ID'] %>"
<%
end
end
end
end %>
echo Found <%= instances_to_cleanup %> / <%= total_instances %> instances to cleanup
24. 24 •
• See video from Michael Stewart (To 20,000 Nodes and Beyond)
• We had the same stories, found the same tricks
• Read Consul Docs: all is RPC, there is no cache by default
• Use discovery_max_stale to scale servers horizontally
• Use ttl for DNS and allow_stale = true
Useful configurations hints
26. 26 •
• Inversion of Control
• Monitoring can be automated: ratio > 0.5 passing/critical
• Everything is As A Service, Users are free to experiment (full
network automation for instance)
• Everything is standardized
• ServiceMeta standardization, LB weights...
• Build features on top of services: Monitoring, versions tracking
Benefits (1/2)
27. 27 •
• Debug is easier
• One single place to look for configuration
• LB/API Load balancing works the same way
• Nothing is hidden: people can troubleshoot themselves
• The team is not a SPOF to debug issues
Benefits (2/2)
29. 29 •
• Consul-templaterb : https://github.com/criteo/consul-templaterb/
• Script/Hack/Automate it easily: supports hot-reload
• Provide Consul-UI as well as Consul-timeline
• Provide additional prometheus endpoints (service changes)
• https://github.com/pierresouchay/consul-ops-tools
• small scripts to help debug Consul (will be enriched)
• A Consul Story: To 20,000 Nodes and Beyond (video)
Open-Source Tools