SlideShare uma empresa Scribd logo
1 de 30
Baixar para ler offline
Pierre Souchay
Discovery Team @Criteo
Twitter: @vizionr
Github: pierresouchay
Consul
Administration
At Scale
2 •
1 2 3 4 5
Numbers
What we do
Make it work 24/24
Our pillars
Tools to scale
What’s new?
Consul
everywhere
Benefits
Tools / References
Q&A
Our 30 minutes presentation
Numbers
4 •
• Consul in use for 3+ years @criteo
• Dedicated team is 6 months old, 5 people
• SDKs development (JVM / C# / Python), tools (GUIs)
• Handle all infrastructure, on-call 24/24 7/7
• Architecture, 1st External Consul Contributor (70+ PR)
The discovery team
5 •
• Prod 35k bare-metal hosts (40/60 Win/Linux), 8 DCs (2 Hadoop)
• 3200 kind of services with 260k instances
• Up to 2.5M req/sec, 100+Pb of data in Hadoop
• More than 300 developers: we MUST scale users too
Our Infrastructure
6 •
• Automatic Load Balancers provisioning (F5/HaProxy)
• SDKs provides discovery for all apps
• DNS provides discovery for non-aware Consul systems
• Bare Metal systems / Hadoop / Mesos (~Nomad)
• K/V for configuration of some tools
Consul to rule them all
7 •
When Consul is down,
Criteo is down.
Make it work 24/24 - 7/7
9 •
• 35k Consul agents installed by Chef
• Registration of service
• by Chef with helpers: standardized/easy
• in Mesos, standardized/automatic
Rule #1 - (1/3) Full automation - as predictable as possible
10 •
• More than 3k services, protected service registration by ACLs
• ACLs as a Service REST API
• No service Conflict by default, Goal: 1 ACL per Service
• Add/Help people putting service Metadata: version, alerts...
• Deploy in preprod, check ACLs, Go Prod
Rule #1 - (2/3) Full automation - as predictable as possible
11 •
• Secure by default in order to be predictable
• Nobody can write on APIs outside of localhost
• https://github.com/hashicorp/consul/issues/4712
• Available in Consul 1.4.2+
• Reduce entropy added by humans
Rule #1 : (3/3) Full automation - as predictable as possible
12 •
• Blackbox monitoring (5+ probes in each DC)
• Register a service, wait its publication in Consul Catalog
• SLA: objective 1s to register a service, up to 3s max
• When SLA is violated, wake up the on-call
Rule #2 - Metrics (1/3)
13 •
• Consul Metrics
• Native Prometheus Support
• Additional on-call alerts
• Track new usages (increase of RPCs, DNS calls…)
• Debug when there is mess
Rule #2 - Metrics (2/3)
14 •
• Consul-templaterb : metrics.erb export to Prometheus
• Provides rate of changes
• Provides instances Passing/Warning/Critical
• View from an agent point of view, not Consul Server
Rule #2 - Metrics (3/3)
15 •
• Logs in Kibana for Consul Server / few canary agents
• Analyzed regularly for early errors detection
• Expose all data to everybody
• Instant view of all services
• Timeline of changes for all services
Rule #3 - Logs, info and History
16 •
• Consul fork: mainstream with patches
• Ready to go to prod in less than 2 hours
• Compare metrics after deployment
• Preprod → Observe → Prod
• Deploy feature per feature, no bulk updates
Rule #4 - Ready to patch
17 •
• Look at all issues on github
• See if known patterns
• Check if issue might impact us
• PR when issue is potentially critical for us (ex: #5050)
Rule #5 - Work on upstream
Tools/Hints to scale
19 •
Consul-UI: scalable UI to show all details about a service
20 •
Consul-UI: Timeline of changes : not an OPS problem anymore
21 •
Changes/sec is a good indicator, will allow
you to detect:
- deployments (right)
- incidents or future incidents
- optimizations to perform
Many of optimizations/fixes from Criteo:
- #3889, #4720 and many more merged
- With #5050 allowed us to more than x100
performance!
Consul-template metrics.erb : changes/sec on a service
22 •
Consul-Templaterb: script everything! 1/2
<%
# This script cleanup all services with tag `marathon` having less than 1 healthcheck (SerfHealth)
instances_to_cleanup=0
total_instances=0
datacenters.each do |dc|
services(dc:dc, tag:'marathon').each do |service_name, tags|
service(service_name, dc:dc, tag:'marathon').each do |snode|
total_instances+=1
if snode['Checks'].count < 2
instances_to_cleanup += 1
%>ssh $SSH_OPTIONS <%= snode['Node']['Node'] %> "/usr/bin/curl -H X-Consul-Token:$CONSUL_DEREGISTER_TOKEN -fs -XPUT
localhost:8500/v1/agent/service/deregister/<%= snode['Service']['ID'] %>"
<%
end
end
end
end %>
echo Found <%= instances_to_cleanup %> / <%= total_instances %> instances to cleanup
23 •
Consul-Templaterb: script everything! 2/2
Call it once…
$ consul-templaterb -c <CONSUL_ADDR> ./clean_svcs_without_hc.sh.erb --once && 
bash ./clean_svcs_without_hc.sh
Or automatically every minute !
--wait 60 --template “clean_svcs_without_hc.sh.erb:./result.sh:bash ./result.sh”
ssh $SSH_OPTIONS mesos-slave017-pa4.central.criteo.preprod "/usr/bin/curl -H X-Consul-Token:$CONSUL_DEREGISTER_TOKEN -fs -XPUT
localhost:8500/v1/agent/service/deregister/marathon-app-deepr-pipeline-31510-23d44ebc2b8d11e9b0125065f387ef80"
ssh $SSH_OPTIONS mesos-slave019-pa4.central.criteo.preprod "/usr/bin/curl -H X-Consul-Token:$CONSUL_DEREGISTER_TOKEN -fs -XPUT
localhost:8500/v1/agent/service/deregister/marathon-app-jtc-jtc-app-31934-1d5a8e202ba611e9b0125065f387ef80"
echo Found 2 / 1812 instances to cleanup
24 •
• See video from Michael Stewart (To 20,000 Nodes and Beyond)
• We had the same stories, found the same tricks
• Read Consul Docs: all is RPC, there is no cache by default
• Use discovery_max_stale to scale servers horizontally
• Use ttl for DNS and allow_stale = true
Useful configurations hints
Consul Everywhere
26 •
• Inversion of Control
• Monitoring can be automated: ratio > 0.5 passing/critical
• Everything is As A Service, Users are free to experiment (full
network automation for instance)
• Everything is standardized
• ServiceMeta standardization, LB weights...
• Build features on top of services: Monitoring, versions tracking
Benefits (1/2)
27 •
• Debug is easier
• One single place to look for configuration
• LB/API Load balancing works the same way
• Nothing is hidden: people can troubleshoot themselves
• The team is not a SPOF to debug issues
Benefits (2/2)
Tools / references
29 •
• Consul-templaterb : https://github.com/criteo/consul-templaterb/
• Script/Hack/Automate it easily: supports hot-reload
• Provide Consul-UI as well as Consul-timeline
• Provide additional prometheus endpoints (service changes)
• https://github.com/pierresouchay/consul-ops-tools
• small scripts to help debug Consul (will be enriched)
• A Consul Story: To 20,000 Nodes and Beyond (video)
Open-Source Tools
30 •
Q&A
Discovery Team @Criteo
Twitter: @vizionr
Github: pierresouchay

Mais conteúdo relacionado

Mais procurados

Salty OPS – Saltstack Introduction
Salty OPS – Saltstack IntroductionSalty OPS – Saltstack Introduction
Salty OPS – Saltstack Introduction
Walter Liu
 

Mais procurados (20)

Deep dive networking
Deep dive networkingDeep dive networking
Deep dive networking
 
Workshop Consul .- Service Discovery & Failure Detection
Workshop Consul .- Service Discovery & Failure DetectionWorkshop Consul .- Service Discovery & Failure Detection
Workshop Consul .- Service Discovery & Failure Detection
 
Performance Tuning Your Puppet Infrastructure - PuppetConf 2014
Performance Tuning Your Puppet Infrastructure - PuppetConf 2014Performance Tuning Your Puppet Infrastructure - PuppetConf 2014
Performance Tuning Your Puppet Infrastructure - PuppetConf 2014
 
Consul presentation
Consul presentationConsul presentation
Consul presentation
 
Spot Trading - A case study in continuous delivery for mission critical finan...
Spot Trading - A case study in continuous delivery for mission critical finan...Spot Trading - A case study in continuous delivery for mission critical finan...
Spot Trading - A case study in continuous delivery for mission critical finan...
 
Celery
CeleryCelery
Celery
 
Using SaltStack to orchestrate microservices in application containers at Sal...
Using SaltStack to orchestrate microservices in application containers at Sal...Using SaltStack to orchestrate microservices in application containers at Sal...
Using SaltStack to orchestrate microservices in application containers at Sal...
 
SaltConf14 - Saurabh Surana, HP Cloud - Automating operations and support wit...
SaltConf14 - Saurabh Surana, HP Cloud - Automating operations and support wit...SaltConf14 - Saurabh Surana, HP Cloud - Automating operations and support wit...
SaltConf14 - Saurabh Surana, HP Cloud - Automating operations and support wit...
 
Test Kitchen and Infrastructure as Code
Test Kitchen and Infrastructure as CodeTest Kitchen and Infrastructure as Code
Test Kitchen and Infrastructure as Code
 
Intelligent infrastructure with SaltStack
Intelligent infrastructure with SaltStackIntelligent infrastructure with SaltStack
Intelligent infrastructure with SaltStack
 
Salty OPS – Saltstack Introduction
Salty OPS – Saltstack IntroductionSalty OPS – Saltstack Introduction
Salty OPS – Saltstack Introduction
 
Where is my scalable api?
Where is my scalable api?Where is my scalable api?
Where is my scalable api?
 
[TechTalks] Learning Configuration Management with SaltStack (Advanced Concepts)
[TechTalks] Learning Configuration Management with SaltStack (Advanced Concepts)[TechTalks] Learning Configuration Management with SaltStack (Advanced Concepts)
[TechTalks] Learning Configuration Management with SaltStack (Advanced Concepts)
 
Experiences from Running Masterless Puppet - PuppetConf 2014
Experiences from Running Masterless Puppet - PuppetConf 2014Experiences from Running Masterless Puppet - PuppetConf 2014
Experiences from Running Masterless Puppet - PuppetConf 2014
 
Sensu and Sensibility - Puppetconf 2014
Sensu and Sensibility - Puppetconf 2014Sensu and Sensibility - Puppetconf 2014
Sensu and Sensibility - Puppetconf 2014
 
OSMC 2014: MonitoringLove with Sensu | Jochen Lillich
OSMC 2014: MonitoringLove with Sensu | Jochen LillichOSMC 2014: MonitoringLove with Sensu | Jochen Lillich
OSMC 2014: MonitoringLove with Sensu | Jochen Lillich
 
Writing Custom Saltstack Execution Modules
Writing Custom Saltstack Execution ModulesWriting Custom Saltstack Execution Modules
Writing Custom Saltstack Execution Modules
 
Introduction to Systems Management with SaltStack
Introduction to Systems Management with SaltStackIntroduction to Systems Management with SaltStack
Introduction to Systems Management with SaltStack
 
Integration testing for salt states using aws ec2 container service
Integration testing for salt states using aws ec2 container serviceIntegration testing for salt states using aws ec2 container service
Integration testing for salt states using aws ec2 container service
 
OMD and Check_mk
OMD and Check_mkOMD and Check_mk
OMD and Check_mk
 

Semelhante a Consul administration at scale

Semelhante a Consul administration at scale (20)

PaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at YelpPaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at Yelp
 
Splunk: Forward me the REST of those shells
Splunk: Forward me the REST of those shellsSplunk: Forward me the REST of those shells
Splunk: Forward me the REST of those shells
 
Ansible benelux meetup - Amsterdam 27-5-2015
Ansible benelux meetup - Amsterdam 27-5-2015Ansible benelux meetup - Amsterdam 27-5-2015
Ansible benelux meetup - Amsterdam 27-5-2015
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic Stack
 
Puppet Camp NYC 2014: Build a Modern Infrastructure in 45 min!
Puppet Camp NYC 2014: Build a Modern Infrastructure in 45 min!Puppet Camp NYC 2014: Build a Modern Infrastructure in 45 min!
Puppet Camp NYC 2014: Build a Modern Infrastructure in 45 min!
 
Monitoring federation open stack infrastructure
Monitoring federation open stack infrastructureMonitoring federation open stack infrastructure
Monitoring federation open stack infrastructure
 
Modern Linux Tracing Landscape
Modern Linux Tracing LandscapeModern Linux Tracing Landscape
Modern Linux Tracing Landscape
 
Rakuten openstack
Rakuten openstackRakuten openstack
Rakuten openstack
 
Puppet Camp New York 2014: Streamlining Puppet Development Workflow
Puppet Camp New York 2014: Streamlining Puppet Development Workflow Puppet Camp New York 2014: Streamlining Puppet Development Workflow
Puppet Camp New York 2014: Streamlining Puppet Development Workflow
 
Steamlining your puppet development workflow
Steamlining your puppet development workflowSteamlining your puppet development workflow
Steamlining your puppet development workflow
 
2019 05-28 SRE Consul Criteo Meetup
2019 05-28 SRE Consul Criteo Meetup2019 05-28 SRE Consul Criteo Meetup
2019 05-28 SRE Consul Criteo Meetup
 
CNIT 152 10 Enterprise Service
CNIT 152 10 Enterprise ServiceCNIT 152 10 Enterprise Service
CNIT 152 10 Enterprise Service
 
Service Discovery using etcd, Consul and Kubernetes
Service Discovery using etcd, Consul and KubernetesService Discovery using etcd, Consul and Kubernetes
Service Discovery using etcd, Consul and Kubernetes
 
Html5 devconf nodejs_devops_shubhra
Html5 devconf nodejs_devops_shubhraHtml5 devconf nodejs_devops_shubhra
Html5 devconf nodejs_devops_shubhra
 
Habitat talk at CodeMonsters Sofia, Bulgaria Nov 27 2018
Habitat talk at CodeMonsters Sofia, Bulgaria Nov 27 2018Habitat talk at CodeMonsters Sofia, Bulgaria Nov 27 2018
Habitat talk at CodeMonsters Sofia, Bulgaria Nov 27 2018
 
[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic
[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic
[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic
 
Puppet Camp Tokyo 2014: Keynote
Puppet Camp Tokyo 2014: KeynotePuppet Camp Tokyo 2014: Keynote
Puppet Camp Tokyo 2014: Keynote
 
Hogy jussunk ki lezárt hálózatokból?
Hogy jussunk ki lezárt hálózatokból?Hogy jussunk ki lezárt hálózatokból?
Hogy jussunk ki lezárt hálózatokból?
 
2019 hashiconf seattle_consul_ioc
2019 hashiconf seattle_consul_ioc2019 hashiconf seattle_consul_ioc
2019 hashiconf seattle_consul_ioc
 

Último

The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 

Último (20)

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
SHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationSHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions Presentation
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 

Consul administration at scale

  • 1. Pierre Souchay Discovery Team @Criteo Twitter: @vizionr Github: pierresouchay Consul Administration At Scale
  • 2. 2 • 1 2 3 4 5 Numbers What we do Make it work 24/24 Our pillars Tools to scale What’s new? Consul everywhere Benefits Tools / References Q&A Our 30 minutes presentation
  • 4. 4 • • Consul in use for 3+ years @criteo • Dedicated team is 6 months old, 5 people • SDKs development (JVM / C# / Python), tools (GUIs) • Handle all infrastructure, on-call 24/24 7/7 • Architecture, 1st External Consul Contributor (70+ PR) The discovery team
  • 5. 5 • • Prod 35k bare-metal hosts (40/60 Win/Linux), 8 DCs (2 Hadoop) • 3200 kind of services with 260k instances • Up to 2.5M req/sec, 100+Pb of data in Hadoop • More than 300 developers: we MUST scale users too Our Infrastructure
  • 6. 6 • • Automatic Load Balancers provisioning (F5/HaProxy) • SDKs provides discovery for all apps • DNS provides discovery for non-aware Consul systems • Bare Metal systems / Hadoop / Mesos (~Nomad) • K/V for configuration of some tools Consul to rule them all
  • 7. 7 • When Consul is down, Criteo is down.
  • 8. Make it work 24/24 - 7/7
  • 9. 9 • • 35k Consul agents installed by Chef • Registration of service • by Chef with helpers: standardized/easy • in Mesos, standardized/automatic Rule #1 - (1/3) Full automation - as predictable as possible
  • 10. 10 • • More than 3k services, protected service registration by ACLs • ACLs as a Service REST API • No service Conflict by default, Goal: 1 ACL per Service • Add/Help people putting service Metadata: version, alerts... • Deploy in preprod, check ACLs, Go Prod Rule #1 - (2/3) Full automation - as predictable as possible
  • 11. 11 • • Secure by default in order to be predictable • Nobody can write on APIs outside of localhost • https://github.com/hashicorp/consul/issues/4712 • Available in Consul 1.4.2+ • Reduce entropy added by humans Rule #1 : (3/3) Full automation - as predictable as possible
  • 12. 12 • • Blackbox monitoring (5+ probes in each DC) • Register a service, wait its publication in Consul Catalog • SLA: objective 1s to register a service, up to 3s max • When SLA is violated, wake up the on-call Rule #2 - Metrics (1/3)
  • 13. 13 • • Consul Metrics • Native Prometheus Support • Additional on-call alerts • Track new usages (increase of RPCs, DNS calls…) • Debug when there is mess Rule #2 - Metrics (2/3)
  • 14. 14 • • Consul-templaterb : metrics.erb export to Prometheus • Provides rate of changes • Provides instances Passing/Warning/Critical • View from an agent point of view, not Consul Server Rule #2 - Metrics (3/3)
  • 15. 15 • • Logs in Kibana for Consul Server / few canary agents • Analyzed regularly for early errors detection • Expose all data to everybody • Instant view of all services • Timeline of changes for all services Rule #3 - Logs, info and History
  • 16. 16 • • Consul fork: mainstream with patches • Ready to go to prod in less than 2 hours • Compare metrics after deployment • Preprod → Observe → Prod • Deploy feature per feature, no bulk updates Rule #4 - Ready to patch
  • 17. 17 • • Look at all issues on github • See if known patterns • Check if issue might impact us • PR when issue is potentially critical for us (ex: #5050) Rule #5 - Work on upstream
  • 19. 19 • Consul-UI: scalable UI to show all details about a service
  • 20. 20 • Consul-UI: Timeline of changes : not an OPS problem anymore
  • 21. 21 • Changes/sec is a good indicator, will allow you to detect: - deployments (right) - incidents or future incidents - optimizations to perform Many of optimizations/fixes from Criteo: - #3889, #4720 and many more merged - With #5050 allowed us to more than x100 performance! Consul-template metrics.erb : changes/sec on a service
  • 22. 22 • Consul-Templaterb: script everything! 1/2 <% # This script cleanup all services with tag `marathon` having less than 1 healthcheck (SerfHealth) instances_to_cleanup=0 total_instances=0 datacenters.each do |dc| services(dc:dc, tag:'marathon').each do |service_name, tags| service(service_name, dc:dc, tag:'marathon').each do |snode| total_instances+=1 if snode['Checks'].count < 2 instances_to_cleanup += 1 %>ssh $SSH_OPTIONS <%= snode['Node']['Node'] %> "/usr/bin/curl -H X-Consul-Token:$CONSUL_DEREGISTER_TOKEN -fs -XPUT localhost:8500/v1/agent/service/deregister/<%= snode['Service']['ID'] %>" <% end end end end %> echo Found <%= instances_to_cleanup %> / <%= total_instances %> instances to cleanup
  • 23. 23 • Consul-Templaterb: script everything! 2/2 Call it once… $ consul-templaterb -c <CONSUL_ADDR> ./clean_svcs_without_hc.sh.erb --once && bash ./clean_svcs_without_hc.sh Or automatically every minute ! --wait 60 --template “clean_svcs_without_hc.sh.erb:./result.sh:bash ./result.sh” ssh $SSH_OPTIONS mesos-slave017-pa4.central.criteo.preprod "/usr/bin/curl -H X-Consul-Token:$CONSUL_DEREGISTER_TOKEN -fs -XPUT localhost:8500/v1/agent/service/deregister/marathon-app-deepr-pipeline-31510-23d44ebc2b8d11e9b0125065f387ef80" ssh $SSH_OPTIONS mesos-slave019-pa4.central.criteo.preprod "/usr/bin/curl -H X-Consul-Token:$CONSUL_DEREGISTER_TOKEN -fs -XPUT localhost:8500/v1/agent/service/deregister/marathon-app-jtc-jtc-app-31934-1d5a8e202ba611e9b0125065f387ef80" echo Found 2 / 1812 instances to cleanup
  • 24. 24 • • See video from Michael Stewart (To 20,000 Nodes and Beyond) • We had the same stories, found the same tricks • Read Consul Docs: all is RPC, there is no cache by default • Use discovery_max_stale to scale servers horizontally • Use ttl for DNS and allow_stale = true Useful configurations hints
  • 26. 26 • • Inversion of Control • Monitoring can be automated: ratio > 0.5 passing/critical • Everything is As A Service, Users are free to experiment (full network automation for instance) • Everything is standardized • ServiceMeta standardization, LB weights... • Build features on top of services: Monitoring, versions tracking Benefits (1/2)
  • 27. 27 • • Debug is easier • One single place to look for configuration • LB/API Load balancing works the same way • Nothing is hidden: people can troubleshoot themselves • The team is not a SPOF to debug issues Benefits (2/2)
  • 29. 29 • • Consul-templaterb : https://github.com/criteo/consul-templaterb/ • Script/Hack/Automate it easily: supports hot-reload • Provide Consul-UI as well as Consul-timeline • Provide additional prometheus endpoints (service changes) • https://github.com/pierresouchay/consul-ops-tools • small scripts to help debug Consul (will be enriched) • A Consul Story: To 20,000 Nodes and Beyond (video) Open-Source Tools
  • 30. 30 • Q&A Discovery Team @Criteo Twitter: @vizionr Github: pierresouchay