Configuration Management is at the core of Ops. It’s the biggest enabler of any compute operation, small and big. In the past decade, we have switched from thinking about the machines we are configuring, to think about the software and services we are controlling. With that change of mindset, so did the tools we are using. Traditional tools like Puppet, chef, salt and Ansible are slowly declining while new tools such as Terraform, Pulumi, Helm and Kustomize are on the rise. In this talk I will try to describe the pain-points and the opportunities of this transformation as well as suggesting a future direction based on tools developed at the big-tech companies (Mainly facebook and google).
12. CFEngine - The bad things
Written in C - Hard to
extend.
Limited file content
operations.
bundle agent example {
files: "/tmp/testfile"
create => "true",
edit_line => proper_greetings;
}
bundle edit_line proper_greetings {
delete_lines: ".*";
insert_lines: "Hello World!";
}
24. Configuration Management challenges
1. Configuration evaluated on the production machines
2. Hard to test (result of problem #1)
3. Too many configuration formats
4. YAML+Templating = 💔
28. “Because our systems are ultimately managed by humans, humans
are responsible for configuration. The quality of the human-
computer interface of a system’s configuration impacts an
organization’s ability to run that system reliably.”
- Štěpán Davidovič, Google SRE Workbook
https://sre.google/workbook/configuration-design/
29.
30.
31. Protoconf Goals
- Deliver configs to all clusters in seconds,
not minutes.
- Configs should have schemas with type
safety
- Configs should be coded, then materialized
- Config changes should be reviewed
- Configs should be easy to test and validate
- Configs could be consumed by all popular
languages.
- Both humans and machines should be able to
change configs
32. Define the config schema
The developer will define
the config struct in
protobuf
// file: ./src/myproject/myconfig.proto
syntax = "proto3";
message MyConfig {
uint32 connection_timeout = 1;
uint32 max_retries = 2;
NestedStruct another_struct = 3;
}
message NestedStruct {
string hello_world = 1;
}
https://developers.google.com/protocol-buffers
https://docs.protoconf.sh/getting-started/
34. Code your config
The developer will then
create a `.pconf` file to
populate the config struct
with the required values.
"""
file: ./src/myproject/myconfig.pconf
"""
load("myconfig.proto", "MyConfig", "NestedStruct")
def main():
return MyConfig(
connection_timeout=5,
max_retries=5,
another_struct=NestedStruct(
hello_world="Hello World!"
)
)
https://docs.protoconf.sh/getting-started/
39. Learn More & Contribute
- Docs site:
- https://docs.protoconf.sh
- Star us on Github
- protoconf/protoconf
- Join us on Discord
- https://discord.protoconf.sh
- Follow us on Twitter:
- @protoconfdev
Editor's Notes
For years, configuration management was limited to the Operating System which is not as relevant as it used to be these days. Today I am going to present you a more holistic approach that will allow you gain control over your production again, as well as an open source tool utilizing this approach.
My name is Shahar Mintz, I am a DevOps consultant.
I previously worked at both start-ups, corporates and big tech companies. I’ve seen big production systems and small ones.
I’ve faced configuration challenges of all sizes.
Trust me, I’ve seen horrible things.
- joined Facebook in 2013 along with the Onavo team,
- 400 physical servers configured and controlled by puppet.
- Puppet was 1 stop shop - controlled everything
5 Years later, left Facebook in 2018, it was a completely different world.
- Kubernetes started to take over the world.
- Prometheus changed the way everyone was doing monitoring.
- and Terraform provisioning
I felt like the configuration management solutions I knew were not adapted to this new world and went on a quest to find a new solution.
But let’s first try to understand, why do even need configuration management?
- Who switched jobs lately?
- you probably got a pile of links with documentation
- VPN, local dev environment
- Outdated, overlapping, links to broken pages
This is how production is without configuration management.
- release new version
- deploy new server
- out-dated docs that links to other docs
- links back
- “spaghetti doc”.
keep it all in one place.
The amount of steps reduces.
keep your production in shape. But not only that.
It helps you recover from errors faster.
Can you imagine:
- starting your production again
- in a new region
- after years of manual changes
- during an outage
without configuration management?
Should be 2 minutes in, take a water break
Imperative means you define the set of actions you want the computer to take.
Declarative means you define the state you want the computer to be int.
MARK BURGESS, the author of CFEngine
describes the imperative approach for configuration management like climbing a mountain. Everytime you would read a local top, you would have to step down a bit in order to climb up again.
We really want to declare the desired state at the bottom of the valley so every time we run into an obstacle, it will be easier to choose a better path.
In 1993, MARK BURGESS was at the Oslo University.
- few Unix Workstations at the University
- configured by scripts
- which failed for unpredictable states
- wrote CFEngine with a declarative DSL
- operator focus on intent
- set of promises
- eventually meet the desired state.
*** Take a water break, let people read ***
- 1990s to early 2000s
- internet became 100 times more popular- bigger and bigger operations
- CFEngine was a key player
- some companies ran tens of thousands of machines.
- written in C and was hard to extend.
- `delete_lines` and `insert_lines`.
*** Take a break, let people read ***
Both addressed these two major shortcomings of CFEngine.
- written in Ruby
- easier to extend
- employed ERB for powerful templating
- introduced clear separation between different aspects of the configuration management code.
- Model-Controller-View inspiration
- Rails and Django.
- servers and database
- reports
- store machines' state in the database
- se values from database as inputs to other machines.
- service discovery at Onavo.
*** Take a break ***
In the 2010s
- shift to the cloud
- auto-scaling
- spot instances.
- machines coming and going away regularly,
- time of config execution is crucial.
- written in ruby with no parallelism.
- could take 5 minutes, sometimes 15 minutes or 1 hour.
- longer for the lifetime of the cloud instance.
- addresses performance
- python
- parallelism
- server initiation
- never used ansible
- didn’t liked salt
- short golden hour
- docker
- who cares about the OS?
- easy to bake container images
- predictable results
- better understanding of cloud operation
- adopt the declarative approach to provisioning (with terrafrom)
When we started to adopt cloud-native software, we started to get away from the idea of a centralized configuration management system.
Many teams now need to choose whether to focus on kubernetes or terraform, and the bridge between the two is not trivial.
- helm and kustomize
- single cluster? fine
- reality: DB? CDN? Monitoring?
- perform upgrades and maintenance
- multiple clusters?
- sync them?
- migrate workload between clusters?
- write a doc for that :)
- recap
- few machines
- declarative
- scale
- hit walls
- limited flexibility
- cloud
- time to the operational is too long
- containers & SaaS
- We stopped thinking about the OS
- no solution for centralized configuration
- some challenges never been addressed
- test in prod!
- stop making config formats!
- your software is not special and it doesn’t need its own configuration format
- yaml
*** water break ***
- focus on software, not environment
- config is reference to memory addresses
- fill the values
- software wants static files
- no logic
- known schema
- human wants high level code
- writes logic
- compile
- google sre workbook
- chapter about configuration design
- human-computer interface
- run systems reliably
- from model to software
- writes controller to generate
- materialized configs
- test locally
- commit to git
- see diff between
- hook ships to prod
In a brief, we could write controllers that will generate configuration files based on our models.
The compilation process runs on the dev machine, the materialized results are now ready to be tested locally by either running unittests against them or by running your code with the new configuration locally.
Now, after we compiled the configuration files locally and tested them, we checked them out to git. Adding them to git gives us another opportunity to validate the diff between current and next version.
A post merge hook should take the materialized configs and deploy them to production
- learned -> creates tool
- inspired by configerator
- protobuf by google
- interface definition language
- compiles into code native to the language you use
- consistency between languages
- how software wants it.
- writes validation in python-like DSL
- code your config
- like humen
- use high level language
- compile it and test the outcome
- to consume the config, read it from gRPC agent.
- after push
- insert to kv store (consul, etcd, zookeeper)
- battle tested software
- agent detects changes and stream updates to application
- no restarts
- agents run as sidecar, reduces blast radius if fails
- alter configs programmatically
- part of CI
- UI for less technical staff
- join the revolution
- let’s change the face of configuration management
- let’s fix cloud native era configuration practices