SlideShare uma empresa Scribd logo
1 de 68
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
SaltStack at Web Scale…Better, Stronger, Faster
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Who’s this guy?
2
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
What is SRE?
 Hybrid of operations and engineering
 Heavily involved in architecture and design
 Application support ninjas
 Masters of automation
3
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
So, what do I do with salt?
 Heavy user
 Active developer
 Administrator (less so)
4
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
What’s LinkedIn?
 Professional social network
 You probably all have an account
 You probably all get email from us too
5
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Salt @ LinkedIn
 When LinkedIn started
– Aug 2011: Salt 0.8.9
– ~5k minions
 When I got involved
– May 2012: Salt 0.9.9
– ~10k minions
 Last SaltConf
– Now: 2014.01
– ~30k minions
 Now
– 2014.7 (starting 2015.2)
– ~70k minions
6
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. 7
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
How to scale 101
 We can rebuild it
 We have the technology
 Better Reliability
 Stronger Availability
 Faster Performance
8
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Better: What is Reliability?
 Being reliable! ( not helpful)
 Maintainability
 Debuggability
 Not breaking
9
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Better: Maintainability
 Encapsulation/Generalization
– Make systems that are responsible for their own things
– Reuse code as much as possible
 Documentation
 Examples:
– States: each state module only knows how to interact with its own stuff
– Channels: don’t have to use SREQ directly (handle all the auth, retries, etc.)
– Job cache: single place where all of the returners (master and minion) live
10
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Better: Maintainability
 Tests!
– Write them!
– Write the negative ones
– Keep them up to date with your changes
 Don’t’ be that guy 
11
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Better: Debuggability
 Logs
– Logging on useful events (such as AES key rotation)
– Debug messages
– Tuning log level on your install
 Fire events
– Filesystem update
– AES key rotation
– Etc.
 Setproctitle: setting useful process titles for ps output
12
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Better: Debugability
 Useful error messages
13
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Better: Debugging state output
SLS
foo:
cmd.run:
- name: 'date'
- prereq:
- cmd: fail_one
# anything that will
# fail test=True
bar:
cmd.run:
- name: 'exit 0'
- cwd: 1 # bad value
14
Output
ID: foo
Function: cmd.run
Name: date
Result: False
Comment: One or more requisite failed
----------
ID: bar
Function: cmd.run
Name: exit 0
Result: False
Comment: One or more requisite failed
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Better: Debugging state output
New Output
ID: foo
Function: cmd.run
Name: date
Result: False
Comment: One or more requisite failed: {'test.fail_one': 'An
exception occurred in this state: Traceback (most recent call
last):n <INSERT TRACEBACK>'}
----------
ID: bar
Function: cmd.run
Name: exit 0
Result: False
Comment: An exception occurred in this state: Traceback (most
recent call last): <INSERT TRACEBACK>
15
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. 16
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Better: Debugging state output
SLS
foo:
cmd.run:
- name: 'date'
- prereq:
- foo: fail_one
- cmd: work
fail_one:
foo.run:
- name: 'exit 1’
work:
cmd.run:
- name: 'date'
17
Output
ID: foo
Function: cmd.run
Name: date
Result: False
Comment: One or more requisite failed
----------
ID: work
Function: cmd.run
Name: date
Result: False
Comment: One or more requisite failed
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Better: Debugging state output
New Output
ID: foo
Function: cmd.run
Name: date
Result: False
Comment: One or more requisite failed: {'test.fail_one':
"State 'foo.run' was not found in SLS 'test'n"}
----------
ID: work
Function: cmd.run
Name: date
Result: None
Comment: Command "date" would have been executed
18
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. 19
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Better: Not-breaking…ability
 Expect failure and code defensively, things fail
– Hardware
– Network
 Modules can be… problematic
20
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. 21
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Better: Not-breaking…ability
Some are obvious:
exit(0)
def test():
return True
22
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Better: Not-breaking…ability
Some less so
import requests
ret = requests.get(‘http://fileserver.corp/somefile’)
def test():
return True
23
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Better: Not-breaking…ability
Some are very hard to find
WARNING: Mixing fork() and threads detected; memory leaked.
24
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Better: Not-breaking…ability
Jira
import gevent.monkey
gevent.monkey.patch_all()
<the rest of the library>
25
That’s not good!
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Better: LazyLoader
 Major overhaul of Salt’s loader system
 Only load modules when you are going to use them
– This means that bad/malicious modules will only affect their own uses
 In addition fixes a few other things
– __context__ is now actually global for a given module dict (e.g. __salt__)
– Submodules work (and reload correctly)
 New in 2015.2
26
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. 27
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Stronger: What is Availability?
 Being available?
 Uptime
 Lots of 9’s!
 Things to consider:
– What has to work to be “available”?
 Minions working?
 Reactor working?
 Pillars working?
 Job cache?
– How do you do maintenance?
 Scheduled downtime?
 HA system you can work on live?
 How do you measure it?
28
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Stronger: Availability of minions
 Availability for a platform doesn’t just mean “it’s not broken”
 Almost always a perception problem
 Some examples:
– Can’t run “salt” on my box (not on the master)
– The CLI return didn’t have all of my hosts! (your box is dead…)
– I re-imaged by box and its not getting jobs anymore (key pair changed)
– “Salt isn’t working” – usually not “salt”
29
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Stronger: Availability of minions
We have to be proactive:
 Documentation/training
 Monitoring: Minion metrics
– Module sync time
– Connectivity to master
– Uptime
 Auto-remediation:
– Re-imaged boxes: keys are regenerated
 We have an internal tool which keep track of hosts
 Use internal tool for determining if the host is the same
 Simple reactor SLS to run custom runner on auth failure
– SaltGuard
 Mismatched minion_id and hostname
 Detects and reports when master public key changed
 And much more!
30
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Stronger: Availability of Master
 Master Metrics
– Collect metrics about how you use salt
– The Reactor is great for generating such metrics
 How many jobs published
 What jobs where published
 Number of worker processes (and stats per process)
 Number of auths
 Things we noticed
– Reactor doesn’t seem to run on all events??
– Mworkers going defunct
– Publisher process dies every day, requiring a bounce of the master
31
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. 32
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Stronger: Availability of Master
 Reactor missing events
– Runner had condition which called exit()– which killed the reactor
 Mworkers going defunct
– When the minion didn’t call recv() the socket on the master side would break
– Handle case– and reset socket on master
 Publisher dying
– Found zmq pub socket memory leak (grows to ~48gb of memory in 1 day)
– Some work with zmq community– but slow going
 Process Manager
– I originally wrote this for netapi, but I generalized it for arbitrary use
– Now the master’s parent process is simply a process manager
– Then we noticed the master restarted every 24h on some (not all) masters
 Re-found bug in python subprocess (http://bugs.python.org/issue1731717)
33
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Stronger: High Availability master
 Today: Active/Passive master pair
 Problems moving to Active/Active
– Job returns (cache)
 Where do they go?
 How do people retrieve them?
 Retention policy?
– Load distribution
 How to get the minions to connect evenly
 Redistribute for maintenance
– Clients (people running commands)
 Which one do they connect to?
 How do they find their job returns later?
34
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Stronger: High Availability master
 Tools we have:
– Failover Minion
 Minion with N masters configured– will find one that’s available
 Tradeoff: eventually available (in failover)
– Multi-minion
 Listen to N masters and do all of their jobs
 Tradeoff: nothing for the minion 
– Syndic
 Allow a master to “proxy” jobs from a higher level master
 Tradeoff: Another master to maintain, still single point of failure
– Multi-Syndic
 Allow a single syndic to “proxy” multiple masters,
 syndic_mode to control forwarding
 Tradeoff: Another master to maintain, (potentially) complicated forwarding
35
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Stronger: High Availability master
Option #1: N masters with N minions
– Cons: “sharded” masters, SPOF for each minion
36
Master Master
Minion Minion
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Stronger: High Availability master
Option #2: N masters with Failover minion
– Cons: “sharded” masters
37
Master Master
Minion Minion
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Stronger: High Availability master
Option #3: N masters with Multi-minion
– Cons: vertical scaling of master
38
Master Master
Minion Minion
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Stronger: High Availability master
Option #4: N top level masters + N multi-syndics + Multi-minion
– Cons: Complex topology, duplicate publishes
39
Master Master
Minion Minion
SyndicSyndic
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Stronger: High Availability master
Option #5: N top level masters + N multi-syndics + failover-minion
– Cons: Complex topology, minimum 4 “masters”
40
Minion
Master Master
Minion Minion
SyndicSyndic
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Stronger: High Availability master
Option #6: Multi-master + multi-syndic (in cluster mode) + failover-minion
– Cons: (not as many)
41
Master +
Syndic
Minion
Master +
Syndic
Minion
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. 42
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Faster: Performance
 What drives performance?
– Throughput problems (need to run more jobs!)
– Latency problems (need those runs faster!)
– Capacity problems (need to use fewer boxes!)
43
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. 44
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Faster: How to Performance?
 Do less
 Do less faster
 Do less faster better
 Things to watch out for:
– Concurrency is hard (deadlocks, infinite spinning)
– If making it faster is making it more confusing, its probably not the right way
– Prioritize your optimizations
45
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Faster: Small optimizations
 Use libraries as recommended
– disable GC during msgpack (read the docs)
– Runner/Wheel client used to spin while waiting for returns (instead of calling
get_event with a timeout)
 Sometimes slower is faster
– compiled regex is slower to create, but faster to execute
 Disk is SLOOOOW
– Make AES key not hit disk (Mworker had to stat a file on every iteration)
– Removed “verify” from all CLI scripts (except the daemons)
 Use the language!
– Built-ins where possible
– Care about your memory (iterate, not copy)
46
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Python copy vs. iterate
47
Copy Iterate
for k in _dict.values(): for k in _dict.itervalues():
for k in _dict.keys(): for k in _dict:
k = _dict.keys()[0] k = _dict.iterkeys().next()
v = _dict.values()[0] v = _dict.itervalues().next()
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Faster: Reactor
 In addition to the reactor dying, we noticed significant event loss
 Turns out the reactor was creating all of the clients (runner, wheel,
LocalClient) on each reaction execution!
– Created cachedict to cache these for a configurable amount of time
 Then the reactor was consuming the entire box!
– Found that the reactor fired events for runner starts– which could then be
reacted on– causing infinite recursion in certain cases
– Made the reactor not fire start events for runners (CHECK: user?)
 Then we found that on master startup the Master host would be out of PIDs
– Reactor would daemonize a process for each event it reacted to
– Switched reactor to a threadpool (to limit concurrency and CPU usage)
48
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Faster: LazyLoader
 In addition to sandboxing, the LazyLoader is much faster
 Instead of loading all incase you might use one, we just-in-time load
 In local testing (salt-call) this cuts ~50% off of the time for a local test.ping
49
Old
$ time salt-call --local test.ping
local:
True
real 0m2.908s
user 0m1.976s
sys 0m0.906s
New
$ time salt-call --local test.ping
local:
True
real 0m1.562s
user 0m0.920s
sys 0m0.631s
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Faster: Auth
 Biggest problem (today) with performance/scale for salt is the auth storm
 What causes it?
– Salt zmq uses AES crypto– mostly with a shared symmetric key
– When the key rotates, the next job sent to a minion will trigger a re-auth
– ZMQ pub/sub sockets by default send all messages everywhere
 Which means if the key rotates and someone pings a minion, all minons will auth
– Bounce of all (or a lot of) minions causes this as well
50
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. 51
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. 52
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Faster: Auth
 This “storm” state doesn’t always clear up on its own
– ZMQ doesn’t tell you if the client for a message is connected
– If a client has left, we don’t know– so we have to execute the job
 Well, that seems bad…
53
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. 54
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Faster: Auth
 How can we fix it?
– Only send the publish job to the minions that need it (zmq_filtering)
 Meaning all minions won’t re-auth at the same time
 Not a “fix”, but avoids the storm if we don’t need it
 Useful for large publish jobs (since you have to send it to your targets, not everyone)
– Make the minions back off when the auth times out
 acceptance wait time: how long to wait after a failure
 acceptance_wait_time_max: if multiple failures, increase backoff up to this
 Great, all good right?
55
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. 56
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Faster: Auth
 It turns out each minion actually auth’s 3 times during startup (mine, req
channel, pillar fetch)
 Well, just pass it around then!
– Actually, not that simple.
– Some of these can share, but others don’t have access to the main daemon
 Solution? Singleton Auth
– Whats a Singleton? Single instance of a class– so you don’t have to pass it
around, the class will just only return one instance to you
– This means everyone can just “create” Auth() and it will take care of just
making one
57
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. 58
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Faster: salt-api
 What is salt-api?
– Netapi modules: generic network interfaces to salt
– Only ones today are rest modules
 Why? It allows for easy integration with salt
– Deployment
– Auto Remediation
– GUI
– Anything else!
59
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Faster: salt-api
 Problems with the current implementations (wsgi and cherrypy)
– …
– Concurrency limitations
 Cherrypy/wsgi are threaded servers
– Request comes in, gets picked up by a thread
– That thread will handle the job and then wait on the response
 The wait on the return event can take a significant amount of time, all the
while the thread is blocked waiting on the response– we can do better!
60
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Faster: Saltnado!
 Tornado implementation of Salt-api
 What is tornado?
– Network server library
– IOLoop
– Coroutines and Futures (probably do a quick explanation of what that is)
61
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Faster: Saltnado!
 Tornado hello world:
class HelloWorldHandler(RequestHandler):
def get(self):
self.write(“hello world”)
62
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Faster: Saltnado!
 Tornado with callbacks:
class AsyncHandler(RequestHandler):
@tornado.gen.asynchronous
def get(self):
http_client = AsyncHTTPClient()
http_client.fetch("http://example.com",
callback=self.on_fetch)
def on_fetch(self, response):
do_something_with_response(response)
self.render("template.html")
63
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Faster: Saltnado!
 Tornado with coroutines:
class GenAsyncHandler(RequestHandler):
@tornado.gen.coroutine
def get(self):
http_client = AsyncHTTPClient()
response = yield http_client.fetch("http://example.com")
do_something_with_response(response)
self.render("template.html")
64
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Building it Faster: Saltnado!
 What does this all mean for salt?
– Event driven API
– No concurrency limitations–long running jobs are now just as expensive as
short running jobs to the API
– Test coverage
65
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Takeaways
 Better Reliability
– Write maintainable, debuggable, working code
– Write tests and keep them up-to-date
 Stronger Availability
– Determine what availability means to your use
– Proactive measuring, monitoring, and remediating
 Faster Performance
– Do less faster better
– Use the language effectively
– Prioritize your performance improvements
66
Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved.
Got more questions about Salt @ LinkedIn
 Got questions?
– Drop by our SaltConf booth! (do we have one?)
– Connect with me on LinkedIn www.linkedin.com/in/jacksontj
– Jacksontj on #salt on freenode
67
SaltConf 2015: Salt stack at web scale: Better, Stronger, Faster

Mais conteúdo relacionado

Mais procurados

SaltConf14 - Ben Cane - Using SaltStack in High Availability Environments
SaltConf14 - Ben Cane - Using SaltStack in High Availability EnvironmentsSaltConf14 - Ben Cane - Using SaltStack in High Availability Environments
SaltConf14 - Ben Cane - Using SaltStack in High Availability EnvironmentsSaltStack
 
Puppet Camp London Fall 2015 - Service Discovery and Puppet
Puppet Camp London Fall 2015 - Service Discovery and PuppetPuppet Camp London Fall 2015 - Service Discovery and Puppet
Puppet Camp London Fall 2015 - Service Discovery and PuppetMarc Cluet
 
[TechTalks] Learning Configuration Management with SaltStack (Advanced Concepts)
[TechTalks] Learning Configuration Management with SaltStack (Advanced Concepts)[TechTalks] Learning Configuration Management with SaltStack (Advanced Concepts)
[TechTalks] Learning Configuration Management with SaltStack (Advanced Concepts)Blazeclan Technologies Private Limited
 
SaltConf14 - Oz Akan, Rackspace - Deploying OpenStack Marconi with SaltStack
SaltConf14 - Oz Akan, Rackspace - Deploying OpenStack Marconi with SaltStackSaltConf14 - Oz Akan, Rackspace - Deploying OpenStack Marconi with SaltStack
SaltConf14 - Oz Akan, Rackspace - Deploying OpenStack Marconi with SaltStackSaltStack
 
SaltConf14 - Matthew Williams, Flowroute - Salt Virt for Linux contatiners an...
SaltConf14 - Matthew Williams, Flowroute - Salt Virt for Linux contatiners an...SaltConf14 - Matthew Williams, Flowroute - Salt Virt for Linux contatiners an...
SaltConf14 - Matthew Williams, Flowroute - Salt Virt for Linux contatiners an...SaltStack
 
Experiences from Running Masterless Puppet - PuppetConf 2014
Experiences from Running Masterless Puppet - PuppetConf 2014Experiences from Running Masterless Puppet - PuppetConf 2014
Experiences from Running Masterless Puppet - PuppetConf 2014Puppet
 
Integration testing for salt states using aws ec2 container service
Integration testing for salt states using aws ec2 container serviceIntegration testing for salt states using aws ec2 container service
Integration testing for salt states using aws ec2 container serviceSaltStack
 
Arnold Bechtoldt, Inovex GmbH Linux systems engineer - Configuration Manageme...
Arnold Bechtoldt, Inovex GmbH Linux systems engineer - Configuration Manageme...Arnold Bechtoldt, Inovex GmbH Linux systems engineer - Configuration Manageme...
Arnold Bechtoldt, Inovex GmbH Linux systems engineer - Configuration Manageme...SaltStack
 
Puppet Camp Paris 2015: Continuous Integration of Puppet Code (Intermediate)
Puppet Camp Paris 2015: Continuous Integration of Puppet Code (Intermediate) Puppet Camp Paris 2015: Continuous Integration of Puppet Code (Intermediate)
Puppet Camp Paris 2015: Continuous Integration of Puppet Code (Intermediate) Puppet
 
SaltConf14 - Anita Kuno, HP & OpenStack - Using SaltStack for event-driven or...
SaltConf14 - Anita Kuno, HP & OpenStack - Using SaltStack for event-driven or...SaltConf14 - Anita Kuno, HP & OpenStack - Using SaltStack for event-driven or...
SaltConf14 - Anita Kuno, HP & OpenStack - Using SaltStack for event-driven or...SaltStack
 
SaltConf14 - Ryan Lane, Wikimedia - Immediate consistency with Trebuchet Depl...
SaltConf14 - Ryan Lane, Wikimedia - Immediate consistency with Trebuchet Depl...SaltConf14 - Ryan Lane, Wikimedia - Immediate consistency with Trebuchet Depl...
SaltConf14 - Ryan Lane, Wikimedia - Immediate consistency with Trebuchet Depl...SaltStack
 
De-centralise and Conquer: Masterless Puppet in a Dynamic Environment
De-centralise and Conquer: Masterless Puppet in a Dynamic EnvironmentDe-centralise and Conquer: Masterless Puppet in a Dynamic Environment
De-centralise and Conquer: Masterless Puppet in a Dynamic EnvironmentPuppet
 
Sensu and Sensibility - Puppetconf 2014
Sensu and Sensibility - Puppetconf 2014Sensu and Sensibility - Puppetconf 2014
Sensu and Sensibility - Puppetconf 2014Tomas Doran
 
Steve Singer - Managing PostgreSQL with Puppet @ Postgres Open
Steve Singer - Managing PostgreSQL with Puppet @ Postgres OpenSteve Singer - Managing PostgreSQL with Puppet @ Postgres Open
Steve Singer - Managing PostgreSQL with Puppet @ Postgres OpenPostgresOpen
 
Salt conf 2014-installing-openstack-using-saltstack-v02
Salt conf 2014-installing-openstack-using-saltstack-v02Salt conf 2014-installing-openstack-using-saltstack-v02
Salt conf 2014-installing-openstack-using-saltstack-v02Yazz Atlas
 
Puppetconf 2015 - Puppet Reporting with Elasticsearch Logstash and Kibana
Puppetconf 2015 - Puppet Reporting with Elasticsearch Logstash and KibanaPuppetconf 2015 - Puppet Reporting with Elasticsearch Logstash and Kibana
Puppetconf 2015 - Puppet Reporting with Elasticsearch Logstash and Kibanapkill
 
Service discovery and puppet
Service discovery and puppetService discovery and puppet
Service discovery and puppetMarc Cluet
 
Running at Scale: Practical Performance Tuning with Puppet - PuppetConf 2013
Running at Scale: Practical Performance Tuning with Puppet - PuppetConf 2013Running at Scale: Practical Performance Tuning with Puppet - PuppetConf 2013
Running at Scale: Practical Performance Tuning with Puppet - PuppetConf 2013Puppet
 
Postgresql 9.0 HA at LOADAYS 2012
Postgresql 9.0 HA at LOADAYS 2012Postgresql 9.0 HA at LOADAYS 2012
Postgresql 9.0 HA at LOADAYS 2012Julien Pivotto
 

Mais procurados (20)

SaltConf14 - Ben Cane - Using SaltStack in High Availability Environments
SaltConf14 - Ben Cane - Using SaltStack in High Availability EnvironmentsSaltConf14 - Ben Cane - Using SaltStack in High Availability Environments
SaltConf14 - Ben Cane - Using SaltStack in High Availability Environments
 
Puppet Camp London Fall 2015 - Service Discovery and Puppet
Puppet Camp London Fall 2015 - Service Discovery and PuppetPuppet Camp London Fall 2015 - Service Discovery and Puppet
Puppet Camp London Fall 2015 - Service Discovery and Puppet
 
[TechTalks] Learning Configuration Management with SaltStack (Advanced Concepts)
[TechTalks] Learning Configuration Management with SaltStack (Advanced Concepts)[TechTalks] Learning Configuration Management with SaltStack (Advanced Concepts)
[TechTalks] Learning Configuration Management with SaltStack (Advanced Concepts)
 
SaltConf14 - Oz Akan, Rackspace - Deploying OpenStack Marconi with SaltStack
SaltConf14 - Oz Akan, Rackspace - Deploying OpenStack Marconi with SaltStackSaltConf14 - Oz Akan, Rackspace - Deploying OpenStack Marconi with SaltStack
SaltConf14 - Oz Akan, Rackspace - Deploying OpenStack Marconi with SaltStack
 
SaltConf14 - Matthew Williams, Flowroute - Salt Virt for Linux contatiners an...
SaltConf14 - Matthew Williams, Flowroute - Salt Virt for Linux contatiners an...SaltConf14 - Matthew Williams, Flowroute - Salt Virt for Linux contatiners an...
SaltConf14 - Matthew Williams, Flowroute - Salt Virt for Linux contatiners an...
 
Experiences from Running Masterless Puppet - PuppetConf 2014
Experiences from Running Masterless Puppet - PuppetConf 2014Experiences from Running Masterless Puppet - PuppetConf 2014
Experiences from Running Masterless Puppet - PuppetConf 2014
 
Integration testing for salt states using aws ec2 container service
Integration testing for salt states using aws ec2 container serviceIntegration testing for salt states using aws ec2 container service
Integration testing for salt states using aws ec2 container service
 
Arnold Bechtoldt, Inovex GmbH Linux systems engineer - Configuration Manageme...
Arnold Bechtoldt, Inovex GmbH Linux systems engineer - Configuration Manageme...Arnold Bechtoldt, Inovex GmbH Linux systems engineer - Configuration Manageme...
Arnold Bechtoldt, Inovex GmbH Linux systems engineer - Configuration Manageme...
 
Puppet Camp Paris 2015: Continuous Integration of Puppet Code (Intermediate)
Puppet Camp Paris 2015: Continuous Integration of Puppet Code (Intermediate) Puppet Camp Paris 2015: Continuous Integration of Puppet Code (Intermediate)
Puppet Camp Paris 2015: Continuous Integration of Puppet Code (Intermediate)
 
SaltConf14 - Anita Kuno, HP & OpenStack - Using SaltStack for event-driven or...
SaltConf14 - Anita Kuno, HP & OpenStack - Using SaltStack for event-driven or...SaltConf14 - Anita Kuno, HP & OpenStack - Using SaltStack for event-driven or...
SaltConf14 - Anita Kuno, HP & OpenStack - Using SaltStack for event-driven or...
 
SaltConf14 - Ryan Lane, Wikimedia - Immediate consistency with Trebuchet Depl...
SaltConf14 - Ryan Lane, Wikimedia - Immediate consistency with Trebuchet Depl...SaltConf14 - Ryan Lane, Wikimedia - Immediate consistency with Trebuchet Depl...
SaltConf14 - Ryan Lane, Wikimedia - Immediate consistency with Trebuchet Depl...
 
De-centralise and Conquer: Masterless Puppet in a Dynamic Environment
De-centralise and Conquer: Masterless Puppet in a Dynamic EnvironmentDe-centralise and Conquer: Masterless Puppet in a Dynamic Environment
De-centralise and Conquer: Masterless Puppet in a Dynamic Environment
 
Sensu and Sensibility - Puppetconf 2014
Sensu and Sensibility - Puppetconf 2014Sensu and Sensibility - Puppetconf 2014
Sensu and Sensibility - Puppetconf 2014
 
Steve Singer - Managing PostgreSQL with Puppet @ Postgres Open
Steve Singer - Managing PostgreSQL with Puppet @ Postgres OpenSteve Singer - Managing PostgreSQL with Puppet @ Postgres Open
Steve Singer - Managing PostgreSQL with Puppet @ Postgres Open
 
Salt conf 2014-installing-openstack-using-saltstack-v02
Salt conf 2014-installing-openstack-using-saltstack-v02Salt conf 2014-installing-openstack-using-saltstack-v02
Salt conf 2014-installing-openstack-using-saltstack-v02
 
OMD and Check_mk
OMD and Check_mkOMD and Check_mk
OMD and Check_mk
 
Puppetconf 2015 - Puppet Reporting with Elasticsearch Logstash and Kibana
Puppetconf 2015 - Puppet Reporting with Elasticsearch Logstash and KibanaPuppetconf 2015 - Puppet Reporting with Elasticsearch Logstash and Kibana
Puppetconf 2015 - Puppet Reporting with Elasticsearch Logstash and Kibana
 
Service discovery and puppet
Service discovery and puppetService discovery and puppet
Service discovery and puppet
 
Running at Scale: Practical Performance Tuning with Puppet - PuppetConf 2013
Running at Scale: Practical Performance Tuning with Puppet - PuppetConf 2013Running at Scale: Practical Performance Tuning with Puppet - PuppetConf 2013
Running at Scale: Practical Performance Tuning with Puppet - PuppetConf 2013
 
Postgresql 9.0 HA at LOADAYS 2012
Postgresql 9.0 HA at LOADAYS 2012Postgresql 9.0 HA at LOADAYS 2012
Postgresql 9.0 HA at LOADAYS 2012
 

Destaque

Saltconf 2016: Salt stack transport and concurrency
Saltconf 2016: Salt stack transport and concurrencySaltconf 2016: Salt stack transport and concurrency
Saltconf 2016: Salt stack transport and concurrencyThomas Jackson
 
Bitfusion Saltconf16 - Seamless Docker Orchestration with SaltStack
Bitfusion Saltconf16 - Seamless Docker Orchestration with SaltStackBitfusion Saltconf16 - Seamless Docker Orchestration with SaltStack
Bitfusion Saltconf16 - Seamless Docker Orchestration with SaltStackSubbu Rama
 
The SaltStack Pub Crawl - Fosscomm 2016
The SaltStack Pub Crawl - Fosscomm 2016The SaltStack Pub Crawl - Fosscomm 2016
The SaltStack Pub Crawl - Fosscomm 2016effie mouzeli
 
Salty OPS – Saltstack Introduction
Salty OPS – Saltstack IntroductionSalty OPS – Saltstack Introduction
Salty OPS – Saltstack IntroductionWalter Liu
 
Automate your development environment with Jira and Saltstack
Automate your development environment with Jira and SaltstackAutomate your development environment with Jira and Saltstack
Automate your development environment with Jira and SaltstackNetworkedAssets
 
Using SaltStack to Auto Triage and Remediate Production Systems
Using SaltStack to Auto Triage and Remediate Production SystemsUsing SaltStack to Auto Triage and Remediate Production Systems
Using SaltStack to Auto Triage and Remediate Production SystemsMichael Kehoe
 
Security & DevOps- Ways To Make Sure Your Apps & Infrastructure Are Secure
Security & DevOps- Ways To Make Sure Your Apps & Infrastructure Are SecureSecurity & DevOps- Ways To Make Sure Your Apps & Infrastructure Are Secure
Security & DevOps- Ways To Make Sure Your Apps & Infrastructure Are SecurePuppet
 
Fall 2016 ats summit - Parent & Origin Selection
Fall 2016 ats summit  - Parent & Origin SelectionFall 2016 ats summit  - Parent & Origin Selection
Fall 2016 ats summit - Parent & Origin SelectionThomas Jackson
 

Destaque (8)

Saltconf 2016: Salt stack transport and concurrency
Saltconf 2016: Salt stack transport and concurrencySaltconf 2016: Salt stack transport and concurrency
Saltconf 2016: Salt stack transport and concurrency
 
Bitfusion Saltconf16 - Seamless Docker Orchestration with SaltStack
Bitfusion Saltconf16 - Seamless Docker Orchestration with SaltStackBitfusion Saltconf16 - Seamless Docker Orchestration with SaltStack
Bitfusion Saltconf16 - Seamless Docker Orchestration with SaltStack
 
The SaltStack Pub Crawl - Fosscomm 2016
The SaltStack Pub Crawl - Fosscomm 2016The SaltStack Pub Crawl - Fosscomm 2016
The SaltStack Pub Crawl - Fosscomm 2016
 
Salty OPS – Saltstack Introduction
Salty OPS – Saltstack IntroductionSalty OPS – Saltstack Introduction
Salty OPS – Saltstack Introduction
 
Automate your development environment with Jira and Saltstack
Automate your development environment with Jira and SaltstackAutomate your development environment with Jira and Saltstack
Automate your development environment with Jira and Saltstack
 
Using SaltStack to Auto Triage and Remediate Production Systems
Using SaltStack to Auto Triage and Remediate Production SystemsUsing SaltStack to Auto Triage and Remediate Production Systems
Using SaltStack to Auto Triage and Remediate Production Systems
 
Security & DevOps- Ways To Make Sure Your Apps & Infrastructure Are Secure
Security & DevOps- Ways To Make Sure Your Apps & Infrastructure Are SecureSecurity & DevOps- Ways To Make Sure Your Apps & Infrastructure Are Secure
Security & DevOps- Ways To Make Sure Your Apps & Infrastructure Are Secure
 
Fall 2016 ats summit - Parent & Origin Selection
Fall 2016 ats summit  - Parent & Origin SelectionFall 2016 ats summit  - Parent & Origin Selection
Fall 2016 ats summit - Parent & Origin Selection
 

Semelhante a SaltConf 2015: Salt stack at web scale: Better, Stronger, Faster

SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools
SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power ToolsSaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools
SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power ToolsSaltStack
 
Serverless in production, an experience report (FullStack 2018)
Serverless in production, an experience report (FullStack 2018)Serverless in production, an experience report (FullStack 2018)
Serverless in production, an experience report (FullStack 2018)Yan Cui
 
Serverless in Production, an experience report (AWS UG South Wales)
Serverless in Production, an experience report (AWS UG South Wales)Serverless in Production, an experience report (AWS UG South Wales)
Serverless in Production, an experience report (AWS UG South Wales)Yan Cui
 
Serverless in production, an experience report
Serverless in production, an experience reportServerless in production, an experience report
Serverless in production, an experience reportYan Cui
 
Serverless in production, an experience report (Going Serverless, 28 Feb 2018)
Serverless in production, an experience report (Going Serverless, 28 Feb 2018)Serverless in production, an experience report (Going Serverless, 28 Feb 2018)
Serverless in production, an experience report (Going Serverless, 28 Feb 2018)Domas Lasauskas
 
Abusing bleeding edge web standards for appsec glory
Abusing bleeding edge web standards for appsec gloryAbusing bleeding edge web standards for appsec glory
Abusing bleeding edge web standards for appsec gloryPriyanka Aash
 
АНДРІЙ ШУМАДА «To Cover Uncoverable» Online WDDay 2022 js
АНДРІЙ ШУМАДА «To Cover Uncoverable» Online WDDay 2022 jsАНДРІЙ ШУМАДА «To Cover Uncoverable» Online WDDay 2022 js
АНДРІЙ ШУМАДА «To Cover Uncoverable» Online WDDay 2022 jsWDDay
 
"To cover uncoverable", Andrii Shumada
"To cover uncoverable", Andrii Shumada"To cover uncoverable", Andrii Shumada
"To cover uncoverable", Andrii ShumadaFwdays
 
Cloud adoption fails - 5 ways deployments go wrong and 5 solutions
Cloud adoption fails - 5 ways deployments go wrong and 5 solutionsCloud adoption fails - 5 ways deployments go wrong and 5 solutions
Cloud adoption fails - 5 ways deployments go wrong and 5 solutionsYevgeniy Brikman
 
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.02014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0Joakim Lindbom
 
Kafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesKafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesTodd Palino
 
Deploying 3 times a day without a downtime @ Rocket Tech Summit in Berlin
Deploying 3 times a day without a downtime @ Rocket Tech Summit in BerlinDeploying 3 times a day without a downtime @ Rocket Tech Summit in Berlin
Deploying 3 times a day without a downtime @ Rocket Tech Summit in BerlinAlessandro Nadalin
 
Build reactive systems on lambda
Build reactive systems on lambdaBuild reactive systems on lambda
Build reactive systems on lambdaYan Cui
 
The future of paas is serverless
The future of paas is serverlessThe future of paas is serverless
The future of paas is serverlessYan Cui
 
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Codemotion
 
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Demi Ben-Ari
 
Pain Driven Development by Alexandr Sugak
Pain Driven Development by Alexandr SugakPain Driven Development by Alexandr Sugak
Pain Driven Development by Alexandr SugakSigma Software
 
ITT 2015 - Kirk Pepperdine - The (not so) Dark Art of Performance Tuning, fro...
ITT 2015 - Kirk Pepperdine - The (not so) Dark Art of Performance Tuning, fro...ITT 2015 - Kirk Pepperdine - The (not so) Dark Art of Performance Tuning, fro...
ITT 2015 - Kirk Pepperdine - The (not so) Dark Art of Performance Tuning, fro...Istanbul Tech Talks
 
Real World Problem Solving Using Application Performance Management 10
Real World Problem Solving Using Application Performance Management 10Real World Problem Solving Using Application Performance Management 10
Real World Problem Solving Using Application Performance Management 10CA Technologies
 
DevOps: Find Solutions, Not More Defects
DevOps: Find Solutions, Not More DefectsDevOps: Find Solutions, Not More Defects
DevOps: Find Solutions, Not More DefectsTechWell
 

Semelhante a SaltConf 2015: Salt stack at web scale: Better, Stronger, Faster (20)

SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools
SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power ToolsSaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools
SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools
 
Serverless in production, an experience report (FullStack 2018)
Serverless in production, an experience report (FullStack 2018)Serverless in production, an experience report (FullStack 2018)
Serverless in production, an experience report (FullStack 2018)
 
Serverless in Production, an experience report (AWS UG South Wales)
Serverless in Production, an experience report (AWS UG South Wales)Serverless in Production, an experience report (AWS UG South Wales)
Serverless in Production, an experience report (AWS UG South Wales)
 
Serverless in production, an experience report
Serverless in production, an experience reportServerless in production, an experience report
Serverless in production, an experience report
 
Serverless in production, an experience report (Going Serverless, 28 Feb 2018)
Serverless in production, an experience report (Going Serverless, 28 Feb 2018)Serverless in production, an experience report (Going Serverless, 28 Feb 2018)
Serverless in production, an experience report (Going Serverless, 28 Feb 2018)
 
Abusing bleeding edge web standards for appsec glory
Abusing bleeding edge web standards for appsec gloryAbusing bleeding edge web standards for appsec glory
Abusing bleeding edge web standards for appsec glory
 
АНДРІЙ ШУМАДА «To Cover Uncoverable» Online WDDay 2022 js
АНДРІЙ ШУМАДА «To Cover Uncoverable» Online WDDay 2022 jsАНДРІЙ ШУМАДА «To Cover Uncoverable» Online WDDay 2022 js
АНДРІЙ ШУМАДА «To Cover Uncoverable» Online WDDay 2022 js
 
"To cover uncoverable", Andrii Shumada
"To cover uncoverable", Andrii Shumada"To cover uncoverable", Andrii Shumada
"To cover uncoverable", Andrii Shumada
 
Cloud adoption fails - 5 ways deployments go wrong and 5 solutions
Cloud adoption fails - 5 ways deployments go wrong and 5 solutionsCloud adoption fails - 5 ways deployments go wrong and 5 solutions
Cloud adoption fails - 5 ways deployments go wrong and 5 solutions
 
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.02014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
 
Kafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesKafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier Architectures
 
Deploying 3 times a day without a downtime @ Rocket Tech Summit in Berlin
Deploying 3 times a day without a downtime @ Rocket Tech Summit in BerlinDeploying 3 times a day without a downtime @ Rocket Tech Summit in Berlin
Deploying 3 times a day without a downtime @ Rocket Tech Summit in Berlin
 
Build reactive systems on lambda
Build reactive systems on lambdaBuild reactive systems on lambda
Build reactive systems on lambda
 
The future of paas is serverless
The future of paas is serverlessThe future of paas is serverless
The future of paas is serverless
 
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
 
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
 
Pain Driven Development by Alexandr Sugak
Pain Driven Development by Alexandr SugakPain Driven Development by Alexandr Sugak
Pain Driven Development by Alexandr Sugak
 
ITT 2015 - Kirk Pepperdine - The (not so) Dark Art of Performance Tuning, fro...
ITT 2015 - Kirk Pepperdine - The (not so) Dark Art of Performance Tuning, fro...ITT 2015 - Kirk Pepperdine - The (not so) Dark Art of Performance Tuning, fro...
ITT 2015 - Kirk Pepperdine - The (not so) Dark Art of Performance Tuning, fro...
 
Real World Problem Solving Using Application Performance Management 10
Real World Problem Solving Using Application Performance Management 10Real World Problem Solving Using Application Performance Management 10
Real World Problem Solving Using Application Performance Management 10
 
DevOps: Find Solutions, Not More Defects
DevOps: Find Solutions, Not More DefectsDevOps: Find Solutions, Not More Defects
DevOps: Find Solutions, Not More Defects
 

Último

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 

Último (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

SaltConf 2015: Salt stack at web scale: Better, Stronger, Faster

  • 1. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. SaltStack at Web Scale…Better, Stronger, Faster
  • 2. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Who’s this guy? 2
  • 3. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. What is SRE?  Hybrid of operations and engineering  Heavily involved in architecture and design  Application support ninjas  Masters of automation 3
  • 4. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. So, what do I do with salt?  Heavy user  Active developer  Administrator (less so) 4
  • 5. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. What’s LinkedIn?  Professional social network  You probably all have an account  You probably all get email from us too 5
  • 6. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Salt @ LinkedIn  When LinkedIn started – Aug 2011: Salt 0.8.9 – ~5k minions  When I got involved – May 2012: Salt 0.9.9 – ~10k minions  Last SaltConf – Now: 2014.01 – ~30k minions  Now – 2014.7 (starting 2015.2) – ~70k minions 6
  • 7. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. 7
  • 8. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. How to scale 101  We can rebuild it  We have the technology  Better Reliability  Stronger Availability  Faster Performance 8
  • 9. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Better: What is Reliability?  Being reliable! ( not helpful)  Maintainability  Debuggability  Not breaking 9
  • 10. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Better: Maintainability  Encapsulation/Generalization – Make systems that are responsible for their own things – Reuse code as much as possible  Documentation  Examples: – States: each state module only knows how to interact with its own stuff – Channels: don’t have to use SREQ directly (handle all the auth, retries, etc.) – Job cache: single place where all of the returners (master and minion) live 10
  • 11. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Better: Maintainability  Tests! – Write them! – Write the negative ones – Keep them up to date with your changes  Don’t’ be that guy  11
  • 12. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Better: Debuggability  Logs – Logging on useful events (such as AES key rotation) – Debug messages – Tuning log level on your install  Fire events – Filesystem update – AES key rotation – Etc.  Setproctitle: setting useful process titles for ps output 12
  • 13. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Better: Debugability  Useful error messages 13
  • 14. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Better: Debugging state output SLS foo: cmd.run: - name: 'date' - prereq: - cmd: fail_one # anything that will # fail test=True bar: cmd.run: - name: 'exit 0' - cwd: 1 # bad value 14 Output ID: foo Function: cmd.run Name: date Result: False Comment: One or more requisite failed ---------- ID: bar Function: cmd.run Name: exit 0 Result: False Comment: One or more requisite failed
  • 15. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Better: Debugging state output New Output ID: foo Function: cmd.run Name: date Result: False Comment: One or more requisite failed: {'test.fail_one': 'An exception occurred in this state: Traceback (most recent call last):n <INSERT TRACEBACK>'} ---------- ID: bar Function: cmd.run Name: exit 0 Result: False Comment: An exception occurred in this state: Traceback (most recent call last): <INSERT TRACEBACK> 15
  • 16. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. 16
  • 17. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Better: Debugging state output SLS foo: cmd.run: - name: 'date' - prereq: - foo: fail_one - cmd: work fail_one: foo.run: - name: 'exit 1’ work: cmd.run: - name: 'date' 17 Output ID: foo Function: cmd.run Name: date Result: False Comment: One or more requisite failed ---------- ID: work Function: cmd.run Name: date Result: False Comment: One or more requisite failed
  • 18. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Better: Debugging state output New Output ID: foo Function: cmd.run Name: date Result: False Comment: One or more requisite failed: {'test.fail_one': "State 'foo.run' was not found in SLS 'test'n"} ---------- ID: work Function: cmd.run Name: date Result: None Comment: Command "date" would have been executed 18
  • 19. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. 19
  • 20. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Better: Not-breaking…ability  Expect failure and code defensively, things fail – Hardware – Network  Modules can be… problematic 20
  • 21. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. 21
  • 22. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Better: Not-breaking…ability Some are obvious: exit(0) def test(): return True 22
  • 23. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Better: Not-breaking…ability Some less so import requests ret = requests.get(‘http://fileserver.corp/somefile’) def test(): return True 23
  • 24. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Better: Not-breaking…ability Some are very hard to find WARNING: Mixing fork() and threads detected; memory leaked. 24
  • 25. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Better: Not-breaking…ability Jira import gevent.monkey gevent.monkey.patch_all() <the rest of the library> 25 That’s not good!
  • 26. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Better: LazyLoader  Major overhaul of Salt’s loader system  Only load modules when you are going to use them – This means that bad/malicious modules will only affect their own uses  In addition fixes a few other things – __context__ is now actually global for a given module dict (e.g. __salt__) – Submodules work (and reload correctly)  New in 2015.2 26
  • 27. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. 27
  • 28. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Stronger: What is Availability?  Being available?  Uptime  Lots of 9’s!  Things to consider: – What has to work to be “available”?  Minions working?  Reactor working?  Pillars working?  Job cache? – How do you do maintenance?  Scheduled downtime?  HA system you can work on live?  How do you measure it? 28
  • 29. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Stronger: Availability of minions  Availability for a platform doesn’t just mean “it’s not broken”  Almost always a perception problem  Some examples: – Can’t run “salt” on my box (not on the master) – The CLI return didn’t have all of my hosts! (your box is dead…) – I re-imaged by box and its not getting jobs anymore (key pair changed) – “Salt isn’t working” – usually not “salt” 29
  • 30. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Stronger: Availability of minions We have to be proactive:  Documentation/training  Monitoring: Minion metrics – Module sync time – Connectivity to master – Uptime  Auto-remediation: – Re-imaged boxes: keys are regenerated  We have an internal tool which keep track of hosts  Use internal tool for determining if the host is the same  Simple reactor SLS to run custom runner on auth failure – SaltGuard  Mismatched minion_id and hostname  Detects and reports when master public key changed  And much more! 30
  • 31. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Stronger: Availability of Master  Master Metrics – Collect metrics about how you use salt – The Reactor is great for generating such metrics  How many jobs published  What jobs where published  Number of worker processes (and stats per process)  Number of auths  Things we noticed – Reactor doesn’t seem to run on all events?? – Mworkers going defunct – Publisher process dies every day, requiring a bounce of the master 31
  • 32. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. 32
  • 33. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Stronger: Availability of Master  Reactor missing events – Runner had condition which called exit()– which killed the reactor  Mworkers going defunct – When the minion didn’t call recv() the socket on the master side would break – Handle case– and reset socket on master  Publisher dying – Found zmq pub socket memory leak (grows to ~48gb of memory in 1 day) – Some work with zmq community– but slow going  Process Manager – I originally wrote this for netapi, but I generalized it for arbitrary use – Now the master’s parent process is simply a process manager – Then we noticed the master restarted every 24h on some (not all) masters  Re-found bug in python subprocess (http://bugs.python.org/issue1731717) 33
  • 34. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Stronger: High Availability master  Today: Active/Passive master pair  Problems moving to Active/Active – Job returns (cache)  Where do they go?  How do people retrieve them?  Retention policy? – Load distribution  How to get the minions to connect evenly  Redistribute for maintenance – Clients (people running commands)  Which one do they connect to?  How do they find their job returns later? 34
  • 35. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Stronger: High Availability master  Tools we have: – Failover Minion  Minion with N masters configured– will find one that’s available  Tradeoff: eventually available (in failover) – Multi-minion  Listen to N masters and do all of their jobs  Tradeoff: nothing for the minion  – Syndic  Allow a master to “proxy” jobs from a higher level master  Tradeoff: Another master to maintain, still single point of failure – Multi-Syndic  Allow a single syndic to “proxy” multiple masters,  syndic_mode to control forwarding  Tradeoff: Another master to maintain, (potentially) complicated forwarding 35
  • 36. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Stronger: High Availability master Option #1: N masters with N minions – Cons: “sharded” masters, SPOF for each minion 36 Master Master Minion Minion
  • 37. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Stronger: High Availability master Option #2: N masters with Failover minion – Cons: “sharded” masters 37 Master Master Minion Minion
  • 38. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Stronger: High Availability master Option #3: N masters with Multi-minion – Cons: vertical scaling of master 38 Master Master Minion Minion
  • 39. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Stronger: High Availability master Option #4: N top level masters + N multi-syndics + Multi-minion – Cons: Complex topology, duplicate publishes 39 Master Master Minion Minion SyndicSyndic
  • 40. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Stronger: High Availability master Option #5: N top level masters + N multi-syndics + failover-minion – Cons: Complex topology, minimum 4 “masters” 40 Minion Master Master Minion Minion SyndicSyndic
  • 41. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Stronger: High Availability master Option #6: Multi-master + multi-syndic (in cluster mode) + failover-minion – Cons: (not as many) 41 Master + Syndic Minion Master + Syndic Minion
  • 42. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. 42
  • 43. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Faster: Performance  What drives performance? – Throughput problems (need to run more jobs!) – Latency problems (need those runs faster!) – Capacity problems (need to use fewer boxes!) 43
  • 44. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. 44
  • 45. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Faster: How to Performance?  Do less  Do less faster  Do less faster better  Things to watch out for: – Concurrency is hard (deadlocks, infinite spinning) – If making it faster is making it more confusing, its probably not the right way – Prioritize your optimizations 45
  • 46. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Faster: Small optimizations  Use libraries as recommended – disable GC during msgpack (read the docs) – Runner/Wheel client used to spin while waiting for returns (instead of calling get_event with a timeout)  Sometimes slower is faster – compiled regex is slower to create, but faster to execute  Disk is SLOOOOW – Make AES key not hit disk (Mworker had to stat a file on every iteration) – Removed “verify” from all CLI scripts (except the daemons)  Use the language! – Built-ins where possible – Care about your memory (iterate, not copy) 46
  • 47. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Python copy vs. iterate 47 Copy Iterate for k in _dict.values(): for k in _dict.itervalues(): for k in _dict.keys(): for k in _dict: k = _dict.keys()[0] k = _dict.iterkeys().next() v = _dict.values()[0] v = _dict.itervalues().next()
  • 48. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Faster: Reactor  In addition to the reactor dying, we noticed significant event loss  Turns out the reactor was creating all of the clients (runner, wheel, LocalClient) on each reaction execution! – Created cachedict to cache these for a configurable amount of time  Then the reactor was consuming the entire box! – Found that the reactor fired events for runner starts– which could then be reacted on– causing infinite recursion in certain cases – Made the reactor not fire start events for runners (CHECK: user?)  Then we found that on master startup the Master host would be out of PIDs – Reactor would daemonize a process for each event it reacted to – Switched reactor to a threadpool (to limit concurrency and CPU usage) 48
  • 49. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Faster: LazyLoader  In addition to sandboxing, the LazyLoader is much faster  Instead of loading all incase you might use one, we just-in-time load  In local testing (salt-call) this cuts ~50% off of the time for a local test.ping 49 Old $ time salt-call --local test.ping local: True real 0m2.908s user 0m1.976s sys 0m0.906s New $ time salt-call --local test.ping local: True real 0m1.562s user 0m0.920s sys 0m0.631s
  • 50. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Faster: Auth  Biggest problem (today) with performance/scale for salt is the auth storm  What causes it? – Salt zmq uses AES crypto– mostly with a shared symmetric key – When the key rotates, the next job sent to a minion will trigger a re-auth – ZMQ pub/sub sockets by default send all messages everywhere  Which means if the key rotates and someone pings a minion, all minons will auth – Bounce of all (or a lot of) minions causes this as well 50
  • 51. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. 51
  • 52. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. 52
  • 53. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Faster: Auth  This “storm” state doesn’t always clear up on its own – ZMQ doesn’t tell you if the client for a message is connected – If a client has left, we don’t know– so we have to execute the job  Well, that seems bad… 53
  • 54. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. 54
  • 55. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Faster: Auth  How can we fix it? – Only send the publish job to the minions that need it (zmq_filtering)  Meaning all minions won’t re-auth at the same time  Not a “fix”, but avoids the storm if we don’t need it  Useful for large publish jobs (since you have to send it to your targets, not everyone) – Make the minions back off when the auth times out  acceptance wait time: how long to wait after a failure  acceptance_wait_time_max: if multiple failures, increase backoff up to this  Great, all good right? 55
  • 56. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. 56
  • 57. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Faster: Auth  It turns out each minion actually auth’s 3 times during startup (mine, req channel, pillar fetch)  Well, just pass it around then! – Actually, not that simple. – Some of these can share, but others don’t have access to the main daemon  Solution? Singleton Auth – Whats a Singleton? Single instance of a class– so you don’t have to pass it around, the class will just only return one instance to you – This means everyone can just “create” Auth() and it will take care of just making one 57
  • 58. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. 58
  • 59. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Faster: salt-api  What is salt-api? – Netapi modules: generic network interfaces to salt – Only ones today are rest modules  Why? It allows for easy integration with salt – Deployment – Auto Remediation – GUI – Anything else! 59
  • 60. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Faster: salt-api  Problems with the current implementations (wsgi and cherrypy) – … – Concurrency limitations  Cherrypy/wsgi are threaded servers – Request comes in, gets picked up by a thread – That thread will handle the job and then wait on the response  The wait on the return event can take a significant amount of time, all the while the thread is blocked waiting on the response– we can do better! 60
  • 61. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Faster: Saltnado!  Tornado implementation of Salt-api  What is tornado? – Network server library – IOLoop – Coroutines and Futures (probably do a quick explanation of what that is) 61
  • 62. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Faster: Saltnado!  Tornado hello world: class HelloWorldHandler(RequestHandler): def get(self): self.write(“hello world”) 62
  • 63. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Faster: Saltnado!  Tornado with callbacks: class AsyncHandler(RequestHandler): @tornado.gen.asynchronous def get(self): http_client = AsyncHTTPClient() http_client.fetch("http://example.com", callback=self.on_fetch) def on_fetch(self, response): do_something_with_response(response) self.render("template.html") 63
  • 64. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Faster: Saltnado!  Tornado with coroutines: class GenAsyncHandler(RequestHandler): @tornado.gen.coroutine def get(self): http_client = AsyncHTTPClient() response = yield http_client.fetch("http://example.com") do_something_with_response(response) self.render("template.html") 64
  • 65. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Building it Faster: Saltnado!  What does this all mean for salt? – Event driven API – No concurrency limitations–long running jobs are now just as expensive as short running jobs to the API – Test coverage 65
  • 66. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Takeaways  Better Reliability – Write maintainable, debuggable, working code – Write tests and keep them up-to-date  Stronger Availability – Determine what availability means to your use – Proactive measuring, monitoring, and remediating  Faster Performance – Do less faster better – Use the language effectively – Prioritize your performance improvements 66
  • 67. Site Reliability Engineering©2015 LinkedIn Corporation. All Rights Reserved. Got more questions about Salt @ LinkedIn  Got questions? – Drop by our SaltConf booth! (do we have one?) – Connect with me on LinkedIn www.linkedin.com/in/jacksontj – Jacksontj on #salt on freenode 67

Notas do Editor

  1. This talk will discuss best practices for scaling SaltStack from thousands to hundreds of thousands of minions. But the devil is in the details and how do you scale without losing performance and making sure it all works? At LinkedIn we've learned some valuable lessons as we've grown our SaltStack footprint. We'll discuss how to run SaltStack, how to not run SaltStack, and how we've contributed to the Salt project to help make it better, stronger and faster.
  2. 0.8.9 runners just added Outputters just added Cross calling salt modules using __salt__ 0.9.9 Highstate test=True External pillar mInion swarm
  3. Bad kwargs in “bar”
  4. We can be sure of it, since (as part of the fix) I added regression tests 
  5. Fix for issue where master comm errors cause minions to delete all modules remove default 2h timeout of pillar fetches (stalls daemon)
  6. Doing work on import means it will happen a *lot*
  7. Normal module, doing nothing (except imports) in the module, then we got this error
  8. With all of these we need some way to sandbox modules from each other, and *more* importantly from breaking the core daemons
  9. More than just “is salt-master running”
  10. To be proactive– we have to know whats broken (and what breaks most)
  11. Sometimes performance is about managing resource usage more than going faster
  12. Well, that’s not good…. But I guess we can deal with that…
  13. Woohoo! Now we have *one* sign-in per minion on start!
  14. Callbacks become a nightmare– as anyone with javascript experience can tell you