O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Fabric, Cuisine and Watchdog for server administration in Python

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio

Confira estes a seguir

1 de 145 Anúncio
Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (19)

Semelhante a Fabric, Cuisine and Watchdog for server administration in Python (20)

Anúncio

Mais recentes (20)

Fabric, Cuisine and Watchdog for server administration in Python

  1. 1. Fabric, Cuisine & Watchdog Sébastien Pierre, ffunction inc. @Montréal Python, February 2011 www.ffctn.com ffunction inc.
  2. 2. How to use Python for Server Administration Thanks to Fabric Cuisine* & Watchdog* *custom tools ffunction inc.
  3. 3. The way we use servers has changed ffunction inc.
  4. 4. The era of dedicated servers Hosted in your server room or in colocation WEB DATABASE EMAIL SERVER SERVER SERVER ffunction inc.
  5. 5. The era of dedicated servers Hosted in your server room or in colocation WEB DATABASE EMAIL SERVER SERVER SERVER Sysadmins typically Sysadmins typically SSH and configure SSH and configure the servers live the servers live ffunction inc.
  6. 6. The era of dedicated servers Hosted in your server room or in colocation WEB DATABASE EMAIL SERVER SERVER SERVER The servers are The servers are conservatively managed, conservatively managed, updates are risky updates are risky ffunction inc.
  7. 7. The era of slices/VPS Linode.com Amazon Ec2 SLICESLICE SLICE 1 1 1 SLICE 1 SLICESLICE 6 1 SLICE SLICE 11 10 SLICE 9 We now have multiple We now have multiple small virtual servers small virtual servers (slices/VPS) (slices/VPS) ffunction inc.
  8. 8. The era of slices/VPS Linode.com Amazon Ec2 SLICESLICE SLICE 1 1 1 SLICE 1 SLICESLICE 6 1 SLICE SLICE 11 10 SLICE 9 Often located in different Often located in different data-centers data-centers ffunction inc.
  9. 9. The era of slices/VPS Linode.com Amazon Ec2 SLICESLICE SLICE 1 1 1 SLICE 1 SLICESLICE 6 1 SLICE SLICE 11 10 SLICE 9 ...and sometimes with ...and sometimes with different providers different providers ffunction inc.
  10. 10. The era of slices/VPS Linode.com Amazon Ec2 SLICESLICE SLICE 1 1 1 SLICE 1 SLICESLICE 6 1 SLICE SLICE 11 10 SLICE 9 IWeb.com We even sometimes DEDICATED DEDICATED We even sometimes still have physical, SERVER 1 SERVER 2 still have physical, dedicated servers dedicated servers ffunction inc.
  11. 11. The challenge ORDER SETUP DEPLOY SERVER SERVER APPLICATION ffunction inc.
  12. 12. The challenge ORDER SETUP DEPLOY SERVER SERVER APPLICATION MAKE THIS PROCESS AS FAST (AND SIMPLE) AS POSSIBLE ffunction inc.
  13. 13. The challenge Create users, groups Create users, groups Customize config files Customize config files Install base packages Install base packages ORDER SETUP DEPLOY SERVER SERVER APPLICATION MAKE THIS PROCESS AS FAST (AND SIMPLE) AS POSSIBLE ffunction inc.
  14. 14. The challenge Install app-specific Install app-specific packages packages deploy application deploy application start services start services ORDER SETUP DEPLOY SERVER SERVER APPLICATION MAKE THIS PROCESS AS FAST (AND SIMPLE) AS POSSIBLE ffunction inc.
  15. 15. The challenge ffunction inc.
  16. 16. The challenge Quickly integrate your Quickly integrate your new server in the new server in the existing architecture existing architecture ffunction inc.
  17. 17. The challenge ...and make sure ...and make sure it's running! it's running! ffunction inc.
  18. 18. Today's menu Interact with your remote machines FABRIC as if they were local Takes care of users, group, packages CUISINE and configuration of your new machine Ensures that your servers and services WATCHDOG are up and running ffunction inc.
  19. 19. Today's menu Interact with your remote machines FABRIC as if they were local Takes care of users, group, packages CUISINE Made by Made by and configuration of your new machine Ensures that your servers and services WATCHDOG are up and running ffunction inc.
  20. 20. Part 1 Fabric - http://fabfile.org application deployment & systems administration tasks ffunction inc.
  21. 21. Fabric is a Python library and command-line tool for streamlining the use of SSH for application deployment or systems administration tasks. ffunction inc.
  22. 22. Wait... what does Wait... what does that mean ? that mean ? Fabric is a Python library and command-line tool for streamlining the use of SSH for application deployment or systems administration tasks. ffunction inc.
  23. 23. Streamlining SSH By hand: version = os.popen(“ssh myserver 'cat /proc/version'”).read() Using Fabric: version = run(“cat /proc/version”) ffunction inc.
  24. 24. Streamlining SSH By hand: version = os.popen(“ssh myserver 'cat /proc/version').read() Using Fabric: from fabric.api import * env.hosts = [“myserver”] version = run(“cat /proc/version”) ffunction inc.
  25. 25. Streamlining SSH By hand: You can specify You can specify multiple hosts and run version = os.popen(“ssh myserver 'cat run multiple hosts and /proc/version').read() the same commands the same commands across them across them Using Fabric: from fabric.api import * env.hosts = [“myserver”] version = run(“cat /proc/version”) ffunction inc.
  26. 26. Streamlining SSH By hand: version = os.popen(“ssh myserver 'cat /proc/version').read() Connections will be Connections will be lazily created and lazily created and pooled pooled Using Fabric: from fabric.api import * env.hosts = [“myserver”] version = run(“cat /proc/version”) ffunction inc.
  27. 27. Streamlining SSH By hand: version = os.popen(“ssh myserver 'cat /proc/version').read() Using Fabric: from fabric.api import * env.hosts = [“myserver”] version = run(“cat /proc/version”) Failures ($STATUS) will Failures ($STATUS) will be handled just like in Make be handled just like in Make ffunction inc.
  28. 28. Example: Installing packages sudo(“aptitude install nginx”) if run("dpkg -s %s | grep 'Status:' ; true" % package).find("installed") == -1: sudo("aptitude install '%s'" % (package) ffunction inc.
  29. 29. Example: Installing packages sudo(“aptitude install nginx”) It's easy to take action It's easy to take action depending on the result depending on the result if run("dpkg -s %s | grep 'Status:' ; true" % package).find("installed") == -1: sudo("aptitude install '%s'" % (package) ffunction inc.
  30. 30. Example: Installing packages Note that we add true Note that we add true sudo(“aptitude install nginx”) so that the run() always so that the run() always succeeds* succeeds* * there are other ways... * there are other ways... if run("dpkg -s %s | grep 'Status:' ; true" % package).find("installed") == -1: sudo("aptitude install '%s'" % (package) ffunction inc.
  31. 31. Example: retrieving system status disk_usage = run(“df -kP”) mem_usage = run(“cat /proc/meminfo”) cpu_usage = run(“cat /proc/stat” print disk_usage, mem_usage, cpu_info ffunction inc.
  32. 32. Example: retrieving system status disk_usage = run(“df -kP”) mem_usage = run(“cat /proc/meminfo”) cpu_usage = run(“cat /proc/stat” print disk_usage, mem_usage, cpu_info Very useful for getting Very useful for getting live information from live information from many different servers many different servers ffunction inc.
  33. 33. Fabfile.py from fabric.api import * from mysetup import * env.host = [“server1.myapp.com”] def setup(): install_packages(“...”) update_configuration() create_users() start_daemons() $ fab setup ffunction inc.
  34. 34. Fabfile.py from fabric.api import * from mysetup import * env.host = [“server1.myapp.com”] def setup(): install_packages(“...”) update_configuration() create_users() start_daemons() Just like Make, you Just like Make, you write rules that do write rules that do something something $ fab setup ffunction inc.
  35. 35. Fabfile.py from fabric.api import * from mysetup import * env.host = [“server1.myapp.com”] def setup(): install_packages(“...”) update_configuration() ...and you can specify create_users() ...and you can specify on which servers the rules start_daemons() on which servers the rules will run will run $ fab setup ffunction inc.
  36. 36. Multiple hosts env.hosts = [ “db1.myapp.com”, “db2.myapp.com”, “db3.myapp.com” ] @hosts(“db1.myapp”) def backup_db(): run(...) ffunction inc.
  37. 37. Roles env.roledefs = { 'web': ['www1', 'www2', 'www3'], 'dns': ['ns1', 'ns2'] } $ fab -R web setup ffunction inc.
  38. 38. Roles env.roledefs = { 'web': ['www1', 'www2', 'www3'], 'dns': ['ns1', 'ns2'] } $ fab -R web setup Will run the setup rule Will run the setup rule only on hosts members only on hosts members of the web role. of the web role. ffunction inc.
  39. 39. What's good about Fabric? Low-level Basically an ssh() command that returns the result Simple primitives run(), sudo(), get(), put(), local(), prompt(), reboot() No magic No DSL, no abstraction, just a remote command API ffunction inc.
  40. 40. What could be improved ? Ease common admin tasks User, group creation. Files, directory operations. Abstract primitives Like install package, so that it works with different OS Templates To make creating/updating configuration files easy ffunction inc.
  41. 41. Cuisine: Chef-like functionality for Fabric ffunction inc.
  42. 42. Part 2 Cuisine ffunction inc.
  43. 43. What is Opscode's Chef? http://wiki.opscode.com/display/chef/Home Recipes Scripts/packages to install and configure services and applications API A DSL-like Ruby API to interact with the OS (create users, groups, install packages, etc) Architecture Client-server or “solo” mode to push and deploy your new configurations ffunction inc.
  44. 44. What I liked about Chef Flexible You can use the API or shell commands Structured Helped me have a clear decomposition of the services installed per machine Community Lots of recipes already available from http://cookbooks.opscode.com/ ffunction inc.
  45. 45. What I didn't like Too many files and directories Code is spread out, hard to get the big picture Abstraction overload API not very well documented, frequent fall backs to plain shell scripts within the recipe No “smart” recipe Recipes are applied all the time, even when it's not necessary ffunction inc.
  46. 46. The question that kept coming... sudo aptitude install apache2 python django- python Django recipe: 5 files, 2 directories What it does, in essence ffunction inc.
  47. 47. The question that kept coming... Is this really necessary Is this really necessary for what I want to do ? sudo aptitude install for what I want to do ? apache2 python django- python Django recipe: 5 files, 2 directories What it does, in essence ffunction inc.
  48. 48. What I loved about Fabric Bare metal ssh() function, simple and elegant set of primitives No magic No abstraction, no model, no compilation Two-way communication Easy to change the rule's behaviour according to the output (ex: do not install something that's already installed) ffunction inc.
  49. 49. What I needed Fabric ffunction inc.
  50. 50. What I needed File I/O File I/O Fabric ffunction inc.
  51. 51. What I needed User/Group User/Group File I/O File I/O Management Management Fabric ffunction inc.
  52. 52. What I needed User/Group User/Group Package Package File I/O File I/O Management Management Management Management Fabric ffunction inc.
  53. 53. What I needed Text processing & Templates Text processing & Templates User/Group User/Group Package Package File I/O File I/O Management Management Management Management Fabric ffunction inc.
  54. 54. How I wanted it Simple “flat” API [object]_[operation] where operation is something in “create”, “read”, “update”, “write”, “remove”, “ensure”, etc... Driven by need Only implement a feature if I have a real need for it No magic Everything is implemented using sh-compatible commands No unnecessary structure Everything fits in one file, no imposed file layout ffunction inc.
  55. 55. Cuisine: Example fabfile.py from cuisine import * env.host = [“server1.myapp.com”] def setup(): package_ensure(“python”, “apache2”, “python-django”) user_ensure(“admin”, uid=2000) upstart_ensure(“django”) $ fab setup ffunction inc.
  56. 56. Cuisine:Fabric's coreimportedfabfile.py Example functions Fabric's core functions are already are already imported from cuisine import * env.host = [“server1.myapp.com”] def setup(): package_ensure(“python”, “apache2”, “python-django”) user_ensure(“admin”, uid=2000) upstart_ensure(“django”) $ fab setup ffunction inc.
  57. 57. Cuisine: Example fabfile.py from cuisine import * env.host = [“server1.myapp.com”] def setup(): package_ensure(“python”, “apache2”, “python-django”) user_ensure(“admin”, uid=2000) upstart_ensure(“django”) Cuisine's API $ fab setup Cuisine's API calls calls ffunction inc.
  58. 58. File I/O ffunction inc.
  59. 59. Cuisine : File I/O ● file_exists does remote file exists? ● file_read reads remote file ● file_write write data to remote file ● file_append appends data to remote file ● file_attribs chmod & chown ● file_remove ffunction inc.
  60. 60. Cuisine : File I/O Supports owner/group ● file_exists does remote file exists? Supports owner/group and mode change and mode change ● file_read reads remote file ● file_write write data to remote file ● file_append appends data to remote file ● file_attribs chmod & chown ● file_remove ffunction inc.
  61. 61. Cuisine : File I/O (directories) ● dir_exists does remote file exists? ● dir_ensure ensures that a directory exists ● dir_attribs chmod & chown ● dir_remove ffunction inc.
  62. 62. Cuisine : File I/O + ● file_update(location, updater=lambda _:_) package_ensure("mongodb-snapshot") def update_configuration( text ): res = [] for line in text.split("n"): if line.strip().startswith("dbpath="): res.append("dbpath=/data/mongodb") elif line.strip().startswith("logpath="): res.append("logpath=/data/logs/mongodb.log") else: res.append(line) return "n".join(res) file_update("/etc/mongodb.conf", update_configuration) ffunction inc.
  63. 63. Cuisine : File I/O + This replaces the values for This replaces the values for ● file_update(location, updater=lambda _:_) configuration entries configuration entries dbpath and logpath dbpath and logpath package_ensure("mongodb-snapshot") def update_configuration( text ): res = [] for line in text.split("n"): if line.strip().startswith("dbpath="): res.append("dbpath=/data/mongodb") elif line.strip().startswith("logpath="): res.append("logpath=/data/logs/mongodb.log") else: res.append(line) return "n".join(res) file_update("/etc/mongodb.conf", update_configuration) ffunction inc.
  64. 64. Cuisine : File I/O + ● file_update(location, updater=lambda _:_) package_ensure("mongodb-snapshot") def update_configuration( text ): res = [] The remote file will only be The remote file line in text.split("n"): for will only be changed if the content changed if the content if line.strip().startswith("dbpath="): is different is different res.append("dbpath=/data/mongodb") elif line.strip().startswith("logpath="): res.append("logpath=/data/logs/mongodb.log") else: res.append(line) return "n".join(res) file_update("/etc/mongodb.conf", update_configuration) ffunction inc.
  65. 65. User Management ffunction inc.
  66. 66. Cuisine: User Management ● user_exists does the user exists? ● user_create create the user ● user_ensure create the user if it doesn't exist ffunction inc.
  67. 67. Cuisine: Group Management ● group_exists does the group exists? ● group_create create the group ● group_ensure create the group if it doesn't exist ● group_user_exists does the user belong to the group? ● group_user_add adds the user to the group ● group_user_ensure ffunction inc.
  68. 68. Package Management ffunction inc.
  69. 69. Cuisine: Package Management ● package_exists is the package available ? ● package_installed is it installed ? ● package_install install the package ● package_ensure ... only if it's not installed ● package_upgrade upgrades the/all package(s) ffunction inc.
  70. 70. Text & Templates ffunction inc.
  71. 71. Cuisine: Text transformation text_ensure_line(text, lines) file_update( "/home/user/.profile", lambda _:text_ensure_line(_, "PYTHONPATH=/opt/lib/python:${PYTHONPATH};" "export PYTHONPATH" )) ffunction inc.
  72. 72. Cuisine: Text transformation Ensures that the PYTHONPATH Ensures that the PYTHONPATH variable is set and exported, text_ensure_line(text, lines) variable is set and exported, If not, these lines will be If not, these lines will be appended. appended. file_update( "/home/user/.profile", lambda _:text_ensure_line(_, "PYTHONPATH=/opt/lib/python:${PYTHONPATH};" "export PYTHONPATH" )) ffunction inc.
  73. 73. Cuisine: Text transformation text_replace_line(text, old, new, find=.., process=...) configuration = local_read("server.conf") for key, value in variables.items(): configuration, replaced = text_replace_line( configuration, key + "=", key + "=" + repr(value), process=lambda text:text.split("=")[0].strip() ) ffunction inc.
  74. 74. Cuisine: Text transformation Replaces lines that look like Replaces lines that look like VARIABLE=VALUE text_replace_line(text, old, new, find=.., process=...) VARIABLE=VALUE with the actual values from the with the actual values from the variables dictionary. variables dictionary. configuration = local_read("server.conf") for key, value in variables.items(): configuration, replaced = text_replace_line( configuration, key + "=", key + "=" + repr(value), process=lambda text:text.split("=")[0].strip() ) ffunction inc.
  75. 75. Cuisine: Text transformation text_replace_line(text, old, new, find=..,process lambda transforms The process=...) The process lambda transforms input lines before comparing input lines before comparing them. them. configuration = local_read("server.conf")lines are stripped Here the Here the lines are stripped for key, value in variables.items(): of spaces and of their value. of spaces and of their value. configuration, replaced = text_replace_line( configuration, key + "=", key + "=" + repr(value), process=lambda text:text.split("=")[0].strip() ) ffunction inc.
  76. 76. Cuisine: Text transformation text_strip_margin(text) file_write(".profile", text_strip_margin( """ |export PATH="$HOME/bin":$PATH |set -o vi """ )) ffunction inc.
  77. 77. Cuisine: Text transformation Everything after the | separator Everything after the | separator will be output as content. will be output as content. text_strip_margin(text) It allows to easily embed text It allows to easily embed text templates within functions. templates within functions. file_write(".profile", text_strip_margin( """ |export PATH="$HOME/bin":$PATH |set -o vi """ )) ffunction inc.
  78. 78. Cuisine: Text transformation text_template(text, variables) text_template(text_strip_margin( """ |cd ${DAEMON_PATH} |exec ${DAEMON_EXEC_PATH} """ ), dict( DAEMON_PATH="/opt/mongodb", DAEMON_EXEC_PATH="/opt/mongodb/mongod" )) ffunction inc.
  79. 79. Cuisine: Text transformation This is a simple wrapper text_template(text, variables) This is a simple wrapper around Python (safe) around Python (safe) string.template() function string.template() function text_template(text_strip_margin( """ |cd ${DAEMON_PATH} |exec ${DAEMON_EXEC_PATH} """ ), dict( DAEMON_PATH="/opt/mongodb", DAEMON_EXEC_PATH="/opt/mongodb/mongod" )) ffunction inc.
  80. 80. Cuisine: Goodies ● ssh_keygen generates DSA keys ● ssh_authorize authorizes your key on the remote server ● mode_sudo run() always uses sudo ● upstart_ensure ensures the given daemon is running & more! ffunction inc.
  81. 81. Why use Cuisine ? ● Simple API for remote-server manipulation Files, users, groups, packages ● Shell commands for specific tasks only Avoid problems with your shell commands by only using run() for very specific tasks ● Cuisine tasks are not stupid *_ensure() commands won't do anything if it's not necessary ffunction inc.
  82. 82. Limitations ● Limited to sh-shells Operations will not work under csh ● Only written/tested for Ubuntu Linux Contributors could easily port commands ffunction inc.
  83. 83. Get started ! On Github: http://github.com/sebastien/cuisine 1 short Python file Documented API ffunction inc.
  84. 84. Part 3 Watchdog Server and services monitoring ffunction inc.
  85. 85. The problem ffunction inc.
  86. 86. The problem Low disk space Low disk space ffunction inc.
  87. 87. The problem Archive files Archive files Rotate logs Rotate logs Purge cache Purge cache ffunction inc.
  88. 88. The problem HTTP server HTTP server has high has high latency latency ffunction inc.
  89. 89. The problem Restart HTTP Restart HTTP server server ffunction inc.
  90. 90. The problem System load System load is too high is too high ffunction inc.
  91. 91. The problem re-nice re-nice important important processes processes ffunction inc.
  92. 92. We want to be notified when incidents happen ffunction inc.
  93. 93. We want automatic actions to be taken whenever possible ffunction inc.
  94. 94. (Some of the) existing solutions Monit, God, Supervisord, Upstart Focus on starting/restarting daemons and services Munin, Cacti Focus on visualization of RRDTool data Collectd Focus on collecting and publishing data ffunction inc.
  95. 95. The ideal tool Wide spectrum Data collection, service monitoring, actions Easy setup and deployment No complex installation or configuration Flexible server architecture Can monitor local or remote processes Customizable and extensible From restarting deamons to monitoring whole servers ffunction inc.
  96. 96. Hello, Watchdog! SERVICE ffunction inc.
  97. 97. Hello, Watchdog! SERVICE RULE ffunction inc.
  98. 98. Hello, Watchdog! A service is a A service is a collection of collection of RULES RULES SERVICE RULE ffunction inc.
  99. 99. Hello, Watchdog! SERVICE HTTP Request RULE CPU, Disk, Mem % Process status I/O Bandwidth ffunction inc.
  100. 100. Hello, Watchdog! SERVICE Each rule retrieves Each rule retrieves data and processes it. HTTP Request data and processes it. Rules can SUCCEED RULE CPU, Disk, Mem % Rules can SUCCEED or FAIL Process status or FAIL I/O Bandwidth ffunction inc.
  101. 101. Hello, Watchdog! SERVICE HTTP Request RULE CPU, Disk, Mem % Process status I/O Bandwidth ACTION ffunction inc.
  102. 102. Hello, Watchdog! SERVICE HTTP Request RULE CPU, Disk, Mem % Process status I/O Bandwidth Logging XMPP, Email notifications ACTION Start/stop process …. ffunction inc.
  103. 103. Hello, Watchdog! SERVICE HTTP Request RULE CPU, Disk, Mem % Process status I/O Bandwidth Actions are bound Actions are bound Logging to rule, triggered to rule, triggered on rule SUCCESS XMPP, Email notifications on rule SUCCESS ACTION or FAILURE Start/stop process or FAILURE …. ffunction inc.
  104. 104. Execution Model MONITOR ffunction inc.
  105. 105. Execution Model SERVICE DEFINITION RULE MONITOR (frequency in ms) ffunction inc.
  106. 106. Services are registered Services are registered Execution Model in the monitor in the monitor SERVICE DEFINITION RULE MONITOR (frequency in ms) ffunction inc.
  107. 107. Execution Model Rules defined in the Rules defined in the service are executed service are executed every N ms every N ms (frequency) SERVICE DEFINITION (frequency) RULE MONITOR (frequency in ms) ffunction inc.
  108. 108. Execution Model SERVICE DEFINITION RULE MONITOR (frequency in ms) SUCCESS FAILURE ACTION ACTION ACTION ffunction inc.
  109. 109. Execution Model SERVICE DEFINITION RULE MONITOR (frequency in ms) SUCCESS FAILURE ACTION ACTION ACTION If the rule SUCCEEDS If the rule SUCCEEDS actions will be actions will be sequentially executed sequentially executed ffunction inc.
  110. 110. Execution Model SERVICE DEFINITION RULE MONITOR (frequency in ms) SUCCESS FAILURE ACTION ACTION ACTION If the rule FAIL If the rule FAIL failure actions will be failure actions will be sequentially executed sequentially executed ffunction inc.
  111. 111. Monitoring a remote machine #!/usr/bin/env python from watchdog import * Monitor( Service( name = "google-search-latency", monitor = ( HTTP( GET="http://www.google.ca/search?q=watchdog", freq=Time.s(1), timeout=Time.ms(80), fail=[ Print("Google search query took more than 50ms") ] ) ) ) ).run() ffunction inc.
  112. 112. Monitoring a remote machine A monitor is like the A monitor is like the “main” for Watchdog. #!/usr/bin/env python “main” for Watchdog. It actively monitors from watchdog import * It actively monitors Monitor( services. services. Service( name = "google-search-latency", monitor = ( HTTP( GET="http://www.google.ca/search?q=watchdog", freq=Time.s(1), timeout=Time.ms(80), fail=[ Print("Google search query took more than 50ms") ] ) ) ) ).run() ffunction inc.
  113. 113. Monitoring a remote machine #!/usr/bin/env python from watchdog import * Monitor( Service( name = "google-search-latency", monitor = ( HTTP( GET="http://www.google.ca/search?q=watchdog", freq=Time.s(1), timeout=Time.ms(80), fail=[ Print("Google search query took more than 50ms") ] ) ) ) ).run() Don't forget to call Don't forget to call run() on it run() on it ffunction inc.
  114. 114. Monitoring a remote machine #!/usr/bin/env python The service monitors from watchdog import * The service monitors the rules Monitor( the rules Service( name = "google-search-latency", monitor = ( HTTP( GET="http://www.google.ca/search?q=watchdog", freq=Time.s(1), timeout=Time.ms(80), fail=[ Print("Google search query took more than 50ms") ] ) ) ) ).run() ffunction inc.
  115. 115. Monitoring a remote machine #!/usr/bin/env python from watchdog import * The HTTP rule The HTTP rule Monitor( allows to test allows to test Service( an URL name = "google-search-latency", an URL monitor = ( HTTP( GET="http://www.google.ca/search?q=watchdog", freq=Time.s(1), timeout=Time.ms(80), fail=[ Print("Google search query took more than 50ms") ] ) ) ) And we display a And we display a ).run() message in case message in case of failure of failure ffunction inc.
  116. 116. Monitoring a remote machine #!/usr/bin/env python from watchdog import * Monitor( Service( name = "google-search-latency", monitor = ( HTTP( GET="http://www.google.ca/search?q=watchdog", freq=Time.s(1), timeout=Time.ms(80), fail=[ Print("Google search query took more than 50ms") ] ) ) If it there is a 4XX or ) If it there is a 4XX or it timeouts, the rule ).run() it timeouts, the rule will fail and display will fail and display an error message an error message ffunction inc.
  117. 117. Monitoring a remote machine $ python example-service-monitoring.py 2011-02-27T22:33:18 watchdog --- #0 (runners=1,threads=2,duration=0.57s) 2011-02-27T22:33:18 watchdog [!] Failure on HTTP(GET="www.google.ca:80/search? q=watchdog",timeout=0.08) : Socket error: timed out Google search query took more than 50ms 2011-02-27T22:33:19 watchdog --- #1 (runners=1,threads=2,duration=0.73s) 2011-02-27T22:33:20 watchdog --- #2 (runners=1,threads=2,duration=0.54s) 2011-02-27T22:33:21 watchdog --- #3 (runners=1,threads=2,duration=0.69s) 2011-02-27T22:33:22 watchdog --- #4 (runners=1,threads=2,duration=0.77s) 2011-02-27T22:33:23 watchdog --- #5 (runners=1,threads=2,duration=0.70s) ffunction inc.
  118. 118. Sending Email Notification send_email = Email( "notifications@ffctn.com", "[Watchdog]Google Search Latency Error", "Latency was over 80ms", "smtp.gmail.com", "myusername", "mypassword" ) […] HTTP( GET="http://www.google.ca/search?q=watchdog", freq=Time.s(1), timeout=Time.ms(80), fail=[ send_email ] ) ffunction inc.
  119. 119. Sending Email Notification send_email = Email( "notifications@ffctn.com", "[Watchdog]Google Search Latency Error", "Latency was over 80ms", "smtp.gmail.com", "myusername", "mypassword" ) […] HTTP( The Email rule will send GET="http://www.google.ca/search?q=watchdog", to send The Email rule will an email freq=Time.s(1), an email to notifications@ffctn.com timeout=Time.ms(80), notifications@ffctn.com when triggered fail=[ when triggered send_email ] ) ffunction inc.
  120. 120. Sending Email Notification send_email = Email( "notifications@ffctn.com", "[Watchdog]Google Search Latency Error", "Latency was over 80ms", "smtp.gmail.com", "myusername", "mypassword" ) […] HTTP( GET="http://www.google.ca/search?q=watchdog", freq=Time.s(1), timeout=Time.ms(80), fail=[ send_email ] ) This is how we bind the This is how we bind the action to the rule failure action to the rule failure ffunction inc.
  121. 121. Sending Email+Jabber Notification send_xmpp = XMPP( "notifications@jabber.org", "Watchdog: Google search latency over 80ms", "myuser@jabber.org", "myspassword" ) […] HTTP( GET="http://www.google.ca/search?q=watchdog", freq=Time.s(1), timeout=Time.ms(80), fail=[ send_email, send_xmpp ] ) ffunction inc.
  122. 122. Monitoring incident: when something fails repeatedly during a given period of time ffunction inc.
  123. 123. Monitoring incident: when something fails repeatedly during a given period of time You don't want to be You don't want to be notified all the time, notified all the time, only when it really only when it really matters. matters. ffunction inc.
  124. 124. Detecting incidents HTTP( GET="http://www.google.ca/search?q=watchdog", freq=Time.s(1), timeout=Time.ms(80), fail=[ Incident( errors = 5, during = Time.s(10), actions = [send_email,send_xmpp] ) ] ) ffunction inc.
  125. 125. Detecting incidents An incident is a “smart” An incident is a “smart” action : it will only do action : it will only do something when the HTTP( something when the condition is met GET="http://www.google.ca/search?q=watchdog", condition is met freq=Time.s(1), timeout=Time.ms(80), fail=[ Incident( errors = 5, during = Time.s(10), actions = [send_email,send_xmpp] ) ] ) ffunction inc.
  126. 126. Detecting incidents HTTP( GET="http://www.google.ca/search?q=watchdog", freq=Time.s(1), When at least 5 errors... When at least 5 errors... timeout=Time.ms(80), fail=[ Incident( errors = 5, during = Time.s(10), actions = [send_email,send_xmpp] ) ] ) ffunction inc.
  127. 127. Detecting incidents HTTP( GET="http://www.google.ca/search?q=watchdog", freq=Time.s(1), timeout=Time.ms(80), ...happen over a 10 ...happen over a 10 fail=[ seconds period seconds period Incident( errors = 5, during = Time.s(10), actions = [send_email,send_xmpp] ) ] ) ffunction inc.
  128. 128. Detecting incidents HTTP( GET="http://www.google.ca/search?q=watchdog", freq=Time.s(1), timeout=Time.ms(80), fail=[ Incident( errors = 5, during = Time.s(10), actions = [send_email,send_xmpp] ) ] ) The Incident action will The Incident action will trigger the given actions trigger the given actions ffunction inc.
  129. 129. Example: Ensuring a service is running from watchdog import * Monitor( Service( name="myservice-ensure-up", monitor=( HTTP( GET="http://localhost:8000/", freq=Time.ms(500), fail=[ Incident( errors=5, during=Time.s(5), actions=[ Restart("myservice-start.py") ])] )))).run() ffunction inc.
  130. 130. Example: Ensuring a service is running from watchdog import * We test if we can We test if we can Monitor( GET http://localhost:8000 GET http://localhost:8000 Service( within 500ms within 500ms name="myservice-ensure-up", monitor=( HTTP( GET="http://localhost:8000/", freq=Time.ms(500), fail=[ Incident( errors=5, during=Time.s(5), actions=[ Restart("myservice-start.py") ])] )))).run() ffunction inc.
  131. 131. Example: Ensuring a service is running from watchdog import * Monitor( Service( name="myservice-ensure-up", monitor=( HTTP( If we can't reach it during If we can't reach it during GET="http://localhost:8000/",seconds 5 5 seconds freq=Time.ms(500), fail=[ Incident( errors=5, during=Time.s(5), actions=[ Restart("myservice-start.py") ])] )))).run() ffunction inc.
  132. 132. Example: Ensuring a service is running from watchdog import * Monitor( Service( name="myservice-ensure-up", monitor=( HTTP( GET="http://localhost:8000/", freq=Time.ms(500), fail=[ We kill and restart We kill and restart Incident( myservice-start.py myservice-start.py errors=5, during=Time.s(5), actions=[ Restart("myservice-start.py") ])] )))).run() ffunction inc.
  133. 133. Example: Monitoring system health from watchdog import * Monitor ( Service( name = "system-health", monitor = ( SystemInfo(freq=Time.s(1), success = ( LogResult("myserver.system.mem", extract=lambda r,_:r["memoryUsage"]), LogResult("myserver.system.disk", extract=lambda r,_:reduce(max,r["diskUsage"].values())), LogResult("myserver.system.cpu", extract=lambda r,_:r["cpuUsage"]), ) ), Delta( Bandwidth("eth0", freq=Time.s(1)), extract = lambda v:v["total"]["bytes"]/1000.0/1000.0, success = [LogResult("myserver.system.eth0.sent")] ), SystemHealth( cpu=0.90, disk=0.90, mem=0.90, freq=Time.s(60), fail=[Log(path="watchdog-system-failures.log")] ), ) ) ).run() ffunction inc.
  134. 134. Monitoring system health from watchdog import * Monitor ( Service( name = "system-health", monitor = ( SystemInfo(freq=Time.s(1), success = ( LogResult("myserver.system.mem", extract=lambda r,_:r["memoryUsage"]), LogResult("myserver.system.disk", extract=lambda r,_:reduce(max,r["diskUsage"].values())), LogResult("myserver.system.cpu", extract=lambda r,_:r["cpuUsage"]), ) ), Delta( Bandwidth("eth0", freq=Time.s(1)), extract = lambda v:v["total"]["bytes"]/1000.0/1000.0, success = [LogResult("myserver.system.eth0.sent")] ), SystemHealth( cpu=0.90, disk=0.90, mem=0.90, freq=Time.s(60), fail=[Log(path="watchdog-system-failures.log")] ), ) ) ).run() ffunction inc.
  135. 135. Monitoring system health SystemInfo will retrieve SystemInfo will retrieve system information and system information and from watchdog import * return it as a dictionary Monitor ( return it as a dictionary Service( name = "system-health", monitor = ( SystemInfo(freq=Time.s(1), success = ( LogResult("myserver.system.mem", extract=lambda r,_:r["memoryUsage"]), LogResult("myserver.system.disk", extract=lambda r,_:reduce(max,r["diskUsage"].values())), LogResult("myserver.system.cpu", extract=lambda r,_:r["cpuUsage"]), ) ), Delta( Bandwidth("eth0", freq=Time.s(1)), extract = lambda v:v["total"]["bytes"]/1000.0/1000.0, success = [LogResult("myserver.system.eth0.sent")] ), SystemHealth( cpu=0.90, disk=0.90, mem=0.90, freq=Time.s(60), fail=[Log(path="watchdog-system-failures.log")] ), ) ) ).run() ffunction inc.
  136. 136. Monitoring system health We log each result by We log each result by extracting the given from watchdog import * extracting the given value from the result Monitor ( value from the result Service( dictionary (memoryUsage, name = "system-health", dictionary (memoryUsage, diskUsage,cpuUsage) monitor = ( diskUsage,cpuUsage) SystemInfo(freq=Time.s(1), success = ( LogResult("myserver.system.mem=", extract=lambda r,_:r["memoryUsage"]), LogResult("myserver.system.disk=", extract=lambda r,_:reduce(max,r["diskUsage"].values())), LogResult("myserver.system.cpu=", extract=lambda r,_:r["cpuUsage"]), ) ), Delta( Bandwidth("eth0", freq=Time.s(1)), extract = lambda v:v["total"]["bytes"]/1000.0/1000.0, success = [LogResult("myserver.system.eth0.sent")] ), SystemHealth( cpu=0.90, disk=0.90, mem=0.90, freq=Time.s(60), fail=[Log(path="watchdog-system-failures.log")] ), ) ) ).run() ffunction inc.
  137. 137. Monitoring system health from watchdog import * Monitor ( Service( name = "system-health", monitor = ( SystemInfo(freq=Time.s(1), Bandwidth collects success = ( Bandwidth collects network interface LogResult("myserver.system.mem=", extract=lambda r,_:r["memoryUsage"]), network interface LogResult("myserver.system.disk=", extract=lambda live traffic information live traffic information r,_:reduce(max,r["diskUsage"].values())), LogResult("myserver.system.cpu=", extract=lambda r,_:r["cpuUsage"]), ) ), Delta( Bandwidth("eth0", freq=Time.s(1)), extract = lambda v:v["total"]["bytes"]/1000.0/1000.0, success = [LogResult("myserver.system.eth0.sent")] ), SystemHealth( cpu=0.90, disk=0.90, mem=0.90, freq=Time.s(60), fail=[Log(path="watchdog-system-failures.log")] ), ) ) ).run() ffunction inc.
  138. 138. Monitoring system health from watchdog import * Monitor ( Service( name = "system-health", monitor But we don't want the = ( But we don't want the SystemInfo(freq=Time.s(1), total amount, we just total amount, we just success = ( want the difference. wantLogResult("myserver.system.mem=", extract=lambda r,_:r["memoryUsage"]), the difference. LogResult("myserver.system.disk=", extract=lambda Delta does just that. Delta does just that. r,_:reduce(max,r["diskUsage"].values())), LogResult("myserver.system.cpu=", extract=lambda r,_:r["cpuUsage"]), ) ), Delta( Bandwidth("eth0", freq=Time.s(1)), extract = lambda _:_["total"]["bytes"]/1000.0/1000.0, success = [LogResult("myserver.system.eth0.sent")] ), SystemHealth( cpu=0.90, disk=0.90, mem=0.90, freq=Time.s(60), fail=[Log(path="watchdog-system-failures.log")] ), ) ) ).run() ffunction inc.
  139. 139. Monitoring system health from watchdog import * Monitor ( Service( name = "system-health", monitor = ( SystemInfo(freq=Time.s(1), success = ( LogResult("myserver.system.mem=", extract=lambda r,_:r["memoryUsage"]), LogResult("myserver.system.disk=", We print the result extract=lambda r,_:reduce(max,r["diskUsage"].values())), We print the result as before LogResult("myserver.system.cpu=", extract=lambda r,_:r["cpuUsage"]), as before ) ), Delta( Bandwidth("eth0", freq=Time.s(1)), extract = lambda _:_["total"]["bytes"]/1000.0/1000.0, success = [LogResult("myserver.system.eth0.sent=")] ), SystemHealth( cpu=0.90, disk=0.90, mem=0.90, freq=Time.s(60), fail=[Log(path="watchdog-system-failures.log")] ), ) ) ).run() ffunction inc.
  140. 140. Monitoring system health from watchdog import * Monitor ( Service( name = "system-health", monitor = ( SystemInfo(freq=Time.s(1), success = ( LogResult("myserver.system.mem=", extract=lambda r,_:r["memoryUsage"]), LogResult("myserver.system.disk=", extract=lambda SystemHealth will r,_:reduce(max,r["diskUsage"].values())), SystemHealth will fail whenever the usage LogResult("myserver.system.cpu=", extract=lambda r,_:r["cpuUsage"]), ) fail whenever the usage ), is above the given is above the given Delta( thresholds thresholds Bandwidth("eth0", freq=Time.s(1)), extract = lambda _:_["total"]["bytes"]/1000.0/1000.0, success = [LogResult("myserver.system.eth0.sent=")] ), SystemHealth( cpu=0.90, disk=0.90, mem=0.90, freq=Time.s(60), fail=[Log(path="watchdog-system-failures.log")] ), ) ) ).run() ffunction inc.
  141. 141. Monitoring system health from watchdog import * Monitor ( Service( name = "system-health", monitor = ( SystemInfo(freq=Time.s(1), success = ( LogResult("myserver.system.mem=", extract=lambda r,_:r["memoryUsage"]), LogResult("myserver.system.disk=", extract=lambda r,_:reduce(max,r["diskUsage"].values())), LogResult("myserver.system.cpu=", extract=lambda r,_:r["cpuUsage"]), ) ), Delta( We'll log failures Bandwidth("eth0", freq=Time.s(1)), We'll log failures extract = lambda _:_["total"]["bytes"]/1000.0/1000.0, file in a log in a log file success = [LogResult("myserver.system.eth0.sent=")] ), SystemHealth( cpu=0.90, disk=0.90, mem=0.90, freq=Time.s(60), fail=[Log(path="watchdog-system-failures.log")] ), ) ) ).run() ffunction inc.
  142. 142. Watchdog: Overview Monitoring DSL Declarative programming to define monitoring strategy Wide spectrum From data collection to incident detection Flexible Does not impose a specific architecture ffunction inc.
  143. 143. Watchdog: Use cases Ensure service availability Test and stop/restart when problems Collect system statistics Log or send data through the network Alert on system or service health Take actions when the system stats is above threshold ffunction inc.
  144. 144. Get started ! On Github: http://github.com/sebastien/watchdog 1 Python file Documented API ffunction inc.
  145. 145. Merci ! www.ffctn.com sebastien@ffctn.com github.com/sebastien ffunction inc.

×