Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Â
Python for System Administrators
1. RAFT
Python for System Administrator
Roberto Polli - roberto.polli@par-tec.it
Par-Tec Spa - Rome Operation Unit
P.zza S. Benedetto da Norcia, 33
00040, Pomezia (RM) - www.par-tec.it
March 13, 2016
Roberto Polli - roberto.polli@par-tec.it
2. RAFT
Agenda
Intro
ipython
Path management: 10â
Encoding: 10â
Data Gathering: 20â
module: psutil
module: subprocess
The /proc ďŹlesystem
Parsing: 60â
Regular Expressions
Nosetest Intermezzo: 15â
Processing: 45â
Distributions
Deviation
Correlation
Plotting Time
End
Roberto Polli - roberto.polli@par-tec.it
3. RAFT
Who? What? Why?
⢠Use python to replace Grep Awk Sed Perl. Speed up your daily job.
⢠Roberto Polli - Solutions Architect @ par-tec.it. Loves writing in C, Java
and Python. Red Hat CertiďŹed Engineer and Virtualization Administrator.
⢠Par-Tec â Proud sponsor of this talk ;) Contributes to various FLOSS and
provides expertise in IT Infrastructure & Services and Business Intelligence
solutions + Vertical Applications for the ďŹnancial market.
Intro Roberto Polli - roberto.polli@par-tec.it
4. RAFT
Requirements
⢠python 2.7+, ipython
⢠course code from github
#git clone https://github.com/ioggstream/python-course
⢠test your environment (eg. psutil, numpy, scipy, matplotlib)
#nosetests -vs test prerequisites.py
⢠ďŹrst part: nose, psutil
⢠second part: scipy, numpy, matplotlib
⢠âŚoptional/advanced content âŚ
Intro Roberto Polli - roberto.polli@par-tec.it
5. RAFT
How
⢠Get ready before starting: code is here on github!
⢠Use notebooks or type everything but #comments and try/except
⢠Type fast with tab-completion and copy-paste
⢠Be curious: inspect and print returned variables
⢠Neverâ
close your iPython session: youâll lose your precious variables
* (ok, sometimes you can).
Intro Roberto Polli - roberto.polli@par-tec.it
6. RAFT
References
⢠irc.freenode.net# python - The Python Community :D
⢠Python Cookbook 3rd ed. OâReilly - David Beazley and Brian K. Jones
⢠Programming Python 4th ed. OâReilly - Mark Lutz
⢠Dive into Python3 2nd ed. Apress - Mark Pilgrim
⢠nose.readthedocs.org
⢠github.com/ioggstream/python-course
Intro Roberto Polli - roberto.polli@par-tec.it
7. RAFT
iPython I
⢠Interactive interpreter with tons of functionalities, and the main tool of
our training.
⢠The most fun way to learn and use python!
⢠Supports tab-completion , readline , inline help
⢠Allows pasting from clipboard with %paste , and multi-line editing with
%edit
⢠Run it enabling plotting support:
# ipython --pylab
ipython Roberto Polli - roberto.polli@par-tec.it
8. RAFT
iPython II
# iPython supports inline-help appending ? to an object
str?
# We can run commands and capture the output in a variable
# donât need to quote using the ! magic on unix
ret = !cat /etc/hosts
# windows has etchosts too ;)
ret = !type c: windowssystem32driversetchosts
ipython Roberto Polli - roberto.polli@par-tec.it
9. RAFT
iPython III
# returned objects can be filtered with
ret. grep (âlocalhostâ)
# Now get the first space-splitted column of the output
ret. fields (0)
ret.grep(âlocalhostâ).fields(0)
# And the last returned value is stored in
localip = _
# We can type long commands in an editor like âviâ using
%edit mytmp.py # type print(ret[0]), then exit (eg. wq!)
> Editing... done. Executing edited code...
ipython Roberto Polli - roberto.polli@par-tec.it
10. RAFT
Path management: Goal
⢠Normalize paths on diďŹerent platform
⢠Create, copy and remove folders
⢠Handle errors
modules: os, os.path, shutil, errno
see also: pathlib on Python 3.4+
Path management: 10â Roberto Polli - roberto.polli@par-tec.it
11. RAFT
Path management: os.path, sys
basedir, hosts = "/", "etc/hosts"
# Check the hosting platform with the sys module
from sys import platform
if platform.startswith(âwinâ):
basedir = âc:/windows/system32/driversâ
# Always use the os.path module!
from os.path import join, normpath
hosts = join(basedir, hosts)
hosts = normpath(hosts)
print("Normalized path is", hosts)
Path management: 10â Roberto Polli - roberto.polli@par-tec.it
12. RAFT
Path management: os.path, sys
⢠os.path is the best way to manage paths!
⢠multiplatform
⢠safe
⢠join removes redundant â/â
⢠normpath ďŹxes â/â orientation and redundant â..â
⢠realpath resolves symlinks
And now, a rapid glance to other tools
Path management: 10â Roberto Polli - roberto.polli@par-tec.it
13. RAFT
Move trees: shutil, os, os.path
from os import makedirs # ...tree creation...
from os.path import isdir # ...checking...
from shutil import copytree, rmtree
makedirs("/tmp/py/foo/bar")
# We can copy a whole tree and test it
copytree("/tmp/py/foo", "/tmp/py/foo2")
assert isdir("/tmp/py/foo2/bar")
rmtree("/tmp/py/foo") # ... and finally delete it
assert not isdir("/tmp/py/foo/bar")
Path management: 10â Roberto Polli - roberto.polli@par-tec.it
14. RAFT
Move trees: errno
# We can use exception handlers to investigate errors
try:
# python2 does not allow to ignore existing directories...
makedirs ("/tmp/py/foo/bar")
# ...and raises an OSError
except OSError as e:
# Just use the errno module to check the error value
import errno
assert e.errno == errno.EEXIST
help(makedirs)
Path management: 10â Roberto Polli - roberto.polli@par-tec.it
15. RAFT
Encoding: Goal
⢠A string more than a sequence of bytes
⢠A string is a couple (bytes, encoding)
⢠Use unicode literals in python2
⢠Manage diďŹerently encoded ďŹlenames
⢠A string is not a sequence of bytes
modules: os, os.path, glob
Encoding: 10â Roberto Polli - roberto.polli@par-tec.it
16. RAFT
Song of Childhood
Als das Kind Kind
war, ging es mit
h¨angenden Armen,
wollte der Bach sei ein
FluĂ, der FluĂsei ein
Strom, und diese
Pf¨utze das Meer.
Als das Kind Kind
war, wues nicht, daĂes
Kind war, alles war
ihm beseelt, und alle
Seelen waren eins.
Als das Kind Kind
war, hatte es von
nichts eine Meinung,
hatte keine
Gewohnheit, saĂoft im
Schneidersitz, lief aus
dem Stand, hatte
einen Wirbel im Haar
und machte kein
Gesicht beim
fotograďŹeren.
ââWhen the child was a child,
characters were bytes, and
strings list of bytesââ
Als das Kind Kind
war, ďŹelen ihm die
Beeren wie nur
Beeren in die Hand
und jetzt immer noch,
machten ihm die
frischen Waln¨usse eine
rauhe Zunge und jetzt
immer noch, hatte es
auf jedem Berg die
Sehnsucht nach dem
immer h¨oheren Berg,
und in jeder Stadt die
Sehnsucht nach der
noch gr¨oStadt, und
das ist immer noch
so, griďŹ im Wipfel
eines Baums nach
dem Kirschen in
einemHochgef¨uhl wie
auch heute noch, eine
Scheu vor jedem
Fremden und hat sie
immer noch, wartete
es auf den ersten
Schnee, und wartet so
immer noch.
Encoding: 10â Roberto Polli - roberto.polli@par-tec.it
17. RAFT
Encoding is a map
# Py3 doesnât need the u
the_string = u "Su00fcd" # S¨ud
# can be encoded in different
in_utf8 = the_string.encode(âutf-8â)
in_win = the_string.encode(âcp1252â)
type(in_utf8) == bytes # byte-sequences
# Decoding bytes using the wrong map..
# ...gives sad results ;)
in_utf8.decode(âcp1252â) # S ËA1/4d
⢠Encoding is a one-to-one
map between a
typographical character
and a byte-sequence
⢠Decoding is its reverse
map
char ascii utf-8 cp1252
a [97] [97] [97]
¨u - [195, 188] [252]
Encoding: 10â Roberto Polli - roberto.polli@par-tec.it
18. RAFT
Enters Encoding
# Filenames are binary data! Be careful when reading from
# a (eg. vfat) filesystem!
# To make python2 encoding-aware we should
from __future__ import unicode_literals
# Create 3 windows-encoded filenames in
basedir = "/tmp/py"
# using the provided function
from course import create_wuerstelstrasse
create_wuerstelstrasse(basedir)
Encoding: 10â Roberto Polli - roberto.polli@par-tec.it
19. RAFT
Encoded ďŹlenames: glob
from glob import glob as ls # expands wildcards like a shell.
files = ls("/tmp/py/*.txt") # To avoid encoding issues ...
# UnicodeDecodeError : âasciiâ codec canât decode byte 0xFC
0xFC == 252 # remember the ¨u in cp1252 map?
files = ls( b "/tmp/py/*.txt") #..we explicitly use bytes
Encoding: 10â Roberto Polli - roberto.polli@par-tec.it
20. RAFT
Data Gathering: Goal
Gathering System Data with multiplatform and platform-dependent tools.
⢠Get infos from ďŹles, /proc and /sys
⢠Capture command output
⢠Use psutil to get IO, CPU and memory data
⢠Parse ďŹles with a strategy
modules: psutil, subprocess, os
Data Gathering: 20â Roberto Polli - roberto.polli@par-tec.it
21. RAFT
Data Gathering: grep
def grep(needle, fpath):
"""is a minimal grep implementation
goal: open() is iterable and doesnât
need splitlines()
goal: comprehension can filter iterables
"""
return [x for x in open(fpath) if needle in x]
# Do we have "localhost" in our "/etc/hosts"?
grep("localhost", "/etc/hosts")
Data Gathering: 20â Roberto Polli - roberto.polli@par-tec.it
22. RAFT
Data Gathering: psutil
# The psutil module is very nice!
import psutil
# Works on Windows, Linux and MacOS
psutil.cpu_percent()
# And its output is easy to manage
psutil.disk_io_counters()
Exercise: Which other information does psutil provide?
Data Gathering: 20âmodule: psutil Roberto Polli - roberto.polli@par-tec.it
23. RAFT
Data Gathering: Exercises
Write a vmstat-like function printing every second:
⢠cpu usage % ;
⢠bytes read and written in the given interval;
⢠Hint: use psutil, time.sleep(1)
⢠Hint: try on ipython and then write the function using
%edit vmstat.py
Data Gathering: 20âmodule: psutil Roberto Polli - roberto.polli@par-tec.it
24. RAFT
Data Gathering: subprocess
# The check_output function returns the command stdout
from subprocess import check_output
# It takes a list as an argument!
out = check_output("ping -w1 -c1 www.google.com". split ())
# and returns a string
print(out)
Data Gathering: 20âmodule: subprocess Roberto Polli - roberto.polli@par-tec.it
25. RAFT
Data Gathering: security
# Be carefull with the above code
out = check_output(âls "./may not work.doc"â. split ())
# You can use
from shlex import split
out = check_output( split (âls "./will work.xlsx"â))
you = r"can âevenâ tokenize "respecting" quotedn chars"
from shlex import shlex
for token in shlex(you):
print(token)
Data Gathering: 20âmodule: subprocess Roberto Polli - roberto.polli@par-tec.it
26. RAFT
Data Gathering: subprocess, sys
def sh(cmd, shell=False, timeout=0):
"""Returns an iterable output of a command string, checking ...
from sys import version_info as python version
from shlex import split
if python_version < (3, 3): # ..before using...
if timeout:
raise ValueError("Timeout not supported")
output = check_output(split(cmd), shell=shell)
else:
output = check_output(split(cmd), shell=shell, timeout=timeout)
return output. splitlines ()
Data Gathering: 20âmodule: subprocess Roberto Polli - roberto.polli@par-tec.it
27. RAFT
Data Gathering: Exercises
Write a simple pgrep-like function for your OS which:
⢠ppgrep signature is the following
def ppgrep(program):
"""@param program - eg. firefox, explorer.exe"""
raise NotImplementedError
⢠prints a list of processes executing âprogramâ;
⢠Hint: use subprocess, os, and list-comprehension
items = [ x for x in a_list if âfirefoxâ in x]
Data Gathering: 20âmodule: subprocess Roberto Polli - roberto.polli@par-tec.it
28. RAFT
âŚData Gathering: Parsing /proc I âŚ
def linux_threads(pid):
"""The Linux /proc filesystem is a cool place to get infos."""
from glob import glob # replaces * and ?
path = "/proc/{}/task/*/status".format(pid)
# Pick a set of fields to gather...
t_info = (âPidâ, âTgidâ, âvoluntaryâ) # a tuple
for t_path in glob(path):
# ...and use comprehension to get interesting data.
print([x for x in open(t_path)
if x. startswith (t_info)] # accepts tuples!
)
Data Gathering: 20âThe /proc ďŹlesystem Roberto Polli - roberto.polli@par-tec.it
29. RAFT
Data Gathering: Parsing /proc II
# On Linux, /proc/diskstats is the source of I/O infos
disk_l = grep("sda", "/proc/diskstats")
# To gather that data we put the headers in a multi-line string
from course import diskstats_headers as headers
disk_info = disk_l[0].split() # Take the 1st entry, split the data
zip(headers, disk_info) # ...and tie them with the headers
list(_) # On py3 you need to iterate the generator!
Data Gathering: 20âThe /proc ďŹlesystem Roberto Polli - roberto.polli@par-tec.it
30. RAFT
Data Gathering: Parsing /proc III
# Or create a reusable commodity class with
from collections import namedtuple
# using headers as attributes
# like the one provided by psutil
DiskStats = namedtuple(âDiskStatâ, headers )
# ... and disk_info as values
dstat = DiskStats(*disk_info)
dstat.device, dstat.writes_ms
# Homework: check further features with
help(collections)
Data Gathering: 20âThe /proc ďŹlesystem Roberto Polli - roberto.polli@par-tec.it
31. RAFT
Parsing: Goal
⢠Plan a parsing strategy
⢠Use basic regular expressions: match, search, sub
⢠Benchmarking a parser
⢠Running nosetests
⢠Write a simple parser
modules: re, nose, %timeit
Parsing: 60â Roberto Polli - roberto.polli@par-tec.it
32. RAFT
Parsing is hard...
âSystem Administrators spent 24.3% of their work-life parsing
ďŹles.ââ
*Independent analysis by The GASP1
Society ;)
1
Grep Awk Sed Perl
Parsing: 60â Roberto Polli - roberto.polli@par-tec.it
33. RAFT
...use a strategy!
1. Collect parsing samples
2. Play in ipython and collect %history
3. Write tests, then the parser
4. Eventually benchmark
Parsing: 60â Roberto Polli - roberto.polli@par-tec.it
34. RAFT
Parsing postďŹx logs
# Before writing the parser, collect samples of
# the interesting lines. For now just
from course import mail_sent, mail_delivered
# and %edit a simple
def test_sent():
hour, host, to = parse_line(mail_sent)
assert hour == â08:00:00â
assert to == âjon@doe.itâ
Parsing: 60â Roberto Polli - roberto.polli@par-tec.it
35. RAFT
Parsing lines: split, zip
May 31 08:00:00 test-1 postďŹx/smtp[169]: 7CD8E730020: to= joe@foo.it , relay=mx2.foo.it[10.0.4.5]:25,
...
mail_sent.split() # Start using basic strings in ipython
# Then tie them with zip/zip()
fields, counting = _, zip(range(20), _)
fields = fields[:7] # We just care for the first 7 values
# and pick fields singularly
hour, host, dest = fields[2], fields[3], fields[6]
Parsing: 60â Roberto Polli - roberto.polli@par-tec.it
36. RAFT
Parse: Exercise I
In another window
⢠edit 03 parsing test.py
⢠complete the parse line(line) function
def parse_line(line):
"""Write your function and test it
with test_sent()"""
raise NotImplementedError
%paste your solutionâs code in iPython and run manually the test functions
Parsing: 60â Roberto Polli - roberto.polli@par-tec.it
37. RAFT
Python Regexp
# Python supports regular expressions via
import re
# We start showing a grep-reloaded function
def grep(expr, fpath):
one = re.compile(expr) # ...has two lookup methods...
assert ( one.match # which searches from Ë the beginning
and one. search ) # that searches anywhere
with open(fpath) as fp:
return [x for x in fp if one.search(x)]
Parsing: 60âRegular Expressions Roberto Polli - roberto.polli@par-tec.it
38. RAFT
Splitting with re.split
from re import split # is a very nice function
# Letâs gather some ping stats
if sys.platform.startswith(âwinâ):
cmd = "ping -n10 www.google.it"
else:
cmd = "ping -c10 -w10 www.google.it"
# Split for both space and =
ping_output = [ split("[ =]", x) for x in sh(cmd)]
Parsing: 60âRegular Expressions Roberto Polli - roberto.polli@par-tec.it
39. RAFT
Splitting with re.ďŹndall
from re import findall # can be misused too ;)
# eg. for adding the ":" to a
mac = "00""24""e8""b4""33""20"
# ...using this
re_hex = â[0-9A-Fa-f]{2}â
mac_address = â:â.join(findall(re_hex, mac))
print("The mac address is ", mac_address)
Actually this does a bit of validation, requiring all chars to be in the 0-F range
Parsing: 60âRegular Expressions Roberto Polli - roberto.polli@par-tec.it
40. RAFT
Benchmarking in iPython I
⢠Parsing big ďŹles needs benchmarks. iPython %timeit magic is a good
starting point.
test_regexps = ("..", "[a-fA-F0-9]{2}")
for re_s in test_regexps:
%timeit â:â.join(findall (re_s, mac))
⢠We can even compare compiled and inline regexp
import re
for re_s in test_regexps:
re_c = re.compile (re_s)
%timeit â:â.join(re_c.findall (mac))
Parsing: 60âRegular Expressions Roberto Polli - roberto.polli@par-tec.it
41. RAFT
Benchmarking in iPython II
Or ďŹnd other methods:
⢠complex...
from re import sub as sed
%timeit sed(râ(..)â, râ1:â, mac)
⢠...or simple
%timeit â:â.join([ mac[i:i+2] for i in range(0,12,2)])
⢠Outside iPython check the timeit module
Parsing: 60âRegular Expressions Roberto Polli - roberto.polli@par-tec.it
42. RAFT
âŚParsing: a real world Example âŚ
# Donât need to type this VSAN configuration script
# which uses linux FC information from /sys filesystem
fc_id_path = "/sys/class/fc_host/host*/port_name"
for x in glob(fc_id_path):
# ...we boldly skip an explicit close()
pwwn = open(x).read() # 0x500143802427e66c
pwwn = pwwn[2:]
# ...and even use the slower but readable
pwwn = re.findall(râ..â, pwwn)
print("member pwwn ", â:â.join(pwwn))
Parsing: 60âRegular Expressions Roberto Polli - roberto.polli@par-tec.it
43. RAFT
Parsing logs: a simple solution
def parse_line(line):
import re
# using _ we improve readability
_, _, hour, host, _, _, dest = line.split()[:7]
try:
# and if dest isnât what we expect...
dest = re.split(râ[<>]â,dest)[1]
except IndexError:
# ...we set it to None
dest = None
return (hour, host, dest)
Parsing: 60âRegular Expressions Roberto Polli - roberto.polli@par-tec.it
44. RAFT
Parsing logs: II
# Now another test for the delivered messages
# %edit 03_parsing_test
def test_delivered():
hour, host, destination = parse_line(test_str_2)
assert hour == â08:00:00â
# Delivery logs should have destination == None
assert destination is None
# Exercise: fix parse_line to work with both tests
# and save test
Nosetest Intermezzo: 15â Roberto Polli - roberto.polli@par-tec.it
45. RAFT
Running nosetest
⢠Now run the following command from a shell
# nosetests -vs 03_parsing_test.py
03_parsing_test.test_sent ... ok
03_parsing_test.test_delivered ... ok
Ran 2 tests in 0.001s
⢠Nose is a test framework.
⢠Nose runs every ďŹle matching test *
⢠Nose runs every function matching test *
Nosetest Intermezzo: 15â Roberto Polli - roberto.polli@par-tec.it
46. RAFT
Simple Test Script
⢠Open the 02 nosetests simple.py ďŹle
def setup():
print("is run before the testsuite, while")
def teardown():
print("after all tests")
def test_one():
# name a function like test_* to run it!
assert 1 == 1
def test_two():
# and use assert to test for success
assert 1 == 0, "I was expecting 0"
Nosetest Intermezzo: 15â Roberto Polli - roberto.polli@par-tec.it
47. RAFT
âŚComplete Test Script: I âŚ
⢠A more ďŹexible script is 02 nosetests full.py which uses a Test class
class Test(object):
@classmethod
def setup_class(self): # is run once at startup,
# ..eg. to create database structure
print("setup testsuite environment")
open("/tmp/test2.out", "w").write("0")
@classmethod
def teardown_class(self): # is run once after all tests to...
print("cleanup testsuite environment")
os.unlink("/tmp/test2.out")
Nosetest Intermezzo: 15â Roberto Polli - roberto.polli@par-tec.it
48. RAFT
âŚComplete Test Script: II âŚ
⢠allowing pre-post testsuite and pre-post test ďŹxtures
class Test(object):
...
# Using a Test class...
def setup(self):
print("is_run_before_every_test") #..and..
def teardown(self):
print("after_every_test") # eg truncate a table
# each test can use the prepared environment
def test_a(self):
assert os.path.isfile("/tmp/test2.out")
Nosetest Intermezzo: 15â Roberto Polli - roberto.polli@par-tec.it
49. RAFT
Simple processing: Goal
⢠Handle gathered data with dict() and zip()
⢠Find data relation with scipy
⢠Get essential information like standard deviation Ď and distributions δ
⢠Linear correlation: whatâs that, when can help
⢠Plotting
modules: numpy, scipy, scipy.stats.stats, collections, random, time
Processing: 45â Roberto Polli - roberto.polli@par-tec.it
50. RAFT
The Chicken Paradox
ââAccording to latest statistics,
it appears that you eat one chicken per year:
and, if that doesnât ďŹt your budget,
youâll ďŹt into statistic anyway,
because someone will eat two.ââ C. A. Salustri
Processing: 45â Roberto Polli - roberto.polli@par-tec.it
51. RAFT
Simple processing: Exercise
How to dismantle the chicken paradox? Gather data!
⢠Write the following function using our parsing strategy
def ping_rtt(seconds=10):
"""@return: a list of ping RTT"""
from course import sh
# get sample output
# find a solution in ipython
# test and paste the code
raise NotImplementedError
⢠Gather 10 seconds of ping output
⢠Hint: reuse the sh() function
⢠Hint: slice and ďŹlter lists using comprehension
Processing: 45âDistributions Roberto Polli - roberto.polli@par-tec.it
52. RAFT
Distributions: set, defaultdict
A distribution or δ shows the frequency of events, like how many people ate x
chickens ;)
#Create a simple δ with Counter
from collection import Counter
d = Counter(rtt)
# We can even use a more flexible
from collections import defaultdict
d = defaultdict(int)
for x in rtt:
distro[x] += 1
Distributions and Mean are both important!
Processing: 45âDistributions Roberto Polli - roberto.polli@par-tec.it
53. RAFT
Standard Deviation: scipy
⢠Standard deviation or Ď
formula is
Ď2
(X) := (xâÂŻx)2
n
â˘ Ď tells if δ is fair or not,
and how much the mean
(ÂŻx) is representative
⢠matplotlib.mlab.normpdf
is a smooth function
approximating the
histogram
from scipy import std, mean
fair = [1, 1] # chickens
unfair = [0, 2] # chickens
assert mean(fair) == mean(unfair)
# Use standard deviation!
std(fair) # 0
std(unfair) # 1
Processing: 45âDeviation Roberto Polli - roberto.polli@par-tec.it
54. RAFT
Simple processing: scipy
Check your computed values vs the Ď returned by ping (didnât you notice ping
returned it?)
"""goal: remember to convert to numeric / float
goal: use scipy
goal: check stdev"""
from scipy import std, mean # max,min are builtin
rtt = ping_rtt()
print(max(rtt), min(rtt), mean(rtt), std(rtt))
Processing: 45âDeviation Roberto Polli - roberto.polli@par-tec.it
55. RAFT
Time Distributions: Exercise
⢠Parse the provided maillog in ipython using its ! magic and get an hourly
email δ
⢠Expected output:
time_d = { # mail delivered (removed) between
0: xxx # 00:00 - 00:59
1: xxx # 01:00 - 01:59
..
}
Processing: 45âDeviation Roberto Polli - roberto.polli@par-tec.it
56. RAFT
Time Distributions: Exercise Solution
# deliveder emails are like the following
#May 14 16:00:04 rpolli postfix/qmgr[122]: 4DC3DA: removed"
ret = !grep removed maillog # get the interesting lines
ts = ret.fields(2) # find the timestamp (3rd column)
hours = [ int(ts) for x in ts ]
time_d = {x: count(x) for x in set(hours)}
Processing: 45âDeviation Roberto Polli - roberto.polli@par-tec.it
57. RAFT
Plotting distributions
# To plot data..
from matplotlib import pyplot as plt
# and set the interactive mode
plt.ion()
# Plotting an histogram...
frequency, bins, _ = hist(hours)
# .. returns a
distribution = dict(zip(slots,
frequency))
This server works mostly at
night...
Processing: 45âDeviation Roberto Polli - roberto.polli@par-tec.it
58. RAFT
Size Distributions: Exercise
⢠Create a size δ using hist(..., bins=...)
⢠Hint: help(hist)
size_d = { # mail size between
0: xxx # 0 - 10k
1: xxx # 10k - 20k
..
}
⢠Homework: Use the size δ to ďŹnd size mean and size sigma and compare
with Ď and mean evaluated from the original data-series
Processing: 45âDeviation Roberto Polli - roberto.polli@par-tec.it
59. RAFT
âŚSimulating data with Ď and ÂŻx âŚ
Mean and a stdev are useful starting point to simulate data using the gaussian
distribution.
# A mail load generator creating attachments of a given size...
from random import gauss
mail_size = gauss(mean, sigma_s) # a random number
# and use time_d to simulate the load during the day
from time import localtime
hour = localtime().tm_hour
mail_per_minute = time_d[hour] / 60 # minutes in hour
Processing: 45âDeviation Roberto Polli - roberto.polli@par-tec.it
60. RAFT
Linear Correlation
# Letâs plot the following datasets
# taken from a 4-hour distribution
mail_sent = [1, 5, 500, 250, 100, 7]
kB_s = [70, 300, 29000, 12500, 450, 500]
# A scatter plot can suggest relations
# between data
plt.scatter(mail_sent, kB_s)
Correlating Mail and Thruput
100 0 100 200 300 400 500 600
kMailsent
5000
0
5000
10000
15000
20000
25000
30000
35000
ThruputkB/s
Correlatingmailandthruput
Processing: 45âCorrelation Roberto Polli - roberto.polli@par-tec.it
61. RAFT
Linear Correlation
The Pearson CoeďŹcient Ď is a relation indicator.
0 no relation
1 direct relation (both dataset increase together)
-1 inverse relation (one increase as the other decrease)
Ď(X, Y ) =
(x â ÂŻx)(y â ÂŻy)
(x â ÂŻx)2 (y â ÂŻy)2
(1)
from scipy.stats.stats import pearsonr
ret = pearsonr(mail_sent, kB_s)
print(ret)
>(0.9823, 0.0004)
correlation, probability = ret
Processing: 45âCorrelation Roberto Polli - roberto.polli@par-tec.it
62. RAFT
You must (scatter) plot!
Ď does not detect non-linear correlation
Processing: 45âCorrelation Roberto Polli - roberto.polli@par-tec.it
63. RAFT
Combinations
# Given a table with many data series
from course import table
table = {...
âcpu_usrâ: [10, 23, 55, ..],
âbyte_inâ: [2132, 3212, 3942, ..], }
# We can combine all their names with
from itertools import combinations
list(combinations(table,2))
>[(âswap_inâ, âcpu_sysâ),
(âswap_inâ, âcswâ), (âcpu_sysâ, âcswâ)... ]
Combinating 4 suites,
2 at a time.
âĽâ
âĽâŁ
âĽâŚ
â âŁ
â âŚ
âŁâŚ
Processing: 45âCorrelation Roberto Polli - roberto.polli@par-tec.it
64. RAFT
NetďŹshing correlation
We can try every combination between data series and check if thereâs some
Ď.
for k1, k2 in combinations(table, 2):
corr, probability = pearsonr(table[k1], table[k2])
if corr < 0.5:
# Iâm *still* not interested in data under this threshold
continue
print("linear correlation between {} and {} is {}".format(
k1, k2, corr))
Processing: 45âCorrelation Roberto Polli - roberto.polli@par-tec.it
65. RAFT
Correlating I/O and Context Switch
Now weâll generate some correlation plots from table data, like this one.
Processing: 45âPlotting Time Roberto Polli - roberto.polli@par-tec.it
66. RAFT
NetďŹshing correlation II
# create all combined plot
for k1, k2 in combinations(table, 2):
corr, probability = pearsonr(table[k1], table[k2])
plt.scatter(table[k1], table[k2])
# 3 digit precision on title
plt.title("R={:0.3f}".format(corr))
plt.xlabel(k1); plt.ylabel(k2)
# save and close the plot
plt.savefig("{}_{}.png".format(k1, k2)); plt.close()
Processing: 45âPlotting Time Roberto Polli - roberto.polli@par-tec.it
67. RAFT
Mark time with colors
# Get combined data directly via items
# using 3 buckets
buckets = 3
for (k1, v1), (k2, v2) in combinations(table. items (), 2):
corr, probability = pearsonr(v1, v2)
length = len(v1)
# Get an array of colors
# eg. [0, 0, ..., 1, 1, .., 2, 2, ...]
colors = [(i * buckets / l) for i in xrange(l) ]
# iterate colors with a nice colorbar
plt.scatter(t1, t2, color=colors)
Processing: 45âPlotting Time Roberto Polli - roberto.polli@par-tec.it
68. RAFT
Thatâs all folks!
Thank you for the attention!
Roberto Polli - roberto.polli@par-tec.it
End Roberto Polli - roberto.polli@par-tec.it