Python in Action (Part 2)

Python in Action
Presented at USENIX LISA Conference
November 16, 2007
David M. Beazley
http://www.dabeaz.com

(Part II - Systems Programming)

Copyright (C) 2007, http://www.dabeaz.com 2- 1

Section Overview
• In this section, we're going to get dirty
• Systems Programming
• Files, I/O, ﬁle-system
• Text parsing, data decoding
• Processes and IPC
• Networking
• Threads and concurrency

Commentary
• I personally think Python is a fantastic tool for
systems programming.
• Modules provide access to most of the major
system libraries I used to access via C
• No enforcement of "morality"
• Decent performance
• It just "works" and it's fun

Approach

• I've thought long and hard about how I
would present this part of the class.
• A reference manual approach would
probably be long and very boring.
• So instead, we're going to focus on building
something more in tune with the times


"To Catch a Slacker"
• Write a collection of Python programs that can
quietly monitor Firefox browser caches to ﬁnd
out who has been spending their day reading
Slashdot instead of working on their TPS reports.
• Oh yeah, and be a real sneaky bugger about it.


Why this Problem?
• Involves a real-world system and data
• Firefox already installed on your machine (?)
• Cross platform (Linux, Mac, Windows)
• Example of tool building
• Related to a variety of practical problems
• A good tour of "Python in Action"

Disclaimers
• I am not involved in browser forensics (or
spyware for that matter).
• I am in no way afﬁliated with Firefox/Mozilla
nor have I ever seen Firefox source code
• I have never worked with the cache data
prior to preparing this tutorial
• I have never used any third-party tools for
looking at this data.


More Disclaimers
• All of the code in this tutorial works with a
standard Python installation
• No third party modules.
• All code is cross-platform
• Code samples are available online at
http://www.dabeaz.com/action/

• Please look at that code and follow along

Assumptions
• This is not a tutorial on systems concepts
• You should be generally familiar with
background material (files, filesystems, file
formats, processes, threads, networking,
protocols, etc.)
• Hopefully you can "extrapolate" from the
material presented here to construct more
advanced Python applications.


The Big Picture
• We want to write a tool that allows
someone to locate, inspect, and perform
queries across a distributed collection of
Firefox caches.
• For example, the cache directories on all
machines on the LAN of a quasi-evil
corporation.


The Firefox Cache
• The Firefox browser keeps a disk cache of
recently visited sites
% ls Cache/
-rw------- 1 beazley 111169 Sep 25 17:15 01CC0844d01
-rw------- 1 beazley 104991 Sep 25 17:15 01CC3844d01
-rw------- 1 beazley 47233 Sep 24 16:41 021F221Ad01
...
-rw------- 1 beazley 26749 Sep 21 11:19 FF8AEDF0d01
-rw------- 1 beazley 58172 Sep 25 18:16 FFE628C6d01
-rw------- 1 beazley 1939456 Sep 25 19:14 _CACHE_001_
-rw------- 1 beazley 33044 Sep 23 21:58 _CACHE_MAP_

• A bunch of cryptically named ﬁles.

Problem : Finding Files
• Find the Firefox cache
Write a program ﬁndcache.py that takes a directory
name as input and recursively scans that directory
and all subdirectories looking for Firefox/Mozilla
cache directories.
• Example:
% python findcache.py /Users/beazley
/Users/beazley/Library/.../qs1ab616.default/Cache
/Users/beazley/Library/.../wxuoyiuf.slt/Cache
%

• Use case: Searching for things on the ﬁlesystem.

ﬁndcache.py
# findcache.py
# Recursively scan a directory looking for
# Firefox/Mozilla cache directories

import sys
import os

if len(sys.argv) != 2:
print >>sys.stderr,"Usage: python findcache.py dirname"
raise SystemExit(1)

caches = (path for path,dirs,files in os.walk(sys.argv[1])
if '_CACHE_MAP_' in files)

for name in caches:
print name


The sys module
# findcache.py
# Recursively scan a directory looking basic
The sys module has for
information related to the
import sys execution environment.
import os

sys.argv
raise SystemExit(1)

sys.stdin (path for'_CACHE_MAP_' list in os.walk(sys.argv[1])
caches = path,dirs,files of the command line
A
sys.stdout
if options
in files)

sys.stderrname
for name in caches:
print
sys.argv = ['findcache.py',
'/Users/beazley']

Standard I/O ﬁles

Program Termination
# findcache.py

import sys
import os

if len(sys.argv) != 2: SystemExit exception
raise SystemExit(1)
Forces Python to exit.
caches = (path for path,dirs,files inis return code.
Value os.walk(sys.argv[1])

for name in caches:
print name


os Module
# findcache.py
# Firefox/Mozilla cache os module
directories

import sys
import os
Contains useful OS related
if len(sys.argv) != 2: functions (ﬁles, processes, etc.)
raise SystemExit(1)


for name in caches:
print name


os.walk()
os.walk(topdir)
# findcache.py
Recursively walkscache directories and
# Firefox/Mozilla a directory tree

generates a sequence of tuples (path,dirs,files)
import sys
path
import os = The current directory name
if
dirs = List of all subdirectory names in path
len(sys.argv) != 2:
files = List of all regular files (data) in path
raise SystemExit(1)


for name in caches:
print name


A Sequence of Caches
# findcache.py

importstatement
This sys generates a sequence of
import os
directory names where '_CACHE_MAP_' is
contained in the ﬁlelist.
raise SystemExit(1)


for name in caches:
The print name name
directory
File name check
that is generated as a
result

Printing the Result
# findcache.py

import sys
import os

raise SystemExit(1)

This prints the sequence
of cache directories that
for name in caches:
print name are generated by the
previous statement.


Commentary
• Our solution is strongly based on a
"declarative" programming style (again)
• We simply write out a sequence of
operations that produce what we want
• Not focused on the underlying mechanics
of how to traverse all of the directories.


Big Idea : Iteration
• Python allows iteration to be captured as a
kind of object.

• This de-couples iteration from the code that
uses the iteration
for name in caches:
print name

• Another usage example:
for name in caches:
print len(os.listdir(name)), name


Big Idea : Iteration
• Compare to this:
for path,dirs,files in os.walk(sys.argv[1]):
if '_CACHE_MAP_' in files:
print len(os.listdir(path)),path

• This code is simple, but the loop and the
code that executes in the loop body are
coupled together
• Not as ﬂexible, but this is somewhat subtle
to wrap your brain around at ﬁrst.


Mini-Reference : sys, os
• sys module
sys.argv # List of command line options
sys.stdin # Standard input
sys.stdout # Standard output
sys.stderr # Standard error
sys.executable # Full path of Python executable
sys.exc_info() # Information on current exception

• os module
os.walk(dir) # Recursively walk dir producing a
# sequence of tuples (path,dlist,flist)

os.listdir(dir) # Return a list of all files in dir

• SystemExit exception
raise SystemExit(n) # Exit with integer code n


Problem: Searching for Text
• Extract all URL requests from the cache
Write a program requests.py that scans the contents
of the _CACHE_00n_ ﬁles and prints a list of URLs
for documents stored in the cache.
• Example:
% python requests.py /Users/.../qs1ab616.default/Cache
http://www.yahoo.com/
http://us.js2.yimg.com/us.yimg.com/a/1-/java/promotions/js/ad_eo_1.1.j
http://us.i1.yimg.com/us.yimg.com/i/ww/atty_hsi_free.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/thm/1/search_1.1.png
...
%

• Use case: Searching the contents of ﬁles for
text patterns.

The Firefox Cache
• The cache directory holds two types of data
• Metadata (URLs, headers, etc.).
• Raw data (HTML, JPEG, PNG, etc.)
• This data is stored in two places
• Cryptic ﬁles in the Cache directory
• Blocks inside the _CACHE_00n_ ﬁles
• Metadata almost always in _CACHE_00n_

Possible Solution : Regex
• The _CACHE_00n_ ﬁles are encoded in a
binary format, but URLs are embedded
inside as null-terminated text:
x00x01x00x08x92x00x02x18x00x00x00x13Fxffx9f
xceFxffx9fxcex00x00x00x00x00x00H)x00x00x00x1a
x00x00x023HTTP:http://slashdot.org/x00request-methodx00
GETx00request-User-Agentx00Mozilla/5.0 (Macintosh; U; Intel
Mac OS X; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7x00
request-Accept-Encodingx00gzip,deflatex00response-headx00
HTTP/1.1 200 OKrnDate: Sun, 30 Sep 2007 13:07:29 GMTrn
Server: Apache/1.3.37 (Unix) mod_perl/1.29rnSLASH_LOG_DATA:
shtmlrnX-Powered-By: Slash 2.005000176rnX-Fry: How can I
live my life if I can't tell good from evil?rnCache-Control:

• Maybe the requests could just be ripped
using a regular expression.

A Regex Solution
# requests.py
import re
import os
import sys

cachedir = sys.argv[1]
cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]

# A regex for embedded URL strings
request_pat = re.compile('([a-z]+://.*?)x00')

# Loop over all files and search for URLs
for name in cachefiles:
data = open(os.path.join(cachedir,name),"rb").read()
index = 0
while True:
m = request_pat.search(data,index)
if not m: break
print m.group(1)
index = m.end()

The re module
# requests.py
import re re module
import os
import sys
Contains all functionality related to
cachefiles
regular expression pattern matching,
= [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]
searching, replacing, etc.
request_pat = re.compile(r'([a-z]+://.*?)x00')
Features are strongly inﬂuenced by Perl,
but regexs are not directly integrated
into the Python language.
index = 0
while True:
if not m: break
print m.group(1)
index = m.end()

Using re
# requests.py
import re are ﬁrst speciﬁed
Patterns as strings
and compiled into a regex
import os
import sys
object.

pat = re.compile(pattern [,flags])


The pattern syntax is "standard"
pat* pat1|pat2
index = 0 pat+ [chars]
while True: pat? [^chars]
(pat) pat{n}
if not m: break
. pat{n,m}
print m.group(1)
index = m.end()

Using re
# requests.py
import re
import os
import sys

cachefiles = [ '_CACHE_001_', '_CACHE_002_',the
All subsequent operations are methods of '_CACHE_003_' ]
compiled regex pattern
m = pat.match(data [,start]) # Check for match
m = pat.search(data [,start]) # Search for match
newdata = pat.sub(data, repl) # Pattern replace
index = 0
while True:
if not m: break
print m.group(1)
index = m.end()

Searching for Matches
# requests.py
import re
import os
pat.search(text
import sys
[,start])

cachedir = the string text for the ﬁrst occurrence
Searches sys.argv[1]
cachefiles = [ pattern starting'_CACHE_002_', '_CACHE_003_' ]
of the regex '_CACHE_001_', at position start.
# Returns a "MatchObject" strings
A regex for embedded URL if a match is found.
In the code below, we're ﬁnding matches one
for a time. cachefiles:
at name in
index = 0
while True:
if not m: break
print m.group(1)
index = m.end()

Match Objects
# requests.py
import re
import os
import sys

Regex matches'_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]
cachefiles = [ are represented by a MatchObject

# m.group([n]) embedded URL matched by group n
A regex for # Text strings
m.start([n]) # Starting index of group n
m.end([n]) # End index of group n
index = 0 The matching text for
while True:
just the URL.
if not m: break
print m.group(1)
index = m.end() The end of the match

Groups
# requests.py
In patterns, parentheses () deﬁne groups which
import re
import os
are numbered left to right.
import sys
group 0 # The entire pattern
cachedir 1 sys.argv[1] Text in first ()
group = #
group 2 # Text in next ()
...

index = 0
while True:
if not m: break
print m.group(1)
index = m.end()

Mini-Reference : re
• re pattern compilation
pat = re.compile(r'patternstring')

• Pattern syntax
literal # Match literal text
pat* # Match 0 or more repetitions of pat
pat+ # Match 1 or more repetitions of pat
pat? # Match 0 or 1 repetitions of pat
pat1|pat2 # Patch pat1 or pat2
(pat) # Patch pat (group)
[chars] # Match characters in chars
[^chars] # Match characters not in chars
. # Match any character except n
d # Match any digit
w # Match alphanumeric character
s # Match whitespace


Mini-Reference : re
• Common pattern operations
pat.search(text) # Search text for a match
pat.match(text) # Search start of text for match
pat.sub(repl,text) # Replace pattern with repl

• Match objects
m.group([n]) # Text matched by group n
m.start([n]) # Starting position of group n
m.end([n]) # Ending position of group n

• How to loop over all matches of a pattern
for m in pat.finditer(text):
# m is a MatchObject that you process
...


Mini-Reference : re
• An example of pattern replacement
# This replaces American dates of the form 'mm/dd/yyyy'
# with European dates of the form 'dd/mm/yyyy'.

# This function takes a MatchObject as input and returns
# replacement text as output.

def euro_date(m):
month = m.group(1)
day = m.group(2)
year = m.group(3)
return "%d/%d/%d" % (day,month,year)

# Date re pattern and replacement operation
datepat = re.compile(r'(d+)/(d+)/(d+)')
newdata = datepat.sub(euro_date,text)


Mini-Reference : re
• There are many more features of the re
module
• Strongly inﬂuenced by Perl (feature set)
• Regexs are a library in Python, not integrated
into the language.
• A book on regular expressions may be
essential for advanced functions.


File Handling
# requests.py
import re
import os
import sys


What is going on in this statement?

index = 0
while True:
if not m: break
print m.group(1)
index = m.end()

os.path module
# requests.py
import re has portable ﬁle related functions
os.path
import os
os.path.join(name1,name2,...) # Join path names
import sys
os.path.getsize(filename) # Get the file size
os.path.getmtime(filename) # Get modification date
There are many more functions, but this is the
#preferred module for basic ﬁlename handling
A regex for embedded URL strings

index = 0
while True:
if not m: break
print m.group(1)
index = m.end()

os.path.join()
# requests.py
import re a fully-expanded pathname
Creates
import os
dirname = '/foo/bar' filename = 'name'
import sys

os.path.join(dirname,filename)

'/foo/bar/name'
Aware of platform differences ('/' vs. '')
index = 0
while True:
if not m: break
print m.group(1)
index = m.end()

Mini-Reference : os.path
os.path.join(s1,s2,...) # Join pathname parts together
os.path.getsize(path) # Get file size of path
os.path.getmtime(path) # Get modify time of path
os.path.getatime(path) # Get access time of path
os.path.getctime(path) # Get creation time of path
os.path.exists(path) # Check if path exists
os.path.isfile(path) # Check if regular file
os.path.isdir(path) # Check if directory
os.path.islink(path) # Check if symbolic link
os.path.basename(path) # Return file part of path
os.path.dirname(path) # Return dir part of
os.path.abspath(path) # Get absolute path


Binary I/O
# requests.py
import re
import os
import sys

cachefiles = binary ﬁles, use modes "rb","wb", etc.
For all [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]
request_pat = re.compile(r'([a-z]+://.*?)x00') Windows)
Disables new-line translation (critical on
index = 0
while True:
if not m: break
print m.group(1)
index = m.end()

Common I/O Shortcuts
# requests.py
import re entire file into a string
# Read an
import=os
data open(filename).read()
import sys
# Write a string out to a file
open(filename,"w").write(text)
# Loop over all lines in a file
#forregex foropen(filename):
A line in embedded URL strings
...

index = 0
while True:
if not m: break
print m.group(1)
index = m.end()

Commentary on Solution
• This regex approach is mostly a hack for this
particular application.
• Reads entire cache ﬁles into memory as
strings (may be quite large)
• Only ﬁnds URLs, no other metadata
• Some risk of false positives since URLs could
also be embedded in data.


Commentary
• We have started to build a collection of
very simple command line tools
• Very much in the "Unix tradition."
• Python makes it easy to create such tools
• More complex applications could be
assembled by simply gluing scripts together


Working with Processes
• It is common to write programs that run
other programs, collect their output, etc.
• Pipes
• Interprocess Communication
• Python has a variety of modules for
supporting this.


subprocess Module
• A module for creating and interacting with
subprocesses
• Consolidates a number of low-level OS
functions such as system(), execv(), spawnv(),
pipe(), popen2(), etc. into a single module
• Cross platform (Unix/Windows)


Example : Slackers

• Find slacker cache entries.
Using the programs ﬁndcache.py and requests.py as
subprocesses, write a program that inspects cache
directories and prints out all entries that contain the
word 'slashdot' in the URL.


slackers.py
# slackers.py
import sys
import subprocess

# Run findcache.py as a subprocess
finder = subprocess.Popen(
[sys.executable,"findcache.py",sys.argv[1]],
stdout=subprocess.PIPE)

dirlist = [line.strip() for line in finder.stdout]

# Run request.py as a subprocess
for cachedir in dirlist:
searcher = subprocess.Popen(
[sys.executable,"requests.py",cachedir],
for line in searcher.stdout:
if 'slashdot' in line: print line,


Launching a subprocess
# slackers.py
import sys
import subprocess



Thiscachedir in dirlist:
for is launching a python Collection of output
script as a subprocess,
with newline
connecting its stdout
stdout=subprocess.PIPE) stripping.
stream to a pipe.


Python Executable
# slackers.py
import sys
import subprocess Full pathname of
python interpreter




Subprocess Arguments
# slackers.py
import sys
import subprocess


List of arguments to subprocess.
for cachedir in dirlist: to what would
Corresponds
appear on a shell command line.


slackers.py
# slackers.py
import sys
import subprocess

# Run findcache.py as a subprocess directory we
More of the same idea. For each
found in the last step, we run requests.py to
produce requests.




Commentary

• subprocess is a large module with many options.
• However, it takes care of a lot of annoying
platform-speciﬁc details for you.
• Currently the "recommended" way of dealing
with subprocesses.


Low Level Subprocesses
• Running a simple system command
os.system("shell command")

• Connecting to a subprocess with pipes
pout, pin = popen2.popen2("shell command")

• Exec/spawn
os.execv(),os.execl(),os.execle(),...
os.spawnv(),os.spawnvl(), os.spawnle(),...

• Unix fork()
os.fork(), os.wait(), os.waitpid(), os._exit(), ...


Interactive Processes
• Python does not have built-in support for
controlling interactive subprocesses (e.g.,
"Expect")
• Must install third party modules for this
• Example: pexpect
• http://pexpect.sourceforge.net


Commentary
• Writing small Unix-like utilities is fairly
straightforward in Python
• Support for standard kinds of operations (files,
regular expressions, pipes, subprocesses, etc.)
• However, our solution is also kind of clunky
• Only returns some information
• Not particularly memory efficient (reads large
files into memory)


Interlude
• Python is well-suited to building libraries
and frameworks.
• In the next part, we're going to take a
totally different approach than simply
writing simple utilities.
• Will build libraries for manipulating cache
data and use those libraries to build tools.


Problem : Parsing Data
• Extract the cache data (for real)
Write a module ffcache.py that contains a set of
functions for reading Firefox cache data into useful
data structures that can be used by other programs.

Capture all available information including URLs,
timestamps, sizes, locations, content types, etc.

• Use case: Blood and guts
Writing programs that can process foreign ﬁle
formats. Processing binary encoded data. Creating
code for later reuse.


The Firefox Cache
• There are four critical ﬁles
_CACHE_MAP_ # Cache index
_CACHE_001_ # Cache data

• All ﬁles are binary-encoded
• _CACHE_MAP_ is used by Firefox to locate
data, but it is not updated until Firefox exits.
• We will ignore _CACHE_MAP_ since we want
to observe caches of live Firefox sessions.


Firefox _CACHE_ Files
• _CACHE_00n_ ﬁle organization
Free/used block bitmap 4096 bytes

Blocks Up to 32768 blocks

• The block size varies according to the ﬁle:
_CACHE_001_ 256 byte blocks


Cache Entries
• Each cache entry:
• A maximum of 4 cache blocks
• Can either be data or metadata
• If >16K, written to a ﬁle instead
• Notice how all the "cryptic" ﬁles are >16K
-rw------- beazley 111169 Sep 25 17:15 01CC0844d01
-rw------- beazley 104991 Sep 25 17:15 01CC3844d01
-rw------- beazley 47233 Sep 24 16:41 021F221Ad01
...
-rw------- beazley 26749 Sep 21 11:19 FF8AEDF0d01
-rw------- beazley 58172 Sep 25 18:16 FFE628C6d01


Cache Metadata
• Metadata is encoded as a binary structure
Header 36 bytes
Request String Variable length (in header)
Request Info Variable length (in header)

• Header encoding (binary, big-endian)
0-3 magic (???) unsigned int (0x00010008)
4-7 location unsigned int
8-11 fetchcount unsigned int
12-15 fetchtime unsigned int (system time)
16-19 modifytime unsigned int (system time)
20-23 expiretime unsigned int (system time)
24-27 datasize unsigned int (byte count)
28-31 requestsize unsigned int (byte count)
32-35 infosize unsigned int (byte count)

Solution Outline
• Part 1: Parsing Metadata Headers
• Part 2: Getting request information (URL)
• Part 3: Extracting additional content info
• Part 4: Scanning of individual cache ﬁles
• Part 5: Scanning an entire directory
• Part 6: Scanning a list of directories

Part I - Reading Headers

• Write a function that can parse the binary
metadata header and return the data in a
useful format


Reading Headers
import struct

# This function parses a cache metadata header into a dict
# of named fields (listed in _headernames below)

_headernames = ['magic','location','fetchcount',
'fetchtime','modifytime','expiretime',
'datasize','requestsize','infosize']

def parse_meta_header(headerdata):
head = struct.unpack(">9I",headerdata)
meta = dict(zip(_headernames,head))
return meta


Reading Headers
• How this is supposed to work:
>>> f = open("Cache/_CACHE_001_","rb")
>>> f.seek(4096) # Skip the bit map
>>> headerdata = f.read(36) # Read 36 byte header
>>> meta = parse_meta_header(headerdata)
>>> meta
{'fetchtime': 1190829792, 'requestsize': 27, 'magic': 65544,
'fetchcount': 3, 'expiretime': 0, 'location': 2449473536L,
'modifytime': 1190829792, 'datasize': 29448, 'infosize': 531}
>>>

• Basically, we're parsing the header into a
useful Python data structure (a dictionary)


struct module
import struct

Parses binary encoded data into Python objects.

You would use this module to pack/unpack
raw
binary 'datasize','requestsize','infosize']
data from Python strings.
return meta
Unpacks 9 unsigned 32-bit
big-endian integers


struct module
import struct

Result is always a tuple of converted values.
head = (65544, 'fetchtime','modifytime','expiretime',
0, 1, 1191682051, 1191682051,
0, 8645, 190, 218)

return meta


Dictionary Creation
zip(s1,s2) makes a list of tuples
zip(_headernames,head) [('magic',head[0]),
import struct
('location',head[1]),
('fetchcount',head[2])
...
]

return meta
Make a dictionary


Commentary
• Dictionaries as data structures
meta = { 'fetchtime' : 1190829792,
'requestsize' : 27,
'magic' : 65544,
'fetchcount' : 3,
'expiretime' : 0,
'location' : 2449473536L,
'modifytime' : 1190829792,
'datasize' : 29448,
'infosize' : 531 }

• Useful if data has many parts
data = f.read(meta[8]) # Huh?!?

vs.

data = f.read(meta['infosize']) # Better


Mini-reference : struct
• struct module
items = struct.unpack(fmt,data)
data = struct.pack(fmt,item1,...,itemn)

• Sample Format codes
'c' char (1 byte string)
'b' signed char (8-bit integer)
'B' unsigned char (8-bit integer)
'h' signed short (16-bit integer)
'H' unsigned short (16-bit integer)
'i' int (32-bit integer)
'I' unsigned int (32-bit integer)
'f' 32-bit single precision float
'd' 64-bit double precision float
's' char s[] (String)
'>' Big endian modifier
'<' Little endian modifier
'!' Network order modifier
'n' Repetition count modifier

Part 2 : Parsing Requests

• Write a function that will read the URL
request string and request information
• Request String : A Null-terminated string
• Request Info : A sequence of Null-terminated
key-value pairs (like a dictionary)


Parsing Requests
import re
part_pat = re.compile(r'[nr -~]*$')

def parse_request_data(meta,requestdata):
parts = requestdata.split('x00')
for part in parts:
if not part_pat.match(part):
return False

request = parts[0]
if len(request) != (meta['requestsize'] - 1):
return False

info = dict(zip(parts[1::2],parts[2::2]))
meta['request'] = request.split(':',1)[1]
meta['info'] = info
return True


Usage : Requests
• Usage of the function:
>>> f.seek(4096) # Skip the bit map
>>> headerdata = f.read(36) # Read 36 byte header
>>> meta = parse_meta_header(headerdata)
>>> requestdata = f.read(meta['requestsize']+meta['infosize'])
>>> parse_request_data(meta,requestdata)
True
>>> meta['request']
'http://www.yahoo.com/'
>>> meta['info']
{'request-method': 'GET', 'request-User-Agent': 'Mozilla/5.0
(Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.7) Gecko/
20070914 Firefox/2.0.0.7', 'charset': 'UTF-8', 'response-
head': 'HTTP/1.1 200 OKrnDate: Wed, 26 Sep 2007 18:03:17
...' }
>>>


String Stripping
import re

for part in parts:
The request dataFalse sequence of null-terminated
return is a

strings. This splits the data up into parts.
request = parts[0]
requestdata = False
return 'partx00partx00partx00partx00...'

.split('x00')
meta['info'] = info
parts = ['part','part','part','part',...]
return True


String Validation
import re

for part in parts:
return False

Individual parts are printable characters except for
request = parts[0]
newline characters ('nr').
return False

We use the re module to match each string. This
would help catchinfo where we might be reading
meta['info'] = cases
return True
bad data (false headers, raw data, etc.).

URL Request String
import re

The request string is the ﬁrst part. The check that
follows makes parts:it's the right size (a further sanity
for part in sure
check on not part_pat.match(part):
if the data integrity).
return False

request = parts[0]
return False

meta['info'] = info
return True


Request Info
import re
Each request has a set of associated
part_pat = re.compile(r'[nr -~]*$') data represented
as key/value pairs.
parts = ['request','key','val','key','val','key','val']
for part in parts:
parts[1::2] part_pat.match(part):
if not ['key','key','key']
return['val','val','val']
parts[2::2] False
zip(parts[1::2],parts[2::2]) [('key','val'),
request = parts[0]
('key','val')
('key','val')]
return False

meta['info'] = info
return True
Makes a dictionary from (key,val) tuples


Fixing the Request
# Given a dictionary of header information and a file,
# this function extracts the request data from a cache
# metadata entry and saves it instring
Cleaning up the request the dictionary. Returns
# True or False depending on success.
request = "HTTP:http://www.google.com"
def read_request_data(header,f):
.split(':',1)
request = f.read(header['requestsize']).strip('x00')
infodata = f.read(header['infosize']).strip('x00')
['HTTP','http://www.google.com']
# Validate request and [1]
infodata here (nothing now)

# Turn the infodata into a dictionary
'http://www.google.com'
parts = infodata.split('x00')
info = dict(zip(parts[::2],parts[1::2]))

meta['info'] = info
return True


Commentary
• Emphasize that Python has very powerful
list manipulation primitives
• Indexing
• Slicing
• List comprehensions
• Etc.
• Knowing how to use these leads to rapid
development and compact code


Part 3: Content Info
• All documents on the internet have
optional content-type, encoding, and
character set information.
• Let's add this information since it will make
it easier for us to determine the type of
ﬁles that are stored in the cache (i.e.,
images, movies, HTML, etc.)


HTTP Responses
• The cache metadata includes an HTTP
response header
>>> print meta['info']['response-head']
HTTP/1.1 200 OK
Date: Sat, 29 Sep 2007 20:51:37 GMT
Cache-Control: private
Vary: User-Agent
Content-Type: text/html; charset=utf-8
Content-Encoding: gzip

>>>

Content type, character set,
and encoding.


Solution
# Given a metadata dictionary, this function adds additional
# fields related to the content type, charset, and encoding

import email
def add_content_info(meta):
info = meta['info']
if 'response-head' not in info:
return
else:
rhead = info.get('response-head').split("n",1)[1]
m = email.message_from_string(rhead)
content = m.get_content_type()
encoding = m.get('content-encoding',None)
charset = m.get_content_charset()
meta['content-type'] = content
meta['content-encoding'] = encoding
meta['charset'] = charset


Internet Data Handling
# Given a metadata dictionary, has afunction adds additional
Python this vast assortment of
internet data handling modules.
# fields related to the content type, charset, and encoding

import email
email. Parsing of email messages,
info = meta['info']
if 'response-head' MIME headers, etc.
not in info:
return
else:


Internet Data Handling
# Given a metadata dictionary, this function adds additional
# In this code, we parse the HTTP charset, and encoding
fields related to the content type,
reponse headers using the email
import email
module and extract content-type,
info = meta['info']
encoding, and charset information.
if 'response-head' not in info:
return
else:


Commentary
• Python is heavily used in Internet applications
• There are modules for parsing common types
of data (email, HTML, XML, etc.)
• There are modules for processing bits and
pieces of internet data (URLs, MIME types,
RFC822 headers, etc.)


Part 4: File Scanning

• Write a function that scans a single cache
ﬁle and produces a sequence of records
containing all of the cache metadata.
• This is just one more of our building blocks
• The goal is to hide some of the nasty bits


File Scanning
# Scan a single file in the firefox cache
def scan_cachefile(f,blocksize):
maxsize = 4*blocksize # Maximum size of an entry
f.seek(4096) # Skip the bit-map
while True:
headerdata = f.read(36)
if not headerdata: break
meta = parse_meta_header(headerdata)
if (meta['magic'] == 0x00010008 and
meta['requestsize'] + meta['infosize'] < maxsize):
requestdata = f.read(meta['requestsize']+
meta['infosize'])
if parse_request_data(meta,requestdata):
add_content_info(meta)
yield meta

# Move the file pointer to the start of the next block
fp = f.tell()
if (fp % blocksize):
f.seek(blocksize - (fp % blocksize),1)

Usage : File Scanning
• Usage of the scan function
>>> for meta in scan_cache_file(f,256)
... print meta['request']
...
http://us.js2.yimg.com/us.yimg.com/a/1-/java/promotions/
...

• We can just open up a cache ﬁle and write a
for-loop to iterate over all of the entries.


Python File I/O
maxsize = 4*blocksize # Maximum size of an entry
while True:
headerdata = f.read(36) File Objects
Modeled after ANSI C.
meta['requestsize'] + meta['infosize'] bytes.
Files are just < maxsize):
File pointer keeps track.
meta['infosize'])
f.read() # Read bytes
f.tell() # Current fp
yield meta
f.seek(n,off) # Move fp
fp = f.tell()

Using Earlier Code
maxsize = 4*blocksize # Maximum size of an using our
Here we are entry
f.seek(4096) # Skip header parsing functions
the bit-map
while True:
headerdata = f.read(36) written in previous parts.
meta['infosize'])
yield meta
Note: We are progressively
adding #more the file apointer
Move
data to
fp = f.tell()
to the start of the next block

dictionary. % blocksize):
if (fp

Data Validation
This is a sanity check to make
sure the header data looks #like a
maxsize = 4*blocksize Maximum size of an entry
valid header.
while True:
meta['infosize'])
yield meta

fp = f.tell()

Generating Results
We are using yield to Maximum size of for a
maxsize = 4*blocksize # produce data an entry
f.seek(4096)
single cache entry. #IfSkip the bit-map a for-
while True:
someone uses
loop, they will get all of the entries.
Note: This allows== 0x00010008 and cache
if (meta['magic'] us to process the
without reading all of the data into memory.
meta['infosize'])
yield meta

fp = f.tell()

Commentary

• Have created a function that can scan a
single _CACHE_00n_ ﬁle and produce a
sequence of dictionaries with metadata.
• It's still somewhat low-level
• Just need to package it a little better


Part 5 : Scan a Directory

• Write a function that takes the name of a
Firefox cache directory, scans all of the
cache ﬁles for metadata, and produces a
single sequence of records.
• Make it real easy to extract data


Solution : Directory Scan
# Given the name of a Firefox cache directory, the function
# scans all of the _CACHE_00n_ files for metadata. A sequence
# of dictionaries containing metadata is returned.

import os
def scan_cache(cachedir):
files = [('_CACHE_001_',256),
('_CACHE_002_',1024),
('_CACHE_003_',4096)]

for cname,blocksize in files:
cfile = open(os.path.join(cachedir,cname),"rb")
for meta in scan_cachefile(cfile,blocksize):
meta['cachedir'] = cachedir
meta['cachefile'] = cname
yield meta
cfile.close()


General idea:
We loop over the three _CACHE_00n_ ﬁles and
import os a sequence of the cache records
produce
files = [('_CACHE_001_',256),
('_CACHE_002_',1024),
('_CACHE_003_',4096)]

yield meta
cfile.close()



import os
files = [('_CACHE_001_',256),
('_CACHE_002_',1024),
('_CACHE_003_',4096)]

We use the low-level ﬁle scanning function
yield meta
cfile.close()
here to generate a sequence of records.

More Generation
Byscans all of here, we are chainingfor metadata. A sequence
# using yield the _CACHE_00n_ files together the
results obtained from all three cache ﬁles into one

big long sequence of results.
import os
files = [('_CACHE_001_',256),
The underlying mechanics and implementation
('_CACHE_002_',1024),
details are hidden (user doesn't care)
('_CACHE_003_',4096)]

yield meta
cfile.close()

Copyright (C) 2007, http://www.dabeaz.com 2-100

Additional Data

import os
files = [('_CACHE_001_',256),
('_CACHE_002_',1024),
Adding
('_CACHE_003_',4096)] path and ﬁle
for information to the data
cname,blocksize in files:
(May be useful later)
yield meta
cfile.close()


Usage : Cache Scan
• Usage of the scan function
>>> for meta in scan_cache("Cache/"):
...
http://us.js2.yimg.com/us.yimg.com/a/1-/java/promotions/
...

• Given the name of a cache directory, we can
just loop over all of the metadata. Trivial!
• With work, could perform various kinds of
queries and processing of the data

Another Example
• Find all requests related to Slashdot
>>> for meta in scan_cache("Cache/"):
... if 'slashdot' in meta['request']:
...
http://www.slashdot.org/
http://images.slashdot.org/topics/topiccommunications.gif
http://images.slashdot.org/topics/topicstorage.gif
http://images.slashdot.org/comments.css?T_2_5_0_176
...

• Well, that was pretty easy.


Another Example
• Find all large JPEG images in the cache
>>> jpegs = (meta for meta in scan_cache("Cache/")
if meta['content-type'] == 'image/jpeg'
and meta['datasize'] > 100000)
>>> for j in jpegs:
... print j['request']
...
http://images.salon.com/ent/video_dog/comedy/2007/09/27/cereal/
story.jpg
http://images.salon.com/ent/video_dog/ifc/2007/09/28/
apocalypse/story.jpg
http://www.lakesideinns.com/images/fallroadphoto2006.jpg
...
>>>

• That was also pretty easy

Part 6 : Scan Everything

• Write a function that takes a list of cache
directories and produces a sequence of all
cache metadata found in all of them.
• A single utility function that let's us query
everything.


Scanning Everything
# scan an entire list of cache directories producing
# a sequence of records

def scan(cachedirs):
if isinstance(cachedirs,str):
cachedirs = [cachedirs]
for cdir in cachedirs:
for meta in scan_cache(cdir):
yield meta


Type Checking
# scan an entire list of cache directories producing
# a sequence of records

def scan(cachedirs):
if isinstance(cachedirs,str):
cachedirs = [cachedirs]
for cdir in cachedirs:
This bit of code ismeta example of type
an
for meta in scan_cache(cdir):
yield
checking.

If the argument is a string, we convert it to a list
with one item. This allows the following usage:
scan("CacheDir")
scan(["CacheDir1","CacheDir2",...])


Putting it all together
# slack.py
# Find all of those slackers who should be working
import sys, os, ffcache

print >>sys.stderr,"Usage: python slack.py dirname"
raise SystemExit(1)


for meta in ffcache.scan(caches):
if 'slashdot' in meta['request']:
print meta['request']
print meta['cachedir']
print


Intermission

• Have written a simple library ffcache.py
• Library takes a moderate complex data
processing problem and breaks it up into
pieces.
• About 100 lines of code.
• Now, let's build an application...


Problem : CacheSpy
• Big Brother (make an evil sound here)
Write a program that ﬁrst locates all of the Firefox
cache directories under a given directory. Then
have that program run forever as a network server,
waiting for connections. On each connection, send
back all of the current cache metadata.

• Big Picture
We're going to write a daemon that will ﬁnd and
quietly report on browser cache contents.


cachespy.py
import sys, os, pickle, SocketServer, ffcache

SPY_PORT = 31337
caches = [path for path,dname,files in os.walk(sys.argv[1])
if '_CACHE_MAP_' in files]

def dump_cache(f):
pickle.dump(meta,f)

class SpyHandler(SocketServer.BaseRequestHandler):
def handle(self):
f = self.request.makefile()
dump_cache(f)
f.close()

SocketServer.TCPServer.allow_reuse_address = True
serv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler)
print "CacheSpy running on port %d" % SPY_PORT
serv.serve_forever()

SocketServer Module

SPY_PORT = 31337
SocketServer

def module for easily creating
A dump_cache(f):
low-level internet applications
pickle.dump(meta,f)
using sockets.
def handle(self):
dump_cache(f)
f.close()


SocketServer Handlers

You deﬁne a simple class that
SPY_PORT = 31337
implements handle().

def dump_cache(f):
This implements the server logic.
pickle.dump(meta,f)

def handle(self):
dump_cache(f)
f.close()


SocketServer Servers

SPY_PORT = 31337

def dump_cache(f):
pickle.dump(meta,f)

Next, you just create a Server object,
def handle(self):
hook f = self.request.makefile()run the
the handler up to it, and
server.
dump_cache(f)
f.close()


Python in Action (Part 2)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to Python in Action (Part 2)

Similar to Python in Action (Part 2) (20)

Recently uploaded

Recently uploaded (20)

Python in Action (Part 2)