1. Python in Action
Presented at USENIX LISA Conference
November 16, 2007
David M. Beazley
http://www.dabeaz.com
(Part II - Systems Programming)
Copyright (C) 2007, http://www.dabeaz.com 2- 1
2. Section Overview
• In this section, we're going to get dirty
• Systems Programming
• Files, I/O, file-system
• Text parsing, data decoding
• Processes and IPC
• Networking
• Threads and concurrency
Copyright (C) 2007, http://www.dabeaz.com 2- 2
3. Commentary
• I personally think Python is a fantastic tool for
systems programming.
• Modules provide access to most of the major
system libraries I used to access via C
• No enforcement of "morality"
• Decent performance
• It just "works" and it's fun
Copyright (C) 2007, http://www.dabeaz.com 2- 3
4. Approach
• I've thought long and hard about how I
would present this part of the class.
• A reference manual approach would
probably be long and very boring.
• So instead, we're going to focus on building
something more in tune with the times
Copyright (C) 2007, http://www.dabeaz.com 2- 4
5. "To Catch a Slacker"
• Write a collection of Python programs that can
quietly monitor Firefox browser caches to find
out who has been spending their day reading
Slashdot instead of working on their TPS reports.
• Oh yeah, and be a real sneaky bugger about it.
Copyright (C) 2007, http://www.dabeaz.com 2- 5
6. Why this Problem?
• Involves a real-world system and data
• Firefox already installed on your machine (?)
• Cross platform (Linux, Mac, Windows)
• Example of tool building
• Related to a variety of practical problems
• A good tour of "Python in Action"
Copyright (C) 2007, http://www.dabeaz.com 2- 6
7. Disclaimers
• I am not involved in browser forensics (or
spyware for that matter).
• I am in no way affiliated with Firefox/Mozilla
nor have I ever seen Firefox source code
• I have never worked with the cache data
prior to preparing this tutorial
• I have never used any third-party tools for
looking at this data.
Copyright (C) 2007, http://www.dabeaz.com 2- 7
8. More Disclaimers
• All of the code in this tutorial works with a
standard Python installation
• No third party modules.
• All code is cross-platform
• Code samples are available online at
http://www.dabeaz.com/action/
• Please look at that code and follow along
Copyright (C) 2007, http://www.dabeaz.com 2- 8
9. Assumptions
• This is not a tutorial on systems concepts
• You should be generally familiar with
background material (files, filesystems, file
formats, processes, threads, networking,
protocols, etc.)
• Hopefully you can "extrapolate" from the
material presented here to construct more
advanced Python applications.
Copyright (C) 2007, http://www.dabeaz.com 2- 9
10. The Big Picture
• We want to write a tool that allows
someone to locate, inspect, and perform
queries across a distributed collection of
Firefox caches.
• For example, the cache directories on all
machines on the LAN of a quasi-evil
corporation.
Copyright (C) 2007, http://www.dabeaz.com 2- 10
12. Problem : Finding Files
• Find the Firefox cache
Write a program findcache.py that takes a directory
name as input and recursively scans that directory
and all subdirectories looking for Firefox/Mozilla
cache directories.
• Example:
% python findcache.py /Users/beazley
/Users/beazley/Library/.../qs1ab616.default/Cache
/Users/beazley/Library/.../wxuoyiuf.slt/Cache
%
• Use case: Searching for things on the filesystem.
Copyright (C) 2007, http://www.dabeaz.com 2- 12
13. findcache.py
# findcache.py
# Recursively scan a directory looking for
# Firefox/Mozilla cache directories
import sys
import os
if len(sys.argv) != 2:
print >>sys.stderr,"Usage: python findcache.py dirname"
raise SystemExit(1)
caches = (path for path,dirs,files in os.walk(sys.argv[1])
if '_CACHE_MAP_' in files)
for name in caches:
print name
Copyright (C) 2007, http://www.dabeaz.com 2- 13
14. The sys module
# findcache.py
# Recursively scan a directory looking basic
The sys module has for
# Firefox/Mozilla cache directories
information related to the
import sys execution environment.
import os
if len(sys.argv) != 2:
sys.argv
print >>sys.stderr,"Usage: python findcache.py dirname"
raise SystemExit(1)
sys.stdin (path for'_CACHE_MAP_' list in os.walk(sys.argv[1])
caches = path,dirs,files of the command line
A
sys.stdout
if options
in files)
sys.stderrname
for name in caches:
print
sys.argv = ['findcache.py',
'/Users/beazley']
Standard I/O files
Copyright (C) 2007, http://www.dabeaz.com 2- 14
15. Program Termination
# findcache.py
# Recursively scan a directory looking for
# Firefox/Mozilla cache directories
import sys
import os
if len(sys.argv) != 2: SystemExit exception
print >>sys.stderr,"Usage: python findcache.py dirname"
raise SystemExit(1)
Forces Python to exit.
caches = (path for path,dirs,files inis return code.
Value os.walk(sys.argv[1])
if '_CACHE_MAP_' in files)
for name in caches:
print name
Copyright (C) 2007, http://www.dabeaz.com 2- 15
16. os Module
# findcache.py
# Recursively scan a directory looking for
# Firefox/Mozilla cache os module
directories
import sys
import os
Contains useful OS related
if len(sys.argv) != 2: functions (files, processes, etc.)
print >>sys.stderr,"Usage: python findcache.py dirname"
raise SystemExit(1)
caches = (path for path,dirs,files in os.walk(sys.argv[1])
if '_CACHE_MAP_' in files)
for name in caches:
print name
Copyright (C) 2007, http://www.dabeaz.com 2- 16
17. os.walk()
os.walk(topdir)
# findcache.py
# Recursively scan a directory looking for
Recursively walkscache directories and
# Firefox/Mozilla a directory tree
generates a sequence of tuples (path,dirs,files)
import sys
path
import os = The current directory name
if
dirs = List of all subdirectory names in path
len(sys.argv) != 2:
files = List of all regular files (data) in path
print >>sys.stderr,"Usage: python findcache.py dirname"
raise SystemExit(1)
caches = (path for path,dirs,files in os.walk(sys.argv[1])
if '_CACHE_MAP_' in files)
for name in caches:
print name
Copyright (C) 2007, http://www.dabeaz.com 2- 17
18. A Sequence of Caches
# findcache.py
# Recursively scan a directory looking for
# Firefox/Mozilla cache directories
importstatement
This sys generates a sequence of
import os
directory names where '_CACHE_MAP_' is
contained in the filelist.
if len(sys.argv) != 2:
print >>sys.stderr,"Usage: python findcache.py dirname"
raise SystemExit(1)
caches = (path for path,dirs,files in os.walk(sys.argv[1])
if '_CACHE_MAP_' in files)
for name in caches:
The print name name
directory
File name check
that is generated as a
result
Copyright (C) 2007, http://www.dabeaz.com 2- 18
19. Printing the Result
# findcache.py
# Recursively scan a directory looking for
# Firefox/Mozilla cache directories
import sys
import os
if len(sys.argv) != 2:
print >>sys.stderr,"Usage: python findcache.py dirname"
raise SystemExit(1)
caches = (path for path,dirs,files in os.walk(sys.argv[1])
This prints the sequence
if '_CACHE_MAP_' in files)
of cache directories that
for name in caches:
print name are generated by the
previous statement.
Copyright (C) 2007, http://www.dabeaz.com 2- 19
20. Commentary
• Our solution is strongly based on a
"declarative" programming style (again)
• We simply write out a sequence of
operations that produce what we want
• Not focused on the underlying mechanics
of how to traverse all of the directories.
Copyright (C) 2007, http://www.dabeaz.com 2- 20
21. Big Idea : Iteration
• Python allows iteration to be captured as a
kind of object.
caches = (path for path,dirs,files in os.walk(sys.argv[1])
if '_CACHE_MAP_' in files)
• This de-couples iteration from the code that
uses the iteration
for name in caches:
print name
• Another usage example:
for name in caches:
print len(os.listdir(name)), name
Copyright (C) 2007, http://www.dabeaz.com 2- 21
22. Big Idea : Iteration
• Compare to this:
for path,dirs,files in os.walk(sys.argv[1]):
if '_CACHE_MAP_' in files:
print len(os.listdir(path)),path
• This code is simple, but the loop and the
code that executes in the loop body are
coupled together
• Not as flexible, but this is somewhat subtle
to wrap your brain around at first.
Copyright (C) 2007, http://www.dabeaz.com 2- 22
23. Mini-Reference : sys, os
• sys module
sys.argv # List of command line options
sys.stdin # Standard input
sys.stdout # Standard output
sys.stderr # Standard error
sys.executable # Full path of Python executable
sys.exc_info() # Information on current exception
• os module
os.walk(dir) # Recursively walk dir producing a
# sequence of tuples (path,dlist,flist)
os.listdir(dir) # Return a list of all files in dir
• SystemExit exception
raise SystemExit(n) # Exit with integer code n
Copyright (C) 2007, http://www.dabeaz.com 2- 23
24. Problem: Searching for Text
• Extract all URL requests from the cache
Write a program requests.py that scans the contents
of the _CACHE_00n_ files and prints a list of URLs
for documents stored in the cache.
• Example:
% python requests.py /Users/.../qs1ab616.default/Cache
http://www.yahoo.com/
http://us.js2.yimg.com/us.yimg.com/a/1-/java/promotions/js/ad_eo_1.1.j
http://us.i1.yimg.com/us.yimg.com/i/ww/atty_hsi_free.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/thm/1/search_1.1.png
...
%
• Use case: Searching the contents of files for
text patterns.
Copyright (C) 2007, http://www.dabeaz.com 2- 24
25. The Firefox Cache
• The cache directory holds two types of data
• Metadata (URLs, headers, etc.).
• Raw data (HTML, JPEG, PNG, etc.)
• This data is stored in two places
• Cryptic files in the Cache directory
• Blocks inside the _CACHE_00n_ files
• Metadata almost always in _CACHE_00n_
Copyright (C) 2007, http://www.dabeaz.com 2- 25
26. Possible Solution : Regex
• The _CACHE_00n_ files are encoded in a
binary format, but URLs are embedded
inside as null-terminated text:
x00x01x00x08x92x00x02x18x00x00x00x13Fxffx9f
xceFxffx9fxcex00x00x00x00x00x00H)x00x00x00x1a
x00x00x023HTTP:http://slashdot.org/x00request-methodx00
GETx00request-User-Agentx00Mozilla/5.0 (Macintosh; U; Intel
Mac OS X; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7x00
request-Accept-Encodingx00gzip,deflatex00response-headx00
HTTP/1.1 200 OKrnDate: Sun, 30 Sep 2007 13:07:29 GMTrn
Server: Apache/1.3.37 (Unix) mod_perl/1.29rnSLASH_LOG_DATA:
shtmlrnX-Powered-By: Slash 2.005000176rnX-Fry: How can I
live my life if I can't tell good from evil?rnCache-Control:
• Maybe the requests could just be ripped
using a regular expression.
Copyright (C) 2007, http://www.dabeaz.com 2- 26
27. A Regex Solution
# requests.py
import re
import os
import sys
cachedir = sys.argv[1]
cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]
# A regex for embedded URL strings
request_pat = re.compile('([a-z]+://.*?)x00')
# Loop over all files and search for URLs
for name in cachefiles:
data = open(os.path.join(cachedir,name),"rb").read()
index = 0
while True:
m = request_pat.search(data,index)
if not m: break
print m.group(1)
index = m.end()
Copyright (C) 2007, http://www.dabeaz.com 2- 27
28. The re module
# requests.py
import re re module
import os
import sys
Contains all functionality related to
cachedir = sys.argv[1]
cachefiles
regular expression pattern matching,
= [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]
searching, replacing, etc.
# A regex for embedded URL strings
request_pat = re.compile(r'([a-z]+://.*?)x00')
Features are strongly influenced by Perl,
but regexs are not directly integrated
# Loop over all files and search for URLs
for name in cachefiles:
into the Python language.
data = open(os.path.join(cachedir,name),"rb").read()
index = 0
while True:
m = request_pat.search(data,index)
if not m: break
print m.group(1)
index = m.end()
Copyright (C) 2007, http://www.dabeaz.com 2- 28
29. Using re
# requests.py
import re are first specified
Patterns as strings
and compiled into a regex
import os
import sys
object.
pat = re.compile(pattern [,flags])
cachedir = sys.argv[1]
cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]
# A regex for embedded URL strings
request_pat = re.compile('([a-z]+://.*?)x00')
# Loop over all files and search for URLs
The pattern syntax is "standard"
for name in cachefiles:
data = open(os.path.join(cachedir,name),"rb").read()
pat* pat1|pat2
index = 0 pat+ [chars]
while True: pat? [^chars]
m = request_pat.search(data,index)
(pat) pat{n}
if not m: break
. pat{n,m}
print m.group(1)
index = m.end()
Copyright (C) 2007, http://www.dabeaz.com 2- 29
30. Using re
# requests.py
import re
import os
import sys
cachedir = sys.argv[1]
cachefiles = [ '_CACHE_001_', '_CACHE_002_',the
All subsequent operations are methods of '_CACHE_003_' ]
compiled regex pattern
# A regex for embedded URL strings
request_pat = re.compile(r'([a-z]+://.*?)x00')
m = pat.match(data [,start]) # Check for match
m = pat.search(data [,start]) # Search for match
# Loop over all files and search for URLs
newdata = pat.sub(data, repl) # Pattern replace
for name in cachefiles:
data = open(os.path.join(cachedir,name),"rb").read()
index = 0
while True:
m = request_pat.search(data,index)
if not m: break
print m.group(1)
index = m.end()
Copyright (C) 2007, http://www.dabeaz.com 2- 30
31. Searching for Matches
# requests.py
import re
import os
pat.search(text
import sys
[,start])
cachedir = the string text for the first occurrence
Searches sys.argv[1]
cachefiles = [ pattern starting'_CACHE_002_', '_CACHE_003_' ]
of the regex '_CACHE_001_', at position start.
# Returns a "MatchObject" strings
A regex for embedded URL if a match is found.
request_pat = re.compile(r'([a-z]+://.*?)x00')
In the code below, we're finding matches one
# Loop over all files and search for URLs
for a time. cachefiles:
at name in
data = open(os.path.join(cachedir,name),"rb").read()
index = 0
while True:
m = request_pat.search(data,index)
if not m: break
print m.group(1)
index = m.end()
Copyright (C) 2007, http://www.dabeaz.com 2- 31
32. Match Objects
# requests.py
import re
import os
import sys
cachedir = sys.argv[1]
Regex matches'_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]
cachefiles = [ are represented by a MatchObject
# m.group([n]) embedded URL matched by group n
A regex for # Text strings
m.start([n]) # Starting index of group n
request_pat = re.compile(r'([a-z]+://.*?)x00')
m.end([n]) # End index of group n
# Loop over all files and search for URLs
for name in cachefiles:
data = open(os.path.join(cachedir,name),"rb").read()
index = 0 The matching text for
while True:
just the URL.
m = request_pat.search(data,index)
if not m: break
print m.group(1)
index = m.end() The end of the match
Copyright (C) 2007, http://www.dabeaz.com 2- 32
33. Groups
# requests.py
In patterns, parentheses () define groups which
import re
import os
are numbered left to right.
import sys
group 0 # The entire pattern
cachedir 1 sys.argv[1] Text in first ()
group = #
group 2 # Text in next ()
cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]
...
# A regex for embedded URL strings
request_pat = re.compile('([a-z]+://.*?)x00')
# Loop over all files and search for URLs
for name in cachefiles:
data = open(os.path.join(cachedir,name),"rb").read()
index = 0
while True:
m = request_pat.search(data,index)
if not m: break
print m.group(1)
index = m.end()
Copyright (C) 2007, http://www.dabeaz.com 2- 33
34. Mini-Reference : re
• re pattern compilation
pat = re.compile(r'patternstring')
• Pattern syntax
literal # Match literal text
pat* # Match 0 or more repetitions of pat
pat+ # Match 1 or more repetitions of pat
pat? # Match 0 or 1 repetitions of pat
pat1|pat2 # Patch pat1 or pat2
(pat) # Patch pat (group)
[chars] # Match characters in chars
[^chars] # Match characters not in chars
. # Match any character except n
d # Match any digit
w # Match alphanumeric character
s # Match whitespace
Copyright (C) 2007, http://www.dabeaz.com 2- 34
35. Mini-Reference : re
• Common pattern operations
pat.search(text) # Search text for a match
pat.match(text) # Search start of text for match
pat.sub(repl,text) # Replace pattern with repl
• Match objects
m.group([n]) # Text matched by group n
m.start([n]) # Starting position of group n
m.end([n]) # Ending position of group n
• How to loop over all matches of a pattern
for m in pat.finditer(text):
# m is a MatchObject that you process
...
Copyright (C) 2007, http://www.dabeaz.com 2- 35
36. Mini-Reference : re
• An example of pattern replacement
# This replaces American dates of the form 'mm/dd/yyyy'
# with European dates of the form 'dd/mm/yyyy'.
# This function takes a MatchObject as input and returns
# replacement text as output.
def euro_date(m):
month = m.group(1)
day = m.group(2)
year = m.group(3)
return "%d/%d/%d" % (day,month,year)
# Date re pattern and replacement operation
datepat = re.compile(r'(d+)/(d+)/(d+)')
newdata = datepat.sub(euro_date,text)
Copyright (C) 2007, http://www.dabeaz.com 2- 36
37. Mini-Reference : re
• There are many more features of the re
module
• Strongly influenced by Perl (feature set)
• Regexs are a library in Python, not integrated
into the language.
• A book on regular expressions may be
essential for advanced functions.
Copyright (C) 2007, http://www.dabeaz.com 2- 37
38. File Handling
# requests.py
import re
import os
import sys
cachedir = sys.argv[1]
cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]
# A regex for embedded URL strings
What is going on in this statement?
request_pat = re.compile(r'([a-z]+://.*?)x00')
# Loop over all files and search for URLs
for name in cachefiles:
data = open(os.path.join(cachedir,name),"rb").read()
index = 0
while True:
m = request_pat.search(data,index)
if not m: break
print m.group(1)
index = m.end()
Copyright (C) 2007, http://www.dabeaz.com 2- 38
39. os.path module
# requests.py
import re has portable file related functions
os.path
import os
os.path.join(name1,name2,...) # Join path names
import sys
os.path.getsize(filename) # Get the file size
os.path.getmtime(filename) # Get modification date
cachedir = sys.argv[1]
cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]
There are many more functions, but this is the
#preferred module for basic filename handling
A regex for embedded URL strings
request_pat = re.compile(r'([a-z]+://.*?)x00')
# Loop over all files and search for URLs
for name in cachefiles:
data = open(os.path.join(cachedir,name),"rb").read()
index = 0
while True:
m = request_pat.search(data,index)
if not m: break
print m.group(1)
index = m.end()
Copyright (C) 2007, http://www.dabeaz.com 2- 39
40. os.path.join()
# requests.py
import re a fully-expanded pathname
Creates
import os
dirname = '/foo/bar' filename = 'name'
import sys
os.path.join(dirname,filename)
cachedir = sys.argv[1]
cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]
# A regex for embedded URL strings
'/foo/bar/name'
request_pat = re.compile(r'([a-z]+://.*?)x00')
Aware of platform differences ('/' vs. '')
# Loop over all files and search for URLs
for name in cachefiles:
data = open(os.path.join(cachedir,name),"rb").read()
index = 0
while True:
m = request_pat.search(data,index)
if not m: break
print m.group(1)
index = m.end()
Copyright (C) 2007, http://www.dabeaz.com 2- 40
41. Mini-Reference : os.path
os.path.join(s1,s2,...) # Join pathname parts together
os.path.getsize(path) # Get file size of path
os.path.getmtime(path) # Get modify time of path
os.path.getatime(path) # Get access time of path
os.path.getctime(path) # Get creation time of path
os.path.exists(path) # Check if path exists
os.path.isfile(path) # Check if regular file
os.path.isdir(path) # Check if directory
os.path.islink(path) # Check if symbolic link
os.path.basename(path) # Return file part of path
os.path.dirname(path) # Return dir part of
os.path.abspath(path) # Get absolute path
Copyright (C) 2007, http://www.dabeaz.com 2- 41
42. Binary I/O
# requests.py
import re
import os
import sys
cachedir = sys.argv[1]
cachefiles = binary files, use modes "rb","wb", etc.
For all [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]
# A regex for embedded URL strings
request_pat = re.compile(r'([a-z]+://.*?)x00') Windows)
Disables new-line translation (critical on
# Loop over all files and search for URLs
for name in cachefiles:
data = open(os.path.join(cachedir,name),"rb").read()
index = 0
while True:
m = request_pat.search(data,index)
if not m: break
print m.group(1)
index = m.end()
Copyright (C) 2007, http://www.dabeaz.com 2- 42
43. Common I/O Shortcuts
# requests.py
import re entire file into a string
# Read an
import=os
data open(filename).read()
import sys
# Write a string out to a file
open(filename,"w").write(text)
cachedir = sys.argv[1]
cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]
# Loop over all lines in a file
#forregex foropen(filename):
A line in embedded URL strings
...
request_pat = re.compile(r'([a-z]+://.*?)x00')
# Loop over all files and search for URLs
for name in cachefiles:
data = open(os.path.join(cachedir,name),"rb").read()
index = 0
while True:
m = request_pat.search(data,index)
if not m: break
print m.group(1)
index = m.end()
Copyright (C) 2007, http://www.dabeaz.com 2- 43
44. Commentary on Solution
• This regex approach is mostly a hack for this
particular application.
• Reads entire cache files into memory as
strings (may be quite large)
• Only finds URLs, no other metadata
• Some risk of false positives since URLs could
also be embedded in data.
Copyright (C) 2007, http://www.dabeaz.com 2- 44
45. Commentary
• We have started to build a collection of
very simple command line tools
• Very much in the "Unix tradition."
• Python makes it easy to create such tools
• More complex applications could be
assembled by simply gluing scripts together
Copyright (C) 2007, http://www.dabeaz.com 2- 45
46. Working with Processes
• It is common to write programs that run
other programs, collect their output, etc.
• Pipes
• Interprocess Communication
• Python has a variety of modules for
supporting this.
Copyright (C) 2007, http://www.dabeaz.com 2- 46
47. subprocess Module
• A module for creating and interacting with
subprocesses
• Consolidates a number of low-level OS
functions such as system(), execv(), spawnv(),
pipe(), popen2(), etc. into a single module
• Cross platform (Unix/Windows)
Copyright (C) 2007, http://www.dabeaz.com 2- 47
48. Example : Slackers
• Find slacker cache entries.
Using the programs findcache.py and requests.py as
subprocesses, write a program that inspects cache
directories and prints out all entries that contain the
word 'slashdot' in the URL.
Copyright (C) 2007, http://www.dabeaz.com 2- 48
49. slackers.py
# slackers.py
import sys
import subprocess
# Run findcache.py as a subprocess
finder = subprocess.Popen(
[sys.executable,"findcache.py",sys.argv[1]],
stdout=subprocess.PIPE)
dirlist = [line.strip() for line in finder.stdout]
# Run request.py as a subprocess
for cachedir in dirlist:
searcher = subprocess.Popen(
[sys.executable,"requests.py",cachedir],
stdout=subprocess.PIPE)
for line in searcher.stdout:
if 'slashdot' in line: print line,
Copyright (C) 2007, http://www.dabeaz.com 2- 49
50. Launching a subprocess
# slackers.py
import sys
import subprocess
# Run findcache.py as a subprocess
finder = subprocess.Popen(
[sys.executable,"findcache.py",sys.argv[1]],
stdout=subprocess.PIPE)
dirlist = [line.strip() for line in finder.stdout]
# Run request.py as a subprocess
Thiscachedir in dirlist:
for is launching a python Collection of output
script as a subprocess,
searcher = subprocess.Popen(
with newline
[sys.executable,"requests.py",cachedir],
connecting its stdout
stdout=subprocess.PIPE) stripping.
stream to a pipe.
for line in searcher.stdout:
if 'slashdot' in line: print line,
Copyright (C) 2007, http://www.dabeaz.com 2- 50
51. Python Executable
# slackers.py
import sys
import subprocess Full pathname of
python interpreter
# Run findcache.py as a subprocess
finder = subprocess.Popen(
[sys.executable,"findcache.py",sys.argv[1]],
stdout=subprocess.PIPE)
dirlist = [line.strip() for line in finder.stdout]
# Run request.py as a subprocess
for cachedir in dirlist:
searcher = subprocess.Popen(
[sys.executable,"requests.py",cachedir],
stdout=subprocess.PIPE)
for line in searcher.stdout:
if 'slashdot' in line: print line,
Copyright (C) 2007, http://www.dabeaz.com 2- 51
52. Subprocess Arguments
# slackers.py
import sys
import subprocess
# Run findcache.py as a subprocess
finder = subprocess.Popen(
[sys.executable,"findcache.py",sys.argv[1]],
stdout=subprocess.PIPE)
dirlist = [line.strip() for line in finder.stdout]
List of arguments to subprocess.
# Run request.py as a subprocess
for cachedir in dirlist: to what would
Corresponds
appear on a shell command line.
searcher = subprocess.Popen(
[sys.executable,"requests.py",cachedir],
stdout=subprocess.PIPE)
for line in searcher.stdout:
if 'slashdot' in line: print line,
Copyright (C) 2007, http://www.dabeaz.com 2- 52
53. slackers.py
# slackers.py
import sys
import subprocess
# Run findcache.py as a subprocess directory we
More of the same idea. For each
finder = subprocess.Popen(
found in the last step, we run requests.py to
[sys.executable,"findcache.py",sys.argv[1]],
produce requests.
stdout=subprocess.PIPE)
dirlist = [line.strip() for line in finder.stdout]
# Run request.py as a subprocess
for cachedir in dirlist:
searcher = subprocess.Popen(
[sys.executable,"requests.py",cachedir],
stdout=subprocess.PIPE)
for line in searcher.stdout:
if 'slashdot' in line: print line,
Copyright (C) 2007, http://www.dabeaz.com 2- 53
54. Commentary
• subprocess is a large module with many options.
• However, it takes care of a lot of annoying
platform-specific details for you.
• Currently the "recommended" way of dealing
with subprocesses.
Copyright (C) 2007, http://www.dabeaz.com 2- 54
55. Low Level Subprocesses
• Running a simple system command
os.system("shell command")
• Connecting to a subprocess with pipes
pout, pin = popen2.popen2("shell command")
• Exec/spawn
os.execv(),os.execl(),os.execle(),...
os.spawnv(),os.spawnvl(), os.spawnle(),...
• Unix fork()
os.fork(), os.wait(), os.waitpid(), os._exit(), ...
Copyright (C) 2007, http://www.dabeaz.com 2- 55
56. Interactive Processes
• Python does not have built-in support for
controlling interactive subprocesses (e.g.,
"Expect")
• Must install third party modules for this
• Example: pexpect
• http://pexpect.sourceforge.net
Copyright (C) 2007, http://www.dabeaz.com 2- 56
57. Commentary
• Writing small Unix-like utilities is fairly
straightforward in Python
• Support for standard kinds of operations (files,
regular expressions, pipes, subprocesses, etc.)
• However, our solution is also kind of clunky
• Only returns some information
• Not particularly memory efficient (reads large
files into memory)
Copyright (C) 2007, http://www.dabeaz.com 2- 57
58. Interlude
• Python is well-suited to building libraries
and frameworks.
• In the next part, we're going to take a
totally different approach than simply
writing simple utilities.
• Will build libraries for manipulating cache
data and use those libraries to build tools.
Copyright (C) 2007, http://www.dabeaz.com 2- 58
59. Problem : Parsing Data
• Extract the cache data (for real)
Write a module ffcache.py that contains a set of
functions for reading Firefox cache data into useful
data structures that can be used by other programs.
Capture all available information including URLs,
timestamps, sizes, locations, content types, etc.
• Use case: Blood and guts
Writing programs that can process foreign file
formats. Processing binary encoded data. Creating
code for later reuse.
Copyright (C) 2007, http://www.dabeaz.com 2- 59
60. The Firefox Cache
• There are four critical files
_CACHE_MAP_ # Cache index
_CACHE_001_ # Cache data
_CACHE_002_ # Cache data
_CACHE_003_ # Cache data
• All files are binary-encoded
• _CACHE_MAP_ is used by Firefox to locate
data, but it is not updated until Firefox exits.
• We will ignore _CACHE_MAP_ since we want
to observe caches of live Firefox sessions.
Copyright (C) 2007, http://www.dabeaz.com 2- 60
61. Firefox _CACHE_ Files
• _CACHE_00n_ file organization
Free/used block bitmap 4096 bytes
Blocks Up to 32768 blocks
• The block size varies according to the file:
_CACHE_001_ 256 byte blocks
_CACHE_002_ 1024 byte blocks
_CACHE_003_ 4096 byte blocks
Copyright (C) 2007, http://www.dabeaz.com 2- 61
62. Cache Entries
• Each cache entry:
• A maximum of 4 cache blocks
• Can either be data or metadata
• If >16K, written to a file instead
• Notice how all the "cryptic" files are >16K
-rw------- beazley 111169 Sep 25 17:15 01CC0844d01
-rw------- beazley 104991 Sep 25 17:15 01CC3844d01
-rw------- beazley 47233 Sep 24 16:41 021F221Ad01
...
-rw------- beazley 26749 Sep 21 11:19 FF8AEDF0d01
-rw------- beazley 58172 Sep 25 18:16 FFE628C6d01
Copyright (C) 2007, http://www.dabeaz.com 2- 62
63. Cache Metadata
• Metadata is encoded as a binary structure
Header 36 bytes
Request String Variable length (in header)
Request Info Variable length (in header)
• Header encoding (binary, big-endian)
0-3 magic (???) unsigned int (0x00010008)
4-7 location unsigned int
8-11 fetchcount unsigned int
12-15 fetchtime unsigned int (system time)
16-19 modifytime unsigned int (system time)
20-23 expiretime unsigned int (system time)
24-27 datasize unsigned int (byte count)
28-31 requestsize unsigned int (byte count)
32-35 infosize unsigned int (byte count)
Copyright (C) 2007, http://www.dabeaz.com 2- 63
64. Solution Outline
• Part 1: Parsing Metadata Headers
• Part 2: Getting request information (URL)
• Part 3: Extracting additional content info
• Part 4: Scanning of individual cache files
• Part 5: Scanning an entire directory
• Part 6: Scanning a list of directories
Copyright (C) 2007, http://www.dabeaz.com 2- 64
65. Part I - Reading Headers
• Write a function that can parse the binary
metadata header and return the data in a
useful format
Copyright (C) 2007, http://www.dabeaz.com 2- 65
66. Reading Headers
import struct
# This function parses a cache metadata header into a dict
# of named fields (listed in _headernames below)
_headernames = ['magic','location','fetchcount',
'fetchtime','modifytime','expiretime',
'datasize','requestsize','infosize']
def parse_meta_header(headerdata):
head = struct.unpack(">9I",headerdata)
meta = dict(zip(_headernames,head))
return meta
Copyright (C) 2007, http://www.dabeaz.com 2- 66
67. Reading Headers
• How this is supposed to work:
>>> f = open("Cache/_CACHE_001_","rb")
>>> f.seek(4096) # Skip the bit map
>>> headerdata = f.read(36) # Read 36 byte header
>>> meta = parse_meta_header(headerdata)
>>> meta
{'fetchtime': 1190829792, 'requestsize': 27, 'magic': 65544,
'fetchcount': 3, 'expiretime': 0, 'location': 2449473536L,
'modifytime': 1190829792, 'datasize': 29448, 'infosize': 531}
>>>
• Basically, we're parsing the header into a
useful Python data structure (a dictionary)
Copyright (C) 2007, http://www.dabeaz.com 2- 67
68. struct module
import struct
Parses binary encoded data into Python objects.
# This function parses a cache metadata header into a dict
# of named fields (listed in _headernames below)
_headernames = ['magic','location','fetchcount',
You would use this module to pack/unpack
'fetchtime','modifytime','expiretime',
raw
binary 'datasize','requestsize','infosize']
data from Python strings.
def parse_meta_header(headerdata):
head = struct.unpack(">9I",headerdata)
meta = dict(zip(_headernames,head))
return meta
Unpacks 9 unsigned 32-bit
big-endian integers
Copyright (C) 2007, http://www.dabeaz.com 2- 68
69. struct module
import struct
# This function parses a cache metadata header into a dict
# of named fields (listed in _headernames below)
Result is always a tuple of converted values.
_headernames = ['magic','location','fetchcount',
head = (65544, 'fetchtime','modifytime','expiretime',
0, 1, 1191682051, 1191682051,
0, 8645, 190, 218)
'datasize','requestsize','infosize']
def parse_meta_header(headerdata):
head = struct.unpack(">9I",headerdata)
meta = dict(zip(_headernames,head))
return meta
Copyright (C) 2007, http://www.dabeaz.com 2- 69
70. Dictionary Creation
zip(s1,s2) makes a list of tuples
zip(_headernames,head) [('magic',head[0]),
import struct
('location',head[1]),
('fetchcount',head[2])
# This function parses a cache metadata header into a dict
...
# of named fields (listed in _headernames below)
]
_headernames = ['magic','location','fetchcount',
'fetchtime','modifytime','expiretime',
'datasize','requestsize','infosize']
def parse_meta_header(headerdata):
head = struct.unpack(">9I",headerdata)
meta = dict(zip(_headernames,head))
return meta
Make a dictionary
Copyright (C) 2007, http://www.dabeaz.com 2- 70
71. Commentary
• Dictionaries as data structures
meta = { 'fetchtime' : 1190829792,
'requestsize' : 27,
'magic' : 65544,
'fetchcount' : 3,
'expiretime' : 0,
'location' : 2449473536L,
'modifytime' : 1190829792,
'datasize' : 29448,
'infosize' : 531 }
• Useful if data has many parts
data = f.read(meta[8]) # Huh?!?
vs.
data = f.read(meta['infosize']) # Better
Copyright (C) 2007, http://www.dabeaz.com 2- 71
72. Mini-reference : struct
• struct module
items = struct.unpack(fmt,data)
data = struct.pack(fmt,item1,...,itemn)
• Sample Format codes
'c' char (1 byte string)
'b' signed char (8-bit integer)
'B' unsigned char (8-bit integer)
'h' signed short (16-bit integer)
'H' unsigned short (16-bit integer)
'i' int (32-bit integer)
'I' unsigned int (32-bit integer)
'f' 32-bit single precision float
'd' 64-bit double precision float
's' char s[] (String)
'>' Big endian modifier
'<' Little endian modifier
'!' Network order modifier
'n' Repetition count modifier
Copyright (C) 2007, http://www.dabeaz.com 2- 72
73. Part 2 : Parsing Requests
• Write a function that will read the URL
request string and request information
• Request String : A Null-terminated string
• Request Info : A sequence of Null-terminated
key-value pairs (like a dictionary)
Copyright (C) 2007, http://www.dabeaz.com 2- 73
74. Parsing Requests
import re
part_pat = re.compile(r'[nr -~]*$')
def parse_request_data(meta,requestdata):
parts = requestdata.split('x00')
for part in parts:
if not part_pat.match(part):
return False
request = parts[0]
if len(request) != (meta['requestsize'] - 1):
return False
info = dict(zip(parts[1::2],parts[2::2]))
meta['request'] = request.split(':',1)[1]
meta['info'] = info
return True
Copyright (C) 2007, http://www.dabeaz.com 2- 74
76. String Stripping
import re
part_pat = re.compile(r'[nr -~]*$')
def parse_request_data(meta,requestdata):
parts = requestdata.split('x00')
for part in parts:
if not part_pat.match(part):
The request dataFalse sequence of null-terminated
return is a
strings. This splits the data up into parts.
request = parts[0]
if len(request) != (meta['requestsize'] - 1):
requestdata = False
return 'partx00partx00partx00partx00...'
info = dict(zip(parts[1::2],parts[2::2]))
.split('x00')
meta['request'] = request.split(':',1)[1]
meta['info'] = info
parts = ['part','part','part','part',...]
return True
Copyright (C) 2007, http://www.dabeaz.com 2- 76
77. String Validation
import re
part_pat = re.compile(r'[nr -~]*$')
def parse_request_data(meta,requestdata):
parts = requestdata.split('x00')
for part in parts:
if not part_pat.match(part):
return False
Individual parts are printable characters except for
request = parts[0]
if len(request) != (meta['requestsize'] - 1):
newline characters ('nr').
return False
info = dict(zip(parts[1::2],parts[2::2]))
We use the re module to match each string. This
meta['request'] = request.split(':',1)[1]
would help catchinfo where we might be reading
meta['info'] = cases
return True
bad data (false headers, raw data, etc.).
Copyright (C) 2007, http://www.dabeaz.com 2- 77
78. URL Request String
import re
part_pat = re.compile(r'[nr -~]*$')
The request string is the first part. The check that
def parse_request_data(meta,requestdata):
parts = requestdata.split('x00')
follows makes parts:it's the right size (a further sanity
for part in sure
check on not part_pat.match(part):
if the data integrity).
return False
request = parts[0]
if len(request) != (meta['requestsize'] - 1):
return False
info = dict(zip(parts[1::2],parts[2::2]))
meta['request'] = request.split(':',1)[1]
meta['info'] = info
return True
Copyright (C) 2007, http://www.dabeaz.com 2- 78
79. Request Info
import re
Each request has a set of associated
part_pat = re.compile(r'[nr -~]*$') data represented
as key/value pairs.
def parse_request_data(meta,requestdata):
parts = requestdata.split('x00')
parts = ['request','key','val','key','val','key','val']
for part in parts:
parts[1::2] part_pat.match(part):
if not ['key','key','key']
return['val','val','val']
parts[2::2] False
zip(parts[1::2],parts[2::2]) [('key','val'),
request = parts[0]
('key','val')
if len(request) != (meta['requestsize'] - 1):
('key','val')]
return False
info = dict(zip(parts[1::2],parts[2::2]))
meta['request'] = request.split(':',1)[1]
meta['info'] = info
return True
Makes a dictionary from (key,val) tuples
Copyright (C) 2007, http://www.dabeaz.com 2- 79
80. Fixing the Request
# Given a dictionary of header information and a file,
# this function extracts the request data from a cache
# metadata entry and saves it instring
Cleaning up the request the dictionary. Returns
# True or False depending on success.
request = "HTTP:http://www.google.com"
def read_request_data(header,f):
.split(':',1)
request = f.read(header['requestsize']).strip('x00')
infodata = f.read(header['infosize']).strip('x00')
['HTTP','http://www.google.com']
# Validate request and [1]
infodata here (nothing now)
# Turn the infodata into a dictionary
'http://www.google.com'
parts = infodata.split('x00')
info = dict(zip(parts[::2],parts[1::2]))
meta['request'] = request.split(':',1)[1]
meta['info'] = info
return True
Copyright (C) 2007, http://www.dabeaz.com 2- 80
81. Commentary
• Emphasize that Python has very powerful
list manipulation primitives
• Indexing
• Slicing
• List comprehensions
• Etc.
• Knowing how to use these leads to rapid
development and compact code
Copyright (C) 2007, http://www.dabeaz.com 2- 81
82. Part 3: Content Info
• All documents on the internet have
optional content-type, encoding, and
character set information.
• Let's add this information since it will make
it easier for us to determine the type of
files that are stored in the cache (i.e.,
images, movies, HTML, etc.)
Copyright (C) 2007, http://www.dabeaz.com 2- 82
83. HTTP Responses
• The cache metadata includes an HTTP
response header
>>> print meta['info']['response-head']
HTTP/1.1 200 OK
Date: Sat, 29 Sep 2007 20:51:37 GMT
Cache-Control: private
Vary: User-Agent
Content-Type: text/html; charset=utf-8
Content-Encoding: gzip
>>>
Content type, character set,
and encoding.
Copyright (C) 2007, http://www.dabeaz.com 2- 83
84. Solution
# Given a metadata dictionary, this function adds additional
# fields related to the content type, charset, and encoding
import email
def add_content_info(meta):
info = meta['info']
if 'response-head' not in info:
return
else:
rhead = info.get('response-head').split("n",1)[1]
m = email.message_from_string(rhead)
content = m.get_content_type()
encoding = m.get('content-encoding',None)
charset = m.get_content_charset()
meta['content-type'] = content
meta['content-encoding'] = encoding
meta['charset'] = charset
Copyright (C) 2007, http://www.dabeaz.com 2- 84
85. Internet Data Handling
# Given a metadata dictionary, has afunction adds additional
Python this vast assortment of
internet data handling modules.
# fields related to the content type, charset, and encoding
import email
email. Parsing of email messages,
def add_content_info(meta):
info = meta['info']
if 'response-head' MIME headers, etc.
not in info:
return
else:
rhead = info.get('response-head').split("n",1)[1]
m = email.message_from_string(rhead)
content = m.get_content_type()
encoding = m.get('content-encoding',None)
charset = m.get_content_charset()
meta['content-type'] = content
meta['content-encoding'] = encoding
meta['charset'] = charset
Copyright (C) 2007, http://www.dabeaz.com 2- 85
86. Internet Data Handling
# Given a metadata dictionary, this function adds additional
# In this code, we parse the HTTP charset, and encoding
fields related to the content type,
reponse headers using the email
import email
module and extract content-type,
def add_content_info(meta):
info = meta['info']
encoding, and charset information.
if 'response-head' not in info:
return
else:
rhead = info.get('response-head').split("n",1)[1]
m = email.message_from_string(rhead)
content = m.get_content_type()
encoding = m.get('content-encoding',None)
charset = m.get_content_charset()
meta['content-type'] = content
meta['content-encoding'] = encoding
meta['charset'] = charset
Copyright (C) 2007, http://www.dabeaz.com 2- 86
87. Commentary
• Python is heavily used in Internet applications
• There are modules for parsing common types
of data (email, HTML, XML, etc.)
• There are modules for processing bits and
pieces of internet data (URLs, MIME types,
RFC822 headers, etc.)
Copyright (C) 2007, http://www.dabeaz.com 2- 87
88. Part 4: File Scanning
• Write a function that scans a single cache
file and produces a sequence of records
containing all of the cache metadata.
• This is just one more of our building blocks
• The goal is to hide some of the nasty bits
Copyright (C) 2007, http://www.dabeaz.com 2- 88
89. File Scanning
# Scan a single file in the firefox cache
def scan_cachefile(f,blocksize):
maxsize = 4*blocksize # Maximum size of an entry
f.seek(4096) # Skip the bit-map
while True:
headerdata = f.read(36)
if not headerdata: break
meta = parse_meta_header(headerdata)
if (meta['magic'] == 0x00010008 and
meta['requestsize'] + meta['infosize'] < maxsize):
requestdata = f.read(meta['requestsize']+
meta['infosize'])
if parse_request_data(meta,requestdata):
add_content_info(meta)
yield meta
# Move the file pointer to the start of the next block
fp = f.tell()
if (fp % blocksize):
f.seek(blocksize - (fp % blocksize),1)
Copyright (C) 2007, http://www.dabeaz.com 2- 89
90. Usage : File Scanning
• Usage of the scan function
>>> f = open("Cache/_CACHE_001_","rb")
>>> for meta in scan_cache_file(f,256)
... print meta['request']
...
http://www.yahoo.com/
http://us.js2.yimg.com/us.yimg.com/a/1-/java/promotions/
http://us.i1.yimg.com/us.yimg.com/i/ww/atty_hsi_free.gif
...
• We can just open up a cache file and write a
for-loop to iterate over all of the entries.
Copyright (C) 2007, http://www.dabeaz.com 2- 90
91. Python File I/O
# Scan a single file in the firefox cache
def scan_cachefile(f,blocksize):
maxsize = 4*blocksize # Maximum size of an entry
f.seek(4096) # Skip the bit-map
while True:
headerdata = f.read(36) File Objects
if not headerdata: break
Modeled after ANSI C.
meta = parse_meta_header(headerdata)
if (meta['magic'] == 0x00010008 and
meta['requestsize'] + meta['infosize'] bytes.
Files are just < maxsize):
File pointer keeps track.
requestdata = f.read(meta['requestsize']+
meta['infosize'])
if parse_request_data(meta,requestdata):
f.read() # Read bytes
add_content_info(meta)
f.tell() # Current fp
yield meta
f.seek(n,off) # Move fp
# Move the file pointer to the start of the next block
fp = f.tell()
if (fp % blocksize):
f.seek(blocksize - (fp % blocksize),1)
Copyright (C) 2007, http://www.dabeaz.com 2- 91
92. Using Earlier Code
# Scan a single file in the firefox cache
def scan_cachefile(f,blocksize):
maxsize = 4*blocksize # Maximum size of an using our
Here we are entry
f.seek(4096) # Skip header parsing functions
the bit-map
while True:
headerdata = f.read(36) written in previous parts.
if not headerdata: break
meta = parse_meta_header(headerdata)
if (meta['magic'] == 0x00010008 and
meta['requestsize'] + meta['infosize'] < maxsize):
requestdata = f.read(meta['requestsize']+
meta['infosize'])
if parse_request_data(meta,requestdata):
add_content_info(meta)
yield meta
Note: We are progressively
adding #more the file apointer
Move
data to
fp = f.tell()
to the start of the next block
dictionary. % blocksize):
if (fp
f.seek(blocksize - (fp % blocksize),1)
Copyright (C) 2007, http://www.dabeaz.com 2- 92
93. Data Validation
# Scan a single file in the firefox cache
This is a sanity check to make
def scan_cachefile(f,blocksize):
sure the header data looks #like a
maxsize = 4*blocksize Maximum size of an entry
f.seek(4096) # Skip the bit-map
valid header.
while True:
headerdata = f.read(36)
if not headerdata: break
meta = parse_meta_header(headerdata)
if (meta['magic'] == 0x00010008 and
meta['requestsize'] + meta['infosize'] < maxsize):
requestdata = f.read(meta['requestsize']+
meta['infosize'])
if parse_request_data(meta,requestdata):
add_content_info(meta)
yield meta
# Move the file pointer to the start of the next block
fp = f.tell()
if (fp % blocksize):
f.seek(blocksize - (fp % blocksize),1)
Copyright (C) 2007, http://www.dabeaz.com 2- 93
94. Generating Results
# Scan a single file in the firefox cache
def scan_cachefile(f,blocksize):
We are using yield to Maximum size of for a
maxsize = 4*blocksize # produce data an entry
f.seek(4096)
single cache entry. #IfSkip the bit-map a for-
while True:
someone uses
loop, they will get all of the entries.
headerdata = f.read(36)
if not headerdata: break
meta = parse_meta_header(headerdata)
Note: This allows== 0x00010008 and cache
if (meta['magic'] us to process the
meta['requestsize'] + meta['infosize'] < maxsize):
without reading all of the data into memory.
requestdata = f.read(meta['requestsize']+
meta['infosize'])
if parse_request_data(meta,requestdata):
add_content_info(meta)
yield meta
# Move the file pointer to the start of the next block
fp = f.tell()
if (fp % blocksize):
f.seek(blocksize - (fp % blocksize),1)
Copyright (C) 2007, http://www.dabeaz.com 2- 94
95. Commentary
• Have created a function that can scan a
single _CACHE_00n_ file and produce a
sequence of dictionaries with metadata.
• It's still somewhat low-level
• Just need to package it a little better
Copyright (C) 2007, http://www.dabeaz.com 2- 95
96. Part 5 : Scan a Directory
• Write a function that takes the name of a
Firefox cache directory, scans all of the
cache files for metadata, and produces a
single sequence of records.
• Make it real easy to extract data
Copyright (C) 2007, http://www.dabeaz.com 2- 96
97. Solution : Directory Scan
# Given the name of a Firefox cache directory, the function
# scans all of the _CACHE_00n_ files for metadata. A sequence
# of dictionaries containing metadata is returned.
import os
def scan_cache(cachedir):
files = [('_CACHE_001_',256),
('_CACHE_002_',1024),
('_CACHE_003_',4096)]
for cname,blocksize in files:
cfile = open(os.path.join(cachedir,cname),"rb")
for meta in scan_cachefile(cfile,blocksize):
meta['cachedir'] = cachedir
meta['cachefile'] = cname
yield meta
cfile.close()
Copyright (C) 2007, http://www.dabeaz.com 2- 97
98. Solution : Directory Scan
General idea:
# Given the name of a Firefox cache directory, the function
# scans all of the _CACHE_00n_ files for metadata. A sequence
# of dictionaries containing metadata is returned.
We loop over the three _CACHE_00n_ files and
import os a sequence of the cache records
produce
def scan_cache(cachedir):
files = [('_CACHE_001_',256),
('_CACHE_002_',1024),
('_CACHE_003_',4096)]
for cname,blocksize in files:
cfile = open(os.path.join(cachedir,cname),"rb")
for meta in scan_cachefile(cfile,blocksize):
meta['cachedir'] = cachedir
meta['cachefile'] = cname
yield meta
cfile.close()
Copyright (C) 2007, http://www.dabeaz.com 2- 98
99. Solution : Directory Scan
# Given the name of a Firefox cache directory, the function
# scans all of the _CACHE_00n_ files for metadata. A sequence
# of dictionaries containing metadata is returned.
import os
def scan_cache(cachedir):
files = [('_CACHE_001_',256),
('_CACHE_002_',1024),
('_CACHE_003_',4096)]
for cname,blocksize in files:
cfile = open(os.path.join(cachedir,cname),"rb")
for meta in scan_cachefile(cfile,blocksize):
meta['cachedir'] = cachedir
meta['cachefile'] = cname
We use the low-level file scanning function
yield meta
cfile.close()
here to generate a sequence of records.
Copyright (C) 2007, http://www.dabeaz.com 2- 99
100. More Generation
# Given the name of a Firefox cache directory, the function
Byscans all of here, we are chainingfor metadata. A sequence
# using yield the _CACHE_00n_ files together the
results obtained from all three cache files into one
# of dictionaries containing metadata is returned.
big long sequence of results.
import os
def scan_cache(cachedir):
files = [('_CACHE_001_',256),
The underlying mechanics and implementation
('_CACHE_002_',1024),
details are hidden (user doesn't care)
('_CACHE_003_',4096)]
for cname,blocksize in files:
cfile = open(os.path.join(cachedir,cname),"rb")
for meta in scan_cachefile(cfile,blocksize):
meta['cachedir'] = cachedir
meta['cachefile'] = cname
yield meta
cfile.close()
Copyright (C) 2007, http://www.dabeaz.com 2-100
101. Additional Data
# Given the name of a Firefox cache directory, the function
# scans all of the _CACHE_00n_ files for metadata. A sequence
# of dictionaries containing metadata is returned.
import os
def scan_cache(cachedir):
files = [('_CACHE_001_',256),
('_CACHE_002_',1024),
Adding
('_CACHE_003_',4096)] path and file
for information to the data
cname,blocksize in files:
(May be useful later)
cfile = open(os.path.join(cachedir,cname),"rb")
for meta in scan_cachefile(cfile,blocksize):
meta['cachedir'] = cachedir
meta['cachefile'] = cname
yield meta
cfile.close()
Copyright (C) 2007, http://www.dabeaz.com 2-101
102. Usage : Cache Scan
• Usage of the scan function
>>> for meta in scan_cache("Cache/"):
... print meta['request']
...
http://www.yahoo.com/
http://us.js2.yimg.com/us.yimg.com/a/1-/java/promotions/
http://us.i1.yimg.com/us.yimg.com/i/ww/atty_hsi_free.gif
...
• Given the name of a cache directory, we can
just loop over all of the metadata. Trivial!
• With work, could perform various kinds of
queries and processing of the data
Copyright (C) 2007, http://www.dabeaz.com 2-102
103. Another Example
• Find all requests related to Slashdot
>>> for meta in scan_cache("Cache/"):
... if 'slashdot' in meta['request']:
... print meta['request']
...
http://www.slashdot.org/
http://images.slashdot.org/topics/topiccommunications.gif
http://images.slashdot.org/topics/topicstorage.gif
http://images.slashdot.org/comments.css?T_2_5_0_176
...
• Well, that was pretty easy.
Copyright (C) 2007, http://www.dabeaz.com 2-103
104. Another Example
• Find all large JPEG images in the cache
>>> jpegs = (meta for meta in scan_cache("Cache/")
if meta['content-type'] == 'image/jpeg'
and meta['datasize'] > 100000)
>>> for j in jpegs:
... print j['request']
...
http://images.salon.com/ent/video_dog/comedy/2007/09/27/cereal/
story.jpg
http://images.salon.com/ent/video_dog/ifc/2007/09/28/
apocalypse/story.jpg
http://www.lakesideinns.com/images/fallroadphoto2006.jpg
...
>>>
• That was also pretty easy
Copyright (C) 2007, http://www.dabeaz.com 2-104
105. Part 6 : Scan Everything
• Write a function that takes a list of cache
directories and produces a sequence of all
cache metadata found in all of them.
• A single utility function that let's us query
everything.
Copyright (C) 2007, http://www.dabeaz.com 2-105
106. Scanning Everything
# scan an entire list of cache directories producing
# a sequence of records
def scan(cachedirs):
if isinstance(cachedirs,str):
cachedirs = [cachedirs]
for cdir in cachedirs:
for meta in scan_cache(cdir):
yield meta
Copyright (C) 2007, http://www.dabeaz.com 2-106
107. Type Checking
# scan an entire list of cache directories producing
# a sequence of records
def scan(cachedirs):
if isinstance(cachedirs,str):
cachedirs = [cachedirs]
for cdir in cachedirs:
This bit of code ismeta example of type
an
for meta in scan_cache(cdir):
yield
checking.
If the argument is a string, we convert it to a list
with one item. This allows the following usage:
scan("CacheDir")
scan(["CacheDir1","CacheDir2",...])
Copyright (C) 2007, http://www.dabeaz.com 2-107
108. Putting it all together
# slack.py
# Find all of those slackers who should be working
import sys, os, ffcache
if len(sys.argv) != 2:
print >>sys.stderr,"Usage: python slack.py dirname"
raise SystemExit(1)
caches = (path for path,dirs,files in os.walk(sys.argv[1])
if '_CACHE_MAP_' in files)
for meta in ffcache.scan(caches):
if 'slashdot' in meta['request']:
print meta['request']
print meta['cachedir']
print
Copyright (C) 2007, http://www.dabeaz.com 2-108
109. Intermission
• Have written a simple library ffcache.py
• Library takes a moderate complex data
processing problem and breaks it up into
pieces.
• About 100 lines of code.
• Now, let's build an application...
Copyright (C) 2007, http://www.dabeaz.com 2-109
110. Problem : CacheSpy
• Big Brother (make an evil sound here)
Write a program that first locates all of the Firefox
cache directories under a given directory. Then
have that program run forever as a network server,
waiting for connections. On each connection, send
back all of the current cache metadata.
• Big Picture
We're going to write a daemon that will find and
quietly report on browser cache contents.
Copyright (C) 2007, http://www.dabeaz.com 2-110
111. cachespy.py
import sys, os, pickle, SocketServer, ffcache
SPY_PORT = 31337
caches = [path for path,dname,files in os.walk(sys.argv[1])
if '_CACHE_MAP_' in files]
def dump_cache(f):
for meta in ffcache.scan(caches):
pickle.dump(meta,f)
class SpyHandler(SocketServer.BaseRequestHandler):
def handle(self):
f = self.request.makefile()
dump_cache(f)
f.close()
SocketServer.TCPServer.allow_reuse_address = True
serv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler)
print "CacheSpy running on port %d" % SPY_PORT
serv.serve_forever()
Copyright (C) 2007, http://www.dabeaz.com 2-111
112. SocketServer Module
import sys, os, pickle, SocketServer, ffcache
SPY_PORT = 31337
SocketServer
caches = [path for path,dname,files in os.walk(sys.argv[1])
if '_CACHE_MAP_' in files]
def module for easily creating
A dump_cache(f):
for meta in ffcache.scan(caches):
low-level internet applications
pickle.dump(meta,f)
using sockets.
class SpyHandler(SocketServer.BaseRequestHandler):
def handle(self):
f = self.request.makefile()
dump_cache(f)
f.close()
SocketServer.TCPServer.allow_reuse_address = True
serv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler)
print "CacheSpy running on port %d" % SPY_PORT
serv.serve_forever()
Copyright (C) 2007, http://www.dabeaz.com 2-112
113. SocketServer Handlers
import sys, os, pickle, SocketServer, ffcache
You define a simple class that
SPY_PORT = 31337
caches = [path for path,dname,files in os.walk(sys.argv[1])
implements handle().
if '_CACHE_MAP_' in files]
def dump_cache(f):
This implements the server logic.
for meta in ffcache.scan(caches):
pickle.dump(meta,f)
class SpyHandler(SocketServer.BaseRequestHandler):
def handle(self):
f = self.request.makefile()
dump_cache(f)
f.close()
SocketServer.TCPServer.allow_reuse_address = True
serv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler)
print "CacheSpy running on port %d" % SPY_PORT
serv.serve_forever()
Copyright (C) 2007, http://www.dabeaz.com 2-113
114. SocketServer Servers
import sys, os, pickle, SocketServer, ffcache
SPY_PORT = 31337
caches = [path for path,dname,files in os.walk(sys.argv[1])
if '_CACHE_MAP_' in files]
def dump_cache(f):
for meta in ffcache.scan(caches):
pickle.dump(meta,f)
Next, you just create a Server object,
class SpyHandler(SocketServer.BaseRequestHandler):
def handle(self):
hook f = self.request.makefile()run the
the handler up to it, and
server.
dump_cache(f)
f.close()
SocketServer.TCPServer.allow_reuse_address = True
serv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler)
print "CacheSpy running on port %d" % SPY_PORT
serv.serve_forever()
Copyright (C) 2007, http://www.dabeaz.com 2-114