SlideShare a Scribd company logo
1 of 173
Download to read offline
Python in Action
                              Presented at USENIX LISA Conference
                                       November 16, 2007
                                               David M. Beazley
                                            http://www.dabeaz.com

                                        (Part II - Systems Programming)

Copyright (C) 2007, http://www.dabeaz.com                                 2- 1
Section Overview
                • In this section, we're going to get dirty
                • Systems Programming
                    • Files, I/O, file-system
                    • Text parsing, data decoding
                    • Processes and IPC
                    • Networking
                    • Threads and concurrency
Copyright (C) 2007, http://www.dabeaz.com                     2- 2
Commentary
                • I personally think Python is a fantastic tool for
                        systems programming.
                • Modules provide access to most of the major
                        system libraries I used to access via C
                • No enforcement of "morality"
                • Decent performance
                • It just "works" and it's fun
Copyright (C) 2007, http://www.dabeaz.com                             2- 3
Approach

                  • I've thought long and hard about how I
                         would present this part of the class.
                  • A reference manual approach would
                         probably be long and very boring.
                  • So instead, we're going to focus on building
                         something more in tune with the times



Copyright (C) 2007, http://www.dabeaz.com                          2- 4
"To Catch a Slacker"
            • Write a collection of Python programs that can
                    quietly monitor Firefox browser caches to find
                    out who has been spending their day reading
                    Slashdot instead of working on their TPS reports.
            • Oh yeah, and be a real sneaky bugger about it.



Copyright (C) 2007, http://www.dabeaz.com                           2- 5
Why this Problem?
                  • Involves a real-world system and data
                  • Firefox already installed on your machine (?)
                  • Cross platform (Linux, Mac, Windows)
                  • Example of tool building
                  • Related to a variety of practical problems
                  • A good tour of "Python in Action"
Copyright (C) 2007, http://www.dabeaz.com                           2- 6
Disclaimers
                • I am not involved in browser forensics (or
                        spyware for that matter).
                • I am in no way affiliated with Firefox/Mozilla
                        nor have I ever seen Firefox source code
                • I have never worked with the cache data
                        prior to preparing this tutorial
                • I have never used any third-party tools for
                        looking at this data.

Copyright (C) 2007, http://www.dabeaz.com                          2- 7
More Disclaimers
                • All of the code in this tutorial works with a
                        standard Python installation
                • No third party modules.
                • All code is cross-platform
                • Code samples are available online at
                                 http://www.dabeaz.com/action/

                 • Please look at that code and follow along
Copyright (C) 2007, http://www.dabeaz.com                         2- 8
Assumptions
                  • This is not a tutorial on systems concepts
                  • You should be generally familiar with
                         background material (files, filesystems, file
                         formats, processes, threads, networking,
                         protocols, etc.)
                  • Hopefully you can "extrapolate" from the
                         material presented here to construct more
                         advanced Python applications.


Copyright (C) 2007, http://www.dabeaz.com                             2- 9
The Big Picture
                  • We want to write a tool that allows
                         someone to locate, inspect, and perform
                         queries across a distributed collection of
                         Firefox caches.
                  • For example, the cache directories on all
                         machines on the LAN of a quasi-evil
                         corporation.



Copyright (C) 2007, http://www.dabeaz.com                             2- 10
The Firefox Cache
                • The Firefox browser keeps a disk cache of
                       recently visited sites
                % ls Cache/
                -rw-------                  1 beazley      111169 Sep 25 17:15 01CC0844d01
                -rw-------                  1 beazley      104991 Sep 25 17:15 01CC3844d01
                -rw-------                  1 beazley       47233 Sep 24 16:41 021F221Ad01
                ...
                -rw-------                  1   beazley     26749   Sep   21   11:19   FF8AEDF0d01
                -rw-------                  1   beazley     58172   Sep   25   18:16   FFE628C6d01
                -rw-------                  1   beazley   1939456   Sep   25   19:14   _CACHE_001_
                -rw-------                  1   beazley   2588672   Sep   25   19:14   _CACHE_002_
                -rw-------                  1   beazley   4567040   Sep   25   18:44   _CACHE_003_
                -rw-------                  1   beazley     33044   Sep   23   21:58   _CACHE_MAP_


                 • A bunch of cryptically named files.
Copyright (C) 2007, http://www.dabeaz.com                                                            2- 11
Problem : Finding Files
                   • Find the Firefox cache
                           Write a program findcache.py that takes a directory
                           name as input and recursively scans that directory
                           and all subdirectories looking for Firefox/Mozilla
                           cache directories.
                   • Example:
                           % python findcache.py /Users/beazley
                           /Users/beazley/Library/.../qs1ab616.default/Cache
                           /Users/beazley/Library/.../wxuoyiuf.slt/Cache
                           %


                    • Use case: Searching for things on the filesystem.
Copyright (C) 2007, http://www.dabeaz.com                                       2- 12
findcache.py
                # findcache.py
                # Recursively scan a directory looking for
                # Firefox/Mozilla cache directories

                import sys
                import os

                if len(sys.argv) != 2:
                    print >>sys.stderr,"Usage: python findcache.py dirname"
                    raise SystemExit(1)

                caches = (path for path,dirs,files in os.walk(sys.argv[1])
                               if '_CACHE_MAP_' in files)

                for name in caches:
                    print name




Copyright (C) 2007, http://www.dabeaz.com                                     2- 13
The sys module
                # findcache.py
                # Recursively scan a directory looking basic
                                 The sys module has for
                # Firefox/Mozilla cache directories
                                            information related to the
                import sys                  execution environment.
                import os

                if len(sys.argv) != 2:
                                             sys.argv
                    print >>sys.stderr,"Usage: python findcache.py dirname"
                    raise SystemExit(1)

               sys.stdin (path for'_CACHE_MAP_' list in os.walk(sys.argv[1])
               caches =            path,dirs,files of the command line
                                             A
               sys.stdout
                               if            options
                                                 in files)

               sys.stderrname
               for name in caches:
                    print
                                             sys.argv = ['findcache.py',
                                                                  '/Users/beazley']

               Standard I/O files
Copyright (C) 2007, http://www.dabeaz.com                                             2- 14
Program Termination
                # findcache.py
                # Recursively scan a directory looking for
                # Firefox/Mozilla cache directories

                import sys
                import os

                if len(sys.argv) != 2:        SystemExit exception
                    print >>sys.stderr,"Usage: python findcache.py dirname"
                    raise SystemExit(1)
                                                       Forces Python to exit.
                caches = (path for          path,dirs,files inis return code.
                                                       Value os.walk(sys.argv[1])
                               if '_CACHE_MAP_' in files)

                for name in caches:
                    print name




Copyright (C) 2007, http://www.dabeaz.com                                           2- 15
os Module
                # findcache.py
                # Recursively scan a directory looking for
                # Firefox/Mozilla cache os module
                                        directories

                import sys
                import os
                                               Contains useful OS related
                if len(sys.argv) != 2:         functions (files, processes, etc.)
                    print >>sys.stderr,"Usage: python findcache.py dirname"
                    raise SystemExit(1)

                caches = (path for path,dirs,files in os.walk(sys.argv[1])
                               if '_CACHE_MAP_' in files)

                for name in caches:
                    print name




Copyright (C) 2007, http://www.dabeaz.com                                          2- 16
os.walk()
               os.walk(topdir)
               # findcache.py
               # Recursively scan a directory looking for
               Recursively walkscache directories and
               # Firefox/Mozilla a directory tree

               generates a sequence of tuples (path,dirs,files)
                import sys
                     path
                import os      = The current directory name
                if
                         dirs = List of all subdirectory names in path
                       len(sys.argv) != 2:
                         files = List of all regular files (data) in path
                        print >>sys.stderr,"Usage: python findcache.py    dirname"
                         raise SystemExit(1)

                caches = (path for path,dirs,files in os.walk(sys.argv[1])
                               if '_CACHE_MAP_' in files)

                for name in caches:
                    print name




Copyright (C) 2007, http://www.dabeaz.com                                            2- 17
A Sequence of Caches
                # findcache.py
                # Recursively scan a directory looking for
                # Firefox/Mozilla cache directories

                importstatement
                 This sys        generates a sequence of
                import os
                 directory names where '_CACHE_MAP_' is
                 contained in the filelist.
                if len(sys.argv) != 2:
                         print >>sys.stderr,"Usage: python findcache.py dirname"
                         raise SystemExit(1)

                caches = (path for path,dirs,files in os.walk(sys.argv[1])
                               if '_CACHE_MAP_' in files)

               for name in caches:
               The print name name
                   directory
                                                    File name check
             that is generated as a
                      result
Copyright (C) 2007, http://www.dabeaz.com                                          2- 18
Printing the Result
                # findcache.py
                # Recursively scan a directory looking for
                # Firefox/Mozilla cache directories

                import sys
                import os

                if len(sys.argv) != 2:
                    print >>sys.stderr,"Usage: python findcache.py dirname"
                    raise SystemExit(1)

                caches = (path for path,dirs,files in os.walk(sys.argv[1])
                                            This prints the sequence
                               if '_CACHE_MAP_' in files)
                                            of cache directories that
                for name in caches:
                    print name              are generated by the
                                            previous statement.

Copyright (C) 2007, http://www.dabeaz.com                                     2- 19
Commentary
                     • Our solution is strongly based on a
                             "declarative" programming style (again)
                     • We simply write out a sequence of
                             operations that produce what we want
                     • Not focused on the underlying mechanics
                             of how to traverse all of the directories.



Copyright (C) 2007, http://www.dabeaz.com                                 2- 20
Big Idea : Iteration
                 • Python allows iteration to be captured as a
                         kind of object.
                        caches = (path for path,dirs,files in os.walk(sys.argv[1])
                                       if '_CACHE_MAP_' in files)

                • This de-couples iteration from the code that
                        uses the iteration
                        for name in caches:
                            print name


                • Another usage example:
                        for name in caches:
                            print len(os.listdir(name)), name


Copyright (C) 2007, http://www.dabeaz.com                                      2- 21
Big Idea : Iteration
                 • Compare to this:
                            for path,dirs,files in os.walk(sys.argv[1]):
                                if '_CACHE_MAP_' in files:
                                    print len(os.listdir(path)),path


                 • This code is simple, but the loop and the
                         code that executes in the loop body are
                         coupled together
                 • Not as flexible, but this is somewhat subtle
                         to wrap your brain around at first.


Copyright (C) 2007, http://www.dabeaz.com                                  2- 22
Mini-Reference : sys, os
                   • sys module
                             sys.argv          #   List of command line options
                             sys.stdin         #   Standard input
                             sys.stdout        #   Standard output
                             sys.stderr        #   Standard error
                             sys.executable    #   Full path of Python executable
                             sys.exc_info()    #   Information on current exception

                   • os module
                             os.walk(dir)      # Recursively walk dir producing a
                                               # sequence of tuples (path,dlist,flist)

                             os.listdir(dir)   # Return a list of all files in dir

                   • SystemExit exception
                             raise SystemExit(n) # Exit with integer code n


Copyright (C) 2007, http://www.dabeaz.com                                             2- 23
Problem: Searching for Text
            • Extract all URL requests from the cache
                   Write a program requests.py that scans the contents
                   of the _CACHE_00n_ files and prints a list of URLs
                   for documents stored in the cache.
            • Example:
                   % python requests.py /Users/.../qs1ab616.default/Cache
                   http://www.yahoo.com/
                   http://us.js2.yimg.com/us.yimg.com/a/1-/java/promotions/js/ad_eo_1.1.j
                   http://us.i1.yimg.com/us.yimg.com/i/ww/atty_hsi_free.gif
                   http://us.i1.yimg.com/us.yimg.com/i/ww/thm/1/search_1.1.png
                   ...
                   %

            • Use case: Searching the contents of files for
                   text patterns.
Copyright (C) 2007, http://www.dabeaz.com                                      2- 24
The Firefox Cache
               • The cache directory holds two types of data
                    • Metadata (URLs, headers, etc.).
                    • Raw data (HTML, JPEG, PNG, etc.)
               • This data is stored in two places
                    • Cryptic files in the Cache directory
                    • Blocks inside the _CACHE_00n_ files
               • Metadata almost always in _CACHE_00n_
Copyright (C) 2007, http://www.dabeaz.com                      2- 25
Possible Solution : Regex
            • The _CACHE_00n_ files are encoded in a
                   binary format, but URLs are embedded
                   inside as null-terminated text:
                   x00x01x00x08x92x00x02x18x00x00x00x13Fxffx9f
                   xceFxffx9fxcex00x00x00x00x00x00H)x00x00x00x1a
                   x00x00x023HTTP:http://slashdot.org/x00request-methodx00
                   GETx00request-User-Agentx00Mozilla/5.0 (Macintosh; U; Intel
                    Mac OS X; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7x00
                   request-Accept-Encodingx00gzip,deflatex00response-headx00
                   HTTP/1.1 200 OKrnDate: Sun, 30 Sep 2007 13:07:29 GMTrn
                   Server: Apache/1.3.37 (Unix) mod_perl/1.29rnSLASH_LOG_DATA:
                    shtmlrnX-Powered-By: Slash 2.005000176rnX-Fry: How can I
                   live my life if I can't tell good from evil?rnCache-Control:


             • Maybe the requests could just be ripped
                    using a regular expression.
Copyright (C) 2007, http://www.dabeaz.com                                      2- 26
A Regex Solution
              # requests.py
              import re
              import os
              import sys

              cachedir = sys.argv[1]
              cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]

              # A regex for embedded URL strings
              request_pat = re.compile('([a-z]+://.*?)x00')

              # Loop over all files and search for URLs
              for name in cachefiles:
                  data = open(os.path.join(cachedir,name),"rb").read()
                  index = 0
                  while True:
                      m = request_pat.search(data,index)
                      if not m: break
                      print m.group(1)
                      index = m.end()
Copyright (C) 2007, http://www.dabeaz.com                                    2- 27
The re module
              # requests.py
              import re                         re module
              import os
              import sys
                                                 Contains all functionality related to
              cachedir =                sys.argv[1]
              cachefiles
                                                 regular expression pattern matching,
                                        = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_'   ]
                                                 searching, replacing, etc.
              # A regex for embedded URL strings
              request_pat = re.compile(r'([a-z]+://.*?)x00')
                                  Features are strongly influenced by Perl,
                                  but regexs are not directly integrated
              # Loop over all files and search for URLs
              for name in cachefiles:
                                  into the Python language.
                  data = open(os.path.join(cachedir,name),"rb").read()
                       index = 0
                       while True:
                           m = request_pat.search(data,index)
                           if not m: break
                           print m.group(1)
                           index = m.end()
Copyright (C) 2007, http://www.dabeaz.com                                                     2- 28
Using re
              # requests.py
              import re are first specified
              Patterns                         as strings
              and compiled into a regex
              import os
              import sys
                                               object.

                pat = re.compile(pattern [,flags])
              cachedir = sys.argv[1]
              cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]

              # A regex for embedded URL strings
              request_pat = re.compile('([a-z]+://.*?)x00')

              # Loop over all files and search for URLs
                             The pattern syntax is "standard"
              for name in cachefiles:
                  data = open(os.path.join(cachedir,name),"rb").read()
                                 pat*            pat1|pat2
                  index = 0      pat+            [chars]
                  while True:    pat?            [^chars]
                      m = request_pat.search(data,index)
                                 (pat)           pat{n}
                      if not m: break
                                 .               pat{n,m}
                      print m.group(1)
                      index = m.end()
Copyright (C) 2007, http://www.dabeaz.com                                    2- 29
Using re
              # requests.py
              import re
              import os
              import sys

              cachedir = sys.argv[1]
              cachefiles = [ '_CACHE_001_', '_CACHE_002_',the
               All subsequent operations are methods of '_CACHE_003_' ]
                compiled regex pattern
              # A regex for embedded URL strings
              request_pat = re.compile(r'([a-z]+://.*?)x00')
                 m = pat.match(data [,start])   # Check for match
                 m = pat.search(data [,start]) # Search for match
              # Loop over all files and search for URLs
                 newdata = pat.sub(data, repl) # Pattern replace
              for name in cachefiles:
                  data = open(os.path.join(cachedir,name),"rb").read()
                  index = 0
                  while True:
                      m = request_pat.search(data,index)
                      if not m: break
                      print m.group(1)
                      index = m.end()
Copyright (C) 2007, http://www.dabeaz.com                                 2- 30
Searching for Matches
              # requests.py
              import re
              import os
               pat.search(text
              import sys
                                            [,start])

              cachedir = the string text for the first occurrence
               Searches sys.argv[1]
              cachefiles = [ pattern starting'_CACHE_002_', '_CACHE_003_' ]
               of the regex '_CACHE_001_', at position start.
              # Returns a "MatchObject" strings
                A regex for embedded URL if a match is found.
              request_pat = re.compile(r'([a-z]+://.*?)x00')
                 In the code below, we're finding matches one
              # Loop over all files and search for URLs
              for a time. cachefiles:
               at name in
                  data = open(os.path.join(cachedir,name),"rb").read()
                  index = 0
                  while True:
                       m = request_pat.search(data,index)
                       if not m: break
                       print m.group(1)
                       index = m.end()
Copyright (C) 2007, http://www.dabeaz.com                                     2- 31
Match Objects
              # requests.py
              import re
              import os
              import sys

              cachedir = sys.argv[1]
              Regex matches'_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]
              cachefiles = [ are represented by a MatchObject

              # m.group([n]) embedded URL matched by group n
                A regex for        # Text strings
                m.start([n])       # Starting index of group n
              request_pat = re.compile(r'([a-z]+://.*?)x00')
                m.end([n])         # End index of group n
              # Loop over all files and search for URLs
              for name in cachefiles:
                  data = open(os.path.join(cachedir,name),"rb").read()
                  index = 0                The matching text for
                  while True:
                                           just the URL.
                      m = request_pat.search(data,index)
                      if not m: break
                      print m.group(1)
                      index = m.end()         The end of the match
Copyright (C) 2007, http://www.dabeaz.com                                  2- 32
Groups
              # requests.py
               In patterns, parentheses () define groups which
              import re
              import os
               are numbered left to right.
              import sys
                group 0            # The entire pattern
              cachedir 1 sys.argv[1] Text in first ()
                group =            #
                group 2            # Text in next ()
              cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]
                ...
              # A regex for embedded URL strings
              request_pat = re.compile('([a-z]+://.*?)x00')

              # Loop over all files and search for URLs
              for name in cachefiles:
                  data = open(os.path.join(cachedir,name),"rb").read()
                  index = 0
                  while True:
                      m = request_pat.search(data,index)
                      if not m: break
                      print m.group(1)
                      index = m.end()
Copyright (C) 2007, http://www.dabeaz.com                                    2- 33
Mini-Reference : re
                  • re pattern compilation
                          pat = re.compile(r'patternstring')


                  • Pattern syntax
                          literal           #   Match   literal text
                          pat*              #   Match   0 or more repetitions of pat
                          pat+              #   Match   1 or more repetitions of pat
                          pat?              #   Match   0 or 1 repetitions of pat
                          pat1|pat2         #   Patch   pat1 or pat2
                          (pat)             #   Patch   pat (group)
                          [chars]           #   Match   characters in chars
                          [^chars]          #   Match   characters not in chars
                          .                 #   Match   any character except n
                          d                #   Match   any digit
                          w                #   Match   alphanumeric character
                          s                #   Match   whitespace


Copyright (C) 2007, http://www.dabeaz.com                                              2- 34
Mini-Reference : re
               • Common pattern operations
                        pat.search(text)        # Search text for a match
                        pat.match(text)         # Search start of text for match
                        pat.sub(repl,text)      # Replace pattern with repl

                • Match objects
                        m.group([n])            # Text matched by group n
                        m.start([n])            # Starting position of group n
                        m.end([n])              # Ending position of group n


                • How to loop over all matches of a pattern
                        for m in pat.finditer(text):
                            # m is a MatchObject that you process
                            ...




Copyright (C) 2007, http://www.dabeaz.com                                          2- 35
Mini-Reference : re
                   • An example of pattern replacement
                          # This replaces American dates of the form 'mm/dd/yyyy'
                          # with European dates of the form 'dd/mm/yyyy'.

                          # This function takes a MatchObject as input and returns
                          # replacement text as output.

                          def euro_date(m):
                             month = m.group(1)
                             day   = m.group(2)
                             year = m.group(3)
                             return "%d/%d/%d" % (day,month,year)

                          # Date re pattern and replacement operation
                          datepat = re.compile(r'(d+)/(d+)/(d+)')
                          newdata = datepat.sub(euro_date,text)



Copyright (C) 2007, http://www.dabeaz.com                                            2- 36
Mini-Reference : re
               • There are many more features of the re
                      module
               • Strongly influenced by Perl (feature set)
               • Regexs are a library in Python, not integrated
                      into the language.
               • A book on regular expressions may be
                      essential for advanced functions.


Copyright (C) 2007, http://www.dabeaz.com                         2- 37
File Handling
              # requests.py
              import re
              import os
              import sys

              cachedir = sys.argv[1]
              cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]

              # A regex for embedded URL strings
               What is going on in this statement?
              request_pat = re.compile(r'([a-z]+://.*?)x00')

              # Loop over all files and search for URLs
              for name in cachefiles:
                  data = open(os.path.join(cachedir,name),"rb").read()
                  index = 0
                  while True:
                      m = request_pat.search(data,index)
                      if not m: break
                      print m.group(1)
                      index = m.end()
Copyright (C) 2007, http://www.dabeaz.com                                    2- 38
os.path module
              # requests.py
              import re has portable file related functions
               os.path
              import os
               os.path.join(name1,name2,...)   # Join path names
              import sys
               os.path.getsize(filename)       # Get the file size
               os.path.getmtime(filename)      # Get modification date
              cachedir = sys.argv[1]
              cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]
               There are many more functions, but this is the
              #preferred module for basic filename handling
                A regex for embedded URL strings
              request_pat = re.compile(r'([a-z]+://.*?)x00')

              # Loop over all files and search for URLs
              for name in cachefiles:
                  data = open(os.path.join(cachedir,name),"rb").read()
                  index = 0
                  while True:
                      m = request_pat.search(data,index)
                      if not m: break
                      print m.group(1)
                      index = m.end()
Copyright (C) 2007, http://www.dabeaz.com                                    2- 39
os.path.join()
              # requests.py
              import re a fully-expanded pathname
              Creates
              import os
               dirname = '/foo/bar'   filename = 'name'
              import sys

                         os.path.join(dirname,filename)
              cachedir = sys.argv[1]
              cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]

              # A regex for embedded URL strings
                              '/foo/bar/name'
              request_pat = re.compile(r'([a-z]+://.*?)x00')
               Aware of platform differences ('/' vs. '')
              # Loop over all files and search for URLs
              for name in cachefiles:
                  data = open(os.path.join(cachedir,name),"rb").read()
                  index = 0
                  while True:
                      m = request_pat.search(data,index)
                      if not m: break
                      print m.group(1)
                      index = m.end()
Copyright (C) 2007, http://www.dabeaz.com                                    2- 40
Mini-Reference : os.path
                      os.path.join(s1,s2,...)   #   Join pathname parts together
                      os.path.getsize(path)     #   Get file size of path
                      os.path.getmtime(path)    #   Get modify time of path
                      os.path.getatime(path)    #   Get access time of path
                      os.path.getctime(path)    #   Get creation time of path
                      os.path.exists(path)      #   Check if path exists
                      os.path.isfile(path)      #   Check if regular file
                      os.path.isdir(path)       #   Check if directory
                      os.path.islink(path)      #   Check if symbolic link
                      os.path.basename(path)    #   Return file part of path
                      os.path.dirname(path)     #   Return dir part of
                      os.path.abspath(path)     #   Get absolute path




Copyright (C) 2007, http://www.dabeaz.com                                          2- 41
Binary I/O
              # requests.py
              import re
              import os
              import sys

              cachedir = sys.argv[1]
              cachefiles = binary files, use modes "rb","wb", etc.
                   For all [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]
              # A regex for embedded URL strings
              request_pat = re.compile(r'([a-z]+://.*?)x00') Windows)
                   Disables new-line translation (critical on
              # Loop over all files and search for URLs
              for name in cachefiles:
                  data = open(os.path.join(cachedir,name),"rb").read()
                  index = 0
                  while True:
                      m = request_pat.search(data,index)
                      if not m: break
                      print m.group(1)
                      index = m.end()
Copyright (C) 2007, http://www.dabeaz.com                                    2- 42
Common I/O Shortcuts
              # requests.py
              import re entire file into a string
               # Read an
              import=os
               data     open(filename).read()
              import sys
               # Write a string out to a file
               open(filename,"w").write(text)
              cachedir = sys.argv[1]
              cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]
               # Loop over all lines in a file
              #forregex foropen(filename):
                A line in embedded URL strings
                    ...
              request_pat = re.compile(r'([a-z]+://.*?)x00')

              # Loop over all files and search for URLs
              for name in cachefiles:
                  data = open(os.path.join(cachedir,name),"rb").read()
                  index = 0
                  while True:
                      m = request_pat.search(data,index)
                      if not m: break
                      print m.group(1)
                      index = m.end()
Copyright (C) 2007, http://www.dabeaz.com                                    2- 43
Commentary on Solution
               • This regex approach is mostly a hack for this
                       particular application.
               • Reads entire cache files into memory as
                       strings (may be quite large)
               • Only finds URLs, no other metadata
               • Some risk of false positives since URLs could
                       also be embedded in data.


Copyright (C) 2007, http://www.dabeaz.com                        2- 44
Commentary
                   • We have started to build a collection of
                          very simple command line tools
                   • Very much in the "Unix tradition."
                   • Python makes it easy to create such tools
                   • More complex applications could be
                          assembled by simply gluing scripts together



Copyright (C) 2007, http://www.dabeaz.com                               2- 45
Working with Processes
                   • It is common to write programs that run
                          other programs, collect their output, etc.
                   • Pipes
                   • Interprocess Communication
                   • Python has a variety of modules for
                          supporting this.



Copyright (C) 2007, http://www.dabeaz.com                              2- 46
subprocess Module
            • A module for creating and interacting with
                   subprocesses
            • Consolidates a number of low-level OS
                   functions such as system(), execv(), spawnv(),
                   pipe(), popen2(), etc. into a single module
            • Cross platform (Unix/Windows)

Copyright (C) 2007, http://www.dabeaz.com                           2- 47
Example : Slackers

             • Find slacker cache entries.
                    Using the programs findcache.py and requests.py as
                    subprocesses, write a program that inspects cache
                    directories and prints out all entries that contain the
                    word 'slashdot' in the URL.




Copyright (C) 2007, http://www.dabeaz.com                                     2- 48
slackers.py
               # slackers.py
               import sys
               import subprocess

               # Run findcache.py as a subprocess
               finder = subprocess.Popen(
                          [sys.executable,"findcache.py",sys.argv[1]],
                          stdout=subprocess.PIPE)

               dirlist = [line.strip() for line in finder.stdout]

               # Run request.py as a subprocess
               for cachedir in dirlist:
                   searcher = subprocess.Popen(
                           [sys.executable,"requests.py",cachedir],
                            stdout=subprocess.PIPE)
                   for line in searcher.stdout:
                       if 'slashdot' in line: print line,


Copyright (C) 2007, http://www.dabeaz.com                                2- 49
Launching a subprocess
               # slackers.py
               import sys
               import subprocess

               # Run findcache.py as a subprocess
               finder = subprocess.Popen(
                          [sys.executable,"findcache.py",sys.argv[1]],
                          stdout=subprocess.PIPE)

               dirlist = [line.strip() for line in finder.stdout]

               # Run request.py as a subprocess
              Thiscachedir in dirlist:
               for is launching a python            Collection of output
              script as a subprocess,
                   searcher = subprocess.Popen(
                                                    with newline
                           [sys.executable,"requests.py",cachedir],
              connecting its stdout
                            stdout=subprocess.PIPE) stripping.
              stream to a pipe.
                   for line in searcher.stdout:
                       if 'slashdot' in line: print line,


Copyright (C) 2007, http://www.dabeaz.com                                  2- 50
Python Executable
               # slackers.py
               import sys
               import subprocess            Full pathname of
                                            python interpreter
               # Run findcache.py as a subprocess
               finder = subprocess.Popen(
                          [sys.executable,"findcache.py",sys.argv[1]],
                          stdout=subprocess.PIPE)

               dirlist = [line.strip() for line in finder.stdout]

               # Run request.py as a subprocess
               for cachedir in dirlist:
                   searcher = subprocess.Popen(
                           [sys.executable,"requests.py",cachedir],
                            stdout=subprocess.PIPE)
                   for line in searcher.stdout:
                       if 'slashdot' in line: print line,


Copyright (C) 2007, http://www.dabeaz.com                                2- 51
Subprocess Arguments
               # slackers.py
               import sys
               import subprocess

               # Run findcache.py as a subprocess
               finder = subprocess.Popen(
                          [sys.executable,"findcache.py",sys.argv[1]],
                          stdout=subprocess.PIPE)

               dirlist = [line.strip() for line in finder.stdout]
                          List of arguments to subprocess.
               # Run request.py as a subprocess
               for cachedir in dirlist: to what would
                          Corresponds
                          appear on a shell command line.
                   searcher = subprocess.Popen(
                                [sys.executable,"requests.py",cachedir],
                                 stdout=subprocess.PIPE)
                        for line in searcher.stdout:
                            if 'slashdot' in line: print line,


Copyright (C) 2007, http://www.dabeaz.com                                  2- 52
slackers.py
               # slackers.py
               import sys
               import subprocess

               # Run findcache.py as a subprocess directory we
               More of the same idea. For each
               finder = subprocess.Popen(
               found in the last step, we run requests.py to
                          [sys.executable,"findcache.py",sys.argv[1]],
               produce requests.
                          stdout=subprocess.PIPE)

               dirlist = [line.strip() for line in finder.stdout]

               # Run request.py as a subprocess
               for cachedir in dirlist:
                   searcher = subprocess.Popen(
                           [sys.executable,"requests.py",cachedir],
                            stdout=subprocess.PIPE)
                   for line in searcher.stdout:
                       if 'slashdot' in line: print line,


Copyright (C) 2007, http://www.dabeaz.com                                2- 53
Commentary

             • subprocess is a large module with many options.
             • However, it takes care of a lot of annoying
                    platform-specific details for you.
             • Currently the "recommended" way of dealing
                    with subprocesses.




Copyright (C) 2007, http://www.dabeaz.com                        2- 54
Low Level Subprocesses
            • Running a simple system command
                       os.system("shell command")

            • Connecting to a subprocess with pipes
                       pout, pin = popen2.popen2("shell command")


            • Exec/spawn
                       os.execv(),os.execl(),os.execle(),...
                       os.spawnv(),os.spawnvl(), os.spawnle(),...

            • Unix fork()
                       os.fork(), os.wait(), os.waitpid(), os._exit(), ...



Copyright (C) 2007, http://www.dabeaz.com                                    2- 55
Interactive Processes
              • Python does not have built-in support for
                     controlling interactive subprocesses (e.g.,
                     "Expect")
              • Must install third party modules for this
              • Example: pexpect
              • http://pexpect.sourceforge.net

Copyright (C) 2007, http://www.dabeaz.com                          2- 56
Commentary
              • Writing small Unix-like utilities is fairly
                     straightforward in Python
              • Support for standard kinds of operations (files,
                     regular expressions, pipes, subprocesses, etc.)
              • However, our solution is also kind of clunky
              • Only returns some information
              • Not particularly memory efficient (reads large
                     files into memory)

Copyright (C) 2007, http://www.dabeaz.com                              2- 57
Interlude
                • Python is well-suited to building libraries
                       and frameworks.
                • In the next part, we're going to take a
                       totally different approach than simply
                       writing simple utilities.
                • Will build libraries for manipulating cache
                       data and use those libraries to build tools.



Copyright (C) 2007, http://www.dabeaz.com                             2- 58
Problem : Parsing Data
            • Extract the cache data (for real)
                   Write a module ffcache.py that contains a set of
                   functions for reading Firefox cache data into useful
                   data structures that can be used by other programs.

                   Capture all available information including URLs,
                   timestamps, sizes, locations, content types, etc.

            • Use case: Blood and guts
                   Writing programs that can process foreign file
                   formats. Processing binary encoded data. Creating
                   code for later reuse.

Copyright (C) 2007, http://www.dabeaz.com                                 2- 59
The Firefox Cache
            • There are four critical files
                        _CACHE_MAP_         #   Cache   index
                        _CACHE_001_         #   Cache   data
                        _CACHE_002_         #   Cache   data
                        _CACHE_003_         #   Cache   data

             • All files are binary-encoded
             • _CACHE_MAP_ is used by Firefox to locate
                    data, but it is not updated until Firefox exits.
             • We will ignore _CACHE_MAP_ since we want
                    to observe caches of live Firefox sessions.

Copyright (C) 2007, http://www.dabeaz.com                              2- 60
Firefox _CACHE_ Files
            • _CACHE_00n_ file organization
                                            Free/used block bitmap   4096 bytes


                                                   Blocks            Up to 32768 blocks



             • The block size varies according to the file:
                            _CACHE_001_                  256 byte blocks
                            _CACHE_002_                 1024 byte blocks
                            _CACHE_003_                 4096 byte blocks


Copyright (C) 2007, http://www.dabeaz.com                                             2- 61
Cache Entries
            • Each cache entry:
                 • A maximum of 4 cache blocks
                 • Can either be data or metadata
                 • If >16K, written to a file instead
            • Notice how all the "cryptic" files are >16K
              -rw-------                beazley   111169 Sep 25 17:15 01CC0844d01
              -rw-------                beazley   104991 Sep 25 17:15 01CC3844d01
              -rw-------                beazley    47233 Sep 24 16:41 021F221Ad01
              ...
              -rw-------                beazley    26749 Sep 21 11:19 FF8AEDF0d01
              -rw-------                beazley    58172 Sep 25 18:16 FFE628C6d01

Copyright (C) 2007, http://www.dabeaz.com                                           2- 62
Cache Metadata
              • Metadata is encoded as a binary structure
                                            Header         36 bytes
                                      Request String       Variable length (in header)
                                       Request Info        Variable length (in header)

               • Header encoding (binary, big-endian)
                              0-3           magic (???)   unsigned   int   (0x00010008)
                              4-7           location      unsigned   int
                              8-11          fetchcount    unsigned   int
                              12-15         fetchtime     unsigned   int   (system time)
                              16-19         modifytime    unsigned   int   (system time)
                              20-23         expiretime    unsigned   int   (system time)
                              24-27         datasize      unsigned   int   (byte count)
                              28-31         requestsize   unsigned   int   (byte count)
                              32-35         infosize      unsigned   int   (byte count)
Copyright (C) 2007, http://www.dabeaz.com                                                  2- 63
Solution Outline
                   • Part 1: Parsing Metadata Headers
                   • Part 2: Getting request information (URL)
                   • Part 3: Extracting additional content info
                   • Part 4: Scanning of individual cache files
                   • Part 5: Scanning an entire directory
                   • Part 6: Scanning a list of directories
Copyright (C) 2007, http://www.dabeaz.com                         2- 64
Part I - Reading Headers


                • Write a function that can parse the binary
                       metadata header and return the data in a
                       useful format




Copyright (C) 2007, http://www.dabeaz.com                         2- 65
Reading Headers
           import struct

           # This function parses a cache metadata header into a dict
           # of named fields (listed in _headernames below)

           _headernames = ['magic','location','fetchcount',
                           'fetchtime','modifytime','expiretime',
                           'datasize','requestsize','infosize']

           def parse_meta_header(headerdata):
               head = struct.unpack(">9I",headerdata)
               meta = dict(zip(_headernames,head))
               return meta




Copyright (C) 2007, http://www.dabeaz.com                               2- 66
Reading Headers
             • How this is supposed to work:
                 >>> f = open("Cache/_CACHE_001_","rb")
                 >>> f.seek(4096)                # Skip the bit map
                 >>> headerdata = f.read(36)     # Read 36 byte header
                 >>> meta = parse_meta_header(headerdata)
                 >>> meta
                 {'fetchtime': 1190829792, 'requestsize': 27, 'magic': 65544,
                 'fetchcount': 3, 'expiretime': 0, 'location': 2449473536L,
                 'modifytime': 1190829792, 'datasize': 29448, 'infosize': 531}
                 >>>


             • Basically, we're parsing the header into a
                    useful Python data structure (a dictionary)


Copyright (C) 2007, http://www.dabeaz.com                                  2- 67
struct module
           import struct


                    Parses binary encoded data into Python objects.
           # This function parses a cache metadata header into a dict
           # of named fields (listed in _headernames below)

           _headernames = ['magic','location','fetchcount',
                    You would use this module to pack/unpack
                           'fetchtime','modifytime','expiretime',
                                                                        raw
                    binary 'datasize','requestsize','infosize']
                           data from Python strings.
           def parse_meta_header(headerdata):
               head = struct.unpack(">9I",headerdata)
               meta = dict(zip(_headernames,head))
               return meta
                                            Unpacks 9 unsigned 32-bit
                                            big-endian integers

Copyright (C) 2007, http://www.dabeaz.com                                     2- 68
struct module
           import struct

           # This function parses a cache metadata header into a dict
           # of named fields (listed in _headernames below)
              Result is always a tuple of converted values.
           _headernames = ['magic','location','fetchcount',
            head = (65544, 'fetchtime','modifytime','expiretime',
                           0, 1, 1191682051, 1191682051,
                     0, 8645, 190, 218)
                           'datasize','requestsize','infosize']

           def parse_meta_header(headerdata):
               head = struct.unpack(">9I",headerdata)
               meta = dict(zip(_headernames,head))
               return meta




Copyright (C) 2007, http://www.dabeaz.com                               2- 69
Dictionary Creation
            zip(s1,s2) makes a list of tuples
           zip(_headernames,head)        [('magic',head[0]),
           import struct
                                           ('location',head[1]),
                                           ('fetchcount',head[2])
           # This function parses a cache metadata header into a dict
                                         ...
           # of named fields (listed in _headernames below)
                                         ]
           _headernames = ['magic','location','fetchcount',
                           'fetchtime','modifytime','expiretime',
                           'datasize','requestsize','infosize']

           def parse_meta_header(headerdata):
               head = struct.unpack(">9I",headerdata)
               meta = dict(zip(_headernames,head))
               return meta
                                            Make a dictionary


Copyright (C) 2007, http://www.dabeaz.com                               2- 70
Commentary
                   • Dictionaries as data structures
                             meta           = { 'fetchtime'     :   1190829792,
                                                'requestsize'   :   27,
                                                'magic'         :   65544,
                                                'fetchcount'    :   3,
                                                'expiretime'    :   0,
                                                'location'      :   2449473536L,
                                                'modifytime'    :   1190829792,
                                                'datasize'      :   29448,
                                                'infosize'      :   531 }

                   • Useful if data has many parts
                           data = f.read(meta[8])                         # Huh?!?

                                               vs.

                           data = f.read(meta['infosize'])                # Better

Copyright (C) 2007, http://www.dabeaz.com                                            2- 71
Mini-reference : struct
                   • struct module
                           items = struct.unpack(fmt,data)
                           data = struct.pack(fmt,item1,...,itemn)

                   • Sample Format codes
                          'c'               char (1 byte string)
                          'b'               signed char (8-bit integer)
                          'B'               unsigned char (8-bit integer)
                          'h'               signed short (16-bit integer)
                          'H'               unsigned short (16-bit integer)
                          'i'               int (32-bit integer)
                          'I'               unsigned int (32-bit integer)
                          'f'               32-bit single precision float
                          'd'               64-bit double precision float
                          's'               char s[] (String)
                          '>'               Big endian modifier
                          '<'               Little endian modifier
                          '!'               Network order modifier
                          'n'               Repetition count modifier
Copyright (C) 2007, http://www.dabeaz.com                                     2- 72
Part 2 : Parsing Requests

                   • Write a function that will read the URL
                          request string and request information
                   • Request String : A Null-terminated string
                   • Request Info : A sequence of Null-terminated
                          key-value pairs (like a dictionary)




Copyright (C) 2007, http://www.dabeaz.com                           2- 73
Parsing Requests
            import re
            part_pat = re.compile(r'[nr -~]*$')

            def parse_request_data(meta,requestdata):
                parts = requestdata.split('x00')
                for part in parts:
                    if not part_pat.match(part):
                        return False

                     request = parts[0]
                     if len(request) != (meta['requestsize'] - 1):
                         return False

                     info = dict(zip(parts[1::2],parts[2::2]))
                     meta['request'] = request.split(':',1)[1]
                     meta['info'] = info
                     return True



Copyright (C) 2007, http://www.dabeaz.com                            2- 74
Usage : Requests
             • Usage of the function:
                >>> f = open("Cache/_CACHE_001_","rb")
                >>> f.seek(4096)                # Skip the bit map
                >>> headerdata = f.read(36)     # Read 36 byte header
                >>> meta = parse_meta_header(headerdata)
                >>> requestdata = f.read(meta['requestsize']+meta['infosize'])
                >>> parse_request_data(meta,requestdata)
                True
                >>> meta['request']
                'http://www.yahoo.com/'
                >>> meta['info']
                {'request-method': 'GET', 'request-User-Agent': 'Mozilla/5.0
                (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.7) Gecko/
                20070914 Firefox/2.0.0.7', 'charset': 'UTF-8', 'response-
                head': 'HTTP/1.1 200 OKrnDate: Wed, 26 Sep 2007 18:03:17
                ...' }
                >>>

Copyright (C) 2007, http://www.dabeaz.com                                  2- 75
String Stripping
            import re
            part_pat = re.compile(r'[nr -~]*$')

            def parse_request_data(meta,requestdata):
                parts = requestdata.split('x00')
                for part in parts:
                    if not part_pat.match(part):
             The request dataFalse sequence of null-terminated
                        return is a

               strings. This splits the data up into parts.
                  request = parts[0]
                  if len(request) != (meta['requestsize'] - 1):
               requestdata = False
                      return 'partx00partx00partx00partx00...'

                     info = dict(zip(parts[1::2],parts[2::2]))
                                             .split('x00')
                     meta['request'] = request.split(':',1)[1]
                     meta['info'] = info
                    parts = ['part','part','part','part',...]
                     return True



Copyright (C) 2007, http://www.dabeaz.com                            2- 76
String Validation
            import re
            part_pat = re.compile(r'[nr -~]*$')

            def parse_request_data(meta,requestdata):
                parts = requestdata.split('x00')
                for part in parts:
                    if not part_pat.match(part):
                        return False

              Individual parts are printable characters except for
                 request = parts[0]
                 if len(request) != (meta['requestsize'] - 1):
              newline characters ('nr').
                     return False

                info = dict(zip(parts[1::2],parts[2::2]))
              We use the re module to match each string. This
                meta['request'] = request.split(':',1)[1]
              would help catchinfo where we might be reading
                meta['info'] = cases
                return True
              bad data (false headers, raw data, etc.).
Copyright (C) 2007, http://www.dabeaz.com                            2- 77
URL Request String
            import re
            part_pat = re.compile(r'[nr -~]*$')

             The request string is the first part. The check that
            def parse_request_data(meta,requestdata):
                 parts = requestdata.split('x00')
             follows makes parts:it's the right size (a further sanity
                 for part in sure
             check on not part_pat.match(part):
                     if the data integrity).
                         return False

                     request = parts[0]
                     if len(request) != (meta['requestsize'] - 1):
                         return False

                     info = dict(zip(parts[1::2],parts[2::2]))
                     meta['request'] = request.split(':',1)[1]
                     meta['info'] = info
                     return True



Copyright (C) 2007, http://www.dabeaz.com                                2- 78
Request Info
            import re
             Each request has a set of associated
            part_pat = re.compile(r'[nr -~]*$')               data represented
              as key/value pairs.
            def parse_request_data(meta,requestdata):
                parts = requestdata.split('x00')
              parts = ['request','key','val','key','val','key','val']
                for part in parts:
              parts[1::2] part_pat.match(part):
                    if not    ['key','key','key']
                        return['val','val','val']
              parts[2::2]      False
                 zip(parts[1::2],parts[2::2])   [('key','val'),
                   request = parts[0]
                                                 ('key','val')
                   if len(request) != (meta['requestsize'] - 1):
                                                 ('key','val')]
                       return False

                     info = dict(zip(parts[1::2],parts[2::2]))
                     meta['request'] = request.split(':',1)[1]
                     meta['info'] = info
                     return True
                                     Makes a dictionary from (key,val) tuples

Copyright (C) 2007, http://www.dabeaz.com                                          2- 79
Fixing the Request
           # Given a dictionary of header information and a file,
           # this function extracts the request data from a cache
           # metadata entry and saves it instring
                 Cleaning up the request the dictionary. Returns
           # True or False depending on success.
                  request = "HTTP:http://www.google.com"
           def read_request_data(header,f):
                                     .split(':',1)
               request = f.read(header['requestsize']).strip('x00')
               infodata = f.read(header['infosize']).strip('x00')
                   ['HTTP','http://www.google.com']
               # Validate request and [1]
                                       infodata here (nothing now)

                     # Turn the infodata into a dictionary
                             'http://www.google.com'
                     parts = infodata.split('x00')
                     info = dict(zip(parts[::2],parts[1::2]))

                     meta['request'] = request.split(':',1)[1]
                     meta['info'] = info
                     return True


Copyright (C) 2007, http://www.dabeaz.com                              2- 80
Commentary
                 • Emphasize that Python has very powerful
                         list manipulation primitives
                                  • Indexing
                                  • Slicing
                                  • List comprehensions
                                  • Etc.
                 • Knowing how to use these leads to rapid
                         development and compact code

Copyright (C) 2007, http://www.dabeaz.com                    2- 81
Part 3: Content Info
                   • All documents on the internet have
                          optional content-type, encoding, and
                          character set information.
                   • Let's add this information since it will make
                          it easier for us to determine the type of
                          files that are stored in the cache (i.e.,
                          images, movies, HTML, etc.)



Copyright (C) 2007, http://www.dabeaz.com                             2- 82
HTTP Responses
                   • The cache metadata includes an HTTP
                          response header
                           >>> print meta['info']['response-head']
                           HTTP/1.1 200 OK
                           Date: Sat, 29 Sep 2007 20:51:37 GMT
                           Cache-Control: private
                           Vary: User-Agent
                           Content-Type: text/html; charset=utf-8
                           Content-Encoding: gzip

                           >>>


                                            Content type, character set,
                                            and encoding.

Copyright (C) 2007, http://www.dabeaz.com                                  2- 83
Solution
         # Given a metadata dictionary, this function adds additional
         # fields related to the content type, charset, and encoding

         import email
         def add_content_info(meta):
             info = meta['info']
             if 'response-head' not in info:
                 return
             else:
                 rhead = info.get('response-head').split("n",1)[1]
                 m = email.message_from_string(rhead)
                 content = m.get_content_type()
                 encoding = m.get('content-encoding',None)
                 charset = m.get_content_charset()
                 meta['content-type'] = content
                 meta['content-encoding'] = encoding
                 meta['charset'] = charset


Copyright (C) 2007, http://www.dabeaz.com                               2- 84
Internet Data Handling
         # Given a metadata dictionary, has afunction adds additional
                                Python this vast assortment of
                                            internet data handling modules.
         # fields related to the content type, charset, and encoding

         import email
                                email. Parsing of email messages,
         def add_content_info(meta):
             info = meta['info']
             if 'response-head' MIME headers, etc.
                                 not in info:
                 return
             else:
                 rhead = info.get('response-head').split("n",1)[1]
                 m = email.message_from_string(rhead)
                 content = m.get_content_type()
                 encoding = m.get('content-encoding',None)
                 charset = m.get_content_charset()
                 meta['content-type'] = content
                 meta['content-encoding'] = encoding
                 meta['charset'] = charset


Copyright (C) 2007, http://www.dabeaz.com                                     2- 85
Internet Data Handling
         # Given a metadata dictionary, this function adds additional
         # In this code, we parse the HTTP charset, and encoding
            fields related to the content type,
          reponse headers using the email
         import email
          module and extract content-type,
         def add_content_info(meta):
             info = meta['info']
          encoding, and charset information.
             if 'response-head' not in info:
                 return
             else:
                 rhead = info.get('response-head').split("n",1)[1]
                 m = email.message_from_string(rhead)
                 content = m.get_content_type()
                 encoding = m.get('content-encoding',None)
                 charset = m.get_content_charset()
                 meta['content-type'] = content
                 meta['content-encoding'] = encoding
                 meta['charset'] = charset


Copyright (C) 2007, http://www.dabeaz.com                               2- 86
Commentary
                 • Python is heavily used in Internet applications
                 • There are modules for parsing common types
                        of data (email, HTML, XML, etc.)
                 • There are modules for processing bits and
                        pieces of internet data (URLs, MIME types,
                        RFC822 headers, etc.)




Copyright (C) 2007, http://www.dabeaz.com                            2- 87
Part 4: File Scanning

                   • Write a function that scans a single cache
                          file and produces a sequence of records
                          containing all of the cache metadata.
                   • This is just one more of our building blocks
                   • The goal is to hide some of the nasty bits


Copyright (C) 2007, http://www.dabeaz.com                           2- 88
File Scanning
      # Scan a single file in the firefox cache
      def scan_cachefile(f,blocksize):
          maxsize = 4*blocksize     # Maximum size of an entry
          f.seek(4096)              # Skip the bit-map
          while True:
              headerdata = f.read(36)
              if not headerdata: break
              meta = parse_meta_header(headerdata)
              if (meta['magic'] == 0x00010008 and
                  meta['requestsize'] + meta['infosize'] < maxsize):
                      requestdata = f.read(meta['requestsize']+
                                           meta['infosize'])
                      if parse_request_data(meta,requestdata):
                           add_content_info(meta)
                           yield meta

                         # Move the file pointer to the start of the next block
                         fp = f.tell()
                         if (fp % blocksize):
                             f.seek(blocksize - (fp % blocksize),1)
Copyright (C) 2007, http://www.dabeaz.com                                         2- 89
Usage : File Scanning
             • Usage of the scan function
                   >>> f = open("Cache/_CACHE_001_","rb")
                   >>> for meta in scan_cache_file(f,256)
                   ...        print meta['request']
                   ...
                   http://www.yahoo.com/
                   http://us.js2.yimg.com/us.yimg.com/a/1-/java/promotions/
                   http://us.i1.yimg.com/us.yimg.com/i/ww/atty_hsi_free.gif
                   ...


             • We can just open up a cache file and write a
                    for-loop to iterate over all of the entries.



Copyright (C) 2007, http://www.dabeaz.com                                     2- 90
Python File I/O
      # Scan a single file in the firefox cache
      def scan_cachefile(f,blocksize):
          maxsize = 4*blocksize     # Maximum size of an entry
          f.seek(4096)              # Skip the bit-map
          while True:
              headerdata = f.read(36)      File Objects
              if not headerdata: break
                                           Modeled after ANSI C.
              meta = parse_meta_header(headerdata)
              if (meta['magic'] == 0x00010008 and
                  meta['requestsize'] + meta['infosize'] bytes.
                                           Files are just < maxsize):
                                           File pointer keeps track.
                      requestdata = f.read(meta['requestsize']+
                                           meta['infosize'])
                      if parse_request_data(meta,requestdata):
                                            f.read()      # Read bytes
                           add_content_info(meta)
                                            f.tell()      # Current fp
                           yield meta
                                            f.seek(n,off) # Move fp
                         # Move the file pointer to the start of the next block
                         fp = f.tell()
                         if (fp % blocksize):
                             f.seek(blocksize - (fp % blocksize),1)
Copyright (C) 2007, http://www.dabeaz.com                                         2- 91
Using Earlier Code
      # Scan a single file in the firefox cache
      def scan_cachefile(f,blocksize):
          maxsize = 4*blocksize     # Maximum size of an using our
                                           Here we are entry
          f.seek(4096)              # Skip header parsing functions
                                            the bit-map
          while True:
              headerdata = f.read(36)      written in previous parts.
              if not headerdata: break
              meta = parse_meta_header(headerdata)
              if (meta['magic'] == 0x00010008 and
                  meta['requestsize'] + meta['infosize'] < maxsize):
                      requestdata = f.read(meta['requestsize']+
                                            meta['infosize'])
                      if parse_request_data(meta,requestdata):
                           add_content_info(meta)
                           yield meta
         Note: We are progressively
         adding #more the file apointer
                  Move
                       data to
                fp = f.tell()
                                                      to the start of the next block

         dictionary. % blocksize):
                if (fp
                                  f.seek(blocksize - (fp % blocksize),1)
Copyright (C) 2007, http://www.dabeaz.com                                              2- 92
Data Validation
      # Scan a single file in the firefox cache
      This is a sanity check to make
      def scan_cachefile(f,blocksize):
      sure the header data looks #like a
          maxsize = 4*blocksize       Maximum size of an entry
          f.seek(4096)              # Skip the bit-map
      valid header.
          while True:
              headerdata = f.read(36)
              if not headerdata: break
              meta = parse_meta_header(headerdata)
              if (meta['magic'] == 0x00010008 and
                  meta['requestsize'] + meta['infosize'] < maxsize):
                      requestdata = f.read(meta['requestsize']+
                                           meta['infosize'])
                      if parse_request_data(meta,requestdata):
                           add_content_info(meta)
                           yield meta

                         # Move the file pointer to the start of the next block
                         fp = f.tell()
                         if (fp % blocksize):
                             f.seek(blocksize - (fp % blocksize),1)
Copyright (C) 2007, http://www.dabeaz.com                                         2- 93
Generating Results
      # Scan a single file in the firefox cache
      def scan_cachefile(f,blocksize):
              We are using yield to Maximum size of for a
          maxsize = 4*blocksize     # produce data an entry
          f.seek(4096)
              single cache entry. #IfSkip the bit-map a for-
          while True:
                                       someone uses
              loop, they will get all of the entries.
              headerdata = f.read(36)
              if not headerdata: break
              meta = parse_meta_header(headerdata)
              Note: This allows== 0x00010008 and cache
              if (meta['magic'] us to process the
                  meta['requestsize'] + meta['infosize'] < maxsize):
              without reading all of the data into memory.
                      requestdata = f.read(meta['requestsize']+
                                           meta['infosize'])
                      if parse_request_data(meta,requestdata):
                           add_content_info(meta)
                           yield meta

                         # Move the file pointer to the start of the next block
                         fp = f.tell()
                         if (fp % blocksize):
                             f.seek(blocksize - (fp % blocksize),1)
Copyright (C) 2007, http://www.dabeaz.com                                         2- 94
Commentary

                   • Have created a function that can scan a
                          single _CACHE_00n_ file and produce a
                          sequence of dictionaries with metadata.
                   • It's still somewhat low-level
                   • Just need to package it a little better


Copyright (C) 2007, http://www.dabeaz.com                           2- 95
Part 5 : Scan a Directory

                   • Write a function that takes the name of a
                          Firefox cache directory, scans all of the
                          cache files for metadata, and produces a
                          single sequence of records.
                   • Make it real easy to extract data


Copyright (C) 2007, http://www.dabeaz.com                             2- 96
Solution : Directory Scan
          # Given the name of a Firefox cache directory, the function
          # scans all of the _CACHE_00n_ files for metadata. A sequence
          # of dictionaries containing metadata is returned.

          import os
          def scan_cache(cachedir):
              files = [('_CACHE_001_',256),
                       ('_CACHE_002_',1024),
                       ('_CACHE_003_',4096)]

                   for cname,blocksize in files:
                       cfile = open(os.path.join(cachedir,cname),"rb")
                       for meta in scan_cachefile(cfile,blocksize):
                           meta['cachedir'] = cachedir
                           meta['cachefile'] = cname
                           yield meta
                       cfile.close()


Copyright (C) 2007, http://www.dabeaz.com                                 2- 97
Solution : Directory Scan
          General idea:
          # Given the name of a Firefox cache directory, the function
          # scans all of the _CACHE_00n_ files for metadata. A sequence
          # of dictionaries containing metadata is returned.
          We loop over the three _CACHE_00n_ files and
          import os a sequence of the cache records
          produce
          def scan_cache(cachedir):
              files = [('_CACHE_001_',256),
                       ('_CACHE_002_',1024),
                       ('_CACHE_003_',4096)]

                   for cname,blocksize in files:
                       cfile = open(os.path.join(cachedir,cname),"rb")
                       for meta in scan_cachefile(cfile,blocksize):
                           meta['cachedir'] = cachedir
                           meta['cachefile'] = cname
                           yield meta
                       cfile.close()


Copyright (C) 2007, http://www.dabeaz.com                                 2- 98
Solution : Directory Scan
          # Given the name of a Firefox cache directory, the function
          # scans all of the _CACHE_00n_ files for metadata. A sequence
          # of dictionaries containing metadata is returned.

          import os
          def scan_cache(cachedir):
              files = [('_CACHE_001_',256),
                       ('_CACHE_002_',1024),
                       ('_CACHE_003_',4096)]

                  for cname,blocksize in files:
                      cfile = open(os.path.join(cachedir,cname),"rb")
                      for meta in scan_cachefile(cfile,blocksize):
                          meta['cachedir'] = cachedir
                          meta['cachefile'] = cname
       We        use the low-level file scanning function
                          yield meta
                      cfile.close()
       here to generate a sequence of records.
Copyright (C) 2007, http://www.dabeaz.com                                 2- 99
More Generation
         # Given the name of a Firefox cache directory, the function
         Byscans all of here, we are chainingfor metadata. A sequence
         # using yield the _CACHE_00n_ files together the
         results obtained from all three cache files into one
         # of dictionaries containing metadata is returned.

         big long sequence of results.
         import os
         def scan_cache(cachedir):
             files = [('_CACHE_001_',256),
         The underlying mechanics and implementation
                       ('_CACHE_002_',1024),
         details are hidden (user doesn't care)
                       ('_CACHE_003_',4096)]

                   for cname,blocksize in files:
                       cfile = open(os.path.join(cachedir,cname),"rb")
                       for meta in scan_cachefile(cfile,blocksize):
                           meta['cachedir'] = cachedir
                           meta['cachefile'] = cname
                           yield meta
                       cfile.close()


Copyright (C) 2007, http://www.dabeaz.com                                2-100
Additional Data
          # Given the name of a Firefox cache directory, the function
          # scans all of the _CACHE_00n_ files for metadata. A sequence
          # of dictionaries containing metadata is returned.

          import os
          def scan_cache(cachedir):
              files = [('_CACHE_001_',256),
                       ('_CACHE_002_',1024),
                                        Adding
                       ('_CACHE_003_',4096)]              path and file
                   for                            information to the data
                            cname,blocksize in files:
                                                  (May be useful later)
                            cfile = open(os.path.join(cachedir,cname),"rb")
                            for meta in scan_cachefile(cfile,blocksize):
                                meta['cachedir'] = cachedir
                                meta['cachefile'] = cname
                                yield meta
                            cfile.close()


Copyright (C) 2007, http://www.dabeaz.com                                     2-101
Usage : Cache Scan
             • Usage of the scan function
                   >>> for meta in scan_cache("Cache/"):
                   ...        print meta['request']
                   ...
                   http://www.yahoo.com/
                   http://us.js2.yimg.com/us.yimg.com/a/1-/java/promotions/
                   http://us.i1.yimg.com/us.yimg.com/i/ww/atty_hsi_free.gif
                   ...

             • Given the name of a cache directory, we can
                    just loop over all of the metadata. Trivial!
             • With work, could perform various kinds of
                    queries and processing of the data
Copyright (C) 2007, http://www.dabeaz.com                                     2-102
Another Example
              • Find all requests related to Slashdot
                      >>> for meta in scan_cache("Cache/"):
                      ...        if 'slashdot' in meta['request']:
                      ...             print meta['request']
                      ...
                      http://www.slashdot.org/
                      http://images.slashdot.org/topics/topiccommunications.gif
                      http://images.slashdot.org/topics/topicstorage.gif
                      http://images.slashdot.org/comments.css?T_2_5_0_176
                      ...

              • Well, that was pretty easy.

Copyright (C) 2007, http://www.dabeaz.com                                     2-103
Another Example
             • Find all large JPEG images in the cache
             >>> jpegs = (meta for meta in scan_cache("Cache/")
                                  if meta['content-type'] == 'image/jpeg'
                                  and meta['datasize'] > 100000)
             >>> for j in jpegs:
             ...     print j['request']
             ...
             http://images.salon.com/ent/video_dog/comedy/2007/09/27/cereal/
             story.jpg
             http://images.salon.com/ent/video_dog/ifc/2007/09/28/
             apocalypse/story.jpg
             http://www.lakesideinns.com/images/fallroadphoto2006.jpg
             ...
             >>>

               • That was also pretty easy
Copyright (C) 2007, http://www.dabeaz.com                                2-104
Part 6 : Scan Everything

                   • Write a function that takes a list of cache
                          directories and produces a sequence of all
                          cache metadata found in all of them.
                   • A single utility function that let's us query
                          everything.




Copyright (C) 2007, http://www.dabeaz.com                              2-105
Scanning Everything
                       # scan an entire list of cache directories producing
                       # a sequence of records

                       def scan(cachedirs):
                           if isinstance(cachedirs,str):
                               cachedirs = [cachedirs]
                           for cdir in cachedirs:
                               for meta in scan_cache(cdir):
                                   yield meta




Copyright (C) 2007, http://www.dabeaz.com                                     2-106
Type Checking
                       # scan an entire list of cache directories producing
                       # a sequence of records

                      def scan(cachedirs):
                          if isinstance(cachedirs,str):
                              cachedirs = [cachedirs]
                          for cdir in cachedirs:
                     This bit of code ismeta example of type
                                         an
                              for meta in scan_cache(cdir):
                                  yield
                                                                checking.

                     If the argument is a string, we convert it to a list
                     with one item. This allows the following usage:
                          scan("CacheDir")
                          scan(["CacheDir1","CacheDir2",...])

Copyright (C) 2007, http://www.dabeaz.com                                     2-107
Putting it all together
               # slack.py
               # Find all of those slackers who should be working
               import sys, os, ffcache

               if len(sys.argv) != 2:
                   print >>sys.stderr,"Usage: python slack.py dirname"
                   raise SystemExit(1)

               caches = (path for path,dirs,files in os.walk(sys.argv[1])
                              if '_CACHE_MAP_' in files)

               for meta in ffcache.scan(caches):
                   if 'slashdot' in meta['request']:
                      print meta['request']
                      print meta['cachedir']
                      print



Copyright (C) 2007, http://www.dabeaz.com                                   2-108
Intermission

                   • Have written a simple library ffcache.py
                   • Library takes a moderate complex data
                          processing problem and breaks it up into
                          pieces.
                   • About 100 lines of code.
                   • Now, let's build an application...

Copyright (C) 2007, http://www.dabeaz.com                            2-109
Problem : CacheSpy
                   • Big Brother (make an evil sound here)
                          Write a program that first locates all of the Firefox
                          cache directories under a given directory. Then
                          have that program run forever as a network server,
                          waiting for connections. On each connection, send
                          back all of the current cache metadata.

                    • Big Picture
                          We're going to write a daemon that will find and
                          quietly report on browser cache contents.


Copyright (C) 2007, http://www.dabeaz.com                                        2-110
cachespy.py
           import sys, os, pickle, SocketServer, ffcache

           SPY_PORT = 31337
           caches = [path for path,dname,files in os.walk(sys.argv[1])
                          if '_CACHE_MAP_' in files]

           def dump_cache(f):
               for meta in ffcache.scan(caches):
                   pickle.dump(meta,f)

           class SpyHandler(SocketServer.BaseRequestHandler):
               def handle(self):
                   f = self.request.makefile()
                   dump_cache(f)
                   f.close()

           SocketServer.TCPServer.allow_reuse_address = True
           serv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler)
           print "CacheSpy running on port %d" % SPY_PORT
           serv.serve_forever()
Copyright (C) 2007, http://www.dabeaz.com                                2-111
SocketServer Module
           import sys, os, pickle, SocketServer, ffcache

           SPY_PORT = 31337
            SocketServer
           caches = [path for path,dname,files in os.walk(sys.argv[1])
                          if '_CACHE_MAP_' in files]

           def module for easily creating
            A dump_cache(f):
               for meta in ffcache.scan(caches):
            low-level internet applications
                   pickle.dump(meta,f)
              using sockets.
           class SpyHandler(SocketServer.BaseRequestHandler):
               def handle(self):
                   f = self.request.makefile()
                   dump_cache(f)
                   f.close()

           SocketServer.TCPServer.allow_reuse_address = True
           serv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler)
           print "CacheSpy running on port %d" % SPY_PORT
           serv.serve_forever()
Copyright (C) 2007, http://www.dabeaz.com                                2-112
SocketServer Handlers
           import sys, os, pickle, SocketServer, ffcache

            You define a simple class that
           SPY_PORT = 31337
           caches = [path for path,dname,files in os.walk(sys.argv[1])
            implements handle().
                          if '_CACHE_MAP_' in files]

           def dump_cache(f):
            This implements the server logic.
               for meta in ffcache.scan(caches):
                   pickle.dump(meta,f)

           class SpyHandler(SocketServer.BaseRequestHandler):
               def handle(self):
                   f = self.request.makefile()
                   dump_cache(f)
                   f.close()

           SocketServer.TCPServer.allow_reuse_address = True
           serv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler)
           print "CacheSpy running on port %d" % SPY_PORT
           serv.serve_forever()
Copyright (C) 2007, http://www.dabeaz.com                                2-113
SocketServer Servers
           import sys, os, pickle, SocketServer, ffcache

           SPY_PORT = 31337
           caches = [path for path,dname,files in os.walk(sys.argv[1])
                          if '_CACHE_MAP_' in files]

           def dump_cache(f):
               for meta in ffcache.scan(caches):
                   pickle.dump(meta,f)

             Next, you just create a Server object,
           class SpyHandler(SocketServer.BaseRequestHandler):
               def handle(self):
             hook f = self.request.makefile()run the
                   the handler up to it, and
             server.
                   dump_cache(f)
                   f.close()

           SocketServer.TCPServer.allow_reuse_address = True
           serv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler)
           print "CacheSpy running on port %d" % SPY_PORT
           serv.serve_forever()
Copyright (C) 2007, http://www.dabeaz.com                                2-114
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)
Python in Action (Part 2)

More Related Content

What's hot

Generator Tricks for Systems Programmers, v2.0
Generator Tricks for Systems Programmers, v2.0Generator Tricks for Systems Programmers, v2.0
Generator Tricks for Systems Programmers, v2.0David Beazley (Dabeaz LLC)
 
SWIG : An Easy to Use Tool for Integrating Scripting Languages with C and C++
SWIG : An Easy to Use Tool for Integrating Scripting Languages with C and C++SWIG : An Easy to Use Tool for Integrating Scripting Languages with C and C++
SWIG : An Easy to Use Tool for Integrating Scripting Languages with C and C++David Beazley (Dabeaz LLC)
 
Using SWIG to Control, Prototype, and Debug C Programs with Python
Using SWIG to Control, Prototype, and Debug C Programs with PythonUsing SWIG to Control, Prototype, and Debug C Programs with Python
Using SWIG to Control, Prototype, and Debug C Programs with PythonDavid Beazley (Dabeaz LLC)
 
The Common Debian Build System (CDBS)
The Common Debian Build System (CDBS)The Common Debian Build System (CDBS)
The Common Debian Build System (CDBS)Peter Eisentraut
 
Ry pyconjp2015 karaoke
Ry pyconjp2015 karaokeRy pyconjp2015 karaoke
Ry pyconjp2015 karaokeRenyuan Lyu
 
PyCon Taiwan 2013 Tutorial
PyCon Taiwan 2013 TutorialPyCon Taiwan 2013 Tutorial
PyCon Taiwan 2013 TutorialJustin Lin
 
Re: 제로부터시작하는텐서플로우
Re: 제로부터시작하는텐서플로우Re: 제로부터시작하는텐서플로우
Re: 제로부터시작하는텐서플로우Mario Cho
 
2015 bioinformatics python_io_wim_vancriekinge
2015 bioinformatics python_io_wim_vancriekinge2015 bioinformatics python_io_wim_vancriekinge
2015 bioinformatics python_io_wim_vancriekingeProf. Wim Van Criekinge
 
Large Files without the Trials
Large Files without the TrialsLarge Files without the Trials
Large Files without the TrialsJazkarta, Inc.
 
SD, a P2P bug tracking system
SD, a P2P bug tracking systemSD, a P2P bug tracking system
SD, a P2P bug tracking systemJesse Vincent
 
FILEgrain: Transport-Agnostic, Fine-Grained Content-Addressable Container Ima...
FILEgrain: Transport-Agnostic, Fine-Grained Content-Addressable Container Ima...FILEgrain: Transport-Agnostic, Fine-Grained Content-Addressable Container Ima...
FILEgrain: Transport-Agnostic, Fine-Grained Content-Addressable Container Ima...Akihiro Suda
 
Packaging for the Maemo Platform
Packaging for the Maemo PlatformPackaging for the Maemo Platform
Packaging for the Maemo PlatformJeremiah Foster
 
The Gory Details of Debian packages
The Gory Details of Debian packagesThe Gory Details of Debian packages
The Gory Details of Debian packagesJeremiah Foster
 
HOW 2019: A complete reproducible ROOT environment in under 5 minutes
HOW 2019: A complete reproducible ROOT environment in under 5 minutesHOW 2019: A complete reproducible ROOT environment in under 5 minutes
HOW 2019: A complete reproducible ROOT environment in under 5 minutesHenry Schreiner
 
Practicing Python 3
Practicing Python 3Practicing Python 3
Practicing Python 3Mosky Liu
 
Bar Camp Auckland - Mongo DB Presentation BCA4
Bar Camp Auckland - Mongo DB Presentation BCA4Bar Camp Auckland - Mongo DB Presentation BCA4
Bar Camp Auckland - Mongo DB Presentation BCA4John Ballinger
 

What's hot (20)

Generator Tricks for Systems Programmers, v2.0
Generator Tricks for Systems Programmers, v2.0Generator Tricks for Systems Programmers, v2.0
Generator Tricks for Systems Programmers, v2.0
 
Python Generator Hacking
Python Generator HackingPython Generator Hacking
Python Generator Hacking
 
Perl-C/C++ Integration with Swig
Perl-C/C++ Integration with SwigPerl-C/C++ Integration with Swig
Perl-C/C++ Integration with Swig
 
Generators: The Final Frontier
Generators: The Final FrontierGenerators: The Final Frontier
Generators: The Final Frontier
 
SWIG : An Easy to Use Tool for Integrating Scripting Languages with C and C++
SWIG : An Easy to Use Tool for Integrating Scripting Languages with C and C++SWIG : An Easy to Use Tool for Integrating Scripting Languages with C and C++
SWIG : An Easy to Use Tool for Integrating Scripting Languages with C and C++
 
Using SWIG to Control, Prototype, and Debug C Programs with Python
Using SWIG to Control, Prototype, and Debug C Programs with PythonUsing SWIG to Control, Prototype, and Debug C Programs with Python
Using SWIG to Control, Prototype, and Debug C Programs with Python
 
Interfacing C/C++ and Python with SWIG
Interfacing C/C++ and Python with SWIGInterfacing C/C++ and Python with SWIG
Interfacing C/C++ and Python with SWIG
 
The Common Debian Build System (CDBS)
The Common Debian Build System (CDBS)The Common Debian Build System (CDBS)
The Common Debian Build System (CDBS)
 
Ry pyconjp2015 karaoke
Ry pyconjp2015 karaokeRy pyconjp2015 karaoke
Ry pyconjp2015 karaoke
 
PyCon Taiwan 2013 Tutorial
PyCon Taiwan 2013 TutorialPyCon Taiwan 2013 Tutorial
PyCon Taiwan 2013 Tutorial
 
Re: 제로부터시작하는텐서플로우
Re: 제로부터시작하는텐서플로우Re: 제로부터시작하는텐서플로우
Re: 제로부터시작하는텐서플로우
 
2015 bioinformatics python_io_wim_vancriekinge
2015 bioinformatics python_io_wim_vancriekinge2015 bioinformatics python_io_wim_vancriekinge
2015 bioinformatics python_io_wim_vancriekinge
 
Large Files without the Trials
Large Files without the TrialsLarge Files without the Trials
Large Files without the Trials
 
SD, a P2P bug tracking system
SD, a P2P bug tracking systemSD, a P2P bug tracking system
SD, a P2P bug tracking system
 
FILEgrain: Transport-Agnostic, Fine-Grained Content-Addressable Container Ima...
FILEgrain: Transport-Agnostic, Fine-Grained Content-Addressable Container Ima...FILEgrain: Transport-Agnostic, Fine-Grained Content-Addressable Container Ima...
FILEgrain: Transport-Agnostic, Fine-Grained Content-Addressable Container Ima...
 
Packaging for the Maemo Platform
Packaging for the Maemo PlatformPackaging for the Maemo Platform
Packaging for the Maemo Platform
 
The Gory Details of Debian packages
The Gory Details of Debian packagesThe Gory Details of Debian packages
The Gory Details of Debian packages
 
HOW 2019: A complete reproducible ROOT environment in under 5 minutes
HOW 2019: A complete reproducible ROOT environment in under 5 minutesHOW 2019: A complete reproducible ROOT environment in under 5 minutes
HOW 2019: A complete reproducible ROOT environment in under 5 minutes
 
Practicing Python 3
Practicing Python 3Practicing Python 3
Practicing Python 3
 
Bar Camp Auckland - Mongo DB Presentation BCA4
Bar Camp Auckland - Mongo DB Presentation BCA4Bar Camp Auckland - Mongo DB Presentation BCA4
Bar Camp Auckland - Mongo DB Presentation BCA4
 

Viewers also liked

Using Python3 to Build a Cloud Computing Service for my Superboard II
Using Python3 to Build a Cloud Computing Service for my Superboard IIUsing Python3 to Build a Cloud Computing Service for my Superboard II
Using Python3 to Build a Cloud Computing Service for my Superboard IIDavid Beazley (Dabeaz LLC)
 
An Embedded Error Recovery and Debugging Mechanism for Scripting Language Ext...
An Embedded Error Recovery and Debugging Mechanism for Scripting Language Ext...An Embedded Error Recovery and Debugging Mechanism for Scripting Language Ext...
An Embedded Error Recovery and Debugging Mechanism for Scripting Language Ext...David Beazley (Dabeaz LLC)
 
A Curious Course on Coroutines and Concurrency
A Curious Course on Coroutines and ConcurrencyA Curious Course on Coroutines and Concurrency
A Curious Course on Coroutines and ConcurrencyDavid Beazley (Dabeaz LLC)
 
Python session 4 subprocess- by Gopal.A (Python developer)
Python session 4 subprocess- by Gopal.A (Python developer)Python session 4 subprocess- by Gopal.A (Python developer)
Python session 4 subprocess- by Gopal.A (Python developer)Navaneethan Naveen
 
Digital Document Preservation Simulation - Boston Python User's Group
Digital Document  Preservation Simulation - Boston Python User's GroupDigital Document  Preservation Simulation - Boston Python User's Group
Digital Document Preservation Simulation - Boston Python User's GroupMicah Altman
 
Intranet Governance
Intranet GovernanceIntranet Governance
Intranet GovernancePebbleRoad
 
WAD : A Module for Converting Fatal Extension Errors into Python Exceptions
WAD : A Module for Converting Fatal Extension Errors into Python ExceptionsWAD : A Module for Converting Fatal Extension Errors into Python Exceptions
WAD : A Module for Converting Fatal Extension Errors into Python ExceptionsDavid Beazley (Dabeaz LLC)
 
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...Spark Summit
 
Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Haoyuan Li
 
Linux Filesystems, RAID, and more
Linux Filesystems, RAID, and moreLinux Filesystems, RAID, and more
Linux Filesystems, RAID, and moreMark Wong
 
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...Spark Summit
 
The Hot Rod Protocol in Infinispan
The Hot Rod Protocol in InfinispanThe Hot Rod Protocol in Infinispan
The Hot Rod Protocol in InfinispanGalder Zamarreño
 
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSAccelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSCeph Community
 
Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift
Advanced Data Retrieval and Analytics with Apache Spark and Openstack SwiftAdvanced Data Retrieval and Analytics with Apache Spark and Openstack Swift
Advanced Data Retrieval and Analytics with Apache Spark and Openstack SwiftDaniel Krook
 

Viewers also liked (18)

Using Python3 to Build a Cloud Computing Service for my Superboard II
Using Python3 to Build a Cloud Computing Service for my Superboard IIUsing Python3 to Build a Cloud Computing Service for my Superboard II
Using Python3 to Build a Cloud Computing Service for my Superboard II
 
An Embedded Error Recovery and Debugging Mechanism for Scripting Language Ext...
An Embedded Error Recovery and Debugging Mechanism for Scripting Language Ext...An Embedded Error Recovery and Debugging Mechanism for Scripting Language Ext...
An Embedded Error Recovery and Debugging Mechanism for Scripting Language Ext...
 
Writing Parsers and Compilers with PLY
Writing Parsers and Compilers with PLYWriting Parsers and Compilers with PLY
Writing Parsers and Compilers with PLY
 
A Curious Course on Coroutines and Concurrency
A Curious Course on Coroutines and ConcurrencyA Curious Course on Coroutines and Concurrency
A Curious Course on Coroutines and Concurrency
 
Python session 4 subprocess- by Gopal.A (Python developer)
Python session 4 subprocess- by Gopal.A (Python developer)Python session 4 subprocess- by Gopal.A (Python developer)
Python session 4 subprocess- by Gopal.A (Python developer)
 
Digital Document Preservation Simulation - Boston Python User's Group
Digital Document  Preservation Simulation - Boston Python User's GroupDigital Document  Preservation Simulation - Boston Python User's Group
Digital Document Preservation Simulation - Boston Python User's Group
 
Python redis talk
Python redis talkPython redis talk
Python redis talk
 
Intranet Governance
Intranet GovernanceIntranet Governance
Intranet Governance
 
How To Govern An Intranet
How To Govern An IntranetHow To Govern An Intranet
How To Govern An Intranet
 
WAD : A Module for Converting Fatal Extension Errors into Python Exceptions
WAD : A Module for Converting Fatal Extension Errors into Python ExceptionsWAD : A Module for Converting Fatal Extension Errors into Python Exceptions
WAD : A Module for Converting Fatal Extension Errors into Python Exceptions
 
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
 
Open Stack Cheat Sheet V1
Open Stack Cheat Sheet V1Open Stack Cheat Sheet V1
Open Stack Cheat Sheet V1
 
Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5
 
Linux Filesystems, RAID, and more
Linux Filesystems, RAID, and moreLinux Filesystems, RAID, and more
Linux Filesystems, RAID, and more
 
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...
 
The Hot Rod Protocol in Infinispan
The Hot Rod Protocol in InfinispanThe Hot Rod Protocol in Infinispan
The Hot Rod Protocol in Infinispan
 
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSAccelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
 
Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift
Advanced Data Retrieval and Analytics with Apache Spark and Openstack SwiftAdvanced Data Retrieval and Analytics with Apache Spark and Openstack Swift
Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift
 

Similar to Python in Action (Part 2)

yocto_scale_handout-with-notes
yocto_scale_handout-with-notesyocto_scale_handout-with-notes
yocto_scale_handout-with-notesSteve Arnold
 
Containers for Science and High-Performance Computing
Containers for Science and High-Performance ComputingContainers for Science and High-Performance Computing
Containers for Science and High-Performance ComputingDmitry Spodarets
 
LXC Containers and AUFs
LXC Containers and AUFsLXC Containers and AUFs
LXC Containers and AUFsDocker, Inc.
 
Leonid Vasilyev "Building, deploying and running production code at Dropbox"
Leonid Vasilyev  "Building, deploying and running production code at Dropbox"Leonid Vasilyev  "Building, deploying and running production code at Dropbox"
Leonid Vasilyev "Building, deploying and running production code at Dropbox"IT Event
 
Docker San Francisco Meetup April 2015 - The Docker Orchestration Ecosystem o...
Docker San Francisco Meetup April 2015 - The Docker Orchestration Ecosystem o...Docker San Francisco Meetup April 2015 - The Docker Orchestration Ecosystem o...
Docker San Francisco Meetup April 2015 - The Docker Orchestration Ecosystem o...Patrick Chanezon
 
Linux Container Primitives and Runtimes (CON407-R1) - AWS re:Invent 2018
Linux Container Primitives and Runtimes (CON407-R1) - AWS re:Invent 2018Linux Container Primitives and Runtimes (CON407-R1) - AWS re:Invent 2018
Linux Container Primitives and Runtimes (CON407-R1) - AWS re:Invent 2018Amazon Web Services
 
[Mas 500] Software Development Strategies
[Mas 500] Software Development Strategies[Mas 500] Software Development Strategies
[Mas 500] Software Development Strategiesrahulbot
 
Django dev-env-my-way
Django dev-env-my-wayDjango dev-env-my-way
Django dev-env-my-wayRobert Lujo
 
Devoxx France 2015 - The Docker Orchestration Ecosystem on Azure
Devoxx France 2015 - The Docker Orchestration Ecosystem on AzureDevoxx France 2015 - The Docker Orchestration Ecosystem on Azure
Devoxx France 2015 - The Docker Orchestration Ecosystem on AzurePatrick Chanezon
 
Road to Opscon (Pisa '15) - DevOoops
Road to Opscon (Pisa '15) - DevOoopsRoad to Opscon (Pisa '15) - DevOoops
Road to Opscon (Pisa '15) - DevOoopsGianluca Varisco
 
OSDC 2016 - Continous Integration in Data Centers - Further 3 Years later by ...
OSDC 2016 - Continous Integration in Data Centers - Further 3 Years later by ...OSDC 2016 - Continous Integration in Data Centers - Further 3 Years later by ...
OSDC 2016 - Continous Integration in Data Centers - Further 3 Years later by ...NETWAYS
 
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformaticsStephen Turner
 
Giving back with GitHub - Putting the Open Source back in iOS
Giving back with GitHub - Putting the Open Source back in iOSGiving back with GitHub - Putting the Open Source back in iOS
Giving back with GitHub - Putting the Open Source back in iOSMadhava Jay
 
WTF my container just spawned a shell!
WTF my container just spawned a shell!WTF my container just spawned a shell!
WTF my container just spawned a shell!Sysdig
 
On the Edge Systems Administration with Golang
On the Edge Systems Administration with GolangOn the Edge Systems Administration with Golang
On the Edge Systems Administration with GolangChris McEniry
 
Lions, Tigers and Deers: What building zoos can teach us about securing micro...
Lions, Tigers and Deers: What building zoos can teach us about securing micro...Lions, Tigers and Deers: What building zoos can teach us about securing micro...
Lions, Tigers and Deers: What building zoos can teach us about securing micro...Sysdig
 
Scale11x lxc talk
Scale11x lxc talkScale11x lxc talk
Scale11x lxc talkdotCloud
 
BACKFiL Finding Files you left on the server
BACKFiL Finding Files you left on the serverBACKFiL Finding Files you left on the server
BACKFiL Finding Files you left on the servertmccurry
 
Gianluca Varisco - DevOoops (Increase awareness around DevOps infra security)
Gianluca Varisco - DevOoops (Increase awareness around DevOps infra security)Gianluca Varisco - DevOoops (Increase awareness around DevOps infra security)
Gianluca Varisco - DevOoops (Increase awareness around DevOps infra security)Codemotion
 

Similar to Python in Action (Part 2) (20)

yocto_scale_handout-with-notes
yocto_scale_handout-with-notesyocto_scale_handout-with-notes
yocto_scale_handout-with-notes
 
Containers for Science and High-Performance Computing
Containers for Science and High-Performance ComputingContainers for Science and High-Performance Computing
Containers for Science and High-Performance Computing
 
LXC Containers and AUFs
LXC Containers and AUFsLXC Containers and AUFs
LXC Containers and AUFs
 
Leonid Vasilyev "Building, deploying and running production code at Dropbox"
Leonid Vasilyev  "Building, deploying and running production code at Dropbox"Leonid Vasilyev  "Building, deploying and running production code at Dropbox"
Leonid Vasilyev "Building, deploying and running production code at Dropbox"
 
Docker San Francisco Meetup April 2015 - The Docker Orchestration Ecosystem o...
Docker San Francisco Meetup April 2015 - The Docker Orchestration Ecosystem o...Docker San Francisco Meetup April 2015 - The Docker Orchestration Ecosystem o...
Docker San Francisco Meetup April 2015 - The Docker Orchestration Ecosystem o...
 
Linux Container Primitives and Runtimes (CON407-R1) - AWS re:Invent 2018
Linux Container Primitives and Runtimes (CON407-R1) - AWS re:Invent 2018Linux Container Primitives and Runtimes (CON407-R1) - AWS re:Invent 2018
Linux Container Primitives and Runtimes (CON407-R1) - AWS re:Invent 2018
 
[Mas 500] Software Development Strategies
[Mas 500] Software Development Strategies[Mas 500] Software Development Strategies
[Mas 500] Software Development Strategies
 
Django dev-env-my-way
Django dev-env-my-wayDjango dev-env-my-way
Django dev-env-my-way
 
Devoxx France 2015 - The Docker Orchestration Ecosystem on Azure
Devoxx France 2015 - The Docker Orchestration Ecosystem on AzureDevoxx France 2015 - The Docker Orchestration Ecosystem on Azure
Devoxx France 2015 - The Docker Orchestration Ecosystem on Azure
 
Road to Opscon (Pisa '15) - DevOoops
Road to Opscon (Pisa '15) - DevOoopsRoad to Opscon (Pisa '15) - DevOoops
Road to Opscon (Pisa '15) - DevOoops
 
OSDC 2016 - Continous Integration in Data Centers - Further 3 Years later by ...
OSDC 2016 - Continous Integration in Data Centers - Further 3 Years later by ...OSDC 2016 - Continous Integration in Data Centers - Further 3 Years later by ...
OSDC 2016 - Continous Integration in Data Centers - Further 3 Years later by ...
 
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
 
Version Control for Mere Mortals
Version Control for Mere MortalsVersion Control for Mere Mortals
Version Control for Mere Mortals
 
Giving back with GitHub - Putting the Open Source back in iOS
Giving back with GitHub - Putting the Open Source back in iOSGiving back with GitHub - Putting the Open Source back in iOS
Giving back with GitHub - Putting the Open Source back in iOS
 
WTF my container just spawned a shell!
WTF my container just spawned a shell!WTF my container just spawned a shell!
WTF my container just spawned a shell!
 
On the Edge Systems Administration with Golang
On the Edge Systems Administration with GolangOn the Edge Systems Administration with Golang
On the Edge Systems Administration with Golang
 
Lions, Tigers and Deers: What building zoos can teach us about securing micro...
Lions, Tigers and Deers: What building zoos can teach us about securing micro...Lions, Tigers and Deers: What building zoos can teach us about securing micro...
Lions, Tigers and Deers: What building zoos can teach us about securing micro...
 
Scale11x lxc talk
Scale11x lxc talkScale11x lxc talk
Scale11x lxc talk
 
BACKFiL Finding Files you left on the server
BACKFiL Finding Files you left on the serverBACKFiL Finding Files you left on the server
BACKFiL Finding Files you left on the server
 
Gianluca Varisco - DevOoops (Increase awareness around DevOps infra security)
Gianluca Varisco - DevOoops (Increase awareness around DevOps infra security)Gianluca Varisco - DevOoops (Increase awareness around DevOps infra security)
Gianluca Varisco - DevOoops (Increase awareness around DevOps infra security)
 

Recently uploaded

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 

Recently uploaded (20)

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Python in Action (Part 2)

  • 1. Python in Action Presented at USENIX LISA Conference November 16, 2007 David M. Beazley http://www.dabeaz.com (Part II - Systems Programming) Copyright (C) 2007, http://www.dabeaz.com 2- 1
  • 2. Section Overview • In this section, we're going to get dirty • Systems Programming • Files, I/O, file-system • Text parsing, data decoding • Processes and IPC • Networking • Threads and concurrency Copyright (C) 2007, http://www.dabeaz.com 2- 2
  • 3. Commentary • I personally think Python is a fantastic tool for systems programming. • Modules provide access to most of the major system libraries I used to access via C • No enforcement of "morality" • Decent performance • It just "works" and it's fun Copyright (C) 2007, http://www.dabeaz.com 2- 3
  • 4. Approach • I've thought long and hard about how I would present this part of the class. • A reference manual approach would probably be long and very boring. • So instead, we're going to focus on building something more in tune with the times Copyright (C) 2007, http://www.dabeaz.com 2- 4
  • 5. "To Catch a Slacker" • Write a collection of Python programs that can quietly monitor Firefox browser caches to find out who has been spending their day reading Slashdot instead of working on their TPS reports. • Oh yeah, and be a real sneaky bugger about it. Copyright (C) 2007, http://www.dabeaz.com 2- 5
  • 6. Why this Problem? • Involves a real-world system and data • Firefox already installed on your machine (?) • Cross platform (Linux, Mac, Windows) • Example of tool building • Related to a variety of practical problems • A good tour of "Python in Action" Copyright (C) 2007, http://www.dabeaz.com 2- 6
  • 7. Disclaimers • I am not involved in browser forensics (or spyware for that matter). • I am in no way affiliated with Firefox/Mozilla nor have I ever seen Firefox source code • I have never worked with the cache data prior to preparing this tutorial • I have never used any third-party tools for looking at this data. Copyright (C) 2007, http://www.dabeaz.com 2- 7
  • 8. More Disclaimers • All of the code in this tutorial works with a standard Python installation • No third party modules. • All code is cross-platform • Code samples are available online at http://www.dabeaz.com/action/ • Please look at that code and follow along Copyright (C) 2007, http://www.dabeaz.com 2- 8
  • 9. Assumptions • This is not a tutorial on systems concepts • You should be generally familiar with background material (files, filesystems, file formats, processes, threads, networking, protocols, etc.) • Hopefully you can "extrapolate" from the material presented here to construct more advanced Python applications. Copyright (C) 2007, http://www.dabeaz.com 2- 9
  • 10. The Big Picture • We want to write a tool that allows someone to locate, inspect, and perform queries across a distributed collection of Firefox caches. • For example, the cache directories on all machines on the LAN of a quasi-evil corporation. Copyright (C) 2007, http://www.dabeaz.com 2- 10
  • 11. The Firefox Cache • The Firefox browser keeps a disk cache of recently visited sites % ls Cache/ -rw------- 1 beazley 111169 Sep 25 17:15 01CC0844d01 -rw------- 1 beazley 104991 Sep 25 17:15 01CC3844d01 -rw------- 1 beazley 47233 Sep 24 16:41 021F221Ad01 ... -rw------- 1 beazley 26749 Sep 21 11:19 FF8AEDF0d01 -rw------- 1 beazley 58172 Sep 25 18:16 FFE628C6d01 -rw------- 1 beazley 1939456 Sep 25 19:14 _CACHE_001_ -rw------- 1 beazley 2588672 Sep 25 19:14 _CACHE_002_ -rw------- 1 beazley 4567040 Sep 25 18:44 _CACHE_003_ -rw------- 1 beazley 33044 Sep 23 21:58 _CACHE_MAP_ • A bunch of cryptically named files. Copyright (C) 2007, http://www.dabeaz.com 2- 11
  • 12. Problem : Finding Files • Find the Firefox cache Write a program findcache.py that takes a directory name as input and recursively scans that directory and all subdirectories looking for Firefox/Mozilla cache directories. • Example: % python findcache.py /Users/beazley /Users/beazley/Library/.../qs1ab616.default/Cache /Users/beazley/Library/.../wxuoyiuf.slt/Cache % • Use case: Searching for things on the filesystem. Copyright (C) 2007, http://www.dabeaz.com 2- 12
  • 13. findcache.py # findcache.py # Recursively scan a directory looking for # Firefox/Mozilla cache directories import sys import os if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1) caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) for name in caches: print name Copyright (C) 2007, http://www.dabeaz.com 2- 13
  • 14. The sys module # findcache.py # Recursively scan a directory looking basic The sys module has for # Firefox/Mozilla cache directories information related to the import sys execution environment. import os if len(sys.argv) != 2: sys.argv print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1) sys.stdin (path for'_CACHE_MAP_' list in os.walk(sys.argv[1]) caches = path,dirs,files of the command line A sys.stdout if options in files) sys.stderrname for name in caches: print sys.argv = ['findcache.py', '/Users/beazley'] Standard I/O files Copyright (C) 2007, http://www.dabeaz.com 2- 14
  • 15. Program Termination # findcache.py # Recursively scan a directory looking for # Firefox/Mozilla cache directories import sys import os if len(sys.argv) != 2: SystemExit exception print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1) Forces Python to exit. caches = (path for path,dirs,files inis return code. Value os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) for name in caches: print name Copyright (C) 2007, http://www.dabeaz.com 2- 15
  • 16. os Module # findcache.py # Recursively scan a directory looking for # Firefox/Mozilla cache os module directories import sys import os Contains useful OS related if len(sys.argv) != 2: functions (files, processes, etc.) print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1) caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) for name in caches: print name Copyright (C) 2007, http://www.dabeaz.com 2- 16
  • 17. os.walk() os.walk(topdir) # findcache.py # Recursively scan a directory looking for Recursively walkscache directories and # Firefox/Mozilla a directory tree generates a sequence of tuples (path,dirs,files) import sys path import os = The current directory name if dirs = List of all subdirectory names in path len(sys.argv) != 2: files = List of all regular files (data) in path print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1) caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) for name in caches: print name Copyright (C) 2007, http://www.dabeaz.com 2- 17
  • 18. A Sequence of Caches # findcache.py # Recursively scan a directory looking for # Firefox/Mozilla cache directories importstatement This sys generates a sequence of import os directory names where '_CACHE_MAP_' is contained in the filelist. if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1) caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) for name in caches: The print name name directory File name check that is generated as a result Copyright (C) 2007, http://www.dabeaz.com 2- 18
  • 19. Printing the Result # findcache.py # Recursively scan a directory looking for # Firefox/Mozilla cache directories import sys import os if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1) caches = (path for path,dirs,files in os.walk(sys.argv[1]) This prints the sequence if '_CACHE_MAP_' in files) of cache directories that for name in caches: print name are generated by the previous statement. Copyright (C) 2007, http://www.dabeaz.com 2- 19
  • 20. Commentary • Our solution is strongly based on a "declarative" programming style (again) • We simply write out a sequence of operations that produce what we want • Not focused on the underlying mechanics of how to traverse all of the directories. Copyright (C) 2007, http://www.dabeaz.com 2- 20
  • 21. Big Idea : Iteration • Python allows iteration to be captured as a kind of object. caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) • This de-couples iteration from the code that uses the iteration for name in caches: print name • Another usage example: for name in caches: print len(os.listdir(name)), name Copyright (C) 2007, http://www.dabeaz.com 2- 21
  • 22. Big Idea : Iteration • Compare to this: for path,dirs,files in os.walk(sys.argv[1]): if '_CACHE_MAP_' in files: print len(os.listdir(path)),path • This code is simple, but the loop and the code that executes in the loop body are coupled together • Not as flexible, but this is somewhat subtle to wrap your brain around at first. Copyright (C) 2007, http://www.dabeaz.com 2- 22
  • 23. Mini-Reference : sys, os • sys module sys.argv # List of command line options sys.stdin # Standard input sys.stdout # Standard output sys.stderr # Standard error sys.executable # Full path of Python executable sys.exc_info() # Information on current exception • os module os.walk(dir) # Recursively walk dir producing a # sequence of tuples (path,dlist,flist) os.listdir(dir) # Return a list of all files in dir • SystemExit exception raise SystemExit(n) # Exit with integer code n Copyright (C) 2007, http://www.dabeaz.com 2- 23
  • 24. Problem: Searching for Text • Extract all URL requests from the cache Write a program requests.py that scans the contents of the _CACHE_00n_ files and prints a list of URLs for documents stored in the cache. • Example: % python requests.py /Users/.../qs1ab616.default/Cache http://www.yahoo.com/ http://us.js2.yimg.com/us.yimg.com/a/1-/java/promotions/js/ad_eo_1.1.j http://us.i1.yimg.com/us.yimg.com/i/ww/atty_hsi_free.gif http://us.i1.yimg.com/us.yimg.com/i/ww/thm/1/search_1.1.png ... % • Use case: Searching the contents of files for text patterns. Copyright (C) 2007, http://www.dabeaz.com 2- 24
  • 25. The Firefox Cache • The cache directory holds two types of data • Metadata (URLs, headers, etc.). • Raw data (HTML, JPEG, PNG, etc.) • This data is stored in two places • Cryptic files in the Cache directory • Blocks inside the _CACHE_00n_ files • Metadata almost always in _CACHE_00n_ Copyright (C) 2007, http://www.dabeaz.com 2- 25
  • 26. Possible Solution : Regex • The _CACHE_00n_ files are encoded in a binary format, but URLs are embedded inside as null-terminated text: x00x01x00x08x92x00x02x18x00x00x00x13Fxffx9f xceFxffx9fxcex00x00x00x00x00x00H)x00x00x00x1a x00x00x023HTTP:http://slashdot.org/x00request-methodx00 GETx00request-User-Agentx00Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7x00 request-Accept-Encodingx00gzip,deflatex00response-headx00 HTTP/1.1 200 OKrnDate: Sun, 30 Sep 2007 13:07:29 GMTrn Server: Apache/1.3.37 (Unix) mod_perl/1.29rnSLASH_LOG_DATA: shtmlrnX-Powered-By: Slash 2.005000176rnX-Fry: How can I live my life if I can't tell good from evil?rnCache-Control: • Maybe the requests could just be ripped using a regular expression. Copyright (C) 2007, http://www.dabeaz.com 2- 26
  • 27. A Regex Solution # requests.py import re import os import sys cachedir = sys.argv[1] cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] # A regex for embedded URL strings request_pat = re.compile('([a-z]+://.*?)x00') # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() Copyright (C) 2007, http://www.dabeaz.com 2- 27
  • 28. The re module # requests.py import re re module import os import sys Contains all functionality related to cachedir = sys.argv[1] cachefiles regular expression pattern matching, = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] searching, replacing, etc. # A regex for embedded URL strings request_pat = re.compile(r'([a-z]+://.*?)x00') Features are strongly influenced by Perl, but regexs are not directly integrated # Loop over all files and search for URLs for name in cachefiles: into the Python language. data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() Copyright (C) 2007, http://www.dabeaz.com 2- 28
  • 29. Using re # requests.py import re are first specified Patterns as strings and compiled into a regex import os import sys object. pat = re.compile(pattern [,flags]) cachedir = sys.argv[1] cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] # A regex for embedded URL strings request_pat = re.compile('([a-z]+://.*?)x00') # Loop over all files and search for URLs The pattern syntax is "standard" for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() pat* pat1|pat2 index = 0 pat+ [chars] while True: pat? [^chars] m = request_pat.search(data,index) (pat) pat{n} if not m: break . pat{n,m} print m.group(1) index = m.end() Copyright (C) 2007, http://www.dabeaz.com 2- 29
  • 30. Using re # requests.py import re import os import sys cachedir = sys.argv[1] cachefiles = [ '_CACHE_001_', '_CACHE_002_',the All subsequent operations are methods of '_CACHE_003_' ] compiled regex pattern # A regex for embedded URL strings request_pat = re.compile(r'([a-z]+://.*?)x00') m = pat.match(data [,start]) # Check for match m = pat.search(data [,start]) # Search for match # Loop over all files and search for URLs newdata = pat.sub(data, repl) # Pattern replace for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() Copyright (C) 2007, http://www.dabeaz.com 2- 30
  • 31. Searching for Matches # requests.py import re import os pat.search(text import sys [,start]) cachedir = the string text for the first occurrence Searches sys.argv[1] cachefiles = [ pattern starting'_CACHE_002_', '_CACHE_003_' ] of the regex '_CACHE_001_', at position start. # Returns a "MatchObject" strings A regex for embedded URL if a match is found. request_pat = re.compile(r'([a-z]+://.*?)x00') In the code below, we're finding matches one # Loop over all files and search for URLs for a time. cachefiles: at name in data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() Copyright (C) 2007, http://www.dabeaz.com 2- 31
  • 32. Match Objects # requests.py import re import os import sys cachedir = sys.argv[1] Regex matches'_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] cachefiles = [ are represented by a MatchObject # m.group([n]) embedded URL matched by group n A regex for # Text strings m.start([n]) # Starting index of group n request_pat = re.compile(r'([a-z]+://.*?)x00') m.end([n]) # End index of group n # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 The matching text for while True: just the URL. m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() The end of the match Copyright (C) 2007, http://www.dabeaz.com 2- 32
  • 33. Groups # requests.py In patterns, parentheses () define groups which import re import os are numbered left to right. import sys group 0 # The entire pattern cachedir 1 sys.argv[1] Text in first () group = # group 2 # Text in next () cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] ... # A regex for embedded URL strings request_pat = re.compile('([a-z]+://.*?)x00') # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() Copyright (C) 2007, http://www.dabeaz.com 2- 33
  • 34. Mini-Reference : re • re pattern compilation pat = re.compile(r'patternstring') • Pattern syntax literal # Match literal text pat* # Match 0 or more repetitions of pat pat+ # Match 1 or more repetitions of pat pat? # Match 0 or 1 repetitions of pat pat1|pat2 # Patch pat1 or pat2 (pat) # Patch pat (group) [chars] # Match characters in chars [^chars] # Match characters not in chars . # Match any character except n d # Match any digit w # Match alphanumeric character s # Match whitespace Copyright (C) 2007, http://www.dabeaz.com 2- 34
  • 35. Mini-Reference : re • Common pattern operations pat.search(text) # Search text for a match pat.match(text) # Search start of text for match pat.sub(repl,text) # Replace pattern with repl • Match objects m.group([n]) # Text matched by group n m.start([n]) # Starting position of group n m.end([n]) # Ending position of group n • How to loop over all matches of a pattern for m in pat.finditer(text): # m is a MatchObject that you process ... Copyright (C) 2007, http://www.dabeaz.com 2- 35
  • 36. Mini-Reference : re • An example of pattern replacement # This replaces American dates of the form 'mm/dd/yyyy' # with European dates of the form 'dd/mm/yyyy'. # This function takes a MatchObject as input and returns # replacement text as output. def euro_date(m): month = m.group(1) day = m.group(2) year = m.group(3) return "%d/%d/%d" % (day,month,year) # Date re pattern and replacement operation datepat = re.compile(r'(d+)/(d+)/(d+)') newdata = datepat.sub(euro_date,text) Copyright (C) 2007, http://www.dabeaz.com 2- 36
  • 37. Mini-Reference : re • There are many more features of the re module • Strongly influenced by Perl (feature set) • Regexs are a library in Python, not integrated into the language. • A book on regular expressions may be essential for advanced functions. Copyright (C) 2007, http://www.dabeaz.com 2- 37
  • 38. File Handling # requests.py import re import os import sys cachedir = sys.argv[1] cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] # A regex for embedded URL strings What is going on in this statement? request_pat = re.compile(r'([a-z]+://.*?)x00') # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() Copyright (C) 2007, http://www.dabeaz.com 2- 38
  • 39. os.path module # requests.py import re has portable file related functions os.path import os os.path.join(name1,name2,...) # Join path names import sys os.path.getsize(filename) # Get the file size os.path.getmtime(filename) # Get modification date cachedir = sys.argv[1] cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] There are many more functions, but this is the #preferred module for basic filename handling A regex for embedded URL strings request_pat = re.compile(r'([a-z]+://.*?)x00') # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() Copyright (C) 2007, http://www.dabeaz.com 2- 39
  • 40. os.path.join() # requests.py import re a fully-expanded pathname Creates import os dirname = '/foo/bar' filename = 'name' import sys os.path.join(dirname,filename) cachedir = sys.argv[1] cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] # A regex for embedded URL strings '/foo/bar/name' request_pat = re.compile(r'([a-z]+://.*?)x00') Aware of platform differences ('/' vs. '') # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() Copyright (C) 2007, http://www.dabeaz.com 2- 40
  • 41. Mini-Reference : os.path os.path.join(s1,s2,...) # Join pathname parts together os.path.getsize(path) # Get file size of path os.path.getmtime(path) # Get modify time of path os.path.getatime(path) # Get access time of path os.path.getctime(path) # Get creation time of path os.path.exists(path) # Check if path exists os.path.isfile(path) # Check if regular file os.path.isdir(path) # Check if directory os.path.islink(path) # Check if symbolic link os.path.basename(path) # Return file part of path os.path.dirname(path) # Return dir part of os.path.abspath(path) # Get absolute path Copyright (C) 2007, http://www.dabeaz.com 2- 41
  • 42. Binary I/O # requests.py import re import os import sys cachedir = sys.argv[1] cachefiles = binary files, use modes "rb","wb", etc. For all [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] # A regex for embedded URL strings request_pat = re.compile(r'([a-z]+://.*?)x00') Windows) Disables new-line translation (critical on # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() Copyright (C) 2007, http://www.dabeaz.com 2- 42
  • 43. Common I/O Shortcuts # requests.py import re entire file into a string # Read an import=os data open(filename).read() import sys # Write a string out to a file open(filename,"w").write(text) cachedir = sys.argv[1] cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] # Loop over all lines in a file #forregex foropen(filename): A line in embedded URL strings ... request_pat = re.compile(r'([a-z]+://.*?)x00') # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() Copyright (C) 2007, http://www.dabeaz.com 2- 43
  • 44. Commentary on Solution • This regex approach is mostly a hack for this particular application. • Reads entire cache files into memory as strings (may be quite large) • Only finds URLs, no other metadata • Some risk of false positives since URLs could also be embedded in data. Copyright (C) 2007, http://www.dabeaz.com 2- 44
  • 45. Commentary • We have started to build a collection of very simple command line tools • Very much in the "Unix tradition." • Python makes it easy to create such tools • More complex applications could be assembled by simply gluing scripts together Copyright (C) 2007, http://www.dabeaz.com 2- 45
  • 46. Working with Processes • It is common to write programs that run other programs, collect their output, etc. • Pipes • Interprocess Communication • Python has a variety of modules for supporting this. Copyright (C) 2007, http://www.dabeaz.com 2- 46
  • 47. subprocess Module • A module for creating and interacting with subprocesses • Consolidates a number of low-level OS functions such as system(), execv(), spawnv(), pipe(), popen2(), etc. into a single module • Cross platform (Unix/Windows) Copyright (C) 2007, http://www.dabeaz.com 2- 47
  • 48. Example : Slackers • Find slacker cache entries. Using the programs findcache.py and requests.py as subprocesses, write a program that inspects cache directories and prints out all entries that contain the word 'slashdot' in the URL. Copyright (C) 2007, http://www.dabeaz.com 2- 48
  • 49. slackers.py # slackers.py import sys import subprocess # Run findcache.py as a subprocess finder = subprocess.Popen( [sys.executable,"findcache.py",sys.argv[1]], stdout=subprocess.PIPE) dirlist = [line.strip() for line in finder.stdout] # Run request.py as a subprocess for cachedir in dirlist: searcher = subprocess.Popen( [sys.executable,"requests.py",cachedir], stdout=subprocess.PIPE) for line in searcher.stdout: if 'slashdot' in line: print line, Copyright (C) 2007, http://www.dabeaz.com 2- 49
  • 50. Launching a subprocess # slackers.py import sys import subprocess # Run findcache.py as a subprocess finder = subprocess.Popen( [sys.executable,"findcache.py",sys.argv[1]], stdout=subprocess.PIPE) dirlist = [line.strip() for line in finder.stdout] # Run request.py as a subprocess Thiscachedir in dirlist: for is launching a python Collection of output script as a subprocess, searcher = subprocess.Popen( with newline [sys.executable,"requests.py",cachedir], connecting its stdout stdout=subprocess.PIPE) stripping. stream to a pipe. for line in searcher.stdout: if 'slashdot' in line: print line, Copyright (C) 2007, http://www.dabeaz.com 2- 50
  • 51. Python Executable # slackers.py import sys import subprocess Full pathname of python interpreter # Run findcache.py as a subprocess finder = subprocess.Popen( [sys.executable,"findcache.py",sys.argv[1]], stdout=subprocess.PIPE) dirlist = [line.strip() for line in finder.stdout] # Run request.py as a subprocess for cachedir in dirlist: searcher = subprocess.Popen( [sys.executable,"requests.py",cachedir], stdout=subprocess.PIPE) for line in searcher.stdout: if 'slashdot' in line: print line, Copyright (C) 2007, http://www.dabeaz.com 2- 51
  • 52. Subprocess Arguments # slackers.py import sys import subprocess # Run findcache.py as a subprocess finder = subprocess.Popen( [sys.executable,"findcache.py",sys.argv[1]], stdout=subprocess.PIPE) dirlist = [line.strip() for line in finder.stdout] List of arguments to subprocess. # Run request.py as a subprocess for cachedir in dirlist: to what would Corresponds appear on a shell command line. searcher = subprocess.Popen( [sys.executable,"requests.py",cachedir], stdout=subprocess.PIPE) for line in searcher.stdout: if 'slashdot' in line: print line, Copyright (C) 2007, http://www.dabeaz.com 2- 52
  • 53. slackers.py # slackers.py import sys import subprocess # Run findcache.py as a subprocess directory we More of the same idea. For each finder = subprocess.Popen( found in the last step, we run requests.py to [sys.executable,"findcache.py",sys.argv[1]], produce requests. stdout=subprocess.PIPE) dirlist = [line.strip() for line in finder.stdout] # Run request.py as a subprocess for cachedir in dirlist: searcher = subprocess.Popen( [sys.executable,"requests.py",cachedir], stdout=subprocess.PIPE) for line in searcher.stdout: if 'slashdot' in line: print line, Copyright (C) 2007, http://www.dabeaz.com 2- 53
  • 54. Commentary • subprocess is a large module with many options. • However, it takes care of a lot of annoying platform-specific details for you. • Currently the "recommended" way of dealing with subprocesses. Copyright (C) 2007, http://www.dabeaz.com 2- 54
  • 55. Low Level Subprocesses • Running a simple system command os.system("shell command") • Connecting to a subprocess with pipes pout, pin = popen2.popen2("shell command") • Exec/spawn os.execv(),os.execl(),os.execle(),... os.spawnv(),os.spawnvl(), os.spawnle(),... • Unix fork() os.fork(), os.wait(), os.waitpid(), os._exit(), ... Copyright (C) 2007, http://www.dabeaz.com 2- 55
  • 56. Interactive Processes • Python does not have built-in support for controlling interactive subprocesses (e.g., "Expect") • Must install third party modules for this • Example: pexpect • http://pexpect.sourceforge.net Copyright (C) 2007, http://www.dabeaz.com 2- 56
  • 57. Commentary • Writing small Unix-like utilities is fairly straightforward in Python • Support for standard kinds of operations (files, regular expressions, pipes, subprocesses, etc.) • However, our solution is also kind of clunky • Only returns some information • Not particularly memory efficient (reads large files into memory) Copyright (C) 2007, http://www.dabeaz.com 2- 57
  • 58. Interlude • Python is well-suited to building libraries and frameworks. • In the next part, we're going to take a totally different approach than simply writing simple utilities. • Will build libraries for manipulating cache data and use those libraries to build tools. Copyright (C) 2007, http://www.dabeaz.com 2- 58
  • 59. Problem : Parsing Data • Extract the cache data (for real) Write a module ffcache.py that contains a set of functions for reading Firefox cache data into useful data structures that can be used by other programs. Capture all available information including URLs, timestamps, sizes, locations, content types, etc. • Use case: Blood and guts Writing programs that can process foreign file formats. Processing binary encoded data. Creating code for later reuse. Copyright (C) 2007, http://www.dabeaz.com 2- 59
  • 60. The Firefox Cache • There are four critical files _CACHE_MAP_ # Cache index _CACHE_001_ # Cache data _CACHE_002_ # Cache data _CACHE_003_ # Cache data • All files are binary-encoded • _CACHE_MAP_ is used by Firefox to locate data, but it is not updated until Firefox exits. • We will ignore _CACHE_MAP_ since we want to observe caches of live Firefox sessions. Copyright (C) 2007, http://www.dabeaz.com 2- 60
  • 61. Firefox _CACHE_ Files • _CACHE_00n_ file organization Free/used block bitmap 4096 bytes Blocks Up to 32768 blocks • The block size varies according to the file: _CACHE_001_ 256 byte blocks _CACHE_002_ 1024 byte blocks _CACHE_003_ 4096 byte blocks Copyright (C) 2007, http://www.dabeaz.com 2- 61
  • 62. Cache Entries • Each cache entry: • A maximum of 4 cache blocks • Can either be data or metadata • If >16K, written to a file instead • Notice how all the "cryptic" files are >16K -rw------- beazley 111169 Sep 25 17:15 01CC0844d01 -rw------- beazley 104991 Sep 25 17:15 01CC3844d01 -rw------- beazley 47233 Sep 24 16:41 021F221Ad01 ... -rw------- beazley 26749 Sep 21 11:19 FF8AEDF0d01 -rw------- beazley 58172 Sep 25 18:16 FFE628C6d01 Copyright (C) 2007, http://www.dabeaz.com 2- 62
  • 63. Cache Metadata • Metadata is encoded as a binary structure Header 36 bytes Request String Variable length (in header) Request Info Variable length (in header) • Header encoding (binary, big-endian) 0-3 magic (???) unsigned int (0x00010008) 4-7 location unsigned int 8-11 fetchcount unsigned int 12-15 fetchtime unsigned int (system time) 16-19 modifytime unsigned int (system time) 20-23 expiretime unsigned int (system time) 24-27 datasize unsigned int (byte count) 28-31 requestsize unsigned int (byte count) 32-35 infosize unsigned int (byte count) Copyright (C) 2007, http://www.dabeaz.com 2- 63
  • 64. Solution Outline • Part 1: Parsing Metadata Headers • Part 2: Getting request information (URL) • Part 3: Extracting additional content info • Part 4: Scanning of individual cache files • Part 5: Scanning an entire directory • Part 6: Scanning a list of directories Copyright (C) 2007, http://www.dabeaz.com 2- 64
  • 65. Part I - Reading Headers • Write a function that can parse the binary metadata header and return the data in a useful format Copyright (C) 2007, http://www.dabeaz.com 2- 65
  • 66. Reading Headers import struct # This function parses a cache metadata header into a dict # of named fields (listed in _headernames below) _headernames = ['magic','location','fetchcount', 'fetchtime','modifytime','expiretime', 'datasize','requestsize','infosize'] def parse_meta_header(headerdata): head = struct.unpack(">9I",headerdata) meta = dict(zip(_headernames,head)) return meta Copyright (C) 2007, http://www.dabeaz.com 2- 66
  • 67. Reading Headers • How this is supposed to work: >>> f = open("Cache/_CACHE_001_","rb") >>> f.seek(4096) # Skip the bit map >>> headerdata = f.read(36) # Read 36 byte header >>> meta = parse_meta_header(headerdata) >>> meta {'fetchtime': 1190829792, 'requestsize': 27, 'magic': 65544, 'fetchcount': 3, 'expiretime': 0, 'location': 2449473536L, 'modifytime': 1190829792, 'datasize': 29448, 'infosize': 531} >>> • Basically, we're parsing the header into a useful Python data structure (a dictionary) Copyright (C) 2007, http://www.dabeaz.com 2- 67
  • 68. struct module import struct Parses binary encoded data into Python objects. # This function parses a cache metadata header into a dict # of named fields (listed in _headernames below) _headernames = ['magic','location','fetchcount', You would use this module to pack/unpack 'fetchtime','modifytime','expiretime', raw binary 'datasize','requestsize','infosize'] data from Python strings. def parse_meta_header(headerdata): head = struct.unpack(">9I",headerdata) meta = dict(zip(_headernames,head)) return meta Unpacks 9 unsigned 32-bit big-endian integers Copyright (C) 2007, http://www.dabeaz.com 2- 68
  • 69. struct module import struct # This function parses a cache metadata header into a dict # of named fields (listed in _headernames below) Result is always a tuple of converted values. _headernames = ['magic','location','fetchcount', head = (65544, 'fetchtime','modifytime','expiretime', 0, 1, 1191682051, 1191682051, 0, 8645, 190, 218) 'datasize','requestsize','infosize'] def parse_meta_header(headerdata): head = struct.unpack(">9I",headerdata) meta = dict(zip(_headernames,head)) return meta Copyright (C) 2007, http://www.dabeaz.com 2- 69
  • 70. Dictionary Creation zip(s1,s2) makes a list of tuples zip(_headernames,head) [('magic',head[0]), import struct ('location',head[1]), ('fetchcount',head[2]) # This function parses a cache metadata header into a dict ... # of named fields (listed in _headernames below) ] _headernames = ['magic','location','fetchcount', 'fetchtime','modifytime','expiretime', 'datasize','requestsize','infosize'] def parse_meta_header(headerdata): head = struct.unpack(">9I",headerdata) meta = dict(zip(_headernames,head)) return meta Make a dictionary Copyright (C) 2007, http://www.dabeaz.com 2- 70
  • 71. Commentary • Dictionaries as data structures meta = { 'fetchtime' : 1190829792, 'requestsize' : 27, 'magic' : 65544, 'fetchcount' : 3, 'expiretime' : 0, 'location' : 2449473536L, 'modifytime' : 1190829792, 'datasize' : 29448, 'infosize' : 531 } • Useful if data has many parts data = f.read(meta[8]) # Huh?!? vs. data = f.read(meta['infosize']) # Better Copyright (C) 2007, http://www.dabeaz.com 2- 71
  • 72. Mini-reference : struct • struct module items = struct.unpack(fmt,data) data = struct.pack(fmt,item1,...,itemn) • Sample Format codes 'c' char (1 byte string) 'b' signed char (8-bit integer) 'B' unsigned char (8-bit integer) 'h' signed short (16-bit integer) 'H' unsigned short (16-bit integer) 'i' int (32-bit integer) 'I' unsigned int (32-bit integer) 'f' 32-bit single precision float 'd' 64-bit double precision float 's' char s[] (String) '>' Big endian modifier '<' Little endian modifier '!' Network order modifier 'n' Repetition count modifier Copyright (C) 2007, http://www.dabeaz.com 2- 72
  • 73. Part 2 : Parsing Requests • Write a function that will read the URL request string and request information • Request String : A Null-terminated string • Request Info : A sequence of Null-terminated key-value pairs (like a dictionary) Copyright (C) 2007, http://www.dabeaz.com 2- 73
  • 74. Parsing Requests import re part_pat = re.compile(r'[nr -~]*$') def parse_request_data(meta,requestdata): parts = requestdata.split('x00') for part in parts: if not part_pat.match(part): return False request = parts[0] if len(request) != (meta['requestsize'] - 1): return False info = dict(zip(parts[1::2],parts[2::2])) meta['request'] = request.split(':',1)[1] meta['info'] = info return True Copyright (C) 2007, http://www.dabeaz.com 2- 74
  • 75. Usage : Requests • Usage of the function: >>> f = open("Cache/_CACHE_001_","rb") >>> f.seek(4096) # Skip the bit map >>> headerdata = f.read(36) # Read 36 byte header >>> meta = parse_meta_header(headerdata) >>> requestdata = f.read(meta['requestsize']+meta['infosize']) >>> parse_request_data(meta,requestdata) True >>> meta['request'] 'http://www.yahoo.com/' >>> meta['info'] {'request-method': 'GET', 'request-User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.7) Gecko/ 20070914 Firefox/2.0.0.7', 'charset': 'UTF-8', 'response- head': 'HTTP/1.1 200 OKrnDate: Wed, 26 Sep 2007 18:03:17 ...' } >>> Copyright (C) 2007, http://www.dabeaz.com 2- 75
  • 76. String Stripping import re part_pat = re.compile(r'[nr -~]*$') def parse_request_data(meta,requestdata): parts = requestdata.split('x00') for part in parts: if not part_pat.match(part): The request dataFalse sequence of null-terminated return is a strings. This splits the data up into parts. request = parts[0] if len(request) != (meta['requestsize'] - 1): requestdata = False return 'partx00partx00partx00partx00...' info = dict(zip(parts[1::2],parts[2::2])) .split('x00') meta['request'] = request.split(':',1)[1] meta['info'] = info parts = ['part','part','part','part',...] return True Copyright (C) 2007, http://www.dabeaz.com 2- 76
  • 77. String Validation import re part_pat = re.compile(r'[nr -~]*$') def parse_request_data(meta,requestdata): parts = requestdata.split('x00') for part in parts: if not part_pat.match(part): return False Individual parts are printable characters except for request = parts[0] if len(request) != (meta['requestsize'] - 1): newline characters ('nr'). return False info = dict(zip(parts[1::2],parts[2::2])) We use the re module to match each string. This meta['request'] = request.split(':',1)[1] would help catchinfo where we might be reading meta['info'] = cases return True bad data (false headers, raw data, etc.). Copyright (C) 2007, http://www.dabeaz.com 2- 77
  • 78. URL Request String import re part_pat = re.compile(r'[nr -~]*$') The request string is the first part. The check that def parse_request_data(meta,requestdata): parts = requestdata.split('x00') follows makes parts:it's the right size (a further sanity for part in sure check on not part_pat.match(part): if the data integrity). return False request = parts[0] if len(request) != (meta['requestsize'] - 1): return False info = dict(zip(parts[1::2],parts[2::2])) meta['request'] = request.split(':',1)[1] meta['info'] = info return True Copyright (C) 2007, http://www.dabeaz.com 2- 78
  • 79. Request Info import re Each request has a set of associated part_pat = re.compile(r'[nr -~]*$') data represented as key/value pairs. def parse_request_data(meta,requestdata): parts = requestdata.split('x00') parts = ['request','key','val','key','val','key','val'] for part in parts: parts[1::2] part_pat.match(part): if not ['key','key','key'] return['val','val','val'] parts[2::2] False zip(parts[1::2],parts[2::2]) [('key','val'), request = parts[0] ('key','val') if len(request) != (meta['requestsize'] - 1): ('key','val')] return False info = dict(zip(parts[1::2],parts[2::2])) meta['request'] = request.split(':',1)[1] meta['info'] = info return True Makes a dictionary from (key,val) tuples Copyright (C) 2007, http://www.dabeaz.com 2- 79
  • 80. Fixing the Request # Given a dictionary of header information and a file, # this function extracts the request data from a cache # metadata entry and saves it instring Cleaning up the request the dictionary. Returns # True or False depending on success. request = "HTTP:http://www.google.com" def read_request_data(header,f): .split(':',1) request = f.read(header['requestsize']).strip('x00') infodata = f.read(header['infosize']).strip('x00') ['HTTP','http://www.google.com'] # Validate request and [1] infodata here (nothing now) # Turn the infodata into a dictionary 'http://www.google.com' parts = infodata.split('x00') info = dict(zip(parts[::2],parts[1::2])) meta['request'] = request.split(':',1)[1] meta['info'] = info return True Copyright (C) 2007, http://www.dabeaz.com 2- 80
  • 81. Commentary • Emphasize that Python has very powerful list manipulation primitives • Indexing • Slicing • List comprehensions • Etc. • Knowing how to use these leads to rapid development and compact code Copyright (C) 2007, http://www.dabeaz.com 2- 81
  • 82. Part 3: Content Info • All documents on the internet have optional content-type, encoding, and character set information. • Let's add this information since it will make it easier for us to determine the type of files that are stored in the cache (i.e., images, movies, HTML, etc.) Copyright (C) 2007, http://www.dabeaz.com 2- 82
  • 83. HTTP Responses • The cache metadata includes an HTTP response header >>> print meta['info']['response-head'] HTTP/1.1 200 OK Date: Sat, 29 Sep 2007 20:51:37 GMT Cache-Control: private Vary: User-Agent Content-Type: text/html; charset=utf-8 Content-Encoding: gzip >>> Content type, character set, and encoding. Copyright (C) 2007, http://www.dabeaz.com 2- 83
  • 84. Solution # Given a metadata dictionary, this function adds additional # fields related to the content type, charset, and encoding import email def add_content_info(meta): info = meta['info'] if 'response-head' not in info: return else: rhead = info.get('response-head').split("n",1)[1] m = email.message_from_string(rhead) content = m.get_content_type() encoding = m.get('content-encoding',None) charset = m.get_content_charset() meta['content-type'] = content meta['content-encoding'] = encoding meta['charset'] = charset Copyright (C) 2007, http://www.dabeaz.com 2- 84
  • 85. Internet Data Handling # Given a metadata dictionary, has afunction adds additional Python this vast assortment of internet data handling modules. # fields related to the content type, charset, and encoding import email email. Parsing of email messages, def add_content_info(meta): info = meta['info'] if 'response-head' MIME headers, etc. not in info: return else: rhead = info.get('response-head').split("n",1)[1] m = email.message_from_string(rhead) content = m.get_content_type() encoding = m.get('content-encoding',None) charset = m.get_content_charset() meta['content-type'] = content meta['content-encoding'] = encoding meta['charset'] = charset Copyright (C) 2007, http://www.dabeaz.com 2- 85
  • 86. Internet Data Handling # Given a metadata dictionary, this function adds additional # In this code, we parse the HTTP charset, and encoding fields related to the content type, reponse headers using the email import email module and extract content-type, def add_content_info(meta): info = meta['info'] encoding, and charset information. if 'response-head' not in info: return else: rhead = info.get('response-head').split("n",1)[1] m = email.message_from_string(rhead) content = m.get_content_type() encoding = m.get('content-encoding',None) charset = m.get_content_charset() meta['content-type'] = content meta['content-encoding'] = encoding meta['charset'] = charset Copyright (C) 2007, http://www.dabeaz.com 2- 86
  • 87. Commentary • Python is heavily used in Internet applications • There are modules for parsing common types of data (email, HTML, XML, etc.) • There are modules for processing bits and pieces of internet data (URLs, MIME types, RFC822 headers, etc.) Copyright (C) 2007, http://www.dabeaz.com 2- 87
  • 88. Part 4: File Scanning • Write a function that scans a single cache file and produces a sequence of records containing all of the cache metadata. • This is just one more of our building blocks • The goal is to hide some of the nasty bits Copyright (C) 2007, http://www.dabeaz.com 2- 88
  • 89. File Scanning # Scan a single file in the firefox cache def scan_cachefile(f,blocksize): maxsize = 4*blocksize # Maximum size of an entry f.seek(4096) # Skip the bit-map while True: headerdata = f.read(36) if not headerdata: break meta = parse_meta_header(headerdata) if (meta['magic'] == 0x00010008 and meta['requestsize'] + meta['infosize'] < maxsize): requestdata = f.read(meta['requestsize']+ meta['infosize']) if parse_request_data(meta,requestdata): add_content_info(meta) yield meta # Move the file pointer to the start of the next block fp = f.tell() if (fp % blocksize): f.seek(blocksize - (fp % blocksize),1) Copyright (C) 2007, http://www.dabeaz.com 2- 89
  • 90. Usage : File Scanning • Usage of the scan function >>> f = open("Cache/_CACHE_001_","rb") >>> for meta in scan_cache_file(f,256) ... print meta['request'] ... http://www.yahoo.com/ http://us.js2.yimg.com/us.yimg.com/a/1-/java/promotions/ http://us.i1.yimg.com/us.yimg.com/i/ww/atty_hsi_free.gif ... • We can just open up a cache file and write a for-loop to iterate over all of the entries. Copyright (C) 2007, http://www.dabeaz.com 2- 90
  • 91. Python File I/O # Scan a single file in the firefox cache def scan_cachefile(f,blocksize): maxsize = 4*blocksize # Maximum size of an entry f.seek(4096) # Skip the bit-map while True: headerdata = f.read(36) File Objects if not headerdata: break Modeled after ANSI C. meta = parse_meta_header(headerdata) if (meta['magic'] == 0x00010008 and meta['requestsize'] + meta['infosize'] bytes. Files are just < maxsize): File pointer keeps track. requestdata = f.read(meta['requestsize']+ meta['infosize']) if parse_request_data(meta,requestdata): f.read() # Read bytes add_content_info(meta) f.tell() # Current fp yield meta f.seek(n,off) # Move fp # Move the file pointer to the start of the next block fp = f.tell() if (fp % blocksize): f.seek(blocksize - (fp % blocksize),1) Copyright (C) 2007, http://www.dabeaz.com 2- 91
  • 92. Using Earlier Code # Scan a single file in the firefox cache def scan_cachefile(f,blocksize): maxsize = 4*blocksize # Maximum size of an using our Here we are entry f.seek(4096) # Skip header parsing functions the bit-map while True: headerdata = f.read(36) written in previous parts. if not headerdata: break meta = parse_meta_header(headerdata) if (meta['magic'] == 0x00010008 and meta['requestsize'] + meta['infosize'] < maxsize): requestdata = f.read(meta['requestsize']+ meta['infosize']) if parse_request_data(meta,requestdata): add_content_info(meta) yield meta Note: We are progressively adding #more the file apointer Move data to fp = f.tell() to the start of the next block dictionary. % blocksize): if (fp f.seek(blocksize - (fp % blocksize),1) Copyright (C) 2007, http://www.dabeaz.com 2- 92
  • 93. Data Validation # Scan a single file in the firefox cache This is a sanity check to make def scan_cachefile(f,blocksize): sure the header data looks #like a maxsize = 4*blocksize Maximum size of an entry f.seek(4096) # Skip the bit-map valid header. while True: headerdata = f.read(36) if not headerdata: break meta = parse_meta_header(headerdata) if (meta['magic'] == 0x00010008 and meta['requestsize'] + meta['infosize'] < maxsize): requestdata = f.read(meta['requestsize']+ meta['infosize']) if parse_request_data(meta,requestdata): add_content_info(meta) yield meta # Move the file pointer to the start of the next block fp = f.tell() if (fp % blocksize): f.seek(blocksize - (fp % blocksize),1) Copyright (C) 2007, http://www.dabeaz.com 2- 93
  • 94. Generating Results # Scan a single file in the firefox cache def scan_cachefile(f,blocksize): We are using yield to Maximum size of for a maxsize = 4*blocksize # produce data an entry f.seek(4096) single cache entry. #IfSkip the bit-map a for- while True: someone uses loop, they will get all of the entries. headerdata = f.read(36) if not headerdata: break meta = parse_meta_header(headerdata) Note: This allows== 0x00010008 and cache if (meta['magic'] us to process the meta['requestsize'] + meta['infosize'] < maxsize): without reading all of the data into memory. requestdata = f.read(meta['requestsize']+ meta['infosize']) if parse_request_data(meta,requestdata): add_content_info(meta) yield meta # Move the file pointer to the start of the next block fp = f.tell() if (fp % blocksize): f.seek(blocksize - (fp % blocksize),1) Copyright (C) 2007, http://www.dabeaz.com 2- 94
  • 95. Commentary • Have created a function that can scan a single _CACHE_00n_ file and produce a sequence of dictionaries with metadata. • It's still somewhat low-level • Just need to package it a little better Copyright (C) 2007, http://www.dabeaz.com 2- 95
  • 96. Part 5 : Scan a Directory • Write a function that takes the name of a Firefox cache directory, scans all of the cache files for metadata, and produces a single sequence of records. • Make it real easy to extract data Copyright (C) 2007, http://www.dabeaz.com 2- 96
  • 97. Solution : Directory Scan # Given the name of a Firefox cache directory, the function # scans all of the _CACHE_00n_ files for metadata. A sequence # of dictionaries containing metadata is returned. import os def scan_cache(cachedir): files = [('_CACHE_001_',256), ('_CACHE_002_',1024), ('_CACHE_003_',4096)] for cname,blocksize in files: cfile = open(os.path.join(cachedir,cname),"rb") for meta in scan_cachefile(cfile,blocksize): meta['cachedir'] = cachedir meta['cachefile'] = cname yield meta cfile.close() Copyright (C) 2007, http://www.dabeaz.com 2- 97
  • 98. Solution : Directory Scan General idea: # Given the name of a Firefox cache directory, the function # scans all of the _CACHE_00n_ files for metadata. A sequence # of dictionaries containing metadata is returned. We loop over the three _CACHE_00n_ files and import os a sequence of the cache records produce def scan_cache(cachedir): files = [('_CACHE_001_',256), ('_CACHE_002_',1024), ('_CACHE_003_',4096)] for cname,blocksize in files: cfile = open(os.path.join(cachedir,cname),"rb") for meta in scan_cachefile(cfile,blocksize): meta['cachedir'] = cachedir meta['cachefile'] = cname yield meta cfile.close() Copyright (C) 2007, http://www.dabeaz.com 2- 98
  • 99. Solution : Directory Scan # Given the name of a Firefox cache directory, the function # scans all of the _CACHE_00n_ files for metadata. A sequence # of dictionaries containing metadata is returned. import os def scan_cache(cachedir): files = [('_CACHE_001_',256), ('_CACHE_002_',1024), ('_CACHE_003_',4096)] for cname,blocksize in files: cfile = open(os.path.join(cachedir,cname),"rb") for meta in scan_cachefile(cfile,blocksize): meta['cachedir'] = cachedir meta['cachefile'] = cname We use the low-level file scanning function yield meta cfile.close() here to generate a sequence of records. Copyright (C) 2007, http://www.dabeaz.com 2- 99
  • 100. More Generation # Given the name of a Firefox cache directory, the function Byscans all of here, we are chainingfor metadata. A sequence # using yield the _CACHE_00n_ files together the results obtained from all three cache files into one # of dictionaries containing metadata is returned. big long sequence of results. import os def scan_cache(cachedir): files = [('_CACHE_001_',256), The underlying mechanics and implementation ('_CACHE_002_',1024), details are hidden (user doesn't care) ('_CACHE_003_',4096)] for cname,blocksize in files: cfile = open(os.path.join(cachedir,cname),"rb") for meta in scan_cachefile(cfile,blocksize): meta['cachedir'] = cachedir meta['cachefile'] = cname yield meta cfile.close() Copyright (C) 2007, http://www.dabeaz.com 2-100
  • 101. Additional Data # Given the name of a Firefox cache directory, the function # scans all of the _CACHE_00n_ files for metadata. A sequence # of dictionaries containing metadata is returned. import os def scan_cache(cachedir): files = [('_CACHE_001_',256), ('_CACHE_002_',1024), Adding ('_CACHE_003_',4096)] path and file for information to the data cname,blocksize in files: (May be useful later) cfile = open(os.path.join(cachedir,cname),"rb") for meta in scan_cachefile(cfile,blocksize): meta['cachedir'] = cachedir meta['cachefile'] = cname yield meta cfile.close() Copyright (C) 2007, http://www.dabeaz.com 2-101
  • 102. Usage : Cache Scan • Usage of the scan function >>> for meta in scan_cache("Cache/"): ... print meta['request'] ... http://www.yahoo.com/ http://us.js2.yimg.com/us.yimg.com/a/1-/java/promotions/ http://us.i1.yimg.com/us.yimg.com/i/ww/atty_hsi_free.gif ... • Given the name of a cache directory, we can just loop over all of the metadata. Trivial! • With work, could perform various kinds of queries and processing of the data Copyright (C) 2007, http://www.dabeaz.com 2-102
  • 103. Another Example • Find all requests related to Slashdot >>> for meta in scan_cache("Cache/"): ... if 'slashdot' in meta['request']: ... print meta['request'] ... http://www.slashdot.org/ http://images.slashdot.org/topics/topiccommunications.gif http://images.slashdot.org/topics/topicstorage.gif http://images.slashdot.org/comments.css?T_2_5_0_176 ... • Well, that was pretty easy. Copyright (C) 2007, http://www.dabeaz.com 2-103
  • 104. Another Example • Find all large JPEG images in the cache >>> jpegs = (meta for meta in scan_cache("Cache/") if meta['content-type'] == 'image/jpeg' and meta['datasize'] > 100000) >>> for j in jpegs: ... print j['request'] ... http://images.salon.com/ent/video_dog/comedy/2007/09/27/cereal/ story.jpg http://images.salon.com/ent/video_dog/ifc/2007/09/28/ apocalypse/story.jpg http://www.lakesideinns.com/images/fallroadphoto2006.jpg ... >>> • That was also pretty easy Copyright (C) 2007, http://www.dabeaz.com 2-104
  • 105. Part 6 : Scan Everything • Write a function that takes a list of cache directories and produces a sequence of all cache metadata found in all of them. • A single utility function that let's us query everything. Copyright (C) 2007, http://www.dabeaz.com 2-105
  • 106. Scanning Everything # scan an entire list of cache directories producing # a sequence of records def scan(cachedirs): if isinstance(cachedirs,str): cachedirs = [cachedirs] for cdir in cachedirs: for meta in scan_cache(cdir): yield meta Copyright (C) 2007, http://www.dabeaz.com 2-106
  • 107. Type Checking # scan an entire list of cache directories producing # a sequence of records def scan(cachedirs): if isinstance(cachedirs,str): cachedirs = [cachedirs] for cdir in cachedirs: This bit of code ismeta example of type an for meta in scan_cache(cdir): yield checking. If the argument is a string, we convert it to a list with one item. This allows the following usage: scan("CacheDir") scan(["CacheDir1","CacheDir2",...]) Copyright (C) 2007, http://www.dabeaz.com 2-107
  • 108. Putting it all together # slack.py # Find all of those slackers who should be working import sys, os, ffcache if len(sys.argv) != 2: print >>sys.stderr,"Usage: python slack.py dirname" raise SystemExit(1) caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) for meta in ffcache.scan(caches): if 'slashdot' in meta['request']: print meta['request'] print meta['cachedir'] print Copyright (C) 2007, http://www.dabeaz.com 2-108
  • 109. Intermission • Have written a simple library ffcache.py • Library takes a moderate complex data processing problem and breaks it up into pieces. • About 100 lines of code. • Now, let's build an application... Copyright (C) 2007, http://www.dabeaz.com 2-109
  • 110. Problem : CacheSpy • Big Brother (make an evil sound here) Write a program that first locates all of the Firefox cache directories under a given directory. Then have that program run forever as a network server, waiting for connections. On each connection, send back all of the current cache metadata. • Big Picture We're going to write a daemon that will find and quietly report on browser cache contents. Copyright (C) 2007, http://www.dabeaz.com 2-110
  • 111. cachespy.py import sys, os, pickle, SocketServer, ffcache SPY_PORT = 31337 caches = [path for path,dname,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files] def dump_cache(f): for meta in ffcache.scan(caches): pickle.dump(meta,f) class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() dump_cache(f) f.close() SocketServer.TCPServer.allow_reuse_address = True serv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler) print "CacheSpy running on port %d" % SPY_PORT serv.serve_forever() Copyright (C) 2007, http://www.dabeaz.com 2-111
  • 112. SocketServer Module import sys, os, pickle, SocketServer, ffcache SPY_PORT = 31337 SocketServer caches = [path for path,dname,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files] def module for easily creating A dump_cache(f): for meta in ffcache.scan(caches): low-level internet applications pickle.dump(meta,f) using sockets. class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() dump_cache(f) f.close() SocketServer.TCPServer.allow_reuse_address = True serv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler) print "CacheSpy running on port %d" % SPY_PORT serv.serve_forever() Copyright (C) 2007, http://www.dabeaz.com 2-112
  • 113. SocketServer Handlers import sys, os, pickle, SocketServer, ffcache You define a simple class that SPY_PORT = 31337 caches = [path for path,dname,files in os.walk(sys.argv[1]) implements handle(). if '_CACHE_MAP_' in files] def dump_cache(f): This implements the server logic. for meta in ffcache.scan(caches): pickle.dump(meta,f) class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() dump_cache(f) f.close() SocketServer.TCPServer.allow_reuse_address = True serv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler) print "CacheSpy running on port %d" % SPY_PORT serv.serve_forever() Copyright (C) 2007, http://www.dabeaz.com 2-113
  • 114. SocketServer Servers import sys, os, pickle, SocketServer, ffcache SPY_PORT = 31337 caches = [path for path,dname,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files] def dump_cache(f): for meta in ffcache.scan(caches): pickle.dump(meta,f) Next, you just create a Server object, class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): hook f = self.request.makefile()run the the handler up to it, and server. dump_cache(f) f.close() SocketServer.TCPServer.allow_reuse_address = True serv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler) print "CacheSpy running on port %d" % SPY_PORT serv.serve_forever() Copyright (C) 2007, http://www.dabeaz.com 2-114