1. I Say Emulate; He Says Migrate
Are emulation or migration feasible
preservation strategies?
National Library of Australia
Prepared by:
Andrew Stawowczyk Long
Presented by: 1
David Pearson
2. Archiving the Web
⢠Many institutions actively harvest the web
⢠Collecting scale vary
⢠Preservation practices not well understood and
implemented
⢠Collecting intent may differ depending on the
institution
2
3. Web Archives
⢠Type
⢠Text oriented
⢠Multimedia (video/audio) oriented
⢠Picture oriented
⢠Databases
⢠Combination of all types
⢠Storage
⢠Uncompressed
⢠Compressed (WARC)
⢠Combination
3
4. Web Objects and Elements
⢠Challenge: Web archives may contain any type of digital object
⢠Common objects
⢠HTML/XML and related (htm, html, xml, css, etc.)
⢠Images (raster images â JPEG, GIF, PNG)
⢠Media
⢠Audio files (au, wav, aiff, midi, mp3)
⢠Video files (mov, mpg, wmv, rm)
⢠Other objects
⢠File Archives (usually compressed â zip, tar, gz, arc, sit)
⢠Images (raster images â bmp, tiff)
⢠Images (vector images - SVG)
⢠Text files (txt, csv, rtf)
⢠Document files
⢠PDF
⢠Microsoft Word, Excel, Power Point
4
5. Comparative statistics of
NLA web collections
PANDORA (selective) .au Domain Harvests
Files: 73 million Files: 2.3 billion
Size: 3.26 TB Size: 78.75 TB
Domain 2005 2006 2007 2008
Harvest
Unique 185 million 596 million 516 million 1 billion
files
Hosts 811,523 1,046,038 1,247,614 3,038,658
crawled
Size 6.69 TB 19.04 18.47TB 34.55 TB
5
6. What are we preserving?
Preservation Intent
⢠Preservation of:
⢠Physical media?
⢠Bit-stream (logical form of data)?
⢠Action (rendering data into something useful to user)?
⢠User experience?
⢠Important Considerations
⢠Creatorâs perceived intent
⢠Institutionâs preservation intent
6
Based on Heslop and Davis (2002)
7. What are we preserving?
Properties
⢠Object Properties
(Properties regarded as important would vary depending on the
intention of the collecting institution)
â˘
â˘
Derived from file format
High-level â e.g. layout, formatting
or WEB
⢠Measured â identified directly by computer
⢠Intended â Set by the collecting body
7
8. Possible Preservation Actions 1
⢠Emulation
The original environment is recreated on a contemporary hardware using
specialised software (emulator) and original software.
⢠Renderers
⢠Specialised software,
operating in the
contemporary environment
and used to access (render)
original files. It is similar
to emulation.
8
9. Possible Preservation Actions 2
⢠Migration
Original file formats are migrated (converted) to
another format, which is supported by current
hardware/software.
e.g. MS Word 3.0 to MS Word
2008
9
10. Possible Preservation Actions 3
Not long-term sustainable
⢠Technological Museum
Collect and maintain the original hardware and software
⢠Take No Action
Do nothing
10
11. Digital Preservation
Preliminaries
⢠Collection objects need to be correctly recognised and
identified
⢠Preservation intent(s) need to be defined
⢠High-level preservation actions need to be defined (e.g. shall
we use emulation or migration?)
⢠Practical-level preservation actions need to be defined
Object Format + Preservation Intent = Appropriate Action
Dillema:
How to properly migrate data if preservation intent(s) are
unknown or not defined 11
12. Tools Required for Emulation
⢠Emulators
⢠Fast, stable, flexible, extendable
⢠Licenced Operating Systems
⢠Various drivers
⢠Web browsers
⢠Browser plug-ins
⢠Other programs as required (e.g. Java, Adobe Acrobat
Reader)
12
13. Tools Required in Migration
⢠Format identifiers
⢠Format converters
⢠Link updaters
⢠QA automatons
CAMiLEON project â Migration on Request Tool
XENA 13
14. Project Tests
General Testing Environment
⢠Large slice of uncompressed PANDORA
archive (random selection)
⢠Whole Domain Harvest archive have not been
included in tests (WARC files)
⢠Multiple hardware combinations
⢠Multiple OS combinations
⢠Multiple Web Browsers
14
15. Project Tests
Material Sample
Testing the industrial scale tools
⢠PANDORA slice
⢠861Gb
⢠18,019,172 files
⢠2,379,326 folders
Testing object properties
⢠Smaller slice of PANDORA slice
⢠20 objects of each selected types
â˘Audio, html, images, pdf, video, zip, MS documents
15
16. Project Tests
Methodology
⢠Large sample testing (861Gb, 18,019,172 files)
⢠Attempt to identify objects in the sample using DROID
⢠Attempt to migrate jpeg images to png and update links
⢠Small sample testing
⢠Select smaller sub-sample, with objects mostly created before year 2000
⢠Identify objects in the sample
⢠View and experience selected objects in contemporary environments using
various platforms, OS and browsers
⢠View and experience selected objects in old environments using
emulations on various platforms, using different OS and browsers
⢠Migrate selected objects and review them in various environments
16
17. Project Tests
Tools tested
⢠Common ⢠Emulation
⢠DROID ⢠QEMU
⢠JHOVE
⢠Bochs
⢠TRiID
⢠File Identifier
⢠MS Virtual PC
(Not exactly an emulator)
⢠Lister (dev. in-house)
⢠OS â Dioscuri
â MS Win XP Pro
â MS Win 3.1
⢠Migration
â MS Win 98SE ⢠ImageMagick
â Ubuntu 9.04
⢠MediaCoder
⢠Web Browsers
â MS IE 7 ⢠Swf>>avi
â Firefox 3 ⢠OpenOffice Tools
â Arachne 1.2
⢠XENA
â Mosaic 2
17
â Netscape 4
18. Project Tests
Control â Current Environment
⢠Properties observed in selected files
Object Basic Characteristics (based on Emulation Project by KB)
1. Content : the text, images, etc. from the object
2. Structure : the cohesion between different parts of the object
3. Context : the meaning of the object.
4. Appearance : the way an object is presented to the user.
5. Behaviour : the interaction of the object with the user or system.
E.g. for HTML pages:
â˘Rendering of text, images, media files
⢠Font, layout, colours, contrast, brightness, animation smoothness, sound quality, etc.
â˘Objects dependencies
â˘Mouse & keyboard behaviour
â˘Data extraction
18
19. Project Tests
Emulated Environments
⢠Hardware
⢠Dell Optiplex GX620, P4, 4.4GHz x 3.39GHZ, 3.5Gb RAM
⢠Power Mac G4
EMULATORS:
⢠Bochs
⢠Host: WinXP Pro v2002 SP3
Ubuntu 9.04
⢠Client: Win 3.1, MS DOS 6.2
WinXP Pro SP2
⢠Dioscuri 0.4.0
⢠Host: WinXP Pro v2002 SP3
⢠Client: Win3.1, MS DOS 6.2
19
20. Project Tests
Emulated Environments
⢠Qemu
⢠Host: MS WinXP Pro v2002 SP3
⢠Clients: MS Win98SE
MS Win 3.1
MS DOS 6.2
Ubuntu 9.04
⢠Host: Ubuntu 9.04
⢠Clients: MS WinXP Pro SP2, P4, 12.92GHz, 256Mb RAM
MS Win98SE
MS Win 3.1
⢠Microsoft Virtual PC
⢠Host: MS WinXP Pro v2002 SP3
⢠Clients: MS Win 3.1
MS Win98SE 20
21. Tests - Summary
Emulation
â˘Setting up emulators was relatively simple
â˘Additional software (especially to work with disk images)
proved to be extremely useful.
â˘Licencing was at times a big obstacle. (E.g. Impossible to
emulate Macintosh environment legally).
â˘A lot of dependencies exist. It is a complex task to make
programs work correctly.
â˘e.g Windows XP requires internet or over-the-phone activation after 30 days
21
22. Tests â Summary
Emulation
⢠All
Some of the dll libraries in Win 3.1 did not agree with 16-bit Netscape and Mosaic
programs
⢠Bochs 2.3.7 for Windows
⢠Extremely slow in GUI environments
⢠No full screen mode. Limited end-user experience.
⢠Dioscuri
⢠Sluggish at times
⢠Didnât like some of the images created in WinImage
⢠Qemu 0.9.0 for Windows and Linux
⢠Much faster but still sluggish at times
⢠Win98SE couldn't run in hi-res, hi-colour mode
⢠Microsoft Virtual PC
Relatively fast (it's a virtualisation software on PC) but still sluggish at times
22
23. Tests - Summary
Migration Environment
â˘Dell Optiplex GX620
â˘MS Windows XP Pro v2002 SP3
â˘Networked drive with PANDORA sample
23
24. Tests - Summary
Migration
â˘Available tools are imperfect and slow.
⢠e.g. DROID took more than two weeks to examine slightly over 18 million
files and many of them were not recognised
â˘It is very difficult to examine contents of the container
formats (e.g. avi or rm)
â˘Network connections need to be as fast as possible
â˘It is difficult to make informed decision about
migration without preservation intent clearly defined
24
25. Tests - General Comments
⢠No proven methods exist
Real-world testing is needed
⢠Most documented approaches are ad-hoc - no
commodity solutions
⢠Tools are few and inadequate
25
26. Tests - General Comments
⢠Preservation policies, especially about
preservation intent are needed
⢠Significant resources are needed to practically
tackle the problem
26
27. Andrew Stawowczyk Long
Strategist
Digital Preservation Standards
NLA
anlong@nla.gov.au
David Pearson
Director (Acting)
Web Archiving and Digital Preservation Branch
NLA
dapearso@nla.gov.au
Project Report is due end of October 2009
27