What's New in Teams Calling, Meetings and Devices April 2024
I say emulate
1. I Say Emulate; He Says Migrate
Are emulation or migration feasible
preservation strategies?
National Library of Australia
Prepared by:
Andrew Stawowczyk Long
Presented by: 1
David Pearson
2. Archiving the Web
• Many institutions actively harvest the web
• Collecting scale vary
• Preservation practices not well understood and
implemented
• Collecting intent may differ depending on the
institution
2
3. Web Archives
• Type
• Text oriented
• Multimedia (video/audio) oriented
• Picture oriented
• Databases
• Combination of all types
• Storage
• Uncompressed
• Compressed (WARC)
• Combination
3
4. Web Objects and Elements
• Challenge: Web archives may contain any type of digital object
• Common objects
• HTML/XML and related (htm, html, xml, css, etc.)
• Images (raster images – JPEG, GIF, PNG)
• Media
• Audio files (au, wav, aiff, midi, mp3)
• Video files (mov, mpg, wmv, rm)
• Other objects
• File Archives (usually compressed – zip, tar, gz, arc, sit)
• Images (raster images – bmp, tiff)
• Images (vector images - SVG)
• Text files (txt, csv, rtf)
• Document files
• PDF
• Microsoft Word, Excel, Power Point
4
5. Comparative statistics of
NLA web collections
PANDORA (selective) .au Domain Harvests
Files: 73 million Files: 2.3 billion
Size: 3.26 TB Size: 78.75 TB
Domain 2005 2006 2007 2008
Harvest
Unique 185 million 596 million 516 million 1 billion
files
Hosts 811,523 1,046,038 1,247,614 3,038,658
crawled
Size 6.69 TB 19.04 18.47TB 34.55 TB
5
6. What are we preserving?
Preservation Intent
• Preservation of:
• Physical media?
• Bit-stream (logical form of data)?
• Action (rendering data into something useful to user)?
• User experience?
• Important Considerations
• Creator’s perceived intent
• Institution’s preservation intent
6
Based on Heslop and Davis (2002)
7. What are we preserving?
Properties
• Object Properties
(Properties regarded as important would vary depending on the
intention of the collecting institution)
•
•
Derived from file format
High-level – e.g. layout, formatting
or WEB
• Measured – identified directly by computer
• Intended – Set by the collecting body
7
8. Possible Preservation Actions 1
• Emulation
The original environment is recreated on a contemporary hardware using
specialised software (emulator) and original software.
• Renderers
• Specialised software,
operating in the
contemporary environment
and used to access (render)
original files. It is similar
to emulation.
8
9. Possible Preservation Actions 2
• Migration
Original file formats are migrated (converted) to
another format, which is supported by current
hardware/software.
e.g. MS Word 3.0 to MS Word
2008
9
10. Possible Preservation Actions 3
Not long-term sustainable
• Technological Museum
Collect and maintain the original hardware and software
• Take No Action
Do nothing
10
11. Digital Preservation
Preliminaries
• Collection objects need to be correctly recognised and
identified
• Preservation intent(s) need to be defined
• High-level preservation actions need to be defined (e.g. shall
we use emulation or migration?)
• Practical-level preservation actions need to be defined
Object Format + Preservation Intent = Appropriate Action
Dillema:
How to properly migrate data if preservation intent(s) are
unknown or not defined 11
12. Tools Required for Emulation
• Emulators
• Fast, stable, flexible, extendable
• Licenced Operating Systems
• Various drivers
• Web browsers
• Browser plug-ins
• Other programs as required (e.g. Java, Adobe Acrobat
Reader)
12
13. Tools Required in Migration
• Format identifiers
• Format converters
• Link updaters
• QA automatons
CAMiLEON project – Migration on Request Tool
XENA 13
14. Project Tests
General Testing Environment
• Large slice of uncompressed PANDORA
archive (random selection)
• Whole Domain Harvest archive have not been
included in tests (WARC files)
• Multiple hardware combinations
• Multiple OS combinations
• Multiple Web Browsers
14
15. Project Tests
Material Sample
Testing the industrial scale tools
• PANDORA slice
• 861Gb
• 18,019,172 files
• 2,379,326 folders
Testing object properties
• Smaller slice of PANDORA slice
• 20 objects of each selected types
•Audio, html, images, pdf, video, zip, MS documents
15
16. Project Tests
Methodology
• Large sample testing (861Gb, 18,019,172 files)
• Attempt to identify objects in the sample using DROID
• Attempt to migrate jpeg images to png and update links
• Small sample testing
• Select smaller sub-sample, with objects mostly created before year 2000
• Identify objects in the sample
• View and experience selected objects in contemporary environments using
various platforms, OS and browsers
• View and experience selected objects in old environments using
emulations on various platforms, using different OS and browsers
• Migrate selected objects and review them in various environments
16
17. Project Tests
Tools tested
• Common • Emulation
• DROID • QEMU
• JHOVE
• Bochs
• TRiID
• File Identifier
• MS Virtual PC
(Not exactly an emulator)
• Lister (dev. in-house)
• OS ● Dioscuri
– MS Win XP Pro
– MS Win 3.1
• Migration
– MS Win 98SE • ImageMagick
– Ubuntu 9.04
• MediaCoder
• Web Browsers
– MS IE 7 • Swf>>avi
– Firefox 3 • OpenOffice Tools
– Arachne 1.2
• XENA
– Mosaic 2
17
– Netscape 4
18. Project Tests
Control – Current Environment
• Properties observed in selected files
Object Basic Characteristics (based on Emulation Project by KB)
1. Content : the text, images, etc. from the object
2. Structure : the cohesion between different parts of the object
3. Context : the meaning of the object.
4. Appearance : the way an object is presented to the user.
5. Behaviour : the interaction of the object with the user or system.
E.g. for HTML pages:
•Rendering of text, images, media files
• Font, layout, colours, contrast, brightness, animation smoothness, sound quality, etc.
•Objects dependencies
•Mouse & keyboard behaviour
•Data extraction
18
19. Project Tests
Emulated Environments
• Hardware
• Dell Optiplex GX620, P4, 4.4GHz x 3.39GHZ, 3.5Gb RAM
• Power Mac G4
EMULATORS:
• Bochs
• Host: WinXP Pro v2002 SP3
Ubuntu 9.04
• Client: Win 3.1, MS DOS 6.2
WinXP Pro SP2
• Dioscuri 0.4.0
• Host: WinXP Pro v2002 SP3
• Client: Win3.1, MS DOS 6.2
19
20. Project Tests
Emulated Environments
• Qemu
• Host: MS WinXP Pro v2002 SP3
• Clients: MS Win98SE
MS Win 3.1
MS DOS 6.2
Ubuntu 9.04
• Host: Ubuntu 9.04
• Clients: MS WinXP Pro SP2, P4, 12.92GHz, 256Mb RAM
MS Win98SE
MS Win 3.1
• Microsoft Virtual PC
• Host: MS WinXP Pro v2002 SP3
• Clients: MS Win 3.1
MS Win98SE 20
21. Tests - Summary
Emulation
•Setting up emulators was relatively simple
•Additional software (especially to work with disk images)
proved to be extremely useful.
•Licencing was at times a big obstacle. (E.g. Impossible to
emulate Macintosh environment legally).
•A lot of dependencies exist. It is a complex task to make
programs work correctly.
•e.g Windows XP requires internet or over-the-phone activation after 30 days
21
22. Tests – Summary
Emulation
• All
Some of the dll libraries in Win 3.1 did not agree with 16-bit Netscape and Mosaic
programs
• Bochs 2.3.7 for Windows
• Extremely slow in GUI environments
• No full screen mode. Limited end-user experience.
• Dioscuri
• Sluggish at times
• Didn’t like some of the images created in WinImage
• Qemu 0.9.0 for Windows and Linux
• Much faster but still sluggish at times
• Win98SE couldn't run in hi-res, hi-colour mode
• Microsoft Virtual PC
Relatively fast (it's a virtualisation software on PC) but still sluggish at times
22
23. Tests - Summary
Migration Environment
•Dell Optiplex GX620
•MS Windows XP Pro v2002 SP3
•Networked drive with PANDORA sample
23
24. Tests - Summary
Migration
•Available tools are imperfect and slow.
• e.g. DROID took more than two weeks to examine slightly over 18 million
files and many of them were not recognised
•It is very difficult to examine contents of the container
formats (e.g. avi or rm)
•Network connections need to be as fast as possible
•It is difficult to make informed decision about
migration without preservation intent clearly defined
24
25. Tests - General Comments
• No proven methods exist
Real-world testing is needed
• Most documented approaches are ad-hoc - no
commodity solutions
• Tools are few and inadequate
25
26. Tests - General Comments
• Preservation policies, especially about
preservation intent are needed
• Significant resources are needed to practically
tackle the problem
26
27. Andrew Stawowczyk Long
Strategist
Digital Preservation Standards
NLA
anlong@nla.gov.au
David Pearson
Director (Acting)
Web Archiving and Digital Preservation Branch
NLA
dapearso@nla.gov.au
Project Report is due end of October 2009
27