SlideShare uma empresa Scribd logo
1 de 27
Baixar para ler offline
I Say Emulate; He Says Migrate

  Are emulation or migration feasible
       preservation strategies?

                          National Library of Australia
                          Prepared by:
                          Andrew Stawowczyk Long
                          Presented by:              1
                          David Pearson
Archiving the Web
• Many institutions actively harvest the web
• Collecting scale vary
• Preservation practices not well understood and
  implemented
• Collecting intent may differ depending on the
  institution



                                                   2
Web Archives
• Type
  •   Text oriented
  •   Multimedia (video/audio) oriented
  •   Picture oriented
  •   Databases
  •   Combination of all types
• Storage
  • Uncompressed
  • Compressed (WARC)
  • Combination
                                          3
Web Objects and Elements
• Challenge: Web archives may contain any type of digital object
• Common objects
    • HTML/XML and related (htm, html, xml, css, etc.)
    • Images (raster images – JPEG, GIF, PNG)
    • Media
         • Audio files (au, wav, aiff, midi, mp3)
         • Video files (mov, mpg, wmv, rm)
• Other objects
    • File Archives (usually compressed – zip, tar, gz, arc, sit)
    • Images (raster images – bmp, tiff)
    • Images (vector images - SVG)
    • Text files (txt, csv, rtf)
    • Document files
         • PDF
         • Microsoft Word, Excel, Power Point
                                                                    4
Comparative statistics of
                 NLA web collections
    PANDORA (selective)               .au Domain Harvests
Files:           73 million       Files:                  2.3 billion
Size:             3.26 TB         Size:                     78.75 TB

  Domain        2005          2006           2007             2008
  Harvest
  Unique       185 million    596 million   516 million        1 billion
  files
  Hosts           811,523     1,046,038      1,247,614       3,038,658
  crawled
  Size            6.69 TB          19.04      18.47TB         34.55 TB
                                                                      5
What are we preserving?
                          Preservation Intent

• Preservation of:
   •   Physical media?
   •   Bit-stream (logical form of data)?
   •   Action (rendering data into something useful to user)?
   •   User experience?
• Important Considerations
   • Creator’s perceived intent
   • Institution’s preservation intent




                                                                6
         Based on Heslop and Davis (2002)
What are we preserving?
                            Properties

• Object Properties
  (Properties regarded as important would vary depending on the
  intention of the collecting institution)

   •
   •
       Derived from file format
       High-level – e.g. layout, formatting
                                             or     WEB
   •   Measured – identified directly by computer
   •   Intended – Set by the collecting body




                                                              7
Possible Preservation Actions 1
• Emulation
    The original environment is recreated on a contemporary hardware using
      specialised software (emulator) and original software.

• Renderers
• Specialised software,
  operating in the
  contemporary environment
  and used to access (render)
  original files. It is similar
  to emulation.




                                                                             8
Possible Preservation Actions 2
• Migration
   Original file formats are migrated (converted) to
   another format, which is supported by current
   hardware/software.
                           e.g. MS Word 3.0 to MS Word
                           2008




                                                         9
Possible Preservation Actions 3
                     Not long-term sustainable

• Technological Museum
  Collect and maintain the original hardware and software


• Take No Action
  Do nothing




                                                            10
Digital Preservation
                             Preliminaries
• Collection objects need to be correctly recognised and
  identified
• Preservation intent(s) need to be defined
• High-level preservation actions need to be defined (e.g. shall
  we use emulation or migration?)
• Practical-level preservation actions need to be defined

     Object Format + Preservation Intent = Appropriate Action



  Dillema:
  How to properly migrate data if preservation intent(s) are
  unknown or not defined                                           11
Tools Required for Emulation
• Emulators
    • Fast, stable, flexible, extendable
•   Licenced Operating Systems
•   Various drivers
•   Web browsers
•   Browser plug-ins
•   Other programs as required (e.g. Java, Adobe Acrobat
    Reader)


                                                      12
Tools Required in Migration
•   Format identifiers
•   Format converters
•   Link updaters
•   QA automatons




CAMiLEON project – Migration on Request Tool
XENA                                           13
Project Tests
           General Testing Environment

• Large slice of uncompressed PANDORA
  archive (random selection)
• Whole Domain Harvest archive have not been
  included in tests (WARC files)
• Multiple hardware combinations
• Multiple OS combinations
• Multiple Web Browsers


                                           14
Project Tests
                          Material Sample

Testing the industrial scale tools
• PANDORA slice
  • 861Gb
  • 18,019,172 files
  • 2,379,326 folders
Testing object properties
• Smaller slice of PANDORA slice
  • 20 objects of each selected types
     •Audio, html, images, pdf, video, zip, MS documents
                                                           15
Project Tests
                              Methodology
• Large sample testing (861Gb, 18,019,172 files)
       • Attempt to identify objects in the sample using DROID
       • Attempt to migrate jpeg images to png and update links


• Small sample testing
       • Select smaller sub-sample, with objects mostly created before year 2000
       • Identify objects in the sample
       • View and experience selected objects in contemporary environments using
         various platforms, OS and browsers
       • View and experience selected objects in old environments using
         emulations on various platforms, using different OS and browsers
       • Migrate selected objects and review them in various environments


                                                                              16
Project Tests
                              Tools tested
• Common                                 • Emulation
 •   DROID                                   • QEMU
 •   JHOVE
                                             • Bochs
 •   TRiID
 •   File Identifier
                                             • MS Virtual PC
                                               (Not exactly an emulator)
 •   Lister (dev. in-house)
 •   OS                                      ● Dioscuri
       –   MS Win XP Pro
       –   MS Win 3.1
                                         • Migration
       –   MS Win 98SE                       • ImageMagick
       –   Ubuntu 9.04
                                             • MediaCoder
 • Web Browsers
       –   MS IE 7                           • Swf>>avi
       –   Firefox 3                         • OpenOffice Tools
       –   Arachne 1.2
                                             • XENA
       –   Mosaic 2
                                                                           17
       –   Netscape 4
Project Tests
                       Control – Current Environment

• Properties observed in selected files
  Object Basic Characteristics (based on Emulation Project by KB)
      1. Content : the text, images, etc. from the object
      2. Structure : the cohesion between different parts of the object
      3. Context : the meaning of the object.
      4. Appearance : the way an object is presented to the user.
      5. Behaviour : the interaction of the object with the user or system.

E.g. for HTML pages:
  •Rendering of text, images, media files
       •   Font, layout, colours, contrast, brightness, animation smoothness, sound quality, etc.

  •Objects dependencies
  •Mouse & keyboard behaviour
  •Data extraction

                                                                                                    18
Project Tests
                    Emulated Environments
• Hardware
   • Dell Optiplex GX620, P4, 4.4GHz x 3.39GHZ, 3.5Gb RAM
   • Power Mac G4

EMULATORS:
• Bochs
   • Host:         WinXP Pro v2002 SP3
                   Ubuntu 9.04
   • Client:       Win 3.1, MS DOS 6.2
                   WinXP Pro SP2
• Dioscuri 0.4.0
   • Host:         WinXP Pro v2002 SP3
   • Client:       Win3.1, MS DOS 6.2

                                                            19
Project Tests
                  Emulated Environments

• Qemu
   • Host:      MS WinXP Pro v2002 SP3
   • Clients:   MS Win98SE
                MS Win 3.1
                MS DOS 6.2
                Ubuntu 9.04


   • Host:      Ubuntu 9.04
   • Clients:   MS WinXP Pro SP2, P4, 12.92GHz, 256Mb RAM
                MS Win98SE
                MS Win 3.1
• Microsoft Virtual PC
   • Host:      MS WinXP Pro v2002 SP3
   • Clients:   MS Win 3.1
                MS Win98SE                                  20
Tests - Summary
                             Emulation


•Setting up emulators was relatively simple
•Additional software (especially to work with disk images)
proved to be extremely useful.
•Licencing was at times a big obstacle. (E.g. Impossible to
emulate Macintosh environment legally).
•A lot of dependencies exist. It is a complex task to make
programs work correctly.
   •e.g Windows XP requires internet or over-the-phone activation after 30 days




                                                                             21
Tests – Summary
                                       Emulation

• All
  Some of the dll libraries in Win 3.1 did not agree with 16-bit Netscape and Mosaic
  programs
• Bochs 2.3.7 for Windows
    • Extremely slow in GUI environments
    • No full screen mode. Limited end-user experience.
• Dioscuri
    • Sluggish at times
    • Didn’t like some of the images created in WinImage
• Qemu 0.9.0 for Windows and Linux
    • Much faster but still sluggish at times
    • Win98SE couldn't run in hi-res, hi-colour mode
• Microsoft Virtual PC
        Relatively fast (it's a virtualisation software on PC) but still sluggish at times
                                                                                             22
Tests - Summary
             Migration Environment




•Dell Optiplex GX620
•MS Windows XP Pro v2002 SP3
•Networked drive with PANDORA sample




                                       23
Tests - Summary
                                Migration

•Available tools are imperfect and slow.
   • e.g. DROID took more than two weeks to examine slightly over 18 million
     files and many of them were not recognised

•It is very difficult to examine contents of the container
formats (e.g. avi or rm)
•Network connections need to be as fast as possible
•It is difficult to make informed decision about
migration without preservation intent clearly defined



                                                                               24
Tests - General Comments

• No proven methods exist
    Real-world testing is needed
  • Most documented approaches are ad-hoc - no
    commodity solutions
• Tools are few and inadequate




                                                 25
Tests - General Comments


• Preservation policies, especially about
  preservation intent are needed
• Significant resources are needed to practically
  tackle the problem




                                                26
Andrew Stawowczyk Long
Strategist
Digital Preservation Standards
NLA
anlong@nla.gov.au

David Pearson
Director (Acting)
Web Archiving and Digital Preservation Branch
NLA
dapearso@nla.gov.au




                 Project Report is due end of October 2009


                                                             27

Mais conteĂşdo relacionado

Destaque (7)

Digital presevation
Digital presevationDigital presevation
Digital presevation
 
The Adventures of Digi: Ideas, Requirements and Reality
The Adventures of Digi: Ideas, Requirements and RealityThe Adventures of Digi: Ideas, Requirements and Reality
The Adventures of Digi: Ideas, Requirements and Reality
 
Those Mad Men from the Antipodes: Presentation Intent at the National Library...
Those Mad Men from the Antipodes: Presentation Intent at the National Library...Those Mad Men from the Antipodes: Presentation Intent at the National Library...
Those Mad Men from the Antipodes: Presentation Intent at the National Library...
 
Creating a vision for mobile service delivery
Creating a vision for mobile service deliveryCreating a vision for mobile service delivery
Creating a vision for mobile service delivery
 
Intro to Digital Preservation
Intro to Digital PreservationIntro to Digital Preservation
Intro to Digital Preservation
 
An Introduction to Digital Preservation
An Introduction to Digital PreservationAn Introduction to Digital Preservation
An Introduction to Digital Preservation
 
Digital preservation
Digital preservationDigital preservation
Digital preservation
 

Semelhante a I say emulate

Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
elliando dias
 
SQL Queries on Smalltalk Objects
SQL Queries on Smalltalk ObjectsSQL Queries on Smalltalk Objects
SQL Queries on Smalltalk Objects
ESUG
 
2010 AIRI Petabyte Challenge - View From The Trenches
2010 AIRI Petabyte Challenge - View From The Trenches2010 AIRI Petabyte Challenge - View From The Trenches
2010 AIRI Petabyte Challenge - View From The Trenches
George Ang
 

Semelhante a I say emulate (20)

Digital Library Software
Digital Library SoftwareDigital Library Software
Digital Library Software
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
 
Analytics with unified file and object
Analytics with unified file and object Analytics with unified file and object
Analytics with unified file and object
 
SQL Queries on Smalltalk Objects
SQL Queries on Smalltalk ObjectsSQL Queries on Smalltalk Objects
SQL Queries on Smalltalk Objects
 
Developing a Staff-Only Samvera Application
Developing a Staff-Only Samvera ApplicationDeveloping a Staff-Only Samvera Application
Developing a Staff-Only Samvera Application
 
Chicago HUG Presentation Oct 2011
Chicago HUG Presentation Oct 2011Chicago HUG Presentation Oct 2011
Chicago HUG Presentation Oct 2011
 
University of Bath Research Data Management training for researchers
University of Bath Research Data Management training for researchersUniversity of Bath Research Data Management training for researchers
University of Bath Research Data Management training for researchers
 
Watching the Detectives: Using digital forensics techniques to investigate th...
Watching the Detectives: Using digital forensics techniques to investigate th...Watching the Detectives: Using digital forensics techniques to investigate th...
Watching the Detectives: Using digital forensics techniques to investigate th...
 
2007 iPres Beijing - MIXED: Preservation by migration to XML
2007 iPres Beijing - MIXED: Preservation by migration to XML2007 iPres Beijing - MIXED: Preservation by migration to XML
2007 iPres Beijing - MIXED: Preservation by migration to XML
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
E Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutesE Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutes
 
Securing the Container Pipeline
Securing the Container PipelineSecuring the Container Pipeline
Securing the Container Pipeline
 
SCA Accessioning Born-Digital Materials Workshop, Nov. 8, 2012
SCA Accessioning Born-Digital Materials Workshop, Nov. 8, 2012SCA Accessioning Born-Digital Materials Workshop, Nov. 8, 2012
SCA Accessioning Born-Digital Materials Workshop, Nov. 8, 2012
 
Accessioning Born-Digital Materials
Accessioning Born-Digital MaterialsAccessioning Born-Digital Materials
Accessioning Born-Digital Materials
 
Preservation Planning: Choosing a suitable digital preservation strategy
Preservation Planning: Choosing a suitable digital preservation strategyPreservation Planning: Choosing a suitable digital preservation strategy
Preservation Planning: Choosing a suitable digital preservation strategy
 
CNIT 121: 14 Investigating Applications
CNIT 121: 14 Investigating ApplicationsCNIT 121: 14 Investigating Applications
CNIT 121: 14 Investigating Applications
 
2010 AIRI Petabyte Challenge - View From The Trenches
2010 AIRI Petabyte Challenge - View From The Trenches2010 AIRI Petabyte Challenge - View From The Trenches
2010 AIRI Petabyte Challenge - View From The Trenches
 
No SQL : Which way to go? Presented at DDDMelbourne 2015
No SQL : Which way to go?  Presented at DDDMelbourne 2015No SQL : Which way to go?  Presented at DDDMelbourne 2015
No SQL : Which way to go? Presented at DDDMelbourne 2015
 
NoSQL, which way to go?
NoSQL, which way to go?NoSQL, which way to go?
NoSQL, which way to go?
 

Mais de National Library of Australia

Mais de National Library of Australia (20)

Publicity and media - Anna Gressier & Sarah Kleven (Communications and Market...
Publicity and media - Anna Gressier & Sarah Kleven (Communications and Market...Publicity and media - Anna Gressier & Sarah Kleven (Communications and Market...
Publicity and media - Anna Gressier & Sarah Kleven (Communications and Market...
 
CHG recipient case study - Julia Mant of the National Institute of Dramatic Art
CHG recipient case study - Julia Mant of the National Institute of Dramatic ArtCHG recipient case study - Julia Mant of the National Institute of Dramatic Art
CHG recipient case study - Julia Mant of the National Institute of Dramatic Art
 
Completing your CHG project - Fran D'Castro
Completing your CHG project - Fran D'CastroCompleting your CHG project - Fran D'Castro
Completing your CHG project - Fran D'Castro
 
Just Digitise It - Daniel Wilksch of the Public Records Office Victoria
Just Digitise It - Daniel Wilksch of the Public Records Office VictoriaJust Digitise It - Daniel Wilksch of the Public Records Office Victoria
Just Digitise It - Daniel Wilksch of the Public Records Office Victoria
 
Trove - a window to our community heritage - Hilary Berthon of Trove, NLA
Trove - a window to our community heritage - Hilary Berthon of Trove, NLATrove - a window to our community heritage - Hilary Berthon of Trove, NLA
Trove - a window to our community heritage - Hilary Berthon of Trove, NLA
 
National Archives of Australia
National Archives of AustraliaNational Archives of Australia
National Archives of Australia
 
Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...
Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...
Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...
 
Assessing Significance and Significance 2.0: an introduction - Margaret Birt...
 Assessing Significance and Significance 2.0: an introduction - Margaret Birt... Assessing Significance and Significance 2.0: an introduction - Margaret Birt...
Assessing Significance and Significance 2.0: an introduction - Margaret Birt...
 
Preservation Needs Assessment - Tamara Lavrencic
Preservation Needs Assessment  - Tamara LavrencicPreservation Needs Assessment  - Tamara Lavrencic
Preservation Needs Assessment - Tamara Lavrencic
 
Assessing the significance of cultural heritage - Tania Cleary
Assessing the significance of cultural heritage - Tania ClearyAssessing the significance of cultural heritage - Tania Cleary
Assessing the significance of cultural heritage - Tania Cleary
 
Publicity, Media & Completing your CHG project - 2017 - Fran D'Castro
Publicity, Media & Completing your CHG project - 2017 - Fran D'CastroPublicity, Media & Completing your CHG project - 2017 - Fran D'Castro
Publicity, Media & Completing your CHG project - 2017 - Fran D'Castro
 
Just Digitise It - Daniel Wilksch of the Public Records Office Victoria
Just Digitise It - Daniel Wilksch of the Public Records Office VictoriaJust Digitise It - Daniel Wilksch of the Public Records Office Victoria
Just Digitise It - Daniel Wilksch of the Public Records Office Victoria
 
TROVE - a window to our community heritage - Hilary Berthon of Trove, NLA
TROVE - a window to our community heritage - Hilary Berthon of Trove, NLATROVE - a window to our community heritage - Hilary Berthon of Trove, NLA
TROVE - a window to our community heritage - Hilary Berthon of Trove, NLA
 
Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...
Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...
Disaster Prevention, Preparedness, Response and Recovery for Collections - Ki...
 
CHG recipient case study - Donna Bailey of the Catholic Diocese of Sandhurst
CHG recipient case study - Donna Bailey of the Catholic Diocese of SandhurstCHG recipient case study - Donna Bailey of the Catholic Diocese of Sandhurst
CHG recipient case study - Donna Bailey of the Catholic Diocese of Sandhurst
 
Preservation Needs Assessment - Tamara Lavrencic
Preservation Needs Assessment - Tamara LavrencicPreservation Needs Assessment - Tamara Lavrencic
Preservation Needs Assessment - Tamara Lavrencic
 
Assessing the significance of cultural heritage - Tania Cleary
Assessing the significance of cultural heritage - Tania ClearyAssessing the significance of cultural heritage - Tania Cleary
Assessing the significance of cultural heritage - Tania Cleary
 
Significance Assessment and Significance 2.0: an introduction - Veronica Bull...
Significance Assessment and Significance 2.0: an introduction - Veronica Bull...Significance Assessment and Significance 2.0: an introduction - Veronica Bull...
Significance Assessment and Significance 2.0: an introduction - Veronica Bull...
 
Preservation assessment - Tamara Lavrencic
Preservation assessment - Tamara LavrencicPreservation assessment - Tamara Lavrencic
Preservation assessment - Tamara Lavrencic
 
Just digitise it - Daniel Wilksch of the Public Records Office Victoria
Just digitise it - Daniel Wilksch of the Public Records Office VictoriaJust digitise it - Daniel Wilksch of the Public Records Office Victoria
Just digitise it - Daniel Wilksch of the Public Records Office Victoria
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Último (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 

I say emulate

  • 1. I Say Emulate; He Says Migrate Are emulation or migration feasible preservation strategies? National Library of Australia Prepared by: Andrew Stawowczyk Long Presented by: 1 David Pearson
  • 2. Archiving the Web • Many institutions actively harvest the web • Collecting scale vary • Preservation practices not well understood and implemented • Collecting intent may differ depending on the institution 2
  • 3. Web Archives • Type • Text oriented • Multimedia (video/audio) oriented • Picture oriented • Databases • Combination of all types • Storage • Uncompressed • Compressed (WARC) • Combination 3
  • 4. Web Objects and Elements • Challenge: Web archives may contain any type of digital object • Common objects • HTML/XML and related (htm, html, xml, css, etc.) • Images (raster images – JPEG, GIF, PNG) • Media • Audio files (au, wav, aiff, midi, mp3) • Video files (mov, mpg, wmv, rm) • Other objects • File Archives (usually compressed – zip, tar, gz, arc, sit) • Images (raster images – bmp, tiff) • Images (vector images - SVG) • Text files (txt, csv, rtf) • Document files • PDF • Microsoft Word, Excel, Power Point 4
  • 5. Comparative statistics of NLA web collections PANDORA (selective) .au Domain Harvests Files: 73 million Files: 2.3 billion Size: 3.26 TB Size: 78.75 TB Domain 2005 2006 2007 2008 Harvest Unique 185 million 596 million 516 million 1 billion files Hosts 811,523 1,046,038 1,247,614 3,038,658 crawled Size 6.69 TB 19.04 18.47TB 34.55 TB 5
  • 6. What are we preserving? Preservation Intent • Preservation of: • Physical media? • Bit-stream (logical form of data)? • Action (rendering data into something useful to user)? • User experience? • Important Considerations • Creator’s perceived intent • Institution’s preservation intent 6 Based on Heslop and Davis (2002)
  • 7. What are we preserving? Properties • Object Properties (Properties regarded as important would vary depending on the intention of the collecting institution) • • Derived from file format High-level – e.g. layout, formatting or WEB • Measured – identified directly by computer • Intended – Set by the collecting body 7
  • 8. Possible Preservation Actions 1 • Emulation The original environment is recreated on a contemporary hardware using specialised software (emulator) and original software. • Renderers • Specialised software, operating in the contemporary environment and used to access (render) original files. It is similar to emulation. 8
  • 9. Possible Preservation Actions 2 • Migration Original file formats are migrated (converted) to another format, which is supported by current hardware/software. e.g. MS Word 3.0 to MS Word 2008 9
  • 10. Possible Preservation Actions 3 Not long-term sustainable • Technological Museum Collect and maintain the original hardware and software • Take No Action Do nothing 10
  • 11. Digital Preservation Preliminaries • Collection objects need to be correctly recognised and identified • Preservation intent(s) need to be defined • High-level preservation actions need to be defined (e.g. shall we use emulation or migration?) • Practical-level preservation actions need to be defined Object Format + Preservation Intent = Appropriate Action Dillema: How to properly migrate data if preservation intent(s) are unknown or not defined 11
  • 12. Tools Required for Emulation • Emulators • Fast, stable, flexible, extendable • Licenced Operating Systems • Various drivers • Web browsers • Browser plug-ins • Other programs as required (e.g. Java, Adobe Acrobat Reader) 12
  • 13. Tools Required in Migration • Format identifiers • Format converters • Link updaters • QA automatons CAMiLEON project – Migration on Request Tool XENA 13
  • 14. Project Tests General Testing Environment • Large slice of uncompressed PANDORA archive (random selection) • Whole Domain Harvest archive have not been included in tests (WARC files) • Multiple hardware combinations • Multiple OS combinations • Multiple Web Browsers 14
  • 15. Project Tests Material Sample Testing the industrial scale tools • PANDORA slice • 861Gb • 18,019,172 files • 2,379,326 folders Testing object properties • Smaller slice of PANDORA slice • 20 objects of each selected types •Audio, html, images, pdf, video, zip, MS documents 15
  • 16. Project Tests Methodology • Large sample testing (861Gb, 18,019,172 files) • Attempt to identify objects in the sample using DROID • Attempt to migrate jpeg images to png and update links • Small sample testing • Select smaller sub-sample, with objects mostly created before year 2000 • Identify objects in the sample • View and experience selected objects in contemporary environments using various platforms, OS and browsers • View and experience selected objects in old environments using emulations on various platforms, using different OS and browsers • Migrate selected objects and review them in various environments 16
  • 17. Project Tests Tools tested • Common • Emulation • DROID • QEMU • JHOVE • Bochs • TRiID • File Identifier • MS Virtual PC (Not exactly an emulator) • Lister (dev. in-house) • OS ● Dioscuri – MS Win XP Pro – MS Win 3.1 • Migration – MS Win 98SE • ImageMagick – Ubuntu 9.04 • MediaCoder • Web Browsers – MS IE 7 • Swf>>avi – Firefox 3 • OpenOffice Tools – Arachne 1.2 • XENA – Mosaic 2 17 – Netscape 4
  • 18. Project Tests Control – Current Environment • Properties observed in selected files Object Basic Characteristics (based on Emulation Project by KB) 1. Content : the text, images, etc. from the object 2. Structure : the cohesion between different parts of the object 3. Context : the meaning of the object. 4. Appearance : the way an object is presented to the user. 5. Behaviour : the interaction of the object with the user or system. E.g. for HTML pages: •Rendering of text, images, media files • Font, layout, colours, contrast, brightness, animation smoothness, sound quality, etc. •Objects dependencies •Mouse & keyboard behaviour •Data extraction 18
  • 19. Project Tests Emulated Environments • Hardware • Dell Optiplex GX620, P4, 4.4GHz x 3.39GHZ, 3.5Gb RAM • Power Mac G4 EMULATORS: • Bochs • Host: WinXP Pro v2002 SP3 Ubuntu 9.04 • Client: Win 3.1, MS DOS 6.2 WinXP Pro SP2 • Dioscuri 0.4.0 • Host: WinXP Pro v2002 SP3 • Client: Win3.1, MS DOS 6.2 19
  • 20. Project Tests Emulated Environments • Qemu • Host: MS WinXP Pro v2002 SP3 • Clients: MS Win98SE MS Win 3.1 MS DOS 6.2 Ubuntu 9.04 • Host: Ubuntu 9.04 • Clients: MS WinXP Pro SP2, P4, 12.92GHz, 256Mb RAM MS Win98SE MS Win 3.1 • Microsoft Virtual PC • Host: MS WinXP Pro v2002 SP3 • Clients: MS Win 3.1 MS Win98SE 20
  • 21. Tests - Summary Emulation •Setting up emulators was relatively simple •Additional software (especially to work with disk images) proved to be extremely useful. •Licencing was at times a big obstacle. (E.g. Impossible to emulate Macintosh environment legally). •A lot of dependencies exist. It is a complex task to make programs work correctly. •e.g Windows XP requires internet or over-the-phone activation after 30 days 21
  • 22. Tests – Summary Emulation • All Some of the dll libraries in Win 3.1 did not agree with 16-bit Netscape and Mosaic programs • Bochs 2.3.7 for Windows • Extremely slow in GUI environments • No full screen mode. Limited end-user experience. • Dioscuri • Sluggish at times • Didn’t like some of the images created in WinImage • Qemu 0.9.0 for Windows and Linux • Much faster but still sluggish at times • Win98SE couldn't run in hi-res, hi-colour mode • Microsoft Virtual PC Relatively fast (it's a virtualisation software on PC) but still sluggish at times 22
  • 23. Tests - Summary Migration Environment •Dell Optiplex GX620 •MS Windows XP Pro v2002 SP3 •Networked drive with PANDORA sample 23
  • 24. Tests - Summary Migration •Available tools are imperfect and slow. • e.g. DROID took more than two weeks to examine slightly over 18 million files and many of them were not recognised •It is very difficult to examine contents of the container formats (e.g. avi or rm) •Network connections need to be as fast as possible •It is difficult to make informed decision about migration without preservation intent clearly defined 24
  • 25. Tests - General Comments • No proven methods exist Real-world testing is needed • Most documented approaches are ad-hoc - no commodity solutions • Tools are few and inadequate 25
  • 26. Tests - General Comments • Preservation policies, especially about preservation intent are needed • Significant resources are needed to practically tackle the problem 26
  • 27. Andrew Stawowczyk Long Strategist Digital Preservation Standards NLA anlong@nla.gov.au David Pearson Director (Acting) Web Archiving and Digital Preservation Branch NLA dapearso@nla.gov.au Project Report is due end of October 2009 27