SlideShare uma empresa Scribd logo
1 de 25
WARCreate
          Create Wayback-Consumable WARC Files from Any Webpage


                      Mat Kelly, Michele C. Weigle, Michael L. Nelson
                           {mkelly,mweigle,mln}@cs.odu.edu
                          Old Dominion University; Norfolk, VA




July 25, 2012
Arlington, Virginia                   Digital Preservation 2012         warcreate.com
What is WARCreate?
      • Google Chrome extension
      • Creates WARC files
      • Enables preservation by users from their
        browser
      • First steps in bringing Institutional
        Archiving facilities to the PC



July 25, 2012                                                2
Arlington, Virginia        Digital Preservation 2012   warcreate.com
Target Content
      • Unreachable by web crawlers
             – Behind authentication
             – Not listed in search engines (Deep Web)
      • Private
             – We don’t want our bank statements in Wayback
      • Non-pertinent to public
             – Others have little interest in our Facebook
               comments

July 25, 2012                                                      3
Arlington, Virginia            Digital Preservation 2012     warcreate.com
Preserving More!
      • Much digital
        information is
        needlessly lost
      • User chooses
        what they deem
        important
      • Compatible with
        standard archiving
        tools.

July 25, 2012                                               4
Arlington, Virginia       Digital Preservation 2012   warcreate.com
WYSIWYG
                                                           Archive created from
            Facebook-Supplied Data Dump
                                                          WARCreate in Wayback




July 25, 2012                                                                      5
Arlington, Virginia                  Digital Preservation 2012               warcreate.com
WYSIWYG
                                                               Archive created from
            Using Scraping Tools (e.g. wget)
                                                              WARCreate in Wayback




July 25, 2012                                                                          6
Arlington, Virginia                      Digital Preservation 2012               warcreate.com
WYSIWYG
                                                              Archive created from
                 A Crawler Has No Context
                                                             WARCreate in Wayback




July 25, 2012                                                                         7
Arlington, Virginia                     Digital Preservation 2012               warcreate.com
WYSIWYG
                                                            Archive created from
                 IA/HERITRIX OBEY ROBOTS
                                                           WARCreate in Wayback




July 25, 2012                                                                       8
Arlington, Virginia                   Digital Preservation 2012               warcreate.com
Goals
      • Make it easy to use (GUI-based, no cmd line)
      • Make it useful (fill the need)
      • Demonstrate novelty of browser-instigated
        preservation
      • Show value of WARC format for Personal Web
        preservation
      • Bring WARC format to Personal Digital
        Archiving
July 25, 2012                                             9
Arlington, Virginia     Digital Preservation 2012   warcreate.com
Creating a WARC
July 25, 2012                                          11
Arlington, Virginia   Digital Preservation 2012   warcreate.com
I’ve Made a WARC. Now what?
      • What you do with the archive is up to you.
             – Install it in your local Wayback instance
      • Who has their own Wayback Instance!?
             – Wayback is free & open source
      • That seems like a lot of work!
             – One additional reason for users NOT to preserve
               what they would like archived


July 25, 2012                                                    12
Arlington, Virginia             Digital Preservation 2012   warcreate.com
WARC Creation & Replay
      6. User Selects website the HTTP Headers
      3. Local visits acaptures using their browser
      1. WARCWayback instance indexes WARC
      4. WARCreate “Generatelocally
      5.
      2.       accesses local wayback
                generated, saved WARC”
      …to directory accessible to local wayback
      to view preserved content
      button in WARCreate

                                                           1
                       2                                   3       6



                           4                                   5
July 25, 2012                                                               13
Arlington, Virginia            Digital Preservation 2012               warcreate.com
Suite Installation & Interaction
      • Drag & Drop .zip to hd

      • Start relevant services
        using GUI
      • Execute WARCreate process

      • View Archive at http://localhost/wayback

July 25, 2012                                              14
Arlington, Virginia       Digital Preservation 2012   warcreate.com
Replay of Preserved Twitter page
July 25, 2012                                                 15
Arlington, Virginia          Digital Preservation 2012   warcreate.com
And My Bank Statements?
      • Preserved content:
             – never leaves WARC files
             – never leaves local machine
      • WARCreate provides preliminary
        encoding/encryption support
      • Wayback instance is hosted on your own
        machine – no external access by default


July 25, 2012                                                  16
Arlington, Virginia           Digital Preservation 2012   warcreate.com
Why Use a Client-Side Server?
   •    Server scripts do what JS can’t
   •    Can reside on your machine!
   •    Controls are GUI based
   •    Resource fetching w/o XSS issues
                                                       WARCreate         Memento
                          Local Wayback Instance       Server-Side       Proxy
                                                        Support
                      …          Tomcat                               Apache       Built On


                             XAMPP-Based Personal Web Archiving Suite

July 25, 2012                                                                           18
Arlington, Virginia                       Digital Preservation 2012                warcreate.com
Extras: Memento Support
      • Suite’s includes tailored
        Timegate
      • Memento abstraction is
        beyond WARC
      • Point MementoFox
        (or other Memento tools)
        to localhost


July 25, 2012                                                 19
Arlington, Virginia          Digital Preservation 2012   warcreate.com
How it All Relates
          BROWSER                   Personal Archives Accessible at localhost




                                                   Local
       Browser Extensions                        Timegate


           MementoFox                                                               Local
                                                                                  Wayback
                                                                                  Instance
                                                         WARC/1.0
                                                         WARC-Type: warcinfo
                                                         WARC-Date: 2012-07-
            WARCreate                                    15T22:15:59.485Z
                                                         WARC-Filename:            Index WARCs
                            Generates WARC file          2220471175c820fee3fec9
                                                         86040ebd1f.warc




July 25, 2012                                                                                     20
Arlington, Virginia                   Digital Preservation 2012                              warcreate.com
Contribution of Work
      • Facilitate browser-based Personal Web
        Archiving
      • Determine feasibility of fully Client-Side
        Preservation
      • Integrate with existing tools for establishing
        use cases



July 25, 2012                                                21
Arlington, Virginia         Digital Preservation 2012   warcreate.com
WARCreate
          Create Wayback-Consumable WARC Files from Any Webpage



                      http://WARCreate.com


July 25, 2012                                                     22
Arlington, Virginia           Digital Preservation 2012     warcreate.com
Backup Slides




July 25, 2012                                            23
Arlington, Virginia     Digital Preservation 2012   warcreate.com
Future Work
      •    Decouple from “server”
      •    Refine Memento integration
      •    Reference full WARC spec
      •    Built-in WARC validation
      •    Built-in replay
      •    Compression
      •    Optimization (removing duplicates)
      •    …many more
July 25, 2012                                               24
Arlington, Virginia        Digital Preservation 2012   warcreate.com
Extras: Configuration Sanity Check
      • Server scipts make up for                    ✗
                                                     ✗
                                                         WARC Validation
                                                         AJAX XSS Circumvention
        Javascript shortcomings                      ✗
                                                     ✗
                                                         HTML5 Sandbox Escaping
                                                         Memento Support
      • The server can reside on                     ✗   Local Wayback Instance
                                                    In WARCreate
        your machine!
      • Setup,Start,Stop are
                   GUI based



July 25, 2012                                                                 25
Arlington, Virginia     Digital Preservation 2012                        warcreate.com
Extras: Configuration Sanity Check
                                                          ✓   WARC Validation
      + Apache allows generated                           ✓   AJAX XSS Circumvention
        WARCs to be validated                             ✓
                                                          ?
                                                              HTML5 Sandbox Escaping
                                                              Memento Support
                                                          ✗   Local Wayback Instance
      + Javascript cannot write to
                                                         In WARCreate
        disk, server-side scripts can
      + Server prevents hot-
        linking & has security
      = Content better preserved
        using server techs


July 25, 2012                                                                      26
Arlington, Virginia          Digital Preservation 2012                        warcreate.com
Extras: Configuration Sanity Check
      • Memento requires Wayback ✓ WARCXSS Circumvention
                                   ✓ AJAX
                                           Validation

        Wayback requires Tomcat    ✓ HTML5 Sandbox Escaping
                                   ✓ Memento Support
        ∴ Memento requires Tomcat ✓ Local Wayback Instance
                                 In WARCreate
      • Memento Timegate req’s
           Python+modules
           (pre-packaged + included)




July 25, 2012                                                    27
Arlington, Virginia             Digital Preservation 2012   warcreate.com

Mais conteúdo relacionado

Semelhante a WARCreate - Create Wayback-Consumable WARC Files from Any Webpage

WKWebView in Production
WKWebView in ProductionWKWebView in Production
WKWebView in ProductionJeremy Wiebe
 
CA Plex on Apple Mac, iOS, Android
CA Plex on Apple Mac, iOS, AndroidCA Plex on Apple Mac, iOS, Android
CA Plex on Apple Mac, iOS, AndroidCM First Group
 
Building Rich Internet Applications Using Google Web Toolkit
Building Rich Internet Applications Using  Google Web ToolkitBuilding Rich Internet Applications Using  Google Web Toolkit
Building Rich Internet Applications Using Google Web Toolkitrajivmordani
 
Mavenizing your Liferay project
Mavenizing your Liferay projectMavenizing your Liferay project
Mavenizing your Liferay projectmimacom
 
Getting Started with Docker
Getting Started with DockerGetting Started with Docker
Getting Started with Dockervisual28
 
Ii 1300-java essentials for android
Ii 1300-java essentials for androidIi 1300-java essentials for android
Ii 1300-java essentials for androidAdrian Mikeliunas
 
Gotta Persist 'Em All: Realm as Replacement for SQLite
Gotta Persist 'Em All: Realm as Replacement for SQLiteGotta Persist 'Em All: Realm as Replacement for SQLite
Gotta Persist 'Em All: Realm as Replacement for SQLiteSiena Aguayo
 
Android operating system
Android operating systemAndroid operating system
Android operating systemDev Savalia
 
DrupalCamp ATL 2010: Not all CMSs are created equal
DrupalCamp ATL 2010: Not all CMSs are created equalDrupalCamp ATL 2010: Not all CMSs are created equal
DrupalCamp ATL 2010: Not all CMSs are created equalandrewmriley
 
GWT HJUG Presentation
GWT HJUG PresentationGWT HJUG Presentation
GWT HJUG PresentationDerrick Bowen
 
Plesk WP Toolkit - Growing Together @Cloudfest 2022
Plesk WP Toolkit - Growing Together @Cloudfest 2022Plesk WP Toolkit - Growing Together @Cloudfest 2022
Plesk WP Toolkit - Growing Together @Cloudfest 2022Plesk
 
Presentation building your cloud with v mware
Presentation   building your cloud with v mwarePresentation   building your cloud with v mware
Presentation building your cloud with v mwaresolarisyourep
 
Presentation building your cloud with v mware
Presentation   building your cloud with v mwarePresentation   building your cloud with v mware
Presentation building your cloud with v mwarexKinAnx
 
DevoxxBelgium_StatefulCloud.pptx
DevoxxBelgium_StatefulCloud.pptxDevoxxBelgium_StatefulCloud.pptx
DevoxxBelgium_StatefulCloud.pptxGrace Jansen
 
Developing Java EE Applications on IntelliJ IDEA with Oracle WebLogic 12c
Developing Java EE Applications on IntelliJ IDEA with Oracle WebLogic 12cDeveloping Java EE Applications on IntelliJ IDEA with Oracle WebLogic 12c
Developing Java EE Applications on IntelliJ IDEA with Oracle WebLogic 12cBruno Borges
 
JavaFX Versus HTML5 - JavaOne 2014
JavaFX Versus HTML5 - JavaOne 2014JavaFX Versus HTML5 - JavaOne 2014
JavaFX Versus HTML5 - JavaOne 2014Ryan Cuprak
 
Tools for Managing the Past Web
Tools for Managing the Past WebTools for Managing the Past Web
Tools for Managing the Past WebMichele Weigle
 
Digital Preservation 2013
Digital Preservation 2013Digital Preservation 2013
Digital Preservation 2013Mat Kelly
 
Ugly truths about html5 moosecon - robert virkus - 2013-03-07
Ugly truths about html5   moosecon - robert virkus - 2013-03-07Ugly truths about html5   moosecon - robert virkus - 2013-03-07
Ugly truths about html5 moosecon - robert virkus - 2013-03-07Enough Software
 

Semelhante a WARCreate - Create Wayback-Consumable WARC Files from Any Webpage (20)

WKWebView in Production
WKWebView in ProductionWKWebView in Production
WKWebView in Production
 
CA Plex on Apple Mac, iOS, Android
CA Plex on Apple Mac, iOS, AndroidCA Plex on Apple Mac, iOS, Android
CA Plex on Apple Mac, iOS, Android
 
jQuery On Rails
jQuery On RailsjQuery On Rails
jQuery On Rails
 
Building Rich Internet Applications Using Google Web Toolkit
Building Rich Internet Applications Using  Google Web ToolkitBuilding Rich Internet Applications Using  Google Web Toolkit
Building Rich Internet Applications Using Google Web Toolkit
 
Mavenizing your Liferay project
Mavenizing your Liferay projectMavenizing your Liferay project
Mavenizing your Liferay project
 
Getting Started with Docker
Getting Started with DockerGetting Started with Docker
Getting Started with Docker
 
Ii 1300-java essentials for android
Ii 1300-java essentials for androidIi 1300-java essentials for android
Ii 1300-java essentials for android
 
Gotta Persist 'Em All: Realm as Replacement for SQLite
Gotta Persist 'Em All: Realm as Replacement for SQLiteGotta Persist 'Em All: Realm as Replacement for SQLite
Gotta Persist 'Em All: Realm as Replacement for SQLite
 
Android operating system
Android operating systemAndroid operating system
Android operating system
 
DrupalCamp ATL 2010: Not all CMSs are created equal
DrupalCamp ATL 2010: Not all CMSs are created equalDrupalCamp ATL 2010: Not all CMSs are created equal
DrupalCamp ATL 2010: Not all CMSs are created equal
 
GWT HJUG Presentation
GWT HJUG PresentationGWT HJUG Presentation
GWT HJUG Presentation
 
Plesk WP Toolkit - Growing Together @Cloudfest 2022
Plesk WP Toolkit - Growing Together @Cloudfest 2022Plesk WP Toolkit - Growing Together @Cloudfest 2022
Plesk WP Toolkit - Growing Together @Cloudfest 2022
 
Presentation building your cloud with v mware
Presentation   building your cloud with v mwarePresentation   building your cloud with v mware
Presentation building your cloud with v mware
 
Presentation building your cloud with v mware
Presentation   building your cloud with v mwarePresentation   building your cloud with v mware
Presentation building your cloud with v mware
 
DevoxxBelgium_StatefulCloud.pptx
DevoxxBelgium_StatefulCloud.pptxDevoxxBelgium_StatefulCloud.pptx
DevoxxBelgium_StatefulCloud.pptx
 
Developing Java EE Applications on IntelliJ IDEA with Oracle WebLogic 12c
Developing Java EE Applications on IntelliJ IDEA with Oracle WebLogic 12cDeveloping Java EE Applications on IntelliJ IDEA with Oracle WebLogic 12c
Developing Java EE Applications on IntelliJ IDEA with Oracle WebLogic 12c
 
JavaFX Versus HTML5 - JavaOne 2014
JavaFX Versus HTML5 - JavaOne 2014JavaFX Versus HTML5 - JavaOne 2014
JavaFX Versus HTML5 - JavaOne 2014
 
Tools for Managing the Past Web
Tools for Managing the Past WebTools for Managing the Past Web
Tools for Managing the Past Web
 
Digital Preservation 2013
Digital Preservation 2013Digital Preservation 2013
Digital Preservation 2013
 
Ugly truths about html5 moosecon - robert virkus - 2013-03-07
Ugly truths about html5   moosecon - robert virkus - 2013-03-07Ugly truths about html5   moosecon - robert virkus - 2013-03-07
Ugly truths about html5 moosecon - robert virkus - 2013-03-07
 

Mais de Mat Kelly

Aggregating Private and Public Web Archives Using the Mementity Framework
Aggregating Private and Public Web Archives Using the Mementity FrameworkAggregating Private and Public Web Archives Using the Mementity Framework
Aggregating Private and Public Web Archives Using the Mementity FrameworkMat Kelly
 
Client-Assisted Memento Aggregation Using the Prefer Header
Client-Assisted Memento Aggregation Using the Prefer HeaderClient-Assisted Memento Aggregation Using the Prefer Header
Client-Assisted Memento Aggregation Using the Prefer HeaderMat Kelly
 
A Framework for Aggregating Public and Private Web Archives
A Framework for Aggregating Public and Private Web ArchivesA Framework for Aggregating Public and Private Web Archives
A Framework for Aggregating Public and Private Web ArchivesMat Kelly
 
Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Mat Kelly
 
Exploring Aggregation of Personal, Private, and Institutional Web Archives
Exploring Aggregation of Personal, Private, and Institutional Web ArchivesExploring Aggregation of Personal, Private, and Institutional Web Archives
Exploring Aggregation of Personal, Private, and Institutional Web ArchivesMat Kelly
 
JCDL 2015 Doctoral Consortium - A Framework for Aggregating Private and Publi...
JCDL 2015 Doctoral Consortium - A Framework for AggregatingPrivate and Publi...JCDL 2015 Doctoral Consortium - A Framework for AggregatingPrivate and Publi...
JCDL 2015 Doctoral Consortium - A Framework for Aggregating Private and Publi...Mat Kelly
 
Visualizing Digital Collections of Web Archives from Columbia Web Archiving C...
Visualizing Digital Collections of Web Archives from Columbia Web Archiving C...Visualizing Digital Collections of Web Archives from Columbia Web Archiving C...
Visualizing Digital Collections of Web Archives from Columbia Web Archiving C...Mat Kelly
 
Facilitation of the A Posteriori Replication of Web Published Satellite Imagery
Facilitation of the A Posteriori Replication of Web Published Satellite ImageryFacilitation of the A Posteriori Replication of Web Published Satellite Imagery
Facilitation of the A Posteriori Replication of Web Published Satellite ImageryMat Kelly
 
Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...
Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...
Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...Mat Kelly
 
Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014
Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014
Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014Mat Kelly
 
Browser-Based Digital Preservation
Browser-Based Digital PreservationBrowser-Based Digital Preservation
Browser-Based Digital PreservationMat Kelly
 
Archive What I See Now - Archive-It Partner Meeting 2013 2013
Archive What I See Now - Archive-It Partner Meeting 2013 2013Archive What I See Now - Archive-It Partner Meeting 2013 2013
Archive What I See Now - Archive-It Partner Meeting 2013 2013Mat Kelly
 
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction System
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction SystemIEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction System
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction SystemMat Kelly
 
Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving
Making Enterprise-Level Archive Tools Accessible for Personal Web ArchivingMaking Enterprise-Level Archive Tools Accessible for Personal Web Archiving
Making Enterprise-Level Archive Tools Accessible for Personal Web ArchivingMat Kelly
 
An Extensible Framework for Creating Personal Web Archives of Content Behind ...
An Extensible Framework for Creating Personal Web Archives of Content Behind ...An Extensible Framework for Creating Personal Web Archives of Content Behind ...
An Extensible Framework for Creating Personal Web Archives of Content Behind ...Mat Kelly
 
The Revolution Will Not Be Archived
The Revolution Will Not Be ArchivedThe Revolution Will Not Be Archived
The Revolution Will Not Be ArchivedMat Kelly
 
NDIIPP/NDSA 2011 - YouTube Link Restoration
NDIIPP/NDSA 2011 - YouTube Link RestorationNDIIPP/NDSA 2011 - YouTube Link Restoration
NDIIPP/NDSA 2011 - YouTube Link RestorationMat Kelly
 
NDIIPP/NDSA 2011 - Archive Facebook
NDIIPP/NDSA 2011 - Archive FacebookNDIIPP/NDSA 2011 - Archive Facebook
NDIIPP/NDSA 2011 - Archive FacebookMat Kelly
 

Mais de Mat Kelly (19)

Aggregating Private and Public Web Archives Using the Mementity Framework
Aggregating Private and Public Web Archives Using the Mementity FrameworkAggregating Private and Public Web Archives Using the Mementity Framework
Aggregating Private and Public Web Archives Using the Mementity Framework
 
Client-Assisted Memento Aggregation Using the Prefer Header
Client-Assisted Memento Aggregation Using the Prefer HeaderClient-Assisted Memento Aggregation Using the Prefer Header
Client-Assisted Memento Aggregation Using the Prefer Header
 
A Framework for Aggregating Public and Private Web Archives
A Framework for Aggregating Public and Private Web ArchivesA Framework for Aggregating Public and Private Web Archives
A Framework for Aggregating Public and Private Web Archives
 
Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count
 
Exploring Aggregation of Personal, Private, and Institutional Web Archives
Exploring Aggregation of Personal, Private, and Institutional Web ArchivesExploring Aggregation of Personal, Private, and Institutional Web Archives
Exploring Aggregation of Personal, Private, and Institutional Web Archives
 
JCDL 2015 Doctoral Consortium - A Framework for Aggregating Private and Publi...
JCDL 2015 Doctoral Consortium - A Framework for AggregatingPrivate and Publi...JCDL 2015 Doctoral Consortium - A Framework for AggregatingPrivate and Publi...
JCDL 2015 Doctoral Consortium - A Framework for Aggregating Private and Publi...
 
Visualizing Digital Collections of Web Archives from Columbia Web Archiving C...
Visualizing Digital Collections of Web Archives from Columbia Web Archiving C...Visualizing Digital Collections of Web Archives from Columbia Web Archiving C...
Visualizing Digital Collections of Web Archives from Columbia Web Archiving C...
 
Facilitation of the A Posteriori Replication of Web Published Satellite Imagery
Facilitation of the A Posteriori Replication of Web Published Satellite ImageryFacilitation of the A Posteriori Replication of Web Published Satellite Imagery
Facilitation of the A Posteriori Replication of Web Published Satellite Imagery
 
Slides
SlidesSlides
Slides
 
Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...
Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...
Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...
 
Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014
Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014
Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014
 
Browser-Based Digital Preservation
Browser-Based Digital PreservationBrowser-Based Digital Preservation
Browser-Based Digital Preservation
 
Archive What I See Now - Archive-It Partner Meeting 2013 2013
Archive What I See Now - Archive-It Partner Meeting 2013 2013Archive What I See Now - Archive-It Partner Meeting 2013 2013
Archive What I See Now - Archive-It Partner Meeting 2013 2013
 
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction System
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction SystemIEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction System
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction System
 
Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving
Making Enterprise-Level Archive Tools Accessible for Personal Web ArchivingMaking Enterprise-Level Archive Tools Accessible for Personal Web Archiving
Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving
 
An Extensible Framework for Creating Personal Web Archives of Content Behind ...
An Extensible Framework for Creating Personal Web Archives of Content Behind ...An Extensible Framework for Creating Personal Web Archives of Content Behind ...
An Extensible Framework for Creating Personal Web Archives of Content Behind ...
 
The Revolution Will Not Be Archived
The Revolution Will Not Be ArchivedThe Revolution Will Not Be Archived
The Revolution Will Not Be Archived
 
NDIIPP/NDSA 2011 - YouTube Link Restoration
NDIIPP/NDSA 2011 - YouTube Link RestorationNDIIPP/NDSA 2011 - YouTube Link Restoration
NDIIPP/NDSA 2011 - YouTube Link Restoration
 
NDIIPP/NDSA 2011 - Archive Facebook
NDIIPP/NDSA 2011 - Archive FacebookNDIIPP/NDSA 2011 - Archive Facebook
NDIIPP/NDSA 2011 - Archive Facebook
 

Último

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 

Último (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

WARCreate - Create Wayback-Consumable WARC Files from Any Webpage

  • 1. WARCreate Create Wayback-Consumable WARC Files from Any Webpage Mat Kelly, Michele C. Weigle, Michael L. Nelson {mkelly,mweigle,mln}@cs.odu.edu Old Dominion University; Norfolk, VA July 25, 2012 Arlington, Virginia Digital Preservation 2012 warcreate.com
  • 2. What is WARCreate? • Google Chrome extension • Creates WARC files • Enables preservation by users from their browser • First steps in bringing Institutional Archiving facilities to the PC July 25, 2012 2 Arlington, Virginia Digital Preservation 2012 warcreate.com
  • 3. Target Content • Unreachable by web crawlers – Behind authentication – Not listed in search engines (Deep Web) • Private – We don’t want our bank statements in Wayback • Non-pertinent to public – Others have little interest in our Facebook comments July 25, 2012 3 Arlington, Virginia Digital Preservation 2012 warcreate.com
  • 4. Preserving More! • Much digital information is needlessly lost • User chooses what they deem important • Compatible with standard archiving tools. July 25, 2012 4 Arlington, Virginia Digital Preservation 2012 warcreate.com
  • 5. WYSIWYG Archive created from Facebook-Supplied Data Dump WARCreate in Wayback July 25, 2012 5 Arlington, Virginia Digital Preservation 2012 warcreate.com
  • 6. WYSIWYG Archive created from Using Scraping Tools (e.g. wget) WARCreate in Wayback July 25, 2012 6 Arlington, Virginia Digital Preservation 2012 warcreate.com
  • 7. WYSIWYG Archive created from A Crawler Has No Context WARCreate in Wayback July 25, 2012 7 Arlington, Virginia Digital Preservation 2012 warcreate.com
  • 8. WYSIWYG Archive created from IA/HERITRIX OBEY ROBOTS WARCreate in Wayback July 25, 2012 8 Arlington, Virginia Digital Preservation 2012 warcreate.com
  • 9. Goals • Make it easy to use (GUI-based, no cmd line) • Make it useful (fill the need) • Demonstrate novelty of browser-instigated preservation • Show value of WARC format for Personal Web preservation • Bring WARC format to Personal Digital Archiving July 25, 2012 9 Arlington, Virginia Digital Preservation 2012 warcreate.com
  • 10. Creating a WARC July 25, 2012 11 Arlington, Virginia Digital Preservation 2012 warcreate.com
  • 11. I’ve Made a WARC. Now what? • What you do with the archive is up to you. – Install it in your local Wayback instance • Who has their own Wayback Instance!? – Wayback is free & open source • That seems like a lot of work! – One additional reason for users NOT to preserve what they would like archived July 25, 2012 12 Arlington, Virginia Digital Preservation 2012 warcreate.com
  • 12. WARC Creation & Replay 6. User Selects website the HTTP Headers 3. Local visits acaptures using their browser 1. WARCWayback instance indexes WARC 4. WARCreate “Generatelocally 5. 2. accesses local wayback generated, saved WARC” …to directory accessible to local wayback to view preserved content button in WARCreate 1 2 3 6 4 5 July 25, 2012 13 Arlington, Virginia Digital Preservation 2012 warcreate.com
  • 13. Suite Installation & Interaction • Drag & Drop .zip to hd • Start relevant services using GUI • Execute WARCreate process • View Archive at http://localhost/wayback July 25, 2012 14 Arlington, Virginia Digital Preservation 2012 warcreate.com
  • 14. Replay of Preserved Twitter page July 25, 2012 15 Arlington, Virginia Digital Preservation 2012 warcreate.com
  • 15. And My Bank Statements? • Preserved content: – never leaves WARC files – never leaves local machine • WARCreate provides preliminary encoding/encryption support • Wayback instance is hosted on your own machine – no external access by default July 25, 2012 16 Arlington, Virginia Digital Preservation 2012 warcreate.com
  • 16. Why Use a Client-Side Server? • Server scripts do what JS can’t • Can reside on your machine! • Controls are GUI based • Resource fetching w/o XSS issues WARCreate Memento Local Wayback Instance Server-Side Proxy Support … Tomcat Apache Built On XAMPP-Based Personal Web Archiving Suite July 25, 2012 18 Arlington, Virginia Digital Preservation 2012 warcreate.com
  • 17. Extras: Memento Support • Suite’s includes tailored Timegate • Memento abstraction is beyond WARC • Point MementoFox (or other Memento tools) to localhost July 25, 2012 19 Arlington, Virginia Digital Preservation 2012 warcreate.com
  • 18. How it All Relates BROWSER Personal Archives Accessible at localhost Local Browser Extensions Timegate MementoFox Local Wayback Instance WARC/1.0 WARC-Type: warcinfo WARC-Date: 2012-07- WARCreate 15T22:15:59.485Z WARC-Filename: Index WARCs Generates WARC file 2220471175c820fee3fec9 86040ebd1f.warc July 25, 2012 20 Arlington, Virginia Digital Preservation 2012 warcreate.com
  • 19. Contribution of Work • Facilitate browser-based Personal Web Archiving • Determine feasibility of fully Client-Side Preservation • Integrate with existing tools for establishing use cases July 25, 2012 21 Arlington, Virginia Digital Preservation 2012 warcreate.com
  • 20. WARCreate Create Wayback-Consumable WARC Files from Any Webpage http://WARCreate.com July 25, 2012 22 Arlington, Virginia Digital Preservation 2012 warcreate.com
  • 21. Backup Slides July 25, 2012 23 Arlington, Virginia Digital Preservation 2012 warcreate.com
  • 22. Future Work • Decouple from “server” • Refine Memento integration • Reference full WARC spec • Built-in WARC validation • Built-in replay • Compression • Optimization (removing duplicates) • …many more July 25, 2012 24 Arlington, Virginia Digital Preservation 2012 warcreate.com
  • 23. Extras: Configuration Sanity Check • Server scipts make up for ✗ ✗ WARC Validation AJAX XSS Circumvention Javascript shortcomings ✗ ✗ HTML5 Sandbox Escaping Memento Support • The server can reside on ✗ Local Wayback Instance In WARCreate your machine! • Setup,Start,Stop are GUI based July 25, 2012 25 Arlington, Virginia Digital Preservation 2012 warcreate.com
  • 24. Extras: Configuration Sanity Check ✓ WARC Validation + Apache allows generated ✓ AJAX XSS Circumvention WARCs to be validated ✓ ? HTML5 Sandbox Escaping Memento Support ✗ Local Wayback Instance + Javascript cannot write to In WARCreate disk, server-side scripts can + Server prevents hot- linking & has security = Content better preserved using server techs July 25, 2012 26 Arlington, Virginia Digital Preservation 2012 warcreate.com
  • 25. Extras: Configuration Sanity Check • Memento requires Wayback ✓ WARCXSS Circumvention ✓ AJAX Validation Wayback requires Tomcat ✓ HTML5 Sandbox Escaping ✓ Memento Support ∴ Memento requires Tomcat ✓ Local Wayback Instance In WARCreate • Memento Timegate req’s Python+modules (pre-packaged + included) July 25, 2012 27 Arlington, Virginia Digital Preservation 2012 warcreate.com

Notas do Editor

  1. IntroduceSoftware project – WARCreate- Chrome extension Allows create WARC file from any webpage you can view in your browser.
  2. WARC format prev inaccessible directly to users Using tool, users utilize format. Bringing WARC format to user is signif b/cfirst step in porting practices, tools and facilities from conventnl web archiving to casual archivist oranyone wants to preserve content in a standard way.
  3. If vistd WB Mach, noticed lg part of web not preserved Content | auth & data inacces crawlers IS NOT ARCHIVEDMaybe this is good, not arcving bank statmtsDontwantCrawlers !=auth w/ bank & archive spending habits but what if I wanted to archive them? Instead bank st, wanted to preserve FB acct content or privTwitt feed? No facil exist that allow me to archi this  WARC format.
  4. Since bank st, twit feed et al are not arch, what else isn’t preservd?A lg majority of web inaccessible Heritr & IA or simply not presvd one reas|anoth. EvrthnHeritrx can archive, we can see in browserWe also see things Heritr != access to. Unarchd info ==more importnt than info currently archd. WE should det what’s impt. – can w/ WARCreate
  5. High Levlproj - don't want transf the complications formt to user but inst allowuser to util format w/o knowing ins/outs. Wanna make it easy to use.Interface,as see = simple at root but powerful below root. Wanna fulfill goal of presvingcontt || auth & contt currently being lost/time bc !being preserved.Esp. wanna show browser,when extdd, used as tool for preservation in way that mksmost sense forpers web preserv, i.e. we preserve what we can see and deem important.Finly,wanna show that WARC formt is applicble beyondscope of convent web archiving.
  6. Said before, Interface is simple. 1.User navigates to webpg want archived, clicks WC icon, clksgenerate warc button& out pops aWARC. Simple.Threre are many user customizable options hidden beneath simplicity but for simple WARC creation of the webpage you’re viewing, it’s one-click and you’re done.
  7. So, you've used WCto create WARC. Mission success! Let's all go home.What good is presrvdformt if cannot be re-expercd? IA’s WB is FOSS! You can dl copy and replay these WARCs. But that sounds excessive &difficult and who really wants to host their own WB intance? THESEprecisely somereasons that have prevntdutilizn of convtnl web archpractices & tools into realm of personal web archiving. But we have a solution for this, too.
  8. y
  9. Might STILLcomplctd, so here's a better synopsis. Drag&Drop zipped suite toroot of HD. Start Apache serv by clicking a button. Start Tomcat serv by clicking a button. CreateWARC from any page on the web using WC. Save WARC to your HD. Wait forWB instc to index -> any content you can view is then preserved into WARC format and replayable from your local machine.
  10. Apache = web server, my bank statements? Exposed to world?Evrthng in Suite is on your machine. Aside frmgrabng content you wannapresv from web, none output is exposed. Further, I’ve worked in preliminary support for encding/encrypn to ease nerves a littlebut real power = your priv archives are !expsd as would be if handedoff to instit. Meanwhile, ur able toutilinstitfacilitts.
  11. K,snds good. Doesn’t do?* doesnot archive entire website into single WARC. -purely WYSIWYG - What you see is what you get. *does not submit WARCs to IA. So, even if y’adPersWebArchwantd world to see one day, tool willntauto’ly submit WARCs.*only implmnts subset WARC formt. While thsWARCs validt, compreh support format has mny edge cases and must take into account all of the nuances of the web experience.*Doesntprvdmeans replay - job of the supplemtry PWA suite, which will be freely available with WC.
  12. Some extras, configing all packages on a system is hard. WC contnssanity checks -ensure u setup the suite correctly and can take advantage of the full extent of the extension. By using this client-side server approach, able to overcome someshort comings of JS &web browser. Start off, !Apache | !Tomcat runningCan still create WARCs using just extension but !validated &might run into website-based security issues trying to store sites' content locally. So, we dragged the suite to our hard drive, per the procedure before and we're now ready to fire up Apache. We click the button.
  13. Why wasnt Memento support integr into gendWARCs? Simput,beyond abstraction WARCs & instd in realm of replay rather than src replay. By utilizng suite that origly built to supplmnt WC also able incld facilities for Memento to handle reqtsapproply& retn to userdate of archive reqd from local WB inst.
  14. Tie all together:WC gens WARCsSuite contains support for Wayback and a host of WARC tools that supplement creation WARCs. Alsoincluded custom Memento Timegate in suite*so when you go to query your persWebArchvson your machine, matter put’nglocalhostaddr oftimegate =*allthe your personal web archives are temporally navigable.
  15. Synop*wanted show browser bsdpersWebArch very feasible endeavor to implmt* By taking into acct pre-existing tools, able to show that WARCs we’re creating aren't some esoteric format that, too, will be lost in time when format dies ->* rather an established, standard format that integrates with pre-existing offerings.
  16. Fut. Work WARCreate* Decouple from “server”* Refine Memento integration* Reference full WARC spec* Built-in WARC validation* Built-in replay* Compression* Optimization (removing duplicates)* …many more
  17. Some extras, configing all packages on a system is hard. WC contnssanity checks -ensure u setup the suite correctly and can take advantage of the full extent of the extension. By using this client-side server approach, able to overcome someshort comings of JS &web browser. Start off, !Apache | !Tomcat runningCan still create WARCs using just extension but !validated &might run into website-based security issues trying to store sites' content locally. So, we dragged the suite to our hard drive, per the procedure before and we're now ready to fire up Apache. We click the button.
  18. Custom conf of Apach suite gives* access to WARC validatn*allows overcome protection of website in keeping us from preserving content *able to interact with local filesys - something Javascript has severe limitations on doing. Replay sys isn't up yet though & that's because the Java-based FOSS Wayback requires Tomcat. Also, wanna able jump in time between our archives using Me, so we cannot experience this without our Wayback instance being up. We start up Tomcat.
  19. &now compltpkg =operatnlIncldd w/ suite is impl of Memento framework built specif for persWebArch.Thssupprt is fairly new and not tied into suite yet, so reqs a lil more s/w, also included