SlideShare uma empresa Scribd logo
1 de 22
Baixar para ler offline
Transactional Web Archives


                                    Robert Sanderson
                                    Lyudmila Balakireva
                                      Harihar Shankar
                                   Herbert Van de Sompel

                                 Los Alamos National Laboratory
                                        Research Library


                                   Web Archive Globalization
                                         JCDL 2011, Ottawa
                                          Jun 16-17, 2011


This research is funded by the Library of Congress

             Memento: Transactional Archiving
    Web Archive Globalization Workshop, Jun 16-17 2010   1
Transactional Web Archiving


  Transactional   Archiving?

  Server Side Capture
       Submission, Storage, Access

  Browser   Side Capture
       Submission, Storage, Access

  Memento    for Access



                            Memento: Transactional Archiving
                   Web Archive Globalization Workshop, Jun 16-17 2010   2
Transactional Archiving?

  Current   web archives actively crawl the web




  For
     example, Heritrix from the Internet Archive and the many
archives that use it




                              Memento: Transactional Archiving
                     Web Archive Globalization Workshop, Jun 16-17 2010   3
Transactional Archiving?

  Transactional
               archives passively accept submitted HTTP
transactions between browser and server




  For   example, TTApache, PageVault and Everlast.




                             Memento: Transactional Archiving
                    Web Archive Globalization Workshop, Jun 16-17 2010   4
Why Transactional Archiving?

  Issues   with crawler based archiving:
        Can be rejected       (robots.txt, by user-agent, by host IP)
        Can be deceived       (cloaking: geo-location, by user-agent)
        Can be trapped        (infinite auto-generated pages)
        Don't necessarily capture well used resources
        Require constant and massive bandwidth



  None   of these are true for Transactional Archiving …

… but, it has its own different set of challenges




                              Memento: Transactional Archiving
                     Web Archive Globalization Workshop, Jun 16-17 2010   5
Transactional Archiving?

  Need   to record transactions between browser and server
        Server side: Servers to be archived must cooperate
        Browser side: Many browsers must cooperate

  Need   to transfer data to archive: either batch mode or real-time
  Archive   must trust submission to be authentic

  Deduplication   challenges as can't control what will be submitted:
        Aliases: Different URL, same response
        Negotiation: Same URL, different response
        Determine "significant" change in response
        Other factors for what to archive/throw away?



                              Memento: Transactional Archiving
                     Web Archive Globalization Workshop, Jun 16-17 2010   6
Server Side Capture

  Approach:

        Willing server records the request and response headers and
         response body just before returning to the browser
        Server sends to an archive for storage




                             Memento: Transactional Archiving
                    Web Archive Globalization Workshop, Jun 16-17 2010   7
Server Side Capture/Submission

  Developer:    Luda Balakireva

  Capture     Implementation
        Apache connection filter module implemented in C to trap
         URL, headers and response body
        Module POSTs to a configurable URL in real time

  Submission     Implementation
        Java/Grizzly+Jersey for handling submission interface
              Can also be deployed under tomcat or glassfish
        BerkeleyDB for storing metadata
        Headers and response body data stored in file system



                               Memento: Transactional Archiving
                      Web Archive Globalization Workshop, Jun 16-17 2010   8
Server Side Capture

  Direct   server to server upload, in real time:
         Most configurations will have server/archive in close network
          proximity
         Reduces wait time between observation and being
          discoverable in archive




                                Memento: Transactional Archiving
                       Web Archive Globalization Workshop, Jun 16-17 2010   9
Server Side Capture: Issues



  If   archive is not local, network latency may be an issue
            But could be amortized by batch upload

  Size      of dataset could very large for dynamically generated pages
            But could be reduced by better detection of high value
             changes compared to counters, timestamps, etc.

  Content      Negotiation problematic!

  Capture      of pages with “attack vector” query params
            index.html?f=/etc/passwd



                                  Memento: Transactional Archiving
                         Web Archive Globalization Workshop, Jun 16-17 2010   10
Browser Side Capture

  Approach:

        Willing browser records the request and response headers
         and response body after receiving from server
        Browser sends to an archive for storage




                             Memento: Transactional Archiving
                    Web Archive Globalization Workshop, Jun 16-17 2010   11
Browser Side Capture/Submission

  Developer:   Rob Sanderson

  Capture   Implementation
        Firefox add-on captures headers and body and writes to
         temporary storage on local disk
        After configurable amount of data stored, module compresses
         and moves to a shared Dropbox folder for batch upload
        (Limited) Ability to detect and ignore private data

  Submission   Implementation
        Dropbox used as transfer, temporary storage mechanism
        Python monitor system on top of Dropbox
        Cassandra (NoSQL hash store) for storing metadata
        Response body and headers stored in pair-tree file system

                             Memento: Transactional Archiving
                    Web Archive Globalization Workshop, Jun 16-17 2010   12
Browser Side Submission

  Reasons   for Dropbox rather than direct upload:
       Batch upload via existing infrastructure reduces bandwidth
       Increases Firefox responsiveness
       Batch processing can be scheduled as needed




                             Memento: Transactional Archiving
                    Web Archive Globalization Workshop, Jun 16-17 2010   13
Browser Side Capture/Submission

                                                             Upload
                                                             Preferences




                                                             Public/Private
                                                             Status Icon




            Memento: Transactional Archiving
   Web Archive Globalization Workshop, Jun 16-17 2010   14
Browser Side: Issues

  Privacy!    Privacy! Privacy!
        Difficult to determine if resource should be captured or not
        Current approach:
              No HTTPS
              Check for “log out”, “sign out” etc in body
              Check for usernames, personal name in body, headers
              Blacklist for domains

  Bandwidth

        Slow-down while uploading batch file noticable on home
         connections




                                Memento: Transactional Archiving
                       Web Archive Globalization Workshop, Jun 16-17 2010   15
Memento in One Slide




         Memento: Transactional Archiving
Web Archive Globalization Workshop, Jun 16-17 2010   16
Access via Memento

  Both   archives provide Memento TimeGates for access

  TimeGates     can be used with MementoFox:
         Endorsed Firefox add-on:             http://bit.ly/memfox
         Must be configured with Dropbox archive TimeGate
         Processes every HTTP request, not just HTML page


  Distributed   access is intentional design feature
         Possible to construct views from multiple archives:
          Server side will have most authentic copy, but may embed
          image from another server, only in Dropbox archive




                               Memento: Transactional Archiving
                      Web Archive Globalization Workshop, Jun 16-17 2010   17
Server Side Archive: Access

  Access     to archive via Memento TimeGate
        Implemented in Grizzly server using Jersey library
  Original   Server uses HTTP Link header to point to archive


  Exportfunctionality also available to WARC format to extract data
in batch mode
        By datetime of last update
        By URL




                               Memento: Transactional Archiving
                      Web Archive Globalization Workshop, Jun 16-17 2010   18
Browser Side Archive: Access

  Apache/Python   Memento TimeGate for access
       Archive provides combined, anonymous TimeGate
       Also provides per-user TimeGates to see own archive
       Per-User currently secure only through obscurity
       Export functionality also yet to be implemented




                            Memento: Transactional Archiving
                   Web Archive Globalization Workshop, Jun 16-17 2010   19
Access via Memento




Experimental
Transactional
Archive



TimeGate
Preferences




                         Memento: Transactional Archiving
                Web Archive Globalization Workshop, Jun 16-17 2010   20
Summary

  Implemented     and tested two types of Transactional Archive:
        Server Side
        Browser Side
  Transactional
              Archives lack many of the challenges of Crawler
based Archives (but have their own)

  Implemented     Memento TimeGates for Transactional Archives:
        Does not require rewriting URIs for self-contained-ness
        Works well with automated, distributed access patterns


  Access   via Browser add-on is fast and seamless
  Server   and Browser archiving code will be released



                              Memento: Transactional Archiving
                     Web Archive Globalization Workshop, Jun 16-17 2010   21
Memento wants to make Navigating the Web’s Past Easy




Learn:          http://www.mementoweb.org/
Talk:    http://groups.google.com/group/memento-dev
Use:                   http://bit.ly/memfox

                          Memento: Transactional Archiving
                 Web Archive Globalization Workshop, Jun 16-17 2010   22

Mais conteúdo relacionado

Destaque

Linked Data and Images: Building Blocks for Cultural Heritage
Linked Data and Images: Building Blocks for Cultural HeritageLinked Data and Images: Building Blocks for Cultural Heritage
Linked Data and Images: Building Blocks for Cultural HeritageRobert Sanderson
 
Open Annotation Core Data Model (tutorial)
Open Annotation Core Data Model (tutorial)Open Annotation Core Data Model (tutorial)
Open Annotation Core Data Model (tutorial)Robert Sanderson
 
Parker Keio 2011: Interoperable Manuscript Framework
Parker Keio 2011: Interoperable Manuscript FrameworkParker Keio 2011: Interoperable Manuscript Framework
Parker Keio 2011: Interoperable Manuscript FrameworkRobert Sanderson
 
Open Repositories 2014: Crowdsourced Transcription via IIIF
Open Repositories 2014: Crowdsourced Transcription via IIIFOpen Repositories 2014: Crowdsourced Transcription via IIIF
Open Repositories 2014: Crowdsourced Transcription via IIIFRobert Sanderson
 
IIIF and JSON-LD: LODLAM Training Day
IIIF and JSON-LD: LODLAM Training DayIIIF and JSON-LD: LODLAM Training Day
IIIF and JSON-LD: LODLAM Training DayRobert Sanderson
 
IIIF Foundational Specifications
IIIF Foundational SpecificationsIIIF Foundational Specifications
IIIF Foundational SpecificationsRobert Sanderson
 
OAC Presentation at CNI 09 Fall Forum
OAC Presentation at CNI 09 Fall ForumOAC Presentation at CNI 09 Fall Forum
OAC Presentation at CNI 09 Fall ForumRobert Sanderson
 
Globalization agents flows networks
Globalization agents flows networksGlobalization agents flows networks
Globalization agents flows networksCompagnonjbd
 
Update on Memento (IIPC 2011 Plenary)
Update on Memento (IIPC 2011 Plenary)Update on Memento (IIPC 2011 Plenary)
Update on Memento (IIPC 2011 Plenary)Robert Sanderson
 
SharedCanvas: Collaborative Digital Facsimiles of Medieval Manuscripts
SharedCanvas: Collaborative Digital Facsimiles of Medieval ManuscriptsSharedCanvas: Collaborative Digital Facsimiles of Medieval Manuscripts
SharedCanvas: Collaborative Digital Facsimiles of Medieval ManuscriptsRobert Sanderson
 
Open Annotation: Annotating High Energy Physics on the Web
Open Annotation: Annotating High Energy Physics on the WebOpen Annotation: Annotating High Energy Physics on the Web
Open Annotation: Annotating High Energy Physics on the WebRobert Sanderson
 
Multiplicity and Publishing in Open Annotation (tutorial)
Multiplicity and Publishing in Open Annotation (tutorial)Multiplicity and Publishing in Open Annotation (tutorial)
Multiplicity and Publishing in Open Annotation (tutorial)Robert Sanderson
 
Books in Browsers / SharedCanvas: Collaborative Facsimiles
Books in Browsers / SharedCanvas: Collaborative FacsimilesBooks in Browsers / SharedCanvas: Collaborative Facsimiles
Books in Browsers / SharedCanvas: Collaborative FacsimilesRobert Sanderson
 
Globalization ppt
Globalization pptGlobalization ppt
Globalization pptMary Brown
 
Linked Data Snowball, or Why We Need Reconciliation
Linked Data Snowball, or Why We Need ReconciliationLinked Data Snowball, or Why We Need Reconciliation
Linked Data Snowball, or Why We Need ReconciliationRobert Sanderson
 

Destaque (20)

OAC Technical Summary
OAC Technical SummaryOAC Technical Summary
OAC Technical Summary
 
Linked Data and Images: Building Blocks for Cultural Heritage
Linked Data and Images: Building Blocks for Cultural HeritageLinked Data and Images: Building Blocks for Cultural Heritage
Linked Data and Images: Building Blocks for Cultural Heritage
 
Open Annotation Core Data Model (tutorial)
Open Annotation Core Data Model (tutorial)Open Annotation Core Data Model (tutorial)
Open Annotation Core Data Model (tutorial)
 
Parker Keio 2011: Interoperable Manuscript Framework
Parker Keio 2011: Interoperable Manuscript FrameworkParker Keio 2011: Interoperable Manuscript Framework
Parker Keio 2011: Interoperable Manuscript Framework
 
Open Repositories 2014: Crowdsourced Transcription via IIIF
Open Repositories 2014: Crowdsourced Transcription via IIIFOpen Repositories 2014: Crowdsourced Transcription via IIIF
Open Repositories 2014: Crowdsourced Transcription via IIIF
 
IIIF: Shared Canvas 2.0
IIIF: Shared Canvas 2.0IIIF: Shared Canvas 2.0
IIIF: Shared Canvas 2.0
 
IIIF and JSON-LD: LODLAM Training Day
IIIF and JSON-LD: LODLAM Training DayIIIF and JSON-LD: LODLAM Training Day
IIIF and JSON-LD: LODLAM Training Day
 
IIIF Foundational Specifications
IIIF Foundational SpecificationsIIIF Foundational Specifications
IIIF Foundational Specifications
 
OAC Presentation at CNI 09 Fall Forum
OAC Presentation at CNI 09 Fall ForumOAC Presentation at CNI 09 Fall Forum
OAC Presentation at CNI 09 Fall Forum
 
Globalization agents flows networks
Globalization agents flows networksGlobalization agents flows networks
Globalization agents flows networks
 
Globalization
GlobalizationGlobalization
Globalization
 
Update on Memento (IIPC 2011 Plenary)
Update on Memento (IIPC 2011 Plenary)Update on Memento (IIPC 2011 Plenary)
Update on Memento (IIPC 2011 Plenary)
 
SharedCanvas: Collaborative Digital Facsimiles of Medieval Manuscripts
SharedCanvas: Collaborative Digital Facsimiles of Medieval ManuscriptsSharedCanvas: Collaborative Digital Facsimiles of Medieval Manuscripts
SharedCanvas: Collaborative Digital Facsimiles of Medieval Manuscripts
 
Open Annotation: Annotating High Energy Physics on the Web
Open Annotation: Annotating High Energy Physics on the WebOpen Annotation: Annotating High Energy Physics on the Web
Open Annotation: Annotating High Energy Physics on the Web
 
Multiplicity and Publishing in Open Annotation (tutorial)
Multiplicity and Publishing in Open Annotation (tutorial)Multiplicity and Publishing in Open Annotation (tutorial)
Multiplicity and Publishing in Open Annotation (tutorial)
 
Globalization project
Globalization projectGlobalization project
Globalization project
 
IIIF Presentation API
IIIF Presentation API IIIF Presentation API
IIIF Presentation API
 
Books in Browsers / SharedCanvas: Collaborative Facsimiles
Books in Browsers / SharedCanvas: Collaborative FacsimilesBooks in Browsers / SharedCanvas: Collaborative Facsimiles
Books in Browsers / SharedCanvas: Collaborative Facsimiles
 
Globalization ppt
Globalization pptGlobalization ppt
Globalization ppt
 
Linked Data Snowball, or Why We Need Reconciliation
Linked Data Snowball, or Why We Need ReconciliationLinked Data Snowball, or Why We Need Reconciliation
Linked Data Snowball, or Why We Need Reconciliation
 

Semelhante a Transactional Archiving (Web Archive Globalization Workshop)

Your browser, your storage (extended version)
Your browser, your storage (extended version)Your browser, your storage (extended version)
Your browser, your storage (extended version)Francesco Fullone
 
Pithos - Architecture and .NET Technologies
Pithos - Architecture and .NET TechnologiesPithos - Architecture and .NET Technologies
Pithos - Architecture and .NET TechnologiesPanagiotis Kanavos
 
Krug Fat Client
Krug Fat ClientKrug Fat Client
Krug Fat ClientPaul Klipp
 
Caching and Its Main Types
Caching and Its Main TypesCaching and Its Main Types
Caching and Its Main TypesHTS Hosting
 
Local Storage for Web Applications
Local Storage for Web ApplicationsLocal Storage for Web Applications
Local Storage for Web ApplicationsMarkku Laine
 
Introduction to IPFS & Filecoin - longer version
Introduction to IPFS & Filecoin - longer versionIntroduction to IPFS & Filecoin - longer version
Introduction to IPFS & Filecoin - longer versionTinaBregovi
 
Introduction to IPFS & Filecoin
Introduction to IPFS & FilecoinIntroduction to IPFS & Filecoin
Introduction to IPFS & FilecoinTinaBregovi
 
Globo.com & Varnish
Globo.com & VarnishGlobo.com & Varnish
Globo.com & Varnishlokama
 
thinking in key value stores
thinking in key value storesthinking in key value stores
thinking in key value storesBhasker Kode
 
F/LOSS in Norwegian libraries
F/LOSS in Norwegian librariesF/LOSS in Norwegian libraries
F/LOSS in Norwegian librariesLibriotech
 
Introduction to Filecoin
Introduction to Filecoin   Introduction to Filecoin
Introduction to Filecoin Vanessa Lošić
 
Whats new in alfresco community 3.4
Whats new in alfresco community 3.4Whats new in alfresco community 3.4
Whats new in alfresco community 3.4Alfresco Software
 
Clientside/Offline (onefile) Lecture Player in a Web Browser
Clientside/Offline (onefile) Lecture Player in a Web BrowserClientside/Offline (onefile) Lecture Player in a Web Browser
Clientside/Offline (onefile) Lecture Player in a Web BrowserTokyo University of Science
 

Semelhante a Transactional Archiving (Web Archive Globalization Workshop) (20)

your browser, your storage
your browser, your storageyour browser, your storage
your browser, your storage
 
Your browser, your storage (extended version)
Your browser, your storage (extended version)Your browser, your storage (extended version)
Your browser, your storage (extended version)
 
Pithos - Architecture and .NET Technologies
Pithos - Architecture and .NET TechnologiesPithos - Architecture and .NET Technologies
Pithos - Architecture and .NET Technologies
 
your browser, my storage
your browser, my storageyour browser, my storage
your browser, my storage
 
Krug Fat Client
Krug Fat ClientKrug Fat Client
Krug Fat Client
 
Caching and Its Main Types
Caching and Its Main TypesCaching and Its Main Types
Caching and Its Main Types
 
Web browser architecture.pptx
Web browser architecture.pptxWeb browser architecture.pptx
Web browser architecture.pptx
 
Local Storage for Web Applications
Local Storage for Web ApplicationsLocal Storage for Web Applications
Local Storage for Web Applications
 
Introduction to IPFS & Filecoin - longer version
Introduction to IPFS & Filecoin - longer versionIntroduction to IPFS & Filecoin - longer version
Introduction to IPFS & Filecoin - longer version
 
Your browser, my storage
Your browser, my storageYour browser, my storage
Your browser, my storage
 
Introduction to IPFS & Filecoin
Introduction to IPFS & FilecoinIntroduction to IPFS & Filecoin
Introduction to IPFS & Filecoin
 
02 intro
02   intro02   intro
02 intro
 
Globo.com & Varnish
Globo.com & VarnishGlobo.com & Varnish
Globo.com & Varnish
 
thinking in key value stores
thinking in key value storesthinking in key value stores
thinking in key value stores
 
F/LOSS in Norwegian libraries
F/LOSS in Norwegian librariesF/LOSS in Norwegian libraries
F/LOSS in Norwegian libraries
 
Introduction to Filecoin
Introduction to Filecoin   Introduction to Filecoin
Introduction to Filecoin
 
Browser Security
Browser SecurityBrowser Security
Browser Security
 
Whats new in alfresco community 3.4
Whats new in alfresco community 3.4Whats new in alfresco community 3.4
Whats new in alfresco community 3.4
 
Clientside/Offline (onefile) Lecture Player in a Web Browser
Clientside/Offline (onefile) Lecture Player in a Web BrowserClientside/Offline (onefile) Lecture Player in a Web Browser
Clientside/Offline (onefile) Lecture Player in a Web Browser
 
Local storage
Local storageLocal storage
Local storage
 

Mais de Robert Sanderson

LUX - Cross Collections Cultural Heritage at Yale
LUX - Cross Collections Cultural Heritage at YaleLUX - Cross Collections Cultural Heritage at Yale
LUX - Cross Collections Cultural Heritage at YaleRobert Sanderson
 
Zoom as a Paradigm for Linked Open Usable Data
Zoom as a Paradigm for Linked Open Usable DataZoom as a Paradigm for Linked Open Usable Data
Zoom as a Paradigm for Linked Open Usable DataRobert Sanderson
 
Provenance and Uncertainty in Linked Art
Provenance and Uncertainty in Linked ArtProvenance and Uncertainty in Linked Art
Provenance and Uncertainty in Linked ArtRobert Sanderson
 
Data is our Product: Thoughts on LOD Sustainability
Data is our Product: Thoughts on LOD SustainabilityData is our Product: Thoughts on LOD Sustainability
Data is our Product: Thoughts on LOD SustainabilityRobert Sanderson
 
A Perspective on Wikidata: Ecosystems, Trust, and Usability
A Perspective on Wikidata: Ecosystems, Trust, and UsabilityA Perspective on Wikidata: Ecosystems, Trust, and Usability
A Perspective on Wikidata: Ecosystems, Trust, and UsabilityRobert Sanderson
 
Linked Art: Sustainable Cultural Knowledge through Linked Open Usable Data
Linked Art: Sustainable Cultural Knowledge through Linked Open Usable DataLinked Art: Sustainable Cultural Knowledge through Linked Open Usable Data
Linked Art: Sustainable Cultural Knowledge through Linked Open Usable DataRobert Sanderson
 
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open Data
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open DataIllusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open Data
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open DataRobert Sanderson
 
Structural Metadata in RDF (IS575)
Structural Metadata in RDF (IS575)Structural Metadata in RDF (IS575)
Structural Metadata in RDF (IS575)Robert Sanderson
 
Sanderson CNI 2020 Keynote - Cultural Heritage Research Data Ecosystem
Sanderson CNI 2020 Keynote - Cultural Heritage Research Data EcosystemSanderson CNI 2020 Keynote - Cultural Heritage Research Data Ecosystem
Sanderson CNI 2020 Keynote - Cultural Heritage Research Data EcosystemRobert Sanderson
 
Tiers of Abstraction and Audience in Cultural Heritage Data Modeling
Tiers of Abstraction and Audience in Cultural Heritage Data ModelingTiers of Abstraction and Audience in Cultural Heritage Data Modeling
Tiers of Abstraction and Audience in Cultural Heritage Data ModelingRobert Sanderson
 
The Importance of being LOUD
The Importance of being LOUDThe Importance of being LOUD
The Importance of being LOUDRobert Sanderson
 
Introduction to Linked Art Model
Introduction to Linked Art ModelIntroduction to Linked Art Model
Introduction to Linked Art ModelRobert Sanderson
 
Standards and Communities: Connected People, Consistent Data, Usable Applicat...
Standards and Communities: Connected People, Consistent Data, Usable Applicat...Standards and Communities: Connected People, Consistent Data, Usable Applicat...
Standards and Communities: Connected People, Consistent Data, Usable Applicat...Robert Sanderson
 
Strong Opinions, Weakly Held
Strong Opinions, Weakly HeldStrong Opinions, Weakly Held
Strong Opinions, Weakly HeldRobert Sanderson
 
IIIF Discovery Walkthrough
IIIF Discovery WalkthroughIIIF Discovery Walkthrough
IIIF Discovery WalkthroughRobert Sanderson
 
Linked Art: An Art Museum Profile for CIDOC-CRM
Linked Art: An Art Museum Profile for CIDOC-CRMLinked Art: An Art Museum Profile for CIDOC-CRM
Linked Art: An Art Museum Profile for CIDOC-CRMRobert Sanderson
 
Euromed2018 Keynote: Usability over Completeness, Community over Committee
Euromed2018 Keynote: Usability over Completeness, Community over CommitteeEuromed2018 Keynote: Usability over Completeness, Community over Committee
Euromed2018 Keynote: Usability over Completeness, Community over CommitteeRobert Sanderson
 
Linked Art - Our Linked Open Usable Data Model
Linked Art - Our Linked Open Usable Data ModelLinked Art - Our Linked Open Usable Data Model
Linked Art - Our Linked Open Usable Data ModelRobert Sanderson
 
EuropeanaTech Keynote: Shout it out LOUD
EuropeanaTech Keynote: Shout it out LOUDEuropeanaTech Keynote: Shout it out LOUD
EuropeanaTech Keynote: Shout it out LOUDRobert Sanderson
 

Mais de Robert Sanderson (20)

Understanding Linked Art
Understanding Linked ArtUnderstanding Linked Art
Understanding Linked Art
 
LUX - Cross Collections Cultural Heritage at Yale
LUX - Cross Collections Cultural Heritage at YaleLUX - Cross Collections Cultural Heritage at Yale
LUX - Cross Collections Cultural Heritage at Yale
 
Zoom as a Paradigm for Linked Open Usable Data
Zoom as a Paradigm for Linked Open Usable DataZoom as a Paradigm for Linked Open Usable Data
Zoom as a Paradigm for Linked Open Usable Data
 
Provenance and Uncertainty in Linked Art
Provenance and Uncertainty in Linked ArtProvenance and Uncertainty in Linked Art
Provenance and Uncertainty in Linked Art
 
Data is our Product: Thoughts on LOD Sustainability
Data is our Product: Thoughts on LOD SustainabilityData is our Product: Thoughts on LOD Sustainability
Data is our Product: Thoughts on LOD Sustainability
 
A Perspective on Wikidata: Ecosystems, Trust, and Usability
A Perspective on Wikidata: Ecosystems, Trust, and UsabilityA Perspective on Wikidata: Ecosystems, Trust, and Usability
A Perspective on Wikidata: Ecosystems, Trust, and Usability
 
Linked Art: Sustainable Cultural Knowledge through Linked Open Usable Data
Linked Art: Sustainable Cultural Knowledge through Linked Open Usable DataLinked Art: Sustainable Cultural Knowledge through Linked Open Usable Data
Linked Art: Sustainable Cultural Knowledge through Linked Open Usable Data
 
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open Data
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open DataIllusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open Data
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open Data
 
Structural Metadata in RDF (IS575)
Structural Metadata in RDF (IS575)Structural Metadata in RDF (IS575)
Structural Metadata in RDF (IS575)
 
Sanderson CNI 2020 Keynote - Cultural Heritage Research Data Ecosystem
Sanderson CNI 2020 Keynote - Cultural Heritage Research Data EcosystemSanderson CNI 2020 Keynote - Cultural Heritage Research Data Ecosystem
Sanderson CNI 2020 Keynote - Cultural Heritage Research Data Ecosystem
 
Tiers of Abstraction and Audience in Cultural Heritage Data Modeling
Tiers of Abstraction and Audience in Cultural Heritage Data ModelingTiers of Abstraction and Audience in Cultural Heritage Data Modeling
Tiers of Abstraction and Audience in Cultural Heritage Data Modeling
 
The Importance of being LOUD
The Importance of being LOUDThe Importance of being LOUD
The Importance of being LOUD
 
Introduction to Linked Art Model
Introduction to Linked Art ModelIntroduction to Linked Art Model
Introduction to Linked Art Model
 
Standards and Communities: Connected People, Consistent Data, Usable Applicat...
Standards and Communities: Connected People, Consistent Data, Usable Applicat...Standards and Communities: Connected People, Consistent Data, Usable Applicat...
Standards and Communities: Connected People, Consistent Data, Usable Applicat...
 
Strong Opinions, Weakly Held
Strong Opinions, Weakly HeldStrong Opinions, Weakly Held
Strong Opinions, Weakly Held
 
IIIF Discovery Walkthrough
IIIF Discovery WalkthroughIIIF Discovery Walkthrough
IIIF Discovery Walkthrough
 
Linked Art: An Art Museum Profile for CIDOC-CRM
Linked Art: An Art Museum Profile for CIDOC-CRMLinked Art: An Art Museum Profile for CIDOC-CRM
Linked Art: An Art Museum Profile for CIDOC-CRM
 
Euromed2018 Keynote: Usability over Completeness, Community over Committee
Euromed2018 Keynote: Usability over Completeness, Community over CommitteeEuromed2018 Keynote: Usability over Completeness, Community over Committee
Euromed2018 Keynote: Usability over Completeness, Community over Committee
 
Linked Art - Our Linked Open Usable Data Model
Linked Art - Our Linked Open Usable Data ModelLinked Art - Our Linked Open Usable Data Model
Linked Art - Our Linked Open Usable Data Model
 
EuropeanaTech Keynote: Shout it out LOUD
EuropeanaTech Keynote: Shout it out LOUDEuropeanaTech Keynote: Shout it out LOUD
EuropeanaTech Keynote: Shout it out LOUD
 

Último

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 

Último (20)

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Transactional Archiving (Web Archive Globalization Workshop)

  • 1. Transactional Web Archives Robert Sanderson Lyudmila Balakireva Harihar Shankar Herbert Van de Sompel Los Alamos National Laboratory Research Library Web Archive Globalization JCDL 2011, Ottawa Jun 16-17, 2011 This research is funded by the Library of Congress Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 1
  • 2. Transactional Web Archiving   Transactional Archiving?   Server Side Capture   Submission, Storage, Access   Browser Side Capture   Submission, Storage, Access   Memento for Access Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 2
  • 3. Transactional Archiving?   Current web archives actively crawl the web   For example, Heritrix from the Internet Archive and the many archives that use it Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 3
  • 4. Transactional Archiving?   Transactional archives passively accept submitted HTTP transactions between browser and server   For example, TTApache, PageVault and Everlast. Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 4
  • 5. Why Transactional Archiving?   Issues with crawler based archiving:   Can be rejected (robots.txt, by user-agent, by host IP)   Can be deceived (cloaking: geo-location, by user-agent)   Can be trapped (infinite auto-generated pages)   Don't necessarily capture well used resources   Require constant and massive bandwidth   None of these are true for Transactional Archiving … … but, it has its own different set of challenges Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 5
  • 6. Transactional Archiving?   Need to record transactions between browser and server   Server side: Servers to be archived must cooperate   Browser side: Many browsers must cooperate   Need to transfer data to archive: either batch mode or real-time   Archive must trust submission to be authentic   Deduplication challenges as can't control what will be submitted:   Aliases: Different URL, same response   Negotiation: Same URL, different response   Determine "significant" change in response   Other factors for what to archive/throw away? Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 6
  • 7. Server Side Capture   Approach:   Willing server records the request and response headers and response body just before returning to the browser   Server sends to an archive for storage Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 7
  • 8. Server Side Capture/Submission   Developer: Luda Balakireva   Capture Implementation   Apache connection filter module implemented in C to trap URL, headers and response body   Module POSTs to a configurable URL in real time   Submission Implementation   Java/Grizzly+Jersey for handling submission interface   Can also be deployed under tomcat or glassfish   BerkeleyDB for storing metadata   Headers and response body data stored in file system Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 8
  • 9. Server Side Capture   Direct server to server upload, in real time:   Most configurations will have server/archive in close network proximity   Reduces wait time between observation and being discoverable in archive Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 9
  • 10. Server Side Capture: Issues   If archive is not local, network latency may be an issue   But could be amortized by batch upload   Size of dataset could very large for dynamically generated pages   But could be reduced by better detection of high value changes compared to counters, timestamps, etc.   Content Negotiation problematic!   Capture of pages with “attack vector” query params   index.html?f=/etc/passwd Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 10
  • 11. Browser Side Capture   Approach:   Willing browser records the request and response headers and response body after receiving from server   Browser sends to an archive for storage Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 11
  • 12. Browser Side Capture/Submission   Developer: Rob Sanderson   Capture Implementation   Firefox add-on captures headers and body and writes to temporary storage on local disk   After configurable amount of data stored, module compresses and moves to a shared Dropbox folder for batch upload   (Limited) Ability to detect and ignore private data   Submission Implementation   Dropbox used as transfer, temporary storage mechanism   Python monitor system on top of Dropbox   Cassandra (NoSQL hash store) for storing metadata   Response body and headers stored in pair-tree file system Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 12
  • 13. Browser Side Submission   Reasons for Dropbox rather than direct upload:   Batch upload via existing infrastructure reduces bandwidth   Increases Firefox responsiveness   Batch processing can be scheduled as needed Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 13
  • 14. Browser Side Capture/Submission Upload Preferences Public/Private Status Icon Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 14
  • 15. Browser Side: Issues   Privacy! Privacy! Privacy!   Difficult to determine if resource should be captured or not   Current approach:   No HTTPS   Check for “log out”, “sign out” etc in body   Check for usernames, personal name in body, headers   Blacklist for domains   Bandwidth   Slow-down while uploading batch file noticable on home connections Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 15
  • 16. Memento in One Slide Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 16
  • 17. Access via Memento   Both archives provide Memento TimeGates for access   TimeGates can be used with MementoFox:   Endorsed Firefox add-on: http://bit.ly/memfox   Must be configured with Dropbox archive TimeGate   Processes every HTTP request, not just HTML page   Distributed access is intentional design feature   Possible to construct views from multiple archives: Server side will have most authentic copy, but may embed image from another server, only in Dropbox archive Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 17
  • 18. Server Side Archive: Access   Access to archive via Memento TimeGate   Implemented in Grizzly server using Jersey library   Original Server uses HTTP Link header to point to archive   Exportfunctionality also available to WARC format to extract data in batch mode   By datetime of last update   By URL Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 18
  • 19. Browser Side Archive: Access   Apache/Python Memento TimeGate for access   Archive provides combined, anonymous TimeGate   Also provides per-user TimeGates to see own archive   Per-User currently secure only through obscurity   Export functionality also yet to be implemented Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 19
  • 20. Access via Memento Experimental Transactional Archive TimeGate Preferences Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 20
  • 21. Summary   Implemented and tested two types of Transactional Archive:   Server Side   Browser Side   Transactional Archives lack many of the challenges of Crawler based Archives (but have their own)   Implemented Memento TimeGates for Transactional Archives:   Does not require rewriting URIs for self-contained-ness   Works well with automated, distributed access patterns   Access via Browser add-on is fast and seamless   Server and Browser archiving code will be released Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 21
  • 22. Memento wants to make Navigating the Web’s Past Easy Learn: http://www.mementoweb.org/ Talk: http://groups.google.com/group/memento-dev Use: http://bit.ly/memfox Memento: Transactional Archiving Web Archive Globalization Workshop, Jun 16-17 2010 22