O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Perseverance on Persistence

1.340 visualizações

Publicada em

Presentation given at EuropeanaTech 2018 in Rotterdam, The Netherlands. Provides a summary of insights gained from working for about a decade on challenges related to temporal aspects of the web, persistence.

Publicada em: Internet
  • Seja o primeiro a comentar

Perseverance on Persistence

  1. 1. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Herbert Van de Sompel Los Alamos National Laboratory @hvdsomp Perseverance on Persistence a future-note about the past
  2. 2. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 OAI-ORE
  3. 3. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 2006 • OAI-ORE observation: Scholarly assets are rapidly becoming compound, consisting of multiple resources with various: • Relationships • Interdependencies • How to convey this compound-ness in an interoperable manner so that applications can access, consume such assets? http://www.openarchives.org/ore/1.0/toc
  4. 4. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Address interoperability challenges from the perspective of the web • The resource at the center of the universe • The notion of a repository (or even of a web server) does not exist in the architecture of the web • Neither the notion of a Digital Object • The tools of the interoperability trade are the primitives of the web ORE Insight 1 - Web-Centric Interoperability Paradigm
  5. 5. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Tools of the Web-Centric Interoperability Trade • Resource • URI • HTTP as the API: HEAD/GET, POST, PUT, DELETE • Representation • Media Type • Link • Content Negotiation • Typed Link • Controlled Vocabularies for Typed Links W3C Architecture of the World Wide Web RDF, RDFS, OWL
  6. 6. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
  7. 7. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 OAI-ORE in EDM Europeana v1.0 2009
  8. 8. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 The web-centric ORE approach allowed using off-the-shelf web tools to archive evolving compound objects • Evolving versions of Resource Maps, Aggregated Resources were captured in a web archive • But how to use the URI of the Aggregation or Resource Map to see the status of an Aggregation at a specific moment in the past? ORE Insight 2 – How to Access Temporal State of an Aggregation H. Van de Sompel (2007) Compound Information Object Prototype Demonstration https://www.dropbox.com/s/dd7xd427y90q4jx/CT_Watch_hvds_20070703.mov?dl=0
  9. 9. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 H. Van de Sompel, M. L. Nelson, R. Sanderson (2013) RFC7089 - HTTP Framework for Time- Based Access to Resource States – Memento. https://tools.ietf.org/html/rfc7089 Memento
  10. 10. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Tools of the Web-Centric Interoperability Trade – HTTP Stack • Resource • URI • HTTP as the API • Representation • Media Types • Link • Content Negotiation, e.g. for preferred Media Type • Typed Link • Controlled Vocabularies for Typed Links W3C Architecture of the World Wide Web HTTP Links, IANA link relation registry, community link relation types HATEOAS – Hypermedia As The Engine Of Application State http://en.wikipedia.org/wiki/HATEOAS
  11. 11. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Original Resource and Mementos
  12. 12. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Bridge from Present to Past
  13. 13. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Bridge from Present to Past
  14. 14. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Bridge from Past to Present
  15. 15. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 timegate Link: Link to Your Own History Can link to preferred web archive, but also: • Maintain your own resource version history • timegate link to your own history • Distributed management of resource history • Uniform access to resource history across systems • Follow links across systems subject to time
  16. 16. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 No timegate Link – Client Intelligence Client uses TimeGate of its preferred web archive, but: • Internet Archive is massive, yet substantial unique materials in other archives • Introduce aggregated TimeGate: Memento Aggregator
  17. 17. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Routing TimeGate Requests Using Machine Learning Bornand, N., Balakireva, L., Van de Sompel, H. (2016) Routing Memento Requests Using Binary Classifiers. JCDL16. https://arxiv.org/abs/1606.09136 • Memento Aggregator covers 20+ web archives • Distributed systems problem: As the number of archives (and incoming requests) grows, sending requests to each archive for every incoming request is not feasible • Response times • Load on distributed archives • After various optimization attempts, devised an approach using binary classifiers per web archive: • Trained on the basis of cached URIs, using URI features only • Operational since 2016: 80% reduction in # queries. 1/3 reduction in response times. Recall 85%
  18. 18. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 From Internet Archive TodayToday Select Date Mar 20 2007 Apr 03 2007 Various Memento Tools (client/server) https://github.com/machawk1/awesome-memento
  19. 19. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Pockets of Persistence
  20. 20. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Creating Pockets of Persistence • With Memento’s time travel capability in place, what would it take to support faithfully navigating the web of the Past? • There are two major forces that hinder achieving this goal: • Link rot: A link stops working all together • Content drift: The linked content changes over time and may eventually no longer be representative of the content that was originally linked • Without these forces at work, the web of the Present would be the same as the web of the Past • But that clearly is not the case
  21. 21. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Hyperlinks in Theory
  22. 22. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Hyperlinks in Reality
  23. 23. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Hyperlinks in Reality
  24. 24. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Link Rot
  25. 25. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Link Rot - PMC Martin Klein, Herbert Van de Sompel, Robert Sanderson, Harihar Shankar, et al. (2014) Scholarly context not found. In: PLOS ONE https://doi.org/10.1371/journal.pone.0115253
  26. 26. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Hyperlinks in Reality
  27. 27. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Content Drift
  28. 28. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Content Drift
  29. 29. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Content Drift http://icecube.wisc.edu/ on May 8 2009 (left) and August 27 2009 (right)
  30. 30. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 No Content Drift http://www.ifa.hawaii.edu/~cowie/k_table.html on June 9 1997 (left) and March 2016 (right)
  31. 31. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Content Drift - PMC Shawn Jones, Herbert Van de Sompel, Harihar Shankar, Martin Klein, et al. (2016) Scholarly context not found. In: PLOS ONE https://doi.org/10.1371/journal.pone.0167475
  32. 32. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Creating Pockets of Persistence • What would it take to really support faithfully navigating the web of the Past? • This challenge exists for the entire web. Some communities with well managed collections care about addressing it: • Scholarly communication • Cultural heritage • Legal publications • Journalism • Wikipedia • Why? • Link Rot: Quality of Service • Content Drift: integrity of the record, reliable evidence, revisiting the state of knowledge, transparency of editorial process, …
  33. 33. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 US Supreme Court Opinion – Link Rot Activism http://ssnat.com
  34. 34. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Two Types of Links from a Managed Collection
  35. 35. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Take 1 – PID Approach PID for B
  36. 36. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Managed Collection => Managed Collection
  37. 37. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 PID Approach Combat: • Link Rot: Link to PID; Redirect to current location • Content Drift: Mint a PID per version; Link to version PID With PID links: • Web of Present = Web of Past
  38. 38. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
  39. 39. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 URI References - PMC Herbert Van de Sompel, Martin Klein, and Shawn Jones (2016) Persistent URIs Must Be Used to Be Persistent. In: WWW2016. http://arxiv.org/1602.09102 Herbert Van de Sompel, Martin Klein, and Shawn Jones (2016) Persistent URIs Must Be Used to Be Persistent. In: WWW2016. http://arxiv.org/1602.09102
  40. 40. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 cite-as Relation Type Herbert Van de Sompel et al. (2018) cite-as: A Link Relation to Convey a Preferred URI for Referencing. https://datatracker.ietf.org/doc/draft-vandesompel-citeas/ http://signposting.org
  41. 41. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 PID Approach – Division of Labor
  42. 42. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Managed Collection => Web at Large
  43. 43. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
  44. 44. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 PID Approach  -
  45. 45. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Take 2 – Robust Links Approach
  46. 46. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Managed Collection => Web at Large
  47. 47. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Snapshot Approach Combat: • Link Rot & Content Drift: Custodian of A creates snapshot of B, in web archive or locally Regarding links: • Intuition suggests linking to the snapshot of B …
  48. 48. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Linking to Snapshot of B = Potentially Creating a Rotten Link • Existing practice for linking to snapshots: <a href=“URL of snapshot of B”> • Problems with existing practice: o Impossible to visit the original URI, if desired o Requires the permanent existence/uptime of the archive that holds the snapshot - One link rot problem replaced by another http://robustlinks.mementoweb.org/about/
  49. 49. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Permanent Existence/Uptime of Archives? Remnant of discontinued web archive http://mummify.it captured on February 14 2014 https://web.archive.org/web/20140214233752/https://www.mummify.it/
  50. 50. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Permanent Existence/Uptime of Archives? http://www.themoscowtimes.com/news/article/russia-bans-wayback-machine-internet-archive-over- islamic-state-video/510074.html
  51. 51. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Permanent Existence/Uptime of Archives? http://web.archive.org/web/20121101043952/http://vogin.nl on March 6 2017 at 15:59 CET
  52. 52. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Decorate the Link • Proposed practice for linking to captures: <a href=“URL of snapshot of B” data-originalurl=“B” data-versiondate=“datetime of snapshot of B”> <a href=“B” data-versionurl=“URL of snapshot of B” data-versiondate=“datetime of snapshot of B”> http://robustlinks.mementoweb.org/spec/
  53. 53. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Robust Links: Link Decoration in Action Van de Sompel H. & Nelson, M.L. (2015) Reminiscing about 15 years of interoperability efforts. In: D-Lib Magazine. https://doi.org/10.1045/november2015-vandesompel JavaScript makes the link decorations actionable
  54. 54. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Robust Links: Refuse to Die
  55. 55. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
  56. 56. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Snapshot Approach – Division of Labor
  57. 57. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Managed Collection => Managed Collection
  58. 58. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Cool URI Approach Combat: • Link Rot: Link to B; Redirect to current location • Content Drift: Generic URI; Version URIs With Cool URI links: • Tension between linking to generic URI and version URI
  59. 59. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Robust Links: Refuse to Die
  60. 60. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18
  61. 61. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Cool URI Approach – Division of Labor
  62. 62. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Robust Links Approach  
  63. 63. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Summary    PID RLLabor -
  64. 64. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Robust Links for Linked Data? Sanderson, R., Ciccarese, P., and Young, B. (2017) Web Annotation Vocabulary W3C Recommendation 23 February 2017. https://www.w3.org/TR/annotation-vocab/
  65. 65. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Handling Resource Versions, Captures B B t1 B t2
  66. 66. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Systems with Resource Versions
  67. 67. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 DBpedia Snapshot Archive Using HDT, TPF, Memento Vander Sande, M., Verborgh, R., Hochstenbach, P., and Van de Sompel, H. (2017) Towards sustainable publishing and querying of distributed Linked Data archives. Temporal: subject URI access ; ?s ?p ?o queries ; SPARQL queries
  68. 68. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Memento Tracer http://tracer.mementoweb.org
  69. 69. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Resource Capture: Tension Between Scale and Quality • Web crawling: optimized for scale • Problems with capturing resources accessible via interactive affordances • webrecorder.io: optimized for quality • Personal archiving • User records web navigation session • Not used for archiving at scale • LOCKSS: optimized for scholarly journals • Pages in Publisher/Journal portals share lay-out, affordances • Heuristics per publisher/journal to improve capture quality
  70. 70. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Memento Tracer: New Sweet Spot Between Scale and Quality • ~ web crawling: server side process to capture resources • ~ LOCKSS: leverages insight that web publications in any given portal are based on same template: • share lay-out • share interactive affordances • ~ webrecorder.io: human guidance to achieve quality • But, with Memento Tracer: • user does not record a specific web publication • user records heuristics that apply to a class of web publications
  71. 71. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Memento Tracer
  72. 72. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 A Trace for slideshare Presentations { "portal_url_match": "(slideshare.net)/([^/]+)/([^/]+)", "actions": [{ "action_order": "1", "value": "div.j-next-btn.arrow-right", "type": "CSSSelector", "action": "repeated_click", "repeat_until": { "condition": "changes", "type": "resource_url" } }, { "action_order": "2", "value": "div.notranslate.transcript.add- padding-right.j-transcript a", "type": "CSSSelector", "action": "click" } ], …
  73. 73. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Memento Tracer: Experimental • Promising results, thus far • Currently investigating challenges, including: • User interface to support recording Traces for complex sequences of interactions. • Limitations of the browser event listener approach for recording Traces. • Language used to express Traces. • Organization of the shared repository for Traces. • Selection of a Trace for capturing a web publication in cases where different page layouts and interactive affordances are available for web publications that share a URI pattern.
  74. 74. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Demo: Recording a Trace for a Web Publication https://github.com/www.gorillatoolkit/pkg/mux
  75. 75. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Demo: Capturing another Web Publication Using the Trace https://github.com/mementoweb/node-solid-server
  76. 76. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Demo: Capturing another Web Publication Using the Trace https://github.com/mementoweb/node-solid-server
  77. 77. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Demo: Playing Back the Captured Web Publication Capture of https://github.com/mementoweb/node-solid-server
  78. 78. Herbert Van de Sompel @hvdsomp EuropeanaTech 2018, Rotterdam, The Netherlands, 15/05/18 Herbert Van de Sompel Los Alamos National Laboratory @hvdsomp Perseverance on Persistence a future-note about the past

×