SlideShare uma empresa Scribd logo
1 de 43
Visualizing Digital Collections
of Web Archives
Mat Kelly, Michael L. Nelson, Michele C. Weigle
Old Dominion University
Web Archiving Collaboration: New Tools and Models
Columbia University, New York, NY
June 4, 2015
@machawk1http://ws-dl.cs.odu.edu
Motivation for Thumbnail
Summarization
• Change over time - aboutness
Apple.com has > 17k mementos
Many Nearly Identical
(apple.com)
Methods of Summarization
• Including all mementos
– many redundant thumbnails
– temporally/spatially/cognitively expensive
• Naively excluding images
– missing important captures in summary
• Compare image thumbnails
– temporally expensive for identifying unique
thumbnails
Comparing mementos’ markup can identify
sufficiently unique mementos
Analyzing Markup
<title>Apple</title>
<meta property="analytics-
track" content="Apple - Index/Tab"
/>
<meta property="analytics-s-
channel" content="homepage" />
<meta property="analytics-s-
bucket-0"
content="appleglobal,applehome"
/>
<meta property="analytics-s-
bucket-1"
content="apple{COUNTRY_CODE}gl
obal,apple{COUNTRY_CODE}home"
/>
8664ee964799c38c156d8f0
39dae8330
apple.com at Mar 17, 2008 HTML for memento SimHash for HTML
SimHash?
HTML snippet for
memento
First k characters of
markup
Second k characters of
markup
64th k characters of
markup
63rd k characters of
markup
markup length
64
k =
…
Hash to a
character
Hash to a
character
Hash to a
character
Hash to a
character
c
3
9
f
…
SimHash vs. Other Hashes
• md5(“aaaaaaaaaaaaaaa”)
12f9cf6998d52dbe773b06f848bb3608
• md5(“aaaaaaabaaaaaaa”)
e984cee68697eb77577717b532171493
• simhash(“aaaaaaaaaaaaaaa”)
8664ee964799c38c156d8f039dae8330
• simhash(“aaaaaaabaaaaaaa”)
8664ee964799a48c156d8f039dae8330
Why SimHash?
• SimHash identifies similarities between
documents
• Conventional hashing algorithms are for
identifying differences
– Drastically different output from similar content
• To remove redundancies, we want to detect
when temporally adjacent mementos are
sufficiently dissimilar
SimHashes for Mementos
HTML of apple.com
March 3, 2008
HTML of apple.com
March 5, 2008
HTML of apple.com
April 12, 2008
HTML of apple.com
October 4, 2008
c39f0abc...b9
c39d0abc...c9
c39d0abc...b9
c770ad1b...b9
Identifying Similarity by
Calculating Hamming Distance
HTML of apple.com
March 3, 2008
HTML of apple.com
March 5, 2008
HTML of apple.com
April 12, 2008
HTML of apple.com
October 4, 2008
c39f0abc...b9
c39d0abc...c9
c39d0abc...b9
c770ad1b...b9
HAMMING DISTANCE
2
1
7
N/A
pivot
Sliding Hamming Distance
• Selection based on previously selected memento
• Sliding pivot
ΔM3 ΔM3
ΔM3
ΔM6
ΔM6
ΔM6
ΔM6
ΔM0
ΔM0
ΔM0
Project Goals
Develop tools that implement thumbnail
summarization for TimeMaps
• Web Service
– Allows anyone to view TimeMap using thumbnail
summarization
• Wayback add-on
– Allows any archive using wayback to provide this
service to users
• Embeddable version
– Allow web page authors to embed overview of
past versions of page on live web page
AlSummarization
• SimHash-based summarization scheme
created by Ahmed AlSum
• AlSum + Summarization = AlSummarization
A. AlSum, and M. L. Nelson. “Thumbnail Summarization Techniques for Web Archives.” In
Proceedings of the 36TH European Conference on Information Retrieval, ECIR 2014, 2014.
Dr. Nelson’s Homepage
• URI-R: http://www.cs.odu.edu/~mln
• Append onto service URI for summary
– http://service/http://www.cs.odu.edu/~mln
Anatomy of the Visualization
3 presentations of the Summary
Temporally sorted
mementos
Memento metadata
Additional (optional) Endpoint
Parameters
• Access – tailors user interface
– Interactive, Embed, Wayback
• Strategy – to use alternative summarization
– alSummarization, yearly, skipListed, random
• http://service/?
o access=wayback&URI-
R=http://www.cs.odu.edu/~mln
o access=wayback&strategy=random&URI-
R=http://www.cs.odu.edu/~mln
Programmatic Flow
User’s Browser Thumbnails
Service
Memento-Compliant
Archive
User Requests URI-R Summary
User’s Browser Thumbnails
Service
Memento-Compliant
Archive
Service Relays URI-R to Archive
User’s Browser Thumbnails
Service
Memento-Compliant
Archive
Service queries archive for all mementos
for URI-R
URI-Ms returned to Service
User’s Browser Thumbnails
Service
Memento-Compliant
Archive
Archive returns TimeMap with URI-Ms to
thumbnail service
TM
Service fetches HTML for each
Memento
Thumbnails
Service
Service generates SimHash for
Each Mementos’ HTML
Thumbnails
Service
c39f0abc...b9
c39d0abc...c9
c39d0abc...b9
c770ad1b...b9
c770ad1b...b9
Service Calculates
Hamming Distance
Thumbnails
Service
Mementos in summary selected based on
hamming distance
c39f0abc...b9
c39d0abc...c9
c39d0abc...b9
c770ad1b...b9
c770ad1b...b9
Hd()
2
1
7
0
Preliminary UI returned to user
User’s Browser Thumbnails
Service
Templated HTML interface is returned to
user with placeholders for thumbnails
HTML
interface
c39d0abc...c9
c39d0abc...b9
c770ad1b...b9
2
1
0
Service Generates Thumbnails for
Mementos in Summary
Thumbnails
Service
c39f0abc...b9
c770ad1b...b9
Hd()
7
Thumbnails Served to User
User’s Browser Thumbnails
Service
Asynchronous polling from HTML pages
populates placeholder images once available
HTML
interface
Core Implementation
•
• for thumbnail generation
• abstractions preserved for code reuse
and extensibility
• Code documented to facilitate extensibility,
usage, and fixes
http://github.com/machawk1/ArchiveThumbnails
Initializing the service
$ npm install
$ node alSummarization.js
* Local resource (css, js,etc.) server
listening on Port 1338...
* Thumbnails service started on Port 15421
> Try localhost:15421/?URI-
R=http://matkelly.com in your web browser
for sample execution.
User/Service Administrator simply enters:
Service responds and is ready for query:
Online vs. Offline Generation
• Online Thumbnail Summarization
– Fetch each mementos’ HTML
– Calculate SimHashes
– Calculate Hamming Distance (HD)
– Select Mementos That Pass HD threshold
– Generate Thumbnails of Mementos
• Offline Thumbnail Summarization
– All of the above performed a priori
– Data potentially updated on access
Adaptive Strategies
• Very large TimeMaps are temporally
expensive to generate
• Default behavior:
if(timeRequirement == tooLong):
use(naiveStrategy)
• User can explicitly override behavior
Other Summarization Strategies
• Random Selection
– k mementos, uniform selection
• Interval
– every mth memento, m = n/k
• Temporal Interval
– One memento/year, reverse chronological
monthly back-fill
• Temporally Uniform Trimming when k > 15
Grid View
AlSummarization vs Random
Dr. Nelson’s Homepage
Random Strategy
Dr. Nelson’s Homepage
AlSummarization Strategy
Grid View
AlSummarization vs Interval
Dr. Nelson’s Homepage
Interval Strategy
Dr. Nelson’s Homepage
AlSummarization Strategy
Grid View
AlSummarization vs Temporal Interval
Dr. Nelson’s Homepage
Temporal Interval Strategy
Dr. Nelson’s Homepage
AlSummarization Strategy
Asynchronous Polling
Server-side SimHash Caching
Four Summarization Strategies
OpenWayback Integration
Service Embedding
• <object data=http://service/http://yoururl.com”
type=“text/html”>
</object>
-or-
• <iframe src=“http://service/http://yoururl.com”>
</iframe>
Visualizing Digital Collections
of Web Archives
• Codebase:
– github.com/machawk1/ArchiveThumbnails
• Service URI:
– http://wsdl-docker.cs.odu.edu:15421
Live
Demo

Mais conteúdo relacionado

Mais procurados

Carpenter - Wolfram Data Summit ResourceSync
Carpenter - Wolfram Data Summit ResourceSyncCarpenter - Wolfram Data Summit ResourceSync
Carpenter - Wolfram Data Summit ResourceSync
nisohq
 
ResourceSync Overview
ResourceSync OverviewResourceSync Overview
ResourceSync Overview
Herbert Van de Sompel
 
Tune-up electronic resources in Alma for better discovery_May_06_2016
Tune-up electronic resources in Alma for better discovery_May_06_2016Tune-up electronic resources in Alma for better discovery_May_06_2016
Tune-up electronic resources in Alma for better discovery_May_06_2016
mahongzn
 

Mais procurados (11)

Carpenter - Wolfram Data Summit ResourceSync
Carpenter - Wolfram Data Summit ResourceSyncCarpenter - Wolfram Data Summit ResourceSync
Carpenter - Wolfram Data Summit ResourceSync
 
ResourceSync: Web-Based Resource Synchronization
ResourceSync: Web-Based Resource SynchronizationResourceSync: Web-Based Resource Synchronization
ResourceSync: Web-Based Resource Synchronization
 
NISO ResourceSync Training Session
NISO ResourceSync Training SessionNISO ResourceSync Training Session
NISO ResourceSync Training Session
 
Today's forecast for your campus: BLUEcloud
 Today's forecast for your campus: BLUEcloud Today's forecast for your campus: BLUEcloud
Today's forecast for your campus: BLUEcloud
 
ResourceSync: Web-based Resource Synchronization
ResourceSync: Web-based Resource SynchronizationResourceSync: Web-based Resource Synchronization
ResourceSync: Web-based Resource Synchronization
 
Publishing "5 star" data: the case for RDF
Publishing "5 star" data: the case for RDFPublishing "5 star" data: the case for RDF
Publishing "5 star" data: the case for RDF
 
ResourceSync Overview
ResourceSync OverviewResourceSync Overview
ResourceSync Overview
 
Tune-up electronic resources in Alma for better discovery_May_06_2016
Tune-up electronic resources in Alma for better discovery_May_06_2016Tune-up electronic resources in Alma for better discovery_May_06_2016
Tune-up electronic resources in Alma for better discovery_May_06_2016
 
Search Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureSearch Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited Lecture
 
OpenGLAM: LOD and American Art
OpenGLAM: LOD and American ArtOpenGLAM: LOD and American Art
OpenGLAM: LOD and American Art
 
Intro cOMPUTERS
Intro cOMPUTERSIntro cOMPUTERS
Intro cOMPUTERS
 

Semelhante a Visualizing Digital Collections of Web Archives from Columbia Web Archiving Collaboration Conference

To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...
To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...
To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...
Tony Erwin
 
To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...
To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...
To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...
Tony Erwin
 
Web technologies course, an introduction
Web technologies course, an introductionWeb technologies course, an introduction
Web technologies course, an introduction
Piero Fraternali
 
TrueReusableCode-BigDataCodeCamp2016
TrueReusableCode-BigDataCodeCamp2016TrueReusableCode-BigDataCodeCamp2016
TrueReusableCode-BigDataCodeCamp2016
Eduard Lazar
 

Semelhante a Visualizing Digital Collections of Web Archives from Columbia Web Archiving Collaboration Conference (20)

Architecture Patterns - Open Discussion
Architecture Patterns - Open DiscussionArchitecture Patterns - Open Discussion
Architecture Patterns - Open Discussion
 
To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...
To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...
To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...
 
JavaOne: Efficiently building and deploying microservices
JavaOne: Efficiently building and deploying microservicesJavaOne: Efficiently building and deploying microservices
JavaOne: Efficiently building and deploying microservices
 
Tech talk-live-alfresco-drupal
Tech talk-live-alfresco-drupalTech talk-live-alfresco-drupal
Tech talk-live-alfresco-drupal
 
Cloud Services Powered by IBM SoftLayer and NetflixOSS
Cloud Services Powered by IBM SoftLayer and NetflixOSSCloud Services Powered by IBM SoftLayer and NetflixOSS
Cloud Services Powered by IBM SoftLayer and NetflixOSS
 
How Responsive Do You Want Your Website?
How Responsive Do You Want Your Website?How Responsive Do You Want Your Website?
How Responsive Do You Want Your Website?
 
Guide to Application Performance: Planning to Continued Optimization
Guide to Application Performance: Planning to Continued OptimizationGuide to Application Performance: Planning to Continued Optimization
Guide to Application Performance: Planning to Continued Optimization
 
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
 
Monitoring a Kubernetes-backed microservice architecture with Prometheus
Monitoring a Kubernetes-backed microservice architecture with PrometheusMonitoring a Kubernetes-backed microservice architecture with Prometheus
Monitoring a Kubernetes-backed microservice architecture with Prometheus
 
To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...
To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...
To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservic...
 
Web technologies course, an introduction
Web technologies course, an introductionWeb technologies course, an introduction
Web technologies course, an introduction
 
HTML5 on Mobile(For Designer)
HTML5 on Mobile(For Designer)HTML5 on Mobile(For Designer)
HTML5 on Mobile(For Designer)
 
Membase Introduction
Membase IntroductionMembase Introduction
Membase Introduction
 
e-Learning Delivery System : The Challenges
e-Learning Delivery System : The Challengese-Learning Delivery System : The Challenges
e-Learning Delivery System : The Challenges
 
Salesforce Performance hacks - Client Side
Salesforce Performance hacks - Client SideSalesforce Performance hacks - Client Side
Salesforce Performance hacks - Client Side
 
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
 
4. Web programming MVC.pptx
4. Web programming  MVC.pptx4. Web programming  MVC.pptx
4. Web programming MVC.pptx
 
2014 09-12 lambda-architecture-at-indix
2014 09-12 lambda-architecture-at-indix2014 09-12 lambda-architecture-at-indix
2014 09-12 lambda-architecture-at-indix
 
TrueReusableCode-BigDataCodeCamp2016
TrueReusableCode-BigDataCodeCamp2016TrueReusableCode-BigDataCodeCamp2016
TrueReusableCode-BigDataCodeCamp2016
 
Semantic technologies in practice - KULeuven 2016
Semantic technologies in practice - KULeuven 2016Semantic technologies in practice - KULeuven 2016
Semantic technologies in practice - KULeuven 2016
 

Mais de Mat Kelly

Browser-Based Digital Preservation
Browser-Based Digital PreservationBrowser-Based Digital Preservation
Browser-Based Digital Preservation
Mat Kelly
 
Archive What I See Now - Archive-It Partner Meeting 2013 2013
Archive What I See Now - Archive-It Partner Meeting 2013 2013Archive What I See Now - Archive-It Partner Meeting 2013 2013
Archive What I See Now - Archive-It Partner Meeting 2013 2013
Mat Kelly
 
Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving
Making Enterprise-Level Archive Tools Accessible for Personal Web ArchivingMaking Enterprise-Level Archive Tools Accessible for Personal Web Archiving
Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving
Mat Kelly
 
An Extensible Framework for Creating Personal Web Archives of Content Behind ...
An Extensible Framework for Creating Personal Web Archives of Content Behind ...An Extensible Framework for Creating Personal Web Archives of Content Behind ...
An Extensible Framework for Creating Personal Web Archives of Content Behind ...
Mat Kelly
 
NDIIPP/NDSA 2011 - YouTube Link Restoration
NDIIPP/NDSA 2011 - YouTube Link RestorationNDIIPP/NDSA 2011 - YouTube Link Restoration
NDIIPP/NDSA 2011 - YouTube Link Restoration
Mat Kelly
 
NDIIPP/NDSA 2011 - Archive Facebook
NDIIPP/NDSA 2011 - Archive FacebookNDIIPP/NDSA 2011 - Archive Facebook
NDIIPP/NDSA 2011 - Archive Facebook
Mat Kelly
 

Mais de Mat Kelly (20)

Aggregating Private and Public Web Archives Using the Mementity Framework
Aggregating Private and Public Web Archives Using the Mementity FrameworkAggregating Private and Public Web Archives Using the Mementity Framework
Aggregating Private and Public Web Archives Using the Mementity Framework
 
Client-Assisted Memento Aggregation Using the Prefer Header
Client-Assisted Memento Aggregation Using the Prefer HeaderClient-Assisted Memento Aggregation Using the Prefer Header
Client-Assisted Memento Aggregation Using the Prefer Header
 
A Framework for Aggregating Public and Private Web Archives
A Framework for Aggregating Public and Private Web ArchivesA Framework for Aggregating Public and Private Web Archives
A Framework for Aggregating Public and Private Web Archives
 
Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count
 
Exploring Aggregation of Personal, Private, and Institutional Web Archives
Exploring Aggregation of Personal, Private, and Institutional Web ArchivesExploring Aggregation of Personal, Private, and Institutional Web Archives
Exploring Aggregation of Personal, Private, and Institutional Web Archives
 
JCDL 2015 Doctoral Consortium - A Framework for Aggregating Private and Publi...
JCDL 2015 Doctoral Consortium - A Framework for AggregatingPrivate and Publi...JCDL 2015 Doctoral Consortium - A Framework for AggregatingPrivate and Publi...
JCDL 2015 Doctoral Consortium - A Framework for Aggregating Private and Publi...
 
Facilitation of the A Posteriori Replication of Web Published Satellite Imagery
Facilitation of the A Posteriori Replication of Web Published Satellite ImageryFacilitation of the A Posteriori Replication of Web Published Satellite Imagery
Facilitation of the A Posteriori Replication of Web Published Satellite Imagery
 
Slides
SlidesSlides
Slides
 
Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...
Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...
Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...
 
Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014
Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014
Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014
 
Browser-Based Digital Preservation
Browser-Based Digital PreservationBrowser-Based Digital Preservation
Browser-Based Digital Preservation
 
Archive What I See Now - Archive-It Partner Meeting 2013 2013
Archive What I See Now - Archive-It Partner Meeting 2013 2013Archive What I See Now - Archive-It Partner Meeting 2013 2013
Archive What I See Now - Archive-It Partner Meeting 2013 2013
 
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction System
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction SystemIEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction System
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction System
 
Digital Preservation 2013
Digital Preservation 2013Digital Preservation 2013
Digital Preservation 2013
 
Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving
Making Enterprise-Level Archive Tools Accessible for Personal Web ArchivingMaking Enterprise-Level Archive Tools Accessible for Personal Web Archiving
Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving
 
An Extensible Framework for Creating Personal Web Archives of Content Behind ...
An Extensible Framework for Creating Personal Web Archives of Content Behind ...An Extensible Framework for Creating Personal Web Archives of Content Behind ...
An Extensible Framework for Creating Personal Web Archives of Content Behind ...
 
The Revolution Will Not Be Archived
The Revolution Will Not Be ArchivedThe Revolution Will Not Be Archived
The Revolution Will Not Be Archived
 
WARCreate - Create Wayback-Consumable WARC Files from Any Webpage
WARCreate - Create Wayback-Consumable WARC Files from Any WebpageWARCreate - Create Wayback-Consumable WARC Files from Any Webpage
WARCreate - Create Wayback-Consumable WARC Files from Any Webpage
 
NDIIPP/NDSA 2011 - YouTube Link Restoration
NDIIPP/NDSA 2011 - YouTube Link RestorationNDIIPP/NDSA 2011 - YouTube Link Restoration
NDIIPP/NDSA 2011 - YouTube Link Restoration
 
NDIIPP/NDSA 2011 - Archive Facebook
NDIIPP/NDSA 2011 - Archive FacebookNDIIPP/NDSA 2011 - Archive Facebook
NDIIPP/NDSA 2011 - Archive Facebook
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

Visualizing Digital Collections of Web Archives from Columbia Web Archiving Collaboration Conference

  • 1. Visualizing Digital Collections of Web Archives Mat Kelly, Michael L. Nelson, Michele C. Weigle Old Dominion University Web Archiving Collaboration: New Tools and Models Columbia University, New York, NY June 4, 2015 @machawk1http://ws-dl.cs.odu.edu
  • 2. Motivation for Thumbnail Summarization • Change over time - aboutness
  • 3. Apple.com has > 17k mementos
  • 5. Methods of Summarization • Including all mementos – many redundant thumbnails – temporally/spatially/cognitively expensive • Naively excluding images – missing important captures in summary • Compare image thumbnails – temporally expensive for identifying unique thumbnails Comparing mementos’ markup can identify sufficiently unique mementos
  • 6. Analyzing Markup <title>Apple</title> <meta property="analytics- track" content="Apple - Index/Tab" /> <meta property="analytics-s- channel" content="homepage" /> <meta property="analytics-s- bucket-0" content="appleglobal,applehome" /> <meta property="analytics-s- bucket-1" content="apple{COUNTRY_CODE}gl obal,apple{COUNTRY_CODE}home" /> 8664ee964799c38c156d8f0 39dae8330 apple.com at Mar 17, 2008 HTML for memento SimHash for HTML
  • 7. SimHash? HTML snippet for memento First k characters of markup Second k characters of markup 64th k characters of markup 63rd k characters of markup markup length 64 k = … Hash to a character Hash to a character Hash to a character Hash to a character c 3 9 f …
  • 8. SimHash vs. Other Hashes • md5(“aaaaaaaaaaaaaaa”) 12f9cf6998d52dbe773b06f848bb3608 • md5(“aaaaaaabaaaaaaa”) e984cee68697eb77577717b532171493 • simhash(“aaaaaaaaaaaaaaa”) 8664ee964799c38c156d8f039dae8330 • simhash(“aaaaaaabaaaaaaa”) 8664ee964799a48c156d8f039dae8330
  • 9. Why SimHash? • SimHash identifies similarities between documents • Conventional hashing algorithms are for identifying differences – Drastically different output from similar content • To remove redundancies, we want to detect when temporally adjacent mementos are sufficiently dissimilar
  • 10. SimHashes for Mementos HTML of apple.com March 3, 2008 HTML of apple.com March 5, 2008 HTML of apple.com April 12, 2008 HTML of apple.com October 4, 2008 c39f0abc...b9 c39d0abc...c9 c39d0abc...b9 c770ad1b...b9
  • 11. Identifying Similarity by Calculating Hamming Distance HTML of apple.com March 3, 2008 HTML of apple.com March 5, 2008 HTML of apple.com April 12, 2008 HTML of apple.com October 4, 2008 c39f0abc...b9 c39d0abc...c9 c39d0abc...b9 c770ad1b...b9 HAMMING DISTANCE 2 1 7 N/A pivot
  • 12.
  • 13. Sliding Hamming Distance • Selection based on previously selected memento • Sliding pivot ΔM3 ΔM3 ΔM3 ΔM6 ΔM6 ΔM6 ΔM6 ΔM0 ΔM0 ΔM0
  • 14. Project Goals Develop tools that implement thumbnail summarization for TimeMaps • Web Service – Allows anyone to view TimeMap using thumbnail summarization • Wayback add-on – Allows any archive using wayback to provide this service to users • Embeddable version – Allow web page authors to embed overview of past versions of page on live web page
  • 15. AlSummarization • SimHash-based summarization scheme created by Ahmed AlSum • AlSum + Summarization = AlSummarization A. AlSum, and M. L. Nelson. “Thumbnail Summarization Techniques for Web Archives.” In Proceedings of the 36TH European Conference on Information Retrieval, ECIR 2014, 2014.
  • 16. Dr. Nelson’s Homepage • URI-R: http://www.cs.odu.edu/~mln • Append onto service URI for summary – http://service/http://www.cs.odu.edu/~mln
  • 17. Anatomy of the Visualization 3 presentations of the Summary Temporally sorted mementos Memento metadata
  • 18. Additional (optional) Endpoint Parameters • Access – tailors user interface – Interactive, Embed, Wayback • Strategy – to use alternative summarization – alSummarization, yearly, skipListed, random • http://service/? o access=wayback&URI- R=http://www.cs.odu.edu/~mln o access=wayback&strategy=random&URI- R=http://www.cs.odu.edu/~mln
  • 19. Programmatic Flow User’s Browser Thumbnails Service Memento-Compliant Archive
  • 20. User Requests URI-R Summary User’s Browser Thumbnails Service Memento-Compliant Archive
  • 21. Service Relays URI-R to Archive User’s Browser Thumbnails Service Memento-Compliant Archive Service queries archive for all mementos for URI-R
  • 22. URI-Ms returned to Service User’s Browser Thumbnails Service Memento-Compliant Archive Archive returns TimeMap with URI-Ms to thumbnail service TM
  • 23. Service fetches HTML for each Memento Thumbnails Service
  • 24. Service generates SimHash for Each Mementos’ HTML Thumbnails Service c39f0abc...b9 c39d0abc...c9 c39d0abc...b9 c770ad1b...b9 c770ad1b...b9
  • 25. Service Calculates Hamming Distance Thumbnails Service Mementos in summary selected based on hamming distance c39f0abc...b9 c39d0abc...c9 c39d0abc...b9 c770ad1b...b9 c770ad1b...b9 Hd() 2 1 7 0
  • 26. Preliminary UI returned to user User’s Browser Thumbnails Service Templated HTML interface is returned to user with placeholders for thumbnails HTML interface
  • 27. c39d0abc...c9 c39d0abc...b9 c770ad1b...b9 2 1 0 Service Generates Thumbnails for Mementos in Summary Thumbnails Service c39f0abc...b9 c770ad1b...b9 Hd() 7
  • 28. Thumbnails Served to User User’s Browser Thumbnails Service Asynchronous polling from HTML pages populates placeholder images once available HTML interface
  • 29. Core Implementation • • for thumbnail generation • abstractions preserved for code reuse and extensibility • Code documented to facilitate extensibility, usage, and fixes http://github.com/machawk1/ArchiveThumbnails
  • 30. Initializing the service $ npm install $ node alSummarization.js * Local resource (css, js,etc.) server listening on Port 1338... * Thumbnails service started on Port 15421 > Try localhost:15421/?URI- R=http://matkelly.com in your web browser for sample execution. User/Service Administrator simply enters: Service responds and is ready for query:
  • 31. Online vs. Offline Generation • Online Thumbnail Summarization – Fetch each mementos’ HTML – Calculate SimHashes – Calculate Hamming Distance (HD) – Select Mementos That Pass HD threshold – Generate Thumbnails of Mementos • Offline Thumbnail Summarization – All of the above performed a priori – Data potentially updated on access
  • 32. Adaptive Strategies • Very large TimeMaps are temporally expensive to generate • Default behavior: if(timeRequirement == tooLong): use(naiveStrategy) • User can explicitly override behavior
  • 33. Other Summarization Strategies • Random Selection – k mementos, uniform selection • Interval – every mth memento, m = n/k • Temporal Interval – One memento/year, reverse chronological monthly back-fill • Temporally Uniform Trimming when k > 15
  • 34. Grid View AlSummarization vs Random Dr. Nelson’s Homepage Random Strategy Dr. Nelson’s Homepage AlSummarization Strategy
  • 35. Grid View AlSummarization vs Interval Dr. Nelson’s Homepage Interval Strategy Dr. Nelson’s Homepage AlSummarization Strategy
  • 36. Grid View AlSummarization vs Temporal Interval Dr. Nelson’s Homepage Temporal Interval Strategy Dr. Nelson’s Homepage AlSummarization Strategy
  • 41. Service Embedding • <object data=http://service/http://yoururl.com” type=“text/html”> </object> -or- • <iframe src=“http://service/http://yoururl.com”> </iframe>
  • 42. Visualizing Digital Collections of Web Archives • Codebase: – github.com/machawk1/ArchiveThumbnails • Service URI: – http://wsdl-docker.cs.odu.edu:15421