SlideShare a Scribd company logo
1 of 30
Thumbnail Summarization
Techniques For Web Archives
Ahmed AlSum*
Stanford University Libraries
Stanford CA, USA
aalsum@stanford.edu
Michael L. Nelson
Old Dominion University
Norfolk VA, USA
mln@cs.odu.edu
The 36th European Conference on Information Retrieval.
ECIR 2014, Amsterdam, Netherlands, 2014
* The research has been conducted while Ahmed AlSum was at Old Dominion University
ECIR 2014 Amsterdam, Netherlands
What is a Web Archive?
http://www.cs.odu.edu
2ECIR 2014 Amsterdam, Netherlands
Memento Terminology
URI-R, R
URI-M, M
URI-T, TM
http://www.amazon.com
http://web.archive.org/web/20110411070244/http://amazon.com
Original Resource
Memento
TimeMap
3ECIR 2014 Amsterdam, Netherlands
Thumbnails in Web Archive
Internet Archive UK Web Archive
4ECIR 2014 Amsterdam, Netherlands
Thumbnail Creation Challenges
• Scalability in Time
• IA may need 361 years to create thumbnail for each memento
using one hundred machines.
• Scalability in Space
• IA will need 355 TB to store 1 thumbnail per each memento.
• Page quality
5ECIR 2014 Amsterdam, Netherlands
Thumbnail Usage Challenges
6
• This is partial view of the first 700 thumbnails out of
10,500 available mementos for www.apple.com
ECIR 2014 Amsterdam, Netherlands
From 10,500 Mementos to 69 Thumbnails.
7ECIR 2014 Amsterdam, Netherlands
How many thumbnails do we need?
www.unfi.com on the live Web
8ECIR 2014 Amsterdam, Netherlands
How many thumbnails do we need?
www.unfi.com on the live Web
9ECIR 2014 Amsterdam, Netherlands
40 Thumbnails are good.
10ECIR 2014 Amsterdam, Netherlands
METHODOLOGY
11ECIR 2014 Amsterdam, Netherlands
Visual Similarity and Text Similarity
SimilarDifferent
HTML Text
12ECIR 2014 Amsterdam, Netherlands
Correlation between
Visual Similarity and Text Similarity
• Text Similarity
• SimHash
• DOM Tree
• Embedded resources
• Memento Datetime (Capture time)
• Visual Similarity
• Number of different pixels
13ECIR 2014 Amsterdam, Netherlands
Text Similarity
SimHash
• Compute 64-bit SimHash fingerprints with k = 4 for two
pages, then Calculate the distance using Hamming
Distance
14ECIR 2014 Amsterdam, Netherlands
12 Sep 2012 - 00:12:27 12 Sep 2012 - 19:54:05
Distance
12 bits
Simhash: 147EDAA9977E9400 Simhash: 157EFAAC97189100
Text Similarity
DOM Tree
• Transfer each webpage to DOM tree
• Calculate the difference using Levenshtein Distance
• Levenshtein distance: is the number of operations to insert, update, and delete.
15ECIR 2014 Amsterdam, Netherlands
Pawlik, M., & Augsten, N. (2011). RTED: a robust algorithm for the tree edit distance. Proceedings of the VLDB Endowment, 5(4), 334–345.
Text Similarity
Embedded resources
• Extract the embedded resources from each page
• Calculate the total number of new resources that have
been added and the resources that have been removed.
16ECIR 2014 Amsterdam, Netherlands
Addition
Removal
Total 4 11
Images 1 9
JS 1 0
CSS 2 2
12 Sep 2012 - 00:12:27 12 Sep 2012 - 19:54:05
Text Similarity
Memento datetime
• Calculate the difference between the record capture time
for both pages in seconds.
17ECIR 2014 Amsterdam, Netherlands
12 Sep 2012 - 00:12:27 12 Sep 2012 - 19:54:05
Difference
70942 sec
Visual Similarity
• The number of different pixels between two thumbnails,
we resize them into different dimensions (e.g., 64x64 and
128x128). We calculate the Manhattan distance between
each pair
ECIR 2014 Amsterdam, Netherlands 18
12 Sep 2012 - 00:12:27 12 Sep 2012 - 19:54:05
Distance
0.65
EXPERIMENT
Calculate the correlation between Visual Similarity and
Text Similarity
ECIR 2014 Amsterdam, Netherlands 19
Fortune 500
• 499,540 mementos from 488
TimeMaps.
• For each Memento, we download the
HTML and capture the thumbnail using
PhantomJS.
20
Dataset
Correlation between
Visual Similarity and Text Similarity
SimHash DOM tree
Embedded resources Memento Datetime
21
SimHash [Charikar 2002], DOM tree [Pawlik 2011], Memento Datetime [Van de Sompel 2013]
ECIR 2014 Amsterdam, Netherlands
SELECTION ALGORITHMS
Using text similarity features to predict the visual
similarity.
22ECIR 2014 Amsterdam, Netherlands
#1: Threshold Grouping
23ECIR 2014 Amsterdam, Netherlands
#1: Threshold Grouping
24ECIR 2014 Amsterdam, Netherlands
#2: Clustering technique
• Input:
• TimeMap with n mementos
• A set of features.
• For example, F = {SimHash, Memento-Datetime}
• Task:
• Cluster n mementos in K clusters.
25ECIR 2014 Amsterdam, Netherlands
#2: Clustering technique
SimHash Feature SimHash and Datetime Features
26
Park, H.-S., & Jun, C.-H. (2009). A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications, 36(2, Part 2), 3336–3341.
ECIR 2014 Amsterdam, Netherlands
#3: Time Normalization
27ECIR 2014 Amsterdam, Netherlands
Selection Algorithms Comparison
Threshold Grouping K clustering Time Normalization
TimeMap Reduction 27% 9% to 12% 23%
Image Loss 28 78 - 101 109
# Features 1 feature 1 or more 1 feature
Preprocessing required Yes Yes No
Efficient processing Medium Extensive Light
Incremental Yes No Yes
Online/offline Both Both Both
28ECIR 2014 Amsterdam, Netherlands
Generalization outside the Web Archive
• Summarize a website of n pages with only k thumbnails
29ECIR 2014 Amsterdam, Netherlands
Conclusions
• We explored the similarity between the text and visual
appearance of the web page.
• We found that SimHash difference between HTML text and
Levenshtein distance between HTML DOM tree have the highest
correlation
• We presented three algorithms to select k thumbnails
from n mementos per TimeMap.
30
aalsum@stanford.edu
@aalsum
ECIR 2014 Amsterdam, Netherlands

More Related Content

Similar to Thumbnail Summarization Techniques For Web Archives

DepositMOre: Applying tools to increase full-text content in institutional re...
DepositMOre: Applying tools to increase full-text content in institutional re...DepositMOre: Applying tools to increase full-text content in institutional re...
DepositMOre: Applying tools to increase full-text content in institutional re...depositMO
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingPlanetData Network of Excellence
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...Oscar Corcho
 
Mining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataMining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataMOVING Project
 
Mining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataMining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataAnsgar Scherp
 
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...Databricks
 
Cloud-native persistence in a serverless world
Cloud-native persistence in a serverless worldCloud-native persistence in a serverless world
Cloud-native persistence in a serverless worldNick Do
 
Apache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series databaseApache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series databaseFlorian Lautenschlager
 
CLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage informationCLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage informationEnno Meijers
 
Fontys Lecture - The Evolution of the Oracle Database 2016
Fontys Lecture -  The Evolution of the Oracle Database 2016Fontys Lecture -  The Evolution of the Oracle Database 2016
Fontys Lecture - The Evolution of the Oracle Database 2016Lucas Jellema
 
RDM programme @ Edinburgh an institutional approach
RDM programme @ Edinburgh an institutional approachRDM programme @ Edinburgh an institutional approach
RDM programme @ Edinburgh an institutional approachJisc
 
Geospatial Sensor Networks and Partitioning Data
Geospatial Sensor Networks and Partitioning DataGeospatial Sensor Networks and Partitioning Data
Geospatial Sensor Networks and Partitioning DataAlexMiowski
 
ODIN 1st year Conference Oct 2013 Interoperability: connecting identifiers
ODIN 1st year Conference Oct 2013 Interoperability: connecting identifiersODIN 1st year Conference Oct 2013 Interoperability: connecting identifiers
ODIN 1st year Conference Oct 2013 Interoperability: connecting identifiersGudmundur Thorisson
 
RDF Stream Processing Models (SR4LD2013)
RDF Stream Processing Models (SR4LD2013)RDF Stream Processing Models (SR4LD2013)
RDF Stream Processing Models (SR4LD2013)Daniele Dell'Aglio
 
Duraspace Hot Topics Series 6: Metadata and Repository Services
Duraspace Hot Topics Series 6: Metadata and Repository ServicesDuraspace Hot Topics Series 6: Metadata and Repository Services
Duraspace Hot Topics Series 6: Metadata and Repository ServicesMatthew Critchlow
 

Similar to Thumbnail Summarization Techniques For Web Archives (20)

DepositMOre: Applying tools to increase full-text content in institutional re...
DepositMOre: Applying tools to increase full-text content in institutional re...DepositMOre: Applying tools to increase full-text content in institutional re...
DepositMOre: Applying tools to increase full-text content in institutional re...
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream Processing
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
 
Mining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataMining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open Data
 
Mining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataMining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open Data
 
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
 
Cloud-native persistence in a serverless world
Cloud-native persistence in a serverless worldCloud-native persistence in a serverless world
Cloud-native persistence in a serverless world
 
sample-resume
sample-resumesample-resume
sample-resume
 
Apache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series databaseApache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series database
 
Service Integration to Enhance RDM
Service Integration to Enhance RDMService Integration to Enhance RDM
Service Integration to Enhance RDM
 
CLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage informationCLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage information
 
RDM Programme @ Edinburgh
RDM Programme @ Edinburgh RDM Programme @ Edinburgh
RDM Programme @ Edinburgh
 
Fontys Lecture - The Evolution of the Oracle Database 2016
Fontys Lecture -  The Evolution of the Oracle Database 2016Fontys Lecture -  The Evolution of the Oracle Database 2016
Fontys Lecture - The Evolution of the Oracle Database 2016
 
RDM programme @ Edinburgh an institutional approach
RDM programme @ Edinburgh an institutional approachRDM programme @ Edinburgh an institutional approach
RDM programme @ Edinburgh an institutional approach
 
RDM@Edinburgh_interoperation_IDCC2015
RDM@Edinburgh_interoperation_IDCC2015RDM@Edinburgh_interoperation_IDCC2015
RDM@Edinburgh_interoperation_IDCC2015
 
Geospatial Sensor Networks and Partitioning Data
Geospatial Sensor Networks and Partitioning DataGeospatial Sensor Networks and Partitioning Data
Geospatial Sensor Networks and Partitioning Data
 
ODIN 1st year Conference Oct 2013 Interoperability: connecting identifiers
ODIN 1st year Conference Oct 2013 Interoperability: connecting identifiersODIN 1st year Conference Oct 2013 Interoperability: connecting identifiers
ODIN 1st year Conference Oct 2013 Interoperability: connecting identifiers
 
RDF Stream Processing Models (SR4LD2013)
RDF Stream Processing Models (SR4LD2013)RDF Stream Processing Models (SR4LD2013)
RDF Stream Processing Models (SR4LD2013)
 
Duraspace Hot Topics Series 6: Metadata and Repository Services
Duraspace Hot Topics Series 6: Metadata and Repository ServicesDuraspace Hot Topics Series 6: Metadata and Repository Services
Duraspace Hot Topics Series 6: Metadata and Repository Services
 
Benefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a ServiceBenefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a Service
 

More from Ahmed AlSum

Restoring US First Website
Restoring US First WebsiteRestoring US First Website
Restoring US First WebsiteAhmed AlSum
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunitiesAhmed AlSum
 
Web Archiving Profile - WADL 2013
Web Archiving Profile - WADL 2013Web Archiving Profile - WADL 2013
Web Archiving Profile - WADL 2013Ahmed AlSum
 
Archival HTTP Redirection Retrieval Policies - TemporalWeb 2013
Archival HTTP Redirection Retrieval Policies - TemporalWeb 2013Archival HTTP Redirection Retrieval Policies - TemporalWeb 2013
Archival HTTP Redirection Retrieval Policies - TemporalWeb 2013Ahmed AlSum
 
ArcLink - IIPC GA 2013
ArcLink - IIPC GA 2013ArcLink - IIPC GA 2013
ArcLink - IIPC GA 2013Ahmed AlSum
 
How Much of the Web is Archived? JCDL 2011
How Much of the Web is Archived? JCDL 2011How Much of the Web is Archived? JCDL 2011
How Much of the Web is Archived? JCDL 2011Ahmed AlSum
 

More from Ahmed AlSum (6)

Restoring US First Website
Restoring US First WebsiteRestoring US First Website
Restoring US First Website
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunities
 
Web Archiving Profile - WADL 2013
Web Archiving Profile - WADL 2013Web Archiving Profile - WADL 2013
Web Archiving Profile - WADL 2013
 
Archival HTTP Redirection Retrieval Policies - TemporalWeb 2013
Archival HTTP Redirection Retrieval Policies - TemporalWeb 2013Archival HTTP Redirection Retrieval Policies - TemporalWeb 2013
Archival HTTP Redirection Retrieval Policies - TemporalWeb 2013
 
ArcLink - IIPC GA 2013
ArcLink - IIPC GA 2013ArcLink - IIPC GA 2013
ArcLink - IIPC GA 2013
 
How Much of the Web is Archived? JCDL 2011
How Much of the Web is Archived? JCDL 2011How Much of the Web is Archived? JCDL 2011
How Much of the Web is Archived? JCDL 2011
 

Recently uploaded

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
The Evolution of Money: Digital Transformation and CBDCs in Central Banking
The Evolution of Money: Digital Transformation and CBDCs in Central BankingThe Evolution of Money: Digital Transformation and CBDCs in Central Banking
The Evolution of Money: Digital Transformation and CBDCs in Central BankingSelcen Ozturkcan
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 

Recently uploaded (20)

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
The Evolution of Money: Digital Transformation and CBDCs in Central Banking
The Evolution of Money: Digital Transformation and CBDCs in Central BankingThe Evolution of Money: Digital Transformation and CBDCs in Central Banking
The Evolution of Money: Digital Transformation and CBDCs in Central Banking
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 

Thumbnail Summarization Techniques For Web Archives

  • 1. Thumbnail Summarization Techniques For Web Archives Ahmed AlSum* Stanford University Libraries Stanford CA, USA aalsum@stanford.edu Michael L. Nelson Old Dominion University Norfolk VA, USA mln@cs.odu.edu The 36th European Conference on Information Retrieval. ECIR 2014, Amsterdam, Netherlands, 2014 * The research has been conducted while Ahmed AlSum was at Old Dominion University ECIR 2014 Amsterdam, Netherlands
  • 2. What is a Web Archive? http://www.cs.odu.edu 2ECIR 2014 Amsterdam, Netherlands
  • 3. Memento Terminology URI-R, R URI-M, M URI-T, TM http://www.amazon.com http://web.archive.org/web/20110411070244/http://amazon.com Original Resource Memento TimeMap 3ECIR 2014 Amsterdam, Netherlands
  • 4. Thumbnails in Web Archive Internet Archive UK Web Archive 4ECIR 2014 Amsterdam, Netherlands
  • 5. Thumbnail Creation Challenges • Scalability in Time • IA may need 361 years to create thumbnail for each memento using one hundred machines. • Scalability in Space • IA will need 355 TB to store 1 thumbnail per each memento. • Page quality 5ECIR 2014 Amsterdam, Netherlands
  • 6. Thumbnail Usage Challenges 6 • This is partial view of the first 700 thumbnails out of 10,500 available mementos for www.apple.com ECIR 2014 Amsterdam, Netherlands
  • 7. From 10,500 Mementos to 69 Thumbnails. 7ECIR 2014 Amsterdam, Netherlands
  • 8. How many thumbnails do we need? www.unfi.com on the live Web 8ECIR 2014 Amsterdam, Netherlands
  • 9. How many thumbnails do we need? www.unfi.com on the live Web 9ECIR 2014 Amsterdam, Netherlands
  • 10. 40 Thumbnails are good. 10ECIR 2014 Amsterdam, Netherlands
  • 12. Visual Similarity and Text Similarity SimilarDifferent HTML Text 12ECIR 2014 Amsterdam, Netherlands
  • 13. Correlation between Visual Similarity and Text Similarity • Text Similarity • SimHash • DOM Tree • Embedded resources • Memento Datetime (Capture time) • Visual Similarity • Number of different pixels 13ECIR 2014 Amsterdam, Netherlands
  • 14. Text Similarity SimHash • Compute 64-bit SimHash fingerprints with k = 4 for two pages, then Calculate the distance using Hamming Distance 14ECIR 2014 Amsterdam, Netherlands 12 Sep 2012 - 00:12:27 12 Sep 2012 - 19:54:05 Distance 12 bits Simhash: 147EDAA9977E9400 Simhash: 157EFAAC97189100
  • 15. Text Similarity DOM Tree • Transfer each webpage to DOM tree • Calculate the difference using Levenshtein Distance • Levenshtein distance: is the number of operations to insert, update, and delete. 15ECIR 2014 Amsterdam, Netherlands Pawlik, M., & Augsten, N. (2011). RTED: a robust algorithm for the tree edit distance. Proceedings of the VLDB Endowment, 5(4), 334–345.
  • 16. Text Similarity Embedded resources • Extract the embedded resources from each page • Calculate the total number of new resources that have been added and the resources that have been removed. 16ECIR 2014 Amsterdam, Netherlands Addition Removal Total 4 11 Images 1 9 JS 1 0 CSS 2 2 12 Sep 2012 - 00:12:27 12 Sep 2012 - 19:54:05
  • 17. Text Similarity Memento datetime • Calculate the difference between the record capture time for both pages in seconds. 17ECIR 2014 Amsterdam, Netherlands 12 Sep 2012 - 00:12:27 12 Sep 2012 - 19:54:05 Difference 70942 sec
  • 18. Visual Similarity • The number of different pixels between two thumbnails, we resize them into different dimensions (e.g., 64x64 and 128x128). We calculate the Manhattan distance between each pair ECIR 2014 Amsterdam, Netherlands 18 12 Sep 2012 - 00:12:27 12 Sep 2012 - 19:54:05 Distance 0.65
  • 19. EXPERIMENT Calculate the correlation between Visual Similarity and Text Similarity ECIR 2014 Amsterdam, Netherlands 19
  • 20. Fortune 500 • 499,540 mementos from 488 TimeMaps. • For each Memento, we download the HTML and capture the thumbnail using PhantomJS. 20 Dataset
  • 21. Correlation between Visual Similarity and Text Similarity SimHash DOM tree Embedded resources Memento Datetime 21 SimHash [Charikar 2002], DOM tree [Pawlik 2011], Memento Datetime [Van de Sompel 2013] ECIR 2014 Amsterdam, Netherlands
  • 22. SELECTION ALGORITHMS Using text similarity features to predict the visual similarity. 22ECIR 2014 Amsterdam, Netherlands
  • 23. #1: Threshold Grouping 23ECIR 2014 Amsterdam, Netherlands
  • 24. #1: Threshold Grouping 24ECIR 2014 Amsterdam, Netherlands
  • 25. #2: Clustering technique • Input: • TimeMap with n mementos • A set of features. • For example, F = {SimHash, Memento-Datetime} • Task: • Cluster n mementos in K clusters. 25ECIR 2014 Amsterdam, Netherlands
  • 26. #2: Clustering technique SimHash Feature SimHash and Datetime Features 26 Park, H.-S., & Jun, C.-H. (2009). A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications, 36(2, Part 2), 3336–3341. ECIR 2014 Amsterdam, Netherlands
  • 27. #3: Time Normalization 27ECIR 2014 Amsterdam, Netherlands
  • 28. Selection Algorithms Comparison Threshold Grouping K clustering Time Normalization TimeMap Reduction 27% 9% to 12% 23% Image Loss 28 78 - 101 109 # Features 1 feature 1 or more 1 feature Preprocessing required Yes Yes No Efficient processing Medium Extensive Light Incremental Yes No Yes Online/offline Both Both Both 28ECIR 2014 Amsterdam, Netherlands
  • 29. Generalization outside the Web Archive • Summarize a website of n pages with only k thumbnails 29ECIR 2014 Amsterdam, Netherlands
  • 30. Conclusions • We explored the similarity between the text and visual appearance of the web page. • We found that SimHash difference between HTML text and Levenshtein distance between HTML DOM tree have the highest correlation • We presented three algorithms to select k thumbnails from n mementos per TimeMap. 30 aalsum@stanford.edu @aalsum ECIR 2014 Amsterdam, Netherlands

Editor's Notes

  1. Verbally show this is the endExplain this is an initial step in this area