Handwritten Text Recognition for manuscripts and early printed texts
Browsing and Recomposition Policies to Minimize Temporal Error When Utilizing Web Archives
1. BROWSING AND
RECOMPOSITION POLICIES
TO MINIMIZE TEMPORAL
ERROR WHEN UTILIZING
WEB ARCHIVES
SCOTT G. AINSWORTH
OLD DOMINION UNIVERSITY
COMPUTER SCIENCE
JCDL 2013
JULY 23-25, 2013
INDIANAPOLIS, INDIANA USA
10. JointConferenceonDigitalLibraries(JCDL)2013
QUESTIONS
• How much temporal drift do users experience?
• How much temporal spread exists in composite
mementos?
• How can drift and spread be minimized?
• What factors contribute, positively or
negatively, to drift and spread?
• Does combining multiple archives produce
better results?
• Would users with differing goals benefit from
different minimization policies and heuristics?
• How can temporal coherence be displayed to
users—simply?
7/23/13 Scott G. Ainsworth • Michael L. Nelson
10
12. JointConferenceonDigitalLibraries(JCDL)2013
RELATED WORK
Web Crawling for Search Engines
• Douglis – Change rates
• Cho – Optimal crawling strategies, change rates,
Web evolution
Web Archiving
• Masanés – Web Archiving: Issues and Methods
• Jaffe & Kirkpatrick – Internet Archive architecture
• Moore et al. – Heritrix crawler
7/23/13 Scott G. Ainsworth • Michael L. Nelson
12
13. JointConferenceonDigitalLibraries(JCDL)2013
RELATED WORK
Control Crawl Data Quality, Future collections
• Spaniol et al. – crawling strategy
• Denev et al. – change rates by MIME type and
depth
• Ben Saad et al. – metadata from crawl used to
select best results from archive
Our Focus: Existing Data Quality
• Existing collections
• Datetime selection policies
7/23/13 Scott G. Ainsworth • Michael L. Nelson
13
14. JointConferenceonDigitalLibraries(JCDL)2013
RELATED WORK
Use Patterns
• AlNoamony et al. – Archive Access Patterns
• Humans vs. Robots
• Dip, dive, slide, & skim
Identifying Duplicates
• Simple identity – images, other binary formats
• direct comparison
• Hash comparison
• HTML, CSS (text)
• Shingling, Jaccard distances, etc.
• SimHash ⃪ most promise
7/23/13 Scott G. Ainsworth • Michael L. Nelson
14
15. JointConferenceonDigitalLibraries(JCDL)2013
RELATED WORK – MEMENTO*
• HTTP extension for datetime negotiation
Request
Response
7/23/13 Scott G. Ainsworth • Michael L. Nelson
15
GET <timegate>/http://www.cs.odu.edu/ HTTP/1.1
…
Accept-Datetime: Sat, 10 May 2005 11:21:00 GMT
…
HTTP/1.1 200 OK
…
Memento-Datetime: Sat, 14 May 2005 01:36:08 GMT
…
*https://datatracker.ietf.org/doc/draft-vandesompel-memento/
17. JointConferenceonDigitalLibraries(JCDL)2013
HOW MUCH IS ARCHIVED?
7/23/13 Scott G. Ainsworth • Michael L. Nelson
17
35 – 90% At least one archived copy
17 – 49% 2 – 5 copies
1 – 8% 6 – 10 copies
8 – 63% > 10 copies JCDL’11
Internet Archive
Search Engine
Other
32. JointConferenceonDigitalLibraries(JCDL)2013
EMBEDDED RESOURCES
Resource Memento-Datetime Delta Resource
Memento-
Datetime
Delta
http://www.cs.odu.edu 2005-05-14 01:36:08 spacer.gif 2005-06-01 16:23:10 18.6 d
mm_menu.js 2005-05-23 02:39:12 9.0 d jimcheng.gif 2005-06-01 16:37:39 18.6 d
style.css 2005-05-23 02:39:39 9.0 d jsmith.gif 2005-06-01 16:58:50 18.6 d
gfx-logo-odu-crown.gif 2005-05-23 02:39:39 9.0 d rmenu_1st_featured_alumni.png 2005-06-01 21:21:45 18.8 d
ddmenu_ddown.js 2005-05-23 02:39:43 9.0 d hmenu_college_...-new.png 2005-12-21 20:14:25 7.3 mo
university.js 2005-05-23 02:39:56 9.0 d rmenu_1st_upcoming_news.png 2005-12-21 20:15:14 7.3 mo
rmenu_1st_about.png 2005-06-01 13:40:25 18.5 d rmenu_1st_upcoming_events.png 2005-12-21 21:01:12 7.3 mo
rmenu_bottom_229.gif 2005-06-01 14:07:29 18.5 d lmenu_1st_resources.png 2005-12-28 17:47:41 7.5 mo
shadow-bl.gif 2005-06-01 14:55:53 18.6 d bullet_blue_triangle.gif 2005-12-28 19:43:48 7.5 mo
ecsbdg.jpg 2005-06-01 14:56:17 18.6 d logo-cs.gif 2005-12-28 19:54:29 7.5 mo
shadow-br.gif 2005-06-01 15:18:18 18.6 d rmenu_1st_featured_student.png 2007-06-12 02:36:07 2.1 years
gfx-btn-go-dblue.gif 2005-06-01 15:34:19 18.6 d shadow-b.gif 2007-06-21 02:35:17 2.1 years
shadow-tr.gif 2005-06-01 15:55:57 18.6 d shadow-r.gif 404 Not Found
header-right1.gif 2005-06-01 16:06:16 18.6 d
7/23/13 Scott G. Ainsworth • Michael L. Nelson
32
Embedded Resources 26
Mean Delta 125.9 days
Standard Deviation 207.7 days
Spread 2.1 years
47. JointConferenceonDigitalLibraries(JCDL)2013
FIRST EXPERIMENT
• 1,000 URIs from DMOZ (Open Directory)
• Download all timemaps
• Download all composite mementos
• Download all embedded resources
• Single and Multiple Archives
• Four Heuristics
7/23/13 Scott G. Ainsworth • Michael L. Nelson
47
48. JointConferenceonDigitalLibraries(JCDL)2013
PRELIMINARY RESULTS 1
Count Description Percent
1,000 Root URI-Rs
910 Root timemaps 91%
87,847 Root URI-Ms in timemaps
96.5 URI-Ms per Root URI-R
85,570 Root memento downloaded 97%
1,488,420 Embedded URI-Rs
17.4 Embedded URI-Rs per Root memento
7/23/13 Scott G. Ainsworth • Michael L. Nelson
48
49. JointConferenceonDigitalLibraries(JCDL)2013
PRELIMINARY RESULTS 2
Description Minimize
Distance,
Single
Archive
Minimize
Distance,
Multi-
Archive
3-Month
Window,
Multi-
Archive
Embedded URI-Rs 1,488,440 1,488,420 1,447,351
Embedded URI-Ms in timemaps 1,169,787 1,186,456 500,541
URI-M/Embedded URI-R 0.79 0.80 0.35
% Complete 73.8% 75.4% 33.8%
Mean spread 200.2 200.1 15.1
Standard Deviation 219.2 219.9 14.3
7/23/13 Scott G. Ainsworth • Michael L. Nelson
49
53. JointConferenceonDigitalLibraries(JCDL)2013
FUTURE WORK
Timemaps, Redirection, Missing Mementos
• Timemaps only tell part of the story
• URI-R redirection (302 from source)
• URI-M redirection (Archive action)
• Mementos in timemaps but not accessible
• Policies must consider user needs
• Leave it missing
• Show “best” substitute
7/23/13 Scott G. Ainsworth • Michael L. Nelson
53
54. JointConferenceonDigitalLibraries(JCDL)2013
FUTURE WORK
Similarity & Duplication
• Delta are currently | root – embedded |
• If bracketing mementos are identical,
should delta be zero?
• HTML is usually modified by the archive
• Can’t check for equality
• Shingling? SimHash?
7/23/13 Scott G. Ainsworth • Michael L. Nelson
54
0 +30d–30d
56. JointConferenceonDigitalLibraries(JCDL)2013
FUTURE WORK
Policies & Heuristics
• Drift
• Sliding target
• Sticky target
• Spread
• Minimize distance
• Past only
• Past preferred
• Near or within distance
• Single vs. multi-archive
• Refine to meet user expectations
7/23/13 Scott G. Ainsworth • Michael L. Nelson
56
58. JointConferenceonDigitalLibraries(JCDL)2013
CONCLUSION
Extensive research on improving acquisition exists
Best use of existing collections needs study
We are looking at
• Characterizing existing holdings
• Policies that minimize impact of drift and spread
• Characterizing memento and walk status
7/23/13 Scott G. Ainsworth • Michael L. Nelson
58
Please forgive the long title. Let me explain it with a fable…
The rest of this presentation will take the following form:A brief discussion of related work and how this research improves our knowledge.Describe how we measured drift?A review of the results.A quick look at how this work can be refined.
A student at ODU becomes curious about the history of the Computer Science Department and visits the Internet Archive’s Wayback Machine.
The student enters http://www.cs.odu.edu and is shown the available dates.The student navigates to2005 and selects 14 May @ 01:36:08.
The student review the Computer Science page.Finding the College of Scienceslink interesting link, the student clicks on it.
After reviewing the College of Sciences page, the student returns to the Computer Science page, and…
1. Whoa! That’s not what was expected!
What just happened.We expected the left side, but got the right side.This is a result of the applying the Sliding Target Policy.Highlight the temporal drift.
Let return to temporal spread.Even though the display is May 14, 2005(CLICK)The resources are captured at very different times.(CLICK)Some days(CLICK)Some months(CLICK)Even years (in this case a m image in the footer)
This leads to questions:
The rest of this presentation will take the following form:A brief discussion of related work and how this research improves our knowledge.Describe how we measured drift?A review of the results.A quick look at how this work can be refined.
The majority of work to date has focused on improving the quality of data acquisition.Spaniol et al. focused on strategy.Denev et a. looked at change rate by MIME type.Ben Saad et al. crawl metadata used to improve presentation to the user.Our focus is getting the best results from existing collectionsAfter all, we can’t go back and “fix” past data acquisition.
The majority of work to date has focused on improving the quality of data acquisition.Spaniol et al. focused on strategy.Denev et a. looked at change rate by MIME type.Ben Saad et al. crawl metadata used to improve presentation to the user.Our focus is getting the best results from existing collectionsAfter all, we can’t go back and “fix” past data acquisition.
The majority of work to date has focused on improving the quality of data acquisition.Spaniol et al. focused on strategy.Denev et a. looked at change rate by MIME type.Ben Saad et al. crawl metadata used to improve presentation to the user.Our focus is getting the best results from existing collectionsAfter all, we can’t go back and “fix” past data acquisition.
Memento is an HTTP extension for datetime negotiation.Now implemented by the Internet Archive, Archive.is, UK National Archive, and UK Web ArchiveThis is a very abbreviated introduction to the Memento API.The Memento API allows an HTTP client to negotiate a datetime.On request, the client add the Accept-Datetime header.On reply, the server sends the Memento-Datetime header, indicating the actual datetime of the memento returned.Memento-Datetime is generally the acquisition datetime of the archived copy.
The rest of this presentation will take the following form:A brief discussion of related work and how this research improves our knowledge.Describe how we measured drift?A review of the results.A quick look at how this work can be refined.
At JCDL 2011, we published “How Much of the Web Is Archived?”This density chart gives a sense of Web archival patterns.Each row represents a single URI. So, row 200 represents the 200th URI.The rows are ordered such that the URI with the earliest memento is on the bottom.The empty rows at the top are URIs that are not archived.Each dot represents a single memento.Most mementos, the brown dots, come from the Internet Archive.The Blue dot are search engine caches—note that since this study was completed, the search engine caches have all locked down—effectively, they are no longer viable sources.The red dot represent other archivesx (WebCite, etc.)
The rest of this presentation will take the following form:A brief discussion of related work and how this research improves our knowledge.Describe how we measured drift?A review of the results.A quick look at how this work can be refined.
We have investigated the temporal drift which occurs while browsing archives.(CLICK)Let use pick up from the introduction
This is an example of the “Sliding Target Policy.”Here is how it works:We started on the May 14 page we selected.When The College of Sciences was clicked,May 14 was used as the target.
And, April 22 was nearest Memento (archived version).When The Computer Science was clicked,April 22 was used as the target.
And, March 31 was nearest Memento.
“What if the target datetime is held steady instead of being allowed to drift?”The Memento extension to HTTP enables this.
Sticky target can be accomplished using the MementoFox extension to Firefox.MementoFox allows the datetime desired is entered and remain fixed.(CLICK)The nearest Memento is retrieved.(CLICK)In this case, the May 14 Computer Science page—same as we selected using the Wayback Machine UI.When the College of Sciences is clicked…(CLICK)
The April 22 page is shown again, because the target datetime is still 2005-05-14.So it is still the nearest.(CLICK)When Computer Science is clicked again…
May 15 is shown as expected.(PAUSE)
The data is variable enough that median is the best measure of central tendency.The main point of this graph is that the Sticky policy reigns in drift andThe sliding policy allows it to continue to increase.Notes:The initial up curve is due to choosing a known Memento-Datetime.We suspect the drop starting at steps 42+ is due to large, self-referencing sites (101celebrities.com) and clusters of related sites.
The rest of this presentation will take the following form:A brief discussion of related work and how this research improves our knowledge.Describe how we measured drift?A review of the results.A quick look at how this work can be refined.
Let return to temporal spread.Most web pages are composed from multiple resources, some of which are circled here.(WAIT FOR ANIMATION)
We call the collection of all mementos required to display a web page, a composite memento.A composite memento consists of a root and embedded mementos and can be represented as a tree. (It is actually a graph, but can be represented as a tree without loss of generality.)(CLICK)Which is represented as URI-M0 at the top of the tree on the right.Embedded mementos, such as images, are also represented in the tree.Embedded mementos can themselves have embedded mementos, for example HTML in a frame. (The ODU CS home page had frames in its 1990s versions, but no longer does.)
Let return to temporal spread.Even though the display is May 14, 2005(CLICK)The resources are captured at very different times.(CLICK)Some days(CLICK)Some months(CLICK)Even years (in this case a m image in the footer)
This is a list of all the mementos that comprise http://www.cs.odu.edu.It is a bit of an eye chart, so here is a summary(CLICK)There are 26 embedded mements (27 total including the root)The mean delta (distance from root) is 125.9 days.The standard deviation is 207.7 – which does not bode well for the mean.Here’s the kicker – the spread is 2.1 years!
Assume we have a composite resource with two embedded images.The graph on the right represents two composite mementos for this resource.The red diamonds are the root mementos, captured at different datetimes.Roots are centered at 0 delta; embedded mementos are offset by their delta.The blue and orange diamonds represent the embedded mementos.Orange mementos are from the same domain as the root.Blue mementos are from a different domain.Gray diamonds represent reused mementos.
Now lets have a look at the full chart (as of mid-2012) for cs.odu.edu.(CLICK)Here is the 2005-05-14 page we have been looking at.(CLICK)Here is page from 2011, (CLICK) and one from 2011.Several things stand out:The maximum spread is nearly 7 years (2005 row)Many embedded resources were acquired well after the corresponding root memento.Reuse appears very high.
Consider 2 mementos, 1 root and 1 embedded.(EXPLAIN why there is only one)In this case the embedded memento was captured after the root(POINT OUT WHICH IS WHICH)Is this coherent? -- Hard to tell
But add the Last-Modified date and it become more clear.In this case, the embedded memento’s Last-Modified and Memento-Datetime “bracket” the root,Providing evidence that the embedded memento existed when the root was captured.
So we consider it coherent.
But what happens when the root is not bracketed?In this case, there is evidence that the embedded memento did NOT exist when the root was captured.
But what happens when the root is not bracketed?In this case, there is evidence that the embedded memento did NOT exist when the root was captured.We consider this a temporal coherence violation.
Similarly, if Last-Modified is missing, it cannot be temporally coherentBut should it be a violation?It could actually be coherent.We are still gathering data on this one.
Similarly, if the embedded memento was captured before the root,Was it still in existence when the root was captured?ProbablyBut more study required.
Recall the single memento, root not bracketed pattern.
What happens is a second memento for the embedded resource is available?We can’t prove either existed when the root was captured.It opens another possibility…
Comparing the mementos.Here we introduce similarity measures.For images: direct comparison is appropriate-archive leave these alone.For text, HTML in particular, archives annotate—add comments—with metadata.In this case we must use a similarity measure such as shingling or SimHash.
What happens is a second memento for the embedded resource is available?It opens another possibility…Comparing the mementos.
If the contents are equal,There is evidence that the embedded memento existed when the root was captured.
The rest of this presentation will take the following form:A brief discussion of related work and how this research improves our knowledge.Describe how we measured drift?A review of the results.A quick look at how this work can be refined.
Real-world access patterns to bring results more inline with actual user experience.We see real humans go 50 steps?Why: Is there no need? Is the interface a problem? Does it get too weird?Try to avoid sites humans would avoid (very subjective—I avoid 101celebrities.com—you might like it)We suspect both drift and spread are influence by not just single large domains, but also by clusters of related domains. Amazon.com & amazon-images.com for instance. Sussing out related domains will help clarify results.
Timemaps only tell part of the storyMemento-Datetimes in timemaps frequently redirect to a different datetime or URIThis is reflected in the drift research but not the spread researchThis redirection will change the deltasBesides, what does it mean when we are redirected to another datetime? (Suspect archive has recognized a duplicate)Another common occurrence is missing mementos. They are in the timemap but not available in the archive.Our research to date simply lists these as missing.But as policies and heuristics are developed, user priorities might required several responses (leave it missing, substitute the next nearest, etc.)
Delta is the absolute value of the difference between the room and embedded Memento-Datetimes.But there are other conditions that could or should indicate a delta of 0 instead.These all revolve around determining that no change has occurred.One of these is bracketing mementos.Explain the chart…However, HTML is problematic because comments are added by the archives.So, we cant check for equality.What similarity measure or measures are reasonable substitutes for equality.
Succinctly communicating the status of a composite memento or walk to the user is important.(CLICK)This just isn’t very user friendly(CLICK)We need a single mutable icon or symbol that can be easily explained and understood.We may need several, one for casual users and one for researchers.(CLICK)For multiple-archive composites, we also need to acknowledge their contribution.
Finally, policies and heuristics must be developed.For example, in the drift work we used sliding and sticky
The rest of this presentation will take the following form:A brief discussion of related work and how this research improves our knowledge.Describe how we measured drift?A review of the results.A quick look at how this work can be refined.