Benefits of doing Internet peering and running an Internet Exchange (IX) pres...
TPDL 2015 - Profiling Web Archives
1. Profiling Web Archives
Sawood Alam and Michael L. Nelson
Computer Science Department, Old Dominion University
Norfolk, Virginia - 23529
Herbert Van de Sompel, Lyudmila L. Balakireva, and Harihar Shankar
Los Alamos National Laboratory, Los Alamos, NM
David S. H. Rosenthal
Stanford University Libraries, Stanford, CA
Supported in part by the International Internet Preservation Consortium (IIPC)
9. Long Tail of Archives
● 400B+ web pages at IA do not cover
everything
● Top three archives after IA produce full
TimeMap 52% of the time (AlSum et al, TPDL 2013)
● Targeted crawls
● Special focus archives
● Restricted resources
● Private archives
10. Archive Profile
● High-level summary of an archive
● Predicts presence of mementos of a URI-R
in an archive
● Provides various statistics about the
holdings
● Small in size
● Publicly available
● Easy to update and partially patch
● Useful for Memento query routing and other
things
33. Future Work
● Generating sample URI sets
● Profiling via sampling
● Language profiles
● Evaluation of combination profiles such as
URI-Key along with Datetime
● Profiles for usage other than Memento
routing, such as, site classification based
profiles (e.g., news, wiki, social media, blog
etc.)
34. Conclusions
● Generated profiles with different policies for
two archives
● Examined cost-precision tradeoffs of various
policies
● Related CDX Size, URI-M, URI-R, and URI-
Key
● Gained up to 22% routing precision with
<5% relative cost without any false negatives
● Code @ GitHub:/oduwsdl/archive_profiler