best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
Profiling Web Archives IIPC GA 2015
1. Profiling Web Archives
memento
and
Computer Science Department, Old Dominion University
Norfolk, Virginia - 23529
Sawood Alam Michael L. Nelson
Los Alamos National Laboratory, Los Alamos, NM
Herbert Van de Sompel
Stanford University Libraries, Stanford, CA
David S. H. Rosenthal
2. Memento Aggregator
memento
Aggregates ~20 archives and counting
Only a few archives return good results
for any query
Time, network, and resource wastage
Query routing can be helpful
3. Long Tail Matters
400B+ web pages at IA do
not cover everything
Top three archives after IA
produce full TimeMap
52% of the time
Targeted crawls
Special focus archives
Restricted resources
Private archives
The Portuguese Web Archive and Memento unveil the first
homepage of the Smithsonian Institution from May 1995...
fb.me/3VAo6gEba
1:12 PM 5 Jan 2015
8 1
PortugueseWebArchive
@PT_WebArchive
Follow
Dennis Ritchie's Homepage has been deleted: cm.bell
labs.com/cm/cs/who/dmr/ and the site has a robots.txt
that blocks it from the Wayback.
2:37 PM 22 Apr 2015
76 23
Jason Scott
@textfiles
Follow
7. A Client Request
Canonical URL
Accept-Datetime (optional)
Accept-Language (optional)
G E T / t i m e g a t e / h t t p : / / w w w . c n n . c o m / H T T P / 1 . 1
H o s t : m e m e n t o w e b . o r g
A c c e p t : t e x t / h t m l , a p p l i c a t i o n / x h t m l + x m l ; q = 0 . 9 , i m a g e / w e b p , * / * ; q = 0 . 8
A c c e p t - E n c o d i n g : g z i p , d e f l a t e , s d c h
A c c e p t - D a t e t i m e : S a t , 1 6 J u n 2 0 1 2 0 0 : 0 0 : 0 0 G M T
A c c e p t - L a n g u a g e : e n - U S , e n ; q = 0 . 8
C a c h e - C o n t r o l : m a x - a g e = 0
I f - M o d i f i e d - S i n c e : T h u , 2 3 A p r 2 0 1 5 1 6 : 5 1 : 5 0 G M T
I f - N o n e - M a t c h : " 7 f f 8 - 5 1 4 6 7 1 8 9 2 9 5 8 0 "
C o n n e c t i o n : k e e p - a l i v e
C o o k i e : _ _ u n a m = 3 4 c 3 c 7 d - 1 4 c e 9 1 7 c e 6 2 - 4 3 c 3 8 e 5 e - 7 . . .
U s e r - A g e n t : M o z i l l a / 5 . 0 L i n u x x 8 6 _ 6 4 C h r o m e / 4 2 . 0 . 2 3 1 1 . 9 0 . . .
8. An Archive Response
Canonical URL (known)
Memento-Datetime
Original Content-Language (optional)
H T T P / 1 . 1 2 0 0 O K
S e r v e r : T e n g i n e / 2 . 0 . 3
D a t e : S u n , 2 6 A p r 2 0 1 5 0 0 : 2 5 : 5 7 G M T
C o n t e n t - T y p e : t e x t / h t m l ; c h a r s e t = u t f - 8
C o n t e n t - L e n g t h : 8 5 9 4 5
C o n n e c t i o n : k e e p - a l i v e
s e t - c o o k i e : w a y b a c k _ s e r v e r = 3 7 ; D o m a i n = a r c h i v e . o r g ; P a t h = / ; E x p i r e s = T u e , 2 6 - M a
M e m e n t o - D a t e t i m e : S a t , 2 5 A p r 2 0 1 5 1 3 : 3 8 : 1 6 G M T
L i n k : ; r e l = " o r i g i n a l " , ; r e l = " t i m e m a p " ; t y p e = " a p p l i c a t i o n / l i n k - f o r m a t " ,
X - A r c h i v e - G u e s s e d - C h a r s e t : U T F - 8
X - A r c h i v e - O r i g - v i a : 1 . 1 v a r n i s h , 1 . 1 v a r n i s h , 1 . 1 v a r n i s h
X - A r c h i v e - O r i g - c o n t e n t - l a n g u a g e : e n
X - A r c h i v e - O r i g - x - c o n t e n t - t y p e - o p t i o n s : n o s n i f f
X - A r c h i v e - O r i g - v a r y : A c c e p t - E n c o d i n g , C o o k i e
X - A r c h i v e - O r i g - c o n t e n t - t y p e : t e x t / h t m l ; c h a r s e t = U T F - 8
X - A r c h i v e - O r i g - c a c h e - c o n t r o l : p r i v a t e , s - m a x a g e = 0 , m a x - a g e = 0 , m u s t - r e v a l i d a t e
X - A r c h i v e - O r i g - s e r v e r : A p a c h e
9. A CDX Snippet
Canonical URL
Memento Datetime
c n n . c o m / 2 0 0 8 0 2 2 6 1 9 3 7 5 7 h t t p : / / w w w . c n n . c o m / t e x t / h t m l 2 0 0 2 Q 4 O Z S V K P Z M U F 3 6 U N 6
c n n . c o m / 2 0 0 9 0 3 1 4 0 2 4 0 3 6 h t t p : / / w w w . c n n . c o m / t e x t / h t m l 2 0 0 4 P V C G T 2 2 V V T D J 3 G X I J
c n n . c o m / 2 0 0 9 0 3 1 4 0 2 4 0 3 6 h t t p : / / w w w . c n n . c o m / t e x t / h t m l 2 0 0 4 P V C G T 2 2 V V T D J 3 G X I J
i . c d n . t r a v e l . c n n . c o m / 2 0 1 3 0 1 0 2 0 8 3 5 5 4 h t t p : / / i . c d n . t r a v e l . c n n . c o m / t e x t / h t m l
i . c d n . t r a v e l . c n n . c o m / 2 0 1 3 0 4 0 4 1 7 2 9 1 3 h t t p : / / i . c d n . t r a v e l . c n n . c o m / t e x t / h t m l
10. Complete URI-R Profiling
Sanderson et al. created a URIR profile for various
archives
Extracted every URI-R from all the CDX files
Gained complete knowledge of the holding of the
participating archives
Profiles were huge
Difficult to keep up-to-date
Misses URI-Rs added later in the archive
11. TLD-only Profiling
AlSum et al. created a TLD
profile for various archives
Collected statistics about
various archives on
various TLDs
Lightweight profiles
Lots of false-positives
All the ".com" queries will
be routed to an archive
that has only a few URI-Rs
with ".com" TLD
12. Middle Ground
Partial URI-Rs, such as:
Registered domain name
Complete domain name (along with any sub-domains)
Complete domain name and first few path segments
Registered domain name and counts of other segments
such as sub-domain, path, and query parameter
Combining above with other attributes such as Content-
Language and Memento-Datetime
13. Archive Profile
High-level digest of an archive
Predicts presence of mementos of a URI-R in an archive
Provides various statistics about the holdings
Small in size
Publicly available
Easy to update and partially patch
Useful for Memento query routing and other things
14. Structure
A r c h i v e m e t a d a t a
S t a t i s t i c s :
P r o f i l e t y p e s :
K e y s : F r e q u e n c y m e a s u r e m e n t s
15. Profile types
URI-R based
Complete URI-R
TLD only
URI-R hashes, such as:
Only first few segments of the URI-R (Sub-URI)
Registered domain name along with counts of other
segments (Segment-Digest)
Language
Datetime
Many more...
16. Keys
Depend on the profile type
Control the balance between profile size and details
U R I - R : " b b c . c o . u k / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 "
T L D : " . u k "
S u b - U R I : " u k ) / " , " u k , c o ) / " , " u k , c o , b b c ) / " , " u k , c o , b b c ) / i m a g e s "
S e g - D i g e s t : " 0 / b b c . c o . u k / 4 "
L a n g u a g e : " e n - G B "
D a t e t i m e : " 2 0 1 4 0 3 " # Y Y Y Y M M
17. Frequency Measurements
Can have the same structure for all profile types
Flexible to choose the attribute set to be included
Affects the profile complexity
Predicts the presence of the mementos of a URI-R
" u k , c o , b b c ) / " :
u r i m :
m a x : 2
m i n : 1
t o t a l : 1 2 8
u r i r : 1 1 5
18. Horizontal and Vertical Holdings
" u k , c o , b b c ) / " :
u r i m :
m a x : 1 0 0
m i n : 1 0 0
t o t a l : 1 0 0
u r i r : 1
" u k , c o , b b c ) / " :
u r i m :
m a x : 1
m i n : 1
t o t a l : 1 0 0
u r i r : 1 0 0
" u k , c o , b b c ) / " :
u r i m :
m a x : 2 0
m i n : 5
t o t a l : 1 0 0
u r i r : 1 0
19. Sample Profile
- - -
" @ c o n t e x t " : " h t t p s : / / o d u w s d l . g i t h u b . i o / c o n t e x t / a r c h p r o f i l e . j s o n l d "
" @ i d " : " h t t p : / / w w w . w e b a r c h i v e . o r g . u k / u k w a / "
a b o u t :
a c c e s s p o i n t : " h t t p : / / w w w . w e b a r c h i v e . o r g . u k / w a y b a c k / "
m e c h a n i s m : " h t t p : / / o d u w s d l . g i t h u b . i o / t e r m s / m e c h a n i s m # c d x "
n a m e : " U K W A 1 9 9 6 C o l l e c t i o n "
p r o f i l e _ u p d a t e d : " 2 0 1 5 - 0 1 - 2 0 T 1 7 : 2 5 : 3 0 Z "
s u b u r i _ c l a s s : " h t t p : / / o d u w s d l . g i t h u b . i o / t e r m s / s u b u r i # H 3 P 1 "
m o r e _ m e t a _ d a t a : " . . . "
s t a t s :
l a n g u a g e :
" e n - U S " :
u r i m : { m a x : 1 3 , m i n : 1 , t o t a l : 4 7 5 2 9 }
u r i r : 2 5 6 2 1
" m o r e _ l a n g u a g e s " : " . . . "
s u b u r i :
" u k ) / " :
u r i m : { m a x : 8 , m i n : 1 , t o t a l : 9 3 2 4 3 2 }
u r i r : 8 6 7 8 1 7
" u k , c o ) / " :
u r i m : { m a x : 8 , m i n : 1 , t o t a l : 4 1 0 9 7 9 }
u r i r : 3 7 8 6 8 6
20. URI-R Based Profiles
URI-R preprocessing
Canonicalize
Apply SURT
Split segments
Extract registered domain
Count segments (sub-domain, path, query params)
Generate all Sub-URIs
Incrementally add segments from left-to-right
Only up to max host and path segments config
Create Segment-Digest with registered domain
Prefix sub-domain count
Suffix path and query params count
21. Key Generation
https://www.BBC.co.uk/images/Logo.png?width=200&height=80#f
Intermediate Values
{ c a n o n i c a l _ u r l : " b b c . c o . u k / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 " ,
s u r t _ u r l : " u k , c o , b b c ) / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 " ,
r e g _ d o m a i n : " b b c . c o . u k " , p a t h _ i n i t i a l : " i " ,
s u b d o m a i n _ c o u n t : 1 , p a t h _ c o u n t : 2 , q u e r y _ p a r a m s _ c o u n t : 2 }
Sub-URI(H 3 P 1 )
[ " u k ) / " ,
" u k , c o ) / " ,
" u k , c o , b b c ) / " ,
" u k , c o , b b c ) / i m a g e s " ]
SegDigest( include_path_initial)
" 1 / b b c . c o . u k / i 4 "
22. Implementation
GitHub:
A python module to generate Sub-URIs from SURT
GitHub:
Various profile generation scripts
/oduwsdl/suburi_generator
/oduwsdl/archive_profiler
23. Canonicalization
Remove "http(s)", "www", and fragment of a URI
Downcase hostname
Remove some known query paras e.g., "jsessionid"
Sort query params by keys and values (secondary)
U R L = " h t t p s : / / w w w . B B C . c o . u k / i m a g e s / L o g o . p n g ? w i d t h = 2 0 0 & h e i g h t = 8 0 # f "
C a n o n i c a l i z e ( U R L )
# = > " b b c . c o . u k / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 "
24. Sort-friendly URI Reordering
Transform (SURT)
Take canonical URL as input
Join hostname segments by commas in reverse order
Separate hostname and path by closing parenthesis
C a n _ U R L = " b b c . c o . u k / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 "
S U R T ( C a n _ U R L )
# = > " u k , c o , b b c ) / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 "
25. Sub-URI
Take SURT URL as input
Incrementally add segments from left-to-right one-by-one
Stop if hostname or path segment limit policy reaches
Return the list of all Sub-URIs
S U R T _ U R L = " u k , c o , b b c ) / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 "
S u b U R I ( S U R T _ U R L , p o l i c y = " H 3 P 1 " )
# = > [ " u k ) / " ,
# " u k , c o ) / " ,
# " u k , c o , b b c ) / " ,
# " u k , c o , b b c ) / i m a g e s " ]
26. URL to Sub-URI
U R L = " h t t p s : / / w w w . B B C . c o . u k / i m a g e s / L o g o . p n g ? w i d t h = 2 0 0 & h e i g h t = 8 0 # f "
C a n _ U R L = C a n o n i c a l i z e ( U R L )
# = > " b b c . c o . u k / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 "
S U R T _ U R L = S U R T ( C a n _ U R L )
# = > " u k , c o , b b c ) / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 "
S u b _ U R I s = S u b U R I ( S U R T _ U R L , p o l i c y = " H 3 P 1 " )
# = > [ " u k ) / " ,
# " u k , c o ) / " ,
# " u k , c o , b b c ) / " ,
# " u k , c o , b b c ) / i m a g e s " ]
27. Segment Count Digest
Extract registered domain name and initial letter of path
Count sub-domain and trailing (path + query) segments
Serialize as follows:
{ s u b d o m a i n _ c o u n t } / { r e g _ d o m a i n } / { p a t h _ i n i t i a l } ? { t r a i l i n g _ c o u n t }
U R L = " h t t p s : / / w w w . B B C . c o . u k / i m a g e s / L o g o . p n g ? w i d t h = 2 0 0 & h e i g h t = 8 0 # f "
S e g m e n t s = S e g m e n t i z e ( U R L )
# = > { r e g _ d o m a i n : " b b c . c o . u k " ,
# p a t h _ i n i t i a l : " i " ,
# s u b d o m a i n _ c o u n t : 1 ,
# p a t h _ c o u n t : 2 ,
# q u e r y _ p a r a m s _ c o u n t : 2 ,
# t r a i l i n g _ c o u n t : 4 }
S e g D i g e s t ( S e g m e n t s , p o l i c y = " e x c l u d e _ p a t h _ i n i t i a l " )
# = > " 1 / b b c . c o . u k / 4 "
S e g D i g e s t ( S e g m e n t s , p o l i c y = " i n c l u d e _ p a t h _ i n i t i a l " )
# = > " 1 / b b c . c o . u k / i 4 "
28. JSON Serialization
Can have complex nested
data structure
JSON-LD for linked data
No partial key lookup
Unsuitable for text
processing tools
Allows processing only
when fully loaded
A single malformed
character makes it
unparsable
Difficult to patch
{
" s u b u r i " : {
" u k ) / " : {
" u r i m " : {
" m a x " : 8 ,
" m i n " : 1 ,
" t o t a l " : 9 3 2 4 3 2
} ,
" u r i r " : 8 6 7 8 1 7
} ,
" u k , c o ) / " : {
" u r i m " : {
" m a x " : 8 ,
" m i n " : 1 ,
" t o t a l " : 4 1 0 9 7 9
} ,
" u r i r " : 3 7 8 6 8 6
} ,
" u k , c o , b b c ) / " : {
" u r i m " : {
" m a x " : 2 ,
" m i n " : 1 ,
" t o t a l " : 1 2 8
29. CDX-JSON Serialization
Fusion of CDX and JSON file formats
A key followed by strict single line JSON value
Unlike CDX, values can have arbitrary attributes
Text processing tool friendly
No single root node or single document restrictions
Enables binary search
Enables partial key lookup
Error resilient
@ c o n t e x t " h t t p s : / / o d u w s d l . g i t h u b . i o / c o n t e x t s / a r c h i v e p r o f i l e . j s o n l d "
@ i d " h t t p : / / w w w . w e b a r c h i v e . o r g . u k / u k w a / "
@ a b o u t { " n a m e " : " U K W A 1 9 9 6 C o l l e c t i o n " , " t y p e " : " s u b u r i # H 3 P 1 " , " . . . " :
u k ) / { " u r i m " : { " m a x " : 8 , " m i n " : 1 , " t o t a l " : 9 3 2 4 3 2 } , " u r i r " : 8 6 7 8 1 7 } ,
u k , c o ) / { " u r i m " : { " m a x " : 8 , " m i n " : 1 , " t o t a l " : 4 1 0 9 7 9 } , " u r i r " : 3 7 8 6 8 6
u k , c o , b b c ) / { " u r i m " : { " m a x " : 2 , " m i n " : 1 , " t o t a l " : 1 2 8 } , " u r i r " : 1 1 5 } ,
u k , c o , b b c ) / i m a g e s { " u r i m " : { " m a x " : 1 , " m i n " : 1 , " t o t a l " : 3 } , " u r i r " :
30. Merging
Only process new data to periodically update for
freshness
Parallel processing
Difficult to keep detailed measures with absolute values
Derived simple heuristic measures to predict presence of
mementos
31. Merging Example
Base Profile
c o m , c n n ) / { " u r i r _ s u m " : 3 0 , " s o u r c e s " : 1 } ,
u k , c o , b b c ) / { " u r i r _ s u m " : 2 0 , " s o u r c e s " : 1 }
New Profile
c o m , c n n ) / { " u r i r _ s u m " : 1 0 , " s o u r c e s " : 1 } ,
c o m , u s a t o d a y ) / { " u r i r _ s u m " : 5 , " s o u r c e s " : 1 }
Merged Profile
c o m , c n n ) / { " u r i r _ s u m " : 4 0 , " s o u r c e s " : 2 } ,
u k , c o , b b c ) / { " u r i r _ s u m " : 2 0 , " s o u r c e s " : 1 } ,
c o m , u s a t o d a y ) / { " u r i r _ s u m " : 5 , " s o u r c e s " : 1 }
35. Evaluation
Relate CDX Size, URI-M, URI-R, and Sub-URI
Analyze profile growth
Estimate Relative Cost
Evaluate Routing Precision vs. Relative Cost
Relative Cost =
|Keys in the Profile|
|URI-R in the Archive|
Routing Precision =
|URI-R Present in the Archive|
|URI-R Predicted by the Profile in Archive|
36. UKWA Dataset
Yearly data as seprate collections
Average CDX line size: 275 bytes
URI-M/URI-R ratio: 2.46
37. Accumulated URI-R Growth (UKWA)
Successive yearly data
was merged
Follows Heaps' Law
K = 3.897
β = 0.892
= KCr C
β
m
38. Sub-URI Key Growth (UKWA)
Slope of the fit line is the
Relative Cost for the
profile policy
Complete URI-R profile
has Relative Cost 1
40. Search Precision of Various Profiles
Search Precision wrt TLD-only profile
Double for H3P0
Five fold for HxP1
Segment-Digest is as good as H3P0
41. Relative Cost vs. Search Precision
Up to 22% routing precision with <5% Reltive Cost
<0.3% sample URIs from MementoProxy and IAWayback
logs present in UKWA
Shallow crawling of UKWA results in higher cost
43. Future Work
Generating sample URI sets
Profiling via sampling
Language profiles
Evaluation of combination profiles such as Sub-URI along
with Datetime
Profiles for usage other than Memento routing, such as,
Media-type profiles (e.g., images, pdf, audio etc.)
Site classification based profiles (e.g., news, wiki, social
media, blog etc.)
44. Conclusions
Generated profiles with different policies for two archives
Examined cost-accuracy trade-offs of various profiles
Related CDX Size, URI-M, URI-R, and Sub-URI
Gained up to 22% routing precision with <5% relative cost
without any false negatives
<5% of the queried URIs are present in each of the
individual archives
Implementation codes are available at:
GitHub:
GitHub:
/oduwsdl/suburi_generator
/oduwsdl/archive_profiler