Presented at ACM CIKM 2019. Used by a variety of researchers, web archive collections have become invaluable sources of evidence. If a researcher is presented with a web archive collection that they did not create, how do they know what is inside so that they can use it for their own research? Search engine results and social media links are represented as surrogates, small easily digestible summaries of the underlying page. Search engines and social media have a different focus, and hence produce different surrogates than web archives. Search engine surrogates help a user answer the question "Will this link meet my information need?" Social media surrogates help a user decide "Should I click on this?" Our use case is subtly different. We hypothesize that groups of surrogates together are useful for summarizing a collection. We want to help users answer the question of "What does the underlying collection contain?" But which surrogate should we use? With Mechanical Turk participants, we evaluate six different surrogate types against each other. We find that the type of surrogate does not influence the time to complete the task we presented the participants. Of particular interest are social cards, surrogates typically found on social media, and browser thumbnails, screen captures of web pages rendered in a browser. At p=0.0569, and p=0.0770, respectively, we find that social cards and social cards paired side-by-side with browser thumbnails probably provide better collection understanding than the surrogates currently used by the popular Archive-It web archiving platform. We measure user interactions with each surrogate and find that users interact with social cards less than other types. The results of this study have implications for our web archive summarization work, live web curation platforms, social media, and more.
OECD bibliometric indicators: Selected highlights, April 2024
Social Cards Probably Provide For Better Understanding Of Web Archive Collections
1. Shawn M. Jones Michele C. Weigle Michael L. Nelson
sjone@cs.odu.edu mweigle@cs.odu.edu mln@cs.odu.edu
@shawnmjones @weiglemc @phonedude_mln
Social Cards Probably Provide
For Better Understanding Of
Web Archive Collections
Old Dominion University
Web Science and Digital Libraries Research Group
@WebSciDL
Thanks to:
2. @shawnmjones @WebSciDL
Curators Build Web Archive Collections
2
Archived web pages, or mementos, are used by journalists, sociologists, and historians.
Tucson Shootings2008 OlympicsUniversity of Utah
3. @shawnmjones @WebSciDL
Web archive collections consist of mementos
– different versions of the same page over time
3
2013
2015
2018
University of Utah Office of Admissions
from the University of Utah Web Archive Collection
4/1/2015
3/5/2015
Tumblr Black Lives Matter Blog
from the #blacklivesmatter Collection
2/12/2015
4. @shawnmjones @WebSciDL
Archive-It allows curators to easily create
web archive collections
Archive-It was created by the Internet Archive as a consistent user interface for constructing web archive
collections. Curators can supply live web resources as seeds and establish crawling schedules of those
seeds to create mementos.
4
5. @shawnmjones @WebSciDL
… and these collections are used by other researchers
5
The collection curator is not the only user of the
collection!
These collections live a life after their curator
has stopped adding to them.
6. @shawnmjones @WebSciDL
The problem…
There are multiple collections
about the same concept.
The metadata for each collection is
non-existent, or inconsistently
applied.
Many collections have
1000s of seeds with multiple
mementos.
There are more than 8000
collections.
Human review of these
mementos for collection
understanding is an expensive
proposition.
6
7. @shawnmjones @WebSciDL
Collections provide web pages based on a theme – we can
summarize those collections by using the best mementos
supporting that theme…
7
Web sites may group some
content, but curators theme some
of this content into collections
which we can reduce to stories.
Our stories consist of ~28
representative mementos, making
them much smaller than the
collections from which they come.
8. @shawnmjones @WebSciDL
Surrogates provide a visual summary of the content
behind a URI…
8
https://www.google.com/maps/dir/Old+Dominion+University,+Norfolk,+VA/Los+Alamos+National+Laboratory,+New+Mexico/@35 .3644614,-
109.356967,4z/data=!3m1!4b1!4m13!4m12!1m5!1m1!1 s0x89ba99ad24ba3945:0xcd2bdc432c4e4bac!2m2!1d-76.3067676!2d36
.8855515!1m5!1m1!1s0x87181246af22e765:0x7f5a90170c5df1b4!2m2!1 d-106.287162!2d35.8440582
Long URI:
The same URI represented by a
browser thumbnail surrogate:
The same URI represented by a
social card surrogate:
9. @shawnmjones @WebSciDL
Social media storytelling uses surrogates to provide a
“summary of summaries”
9
2 resources are shown from this Wakelet story6 resources are shown from this Storify story
Each surrogate summarizes a
web resource.
Each story groups the
surrogates, summarizing the
topic.
We want to use this technique
to summarize web archive
collections because users are
already familiar with this
visualization paradigm.
10. @shawnmjones @WebSciDL
We want to help the user answer the question of
“What does the underlying collection contain?”
10
There are many types of surrogates. Which one
best conveys the concepts of the underlying
collection?
If we already have the mementos describing a
collection, we need to visualize them with some
type of surrogate.
11. @shawnmjones @WebSciDL
Storytelling implemented
11
From this: 31,863 seed mementos of 955 seeds,
potentially 31K+ documents to review
To this: a story built from a representative sample of
~30 mementos visualized as surrogates
13. @shawnmjones @WebSciDL
But, alas the metadata does not help…
13
A variety of metadata fields can be used for surrogates,
with the most popular field being Title.
The metadata on seeds and collections is
optional.
As the number of seeds increases, the less
metadata is present per seed.Even though this is the case, 54.60% of Archive-It
surrogates consist of only the seed URL and capture
dates.
14. @shawnmjones @WebSciDL
Understanding Archive-It Collections… Manually
14
Step 1: Decide upon search terms
and find a collection
via the search engine
Step 0:
Decide on your
information need
Step 2: Select a collection from
the many results available
15. @shawnmjones @WebSciDL
Understanding Archive-It Collections… Manually
15
Step 3: View the seed-centric collection page
Step 4: Choose a seed from this list that may meet the
information need. This collection contains 1,149 seeds.
How do you choose one without metadata to guide
you?
16. @shawnmjones @WebSciDL
Understanding Archive-It Collections… Manually
16
Step 5: View the mementos associated with that
seed.
This seed has 923 seed mementos. There are
more mementos linked from these mementos
Step 6: Read the text of the memento to learn about its
contents.
17. @shawnmjones @WebSciDL
Understanding Archive-It Collections… Manually
17
Step 7: Follow links and review more
content until you reach a page that was
not archived.
Step 8: Repeat steps 4-7 until enough information about has
been amassed to determine if the collection meets your
information need. This collection contained 80,484 seed
mementos.
18. @shawnmjones @WebSciDL
In spite of the lack of information in Archive-It surrogates,
some information can be gleaned from the URI…
18
Thus, surrogates like these may still yield enough
information for collection understanding.
19. @shawnmjones @WebSciDL
Existing surrogate services create a confusing
experience for mementos
19
Who published these resources?
Archive-It?
CNN?
Is the story author sharing fake news?
S. M. Jones. “A Preview of MementoEmbed: Embeddable Surrogates for Archived Web Pages.” https://ws-
dl.blogspot.com/2018/08/2018-08-01-preview-of-mementoembed.html, 2018.
embed.rocks social card
embed.ly social card
20. @shawnmjones @WebSciDL
Neither social media services nor card services were
reliable for storytelling, so we created MementoEmbed…
20
Information in the
MementoEmbed social
card is separated to
avoid issues of
confusion about
attribution.
MementoEmbed is
archive-aware. It can
locate information
about the memento
that is not available in
other cards.
S. M. Jones. “A Preview of MementoEmbed: Embeddable Surrogates for Archived Web Pages.” https://ws-
dl.blogspot.com/2018/08/2018-08-01-preview-of-mementoembed.html, 2018.
22. @shawnmjones @WebSciDL
We shared these stories, rendered using 6 different surrogate
types, with MT participants…
22
Archive-It like
Social Card
Browser thumbnails
Social Card With
Thumbnail as Image
(sc/t)
Social Card With
Thumbnail to
Right (sc+t)
Social Card with
Thumbnail on
Hover (sc^t)
• 4 stories of 15-17 mementos selected by human
Archive-It curators from their collections
• 6 different surrogate types
• Social cards and thumbnails were produced by
MementoEmbed
• 24 different story-surrogate combinations
• 120 MT participants
• They were given 30 seconds to view each story
23. @shawnmjones @WebSciDL
And then we asked them which of 2 of 6 mementos come
from the same collection…
23
• Each participant was shown a list of 6 surrogates of the same type as the story they just viewed.
• They were asked to choose the 2 that they thought came from the same collection.
• They were given as much time as they wished to answer the question.
• This is similar to the Sentence Verification Task from reading comprehension studies.
24. @shawnmjones @WebSciDL
Response times per surrogate had interesting means,
but p-values were not statistically significant at p < 0.05
24
p = 0.190
p = 0.202
25. @shawnmjones @WebSciDL
Correct answers per surrogate indicate that social
cards probably outperform the Archive-It surrogate
25
0 0.5 1 1.5 2 2.5
Archive-It Facsimile
Browser Thumbnails
Social Cards
sc+t
sc/t
sc^t
Correct Answers Per Surrogate
Median Mean
p = 0.0569
p = 0.0770
26. @shawnmjones @WebSciDL
Whenever thumbnails are present, more users interact
with them
26
We could not detect if participants were zooming in to view thumbnails, but most hovered when confronted
with a thumbnail, regardless of surrogate.
For browser thumbnails alone, most of the participants clicked the link to view the actual memento behind the
surrogate.
27. @shawnmjones @WebSciDL
Conclusions
54.60% of Archive-It surrogates have only a URL and capture dates
in spite of this, some information could still be gleaned from the URL
Results from our MT study with 120 participants:
response times were not statistically significant at p < 0.05
for correct answers: social cards probably outperform the existing Archive-It surrogate at p = 0.0569
when thumbnails are present, more participants interacted with them
when thumbnails are present, more participants click through to view the page underneath rather than relying
upon the surrogate
Conclusions:
thumbnails encourage more interaction, specifically clicking through to the underlying page, than social cards
social cards outperform the existing status quo Archive-It surrogate in terms of correct answers at p = 0.0569
social cards probably provide for better understanding of web archive collections
For more information, see https://arxiv.org/abs/1905.11342
27