O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
Improving Collection Understanding
For Web Archives With Storytelling:
Shining Light Into
Dark and Stormy Archives
Shawn M...
@shawnmjones @WebSciDL
Outline
1. Motivation And Research
Questions
2. Background And Related Work
3. Selecting Exemplars ...
@shawnmjones @StormyArchives
November 8, 2019
3
During the second week of November 2019,
the National Center for Medical I...
@shawnmjones @StormyArchives
December 16, 2019
4
The first documented COVID-19 hospital
admission was on December 16, 2019...
@shawnmjones @StormyArchives
January 13, 2020
5
One month later, CNN carries a coronavirus
category on its front page.
@shawnmjones @StormyArchives
February 28, 2020
6
Another month goes by with more front-page
articles about coronavirus.
@shawnmjones @StormyArchives
March 13, 2020
7
A month later, CNN had many front-page
articles about coronavirus with a spe...
@shawnmjones @StormyArchives
March 20, 2020
8
A week later, states are locking down.
@shawnmjones @StormyArchives
March 27, 2020
9
A week later, the US has the most cases of any
country.
@shawnmjones @StormyArchives
A web archive
helped me tell
this story.
10
These mementos are stored
in the Internet Archive...
@shawnmjones @StormyArchives
What other stories can we tell with web
archives?
11
Motivation and Research Questions
@shawnmjones @StormyArchives
Natasha is studying how disasters shape
cultures...
12
Sources like Wikipedia now have a
summ...
@shawnmjones @StormyArchives
Per Nwala et al., news articles about the event tend to slide
down search results as we get f...
@shawnmjones @StormyArchives
Natasha also knows that news articles are updated with
more current and correct information
1...
@shawnmjones @StormyArchives
Natasha knows that any time that we need proof
that X said Y at date D, we need web archives
...
@shawnmjones @StormyArchives
Natasha also knows that archivists create
web archive collections based on a theme
16
Motivat...
@shawnmjones @StormyArchives
With these themed collections, she can discover documents
that once existed and match her eve...
@shawnmjones @StormyArchives
Natasha has discovered multiple sites with
themed web archive collections
18
Library of Congr...
@shawnmjones @StormyArchives
Natasha chooses to look through the
themed collections at Archive-It
19
As a popular subscrip...
@shawnmjones @StormyArchives
There are multiple collections about the
subject, which one should she work with?
20
This is ...
@shawnmjones @StormyArchives 21
Natasha is not alone, 44
Archive-It collections
match the search query
“human rights”
How ...
@shawnmjones @StormyArchives
Rustam needs to study how the Boston
Marathon Bombing unfolded…
22
Reviewing different
mement...
@shawnmjones @StormyArchives
Olayinka wants to understand what different
news sources revealed on the same day…
23
Today s...
@shawnmjones @StormyArchives
Elbert is an archivist who wants to promote his
collections, so others are aware of them…
24
...
@shawnmjones @StormyArchives
Ling is an archivist who inherited a collection from another
archivist, and she needs to unde...
@shawnmjones @StormyArchives
Ling knows she is not alone – the collections are often built
automatically, making it diffic...
@shawnmjones @StormyArchives
All these personas need a faster method of
collection understanding
27
Persona Natasha Rustam...
@shawnmjones @StormyArchives
All are faced with more than 14,000
collections at Archive-It alone
28
More than 14,000 colle...
@shawnmjones @StormyArchives
The problem, summarized
29
§ There are multiple collections
about the same concept.
§ It is d...
@shawnmjones @StormyArchives
Our proposal: a visualization made of
exemplar mementos
30
§ Our visualization is a summary
t...
@shawnmjones @StormyArchives
Users already interact with pages like this
every day
31
A story on Wakelet about the 2021
Ca...
@shawnmjones @StormyArchives
Social media stories apply visualizations that
users already know how to understand
32
An ind...
@shawnmjones @StormyArchives
We developed a five-process storytelling model based
on existing work on summarization and st...
@shawnmjones @StormyArchives
Our five-process storytelling model maps to
our research questions
34
RQ1: What types of web ...
@shawnmjones @StormyArchives
Our Dark and Stormy Archives Tools serve as a
reference implementation of our storytelling pr...
@shawnmjones @WebSciDL
Outline
1. Motivation And Research
Questions
2. Background And Related Work
3. Selecting Exemplars ...
@shawnmjones @StormyArchives
URIs identify resources
37
T. Berners-Lee, et al. “RFC 3986 – Uniform Resource Identifier (UR...
@shawnmjones @StormyArchives
HTML is the file format we use for web
resources
38
HTML contains links to other
pages, ident...
@shawnmjones @StormyArchives
Web archives apply crawlers to quickly visit
pages and follow links to build collections
39
B...
@shawnmjones @StormyArchives
Web archives apply crawlers to quickly visit
pages and follow links to build collections
40
B...
@shawnmjones @StormyArchives
Web archives apply crawlers to quickly visit
pages and follow links to build collections
41
B...
@shawnmjones @StormyArchives
Web archives apply crawlers to quickly visit
pages and follow links to build collections
42
B...
@shawnmjones @StormyArchives
A TimeMap gives us a listing of the
mementos available for an original resource
43
Background...
@shawnmjones @StormyArchives
Others have tackled portions of the problem of summarizing
web archives, but only AlNoamany a...
@shawnmjones @StormyArchives
AlNoamany identified the characteristics of social
media stories and Archive-It collections
4...
@shawnmjones @StormyArchives
Select
Exemplars
AlNoamany extracted some story metadata and relied on
Storify to create and ...
@shawnmjones @StormyArchives
Her proof-of-concept generated some document
metadata and relied on Storify to generate the r...
@shawnmjones @StormyArchives
She generated many different stories based on
exemplars selected by her proof-of-concept
48
G...
@shawnmjones @StormyArchives
Through a user study, she demonstrated that participants
could tell the difference between he...
@shawnmjones @StormyArchives
Unfortunately, her solution is difficult to generalize
50
Generate
Story
Metadata
Generate
Do...
@shawnmjones @WebSciDL
Outline
1. Motivation And Research
Questions
2. Background And Related Work
3. Selecting Exemplars ...
@shawnmjones @StormyArchives
As collection users, what structural features
can we view from outside?
52
§ Using only struc...
@shawnmjones @StormyArchives
Was the collection built from web sites belonging
to one domain or many?
53
Many domains One ...
@shawnmjones @StormyArchives
Were most of the web pages in the collection top-level
pages or specific articles deeper in a...
@shawnmjones @StormyArchives
Growth curves provide some understanding of
collection curation behavior
55
• Skew of the
col...
@shawnmjones @StormyArchives
We discovered four semantic categories in
Archive-It collections
56
Self-Archiving
54.1% of c...
@shawnmjones @StormyArchives
Self-Archiving collections dominate Archive-It
57
54.1% of collections
27.6% 14.1%
In a study...
@shawnmjones @StormyArchives
Subject-based collections come in second
58
27.6% of collections
14.1%
In a study of 3,382 Ar...
@shawnmjones @StormyArchives
Time Bounded – Expected
collections summarize events
we anticipate
59
14.1% of collections
In...
@shawnmjones @StormyArchives 60
4.2% of collections
In a study of 3,382
Archive-It collections
Selecting Exemplars and Gen...
@shawnmjones @StormyArchives
We can bridge the structural to the
descriptive…
61
Self-Archiving
54.1% of collections
Subje...
@shawnmjones @StormyArchives
RQ1: What types of web archive collections exist
and what structural features do they have?
6...
@shawnmjones @StormyArchives
Identifying off-topic mementos is key to choosing
exemplar mementos
63
Hacked
Moved on from t...
@shawnmjones @StormyArchives
We found that Word Count had the best F1
score for identifying off-topic mementos
64
We reuse...
@shawnmjones @StormyArchives
Filtering off-topic mementos is just one step in a set of
algorithmic primitives for selectin...
@shawnmjones @StormyArchives
Ordering allows us to create meaning from a
list of mementos
66
We can order the collection b...
@shawnmjones @StormyArchives
filter
include-only mementos
containing a given pattern
Web
archive
collection
exemplars
redu...
@shawnmjones @StormyArchives
Clustering based on a feature allows us to
imbue subsets of mementos with meaning
68
With the...
@shawnmjones @StormyArchives
These primitives allow us to create other algorithms for
selecting exemplars that tell the st...
@shawnmjones @StormyArchives
Search engines are the de-facto method of exploring
collections; if we consider them a baseli...
@shawnmjones @StormyArchives
We then generated queries with four different methods based on
the content of the exemplars p...
@shawnmjones @StormyArchives
We visualized the percentage of exemplars
that were never retrieved by any query
72
Selecting...
@shawnmjones @StormyArchives
For all query methods the DSA algorithms’ exemplars
have similar retrievability
73
Selecting ...
@shawnmjones @StormyArchives
For all query methods the DSA algorithms’ exemplars
have similar retrievability
74
Selecting ...
@shawnmjones @StormyArchives
RQ2: Which approaches work best for selecting
exemplars from web archive collections?
75
We e...
@shawnmjones @StormyArchives
We implemented these primitives as part of Hypercane
76
Hypercane was used to
conduct the exp...
@shawnmjones @WebSciDL
Outline
1. Motivation And Research
Questions
2. Background And Related Work
3. Selecting Exemplars ...
@shawnmjones @StormyArchives
We evaluated 55 platforms in 2017 and found that existing social
platforms do not reliably pr...
@shawnmjones @StormyArchives
We reused exemplars that archivists had selected to
describe their own collections to create ...
@shawnmjones @StormyArchives
Archive-It like surrogates visualize these
mementos as they are on Archive-It
80
Archive-It l...
@shawnmjones @StormyArchives
Browser thumbnails are screenshots of the page in a
browser
81
S. M. Jones, M. C. Weigle, and...
@shawnmjones @StormyArchives
Social cards come from social media
platforms
82
S. M. Jones, M. C. Weigle, and M. L. Nelson,...
@shawnmjones @StormyArchives
sc/t combines social cards and thumbnails
83
S. M. Jones, M. C. Weigle, and M. L. Nelson, “So...
@shawnmjones @StormyArchives
sc+t places the social card to the left and a thumbnail
to the right
84
S. M. Jones, M. C. We...
@shawnmjones @StormyArchives
sc^t is interactive
85
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Pr...
@shawnmjones @StormyArchives
We then presented these stories to Mechanical Turk (MT)
participants
86
Archive-It like
Socia...
@shawnmjones @StormyArchives
And then asked them which of the following come from
the same collection…
87
• Each participa...
@shawnmjones @StormyArchives
Social cards probably outperform the Archive-It
surrogate for participant’s correct answers
8...
@shawnmjones @StormyArchives
Social cards produced less interaction while participants
viewed their stories
89
We measured...
@shawnmjones @StormyArchives
RQ3: What surrogates work best for
understanding groups of mementos?
90
S. M. Jones, M. C. We...
@shawnmjones @StormyArchives
Social cards are generated based on the
HTML metadata that authors provide
og:title
-or-
twit...
@shawnmjones @StormyArchives
We analyzed 277,724 news articles captured by the Internet
Archive from 1998 to 2016, and fou...
@shawnmjones @StormyArchives
By applying author behavior, we can
generate descriptions
93
Generating Document Metadata
We ...
@shawnmjones @StormyArchives
Generating Document Metadata
If no metadata
exists, we can
select a striking
image from the
i...
@shawnmjones @StormyArchives
Our generic image selection
approach has 3 steps
1. Score each image in the
document by some
...
@shawnmjones @StormyArchives
We visualized how well different approaches performed at
choosing a striking image that was p...
@shawnmjones @StormyArchives
We found that Random Forest performed best with base image
features quickly calculated via st...
@shawnmjones @StormyArchives
RQ4: What methods that automate the creation of surrogates
produce results that best match hu...
@shawnmjones @StormyArchives
We implemented these results as part of
MementoEmbed
99
Cards
Browser
Thumbnails
Imagereels
W...
@shawnmjones @WebSciDL
Outline
1. Motivation And Research
Questions
2. Background And Related Work
3. Selecting Exemplars ...
@shawnmjones @StormyArchives
Because Storify was gone, we created
Raintale for visualizing and distributing stories
101
Vi...
@shawnmjones @WebSciDL
Remember, Elbert wants to promote his collections for
others, and he uses the DSA Toolkit to do so
...
@shawnmjones @StormyArchives
Elbert applies all processes of our storytelling model
103
Visualizing And Distributing Stori...
@shawnmjones @WebSciDL
Remember, Natasha needs to compare
collections to each other
104
Today she is
reviewing
different
c...
@shawnmjones @StormyArchives
Ling inherited a collection and needs to
know what it contains
105
Ling can apply our
process...
@shawnmjones @WebSciDL
Rustam wants to see how a page changed
over time
106
Visualizing And Distributing Stories
Generate
...
@shawnmjones @StormyArchives
Rustam chooses one of Raintale’s default templates
because he is using the DSA Toolkit for ex...
@shawnmjones @WebSciDL
Olayinka wants to see what different news sources said
on the same day in different years
108
Visua...
@shawnmjones @StormyArchives
Olayinka can look through the stories produced by our
SHARI process to perform her comparison...
@shawnmjones @WebSciDL
Outline
1. Motivation And Research
Questions
2. Background And Related Work
3. Selecting Exemplars ...
@shawnmjones @StormyArchives
We presented a model for storytelling with web archives
111
Contributions
@shawnmjones @WebSciDL
We established a vocabulary for different
types and structural features of collections
112
Type % o...
@shawnmjones @StormyArchives
Word count is a fast, effective intra-TimeMap
method of identifying off-topic mementos
113
iP...
@shawnmjones @WebSciDL
We devised a set of primitives for intelligently selecting
exemplars from web archive collections
1...
@shawnmjones @StormyArchives
Hypercane implements our primitives for
selecting exemplars
115
ACM/IEEE
JCDL 2021
ACM SIGWEB...
@shawnmjones @WebSciDL
We created four different algorithms from these
primitives and found that they produce exemplars wi...
@shawnmjones @StormyArchives
Our user study provides engineers support for
choosing social cards over other surrogate type...
@shawnmjones @WebSciDL
We established methods for generating the metadata
for social cards if it does not exist
118
S. M. ...
@shawnmjones @StormyArchives
We explored the reasons for metadata adoption
119
S. M. Jones, V. Neblitt-Jones, M. C. Weigle...
@shawnmjones @WebSciDL
We released MementoEmbed and Raintale as reference
implementations for visualizing and distributing...
@shawnmjones @WebSciDL
And I am eager to apply this
expertise at
Los Alamos National Laboratory’s
Information Sciences Div...
@shawnmjones @StormyArchives
Using our model and the lessons from these research
questions, we have implemented tools to t...
@shawnmjones @StormyArchives
Using our model and the lessons from these research
questions, we have implemented tools to t...
@shawnmjones @WebSciDL
Backup Slides
124
@shawnmjones @StormyArchives
As collection users, we view Archive-It collections
from outside…
125
• Curators select seeds...
@shawnmjones @StormyArchives
Response times per surrogate had interesting
means, but p-values were not statistically
signi...
@shawnmjones @StormyArchives
The Off-Topic Memento Toolkit (OTMT) compares a seed’s first
memento with the seed’s other me...
@shawnmjones @StormyArchives
Does most of the collection exist earlier or later in its
life?
128
This collection was creat...
@shawnmjones @StormyArchives
When did the curator select and archive a collection’s
contents?
129
This collection was crea...
@shawnmjones @StormyArchives
Did the curator create a collection intended to archive new versions of
the same web pages re...
@shawnmjones @StormyArchives
The Memento Protocol provides us a standard
method for acquiring information from web archive...
@shawnmjones @StormyArchives
We use surrogates all of the time!
132
Browser Thumbnail (example from UK Web Archive)
Text s...
@shawnmjones @WebSciDL
Surrogates are not new!
Traditional surrogates contain metadata
generated by humans to convey about...
@shawnmjones @StormyArchives
Surrogates provide a visual summary of the
content behind a URI…
134
https://www.google.com/m...
@shawnmjones @WebSciDL
Social media storytelling uses surrogates to
provide a “summary of summaries”
135
2 resources are s...
@shawnmjones @StormyArchives
The Problem: Understanding
web archive collections is
costly
136
§ There are multiple collect...
@shawnmjones @StormyArchives 137
Our Solution: Social media storytelling uses groups
of surrogates to provide a “summary o...
@shawnmjones @StormyArchives
The problem, summarized
§ There are multiple collections
about the same concept.
§ The metada...
@shawnmjones @StormyArchives
Archive-It allows easy collection creation
Archive-It was created by the Internet Archive as ...
@shawnmjones @StormyArchives
Reviewing mementos manually is costly
This collection has 132,599 seeds, many
with multiple m...
@shawnmjones @StormyArchives
More Archive-It collections are added every
year
More than 14,000 collections exist as of the...
@shawnmjones @StormyArchives
Latent Semantic Analysis for document
clustering
142
LSA utilizes a term-document matrix
• ro...
@shawnmjones @StormyArchives
Latent Dirichlet Allocation
For a corpus D consisting of M documents each of length Ni
1. Cho...
@shawnmjones @WebSciDL
Many have tackled selecting exemplar sentences or
images from a document, few have covered selectin...
@shawnmjones @StormyArchives
Existing tools for web archive collections require that the
user have access to WARCs.
145
Ar...
@shawnmjones @StormyArchives
Existing work on generating story metadata relies
on archivists to manually review and annota...
@shawnmjones @StormyArchives
Other studies on surrogates did not focus on if participants
understood the underlying collec...
@shawnmjones @StormyArchives
Others tried to visualize whole collections at once or
created solutions specific to a web ar...
@shawnmjones @StormyArchives
Web surrogates provide a visual summary
of the content behind a URI…
149
https://www.google.c...
@shawnmjones @StormyArchives
Social media storytelling uses surrogates to provide a
“summary of summaries”
150
2 resources...
@shawnmjones @StormyArchives
DSA2 Algorithm
151
@shawnmjones @StormyArchives
DSA3 Algorithm
152
@shawnmjones @StormyArchives
DSA4 Algorithm
153
Próximos SlideShares
Carregando em…5
×

0

Compartilhar

Baixar para ler offline

Improving Collection Understanding For Web Archives With Storytelling: Shining Light Into Dark and Stormy Archives - PhD Defense

Baixar para ler offline

Collections are the tools that people use to make sense of an ever-increasing number of archived web pages. As collections themselves grow, we need tools to make sense of them. Tools that work on the general web, like search engines, are not a good fit for these collections because search engines do not currently represent multiple document versions well. Web archive collections themselves are vast, some containing hundreds of thousands of documents. There are also thousands of collections, many of which cover the same topic. Few collections include standardized metadata. Too many documents from too many collections with not enough metadata makes collection understanding an expensive proposition.

This dissertation establishes a five-process model to assist with web archive collection understanding. This model aims to automatically produce a social media story -- a visualization paradigm with which most web users are already familiar. Each social media story contains surrogates which are summaries of individual documents. These surrogates, when collected together, summarize the overall topic of the story. After applying our storytelling model, they summarize the topic of a web archive collection.

We develop and test a framework to select the best exemplars that represent a collection. We establish that algorithms produced from these primitives select exemplars that are otherwise undiscoverable using conventional search engine methods. We generate story metadata to improve the information scent of a story so users can understand it better. After an analysis showing that existing platforms perform poorly for web archives and a user study establishing the best surrogate type, we generate document metadata for the exemplars with machine learning. We then visualize the story and document metadata together and distribute it to satisfy the information needs of multiple personas who benefit from our model.

Our tools serve as a reference implementation of our Dark and Stormy Archives storytelling model. Hypercane selects exemplars and generates story metadata. MementoEmbed generates document metadata. Raintale visualizes and distributes the story based on the story metadata and the document metadata of these exemplars. By providing understanding at a glance, our stories save users the time and effort of reading thousands of documents and, most importantly, help them understand web archive collections.

  • Seja a primeira pessoa a gostar disto

Improving Collection Understanding For Web Archives With Storytelling: Shining Light Into Dark and Stormy Archives - PhD Defense

  1. 1. Improving Collection Understanding For Web Archives With Storytelling: Shining Light Into Dark and Stormy Archives Shawn M. Jones Los Alamos National Laboratory Research Library Prototyping Team Web Science and Digital Libraries Research Group Old Dominion University Dissertation Defense 2021/08/05 1 Thanks to:
  2. 2. @shawnmjones @WebSciDL Outline 1. Motivation And Research Questions 2. Background And Related Work 3. Selecting Exemplars And Generating Story Metadata 4. Generating Document Metadata 5. Visualizing And Distributing Stories 6. Contributions And Conclusion 2
  3. 3. @shawnmjones @StormyArchives November 8, 2019 3 During the second week of November 2019, the National Center for Medical Intelligence shared intelligence based on "monitoring of internal Chinese communications" that warned of a potential novel coronavirus pandemic coming out of Wuhan. Source: https://en.wikipedia.org/wiki/Timeline_of_the_COVID- 19_pandemic_in_2019 COVID-19 was not named and was only known to a small group in the US. No news coverage existed.
  4. 4. @shawnmjones @StormyArchives December 16, 2019 4 The first documented COVID-19 hospital admission was on December 16, 2019. COVID-19 was still not well known and received no news coverage.
  5. 5. @shawnmjones @StormyArchives January 13, 2020 5 One month later, CNN carries a coronavirus category on its front page.
  6. 6. @shawnmjones @StormyArchives February 28, 2020 6 Another month goes by with more front-page articles about coronavirus.
  7. 7. @shawnmjones @StormyArchives March 13, 2020 7 A month later, CNN had many front-page articles about coronavirus with a special Coronavirus heading for more articles.
  8. 8. @shawnmjones @StormyArchives March 20, 2020 8 A week later, states are locking down.
  9. 9. @shawnmjones @StormyArchives March 27, 2020 9 A week later, the US has the most cases of any country.
  10. 10. @shawnmjones @StormyArchives A web archive helped me tell this story. 10 These mementos are stored in the Internet Archive. They are full captures of the web code that existed on those dates.
  11. 11. @shawnmjones @StormyArchives What other stories can we tell with web archives? 11 Motivation and Research Questions
  12. 12. @shawnmjones @StormyArchives Natasha is studying how disasters shape cultures... 12 Sources like Wikipedia now have a summary of the event after the fact. Today she is reviewing the South Louisiana Flood of 2016. Motivation and Research Questions She wants to know about the news reporting as it was at the time of the event.
  13. 13. @shawnmjones @StormyArchives Per Nwala et al., news articles about the event tend to slide down search results as we get further from the event. 13 Motivation and Research Questions Green = coverage of event Red = Summaries of the event A. C. Nwala, M. C. Weigle, and M. L. Nelson, “Scraping SERPs for Archival Seeds: It Matters When You Start,” in ACM/IEEE JCDL, 2018. https://doi.org/10.1145/3197026.3197056. She knows that five years later, it is harder to find news articles from the event itself.
  14. 14. @shawnmjones @StormyArchives Natasha also knows that news articles are updated with more current and correct information 14 She wants to know about the news reporting as it was at the time of the event. Motivation and Research Questions Today 8/14/2016 during event
  15. 15. @shawnmjones @StormyArchives Natasha knows that any time that we need proof that X said Y at date D, we need web archives 15 She knows that web archives contain not just “screenshots” but full captures of web code as mementos. To start, she must know a URL and capture datetime. Then she can view a memento. And she can review its code, if needed. Motivation and Research Questions
  16. 16. @shawnmjones @StormyArchives Natasha also knows that archivists create web archive collections based on a theme 16 Motivation and Research Questions
  17. 17. @shawnmjones @StormyArchives With these themed collections, she can discover documents that once existed and match her event or topic 17 Virginia Tech: Crisis, Tragedy, and Recovery Network capturing coverage of the 2011 Tucson Shootings University of Utah capturing its web presence over time Motivation and Research Questions
  18. 18. @shawnmjones @StormyArchives Natasha has discovered multiple sites with themed web archive collections 18 Library of Congress Archive-It (by the Internet Archive) Trove Conifer Each site has different capabilities and different types of collections. Motivation and Research Questions
  19. 19. @shawnmjones @StormyArchives Natasha chooses to look through the themed collections at Archive-It 19 As a popular subscription service of the Internet Archive, Archive-It helps archivists create themed collections. These collections consist of seeds. Mementos are observations of a seed at different points in time. For each seed, there are multiple mementos. This seed has 7 mementos (captured 7 times). Motivation and Research Questions
  20. 20. @shawnmjones @StormyArchives There are multiple collections about the subject, which one should she work with? 20 This is not the only disaster she is studying. She needs to waste as little time as possible. Motivation and Research Questions
  21. 21. @shawnmjones @StormyArchives 21 Natasha is not alone, 44 Archive-It collections match the search query “human rights” How are they different from each other? Which one is best for her needs? Motivation and Research Questions
  22. 22. @shawnmjones @StormyArchives Rustam needs to study how the Boston Marathon Bombing unfolded… 22 Reviewing different mementos of the same seed allows Rustam to understand when the public learned of different events, including when misinformation was corrected. Rather than digging through collections manually, how can Rustam discover and view this more quickly? Motivation and Research Questions
  23. 23. @shawnmjones @StormyArchives Olayinka wants to understand what different news sources revealed on the same day… 23 Today she is trying to understand the different reporting on the September 11th Attacks. How can Olayinka discover and view this more quickly? Motivation and Research Questions
  24. 24. @shawnmjones @StormyArchives Elbert is an archivist who wants to promote his collections, so others are aware of them… 24 He wants to help visitors like Natasha, Rustam, and Olayinka notice his collections and use them. How does he create enticing visualizations that people can understand with minimal effort? Motivation and Research Questions
  25. 25. @shawnmjones @StormyArchives Ling is an archivist who inherited a collection from another archivist, and she needs to understand it so she can make decisions about it… 25 Her collection has hundreds of thousands of seeds. Her predecessor did not provide much metadata with the collection. Archivists can add metadata to collections, but many Archive-It collections contain little metadata. The more metadata a reader needs to understand a collection, the less they have available. Motivation and Research Questions
  26. 26. @shawnmjones @StormyArchives Ling knows she is not alone – the collections are often built automatically, making it difficult to know what they contain 26 Web Archiving Technical Lead of the British Library Ling knows that the automation makes it expensive to add metadata to thousands of documents after they are collected. Motivation and Research Questions
  27. 27. @shawnmjones @StormyArchives All these personas need a faster method of collection understanding 27 Persona Natasha Rustam Olayinka Elbert Ling Information need Quickly compare collections Follow a source over time Understand a time from different sources Promote collections and help visitors understand them Understand a collection that they inherited Role Visitor Visitor Visitor Archivist Archivist Understanding needs Overall collection Aspect (Page) of a collection Aspect (Time) of a collection Overall collection Overall collection Motivation and Research Questions
  28. 28. @shawnmjones @StormyArchives All are faced with more than 14,000 collections at Archive-It alone 28 More than 14,000 collections exist as of the end of 2020 0 500 1000 1500 2000 2500 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 # of Collections Year # of New Archive-It Collections Per Year All Collections Only Private Collections Only Public Collections Motivation and Research Questions
  29. 29. @shawnmjones @StormyArchives The problem, summarized 29 § There are multiple collections about the same concept. § It is difficult to easily expose aspects (e.g., time, page) of collections. § The metadata for each collection is non-existent, or inconsistently applied. § Many collections have 1000s of seeds with multiple mementos. § There are more than 14,000 collections. § Human review of these mementos for collection understanding is an expensive proposition. Motivation and Research Questions
  30. 30. @shawnmjones @StormyArchives Our proposal: a visualization made of exemplar mementos 30 § Our visualization is a summary that will act like an abstract § Pirolli and Card’s Information Foraging Theory: § maximize the value of the information gained from our summaries § minimize the cost of interacting with the collection § ensure that our exemplar mementos have good information scent § contain cues that the memento will address a user’s needs From this: 318 seeds with 2421 mementos To something like this: a social media story of ~28 surrogates P. Pirolli. 2005. Rational Analyses of Information Foraging on the Web. Cognitive Science 29, 3 (May 2005), 343–373. DOI:10.1207/s15516709cog0000_20 Motivation and Research Questions
  31. 31. @shawnmjones @StormyArchives Users already interact with pages like this every day 31 A story on Wakelet about the 2021 Capitol Attack Motivation and Research Questions A Twitter Moment of astronaut Michael Collins Twitter creates Moments that present surrogates linking to content about a topic of interest. Educators, librarians, and others create stories on Wakelet about different subjects.
  32. 32. @shawnmjones @StormyArchives Social media stories apply visualizations that users already know how to understand 32 An individual surrogate summarizes a web resource. When we combine surrogates into a story, we summarize a topic. Motivation and Research Questions
  33. 33. @shawnmjones @StormyArchives We developed a five-process storytelling model based on existing work on summarization and storytelling 33 exemplar mementos collection title: 2013 Boston Marathon Bombing collected by: Internet Archive Global Events collection URL image data... seed data... top terms top entities... title: Boston Marathon Explosions... description: “The grace this tragedy exposed...” striking image.. Select Exemplars Generate Story Metadata Generate Document Metadata Visualize The Story Distribute The Story AlNoamany found that popular stories contain 28 elements, so we have a target of 28 exemplars. AlNoamany pioneered this work combining web archive collections with Storify, but Storify is now gone. Motivation and Research Questions
  34. 34. @shawnmjones @StormyArchives Our five-process storytelling model maps to our research questions 34 RQ1: What types of web archive collections exist and what are their structural features? RQ2: What approaches work best for selecting exemplars from web archive collections? RQ3: What surrogates work best for understanding groups of mementos? RQ4: What methods that automate the creation of surrogates produce results that best match humans’ behavior? Generate Story Metadata Select Exemplars Generate Document Metadata Visualize The Story Distribute The Story Examples and Use Cases for our Personas Motivation and Research Questions
  35. 35. @shawnmjones @StormyArchives Our Dark and Stormy Archives Tools serve as a reference implementation of our storytelling process 35 Motivation and Research Questions
  36. 36. @shawnmjones @WebSciDL Outline 1. Motivation And Research Questions 2. Background And Related Work 3. Selecting Exemplars And Generating Story Metadata 4. Generating Document Metadata 5. Visualizing And Distributing Stories 6. Contributions And Conclusion 36
  37. 37. @shawnmjones @StormyArchives URIs identify resources 37 T. Berners-Lee, et al. “RFC 3986 – Uniform Resource Identifier (URI): Generic Syntax”. https://www.rfc-editor.org/rfc/rfc3986.txt, 2005. Jacobs, I. and Walsh, N. eds., “Architecture of the World Wide Web, Vol. 1.” https://www.w3.org/TR/webarch/, 2003. URIs are a superset of identifiers that contains URLs, URNs, etc. Background and Related Work URIs identify resources, which have different representations depending on the visitor’s needs.
  38. 38. @shawnmjones @StormyArchives HTML is the file format we use for web resources 38 HTML contains links to other pages, identified by URIs. Background and Related Work
  39. 39. @shawnmjones @StormyArchives Web archives apply crawlers to quickly visit pages and follow links to build collections 39 Background and Related Work Crawlers save web resources in the WARC file format. WARC/1.0 WARC-Date: 2016-04-01T18:08:53Z WARC-Type: response WARC-Record-ID: <urn:uuid:8ede0256-790b-4378-a469-cfdcf0b9f8a6> WARC-Target-URI: http://philadelphia.feb.gov/ WARC-Payload-Digest: sha1:USWZWZNSVCOSXA2MBSHMRAVZY7R2ZQSY WARC-Block-Digest: sha1:NNBOIVAWIP63IESE3JCE36X6BRE2YYZG Content-Type: application/http; msgtype=response Content-Length: 19556 HTTP/1.0 200 OK server: Apache-Coyote/1.1 content-type: text/html;charset=utf-8 content-length: 35917 date: Sat, 08 May 2021 16:04:56 GMT <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf- 8" /> ... The page to be crawled is a seed or original resource. An observation of that original resource at a specific time is a memento. We use the term URI-M to denote a memento URI. The datetime of a memento’s capture is its memento-datetime.
  40. 40. @shawnmjones @StormyArchives Web archives apply crawlers to quickly visit pages and follow links to build collections 40 Background and Related Work Crawlers save web resources in the WARC file format. WARC/1.0 WARC-Date: 2016-04-01T18:08:53Z WARC-Type: response WARC-Record-ID: <urn:uuid:8ede0256-790b-4378-a469-cfdcf0b9f8a6> WARC-Target-URI: http://philadelphia.feb.gov/ WARC-Payload-Digest: sha1:USWZWZNSVCOSXA2MBSHMRAVZY7R2ZQSY WARC-Block-Digest: sha1:NNBOIVAWIP63IESE3JCE36X6BRE2YYZG Content-Type: application/http; msgtype=response Content-Length: 19556 HTTP/1.0 200 OK server: Apache-Coyote/1.1 content-type: text/html;charset=utf-8 content-length: 35917 date: Sat, 08 May 2021 16:04:56 GMT <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf- 8" /> ... The page to be crawled is a seed or original resource. An observation of that original resource at a specific time is a memento. We use the term URI-M to denote a memento URI. The datetime of a memento’s capture is its memento-datetime.
  41. 41. @shawnmjones @StormyArchives Web archives apply crawlers to quickly visit pages and follow links to build collections 41 Background and Related Work Crawlers save web resources in the WARC file format. WARC/1.0 WARC-Date: 2016-04-01T18:08:53Z WARC-Type: response WARC-Record-ID: <urn:uuid:8ede0256-790b-4378-a469-cfdcf0b9f8a6> WARC-Target-URI: http://philadelphia.feb.gov/ WARC-Payload-Digest: sha1:USWZWZNSVCOSXA2MBSHMRAVZY7R2ZQSY WARC-Block-Digest: sha1:NNBOIVAWIP63IESE3JCE36X6BRE2YYZG Content-Type: application/http; msgtype=response Content-Length: 19556 HTTP/1.0 200 OK server: Apache-Coyote/1.1 content-type: text/html;charset=utf-8 content-length: 35917 date: Sat, 08 May 2021 16:04:56 GMT <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf- 8" /> ... The page to be crawled is a seed or original resource. An observation of that original resource at a specific time is a memento. We use the term URI-M to denote a memento URI. The datetime of a memento’s capture is its memento-datetime.
  42. 42. @shawnmjones @StormyArchives Web archives apply crawlers to quickly visit pages and follow links to build collections 42 Background and Related Work Crawlers save web resources in the WARC file format. WARC/1.0 WARC-Date: 2016-04-01T18:08:53Z WARC-Type: response WARC-Record-ID: <urn:uuid:8ede0256-790b-4378-a469-cfdcf0b9f8a6> WARC-Target-URI: http://philadelphia.feb.gov/ WARC-Payload-Digest: sha1:USWZWZNSVCOSXA2MBSHMRAVZY7R2ZQSY WARC-Block-Digest: sha1:NNBOIVAWIP63IESE3JCE36X6BRE2YYZG Content-Type: application/http; msgtype=response Content-Length: 19556 HTTP/1.0 200 OK server: Apache-Coyote/1.1 content-type: text/html;charset=utf-8 content-length: 35917 date: Sat, 08 May 2021 16:04:56 GMT <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf- 8" /> ... The page to be crawled is a seed or original resource. An observation of that original resource at a specific time is a memento. We use the term URI-M to denote a memento URI. The datetime of a memento’s capture is its memento-datetime.
  43. 43. @shawnmjones @StormyArchives A TimeMap gives us a listing of the mementos available for an original resource 43 Background and Related Work the original resource “now” <http://www.cs.odu.edu>;rel="original", <https://web.archive.org/web/19970102130137/http://cs.odu.edu:80/>;rel="memento"; datetime="Thu, 02 Jan 1997 13:01:37 GMT", <https://web.archive.org/web/19970606105039/http://www.cs.odu.edu:80/>;rel="memento"; datetime="Fri, 06 Jun 1997 10:50:39 GMT", <http://archive.md/19970606105039/http://www.cs.odu.edu/>;rel="memento"; datetime="Fri, 06 Jun 1997 10:50:39 GMT", <https://web.archive.org/web/19971010201632/http://www.cs.odu.edu:80/>;rel="memento"; datetime="Fri, 10 Oct 1997 20:16:32 GMT", <https://web.archive.org/web/19971211124211/http://www.cs.odu.edu:80/>;rel="memento"; datetime="Thu, 11 Dec 1997 12:42:11 GMT", ... <https://web.archive.org/web/19990502033600/http://cs.odu.edu:80/>;rel="memento"; datetime="Sun, 02 May 1999 03:36:00 GMT", ... <https://arquivo.pt/wayback/20091223043049mp_/http://www.cs.odu.edu/>;rel="memento"; datetime="Wed, 23 Dec 2009 04:30:49 GMT", ... memento from 1997 memento from 1999 memento from 2009 Van de Sompel, H. Nelson, M. & Sanderson, R. “RFC 7089 – HTTP Framework for Time-Based Access to Resource States -- Memento”. http://www.rfc-editor.org/info/rfc7089. 2013.
  44. 44. @shawnmjones @StormyArchives Others have tackled portions of the problem of summarizing web archives, but only AlNoamany addressed all processes 44 Background and Related Work Some have conflated our steps of generating metadata and visualizing it. Many have and continue to focus on selecting exemplar words, sentences, images, video clips, and more for summarization. Those who have evaluated surrogates in the past focused on if the participant chose the correct search engine result, but not understanding. Attempts to manually apply metadata to these collections are impacted by the scale of the problem.
  45. 45. @shawnmjones @StormyArchives AlNoamany identified the characteristics of social media stories and Archive-It collections 45 Background and Related Work Select Exemplars Generate Story Metadata Generate Document Metadata Visualize The Story Distribute The Story By analyzing the characteristics of stories and collections, she determined that popular stories contain 28 elements. Our model maps to hers but expands her visualize step. AlNoamany’s sieve diagram gives us one solution for storytelling. We will explore others. Y. AlNoamany, M. C. Weigle, and M. L. Nelson, “Generating Stories From Archived Collections,” in ACM Web Science, pp. 309–318, 2017. https://doi.org/10.1145/3091478.3091508.
  46. 46. @shawnmjones @StormyArchives Select Exemplars AlNoamany extracted some story metadata and relied on Storify to create and distribute the resulting visualization. 46 Background and Related Work Generate Story Metadata Generate Document Metadata Visualize The Story Distribute The Story Y. AlNoamany, M. C. Weigle, and M. L. Nelson, “Generating Stories From Archived Collections,” in ACM Web Science, pp. 309–318, 2017. https://doi.org/10.1145/3091478.3091508.
  47. 47. @shawnmjones @StormyArchives Her proof-of-concept generated some document metadata and relied on Storify to generate the rest. 47 Background and Related Work Generate Story Metadata Generate Document Metadata Select Exemplars Visualize The Story Distribute The Story Storify AlNoamany’s Proof-of-Concept (POC) Both POC and Storify Generated Portions of Document Metadata Y. AlNoamany, M. C. Weigle, and M. L. Nelson, “Generating Stories From Archived Collections,” in ACM Web Science, pp. 309–318, 2017. https://doi.org/10.1145/3091478.3091508.
  48. 48. @shawnmjones @StormyArchives She generated many different stories based on exemplars selected by her proof-of-concept 48 Generate Story Metadata Generate Document Metadata Select Exemplars Visualize The Story Distribute The Story Storify AlNoamany’s Proof-of-Concept (POC) Both POC and Storify Generated Portions of Document Metadata Background and Related Work Y. AlNoamany, M. C. Weigle, and M. L. Nelson, “Generating Stories From Archived Collections,” in ACM Web Science, pp. 309–318, 2017. https://doi.org/10.1145/3091478.3091508.
  49. 49. @shawnmjones @StormyArchives Through a user study, she demonstrated that participants could tell the difference between her solution’s stories and randomly generated stories 49 Background and Related Work Participants could not tell the difference between her solution’s stories and those generated by human archivists Y. AlNoamany, M. C. Weigle, and M. L. Nelson, “Generating Stories From Archived Collections,” in ACM Web Science, pp. 309–318, 2017. https://doi.org/10.1145/3091478.3091508.
  50. 50. @shawnmjones @StormyArchives Unfortunately, her solution is difficult to generalize 50 Generate Story Metadata Generate Document Metadata Select Exemplars Visualize The Story Distribute The Story Storify AlNoamany’s Proof-of-Concept (POC) Both POC and Storify Generated Portions of Document Metadata Background and Related Work Adobe shut down the Storify platform in 2018. AlNoamany’s POC focused on Archive-It.
  51. 51. @shawnmjones @WebSciDL Outline 1. Motivation And Research Questions 2. Background And Related Work 3. Selecting Exemplars And Generating Story Metadata 4. Generating Document Metadata 5. Visualizing And Distributing Stories 6. Contributions And Conclusion 51
  52. 52. @shawnmjones @StormyArchives As collection users, what structural features can we view from outside? 52 § Using only structural features is advantageous because it saves one from having to download a collection’s content. § These structural features give us different insight than can be provided by text analysis or metadata. 81,014 seeds 486,227 seed mementos Structural features shown here: • number of seeds • number of mementos S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P Selecting Exemplars and Generating Story Metadata
  53. 53. @shawnmjones @StormyArchives Was the collection built from web sites belonging to one domain or many? 53 Many domains One domain Structural feature discussed here: • domain diversity S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P Selecting Exemplars and Generating Story Metadata
  54. 54. @shawnmjones @StormyArchives Were most of the web pages in the collection top-level pages or specific articles deeper in a web site? 54 Top-level pages Deeper links Structural feature discussed here: • path depth diversity • most frequent path depth S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P Selecting Exemplars and Generating Story Metadata
  55. 55. @shawnmjones @StormyArchives Growth curves provide some understanding of collection curation behavior 55 • Skew of the collection’s holdings • Indicates temporality of collection • Skew of the curatorial involvement with the collection • When seeds were added • When interest was lost or regained (Positive) (Positive) (Negative) (Negative) S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P Selecting Exemplars and Generating Story Metadata
  56. 56. @shawnmjones @StormyArchives We discovered four semantic categories in Archive-It collections 56 Self-Archiving 54.1% of collections Subject-based 27.6% of collections Time Bounded – Expected 14.1% of collections Time Bounded – Spontaneous 4.2% of collections Some evaluated by AlNoamany In a study of 3,382 Archive-It collections Selecting Exemplars and Generating Story Metadata S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
  57. 57. @shawnmjones @StormyArchives Self-Archiving collections dominate Archive-It 57 54.1% of collections 27.6% 14.1% In a study of 3,382 Archive-It collections Selecting Exemplars and Generating Story Metadata Subject-based Time Bounded – Expected Time Bounded – Spontaneous 4.2% Organizations archiving themselves or those they are responsible for. S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
  58. 58. @shawnmjones @StormyArchives Subject-based collections come in second 58 27.6% of collections 14.1% In a study of 3,382 Archive-It collections Selecting Exemplars and Generating Story Metadata Time Bounded – Expected Time Bounded – Spontaneous 4.2% Collections centered on a subject that is not ephemeral. 54.1% Self-archiving S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
  59. 59. @shawnmjones @StormyArchives Time Bounded – Expected collections summarize events we anticipate 59 14.1% of collections In a study of 3,382 Archive-It collections Selecting Exemplars and Generating Story Metadata Time Bounded – Spontaneous 4.2% Collections about an anticipated event. 54.1% Self-archiving 27.6% Subject-based S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
  60. 60. @shawnmjones @StormyArchives 60 4.2% of collections In a study of 3,382 Archive-It collections Selecting Exemplars and Generating Story Metadata Collections about an unexpected event. Some of these were evaluated by AlNoamany. 54.1% Self-archiving 27.6% Subject-based 14.1% Time Bounded – Expected Time Bounded – Spontaneous collections summarize unexpected events S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
  61. 61. @shawnmjones @StormyArchives We can bridge the structural to the descriptive… 61 Self-Archiving 54.1% of collections Subject-based 27.6% of collections Time Bounded – Expected 14.1% of collections Time Bounded – Spontaneous 4.2% of collections Some evaluated by AlNoamany Using the structural features mentioned previously, we can predict these semantic categories with a Random Forest classifier with F1 = 0.720 Selecting Exemplars and Generating Story Metadata S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
  62. 62. @shawnmjones @StormyArchives RQ1: What types of web archive collections exist and what structural features do they have? 62 S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P Type % of Archive-It Collections Description Example Collection Self-Archiving 54.1% an organization archiving itself University of Utah Web Archive Subject-based 27.6% seeds bound by single topic Environmental Justice Time Bounded – Expected 14.1% an expected event or time period 2008 Olympics Time Bounded – Spontaneous 4.2% unexpected event Tucson Shootings Based on a manual review of 3,382 Archive-It collections, we classified them into 4 types. Growth curves give us some idea of the curatorial involvement with a collection over time. When selecting exemplars, we need to summarize the collection in terms of time and topic. The shapes of these growth curves indicate how we might cluster in time. This example growth curve shows us that 30% of the seeds were added early in the collection’s life. Structurally, for seeds, we can study the: • distribution of domains • distribution of path depths • most frequent path depth • query string usage Selecting Exemplars and Generating Story Metadata
  63. 63. @shawnmjones @StormyArchives Identifying off-topic mementos is key to choosing exemplar mementos 63 Hacked Moved on from topic Collections have a topic. Seeds are selected to support that topic. Mementos are observations of seeds. Some of these versions are off-topic. Excluding these off-topic mementos from consideration is key to selecting exemplars. Web Page Gone Account Suspension S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/UBW87 Selecting Exemplars and Generating Story Metadata
  64. 64. @shawnmjones @StormyArchives We found that Word Count had the best F1 score for identifying off-topic mementos 64 We reused AlNoamany’s labeled dataset. She did not try: • Sorensen-Dice • Simhash of raw content • Simhash of TF • Gensim LSI Our word count accuracy came out ahead of AlNoamany’s. S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/UBW87 Y. AlNoamany, M. C. Weigle, and M. L. Nelson, “Detecting off-topic pages within TimeMaps in Web archives,” International Journal on Digital Libraries, 2016. https://doi.org/10.1007/s00799016-0183-5 Y. AlNoamany and S. M. Jones, “Off-Topic Gold Standard Dataset,” GitHub. 2018. https://github.com/oduwsdl/offtopic-goldstandard-data Selecting Exemplars and Generating Story Metadata
  65. 65. @shawnmjones @StormyArchives Filtering off-topic mementos is just one step in a set of algorithmic primitives for selecting exemplars 65 We can filter the collection to get a good set of exemplars and then randomly sample from the remainder. Selecting Exemplars and Generating Story Metadata
  66. 66. @shawnmjones @StormyArchives Ordering allows us to create meaning from a list of mementos 66 We can order the collection by some feature and then systematically sample every jth memento from the remainder. Selecting Exemplars and Generating Story Metadata
  67. 67. @shawnmjones @StormyArchives filter include-only mementos containing a given pattern Web archive collection exemplars reduces number of pages to consider intention of steps order by descending score order score scores results score results with BM25 scores results orders results Scoring gives us an idea of how well a memento meets the information needs represented by a function 67 We can combine filter, score, and order to create a simple search engine. Selecting Exemplars and Generating Story Metadata
  68. 68. @shawnmjones @StormyArchives Clustering based on a feature allows us to imbue subsets of mementos with meaning 68 With these primitives, we can reproduce AlNoamany’s Algorithm which we will now call DSA1. Selecting Exemplars and Generating Story Metadata
  69. 69. @shawnmjones @StormyArchives These primitives allow us to create other algorithms for selecting exemplars that tell the story the user desires 69 DSA2 focuses on representing collection growth curves and scoring mementos by their surrogate metadata. DSA3 focuses on mementos that best match the collection topic. DSA4 focuses on finding the most novel mementos in the collection. Selecting Exemplars and Generating Story Metadata
  70. 70. @shawnmjones @StormyArchives Search engines are the de-facto method of exploring collections; if we consider them a baseline, then how retrievable are the exemplars produced by DSA algorithms? 70 Selecting Exemplars and Generating Story Metadata We loaded 8 different Archive-It collections into different instances of the SolrWayback web archive search engine. We also executed 4 different DSA algorithms to produce exemplars from these collections. Web archive collection exemplars Web archive collection exemplars exemplars Web archive collection Web archive collection exemplars
  71. 71. @shawnmjones @StormyArchives We then generated queries with four different methods based on the content of the exemplars produced by each DSA algorithm 71 Selecting Exemplars and Generating Story Metadata
  72. 72. @shawnmjones @StormyArchives We visualized the percentage of exemplars that were never retrieved by any query 72 Selecting Exemplars and Generating Story Metadata x-axis the number of search results to review before we find the exemplar y-axis the percentage of exemplars that have zero retrievability In this graph, we are reporting zero retrievability with: • queries from doc2query-T5 • for exemplars chosen by DSA3 At 10 search results, 57.82% of the exemplars were not retrieved. After 1000 search results, 36.05% of the exemplars were not retrieved.
  73. 73. @shawnmjones @StormyArchives For all query methods the DSA algorithms’ exemplars have similar retrievability 73 Selecting Exemplars and Generating Story Metadata
  74. 74. @shawnmjones @StormyArchives For all query methods the DSA algorithms’ exemplars have similar retrievability 74 Selecting Exemplars and Generating Story Metadata If all pages are relevant, then DSA algorithms produce mementos with more novelty than standard query methods can with a state-of-the-art web archive search engine. DSA4 was designed to surface more novel mementos and meets its goal in these results.
  75. 75. @shawnmjones @StormyArchives RQ2: Which approaches work best for selecting exemplars from web archive collections? 75 We established that four different algorithms produced from these primitives will select exemplars that were not retrievable using standard query methods and a state-of-the-art web archive search engine. Removing off-topic mementos is but one step toward selecting exemplars. We devised a set of primitives for creating many different types of sampling algorithms that consider structural features. An important step in selecting exemplars to summarize the collection is identifying off-topic mementos. We found that word count differences work best. S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/UBW87 Selecting Exemplars and Generating Story Metadata
  76. 76. @shawnmjones @StormyArchives We implemented these primitives as part of Hypercane 76 Hypercane was used to conduct the experiments in this section. Selecting Exemplars and Generating Story Metadata S. M. Jones, M. Klein, M. C. Weigle, and M. L. Nelson. 2021. Hypercane: Intelligent Sampling for Web Archive Collections. In ACM/IEEE JCDL 2021. [to be published in September 2021]
  77. 77. @shawnmjones @WebSciDL Outline 1. Motivation And Research Questions 2. Background And Related Work 3. Selecting Exemplars And Generating Story Metadata 4. Generating Document Metadata 5. Visualizing And Distributing Stories 6. Contributions And Conclusion 77
  78. 78. @shawnmjones @StormyArchives We evaluated 55 platforms in 2017 and found that existing social platforms do not reliably produce surrogates for mementos 78 Generating Document Metadata If we cannot rely upon the service to generate a surrogate, we will have to create our own. Which surrogate works best for understanding web archive collections? S. M. Jones, V. Neblitt-Jones, M. C. Weigle, M. Klein, and M. L. Nelson, “MementoEmbed and Raintale for Web Archive Storytelling,” In Web Archiving and Digial Libraries Workshop, 2020. https://arxiv.org/abs/2008.00137
  79. 79. @shawnmjones @StormyArchives We reused exemplars that archivists had selected to describe their own collections to create stories with different surrogates... 79 Generating Document Metadata
  80. 80. @shawnmjones @StormyArchives Archive-It like surrogates visualize these mementos as they are on Archive-It 80 Archive-It like surrogate S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039. Generating Document Metadata This is a screenshot of a subset of the exemplars selected by the archivists of the Archive-It collection Egypt Politics and Revolution.
  81. 81. @shawnmjones @StormyArchives Browser thumbnails are screenshots of the page in a browser 81 S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039. Generating Document Metadata Browser thumbnails Browser thumbnails are a popular surrogate type used at web archives. This is a screenshot of the exemplars selected by the archivists of the Archive-It collection Egypt Politics and Revolution.
  82. 82. @shawnmjones @StormyArchives Social cards come from social media platforms 82 S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039. Generating Document Metadata Social cards Social cards are a type of surrogate typically found on social media platforms like Facebook or Twitter. These social cards were specially designed to include information from web archives. This is a screenshot of a subset of the exemplars selected by the archivists of the Archive-It collection Egypt Politics and Revolution.
  83. 83. @shawnmjones @StormyArchives sc/t combines social cards and thumbnails 83 S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039. Generating Document Metadata sc/t We replaced the striking image of the social card with a browser thumbnail. This is a screenshot of a subset of the exemplars selected by the archivists of the Archive-It collection Egypt Politics and Revolution.
  84. 84. @shawnmjones @StormyArchives sc+t places the social card to the left and a thumbnail to the right 84 S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039. Generating Document Metadata sc+t Our thought was that more information was better. This is a screenshot of a subset of the exemplars selected by the archivists of the Archive-It collection Egypt Politics and Revolution.
  85. 85. @shawnmjones @StormyArchives sc^t is interactive 85 S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039. Generating Document Metadata sc^t When a user hovers over the striking image the browser thumbnail appears. This provides both types of surrogates in a smaller space. This is a screenshot of a subset of the exemplars selected by the archivists of the Archive-It collection Egypt Politics and Revolution.
  86. 86. @shawnmjones @StormyArchives We then presented these stories to Mechanical Turk (MT) participants 86 Archive-It like Social Card Browser thumbnails Social Card With Thumbnail as Image (sc/t) Social Card With Thumbnail to Right (sc+t) Social Card with Thumbnail on Hover (sc^t) • 4 stories of 15-17 URI-Ms selected by human Archive-It curators from their collections • 6 different surrogate types • 24 different story-surrogate combinations • 120 MT participants • Given 30 seconds to view each story S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039. Generating Document Metadata
  87. 87. @shawnmjones @StormyArchives And then asked them which of the following come from the same collection… 87 • Each participant was shown a list of 6 surrogates of the same type as the story they just viewed. • They were asked to choose the 2 that they thought came from the same collection. • They were given as much time as they wished to answer the question. • This process is like the Sentence Verification Task used in reading comprehension studies. S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039. Generating Document Metadata
  88. 88. @shawnmjones @StormyArchives Social cards probably outperform the Archive-It surrogate for participant’s correct answers 88 0 0.5 1 1.5 2 2.5 Archive-It Facsimile Browser Thumbnails Social Cards sc+t sc/t sc^t Correct Answers Per Surrogate Median Mean p = 0.0569 p = 0.0770 S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039. Generating Document Metadata
  89. 89. @shawnmjones @StormyArchives Social cards produced less interaction while participants viewed their stories 89 We measured clicks and hovers by participants while they were viewing their stories. For browser thumbnails alone, most of the participants clicked the link to view the actual memento behind the surrogate. S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039. Generating Document Metadata
  90. 90. @shawnmjones @StormyArchives RQ3: What surrogates work best for understanding groups of mementos? 90 S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039. Correct answers per surrogate indicate that social cards probably outperform the Archive-It surrogate • 4 stories of 15-17 mementos selected by human curators from their own collections • 6 different surrogate types • 24 different story-surrogate combinations • Each given 30 seconds to view a story, then asked a question From a user study with 120 Mechanical Turk participants: With social cards, users were able to correctly answer our questions without as much interaction. Generating Document Metadata
  91. 91. @shawnmjones @StormyArchives Social cards are generated based on the HTML metadata that authors provide og:title -or- twitter:title -or- <title> og:description -or- twitter:description -or- description og:image -or- twitter:image Without twitter:card and og:title or twitter:title, Twitter gives up and does not generate a card. Facebook parses the <title> and produces a card with just a title. S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. 2021. Automatically Selecting Striking Images for Social Cards. In ACM WebSci ‘21. https://arxiv.org/pdf/2103.04899. 91 Generating Document Metadata What do we do if this metadata does not exist?
  92. 92. @shawnmjones @StormyArchives We analyzed 277,724 news articles captured by the Internet Archive from 1998 to 2016, and found different rates of metadata adoption OGP = Open Graph Protocol Facebook Cards 150 billion documents in the Internet Archive were captured before 2010 and thus have no card metadata 92 Generating Document Metadata S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson, “Automatically Selecting Striking Images for Social Cards,” In ACM Web Science Conference, 2021. https://doi.org/10.1145/3447535.3462505. S. M. Jones, V. Neblitt-Jones, M. C. Weigle, M. Klein, and M. L. Nelson, “It's All About The Cards: Sharing on Social Media Probably Encouraged HTML Metadata Growth,” To be in ACM/IEEE Joint Conference on Digital Libraries, 2021. [preprint: https://arxiv.org/abs/2104.04116.]
  93. 93. @shawnmjones @StormyArchives By applying author behavior, we can generate descriptions 93 Generating Document Metadata We used the existing field values, written by page authors, as ground truth data. It tells us that authors tend to write card descriptions that have the following lengths: • 268 characters • 52 words • 2 sentences We can use this length as input to automatic text summarization algorithms.
  94. 94. @shawnmjones @StormyArchives Generating Document Metadata If no metadata exists, we can select a striking image from the images available in the document Which of the images outlined in red is the striking one chosen by the author? How would a machine know which one to choose if there were no striking image specified in the metadata? 94
  95. 95. @shawnmjones @StormyArchives Our generic image selection approach has 3 steps 1. Score each image in the document by some approach (e.g., ML probability, feature value) 2. Sort the list of images by descending score (e.g., highest ML probability is first, image with most colors is first) 3. Choose the image at the beginning of the list (highest scoring) 154,131 colors Sorted by color count Sorted by classifier probability 48,020 colors 44,737 colors 30,940 colors 0.3623 0.1948 0.1259 3,816 colors 0.1116 0.11 (resized) (cropped) (resized) (cropped) (larger) 95 Generating Document Metadata
  96. 96. @shawnmjones @StormyArchives We visualized how well different approaches performed at choosing a striking image that was perceptually the same as the author’s 96 The best approach starts here As we proceed to the right, we accept more images as perceptually equal to the one selected by the approach All lines converge as any image becomes acceptable as correct Higher scores indicate more accurate answers Remember: we are trying to find the approach that best selects the striking image chosen by the author Generating Document Metadata S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. 2021. Automatically Selecting Striking Images for Social Cards. In ACM WebSci ‘21. https://doi.org/10.1145/3447535.3462505.
  97. 97. @shawnmjones @StormyArchives We found that Random Forest performed best with base image features quickly calculated via standard image libraries 97 S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. 2021. Automatically Selecting Striking Images for Social Cards. In ACM WebSci ‘21. https://doi.org/10.1145/3447535.3462505. Generating Document Metadata P@1=0.831 MRR=0.883 base image features: • byte size • width in pixels • height in pixels • negative space (# of histogram cols = 0) • size in pixels • aspect ratio • number of colors
  98. 98. @shawnmjones @StormyArchives RQ4: What methods that automate the creation of surrogates produce results that best match humans' behavior? 98 Generating Document Metadata Authors write card descriptions that are 268 characters, 52 words, or 2 sentences long. We can use this length as input to automatic text summarization algorithms, like TextRank. With base image features Random Forest performed best for choosing the same striking image as the author. S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson, “Automatically Selecting Striking Images for Social Cards,” In ACM Web Science Conference, 2021. https://doi.org/10.1145/3447535.3462505. S. M. Jones, V. Neblitt-Jones, M. C. Weigle, M. Klein, and M. L. Nelson, “It's All About The Cards: Sharing on Social Media Probably Encouraged HTML Metadata Growth,” To be in ACM/IEEE Joint Conference on Digital Libraries, 2021. [preprint: https://arxiv.org/abs/2104.04116.] We analyzed the metadata usage of news article mementos over time. Metadata fields associated with cards had astronomical growth.
  99. 99. @shawnmjones @StormyArchives We implemented these results as part of MementoEmbed 99 Cards Browser Thumbnails Imagereels Word Clouds Generating Document Metadata As an archive-aware surrogate service, MementoEmbed provides different types of surrogates for mementos. It also has an extensive API for generating document metadata. S. M. Jones, V. Neblitt-Jones, M. C. Weigle, M. Klein, and M. L. Nelson, “MementoEmbed and Raintale for Web Archive Storytelling,” In Web Archiving and Digial Libraries Workshop, 2020. https://arxiv.org/abs/2008.00137
  100. 100. @shawnmjones @WebSciDL Outline 1. Motivation And Research Questions 2. Background And Related Work 3. Selecting Exemplars And Generating Story Metadata 4. Generating Document Metadata 5. Visualizing And Distributing Stories 6. Contributions And Conclusion 100
  101. 101. @shawnmjones @StormyArchives Because Storify was gone, we created Raintale for visualizing and distributing stories 101 Visualizing And Distributing Stories S. M. Jones, M. Klein, M. C. Weigle, and M. L. Nelson, “MementoEmbed and Raintale for Web Archive Storytelling,” In Web Archiving and Digital Libraries Workshop, 2020. https://arxiv.org/abs/2008.00137 Storify provided an API, allowing us to configure the look and feel of our story. With this functionality gone, we created Raintale, a platform agnostic storytelling tool that generates files or social media posts.
  102. 102. @shawnmjones @WebSciDL Remember, Elbert wants to promote his collections for others, and he uses the DSA Toolkit to do so 102 Today he is promoting a collection about COVID-19. Visualizing And Distributing Stories From this: 23,376 mementos To this: a sample of 36 mementos visualized as social cards, phrases, and images S. M. Jones, M. Klein, M. C. Weigle, and M. L. Nelson, “MementoEmbed and Raintale for Web Archive Storytelling,” In Web Archiving and Digital Libraries Workshop, 2020. https://arxiv.org/abs/2008.00137
  103. 103. @shawnmjones @StormyArchives Elbert applies all processes of our storytelling model 103 Visualizing And Distributing Stories Generate Story Metadata Select Exemplars Generate Document Metadata Visualize The Story Distribute The Story S. M. Jones, M. Klein, M. C. Weigle, and M. L. Nelson, “MementoEmbed and Raintale for Web Archive Storytelling,” In Web Archiving and Digital Libraries Workshop, 2020. https://arxiv.org/abs/2008.00137
  104. 104. @shawnmjones @WebSciDL Remember, Natasha needs to compare collections to each other 104 Today she is reviewing different collections about shootings. Virginia Tech El Paso Norway Visualizing And Distributing Stories S. M. Jones, M. Klein, M. C. Weigle, and M. L. Nelson, “MementoEmbed and Raintale for Web Archive Storytelling,” In Web Archiving and Digital Libraries Workshop, 2020. https://arxiv.org/abs/2008.00137
  105. 105. @shawnmjones @StormyArchives Ling inherited a collection and needs to know what it contains 105 Ling can apply our processes with a different template to include other information, like structural features. Visualizing And Distributing Stories To this: 50 exemplars, structural features, metadata analysis, growth curves, and more From this: 88,755 mementos and no metadata
  106. 106. @shawnmjones @WebSciDL Rustam wants to see how a page changed over time 106 Visualizing And Distributing Stories Generate Story Metadata Select Exemplars Generate Document Metadata Visualize The Story Distribute The Story Rustam uses Hypercane to help him choose a page and then view its change over time.
  107. 107. @shawnmjones @StormyArchives Rustam chooses one of Raintale’s default templates because he is using the DSA Toolkit for exploration 107 Visualizing And Distributing Stories Rustam’s story seems plain, but he is really interested in the changing text over time.
  108. 108. @shawnmjones @WebSciDL Olayinka wants to see what different news sources said on the same day in different years 108 Visualizing And Distributing Stories With our SHARI process, she can compare different years to each other 2018 US Elections 2020 COVID-19 2019 Mass shootings in El Paso and Dayton S. M. Jones, A. C. Nwala, M. C. Weigle, M. Klein, and M. L. Nelson, “SHARI -- An Integration of Tools to Visualize the Story of the Day,” In Web Archiving and Digital Libraries Workshop, 2020. https://arxiv.org/abs/2008.00139
  109. 109. @shawnmjones @StormyArchives Olayinka can look through the stories produced by our SHARI process to perform her comparisons 109 Visualizing And Distributing Stories Our process is not just limited to our implementation, and allows us to incorporate input from other systems, like StoryGraph. Generate Document Metadata Visualize The Story Distribute The Story Select Exemplars Generate Story Metadata S. M. Jones, A. C. Nwala, M. C. Weigle, M. Klein, and M. L. Nelson, “SHARI -- An Integration of Tools to Visualize the Story of the Day,” In Web Archiving and Digital Libraries Workshop, 2020. https://arxiv.org/abs/2008.00139
  110. 110. @shawnmjones @WebSciDL Outline 1. Motivation And Research Questions 2. Background And Related Work 3. Selecting Exemplars And Generating Story Metadata 4. Generating Document Metadata 5. Visualizing And Distributing Stories 6. Contributions And Conclusion 110
  111. 111. @shawnmjones @StormyArchives We presented a model for storytelling with web archives 111 Contributions
  112. 112. @shawnmjones @WebSciDL We established a vocabulary for different types and structural features of collections 112 Type % of Archive-It Collections Description Example Collection Self-Archiving 54.1% an organization archiving itself University of Utah Web Archive Subject-based 27.6% seeds bound by single topic Environmental Justice Time Bounded – Expected 14.1% an expected event or time period 2008 Olympics Time Bounded – Spontaneous 4.2% unexpected event Tucson Shootings Based on a manual review of 3,382 Archive-It collections, we classified them into 4 types. Growth curves give us some idea of the curatorial involvement with a collection over time. Structurally, for seeds, we can study the: • distribution of domains • distribution of path depths • most frequent path depth • query string usage iPres 2018 S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P Contributions
  113. 113. @shawnmjones @StormyArchives Word count is a fast, effective intra-TimeMap method of identifying off-topic mementos 113 iPres 2018 S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P Technical problems Page gone Hacking Moving on from topic Contributions
  114. 114. @shawnmjones @WebSciDL We devised a set of primitives for intelligently selecting exemplars from web archive collections 114 Contributions
  115. 115. @shawnmjones @StormyArchives Hypercane implements our primitives for selecting exemplars 115 ACM/IEEE JCDL 2021 ACM SIGWEB Newsletter 2021 S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. Hypercane: Intelligent Sampling for Web Archive Collections. In ACM/IEEE Joint Conference on Digital Libraries, 2021. [To be published in 2021] S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. Hypercane: Toolkit for Summarizing Large Collections of Archived Pages. In SIGWEB Newsletter Autumn, 2021. [To be published in 2021] Contributions
  116. 116. @shawnmjones @WebSciDL We created four different algorithms from these primitives and found that they produce exemplars with low retrievability with a state-of-the-art search engine 116 We applied four different query methods to the mementos surfaced by these algorithms. As designed, our DSA4 algorithm surfaced more novel exemplars than those discoverable via the search engine. We measured mean retrievability and zero retrievability to determine how easy a document was to retrieve with the given query method. Contributions
  117. 117. @shawnmjones @StormyArchives Our user study provides engineers support for choosing social cards over other surrogate types 117 From our user study, correct answers per surrogate indicate that social cards probably outperform the Archive-It surrogate With social cards, users were able to correctly answer our questions without as much interaction. ACM CIKM 2019 S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web Archive Collections,” In ACM International Conference on Information and Knowledge Management, 2019. https://doi.org/10.1145/3357384.3358039. Contributions
  118. 118. @shawnmjones @WebSciDL We established methods for generating the metadata for social cards if it does not exist 118 S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson, “Automatically Selecting Striking Images for Social Cards,” In ACM Web Science Conference, 2021. https://doi.org/10.1145/3447535.3462505. ACM Web Science 2021 For choosing striking images, we trained classifiers using base image features (e.g., pixel size, color count) to choose the same striking image that web page authors chose. Random Forest with these base image features performed best. Contributions
  119. 119. @shawnmjones @StormyArchives We explored the reasons for metadata adoption 119 S. M. Jones, V. Neblitt-Jones, M. C. Weigle, M. Klein, and M. L. Nelson, “It's All About The Cards: Sharing on Social Media Probably Encouraged HTML Metadata Growth,” To be in ACM/IEEE Joint Conference on Digital Libraries, 2021. [preprint: https://arxiv.org/abs/2104.04116.] ACM/IEEE JCDL 2021 Many efforts have been made to encourage metadata adoption by web pages authors. Once social card metadata became available, its use skyrocketed! Contributions
  120. 120. @shawnmjones @WebSciDL We released MementoEmbed and Raintale as reference implementations for visualizing and distributing stories 120 WADL 2020 WADL 2020 We detailed how to generate document metadata with MementoEmbed and visualize and distribute the story with Raintale. We also provided an example of these processes for a day’s news. Contributions S. M. Jones, M. Klein, M. C. Weigle, and M. L. Nelson, “MementoEmbed and Raintale for Web Archive Storytelling,” In Web Archiving and Digital Libraries Workshop, 2020. https://arxiv.org/abs/2008.00137 S. M. Jones, A. C. Nwala, M. C. Weigle, M. Klein, and M. L. Nelson, “SHARI -- An Integration of Tools to Visualize the Story of the Day,” In Web Archiving and Digital Libraries Workshop, 2020. https://arxiv.org/abs/2008.00139
  121. 121. @shawnmjones @WebSciDL And I am eager to apply this expertise at Los Alamos National Laboratory’s Information Sciences Division (CCS-3) 121 https://oduwsdl.github.io/dsa-puddles/shawnmjones/
  122. 122. @shawnmjones @StormyArchives Using our model and the lessons from these research questions, we have implemented tools to tell stories that summarize web archive collections 122 Generate Story Metadata Select Exemplars Generate Document Metadata Visualize The Story Distribute The Story Read the dissertation for • use cases • more example stories • details on experiments • details on these tools • examples with web archives other than Archive-It A sample of future work ideas: • better summary evaluation • augmenting collections with live web metadata • entity/topic cards rather than social cards • summarizing scholar output, project status, scatter/gather interfaces • solving corporate intranet search problems Contributions: • 5-process model for automatic storytelling • vocabulary for types of web archive collections • structural features of web archive collections • word count works best for identifying off- topic mementos • set of primitives for building algorithms • algorithms built with primitives select novel exemplars that standard search engine did not discover • social cards provide better understanding that the existing state of the art web archive surrogates • machine learning can the same select striking images as a page author • Hypercane, MementoEmbed, and Raintale as implementations Conclusion https://oduwsdl.github.io/dsa/
  123. 123. @shawnmjones @StormyArchives Using our model and the lessons from these research questions, we have implemented tools to tell stories that summarize web archive collections 123 Generate Story Metadata Select Exemplars Generate Document Metadata Visualize The Story Distribute The Story Read the dissertation for • use cases • more example stories • details on experiments • details on these tools • examples with web archives other than Archive-It A sample of future work ideas: • better summary evaluation • augmenting collections with live web metadata • entity/topic cards rather than social cards • summarizing scholar output, project status, scatter/gather interfaces • solving corporate intranet search problems Contributions: • 5-process model for automatic storytelling • vocabulary for types of web archive collections • structural features of web archive collections • word count works best for identifying off- topic mementos • set of primitives for building algorithms • algorithms built with primitives select novel exemplars that standard search engine did not discover • social cards provide better understanding that the existing state of the art web archive surrogates • machine learning can the same select striking images as a page author • Hypercane, MementoEmbed, and Raintale as implementations Conclusion https://oduwsdl.github.io/dsa/ What story will you tell with web archives?
  124. 124. @shawnmjones @WebSciDL Backup Slides 124
  125. 125. @shawnmjones @StormyArchives As collection users, we view Archive-It collections from outside… 125 • Curators select seeds, which are captured as seed mementos • Deep mementos are created from other pages linked to seeds S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P Selecting Exemplars and Generating Story Metadata
  126. 126. @shawnmjones @StormyArchives Response times per surrogate had interesting means, but p-values were not statistically significant at p < 0.05 126 0 20 40 60 80 100 120 140 160 Archive-It Facsimile Browser Thumbnails Social Cards sc+t sc/t sc^t Response Times Per Surrogate Median Mean p = 0.190 p = 0.202 S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039.
  127. 127. @shawnmjones @StormyArchives The Off-Topic Memento Toolkit (OTMT) compares a seed’s first memento with the seed’s other mementos via different measures… Measure Fully Equivalent Score Fully Dissimilar Score Preprocessing Performed OTMT -tm keyword Byte Count 0.0 -1.0 No bytecount Word Count 0.0 -1.0 Yes wordcount Jaccard Distance 0.0 1.0 Yes jaccard Sørensen-Dice 0.0 1.0 Yes sorensen Simhash of Term Frequencies 0 64 Yes simhash-tf Simhash or raw memento 0 64 No simhash-raw Cosine Similarity of TF-IDF Vectors 1.0 0 Yes cosine Cosine Similarity of LSI Vectors 1.0 0 Yes gensim_lsi 127 S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/UBW87
  128. 128. @shawnmjones @StormyArchives Does most of the collection exist earlier or later in its life? 128 This collection was created in March 2010. Most of its mementos come from 2016 – 2018. Most of this collection exists later in its life. Structural feature discussed here: • area under the seed memento growth curve • lifespan of the collection S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
  129. 129. @shawnmjones @StormyArchives When did the curator select and archive a collection’s contents? 129 This collection was created in March 2006. Some of the seeds were selected in 2006. Many of the seeds were selected all along its life. It has mementos as recent as July 2018. Structural feature discussed here: • area under the seed growth curve • lifespan of the collection S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
  130. 130. @shawnmjones @StormyArchives Did the curator create a collection intended to archive new versions of the same web pages repeatedly? 130 This collection was created in June 2014. The seeds were selected toward the beginning of its life. Mementos were captured all during its life. Structural feature discussed here: • area under the seed growth curve • area under the seed memento growth curve • lifespan of the collection S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
  131. 131. @shawnmjones @StormyArchives The Memento Protocol provides us a standard method for acquiring information from web archives 131 Background and Related Work Memento gives us TimeGates – identified by URI-G – for finding a specific memento based on its original resource and capture datetime, its memento-datetime. Memento also gives us TimeMaps – identified by URI-T – for listing all of the mementos for an original resource and their memento-datetimes. <http://a.example.org>;rel="original", <http://arxiv.example.net/timemap/http://a.example.org>; rel="self"; type="application/link-format" ; from="Tue, 20 Jun 2000 18:02:59 GMT" ; until="Wed, 21 Jun 2000 04:41:56 GMT", <http://arxiv.example.net/timegate/http://a.example.org>; rel="timegate", <http://arxiv.example.net/web/20000620180259/http://a.example.org>; rel="first memento";datetime="Tue, 20 Jun 2000 18:02:59 GMT", <http://arxiv.example.net/web/20091027204954/http://a.example.org>; rel="last memento";datetime="Tue, 27 Oct 2009 20:49:54 GMT", <http://arxiv.example.net/web/20000621011731/http://a.example.org>; rel="memento";datetime="Wed, 21 Jun 2000 01:17:31 GMT", <http://arxiv.example.net/web/20000621044156/http://a.example.org>; rel="memento";datetime="Wed, 21 Jun 2000 04:41:56 GMT” ... URI-R URI-T URI-M memento-datetime URI-G Van de Sompel, H. Nelson, M. & Sanderson, R. “RFC 7089 – HTTP Framework for Time-Based Access to Resource States -- Memento”. http://www.rfc-editor.org/info/rfc7089. 2013.
  132. 132. @shawnmjones @StormyArchives We use surrogates all of the time! 132 Browser Thumbnail (example from UK Web Archive) Text snippet (example from Bing) Social Card (example from Facebook) Text + Thumbnail (example from Internet Archive) S. M. Jones. “Let's Get Visual and Examine Web Page Surrogates.” https://ws- dl.blogspot.com/2018/04/2018-04-24-lets-get-visual-and-examine.html, 2018. Motivation and Research Questions
  133. 133. @shawnmjones @WebSciDL Surrogates are not new! Traditional surrogates contain metadata generated by humans to convey aboutness 133 An individual surrogate summarizes an item. Card catalogs, however, were not stories, just manual methods for finding individual items in collections. Motivation and Research Questions
  134. 134. @shawnmjones @StormyArchives Surrogates provide a visual summary of the content behind a URI… 134 https://www.google.com/maps/dir/Old+Dominion+University,+Norfolk,+VA/Los+Alamos+National+Laboratory,+New+Mexico/@35 .3644614,- 109.356967,4z/data=!3m1!4b1!4m13!4m12!1m5!1m1!1 s0x89ba99ad24ba3945:0xcd2bdc432c4e4bac!2m2!1d-76.3067676!2d36 .8855515!1m5!1m1!1s0x87181246af22e765:0x7f5a90170c5df1b4!2m2!1 d-106.287162!2d35.8440582 Long URI: The same URI as a browser thumbnail surrogate: The same URI as a social card surrogate: Background and Related Work
  135. 135. @shawnmjones @WebSciDL Social media storytelling uses surrogates to provide a “summary of summaries” 135 2 resources are shown from this Wakelet story 6 resources are shown from this Storify story Each surrogate summarizes a web resource. Each story groups the surrogates, summarizing the topic. We want to use this technique to summarize web archive collections because users are already familiar with this visualization paradigm.
  136. 136. @shawnmjones @StormyArchives The Problem: Understanding web archive collections is costly 136 § There are multiple collections about the “same concept.” § The metadata for each collection is non- existent, or inconsistently applied. § A seed is a web page to be crawled. § A memento is an observation of a seed at a specific point in time. § Many collections have 1000s of seeds with multiple mementos. § There are more than 14,000 collections. § Archive-It is a popular platform, but other web archive collection platforms exist (e.g., Library of Congress, Conifer, Trove). § Existing solutions do not handle the time dimension inherent to web archive collections. more seeds = less metadata
  137. 137. @shawnmjones @StormyArchives 137 Our Solution: Social media storytelling uses groups of surrogates to provide a “summary of summaries” Each surrogate summarizes a web resource. Each story groups the surrogates, summarizing the topic. We want to use this technique to summarize web archive collections because users are already familiar with this visualization paradigm. We established a five-process model for storytelling with web archive collections A surrogate summarizes a web page. This surrogate type is called a social card. Storytelling is the visualization. Our contribution is the automation that selects the exemplars and metadata that make this story.
  138. 138. @shawnmjones @StormyArchives The problem, summarized § There are multiple collections about the same concept. § The metadata for each collection is non-existent, or inconsistently applied. § Many collections have 1000s of seeds with multiple mementos. § There are more than 14,000 collections. § Human review of these mementos for collection understanding is an expensive proposition. 138
  139. 139. @shawnmjones @StormyArchives Archive-It allows easy collection creation Archive-It was created by the Internet Archive as a consistent user interface for constructing web archive collections. Curators can supply live web resources as seeds and establish crawling schedules of those seeds to create mementos. 139
  140. 140. @shawnmjones @StormyArchives Reviewing mementos manually is costly This collection has 132,599 seeds, many with multiple mementos Some collections have 1000s of seeds Each seed can have many mementos In some cases, this can require reviewing 100,000+ documents to understand the collection 140
  141. 141. @shawnmjones @StormyArchives More Archive-It collections are added every year More than 14,000 collections exist as of the end of 2020 141 0 500 1000 1500 2000 2500 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 # of Collections Year # of New Archive-It Collections Per Year All Collections Only Private Collections Only Public Collections
  142. 142. @shawnmjones @StormyArchives Latent Semantic Analysis for document clustering 142 LSA utilizes a term-document matrix • rows correspond to terms and columns correspond to documents • elements are typically weighted via TF-IDF • if TF-IDF, then it is proportional to the number of times the terms appear in each document • use single value decomposition to create two new matrices • the last of these matrices contains a set of documents with coordinates for each cluster LSA requires that the user supply the desired number of topics. Dark cells indicate high weights. High weights signify clustering. Wikipedia contributors. (2019, July 26). Latent semantic analysis. In Wikipedia, The Free Encyclopedia. Retrieved 21:31, July 31, 2019, from https://en.wikipedia.org/w/index.php?title=Latent_semantic_analysis&oldid=907976703 it will be difficult to generalize this number across types of collections
  143. 143. @shawnmjones @StormyArchives Latent Dirichlet Allocation For a corpus D consisting of M documents each of length Ni 1. Choose where and is a Dirichlet distribution with symmetric parameter which typically is sparse ( ) 2. Choose where and typically is sparse 3. For each of the word positions i, j where and 1. Choose a topic 2. Choose a word Wikipedia contributors. (2019, July 25). Latent Dirichlet allocation. In Wikipedia, The Free Encyclopedia. Retrieved 20:13, July 31, 2019, from https://en.wikipedia.org/w/index.php?title=Latent_Dirichlet_allocation&oldid=907806560 143 K is the number of topics requested by the user M is the number of documents in the corpus N is the number of words is the word distribution for topic k is the topic distribution for document i zij is the topic for the j-th word in document i wij is a specific word in document i *e.g. of multinomial – probability of counts of each side for rolling k-sided die n times it will be difficult to generalize this number across types of collections
  144. 144. @shawnmjones @WebSciDL Many have tackled selecting exemplar sentences or images from a document, few have covered selecting exemplar documents from a corpus over time. 144 Background and Related Work We are inspired by these solutions and will apply some of their ideas in a moment. Silva et al. word graphs Silva and Sampaio. 2014. Using Luhn’s Automatic Abstract Method to Create Graphs of Words for Document Visualization. Social Networking. 65-70. https://doi.org/10.4236/sn.2014.32008. R. Sipos et al. 2012. Temporal corpus summarization using submodular word coverage. In ACM CIKM 2012, 754-763. https://doi.org/10.1145/2396761.2396857. Sipos et al. influential author clusters
  145. 145. @shawnmjones @StormyArchives Existing tools for web archive collections require that the user have access to WARCs. 145 ArchiveSpark Archives Unleashed Cloud (now part of Archive-It) Archivists are the only ones likely to have that access. We want anyone to be able to summarize a collection. Warclight Background and Related Work Holzmann et al. 2016. ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation. In ACM/IEEE JCDL 2016, 83-92. https://doi.org/10.1145/2910896.2910902. Ruest et al. 2014. archivesunleashed/warclight – A Rails engine supporting the discovery of web archives. https://github.com/archivesunleashed/warclight. Deschamps et al. 2019. The Cost of a WARC: Analyzing Web Archives in the Cloud. In ACM/IEEE JCDL 2019, 261-264. https://doi.org/10.1109/JCDL.2019.00043. Stories also need URIs for linking surrogates. WARCs alone cannot do this.
  146. 146. @shawnmjones @StormyArchives Existing work on generating story metadata relies on archivists to manually review and annotate each seed or memento 146 Scale is the greatest challenge here. Web archive collections grow quickly, and archivists have a hard time keeping up with the number of documents to annotate. Background and Related Work D. V. Pitti, “Encoded Archival Description,” D-Lib Magazine, vol. 5, no. 11, 1999. https://doi.org/10.1045/november99-pitti. Encoded Archival Description could work, if there were not thousands of documents to annotate.
  147. 147. @shawnmjones @StormyArchives Other studies on surrogates did not focus on if participants understood the underlying collection, instead whether participants chose the correct search result for a query 147 These studies did not compare thumbnails to social cards directly. Web archives love using thumbnails, but is there something better for visitors? Background and Related Work
  148. 148. @shawnmjones @StormyArchives Others tried to visualize whole collections at once or created solutions specific to a web archive 148 Conta Me Histórias Padia et al. R. Campos et al. 2021. Automatic generation of timelines for past-web events. The Past Web: Exploring Web Archives, 225-242. https: //doi.org/10.1007/978-3-030-63291-5_18. K. Padia, Y. AlNoamany, and M. C. Weigle, “Visualizing digital collections at Archive-It,” in Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries, (Washington, DC, USA), pp. 15–18, 2012. https://doi.org/10.1145/ 2232817.2232821. Background and Related Work
  149. 149. @shawnmjones @StormyArchives Web surrogates provide a visual summary of the content behind a URI… 149 https://www.google.com/maps/dir/Old+Dominion+University,+Norfolk,+VA/Los+Alamos+National+Laboratory,+New+Mexico/@35 .3644614,- 109.356967,4z/data=!3m1!4b1!4m13!4m12!1m5!1m1!1 s0x89ba99ad24ba3945:0xcd2bdc432c4e4bac!2m2!1d-76.3067676!2d36 .8855515!1m5!1m1!1s0x87181246af22e765:0x7f5a90170c5df1b4!2m2!1 d-106.287162!2d35.8440582 Long URI: The same URI as a browser thumbnail surrogate: The same URI as a social card surrogate:
  150. 150. @shawnmjones @StormyArchives Social media storytelling uses surrogates to provide a “summary of summaries” 150 2 resources are shown in this Wakelet story 6 resources are shown in this Storify story Each surrogate summarizes a web resource. Each story groups the surrogates, summarizing the topic. We want to use this technique to summarize web archive collections because users are already familiar with this visualization paradigm.
  151. 151. @shawnmjones @StormyArchives DSA2 Algorithm 151
  152. 152. @shawnmjones @StormyArchives DSA3 Algorithm 152
  153. 153. @shawnmjones @StormyArchives DSA4 Algorithm 153

Collections are the tools that people use to make sense of an ever-increasing number of archived web pages. As collections themselves grow, we need tools to make sense of them. Tools that work on the general web, like search engines, are not a good fit for these collections because search engines do not currently represent multiple document versions well. Web archive collections themselves are vast, some containing hundreds of thousands of documents. There are also thousands of collections, many of which cover the same topic. Few collections include standardized metadata. Too many documents from too many collections with not enough metadata makes collection understanding an expensive proposition. This dissertation establishes a five-process model to assist with web archive collection understanding. This model aims to automatically produce a social media story -- a visualization paradigm with which most web users are already familiar. Each social media story contains surrogates which are summaries of individual documents. These surrogates, when collected together, summarize the overall topic of the story. After applying our storytelling model, they summarize the topic of a web archive collection. We develop and test a framework to select the best exemplars that represent a collection. We establish that algorithms produced from these primitives select exemplars that are otherwise undiscoverable using conventional search engine methods. We generate story metadata to improve the information scent of a story so users can understand it better. After an analysis showing that existing platforms perform poorly for web archives and a user study establishing the best surrogate type, we generate document metadata for the exemplars with machine learning. We then visualize the story and document metadata together and distribute it to satisfy the information needs of multiple personas who benefit from our model. Our tools serve as a reference implementation of our Dark and Stormy Archives storytelling model. Hypercane selects exemplars and generates story metadata. MementoEmbed generates document metadata. Raintale visualizes and distributes the story based on the story metadata and the document metadata of these exemplars. By providing understanding at a glance, our stories save users the time and effort of reading thousands of documents and, most importantly, help them understand web archive collections.

Vistos

Vistos totais

106

No Slideshare

0

De incorporações

0

Número de incorporações

93

Ações

Baixados

0

Compartilhados

0

Comentários

0

Curtir

0

×