Web archives and the problem of access: prototyping a researcher dashboard for the UK Government Web Archive
By Mark Bell, Tom Storrar and Jane Winters
January 2020
ArAcAi - The Problem of Access: Prototyping a Researcher Dashboard for the UK Government Web Archive
1. Web archives and the
problem of access:
prototyping a researcher
dashboard for the UK
Government Web Archive
Mark Bell, Tom Storrar and
Jane Winters
15 January 2020
2. The National Archives is the official archive of UK government: collecting,
preserving and giving access to 1,000 years of history
Alongside paper and digitised records, our and web archive collections are
growing rapidly and the UKGWA is our largest collection:
■ 1996 to present: over 23 years of government websites and social
media
■ 6 billion resources, 150TB+ (compressed) data
■ It has gov.uk domains but lots more, too - wherever government
hosts content (at present, over 800 websites!)
The UK Government Web Archive (UKGWA)
3. The UKGWA is openly available and well-used
Typical routes into the content of the collection include:
■ Through Google and other search engines
■ Redirection to it from government websites or from references to
historic documents within other documents
■ Direct “research sessions” - often returning users who have a specific
information need. They will often use our search service:
https://webarchive.nationalarchives.gov.uk/search/
■ Increasing use of the collection “as data” - but this is challenging in a
number of ways
Use of the UKGWA
4. What do researchers want to do with the
UKGWA?
❏ Essential primary source for the history of the late 20th and early 21st
century (mid 1990s to the present day)
❏ Record of government (central and local) and its interactions with its
citizens online
❏ Need to understand both its scope and its scale, and this means moving
beyond keyword searching (the default for many humanities researchers)
❏ Gain insight into the collection processes, how these have changed over
time, and the factors that have influenced when and how data is
harvested (these are patchwork or ‘Frankenstein’ archives)
5. ❏ Extract different kinds of data
from the archive (text, images,
remove navigation etc.)
❏ Analyse trends in the data, e.g.
cultural and linguistic change
❏ Study online networks of
government and the flow of
information between and
within departments
❏ Deploy visualisation to aid
navigation and analysis
(macro- and micro-level)
What do researchers want to do with the
UKGWA?
Elevation for clock dial for Big Ben tower
6. Web archiving as collaboration
❏ The challenges posed by web archives (for researchers, web archivists
and research software engineers) are too complex to be solved by
individuals or organisations working on their own
❏ Researchers need web archivists, and web archivists need researchers
❏ Through collaboration, we can develop a robust community of practice
and knowledge
❏ We can argue for enhanced access to web archives, for researchers and
the wider public
❏ We can experiment, innovate and sometimes fail
❏ We can make the case for greater investment in web archiving (and in
web archiving institutions)
9. What are we analysing? - Macroscopic view
Archive
-> Domain
-> Sub-domain
-> Page
-> Resource
10. What are we analysing? - Content
History of salt
The craving for salt
Human beings have an intimate relationship with salt. Our
tears, blood and sweat taste of salt.
The chemical reactions inside our bodies need sodium - one of
the two elements that make up salt (with chloride).
We can't survive without sodium, but it was about five million
years before humans began to eat their sodium as salt.
Hunters in Greenland ate no salt until they were introduced to
it by whaling Europeans in the 17th century. Like our
prehistoric forebears, Lapps, Samoyeds, Kirghiz, Bedouin,
Masai and Zulus used to consume all the sodium they needed
from the animals and fish they ate.
Agriculture and salt
Archaeologists believe that salt eating developed as humans
learned how to keep animals and grow crops in the years after
10,000 BC. As the proportion of meat in their diet fell, people
had to find salt for themselves and for their domesticated
animals.
Content
13. What are we analysing? - Site Structure
https://webarchive.nationalarchives.gov.uk/20190102181627/https://www.gov.uk/guidance/cartels-confess-and-apply-for-
leniency
Warning: This doesn’t exist!
14. Topic Modelling
0 : research councils council innovation rcuk funding public government review business executive working training development work group
1 : museum maritime national greenwich royal nmm time london observatory family house rights world visit reserved events
2 : day information fruit local health navigation legal school scheme contact children vegetable healthy vegetables department content
3 : ocr science information gateway aqa including edexcel chemistry physics teachers webpage wjec teaching revision gcse century
4 : science triple learning support resources latest physics students schools programme teaching teachers gcse resource feedback comments
5 : food eat foods people bacteria meat fish agency fridge don standards raw cooked pregnant date find
6 : army museum national british war general nam enquiries pm services quick britain follow world field soldiers
7 : salt eat fruit foods fat food high good eating day milk diet children vitamin vegetables healthy
15. Doc2Vec - Like word2vec but with documents
● Find similar documents
● Group documents together
● Enable semantic search
19. Components of a dashboard
Select sites for analysis: manual or by similarityScope
Granularity
Time
Content/
Structure
£
Export
Level to perform analysis: archive, domain, page
Filter by time period: state at time; activity during period
Compare change in one set of sites with another
Charges: paying for computation
Exporting results and visualisations
Compare
Analyse by content or structure (page, site, network)
Visualise Charts, networks, word clouds etc.
20. Web archives are created through actions, decisions, both human and
machine.
Human actions involve decisions on when and how to capture a resource
or a website but also why. Data on this is kept as part of the archive but
most of it is not public.
Machines make decisions based on the parameters or rules they are
provided by human actors. We can add trust and transparency to this
process by revealing as much of this as we can to our users.
We can commit to publishing this knowledge but publishing in a way
that adds to users’ comprehension of the web archive it a challenge.
Static datasets (csv) are a start, leading to queryable ones (APIs…)
Key Context on the creation of the UKGWA
21. We’re not alone; we are part of a vibrant community of web archives and
researchers.
We are taking inspiration (and code!) from the great work being done by
Archives Unleashed, the Internet Archive, the British Library and many
others.
We’ve also been gaining more and more hands-on experience of
running research projects using UKGWA data, for example, recently:
■ Alan Turing Institute Data Challenge - Identifying Topics and Trends
(December 2019)
■ CAS Network Analysis Workshop (June 2019)
These are crucial to our work and there are many more are to come!
Collaborate!
22. ❏ Bring stakeholders together
regularly (workshops, hackathons
etc.)
❏ A wide range of skills and expertise
are required but some
interventions can lower barriers
❏ Artificial intelligence is already
helping us to explore web archives,
and will continue to transform
access
❏ … but it is not enough on its own
Conclusion
Wartime storage of documents in the
Long Gallery at Haddon Hall