SlideShare uma empresa Scribd logo
1 de 23
Web archives and the
problem of access:
prototyping a researcher
dashboard for the UK
Government Web Archive
Mark Bell, Tom Storrar and
Jane Winters
15 January 2020
The National Archives is the official archive of UK government: collecting,
preserving and giving access to 1,000 years of history
Alongside paper and digitised records, our and web archive collections are
growing rapidly and the UKGWA is our largest collection:
■ 1996 to present: over 23 years of government websites and social
media
■ 6 billion resources, 150TB+ (compressed) data
■ It has gov.uk domains but lots more, too - wherever government
hosts content (at present, over 800 websites!)
The UK Government Web Archive (UKGWA)
The UKGWA is openly available and well-used
Typical routes into the content of the collection include:
■ Through Google and other search engines
■ Redirection to it from government websites or from references to
historic documents within other documents
■ Direct “research sessions” - often returning users who have a specific
information need. They will often use our search service:
https://webarchive.nationalarchives.gov.uk/search/
■ Increasing use of the collection “as data” - but this is challenging in a
number of ways
Use of the UKGWA
What do researchers want to do with the
UKGWA?
❏ Essential primary source for the history of the late 20th and early 21st
century (mid 1990s to the present day)
❏ Record of government (central and local) and its interactions with its
citizens online
❏ Need to understand both its scope and its scale, and this means moving
beyond keyword searching (the default for many humanities researchers)
❏ Gain insight into the collection processes, how these have changed over
time, and the factors that have influenced when and how data is
harvested (these are patchwork or ‘Frankenstein’ archives)
❏ Extract different kinds of data
from the archive (text, images,
remove navigation etc.)
❏ Analyse trends in the data, e.g.
cultural and linguistic change
❏ Study online networks of
government and the flow of
information between and
within departments
❏ Deploy visualisation to aid
navigation and analysis
(macro- and micro-level)
What do researchers want to do with the
UKGWA?
Elevation for clock dial for Big Ben tower
Web archiving as collaboration
❏ The challenges posed by web archives (for researchers, web archivists
and research software engineers) are too complex to be solved by
individuals or organisations working on their own
❏ Researchers need web archivists, and web archivists need researchers
❏ Through collaboration, we can develop a robust community of practice
and knowledge
❏ We can argue for enhanced access to web archives, for researchers and
the wider public
❏ We can experiment, innovate and sometimes fail
❏ We can make the case for greater investment in web archiving (and in
web archiving institutions)
Dashboard basics
Rise and Fall of the Web
What are we analysing? - Macroscopic view
Archive
-> Domain
-> Sub-domain
-> Page
-> Resource
What are we analysing? - Content
History of salt
The craving for salt
Human beings have an intimate relationship with salt. Our
tears, blood and sweat taste of salt.
The chemical reactions inside our bodies need sodium - one of
the two elements that make up salt (with chloride).
We can't survive without sodium, but it was about five million
years before humans began to eat their sodium as salt.
Hunters in Greenland ate no salt until they were introduced to
it by whaling Europeans in the 17th century. Like our
prehistoric forebears, Lapps, Samoyeds, Kirghiz, Bedouin,
Masai and Zulus used to consume all the sodium they needed
from the animals and fish they ate.
Agriculture and salt
Archaeologists believe that salt eating developed as humans
learned how to keep animals and grow crops in the years after
10,000 BC. As the proportion of meat in their diet fell, people
had to find salt for themselves and for their domesticated
animals.
Content
What is content?
Content
What are we analysing? - Page Structure
What are we analysing? - Site Structure
https://webarchive.nationalarchives.gov.uk/20190102181627/https://www.gov.uk/guidance/cartels-confess-and-apply-for-
leniency
Warning: This doesn’t exist!
Topic Modelling
0 : research councils council innovation rcuk funding public government review business executive working training development work group
1 : museum maritime national greenwich royal nmm time london observatory family house rights world visit reserved events
2 : day information fruit local health navigation legal school scheme contact children vegetable healthy vegetables department content
3 : ocr science information gateway aqa including edexcel chemistry physics teachers webpage wjec teaching revision gcse century
4 : science triple learning support resources latest physics students schools programme teaching teachers gcse resource feedback comments
5 : food eat foods people bacteria meat fish agency fridge don standards raw cooked pregnant date find
6 : army museum national british war general nam enquiries pm services quick britain follow world field soldiers
7 : salt eat fruit foods fat food high good eating day milk diet children vitamin vegetables healthy
Doc2Vec - Like word2vec but with documents
● Find similar documents
● Group documents together
● Enable semantic search
Document Summarisation
Scale reduction
Home
Page
Sub-section BSub-section A
Page 1 Page 2 Page 3 Page 4
PDF 1 PDF 2
10s of millions
1000s
Home
Page
Sub-section A Sub-section B
Sub-section A Page 1 Page 2
Change over time
Content
Structure
Static Dormant
Components of a dashboard
Select sites for analysis: manual or by similarityScope
Granularity
Time
Content/
Structure
£
Export
Level to perform analysis: archive, domain, page
Filter by time period: state at time; activity during period
Compare change in one set of sites with another
Charges: paying for computation
Exporting results and visualisations
Compare
Analyse by content or structure (page, site, network)
Visualise Charts, networks, word clouds etc.
Web archives are created through actions, decisions, both human and
machine.
Human actions involve decisions on when and how to capture a resource
or a website but also why. Data on this is kept as part of the archive but
most of it is not public.
Machines make decisions based on the parameters or rules they are
provided by human actors. We can add trust and transparency to this
process by revealing as much of this as we can to our users.
We can commit to publishing this knowledge but publishing in a way
that adds to users’ comprehension of the web archive it a challenge.
Static datasets (csv) are a start, leading to queryable ones (APIs…)
Key Context on the creation of the UKGWA
We’re not alone; we are part of a vibrant community of web archives and
researchers.
We are taking inspiration (and code!) from the great work being done by
Archives Unleashed, the Internet Archive, the British Library and many
others.
We’ve also been gaining more and more hands-on experience of
running research projects using UKGWA data, for example, recently:
■ Alan Turing Institute Data Challenge - Identifying Topics and Trends
(December 2019)
■ CAS Network Analysis Workshop (June 2019)
These are crucial to our work and there are many more are to come!
Collaborate!
❏ Bring stakeholders together
regularly (workshops, hackathons
etc.)
❏ A wide range of skills and expertise
are required but some
interventions can lower barriers
❏ Artificial intelligence is already
helping us to explore web archives,
and will continue to transform
access
❏ … but it is not enough on its own
Conclusion
Wartime storage of documents in the
Long Gallery at Haddon Hall
Colossus electronic digital computer

Mais conteúdo relacionado

Último

Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...Kayode Fayemi
 
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara ServicesVVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara ServicesPooja Nehwal
 
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...Delhi Call girls
 
Air breathing and respiratory adaptations in diver animals
Air breathing and respiratory adaptations in diver animalsAir breathing and respiratory adaptations in diver animals
Air breathing and respiratory adaptations in diver animalsaqsarehman5055
 
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptxMohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptxmohammadalnahdi22
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoKayode Fayemi
 
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfSkillCertProExams
 
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...Sheetaleventcompany
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar TrainingKylaCullinane
 
SaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, YardstickSaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, Yardsticksaastr
 
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort ServiceDelhi Call girls
 
Dreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIIDreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIINhPhngng3
 
lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lodhisaajjda
 
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdfThe workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdfSenaatti-kiinteistöt
 
Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)Chameera Dedduwage
 
Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Vipesco
 
Presentation on Engagement in Book Clubs
Presentation on Engagement in Book ClubsPresentation on Engagement in Book Clubs
Presentation on Engagement in Book Clubssamaasim06
 
Causes of poverty in France presentation.pptx
Causes of poverty in France presentation.pptxCauses of poverty in France presentation.pptx
Causes of poverty in France presentation.pptxCamilleBoulbin1
 
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night EnjoyCall Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night EnjoyPooja Nehwal
 
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort ServiceDelhi Call girls
 

Último (20)

Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
 
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara ServicesVVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
 
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
 
Air breathing and respiratory adaptations in diver animals
Air breathing and respiratory adaptations in diver animalsAir breathing and respiratory adaptations in diver animals
Air breathing and respiratory adaptations in diver animals
 
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptxMohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptx
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
 
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
 
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar Training
 
SaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, YardstickSaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, Yardstick
 
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
 
Dreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIIDreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio III
 
lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.
 
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdfThe workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
 
Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)
 
Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510
 
Presentation on Engagement in Book Clubs
Presentation on Engagement in Book ClubsPresentation on Engagement in Book Clubs
Presentation on Engagement in Book Clubs
 
Causes of poverty in France presentation.pptx
Causes of poverty in France presentation.pptxCauses of poverty in France presentation.pptx
Causes of poverty in France presentation.pptx
 
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night EnjoyCall Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
 
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
 

Destaque

How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationErica Santiago
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellSaba Software
 
Introduction to C Programming Language
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming LanguageSimplilearn
 

Destaque (20)

How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
 
Introduction to C Programming Language
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming Language
 

ArAcAi - The Problem of Access: Prototyping a Researcher Dashboard for the UK Government Web Archive

  • 1. Web archives and the problem of access: prototyping a researcher dashboard for the UK Government Web Archive Mark Bell, Tom Storrar and Jane Winters 15 January 2020
  • 2. The National Archives is the official archive of UK government: collecting, preserving and giving access to 1,000 years of history Alongside paper and digitised records, our and web archive collections are growing rapidly and the UKGWA is our largest collection: ■ 1996 to present: over 23 years of government websites and social media ■ 6 billion resources, 150TB+ (compressed) data ■ It has gov.uk domains but lots more, too - wherever government hosts content (at present, over 800 websites!) The UK Government Web Archive (UKGWA)
  • 3. The UKGWA is openly available and well-used Typical routes into the content of the collection include: ■ Through Google and other search engines ■ Redirection to it from government websites or from references to historic documents within other documents ■ Direct “research sessions” - often returning users who have a specific information need. They will often use our search service: https://webarchive.nationalarchives.gov.uk/search/ ■ Increasing use of the collection “as data” - but this is challenging in a number of ways Use of the UKGWA
  • 4. What do researchers want to do with the UKGWA? ❏ Essential primary source for the history of the late 20th and early 21st century (mid 1990s to the present day) ❏ Record of government (central and local) and its interactions with its citizens online ❏ Need to understand both its scope and its scale, and this means moving beyond keyword searching (the default for many humanities researchers) ❏ Gain insight into the collection processes, how these have changed over time, and the factors that have influenced when and how data is harvested (these are patchwork or ‘Frankenstein’ archives)
  • 5. ❏ Extract different kinds of data from the archive (text, images, remove navigation etc.) ❏ Analyse trends in the data, e.g. cultural and linguistic change ❏ Study online networks of government and the flow of information between and within departments ❏ Deploy visualisation to aid navigation and analysis (macro- and micro-level) What do researchers want to do with the UKGWA? Elevation for clock dial for Big Ben tower
  • 6. Web archiving as collaboration ❏ The challenges posed by web archives (for researchers, web archivists and research software engineers) are too complex to be solved by individuals or organisations working on their own ❏ Researchers need web archivists, and web archivists need researchers ❏ Through collaboration, we can develop a robust community of practice and knowledge ❏ We can argue for enhanced access to web archives, for researchers and the wider public ❏ We can experiment, innovate and sometimes fail ❏ We can make the case for greater investment in web archiving (and in web archiving institutions)
  • 8. Rise and Fall of the Web
  • 9. What are we analysing? - Macroscopic view Archive -> Domain -> Sub-domain -> Page -> Resource
  • 10. What are we analysing? - Content History of salt The craving for salt Human beings have an intimate relationship with salt. Our tears, blood and sweat taste of salt. The chemical reactions inside our bodies need sodium - one of the two elements that make up salt (with chloride). We can't survive without sodium, but it was about five million years before humans began to eat their sodium as salt. Hunters in Greenland ate no salt until they were introduced to it by whaling Europeans in the 17th century. Like our prehistoric forebears, Lapps, Samoyeds, Kirghiz, Bedouin, Masai and Zulus used to consume all the sodium they needed from the animals and fish they ate. Agriculture and salt Archaeologists believe that salt eating developed as humans learned how to keep animals and grow crops in the years after 10,000 BC. As the proportion of meat in their diet fell, people had to find salt for themselves and for their domesticated animals. Content
  • 12. What are we analysing? - Page Structure
  • 13. What are we analysing? - Site Structure https://webarchive.nationalarchives.gov.uk/20190102181627/https://www.gov.uk/guidance/cartels-confess-and-apply-for- leniency Warning: This doesn’t exist!
  • 14. Topic Modelling 0 : research councils council innovation rcuk funding public government review business executive working training development work group 1 : museum maritime national greenwich royal nmm time london observatory family house rights world visit reserved events 2 : day information fruit local health navigation legal school scheme contact children vegetable healthy vegetables department content 3 : ocr science information gateway aqa including edexcel chemistry physics teachers webpage wjec teaching revision gcse century 4 : science triple learning support resources latest physics students schools programme teaching teachers gcse resource feedback comments 5 : food eat foods people bacteria meat fish agency fridge don standards raw cooked pregnant date find 6 : army museum national british war general nam enquiries pm services quick britain follow world field soldiers 7 : salt eat fruit foods fat food high good eating day milk diet children vitamin vegetables healthy
  • 15. Doc2Vec - Like word2vec but with documents ● Find similar documents ● Group documents together ● Enable semantic search
  • 17. Scale reduction Home Page Sub-section BSub-section A Page 1 Page 2 Page 3 Page 4 PDF 1 PDF 2 10s of millions 1000s Home Page Sub-section A Sub-section B Sub-section A Page 1 Page 2
  • 19. Components of a dashboard Select sites for analysis: manual or by similarityScope Granularity Time Content/ Structure £ Export Level to perform analysis: archive, domain, page Filter by time period: state at time; activity during period Compare change in one set of sites with another Charges: paying for computation Exporting results and visualisations Compare Analyse by content or structure (page, site, network) Visualise Charts, networks, word clouds etc.
  • 20. Web archives are created through actions, decisions, both human and machine. Human actions involve decisions on when and how to capture a resource or a website but also why. Data on this is kept as part of the archive but most of it is not public. Machines make decisions based on the parameters or rules they are provided by human actors. We can add trust and transparency to this process by revealing as much of this as we can to our users. We can commit to publishing this knowledge but publishing in a way that adds to users’ comprehension of the web archive it a challenge. Static datasets (csv) are a start, leading to queryable ones (APIs…) Key Context on the creation of the UKGWA
  • 21. We’re not alone; we are part of a vibrant community of web archives and researchers. We are taking inspiration (and code!) from the great work being done by Archives Unleashed, the Internet Archive, the British Library and many others. We’ve also been gaining more and more hands-on experience of running research projects using UKGWA data, for example, recently: ■ Alan Turing Institute Data Challenge - Identifying Topics and Trends (December 2019) ■ CAS Network Analysis Workshop (June 2019) These are crucial to our work and there are many more are to come! Collaborate!
  • 22. ❏ Bring stakeholders together regularly (workshops, hackathons etc.) ❏ A wide range of skills and expertise are required but some interventions can lower barriers ❏ Artificial intelligence is already helping us to explore web archives, and will continue to transform access ❏ … but it is not enough on its own Conclusion Wartime storage of documents in the Long Gallery at Haddon Hall