This document provides information about a workshop on newspaper data visualization hosted by the British Library and London College of Communication. It discusses the British Library's collection of over 34,000 newspaper titles containing 450 million pages. It outlines plans to digitize 1.3 million additional newspaper pages by 2022 and make metadata and text data openly available. The workshop goals are to help researchers understand how to visualize and analyze the complexities of the library's newspaper collection using tools like Python, R, Voyant, and Palladio and methods like named entity recognition and text mining.
2. www.bl.uk
Newspapers at the British Library
• The national collection, from 1619 to present day
• Over 34,000 titles, or 60 million individual issues, or 450
million pages
• 25,000 titles from the UK and Ireland
• 34 million pages are available at
www.britishnewspaperarchive.co.uk
• To date, selection of newspaper titles for digitization
undertaken by Findmypast
• To date, all data created through digitisation is owned by
Findmypast
2
3. www.bl.uk
Heritage Made Digital Newspapers
3
• British Library project to digitise newspapers for itself,
alongside Findmypast operation
• Target – 1.3 million pages by 2022
• Over 200 titles, with focus on poor/unfit titles
published in London
• Titles will all be made available on British Newspaper
Archive
• All titles will eventually be openly available online
• All metadata, including derived data (OCR, entities)
owned by British Library and to be made openly
available, with other newspaper data to follow
4. www.bl.uk
Our newspaper data plans
4
• To encourage multiple uses of data derived from newspapers
• Treating newspaper data as a ‘collection’ in its own right
Outputs
• Bibliographical list of all BL UK and Irish newspapers
• HMD and other newspaper data openly available
through BL’s new digital repository
• Visualisations of the collection to help explore and
understand it
Users
• Academics using ‘big data’ for new kinds of research
• General users unfamiliar with data or easy-to-use tools
• Creatives
5. www.bl.uk
• We aim to develop a series of workshops with
London College of Communication, which will
integrate newspaper data analytics with creative
design
• Today is a trial workshop to test ideas
• The wider goal is to help researchers understand
how to visualise the complexities of the Library’s
newspaper collection
5
Workshop goals
6. www.bl.uk
Python
Programming language,
becoming the industry
standard for data
analytics
6
Programming tools we use
R
Programming language
with good visualisation
packages
Jupyter Notebooks & JupyterHub
Tools which allow for interactive and
shareable Python code
e.g. https://github.com/GLAM-Workbench
BL Labs version coming soon
7. www.bl.uk
• Learn how to use these at:
• https://programminghistorian.org/
• https://software-carpentry.org/
• Watch out for Adult Learning courses at British
Library next Autumn
7
Programming tools we use
9. www.bl.uk
• Named Entity Recognition using the Python
library NLTK
• Named Entity Recognition is the term for a range
of methods for detecting certain types of words
in text, such as people, places and organisations
• Can be used to analyse the prominence of
certain people or places in a large dataset
9
Methods used for today
10. www.bl.uk
• Simple text mining with R and the library tidytext
• Counting word frequencies can help us to ‘see
inside’ large amounts of text, and detect patterns
we might otherwise miss
• We might use it to understand the focus of a
particular newspaper title, or to compare reporting
over different chunks of time
10
Methods used for today