Enabling complex analysis of large scale digital collections
1. Research data spring
Enabling Complex Analysis of Large Scale Digital Collections14/7/2015
Lots of money has been spent digitising heritage collections. Digitised heritage
collections are data. But non-computationally trained scholars don't know what
to ask of large quantities of data. Often they do not have access to high
performance computing facilities and they don’t know how to use them.
We have addressed this fundamental problem by extending research data
management processes in order to enable novel research in the arts, humanities,
and social and historical sciences and a deeper understanding of emerging
research needs. In our first phase, we have successfully implemented large
scale, complex search of a digitised collection: now we scale up…
2. More & more digitised content is in the public domain
14/07/15 Enabling Complex Analysis of Large Scale Digital Collections 2
3. UK eScience infrastucture not used in A+H or SHS
14/07/15 Enabling Complex Analysis of Large Scale Digital Collections 3
4. Phase 1: take 64,000 British Library digitised books
14/07/15 Enabling Complex Analysis of Large Scale Digital Collections 4
5. See how we can analyse them using UCL’s HPC
14/07/15 Enabling Complex Analysis of Large Scale Digital Collections 5
6. Moving beyond restrictive basic searches
14/07/15 Enabling Complex Analysis of Large Scale Digital Collections 6
7. team
14/07/15 Enabling Complex Analysis of Large Scale Digital Collections 7
James Hetherington
Research Software Engineer
8. Work with researchers 1: detect trends
14/07/15 Enabling Complex Analysis of Large Scale Digital Collections 8
Anne Welsh
Lecturer in Library and
Information Studies, UCL
Interested in growth of professions
in theVictorian era.
Needs to be able to do AND, OR,
NOT, AND NOT Boolean queries:
beyond capabilities of current
Large scale digitisation search
functions.
9. Work with researchers 2: compare data sources
14/07/15 Enabling Complex Analysis of Large Scale Digital Collections 9
Oliver Duke-Williams
Lecturer in Digital
Information Studies, UCL
Interested in history of
demographics and health data.
Can we track the prevalence of
diseases in the corpus, and do
they relate to known
epidemics, using existing data?
10. 1853-54
c. 11,000 UK deaths
('John Snow / Broad Street pump' epidemic)
Deaths in England 1838 1839
Measles 6,514 10,937
Whooping cough 9,107 8,165
Consumption 59,025 59,559
First outbreak in UK 1831-2
c. 55,000 deaths
Cholera 1848-49
53,293 deaths (England)
1863 – East London
c.6,000 deaths
11. Work with researchers 3: visualise content
14/07/15 Enabling Complex Analysis of Large Scale Digital Collections 11
Will Finley
PhD Student, History,
University of Sheffield
Interested in History of Printed
Book Illustration 1750-1850.
How can we analyse and
visualise how the size and
placing of illustrations in the
corpus changes over time?
12. All outputs documented on github
»https://github.com/UCL-dataspring
14/07/15 Enabling Complex Analysis of Large Scale Digital Collections 12
13. Including all code, recipes, & visualisations
14/07/15 Enabling Complex Analysis of Large Scale Digital Collections 13
14. Explained in a series of blog posts
» http://britishlibrary.typepad.co.uk/digital-
scholarship/2015/07/turning-research-questions-into-
computational-queries.html
http://bit.ly/dataspring
14/07/15 Enabling Complex Analysis of Large Scale Digital Collections 14
15. Overview
» Not a Research Project
» Not an API
» Not replicating existing search facilities
» How can we provide access to data and compute?
» What are the technical issues in using escience infrastructure
for cultural and heritage datasets?
» How can we train people in the A+H, and Libraries, to use
this?
» How can we scale this up across the arts and humanities, and
social and historical sciences?
14/07/15 Enabling Complex Analysis of Large Scale Digital Collections 15
16. Scaling Up 1: more, different data
• 25,000 texts from the first phase
of EEBO-TCP
• 1473 to 1700, 2m pages, 1b
words, public domain
• Little overlap with BL data
• We have global search of the BL
data working.Adding EEBO-TCP
will allow us to compare different
ingest issues
• Inform data service providers
about issues in using different
textual data sets
14/07/15 Enabling Complex Analysis of Large Scale Digital Collections 16
17. Scaling Up 2: More researchers, understanding needs
14/07/15 Enabling Complex Analysis of Large Scale Digital Collections 17
18. Scaling Up 3: making researchers into independent users
• Moving away from the “tame programmer in
the room”
• Building a set of reusable recipes
• TrainingA+H, SHS researchers and Librarians
to be able to run queries themselves
• Core set of fundamental queries that can be
tweaked be individual researchers to search for
unique terms
• By end of Phase 2: Have
researchers searching
successfully without the help of
programmers or data scientists
• In prep for Phase 3: where we
train others from the UK in the
set up and query of textual data
using existing HPC facilities.
14/07/15 Enabling Complex Analysis of Large Scale Digital Collections 18
19. Plan & Outputs
» Month 1: identify researchers. Ingest EEBO-TCP. Stress test existing queries, develop search templates
» Month 2:Training with core set of researchers to adopt and implement queries. Documentation and
developing of training.
» Month 3: Independent Search workshops – software carpentry forA+H research computing
» Month 4: Reflection, write up, preparation of public facing materials that tell others how to do this.
» Fully documented on Github Repo
› https://github.com/UCL-dataspring
› Cluster code
› Raw results
› Visualisations
› User guides
» Publicly presented (will also set up dedicated blog, social media channels, etc in Phase 2)
» Submission of academic paper re project to leading conference
14/07/15 Enabling Complex Analysis of Large Scale Digital Collections 19
20. Funding
»Pitching for the whole £40,000
»We need adequate funding to pay for research
programmer to:
› set up the infrastructure for training
› Prepare training materials
› Ingest new data set
»Also, other staff time, data preservation costs, travel
between sites
»Full support from UCL in FEC
14/07/15 Enabling Complex Analysis of Large Scale Digital Collections 20
21. Phase 2: Make digitised books truly searchable
14/07/15 Enabling Complex Analysis of Large Scale Digital Collections 21
22. Not for the pitch, but please fill in
»Contact person: still MelissaTerras
»Social media presence -@melissaterras and @j_w_baker
14/07/15 Enabling Complex Analysis of Large Scale Digital Collections 22
Notas do Editor
Go to ‘View’ menu > ‘Header and Footer…’ to edit the footers on this slide (click ‘Apply’ to change only the currently selected slide, or ‘Apply to All’ to change the footers on all slides).
To add a background image to this slide; drag a picture to the placeholder or click the icon in the centre of the placeholder to browse for and add another image. Once added, the image can be cropped, resized or repositioned to suit.
Data for all diseases, normalised by number of words.
Notes:
'consumption' is an interesting test case for later: by looking at frequencies of proximate words, can we make a good guess as to whether any given reference is to consumption as a disease (the word was used as a common name for a form of tuberculosis), or just the word 'consumption'? In order to do so we need a reference set of word frequency data, but that can be confidently built by looking at proximate frequencies for the word 'tuberculosis'.
Cholera is the most interesting set of results here
Image shows major UK outbreaks; other outbreaks were occurring in the rest of the world at other times; first outbreak c. 1817, Bengal.
There are pronounced spikes 1870s and 1880s; these are not associated with UK epidemics, but there were outbreaks in the US and elsewhere. It would be interesting to look more closely at these later clusters.
Other diseases – there is some apparent relationship – more mentions of 'consumption' than of measles / whooping cough, and more deaths – BUT – this is an unfiltered use of the word 'consumption'.
Data sources:
Deaths in England – from Chadwick, E (1842) The Sanitary Conditions of the Labouring Population
Cholera deaths, various sources mostly Wall AJ, (1893) Asiatic cholera : its history, pathology, and modern treatment, plus some not-fully-cited narratives via google.