UKSG 2015 Mechanical curator and British Library labs
1. The Mechanical Curator, maps
and the online community
Ben O’Steen, British Library Labs
@benosteen ben.osteen@bl.uk
2. Andrew W. Mellon funded project seeking to bring
researchers and our digital data closer together.
There is a significant gap between them for many reasons.
3. Andrew W. Mellon funded project seeking to bring
researchers and our digital data closer together.
There is a significant gap between them for many reasons.
I’m there to work out what bridges to build.
4.
5.
6. Modern research forces us to re-evaluate what is
meant by ‘access’
Enabling compute for example:
Distant reading, machine learning, statistical methods -
an ever-growing list.
7. Infancy of understanding
Large-scale analysis of
text is evolving but
young.
Exasperating situation
where ‘black boxes’ of
algorithms are used to
draw conclusions.
http://www.scottbot.net/HIAL/?p=41271
8. “Black Boxes”:
a misnomer
It is legitimate and
useful to use code that
you could not write.
It is not legitimate to
simply believe the
‘label’ on the side of
the box.
E.g. “Sentiment
Analysis” is often
nothing of the sort.
9. Quoting Scott Weingart: (emphasis mine)
● Do sentiment analysis algorithms agree with one another enough to be considered
valid?
● Do sentiment analysis results agree with humans performing the same task
enough to be considered valid?
● Is Jockers’ instantiation of aggregate sentiment analysis validly measuring
anything besides random fluctuations?
● Is aggregate sentiment analysis, by human or machine, a valid method for revealing
plot arcs?
● If aggregate sentiment analysis finds common but distinct patterns and they don’t seem to
map onto plot arcs, can they still be valid measurements of anything at all?
● Can a subjective concept, whether measured by people or machines, actually be
considered invalid or valid?
(again from http://www.scottbot.net/HIAL/?p=41271)
10. Do researchers need to “level up” and become
machine learning experts to use it?
11. In short, no.
We do not require scientists to have a masters degree in
Statistics to publish on numerical results, nor be prize-
winning novelists to write research papers.*
There is a middle ground between treating something as
magic and being an expert in the field.
I cannot say who specifically - librarian, data scientist, PI,
consultant, etc - is best placed to gain and use this
knowledge without evidence or trials.
* although, it likely couldn’t hurt given the papers I’ve read.
12. Let’s consider a real
example.
Peter Francois,
2013 British Library Labs Competition winner
13. “I am interested in
travel accounts in
Europe during the
19th Century”
14. “The Great Unread”, Graph, Maps and Trees
and Franco Moretti
From a review of “Graph, Maps and Trees”:
“Professor Franco Moretti argues heretically that literature scholars should
stop reading books and start counting, graphing, and mapping them
instead [...]”
“For any given period scholars focus on a select group of a mere few
hundred texts: the canon. As a result, they have allowed a narrow
distorting slice of history to pass for the total picture.”
“Moretti offers bar charts, maps, and time lines instead, developing the
idea of "distant reading," set forth in his path-breaking essay
"Conjectures on World Literature," into a full-blown experiment in literary
historiography, where the canon disappears into the larger literary
system.”
16. Bias in digitisation
The tool was made to give a statistically valid sample.
Due to the paltry amount digitised, it showed how skewed
the digital corpus is, compared to the overall holdings.
Allen B. Riddell in “Where are the novels?”* estimates
that using HathiTrust’s corpus:
“... about 58%—somewhere between 47% and 68%—of
the 2,903 novels [all publications in English between 1800
and 1836] have publicly accessible scans.”
* (2012) https://ariddell.org/where-are-the-novels.html
18. Presentation shapes research questions
“On The Road”, Jack Kerouac
(via http://www.openculture.com/2007/08/on_the_road_the_original_scroll.html)
19.
20.
21.
22.
23. Impact?
Hard to measure but:
- 17-20 million hits on average every month,
over 250 million in 14 months.
- Over 200,000 tags added.
- > 5,500 clicks on ‘purchase a high
resolution version’
- Hundreds of contributors.
- Iterative crowdsourcing is ongoing.
- https://commons.wikimedia.org/wiki/Commons:
British_Library/Mechanical_Curator_collection/m
ap_tag_status
24.
25.
26. Rethinking access
What if everything had (at least) one URL?
Every book?
Every article?
Every page?
Every paragraph?
What if that URL worked in predictable ways?
39. Iterative crowdsourcing* and
curation
Release data with the attitude that people will
tell you why it is wrong and give them tools to
fix it.
Georeferencing maps found in books, gives
data that can be used to generate more
specific metadata about what those books
concern.
* A term I have borrowed from Mia Ridge
40.
41.
42. Light-hearted but underlines a
crucial pattern of access
Interfaces to content need to expect and to
cater to machine access.
A human may not be present to say, ‘log in’.
Keyword search is useless as a filtering
mechanism
Text- and data-mining is like throwing a
magnet into a haystack, without knowing if
there are any steel needles in there.
44. Off the Map
2014 Winners
2014 winning team:
Gothulus Rift
University of South Wales
Created a Fonthill Abbey
inspired game called Nix using
Oculus Rift
Blog: http://nixgamedevblog.
blogspot.co.uk
YouTube flythrough: http:
//youtu.be/8ESieZO4VHw
45. Off the Map 2015
Alice’s Adventures Off the Map
Part of the British Library's celebrations for the 150th
anniversary of Alice in Wonderland
http://gamecity.org/alices-adventures-off-the-map/
46. British Library Labs Competitions
http://labs.bl.
uk/British+Library+Labs+Competition+2015
Unofficial descriptions of the two main aspects
of this:
“Tell us your ideas”
and
“Show us what you have done”