Big Data in the Arts and Humanities: Stirling presentation
1. Big Data in the Arts
and Humanities
Andrew Prescott, University of Glasgow
AHRC Theme Leader for Digital Transformations
University of Stirling Literature and
Languages Seminar, 17 February 2016
2. Neurone activity in the brain of a zebra fish embryo.
Each video sequence is one terabyte in size.
Ahrens, M. B. & Keller, P. J. Nature Meth. http://dx.doi.org/10.1038/NMETH.2434 (2013)
3. The high frequency telescopes of the Square Kilometre Array will
produce 1 exabyte per day (more than current global internet traffic) in
first phase. This will eventually rise to many Petabits (1015) per
second, more than 10 times the current global internet traffic
4. BIG HUMANITIES DATASETS
Sound and Video:
• Shoa Holocaust Survivors testimonials collection is 20 terabytes
(cf. Sloan Digital Sky Survey 10 terabytes)
• The BBC’s digital assets are estimated at about 52 petabytes of
data
Structured data:
• US National Archives and Records Administration: 142 TB of
data; estimated 347 PB by 2022
• Ancestry holds 14 billion records and is adding 2 million records
daily. Brightsolid's (Findmypast) new data centre in Aberdeen
will have 400 petabytes of storage
• Web archives: multi-petabyte
Linguistic corpora:
• Corpus of American Contemporary English: 450 million words
• Wikipedia Corpus: 1.9 billion words
• Google American books n-grams: 155 billion words
5. THE CHANGING NATURE OF THE PRIMARY
MATERIALS OF HUMANITIES RESEARCH
• The papers of the British prime minister William Ewart
Gladstone (1809-1898): approx. 160,000 documents
in 762 volumes.
• Margaret Thatcher archive: 1 million documents in
3,000 boxes occupying 300 metres of shelving
• Enron Corporation Corpus, acquired by Federal
Energy Regulatory Commission during enquiry into
corporation’s collapse. Approx. 600,000 e-mails
generated by 158 employees; about 423MB (zipped).
6. Electronic records from the Executive Office of the President during the
second presidency of George W. Bush: 82 TB of data; 200+ million e-
mail messages; 3+ million digital photographs; 30+ million other
electronic records
http://www.georgewbushlibrary.smu.edu/Research/Electronic-Records.aspx
7. ###### Begin Original ARMS Header ######
RECORD TYPE: PRESIDENTIAL (NOTES MAIL)
CREATOR:Sandy Kress ( CN=Sandy Kress/OU=OPD/O=EOP [ OPD ] ) CREATION DATE/TIME:14-JUN-2001 17:13:17.00
SUBJECT:: Education statement
TO:Claire E. Buchan ( CN=Claire E. Buchan/OU=WHO/O=EOP@EOP [ WHO ] ) READ:UNKNOWN
###### End Original ARMS Header ######
---------------------- Forwarded by Sandy Kress/OPD/EOP on 06/14/2001 05:13 PM ---------------------------
Sarah Pfeifer 06/14/2001 04:59:34 PM Record Type: Record
To: Sarah E. Youssef/OPD/EOP@EOP, Brian R. Besanceney/OPD/EOP@EOP, Sandy Kress/OPD/EOP@EOP
cc:
Subject: Education statement
---------------------- Forwarded by Sarah Pfeifer/OPD/EOP on 06/14/2001 04:59 PM ---------------------------
Sarah Pfeifer 06/14/2001 04:59:00 PM Record Type: Record
To: See the distribution list at the bottom of this message cc:
Subject: Education statement
This statement has been approved by the President. Harriet called me several minutes ago with one last change, which I have
incorporated.
Message Sent To:_____________________________________________________________ Harriet Miers/WHO/EOP@EOP
John Gardner/WHO/EOP@EOP Barbara A. Barclay/WHO/EOP@EOP Debra D. Bird/WHO/EOP@EOP Carolyn E. Cleveland/
WHO/EOP@EOP
E-mail by B. Alexander (Sandy) Kress, Senior Adviser to President
George W. Bush on Education, concerning the drafting of the No Child
Left Behind Act in 2001
http://www.georgewbushlibrary.smu.edu/en/Research/Electronic-Records/Email.aspx#Email
8. • Visualisation of relationship
between terms in Wikileaks
Significant Action Reports
real to Iraq
• Big data: ‘whose size forces
us to look beyond the tried-
and true methods that are
prevalent at that
time’ (Jacobs, 2009)
• Illustrate how big data is
already a current issue for
humanities researchers
• Suggests humanities
becoming not only more
quantitative, but also more
visual, haptic and
exploratory
9. collateral exposure..?POSSIBLE INFORMATION
media diversion..?POSSIBLE INFORMATION
Extract from project publication for Insurance.AES256 by Michael Takeo
Magruder (2011), using Wikileaks material to reflect on issues of
information freedom and secrecy in today's ever-shifting media landscape.
http://www.takeo.org/nspace/2011-insurance_aes256/
10. Portfolio of Big Data projects funded by UK
Arts and Humanities Research Council,
2014-15
• Dealing with large textual corpora: UK statute law; mining
the history of medicine
• Linking existing databases: Snapdrgn; Big Data History of
Music
• Annotation of unstructured data: DEEP film access;
optical music recognition; Lost Visions
• Visualisation: International crime fiction; Seeing Data
• Critical study of data: Our Data Ourselves; Secret Life of a
Weather Datum
11. Portfolio of Big Data projects funded by UK
Arts and Humanities Research Council,
2014-15
• Mapping: Literary History of Edinburgh;
• Internet of Things: archaeological 3D imaging; Tangible Memories
• Reflects range of activities currently used in ‘Big Humanities’.
• Does anything link these together methodologically? Do they
represent anything different from what we have previously done?
• Is there a ‘Big Data moment’, or is it simply that data and
expertise is now available on a larger scale?
• What distinctive contributions can the arts and humanities make
to the Big Data debates?
12. HAVE WE BEEN HERE FOR A LONG TIME?
• If Big Data is defined as data whose
size requires us to look beyond tried
methods, it has been with us since
antiquity
• Invention of writing linked to government
need to manage information
• 1086: Detailed register of property in
Domesday Book
• 12th century: development of pipe rolls
and use of counters in government
accounting
• 13th century: alphabetisation of the bible
by a team of Dominican friars
13. WHY BIG DATA IS DIFFERENT
• Historical examples like Domesday Book or census were
inventories; descriptive and backward-looking
• The aim of Big Data techniques is predictive: ‘We know what you
are going to do tomorrow’ (credit score agency)
• Results derive from quantity of data rather than quality; methods
‘inherently inexact but the vast amount of data compensates for the
imperfections’ (Mayer-Schonberger, p. 187)
• Ignores causal relationships and looks for co-relations e.g. how
lifestyle factors predict likelihood of adhering to medical prediction
• About quantity (not quality); storage (not curation); statistics (not
logic); syntaxx (not semantics) [thanks to Volker Markl]
14. EXAMPLES OF PREDICTIVE ANALYTICS
• Driven largely by finance and retail, but rapidly spreading into other
sectors
• Chicago: Automated Preventive Rodent Baiting Program analyses 31
indicators to predict where rodent infestations will occur
• New York: predicting where unlicensed building conversions have
occurred to target inspections and issue vacate orders
• Chicago: Predictive Policing System
• AHRC programme includes projects on online betting on election
results, and on legislation
• AHRC-Nesta project to use predictive analytics to improve museum
attendance
15. Use of big data techniques in choosing film directors,
cast, crew, etc.: the-numbers.com
16. Use of predictive analytics to ‘optimise scripts’ in film and TV:
epagogix.com
John Wiley considering using IBM Pure Data analytics in similar way
for scientific and academic publishing
17. CHALLENGES OF BIG DATA TO THE ARTS
AND HUMANITIES
• Not simply about role of quantification or scientific method in arts and
humanities
• Challenges assumptions about role of information in research: if data
is big enough, messy or poorly curated data need not be an issue
• Questions existing research methods: ‘data-driven research’
• Undermines assumptions about causality and human agency
• Role of retail and financial agencies in developing these methods - the
enclosure of data
• Challenges existing critical and theoretical frameworks: not ‘end of
theory’ but ‘big data needs big theory’
18. HOW THE ARTS AND HUMANITIES CAN
ADDRESS BIG DATA CHALLENGES
• Developing new theoretical frameworks and responses: critical
data studies
• Providing models in areas such as causality and ‘messiness of
data’
• Exploring the spaces and flow of big data
• Promoting moral values of humanities research in a big data world
• Role of design
• ‘Radical contextualisation’ of big data
• Humanisation of big data
19. THE NEED FOR BIG THEORY
• Chris Anderson in Wired 2008: ‘Out with every theory of human
behavior, from linguistics to sociology. Forget taxonomy, ontology,
and psychology. Who knows why people do what they do? The point
is they do it, and we can track and measure it with unprecedented
fidelity. With enough data, the numbers speak for themselves’.
• New York Times, 2010: ‘The next big idea in language, history and
the arts? Data. Members of a new generation of digitally savvy
humanists argue it is time to stop looking for inspiration in the next
political or philosophical ‘ism’ and start exploring how technology is
changing our understanding of the liberal arts. This latest frontier is
about method, they say, using powerful technologies and vast stores
of digitised materials that previous humanities scholars did not have’.
• Charles Darwin (cited by Callebut): ‘all observation must be for or
against some view if it is to be of any service’
20. THE NEED FOR BIG THEORY
• Bowker (2006): Raw data is both an oxymoron and a
bad idea; to the contrary, data should be cooked with
care
• Huggett (2014): Data are not 'out there', waiting to be
discovered; if anything, data are waiting to be created.
Information about the past is situated, contingent, and
incomplete; data are theory-laden, and relationships
are constantly changing depending on context.
• Kitchen and Lauriault (2014): Data are situated,
contingent, relational, and framed, and used
contextually to try and achieve certain aims and goals
21. CRITICAL DATA STUDIES
Dalton and Thatcher, What does a critical data studies look like,
and why do we care? Seven points for a critical approach to ‘big
data (Society and Space, 2014)
1. situate data regimes in time and space
2. expose data as inherently political and whose interests they
serve
3. unpack the complex, non-deterministic relationship between
data and society
4. illustrate the ways in which data are never raw
5. expose the fallacies that data can speak for themselves and
that big data will replace small data
6. explore how new data regimes can be used in socially
progressive ways
7. examine how academia engages with new data regimes and
the opportunities of such engagement
24. RETHINKING THE IMPLICATIONS OF BIG DATA
• Is a switch from causality to co-relation so radical?
• As long ago as 1946, the historian Marc Bloch argued
against the ‘idol of origins’ and sought a history with
stronger social and cultural understanding
• Pioneering work of humanities scholarship such as
Annales School of historians has lot to contribute in terms
of integrating methodology, data and new techniques
• Continued importance of critical understanding of data, as
Google flu trends controversy illustrates
• Experience of humanities scholars in dealing with complex
and messy historical datasets potentially very relevant
25. Visualisation of ontology for linking information
about people in the ancient world developed by
the Standards for Networking Ancient
Prosopographies project:
snapdrgn.net
28. Erica Savig, M.Arch.
PhD Candidate, Cancer Biology
Stanford University
Lab of Garry P. Nolan
National Science Foundation Graduate
Research Fellow
Stanford Graduate Research Fellow
Common Design
Strategies for Exploring
Signaling Networks in
Biology and Intellectual
Geographies in History
Nicole Coleman
Director, Humanities + Design
Stanford University
29. Component
and Behavior
for Protein 1
Component
and Behavior
for Protein 2
Component
and Behavior
for Protein 3
Parametric Modeling Quantitatively Maps Single Cell Protein
Levels to Individual Qualitative Components
33. Tim Hitchcock on Big Data, Small Data and Meaning
(historyonics.blogspot.co.uk):
‘Big Data’ supposedly lets you get away with dirty data. In contrast,
humanists do read the data; and do so with a sharp eye for its
individual rhythms and peculiarities – its weirdness.
In the rush towards 'Big Data' – the Longue durée, and automated
network analysis; towards a vision of Humanist scholarship in which
Bayesian probability is as significant as biblical allusion, the most
urgent need seems to me to be to find the tools that allow us to do the
job of close reading of all the small data that goes to make the bigger
variety…we need to be able to contextualise every single word in a
representation of every word, ever. Every gesture contextualised in the
collective record all gestures; and every brushstroke, in the collective
knowledge of every painting.
34. Towards a ‘radical contextualisation’: Mapping Metaphor
with the Historical Thesaurus of the English Language
http://blogs.arts.gla.ac.uk/metaphor/