Visualizing the Transcribe Bentham Corpus
Frédérique Mélanie, Estelle Tieberghien, Pablo Ruiz Fabo,
Thierry Poibeau
LATTICE Lab: ENS – CNRS – U Paris 3, PSL – USPC
Tim Causer, Melissa Terras
UCL Bentham Project, UCL Digital Humanities
UCLDH Seminar, December 2016
Procuring digital preservation CAN be quick and painless with our new dynamic...
Visualizing the Transcribe Bentham Corpus
1. Visualizing the
Transcribe Bentham Corpus
Frédérique Mélanie, Estelle Tieberghien, Pablo Ruiz Fabo,
Thierry Poibeau
LATTICE Lab: ENS – CNRS – U Paris 3, PSL – USPC
Tim Causer, Melissa Terras
UCL Bentham Project, UCL Digital Humanities
UCLDH Seminar, December 2016
2. Outline
• UCL Bentham Project & Transcribe Bentham
• How navigate this corpus? Visualizations
– Lexical extraction
– Co-occurrence networks
• Static view and Temporal evolution
• Evaluation and Challenges
• Other corpus explorations via visualization
• Distant Reading Module, WordTree
• Other lexical analyses 2
3. Jeremy Bentham (1748-1832)
•Jurist, philosopher, and legal and
social reformer
•Leading theorist in Anglo-American
philosophy of law
•Influenced the development of
welfarism
•Advocated utilitarianism
•Animal rights,
•Work on the “panopticon”
•Not founder of UCL, but...
•60,000 folios in UCL Sp. Collections
•40,000 untranscribed
•Auto-icon
4. The Bentham Project
• http://www.ucl.ac.uk/Bentham-Project/
• Since 1959
• “aims to produce a new scholarly
edition of the works and
correspondence of Jeremy Bentham”
• twenty six volumes of the new
Collected Works have been published
• 50 years to transcribe 20,000 folios
• Previous AHRC grant catalogued the
manuscripts
– http://www.benthampapers.ucl.ac.uk/
5.
6.
7.
8.
9.
10. Facts and Figures (as of 1st July 2016)
• 16,205 manuscripts transcribed/partially-transcribed
• 15,351 (94%) checked and approved
• 83,955 visits
• 34,359 unique views
• Average session time: 14 minutes 13 seconds
• 140 countries
• 514 people have transcribed something
• Most of the work done by the 26 Super Transcribers
• Average of 54 transcripts edited since the start of the project
• Average of 56 per week during the last twelve months
• Greatest number of transcripts in any one week: 300 (w/c 14 June
• 2014)
11. Transcribe Bentham progress, 8 September 2010 to 20 March 2015
0
2000
4000
6000
8000
10000
12000
8
Sep
2010
5
Nov
2011
30
Dec
2010
25
Feb
2011
15
Apr
2011
17
Jun
2011
12
Aug
2011
7
Oct
2011
2
Dec
2011
27
Jan
2012
23
Mar
2012
18
May
2012
13
Jul
2012
7
Sep
2012
2
Nov
2012
28
Dec
2012
22
Feb
2013
26
Apr
2013
21
Jun
2013
16
Aug
2013
11
Oct
2013
6
Dec
2013
31
Jan
2014
28
Mar
2014
23
May
2014
18
Jul
2014
12
Sep
2014
7
Nov
2014
9
Jan
2015
6
Mar
2015
Manuscripts worked on Completed transcripts
NYT article
BL manuscripts made
available
12. With thanks to:
•Prof Philip Schofield (UCL Bentham Project, Principal
Investigator)
•Dr Tim Causer (Bentham Project)
•Dr Kris Grint (Bentham Project)
•Richard Davis (University of London Computer Centre
•José Martin (ULCC)
•Martin Moyle (UCL Library Services)
•Lesley Pitman (UCL Library Services)
•Tony Slade (UCL Creative Media)
•Miguel Faleiro Rodrigues, Alejandro Salinas Lopez, and
Raheel Nabi (UCL Creative Media)
•Dr Arnold Hunt (British Library)
•Anna-Maria Sichani (Bentham Project)
•Dr Justin Tonra (National University of Ireland Galway)
and Dr Valerie Wallace (Victoria University Wellington),
bother formerly of the Bentham Project
•All the partners in Transcriptorium
http://transcriptorium.eu/consortium/
•And Transcribe Bentham’s volunteers!
•Project previously funded by the AHRC and the Andrew W.
Mellon Foundation
13. Outline
• UCL Bentham Project & Transcribe Bentham
• How navigate this corpus? Visualizations
– Lexical extraction
– Co-occurrence networks
• Static view and Temporal evolution
• Evaluation and Challenges
• Other corpus explorations via visualization
• Distant Reading Module, WordTree
• Other lexical analyses 13
15. Relevant access to a large corpus
• A search index?
• Topic models?
• Corpus cartography?
Challenges for this corpus
• Not an all-English corpus
• Difficulties posed by an historical variety
• Technical language
• Revision history, additions and deletions
15
16. Stats for analyzed corpus sample
• Total TEI files: 29,900
• In English: 29,400
• That we dated: 16,700
• We only visualized English transcripts that
we could date (with a simple heuristic)1
• Work is based on ca. 55% of the all the
TEI files in our sample
16
1We were not using the corpus’ date metadata for this exercise
17. Corpus Cartography
• Lexical extraction (of relevant sequences)
• Clustering based on similarity measures
• Visual representation (map of the corpus)
based on layout algorithms
17
18. Cartography tool: CorText
• CorText Manager covers all cartography
steps:
– Lexical extraction
– Clustering
– Visualization
• Each step can be used independently,
thanks to standard import/export formats
18
20. Outline
• UCL Bentham Project & Transcribe Bentham
• How navigate this corpus? Visualizations
– Lexical extraction
– Co-occurrence networks
• Static view and Temporal evolution
• Evaluation and Challenges
• Other corpus explorations via visualization
• Distant Reading Module, WordTree
• Other lexical analyses 20
21. Lexical Extraction
• CorText native option
– Noun-Phrase chunks (based on TreeTagger)
• Our options:
– Entity Linking / Wikification to DBpedia
– Keyphrase extraction tools like YaTeA
• In all cases: manual selection of pre-ranked
candidate terms by a domain-expert
21
22. Entity Linking / Wikification
• Given a database with encyclopedic
knowledge (e.g. Wikipedia)
- Finds references (mentions) to DB terms in text
- Dealing with variability in the mentions for a term
22
23. Entity Linking / Wikification
• Given a database with encyclopedic
knowledge (e.g. Wikipedia)
- Finds references (mentions) to DB terms in text
- Dealing with variability in the mentions for a term
23
Database
24. Entity Linking / Wikification
• Given a database with encyclopedic
knowledge (e.g. Wikipedia)
- Finds references (mentions) to DB terms in text
- Dealing with variability in the mentions for a term
24
Database
25. Entity Linking / Wikification
• Given a database with encyclopedic
knowledge (e.g. Wikipedia)
- Finds references (mentions) to DB terms in text
- Dealing with variability in the mentions for a term
25
DatabaseCorpus
- judicatory
- judicial
- judicature
- Judicatory
- Judicial
26. Entity Linking / Wikification
• Tool: DBpedia Spotlight
• Compares the context of sequences of
words in a text against DBpedia articles:
– Term definition’s text
– Links
– DBpedia structure (redirections etc.)
• Assigns a DBpedia term to the sequence if
a good match is found
26
27. Entity Linking / Wikification
Example terms and their variants
27
Term Variants
Judiciary judicature, judicatory,
judicial
Jury jury, juries
Monarch king, monarch
Quantity amount, quantity
Saint Peter Simon Peter, Cephas
28. Entity Linking / Wikification
28
• Applying a current knowledge-base
(DBpedia) to 18th-19th century texts
• Is this a valid method?
29. Keyphrase extraction
• YaTeA (Aubin and Hamon, 2006)
• Extracts noun-phrases of configurable
structure and length
29
30. Outline
• UCL Bentham Project & Transcribe Bentham
• How navigate this corpus? Visualizations
– Lexical extraction
– Co-occurrence networks
• Static view and Temporal evolution
• Evaluation and Challenges
• Other corpus explorations via visualization
• Distant Reading Module, WordTree
• Other lexical analyses 30
31. Clustering
• CorText offers several similarity metrics
– we chose the default method (distributional)
for homogeneous networks (Weeds & Weir 2005)
31
32. Visualization
• Static (one map for all dated transcripts)
• Dynamic: temporal slices on the corpus
– Heatmaps
– “River” or Sankey networks (“Tubes layout”)
32
http://apps.lattice.cnrs.fr/bentham
54. Outline
• UCL Bentham Project & Transcribe Bentham
• How navigate this corpus? Visualizations
– Lexical extraction
– Co-occurrence networks
• Static view and Temporal evolution
• Evaluation and Challenges
• Other corpus explorations via visualization
• Distant Reading Module, WordTree
• Other lexical analyses 54
55. Evaluation
• Static maps: terms in the clusters
correspond closely to issues dealt with by
Bentham for the thematic areas of each
cluster
• Heatmaps: The evolution depicted
corresponds to the evolution of topics in
Bentham’s work
• DBpedia vs. keyphrase extraction: The
keyphrases provide more relevant
evidence for specialized scholars, a
general encyclopedia can help other users
55
60. Evolution of a lexical item
60
Temporal evolution
Temporal evolution profiles:
- Here: Rising, but present at all dates
- Other examples: falling, regular spikes etc.
79. Summary
• Accessing a large unedited corpus
– Cartography methods
• Lexical extraction
• Maps
– Static picture of the corpus
– Temporal evolution
– Other visualizations (Distant, WordTree)
• Domain-expert feedback
• Challenges
• Other lexical analyses
79
http://apps.lattice.cnrs.fr/bentham
80. Bibliography
Aubin, S., and Hamon, T. (2006) Improving Term
Extraction with Terminological Resources. In
Advances in Natural Language Processing: 5th
International Conference on NLP, FinTAL 2006, pp.
380-387. LNAI 4139. Springer.
Auer, Sören, et al. (2007). DBpedia: A nucleus for a
web of open data. The Semantic Web. Springer.
Causer, Tim, and Terras, Melissa (2014a). Many
hands make light work. Many hands together
make merry work: Transcribe Bentham and
crowdsourcing manuscript collections, in
Crowdsourcing Our Cultural Heritage, ed. M. Ridge,
Ashgate
Causer, Tim, and Terras, Melissa (2014b).
Crowdsourcing Bentham: Beyond the Traditional
Boundaries of Academic History, International
Journal of Humanities and Arts Computing, 8
Chavalarias, David, and Jean-Philippe Cointet. (2013).
Phylomemetic Patterns in Science Evolution—The
Rise and Fall of Scientific Fields. PLoS ONE 8 (2)
Cortext Manager Documentation (2016).
https://docs.cortext.net/.
Mendes, Pablo N., Max Jakob, Andrés García-Silva,
and Christian Bizer. (2011). DBpedia Spotlight:
Shedding Light on the Web of Documents. In
Proceedings of the 7th International Conference on
Semantic Systems, 1–8. ACM.
Mélanie, F., Tieberghien, E., Ruiz, P., Poibeau, T.,
Causer, T. Terras, M. (2016). Mapping the Bentham
Corpus. In Digital Humanities Conference (DH
2016). Kraków, Poland.
Poibeau, T. and Ruiz, P. (2015). Generating Navigable
Semantic Maps from Social Sciences Corpora. In
Digital Humanities Conference (DH 2015). Sydney,
Australia.
Rule, Alix, Jean-Philippe Cointet, and Peter S.
Bearman. (2015). Lexical Shifts, Substantive
Changes, and Continuity in State of the Union
Discourse, 1790–2014. Proceedings of the National
Academy of Sciences 112 (35)
Venturini, T., N. Baya Laffite, J.-P. Cointet, I. Gray, V.
Zabban, and K. De Pryck. (2014). Three Maps and
Three Misunderstandings: A Digital Mapping of
Climate Diplomacy. Big Data & Society 1
Weeds J, Weir D (2005). Co-occurrence retrieval: A
flexible framework for lexical distributional similarity.
In Computational Linguistics 31(4), 439–475.
Wattenberg, M. and Viégas, F.B., 2008. The word tree,
an interactive visual concordance. In IEEE
transactions on visualization and computer graphics,
14(6), pp.1221-1228.
80