Visualizing the Transcribe Bentham Corpus

Visualizing the
Transcribe Bentham Corpus
Frédérique Mélanie, Estelle Tieberghien, Pablo Ruiz Fabo,
Thierry Poibeau
LATTICE Lab: ENS – CNRS – U Paris 3, PSL – USPC
Tim Causer, Melissa Terras
UCL Bentham Project, UCL Digital Humanities
UCLDH Seminar, December 2016

Outline
• UCL Bentham Project & Transcribe Bentham
• How navigate this corpus? Visualizations
– Lexical extraction
– Co-occurrence networks
• Static view and Temporal evolution
• Evaluation and Challenges
• Other corpus explorations via visualization
• Distant Reading Module, WordTree
• Other lexical analyses 2

Jeremy Bentham (1748-1832)
•Jurist, philosopher, and legal and
social reformer
•Leading theorist in Anglo-American
philosophy of law
•Influenced the development of
welfarism
•Advocated utilitarianism
•Animal rights,
•Work on the “panopticon”
•Not founder of UCL, but...
•60,000 folios in UCL Sp. Collections
•40,000 untranscribed
•Auto-icon

The Bentham Project
• http://www.ucl.ac.uk/Bentham-Project/
• Since 1959
• “aims to produce a new scholarly
edition of the works and
correspondence of Jeremy Bentham”
• twenty six volumes of the new
Collected Works have been published
• 50 years to transcribe 20,000 folios
• Previous AHRC grant catalogued the
manuscripts
– http://www.benthampapers.ucl.ac.uk/

Facts and Figures (as of 1st July 2016)
• 16,205 manuscripts transcribed/partially-transcribed
• 15,351 (94%) checked and approved
• 83,955 visits
• 34,359 unique views
• Average session time: 14 minutes 13 seconds
• 140 countries
• 514 people have transcribed something
• Most of the work done by the 26 Super Transcribers
• Average of 54 transcripts edited since the start of the project
• Average of 56 per week during the last twelve months
• Greatest number of transcripts in any one week: 300 (w/c 14 June
• 2014)

Transcribe Bentham progress, 8 September 2010 to 20 March 2015
0
2000
4000
6000
8000
10000
12000
8
Sep
2010
5
Nov
2011
30
Dec
2010
25
Feb
2011
15
Apr
2011
17
Jun
2011
12
Aug
2011
7
Oct
2011
2
Dec
2011
27
Jan
2012
23
Mar
2012
18
May
2012
13
Jul
2012
7
Sep
2012
2
Nov
2012
28
Dec
2012
22
Feb
2013
26
Apr
2013
21
Jun
2013
16
Aug
2013
11
Oct
2013
6
Dec
2013
31
Jan
2014
28
Mar
2014
23
May
2014
18
Jul
2014
12
Sep
2014
7
Nov
2014
9
Jan
2015
6
Mar
2015
Manuscripts worked on Completed transcripts
NYT article
BL manuscripts made
available

With thanks to:
•Prof Philip Schofield (UCL Bentham Project, Principal
Investigator)
•Dr Tim Causer (Bentham Project)
•Dr Kris Grint (Bentham Project)
•Richard Davis (University of London Computer Centre
•José Martin (ULCC)
•Martin Moyle (UCL Library Services)
•Lesley Pitman (UCL Library Services)
•Tony Slade (UCL Creative Media)
•Miguel Faleiro Rodrigues, Alejandro Salinas Lopez, and
Raheel Nabi (UCL Creative Media)
•Dr Arnold Hunt (British Library)
•Anna-Maria Sichani (Bentham Project)
•Dr Justin Tonra (National University of Ireland Galway)
and Dr Valerie Wallace (Victoria University Wellington),
bother formerly of the Bentham Project
•All the partners in Transcriptorium
http://transcriptorium.eu/consortium/
•And Transcribe Bentham’s volunteers!
•Project previously funded by the AHRC and the Andrew W.
Mellon Foundation

Outline

Relevant access to a large corpus
14

Relevant access to a large corpus
• A search index?
• Topic models?
• Corpus cartography?
Challenges for this corpus
• Not an all-English corpus
• Difficulties posed by an historical variety
• Technical language
• Revision history, additions and deletions
15

Stats for analyzed corpus sample
• Total TEI files: 29,900
• In English: 29,400
• That we dated: 16,700
• We only visualized English transcripts that
we could date (with a simple heuristic)1
• Work is based on ca. 55% of the all the
TEI files in our sample
16
1We were not using the corpus’ date metadata for this exercise

Corpus Cartography
• Lexical extraction (of relevant sequences)
• Clustering based on similarity measures
• Visual representation (map of the corpus)
based on layout algorithms
17

Cartography tool: CorText
• CorText Manager covers all cartography
steps:
– Clustering
– Visualization
• Each step can be used independently,
thanks to standard import/export formats
18

ToolscombinedwithCorText
CARTOGRAPHY STEP TOOLS and RESOURCES
Lexical Extraction
DBpedia Spotlight
YaTeA
Human domain-expert
Clustering CorText Analysis
Visualization Gephi + Sigma JS plugin
- Static
CorText MapExplorer
Inkscape
- Dynamic CorText Heatmaps,
Tubes, Distant Reading
19

Outline

Lexical Extraction
• CorText native option
– Noun-Phrase chunks (based on TreeTagger)
• Our options:
– Entity Linking / Wikification to DBpedia
– Keyphrase extraction tools like YaTeA
• In all cases: manual selection of pre-ranked
candidate terms by a domain-expert
21

Entity Linking / Wikification
• Given a database with encyclopedic
knowledge (e.g. Wikipedia)
- Finds references (mentions) to DB terms in text
- Dealing with variability in the mentions for a term
22

23
Database

24
Database

25
DatabaseCorpus
- judicatory
- judicial
- judicature
- Judicatory
- Judicial

• Tool: DBpedia Spotlight
• Compares the context of sequences of
words in a text against DBpedia articles:
– Term definition’s text
– Links
– DBpedia structure (redirections etc.)
• Assigns a DBpedia term to the sequence if
a good match is found
26

Example terms and their variants
27
Term Variants
Judiciary judicature, judicatory,
judicial
Jury jury, juries
Monarch king, monarch
Quantity amount, quantity
Saint Peter Simon Peter, Cephas

28
• Applying a current knowledge-base
(DBpedia) to 18th-19th century texts
• Is this a valid method?

Keyphrase extraction
• YaTeA (Aubin and Hamon, 2006)
• Extracts noun-phrases of configurable
structure and length
29

Outline

Clustering
• CorText offers several similarity metrics
– we chose the default method (distributional)
for homogeneous networks (Weeds & Weir 2005)
31

Visualization
• Static (one map for all dated transcripts)
• Dynamic: temporal slices on the corpus
– Heatmaps
– “River” or Sankey networks (“Tubes layout”)
32
http://apps.lattice.cnrs.fr/bentham

Static visualization
33
CorText network visualized with Gephi

Static visualization
34
CorText network visualized with Gephi

Example term: happiness
37
CorText network made interactive thanks to Gephi’s Sigma JS Exporter

Examples: nodes linking clusters
44

Examples: nodes linking clusters
45

Heatmaps: Saliency per subcorpus
46

Heatmaps: 1800-1809 subcorpus
47

Heatmaps: 1810-1819 subcorpus
48

Dynamic visualization
50
1795 1800 1805 1810

51
1795 1800 1805 1810

52
1795 1800 1805 1810

53
1795 1800 1805 1810

Outline

Evaluation
• Static maps: terms in the clusters
correspond closely to issues dealt with by
Bentham for the thematic areas of each
cluster
• Heatmaps: The evolution depicted
corresponds to the evolution of topics in
Bentham’s work
• DBpedia vs. keyphrase extraction: The
keyphrases provide more relevant
evidence for specialized scholars, a
general encyclopedia can help other users
55

Challenges
Deleted material Additions
56

Challenges
Thematic Variety
• Animal Welfare
• Arts
• Capital punishment
• Civil Code
• Constitutional Code
• Convict transportation
• Correspondence
• Crime & Punishment
• Education
• Law
• Legislation
• Moral Philosophy
• New South Wales
• Panopticon
• Penal Code
• Political Economy
• Preventive Police
• Religion
• Science
• Sexual Morality
• Torture
Formal Variety
• Text sheets
• Copies / Fair copies
• Marginal summary sheets
• Correspondence
• Collectanea
• Rudiments
• Spencers
57
From http://www.transcribe-bentham.da.ulcc.ac.uk/td/Manuscripts and
http://www.benthampapers.ucl.ac.uk/help.aspx?subject=category

Outline

Distant Reading Module
• Follow evolution of selected lexical
sequences
59

Evolution of a lexical item
60
Temporal evolution
Temporal evolution profiles:
- Here: Rising, but present at all dates
- Other examples: falling, regular spikes etc.

Context evolution: Bump Charts
64
• Example: evil

65
Neighbours evolutionBumpCharts

66
Neighbours evolutionBumpCharts

• Example: relations among neighbours of
evil
Relations in the context: Egonetworks
67

Evolution of neighbours’ relations
68
Egonetworks(Period2)

69

70

Outline

Other Lexical Analyses
• TXM “textometry” tool
– Automatic part-of-
speech tagging
– Partition texts according
to metadata
– Query corpus using
linguistic criteria
– Statistical analyses
(overrepresentation,
underrepresentation)
72
[ http://textometrie.ens-lyon.fr/?lang=en ]

Lexical Analysis with TXM
• Partition the corpus according to Category,
Year, Decade, Main headings, or other
available metadata
74

Lexical Analysis with TXM
Number of words per Category
75

Lexical Analyses with TXM
• Over- (or under-) representation of given
words per decade (after partitioning per decade)
76

TXM linguistic queries
• Evil followed by a noun, per text-category
77

TXM linguistic queries
• Sentences containing an adjective + evil
78

Summary
• Accessing a large unedited corpus
– Cartography methods
• Lexical extraction
• Maps
– Static picture of the corpus
– Temporal evolution
– Other visualizations (Distant, WordTree)
• Domain-expert feedback
• Challenges
• Other lexical analyses
79
http://apps.lattice.cnrs.fr/bentham

Bibliography
Aubin, S., and Hamon, T. (2006) Improving Term
Extraction with Terminological Resources. In
Advances in Natural Language Processing: 5th
International Conference on NLP, FinTAL 2006, pp.
380-387. LNAI 4139. Springer.
Auer, Sören, et al. (2007). DBpedia: A nucleus for a
web of open data. The Semantic Web. Springer.
Causer, Tim, and Terras, Melissa (2014a). Many
hands make light work. Many hands together
make merry work: Transcribe Bentham and
crowdsourcing manuscript collections, in
Crowdsourcing Our Cultural Heritage, ed. M. Ridge,
Ashgate
Causer, Tim, and Terras, Melissa (2014b).
Crowdsourcing Bentham: Beyond the Traditional
Boundaries of Academic History, International
Journal of Humanities and Arts Computing, 8
Chavalarias, David, and Jean-Philippe Cointet. (2013).
Phylomemetic Patterns in Science Evolution—The
Rise and Fall of Scientific Fields. PLoS ONE 8 (2)
Cortext Manager Documentation (2016).
https://docs.cortext.net/.
Mendes, Pablo N., Max Jakob, Andrés García-Silva,
and Christian Bizer. (2011). DBpedia Spotlight:
Shedding Light on the Web of Documents. In
Proceedings of the 7th International Conference on
Semantic Systems, 1–8. ACM.
Mélanie, F., Tieberghien, E., Ruiz, P., Poibeau, T.,
Causer, T. Terras, M. (2016). Mapping the Bentham
Corpus. In Digital Humanities Conference (DH
2016). Kraków, Poland.
Poibeau, T. and Ruiz, P. (2015). Generating Navigable
Semantic Maps from Social Sciences Corpora. In
Digital Humanities Conference (DH 2015). Sydney,
Australia.
Rule, Alix, Jean-Philippe Cointet, and Peter S.
Bearman. (2015). Lexical Shifts, Substantive
Changes, and Continuity in State of the Union
Discourse, 1790–2014. Proceedings of the National
Academy of Sciences 112 (35)
Venturini, T., N. Baya Laffite, J.-P. Cointet, I. Gray, V.
Zabban, and K. De Pryck. (2014). Three Maps and
Three Misunderstandings: A Digital Mapping of
Climate Diplomacy. Big Data & Society 1
Weeds J, Weir D (2005). Co-occurrence retrieval: A
flexible framework for lexical distributional similarity.
In Computational Linguistics 31(4), 439–475.
Wattenberg, M. and Viégas, F.B., 2008. The word tree,
an interactive visual concordance. In IEEE
transactions on visualization and computer graphics,
14(6), pp.1221-1228.
80

82
& return you all due thanks
pablo.ruiz.fabo@ens.fr http://www.lattice.cnrs.fr/Pablo-Ruiz-Fabo,541
http://apps.lattice.cnrs.fr/

Visualizing the Transcribe Bentham Corpus

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (11)

Destaque

Destaque (20)

Semelhante a Visualizing the Transcribe Bentham Corpus

Semelhante a Visualizing the Transcribe Bentham Corpus (20)

Mais de UCLDH

Mais de UCLDH (20)

Último

Último (20)

Visualizing the Transcribe Bentham Corpus