[poster] Extracting Information From Classics Scholarly Texts
1. EXTRACTING INFORMATION FROM CLASSICS SCHOLARLY TEXTS
Matteo Romanello, matteo.romanello@kcl.ac.uk
Goal HIDDEN WORD PUZZLE
The project at a glance
● PhD research project in Digital Humanities
Devising an automatic system to improve To solve the puzzle find the
(DH) information retrieval over a discipline-specif c
i words in the schema by
● discipline: DH, Classics (Greek and Latin
corpus of unstructured texts. using a word list as clue.
literature)
CORPUS: Open Access At the end you'll have added
● topic: extracting structured information from a information to the initially
collection of Classics journal papers
corpus of unstructured texts chaotic picture.
Why automatic? Because automatic means also Steps
Gone digital. What changed? scalable when you are dealing with a huge quantity
of data. 1. Building the corpus (OCR, preprocessing)
We are moving from books to e- Information retrieval: the task of retrieving
books, and from journals to e- information (most of the times accomplished by 2. Making the data sources interoperable
journals as we are using them using search engines) (when the same entity E appears in DB1 and DB2,
almost daily. the information about E in DB1 have to be added
Corpus of unstructured texts: collection of plain
Is our way of accessing texts, without any kind of mark-up (such as XML). to information about E in DB2)
information actually changed
with the use of digital tools? 3. Finding in the corpus the mentions of
Information can be REALIA (place, names, work passages, etc.)
Did just the format change or accessed using multiple
are we provided with innovative access points that are 4. Disambiguating the mentions of REALIA
ways of accessing information meaningful for scholars
based on digital technologies? in a specif c f eld.
i i 5. Automatic creation of new indices to the
texts
Access points to information in Classics Method
Expected results
Print resources 1. Reuse existing data resources containing
structured information (such as gazetteers, ●Providing automatically multiple
● Table of Content (TOC)
authority lists, etc.) stored using different data meaningful entry points to information
● Indexes (index of citations, index of greek word,
index of geographic place, index of names, etc.) formats (Relational DataBases, XML f les, i ● Enrich the corpus with links to navigate
etc.)
Electronic resources through resources
● TOCs 2. Apply Computational Linguistic and
● Access through search engines Natural Language Processing algorithms ● Exploiting extracted information to
● ? for the information extraction improve user access to the corpus
* usually provided just for monographs because expensive 3. Use structured data as training data for ●Demonstrate the scalability of the
to be produced the algorithms which “mines” the unstructured approach
text corpus
Centre of Computing in the Humanities (CCH), King's College London