Günter Mühlberger (University of Innsbruck, AT): The READ project. Objectives, tasks and partner organisations
co:op-READ-Convention Marburg
Technology meets Scholarship, or how Handwritten Text Recognition will Revolutionize Access to Archival Collections.
With a special focus on biographical data in archives
Hessian State Archives Marburg Friedrichsplatz 15, D - 35037 Marburg
19-21 January 2016
2. Facts and Figures
• READ
• Recognition and Enrichment of Archival Documents
• 13 Partners, coordinated by the University of Innsbruck
• 10 Institutions as associated partners via a Memorandum of
Understanding
• Duration: 1.1.2016 to 30.6.2019
• Grant: 8,2 mill. EUR
• Objectives
• Applied research in pattern recognition and human language
technology
• Services for archives, humanities scholars, volunteers and
computer scientists
• Network building among those user groups
3. READ Consortium
READ Partners
University of Innbruck
(co-ordinator)
University of London
Technical University Valencia Technical University Lausanne
University College London University of Rostock
National Centre for Scientific
Research - Demokritos
XEROX – European Research Centre
Technical University Vienna University of Leipzig
National Archive Finland Diozesan Archive Passau
4. READ MoU Partners
READ MoU Partners
Australian National Library Gottfried Wilhelm Leibniz Bibliothek
National Library of Spain Centre virtuel de la connaissance
sur l'Europe Digital Humanities Lab
(Luxembourg)
The Linnean Society of London The Hessian
State Archive Marburg (Germany)
The Munch Museum (Norway) The Civic Archives of Bozen
Bolzano (Italy)
Music and Instrument Museum
Leipzig
The University and Research Library
Erfurt/Gotha (Germany)
Friedrich-August-Universität
Erlangen/Nürnberg
PLANET GmbH. (Germany)
5. What will remain once the
project has finished its work in
June 2019?
6. Publications
• H2020 Grant Agreement
• Article 29.2 Open access to scientific publications
• Each beneficiary must ensure open access (free of
charge online access for any user) to all peer-reviewed
scientific publications relating to its results.
• Open Access
• Golden way
• E.g. FrontiersIn from EPFL (Technical University Lausanne)
• Green way
• Key Performance Indicator
• 15-25 scientific publications per year
7. Research Data
• H2020 Grant Agreement
• 29.3 Open access to research data
• Regarding the digital research data generated in the
action (‘data’), the beneficiaries must:
• (a) deposit in a research data repository and take
measures to make it possible for third parties to access,
mine, exploit, reproduce and disseminate — free of
charge for any user — the following:
• (i) the data, including associated metadata, needed to
validate the results presented in scientific publications as
soon as possible;
8. What are Research Data in READ?
• Images and corresponding Reference Data
(= ground truth)
• Images = Raw material
• Reference data or ground truth = the expected, perfect
output
• Data = what is actually produced by an algorithm/tool
• Example
• Image of a page
• Correct text of a page = reference data
• Data = the text produced by a HTR engine
• Difference between expected result and actual result =
the result of a scientific experiment, e.g. measured as
Word Error Rate
9. Research Data are used for…
• Evaluation
• Difference between expected and actual result
• Problem description / requirements specification
• What do we actually expect from an algorithm or tool?
• Simple with HTR, but becomes much more complicated with
Layout Analysis
• E.g. do we need the whole text of a page, or maybe just
person names within one column of a table? Such questions
need to be defined and need to be reflected in the design of
the reference data
• Machine learning (training data)
• Machine learning tools need training data
• Reference data are the basis for this training process
10. Research Data in READ
• Key Performance Indicator
• 3 Mill. Images with at least 50.000 pages of reference
data at the end of the project
• Why such a large amount?
• Our objective is that the READ dataset is “somehow”
representative for many document types in archives, for
writing and layout styles of several centuries and
languages
• We are therefore very much interested in any kind of
digitised document collection
• Progress in computer science is strongly connected to
the availability of large data sets
11. Research Data for Competitions
• Key Performance Indicator
• READ will organise several research competitions at various
conferences
• Competitions
• Nowadays a popular way to measure the progress of research
in a specific field. E.g. line detection, or text recognition, or
writer retrieval…
• Evaluation of competition results
• Depends on the availability of reference data
• Attractiveness of competitions
• Dependent on the challenge itself, but also on the size of
dataset and the quality of reference data
• 160.000 EUR are foreseen as sub-contracts for the production
of reference data
12. Research Data in READ
• Images will be connected with reference data such as:
• Correct text (e.g. on page or line level)
• Correct writer attribution (e.g. letters with names of writers)
• Correct person names on page level
• Correct layout elements, e.g. text lines, text blocks, tables, or
forms
• Detailed descriptions of tables or forms
• Everything which is interesting for archives, scholars, the
public!
• Data will be made available e.g. via ZENODO or other
Research Data Platforms
• Archives are encouraged to provide their collections!
13. Open Source Software
• Release as OS
• Not an obligation of the Grant Agreement, but from the
specific e-Infrastructure call of the EU
• Foreseen for (nearly) all software tools in the project
• During 2016 we will take the first steps and move parts
of the software to GITHUB or a similar platform
• Advantage
• Many tools are research tools and therefore “not easy”
to implement
• The implementation in Transkribus will allow users to try
out the tools in beforehand
14. Interim summary
• Open Access to publications
• E.g. via Open Access publishers
• Open Research Data (images and reference data)
• E.g. via Repositories, such as ZENODO (run by CERN Data
service)
• Open Source for the software tools
• E.g. via open software repositories, such as GITHUB
An (expert) user will have “everything together” to
dive deeper into the results of the project
15. Open Platform
Build a platform which provides recognition,
transcription and enrichment of historical documents
as a general infrastructure for archives, libraries,
humanities scholars, volunteers, the public – and
computer scientists.
16. Why a Platform? (1)
• Software as a Service (SAAS)
• Implementation of the full range of tools from READ requires
a lot of work and knowhow
• The entrance hurdle for archives and humanities scholars is
much lower since the services can be accessed and used via
the Internet
• E.g. users are free to upload their documents, to run tests and
to further decide which services they want to use
• Machine Learning
• Most tools require large amounts of training data
• The more data are available in the platform the higher the
chance to improve accuracy
• E.g.: if a user in Greifswald transcribes a German text from
1700 these data may also be used to train the HTR engine for
a user in Bavaria. Or in the US.
17. Why a platform? (2)
• Cooperation
• Successful digitisation projects need collaboration between
content holders, scholars, computer scientists and volunteers
• Platform serves as a mediation tool between these
stakeholder groups
• E.g. they can define requirements, produce reference data,
implement new services, edit and correct results in a shared
manner
• Standardisation
• Full benefits of technology can only be enjoyed if a large
variety of standards is obeyed
• De-facto standardisation by using the same platform and
tools
• E.g. the real benefit of digital editions will be enjoyed once
they are centrally accessible
18. Service Platform
• READ Service Platform = Transkribus
• We are obliged to run the service platform from the very first
day of the project
• We are also obliged to provide a business plan in month 12
• And to implement this business plan after month 12
• Final objective
• To run and maintain the service platform also after the end of
the project
• A business model needs to be developed
• General approach
• Service levels
• To provide free services for everyone – only if some limits are
exceeded than service fees will be applied
19. Overview of tools and services
• Handwritten Text Recognition
• HTR based on HMM and on NN
• Keyword Spotting
• Query by Example
• Query by String
• Image Preprocessing
• Binarisation, Enhancement
• Layout Analysis
• Basic analysis of words, lines, region types (text, graphical,…)
• Table and Forms Recognition
• Generic and template based recognition
20. Overview of tools and services
• Document Understanding
• Columns, marginalia, date, etc.
• Automatic Writer Identification and Retrieval
• Training and retrieval of specific writers/writing styles
• Language Toolkit
• Adaptation of language resources to support HTR
• Text2Image matching
• Matching existing text with images
• E-Learning module
• Online training tool for students and volunteers to practise deciphering
of handwritten documents
• ScanApp
• The mobile phone as document scanner with direct connection to the
Transkribus service platform
• And many more…