Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
IMPACT Final Conference - Ulrich Reffle
1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Analysis and Post-Correction of OCR-processed
historical documents
Ulrich Reffle
CIS
University of Munich
2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Overview
Document specific analysis of OCR results of historical documents
A system for interactive OCR post-correction
24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 2
3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Document specific analysis of OCR
results of historical documents
24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 3
4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Why do we need special methods?
Problems specific to the processing of historical language in the context of
mass digitization:
– High OCR error rates
– No standardized language
Special resources and methods are needed for OCR, post-processing and
Information Retrieval
Problem of historical
language variation
Post-
Digital OCR OCR-
Correction IR
image result
24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 4
5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Why do we need special methods?
Diversity of input material makes document specific parameter settings
important:
– Distribution of spelling variants
– Special vocabulary
– OCR channel model
Problem of historical
language variation
Post-
Digital OCR OCR-
Correction IR
image result
24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 5
6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Document specific language and error profiles
Language and error profiles provide document specific characteristics of
the language and OCR errors.
Language profile: shares of foreign languages (such as Latin, French),
frequencies for language modeling, important patterns of spelling variation
(in English: e.g. oou, vu )
Error profile: estimated error rate, important error patterns (like ec, il),
frequent erroneous words
Language and error profiles are computed fully automatically, no manual
interaction or groundtruth needed.
24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 6
7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Global Profile of a document
Frequency Lexicon %
t→th 120 Modern 82%
Language
i→y 106 Historic 9%
profile
ä→a 38 Place names 6%
… … Latin 3%
Frequency
e→c 51 Correct words 72%
Error
n→u 45 Erroneous words 20%
profile
t→i 34 Unknown words 8%
… …
24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 7
8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Local profile of all words of a document
Weighted set of interpretations/ correction suggestions for each word of the
document.
„theil“
„theil“
„theil“
„theil“
„hatn“
Correction suggestion Modern spelling probability
hath has 0,95
hat Hat 0,01
hate hate 0,04
24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 8
9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Summary
Document specific profiles …
– are computed in a fully automated way from OCR output
– provide characteristics of language and OCR error channel in order to adapt
OCR and downstream processes.
24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 9
10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
System for interactive post-correction
of OCR results
24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 10
11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Post-correction system
A graphical user interface for fast and convenient post-correction
specifically for OCRed historical documents
Novel possibilities for detection, presentation and correction of systematic
OCR errors.
24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 11
12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Post-correction system
OCR Editor
Special functionality
Image
24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 12
13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Proper treatment of spelling variants
Historical spelling variants are identified with the help of historical lexica and
language profiles.
Local profiles include non-modern words as correction suggestions.
24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 13
14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Conventional correction methods
Correcting words in the text view
– Manual input
– Selection of a correction suggestion
24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 14
15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Batch-Correction of systematic OCR errors
Systematic OCR errors are identified by error profile
Batches of errors can be corrected with just a few keystrokes.
24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 15
16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Evaluation
User experiment with 14 participants.
Novel technology makes correction up to 2.7 times faster.
24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 16
17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Availability
Graphical interface is going to be distributed open source.
Document pre-processing to obtain language and error profiles is protected
by US patent application.
– Pre-processing is offered as a web-service, as of now free of charge.
24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 17
18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Thank you!
http://ocr.cis.uni-muenchen.de
uli@cis.uni-muenchen.de
24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 18