OCR challenges in historic documents and the contribution of IMPACT
IFLA 2010 Satellite Meeting "New Techniques for Old Documents", 16-18 August 2010, Uppsala, Sweden.
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
OCR challenges in historic documents and the contribution of IMPACT
1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
OCR challenges in historic documents
and the contribution of IMPACT
Clemens Neudecker, KB National Library of the Netherlands
18/08/2010 - IFLA satellite meeting, Uppsala
2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
18/08/2010
2
Background
Text that is not digital is virtually invisible
OCR (optical character recognition) technology does not produce
satisfactory results for historic documents
There is a lack of institutional knowledge and expertise which
causes “re-inventing the wheel”
Innovate OCR software and language technology
Share best practice and build capacity across Europe
(Guidelines, Training, Workshops)
3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT – Improving access to text
Funded by the EC as part of the 7th Framework Programme
Coordinated by KB – National Library of the Netherlands
18/08/2010
3
EU funding: € 12 100 000
26 partners: Libraries, Research Institutes, Industry Partners
Start date: 1 January 2008
Duration: 48 Months 2011: Center of Competence
4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Historic material: different problems
18/08/2010
4
I. OCR errors
Damaged material, bad quality scans, difficult layout,
historic fonts, …
II. Historical language
Spelling variants, orthographical variants, inflected forms, …
5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
18/08/2010
5
Bad OCR results…
la 112 B ik e my lat arrived the
>Pylades,-. lliot; aod. Abe- 3ineva, CNeee 4orn Neath,
' titch ,cuim; ,'t;ohn_ IoMelwl fri ytiil SUn-
.die8; ,FrietndiLp, St&ar, froniidon, 'Ui wine and
grocerieu ;: ;aletn, Bker, from Liverpool,. witfi eoal.;'
4Stalled the AluidonG.: ceror' Lkndon, with sundries;
: ;Two Rrothwsj'@ Whe~atn-;- Pylade', Eiot; Har'tinny,;
;: Fisbley; ::Iiiveiy Peggy:-(flth add tie JAne, Redman,
for eathly Newpot;agd llford; -Tw Br.otherAs, lawces,
fos Lysixowjvithbinehol V pirI-ihzure;vi etsey, Per-wIliti;
iIudstry, ModA - ~tbi ,Al~t,,'enniugs, for
.:IP1~iOntI, StIth Ltu .c*ar An'l? Hawkinss foir
ouck , + iii ballasto I _______~ ~ ~ ~~~Ai
6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
18/08/2010
6
Bleed through & shine through
General description Effects on OCRing
When the printing ink was not dry, the
letters of the one page also appear on the
other page.
Also, if a paper is relatively thin the ink of
the other side of the page may shine
through.
Effects are high, since it is the
same ink (though lighter) and the
shaping of characters is directly
disturbed.
7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
18/08/2010
7
IMPACT: Binarisation
8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
18/08/2010
8
Annotations in the text
General description Effects on OCRing
All notes, lines, drawings created by
users, but also stamps, tapes etc. used
within libraries.
Effects are high, since both segmentation
as well as the recognition process itself is
disturbed.
9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
18/08/2010
9
IMPACT: Improved binarisation
9
Original State of the Art IMPACT
10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
18/08/2010
10
Warping of paper
General description Effects on OCRing
Due to humidity the single
page of an old book is very
rarely really flat, in contrast it
is warped. Even with putting
the paper against a glass
plate the warping will not
disappear.
Partly a relatively high
effect, especially if it is
connected with bad
printing (e.g. characters
not aligned on the
baseline of a line).
11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
18/08/2010
11
IMPACT: Border removal
12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
18/08/2010
12
IMPACT: Geometric correction I
13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
18/08/2010
13
IMPACT: Geometric correction II
14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
18/08/2010
14
Gothic typeface
General description Effects on OCRing
Historic fonts, obsolete characters
such as the long s
Effects are high since such fonts and
characters are often not recognised
correctly.
15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
18/08/2010
15
IMPACT: Improved recognition
16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
18/08/2010
General description
Due to difficult
layouts, pages can be
segmented
incorrectly
16
Complex layout
Effects on OCRing
Effects are high since
text is not ordered in
the right way
17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
18/08/2010
17
IMPACT: Segmentation
Blocks/Regions Words Glyphs
18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT: Functional extension parser
18/08/2010
18
Recognition of the structure
of book pages
– Print space
– Standard font of the
main text
– Page numbers
Enrichment of OCR results
with structural information
19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Bad printing: blurred, broken, faded characters
18/08/2010
19
General description Effects on OCRing
According to the printing technology used
letters may be blurred, broken or dotted.
Effects are high since characters are
broken or bound together.
20. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
18/08/2010
20
IMPACT: Cooperative correction
Integrated web-based
system for cooperative
correction of OCR
results
Character/Word/Page
mode
Collaboratively correct
OCR errors and use
results for improving
OCR
21. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
18/08/2010
21
IMPACT: Word spotting
Alternative technique for indexing
historical documents
After word segmentation relevant
words are detected and highlighted
Key words can be e.g. person and
location names
22. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
18/08/2010
22
Historical language
Historical variants of the Dutch word ‘wereld’ (world):
werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt
wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels
zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts
werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts
werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled
23. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
18/08/2010
23
IMPACT: Historical dictionaries
OCR:
Lexica for German, Dutch, English, French, Spanish, Polish,
Bulgarian and Czech
Generic tools for building historical lexica
FineReader with built in ssttaannddaarrdd DDuuttcchh dictionary werreid
FineReader with IMPACT dictionary of historical Dutch werreld
RETRIEVAL:
Key in ‘wereld’ and find ‘werreld’
24. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT: Linguistic post-correction
18/08/2010
The colors indicate different types of analysis results,
like a word being found in the historical or hypothetical
dictionary, or a supposed OCR error, etc.
24
25. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT: Interoperability framework
18/08/2010
25
Interaction, Modularisation, Evaluation
26. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
18/08/2010
26
Thank you!
http://www.impact–project.eu/
impact@kb.nl
@impactocr
http://impactocr.wordpress.com/