SlideShare a Scribd company logo
1 of 18
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Analysis and Post-Correction of OCR-processed
historical documents
Ulrich Reffle

CIS
University of Munich
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Overview
 Document specific analysis of OCR results of historical documents
 A system for interactive OCR post-correction




24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de                                                                                                         2
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Document specific analysis of OCR
results of historical documents




24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de                                                                                                         3
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




   Why do we need special methods?
           Problems specific to the processing of historical language in the context of
           mass digitization:
             – High OCR error rates
             – No standardized language
                         Special resources and methods are needed for OCR, post-processing and
                          Information Retrieval
                                                                                                  Problem of historical
                                                                                                    language variation

                                                                                                                 Post-
Digital                                   OCR                          OCR-
                                                                                                               Correction                                   IR
image                                                                  result
   24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de                                                                                                              4
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




   Why do we need special methods?
           Diversity of input material makes document specific parameter settings
           important:
             – Distribution of spelling variants
             – Special vocabulary
             – OCR channel model

                                                                                                  Problem of historical
                                                                                                    language variation

                                                                                                                 Post-
Digital                                   OCR                          OCR-
                                                                                                               Correction                                   IR
image                                                                  result
   24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de                                                                                                              5
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Document specific language and error profiles
 Language and error profiles provide document specific characteristics of
  the language and OCR errors.
 Language profile: shares of foreign languages (such as Latin, French),
  frequencies for language modeling, important patterns of spelling variation
  (in English: e.g. oou, vu )
 Error profile: estimated error rate, important error patterns (like ec, il),
  frequent erroneous words
 Language and error profiles are computed fully automatically, no manual
  interaction or groundtruth needed.



24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de                                                                                                         6
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      Global Profile of a document
                                                      Frequency                                                    Lexicon                                       %
                           t→th                      120                                                           Modern                                      82%
Language
                           i→y                       106                                                           Historic                                    9%
profile
                           ä→a                       38                                                            Place names                                 6%
                           …                         …                                                             Latin                                       3%

                                                      Frequency
                           e→c                       51                                                            Correct words                               72%
  Error
                           n→u                       45                                                            Erroneous words                             20%
  profile
                           t→i                       34                                                            Unknown words                               8%
                           …                         …

      24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de                                                                                                               7
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




   Local profile of all words of a document
    Weighted set of interpretations/ correction suggestions for each word of the
     document.
    „theil“
   „theil“
  „theil“
 „theil“
„hatn“
   Correction suggestion                                  Modern spelling                                        probability
   hath                                                   has                                                    0,95
   hat                                                    Hat                                                    0,01
   hate                                                   hate                                                   0,04



   24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de                                                                                                         8
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Summary
 Document specific profiles …
          – are computed in a fully automated way from OCR output
          – provide characteristics of language and OCR error channel in order to adapt
            OCR and downstream processes.




24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de                                                                                                         9
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




System for interactive post-correction
of OCR results




24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de                                                                                                         10
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Post-correction system
 A graphical user interface for fast and convenient post-correction
  specifically for OCRed historical documents
 Novel possibilities for detection, presentation and correction of systematic
  OCR errors.




24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de                                                                                                         11
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




   Post-correction system
                                                                                                                                                         OCR Editor




Special functionality




                                                                                                                                                             Image
   24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de                                                                                                                   12
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Proper treatment of spelling variants
 Historical spelling variants are identified with the help of historical lexica and
  language profiles.
 Local profiles include non-modern words as correction suggestions.




24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de                                                                                                         13
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Conventional correction methods
 Correcting words in the text view
          – Manual input
          – Selection of a correction suggestion




24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de                                                                                                         14
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Batch-Correction of systematic OCR errors
 Systematic OCR errors are identified by error profile
 Batches of errors can be corrected with just a few keystrokes.




24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de                                                                                                         15
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Evaluation
 User experiment with 14 participants.
 Novel technology makes correction up to 2.7 times faster.




24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de                                                                                                         16
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Availability
 Graphical interface is going to be distributed open source.
 Document pre-processing to obtain language and error profiles is protected
  by US patent application.
          – Pre-processing is offered as a web-service, as of now free of charge.




24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de                                                                                                         17
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                             Thank you!

                                           http://ocr.cis.uni-muenchen.de
                                              uli@cis.uni-muenchen.de




24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de                                                                                                         18

More Related Content

What's hot

Résumé of Liudvikas Paskevicius
Résumé of Liudvikas PaskeviciusRésumé of Liudvikas Paskevicius
Résumé of Liudvikas Paskeviciusliudvikasp
 
Is There a Palce for Technology in the University Language Instruction?
Is There a Palce for Technology in the University Language Instruction? Is There a Palce for Technology in the University Language Instruction?
Is There a Palce for Technology in the University Language Instruction? Marta Borowiak-Dostatnia
 
Ontology Integration and Interoperability (OntoIOp) – Part 1: The Distributed...
Ontology Integration and Interoperability (OntoIOp) – Part 1: The Distributed...Ontology Integration and Interoperability (OntoIOp) – Part 1: The Distributed...
Ontology Integration and Interoperability (OntoIOp) – Part 1: The Distributed...Christoph Lange
 
Digital Humanities @ Net7
Digital Humanities @ Net7Digital Humanities @ Net7
Digital Humanities @ Net7Net7
 

What's hot (7)

Résumé of Liudvikas Paskevicius
Résumé of Liudvikas PaskeviciusRésumé of Liudvikas Paskevicius
Résumé of Liudvikas Paskevicius
 
Peter Doorn
Peter DoornPeter Doorn
Peter Doorn
 
my updated CV (resume)
my updated CV (resume)my updated CV (resume)
my updated CV (resume)
 
Michel Alexandre Salim\'s resume
Michel Alexandre Salim\'s resumeMichel Alexandre Salim\'s resume
Michel Alexandre Salim\'s resume
 
Is There a Palce for Technology in the University Language Instruction?
Is There a Palce for Technology in the University Language Instruction? Is There a Palce for Technology in the University Language Instruction?
Is There a Palce for Technology in the University Language Instruction?
 
Ontology Integration and Interoperability (OntoIOp) – Part 1: The Distributed...
Ontology Integration and Interoperability (OntoIOp) – Part 1: The Distributed...Ontology Integration and Interoperability (OntoIOp) – Part 1: The Distributed...
Ontology Integration and Interoperability (OntoIOp) – Part 1: The Distributed...
 
Digital Humanities @ Net7
Digital Humanities @ Net7Digital Humanities @ Net7
Digital Humanities @ Net7
 

Viewers also liked

BL Demo Day - July2011 - (4) OCR for IMPACT Part 1
BL Demo Day - July2011 - (4) OCR for IMPACT Part 1BL Demo Day - July2011 - (4) OCR for IMPACT Part 1
BL Demo Day - July2011 - (4) OCR for IMPACT Part 1IMPACT Centre of Competence
 
BL Demo Day - July2011 - (3) Image Enhancement for OCR
BL Demo Day - July2011 - (3) Image Enhancement for OCRBL Demo Day - July2011 - (3) Image Enhancement for OCR
BL Demo Day - July2011 - (3) Image Enhancement for OCRIMPACT Centre of Competence
 
IMPACT Final Conference - Majlis Bremer Laamanen
IMPACT Final Conference - Majlis Bremer LaamanenIMPACT Final Conference - Majlis Bremer Laamanen
IMPACT Final Conference - Majlis Bremer LaamanenIMPACT Centre of Competence
 
IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...
IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...
IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...IMPACT Centre of Competence
 
IMPACT Final Conference - Language Parallel Sessions - Gotscharek
IMPACT Final Conference - Language Parallel Sessions -  GotscharekIMPACT Final Conference - Language Parallel Sessions -  Gotscharek
IMPACT Final Conference - Language Parallel Sessions - GotscharekIMPACT Centre of Competence
 
IMACT Final Conference - Language Parallel Sessions - Erjavec
IMACT Final Conference - Language Parallel Sessions - ErjavecIMACT Final Conference - Language Parallel Sessions - Erjavec
IMACT Final Conference - Language Parallel Sessions - ErjavecIMPACT Centre of Competence
 
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...IMPACT Centre of Competence
 
IMPACT Final Conference - Apostolos Antonacopoulos
IMPACT Final Conference - Apostolos AntonacopoulosIMPACT Final Conference - Apostolos Antonacopoulos
IMPACT Final Conference - Apostolos AntonacopoulosIMPACT Centre of Competence
 
IMPACT/myGrid Hackathon - Taverna Server as a Portal
IMPACT/myGrid Hackathon - Taverna Server as a PortalIMPACT/myGrid Hackathon - Taverna Server as a Portal
IMPACT/myGrid Hackathon - Taverna Server as a PortalIMPACT Centre of Competence
 
IMPACT/myGrid Hackathon - Introduction to IMPACT
IMPACT/myGrid Hackathon - Introduction to IMPACTIMPACT/myGrid Hackathon - Introduction to IMPACT
IMPACT/myGrid Hackathon - Introduction to IMPACTIMPACT Centre of Competence
 

Viewers also liked (20)

IMPACT Final Conference - Jesse de Does
IMPACT Final Conference - Jesse de DoesIMPACT Final Conference - Jesse de Does
IMPACT Final Conference - Jesse de Does
 
IMPACT Final Conference - Katrien Depuydt
IMPACT Final Conference - Katrien DepuydtIMPACT Final Conference - Katrien Depuydt
IMPACT Final Conference - Katrien Depuydt
 
IMPACT Final Conference - Asaf Tzadok
IMPACT Final Conference - Asaf TzadokIMPACT Final Conference - Asaf Tzadok
IMPACT Final Conference - Asaf Tzadok
 
BL Demo Day - July2011 - (4) OCR for IMPACT Part 1
BL Demo Day - July2011 - (4) OCR for IMPACT Part 1BL Demo Day - July2011 - (4) OCR for IMPACT Part 1
BL Demo Day - July2011 - (4) OCR for IMPACT Part 1
 
BL Demo Day - July2011 - (3) Image Enhancement for OCR
BL Demo Day - July2011 - (3) Image Enhancement for OCRBL Demo Day - July2011 - (3) Image Enhancement for OCR
BL Demo Day - July2011 - (3) Image Enhancement for OCR
 
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael FuchsIMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
 
IMPACT Final Conference - Clemens Neudecker
IMPACT Final Conference - Clemens NeudeckerIMPACT Final Conference - Clemens Neudecker
IMPACT Final Conference - Clemens Neudecker
 
IMPACT Final Conference - Khalil Rouhana
IMPACT Final Conference - Khalil  RouhanaIMPACT Final Conference - Khalil  Rouhana
IMPACT Final Conference - Khalil Rouhana
 
IMPACT Final Conference - Majlis Bremer Laamanen
IMPACT Final Conference - Majlis Bremer LaamanenIMPACT Final Conference - Majlis Bremer Laamanen
IMPACT Final Conference - Majlis Bremer Laamanen
 
IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...
IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...
IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...
 
IMPACT Final Conference - Language Parallel Sessions - Gotscharek
IMPACT Final Conference - Language Parallel Sessions -  GotscharekIMPACT Final Conference - Language Parallel Sessions -  Gotscharek
IMPACT Final Conference - Language Parallel Sessions - Gotscharek
 
IMPACT Final Conference - Steven Krauwer
IMPACT Final Conference - Steven KrauwerIMPACT Final Conference - Steven Krauwer
IMPACT Final Conference - Steven Krauwer
 
IMACT Final Conference - Language Parallel Sessions - Erjavec
IMACT Final Conference - Language Parallel Sessions - ErjavecIMACT Final Conference - Language Parallel Sessions - Erjavec
IMACT Final Conference - Language Parallel Sessions - Erjavec
 
IMPACT Final Conference - Paul Fogel
IMPACT Final Conference - Paul FogelIMPACT Final Conference - Paul Fogel
IMPACT Final Conference - Paul Fogel
 
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...
 
IMPACT Final Conference - Apostolos Antonacopoulos
IMPACT Final Conference - Apostolos AntonacopoulosIMPACT Final Conference - Apostolos Antonacopoulos
IMPACT Final Conference - Apostolos Antonacopoulos
 
IMPACT Final Conference - Gregory Crane
IMPACT Final Conference - Gregory CraneIMPACT Final Conference - Gregory Crane
IMPACT Final Conference - Gregory Crane
 
IMPACT Final Conference - Aly Conteh
IMPACT Final Conference - Aly ContehIMPACT Final Conference - Aly Conteh
IMPACT Final Conference - Aly Conteh
 
IMPACT/myGrid Hackathon - Taverna Server as a Portal
IMPACT/myGrid Hackathon - Taverna Server as a PortalIMPACT/myGrid Hackathon - Taverna Server as a Portal
IMPACT/myGrid Hackathon - Taverna Server as a Portal
 
IMPACT/myGrid Hackathon - Introduction to IMPACT
IMPACT/myGrid Hackathon - Introduction to IMPACTIMPACT/myGrid Hackathon - Introduction to IMPACT
IMPACT/myGrid Hackathon - Introduction to IMPACT
 

Similar to IMPACT Final Conference - Ulrich Reffle

TR5 Prolifer and Post-Correction System. Ludwig Maximilians
TR5 Prolifer and Post-Correction System. Ludwig MaximiliansTR5 Prolifer and Post-Correction System. Ludwig Maximilians
TR5 Prolifer and Post-Correction System. Ludwig MaximiliansBiblioteca Nacional de España
 
The Improving Access to Text (IMPACT) project and other European initiatives
The Improving Access to Text (IMPACT) project and other European initiativesThe Improving Access to Text (IMPACT) project and other European initiatives
The Improving Access to Text (IMPACT) project and other European initiativesMichael Day
 
Towards a Human Language Project for Multilingual Europe: AI and Interpretation
Towards a Human Language Project for Multilingual Europe: AI and InterpretationTowards a Human Language Project for Multilingual Europe: AI and Interpretation
Towards a Human Language Project for Multilingual Europe: AI and InterpretationGeorg Rehm
 
Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)cneudecker
 
M&L 2012 - Translectures: tackling the translation issue in a cost effective ...
M&L 2012 - Translectures: tackling the translation issue in a cost effective ...M&L 2012 - Translectures: tackling the translation issue in a cost effective ...
M&L 2012 - Translectures: tackling the translation issue in a cost effective ...Media & Learning Conference
 
The META-NET Strategic Research Agenda for Multilingual Europe 2020
The META-NET Strategic Research Agenda for Multilingual Europe 2020The META-NET Strategic Research Agenda for Multilingual Europe 2020
The META-NET Strategic Research Agenda for Multilingual Europe 2020Georg Rehm
 
Multilingual challenges in Europeana
Multilingual challenges in EuropeanaMultilingual challenges in Europeana
Multilingual challenges in EuropeanaAntoine Isaac
 
Europeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers
 
IMPACT HPC Cloud Day
IMPACT HPC Cloud DayIMPACT HPC Cloud Day
IMPACT HPC Cloud Daycneudecker
 
Language Technologies for Big Data – A Strategic Agenda for the Multilingual ...
Language Technologies for Big Data – A Strategic Agenda for the Multilingual ...Language Technologies for Big Data – A Strategic Agenda for the Multilingual ...
Language Technologies for Big Data – A Strategic Agenda for the Multilingual ...Georg Rehm
 
Centre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens NeudeckerCentre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens NeudeckerBiblioteca Nacional de España
 
AI for Translation Technologies and Multilingual Europe
AI for Translation Technologies and Multilingual EuropeAI for Translation Technologies and Multilingual Europe
AI for Translation Technologies and Multilingual EuropeGeorg Rehm
 
French Presidency - 1 march 2022
French Presidency - 1 march 2022French Presidency - 1 march 2022
French Presidency - 1 march 2022Europeana
 
2012 oct 22 shaping access presentation_alt
2012 oct 22  shaping access presentation_alt2012 oct 22  shaping access presentation_alt
2012 oct 22 shaping access presentation_altEuropeana
 
Mahmoud Resume English
Mahmoud Resume EnglishMahmoud Resume English
Mahmoud Resume Englishmahmoudfem
 
Multilingualism for Digital Europe
Multilingualism for Digital EuropeMultilingualism for Digital Europe
Multilingualism for Digital EuropeGeorg Rehm
 
Promoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language TechnologyPromoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language Technologytechiaith
 
Bratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdfBratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdfIMPACT Centre of Competence
 
how to innovate lexicography by means of research infrastructures
how to innovate lexicography by means of research infrastructureshow to innovate lexicography by means of research infrastructures
how to innovate lexicography by means of research infrastructureseveline wandl-vogt
 

Similar to IMPACT Final Conference - Ulrich Reffle (20)

TR5 Prolifer and Post-Correction System. Ludwig Maximilians
TR5 Prolifer and Post-Correction System. Ludwig MaximiliansTR5 Prolifer and Post-Correction System. Ludwig Maximilians
TR5 Prolifer and Post-Correction System. Ludwig Maximilians
 
The Improving Access to Text (IMPACT) project and other European initiatives
The Improving Access to Text (IMPACT) project and other European initiativesThe Improving Access to Text (IMPACT) project and other European initiatives
The Improving Access to Text (IMPACT) project and other European initiatives
 
Towards a Human Language Project for Multilingual Europe: AI and Interpretation
Towards a Human Language Project for Multilingual Europe: AI and InterpretationTowards a Human Language Project for Multilingual Europe: AI and Interpretation
Towards a Human Language Project for Multilingual Europe: AI and Interpretation
 
博物館科技前瞻2010 horizon-report-museum-edition
博物館科技前瞻2010 horizon-report-museum-edition博物館科技前瞻2010 horizon-report-museum-edition
博物館科技前瞻2010 horizon-report-museum-edition
 
Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)
 
M&L 2012 - Translectures: tackling the translation issue in a cost effective ...
M&L 2012 - Translectures: tackling the translation issue in a cost effective ...M&L 2012 - Translectures: tackling the translation issue in a cost effective ...
M&L 2012 - Translectures: tackling the translation issue in a cost effective ...
 
The META-NET Strategic Research Agenda for Multilingual Europe 2020
The META-NET Strategic Research Agenda for Multilingual Europe 2020The META-NET Strategic Research Agenda for Multilingual Europe 2020
The META-NET Strategic Research Agenda for Multilingual Europe 2020
 
Multilingual challenges in Europeana
Multilingual challenges in EuropeanaMultilingual challenges in Europeana
Multilingual challenges in Europeana
 
Europeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday Muehlberger
 
IMPACT HPC Cloud Day
IMPACT HPC Cloud DayIMPACT HPC Cloud Day
IMPACT HPC Cloud Day
 
Language Technologies for Big Data – A Strategic Agenda for the Multilingual ...
Language Technologies for Big Data – A Strategic Agenda for the Multilingual ...Language Technologies for Big Data – A Strategic Agenda for the Multilingual ...
Language Technologies for Big Data – A Strategic Agenda for the Multilingual ...
 
Centre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens NeudeckerCentre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens Neudecker
 
AI for Translation Technologies and Multilingual Europe
AI for Translation Technologies and Multilingual EuropeAI for Translation Technologies and Multilingual Europe
AI for Translation Technologies and Multilingual Europe
 
French Presidency - 1 march 2022
French Presidency - 1 march 2022French Presidency - 1 march 2022
French Presidency - 1 march 2022
 
2012 oct 22 shaping access presentation_alt
2012 oct 22  shaping access presentation_alt2012 oct 22  shaping access presentation_alt
2012 oct 22 shaping access presentation_alt
 
Mahmoud Resume English
Mahmoud Resume EnglishMahmoud Resume English
Mahmoud Resume English
 
Multilingualism for Digital Europe
Multilingualism for Digital EuropeMultilingualism for Digital Europe
Multilingualism for Digital Europe
 
Promoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language TechnologyPromoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language Technology
 
Bratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdfBratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdf
 
how to innovate lexicography by means of research infrastructures
how to innovate lexicography by means of research infrastructureshow to innovate lexicography by means of research infrastructures
how to innovate lexicography by means of research infrastructures
 

More from IMPACT Centre of Competence

More from IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 

Recently uploaded

4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptshraddhaparab530
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 

Recently uploaded (20)

4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.ppt
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 

IMPACT Final Conference - Ulrich Reffle

  • 1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Analysis and Post-Correction of OCR-processed historical documents Ulrich Reffle CIS University of Munich
  • 2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Overview  Document specific analysis of OCR results of historical documents  A system for interactive OCR post-correction 24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 2
  • 3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Document specific analysis of OCR results of historical documents 24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 3
  • 4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Why do we need special methods? Problems specific to the processing of historical language in the context of mass digitization: – High OCR error rates – No standardized language  Special resources and methods are needed for OCR, post-processing and Information Retrieval Problem of historical language variation Post- Digital OCR OCR- Correction IR image result 24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 4
  • 5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Why do we need special methods? Diversity of input material makes document specific parameter settings important: – Distribution of spelling variants – Special vocabulary – OCR channel model Problem of historical language variation Post- Digital OCR OCR- Correction IR image result 24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 5
  • 6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Document specific language and error profiles  Language and error profiles provide document specific characteristics of the language and OCR errors.  Language profile: shares of foreign languages (such as Latin, French), frequencies for language modeling, important patterns of spelling variation (in English: e.g. oou, vu )  Error profile: estimated error rate, important error patterns (like ec, il), frequent erroneous words  Language and error profiles are computed fully automatically, no manual interaction or groundtruth needed. 24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 6
  • 7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Global Profile of a document Frequency Lexicon % t→th 120 Modern 82% Language i→y 106 Historic 9% profile ä→a 38 Place names 6% … … Latin 3% Frequency e→c 51 Correct words 72% Error n→u 45 Erroneous words 20% profile t→i 34 Unknown words 8% … … 24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 7
  • 8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Local profile of all words of a document  Weighted set of interpretations/ correction suggestions for each word of the document. „theil“ „theil“ „theil“ „theil“ „hatn“ Correction suggestion Modern spelling probability hath has 0,95 hat Hat 0,01 hate hate 0,04 24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 8
  • 9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Summary  Document specific profiles … – are computed in a fully automated way from OCR output – provide characteristics of language and OCR error channel in order to adapt OCR and downstream processes. 24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 9
  • 10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. System for interactive post-correction of OCR results 24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 10
  • 11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Post-correction system  A graphical user interface for fast and convenient post-correction specifically for OCRed historical documents  Novel possibilities for detection, presentation and correction of systematic OCR errors. 24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 11
  • 12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Post-correction system OCR Editor Special functionality Image 24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 12
  • 13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Proper treatment of spelling variants  Historical spelling variants are identified with the help of historical lexica and language profiles.  Local profiles include non-modern words as correction suggestions. 24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 13
  • 14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Conventional correction methods  Correcting words in the text view – Manual input – Selection of a correction suggestion 24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 14
  • 15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Batch-Correction of systematic OCR errors  Systematic OCR errors are identified by error profile  Batches of errors can be corrected with just a few keystrokes. 24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 15
  • 16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Evaluation  User experiment with 14 participants.  Novel technology makes correction up to 2.7 times faster. 24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 16
  • 17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Availability  Graphical interface is going to be distributed open source.  Document pre-processing to obtain language and error profiles is protected by US patent application. – Pre-processing is offered as a web-service, as of now free of charge. 24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 17
  • 18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Thank you! http://ocr.cis.uni-muenchen.de uli@cis.uni-muenchen.de 24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 18