SlideShare uma empresa Scribd logo
1 de 11
Baixar para ler offline
Correcting OCR results with 
computational linguistic 
methods 
Michel Généreux, EURAC 
Bozen / Bolzano 
in collaboration with Egon W. Stemle, Lionel Nicolas, Verena Lyding 
and Katalin Szabò 
1 
Institute for Specialised Communication and Multilingualism 
Michel Généreux 27.10.2014
The OPATCH project (Open Platform for Access to and Analysis of Textual Documents of 
Cultural Heritage) aims at creating an advanced on-line search infrastructure for research 
in an historical newspapers archive. 
• Duration: 24 months (Jan 2014 – Dec 2015) 
• Fundings: Autonome Provinz Bozen-Südtirol, Landesgesetz Nr. 14, „Forschung und Innovation“ 
• Partners: 
• Landesbibliothek Dr. Friedrich Teßmann, Bozen 
• Institut für Corpuslinguistik und Texttechnologie (ICLTT), Wien 
For implementing this, OPATCH builds on computational linguistic (CL) methods for structural 
parsing, word class tagging and named entity recognition. 
Dating between 1910 and 1920, the newspapers are typed in the blackletter Fraktur font and 
paper quality is derogated due to age. 
Hence, in OPATCH we are starting from majorly error-prone OCR-ed text, in quantities that 
cannot realistically be corrected manually. 
2 
Institute for Specialised Communication and Multilingualism 
Introduction 
Michel Généreux 27.10.2014
A Glance at the Teßmann collection 
626,287 pages for a total of 819,310,354 tokens; 
616,751,127 tokens are in the reference dictionary, so a degree of cleanness 
Our post-OCR correction system is based on 10 OCR-ed pages with cleanness 
Given that the dictionary covers on average 91% of all words, 0.90 cleanness 
Focusing on the best pages from the Teßmann collection ... 
3 
Institute for Specialised Communication and Multilingualism 
of 0.75 
of 0.70 
is almost perfect OCR. 
Michel Généreux 27.10.2014
Cumul. 
number of 
PM_1900 1032 1137449 0.901 1032 1137449 1024804 90.1% 
AM_1890 8 9457 0.899 1040 1146906 1033302 90.1% 
FA_1900 8 8652 0.887 1048 1155558 1040978 90.1% 
PM_1910 737 750131 0.886 1785 1905689 1705286 89.5% 
IS_1900 2970 2817010 0.879 4755 4722699 4180623 88.5% 
WB_1900 72 45481 0.872 4827 4768180 4220278 88.5% 
SVB_1890 9749 14306473 0.809 70546 92484068 75861319 82.0% 
17545 26314923 0.807 88091 118798991 97103539 81.7% 
16015 12153784 0.806 104106 130952775 106894635 81.6% 
4 
Institute for Specialised Communication and Multilingualism 
OPATCH 100k pages 
Newsp_Decade Number 
of pages 
Number of 
tokens 
Number of 
tokens in 
dictionary 
pages 
Cumul. 
number of 
tokens 
Cumul. 
number of 
tokens in dict. 
SVB_1900 
Volksblatt 
BZN_1890 
Bozner 
Nachrichten 
Michel Généreux 27.10.2014
OPATCH unnanotated 5k pages 
We select the 5000 cleanest pages from the Teßmann collection above in the years 1910-1920. The average 
cleanness for this corpus is 91%, so no need for cleaning. This results in the following un-annotated example 
corpus: 
5 
Institute for Specialised Communication and Multilingualism 
Michel Généreux 27.10.2014
OPATCH annotated 5k pages 
Automated annotations for Part-of-Speech (POS), Lemmas and Named Entities (NE). We also have a list of 
roughly 31k locations and names for South Tyrol compiled by Teßmann. 
POS LEMMA NE 
Bozen NE Bozen I-LOC 
hat VAFIN haben 
ein ART eine 
Museum NN Museum 
für APPR für 
Ötzi NE Ötzi I-PER 
6 
Institute for Specialised Communication and Multilingualism 
Bozen hat ein Museum für Ötzi. 
Michel Généreux 27.10.2014
Corpora 
Ten OCR-ed pages with their manually corrected versions. 
• 10,468 tokens and 3,621 types. Eight pages (8,324/2,487) are 
used as training data and two pages (2,144/1,134) for testing. 
• More than one out of two tokens is misrecognized, among 
which almost half (48%) need a minimum of three edit 
operations for correction. 
Reference corpus: 5M words and 5M bigrams 
• From the WEB and also http://www.gutenberg.org/ Romane 
• The dictionary covers 91% of all words in the ten OCR-ed pages 
7 
Institute for Specialised Communication and Multilingualism 
und Erzählungen (1910-20) 
Michel Généreux 27.10.2014
Approach for OCR-ed correction 
1. Probability models. Collate and tally all edit-operations (delete, insert and 
replace) needed to transform all unrecognized tokens from the training 
OCR-ed texts to its corrected form in the Gold Standard: n|u 98 
* we obtain two probability models: constrained and unconstrained 
2. Candidate generation is achieved by finding the closest entry in the 
dictionary by applying the minimum number of edit-operations to an 
unrecognized OCR-ed token. The number of candidates is function of the 
maximum number of edit-operations allowed and the model used. 
* wundestc → wundesten: ’c’ → ’e’ and inserting a ’n’ after the ’e’. 
3. Selection of the most suitable candidate, given relative frequency and 
... word word word word word WORD word word word word ... 
8 
Institute for Specialised Communication and Multilingualism 
context: 
Michel Généreux 27.10.2014
Experiment 1 
Artificially created errors 
●To achieve this we extracted random trigrams from the GS (left context, target, right context) and 
applied, in reverse, the edit error model. 
●Errors were introduced up to a limit of two per target and contexts. 
●At the end of this process, we have two context words and five candidates, including the target. 
●En is the maximum edit-operations performed to generate candidates. 
9 
Institute for Specialised Communication and Multilingualism 
Michel Généreux 27.10.2014
Experiment 2: Real errors 
10 
Institute for Specialised Communication and Multilingualism 
Michel Généreux 27.10.2014
The approach we presented to correct OCR errors considered four features of two types: edit-distance 
Results showed that a simple scoring system can correct with very high accuracy OCR-ed texts 
under idealized conditions: no more than two edit operations and a wide-coverage dictionary. 
Obviously, these conditions do not always hold in practice, thus an observed accu-racy 
drops to 10%. Wrong substitions by the OCR process have also been neglected. 
Nevertheless, we can expect to improve our dictionary coverage so that very noisy OCR-ed texts 
(i.e. 48% error with distance of at least three to target) can be corrected with accuracies up to 
20%. 
OCR-ed texts with less challenging error patterns can be corrected with accuracies up to 61% 
(distance two) and 86% (distance one). 
Reference: Michel Généreux, Egon W. Stemle, Verena Lyding and Lionel Nicolas. Correcting OCR 
errors for German in Fraktur font. Pisa, 9-10 dicembre 2014. La prima Conferenza di Linguistica 
Computazionale Italiana. 
11 
Institute for Specialised Communication and Multilingualism 
Discussion 
and n-grams frequencies. 
Michel Généreux 27.10.2014

Mais conteúdo relacionado

Mais procurados

Europeana Newspapers LIBER2013 Workshop intro
Europeana Newspapers LIBER2013 Workshop introEuropeana Newspapers LIBER2013 Workshop intro
Europeana Newspapers LIBER2013 Workshop introEuropeana Newspapers
 
Large scale refinement of digital historical newspapers with named entities r...
Large scale refinement of digital historical newspapers with named entities r...Large scale refinement of digital historical newspapers with named entities r...
Large scale refinement of digital historical newspapers with named entities r...cneudecker
 
ENP Belgrade WS refinement introduction
ENP Belgrade WS refinement introductionENP Belgrade WS refinement introduction
ENP Belgrade WS refinement introductionEuropeana Newspapers
 
ENP Belgrade Workshop Project Overview
ENP Belgrade Workshop Project OverviewENP Belgrade Workshop Project Overview
ENP Belgrade Workshop Project OverviewEuropeana Newspapers
 
Open Government Data in Europe
Open Government Data in EuropeOpen Government Data in Europe
Open Government Data in Europeokfn
 
Refinement of Digitised Newspapers
Refinement of Digitised NewspapersRefinement of Digitised Newspapers
Refinement of Digitised Newspaperscneudecker
 
An Experimental Workflow Development Platform for Historical Document Digitis...
An Experimental Workflow Development Platform for Historical Document Digitis...An Experimental Workflow Development Platform for Historical Document Digitis...
An Experimental Workflow Development Platform for Historical Document Digitis...cneudecker
 
Archiver pilot phase kick off Award Ceremony
Archiver pilot phase kick off Award CeremonyArchiver pilot phase kick off Award Ceremony
Archiver pilot phase kick off Award CeremonyArchiver
 
Progetto EOSC-Pillar (Fulvio Galeazzi, GARR)
Progetto EOSC-Pillar (Fulvio Galeazzi, GARR)Progetto EOSC-Pillar (Fulvio Galeazzi, GARR)
Progetto EOSC-Pillar (Fulvio Galeazzi, GARR)Data Driven Innovation
 
Benefits of collaborative EU digitization projects
Benefits of collaborative EU digitization projectsBenefits of collaborative EU digitization projects
Benefits of collaborative EU digitization projectsTrilce Navarrete
 
Experimental Workflow Development in Digitisation
Experimental Workflow Development in DigitisationExperimental Workflow Development in Digitisation
Experimental Workflow Development in Digitisationcneudecker
 
Europeana Music Channel, wireframes
Europeana Music Channel, wireframesEuropeana Music Channel, wireframes
Europeana Music Channel, wireframesDavid Haskiya
 
Il DTC-Lazio e i dati del patrimonio culturale (Maria Prezioso, Università To...
Il DTC-Lazio e i dati del patrimonio culturale (Maria Prezioso, Università To...Il DTC-Lazio e i dati del patrimonio culturale (Maria Prezioso, Università To...
Il DTC-Lazio e i dati del patrimonio culturale (Maria Prezioso, Università To...Data Driven Innovation
 
Prototype Phase Kick-off Event and Ceremony
Prototype Phase Kick-off Event and CeremonyPrototype Phase Kick-off Event and Ceremony
Prototype Phase Kick-off Event and CeremonyArchiver
 
Europeana Newspapers wp2 liber2013
Europeana Newspapers wp2 liber2013Europeana Newspapers wp2 liber2013
Europeana Newspapers wp2 liber2013Europeana Newspapers
 
Design phase kick-off event and Ceremony
Design phase kick-off event and CeremonyDesign phase kick-off event and Ceremony
Design phase kick-off event and CeremonyArchiver
 

Mais procurados (20)

Europeana Newspapers LIBER2013 Workshop intro
Europeana Newspapers LIBER2013 Workshop introEuropeana Newspapers LIBER2013 Workshop intro
Europeana Newspapers LIBER2013 Workshop intro
 
Large scale refinement of digital historical newspapers with named entities r...
Large scale refinement of digital historical newspapers with named entities r...Large scale refinement of digital historical newspapers with named entities r...
Large scale refinement of digital historical newspapers with named entities r...
 
ENP Belgrade WS Introduction
ENP Belgrade WS IntroductionENP Belgrade WS Introduction
ENP Belgrade WS Introduction
 
ENP Belgrade WS refinement introduction
ENP Belgrade WS refinement introductionENP Belgrade WS refinement introduction
ENP Belgrade WS refinement introduction
 
ENP Belgrade Workshop Project Overview
ENP Belgrade Workshop Project OverviewENP Belgrade Workshop Project Overview
ENP Belgrade Workshop Project Overview
 
ENP Belgrade WS Metadata
ENP Belgrade WS MetadataENP Belgrade WS Metadata
ENP Belgrade WS Metadata
 
ENP_Dutch_Infoday_LWilms
ENP_Dutch_Infoday_LWilmsENP_Dutch_Infoday_LWilms
ENP_Dutch_Infoday_LWilms
 
EurnewsLDN_Clemens_Neudecker
EurnewsLDN_Clemens_NeudeckerEurnewsLDN_Clemens_Neudecker
EurnewsLDN_Clemens_Neudecker
 
Open Government Data in Europe
Open Government Data in EuropeOpen Government Data in Europe
Open Government Data in Europe
 
Refinement of Digitised Newspapers
Refinement of Digitised NewspapersRefinement of Digitised Newspapers
Refinement of Digitised Newspapers
 
An Experimental Workflow Development Platform for Historical Document Digitis...
An Experimental Workflow Development Platform for Historical Document Digitis...An Experimental Workflow Development Platform for Historical Document Digitis...
An Experimental Workflow Development Platform for Historical Document Digitis...
 
Archiver pilot phase kick off Award Ceremony
Archiver pilot phase kick off Award CeremonyArchiver pilot phase kick off Award Ceremony
Archiver pilot phase kick off Award Ceremony
 
Progetto EOSC-Pillar (Fulvio Galeazzi, GARR)
Progetto EOSC-Pillar (Fulvio Galeazzi, GARR)Progetto EOSC-Pillar (Fulvio Galeazzi, GARR)
Progetto EOSC-Pillar (Fulvio Galeazzi, GARR)
 
Benefits of collaborative EU digitization projects
Benefits of collaborative EU digitization projectsBenefits of collaborative EU digitization projects
Benefits of collaborative EU digitization projects
 
Experimental Workflow Development in Digitisation
Experimental Workflow Development in DigitisationExperimental Workflow Development in Digitisation
Experimental Workflow Development in Digitisation
 
Europeana Music Channel, wireframes
Europeana Music Channel, wireframesEuropeana Music Channel, wireframes
Europeana Music Channel, wireframes
 
Il DTC-Lazio e i dati del patrimonio culturale (Maria Prezioso, Università To...
Il DTC-Lazio e i dati del patrimonio culturale (Maria Prezioso, Università To...Il DTC-Lazio e i dati del patrimonio culturale (Maria Prezioso, Università To...
Il DTC-Lazio e i dati del patrimonio culturale (Maria Prezioso, Università To...
 
Prototype Phase Kick-off Event and Ceremony
Prototype Phase Kick-off Event and CeremonyPrototype Phase Kick-off Event and Ceremony
Prototype Phase Kick-off Event and Ceremony
 
Europeana Newspapers wp2 liber2013
Europeana Newspapers wp2 liber2013Europeana Newspapers wp2 liber2013
Europeana Newspapers wp2 liber2013
 
Design phase kick-off event and Ceremony
Design phase kick-off event and CeremonyDesign phase kick-off event and Ceremony
Design phase kick-off event and Ceremony
 

Destaque

Overview of the Europeana Newspapers Project
Overview of the Europeana Newspapers ProjectOverview of the Europeana Newspapers Project
Overview of the Europeana Newspapers ProjectEuropeana Newspapers
 
Social Media Revolution
Social Media RevolutionSocial Media Revolution
Social Media RevolutionGalstuki
 
Europeana Newspapers Estonian Infoday Kristel Veimann
Europeana Newspapers Estonian Infoday Kristel VeimannEuropeana Newspapers Estonian Infoday Kristel Veimann
Europeana Newspapers Estonian Infoday Kristel VeimannEuropeana Newspapers
 
Wiki PowerPoint Presentation
Wiki PowerPoint PresentationWiki PowerPoint Presentation
Wiki PowerPoint PresentationKerenzaRea
 
Europeana Newspapers Estonian Infoday Ragne Kouts
Europeana Newspapers Estonian Infoday Ragne KoutsEuropeana Newspapers Estonian Infoday Ragne Kouts
Europeana Newspapers Estonian Infoday Ragne KoutsEuropeana Newspapers
 
Несколько вопросов о Social Media Marketing
Несколько вопросов о Social Media MarketingНесколько вопросов о Social Media Marketing
Несколько вопросов о Social Media MarketingGalstuki
 
Ladies Night Cinema City 14.06.2012
Ladies Night Cinema City 14.06.2012Ladies Night Cinema City 14.06.2012
Ladies Night Cinema City 14.06.2012Kobieca Strefa
 
How To Make Money with oDesk? 10 Steps for Success!
How To Make Money with oDesk? 10 Steps for Success!How To Make Money with oDesk? 10 Steps for Success!
How To Make Money with oDesk? 10 Steps for Success!Maz Here
 
Cortas de regeneracion
Cortas de regeneracionCortas de regeneracion
Cortas de regeneracionsilvinews
 

Destaque (17)

Overview of the Europeana Newspapers Project
Overview of the Europeana Newspapers ProjectOverview of the Europeana Newspapers Project
Overview of the Europeana Newspapers Project
 
Social Media Revolution
Social Media RevolutionSocial Media Revolution
Social Media Revolution
 
Europeana Newspapers Estonian Infoday Kristel Veimann
Europeana Newspapers Estonian Infoday Kristel VeimannEuropeana Newspapers Estonian Infoday Kristel Veimann
Europeana Newspapers Estonian Infoday Kristel Veimann
 
Wiki PowerPoint Presentation
Wiki PowerPoint PresentationWiki PowerPoint Presentation
Wiki PowerPoint Presentation
 
1
11
1
 
Scarlet educause
Scarlet educauseScarlet educause
Scarlet educause
 
Europeana Newspapers Estonian Infoday Ragne Kouts
Europeana Newspapers Estonian Infoday Ragne KoutsEuropeana Newspapers Estonian Infoday Ragne Kouts
Europeana Newspapers Estonian Infoday Ragne Kouts
 
Несколько вопросов о Social Media Marketing
Несколько вопросов о Social Media MarketingНесколько вопросов о Social Media Marketing
Несколько вопросов о Social Media Marketing
 
EurnewsLDN_Patrick_Fleming
EurnewsLDN_Patrick_FlemingEurnewsLDN_Patrick_Fleming
EurnewsLDN_Patrick_Fleming
 
ENP_Dutch_infoday_EVanEijck
ENP_Dutch_infoday_EVanEijckENP_Dutch_infoday_EVanEijck
ENP_Dutch_infoday_EVanEijck
 
Slideshow
SlideshowSlideshow
Slideshow
 
Ppt wiki
Ppt wikiPpt wiki
Ppt wiki
 
Ladies Night Cinema City 14.06.2012
Ladies Night Cinema City 14.06.2012Ladies Night Cinema City 14.06.2012
Ladies Night Cinema City 14.06.2012
 
Homepage voci
Homepage vociHomepage voci
Homepage voci
 
How To Make Money with oDesk? 10 Steps for Success!
How To Make Money with oDesk? 10 Steps for Success!How To Make Money with oDesk? 10 Steps for Success!
How To Make Money with oDesk? 10 Steps for Success!
 
ENP_Dutch_infoday_HCrijns
ENP_Dutch_infoday_HCrijnsENP_Dutch_infoday_HCrijns
ENP_Dutch_infoday_HCrijns
 
Cortas de regeneracion
Cortas de regeneracionCortas de regeneracion
Cortas de regeneracion
 

Semelhante a Correcting OCR Results with CL Methods

IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...IRJET Journal
 
Ocr accuracy improvement on
Ocr accuracy improvement onOcr accuracy improvement on
Ocr accuracy improvement onsipij
 
An Efficient Segmentation Technique for Machine Printed Devanagiri Script: Bo...
An Efficient Segmentation Technique for Machine Printed Devanagiri Script: Bo...An Efficient Segmentation Technique for Machine Printed Devanagiri Script: Bo...
An Efficient Segmentation Technique for Machine Printed Devanagiri Script: Bo...iosrjce
 
IMPACT Final Conference - Hildelies Balk-Pennington de Jongh
IMPACT Final Conference - Hildelies Balk-Pennington de JonghIMPACT Final Conference - Hildelies Balk-Pennington de Jongh
IMPACT Final Conference - Hildelies Balk-Pennington de JonghIMPACT Centre of Competence
 
TAUS MT Showcase, MT@EC for European public administrations and online servic...
TAUS MT Showcase, MT@EC for European public administrations and online servic...TAUS MT Showcase, MT@EC for European public administrations and online servic...
TAUS MT Showcase, MT@EC for European public administrations and online servic...TAUS - The Language Data Network
 
Correcting optical character recognition result via a novel approach
Correcting optical character recognition result via a novel approachCorrecting optical character recognition result via a novel approach
Correcting optical character recognition result via a novel approachIJICTJOURNAL
 
Presentation ASLIB 2014_Ghoula
Presentation ASLIB 2014_GhoulaPresentation ASLIB 2014_Ghoula
Presentation ASLIB 2014_GhoulaNizar Ghoula
 
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...RIILP
 
Jan Luts - Exploring artificial intelligence innovations in ESCO and Europass
Jan Luts - Exploring artificial intelligence innovations in ESCO and EuropassJan Luts - Exploring artificial intelligence innovations in ESCO and Europass
Jan Luts - Exploring artificial intelligence innovations in ESCO and EuropassEADTU
 
Product Label Reading System for visually challenged people
Product Label Reading System for visually challenged peopleProduct Label Reading System for visually challenged people
Product Label Reading System for visually challenged peopleIRJET Journal
 
An optimized approach to voice translation on mobile phones
An optimized approach to voice translation on mobile phonesAn optimized approach to voice translation on mobile phones
An optimized approach to voice translation on mobile phoneseSAT Journals
 
An optimized approach to voice translation on mobile phones
An optimized approach to voice translation on mobile phonesAn optimized approach to voice translation on mobile phones
An optimized approach to voice translation on mobile phoneseSAT Publishing House
 
Quality Translation Enhancement Using Sequence Knowledge and Pruning in Stati...
Quality Translation Enhancement Using Sequence Knowledge and Pruning in Stati...Quality Translation Enhancement Using Sequence Knowledge and Pruning in Stati...
Quality Translation Enhancement Using Sequence Knowledge and Pruning in Stati...TELKOMNIKA JOURNAL
 
178 - A replicated study on duplicate detection: Using Apache Lucene to searc...
178 - A replicated study on duplicate detection: Using Apache Lucene to searc...178 - A replicated study on duplicate detection: Using Apache Lucene to searc...
178 - A replicated study on duplicate detection: Using Apache Lucene to searc...ESEM 2014
 
INFORMATION EXTRACTION FROM PRODUCT LABELS: A MACHINE VISION APPROACH
INFORMATION EXTRACTION FROM PRODUCT LABELS: A MACHINE VISION APPROACHINFORMATION EXTRACTION FROM PRODUCT LABELS: A MACHINE VISION APPROACH
INFORMATION EXTRACTION FROM PRODUCT LABELS: A MACHINE VISION APPROACHijaia
 
Modeling of Speech Synthesis of Standard Arabic Using an Expert System
Modeling of Speech Synthesis of Standard Arabic Using an Expert SystemModeling of Speech Synthesis of Standard Arabic Using an Expert System
Modeling of Speech Synthesis of Standard Arabic Using an Expert Systemcsandit
 

Semelhante a Correcting OCR Results with CL Methods (20)

IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
 
Ocr accuracy improvement on
Ocr accuracy improvement onOcr accuracy improvement on
Ocr accuracy improvement on
 
An Efficient Segmentation Technique for Machine Printed Devanagiri Script: Bo...
An Efficient Segmentation Technique for Machine Printed Devanagiri Script: Bo...An Efficient Segmentation Technique for Machine Printed Devanagiri Script: Bo...
An Efficient Segmentation Technique for Machine Printed Devanagiri Script: Bo...
 
IMPACT Final Conference - Hildelies Balk-Pennington de Jongh
IMPACT Final Conference - Hildelies Balk-Pennington de JonghIMPACT Final Conference - Hildelies Balk-Pennington de Jongh
IMPACT Final Conference - Hildelies Balk-Pennington de Jongh
 
TAUS MT Showcase, MT@EC for European public administrations and online servic...
TAUS MT Showcase, MT@EC for European public administrations and online servic...TAUS MT Showcase, MT@EC for European public administrations and online servic...
TAUS MT Showcase, MT@EC for European public administrations and online servic...
 
Correcting optical character recognition result via a novel approach
Correcting optical character recognition result via a novel approachCorrecting optical character recognition result via a novel approach
Correcting optical character recognition result via a novel approach
 
Presentation ASLIB 2014_Ghoula
Presentation ASLIB 2014_GhoulaPresentation ASLIB 2014_Ghoula
Presentation ASLIB 2014_Ghoula
 
IMPACT Final Conference - Claus Gravenhorst
IMPACT Final Conference - Claus GravenhorstIMPACT Final Conference - Claus Gravenhorst
IMPACT Final Conference - Claus Gravenhorst
 
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...
 
sample PPT.pptx
sample PPT.pptxsample PPT.pptx
sample PPT.pptx
 
Wroclaw university library - Grazyna Piotrowicz
Wroclaw university library - Grazyna PiotrowiczWroclaw university library - Grazyna Piotrowicz
Wroclaw university library - Grazyna Piotrowicz
 
Jan Luts - Exploring artificial intelligence innovations in ESCO and Europass
Jan Luts - Exploring artificial intelligence innovations in ESCO and EuropassJan Luts - Exploring artificial intelligence innovations in ESCO and Europass
Jan Luts - Exploring artificial intelligence innovations in ESCO and Europass
 
Product Label Reading System for visually challenged people
Product Label Reading System for visually challenged peopleProduct Label Reading System for visually challenged people
Product Label Reading System for visually challenged people
 
An optimized approach to voice translation on mobile phones
An optimized approach to voice translation on mobile phonesAn optimized approach to voice translation on mobile phones
An optimized approach to voice translation on mobile phones
 
An optimized approach to voice translation on mobile phones
An optimized approach to voice translation on mobile phonesAn optimized approach to voice translation on mobile phones
An optimized approach to voice translation on mobile phones
 
IMPACT Final Conference - Richard Boulderstone
IMPACT Final Conference - Richard BoulderstoneIMPACT Final Conference - Richard Boulderstone
IMPACT Final Conference - Richard Boulderstone
 
Quality Translation Enhancement Using Sequence Knowledge and Pruning in Stati...
Quality Translation Enhancement Using Sequence Knowledge and Pruning in Stati...Quality Translation Enhancement Using Sequence Knowledge and Pruning in Stati...
Quality Translation Enhancement Using Sequence Knowledge and Pruning in Stati...
 
178 - A replicated study on duplicate detection: Using Apache Lucene to searc...
178 - A replicated study on duplicate detection: Using Apache Lucene to searc...178 - A replicated study on duplicate detection: Using Apache Lucene to searc...
178 - A replicated study on duplicate detection: Using Apache Lucene to searc...
 
INFORMATION EXTRACTION FROM PRODUCT LABELS: A MACHINE VISION APPROACH
INFORMATION EXTRACTION FROM PRODUCT LABELS: A MACHINE VISION APPROACHINFORMATION EXTRACTION FROM PRODUCT LABELS: A MACHINE VISION APPROACH
INFORMATION EXTRACTION FROM PRODUCT LABELS: A MACHINE VISION APPROACH
 
Modeling of Speech Synthesis of Standard Arabic Using an Expert System
Modeling of Speech Synthesis of Standard Arabic Using an Expert SystemModeling of Speech Synthesis of Standard Arabic Using an Expert System
Modeling of Speech Synthesis of Standard Arabic Using an Expert System
 

Mais de Europeana Newspapers

Presentation of Philippe Mezzasalma at the BnF Information Day in Paris
Presentation of Philippe Mezzasalma at the BnF Information Day in ParisPresentation of Philippe Mezzasalma at the BnF Information Day in Paris
Presentation of Philippe Mezzasalma at the BnF Information Day in ParisEuropeana Newspapers
 
Presentation of Ioannis Anagnostopoulos at BnF Information Day
Presentation of Ioannis Anagnostopoulos at BnF Information DayPresentation of Ioannis Anagnostopoulos at BnF Information Day
Presentation of Ioannis Anagnostopoulos at BnF Information DayEuropeana Newspapers
 
Présentation Günter Mühlberger, BnF Information Day
Présentation Günter Mühlberger, BnF Information DayPrésentation Günter Mühlberger, BnF Information Day
Présentation Günter Mühlberger, BnF Information DayEuropeana Newspapers
 
Presentation of Claus Gravenhorst, BnF Information Day
Presentation of Claus Gravenhorst, BnF Information DayPresentation of Claus Gravenhorst, BnF Information Day
Presentation of Claus Gravenhorst, BnF Information DayEuropeana Newspapers
 
Presentation of Alaa Abi Haidar at the BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information DayPresentation of Alaa Abi Haidar at the BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information DayEuropeana Newspapers
 
IFLA 2014 Europeana Newspapers Rossitza Atanassova
IFLA 2014 Europeana Newspapers Rossitza AtanassovaIFLA 2014 Europeana Newspapers Rossitza Atanassova
IFLA 2014 Europeana Newspapers Rossitza AtanassovaEuropeana Newspapers
 
Europeana Newspapers Estonian Infoday Krista Kiisa
Europeana Newspapers Estonian Infoday Krista KiisaEuropeana Newspapers Estonian Infoday Krista Kiisa
Europeana Newspapers Estonian Infoday Krista KiisaEuropeana Newspapers
 
Europeana Newspapers Estonian Infoday Krista Aru
Europeana Newspapers Estonian Infoday Krista AruEuropeana Newspapers Estonian Infoday Krista Aru
Europeana Newspapers Estonian Infoday Krista AruEuropeana Newspapers
 
Europeana Newspapers Estonian Infoday Fred Puss
Europeana Newspapers Estonian Infoday Fred PussEuropeana Newspapers Estonian Infoday Fred Puss
Europeana Newspapers Estonian Infoday Fred PussEuropeana Newspapers
 
Europeana Newpapers LFT Infoday Neudecker
Europeana Newpapers LFT Infoday NeudeckerEuropeana Newpapers LFT Infoday Neudecker
Europeana Newpapers LFT Infoday NeudeckerEuropeana Newspapers
 
Europeana Newspapers LFT Infoday Thompson
Europeana Newspapers LFT Infoday ThompsonEuropeana Newspapers LFT Infoday Thompson
Europeana Newspapers LFT Infoday ThompsonEuropeana Newspapers
 
Europeana Newspapers LFT Infoday Rossi
Europeana Newspapers LFT Infoday RossiEuropeana Newspapers LFT Infoday Rossi
Europeana Newspapers LFT Infoday RossiEuropeana Newspapers
 
Europeana Newspapers LFT Infoday Messina
Europeana Newspapers LFT Infoday MessinaEuropeana Newspapers LFT Infoday Messina
Europeana Newspapers LFT Infoday MessinaEuropeana Newspapers
 
Europeana Newspapers Infoday Marchetti
Europeana Newspapers Infoday MarchettiEuropeana Newspapers Infoday Marchetti
Europeana Newspapers Infoday MarchettiEuropeana Newspapers
 
Europeana Newspapers LFT Infoday Kempf
Europeana Newspapers LFT Infoday KempfEuropeana Newspapers LFT Infoday Kempf
Europeana Newspapers LFT Infoday KempfEuropeana Newspapers
 
Europeana Newspapers LFT Infoday Bolioli
Europeana Newspapers LFT Infoday BolioliEuropeana Newspapers LFT Infoday Bolioli
Europeana Newspapers LFT Infoday BolioliEuropeana Newspapers
 

Mais de Europeana Newspapers (20)

Presentation of Philippe Mezzasalma at the BnF Information Day in Paris
Presentation of Philippe Mezzasalma at the BnF Information Day in ParisPresentation of Philippe Mezzasalma at the BnF Information Day in Paris
Presentation of Philippe Mezzasalma at the BnF Information Day in Paris
 
Presentation of Ioannis Anagnostopoulos at BnF Information Day
Presentation of Ioannis Anagnostopoulos at BnF Information DayPresentation of Ioannis Anagnostopoulos at BnF Information Day
Presentation of Ioannis Anagnostopoulos at BnF Information Day
 
Présentation Günter Mühlberger, BnF Information Day
Présentation Günter Mühlberger, BnF Information DayPrésentation Günter Mühlberger, BnF Information Day
Présentation Günter Mühlberger, BnF Information Day
 
Presentation of Claus Gravenhorst, BnF Information Day
Presentation of Claus Gravenhorst, BnF Information DayPresentation of Claus Gravenhorst, BnF Information Day
Presentation of Claus Gravenhorst, BnF Information Day
 
Presentation of Alaa Abi Haidar at the BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information DayPresentation of Alaa Abi Haidar at the BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information Day
 
IFLA 2014 Europeana Newspapers Rossitza Atanassova
IFLA 2014 Europeana Newspapers Rossitza AtanassovaIFLA 2014 Europeana Newspapers Rossitza Atanassova
IFLA 2014 Europeana Newspapers Rossitza Atanassova
 
Europeana Newspapers Estonian Infoday Krista Kiisa
Europeana Newspapers Estonian Infoday Krista KiisaEuropeana Newspapers Estonian Infoday Krista Kiisa
Europeana Newspapers Estonian Infoday Krista Kiisa
 
Europeana Newspapers Estonian Infoday Krista Aru
Europeana Newspapers Estonian Infoday Krista AruEuropeana Newspapers Estonian Infoday Krista Aru
Europeana Newspapers Estonian Infoday Krista Aru
 
Europeana Newspapers Estonian Infoday Fred Puss
Europeana Newspapers Estonian Infoday Fred PussEuropeana Newspapers Estonian Infoday Fred Puss
Europeana Newspapers Estonian Infoday Fred Puss
 
Europeana Newpapers LFT Infoday Neudecker
Europeana Newpapers LFT Infoday NeudeckerEuropeana Newpapers LFT Infoday Neudecker
Europeana Newpapers LFT Infoday Neudecker
 
Europeana Newspapers LFT Infoday Thompson
Europeana Newspapers LFT Infoday ThompsonEuropeana Newspapers LFT Infoday Thompson
Europeana Newspapers LFT Infoday Thompson
 
Europeana Newspapers LFT Infoday Rossi
Europeana Newspapers LFT Infoday RossiEuropeana Newspapers LFT Infoday Rossi
Europeana Newspapers LFT Infoday Rossi
 
Enp lft infoday_neudecker
Enp lft infoday_neudeckerEnp lft infoday_neudecker
Enp lft infoday_neudecker
 
Europeana Newspapers LFT Infoday Messina
Europeana Newspapers LFT Infoday MessinaEuropeana Newspapers LFT Infoday Messina
Europeana Newspapers LFT Infoday Messina
 
Europeana Newspapers Infoday Marchetti
Europeana Newspapers Infoday MarchettiEuropeana Newspapers Infoday Marchetti
Europeana Newspapers Infoday Marchetti
 
Europeana Newspapers LFT Infoday Kempf
Europeana Newspapers LFT Infoday KempfEuropeana Newspapers LFT Infoday Kempf
Europeana Newspapers LFT Infoday Kempf
 
Europeana Newspapers LFT Infoday Bolioli
Europeana Newspapers LFT Infoday BolioliEuropeana Newspapers LFT Infoday Bolioli
Europeana Newspapers LFT Infoday Bolioli
 
ENP_Dutch_Infoday_MWillems
ENP_Dutch_Infoday_MWillemsENP_Dutch_Infoday_MWillems
ENP_Dutch_Infoday_MWillems
 
ENP_Dutch_Infoday_PHuijnen
ENP_Dutch_Infoday_PHuijnen ENP_Dutch_Infoday_PHuijnen
ENP_Dutch_Infoday_PHuijnen
 
ENP_Dutch_Infoday_SKruizinga
ENP_Dutch_Infoday_SKruizingaENP_Dutch_Infoday_SKruizinga
ENP_Dutch_Infoday_SKruizinga
 

Último

Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxShobhayan Kirtania
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 

Último (20)

Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptx
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 

Correcting OCR Results with CL Methods

  • 1. Correcting OCR results with computational linguistic methods Michel Généreux, EURAC Bozen / Bolzano in collaboration with Egon W. Stemle, Lionel Nicolas, Verena Lyding and Katalin Szabò 1 Institute for Specialised Communication and Multilingualism Michel Généreux 27.10.2014
  • 2. The OPATCH project (Open Platform for Access to and Analysis of Textual Documents of Cultural Heritage) aims at creating an advanced on-line search infrastructure for research in an historical newspapers archive. • Duration: 24 months (Jan 2014 – Dec 2015) • Fundings: Autonome Provinz Bozen-Südtirol, Landesgesetz Nr. 14, „Forschung und Innovation“ • Partners: • Landesbibliothek Dr. Friedrich Teßmann, Bozen • Institut für Corpuslinguistik und Texttechnologie (ICLTT), Wien For implementing this, OPATCH builds on computational linguistic (CL) methods for structural parsing, word class tagging and named entity recognition. Dating between 1910 and 1920, the newspapers are typed in the blackletter Fraktur font and paper quality is derogated due to age. Hence, in OPATCH we are starting from majorly error-prone OCR-ed text, in quantities that cannot realistically be corrected manually. 2 Institute for Specialised Communication and Multilingualism Introduction Michel Généreux 27.10.2014
  • 3. A Glance at the Teßmann collection 626,287 pages for a total of 819,310,354 tokens; 616,751,127 tokens are in the reference dictionary, so a degree of cleanness Our post-OCR correction system is based on 10 OCR-ed pages with cleanness Given that the dictionary covers on average 91% of all words, 0.90 cleanness Focusing on the best pages from the Teßmann collection ... 3 Institute for Specialised Communication and Multilingualism of 0.75 of 0.70 is almost perfect OCR. Michel Généreux 27.10.2014
  • 4. Cumul. number of PM_1900 1032 1137449 0.901 1032 1137449 1024804 90.1% AM_1890 8 9457 0.899 1040 1146906 1033302 90.1% FA_1900 8 8652 0.887 1048 1155558 1040978 90.1% PM_1910 737 750131 0.886 1785 1905689 1705286 89.5% IS_1900 2970 2817010 0.879 4755 4722699 4180623 88.5% WB_1900 72 45481 0.872 4827 4768180 4220278 88.5% SVB_1890 9749 14306473 0.809 70546 92484068 75861319 82.0% 17545 26314923 0.807 88091 118798991 97103539 81.7% 16015 12153784 0.806 104106 130952775 106894635 81.6% 4 Institute for Specialised Communication and Multilingualism OPATCH 100k pages Newsp_Decade Number of pages Number of tokens Number of tokens in dictionary pages Cumul. number of tokens Cumul. number of tokens in dict. SVB_1900 Volksblatt BZN_1890 Bozner Nachrichten Michel Généreux 27.10.2014
  • 5. OPATCH unnanotated 5k pages We select the 5000 cleanest pages from the Teßmann collection above in the years 1910-1920. The average cleanness for this corpus is 91%, so no need for cleaning. This results in the following un-annotated example corpus: 5 Institute for Specialised Communication and Multilingualism Michel Généreux 27.10.2014
  • 6. OPATCH annotated 5k pages Automated annotations for Part-of-Speech (POS), Lemmas and Named Entities (NE). We also have a list of roughly 31k locations and names for South Tyrol compiled by Teßmann. POS LEMMA NE Bozen NE Bozen I-LOC hat VAFIN haben ein ART eine Museum NN Museum für APPR für Ötzi NE Ötzi I-PER 6 Institute for Specialised Communication and Multilingualism Bozen hat ein Museum für Ötzi. Michel Généreux 27.10.2014
  • 7. Corpora Ten OCR-ed pages with their manually corrected versions. • 10,468 tokens and 3,621 types. Eight pages (8,324/2,487) are used as training data and two pages (2,144/1,134) for testing. • More than one out of two tokens is misrecognized, among which almost half (48%) need a minimum of three edit operations for correction. Reference corpus: 5M words and 5M bigrams • From the WEB and also http://www.gutenberg.org/ Romane • The dictionary covers 91% of all words in the ten OCR-ed pages 7 Institute for Specialised Communication and Multilingualism und Erzählungen (1910-20) Michel Généreux 27.10.2014
  • 8. Approach for OCR-ed correction 1. Probability models. Collate and tally all edit-operations (delete, insert and replace) needed to transform all unrecognized tokens from the training OCR-ed texts to its corrected form in the Gold Standard: n|u 98 * we obtain two probability models: constrained and unconstrained 2. Candidate generation is achieved by finding the closest entry in the dictionary by applying the minimum number of edit-operations to an unrecognized OCR-ed token. The number of candidates is function of the maximum number of edit-operations allowed and the model used. * wundestc → wundesten: ’c’ → ’e’ and inserting a ’n’ after the ’e’. 3. Selection of the most suitable candidate, given relative frequency and ... word word word word word WORD word word word word ... 8 Institute for Specialised Communication and Multilingualism context: Michel Généreux 27.10.2014
  • 9. Experiment 1 Artificially created errors ●To achieve this we extracted random trigrams from the GS (left context, target, right context) and applied, in reverse, the edit error model. ●Errors were introduced up to a limit of two per target and contexts. ●At the end of this process, we have two context words and five candidates, including the target. ●En is the maximum edit-operations performed to generate candidates. 9 Institute for Specialised Communication and Multilingualism Michel Généreux 27.10.2014
  • 10. Experiment 2: Real errors 10 Institute for Specialised Communication and Multilingualism Michel Généreux 27.10.2014
  • 11. The approach we presented to correct OCR errors considered four features of two types: edit-distance Results showed that a simple scoring system can correct with very high accuracy OCR-ed texts under idealized conditions: no more than two edit operations and a wide-coverage dictionary. Obviously, these conditions do not always hold in practice, thus an observed accu-racy drops to 10%. Wrong substitions by the OCR process have also been neglected. Nevertheless, we can expect to improve our dictionary coverage so that very noisy OCR-ed texts (i.e. 48% error with distance of at least three to target) can be corrected with accuracies up to 20%. OCR-ed texts with less challenging error patterns can be corrected with accuracies up to 61% (distance two) and 86% (distance one). Reference: Michel Généreux, Egon W. Stemle, Verena Lyding and Lionel Nicolas. Correcting OCR errors for German in Fraktur font. Pisa, 9-10 dicembre 2014. La prima Conferenza di Linguistica Computazionale Italiana. 11 Institute for Specialised Communication and Multilingualism Discussion and n-grams frequencies. Michel Généreux 27.10.2014