SlideShare uma empresa Scribd logo
1 de 31
TR5 Profiler and Post-Correction System  Ludwig-Maximilians-Universität München Centrum für Informations- und Sprachverarbeitung
TR5 Post-Correction System ,[object Object],[object Object],[object Object]
Customizable user interface ,[object Object],[object Object],[object Object],[object Object],OCR and image fragments Correction candidates, Special functions Complete image Font size
[object Object],[object Object],View: OCR and Image clippings
[object Object],[object Object],[object Object],View: Original image
[object Object],[object Object],[object Object],Word by word correction of text
[object Object],[object Object],Batch correction: efficient postcorrection
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Batch correction: efficient postcorrection
Postcorrection system: Evaluation Ulrich Reffle, 4, Juli 2011 ,[object Object],[object Object]
Korrektursystem
Korrektursystem
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Why another postcorrection system?
[object Object],[object Object],[object Object],[object Object],Underlying language technology
Text and error profiles ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],patterns OCR errors
Historical variant and OCR error patterns Historical Variants OCR Error patterns teil    theil theil    iheil
Relative frequency: 2.9% of all ‘t’ are rewritten to ‘th’  Absolute frequency: Pattern was found 120 times in the current document.
[object Object],[object Object],Occurrence  of spelling variant “i->y”: Occurrence  of ocr error “ i->y”:
[object Object],[object Object],Occurrences of spelling variant “i->y”: +0.999771 Occurrences of ocr error “ i->y”: +0.000224948
Computation of profile: initialization OCR result w 0 , w 1  ,w 2 , w 3 , … Initial global profile ,[object Object],[object Object],[object Object],[object Object]
Computation of profile: global to local w 3 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … Ulrich Reffle, 4, Juli 2011 w 3 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 2 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 1 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 0 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … Local profile Initial global profile OCR result w 0 , w 1  ,w 2 , w 3 , … ,[object Object],[object Object],[object Object],[object Object]
Computation of profile: local to global w 3 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … Ulrich Reffle, 4, Juli 2011 w 3 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 2 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 1 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 0 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … Local profile Global profile OCR result w 0 , w 1  ,w 2 , w 3 , … ,[object Object],[object Object],[object Object],[object Object]
Computation of profile: iteration Ulrich Reffle, 4, Juli 2011 Local profile Global profile w 3 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 3 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 2 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 1 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 0 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … OCR result w 0 , w 1  ,w 2 , w 3 , … ,[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Evaluation: Measures (1)  Global Profiles Percentage of matches for the first 10 patterns in the ranked output lists Two Values: Historical Patterns, OCR Patterns (2)  OCR Error Detection   Precision and Recall for the OCR errors detected by the Profiler (3)  Indirect evaluation (For instance, by means of the postcorrection system)
Evaluation: Data preparation (1)  Deep Evaluation: For each token of the evaluation document the historical interpretation and the  OCR  interpretation have been manually annotated.  ++ fully accurate  -- manual work (2)  Shallow Evaluation:  The OCR’ed document is automatically aligned with its re-typed ground truth; For each token of the evaluation document  the historical and the OCR interpretation is automatically assigned from the ground truth. ++ no manual work  – not completely accurate
Evaluation: Data Deep:  Eckartshausen  100 pages  Briefkunst  40 pages Shallow:  5 books each,  16 th , 17 th  and 18 th  century
Evaluation: Eckartshausen ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Graphical Evaluation: Eckartshausen
Graphical Evaluation: diacritics Hist. Var. OCR
Shallow Evaluation Results 16th  17th 18th HIST Patterns  first 10 60% 74% 78% OCR Patterns  first 10 48% 70% 50% Error Detection Prec 95% 92% 81% Error Detection Recall 49% 43% 45% Content Words Errors 64% 44% 16% Easy Interactive Correction  per 10,000 words ≈ 3000 words ≈  1892 words ≈  720 words

Mais conteúdo relacionado

Semelhante a Postcorrection and profiler_bne_demoday

Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015
Editor IJARCET
 
Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015
Editor IJARCET
 

Semelhante a Postcorrection and profiler_bne_demoday (20)

Odp
OdpOdp
Odp
 
Entering the Fourth Dimension of OCR with Tesseract - Talk from Voxxed Days B...
Entering the Fourth Dimension of OCR with Tesseract - Talk from Voxxed Days B...Entering the Fourth Dimension of OCR with Tesseract - Talk from Voxxed Days B...
Entering the Fourth Dimension of OCR with Tesseract - Talk from Voxxed Days B...
 
IMPACT Final Conference - Asaf Tzadok
IMPACT Final Conference - Asaf TzadokIMPACT Final Conference - Asaf Tzadok
IMPACT Final Conference - Asaf Tzadok
 
Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interac...
Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interac...Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interac...
Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interac...
 
Optical Character Recognition
Optical Character RecognitionOptical Character Recognition
Optical Character Recognition
 
Optical character recognition IEEE Paper Study
Optical character recognition IEEE Paper StudyOptical character recognition IEEE Paper Study
Optical character recognition IEEE Paper Study
 
Optical Character Recognition( OCR )
Optical Character Recognition( OCR )Optical Character Recognition( OCR )
Optical Character Recognition( OCR )
 
Evaluating Google Cloud Vision for OCR
Evaluating Google Cloud Vision for OCREvaluating Google Cloud Vision for OCR
Evaluating Google Cloud Vision for OCR
 
Evgen Terpil "OCR in the Wild World of Social Media"
Evgen Terpil "OCR in the Wild World of Social Media"Evgen Terpil "OCR in the Wild World of Social Media"
Evgen Terpil "OCR in the Wild World of Social Media"
 
Cpcs302 1
Cpcs302  1Cpcs302  1
Cpcs302 1
 
Design and Description of Feature Extraction Algorithm for Old English Font
Design and Description of Feature Extraction Algorithm for Old English FontDesign and Description of Feature Extraction Algorithm for Old English Font
Design and Description of Feature Extraction Algorithm for Old English Font
 
06 traub
06 traub06 traub
06 traub
 
Entering the Fourth Dimension of OCR with Tesseract
Entering the Fourth Dimension of OCR with TesseractEntering the Fourth Dimension of OCR with Tesseract
Entering the Fourth Dimension of OCR with Tesseract
 
IRJET-Optical Character Recognition using ANN
IRJET-Optical Character Recognition using ANNIRJET-Optical Character Recognition using ANN
IRJET-Optical Character Recognition using ANN
 
Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1
 
User-friendly ways to capture temporal properties - Seminar at KTH, June 2015
User-friendly ways to capture temporal properties - Seminar at KTH, June 2015User-friendly ways to capture temporal properties - Seminar at KTH, June 2015
User-friendly ways to capture temporal properties - Seminar at KTH, June 2015
 
Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015
 
Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015
 
Online Hand Written Character Recognition
Online Hand Written Character RecognitionOnline Hand Written Character Recognition
Online Hand Written Character Recognition
 
UseR 2017
UseR 2017UseR 2017
UseR 2017
 

Mais de IMPACT Centre of Competence

Mais de IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Último (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Postcorrection and profiler_bne_demoday

  • 1. TR5 Profiler and Post-Correction System Ludwig-Maximilians-Universität München Centrum für Informations- und Sprachverarbeitung
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16. Historical variant and OCR error patterns Historical Variants OCR Error patterns teil  theil theil  iheil
  • 17. Relative frequency: 2.9% of all ‘t’ are rewritten to ‘th’ Absolute frequency: Pattern was found 120 times in the current document.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25. Evaluation: Measures (1) Global Profiles Percentage of matches for the first 10 patterns in the ranked output lists Two Values: Historical Patterns, OCR Patterns (2) OCR Error Detection Precision and Recall for the OCR errors detected by the Profiler (3) Indirect evaluation (For instance, by means of the postcorrection system)
  • 26. Evaluation: Data preparation (1) Deep Evaluation: For each token of the evaluation document the historical interpretation and the OCR interpretation have been manually annotated. ++ fully accurate -- manual work (2) Shallow Evaluation: The OCR’ed document is automatically aligned with its re-typed ground truth; For each token of the evaluation document the historical and the OCR interpretation is automatically assigned from the ground truth. ++ no manual work – not completely accurate
  • 27. Evaluation: Data Deep: Eckartshausen 100 pages Briefkunst 40 pages Shallow: 5 books each, 16 th , 17 th and 18 th century
  • 28.
  • 31. Shallow Evaluation Results 16th 17th 18th HIST Patterns first 10 60% 74% 78% OCR Patterns first 10 48% 70% 50% Error Detection Prec 95% 92% 81% Error Detection Recall 49% 43% 45% Content Words Errors 64% 44% 16% Easy Interactive Correction per 10,000 words ≈ 3000 words ≈ 1892 words ≈ 720 words

Notas do Editor

  1. DictModule name=“modern” File=“../dicts/modern.dic” max_ocr_errors=3 max_spelling_variants
  2. DictModule name=“modern” File=“../dicts/modern.dic” max_ocr_errors=3 max_spelling_variants