SlideShare uma empresa Scribd logo
1 de 21
eMOP Post-OCR Triage
Diagnosing Page Image Problems with Post-OCR
Triage for eMOP
Matthew Christy,
Loretta Auvil,
Dr. Ricardo Gutierrez-
Osuna,
Boris Capitanu,
Anshul Gupta,
Elizabeth Grumbach
 emop.tamu.edu/
 DH2014 Presentation
 emop.tamu.edu/post-
processing
 eMOP Workflows
 emop.tamu.edu/workflow
s
 Mellon Grant Proposal
 idhmc.tamu.edu/projects
/Mellon/eMOPPublic.pdf
eMOP Info
eMOP Website More eMOP
 Facebook
 The Early Modern OCR
Project
 Twitter
 #emop
 @IDHMC_Nexus
 @matt_christy
 @EMGrumbach
DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP
2
The Numbers
Page Images
 Early English Books online
(Proquest) EEBO: ~125,000
documents, ~13 million
pages images (1475-1700)
 Eighteenth Century
Collections Online (Gale
Cengage) ECCO: ~182,000
documents, ~32 million
page images (1700-1800)
 Total: >300,000 documents
& 45 million page images.
Ground Truth
 Text Creation Partnership TCP:
~46,000 double-keyed hand
transcribed docuemnts
 44,000 EEBO
 2,200 ECCO
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
3
Page Images
DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP
4
The Constraints
 45 million page images!
 Only 2 years
 Small IDHMC team focused
on gather data and training
Tesseract for early modern
typefaces
 Great team of collaborators
focusing on post-processing
 Software Environment for the
Advancement of Scholarly
Research (SEASR) – University of
Illinois, Urbana-Champaign
 Perception, Sensing, and
Instrumentation (PSI) Lab, Texas
A&M University
 Everything must be open-
source
 Focus our efforts on post-
processing triage and
recovery
 Triage system will score page
results and route pages to be
corrected or analyzed for
problems
 Results:
1. Good quality, corrected
OCR output
2. A DB of tagged pages
indicating pre-processing
needs
DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP
5
Solution
Post-ProcessingTriage
DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP
6
Post-Processing
DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP
7
Triage
Treatment Diagnosis
Triage:De-noising
DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP
8
 Uses hOCR results
1. Determine average
word bounding box
size
2. Weed out boxes are
that too big or too
small
3. But keep small boxes
that have neighbors
that are “words”
Triage: De-noising
DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP
9
Before: 35% After: 58%
Triage: De-noising
DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP
10
Before After
Triage: Estimated Correctability
DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP
11
Page Evaluation
 Determine how correctable a
page’s OCR results are by
examining the text.
 The score is based on the ratio
of words that fit the
correctable profile to the total
number of words
Correctable Profile
1. Clean tokens:
 remove leading and trailing
punctuation
 remaining token must have at
least 3 letters
2. Spell check tokens >1
character
3. Check token profile :
 contain at most 2 non-alpha
characters, and
 at least 1 alpha character,
 have a length of at least 3,
 and do not contain 4 or more
repeated characters in a run
4. Also consider length of tokens
compared to average for the
page
DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP
12Triage: Estimated Correctability
Treatment: Page Correction
DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP
13
1. Preliminary cleanup
 remove punctuation from begin/end of
tokens
 remove empty lines and empty tokens
 combine hyphenated tokens that appear
at the end of a line
 retain cleaned & original tokens as
“suggestions”
2. Apply common transformations and
period specific dictionary lookups to
gather suggestions for words.
 transformation rules: rn->m; c->e; 1->l; e
DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP
14Treatment: Page Correction
3. Use context checking on a sliding window of 3 words,
and their suggested changes, to find the best context
matches in our(sanitized, period-specific) Google 3-
gram dataset
 if no context is found and only one additional suggestion
was made from transformation or dictionary, then
replace with this suggestion
 if no context and “clean” token from above is in the
dictionary, replace with this token
DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP
15Treatment: Page Correction
window: tbat l thoughc
Candidates used for context matching:
 tbat -> Set(thai, thar, bat, twat, tibet, ébat, ibat, tobit, that, tat, tba, ilial,
abat, tbat, teat)
 l -> Set(l)
 thoughc -> Set(thoughc, thought, though)
ContextMatch: that l thought (matchCount: 1844 , volCount: 1474)
window: l thoughc Ihe
Candidates used for context matching:
 l -> Set(l)
 thoughc -> Set(thoughc, thought, though)
 Ihe -> Set(che, sho, enc, ile, iee, plie, ihe, ire, ike, she, ife, ide, ibo, i.e,
ene, ice, inc, tho, ime, ite, ive, the)
ContextMatch: l though the (matchCount: 497 , volCount: 486)
ContextMatch: l thought she (matchCount: 1538 , volCount: 997)
ContextMatch: l thought the (matchCount: 2496 , volCount: 1905)
tbat I thoughc Ihe Was
DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP
16Treatment: Page Correction
window: thoughc Ihe Was
Candidates used for context matching:
 thoughc -> Set(thoughc, thought, though)
 Ihe -> Set(che, sho, enc, ile, iee, plie, ihe, ire, ike, she, ife, ide, ibo, i.e,
ene, ice, inc, tho, ime, ite, ive, the)
 Was -> Set(Was)
ContextMatch: though ice was (matchCount: 121 , volCount: 120)
ContextMatch: though ike was (matchCount: 65 , volCount: 59)
ContextMatch: though she was (matchCount: 556,763 , volCount:
364,965)
ContextMatch: though the was (matchCount: 197 , volCount: 196)
ContextMatch: thought ice was (matchCount: 45 , volCount: 45)
ContextMatch: thought ike was (matchCount: 112 , volCount: 108)
ContextMatch: thought she was (matchCount: 549,531 , volCount:
325,822)
ContextMatch: thought the was (matchCount: 91 , volCount: 91)
that I thought she was
Treatment: Results
DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP
17
Treatment: Results
DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP
18
Diagnosis: Page Tagging
DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP
19
 Tags pages with
problems that prevent
good OCR results
 Can be used to apply
appropriate pre-
processing and re-
OCRing
 Eventually, will end up
with a list of pages that
simply need to be re-
digitized
 This will be the first time
any comprehensive
analysis has been done
on these page images.
 Users tag sample pages in
a desktop version of Picasa
 Machine learning
algorithms use those tags
to learn how to recognize
skew, warp, noise, etc.
 Have developed
algorithms to:
 measure skew
 measure noise
Further/Current Work
 Identifying multiple pages/columns in an image
 Predicting juxta scores for documents without
corresponding groundtruth
 Identifying warp
 Identify and fixing incorrect word order in hOCR
output
 can occur on pages with skew, vertical lines,
decorative drop-caps, etc.
 will affect scoring and context-based corrections
Develop measure of noisiness
Develop measure of skew-ness
DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP
20
The end
For eMOP questions please
contact us at :
mchristy@tamu.edu
egrumbac@tamu.edu
DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP
21

Mais conteúdo relacionado

Mais procurados

Dh2014 e mopcobre-complete
Dh2014 e mopcobre-completeDh2014 e mopcobre-complete
Dh2014 e mopcobre-completeLaura Mandell
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Vsevolod Dyomkin
 
SAA 2014 Pre-conference Workshop - OCRing with Open Source Tools
SAA 2014 Pre-conference Workshop - OCRing with Open Source ToolsSAA 2014 Pre-conference Workshop - OCRing with Open Source Tools
SAA 2014 Pre-conference Workshop - OCRing with Open Source ToolsMatt Christy
 
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language ProcessingVsevolod Dyomkin
 
Large Scale Text Processing
Large Scale Text ProcessingLarge Scale Text Processing
Large Scale Text ProcessingSuneel Marthi
 
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...Apache OpenNLP
 
Embracing diversity searching over multiple languages
Embracing diversity  searching over multiple languagesEmbracing diversity  searching over multiple languages
Embracing diversity searching over multiple languagesSuneel Marthi
 
Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in PracticeVsevolod Dyomkin
 
Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Vsevolod Dyomkin
 
Link Discovery Tutorial Part III: Benchmarking for Instance Matching Systems
Link Discovery Tutorial Part III: Benchmarking for Instance Matching SystemsLink Discovery Tutorial Part III: Benchmarking for Instance Matching Systems
Link Discovery Tutorial Part III: Benchmarking for Instance Matching SystemsHolistic Benchmarking of Big Linked Data
 
Platforms and the Semantic Web
Platforms and the Semantic WebPlatforms and the Semantic Web
Platforms and the Semantic WebDanny Ayers
 
Quick Tour of Text Mining
Quick Tour of Text MiningQuick Tour of Text Mining
Quick Tour of Text MiningYi-Shin Chen
 
Quick tour all handout
Quick tour all handoutQuick tour all handout
Quick tour all handoutYi-Shin Chen
 

Mais procurados (20)

Dh2014 e mopcobre-complete
Dh2014 e mopcobre-completeDh2014 e mopcobre-complete
Dh2014 e mopcobre-complete
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?
 
SAA 2014 Pre-conference Workshop - OCRing with Open Source Tools
SAA 2014 Pre-conference Workshop - OCRing with Open Source ToolsSAA 2014 Pre-conference Workshop - OCRing with Open Source Tools
SAA 2014 Pre-conference Workshop - OCRing with Open Source Tools
 
The State of #NLProc
The State of #NLProcThe State of #NLProc
The State of #NLProc
 
Aspects of NLP Practice
Aspects of NLP PracticeAspects of NLP Practice
Aspects of NLP Practice
 
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language Processing
 
Large Scale Text Processing
Large Scale Text ProcessingLarge Scale Text Processing
Large Scale Text Processing
 
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
 
Link Discovery Tutorial Introduction
Link Discovery Tutorial IntroductionLink Discovery Tutorial Introduction
Link Discovery Tutorial Introduction
 
Embracing diversity searching over multiple languages
Embracing diversity  searching over multiple languagesEmbracing diversity  searching over multiple languages
Embracing diversity searching over multiple languages
 
Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in Practice
 
NLP Project Full Cycle
NLP Project Full CycleNLP Project Full Cycle
NLP Project Full Cycle
 
Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)
 
Link Discovery Tutorial Part III: Benchmarking for Instance Matching Systems
Link Discovery Tutorial Part III: Benchmarking for Instance Matching SystemsLink Discovery Tutorial Part III: Benchmarking for Instance Matching Systems
Link Discovery Tutorial Part III: Benchmarking for Instance Matching Systems
 
Link Discovery Tutorial Part II: Accuracy
Link Discovery Tutorial Part II: AccuracyLink Discovery Tutorial Part II: Accuracy
Link Discovery Tutorial Part II: Accuracy
 
Platforms and the Semantic Web
Platforms and the Semantic WebPlatforms and the Semantic Web
Platforms and the Semantic Web
 
Link Discovery Tutorial Part V: Hands-On
Link Discovery Tutorial Part V: Hands-OnLink Discovery Tutorial Part V: Hands-On
Link Discovery Tutorial Part V: Hands-On
 
Quick Tour of Text Mining
Quick Tour of Text MiningQuick Tour of Text Mining
Quick Tour of Text Mining
 
Quick tour all handout
Quick tour all handoutQuick tour all handout
Quick tour all handout
 
Link Discovery Tutorial Part I: Efficiency
Link Discovery Tutorial Part I: EfficiencyLink Discovery Tutorial Part I: Efficiency
Link Discovery Tutorial Part I: Efficiency
 

Semelhante a mchristy-Dh2014- emop-postOCR-triage

Learning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic ProgrammingLearning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic ProgrammingVrije Universiteit Amsterdam
 
2014 khmer protocols
2014 khmer protocols2014 khmer protocols
2014 khmer protocolsc.titus.brown
 
OCR with MXNet Gluon
OCR with MXNet GluonOCR with MXNet Gluon
OCR with MXNet GluonApache MXNet
 
A Distributed Architecture System for Recognizing Textual Entailment
A Distributed Architecture System for Recognizing Textual EntailmentA Distributed Architecture System for Recognizing Textual Entailment
A Distributed Architecture System for Recognizing Textual EntailmentFaculty of Computer Science
 
Approaching (almost) Any NLP Problem
Approaching (almost) Any NLP ProblemApproaching (almost) Any NLP Problem
Approaching (almost) Any NLP ProblemAbhishek Thakur
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer modelsDing Li
 
Defensa.V11
Defensa.V11Defensa.V11
Defensa.V11promanas
 
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...Isabelle Augenstein
 
Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016Chris Fregly
 
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015RIILP
 
Artificial intelligence for Social Good
Artificial intelligence for Social GoodArtificial intelligence for Social Good
Artificial intelligence for Social GoodOana Tifrea-Marciuska
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPMENGSAYLOEM1
 
D01 choueka dershowitz_word_spotting_algorithm
D01 choueka dershowitz_word_spotting_algorithmD01 choueka dershowitz_word_spotting_algorithm
D01 choueka dershowitz_word_spotting_algorithmevaminerva
 
D01 choueka dershowitz_word_spotting_algorithm
D01 choueka dershowitz_word_spotting_algorithmD01 choueka dershowitz_word_spotting_algorithm
D01 choueka dershowitz_word_spotting_algorithmevaminerva
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMfnothaft
 

Semelhante a mchristy-Dh2014- emop-postOCR-triage (20)

Learning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic ProgrammingLearning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic Programming
 
CNS_poster12
CNS_poster12CNS_poster12
CNS_poster12
 
2014 khmer protocols
2014 khmer protocols2014 khmer protocols
2014 khmer protocols
 
OCR with MXNet Gluon
OCR with MXNet GluonOCR with MXNet Gluon
OCR with MXNet Gluon
 
A Distributed Architecture System for Recognizing Textual Entailment
A Distributed Architecture System for Recognizing Textual EntailmentA Distributed Architecture System for Recognizing Textual Entailment
A Distributed Architecture System for Recognizing Textual Entailment
 
defense
defensedefense
defense
 
Approaching (almost) Any NLP Problem
Approaching (almost) Any NLP ProblemApproaching (almost) Any NLP Problem
Approaching (almost) Any NLP Problem
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
2014 toronto-torbug
2014 toronto-torbug2014 toronto-torbug
2014 toronto-torbug
 
Defensa.V11
Defensa.V11Defensa.V11
Defensa.V11
 
Text analysis using python
Text analysis using pythonText analysis using python
Text analysis using python
 
2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
 
Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016
 
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
 
Artificial intelligence for Social Good
Artificial intelligence for Social GoodArtificial intelligence for Social Good
Artificial intelligence for Social Good
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
 
D01 choueka dershowitz_word_spotting_algorithm
D01 choueka dershowitz_word_spotting_algorithmD01 choueka dershowitz_word_spotting_algorithm
D01 choueka dershowitz_word_spotting_algorithm
 
D01 choueka dershowitz_word_spotting_algorithm
D01 choueka dershowitz_word_spotting_algorithmD01 choueka dershowitz_word_spotting_algorithm
D01 choueka dershowitz_word_spotting_algorithm
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAM
 

Último

Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationRosabel UA
 

Último (20)

YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translation
 

mchristy-Dh2014- emop-postOCR-triage

  • 1. eMOP Post-OCR Triage Diagnosing Page Image Problems with Post-OCR Triage for eMOP Matthew Christy, Loretta Auvil, Dr. Ricardo Gutierrez- Osuna, Boris Capitanu, Anshul Gupta, Elizabeth Grumbach
  • 2.  emop.tamu.edu/  DH2014 Presentation  emop.tamu.edu/post- processing  eMOP Workflows  emop.tamu.edu/workflow s  Mellon Grant Proposal  idhmc.tamu.edu/projects /Mellon/eMOPPublic.pdf eMOP Info eMOP Website More eMOP  Facebook  The Early Modern OCR Project  Twitter  #emop  @IDHMC_Nexus  @matt_christy  @EMGrumbach DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP 2
  • 3. The Numbers Page Images  Early English Books online (Proquest) EEBO: ~125,000 documents, ~13 million pages images (1475-1700)  Eighteenth Century Collections Online (Gale Cengage) ECCO: ~182,000 documents, ~32 million page images (1700-1800)  Total: >300,000 documents & 45 million page images. Ground Truth  Text Creation Partnership TCP: ~46,000 double-keyed hand transcribed docuemnts  44,000 EEBO  2,200 ECCO DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 3
  • 4. Page Images DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP 4
  • 5. The Constraints  45 million page images!  Only 2 years  Small IDHMC team focused on gather data and training Tesseract for early modern typefaces  Great team of collaborators focusing on post-processing  Software Environment for the Advancement of Scholarly Research (SEASR) – University of Illinois, Urbana-Champaign  Perception, Sensing, and Instrumentation (PSI) Lab, Texas A&M University  Everything must be open- source  Focus our efforts on post- processing triage and recovery  Triage system will score page results and route pages to be corrected or analyzed for problems  Results: 1. Good quality, corrected OCR output 2. A DB of tagged pages indicating pre-processing needs DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP 5 Solution
  • 6. Post-ProcessingTriage DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP 6
  • 7. Post-Processing DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP 7 Triage Treatment Diagnosis
  • 8. Triage:De-noising DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP 8  Uses hOCR results 1. Determine average word bounding box size 2. Weed out boxes are that too big or too small 3. But keep small boxes that have neighbors that are “words”
  • 9. Triage: De-noising DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP 9 Before: 35% After: 58%
  • 10. Triage: De-noising DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP 10 Before After
  • 11. Triage: Estimated Correctability DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP 11 Page Evaluation  Determine how correctable a page’s OCR results are by examining the text.  The score is based on the ratio of words that fit the correctable profile to the total number of words Correctable Profile 1. Clean tokens:  remove leading and trailing punctuation  remaining token must have at least 3 letters 2. Spell check tokens >1 character 3. Check token profile :  contain at most 2 non-alpha characters, and  at least 1 alpha character,  have a length of at least 3,  and do not contain 4 or more repeated characters in a run 4. Also consider length of tokens compared to average for the page
  • 12. DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP 12Triage: Estimated Correctability
  • 13. Treatment: Page Correction DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP 13 1. Preliminary cleanup  remove punctuation from begin/end of tokens  remove empty lines and empty tokens  combine hyphenated tokens that appear at the end of a line  retain cleaned & original tokens as “suggestions” 2. Apply common transformations and period specific dictionary lookups to gather suggestions for words.  transformation rules: rn->m; c->e; 1->l; e
  • 14. DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP 14Treatment: Page Correction 3. Use context checking on a sliding window of 3 words, and their suggested changes, to find the best context matches in our(sanitized, period-specific) Google 3- gram dataset  if no context is found and only one additional suggestion was made from transformation or dictionary, then replace with this suggestion  if no context and “clean” token from above is in the dictionary, replace with this token
  • 15. DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP 15Treatment: Page Correction window: tbat l thoughc Candidates used for context matching:  tbat -> Set(thai, thar, bat, twat, tibet, ébat, ibat, tobit, that, tat, tba, ilial, abat, tbat, teat)  l -> Set(l)  thoughc -> Set(thoughc, thought, though) ContextMatch: that l thought (matchCount: 1844 , volCount: 1474) window: l thoughc Ihe Candidates used for context matching:  l -> Set(l)  thoughc -> Set(thoughc, thought, though)  Ihe -> Set(che, sho, enc, ile, iee, plie, ihe, ire, ike, she, ife, ide, ibo, i.e, ene, ice, inc, tho, ime, ite, ive, the) ContextMatch: l though the (matchCount: 497 , volCount: 486) ContextMatch: l thought she (matchCount: 1538 , volCount: 997) ContextMatch: l thought the (matchCount: 2496 , volCount: 1905) tbat I thoughc Ihe Was
  • 16. DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP 16Treatment: Page Correction window: thoughc Ihe Was Candidates used for context matching:  thoughc -> Set(thoughc, thought, though)  Ihe -> Set(che, sho, enc, ile, iee, plie, ihe, ire, ike, she, ife, ide, ibo, i.e, ene, ice, inc, tho, ime, ite, ive, the)  Was -> Set(Was) ContextMatch: though ice was (matchCount: 121 , volCount: 120) ContextMatch: though ike was (matchCount: 65 , volCount: 59) ContextMatch: though she was (matchCount: 556,763 , volCount: 364,965) ContextMatch: though the was (matchCount: 197 , volCount: 196) ContextMatch: thought ice was (matchCount: 45 , volCount: 45) ContextMatch: thought ike was (matchCount: 112 , volCount: 108) ContextMatch: thought she was (matchCount: 549,531 , volCount: 325,822) ContextMatch: thought the was (matchCount: 91 , volCount: 91) that I thought she was
  • 17. Treatment: Results DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP 17
  • 18. Treatment: Results DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP 18
  • 19. Diagnosis: Page Tagging DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP 19  Tags pages with problems that prevent good OCR results  Can be used to apply appropriate pre- processing and re- OCRing  Eventually, will end up with a list of pages that simply need to be re- digitized  This will be the first time any comprehensive analysis has been done on these page images.  Users tag sample pages in a desktop version of Picasa  Machine learning algorithms use those tags to learn how to recognize skew, warp, noise, etc.  Have developed algorithms to:  measure skew  measure noise
  • 20. Further/Current Work  Identifying multiple pages/columns in an image  Predicting juxta scores for documents without corresponding groundtruth  Identifying warp  Identify and fixing incorrect word order in hOCR output  can occur on pages with skew, vertical lines, decorative drop-caps, etc.  will affect scoring and context-based corrections Develop measure of noisiness Develop measure of skew-ness DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP 20
  • 21. The end For eMOP questions please contact us at : mchristy@tamu.edu egrumbac@tamu.edu DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP 21

Notas do Editor

  1. The Early Modern OCR Project (eMOP) is an Andrew W. Mellon Foundation funded grant project running out of the Initiative for Digital Humanities, Media, and Culture (IDHMC) at Texas A&M University, to develop and test tools and techniques to apply Optical Character Recognition (OCR) to early modern English documents from the hand press period, roughly 1475-1800. The basic premise of eMOP is to use typeface and book history techniques to train modern OCR engines specifically on the typefaces in our collection of documents, and thereby improve the accuracy of the OCR results. eMOP’s immediate goal is to make machine readable, or improve the readability, for 45 million pages of text from two major proprietary databases: Eighteenth Century Collections Online (ECCO) and Early English Books Online (EEBO). Generally, eMOP aims to improve the visibility of early modern texts by making their contents fully searchable. The current paradigm of searching special collections for early modern materials by either metadata alone or “dirty” OCR is inefficient for scholarly research.
  2. Some were great most were not Noisy Skewed Warped Or they posed challenges for OCR engines Multiple pages per image Multiple columns Images & decorative elements Marginalia Missing margins many were terrible
  3. CONSTRAINTS: We knew there were plenty of pre-processing algorithms to solve many of these problems, but given these constraints we felt we couldn’t conceivably pre-process all pages with all algorithms. SOLUTION: By making our triage system more robust we could attempt to correct as much as possible, but also identify page image problems and tag each page in the DB so that we’d know what pre-processing should be applied in order to get better results when re-OCRing again later.
  4. Before: 55% After: 73%
  5. This will be the first time that any sort of comprehensive analysis has been done on the page images of these collections.