SlideShare uma empresa Scribd logo
1 de 21
eMOP Book History Tools
Book History and Software Tools: Examining Typefaces for OCR
Training in eMOP
Matt Christy,
Todd Samuelson,
Katayoun Torabi,
Bryan Tarpley,
Elizabeth Grumbach
 emop.tamu.edu/
 Dh2014 Presentation
 emop.tamu.edu/book-
history-tools
 eMOP Workflows
 emop.tamu.edu/workflows
 Mellon Grant Proposal
 idhmc.tamu.edu/projects/
Mellon/eMOPPublic.pdf
eMOP Info
eMOP Website More eMOP
 Facebook
 Early Modern OCR Project
 Twitter
 #emop
 @IDHMC_Nexus
 @matt_christy
 @EMGrumbach
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
2
Early Modern OCR Project
 The Early Modern OCR Project (eMOP) is an Andrew W.
Mellon Foundation funded grant project running out of the
Initiative for Digital Humanities, Media, and Culture (IDHMC)
at Texas A&M University, to develop and test tools and
techniques to apply Optical Character Recognition (OCR)
to early modern English documents from the hand press
period, roughly 1475-1800.
 eMOP aims to improve the visibility of early modern texts by
making their contents fully searchable. The current
paradigm of searching special collections for early modern
materials by either metadata alone or “dirty” OCR is
insufficient for scholarly research.
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
3
Specifically, eMOP’s goal is to make
machine readable, or improve the
readability, for 305,000 document/45
million pages of text from two major
proprietary databases: Eighteenth
Century Collections Online (ECCO)
and Early English Books Online (EEBO).
Generally, our aim is to use typeface
and book history techniques to train
modern OCR engines specifically on
the typefaces in our collection of
documents, and thereby improve the
accuracy of the OCR results.
TrainingTesseract
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
4
Aletheia
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
5
www.primaresearch.org/too
ls.php
Available for free but requires
registration.
 Created by PRImA Research
Labs, University of Salford, UK.
 Windows based tool.
 Developed as a groundtruth
creation tool
 Used by eMOP undergraduate
student workers to create training
of desired typeface for Tesseract.
 Can identify glyphs on a page
image with page coordinates and
Unicode values.
Aletheia:Workflow
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
6  Binarization and Denoise are native Aletheia functions
 A team of Undergraduate student workers refines and
corrects glyph boxes and unicode values, where needed.
 Output: A set of PAGE XML files with page coordinates and
unicode values for every identified glyph on each processed
TIFF image.
Aletheia: Glyph Recognition
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
7
Uses Tesseract to find glyphs
Aletheia: I/O
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
8
We then convert PAGE XML
file to Tesseract Box file using
XSLT
Tesseract Training
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
9
Franken+
1. Windows based tool that uses a
MySQL DB.
2. Developed for eMOP by IDHMC
Graduate student worker Bryan
Tarpley.
3. Designed to be easily used by
eMOP Undergraduate student
workers
4. Takes Aletheia's output files as
input.
5. Outputs the same box files and TIFF
images that Tesseract's first stage
of native training.
 Available open-source at:
github.com/idhmc-
tamu/FrankenPlus
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
10
Franken+Workflow
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
11
1. Groups all glyphs with
the same Unicode
values into one window
for comparison.
2. Uses all selected glyphs
to create a Franken-
page image (TIFF) using
a selected text as a
base.
3. Outputs the same box
files and TIFF images
that Tesseract's first
stage of native training.
Franken+ Ingestion
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
12
Franken+
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
13
 All exemplars of the
same glyph are
displayed together.
 Users can quickly
identify and
deselect:
 Incorrectly labeled
glyphs
 Incomplete glyphs
 Unrepresentative
exemplars
 Different sized glyphs
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
14
Franken+
TrainingTesseract
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
15
Thiſ great conſumption to a fever turn'd,
And ſo the oꝗld had fitſ; it joy'd, it mourn'd;
And, aſ men thinke, that Agueſ phy ck are,
And th'Ague being ſpent, give over care.
Žo thou cke World, mꝗſtak'ſt thy ſelże to bee
Well, when ãlaſ, thou'rt in a Lethargie.
Her death did wound and tame thee than, and than
Thou might'ſt ha e better ſpar'd the Sunne, or man.
That wound waſ deep, but 'tiſ more miżery,
That thou haſt loſt thy ſenſe and memor .
'Twaſ heavy then to heare thy voyce of mone,
But thiſ iſ worſe, that thou art ſpeechle e growne.
Thou haſt forgot thy name thou hadſt; thou waſt
Nothing but ee, and her thou haſt o'rpaſt.
For aſ a child kept from the Fount, untill
Ä prince, expe ed long, come to fulfill
The ceremonieſ, thou unnam'd had'ſt laid,
Had not her comming, thee her palace made:
Her name defin'd thee, gave thee forme, and frame,
And thou forgett'ſt to celebrate th n me.
Some monethſ e hath beene dead (but beìng dead,
Meaſureſ of timeſ are all determined)
But long e'ath beene away, long, long, et none
Offerſ to tell uſ who it iſ that'ſ gone.
But aſ in ſtateſ doubtfull of future heireſ,
When ckne e without remedie empaireſ
The preſent Prince, they're loth it ould be ſaid,
The Prince doth langui , or the Prince iſ dead:
So mankinde feeling no a generall tha ,
Franken+ Results
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
16
AFTER
BEFORE
eMOP
TesseractTraining
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
17
S-face / Y-face
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in
eMOP
18
Weiss, Adrian. “Font Analysis as a Bibliographical Method: the
Elizabethan Play-Quarto Printers and Compositors.” Studies
in Bibliography 43 (1990): 95-164.
 Weiss organized late 16th and early 17th century
typefaces into these two general types (named for the
first works in which they were identified)
 Y-Face, from an edition of The Malcontents
 S-Face, from Ben Jonson's Sejanus
S-face /
Y-face
19
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
Other Applications
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
20
 A close examination of the typefaces used by a printer
 An investigation of the typefaces used in a work or in the
same editions of a work
 A reexamination of typefaces classified via a system (Proctor-
Haebler)
The end
For eMOP questions please
contact us at :
mchristy@tamu.edu
egrumbac@tamu.edu
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in
eMOP
21

Mais conteúdo relacionado

Mais procurados

Tamu big data-conf-1b
Tamu big data-conf-1bTamu big data-conf-1b
Tamu big data-conf-1bMatt Christy
 
Dh2014 e mopcobre-complete
Dh2014 e mopcobre-completeDh2014 e mopcobre-complete
Dh2014 e mopcobre-completeLaura Mandell
 
SAA 2014 Pre-conference Workshop - OCRing with Open Source Tools
SAA 2014 Pre-conference Workshop - OCRing with Open Source ToolsSAA 2014 Pre-conference Workshop - OCRing with Open Source Tools
SAA 2014 Pre-conference Workshop - OCRing with Open Source ToolsMatt Christy
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Vsevolod Dyomkin
 
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language ProcessingVsevolod Dyomkin
 
Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in PracticeVsevolod Dyomkin
 
Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Vsevolod Dyomkin
 
Quick Tour of Text Mining
Quick Tour of Text MiningQuick Tour of Text Mining
Quick Tour of Text MiningYi-Shin Chen
 
From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?Constantin Orasan
 
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Chinese Character Decomposition for  Neural MT with Multi-Word ExpressionsChinese Character Decomposition for  Neural MT with Multi-Word Expressions
Chinese Character Decomposition for Neural MT with Multi-Word ExpressionsLifeng (Aaron) Han
 
Does Data Quality lays in facts, or in acts?
Does Data Quality lays in facts, or in acts?Does Data Quality lays in facts, or in acts?
Does Data Quality lays in facts, or in acts?jeansoulin
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word CloudsMarina Santini
 
Recent trends in natural language processing
Recent trends in natural language processingRecent trends in natural language processing
Recent trends in natural language processingBalayogi G
 
Technical Services and the Virtual Reference Desk: Mining Chat Transcripts fo...
Technical Services and the Virtual Reference Desk: Mining Chat Transcripts fo...Technical Services and the Virtual Reference Desk: Mining Chat Transcripts fo...
Technical Services and the Virtual Reference Desk: Mining Chat Transcripts fo...NASIG
 
Digitization Projects Tech Con 2006
Digitization Projects Tech Con 2006Digitization Projects Tech Con 2006
Digitization Projects Tech Con 2006Regina Koury
 
SSSW 2013 - Feeding Recommender Systems with Linked Open Data
SSSW 2013 - Feeding Recommender Systems with Linked Open DataSSSW 2013 - Feeding Recommender Systems with Linked Open Data
SSSW 2013 - Feeding Recommender Systems with Linked Open DataPolytechnic University of Bari
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.Lifeng (Aaron) Han
 

Mais procurados (20)

Tamu big data-conf-1b
Tamu big data-conf-1bTamu big data-conf-1b
Tamu big data-conf-1b
 
Dh2014 e mopcobre-complete
Dh2014 e mopcobre-completeDh2014 e mopcobre-complete
Dh2014 e mopcobre-complete
 
SAA 2014 Pre-conference Workshop - OCRing with Open Source Tools
SAA 2014 Pre-conference Workshop - OCRing with Open Source ToolsSAA 2014 Pre-conference Workshop - OCRing with Open Source Tools
SAA 2014 Pre-conference Workshop - OCRing with Open Source Tools
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?
 
The State of #NLProc
The State of #NLProcThe State of #NLProc
The State of #NLProc
 
Aspects of NLP Practice
Aspects of NLP PracticeAspects of NLP Practice
Aspects of NLP Practice
 
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language Processing
 
NLP Project Full Cycle
NLP Project Full CycleNLP Project Full Cycle
NLP Project Full Cycle
 
Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in Practice
 
Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)
 
Quick Tour of Text Mining
Quick Tour of Text MiningQuick Tour of Text Mining
Quick Tour of Text Mining
 
From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?
 
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Chinese Character Decomposition for  Neural MT with Multi-Word ExpressionsChinese Character Decomposition for  Neural MT with Multi-Word Expressions
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
 
Does Data Quality lays in facts, or in acts?
Does Data Quality lays in facts, or in acts?Does Data Quality lays in facts, or in acts?
Does Data Quality lays in facts, or in acts?
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word Clouds
 
Recent trends in natural language processing
Recent trends in natural language processingRecent trends in natural language processing
Recent trends in natural language processing
 
Technical Services and the Virtual Reference Desk: Mining Chat Transcripts fo...
Technical Services and the Virtual Reference Desk: Mining Chat Transcripts fo...Technical Services and the Virtual Reference Desk: Mining Chat Transcripts fo...
Technical Services and the Virtual Reference Desk: Mining Chat Transcripts fo...
 
Digitization Projects Tech Con 2006
Digitization Projects Tech Con 2006Digitization Projects Tech Con 2006
Digitization Projects Tech Con 2006
 
SSSW 2013 - Feeding Recommender Systems with Linked Open Data
SSSW 2013 - Feeding Recommender Systems with Linked Open DataSSSW 2013 - Feeding Recommender Systems with Linked Open Data
SSSW 2013 - Feeding Recommender Systems with Linked Open Data
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
 

Semelhante a mchristy-DH2014-emop-bookhistory-tools

OpenWN-PT: a Brazilian Wordnet for all
OpenWN-PT: a Brazilian Wordnet for allOpenWN-PT: a Brazilian Wordnet for all
OpenWN-PT: a Brazilian Wordnet for allAlexandre Rademaker
 
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...Apache OpenNLP
 
Large Scale Text Processing
Large Scale Text ProcessingLarge Scale Text Processing
Large Scale Text ProcessingSuneel Marthi
 
Large Scale Processing of Unstructured Text
Large Scale Processing of Unstructured TextLarge Scale Processing of Unstructured Text
Large Scale Processing of Unstructured TextDataWorks Summit
 
AI for digital humanities, with tagtog.net -- Lancaster University 2019 workshop
AI for digital humanities, with tagtog.net -- Lancaster University 2019 workshopAI for digital humanities, with tagtog.net -- Lancaster University 2019 workshop
AI for digital humanities, with tagtog.net -- Lancaster University 2019 workshoptagtog
 
Data management for researchers
Data management for researchersData management for researchers
Data management for researchersDirk Roorda
 
Session 0.0 aussenac semanticsnl-pwebsem2017-v4
Session 0.0   aussenac semanticsnl-pwebsem2017-v4Session 0.0   aussenac semanticsnl-pwebsem2017-v4
Session 0.0 aussenac semanticsnl-pwebsem2017-v4semanticsconference
 
Entering the Fourth Dimension of OCR with Tesseract - Talk from Voxxed Days B...
Entering the Fourth Dimension of OCR with Tesseract - Talk from Voxxed Days B...Entering the Fourth Dimension of OCR with Tesseract - Talk from Voxxed Days B...
Entering the Fourth Dimension of OCR with Tesseract - Talk from Voxxed Days B...🎤 Hanno Embregts 🎸
 
TR5 Prolifer and Post-Correction System. Ludwig Maximilians
TR5 Prolifer and Post-Correction System. Ludwig MaximiliansTR5 Prolifer and Post-Correction System. Ludwig Maximilians
TR5 Prolifer and Post-Correction System. Ludwig MaximiliansBiblioteca Nacional de España
 
Content Processing Architecture and Applications - Introduction to Text Mining
Content Processing Architecture and Applications - Introduction to Text MiningContent Processing Architecture and Applications - Introduction to Text Mining
Content Processing Architecture and Applications - Introduction to Text MiningFindwise
 
A detailed study and recent research on handwritten recognition
A detailed study and recent research on handwritten recognitionA detailed study and recent research on handwritten recognition
A detailed study and recent research on handwritten recognitionShruthiamar
 
Estimating the Impact of OCR Quality on Research Tasks in the Digital Humanities
Estimating the Impact of OCR Quality on Research Tasks in the Digital HumanitiesEstimating the Impact of OCR Quality on Research Tasks in the Digital Humanities
Estimating the Impact of OCR Quality on Research Tasks in the Digital HumanitiesMyriam Traub
 
NLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in PythonNLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in Pythonshanbady
 
Recent Advances in Natural Language Processing
Recent Advances in Natural Language ProcessingRecent Advances in Natural Language Processing
Recent Advances in Natural Language ProcessingSeth Grimes
 
Keynote new convergences between natural language processing and knowledge ...
Keynote   new convergences between natural language processing and knowledge ...Keynote   new convergences between natural language processing and knowledge ...
Keynote new convergences between natural language processing and knowledge ...semanticsconference
 
Feb.2016 Demystifying Digital Humanities - Workshop 2
Feb.2016 Demystifying Digital Humanities - Workshop 2Feb.2016 Demystifying Digital Humanities - Workshop 2
Feb.2016 Demystifying Digital Humanities - Workshop 2Paige Morgan
 
Text as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew BibleText as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew BibleDirk Roorda
 

Semelhante a mchristy-DH2014-emop-bookhistory-tools (20)

OpenWN-PT: a Brazilian Wordnet for all
OpenWN-PT: a Brazilian Wordnet for allOpenWN-PT: a Brazilian Wordnet for all
OpenWN-PT: a Brazilian Wordnet for all
 
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
 
Large Scale Text Processing
Large Scale Text ProcessingLarge Scale Text Processing
Large Scale Text Processing
 
Large Scale Processing of Unstructured Text
Large Scale Processing of Unstructured TextLarge Scale Processing of Unstructured Text
Large Scale Processing of Unstructured Text
 
AI for digital humanities, with tagtog.net -- Lancaster University 2019 workshop
AI for digital humanities, with tagtog.net -- Lancaster University 2019 workshopAI for digital humanities, with tagtog.net -- Lancaster University 2019 workshop
AI for digital humanities, with tagtog.net -- Lancaster University 2019 workshop
 
LSDI.pptx
LSDI.pptxLSDI.pptx
LSDI.pptx
 
Data management for researchers
Data management for researchersData management for researchers
Data management for researchers
 
Session 0.0 aussenac semanticsnl-pwebsem2017-v4
Session 0.0   aussenac semanticsnl-pwebsem2017-v4Session 0.0   aussenac semanticsnl-pwebsem2017-v4
Session 0.0 aussenac semanticsnl-pwebsem2017-v4
 
Aussenac semanticsnl pwebsem2017-v4
Aussenac semanticsnl pwebsem2017-v4Aussenac semanticsnl pwebsem2017-v4
Aussenac semanticsnl pwebsem2017-v4
 
Entering the Fourth Dimension of OCR with Tesseract - Talk from Voxxed Days B...
Entering the Fourth Dimension of OCR with Tesseract - Talk from Voxxed Days B...Entering the Fourth Dimension of OCR with Tesseract - Talk from Voxxed Days B...
Entering the Fourth Dimension of OCR with Tesseract - Talk from Voxxed Days B...
 
TR5 Prolifer and Post-Correction System. Ludwig Maximilians
TR5 Prolifer and Post-Correction System. Ludwig MaximiliansTR5 Prolifer and Post-Correction System. Ludwig Maximilians
TR5 Prolifer and Post-Correction System. Ludwig Maximilians
 
Content Processing Architecture and Applications - Introduction to Text Mining
Content Processing Architecture and Applications - Introduction to Text MiningContent Processing Architecture and Applications - Introduction to Text Mining
Content Processing Architecture and Applications - Introduction to Text Mining
 
A detailed study and recent research on handwritten recognition
A detailed study and recent research on handwritten recognitionA detailed study and recent research on handwritten recognition
A detailed study and recent research on handwritten recognition
 
Estimating the Impact of OCR Quality on Research Tasks in the Digital Humanities
Estimating the Impact of OCR Quality on Research Tasks in the Digital HumanitiesEstimating the Impact of OCR Quality on Research Tasks in the Digital Humanities
Estimating the Impact of OCR Quality on Research Tasks in the Digital Humanities
 
NLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in PythonNLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in Python
 
Recent Advances in Natural Language Processing
Recent Advances in Natural Language ProcessingRecent Advances in Natural Language Processing
Recent Advances in Natural Language Processing
 
Keynote new convergences between natural language processing and knowledge ...
Keynote   new convergences between natural language processing and knowledge ...Keynote   new convergences between natural language processing and knowledge ...
Keynote new convergences between natural language processing and knowledge ...
 
Feb.2016 Demystifying Digital Humanities - Workshop 2
Feb.2016 Demystifying Digital Humanities - Workshop 2Feb.2016 Demystifying Digital Humanities - Workshop 2
Feb.2016 Demystifying Digital Humanities - Workshop 2
 
Text as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew BibleText as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew Bible
 
06 traub
06 traub06 traub
06 traub
 

Último

Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 

Último (20)

Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 

mchristy-DH2014-emop-bookhistory-tools

  • 1. eMOP Book History Tools Book History and Software Tools: Examining Typefaces for OCR Training in eMOP Matt Christy, Todd Samuelson, Katayoun Torabi, Bryan Tarpley, Elizabeth Grumbach
  • 2.  emop.tamu.edu/  Dh2014 Presentation  emop.tamu.edu/book- history-tools  eMOP Workflows  emop.tamu.edu/workflows  Mellon Grant Proposal  idhmc.tamu.edu/projects/ Mellon/eMOPPublic.pdf eMOP Info eMOP Website More eMOP  Facebook  Early Modern OCR Project  Twitter  #emop  @IDHMC_Nexus  @matt_christy  @EMGrumbach DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 2
  • 3. Early Modern OCR Project  The Early Modern OCR Project (eMOP) is an Andrew W. Mellon Foundation funded grant project running out of the Initiative for Digital Humanities, Media, and Culture (IDHMC) at Texas A&M University, to develop and test tools and techniques to apply Optical Character Recognition (OCR) to early modern English documents from the hand press period, roughly 1475-1800.  eMOP aims to improve the visibility of early modern texts by making their contents fully searchable. The current paradigm of searching special collections for early modern materials by either metadata alone or “dirty” OCR is insufficient for scholarly research. DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 3 Specifically, eMOP’s goal is to make machine readable, or improve the readability, for 305,000 document/45 million pages of text from two major proprietary databases: Eighteenth Century Collections Online (ECCO) and Early English Books Online (EEBO). Generally, our aim is to use typeface and book history techniques to train modern OCR engines specifically on the typefaces in our collection of documents, and thereby improve the accuracy of the OCR results.
  • 4. TrainingTesseract DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 4
  • 5. Aletheia DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 5 www.primaresearch.org/too ls.php Available for free but requires registration.  Created by PRImA Research Labs, University of Salford, UK.  Windows based tool.  Developed as a groundtruth creation tool  Used by eMOP undergraduate student workers to create training of desired typeface for Tesseract.  Can identify glyphs on a page image with page coordinates and Unicode values.
  • 6. Aletheia:Workflow DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 6  Binarization and Denoise are native Aletheia functions  A team of Undergraduate student workers refines and corrects glyph boxes and unicode values, where needed.  Output: A set of PAGE XML files with page coordinates and unicode values for every identified glyph on each processed TIFF image.
  • 7. Aletheia: Glyph Recognition DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 7 Uses Tesseract to find glyphs
  • 8. Aletheia: I/O DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 8 We then convert PAGE XML file to Tesseract Box file using XSLT
  • 9. Tesseract Training DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 9
  • 10. Franken+ 1. Windows based tool that uses a MySQL DB. 2. Developed for eMOP by IDHMC Graduate student worker Bryan Tarpley. 3. Designed to be easily used by eMOP Undergraduate student workers 4. Takes Aletheia's output files as input. 5. Outputs the same box files and TIFF images that Tesseract's first stage of native training.  Available open-source at: github.com/idhmc- tamu/FrankenPlus DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 10
  • 11. Franken+Workflow DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 11 1. Groups all glyphs with the same Unicode values into one window for comparison. 2. Uses all selected glyphs to create a Franken- page image (TIFF) using a selected text as a base. 3. Outputs the same box files and TIFF images that Tesseract's first stage of native training.
  • 12. Franken+ Ingestion DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 12
  • 13. Franken+ DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 13  All exemplars of the same glyph are displayed together.  Users can quickly identify and deselect:  Incorrectly labeled glyphs  Incomplete glyphs  Unrepresentative exemplars  Different sized glyphs
  • 14. DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 14 Franken+
  • 15. TrainingTesseract DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 15 Thiſ great conſumption to a fever turn'd, And ſo the oꝗld had fitſ; it joy'd, it mourn'd; And, aſ men thinke, that Agueſ phy ck are, And th'Ague being ſpent, give over care. Žo thou cke World, mꝗſtak'ſt thy ſelże to bee Well, when ãlaſ, thou'rt in a Lethargie. Her death did wound and tame thee than, and than Thou might'ſt ha e better ſpar'd the Sunne, or man. That wound waſ deep, but 'tiſ more miżery, That thou haſt loſt thy ſenſe and memor . 'Twaſ heavy then to heare thy voyce of mone, But thiſ iſ worſe, that thou art ſpeechle e growne. Thou haſt forgot thy name thou hadſt; thou waſt Nothing but ee, and her thou haſt o'rpaſt. For aſ a child kept from the Fount, untill Ä prince, expe ed long, come to fulfill The ceremonieſ, thou unnam'd had'ſt laid, Had not her comming, thee her palace made: Her name defin'd thee, gave thee forme, and frame, And thou forgett'ſt to celebrate th n me. Some monethſ e hath beene dead (but beìng dead, Meaſureſ of timeſ are all determined) But long e'ath beene away, long, long, et none Offerſ to tell uſ who it iſ that'ſ gone. But aſ in ſtateſ doubtfull of future heireſ, When ckne e without remedie empaireſ The preſent Prince, they're loth it ould be ſaid, The Prince doth langui , or the Prince iſ dead: So mankinde feeling no a generall tha ,
  • 16. Franken+ Results DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 16 AFTER BEFORE
  • 17. eMOP TesseractTraining DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 17
  • 18. S-face / Y-face DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 18 Weiss, Adrian. “Font Analysis as a Bibliographical Method: the Elizabethan Play-Quarto Printers and Compositors.” Studies in Bibliography 43 (1990): 95-164.  Weiss organized late 16th and early 17th century typefaces into these two general types (named for the first works in which they were identified)  Y-Face, from an edition of The Malcontents  S-Face, from Ben Jonson's Sejanus
  • 19. S-face / Y-face 19 DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
  • 20. Other Applications DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 20  A close examination of the typefaces used by a printer  An investigation of the typefaces used in a work or in the same editions of a work  A reexamination of typefaces classified via a system (Proctor- Haebler)
  • 21. The end For eMOP questions please contact us at : mchristy@tamu.edu egrumbac@tamu.edu DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 21

Notas do Editor

  1. Aletheia: Created by PRImA Research Labs at the University of Salford, as a groundtruth creation tool. A team of undergraduates uses Aletheia to identify each glyph on the page images, and ensure that the correct Unicode value is assigned to each. Aletheia outputs an XML file containing all identified glyphs on a page with their corresponding coordinates and Unicode values.
  2. This is cheating: the result of scanning the same page we used to create the training.
  3. So we think that Franken+ can be a really useful tool for the close examination of typefaces and book history, and opens up some of the admittedly tedious work to non-experts.