SlideShare uma empresa Scribd logo
1 de 51
Baixar para ler offline
Beatrice Alex
balex@inf.ed.ac.uk
DATeCH 2014, May 20th 2014
Estimating and Rating the Quality of Optical
Character Recognised Text
OVERVIEW
Background: Trading Consequences
OCR accuracy estimation
Motivation
Related work
OCR errors in text mining (eye-balling data versus
quantitative evaluation)
Computing text quality
Manual vs. automatic rating
Summary and conclusion
DATeCH 2014, May 20th 2014
TRADING CONSEQUENCES
JISC/SSHRC Digging into Data Challenge II (2
year project, 2012-2013)
Text mining, data extraction and information
visualisation to explore big historical datasets.
Focus on how commodities were traded across
the globe in the 19th century.
Help historians to discover novel patterns and
explore new research questions.
DATeCH 2014, May 20th 2014
PROJECT TEAM
Ewan Klein, Bea Alex, Claire Grover,
Richard Tobin: text mining
Colin Coates, Andrew Watson:
historical analysis
Jim Clifford: historical analysis
James Reid, Nicola Osborne: data
management, social media
Aaron Quigley, Uta Hinrichs:
information visualisation
DATeCH 2014, May 20th 2014
TRADITIONAL
HISTORICAL RESEARCH
Gillow and the Use of Mahogany in the Eighteenth
Century, Adam Bowett, Regional Furniture, v.XII,
1998.
Global Fats Supply 1894-98
DATeCH 2014, May 20th 2014
PROJECT OVERVIEW
Scotland’s National Collections and the Digital Humanities, Edinburgh, 14/02/2014
DOCUMENT COLLECTIONS
Collection # of Documents # of Images
House of Commons
Parliamentary Papers
(ProQuest)
118,526 6,448,739
Early Canadiana Online 83,016 3,938,758
Directors’ Letters of
Correspondence (Kew)
14,340 n/a
Confidential Prints (Adam
Matthews)
1,315 140,010
Foreign and
Commonwealth Office
Collection
1,000 41,611
Asia and the West (Gale) 4,725 948,773 (OCRed: 450,841)
Scotland’s National Collections and the Digital Humanities, Edinburgh, 14/02/2014
DOCUMENT COLLECTIONS
Collection # of Documents # of Images
House of Commons
Parliamentary Papers
(ProQuest)
118,526 6,448,739
Early Canadiana Online 83,016 3,938,758
Directors’ Letters of
Correspondence (Kew)
14,340 n/a
Confidential Prints (Adam
Matthews)
1,315 140,010
Foreign and
Commonwealth Office
Collection
1,000 41,611
Asia and the West (Gale) 4,725 948,773 (OCRed: 450,841)
Over 10 million document pages,
Over 7 billion word tokens.
Scotland’s National Collections and the Digital Humanities, Edinburgh, 14/02/2014
OCR-ED TEXT
DATeCH 2014, May 20th 2014
OCR-ED TEXT
DATeCH 2014, May 20th 2014
OCR-ED TEXT
DATeCH 2014, May 20th 2014
OCR-ED TEXT
DATeCH 2014, May 20th 2014
WHY OCR ACCURACY
ESTIMATION?
A reasonable amount of already digitised books (some
with very bad text quality). Can we mine some of them
now.
To what extent do OCR errors affect text mining?
What is their effect when dealing with big data?
What text is of sufficient high quality to be
understood? How bad is too bad? What happens to
the rest?
Can we measure text quality? How does it compare to
human quality ranking of text?
DATeCH 2014, May 20th 2014
RELATED WORK
Some OCR output contains character-based accuracy rates
which can be very deceptive.
Popat, 2009:
Extensive study on quality ranking of short OCRed text
snippets in different languages. Examined rank order of
text snippets of inter-, intra- and machine ratings.
Compared spatial and sequential character n-gram-based
approaches to a dictionary-based approach (web corpus,
capped at 50K most frequent words per language).
Compared random to balanced (stratified) sampling.
Metric: average rank correlation.
DATeCH 2014, May 20th 2014
OCR ERRORS AND BIG DATA
Are OCR errors negligible when mining big data to
detect trends?
Our data suffers from all the common OCR error
types (at best just a few character insertions,
substitutions and deletions), at worst much worse
(page upside down).
Character confusion examples:
e -> c, a -> o, h -> b, l -> t, m -> n, f -> s
DATeCH 2014, May 20th 2014
OCR ERRORS
PQIS All Team Meeting, ProQuest, April 23rd 2014
OCR ERRORS
PQIS All Team Meeting, ProQuest, April 23rd 2014
OCR ERRORS
PQIS All Team Meeting, ProQuest, April 23rd 2014
OCR ERRORS
PQIS All Team Meeting, ProQuest, April 23rd 2014
OCR ERRORS
PQIS All Team Meeting, ProQuest, April 23rd 2014
OCR ERRORS AND TEXT MINING
Need a more quantitative analysis.
Built a commodity and location recognition tool.
Evaluated it against manually annotated gold
standard.
DATeCH 2014, May 20th 2014
OCR ERRORS AND TEXT MINING
32.6% of false negative commodity mentions (101
of 310) contain OCR errors (= 9.1% of all
commodity mentions in the gold standard)
sainon, rubher, tmber
30.2% of false negative location mentions (467 of
1,549) contain OCR errors (= 14.8% of all location
mentions in the gold standard)
Montreai, Montroal, Mont- treal and 10NTREAL.
DATeCH 2014, May 20th 2014
OCR ERRORS AND TEXT MINING
DATeCH 2014, May 20th 2014
PREDICTING TEXT QUALITY
Can we compute a simple quality score for a large
data collection (i.e. over 7 billion words)?
How easily can humans perform document-level
quality rating?
DATeCH 2014, May 20th 2014
COMPUTING TEXT QUALITY
Simple document-level quality
score to get a rough estimate
of how good a document is.
Word tokens found in an
English dictionary (aspell “en”)
and Roman/Arabic numbers
over all word tokens in the text.
Scores range between 0 and 1.
Caveat: it does not consider
historic variants.
DATeCH 2014, May 20th 2014
COMPUTING TEXT QUALITY
Score distribution over the English Early Canadiana
Online data (55,313 documents).
DATeCH 2014, May 20th 2014
DATA PREPARATION
DATeCH 2014, May 20th 2014
Early Canadiana Online (books, magazines and
government publications relevant to Canadian
history ranging from 1600 to the 1940s)
83,016 documents (almost 4 million images
containing text mostly in English and French but
also in 10 First Nation languages, European
languages and Latin).
Language identification (or meta data information)
to retain only English content (55,313 documents).
DATA PREPARATION
DATeCH 2014, May 20th 2014
Ran the automatic scoring over all English ECO documents.
Applied stratified sampling to collect 100 documents by
randomly selecting:
20 documents where 0 >= SQ < 0:2,
20 documents where 0.2 >= SQ < 0.4,
20 documents where 0.4 >= SQ < 0.6,
20 documents where 0.6 >= SQ < 0.8,
20 documents where 0.8 >= SQ < 1.
Shuffled documents and removed the quality score.
MANUAL RATING
DATeCH 2014, May 20th 2014
Two raters looked at each document and rated it on a 5-point scale.
5 ... OCR quality is high. There are few errors. The text is easily
readable and understandable.
4 ... OCR quality is good. There are some errors but they are limited
in number and the text is still mostly readable and understandable.
3 ... OCR quality is mediocre. There are numerous OCR errors and
only part of the text is readable and understandable.
2 ... OCR quality is low. There is a large number of OCR errors which
seriously affect the readability and understandability of the majority of
the text.
1 ... OCR quality is extremely low. The text is so full of errors that it
is not readable and understandable.
Weighted Kappa: 0.516
INTER-RATER AGREEMENT
DATeCH 2014, May 20th 2014
Weighted Kappa: 0.516
INTER-RATER AGREEMENT
DATeCH 2014, May 20th 2014
INTER-RATER AGREEMENT
DATeCH 2014, May 20th 2014
Weighted Kappa: 0.60
AUTOMATIC VS HUMAN
DATeCH 2014, May 20th 2014
AUTOMATIC VS HUMAN
DATeCH 2014, May 20th 2014
AUTOMATIC VS HUMAN
DATeCH 2014, May 20th 2014
AUTOMATIC VS HUMAN
DATeCH 2014, May 20th 2014
AUTOMATIC VS HUMAN
DATeCH 2014, May 20th 2014
AUTOMATIC VS HUMAN
DATeCH 2014, May 20th 2014
THRESHOLD?
DATeCH 2014, May 20th 2014
CONCLUSION
We applied a simple quality scoring method to a large document
collection and showed that automatic rating correlates with human
rating.
Document-level rating is not easy to do manually.
Automatic document-level rating is not ideal but it give us a first “taste”
of how good the quality of a document is. It is much more consistent
than a person doing the same task.
Many OCR errors are noise in big data but when added up they affect
a significant amount of text.
We found that named entities are effected worse than common words.
HSS scholars need to be made much more aware of OCR errors
affecting their search results for historical collections.
DATeCH 2014, May 20th 2014
FUTURE WORK
Consider publication date and digitisation date when
doing OCR quality estimation.
Examine the bad documents identify those worth
post-correcting.
AHRC big data project (Palimpsest) on mining and
geo-referencing literature set in Edinburgh.
Collaboration with literary scholars interested in loco-
specificity and its context in literature.
DATeCH 2014, May 20th 2014
THANK YOU
Rating annotation guidelines and doubly rated data available on GitHub (digtrade)
Contact: balex@inf.ed.ac.uk
Website: http://tradingconsequences.blogs.edina.ac.uk/
Twitter: @digtrade
DATeCH 2014, May 20th 2014
BRINGING ARCHIVES ALIVE
DATeCH 2014, May 20th 2014
BRINGING ARCHIVES ALIVE
!
DATeCH 2014, May 20th 2014
BRINGING ARCHIVES ALIVE
!!
DATeCH 2014, May 20th 2014
BRINGING ARCHIVES ALIVE
DATeCH 2014, May 20th 2014
SYSTEM
Documents Text Mining
Annotated
Documents
XML 2 RDB
Commodities
RDB
Lexicons
& Gazetteers
Query Interface
Visualisation
Commodities
Ontology
SKOS
Scotland’s National Collections and the Digital Humanities, Edinburgh, 14/02/2014
MINED INFORMATION
Example sentence:
Normalised and grounded entities:
commodity: cassia bark [concept: Cinnamomum cassia]
date: 1871 (year=1871)
location: Padang (lat=-0.94924;long=100.35427;country=ID)
location: America (lat=39.76;long=-98.50;country=n/a)
quantity + unit: 6,127 piculs
Scotland’s National Collections and the Digital Humanities, Edinburgh, 14/02/2014
MINED INFORMATION
Example sentence:
Extracted entity attributes and relations:
origin location: Padang
destination location: America
commodity–date relation: cassia bark – 1871
commodity–location relation: cassia bark – Padang
commodity–location relation: cassia bark – America
Scotland’s National Collections and the Digital Humanities, Edinburgh, 14/02/2014
EDINBURGH GEOPARSER
Scotland’s National Collections and the Digital Humanities, Edinburgh, 14/02/2014
CONCLUSION
Importance of two-way collaboration between
technology and humanities expert in digital HSS
projects.
Value of iterative development and rapid prototyping.
Geo-referencing text is very important for historical
analysis.
Most OCR errors are noise in big data but HSS
scholars need to be made more aware of OCR errors
affecting their search results for historical collections.
DATeCH 2014, May 20th 2014

Mais conteúdo relacionado

Semelhante a Datech2014 - Session 5 - Estimating and Rating Quality of Optical Character Recognised Text

A demonstration of transparent and scalable OpenURL quality metrics for use i...
A demonstration of transparent and scalable OpenURL quality metrics for use i...A demonstration of transparent and scalable OpenURL quality metrics for use i...
A demonstration of transparent and scalable OpenURL quality metrics for use i...alc28
 
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUESA STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUESijcsitcejournal
 
NLP2RDF Wortschatz and Linguistic LOD draft
NLP2RDF Wortschatz and Linguistic LOD draftNLP2RDF Wortschatz and Linguistic LOD draft
NLP2RDF Wortschatz and Linguistic LOD draftSebastian Hellmann
 
BiographyNet: Linking the world of History
BiographyNet: Linking the world of HistoryBiographyNet: Linking the world of History
BiographyNet: Linking the world of HistoryBiographyNet
 
Ocls 4th annual breakfast 2016
Ocls 4th annual breakfast 2016Ocls 4th annual breakfast 2016
Ocls 4th annual breakfast 2016Jan Dawson
 
mchristy-DH2014-emop-bookhistory-tools
mchristy-DH2014-emop-bookhistory-toolsmchristy-DH2014-emop-bookhistory-tools
mchristy-DH2014-emop-bookhistory-toolsMatt Christy
 
Wei Xu - Innovative Applications of AI Panel
Wei Xu - Innovative Applications of AI PanelWei Xu - Innovative Applications of AI Panel
Wei Xu - Innovative Applications of AI PanelRehgan Avon
 
Connecting Librarians to Researchers
Connecting Librarians to ResearchersConnecting Librarians to Researchers
Connecting Librarians to ResearchersQSR International
 
Usable Language | How Content Shapes The User Experience
Usable Language | How Content Shapes The User ExperienceUsable Language | How Content Shapes The User Experience
Usable Language | How Content Shapes The User ExperienceRandall Snare
 
Computer Software in Qualitative Research: An Introduction to NVivo
Computer Software in Qualitative Research: An Introduction to NVivoComputer Software in Qualitative Research: An Introduction to NVivo
Computer Software in Qualitative Research: An Introduction to NVivoAdam Perzynski, PhD
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Future of Natural Language Processing - Potential Lists of Topics for PhD stu...
Future of Natural Language Processing - Potential Lists of Topics for PhD stu...Future of Natural Language Processing - Potential Lists of Topics for PhD stu...
Future of Natural Language Processing - Potential Lists of Topics for PhD stu...PhD Assistance
 
Europeana Newspapers LFT Infoday Genereux
Europeana Newspapers LFT Infoday GenereuxEuropeana Newspapers LFT Infoday Genereux
Europeana Newspapers LFT Infoday GenereuxEuropeana Newspapers
 
Nothing is created, nothing is lost, everything changes (ELAG, 2017)
Nothing is created, nothing is lost, everything changes (ELAG, 2017)Nothing is created, nothing is lost, everything changes (ELAG, 2017)
Nothing is created, nothing is lost, everything changes (ELAG, 2017)Péter Király
 

Semelhante a Datech2014 - Session 5 - Estimating and Rating Quality of Optical Character Recognised Text (20)

A demonstration of transparent and scalable OpenURL quality metrics for use i...
A demonstration of transparent and scalable OpenURL quality metrics for use i...A demonstration of transparent and scalable OpenURL quality metrics for use i...
A demonstration of transparent and scalable OpenURL quality metrics for use i...
 
Carpenter, McCraken, Ventimiglia, Noonan, and Walker "KBART and the OpenURL: ...
Carpenter, McCraken, Ventimiglia, Noonan, and Walker "KBART and the OpenURL: ...Carpenter, McCraken, Ventimiglia, Noonan, and Walker "KBART and the OpenURL: ...
Carpenter, McCraken, Ventimiglia, Noonan, and Walker "KBART and the OpenURL: ...
 
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUESA STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
 
NLP2RDF Wortschatz and Linguistic LOD draft
NLP2RDF Wortschatz and Linguistic LOD draftNLP2RDF Wortschatz and Linguistic LOD draft
NLP2RDF Wortschatz and Linguistic LOD draft
 
BiographyNet: Linking the world of History
BiographyNet: Linking the world of HistoryBiographyNet: Linking the world of History
BiographyNet: Linking the world of History
 
Ocls 4th annual breakfast 2016
Ocls 4th annual breakfast 2016Ocls 4th annual breakfast 2016
Ocls 4th annual breakfast 2016
 
OCR, optical character reader
OCR, optical character readerOCR, optical character reader
OCR, optical character reader
 
mchristy-DH2014-emop-bookhistory-tools
mchristy-DH2014-emop-bookhistory-toolsmchristy-DH2014-emop-bookhistory-tools
mchristy-DH2014-emop-bookhistory-tools
 
co:op-READ-Convention Marburg - Enrique Vidal
co:op-READ-Convention Marburg - Enrique Vidalco:op-READ-Convention Marburg - Enrique Vidal
co:op-READ-Convention Marburg - Enrique Vidal
 
Wei Xu - Innovative Applications of AI Panel
Wei Xu - Innovative Applications of AI PanelWei Xu - Innovative Applications of AI Panel
Wei Xu - Innovative Applications of AI Panel
 
Connecting Librarians to Researchers
Connecting Librarians to ResearchersConnecting Librarians to Researchers
Connecting Librarians to Researchers
 
Usable Language | How Content Shapes The User Experience
Usable Language | How Content Shapes The User ExperienceUsable Language | How Content Shapes The User Experience
Usable Language | How Content Shapes The User Experience
 
CUA Engineering104
CUA Engineering104CUA Engineering104
CUA Engineering104
 
Transkribus | Günter Mühlberger
Transkribus | Günter MühlbergerTranskribus | Günter Mühlberger
Transkribus | Günter Mühlberger
 
Computer Software in Qualitative Research: An Introduction to NVivo
Computer Software in Qualitative Research: An Introduction to NVivoComputer Software in Qualitative Research: An Introduction to NVivo
Computer Software in Qualitative Research: An Introduction to NVivo
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Future of Natural Language Processing - Potential Lists of Topics for PhD stu...
Future of Natural Language Processing - Potential Lists of Topics for PhD stu...Future of Natural Language Processing - Potential Lists of Topics for PhD stu...
Future of Natural Language Processing - Potential Lists of Topics for PhD stu...
 
Europeana Newspapers LFT Infoday Genereux
Europeana Newspapers LFT Infoday GenereuxEuropeana Newspapers LFT Infoday Genereux
Europeana Newspapers LFT Infoday Genereux
 
Nothing is created, nothing is lost, everything changes (ELAG, 2017)
Nothing is created, nothing is lost, everything changes (ELAG, 2017)Nothing is created, nothing is lost, everything changes (ELAG, 2017)
Nothing is created, nothing is lost, everything changes (ELAG, 2017)
 
Kohacon2016
Kohacon2016Kohacon2016
Kohacon2016
 

Mais de IMPACT Centre of Competence

Mais de IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 

Último

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 

Último (20)

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 

Datech2014 - Session 5 - Estimating and Rating Quality of Optical Character Recognised Text

  • 1. Beatrice Alex balex@inf.ed.ac.uk DATeCH 2014, May 20th 2014 Estimating and Rating the Quality of Optical Character Recognised Text
  • 2. OVERVIEW Background: Trading Consequences OCR accuracy estimation Motivation Related work OCR errors in text mining (eye-balling data versus quantitative evaluation) Computing text quality Manual vs. automatic rating Summary and conclusion DATeCH 2014, May 20th 2014
  • 3. TRADING CONSEQUENCES JISC/SSHRC Digging into Data Challenge II (2 year project, 2012-2013) Text mining, data extraction and information visualisation to explore big historical datasets. Focus on how commodities were traded across the globe in the 19th century. Help historians to discover novel patterns and explore new research questions. DATeCH 2014, May 20th 2014
  • 4. PROJECT TEAM Ewan Klein, Bea Alex, Claire Grover, Richard Tobin: text mining Colin Coates, Andrew Watson: historical analysis Jim Clifford: historical analysis James Reid, Nicola Osborne: data management, social media Aaron Quigley, Uta Hinrichs: information visualisation DATeCH 2014, May 20th 2014
  • 5. TRADITIONAL HISTORICAL RESEARCH Gillow and the Use of Mahogany in the Eighteenth Century, Adam Bowett, Regional Furniture, v.XII, 1998. Global Fats Supply 1894-98 DATeCH 2014, May 20th 2014
  • 6. PROJECT OVERVIEW Scotland’s National Collections and the Digital Humanities, Edinburgh, 14/02/2014
  • 7. DOCUMENT COLLECTIONS Collection # of Documents # of Images House of Commons Parliamentary Papers (ProQuest) 118,526 6,448,739 Early Canadiana Online 83,016 3,938,758 Directors’ Letters of Correspondence (Kew) 14,340 n/a Confidential Prints (Adam Matthews) 1,315 140,010 Foreign and Commonwealth Office Collection 1,000 41,611 Asia and the West (Gale) 4,725 948,773 (OCRed: 450,841) Scotland’s National Collections and the Digital Humanities, Edinburgh, 14/02/2014
  • 8. DOCUMENT COLLECTIONS Collection # of Documents # of Images House of Commons Parliamentary Papers (ProQuest) 118,526 6,448,739 Early Canadiana Online 83,016 3,938,758 Directors’ Letters of Correspondence (Kew) 14,340 n/a Confidential Prints (Adam Matthews) 1,315 140,010 Foreign and Commonwealth Office Collection 1,000 41,611 Asia and the West (Gale) 4,725 948,773 (OCRed: 450,841) Over 10 million document pages, Over 7 billion word tokens. Scotland’s National Collections and the Digital Humanities, Edinburgh, 14/02/2014
  • 9. OCR-ED TEXT DATeCH 2014, May 20th 2014
  • 10. OCR-ED TEXT DATeCH 2014, May 20th 2014
  • 11. OCR-ED TEXT DATeCH 2014, May 20th 2014
  • 12. OCR-ED TEXT DATeCH 2014, May 20th 2014
  • 13. WHY OCR ACCURACY ESTIMATION? A reasonable amount of already digitised books (some with very bad text quality). Can we mine some of them now. To what extent do OCR errors affect text mining? What is their effect when dealing with big data? What text is of sufficient high quality to be understood? How bad is too bad? What happens to the rest? Can we measure text quality? How does it compare to human quality ranking of text? DATeCH 2014, May 20th 2014
  • 14. RELATED WORK Some OCR output contains character-based accuracy rates which can be very deceptive. Popat, 2009: Extensive study on quality ranking of short OCRed text snippets in different languages. Examined rank order of text snippets of inter-, intra- and machine ratings. Compared spatial and sequential character n-gram-based approaches to a dictionary-based approach (web corpus, capped at 50K most frequent words per language). Compared random to balanced (stratified) sampling. Metric: average rank correlation. DATeCH 2014, May 20th 2014
  • 15. OCR ERRORS AND BIG DATA Are OCR errors negligible when mining big data to detect trends? Our data suffers from all the common OCR error types (at best just a few character insertions, substitutions and deletions), at worst much worse (page upside down). Character confusion examples: e -> c, a -> o, h -> b, l -> t, m -> n, f -> s DATeCH 2014, May 20th 2014
  • 16. OCR ERRORS PQIS All Team Meeting, ProQuest, April 23rd 2014
  • 17. OCR ERRORS PQIS All Team Meeting, ProQuest, April 23rd 2014
  • 18. OCR ERRORS PQIS All Team Meeting, ProQuest, April 23rd 2014
  • 19. OCR ERRORS PQIS All Team Meeting, ProQuest, April 23rd 2014
  • 20. OCR ERRORS PQIS All Team Meeting, ProQuest, April 23rd 2014
  • 21. OCR ERRORS AND TEXT MINING Need a more quantitative analysis. Built a commodity and location recognition tool. Evaluated it against manually annotated gold standard. DATeCH 2014, May 20th 2014
  • 22. OCR ERRORS AND TEXT MINING 32.6% of false negative commodity mentions (101 of 310) contain OCR errors (= 9.1% of all commodity mentions in the gold standard) sainon, rubher, tmber 30.2% of false negative location mentions (467 of 1,549) contain OCR errors (= 14.8% of all location mentions in the gold standard) Montreai, Montroal, Mont- treal and 10NTREAL. DATeCH 2014, May 20th 2014
  • 23. OCR ERRORS AND TEXT MINING DATeCH 2014, May 20th 2014
  • 24. PREDICTING TEXT QUALITY Can we compute a simple quality score for a large data collection (i.e. over 7 billion words)? How easily can humans perform document-level quality rating? DATeCH 2014, May 20th 2014
  • 25. COMPUTING TEXT QUALITY Simple document-level quality score to get a rough estimate of how good a document is. Word tokens found in an English dictionary (aspell “en”) and Roman/Arabic numbers over all word tokens in the text. Scores range between 0 and 1. Caveat: it does not consider historic variants. DATeCH 2014, May 20th 2014
  • 26. COMPUTING TEXT QUALITY Score distribution over the English Early Canadiana Online data (55,313 documents). DATeCH 2014, May 20th 2014
  • 27. DATA PREPARATION DATeCH 2014, May 20th 2014 Early Canadiana Online (books, magazines and government publications relevant to Canadian history ranging from 1600 to the 1940s) 83,016 documents (almost 4 million images containing text mostly in English and French but also in 10 First Nation languages, European languages and Latin). Language identification (or meta data information) to retain only English content (55,313 documents).
  • 28. DATA PREPARATION DATeCH 2014, May 20th 2014 Ran the automatic scoring over all English ECO documents. Applied stratified sampling to collect 100 documents by randomly selecting: 20 documents where 0 >= SQ < 0:2, 20 documents where 0.2 >= SQ < 0.4, 20 documents where 0.4 >= SQ < 0.6, 20 documents where 0.6 >= SQ < 0.8, 20 documents where 0.8 >= SQ < 1. Shuffled documents and removed the quality score.
  • 29. MANUAL RATING DATeCH 2014, May 20th 2014 Two raters looked at each document and rated it on a 5-point scale. 5 ... OCR quality is high. There are few errors. The text is easily readable and understandable. 4 ... OCR quality is good. There are some errors but they are limited in number and the text is still mostly readable and understandable. 3 ... OCR quality is mediocre. There are numerous OCR errors and only part of the text is readable and understandable. 2 ... OCR quality is low. There is a large number of OCR errors which seriously affect the readability and understandability of the majority of the text. 1 ... OCR quality is extremely low. The text is so full of errors that it is not readable and understandable.
  • 30. Weighted Kappa: 0.516 INTER-RATER AGREEMENT DATeCH 2014, May 20th 2014
  • 31. Weighted Kappa: 0.516 INTER-RATER AGREEMENT DATeCH 2014, May 20th 2014
  • 32. INTER-RATER AGREEMENT DATeCH 2014, May 20th 2014 Weighted Kappa: 0.60
  • 33. AUTOMATIC VS HUMAN DATeCH 2014, May 20th 2014
  • 34. AUTOMATIC VS HUMAN DATeCH 2014, May 20th 2014
  • 35. AUTOMATIC VS HUMAN DATeCH 2014, May 20th 2014
  • 36. AUTOMATIC VS HUMAN DATeCH 2014, May 20th 2014
  • 37. AUTOMATIC VS HUMAN DATeCH 2014, May 20th 2014
  • 38. AUTOMATIC VS HUMAN DATeCH 2014, May 20th 2014
  • 40. CONCLUSION We applied a simple quality scoring method to a large document collection and showed that automatic rating correlates with human rating. Document-level rating is not easy to do manually. Automatic document-level rating is not ideal but it give us a first “taste” of how good the quality of a document is. It is much more consistent than a person doing the same task. Many OCR errors are noise in big data but when added up they affect a significant amount of text. We found that named entities are effected worse than common words. HSS scholars need to be made much more aware of OCR errors affecting their search results for historical collections. DATeCH 2014, May 20th 2014
  • 41. FUTURE WORK Consider publication date and digitisation date when doing OCR quality estimation. Examine the bad documents identify those worth post-correcting. AHRC big data project (Palimpsest) on mining and geo-referencing literature set in Edinburgh. Collaboration with literary scholars interested in loco- specificity and its context in literature. DATeCH 2014, May 20th 2014
  • 42. THANK YOU Rating annotation guidelines and doubly rated data available on GitHub (digtrade) Contact: balex@inf.ed.ac.uk Website: http://tradingconsequences.blogs.edina.ac.uk/ Twitter: @digtrade DATeCH 2014, May 20th 2014
  • 43. BRINGING ARCHIVES ALIVE DATeCH 2014, May 20th 2014
  • 44. BRINGING ARCHIVES ALIVE ! DATeCH 2014, May 20th 2014
  • 45. BRINGING ARCHIVES ALIVE !! DATeCH 2014, May 20th 2014
  • 46. BRINGING ARCHIVES ALIVE DATeCH 2014, May 20th 2014
  • 47. SYSTEM Documents Text Mining Annotated Documents XML 2 RDB Commodities RDB Lexicons & Gazetteers Query Interface Visualisation Commodities Ontology SKOS Scotland’s National Collections and the Digital Humanities, Edinburgh, 14/02/2014
  • 48. MINED INFORMATION Example sentence: Normalised and grounded entities: commodity: cassia bark [concept: Cinnamomum cassia] date: 1871 (year=1871) location: Padang (lat=-0.94924;long=100.35427;country=ID) location: America (lat=39.76;long=-98.50;country=n/a) quantity + unit: 6,127 piculs Scotland’s National Collections and the Digital Humanities, Edinburgh, 14/02/2014
  • 49. MINED INFORMATION Example sentence: Extracted entity attributes and relations: origin location: Padang destination location: America commodity–date relation: cassia bark – 1871 commodity–location relation: cassia bark – Padang commodity–location relation: cassia bark – America Scotland’s National Collections and the Digital Humanities, Edinburgh, 14/02/2014
  • 50. EDINBURGH GEOPARSER Scotland’s National Collections and the Digital Humanities, Edinburgh, 14/02/2014
  • 51. CONCLUSION Importance of two-way collaboration between technology and humanities expert in digital HSS projects. Value of iterative development and rapid prototyping. Geo-referencing text is very important for historical analysis. Most OCR errors are noise in big data but HSS scholars need to be made more aware of OCR errors affecting their search results for historical collections. DATeCH 2014, May 20th 2014