SlideShare a Scribd company logo
1 of 39
Impact of Crowdsourcing OCR Improvements
on Retrievability Bias
Myriam C. Traub, Thaer Samar, Jacco van Ossenbruggen, Lynda Hardman

Centrum Wiskunde & Informatica, Amsterdam, NL 1
Motivation: Retrievability (Bias)
• Introduced by Azzopardi et al. in 2008 [1]
• Retrievability score counts how 

often a document is retrieved as one of 

the top K documents by a given set of queries
• Gini coefficient quantifies inequality in the distribution of
scores
[1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th
ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
2
Study on Retrievability Bias (JCDL2016)
• Follow-up study of Querylog-based Assessment of Retrievability Bias in
a Large Newspaper Corpus
• Large-scale study based on 102 million newspaper items, 4 million
simulated queries and 957,239 real user queries
• Findings:
• Large inequalities among the documents indicating retrievability bias
• Document length impacts retrieval, no evidence for other technical
bias found
• Simulated queries yield very different results than real queries,
experiments should take operators and facets into account
3
Potential Causes for
Retrievability Bias
• Skills and interest of users
• Collection bias
• Ranking algorithm
• UI design
• (OCR) quality
4
Courante uyt Italien, Duytslandt, &c 

(14-06-1618)
Research Questions
• RQ1: Relation between OCR quality and
retrievability
• RQ2: Direct impact of correction on
retrievability bias of corrected documents
• RQ3: Indirect impact of correction of a
fraction of documents on non-corrected
ones
5
Research Questions
• RQ1: Relation between OCR quality and
retrievability
• RQ2: Direct impact of correction on
retrievability bias of corrected documents
• RQ3: Indirect impact of correction of a
fraction of documents on non-corrected
ones
How does bias caused by
OCR quality impact my (re-)search
results?
5
Experimental Setup
6
Documents & Queries
• Subset of the historic newspaper
archive maintained by the National
Library of the Netherlands (public,
KB)

• Ground truth set of 100 manually
corrected newspaper issues (822
articles) published in the 17th century
and WWII period (public, KB)

• Character error rates (CER)
computed with [1]

• User queries collected from
delpher.nl (confidential, KB, same as
in previous study), stopwords, short
term removed, deduplicated
7
[1] https://www.digitisation.eu/ De geus onder studenten 

(14-10-1940)
4 Corpora
• Ground truth set (822 documents):
• uncorrected
• corrected
• Ground truth + mixed in (1644 documents):
• uncorrected
• partially corrected
8
Setup - Retrievability
[1] http://www.lemurproject.org/
Indri search engine [1]
Documents
Queries
9
• We report on c=1, c=10, c=100 or c=infinite
• Carried out on each of the four corpora
Retrievability Scores r(d)
et al. introduced a way to measure how retrieval system
nfluence the accessibility of documents in a collection [1
The retrievability score of a document d, r(d), measures ho
accessible a document is. It is determined by several factor
ncluding the matching function of the retrieval system an
the number of documents a user is willing to evaluate. Th
retrievability score is the result of a cumulative scoring fun
tion, defined as:
r(d) =
X
q2Q
oq · f(kdq, c),
where c defines the number of documents a user is willin
to examine in a ranked list. We use cuto↵ values c = 1
c = 100, and c = 1000. The coe cient oq weights the im
portance of a query. We assign equal weights, with oq =
The function f(kdq, c) is a generalized utility/cost functio10
[1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th
ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
Retrievability Scores r(d)
et al. introduced a way to measure how retrieval system
nfluence the accessibility of documents in a collection [1
The retrievability score of a document d, r(d), measures ho
accessible a document is. It is determined by several factor
ncluding the matching function of the retrieval system an
the number of documents a user is willing to evaluate. Th
retrievability score is the result of a cumulative scoring fun
tion, defined as:
r(d) =
X
q2Q
oq · f(kdq, c),
where c defines the number of documents a user is willin
to examine in a ranked list. We use cuto↵ values c = 1
c = 100, and c = 1000. The coe cient oq weights the im
portance of a query. We assign equal weights, with oq =
The function f(kdq, c) is a generalized utility/cost functio
Retrievability score
for a document d
10
[1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th
ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
Retrievability Scores r(d)
et al. introduced a way to measure how retrieval system
nfluence the accessibility of documents in a collection [1
The retrievability score of a document d, r(d), measures ho
accessible a document is. It is determined by several factor
ncluding the matching function of the retrieval system an
the number of documents a user is willing to evaluate. Th
retrievability score is the result of a cumulative scoring fun
tion, defined as:
r(d) =
X
q2Q
oq · f(kdq, c),
where c defines the number of documents a user is willin
to examine in a ranked list. We use cuto↵ values c = 1
c = 100, and c = 1000. The coe cient oq weights the im
portance of a query. We assign equal weights, with oq =
The function f(kdq, c) is a generalized utility/cost functio
Retrievability score
for a document d
Rank of document d in
the result list of a query q
10
[1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th
ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
Retrievability Scores r(d)
et al. introduced a way to measure how retrieval system
nfluence the accessibility of documents in a collection [1
The retrievability score of a document d, r(d), measures ho
accessible a document is. It is determined by several factor
ncluding the matching function of the retrieval system an
the number of documents a user is willing to evaluate. Th
retrievability score is the result of a cumulative scoring fun
tion, defined as:
r(d) =
X
q2Q
oq · f(kdq, c),
where c defines the number of documents a user is willin
to examine in a ranked list. We use cuto↵ values c = 1
c = 100, and c = 1000. The coe cient oq weights the im
portance of a query. We assign equal weights, with oq =
The function f(kdq, c) is a generalized utility/cost functio
Retrievability score
for a document d
Cutoff value c
Rank of document d in
the result list of a query q
10
[1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th
ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
Retrievability Scores r(d)
et al. introduced a way to measure how retrieval system
nfluence the accessibility of documents in a collection [1
The retrievability score of a document d, r(d), measures ho
accessible a document is. It is determined by several factor
ncluding the matching function of the retrieval system an
the number of documents a user is willing to evaluate. Th
retrievability score is the result of a cumulative scoring fun
tion, defined as:
r(d) =
X
q2Q
oq · f(kdq, c),
where c defines the number of documents a user is willin
to examine in a ranked list. We use cuto↵ values c = 1
c = 100, and c = 1000. The coe cient oq weights the im
portance of a query. We assign equal weights, with oq =
The function f(kdq, c) is a generalized utility/cost functio
Retrievability score
for a document d
Cutoff value c
Rank of document d in
the result list of a query q
Possibility to give more
weight to certain queries, 

we use oq=1
10
[1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th
ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
Retrievability Scores r(d)
et al. introduced a way to measure how retrieval system
nfluence the accessibility of documents in a collection [1
The retrievability score of a document d, r(d), measures ho
accessible a document is. It is determined by several factor
ncluding the matching function of the retrieval system an
the number of documents a user is willing to evaluate. Th
retrievability score is the result of a cumulative scoring fun
tion, defined as:
r(d) =
X
q2Q
oq · f(kdq, c),
where c defines the number of documents a user is willin
to examine in a ranked list. We use cuto↵ values c = 1
c = 100, and c = 1000. The coe cient oq weights the im
portance of a query. We assign equal weights, with oq =
The function f(kdq, c) is a generalized utility/cost functio
Retrievability score
for a document d
Cutoff value c
Rank of document d in
the result list of a query q
For all queries q
in a query set Q
Possibility to give more
weight to certain queries, 

we use oq=1
10
[1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th
ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
Impact Assessment
• Wealth: How many documents were retrieved in total?
• Sum of all r(d) scores
• Equality: How are r(d) scores distributed among documents?
• Gini coefficient
• Retrieval per document/query:
• Changes due to correction
• Impact of individual (query) terms
11
OCR Quality & Retrievability
RQ1: What is the relation between a document’s OCR character error rate and its
retrievability score?
12
RQ1: OCR Quality & Retrievability
• CER in 17cent collection significantly higher
• R(d) scores higher in WWII collection
• Correlation between r(d) and CER: -0.57 (Pearson) and -0.61 (Spearman) with p<0.001
0
20,000
40,000
60,000
0% 20% 40% 60% 80%
Character error rate (CER)
R(d)score
Document length 1000 2000 3000 Subset 17cent WWII
13
Direct Impact of OCR Quality
RQ2: How does the correction of OCR errors impact the retrievability bias of the
corrected documents?
14
Direct Impact
Uncorrected
Complete
corpus was
corrected
Corrected
15
Impact of Correction on Wealth
• More documents
retrieved from corrected
documents
• Number of queries with
results increased by 8%
• Impact is largest for
users willing to look at
the entire result list
16
365,855
338,139
2,023,283
1,750,340
5,477,566
4,341,536
6,033,099
4,521,030
+ 8%+ 8%
+ 16%+ 16%
+ 26%+ 26%
+ 34%+ 34%
c=1
c=10
c=100
c=infinite
0 2,500,000 5,000,000 7,500,000 10,000,000
Sum of all r(d) scores (wealth)
Condition error−prone corrected
Impact on Equality
• Correction lowers inequality among
documents
• In contrast to earlier findings, Gini
coefficients do not decrease with
larger c’s
• Correction fixes more FN than FP
(c=infinite):
• Increases both, wealth and
equality
17
0.0
0.2
0.4
0.6
1 10 100 infinite
Gini
Condition 822GTcor 822GTerr
Direct Impact: Gini Coefficients
Retrieval per Document
• Few documents lose r(d) scores after correction:

Good, these are former FP caused by OCR errors and no longer retrieved
• Most documents, however, gain — with 17cent corpus improving to a larger
extent, but still remaining at a lower level
18
Retrieval per Query
• Only 44% of the queries retrieved at least one document
• Despite small collection size, we see large gains
• Some queries lose because they retrieved FP from the uncorrected document set
19
Retrieval per Query
Top 10 terms cause 35% of the
wealth increase. These terms:
1. Appear very frequently in
user queries and
2. Are highly susceptible to
OCR errors in the
documents
Conclusion: Real queries are
also a source of bias
20
0
25
50
75
100
0 1,000 2,000 3,000 4,000
Query terms ordered by
difference in impact (descending)
Cumulativer(d)difference(%)
* new, Amsterdam, end, Mister, died/dead, grand/
large, Willem (name), two, three, old
Figure 4: Queries ordered by their gain/loss in number of
retrieved documents. The position on the y-axis represents
the number of documents retrieved from 822GTcor .
histograms. The distributions of the dierences in r(d) scores in Ta-
ble 2, show that for all cuto values, the median of the dierences is
positive, and increases from 8 (c = 1) to 912 (c = 1). The maximum
loss and the maximum gain in r(d) scores increase for larger cuto
values c, the latter to a much larger extent. Note that for c = 1 and
c = 10 the entire rst quartile is lled with documents that scored
worse in the corrected version. This shows that the competition
in the top results makes the gain of some documents the loss of
others.
Increased retrieval per query In a nal step, we investigated
0
25
50
75
100
0 1,000 2,000 3,000 4,000
Query terms ordered by
difference in impact (descending)
Cumulativer(d)difference(%)
Query Frequency in Cum.
Term Queries 822GT err 822GTcor Impact
nieuwe 1,903 99 166 7.36%
amsterdam 7,885 41 57 14.65%
ende 185 103 480 18.69%
heer 826 20 89 21.99%
overleden 3,698 5 18 24.78%
groot 1,573 125 153 27.33%
willem 5,375 5 13 29.81%
twee 319 64 175 31.83%
drie 401 34 120 33.81%
oude 991 50 78 35.41%
Figure 5: The accumulated impact scores of single-term
queries show that very few query term contribute a large
fraction of the overall wealth. The top ten query terms ac-
*
Indirect Impact of OCR Quality
RQ3: How does the correction of a fraction of error-prone documents influence the
retrievability of non-corrected ones?
21
Indirect Impact
Mixed
Half of the
corpus was
corrected
Uncorrected
22
Indirect Impact
Mixed
Half of the
corpus was
corrected
Uncorrected
22
50% same documents as 

in previous RQ
Indirect Impact
Mixed
Half of the
corpus was
corrected
Uncorrected
22
50% same documents as 

in previous RQ
50% new documents
Indirect Impact
Mixed
Half of the
corpus was
corrected
We’re mainly interested 

in these documents
Uncorrected
22
50% same documents as 

in previous RQ
50% new documents
Equality still increases!
• Equality in r(d) scores is higher in
the corrected document collection
• Again, correction has decreased
retrievability bias
23
0.0
0.2
0.4
0.6
0.8
1 10 100 infinite
Gini
Condition 1644err 1644mix
Indirect Impact: Gini Coefficients
376,139
353,613
2,307,996
2,099,816
7,676,830
6,698,945
9,520,643
8,008,574
c=1
c=10
c=100
c=infinity
0 3,000,000 6,000,000 9,000,000
Wealth
Complete Document Collection
225,809
180,079
1,420,322
1,112,705
4,898,694
3,783,514
6,033,099
4,521,030
c=1
c=10
c=100
c=infinity
0 2,000,000 4,000,000 6,000,000
Wealth
Ground Truth Document Collection
150,330
173,534
887,674
987,111
2,778,136
2,915,431
3,487,544
3,487,544
c=1
c=10
c=100
c=infinity
0 1,000,000 2,000,000 3,000,000 4,000,000
Wealth
Condition 1644_err 1644_mix
Mixed−in Document Collection
Impact on Wealth
• Complete: Correction increases wealth
24
376,139
353,613
2,307,996
2,099,816
7,676,830
6,698,945
9,520,643
8,008,574
c=1
c=10
c=100
c=infinity
0 3,000,000 6,000,000 9,000,000
Wealth
Complete Document Collection
225,809
180,079
1,420,322
1,112,705
4,898,694
3,783,514
6,033,099
4,521,030
c=1
c=10
c=100
c=infinity
0 2,000,000 4,000,000 6,000,000
Wealth
Ground Truth Document Collection
150,330
173,534
887,674
987,111
2,778,136
2,915,431
3,487,544
3,487,544
c=1
c=10
c=100
c=infinity
0 1,000,000 2,000,000 3,000,000 4,000,000
Wealth
Condition 1644_err 1644_mix
Mixed−in Document Collection
Impact on Wealth
• Complete: Correction increases wealth
• GT only:
• Increase in wealth
• c=1: +20%
• c=10: +22%
• c=100: +23%
• c=infinite: +25%
24
376,139
353,613
2,307,996
2,099,816
7,676,830
6,698,945
9,520,643
8,008,574
c=1
c=10
c=100
c=infinity
0 3,000,000 6,000,000 9,000,000
Wealth
Complete Document Collection
225,809
180,079
1,420,322
1,112,705
4,898,694
3,783,514
6,033,099
4,521,030
c=1
c=10
c=100
c=infinity
0 2,000,000 4,000,000 6,000,000
Wealth
Ground Truth Document Collection
150,330
173,534
887,674
987,111
2,778,136
2,915,431
3,487,544
3,487,544
c=1
c=10
c=100
c=infinity
0 1,000,000 2,000,000 3,000,000 4,000,000
Wealth
Condition 1644_err 1644_mix
Mixed−in Document Collection
Impact on Wealth
• Complete: Correction increases wealth
• GT only:
• Increase in wealth
• c=1: +20%
• c=10: +22%
• c=100: +23%
• c=infinite: +25%
• Mixed-in only:
• Decrease in wealth:
• c=1: -13%
• c=10: -10%
• c=100: -5%
24
Retrieval per Document (mixed-in only, c=10)
• Most documents’ scores change very little and if, they lose r(d) scores
• 171 documents gain r(d) scores
• Benefit from FP matches that disappeared
25
Conclusions
26
Conclusions
• In our study, OCR correction
• Increases overall retrievability
• Reduces retrievability bias, even in a partially corrected corpus
• Higher scores caused by small set of terms that are
• frequent in queries and
• susceptible to OCR errors
• Using real user queries is essential to understand actual bias caused
by OCR errors.
27
Impact of Crowdsourcing
OCR Improvements on
Retrievability Bias
We would like to thank the 

for making the newspaper corpus and the
(sensitive) user data available to us for
research.
28
This research is partly funded by the Dutch COMMIT/ program, the
WebART project and the VRE4EIC project, a project that has received
funding from the European Union’s Horizon 2020 research and innovation
program under grant agreement No 676247.

More Related Content

Similar to Impact of Crowdsourcing OCR Improvements on Retrievability Bias

IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...onlmcq
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligencevini89
 
Document ranking using qprp with concept of multi dimensional subspace
Document ranking using qprp with concept of multi dimensional subspaceDocument ranking using qprp with concept of multi dimensional subspace
Document ranking using qprp with concept of multi dimensional subspacePrakash Dubey
 
Query expansion_group42_ire
Query expansion_group42_ireQuery expansion_group42_ire
Query expansion_group42_ireKovidaN
 
An introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalAn introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalMounia Lalmas-Roelleke
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Primya Tamil
 
Testing Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting TechniqueTesting Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting Techniquekevig
 
Testing Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting TechniqueTesting Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting Techniquekevig
 
DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY
DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITYDIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY
DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITYIJDKP
 
Analysis of data quality and information quality problems in digital manufact...
Analysis of data quality and information quality problems in digital manufact...Analysis of data quality and information quality problems in digital manufact...
Analysis of data quality and information quality problems in digital manufact...Mary Montoya
 
Informatio retrival evaluation
Informatio retrival evaluationInformatio retrival evaluation
Informatio retrival evaluationNidhirBiswas
 
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...ijnlc
 
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESFINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESkevig
 
Workflow Provenance: From Modelling to Reporting
Workflow Provenance: From Modelling to ReportingWorkflow Provenance: From Modelling to Reporting
Workflow Provenance: From Modelling to ReportingRayhan Ferdous
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Sean Golliher
 
ROMAN URDU OPINION MINING SYSTEM (RUOMIS)
ROMAN URDU OPINION MINING SYSTEM (RUOMIS) ROMAN URDU OPINION MINING SYSTEM (RUOMIS)
ROMAN URDU OPINION MINING SYSTEM (RUOMIS) cseij
 
Hybrid geo textual index structure
Hybrid geo textual index structureHybrid geo textual index structure
Hybrid geo textual index structurecseij
 
Scalable and efficient cluster based framework for
Scalable and efficient cluster based framework forScalable and efficient cluster based framework for
Scalable and efficient cluster based framework foreSAT Publishing House
 
Scalable and efficient cluster based framework for multidimensional indexing
Scalable and efficient cluster based framework for multidimensional indexingScalable and efficient cluster based framework for multidimensional indexing
Scalable and efficient cluster based framework for multidimensional indexingeSAT Journals
 

Similar to Impact of Crowdsourcing OCR Improvements on Retrievability Bias (20)

IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
Document ranking using qprp with concept of multi dimensional subspace
Document ranking using qprp with concept of multi dimensional subspaceDocument ranking using qprp with concept of multi dimensional subspace
Document ranking using qprp with concept of multi dimensional subspace
 
Query expansion_group42_ire
Query expansion_group42_ireQuery expansion_group42_ire
Query expansion_group42_ire
 
An introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalAn introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information Retrieval
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models
 
Testing Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting TechniqueTesting Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting Technique
 
Testing Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting TechniqueTesting Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting Technique
 
ME Synopsis
ME SynopsisME Synopsis
ME Synopsis
 
DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY
DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITYDIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY
DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY
 
Analysis of data quality and information quality problems in digital manufact...
Analysis of data quality and information quality problems in digital manufact...Analysis of data quality and information quality problems in digital manufact...
Analysis of data quality and information quality problems in digital manufact...
 
Informatio retrival evaluation
Informatio retrival evaluationInformatio retrival evaluation
Informatio retrival evaluation
 
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
 
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESFINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
 
Workflow Provenance: From Modelling to Reporting
Workflow Provenance: From Modelling to ReportingWorkflow Provenance: From Modelling to Reporting
Workflow Provenance: From Modelling to Reporting
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
 
ROMAN URDU OPINION MINING SYSTEM (RUOMIS)
ROMAN URDU OPINION MINING SYSTEM (RUOMIS) ROMAN URDU OPINION MINING SYSTEM (RUOMIS)
ROMAN URDU OPINION MINING SYSTEM (RUOMIS)
 
Hybrid geo textual index structure
Hybrid geo textual index structureHybrid geo textual index structure
Hybrid geo textual index structure
 
Scalable and efficient cluster based framework for
Scalable and efficient cluster based framework forScalable and efficient cluster based framework for
Scalable and efficient cluster based framework for
 
Scalable and efficient cluster based framework for multidimensional indexing
Scalable and efficient cluster based framework for multidimensional indexingScalable and efficient cluster based framework for multidimensional indexing
Scalable and efficient cluster based framework for multidimensional indexing
 

More from Myriam Traub

Querylog-based Assessment of Retrievability Bias in a Large Newspaper Corpus
Querylog-based Assessment of Retrievability Bias in a  Large Newspaper CorpusQuerylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus
Querylog-based Assessment of Retrievability Bias in a Large Newspaper CorpusMyriam Traub
 
Effectiveness of Gamesourcing Expert Painting Annotations
Effectiveness of Gamesourcing Expert Painting AnnotationsEffectiveness of Gamesourcing Expert Painting Annotations
Effectiveness of Gamesourcing Expert Painting AnnotationsMyriam Traub
 
The Nature Of Digitally-Produced Data: Towards Social-Scientific Tool Criticism
The Nature Of Digitally-Produced Data: Towards Social-Scientific Tool CriticismThe Nature Of Digitally-Produced Data: Towards Social-Scientific Tool Criticism
The Nature Of Digitally-Produced Data: Towards Social-Scientific Tool CriticismMyriam Traub
 
Querylog-based Assessment of Retrievability Bias in Delpher
Querylog-based Assessment of Retrievability Bias in DelpherQuerylog-based Assessment of Retrievability Bias in Delpher
Querylog-based Assessment of Retrievability Bias in DelpherMyriam Traub
 
Impact Analysis of OCR Quality on Research Tasks in Digital Archives
Impact Analysis of OCR Quality on Research Tasks in Digital ArchivesImpact Analysis of OCR Quality on Research Tasks in Digital Archives
Impact Analysis of OCR Quality on Research Tasks in Digital ArchivesMyriam Traub
 
Estimating the Impact of OCR Quality on Research Tasks in the Digital Humanities
Estimating the Impact of OCR Quality on Research Tasks in the Digital HumanitiesEstimating the Impact of OCR Quality on Research Tasks in the Digital Humanities
Estimating the Impact of OCR Quality on Research Tasks in the Digital HumanitiesMyriam Traub
 
Measuring the Effectiveness of Gamesourcing Expert Oil Painting Annotations
Measuring the Effectiveness of Gamesourcing Expert Oil Painting AnnotationsMeasuring the Effectiveness of Gamesourcing Expert Oil Painting Annotations
Measuring the Effectiveness of Gamesourcing Expert Oil Painting AnnotationsMyriam Traub
 

More from Myriam Traub (8)

Querylog-based Assessment of Retrievability Bias in a Large Newspaper Corpus
Querylog-based Assessment of Retrievability Bias in a  Large Newspaper CorpusQuerylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus
Querylog-based Assessment of Retrievability Bias in a Large Newspaper Corpus
 
Effectiveness of Gamesourcing Expert Painting Annotations
Effectiveness of Gamesourcing Expert Painting AnnotationsEffectiveness of Gamesourcing Expert Painting Annotations
Effectiveness of Gamesourcing Expert Painting Annotations
 
The Nature Of Digitally-Produced Data: Towards Social-Scientific Tool Criticism
The Nature Of Digitally-Produced Data: Towards Social-Scientific Tool CriticismThe Nature Of Digitally-Produced Data: Towards Social-Scientific Tool Criticism
The Nature Of Digitally-Produced Data: Towards Social-Scientific Tool Criticism
 
Querylog-based Assessment of Retrievability Bias in Delpher
Querylog-based Assessment of Retrievability Bias in DelpherQuerylog-based Assessment of Retrievability Bias in Delpher
Querylog-based Assessment of Retrievability Bias in Delpher
 
Impact Analysis of OCR Quality on Research Tasks in Digital Archives
Impact Analysis of OCR Quality on Research Tasks in Digital ArchivesImpact Analysis of OCR Quality on Research Tasks in Digital Archives
Impact Analysis of OCR Quality on Research Tasks in Digital Archives
 
Tool Criticism
Tool CriticismTool Criticism
Tool Criticism
 
Estimating the Impact of OCR Quality on Research Tasks in the Digital Humanities
Estimating the Impact of OCR Quality on Research Tasks in the Digital HumanitiesEstimating the Impact of OCR Quality on Research Tasks in the Digital Humanities
Estimating the Impact of OCR Quality on Research Tasks in the Digital Humanities
 
Measuring the Effectiveness of Gamesourcing Expert Oil Painting Annotations
Measuring the Effectiveness of Gamesourcing Expert Oil Painting AnnotationsMeasuring the Effectiveness of Gamesourcing Expert Oil Painting Annotations
Measuring the Effectiveness of Gamesourcing Expert Oil Painting Annotations
 

Recently uploaded

Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxmaryFF1
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxJorenAcuavera1
 
Carbon Dioxide Capture and Storage (CSS)
Carbon Dioxide Capture and Storage (CSS)Carbon Dioxide Capture and Storage (CSS)
Carbon Dioxide Capture and Storage (CSS)Tamer Koksalan, PhD
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...D. B. S. College Kanpur
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde
 
Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...navyadasi1992
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXDole Philippines School
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 

Recently uploaded (20)

Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptx
 
Carbon Dioxide Capture and Storage (CSS)
Carbon Dioxide Capture and Storage (CSS)Carbon Dioxide Capture and Storage (CSS)
Carbon Dioxide Capture and Storage (CSS)
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensor
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical Engineering
 
Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 

Impact of Crowdsourcing OCR Improvements on Retrievability Bias

  • 1. Impact of Crowdsourcing OCR Improvements on Retrievability Bias Myriam C. Traub, Thaer Samar, Jacco van Ossenbruggen, Lynda Hardman Centrum Wiskunde & Informatica, Amsterdam, NL 1
  • 2. Motivation: Retrievability (Bias) • Introduced by Azzopardi et al. in 2008 [1] • Retrievability score counts how 
 often a document is retrieved as one of 
 the top K documents by a given set of queries • Gini coefficient quantifies inequality in the distribution of scores [1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM. 2
  • 3. Study on Retrievability Bias (JCDL2016) • Follow-up study of Querylog-based Assessment of Retrievability Bias in a Large Newspaper Corpus • Large-scale study based on 102 million newspaper items, 4 million simulated queries and 957,239 real user queries • Findings: • Large inequalities among the documents indicating retrievability bias • Document length impacts retrieval, no evidence for other technical bias found • Simulated queries yield very different results than real queries, experiments should take operators and facets into account 3
  • 4. Potential Causes for Retrievability Bias • Skills and interest of users • Collection bias • Ranking algorithm • UI design • (OCR) quality 4 Courante uyt Italien, Duytslandt, &c (14-06-1618)
  • 5. Research Questions • RQ1: Relation between OCR quality and retrievability • RQ2: Direct impact of correction on retrievability bias of corrected documents • RQ3: Indirect impact of correction of a fraction of documents on non-corrected ones 5
  • 6. Research Questions • RQ1: Relation between OCR quality and retrievability • RQ2: Direct impact of correction on retrievability bias of corrected documents • RQ3: Indirect impact of correction of a fraction of documents on non-corrected ones How does bias caused by OCR quality impact my (re-)search results? 5
  • 8. Documents & Queries • Subset of the historic newspaper archive maintained by the National Library of the Netherlands (public, KB) • Ground truth set of 100 manually corrected newspaper issues (822 articles) published in the 17th century and WWII period (public, KB) • Character error rates (CER) computed with [1] • User queries collected from delpher.nl (confidential, KB, same as in previous study), stopwords, short term removed, deduplicated 7 [1] https://www.digitisation.eu/ De geus onder studenten (14-10-1940)
  • 9. 4 Corpora • Ground truth set (822 documents): • uncorrected • corrected • Ground truth + mixed in (1644 documents): • uncorrected • partially corrected 8
  • 10. Setup - Retrievability [1] http://www.lemurproject.org/ Indri search engine [1] Documents Queries 9 • We report on c=1, c=10, c=100 or c=infinite • Carried out on each of the four corpora
  • 11. Retrievability Scores r(d) et al. introduced a way to measure how retrieval system nfluence the accessibility of documents in a collection [1 The retrievability score of a document d, r(d), measures ho accessible a document is. It is determined by several factor ncluding the matching function of the retrieval system an the number of documents a user is willing to evaluate. Th retrievability score is the result of a cumulative scoring fun tion, defined as: r(d) = X q2Q oq · f(kdq, c), where c defines the number of documents a user is willin to examine in a ranked list. We use cuto↵ values c = 1 c = 100, and c = 1000. The coe cient oq weights the im portance of a query. We assign equal weights, with oq = The function f(kdq, c) is a generalized utility/cost functio10 [1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
  • 12. Retrievability Scores r(d) et al. introduced a way to measure how retrieval system nfluence the accessibility of documents in a collection [1 The retrievability score of a document d, r(d), measures ho accessible a document is. It is determined by several factor ncluding the matching function of the retrieval system an the number of documents a user is willing to evaluate. Th retrievability score is the result of a cumulative scoring fun tion, defined as: r(d) = X q2Q oq · f(kdq, c), where c defines the number of documents a user is willin to examine in a ranked list. We use cuto↵ values c = 1 c = 100, and c = 1000. The coe cient oq weights the im portance of a query. We assign equal weights, with oq = The function f(kdq, c) is a generalized utility/cost functio Retrievability score for a document d 10 [1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
  • 13. Retrievability Scores r(d) et al. introduced a way to measure how retrieval system nfluence the accessibility of documents in a collection [1 The retrievability score of a document d, r(d), measures ho accessible a document is. It is determined by several factor ncluding the matching function of the retrieval system an the number of documents a user is willing to evaluate. Th retrievability score is the result of a cumulative scoring fun tion, defined as: r(d) = X q2Q oq · f(kdq, c), where c defines the number of documents a user is willin to examine in a ranked list. We use cuto↵ values c = 1 c = 100, and c = 1000. The coe cient oq weights the im portance of a query. We assign equal weights, with oq = The function f(kdq, c) is a generalized utility/cost functio Retrievability score for a document d Rank of document d in the result list of a query q 10 [1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
  • 14. Retrievability Scores r(d) et al. introduced a way to measure how retrieval system nfluence the accessibility of documents in a collection [1 The retrievability score of a document d, r(d), measures ho accessible a document is. It is determined by several factor ncluding the matching function of the retrieval system an the number of documents a user is willing to evaluate. Th retrievability score is the result of a cumulative scoring fun tion, defined as: r(d) = X q2Q oq · f(kdq, c), where c defines the number of documents a user is willin to examine in a ranked list. We use cuto↵ values c = 1 c = 100, and c = 1000. The coe cient oq weights the im portance of a query. We assign equal weights, with oq = The function f(kdq, c) is a generalized utility/cost functio Retrievability score for a document d Cutoff value c Rank of document d in the result list of a query q 10 [1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
  • 15. Retrievability Scores r(d) et al. introduced a way to measure how retrieval system nfluence the accessibility of documents in a collection [1 The retrievability score of a document d, r(d), measures ho accessible a document is. It is determined by several factor ncluding the matching function of the retrieval system an the number of documents a user is willing to evaluate. Th retrievability score is the result of a cumulative scoring fun tion, defined as: r(d) = X q2Q oq · f(kdq, c), where c defines the number of documents a user is willin to examine in a ranked list. We use cuto↵ values c = 1 c = 100, and c = 1000. The coe cient oq weights the im portance of a query. We assign equal weights, with oq = The function f(kdq, c) is a generalized utility/cost functio Retrievability score for a document d Cutoff value c Rank of document d in the result list of a query q Possibility to give more weight to certain queries, 
 we use oq=1 10 [1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
  • 16. Retrievability Scores r(d) et al. introduced a way to measure how retrieval system nfluence the accessibility of documents in a collection [1 The retrievability score of a document d, r(d), measures ho accessible a document is. It is determined by several factor ncluding the matching function of the retrieval system an the number of documents a user is willing to evaluate. Th retrievability score is the result of a cumulative scoring fun tion, defined as: r(d) = X q2Q oq · f(kdq, c), where c defines the number of documents a user is willin to examine in a ranked list. We use cuto↵ values c = 1 c = 100, and c = 1000. The coe cient oq weights the im portance of a query. We assign equal weights, with oq = The function f(kdq, c) is a generalized utility/cost functio Retrievability score for a document d Cutoff value c Rank of document d in the result list of a query q For all queries q in a query set Q Possibility to give more weight to certain queries, 
 we use oq=1 10 [1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
  • 17. Impact Assessment • Wealth: How many documents were retrieved in total? • Sum of all r(d) scores • Equality: How are r(d) scores distributed among documents? • Gini coefficient • Retrieval per document/query: • Changes due to correction • Impact of individual (query) terms 11
  • 18. OCR Quality & Retrievability RQ1: What is the relation between a document’s OCR character error rate and its retrievability score? 12
  • 19. RQ1: OCR Quality & Retrievability • CER in 17cent collection significantly higher • R(d) scores higher in WWII collection • Correlation between r(d) and CER: -0.57 (Pearson) and -0.61 (Spearman) with p<0.001 0 20,000 40,000 60,000 0% 20% 40% 60% 80% Character error rate (CER) R(d)score Document length 1000 2000 3000 Subset 17cent WWII 13
  • 20. Direct Impact of OCR Quality RQ2: How does the correction of OCR errors impact the retrievability bias of the corrected documents? 14
  • 22. Impact of Correction on Wealth • More documents retrieved from corrected documents • Number of queries with results increased by 8% • Impact is largest for users willing to look at the entire result list 16 365,855 338,139 2,023,283 1,750,340 5,477,566 4,341,536 6,033,099 4,521,030 + 8%+ 8% + 16%+ 16% + 26%+ 26% + 34%+ 34% c=1 c=10 c=100 c=infinite 0 2,500,000 5,000,000 7,500,000 10,000,000 Sum of all r(d) scores (wealth) Condition error−prone corrected
  • 23. Impact on Equality • Correction lowers inequality among documents • In contrast to earlier findings, Gini coefficients do not decrease with larger c’s • Correction fixes more FN than FP (c=infinite): • Increases both, wealth and equality 17 0.0 0.2 0.4 0.6 1 10 100 infinite Gini Condition 822GTcor 822GTerr Direct Impact: Gini Coefficients
  • 24. Retrieval per Document • Few documents lose r(d) scores after correction:
 Good, these are former FP caused by OCR errors and no longer retrieved • Most documents, however, gain — with 17cent corpus improving to a larger extent, but still remaining at a lower level 18
  • 25. Retrieval per Query • Only 44% of the queries retrieved at least one document • Despite small collection size, we see large gains • Some queries lose because they retrieved FP from the uncorrected document set 19
  • 26. Retrieval per Query Top 10 terms cause 35% of the wealth increase. These terms: 1. Appear very frequently in user queries and 2. Are highly susceptible to OCR errors in the documents Conclusion: Real queries are also a source of bias 20 0 25 50 75 100 0 1,000 2,000 3,000 4,000 Query terms ordered by difference in impact (descending) Cumulativer(d)difference(%) * new, Amsterdam, end, Mister, died/dead, grand/ large, Willem (name), two, three, old Figure 4: Queries ordered by their gain/loss in number of retrieved documents. The position on the y-axis represents the number of documents retrieved from 822GTcor . histograms. The distributions of the dierences in r(d) scores in Ta- ble 2, show that for all cuto values, the median of the dierences is positive, and increases from 8 (c = 1) to 912 (c = 1). The maximum loss and the maximum gain in r(d) scores increase for larger cuto values c, the latter to a much larger extent. Note that for c = 1 and c = 10 the entire rst quartile is lled with documents that scored worse in the corrected version. This shows that the competition in the top results makes the gain of some documents the loss of others. Increased retrieval per query In a nal step, we investigated 0 25 50 75 100 0 1,000 2,000 3,000 4,000 Query terms ordered by difference in impact (descending) Cumulativer(d)difference(%) Query Frequency in Cum. Term Queries 822GT err 822GTcor Impact nieuwe 1,903 99 166 7.36% amsterdam 7,885 41 57 14.65% ende 185 103 480 18.69% heer 826 20 89 21.99% overleden 3,698 5 18 24.78% groot 1,573 125 153 27.33% willem 5,375 5 13 29.81% twee 319 64 175 31.83% drie 401 34 120 33.81% oude 991 50 78 35.41% Figure 5: The accumulated impact scores of single-term queries show that very few query term contribute a large fraction of the overall wealth. The top ten query terms ac- *
  • 27. Indirect Impact of OCR Quality RQ3: How does the correction of a fraction of error-prone documents influence the retrievability of non-corrected ones? 21
  • 28. Indirect Impact Mixed Half of the corpus was corrected Uncorrected 22
  • 29. Indirect Impact Mixed Half of the corpus was corrected Uncorrected 22 50% same documents as 
 in previous RQ
  • 30. Indirect Impact Mixed Half of the corpus was corrected Uncorrected 22 50% same documents as 
 in previous RQ 50% new documents
  • 31. Indirect Impact Mixed Half of the corpus was corrected We’re mainly interested 
 in these documents Uncorrected 22 50% same documents as 
 in previous RQ 50% new documents
  • 32. Equality still increases! • Equality in r(d) scores is higher in the corrected document collection • Again, correction has decreased retrievability bias 23 0.0 0.2 0.4 0.6 0.8 1 10 100 infinite Gini Condition 1644err 1644mix Indirect Impact: Gini Coefficients
  • 33. 376,139 353,613 2,307,996 2,099,816 7,676,830 6,698,945 9,520,643 8,008,574 c=1 c=10 c=100 c=infinity 0 3,000,000 6,000,000 9,000,000 Wealth Complete Document Collection 225,809 180,079 1,420,322 1,112,705 4,898,694 3,783,514 6,033,099 4,521,030 c=1 c=10 c=100 c=infinity 0 2,000,000 4,000,000 6,000,000 Wealth Ground Truth Document Collection 150,330 173,534 887,674 987,111 2,778,136 2,915,431 3,487,544 3,487,544 c=1 c=10 c=100 c=infinity 0 1,000,000 2,000,000 3,000,000 4,000,000 Wealth Condition 1644_err 1644_mix Mixed−in Document Collection Impact on Wealth • Complete: Correction increases wealth 24
  • 34. 376,139 353,613 2,307,996 2,099,816 7,676,830 6,698,945 9,520,643 8,008,574 c=1 c=10 c=100 c=infinity 0 3,000,000 6,000,000 9,000,000 Wealth Complete Document Collection 225,809 180,079 1,420,322 1,112,705 4,898,694 3,783,514 6,033,099 4,521,030 c=1 c=10 c=100 c=infinity 0 2,000,000 4,000,000 6,000,000 Wealth Ground Truth Document Collection 150,330 173,534 887,674 987,111 2,778,136 2,915,431 3,487,544 3,487,544 c=1 c=10 c=100 c=infinity 0 1,000,000 2,000,000 3,000,000 4,000,000 Wealth Condition 1644_err 1644_mix Mixed−in Document Collection Impact on Wealth • Complete: Correction increases wealth • GT only: • Increase in wealth • c=1: +20% • c=10: +22% • c=100: +23% • c=infinite: +25% 24
  • 35. 376,139 353,613 2,307,996 2,099,816 7,676,830 6,698,945 9,520,643 8,008,574 c=1 c=10 c=100 c=infinity 0 3,000,000 6,000,000 9,000,000 Wealth Complete Document Collection 225,809 180,079 1,420,322 1,112,705 4,898,694 3,783,514 6,033,099 4,521,030 c=1 c=10 c=100 c=infinity 0 2,000,000 4,000,000 6,000,000 Wealth Ground Truth Document Collection 150,330 173,534 887,674 987,111 2,778,136 2,915,431 3,487,544 3,487,544 c=1 c=10 c=100 c=infinity 0 1,000,000 2,000,000 3,000,000 4,000,000 Wealth Condition 1644_err 1644_mix Mixed−in Document Collection Impact on Wealth • Complete: Correction increases wealth • GT only: • Increase in wealth • c=1: +20% • c=10: +22% • c=100: +23% • c=infinite: +25% • Mixed-in only: • Decrease in wealth: • c=1: -13% • c=10: -10% • c=100: -5% 24
  • 36. Retrieval per Document (mixed-in only, c=10) • Most documents’ scores change very little and if, they lose r(d) scores • 171 documents gain r(d) scores • Benefit from FP matches that disappeared 25
  • 38. Conclusions • In our study, OCR correction • Increases overall retrievability • Reduces retrievability bias, even in a partially corrected corpus • Higher scores caused by small set of terms that are • frequent in queries and • susceptible to OCR errors • Using real user queries is essential to understand actual bias caused by OCR errors. 27
  • 39. Impact of Crowdsourcing OCR Improvements on Retrievability Bias We would like to thank the for making the newspaper corpus and the (sensitive) user data available to us for research. 28 This research is partly funded by the Dutch COMMIT/ program, the WebART project and the VRE4EIC project, a project that has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No 676247.