Digitized document collections often suffer from OCR errors that may impact a document’s readability and retrievability. We studied the effects of correcting OCR errors on the retrievability of documents in a historic newspaper corpus of a digital library. We computed retrievability scores for the uncorrected documents using queries from the library’s search log, and found that the document OCR character error rate and retrievability score are strongly correlated. We computed retrievability scores for manually corrected versions of the same documents, and report on differences in their total sum, the overall retrievability bias, and the distribution of these changes over the documents, queries and query terms. For large collections, often only a fraction of the corpus is manually corrected. Using a mixed corpus, we assess how this mix affects the retrievability of the corrected and uncorrected documents. The correction of OCR errors increased the number of documents retrieved in all conditions. The increase contributed to a less biased retrieval, even when taking the potential lower ranking of uncorrected documents into account.
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Impact of Crowdsourcing OCR Improvements on Retrievability Bias
1. Impact of Crowdsourcing OCR Improvements
on Retrievability Bias
Myriam C. Traub, Thaer Samar, Jacco van Ossenbruggen, Lynda Hardman
Centrum Wiskunde & Informatica, Amsterdam, NL 1
2. Motivation: Retrievability (Bias)
• Introduced by Azzopardi et al. in 2008 [1]
• Retrievability score counts how
often a document is retrieved as one of
the top K documents by a given set of queries
• Gini coefficient quantifies inequality in the distribution of
scores
[1] L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th
ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
2
3. Study on Retrievability Bias (JCDL2016)
• Follow-up study of Querylog-based Assessment of Retrievability Bias in
a Large Newspaper Corpus
• Large-scale study based on 102 million newspaper items, 4 million
simulated queries and 957,239 real user queries
• Findings:
• Large inequalities among the documents indicating retrievability bias
• Document length impacts retrieval, no evidence for other technical
bias found
• Simulated queries yield very different results than real queries,
experiments should take operators and facets into account
3
5. Research Questions
• RQ1: Relation between OCR quality and
retrievability
• RQ2: Direct impact of correction on
retrievability bias of corrected documents
• RQ3: Indirect impact of correction of a
fraction of documents on non-corrected
ones
5
6. Research Questions
• RQ1: Relation between OCR quality and
retrievability
• RQ2: Direct impact of correction on
retrievability bias of corrected documents
• RQ3: Indirect impact of correction of a
fraction of documents on non-corrected
ones
How does bias caused by
OCR quality impact my (re-)search
results?
5
8. Documents & Queries
• Subset of the historic newspaper
archive maintained by the National
Library of the Netherlands (public,
KB)
• Ground truth set of 100 manually
corrected newspaper issues (822
articles) published in the 17th century
and WWII period (public, KB)
• Character error rates (CER)
computed with [1]
• User queries collected from
delpher.nl (confidential, KB, same as
in previous study), stopwords, short
term removed, deduplicated
7
[1] https://www.digitisation.eu/ De geus onder studenten
(14-10-1940)
9. 4 Corpora
• Ground truth set (822 documents):
• uncorrected
• corrected
• Ground truth + mixed in (1644 documents):
• uncorrected
• partially corrected
8
10. Setup - Retrievability
[1] http://www.lemurproject.org/
Indri search engine [1]
Documents
Queries
9
• We report on c=1, c=10, c=100 or c=infinite
• Carried out on each of the four corpora
11. Retrievability Scores r(d)
et al. introduced a way to measure how retrieval system
nfluence the accessibility of documents in a collection [1
The retrievability score of a document d, r(d), measures ho
accessible a document is. It is determined by several factor
ncluding the matching function of the retrieval system an
the number of documents a user is willing to evaluate. Th
retrievability score is the result of a cumulative scoring fun
tion, defined as:
r(d) =
X
q2Q
oq · f(kdq, c),
where c defines the number of documents a user is willin
to examine in a ranked list. We use cuto↵ values c = 1
c = 100, and c = 1000. The coe cient oq weights the im
portance of a query. We assign equal weights, with oq =
The function f(kdq, c) is a generalized utility/cost functio10
[1] L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th
ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
12. Retrievability Scores r(d)
et al. introduced a way to measure how retrieval system
nfluence the accessibility of documents in a collection [1
The retrievability score of a document d, r(d), measures ho
accessible a document is. It is determined by several factor
ncluding the matching function of the retrieval system an
the number of documents a user is willing to evaluate. Th
retrievability score is the result of a cumulative scoring fun
tion, defined as:
r(d) =
X
q2Q
oq · f(kdq, c),
where c defines the number of documents a user is willin
to examine in a ranked list. We use cuto↵ values c = 1
c = 100, and c = 1000. The coe cient oq weights the im
portance of a query. We assign equal weights, with oq =
The function f(kdq, c) is a generalized utility/cost functio
Retrievability score
for a document d
10
[1] L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th
ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
13. Retrievability Scores r(d)
et al. introduced a way to measure how retrieval system
nfluence the accessibility of documents in a collection [1
The retrievability score of a document d, r(d), measures ho
accessible a document is. It is determined by several factor
ncluding the matching function of the retrieval system an
the number of documents a user is willing to evaluate. Th
retrievability score is the result of a cumulative scoring fun
tion, defined as:
r(d) =
X
q2Q
oq · f(kdq, c),
where c defines the number of documents a user is willin
to examine in a ranked list. We use cuto↵ values c = 1
c = 100, and c = 1000. The coe cient oq weights the im
portance of a query. We assign equal weights, with oq =
The function f(kdq, c) is a generalized utility/cost functio
Retrievability score
for a document d
Rank of document d in
the result list of a query q
10
[1] L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th
ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
14. Retrievability Scores r(d)
et al. introduced a way to measure how retrieval system
nfluence the accessibility of documents in a collection [1
The retrievability score of a document d, r(d), measures ho
accessible a document is. It is determined by several factor
ncluding the matching function of the retrieval system an
the number of documents a user is willing to evaluate. Th
retrievability score is the result of a cumulative scoring fun
tion, defined as:
r(d) =
X
q2Q
oq · f(kdq, c),
where c defines the number of documents a user is willin
to examine in a ranked list. We use cuto↵ values c = 1
c = 100, and c = 1000. The coe cient oq weights the im
portance of a query. We assign equal weights, with oq =
The function f(kdq, c) is a generalized utility/cost functio
Retrievability score
for a document d
Cutoff value c
Rank of document d in
the result list of a query q
10
[1] L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th
ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
15. Retrievability Scores r(d)
et al. introduced a way to measure how retrieval system
nfluence the accessibility of documents in a collection [1
The retrievability score of a document d, r(d), measures ho
accessible a document is. It is determined by several factor
ncluding the matching function of the retrieval system an
the number of documents a user is willing to evaluate. Th
retrievability score is the result of a cumulative scoring fun
tion, defined as:
r(d) =
X
q2Q
oq · f(kdq, c),
where c defines the number of documents a user is willin
to examine in a ranked list. We use cuto↵ values c = 1
c = 100, and c = 1000. The coe cient oq weights the im
portance of a query. We assign equal weights, with oq =
The function f(kdq, c) is a generalized utility/cost functio
Retrievability score
for a document d
Cutoff value c
Rank of document d in
the result list of a query q
Possibility to give more
weight to certain queries,
we use oq=1
10
[1] L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th
ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
16. Retrievability Scores r(d)
et al. introduced a way to measure how retrieval system
nfluence the accessibility of documents in a collection [1
The retrievability score of a document d, r(d), measures ho
accessible a document is. It is determined by several factor
ncluding the matching function of the retrieval system an
the number of documents a user is willing to evaluate. Th
retrievability score is the result of a cumulative scoring fun
tion, defined as:
r(d) =
X
q2Q
oq · f(kdq, c),
where c defines the number of documents a user is willin
to examine in a ranked list. We use cuto↵ values c = 1
c = 100, and c = 1000. The coe cient oq weights the im
portance of a query. We assign equal weights, with oq =
The function f(kdq, c) is a generalized utility/cost functio
Retrievability score
for a document d
Cutoff value c
Rank of document d in
the result list of a query q
For all queries q
in a query set Q
Possibility to give more
weight to certain queries,
we use oq=1
10
[1] L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th
ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
17. Impact Assessment
• Wealth: How many documents were retrieved in total?
• Sum of all r(d) scores
• Equality: How are r(d) scores distributed among documents?
• Gini coefficient
• Retrieval per document/query:
• Changes due to correction
• Impact of individual (query) terms
11
18. OCR Quality & Retrievability
RQ1: What is the relation between a document’s OCR character error rate and its
retrievability score?
12
19. RQ1: OCR Quality & Retrievability
• CER in 17cent collection significantly higher
• R(d) scores higher in WWII collection
• Correlation between r(d) and CER: -0.57 (Pearson) and -0.61 (Spearman) with p<0.001
0
20,000
40,000
60,000
0% 20% 40% 60% 80%
Character error rate (CER)
R(d)score
Document length 1000 2000 3000 Subset 17cent WWII
13
20. Direct Impact of OCR Quality
RQ2: How does the correction of OCR errors impact the retrievability bias of the
corrected documents?
14
22. Impact of Correction on Wealth
• More documents
retrieved from corrected
documents
• Number of queries with
results increased by 8%
• Impact is largest for
users willing to look at
the entire result list
16
365,855
338,139
2,023,283
1,750,340
5,477,566
4,341,536
6,033,099
4,521,030
+ 8%+ 8%
+ 16%+ 16%
+ 26%+ 26%
+ 34%+ 34%
c=1
c=10
c=100
c=infinite
0 2,500,000 5,000,000 7,500,000 10,000,000
Sum of all r(d) scores (wealth)
Condition error−prone corrected
23. Impact on Equality
• Correction lowers inequality among
documents
• In contrast to earlier findings, Gini
coefficients do not decrease with
larger c’s
• Correction fixes more FN than FP
(c=infinite):
• Increases both, wealth and
equality
17
0.0
0.2
0.4
0.6
1 10 100 infinite
Gini
Condition 822GTcor 822GTerr
Direct Impact: Gini Coefficients
24. Retrieval per Document
• Few documents lose r(d) scores after correction:
Good, these are former FP caused by OCR errors and no longer retrieved
• Most documents, however, gain — with 17cent corpus improving to a larger
extent, but still remaining at a lower level
18
25. Retrieval per Query
• Only 44% of the queries retrieved at least one document
• Despite small collection size, we see large gains
• Some queries lose because they retrieved FP from the uncorrected document set
19
26. Retrieval per Query
Top 10 terms cause 35% of the
wealth increase. These terms:
1. Appear very frequently in
user queries and
2. Are highly susceptible to
OCR errors in the
documents
Conclusion: Real queries are
also a source of bias
20
0
25
50
75
100
0 1,000 2,000 3,000 4,000
Query terms ordered by
difference in impact (descending)
Cumulativer(d)difference(%)
* new, Amsterdam, end, Mister, died/dead, grand/
large, Willem (name), two, three, old
Figure 4: Queries ordered by their gain/loss in number of
retrieved documents. The position on the y-axis represents
the number of documents retrieved from 822GTcor .
histograms. The distributions of the dierences in r(d) scores in Ta-
ble 2, show that for all cuto values, the median of the dierences is
positive, and increases from 8 (c = 1) to 912 (c = 1). The maximum
loss and the maximum gain in r(d) scores increase for larger cuto
values c, the latter to a much larger extent. Note that for c = 1 and
c = 10 the entire rst quartile is lled with documents that scored
worse in the corrected version. This shows that the competition
in the top results makes the gain of some documents the loss of
others.
Increased retrieval per query In a nal step, we investigated
0
25
50
75
100
0 1,000 2,000 3,000 4,000
Query terms ordered by
difference in impact (descending)
Cumulativer(d)difference(%)
Query Frequency in Cum.
Term Queries 822GT err 822GTcor Impact
nieuwe 1,903 99 166 7.36%
amsterdam 7,885 41 57 14.65%
ende 185 103 480 18.69%
heer 826 20 89 21.99%
overleden 3,698 5 18 24.78%
groot 1,573 125 153 27.33%
willem 5,375 5 13 29.81%
twee 319 64 175 31.83%
drie 401 34 120 33.81%
oude 991 50 78 35.41%
Figure 5: The accumulated impact scores of single-term
queries show that very few query term contribute a large
fraction of the overall wealth. The top ten query terms ac-
*
27. Indirect Impact of OCR Quality
RQ3: How does the correction of a fraction of error-prone documents influence the
retrievability of non-corrected ones?
21
30. Indirect Impact
Mixed
Half of the
corpus was
corrected
Uncorrected
22
50% same documents as
in previous RQ
50% new documents
31. Indirect Impact
Mixed
Half of the
corpus was
corrected
We’re mainly interested
in these documents
Uncorrected
22
50% same documents as
in previous RQ
50% new documents
32. Equality still increases!
• Equality in r(d) scores is higher in
the corrected document collection
• Again, correction has decreased
retrievability bias
23
0.0
0.2
0.4
0.6
0.8
1 10 100 infinite
Gini
Condition 1644err 1644mix
Indirect Impact: Gini Coefficients
36. Retrieval per Document (mixed-in only, c=10)
• Most documents’ scores change very little and if, they lose r(d) scores
• 171 documents gain r(d) scores
• Benefit from FP matches that disappeared
25
38. Conclusions
• In our study, OCR correction
• Increases overall retrievability
• Reduces retrievability bias, even in a partially corrected corpus
• Higher scores caused by small set of terms that are
• frequent in queries and
• susceptible to OCR errors
• Using real user queries is essential to understand actual bias caused
by OCR errors.
27
39. Impact of Crowdsourcing
OCR Improvements on
Retrievability Bias
We would like to thank the
for making the newspaper corpus and the
(sensitive) user data available to us for
research.
28
This research is partly funded by the Dutch COMMIT/ program, the
WebART project and the VRE4EIC project, a project that has received
funding from the European Union’s Horizon 2020 research and innovation
program under grant agreement No 676247.