Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
20130412 Productivity of the crowd [acrl indianapolis]
1. Productivity of the
crowd
Slides @ http://bit.ly/crowdsourceacrl2013
Frederick Zarndt
Chair, IFLA Newspapers Section
CCS / Digital Divide Data / DL Consulting
@cowboyMontana, #crowdsourceacrl2013
frederick@frederickzarndt.com
Brian Geiger
Director, Center for Bibliographic Studies and Research
bgeiger@ucr.edu
Photo held by John Oxley Library, State Library of
Queensland. Original from Courier-mail, Brisbane,
Queensland, Australia.
5. Demographics
“British volunteers for "Kitchener's Army" waiting for their pay in the
churchyard of St. Martin-in-the-Fields, Trafalgar Square, London”
Public domain photo from Imperial War Museum
15. ?
Photo held by John Oxley Library, State Library of
Queensland. Original from Courier-mail, Brisbane,
Queensland, Australia.
16. User Demographic
genealogists and family historians 50+ years old
• In 2012 the National Library of Australia reported
that ~50% of Trove users are family historians
PAPERSPAST • National Library of New Zealand survey found that
~50% of PapersPast users are genealogists
• A 2013 California Digital Newspaper Collection
survey shows that more than 65% of its users are
genealogists; 75% are 50 years old or older
• A 2012 Utah Digital Newspapers survey showed
that 72% of its users are genealogists*
• A 2013 Cambridge Public Library survey shows
that more than 80% of its users are genealogists;
73% are 50 years old or older
*John Herbert and Randy Olsen. “Small town papers: Still delivering the news”. Paper given at 2012 World
Library and Information Congress. Helsinki. August 2012.
17. raw OCR text newspaper image
Deaths. lln»rieff, Esq. of <c .. Qn.
Sunday, the till. greatly Drandrellt, of
Orms4irJi.- ~ ; ;✓ ' • * On ijfr r inn
ljjjil F iij '11 f Havodivyd,
Carnarvonshire, S ; **" *- ' « ' March
Oxford, F. Tfovmeud, Uerald. » • V .
•On Tncsdav last, Mr. Charles.
IWilinson, this 8 ; had vf thesis#,, a
week ago, which tcrminate<i'iu his
death. . / ' ■ O'i Sunday, dJst nit. at.
AsbtCnvHall, mar Lancaster,
Mr.,Geo. Worn ick, many years
house'steward hit late Once The
Hamilton and Brandon. He locked
himself h»oWn'r«wte<: soon. twelve
o'clock" that dny, and fii»-d a loaded
pistol "through Ins bead, 1 which
instantaneously killed him. Coronet's
Verdict, shot himself in a temporary fit of
Friday week,
Excerpt from The British Newspaper Archive, Chester Courant, Tuesday 6-Apr-1819, page 3.
18. Edwin Kiljin (Koninklijke Bibliotheek the Netherlands)
reports raw OCR character accuracies of 68% for early 20th
century newspapers
Rose Holley (National Library of Australia) reports raw OCR
character accuracy varied from 71% to 98% on a sample Trove
digitized newspapers
Edwin Kiljin. “The current state-of-art in newspaper digitization.” D-Lib Magazine. January/February 2008.
Rose Holley. “How good can it get? Analysing and improving OCR accuracy in large scale historic newspaper
digitisation programs. D-Lib Magazine. March/April 2009.
Public domain graphic images courtesy of Wikimedia Commons.
Graphic is logo for Accuracy in Media (http://www.aim.org/)
19. Crowdsourcing is the practice of obtaining
needed services, ideas, or content by
soliciting contributions from a large group of
people, and especially from an online
community, rather than from traditional
employees or suppliers. ... [It] is different
from ordinary outsourcing since it is a task or
problem that is outsourced to an undefined
public rather than a specific, named group.
Wikipedia contributors, "Crowdsourcing," Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/wiki/
Crowdsourcing (accessed March 17, 2013)
20. Motivation
Graphic from Kaufmann et al. “More than fun and money. Worker Motivation in
Crowdsourcing – A Study on Mechanical Turk.”
21. You can make a
difference
Graphic courtesy of TYPEinspire (http://typeinspire.com/)
23. uncorrected OCR accuracy by
newspaper title
raw character ~raw word
Title
accuracy accuracy*
PRP Pacific Rural Press 1871 - 1922 92.6% 68.1%
SFC San Francisco Call 1890 - 1913 92.6% 68.1%
LAH Los Angeles Herald 1873 - 1910 88.7% 54.9%
LH Livermore Herald 1877 - 1899 88.6% 54.6%
DAC Daily Alta California 1841 - 1891 88.2% 53.4%
CFJ California Farmer and Journal
86.5% 48.4%
of Useful Sciences 1855 - 1880
SN Sausalito News 1885 - 1922 70.4% 17.3%
*Word accuracy assumes average word length is 5 characters
24. corrected OCR accuracy by
newspaper title
raw character corrected
Title
accuracy accuracy
PRP Pacific Rural Press 1871 - 1922 92.6% 99.3%
SFC San Francisco Call 1890 - 1913 92.6% 99.6%
LAH Los Angeles Herald 1873 - 1910 88.7% 99.1%
LH Livermore Herald 1877 - 1899 88.6% 99.9%
DAC Daily Alta California 1841 - 1891 88.2% 99.9%
CFJ California Farmer and Journal
86.5% 99.8%
of Useful Sciences 1855 - 1880
SN Sausalito News 1885 - 1922 70.4% 100.0%
25. corrected OCR accuracy by
newspaper title
raw character ~raw word corrected ~corrected word
Title
accuracy accuracy* accuracy accuracy*
PRP 1871 - 1922 92.6% 68.1% 99.3% 96.5%
SFC 1890 - 1913 92.6% 68.1% 99.6% 98.0%
LAH 1873 - 1910 88.7% 54.9% 99.1% 95.6%
LH 1877 - 1899 88.6% 54.6% 99.9% 99.5%
DAC 1841 - 1891 88.2% 53.4% 99.9% 99.5%
CF 1855 - 1880 86.5% 48.4% 98.3% 91.8%
SN 1885 - 1922 70.4% 17.3% 100.0% 100.0%
*Word accuracy assumes average word length is 5 characters
26. correction accuracy by user
average uncorrected average corrected
User
text accuracy text accuracy
A 70.4% 100.0%
B 87.1% 99.5%
C 95.4% 99.5%
D 86.5% 98.3%
E 95.3% 100.0%
F 91.0% 100.0%
G 91.0% 99.8%
H 90.5% 99.0%
I 96.6% 99.8%
J 94.8% 100.0%
K 86.8% 99.3%
27. Crowdsourcing
benefits
Public domain photo courtesy of US Navy
28. $
Economics
Financial value of outsourced OCR text correction for
newspapers?
The Assumptions
$ 25 to 50 characters per line in a newspaper column:
Assume 40 characters per line (CDNC sample average)
$ Outsourced text transcription or correction costs USD
$0.35 to $1.20 per 1000 characters: Assume $0.50 per
1000 characters
29. $
Economics
$ 578,000 lines x 40 characters per line x 1/1000 x
$0.50 = $11,560
$ 68,908,757 lines x 40 characters per line x
1/1000 x $0.50 = $1,378,175
30. $
Economics
Financial value of in-house OCR text correction?
The Assumptions
$ Correction takes 15 seconds per line
$ Cost is hourly wage plus benefits of lowest level
employee, $10 for CDNC, $41.88* for Australia
AUD $40.38 = USD $41.88 is the actual labor value assumed by the National Library of Australia to calculate
avoided costs due to crowdsourced OCR text correction in its 2012 Trove Status Report.
31. $
Economics
$ 578,000 lines x 15 seconds per line x 1/3600 hrs
per second x $10.00 per hr = $24,083
$ 68,908,757 lines x 15 seconds per line x 1/3600 hrs
per second x $41.88 per hr = $12,024,578
32. Accuracy
“His Accuracy Depends on Ours!"
Office for Emergency Management. Office of War Information.
Domestic Operations Branch. Bureau of Special Services. [Photo
held at US National Archives and Records Administration]
33. Accuracy
How does low text accuracy affect search recall?
The Facts
Average uncorrected OCR character accuracy of the
CDNC data is ~89%
Average length of an English word is 5 characters
Average word accuracy is 89% x 89% x 89% x 89% x
89% = 55.8% - round up to 60% or 6 out of 10 words
correct
Public domain graphic images courtesy of Wikimedia Commons.
35. Accuracy
The Facts
Average corrected character accuracy of the CDNC
data is ~99.4%
Average word accuracy of the CDNC corrected text
is 99.4% x 99.4% x 99.4% x 99.4% x 99.4% = 97.0%
Public domain graphic images courtesy of Wikimedia Commons.
37. Accuracy
A search for my grandmother’s maiden name
“Arndt” gives 11,154 results*
* Search performed 8 April 2013
Public domain graphic image courtesy of Wikimedia Commons.
38. Accuracy
A search for my grandmother’s maiden name
“Arndt” gives 11,154 results*
If text accuracy is 55.8% (same as uncorrected CDNC
sample), then 8,835 instances of “Arndt” were not found
* Search performed 8 April 2013
Public domain graphic images courtesy of Wikimedia Commons.
39. Accuracy
A search for my grandmother’s maiden name
“Arndt” gives 11,154 results*
If text accuracy is 55.8% (same as uncorrected CDNC
sample), then 8,835 instances of “Arndt” were not found
If text accuracy is 97.0%, then 345 instances of “Arndt”
were not found
* Search performed 8 April 2013
Public domain graphic images courtesy of Wikimedia Commons.
40. Accuracy
Suppose the name is longer than 5 characters?
The Facts
Assume that average uncorrected / corrected OCR
character accuracy is ~89% / ~99% same as CDNC.
Name name length raw text accuracy corrected text accuracy
Eklund 6 49.7% 94.2%
Kennedy 7 44.2% 93.25
Espinosa 8 39.4% 92.3%
Bonaparte 9 35.0% 91.4%
Chatterjee 10 31.2% 90.4%
Public domain graphic images courtesy of Wikimedia Commons.
41. Accuracy
Searches done 19-Mar-2013 (6,025,474 pages
from 1836 to 1922).
Number of Missing results with Missing results with
Name
search results raw text accuracy corrected text accuracy
Eklund 2,951 2,987 182
Kennedy 360,723 455,392 26,111
Espinosa 1,918 2,950 160
Bonaparte 44,664 82,947 4,203
Chatterjee 19 42 2
Public domain graphic images courtesy of Wikimedia Commons.
42. Hard-to-measure-but-
shouldn’t-be-overlooked
benefits
Public domain photo “A useful instruction for young sailors from the Royal Hospital
School, Greenwich” from the National Maritime Museum.
43. HTMBSBO benefit
“when someone transcribes a document, they are
actually better fulfilling the mission of a cultural
heritage organization than someone who simply stops
by to flip through the pages”
Paraphrased from Trevor Owen’s Crowdstorming blog http://crowdstorming.wordpress.com/
44. HTMBSBO benefit
“in addition to increasing search accuracy or lowering
the costs of document transcription, crowdsourcing is
the single greatest advancement in getting people using
and interacting with library collections”
Paraphrased from Trevor Owen’s Crowdstorming blog http://crowdstorming.wordpress.com/
45. Cognitive surplus
... people are learning to use their free time for creative activities
rather than consumptive ones [such as watching TV] ...
... the total human cognitive effort in creating all of Wikipedia in
every language is about one hundred million hours ...
... Americans alone watch two hundred billion hours of TV every
year, or enough time, if it would be devoted to projects similar to
Wikipedia, to create about 2000 of them ...
Clay Shirky. Cognitive surplus: Creativity and generosity in a connected age. Penguin Press. New York. 2010.
46. Conclusion of the Sonata for piano #32, opus 111
by Ludwig van Beethoven
47. ?
Slides @ http://bit.ly/crowdsourceacrl2013
Frederick Zarndt
Chair, IFLA Newspapers Section
CCS / Digital Divide Data / DL Consulting
@cowboyMontana, #crowdsourceacrl2013
frederick@frederickzarndt.com
Brian Geiger
Director, Center for Bibliographic Studies and Research
bgeiger@ucr.edu
Photo held by John Oxley Library, State Library of
Queensland. Original from Courier-mail, Brisbane,
Queensland, Australia.
48. Try crowdsourcing!
Correct California newspapers at http://cdnc.ucr.edu
Correct Australian newspapers http://trove.nla.gov.au
Correct Cambridge MA newspapers http://bit.ly/cambridgepublic
Correct Tennessee newspapers http://tndp.lib.utk.edu
Correct Virginia newspapers http://virginiachronicle.com
49. Hãy thử crowdsourcing!
Correct Vietnamese newspapers http://bit.ly/nationallibraryofvietnam
Попробуйте краудсорсинга!
Or try Russian language periodicals http://bit.ly/russianperiodicals
Kokeile crowdsourcing!
Or try Finnish newspapers http://digi.lib.helsinki.fi/sanomalehti
50. Motivation
Graphic from Kaufmann et al. “More than fun and money. Worker Motivation in
Crowdsourcing – A Study on Mechanical Turk.”
51. Motivation
Trove users’ report
• “I enjoy the correction - it’s a great way to learn more about past
history and things of interest whilst doing a ‘service to the
community’ by correcting text for the benefit of others.”
• “I have recently retired from IT and thought that I could be of
some assistance to the project. It benefits me and other people. It
helps with family research.”
From Rose Holley in “Many Hands Make Light Work.” National Library of Australia March 2009.
52. Motivation
CDNC users’ report
“I am interested in all kinds of history. I have pursued genealogy as a
hobby for many years. I correct text at CDNC because I see it as a
constructive way to contribute to a worthwhile project. Because I am
interested in history, I enjoy it.”
Wesley, California
Personal communications with CDNC text correctors.
53. Motivation
CDNC users’ report
“I only correct the text on articles of local interest - nothing at state,
national or international level, no advertisements, etc. The objective
is to be able to help researchers to locate local people, places,
organizations and events using the on-line search at CDNC. I correct
local news & gossip, personal items, real estate transactions, superior
court proceedings, county and local board of supervisors meetings,
obituaries, birth notices, marriages, yachting news, etc.”
Ann, California
Personal communications with CDNC text correctors.
54. Motivation
CDNC users’ report
“I am correcting text for the Coronado Tent City Program for 1903.
It is important to correct any problems with personal names and
other information so that researchers will be able to search by
keyword and be assured of retrieving desired results. ... type fonts
cause a great deal of difficulty in digitizing the text and can cause
problems for searchers. Also, many of the guests' names at Tent
City and Hotel Del Coronado were taken from the registration books
and reported in the Program. This led to many problems in spelling
of last names and the editors were not careful to be consistent in the
spellings. This Program is an important resource since it provides
an excellent picture of daily life in Tent City and captures much of
the history of Coronado itself.”
Gene, California
Personal communications with CDNC text correctors.
55. Motivation
CDNC users’ report
“I have always been interested in history, especially the
development of the American West, and nothing brings it alive
better than newspapers of the time. I believe them to be an
invaluable source of knowledge for us and future generations.”
David, United Kingdom
Personal communications with CDNC text correctors.
56. Motivation
CDNC users’ report
CDNC is an excellent source of information matching my
personal interest in such topics as sea history, development of
shipbuilding, clippers and other ships etc. ... Unfortunately, the
quality of text ... is rather poor I’m afraid. This is why I started to
do all corrections necessary for myself ... and to leave the
corrected text for use of others. .... I am not doing this very
regularly as this is just my hobby and pleasure.
Jerzey, Poland
Personal communications with CDNC text correctors.
57. Other resources
Mapping Texts at http://mappingtexts.stanford.edu/
Wragge Labs at http://wraggelabs.com/
Wikipedia list of crowdsourcing projects
https://en.wikipedia.org/wiki/
List_of_crowdsourcing_projects
58. As of 17-Mar-2013 the National Library of Australia’s (http://trove.nla.gov.au/) Alexa Internet traffic
rank is 14,490 (global) / 330 (Australia). Trove gets ~75% of all National Library web traffic.
59. National Library of
Australia
• Online since 2008
• 8,000,000+ pages
• Top text corrector 1,772,090 lines
• 2,400,000+ lines corrected each month (average for
Mar 2012 to Mar 2013)
• 90,489,875 lines corrected as of Mar 2013, up from
61,682,883 lines corrected Mar 2012
• 88,935 total registered users
• 8,743 active users
Statistics from private communication with the National Library of Australia Oct 2012
60. As of 17-Mar-2013 National Library of Finland’s (http://www.nationallibrary.fi/) Alexa Internet global
traffic rank is 4,303,901. Its Internet traffic rank for Finland was 199 as of 2-Apr-2012.
61. National Library of
Finland
• Digitalkoot is a project to improve OCR text in digitized
newspapers -- by playing games!
• Digitalkoot is a collaboration between the National
Library and Microtask
• Players correct OCR text by playing Myyräsillassa
(Mole Bridge) or Myyräjahdissa (Mole Hunt)
• National Library has 4,000,000+ digitized pages
• 109,321 registered players (October 2012)
• Since February 2011 8,024,530 micro-tasks have been
completed
62. As of 17-Mar-2013 UC Riverside’s Alexa Internet traffic rank is 11,782 (global) / 4,120 (USA).
CDNC gets ~3.30% of all UC Riverside web traffic.
63. California Digital
Newspaper Collection
• CDNC began digitizing newspapers in 2005 as part of
the Library of Congress National Digital Newspapers
Program (NDNP)
• Newspapers digitized to article-level in addition to
page-level as required by NDNP (same as Utah Digital
Newspapers)
• Since 2009 hosted on Veridian at http://cdnc.ucr.edu
• Collection size 55,970 issues, 495,175 pages, 5,658,224
articles, 498,000,000+ lines (Mar 2013)
64. OCR text correction
• OCR text correction added August 2011
• Corrections are done line by line
• ~578,000+ lines of text corrected Oct 2012
• ~935,398+ lines of text corrected Mar 2013
• ~2% of the collection corrected, 98% to go!
• Top corrector 327,244 lines > 2x 2nd corrector
65.
66. Cambridge Public Library
Historic Newspaper Collection
• Cambridge Historic Newspapers online since Jan 2012.
• Cambridge Massachusetts Public Library digitized local
newspapers (http://cambridge.dlconsulting.com/)
• Newspapers digitized to article-level
• Collection size 6,346 issues, 59,070 pages, 669,406
articles (Mar-2013)
• Collection includes 13,099 obituary cards