SlideShare uma empresa Scribd logo
1 de 66
Baixar para ler offline
Productivity of the
               crowd
               Slides @ http://bit.ly/crowdsourceacrl2013

                            Frederick Zarndt
                    Chair, IFLA Newspapers Section
                CCS / Digital Divide Data / DL Consulting
                @cowboyMontana, #crowdsourceacrl2013
                    frederick@frederickzarndt.com

                          Brian Geiger
    Director, Center for Bibliographic Studies and Research
                        bgeiger@ucr.edu
Photo held by John Oxley Library, State Library of
Queensland. Original from Courier-mail, Brisbane,
Queensland, Australia.
News
Crowds
Crowds
   +
 News
Demographics



“British volunteers for "Kitchener's Army" waiting for their pay in the
churchyard of St. Martin-in-the-Fields, Trafalgar Square, London”
Public domain photo from Imperial War Museum
purpose / motive /reason

       50%
purpose / motive /reason

       50%
purpose / motive /reason

       72%
purpose / motive /reason

       80%
purpose / motive /reason

       67%
?
Photo held by John Oxley Library, State Library of
Queensland. Original from Courier-mail, Brisbane,
Queensland, Australia.
age
50%
age
80%
age
67%
?
Photo held by John Oxley Library, State Library of
Queensland. Original from Courier-mail, Brisbane,
Queensland, Australia.
User Demographic
            genealogists and family historians 50+ years old

                         • In 2012 the National Library of Australia reported
                           that ~50% of Trove users are family historians
  PAPERSPAST             • National Library of New Zealand survey found that
                           ~50% of PapersPast users are genealogists
                         • A 2013 California Digital Newspaper Collection
                           survey shows that more than 65% of its users are
                           genealogists; 75% are 50 years old or older
                         • A 2012 Utah Digital Newspapers survey showed
                           that 72% of its users are genealogists*
                         • A 2013 Cambridge Public Library survey shows
                           that more than 80% of its users are genealogists;
                           73% are 50 years old or older
*John Herbert and Randy Olsen. “Small town papers: Still delivering the news”. Paper given at 2012 World
Library and Information Congress. Helsinki. August 2012.
raw OCR text                                     newspaper image
Deaths. lln»rieff, Esq. of <c .. Qn.
Sunday, the till. greatly Drandrellt, of
Orms4irJi.- ~ ; ;✓ ' • * On ijfr r inn
ljjjil F iij '11 f Havodivyd,
Carnarvonshire, S ; **" *- ' « ' March
Oxford, F. Tfovmeud, Uerald. » • V .
•On Tncsdav last, Mr. Charles.
IWilinson, this 8 ; had vf thesis#,, a
week ago, which tcrminate<i'iu his
death. . / ' ■ O'i Sunday, dJst nit. at.
AsbtCnvHall, mar Lancaster,
Mr.,Geo. Worn ick, many years
house'steward hit late Once The
Hamilton and Brandon. He locked
himself h»oWn'r«wte<: soon. twelve
o'clock" that dny, and fii»-d a loaded
pistol "through Ins bead, 1 which
instantaneously killed him. Coronet's
Verdict, shot himself in a temporary fit of
Friday week,



Excerpt from The British Newspaper Archive, Chester Courant, Tuesday 6-Apr-1819, page 3.
Edwin Kiljin (Koninklijke Bibliotheek the Netherlands)
         reports raw OCR character accuracies of 68% for early 20th
         century newspapers

            Rose Holley (National Library of Australia) reports raw OCR
         character accuracy varied from 71% to 98% on a sample Trove
         digitized newspapers



Edwin Kiljin. “The current state-of-art in newspaper digitization.” D-Lib Magazine. January/February 2008.
Rose Holley. “How good can it get? Analysing and improving OCR accuracy in large scale historic newspaper
digitisation programs. D-Lib Magazine. March/April 2009.
Public domain graphic images courtesy of Wikimedia Commons.
Graphic is logo for Accuracy in Media (http://www.aim.org/)
Crowdsourcing is the practice of obtaining
                needed services, ideas, or content by
           soliciting contributions from a large group of
                people, and especially from an online
              community, rather than from traditional
              employees or suppliers. ... [It] is different
           from ordinary outsourcing since it is a task or
            problem that is outsourced to an undefined
             public rather than a specific, named group.


Wikipedia contributors, "Crowdsourcing," Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/wiki/
Crowdsourcing (accessed March 17, 2013)
Motivation
Graphic from Kaufmann et al. “More than fun and money. Worker Motivation in
Crowdsourcing – A Study on Mechanical Turk.”
You can make a
               difference




Graphic courtesy of TYPEinspire (http://typeinspire.com/)
User        Lines corrected   Lines corrected   User
                 1             242,965          1,456,906        1
                 2              87,515          1,385,369        2
                 3              31,318          1,010,360        3
                 4              24,144           960,230         4
                 5              23,184           847,340         5
                 6              19,240           786,147         6
                 7             18,898            657,187         7
                 8              16,875           600,513         8
                 9              11,784           582,276         9
                10               9,762           565,384        10

Statistics from Oct 2012
uncorrected OCR accuracy by
                     newspaper title
                                                           raw character   ~raw word
                               Title
                                                             accuracy       accuracy*

           PRP Pacific Rural Press 1871 - 1922                 92.6%         68.1%

           SFC San Francisco Call 1890 - 1913                  92.6%         68.1%

           LAH Los Angeles Herald 1873 - 1910                  88.7%         54.9%

           LH Livermore Herald 1877 - 1899                     88.6%         54.6%

           DAC Daily Alta California 1841 - 1891               88.2%         53.4%

           CFJ California Farmer and Journal
                                                               86.5%         48.4%
           of Useful Sciences 1855 - 1880

           SN Sausalito News 1885 - 1922                       70.4%         17.3%


*Word   accuracy assumes average word length is 5 characters
corrected OCR accuracy by
           newspaper title
                                        raw character   corrected
                   Title
                                          accuracy      accuracy

PRP Pacific Rural Press 1871 - 1922        92.6%         99.3%

SFC San Francisco Call 1890 - 1913         92.6%         99.6%

LAH Los Angeles Herald 1873 - 1910         88.7%         99.1%

LH Livermore Herald 1877 - 1899            88.6%         99.9%

DAC Daily Alta California 1841 - 1891      88.2%         99.9%

CFJ California Farmer and Journal
                                           86.5%         99.8%
of Useful Sciences 1855 - 1880

SN Sausalito News 1885 - 1922              70.4%         100.0%
corrected OCR accuracy by
                       newspaper title
                                raw character       ~raw word    corrected   ~corrected word
                Title
                                  accuracy           accuracy*   accuracy       accuracy*

        PRP 1871 - 1922              92.6%             68.1%      99.3%          96.5%

        SFC 1890 - 1913              92.6%             68.1%      99.6%          98.0%

        LAH 1873 - 1910              88.7%             54.9%      99.1%          95.6%

        LH 1877 - 1899               88.6%             54.6%      99.9%          99.5%

        DAC 1841 - 1891              88.2%             53.4%      99.9%          99.5%

        CF 1855 - 1880               86.5%             48.4%      98.3%          91.8%

        SN 1885 - 1922               70.4%             17.3%      100.0%         100.0%



*Word   accuracy assumes average word length is 5 characters
correction accuracy by user

         average uncorrected   average corrected
  User
            text accuracy        text accuracy
   A            70.4%              100.0%
   B            87.1%               99.5%
   C            95.4%               99.5%
   D            86.5%               98.3%
   E            95.3%              100.0%
   F            91.0%              100.0%
   G            91.0%               99.8%
   H           90.5%                99.0%
   I           96.6%                99.8%
   J           94.8%               100.0%
   K           86.8%                99.3%
Crowdsourcing
   benefits




        Public domain photo courtesy of US Navy
$
                 Economics

   Financial value of outsourced OCR text correction for
   newspapers?
   The Assumptions
$ 25 to 50 characters per line in a newspaper column:
  Assume 40 characters per line (CDNC sample average)
$ Outsourced text transcription or correction costs USD
  $0.35 to $1.20 per 1000 characters: Assume $0.50 per
  1000 characters
$
       Economics


$ 578,000 lines x 40 characters per line x 1/1000 x
  $0.50 = $11,560


$ 68,908,757 lines x 40 characters per line x
  1/1000 x $0.50 = $1,378,175
$
                                   Economics

               Financial value of in-house OCR text correction?
               The Assumptions
         $ Correction takes 15 seconds per line
         $ Cost is hourly wage plus benefits of lowest level
           employee, $10 for CDNC, $41.88* for Australia




AUD $40.38 = USD $41.88 is the actual labor value assumed by the National Library of Australia to calculate
avoided costs due to crowdsourced OCR text correction in its 2012 Trove Status Report.
$
          Economics


$ 578,000 lines x 15 seconds per line x 1/3600 hrs
  per second x $10.00 per hr = $24,083


$ 68,908,757 lines x 15 seconds per line x 1/3600 hrs
  per second x $41.88 per hr = $12,024,578
Accuracy



“His Accuracy Depends on Ours!"
Office for Emergency Management. Office of War Information.
Domestic Operations Branch. Bureau of Special Services. [Photo
held at US National Archives and Records Administration]
Accuracy

        How does low text accuracy affect search recall?
        The Facts
          Average uncorrected OCR character accuracy of the
        CDNC data is ~89%
             Average length of an English word is 5 characters
           Average word accuracy is 89% x 89% x 89% x 89% x
        89% = 55.8% - round up to 60% or 6 out of 10 words
        correct


Public domain graphic images courtesy of Wikimedia Commons.
Search recall no text correction


                                 T




                                                                      ARND
                                D
                             R N
                            A




                                                                          T
                                                 ARNDT


                                                         ARNDT
                                         ARNDT
 instances of “ARNDT” found                                      instances of “ARNDT” not found
                                             ARNDT

                                                     ARNDT
                                            ARNDT




                                                                    ARNDT
                             ARNDT




Public domain graphic images courtesy of Wikimedia Commons.
Image © Nevit Dilmen found at Wikimedia commons
Accuracy


         The Facts
            Average corrected character accuracy of the CDNC
         data is ~99.4%
             Average word accuracy of the CDNC corrected text
         is 99.4% x 99.4% x 99.4% x 99.4% x 99.4% = 97.0%




Public domain graphic images courtesy of Wikimedia Commons.
Search recall with text correction



                                                 ARNDT
                                                         ARNDT

                                         ARNDT
 instances of “ARNDT” found                      ARNDT ARNDT     instances of “ARNDT” not found
                                                         ARNDT
                                         ARNDT

                                            ARNDT        ARNDT



                                                         AR
                                                         ND
                                                           T




Public domain graphic image courtesy of Wikimedia Commons.
Image © Nevit Dilmen found at Wikimedia commons
Accuracy

         A search for my grandmother’s maiden name
         “Arndt” gives 11,154 results*




*   Search performed 8 April 2013
Public domain graphic image courtesy of Wikimedia Commons.
Accuracy

         A search for my grandmother’s maiden name
         “Arndt” gives 11,154 results*
           If text accuracy is 55.8% (same as uncorrected CDNC
         sample), then 8,835 instances of “Arndt” were not found




*   Search performed 8 April 2013
Public domain graphic images courtesy of Wikimedia Commons.
Accuracy

          A search for my grandmother’s maiden name
         “Arndt” gives 11,154 results*
           If text accuracy is 55.8% (same as uncorrected CDNC
         sample), then 8,835 instances of “Arndt” were not found
            If text accuracy is 97.0%, then 345 instances of “Arndt”
         were not found




*   Search performed 8 April 2013
Public domain graphic images courtesy of Wikimedia Commons.
Accuracy

      Suppose the name is longer than 5 characters?
      The Facts
         Assume that average uncorrected / corrected OCR
      character accuracy is ~89% / ~99% same as CDNC.

               Name            name length     raw text accuracy   corrected text accuracy
               Eklund                6               49.7%                 94.2%
              Kennedy                7               44.2%                  93.25
              Espinosa               8               39.4%                 92.3%
              Bonaparte              9               35.0%                 91.4%
              Chatterjee             10              31.2%                 90.4%
Public domain graphic images courtesy of Wikimedia Commons.
Accuracy

             Searches done 19-Mar-2013 (6,025,474 pages
             from 1836 to 1922).


                            Number of       Missing results with    Missing results with
            Name
                          search results     raw text accuracy     corrected text accuracy

            Eklund             2,951                2,987                   182
           Kennedy            360,723              455,392                 26,111
           Espinosa            1,918                2,950                   160
           Bonaparte          44,664                82,947                  4,203
           Chatterjee            19                   42                     2



Public domain graphic images courtesy of Wikimedia Commons.
Hard-to-measure-but-
shouldn’t-be-overlooked
       benefits




  Public domain photo “A useful instruction for young sailors from the Royal Hospital
  School, Greenwich” from the National Maritime Museum.
HTMBSBO benefit

        “when someone transcribes a document, they are
         actually better fulfilling the mission of a cultural
      heritage organization than someone who simply stops
                   by to flip through the pages”




Paraphrased from Trevor Owen’s Crowdstorming blog http://crowdstorming.wordpress.com/
HTMBSBO benefit

      “in addition to increasing search accuracy or lowering
      the costs of document transcription, crowdsourcing is
     the single greatest advancement in getting people using
             and interacting with library collections”




Paraphrased from Trevor Owen’s Crowdstorming blog http://crowdstorming.wordpress.com/
Cognitive surplus


        ... people are learning to use their free time for creative activities
        rather than consumptive ones [such as watching TV] ...

        ... the total human cognitive effort in creating all of Wikipedia in
        every language is about one hundred million hours ...

        ... Americans alone watch two hundred billion hours of TV every
        year, or enough time, if it would be devoted to projects similar to
        Wikipedia, to create about 2000 of them ...




Clay Shirky. Cognitive surplus: Creativity and generosity in a connected age. Penguin Press. New York. 2010.
Conclusion of the Sonata for piano #32, opus 111
          by Ludwig van Beethoven
?
                Slides @ http://bit.ly/crowdsourceacrl2013

                            Frederick Zarndt
                    Chair, IFLA Newspapers Section
                CCS / Digital Divide Data / DL Consulting
                @cowboyMontana, #crowdsourceacrl2013
                    frederick@frederickzarndt.com

                           Brian Geiger
     Director, Center for Bibliographic Studies and Research
                         bgeiger@ucr.edu
Photo held by John Oxley Library, State Library of
Queensland. Original from Courier-mail, Brisbane,
Queensland, Australia.
Try crowdsourcing!

 Correct California newspapers at http://cdnc.ucr.edu


 Correct Australian newspapers http://trove.nla.gov.au


 Correct Cambridge MA newspapers http://bit.ly/cambridgepublic


 Correct Tennessee newspapers http://tndp.lib.utk.edu


 Correct Virginia newspapers http://virginiachronicle.com
Hãy thử crowdsourcing!
 Correct Vietnamese newspapers http://bit.ly/nationallibraryofvietnam




Попробуйте краудсорсинга!
 Or try Russian language periodicals http://bit.ly/russianperiodicals




  Kokeile crowdsourcing!
 Or try Finnish newspapers http://digi.lib.helsinki.fi/sanomalehti
Motivation
Graphic from Kaufmann et al. “More than fun and money. Worker Motivation in
Crowdsourcing – A Study on Mechanical Turk.”
Motivation
                                          Trove users’ report


             • “I enjoy the correction - it’s a great way to learn more about past
             history and things of interest whilst doing a ‘service to the
             community’ by correcting text for the benefit of others.”

             • “I have recently retired from IT and thought that I could be of
             some assistance to the project. It benefits me and other people. It
             helps with family research.”




From Rose Holley in “Many Hands Make Light Work.” National Library of Australia March 2009.
Motivation
                                        CDNC users’ report



             “I am interested in all kinds of history. I have pursued genealogy as a
               hobby for many years. I correct text at CDNC because I see it as a
             constructive way to contribute to a worthwhile project. Because I am
                                interested in history, I enjoy it.”
                                                     Wesley, California




Personal communications with CDNC text correctors.
Motivation
                                        CDNC users’ report


                “I only correct the text on articles of local interest - nothing at state,
               national or international level, no advertisements, etc.  The objective
                    is to be able to help researchers to locate local people, places,
               organizations and events using the on-line search at CDNC.  I correct
               local news & gossip, personal items, real estate transactions, superior
                court proceedings, county and local board of supervisors meetings,
                       obituaries, birth notices, marriages, yachting news, etc.”
                                                     Ann, California




Personal communications with CDNC text correctors.
Motivation
                                        CDNC users’ report

             “I am correcting text for the Coronado Tent City Program for 1903. 
               It is important to correct any problems with personal names and
                 other information so that researchers will be able to search by
              keyword and be assured of retrieving desired results. ... type fonts
              cause a great deal of difficulty in digitizing the text and can cause
               problems for searchers.  Also, many of the guests' names at Tent
            City and Hotel Del Coronado were taken from the registration books
            and reported in the Program.  This led to many problems in spelling
            of last names and the editors were not careful to be consistent in the
             spellings.  This Program is an important resource since it provides
              an excellent picture of daily life in Tent City and captures much of
                                 the history of Coronado itself.”
                                                     Gene, California

Personal communications with CDNC text correctors.
Motivation
                                        CDNC users’ report



                     “I have always been interested in history, especially the
                 development of the American West, and nothing brings it alive
                   better than newspapers of the time. I believe them to be an
                 invaluable source of knowledge for us and future generations.”
                                                 David, United Kingdom




Personal communications with CDNC text correctors.
Motivation
                              CDNC users’ report

                 CDNC is an excellent source of information matching my
               personal interest in such topics as sea history, development of
              shipbuilding, clippers and other ships etc. ... Unfortunately, the
             quality of text ... is rather poor I’m afraid. This is why I started to
                 do all corrections necessary for myself ... and to leave the
                corrected text for use of others. .... I am not doing this very
                      regularly as this is just my hobby and pleasure.
                                           Jerzey, Poland




Personal communications with CDNC text correctors.
Other resources

Mapping Texts at http://mappingtexts.stanford.edu/




            Wragge Labs at http://wraggelabs.com/




            Wikipedia list of crowdsourcing projects
                    https://en.wikipedia.org/wiki/
                  List_of_crowdsourcing_projects
As of 17-Mar-2013 the National Library of Australia’s (http://trove.nla.gov.au/) Alexa Internet traffic
   rank is 14,490 (global) / 330 (Australia). Trove gets ~75% of all National Library web traffic.
National Library of
                          Australia
             • Online since 2008
             • 8,000,000+ pages
             • Top text corrector 1,772,090 lines
             • 2,400,000+ lines corrected each month (average for
               Mar 2012 to Mar 2013)
             • 90,489,875 lines corrected as of Mar 2013, up from
               61,682,883 lines corrected Mar 2012
             • 88,935 total registered users
             • 8,743 active users

Statistics from private communication with the National Library of Australia Oct 2012
As of 17-Mar-2013 National Library of Finland’s (http://www.nationallibrary.fi/) Alexa Internet global
       traffic rank is 4,303,901. Its Internet traffic rank for Finland was 199 as of 2-Apr-2012.
National Library of
              Finland
• Digitalkoot is a project to improve OCR text in digitized
  newspapers -- by playing games!
• Digitalkoot is a collaboration between the National
  Library and Microtask
• Players correct OCR text by playing Myyräsillassa
  (Mole Bridge) or Myyräjahdissa (Mole Hunt)
• National Library has 4,000,000+ digitized pages
• 109,321 registered players (October 2012)
• Since February 2011 8,024,530 micro-tasks have been
  completed
As of 17-Mar-2013 UC Riverside’s Alexa Internet traffic rank is 11,782 (global) / 4,120 (USA).
                    CDNC gets ~3.30% of all UC Riverside web traffic.
California Digital
    Newspaper Collection
• CDNC began digitizing newspapers in 2005 as part of
  the Library of Congress National Digital Newspapers
  Program (NDNP)
• Newspapers digitized to article-level in addition to
  page-level as required by NDNP (same as Utah Digital
  Newspapers)
• Since 2009 hosted on Veridian at http://cdnc.ucr.edu
• Collection size 55,970 issues, 495,175 pages, 5,658,224
  articles, 498,000,000+ lines (Mar 2013)
OCR text correction

• OCR text correction added August 2011
• Corrections are done line by line
• ~578,000+ lines of text corrected Oct 2012
• ~935,398+ lines of text corrected Mar 2013
• ~2% of the collection corrected, 98% to go!
• Top corrector 327,244 lines > 2x 2nd corrector
Cambridge Public Library
    Historic Newspaper Collection


• Cambridge Historic Newspapers online since Jan 2012.
• Cambridge Massachusetts Public Library digitized local
  newspapers (http://cambridge.dlconsulting.com/)
• Newspapers digitized to article-level
• Collection size 6,346 issues, 59,070 pages, 669,406
  articles (Mar-2013)
• Collection includes 13,099 obituary cards

Mais conteúdo relacionado

Mais de Frederick Zarndt

Digitization of the Tuol Sleng Genocide Museum Archives
Digitization of the Tuol Sleng Genocide Museum ArchivesDigitization of the Tuol Sleng Genocide Museum Archives
Digitization of the Tuol Sleng Genocide Museum ArchivesFrederick Zarndt
 
2017 Born Digital Legal Deposit Policies and Practices
2017 Born Digital Legal Deposit Policies and Practices2017 Born Digital Legal Deposit Policies and Practices
2017 Born Digital Legal Deposit Policies and PracticesFrederick Zarndt
 
e-Legal Deposit Survey 2017
e-Legal Deposit Survey 2017e-Legal Deposit Survey 2017
e-Legal Deposit Survey 2017Frederick Zarndt
 
Project Management according to Great Pumpkin Principles
Project Management according to Great Pumpkin PrinciplesProject Management according to Great Pumpkin Principles
Project Management according to Great Pumpkin PrinciplesFrederick Zarndt
 
What did you say? interculture communication [20160308 phnom penh]
What did you say? interculture communication [20160308 phnom penh]What did you say? interculture communication [20160308 phnom penh]
What did you say? interculture communication [20160308 phnom penh]Frederick Zarndt
 
Coronado public library digital newspapers workshop local partnerships [oct 2...
Coronado public library digital newspapers workshop local partnerships [oct 2...Coronado public library digital newspapers workshop local partnerships [oct 2...
Coronado public library digital newspapers workshop local partnerships [oct 2...Frederick Zarndt
 
Coronado public library digital newspapers workshop [Oct 2016]
Coronado public library digital newspapers workshop [Oct 2016]Coronado public library digital newspapers workshop [Oct 2016]
Coronado public library digital newspapers workshop [Oct 2016]Frederick Zarndt
 
What did you say? mindful interculture communication [201608 icgse]
What did you say? mindful interculture communication [201608 icgse]What did you say? mindful interculture communication [201608 icgse]
What did you say? mindful interculture communication [201608 icgse]Frederick Zarndt
 
Here Today, Gone within a Month: The Fleeting Life of Digital News
Here Today, Gone within a Month: The Fleeting Life of Digital NewsHere Today, Gone within a Month: The Fleeting Life of Digital News
Here Today, Gone within a Month: The Fleeting Life of Digital NewsFrederick Zarndt
 
An international survey of born digital legal deposit policies and practices ...
An international survey of born digital legal deposit policies and practices ...An international survey of born digital legal deposit policies and practices ...
An international survey of born digital legal deposit policies and practices ...Frederick Zarndt
 
An international survey of born digital legal deposit policies and practices ...
An international survey of born digital legal deposit policies and practices ...An international survey of born digital legal deposit policies and practices ...
An international survey of born digital legal deposit policies and practices ...Frederick Zarndt
 
Rootstech 2015 finding and using digitized historical newspapers workshop [20...
Rootstech 2015 finding and using digitized historical newspapers workshop [20...Rootstech 2015 finding and using digitized historical newspapers workshop [20...
Rootstech 2015 finding and using digitized historical newspapers workshop [20...Frederick Zarndt
 
20140410 ifla digitization workshop [idlc kuala lumpur]
20140410 ifla digitization workshop [idlc kuala lumpur]20140410 ifla digitization workshop [idlc kuala lumpur]
20140410 ifla digitization workshop [idlc kuala lumpur]Frederick Zarndt
 
What did you say? Intercultural expectations, misunderstandings, and communic...
What did you say? Intercultural expectations, misunderstandings, and communic...What did you say? Intercultural expectations, misunderstandings, and communic...
What did you say? Intercultural expectations, misunderstandings, and communic...Frederick Zarndt
 
20140628 crowdsourcing, family history, and long tails for libraries [ala ann...
20140628 crowdsourcing, family history, and long tails for libraries [ala ann...20140628 crowdsourcing, family history, and long tails for libraries [ala ann...
20140628 crowdsourcing, family history, and long tails for libraries [ala ann...Frederick Zarndt
 
20140408 digital newspapers collections [idlc kuala lumpur]
20140408 digital newspapers collections [idlc kuala lumpur]20140408 digital newspapers collections [idlc kuala lumpur]
20140408 digital newspapers collections [idlc kuala lumpur]Frederick Zarndt
 
20131019 digital collections - if you build them will anyone visit [library 2...
20131019 digital collections - if you build them will anyone visit [library 2...20131019 digital collections - if you build them will anyone visit [library 2...
20131019 digital collections - if you build them will anyone visit [library 2...Frederick Zarndt
 
20130903 what did you say? interculture communication [hamburg]
20130903 what did you say? interculture communication [hamburg]20130903 what did you say? interculture communication [hamburg]
20130903 what did you say? interculture communication [hamburg]Frederick Zarndt
 
201308 wlic standards committee zarndt et al the alto editorial board collabo...
201308 wlic standards committee zarndt et al the alto editorial board collabo...201308 wlic standards committee zarndt et al the alto editorial board collabo...
201308 wlic standards committee zarndt et al the alto editorial board collabo...Frederick Zarndt
 
2013 ifla satellite zarndt et al [marketing cultural heritage digital collect...
2013 ifla satellite zarndt et al [marketing cultural heritage digital collect...2013 ifla satellite zarndt et al [marketing cultural heritage digital collect...
2013 ifla satellite zarndt et al [marketing cultural heritage digital collect...Frederick Zarndt
 

Mais de Frederick Zarndt (20)

Digitization of the Tuol Sleng Genocide Museum Archives
Digitization of the Tuol Sleng Genocide Museum ArchivesDigitization of the Tuol Sleng Genocide Museum Archives
Digitization of the Tuol Sleng Genocide Museum Archives
 
2017 Born Digital Legal Deposit Policies and Practices
2017 Born Digital Legal Deposit Policies and Practices2017 Born Digital Legal Deposit Policies and Practices
2017 Born Digital Legal Deposit Policies and Practices
 
e-Legal Deposit Survey 2017
e-Legal Deposit Survey 2017e-Legal Deposit Survey 2017
e-Legal Deposit Survey 2017
 
Project Management according to Great Pumpkin Principles
Project Management according to Great Pumpkin PrinciplesProject Management according to Great Pumpkin Principles
Project Management according to Great Pumpkin Principles
 
What did you say? interculture communication [20160308 phnom penh]
What did you say? interculture communication [20160308 phnom penh]What did you say? interculture communication [20160308 phnom penh]
What did you say? interculture communication [20160308 phnom penh]
 
Coronado public library digital newspapers workshop local partnerships [oct 2...
Coronado public library digital newspapers workshop local partnerships [oct 2...Coronado public library digital newspapers workshop local partnerships [oct 2...
Coronado public library digital newspapers workshop local partnerships [oct 2...
 
Coronado public library digital newspapers workshop [Oct 2016]
Coronado public library digital newspapers workshop [Oct 2016]Coronado public library digital newspapers workshop [Oct 2016]
Coronado public library digital newspapers workshop [Oct 2016]
 
What did you say? mindful interculture communication [201608 icgse]
What did you say? mindful interculture communication [201608 icgse]What did you say? mindful interculture communication [201608 icgse]
What did you say? mindful interculture communication [201608 icgse]
 
Here Today, Gone within a Month: The Fleeting Life of Digital News
Here Today, Gone within a Month: The Fleeting Life of Digital NewsHere Today, Gone within a Month: The Fleeting Life of Digital News
Here Today, Gone within a Month: The Fleeting Life of Digital News
 
An international survey of born digital legal deposit policies and practices ...
An international survey of born digital legal deposit policies and practices ...An international survey of born digital legal deposit policies and practices ...
An international survey of born digital legal deposit policies and practices ...
 
An international survey of born digital legal deposit policies and practices ...
An international survey of born digital legal deposit policies and practices ...An international survey of born digital legal deposit policies and practices ...
An international survey of born digital legal deposit policies and practices ...
 
Rootstech 2015 finding and using digitized historical newspapers workshop [20...
Rootstech 2015 finding and using digitized historical newspapers workshop [20...Rootstech 2015 finding and using digitized historical newspapers workshop [20...
Rootstech 2015 finding and using digitized historical newspapers workshop [20...
 
20140410 ifla digitization workshop [idlc kuala lumpur]
20140410 ifla digitization workshop [idlc kuala lumpur]20140410 ifla digitization workshop [idlc kuala lumpur]
20140410 ifla digitization workshop [idlc kuala lumpur]
 
What did you say? Intercultural expectations, misunderstandings, and communic...
What did you say? Intercultural expectations, misunderstandings, and communic...What did you say? Intercultural expectations, misunderstandings, and communic...
What did you say? Intercultural expectations, misunderstandings, and communic...
 
20140628 crowdsourcing, family history, and long tails for libraries [ala ann...
20140628 crowdsourcing, family history, and long tails for libraries [ala ann...20140628 crowdsourcing, family history, and long tails for libraries [ala ann...
20140628 crowdsourcing, family history, and long tails for libraries [ala ann...
 
20140408 digital newspapers collections [idlc kuala lumpur]
20140408 digital newspapers collections [idlc kuala lumpur]20140408 digital newspapers collections [idlc kuala lumpur]
20140408 digital newspapers collections [idlc kuala lumpur]
 
20131019 digital collections - if you build them will anyone visit [library 2...
20131019 digital collections - if you build them will anyone visit [library 2...20131019 digital collections - if you build them will anyone visit [library 2...
20131019 digital collections - if you build them will anyone visit [library 2...
 
20130903 what did you say? interculture communication [hamburg]
20130903 what did you say? interculture communication [hamburg]20130903 what did you say? interculture communication [hamburg]
20130903 what did you say? interculture communication [hamburg]
 
201308 wlic standards committee zarndt et al the alto editorial board collabo...
201308 wlic standards committee zarndt et al the alto editorial board collabo...201308 wlic standards committee zarndt et al the alto editorial board collabo...
201308 wlic standards committee zarndt et al the alto editorial board collabo...
 
2013 ifla satellite zarndt et al [marketing cultural heritage digital collect...
2013 ifla satellite zarndt et al [marketing cultural heritage digital collect...2013 ifla satellite zarndt et al [marketing cultural heritage digital collect...
2013 ifla satellite zarndt et al [marketing cultural heritage digital collect...
 

Último

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 

Último (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 

20130412 Productivity of the crowd [acrl indianapolis]

  • 1. Productivity of the crowd Slides @ http://bit.ly/crowdsourceacrl2013 Frederick Zarndt Chair, IFLA Newspapers Section CCS / Digital Divide Data / DL Consulting @cowboyMontana, #crowdsourceacrl2013 frederick@frederickzarndt.com Brian Geiger Director, Center for Bibliographic Studies and Research bgeiger@ucr.edu Photo held by John Oxley Library, State Library of Queensland. Original from Courier-mail, Brisbane, Queensland, Australia.
  • 4. Crowds + News
  • 5. Demographics “British volunteers for "Kitchener's Army" waiting for their pay in the churchyard of St. Martin-in-the-Fields, Trafalgar Square, London” Public domain photo from Imperial War Museum
  • 6. purpose / motive /reason 50%
  • 7. purpose / motive /reason 50%
  • 8. purpose / motive /reason 72%
  • 9. purpose / motive /reason 80%
  • 10. purpose / motive /reason 67%
  • 11. ? Photo held by John Oxley Library, State Library of Queensland. Original from Courier-mail, Brisbane, Queensland, Australia.
  • 15. ? Photo held by John Oxley Library, State Library of Queensland. Original from Courier-mail, Brisbane, Queensland, Australia.
  • 16. User Demographic genealogists and family historians 50+ years old • In 2012 the National Library of Australia reported that ~50% of Trove users are family historians PAPERSPAST • National Library of New Zealand survey found that ~50% of PapersPast users are genealogists • A 2013 California Digital Newspaper Collection survey shows that more than 65% of its users are genealogists; 75% are 50 years old or older • A 2012 Utah Digital Newspapers survey showed that 72% of its users are genealogists* • A 2013 Cambridge Public Library survey shows that more than 80% of its users are genealogists; 73% are 50 years old or older *John Herbert and Randy Olsen. “Small town papers: Still delivering the news”. Paper given at 2012 World Library and Information Congress. Helsinki. August 2012.
  • 17. raw OCR text newspaper image Deaths. lln»rieff, Esq. of <c .. Qn. Sunday, the till. greatly Drandrellt, of Orms4irJi.- ~ ; ;✓ ' • * On ijfr r inn ljjjil F iij '11 f Havodivyd, Carnarvonshire, S ; **" *- ' « ' March Oxford, F. Tfovmeud, Uerald. » • V . •On Tncsdav last, Mr. Charles. IWilinson, this 8 ; had vf thesis#,, a week ago, which tcrminate<i'iu his death. . / ' ■ O'i Sunday, dJst nit. at. AsbtCnvHall, mar Lancaster, Mr.,Geo. Worn ick, many years house'steward hit late Once The Hamilton and Brandon. He locked himself h»oWn'r«wte<: soon. twelve o'clock" that dny, and fii»-d a loaded pistol "through Ins bead, 1 which instantaneously killed him. Coronet's Verdict, shot himself in a temporary fit of Friday week, Excerpt from The British Newspaper Archive, Chester Courant, Tuesday 6-Apr-1819, page 3.
  • 18. Edwin Kiljin (Koninklijke Bibliotheek the Netherlands) reports raw OCR character accuracies of 68% for early 20th century newspapers Rose Holley (National Library of Australia) reports raw OCR character accuracy varied from 71% to 98% on a sample Trove digitized newspapers Edwin Kiljin. “The current state-of-art in newspaper digitization.” D-Lib Magazine. January/February 2008. Rose Holley. “How good can it get? Analysing and improving OCR accuracy in large scale historic newspaper digitisation programs. D-Lib Magazine. March/April 2009. Public domain graphic images courtesy of Wikimedia Commons. Graphic is logo for Accuracy in Media (http://www.aim.org/)
  • 19. Crowdsourcing is the practice of obtaining needed services, ideas, or content by soliciting contributions from a large group of people, and especially from an online community, rather than from traditional employees or suppliers. ... [It] is different from ordinary outsourcing since it is a task or problem that is outsourced to an undefined public rather than a specific, named group. Wikipedia contributors, "Crowdsourcing," Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/wiki/ Crowdsourcing (accessed March 17, 2013)
  • 20. Motivation Graphic from Kaufmann et al. “More than fun and money. Worker Motivation in Crowdsourcing – A Study on Mechanical Turk.”
  • 21. You can make a difference Graphic courtesy of TYPEinspire (http://typeinspire.com/)
  • 22. User Lines corrected Lines corrected User 1 242,965 1,456,906 1 2 87,515 1,385,369 2 3 31,318 1,010,360 3 4 24,144 960,230 4 5 23,184 847,340 5 6 19,240 786,147 6 7 18,898 657,187 7 8 16,875 600,513 8 9 11,784 582,276 9 10 9,762 565,384 10 Statistics from Oct 2012
  • 23. uncorrected OCR accuracy by newspaper title raw character ~raw word Title accuracy accuracy* PRP Pacific Rural Press 1871 - 1922 92.6% 68.1% SFC San Francisco Call 1890 - 1913 92.6% 68.1% LAH Los Angeles Herald 1873 - 1910 88.7% 54.9% LH Livermore Herald 1877 - 1899 88.6% 54.6% DAC Daily Alta California 1841 - 1891 88.2% 53.4% CFJ California Farmer and Journal 86.5% 48.4% of Useful Sciences 1855 - 1880 SN Sausalito News 1885 - 1922 70.4% 17.3% *Word accuracy assumes average word length is 5 characters
  • 24. corrected OCR accuracy by newspaper title raw character corrected Title accuracy accuracy PRP Pacific Rural Press 1871 - 1922 92.6% 99.3% SFC San Francisco Call 1890 - 1913 92.6% 99.6% LAH Los Angeles Herald 1873 - 1910 88.7% 99.1% LH Livermore Herald 1877 - 1899 88.6% 99.9% DAC Daily Alta California 1841 - 1891 88.2% 99.9% CFJ California Farmer and Journal 86.5% 99.8% of Useful Sciences 1855 - 1880 SN Sausalito News 1885 - 1922 70.4% 100.0%
  • 25. corrected OCR accuracy by newspaper title raw character ~raw word corrected ~corrected word Title accuracy accuracy* accuracy accuracy* PRP 1871 - 1922 92.6% 68.1% 99.3% 96.5% SFC 1890 - 1913 92.6% 68.1% 99.6% 98.0% LAH 1873 - 1910 88.7% 54.9% 99.1% 95.6% LH 1877 - 1899 88.6% 54.6% 99.9% 99.5% DAC 1841 - 1891 88.2% 53.4% 99.9% 99.5% CF 1855 - 1880 86.5% 48.4% 98.3% 91.8% SN 1885 - 1922 70.4% 17.3% 100.0% 100.0% *Word accuracy assumes average word length is 5 characters
  • 26. correction accuracy by user average uncorrected average corrected User text accuracy text accuracy A 70.4% 100.0% B 87.1% 99.5% C 95.4% 99.5% D 86.5% 98.3% E 95.3% 100.0% F 91.0% 100.0% G 91.0% 99.8% H 90.5% 99.0% I 96.6% 99.8% J 94.8% 100.0% K 86.8% 99.3%
  • 27. Crowdsourcing benefits Public domain photo courtesy of US Navy
  • 28. $ Economics Financial value of outsourced OCR text correction for newspapers? The Assumptions $ 25 to 50 characters per line in a newspaper column: Assume 40 characters per line (CDNC sample average) $ Outsourced text transcription or correction costs USD $0.35 to $1.20 per 1000 characters: Assume $0.50 per 1000 characters
  • 29. $ Economics $ 578,000 lines x 40 characters per line x 1/1000 x $0.50 = $11,560 $ 68,908,757 lines x 40 characters per line x 1/1000 x $0.50 = $1,378,175
  • 30. $ Economics Financial value of in-house OCR text correction? The Assumptions $ Correction takes 15 seconds per line $ Cost is hourly wage plus benefits of lowest level employee, $10 for CDNC, $41.88* for Australia AUD $40.38 = USD $41.88 is the actual labor value assumed by the National Library of Australia to calculate avoided costs due to crowdsourced OCR text correction in its 2012 Trove Status Report.
  • 31. $ Economics $ 578,000 lines x 15 seconds per line x 1/3600 hrs per second x $10.00 per hr = $24,083 $ 68,908,757 lines x 15 seconds per line x 1/3600 hrs per second x $41.88 per hr = $12,024,578
  • 32. Accuracy “His Accuracy Depends on Ours!" Office for Emergency Management. Office of War Information. Domestic Operations Branch. Bureau of Special Services. [Photo held at US National Archives and Records Administration]
  • 33. Accuracy How does low text accuracy affect search recall? The Facts Average uncorrected OCR character accuracy of the CDNC data is ~89% Average length of an English word is 5 characters Average word accuracy is 89% x 89% x 89% x 89% x 89% = 55.8% - round up to 60% or 6 out of 10 words correct Public domain graphic images courtesy of Wikimedia Commons.
  • 34. Search recall no text correction T ARND D R N A T ARNDT ARNDT ARNDT instances of “ARNDT” found instances of “ARNDT” not found ARNDT ARNDT ARNDT ARNDT ARNDT Public domain graphic images courtesy of Wikimedia Commons. Image © Nevit Dilmen found at Wikimedia commons
  • 35. Accuracy The Facts Average corrected character accuracy of the CDNC data is ~99.4% Average word accuracy of the CDNC corrected text is 99.4% x 99.4% x 99.4% x 99.4% x 99.4% = 97.0% Public domain graphic images courtesy of Wikimedia Commons.
  • 36. Search recall with text correction ARNDT ARNDT ARNDT instances of “ARNDT” found ARNDT ARNDT instances of “ARNDT” not found ARNDT ARNDT ARNDT ARNDT AR ND T Public domain graphic image courtesy of Wikimedia Commons. Image © Nevit Dilmen found at Wikimedia commons
  • 37. Accuracy A search for my grandmother’s maiden name “Arndt” gives 11,154 results* * Search performed 8 April 2013 Public domain graphic image courtesy of Wikimedia Commons.
  • 38. Accuracy A search for my grandmother’s maiden name “Arndt” gives 11,154 results* If text accuracy is 55.8% (same as uncorrected CDNC sample), then 8,835 instances of “Arndt” were not found * Search performed 8 April 2013 Public domain graphic images courtesy of Wikimedia Commons.
  • 39. Accuracy A search for my grandmother’s maiden name “Arndt” gives 11,154 results* If text accuracy is 55.8% (same as uncorrected CDNC sample), then 8,835 instances of “Arndt” were not found If text accuracy is 97.0%, then 345 instances of “Arndt” were not found * Search performed 8 April 2013 Public domain graphic images courtesy of Wikimedia Commons.
  • 40. Accuracy Suppose the name is longer than 5 characters? The Facts Assume that average uncorrected / corrected OCR character accuracy is ~89% / ~99% same as CDNC. Name name length raw text accuracy corrected text accuracy Eklund 6 49.7% 94.2% Kennedy 7 44.2% 93.25 Espinosa 8 39.4% 92.3% Bonaparte 9 35.0% 91.4% Chatterjee 10 31.2% 90.4% Public domain graphic images courtesy of Wikimedia Commons.
  • 41. Accuracy Searches done 19-Mar-2013 (6,025,474 pages from 1836 to 1922). Number of Missing results with Missing results with Name search results raw text accuracy corrected text accuracy Eklund 2,951 2,987 182 Kennedy 360,723 455,392 26,111 Espinosa 1,918 2,950 160 Bonaparte 44,664 82,947 4,203 Chatterjee 19 42 2 Public domain graphic images courtesy of Wikimedia Commons.
  • 42. Hard-to-measure-but- shouldn’t-be-overlooked benefits Public domain photo “A useful instruction for young sailors from the Royal Hospital School, Greenwich” from the National Maritime Museum.
  • 43. HTMBSBO benefit “when someone transcribes a document, they are actually better fulfilling the mission of a cultural heritage organization than someone who simply stops by to flip through the pages” Paraphrased from Trevor Owen’s Crowdstorming blog http://crowdstorming.wordpress.com/
  • 44. HTMBSBO benefit “in addition to increasing search accuracy or lowering the costs of document transcription, crowdsourcing is the single greatest advancement in getting people using and interacting with library collections” Paraphrased from Trevor Owen’s Crowdstorming blog http://crowdstorming.wordpress.com/
  • 45. Cognitive surplus ... people are learning to use their free time for creative activities rather than consumptive ones [such as watching TV] ... ... the total human cognitive effort in creating all of Wikipedia in every language is about one hundred million hours ... ... Americans alone watch two hundred billion hours of TV every year, or enough time, if it would be devoted to projects similar to Wikipedia, to create about 2000 of them ... Clay Shirky. Cognitive surplus: Creativity and generosity in a connected age. Penguin Press. New York. 2010.
  • 46. Conclusion of the Sonata for piano #32, opus 111 by Ludwig van Beethoven
  • 47. ? Slides @ http://bit.ly/crowdsourceacrl2013 Frederick Zarndt Chair, IFLA Newspapers Section CCS / Digital Divide Data / DL Consulting @cowboyMontana, #crowdsourceacrl2013 frederick@frederickzarndt.com Brian Geiger Director, Center for Bibliographic Studies and Research bgeiger@ucr.edu Photo held by John Oxley Library, State Library of Queensland. Original from Courier-mail, Brisbane, Queensland, Australia.
  • 48. Try crowdsourcing! Correct California newspapers at http://cdnc.ucr.edu Correct Australian newspapers http://trove.nla.gov.au Correct Cambridge MA newspapers http://bit.ly/cambridgepublic Correct Tennessee newspapers http://tndp.lib.utk.edu Correct Virginia newspapers http://virginiachronicle.com
  • 49. Hãy thử crowdsourcing! Correct Vietnamese newspapers http://bit.ly/nationallibraryofvietnam Попробуйте краудсорсинга! Or try Russian language periodicals http://bit.ly/russianperiodicals Kokeile crowdsourcing! Or try Finnish newspapers http://digi.lib.helsinki.fi/sanomalehti
  • 50. Motivation Graphic from Kaufmann et al. “More than fun and money. Worker Motivation in Crowdsourcing – A Study on Mechanical Turk.”
  • 51. Motivation Trove users’ report • “I enjoy the correction - it’s a great way to learn more about past history and things of interest whilst doing a ‘service to the community’ by correcting text for the benefit of others.” • “I have recently retired from IT and thought that I could be of some assistance to the project. It benefits me and other people. It helps with family research.” From Rose Holley in “Many Hands Make Light Work.” National Library of Australia March 2009.
  • 52. Motivation CDNC users’ report “I am interested in all kinds of history. I have pursued genealogy as a hobby for many years. I correct text at CDNC because I see it as a constructive way to contribute to a worthwhile project. Because I am interested in history, I enjoy it.” Wesley, California Personal communications with CDNC text correctors.
  • 53. Motivation CDNC users’ report “I only correct the text on articles of local interest - nothing at state, national or international level, no advertisements, etc.  The objective is to be able to help researchers to locate local people, places, organizations and events using the on-line search at CDNC.  I correct local news & gossip, personal items, real estate transactions, superior court proceedings, county and local board of supervisors meetings, obituaries, birth notices, marriages, yachting news, etc.” Ann, California Personal communications with CDNC text correctors.
  • 54. Motivation CDNC users’ report “I am correcting text for the Coronado Tent City Program for 1903.  It is important to correct any problems with personal names and other information so that researchers will be able to search by keyword and be assured of retrieving desired results. ... type fonts cause a great deal of difficulty in digitizing the text and can cause problems for searchers.  Also, many of the guests' names at Tent City and Hotel Del Coronado were taken from the registration books and reported in the Program.  This led to many problems in spelling of last names and the editors were not careful to be consistent in the spellings.  This Program is an important resource since it provides an excellent picture of daily life in Tent City and captures much of the history of Coronado itself.” Gene, California Personal communications with CDNC text correctors.
  • 55. Motivation CDNC users’ report “I have always been interested in history, especially the development of the American West, and nothing brings it alive better than newspapers of the time. I believe them to be an invaluable source of knowledge for us and future generations.” David, United Kingdom Personal communications with CDNC text correctors.
  • 56. Motivation CDNC users’ report CDNC is an excellent source of information matching my personal interest in such topics as sea history, development of shipbuilding, clippers and other ships etc. ... Unfortunately, the quality of text ... is rather poor I’m afraid. This is why I started to do all corrections necessary for myself ... and to leave the corrected text for use of others. .... I am not doing this very regularly as this is just my hobby and pleasure. Jerzey, Poland Personal communications with CDNC text correctors.
  • 57. Other resources Mapping Texts at http://mappingtexts.stanford.edu/ Wragge Labs at http://wraggelabs.com/ Wikipedia list of crowdsourcing projects https://en.wikipedia.org/wiki/ List_of_crowdsourcing_projects
  • 58. As of 17-Mar-2013 the National Library of Australia’s (http://trove.nla.gov.au/) Alexa Internet traffic rank is 14,490 (global) / 330 (Australia). Trove gets ~75% of all National Library web traffic.
  • 59. National Library of Australia • Online since 2008 • 8,000,000+ pages • Top text corrector 1,772,090 lines • 2,400,000+ lines corrected each month (average for Mar 2012 to Mar 2013) • 90,489,875 lines corrected as of Mar 2013, up from 61,682,883 lines corrected Mar 2012 • 88,935 total registered users • 8,743 active users Statistics from private communication with the National Library of Australia Oct 2012
  • 60. As of 17-Mar-2013 National Library of Finland’s (http://www.nationallibrary.fi/) Alexa Internet global traffic rank is 4,303,901. Its Internet traffic rank for Finland was 199 as of 2-Apr-2012.
  • 61. National Library of Finland • Digitalkoot is a project to improve OCR text in digitized newspapers -- by playing games! • Digitalkoot is a collaboration between the National Library and Microtask • Players correct OCR text by playing Myyräsillassa (Mole Bridge) or Myyräjahdissa (Mole Hunt) • National Library has 4,000,000+ digitized pages • 109,321 registered players (October 2012) • Since February 2011 8,024,530 micro-tasks have been completed
  • 62. As of 17-Mar-2013 UC Riverside’s Alexa Internet traffic rank is 11,782 (global) / 4,120 (USA). CDNC gets ~3.30% of all UC Riverside web traffic.
  • 63. California Digital Newspaper Collection • CDNC began digitizing newspapers in 2005 as part of the Library of Congress National Digital Newspapers Program (NDNP) • Newspapers digitized to article-level in addition to page-level as required by NDNP (same as Utah Digital Newspapers) • Since 2009 hosted on Veridian at http://cdnc.ucr.edu • Collection size 55,970 issues, 495,175 pages, 5,658,224 articles, 498,000,000+ lines (Mar 2013)
  • 64. OCR text correction • OCR text correction added August 2011 • Corrections are done line by line • ~578,000+ lines of text corrected Oct 2012 • ~935,398+ lines of text corrected Mar 2013 • ~2% of the collection corrected, 98% to go! • Top corrector 327,244 lines > 2x 2nd corrector
  • 65.
  • 66. Cambridge Public Library Historic Newspaper Collection • Cambridge Historic Newspapers online since Jan 2012. • Cambridge Massachusetts Public Library digitized local newspapers (http://cambridge.dlconsulting.com/) • Newspapers digitized to article-level • Collection size 6,346 issues, 59,070 pages, 669,406 articles (Mar-2013) • Collection includes 13,099 obituary cards