SlideShare uma empresa Scribd logo
1 de 70
Baixar para ler offline
No tempest in my teapot:
Analysis of Crowdsourced Data
 and User Experiences at the
 California Digital Newspaper
           Collection


                         Brian Geiger
   Director, Center for Bibliographical Studies and Research
            California Digital Newspaper Collection

                     Frederick Zarndt
               Chair, IFLA Newspapers Section

                                       Photo held by John Oxley Library, State Library of Queensland. Original from

                                       Courier-mail, Brisbane, Queensland, Australia.
Crowds
The Wisdom of Crowds

In 2004 James Surowiecki published “The Wisdom
of Crowds: Why the Many Are Smarter Than the
Few and How Collective Wisdom Shapes Business,
Economies, Societies and Nations”. In it he asserts

     a crowd of persons that are diverse,
     independent, and decentralized usually make
     better judgements or decisions than single
     persons
“crowdsourcing”

was coined by Jeff Howe in “The rise of
crowdsourcing” published in Wired magazine June
2006.
A Google advanced search for
“crowdsourcing” from 1-Jun-2006, the date
 of publication of Jeff Howe’s Wired magazine
   article, to 1-Jun-2007 gives 44,600 hits.
A date range of 1-Jun-2011 to 1-Jun-2012 gives
               2,680,000 hits.




       Searches used the Internet Archives’ Wayback Machine
Crowdsourcing is a process that
              involves outsourcing tasks to a distributed
               group of people. ... the difference between
              crowdsourcing and ordinary outsourcing is
               that a task or problem is outsourced to an
                 undefined public rather than a specific
                     body, such as paid employees.



Wikipedia contributors, "Crowdsourcing," Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/wiki/Crowdsourcing
(accessed June 1, 2012)
Crowdsourcing is a type of participative online activity in
           which an individual, an institution, a non-profit
           organization, or company proposes to a group of individuals
           of varying knowledge, heterogeneity, and number, via a
           flexible open call, the voluntary undertaking of a task. The
           undertaking of the task, of variable complexity and
           modularity, and in which the crowd should participate
           bringing their work, money, knowledge and/or experience,
           always entails mutual benefit. The user will receive the
           satisfaction of a given type of need, be it economic, social
           recognition, self-esteem, or the development of individual
           skills, while the crowdsourcer will obtain and utilize to their
           advantage that what the user has brought to the venture,
           whose form will depend on the type of activity undertaken.



Enrique Estellés-Arolas and Fernando González-Ladrón-de-Guevara. Towards an integrated crowdsourcing definition.
Journal of Information Science XX(X). 2012. pp. 1-14.
crowdcollaboration    crowd*




                                 crowdsourcing
         ng
       di




               citizen science
     un
     df
ow
cr




crowdcasting            crowdvoting
what is Alexa?
•   Alexa collects and analyzes Internet data for purposes of web analytics. Web analytics is
    the measurement, collection, analysis and reporting of Internet data for the purposes of
    understanding and optimizing web usage. Alexa is now a subsidiary of Amazon.

•   Alexa was founded in 1996 by Brewster Kahle (Internet Archive) and Bruce Gilliat.

•   Alexa operations includes archiving of webpages as they are crawled. This database served
    as the basis for the creation of the Internet Archive accessible through the Wayback
    Machine.

•   Alexa continually crawls all publicly-available websites to create a series of snapshots of
    the web.

•   Alexa gathers information from a variety of sources to provide key statistics about each
    site on the web, for example, Traffic Rank, the number of PageViews, and site Speed,
    Bounce Rate, etc. This information is derived from Alexa toolbar users (~6,000,000
    worldwide).
definitions
        •   A PageView is a request for a file whose type is defined as a page.

        •   A Unique Visitor is a uniquely identified client generating requests on the
            web server or viewing pages within a defined time period (i.e. day, week or
            month). A Unique Visitor counts once within the timescale.

        •   A Visit is a series of page requests from the same uniquely identified client
            with a time of no more than 30 minutes between each page request.

        •   Bounce Rate is the percentage of visits where the visitor enters and exits at
            the same page without visiting any other pages on the site in between.

        •   World | Country Rank is a function of the average daily unique visits and
            the number of unique pages requested.




definitions adapted from Wikipedia http://en.wikipedia.org/wiki/Web_analytics
crowdsourcing




         Amazon Mechanical Turk was launched Nov 2005
Alexa global rank of Amazon Mechanical Turk (13-Jun-2012): 6,022
crowdsourcing




Each day 200,000,000 recaptcha’s are solved by humans around the world
crowdvoting

 Iowa Electronic Market was 1st
 launched in 1995

 Alexa global traffic rank of Iowa
 Electronic Market (6-Aug-2012):
 11,290

 Alexa US traffic rank of Iowa
 Electronic Market (6-Aug-2012):
 3,923
citizen science




           Galaxy Zoo was 1st launched July 2007
Alexa global traffic rank of Galaxy Zoo (13-Jun-2012): 557,766
crowdfunding




                 Kickstarter was 1st launched in 2008
      Alexa global traffic rank of Kickstarter (6-Aug-2012): 752
27,528 projects successfully funded with more than USD $254,000,000
crowdcollaboration
Wikipedia

•   Began 2001

•   Now in 285 languages

•   3,900,000+ articles in English, 1,400,000+ in German, 1,250,000+ in
    French, 1,050,000 in Dutch

•   40 wikipedia languages with more than 100,000 articles

•   112 wikipedia languages with more than 10,000 articles

•   400,000,000 unique visitors per month

•   85,000 active contributors

•   Alexa global traffic rank: #6 in worldwide web traffic
Family Search Indexing was 1st launched (beta) 2004
Alexa global / country traffic rank of FamilySearch (13-Jun-2012): 4,352 / 1,357
• Started (beta) 2004
• More than 780,000 worldwide registered volunteers
  from ~25 countries index records relevant to family
  history
• Approximately 100,000 active volunteers each month
• UI in Chinese, English, German, French, Italian,
  Japanese, Korean, Portuguese, and Russian
• Blind double-key entry with arbitration / reconciliation
• More than 1,500,088,741 records indexed (July 2012)
• Accuracy typically > 99.95%
Project Gutenberg was 1st launched Dec 1971
Alexa global traffic rank of Project Gutenberg (13-Jun-2012): 5,744
• Started Dec 1971
• Worldwide volunteers transcribe or proofread OCR’d
  public domain books through Distributed Proofreaders
• 40,000 books completed (July 2012)
• Partner / affiliated projects for Australia, Canada,
  Europe, Germany, Luxembourg, Philippines, Runeberg
  (Nordic literature), Russia, Taiwan
Alexa global / country traffic rank of National Library of Australia (31-Oct-2012): 15,519 / 406
                      Trove gets ~72% of all National Library web traffic.
National Library of
          Australia
• Online since 2008
• 7,200,000+ pages
• Top text corrector 1,250,000 lines (June 2012)
• 2,450,000+ lines corrected each month (average
  for 1st 6 months 2012)
• 68,908,757 lines corrected as of July 2012, up
  from 42,411,468 lines corrected July 2011.
• 63,613 total registered users (July 2012)
• 4,146 active users (June 2012)
Alexa global / country traffic rank of National Library of Finland
          2,535,854 (31-Oct-2012) / 199 (2-Apr-2012)
National Library of
             Finland
• Digitalkoot is a project to improve OCR text in
  digitized newspapers -- by playing games!
• Digitalkoot is a collaboration between the National
  Library and Microtask
• Players correct OCR text by playing Myyräsillassa
  (Mole Bridge) or Myyräjahdissa (Mole Hunt)
• National Library has 4,000,000+ digitized pages
• 109,321 registered players (October 2012)
• Since February 2011 8,024,530 micro-tasks have
  been completed
Alexa global / country traffic rank of UC Riverside (31-Oct-2012): 12,439 / 4,717
                CDNC gets ~1.84% of all UC Riverside web traffic.
California Digital
 Newspaper Collection
• CDNC began digitizing newspapers in 2005 as
  part of NDNP
• Newspapers digitized to article-level as well as
  to page-level as required by NDNP
• Hosted on Veridian beginning 2009
• Collection size 55,970 issues, 495,175 pages,
  5,658,224 articles, 498,000,000+ lines
OCR text correction

• OCR text correction added August 2011
• Corrections are done line by line
• ~578,000+ lines of text corrected (Oct 2012)
• ~1.1% of the collection corrected, 98.9% to go!
• Top corrector 243,000 lines > 2x 2nd corrector
User Lines corrected Lines corrected User
  1      242,965        1,456,906      1
  2       87,515        1,385,369      2
  3       31,318        1,010,360      3
  4       24,144         960,230       4
  5       23,184         847,340       5
  6       19,240         786,147       6
  7      18,898          657,187       7
  8       16,875         600,513       8
  9       11,784         582,276       9
 10        9,762         565,384      10
uncorrected OCR accuracy by
                      newspaper title
                                                             OCR character   ~OCR word
                                  Title
                                                               accuracy       accuracy*

            PRP Pacific Rural Press 1871 - 1922                  92.6%         68.1%

            SFC San Francisco Call 1890 - 1913                   92.6%         68.1%

            LAH Los Angeles Herald 1873 - 1910                   88.7%         54.9%

            LH Livermore Herald 1877 - 1899                      88.6%         54.6%

            DAC Daily Alta California 1841 - 1891                88.2%         53.4%

            CFJ California Farmer and Journal
                                                                 86.5%         48.4%
            of Useful Sciences 1855 - 1880

            SN Sausalito News 1885 - 1922                        70.4%         17.3%

*Word accuracy assumes average word length is 5 characters
OCR accuracy by newspaper title

                                        OCR character   Corrected
                 Title
                                          accuracy      accuracy

PRP Pacific Rural Press 1871 - 1922         92.6%         99.3%

SFC San Francisco Call 1890 - 1913          92.6%         99.6%

LAH Los Angeles Herald 1873 - 1910          88.7%         99.1%

LH Livermore Herald 1877 - 1899             88.6%         99.9%

DAC Daily Alta California 1841 - 1891       88.2%         99.9%

CFJ California Farmer and Journal
                                            86.5%         99.8%
of Useful Sciences 1855 - 1880

SN Sausalito News 1885 - 1922               70.4%        100.0%
corrected accuracy by
                  newspaper title
                                  OCR character ~OCR word Corrected  ~Corrected
                  Title
                                    accuracy     accuracy* accuracy word accuracy*

         PRP 1871 - 1922                 92.6%               68.1%   99.3%    96.5%

         SFC 1890 - 1913                 92.6%               68.1%   99.6%    98.0%

         LAH 1873 - 1910                 88.7%               54.9%   99.1%    95.6%

         LH 1877 - 1899                 88.6%                54.6%   99.9%    99.5%

         DAC 1841 - 1891                88.2%                53.4%   99.9%    99.5%

         CF 1855 - 1880                  86.5%               48.4%   98.3%    91.8%

         SN 1885 - 1922                  70.4%               17.3%   100.0%   100.0%


*Word accuracy assumes average word length is 5 characters
correction accuracy
      by user
        Average OCR   Correction
 User
          accuracy     accuracy
  A        70.4%       100.0%
  B        87.1%        99.5%
  C        95.4%        99.5%
  D        86.5%        98.3%
  E        95.3%       100.0%
  F        91.0%       100.0%
  G        91.0%        99.8%
  H        90.5%        99.0%
  I        96.6%        99.8%
  J        94.8%       100.0%
  K        86.8%        99.3%
the long    of crowdsourced        tail *

     OCR text correction
                      a probability distribution has a long tail if a larger
                     share of population rests within its tail than it would
                                 under a normal distribution

                     the most productive users represent a small fraction
                         of the total user population and ~50% of total
                         production, or, said a different way, the largest
                        fraction but individually not quite so productive
                      users are as important as the most productive users




The phrase “long tail” was popularized by Chris Anderson in the October 2004 Wired magazine article The Long Tail
and by Clay Shirky’s February 2003 essay “Power laws, web logs, and inequality”.
OCR text correction long tails
                             3,000,000




                             2,250,000
                                            50%
  300000



top corrector 242,965        1,500,000   top corrector 1,456,906
  225000



           50%                750,000

  150000                                                                        50%

                                    0

   75000                                                           NLA lines corrected by text corector


                                                     50%
      0

                        CDNC lines corrected by text corrector
Motivation
Graphic from Kaufmann et al. “More than fun and money. Worker Motivation
in Crowdsourcing – A Study on Mechanical Turk.”
Wisdom of crowds

                           Each person should have private information
      Diversity            even if it's just an eccentric interpretation of the
                           known facts.
               People's opinions aren't determined by the
  Independence
               opinions of those around them.
                           People are able to specialize and draw on local
Decentralization
                           knowledge.
                           Some mechanism exists for turning private
   Aggregation
                           judgments into a collective decision.

James Surowiecki, The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective
Wisdom Shapes Business, Economies, Societies and Nations, Anchor Books, New York, 2005.
Cognitive surplus

    ... people are learning to use their free time for creative
    activities rather than consumptive ones [such as watching
    TV] ...

    ... the total human cognitive effort in creating all of
    Wikipedia in every language is about one hundred million
    hours ...

    ... Americans alone watch two hundred billion hours of TV
    every year, or enough time, if it would be devoted to projects
    similar to Wikipedia, to create about 2000 of them ...


Clay Shirky. Cognitive surplus: Creativity and generosity in a connected age. Penguin Press. New York. 2010.
Motivation
             Genealogists and family historians
             • National Library of Australia’s 2012 Trove
               status report showed that ~50% of Trove users
               are family historians
PAPERSPAST   • National Library of New Zealand survey found
               that ~50% of PapersPast users are genealogists
             • California Digital Newspaper Collection spring
               2012 survey discovered that ~70% of its users
               are genealogists; 75% are 50 years old or older
             • A Utah Digital Newspapers survey showed that
               72% of its users are genealogists
Motivation
                                  Trove users’ report


           • “I enjoy the correction - it’s a great way to learn more
           about past history and things of interest whilst doing a
           ‘service to the community’ by correcting text for the benefit
           of others.”
           • “I have recently retired from IT and thought that I could be
           of some assistance to the project. It benefits me and other
           people. It helps with family research.”




From Rose Holley in “Many Hands Make Light Work.” National Library of Australia March 2009.
Motivation
                                 CDNC users’ report


          “I am interested in all kinds of history. I have pursued genealogy
           as a hobby for many years. I correct text at CDNC because I see
            it as a constructive way to contribute to a worthwhile project.
                     Because I am interested in history, I enjoy it.”
                                              Wesley, California




Personal communications with CDNC text correctors.
Motivation
                                 CDNC users’ report

               “I only correct the text on articles of local interest - nothing at
                state, national or international level, no advertisements, etc. 
                The objective is to be able to help researchers to locate local
                 people, places, organizations and events using the on-line
              search at CDNC.  I correct local news & gossip, personal items,
             real estate transactions, superior court proceedings, county and
               local board of supervisors meetings, obituaries, birth notices,
                                marriages, yachting news, etc.”
                                                     Ann, California




Personal communications with CDNC text correctors.
Motivation
                                 CDNC users’ report
           “I am correcting text for the Coronado Tent City Program for
            1903.  It is important to correct any problems with personal
           names and other information so that researchers will be able
             to search by keyword and be assured of retrieving desired
                results. ... type fonts cause a great deal of difficulty in
          digitizing the text and can cause problems for searchers.  Also,
               many of the guests' names at Tent City and Hotel Del
          Coronado were taken from the registration books and reported
           in the Program.  This led to many problems in spelling of last
          names and the editors were not careful to be consistent in the
             spellings.  This Program is an important resource since it
             provides an excellent picture of daily life in Tent City and
                  captures much of the history of Coronado itself.”
                                               Gene, California
Personal communications with CDNC text correctors.
Motivation
                                 CDNC users’ report


              “I have always been interested in history, especially the
          development of the American West, and nothing brings it alive
            better than newspapers of the time. I believe them to be an
          invaluable source of knowledge for us and future generations.”
                                         David, United Kingdom




Personal communications with CDNC text correctors.
Motivation
                                 CDNC users’ report

                CDNC is an excellent source of information matching my
               personal interest in such topics as sea history, development
                      of shipbuilding, clippers and other ships etc. ...
                  Unfortunately, the quality of text ... is rather poor I’m
                afraid. This is why I started to do all corrections necessary
                   for myself ... and to leave the corrected text for use of
                others. .... I am not doing this very regularly as this is just
                                   my hobby and pleasure.
                                                Jerzey, Poland




Personal communications with CDNC text correctors.
Website traffic
Website traffic

         After a crowdsourcing transcription project of diaries from the
         American War Between the States, Nicole Saylor, Head of Digital
         Library Services at the University of Iowa Libraries, reported



                   “On June 9, 2011, we went from about 1000
                   daily hits to our digital library on a really good
                   day to more than 70,000.”


Nicole Saylor interviewed by Trevor Owens. “Crowdsourcing the Civil War: Insights Interview with Nicole Saylor” blog post
at http://blogs.loc.gov/digitalpreservation/2011/12/crowdsourcing-the-civil-war-insights-interview-with-nicole-saylor/.
Dec 6, 2011.
Website traffic
    Website traffic at CDNC before / after implementing
                        crowdsourcing


                     before crowdsourcing       after crowdsourcing
                                                                         change
                   11-Jun-2011 / 12-Jul-2011 11-Jun-2012 / 12-Jul-2012

    visits                 17,485                    21,488              +22.9%

unique visitors            11,381                    13,376              +17.5%

visit duration            9m 24s                     11m 7s              +18.3%

 bounce rate               51.3%                     44.5%               -6.8%

pages per visit             14.9                       11.7              -21.5%
Website traffic
Crowdsourcing
   benefits




        Public domain photo courtesy of US Navy
$
                Economics

   Financial value of outsourced OCR text correction
   for newspapers?
   The Assumptions
• 25 to 50 characters per line in a newspaper column:
  Assume 40 characters per line (CDNC sample average)
• Outsourced text transcription or correction costs USD
  $0.35 to $1.20 per 1000 characters: Assume $0.50
  per 1000 characters
$
      Economics


$ 578,000 lines x 40 characters per line x
  1/1000 x $0.50 = $11,560
$ 68,908,757 lines x 40 characters per line x
  1/1000 x $0.50 = $1,378,175
$
                                        Economics

                  Financial value of in-house OCR text
                  correction?
                  The Assumptions
           • Correction takes 15 seconds per line
           • Cost is hourly wage plus benefits of lowest level
             employee, $10 for CDNC, $41.88* for Australia


AUD $40.38 = USD $41.88 is the actual labor value assumed by the National Library of Australia to calculate avoided costs
due to crowdsourced OCR text correction in its 2012 Trove Status Report.
$
         Economics


$ 578,000 lines x 15 seconds per line x 1/3600 hrs
  per second x $10.00 per hr = $24,083
$ 68,908,757 lines x 15 seconds per line x 1/3600
  hrs per second x $41.88 per hr = $12,024,578
Accuracy



“His Accuracy Depends on Ours!"
Office for Emergency Management. Office of War
Information. Domestic Operations Branch. Bureau of
Special Services. [Photo held at US National Archives and
Records Administration]
Accuracy

          • Edwin Kiljin (Koninklijke Bibliotheek the Netherlands)
          reports raw OCR character accuracies of 68% for early 20th
          century newspapers
          • Rose Holley (National Library of Australia) reports raw
          OCR character accuracy varied from 71% to 98% on a
          sample Trove digitized newspapers



Edwin Kiljin. “The current state-of-art in newspaper digitization.” D-Lib Magazine. January/February 2008.
Rose Holley. “How good can it get? Analysing and improving OCR accuracy in large scale historic newspaper digitisation
programs. D-Lib Magazine. March/April 2009.
Public domain graphic courtesy of Wikimedia Commons.
Accuracy
                   Mapping texts* assesses digitization quality of digital
                     newspapers by comparing the number of words
                    recognized to the total number of words scanned




*Mapping texts is a collaboration between the University of North Texas and Stanford University aimed at experimenting
with new methods for finding and analyzing meaningful patterns embedded in massive collections of digital newspapers.
Accuracy
       How does low text accuracy affect search recall?
       The Facts
       • Average uncorrected OCR character accuracy of the
         CDNC sample data is ~89%
       • Average length of an English word is 5 characters
       • Average word accuracy is 89% x 89% x 89% x 89% x
         89% = 55.8% - round up to 60% or 6 out of 10 words
         correct


Public domain graphic courtesy of Wikimedia Commons.
Search recall no text correction


                                                          ARNDT




                                 ARNDT           ARNDT
            ARNDT            ARNDT
                                         ARNDT


                                     ARNDT        ARNDT




                                                                  ARNDT



                                     ARNDT




instances of “ARNDT” found                          instances of “ARNDT” not found
Accuracy


       The Facts
       • Average corrected character accuracy of the CDNC
         sample data is ~99.4%
       • Average word accuracy of CDNC corrected text is
         99.4% x 99.4% x 99.4% x 99.4% x 99.4% = 97.0%




Public domain graphic courtesy of Wikimedia Commons.
Search recall with text correction




                                       ARNDT

                               ARNDT               ARNDT
                     ARNDT ARNDT
                                        ARNDT
                                           ARNDT
                                   ARNDT            ARNDT



                                       ARNDT




instances of “ARNDT” found                            instances of “ARNDT” not found
Accuracy

         A search for “Arndt” at Chronicling America
         gives 10,267 results*
         • If Chronicling America text accuracy is 55.8% (same
           as uncorrected CDNC sample), then 8,133 instances
           of “Arndt” were not found
         • If text accuracy is 97.0%, then 317 instances of
           “Arndt” were not found
     *   Search performed 31 Oct 2012
         Alexa global / country traffic rank of Library of Congress (31-Oct-2012): 4,056 / 1,317
                 Chronicling America gets ~7.1% of all Library of Congress web traffic.

Public domain graphic courtesy of Wikimedia Commons.
Hard-to-measure-but-
shouldn’t-be-overlooked
       benefits




  Public domain photo “A useful instruction for young sailors from the Royal Hospital
  School, Greenwich” from the National Maritime Museum.
HTMBSBO benefit

        “when someone transcribes a document, they are
         actually better fulfilling the mission of a cultural
      heritage organization than someone who simply stops
                   by to flip through the pages”




Paraphrased from Trevor Owen’s Crowdstorming blog http://crowdstorming.wordpress.com/
HTMBSBO benefit

      “in addition to increasing search accuracy or lowering
      the costs of document transcription, crowdsourcing is
     the single greatest advancement in getting people using
             and interacting with library collections”




Paraphrased from Trevor Owen’s Crowdstorming blog http://crowdstorming.wordpress.com/
Crowdsourcing considerations
• How to market / advertise
  crowdsourcing?
• How to motivate
  crowdsourcers?
• Is authentication / identity of
  crowdsourcers an issue?
• How to administer
  crowdsourced data?


                                    Photo of Aleister Crowley [Public domain] from Wikimedia
                                                            Commons
Conclusions
             • Lots of crowdsourcing in cultural heritage
               organizations and elsewhere
             • Benefits are multi-faceted: Economic, data
               accuracy, patron engagement, increased web
               traffic




Conclusion of the Sonata for piano #32, opus 111 by
Ludwig van Beethoven
Try crowdsourcing!
                         Correct California newspapers text
                                http://cdnc.ucr.edu

                         Correct Australian newspapers text
                              http://trove.nla.gov.au

                         Correct Cambridge MA newspapers text
                             http://bit.ly/cambridgepublic

                         Correct Russian language periodicals
                           http://bit.ly/russianperiodicals


Others soon to follow: Library of Virginia, University of Tennessee,
                 National Library of Singapore, ...
?
        Brian Geiger
      bgeiger@ucr.edu

       Frederick Zarndt
frederick@frederickzarndt.com


                  Photo held by John Oxley Library, State Library of Queensland. Original from

                  Courier-mail, Brisbane, Queensland, Australia.

Mais conteúdo relacionado

Mais procurados

Discovering Library2.0 Libraryservices For The Google Generation Sconul June ...
Discovering Library2.0 Libraryservices For The Google Generation Sconul June ...Discovering Library2.0 Libraryservices For The Google Generation Sconul June ...
Discovering Library2.0 Libraryservices For The Google Generation Sconul June ...Ken Chad Consulting Ltd
 
Library collections and the emerging scholarly record
Library collections and the emerging scholarly recordLibrary collections and the emerging scholarly record
Library collections and the emerging scholarly recordlisld
 
Towards collaboration at scale: Libraries, the social and the technical
Towards collaboration at scale:  Libraries, the social and the technicalTowards collaboration at scale:  Libraries, the social and the technical
Towards collaboration at scale: Libraries, the social and the technicallisld
 
Taiga4lightningtalks
Taiga4lightningtalksTaiga4lightningtalks
Taiga4lightningtalkskantelman
 
Collection directions - towards collective collections
Collection directions - towards collective collectionsCollection directions - towards collective collections
Collection directions - towards collective collectionslisld
 
Rightscaling, engagement, learning: reconfiguring the library for a network e...
Rightscaling, engagement, learning: reconfiguring the library for a network e...Rightscaling, engagement, learning: reconfiguring the library for a network e...
Rightscaling, engagement, learning: reconfiguring the library for a network e...lisld
 
Social metadata for libraries, archives and museums: Research findings from t...
Social metadata for libraries, archives and museums: Research findings from t...Social metadata for libraries, archives and museums: Research findings from t...
Social metadata for libraries, archives and museums: Research findings from t...Rose Holley
 
A view of WorldCat- SNBU 2010
A view of WorldCat- SNBU 2010A view of WorldCat- SNBU 2010
A view of WorldCat- SNBU 2010OCLC LAC
 
The library in the life of the user
The library in the life of the userThe library in the life of the user
The library in the life of the userlisld
 
What business are we in?
What business are we in?What business are we in?
What business are we in?lisld
 

Mais procurados (12)

Discovering Library2.0 Libraryservices For The Google Generation Sconul June ...
Discovering Library2.0 Libraryservices For The Google Generation Sconul June ...Discovering Library2.0 Libraryservices For The Google Generation Sconul June ...
Discovering Library2.0 Libraryservices For The Google Generation Sconul June ...
 
Library collections and the emerging scholarly record
Library collections and the emerging scholarly recordLibrary collections and the emerging scholarly record
Library collections and the emerging scholarly record
 
Towards collaboration at scale: Libraries, the social and the technical
Towards collaboration at scale:  Libraries, the social and the technicalTowards collaboration at scale:  Libraries, the social and the technical
Towards collaboration at scale: Libraries, the social and the technical
 
Taiga4lightningtalks
Taiga4lightningtalksTaiga4lightningtalks
Taiga4lightningtalks
 
Collection directions - towards collective collections
Collection directions - towards collective collectionsCollection directions - towards collective collections
Collection directions - towards collective collections
 
Rightscaling, engagement, learning: reconfiguring the library for a network e...
Rightscaling, engagement, learning: reconfiguring the library for a network e...Rightscaling, engagement, learning: reconfiguring the library for a network e...
Rightscaling, engagement, learning: reconfiguring the library for a network e...
 
Rethinking_the_LSP_Jan2016a
Rethinking_the_LSP_Jan2016aRethinking_the_LSP_Jan2016a
Rethinking_the_LSP_Jan2016a
 
Social metadata for libraries, archives and museums: Research findings from t...
Social metadata for libraries, archives and museums: Research findings from t...Social metadata for libraries, archives and museums: Research findings from t...
Social metadata for libraries, archives and museums: Research findings from t...
 
A view of WorldCat- SNBU 2010
A view of WorldCat- SNBU 2010A view of WorldCat- SNBU 2010
A view of WorldCat- SNBU 2010
 
Understanding Critical Elements of E-books: The Social Reading Experience of ...
Understanding Critical Elements of E-books: The Social Reading Experience of ...Understanding Critical Elements of E-books: The Social Reading Experience of ...
Understanding Critical Elements of E-books: The Social Reading Experience of ...
 
The library in the life of the user
The library in the life of the userThe library in the life of the user
The library in the life of the user
 
What business are we in?
What business are we in?What business are we in?
What business are we in?
 

Semelhante a 20121105 no tempest in my teapot [dlf forum denver]

20130630 What motivates library crowdsourcing volunteers? [ALA LITA]
20130630 What motivates library crowdsourcing volunteers? [ALA LITA]20130630 What motivates library crowdsourcing volunteers? [ALA LITA]
20130630 What motivates library crowdsourcing volunteers? [ALA LITA]Frederick Zarndt
 
20130123 Crowdsourcing [hamilton library u of hi]
20130123 Crowdsourcing [hamilton library u of hi]20130123 Crowdsourcing [hamilton library u of hi]
20130123 Crowdsourcing [hamilton library u of hi]Frederick Zarndt
 
Community Generated Databases for NY State History Conference 2013
Community Generated Databases for NY State History Conference 2013Community Generated Databases for NY State History Conference 2013
Community Generated Databases for NY State History Conference 2013Larry Naukam
 
How can UK academic libraries respond to the current issues in scholarly publ...
How can UK academic libraries respond to the current issues in scholarly publ...How can UK academic libraries respond to the current issues in scholarly publ...
How can UK academic libraries respond to the current issues in scholarly publ...Stuart Dempster
 
20120821 putting the world’s cultural heritage online with crowd sourcing sli...
20120821 putting the world’s cultural heritage online with crowd sourcing sli...20120821 putting the world’s cultural heritage online with crowd sourcing sli...
20120821 putting the world’s cultural heritage online with crowd sourcing sli...Frederick Zarndt
 
Leslie Johnston Keynote, Best Practices Exchange 2011
Leslie Johnston Keynote, Best Practices Exchange 2011Leslie Johnston Keynote, Best Practices Exchange 2011
Leslie Johnston Keynote, Best Practices Exchange 2011lljohnston
 
Is It Too Late to Ensure Continuity of Access to the Scholarly Record?
Is It Too Late to Ensure Continuity of Access to the Scholarly Record?Is It Too Late to Ensure Continuity of Access to the Scholarly Record?
Is It Too Late to Ensure Continuity of Access to the Scholarly Record?EDINA, University of Edinburgh
 
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012lljohnston
 
Hello islandora building a digital repository nov 30, 2016 v6
Hello islandora  building a digital repository nov 30, 2016 v6Hello islandora  building a digital repository nov 30, 2016 v6
Hello islandora building a digital repository nov 30, 2016 v6eohallor
 
Communities and Big Data
Communities and Big DataCommunities and Big Data
Communities and Big DataErin Robinson
 
Wiser Pku Lecture@Life Science School Pku
Wiser Pku Lecture@Life Science School PkuWiser Pku Lecture@Life Science School Pku
Wiser Pku Lecture@Life Science School Pkuguest8ed46d
 
Wiserpku Lecture@Life Science School Pku
Wiserpku Lecture@Life Science School PkuWiserpku Lecture@Life Science School Pku
Wiserpku Lecture@Life Science School Pkuwiser pku
 
OCLC Research Update at ALA Chicago. June 26, 2017.
OCLC Research Update at ALA Chicago. June 26, 2017.OCLC Research Update at ALA Chicago. June 26, 2017.
OCLC Research Update at ALA Chicago. June 26, 2017.OCLC
 
Web-Scale Discovery: Post Implementation
Web-Scale Discovery: Post ImplementationWeb-Scale Discovery: Post Implementation
Web-Scale Discovery: Post ImplementationRachel Vacek
 
Dennis Massie, OCLC, USA Come for the free analysis, stay for the community...
Dennis Massie, OCLC, USA   Come for the free analysis, stay for the community...Dennis Massie, OCLC, USA   Come for the free analysis, stay for the community...
Dennis Massie, OCLC, USA Come for the free analysis, stay for the community...CTLes
 
Digital Transformation and Data - the Wikimedia Residency at the University o...
Digital Transformation and Data - the Wikimedia Residency at the University o...Digital Transformation and Data - the Wikimedia Residency at the University o...
Digital Transformation and Data - the Wikimedia Residency at the University o...Ewan McAndrew
 
Motivational Metrics: A Publisher and Library Collaboration
Motivational Metrics: A Publisher and Library CollaborationMotivational Metrics: A Publisher and Library Collaboration
Motivational Metrics: A Publisher and Library CollaborationDanea Johnson
 
Crowdsourcing based curation and user engagement in digital library design
Crowdsourcing based curation and user engagement in digital library designCrowdsourcing based curation and user engagement in digital library design
Crowdsourcing based curation and user engagement in digital library designRose Holley
 

Semelhante a 20121105 no tempest in my teapot [dlf forum denver] (20)

20130630 What motivates library crowdsourcing volunteers? [ALA LITA]
20130630 What motivates library crowdsourcing volunteers? [ALA LITA]20130630 What motivates library crowdsourcing volunteers? [ALA LITA]
20130630 What motivates library crowdsourcing volunteers? [ALA LITA]
 
20130123 Crowdsourcing [hamilton library u of hi]
20130123 Crowdsourcing [hamilton library u of hi]20130123 Crowdsourcing [hamilton library u of hi]
20130123 Crowdsourcing [hamilton library u of hi]
 
Community Generated Databases for NY State History Conference 2013
Community Generated Databases for NY State History Conference 2013Community Generated Databases for NY State History Conference 2013
Community Generated Databases for NY State History Conference 2013
 
Resource discovery tools
Resource discovery toolsResource discovery tools
Resource discovery tools
 
How can UK academic libraries respond to the current issues in scholarly publ...
How can UK academic libraries respond to the current issues in scholarly publ...How can UK academic libraries respond to the current issues in scholarly publ...
How can UK academic libraries respond to the current issues in scholarly publ...
 
20120821 putting the world’s cultural heritage online with crowd sourcing sli...
20120821 putting the world’s cultural heritage online with crowd sourcing sli...20120821 putting the world’s cultural heritage online with crowd sourcing sli...
20120821 putting the world’s cultural heritage online with crowd sourcing sli...
 
Leslie Johnston Keynote, Best Practices Exchange 2011
Leslie Johnston Keynote, Best Practices Exchange 2011Leslie Johnston Keynote, Best Practices Exchange 2011
Leslie Johnston Keynote, Best Practices Exchange 2011
 
Is It Too Late to Ensure Continuity of Access to the Scholarly Record?
Is It Too Late to Ensure Continuity of Access to the Scholarly Record?Is It Too Late to Ensure Continuity of Access to the Scholarly Record?
Is It Too Late to Ensure Continuity of Access to the Scholarly Record?
 
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
 
Hello islandora building a digital repository nov 30, 2016 v6
Hello islandora  building a digital repository nov 30, 2016 v6Hello islandora  building a digital repository nov 30, 2016 v6
Hello islandora building a digital repository nov 30, 2016 v6
 
Communities and Big Data
Communities and Big DataCommunities and Big Data
Communities and Big Data
 
Wiser Pku Lecture@Life Science School Pku
Wiser Pku Lecture@Life Science School PkuWiser Pku Lecture@Life Science School Pku
Wiser Pku Lecture@Life Science School Pku
 
Wiserpku Lecture@Life Science School Pku
Wiserpku Lecture@Life Science School PkuWiserpku Lecture@Life Science School Pku
Wiserpku Lecture@Life Science School Pku
 
OCLC Research Update at ALA Chicago. June 26, 2017.
OCLC Research Update at ALA Chicago. June 26, 2017.OCLC Research Update at ALA Chicago. June 26, 2017.
OCLC Research Update at ALA Chicago. June 26, 2017.
 
Web-Scale Discovery: Post Implementation
Web-Scale Discovery: Post ImplementationWeb-Scale Discovery: Post Implementation
Web-Scale Discovery: Post Implementation
 
Dennis Massie, OCLC, USA Come for the free analysis, stay for the community...
Dennis Massie, OCLC, USA   Come for the free analysis, stay for the community...Dennis Massie, OCLC, USA   Come for the free analysis, stay for the community...
Dennis Massie, OCLC, USA Come for the free analysis, stay for the community...
 
Cil06giltrud(1)
Cil06giltrud(1)Cil06giltrud(1)
Cil06giltrud(1)
 
Digital Transformation and Data - the Wikimedia Residency at the University o...
Digital Transformation and Data - the Wikimedia Residency at the University o...Digital Transformation and Data - the Wikimedia Residency at the University o...
Digital Transformation and Data - the Wikimedia Residency at the University o...
 
Motivational Metrics: A Publisher and Library Collaboration
Motivational Metrics: A Publisher and Library CollaborationMotivational Metrics: A Publisher and Library Collaboration
Motivational Metrics: A Publisher and Library Collaboration
 
Crowdsourcing based curation and user engagement in digital library design
Crowdsourcing based curation and user engagement in digital library designCrowdsourcing based curation and user engagement in digital library design
Crowdsourcing based curation and user engagement in digital library design
 

Mais de Frederick Zarndt

Digitization of the Tuol Sleng Genocide Museum Archives
Digitization of the Tuol Sleng Genocide Museum ArchivesDigitization of the Tuol Sleng Genocide Museum Archives
Digitization of the Tuol Sleng Genocide Museum ArchivesFrederick Zarndt
 
2017 Born Digital Legal Deposit Policies and Practices
2017 Born Digital Legal Deposit Policies and Practices2017 Born Digital Legal Deposit Policies and Practices
2017 Born Digital Legal Deposit Policies and PracticesFrederick Zarndt
 
e-Legal Deposit Survey 2017
e-Legal Deposit Survey 2017e-Legal Deposit Survey 2017
e-Legal Deposit Survey 2017Frederick Zarndt
 
Project Management according to Great Pumpkin Principles
Project Management according to Great Pumpkin PrinciplesProject Management according to Great Pumpkin Principles
Project Management according to Great Pumpkin PrinciplesFrederick Zarndt
 
What did you say? interculture communication [20160308 phnom penh]
What did you say? interculture communication [20160308 phnom penh]What did you say? interculture communication [20160308 phnom penh]
What did you say? interculture communication [20160308 phnom penh]Frederick Zarndt
 
Coronado public library digital newspapers workshop local partnerships [oct 2...
Coronado public library digital newspapers workshop local partnerships [oct 2...Coronado public library digital newspapers workshop local partnerships [oct 2...
Coronado public library digital newspapers workshop local partnerships [oct 2...Frederick Zarndt
 
Coronado public library digital newspapers workshop [Oct 2016]
Coronado public library digital newspapers workshop [Oct 2016]Coronado public library digital newspapers workshop [Oct 2016]
Coronado public library digital newspapers workshop [Oct 2016]Frederick Zarndt
 
What did you say? mindful interculture communication [201608 icgse]
What did you say? mindful interculture communication [201608 icgse]What did you say? mindful interculture communication [201608 icgse]
What did you say? mindful interculture communication [201608 icgse]Frederick Zarndt
 
Here Today, Gone within a Month: The Fleeting Life of Digital News
Here Today, Gone within a Month: The Fleeting Life of Digital NewsHere Today, Gone within a Month: The Fleeting Life of Digital News
Here Today, Gone within a Month: The Fleeting Life of Digital NewsFrederick Zarndt
 
Here Today, Gone within a Month: The Fleeting Life of Digital News
Here Today, Gone within a Month: The Fleeting Life of Digital NewsHere Today, Gone within a Month: The Fleeting Life of Digital News
Here Today, Gone within a Month: The Fleeting Life of Digital NewsFrederick Zarndt
 
An international survey of born digital legal deposit policies and practices ...
An international survey of born digital legal deposit policies and practices ...An international survey of born digital legal deposit policies and practices ...
An international survey of born digital legal deposit policies and practices ...Frederick Zarndt
 
An international survey of born digital legal deposit policies and practices ...
An international survey of born digital legal deposit policies and practices ...An international survey of born digital legal deposit policies and practices ...
An international survey of born digital legal deposit policies and practices ...Frederick Zarndt
 
Rootstech 2015 finding and using digitized historical newspapers workshop [20...
Rootstech 2015 finding and using digitized historical newspapers workshop [20...Rootstech 2015 finding and using digitized historical newspapers workshop [20...
Rootstech 2015 finding and using digitized historical newspapers workshop [20...Frederick Zarndt
 
20140410 ifla digitization workshop [idlc kuala lumpur]
20140410 ifla digitization workshop [idlc kuala lumpur]20140410 ifla digitization workshop [idlc kuala lumpur]
20140410 ifla digitization workshop [idlc kuala lumpur]Frederick Zarndt
 
What did you say? Intercultural expectations, misunderstandings, and communic...
What did you say? Intercultural expectations, misunderstandings, and communic...What did you say? Intercultural expectations, misunderstandings, and communic...
What did you say? Intercultural expectations, misunderstandings, and communic...Frederick Zarndt
 
20140628 crowdsourcing, family history, and long tails for libraries [ala ann...
20140628 crowdsourcing, family history, and long tails for libraries [ala ann...20140628 crowdsourcing, family history, and long tails for libraries [ala ann...
20140628 crowdsourcing, family history, and long tails for libraries [ala ann...Frederick Zarndt
 
20140408 digital newspapers collections [idlc kuala lumpur]
20140408 digital newspapers collections [idlc kuala lumpur]20140408 digital newspapers collections [idlc kuala lumpur]
20140408 digital newspapers collections [idlc kuala lumpur]Frederick Zarndt
 
20131019 digital collections - if you build them will anyone visit [library 2...
20131019 digital collections - if you build them will anyone visit [library 2...20131019 digital collections - if you build them will anyone visit [library 2...
20131019 digital collections - if you build them will anyone visit [library 2...Frederick Zarndt
 
20130903 what did you say? interculture communication [hamburg]
20130903 what did you say? interculture communication [hamburg]20130903 what did you say? interculture communication [hamburg]
20130903 what did you say? interculture communication [hamburg]Frederick Zarndt
 
201308 wlic standards committee zarndt et al the alto editorial board collabo...
201308 wlic standards committee zarndt et al the alto editorial board collabo...201308 wlic standards committee zarndt et al the alto editorial board collabo...
201308 wlic standards committee zarndt et al the alto editorial board collabo...Frederick Zarndt
 

Mais de Frederick Zarndt (20)

Digitization of the Tuol Sleng Genocide Museum Archives
Digitization of the Tuol Sleng Genocide Museum ArchivesDigitization of the Tuol Sleng Genocide Museum Archives
Digitization of the Tuol Sleng Genocide Museum Archives
 
2017 Born Digital Legal Deposit Policies and Practices
2017 Born Digital Legal Deposit Policies and Practices2017 Born Digital Legal Deposit Policies and Practices
2017 Born Digital Legal Deposit Policies and Practices
 
e-Legal Deposit Survey 2017
e-Legal Deposit Survey 2017e-Legal Deposit Survey 2017
e-Legal Deposit Survey 2017
 
Project Management according to Great Pumpkin Principles
Project Management according to Great Pumpkin PrinciplesProject Management according to Great Pumpkin Principles
Project Management according to Great Pumpkin Principles
 
What did you say? interculture communication [20160308 phnom penh]
What did you say? interculture communication [20160308 phnom penh]What did you say? interculture communication [20160308 phnom penh]
What did you say? interculture communication [20160308 phnom penh]
 
Coronado public library digital newspapers workshop local partnerships [oct 2...
Coronado public library digital newspapers workshop local partnerships [oct 2...Coronado public library digital newspapers workshop local partnerships [oct 2...
Coronado public library digital newspapers workshop local partnerships [oct 2...
 
Coronado public library digital newspapers workshop [Oct 2016]
Coronado public library digital newspapers workshop [Oct 2016]Coronado public library digital newspapers workshop [Oct 2016]
Coronado public library digital newspapers workshop [Oct 2016]
 
What did you say? mindful interculture communication [201608 icgse]
What did you say? mindful interculture communication [201608 icgse]What did you say? mindful interculture communication [201608 icgse]
What did you say? mindful interculture communication [201608 icgse]
 
Here Today, Gone within a Month: The Fleeting Life of Digital News
Here Today, Gone within a Month: The Fleeting Life of Digital NewsHere Today, Gone within a Month: The Fleeting Life of Digital News
Here Today, Gone within a Month: The Fleeting Life of Digital News
 
Here Today, Gone within a Month: The Fleeting Life of Digital News
Here Today, Gone within a Month: The Fleeting Life of Digital NewsHere Today, Gone within a Month: The Fleeting Life of Digital News
Here Today, Gone within a Month: The Fleeting Life of Digital News
 
An international survey of born digital legal deposit policies and practices ...
An international survey of born digital legal deposit policies and practices ...An international survey of born digital legal deposit policies and practices ...
An international survey of born digital legal deposit policies and practices ...
 
An international survey of born digital legal deposit policies and practices ...
An international survey of born digital legal deposit policies and practices ...An international survey of born digital legal deposit policies and practices ...
An international survey of born digital legal deposit policies and practices ...
 
Rootstech 2015 finding and using digitized historical newspapers workshop [20...
Rootstech 2015 finding and using digitized historical newspapers workshop [20...Rootstech 2015 finding and using digitized historical newspapers workshop [20...
Rootstech 2015 finding and using digitized historical newspapers workshop [20...
 
20140410 ifla digitization workshop [idlc kuala lumpur]
20140410 ifla digitization workshop [idlc kuala lumpur]20140410 ifla digitization workshop [idlc kuala lumpur]
20140410 ifla digitization workshop [idlc kuala lumpur]
 
What did you say? Intercultural expectations, misunderstandings, and communic...
What did you say? Intercultural expectations, misunderstandings, and communic...What did you say? Intercultural expectations, misunderstandings, and communic...
What did you say? Intercultural expectations, misunderstandings, and communic...
 
20140628 crowdsourcing, family history, and long tails for libraries [ala ann...
20140628 crowdsourcing, family history, and long tails for libraries [ala ann...20140628 crowdsourcing, family history, and long tails for libraries [ala ann...
20140628 crowdsourcing, family history, and long tails for libraries [ala ann...
 
20140408 digital newspapers collections [idlc kuala lumpur]
20140408 digital newspapers collections [idlc kuala lumpur]20140408 digital newspapers collections [idlc kuala lumpur]
20140408 digital newspapers collections [idlc kuala lumpur]
 
20131019 digital collections - if you build them will anyone visit [library 2...
20131019 digital collections - if you build them will anyone visit [library 2...20131019 digital collections - if you build them will anyone visit [library 2...
20131019 digital collections - if you build them will anyone visit [library 2...
 
20130903 what did you say? interculture communication [hamburg]
20130903 what did you say? interculture communication [hamburg]20130903 what did you say? interculture communication [hamburg]
20130903 what did you say? interculture communication [hamburg]
 
201308 wlic standards committee zarndt et al the alto editorial board collabo...
201308 wlic standards committee zarndt et al the alto editorial board collabo...201308 wlic standards committee zarndt et al the alto editorial board collabo...
201308 wlic standards committee zarndt et al the alto editorial board collabo...
 

Último

Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 

Último (20)

Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

20121105 no tempest in my teapot [dlf forum denver]

  • 1. No tempest in my teapot: Analysis of Crowdsourced Data and User Experiences at the California Digital Newspaper Collection Brian Geiger Director, Center for Bibliographical Studies and Research California Digital Newspaper Collection Frederick Zarndt Chair, IFLA Newspapers Section Photo held by John Oxley Library, State Library of Queensland. Original from Courier-mail, Brisbane, Queensland, Australia.
  • 3. The Wisdom of Crowds In 2004 James Surowiecki published “The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations”. In it he asserts a crowd of persons that are diverse, independent, and decentralized usually make better judgements or decisions than single persons
  • 4. “crowdsourcing” was coined by Jeff Howe in “The rise of crowdsourcing” published in Wired magazine June 2006.
  • 5. A Google advanced search for “crowdsourcing” from 1-Jun-2006, the date of publication of Jeff Howe’s Wired magazine article, to 1-Jun-2007 gives 44,600 hits. A date range of 1-Jun-2011 to 1-Jun-2012 gives 2,680,000 hits. Searches used the Internet Archives’ Wayback Machine
  • 6. Crowdsourcing is a process that involves outsourcing tasks to a distributed group of people. ... the difference between crowdsourcing and ordinary outsourcing is that a task or problem is outsourced to an undefined public rather than a specific body, such as paid employees. Wikipedia contributors, "Crowdsourcing," Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/wiki/Crowdsourcing (accessed June 1, 2012)
  • 7. Crowdsourcing is a type of participative online activity in which an individual, an institution, a non-profit organization, or company proposes to a group of individuals of varying knowledge, heterogeneity, and number, via a flexible open call, the voluntary undertaking of a task. The undertaking of the task, of variable complexity and modularity, and in which the crowd should participate bringing their work, money, knowledge and/or experience, always entails mutual benefit. The user will receive the satisfaction of a given type of need, be it economic, social recognition, self-esteem, or the development of individual skills, while the crowdsourcer will obtain and utilize to their advantage that what the user has brought to the venture, whose form will depend on the type of activity undertaken. Enrique Estellés-Arolas and Fernando González-Ladrón-de-Guevara. Towards an integrated crowdsourcing definition. Journal of Information Science XX(X). 2012. pp. 1-14.
  • 8. crowdcollaboration crowd* crowdsourcing ng di citizen science un df ow cr crowdcasting crowdvoting
  • 9. what is Alexa? • Alexa collects and analyzes Internet data for purposes of web analytics. Web analytics is the measurement, collection, analysis and reporting of Internet data for the purposes of understanding and optimizing web usage. Alexa is now a subsidiary of Amazon. • Alexa was founded in 1996 by Brewster Kahle (Internet Archive) and Bruce Gilliat. • Alexa operations includes archiving of webpages as they are crawled. This database served as the basis for the creation of the Internet Archive accessible through the Wayback Machine. • Alexa continually crawls all publicly-available websites to create a series of snapshots of the web. • Alexa gathers information from a variety of sources to provide key statistics about each site on the web, for example, Traffic Rank, the number of PageViews, and site Speed, Bounce Rate, etc. This information is derived from Alexa toolbar users (~6,000,000 worldwide).
  • 10. definitions • A PageView is a request for a file whose type is defined as a page. • A Unique Visitor is a uniquely identified client generating requests on the web server or viewing pages within a defined time period (i.e. day, week or month). A Unique Visitor counts once within the timescale. • A Visit is a series of page requests from the same uniquely identified client with a time of no more than 30 minutes between each page request. • Bounce Rate is the percentage of visits where the visitor enters and exits at the same page without visiting any other pages on the site in between. • World | Country Rank is a function of the average daily unique visits and the number of unique pages requested. definitions adapted from Wikipedia http://en.wikipedia.org/wiki/Web_analytics
  • 11. crowdsourcing Amazon Mechanical Turk was launched Nov 2005 Alexa global rank of Amazon Mechanical Turk (13-Jun-2012): 6,022
  • 12. crowdsourcing Each day 200,000,000 recaptcha’s are solved by humans around the world
  • 13. crowdvoting Iowa Electronic Market was 1st launched in 1995 Alexa global traffic rank of Iowa Electronic Market (6-Aug-2012): 11,290 Alexa US traffic rank of Iowa Electronic Market (6-Aug-2012): 3,923
  • 14. citizen science Galaxy Zoo was 1st launched July 2007 Alexa global traffic rank of Galaxy Zoo (13-Jun-2012): 557,766
  • 15. crowdfunding Kickstarter was 1st launched in 2008 Alexa global traffic rank of Kickstarter (6-Aug-2012): 752 27,528 projects successfully funded with more than USD $254,000,000
  • 17. Wikipedia • Began 2001 • Now in 285 languages • 3,900,000+ articles in English, 1,400,000+ in German, 1,250,000+ in French, 1,050,000 in Dutch • 40 wikipedia languages with more than 100,000 articles • 112 wikipedia languages with more than 10,000 articles • 400,000,000 unique visitors per month • 85,000 active contributors • Alexa global traffic rank: #6 in worldwide web traffic
  • 18.
  • 19. Family Search Indexing was 1st launched (beta) 2004 Alexa global / country traffic rank of FamilySearch (13-Jun-2012): 4,352 / 1,357
  • 20. • Started (beta) 2004 • More than 780,000 worldwide registered volunteers from ~25 countries index records relevant to family history • Approximately 100,000 active volunteers each month • UI in Chinese, English, German, French, Italian, Japanese, Korean, Portuguese, and Russian • Blind double-key entry with arbitration / reconciliation • More than 1,500,088,741 records indexed (July 2012) • Accuracy typically > 99.95%
  • 21. Project Gutenberg was 1st launched Dec 1971 Alexa global traffic rank of Project Gutenberg (13-Jun-2012): 5,744
  • 22. • Started Dec 1971 • Worldwide volunteers transcribe or proofread OCR’d public domain books through Distributed Proofreaders • 40,000 books completed (July 2012) • Partner / affiliated projects for Australia, Canada, Europe, Germany, Luxembourg, Philippines, Runeberg (Nordic literature), Russia, Taiwan
  • 23. Alexa global / country traffic rank of National Library of Australia (31-Oct-2012): 15,519 / 406 Trove gets ~72% of all National Library web traffic.
  • 24. National Library of Australia • Online since 2008 • 7,200,000+ pages • Top text corrector 1,250,000 lines (June 2012) • 2,450,000+ lines corrected each month (average for 1st 6 months 2012) • 68,908,757 lines corrected as of July 2012, up from 42,411,468 lines corrected July 2011. • 63,613 total registered users (July 2012) • 4,146 active users (June 2012)
  • 25. Alexa global / country traffic rank of National Library of Finland 2,535,854 (31-Oct-2012) / 199 (2-Apr-2012)
  • 26. National Library of Finland • Digitalkoot is a project to improve OCR text in digitized newspapers -- by playing games! • Digitalkoot is a collaboration between the National Library and Microtask • Players correct OCR text by playing Myyräsillassa (Mole Bridge) or Myyräjahdissa (Mole Hunt) • National Library has 4,000,000+ digitized pages • 109,321 registered players (October 2012) • Since February 2011 8,024,530 micro-tasks have been completed
  • 27. Alexa global / country traffic rank of UC Riverside (31-Oct-2012): 12,439 / 4,717 CDNC gets ~1.84% of all UC Riverside web traffic.
  • 28. California Digital Newspaper Collection • CDNC began digitizing newspapers in 2005 as part of NDNP • Newspapers digitized to article-level as well as to page-level as required by NDNP • Hosted on Veridian beginning 2009 • Collection size 55,970 issues, 495,175 pages, 5,658,224 articles, 498,000,000+ lines
  • 29. OCR text correction • OCR text correction added August 2011 • Corrections are done line by line • ~578,000+ lines of text corrected (Oct 2012) • ~1.1% of the collection corrected, 98.9% to go! • Top corrector 243,000 lines > 2x 2nd corrector
  • 30. User Lines corrected Lines corrected User 1 242,965 1,456,906 1 2 87,515 1,385,369 2 3 31,318 1,010,360 3 4 24,144 960,230 4 5 23,184 847,340 5 6 19,240 786,147 6 7 18,898 657,187 7 8 16,875 600,513 8 9 11,784 582,276 9 10 9,762 565,384 10
  • 31. uncorrected OCR accuracy by newspaper title OCR character ~OCR word Title accuracy accuracy* PRP Pacific Rural Press 1871 - 1922 92.6% 68.1% SFC San Francisco Call 1890 - 1913 92.6% 68.1% LAH Los Angeles Herald 1873 - 1910 88.7% 54.9% LH Livermore Herald 1877 - 1899 88.6% 54.6% DAC Daily Alta California 1841 - 1891 88.2% 53.4% CFJ California Farmer and Journal 86.5% 48.4% of Useful Sciences 1855 - 1880 SN Sausalito News 1885 - 1922 70.4% 17.3% *Word accuracy assumes average word length is 5 characters
  • 32. OCR accuracy by newspaper title OCR character Corrected Title accuracy accuracy PRP Pacific Rural Press 1871 - 1922 92.6% 99.3% SFC San Francisco Call 1890 - 1913 92.6% 99.6% LAH Los Angeles Herald 1873 - 1910 88.7% 99.1% LH Livermore Herald 1877 - 1899 88.6% 99.9% DAC Daily Alta California 1841 - 1891 88.2% 99.9% CFJ California Farmer and Journal 86.5% 99.8% of Useful Sciences 1855 - 1880 SN Sausalito News 1885 - 1922 70.4% 100.0%
  • 33. corrected accuracy by newspaper title OCR character ~OCR word Corrected ~Corrected Title accuracy accuracy* accuracy word accuracy* PRP 1871 - 1922 92.6% 68.1% 99.3% 96.5% SFC 1890 - 1913 92.6% 68.1% 99.6% 98.0% LAH 1873 - 1910 88.7% 54.9% 99.1% 95.6% LH 1877 - 1899 88.6% 54.6% 99.9% 99.5% DAC 1841 - 1891 88.2% 53.4% 99.9% 99.5% CF 1855 - 1880 86.5% 48.4% 98.3% 91.8% SN 1885 - 1922 70.4% 17.3% 100.0% 100.0% *Word accuracy assumes average word length is 5 characters
  • 34. correction accuracy by user Average OCR Correction User accuracy accuracy A 70.4% 100.0% B 87.1% 99.5% C 95.4% 99.5% D 86.5% 98.3% E 95.3% 100.0% F 91.0% 100.0% G 91.0% 99.8% H 90.5% 99.0% I 96.6% 99.8% J 94.8% 100.0% K 86.8% 99.3%
  • 35. the long of crowdsourced tail * OCR text correction a probability distribution has a long tail if a larger share of population rests within its tail than it would under a normal distribution the most productive users represent a small fraction of the total user population and ~50% of total production, or, said a different way, the largest fraction but individually not quite so productive users are as important as the most productive users The phrase “long tail” was popularized by Chris Anderson in the October 2004 Wired magazine article The Long Tail and by Clay Shirky’s February 2003 essay “Power laws, web logs, and inequality”.
  • 36. OCR text correction long tails 3,000,000 2,250,000 50% 300000 top corrector 242,965 1,500,000 top corrector 1,456,906 225000 50% 750,000 150000 50% 0 75000 NLA lines corrected by text corector 50% 0 CDNC lines corrected by text corrector
  • 37. Motivation Graphic from Kaufmann et al. “More than fun and money. Worker Motivation in Crowdsourcing – A Study on Mechanical Turk.”
  • 38. Wisdom of crowds Each person should have private information Diversity even if it's just an eccentric interpretation of the known facts. People's opinions aren't determined by the Independence opinions of those around them. People are able to specialize and draw on local Decentralization knowledge. Some mechanism exists for turning private Aggregation judgments into a collective decision. James Surowiecki, The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations, Anchor Books, New York, 2005.
  • 39. Cognitive surplus ... people are learning to use their free time for creative activities rather than consumptive ones [such as watching TV] ... ... the total human cognitive effort in creating all of Wikipedia in every language is about one hundred million hours ... ... Americans alone watch two hundred billion hours of TV every year, or enough time, if it would be devoted to projects similar to Wikipedia, to create about 2000 of them ... Clay Shirky. Cognitive surplus: Creativity and generosity in a connected age. Penguin Press. New York. 2010.
  • 40. Motivation Genealogists and family historians • National Library of Australia’s 2012 Trove status report showed that ~50% of Trove users are family historians PAPERSPAST • National Library of New Zealand survey found that ~50% of PapersPast users are genealogists • California Digital Newspaper Collection spring 2012 survey discovered that ~70% of its users are genealogists; 75% are 50 years old or older • A Utah Digital Newspapers survey showed that 72% of its users are genealogists
  • 41. Motivation Trove users’ report • “I enjoy the correction - it’s a great way to learn more about past history and things of interest whilst doing a ‘service to the community’ by correcting text for the benefit of others.” • “I have recently retired from IT and thought that I could be of some assistance to the project. It benefits me and other people. It helps with family research.” From Rose Holley in “Many Hands Make Light Work.” National Library of Australia March 2009.
  • 42. Motivation CDNC users’ report “I am interested in all kinds of history. I have pursued genealogy as a hobby for many years. I correct text at CDNC because I see it as a constructive way to contribute to a worthwhile project. Because I am interested in history, I enjoy it.” Wesley, California Personal communications with CDNC text correctors.
  • 43. Motivation CDNC users’ report “I only correct the text on articles of local interest - nothing at state, national or international level, no advertisements, etc.  The objective is to be able to help researchers to locate local people, places, organizations and events using the on-line search at CDNC.  I correct local news & gossip, personal items, real estate transactions, superior court proceedings, county and local board of supervisors meetings, obituaries, birth notices, marriages, yachting news, etc.” Ann, California Personal communications with CDNC text correctors.
  • 44. Motivation CDNC users’ report “I am correcting text for the Coronado Tent City Program for 1903.  It is important to correct any problems with personal names and other information so that researchers will be able to search by keyword and be assured of retrieving desired results. ... type fonts cause a great deal of difficulty in digitizing the text and can cause problems for searchers.  Also, many of the guests' names at Tent City and Hotel Del Coronado were taken from the registration books and reported in the Program.  This led to many problems in spelling of last names and the editors were not careful to be consistent in the spellings.  This Program is an important resource since it provides an excellent picture of daily life in Tent City and captures much of the history of Coronado itself.” Gene, California Personal communications with CDNC text correctors.
  • 45. Motivation CDNC users’ report “I have always been interested in history, especially the development of the American West, and nothing brings it alive better than newspapers of the time. I believe them to be an invaluable source of knowledge for us and future generations.” David, United Kingdom Personal communications with CDNC text correctors.
  • 46. Motivation CDNC users’ report CDNC is an excellent source of information matching my personal interest in such topics as sea history, development of shipbuilding, clippers and other ships etc. ... Unfortunately, the quality of text ... is rather poor I’m afraid. This is why I started to do all corrections necessary for myself ... and to leave the corrected text for use of others. .... I am not doing this very regularly as this is just my hobby and pleasure. Jerzey, Poland Personal communications with CDNC text correctors.
  • 48. Website traffic After a crowdsourcing transcription project of diaries from the American War Between the States, Nicole Saylor, Head of Digital Library Services at the University of Iowa Libraries, reported “On June 9, 2011, we went from about 1000 daily hits to our digital library on a really good day to more than 70,000.” Nicole Saylor interviewed by Trevor Owens. “Crowdsourcing the Civil War: Insights Interview with Nicole Saylor” blog post at http://blogs.loc.gov/digitalpreservation/2011/12/crowdsourcing-the-civil-war-insights-interview-with-nicole-saylor/. Dec 6, 2011.
  • 49. Website traffic Website traffic at CDNC before / after implementing crowdsourcing before crowdsourcing after crowdsourcing change 11-Jun-2011 / 12-Jul-2011 11-Jun-2012 / 12-Jul-2012 visits 17,485 21,488 +22.9% unique visitors 11,381 13,376 +17.5% visit duration 9m 24s 11m 7s +18.3% bounce rate 51.3% 44.5% -6.8% pages per visit 14.9 11.7 -21.5%
  • 51. Crowdsourcing benefits Public domain photo courtesy of US Navy
  • 52. $ Economics Financial value of outsourced OCR text correction for newspapers? The Assumptions • 25 to 50 characters per line in a newspaper column: Assume 40 characters per line (CDNC sample average) • Outsourced text transcription or correction costs USD $0.35 to $1.20 per 1000 characters: Assume $0.50 per 1000 characters
  • 53. $ Economics $ 578,000 lines x 40 characters per line x 1/1000 x $0.50 = $11,560 $ 68,908,757 lines x 40 characters per line x 1/1000 x $0.50 = $1,378,175
  • 54. $ Economics Financial value of in-house OCR text correction? The Assumptions • Correction takes 15 seconds per line • Cost is hourly wage plus benefits of lowest level employee, $10 for CDNC, $41.88* for Australia AUD $40.38 = USD $41.88 is the actual labor value assumed by the National Library of Australia to calculate avoided costs due to crowdsourced OCR text correction in its 2012 Trove Status Report.
  • 55. $ Economics $ 578,000 lines x 15 seconds per line x 1/3600 hrs per second x $10.00 per hr = $24,083 $ 68,908,757 lines x 15 seconds per line x 1/3600 hrs per second x $41.88 per hr = $12,024,578
  • 56. Accuracy “His Accuracy Depends on Ours!" Office for Emergency Management. Office of War Information. Domestic Operations Branch. Bureau of Special Services. [Photo held at US National Archives and Records Administration]
  • 57. Accuracy • Edwin Kiljin (Koninklijke Bibliotheek the Netherlands) reports raw OCR character accuracies of 68% for early 20th century newspapers • Rose Holley (National Library of Australia) reports raw OCR character accuracy varied from 71% to 98% on a sample Trove digitized newspapers Edwin Kiljin. “The current state-of-art in newspaper digitization.” D-Lib Magazine. January/February 2008. Rose Holley. “How good can it get? Analysing and improving OCR accuracy in large scale historic newspaper digitisation programs. D-Lib Magazine. March/April 2009. Public domain graphic courtesy of Wikimedia Commons.
  • 58. Accuracy Mapping texts* assesses digitization quality of digital newspapers by comparing the number of words recognized to the total number of words scanned *Mapping texts is a collaboration between the University of North Texas and Stanford University aimed at experimenting with new methods for finding and analyzing meaningful patterns embedded in massive collections of digital newspapers.
  • 59. Accuracy How does low text accuracy affect search recall? The Facts • Average uncorrected OCR character accuracy of the CDNC sample data is ~89% • Average length of an English word is 5 characters • Average word accuracy is 89% x 89% x 89% x 89% x 89% = 55.8% - round up to 60% or 6 out of 10 words correct Public domain graphic courtesy of Wikimedia Commons.
  • 60. Search recall no text correction ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT instances of “ARNDT” found instances of “ARNDT” not found
  • 61. Accuracy The Facts • Average corrected character accuracy of the CDNC sample data is ~99.4% • Average word accuracy of CDNC corrected text is 99.4% x 99.4% x 99.4% x 99.4% x 99.4% = 97.0% Public domain graphic courtesy of Wikimedia Commons.
  • 62. Search recall with text correction ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT ARNDT instances of “ARNDT” found instances of “ARNDT” not found
  • 63. Accuracy A search for “Arndt” at Chronicling America gives 10,267 results* • If Chronicling America text accuracy is 55.8% (same as uncorrected CDNC sample), then 8,133 instances of “Arndt” were not found • If text accuracy is 97.0%, then 317 instances of “Arndt” were not found * Search performed 31 Oct 2012 Alexa global / country traffic rank of Library of Congress (31-Oct-2012): 4,056 / 1,317 Chronicling America gets ~7.1% of all Library of Congress web traffic. Public domain graphic courtesy of Wikimedia Commons.
  • 64. Hard-to-measure-but- shouldn’t-be-overlooked benefits Public domain photo “A useful instruction for young sailors from the Royal Hospital School, Greenwich” from the National Maritime Museum.
  • 65. HTMBSBO benefit “when someone transcribes a document, they are actually better fulfilling the mission of a cultural heritage organization than someone who simply stops by to flip through the pages” Paraphrased from Trevor Owen’s Crowdstorming blog http://crowdstorming.wordpress.com/
  • 66. HTMBSBO benefit “in addition to increasing search accuracy or lowering the costs of document transcription, crowdsourcing is the single greatest advancement in getting people using and interacting with library collections” Paraphrased from Trevor Owen’s Crowdstorming blog http://crowdstorming.wordpress.com/
  • 67. Crowdsourcing considerations • How to market / advertise crowdsourcing? • How to motivate crowdsourcers? • Is authentication / identity of crowdsourcers an issue? • How to administer crowdsourced data? Photo of Aleister Crowley [Public domain] from Wikimedia Commons
  • 68. Conclusions • Lots of crowdsourcing in cultural heritage organizations and elsewhere • Benefits are multi-faceted: Economic, data accuracy, patron engagement, increased web traffic Conclusion of the Sonata for piano #32, opus 111 by Ludwig van Beethoven
  • 69. Try crowdsourcing! Correct California newspapers text http://cdnc.ucr.edu Correct Australian newspapers text http://trove.nla.gov.au Correct Cambridge MA newspapers text http://bit.ly/cambridgepublic Correct Russian language periodicals http://bit.ly/russianperiodicals Others soon to follow: Library of Virginia, University of Tennessee, National Library of Singapore, ...
  • 70. ? Brian Geiger bgeiger@ucr.edu Frederick Zarndt frederick@frederickzarndt.com Photo held by John Oxley Library, State Library of Queensland. Original from Courier-mail, Brisbane, Queensland, Australia.