Digitalização: Captura de Imagem e Fluxo de Trabalho - Constance Rinaldo

516 visualizações

Publicada em

A reunião anual de 2015 da Rede Global Biodiversity Heritage Library será realizada no Brasil e abordará o estado de desenvolvimento da Biodiversity Heritage Library (BHL) e sistemas de informação em Biodiversidade.

Organizada pelos Programas SciELO e BIOTA da FAPESP, a reunião está dirigida para pesquisadores e profissionais relacionados com biodiversidade e informação científica. O programa científico contará com autoridades e especialistas nacionais e internacionais.

A Rede Global da BHL (gBHL) conta com a participação da África do Sul, Austrália, Brasil, China, Egito, Estados Unidos e Europa. A BHL trabalha de forma colaborativa em prol do acesso aberto à literatura em biodiversidade como parte da comunidade de biodiversidade global.

Publicada em: Ciências
0 comentários
0 gostaram
Estatísticas
Notas
  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

Sem downloads
Visualizações
Visualizações totais
516
No SlideShare
0
A partir de incorporações
0
Número de incorporações
3
Ações
Compartilhamentos
0
Downloads
4
Comentários
0
Gostaram
0
Incorporações 0
Nenhuma incorporação

Nenhuma nota no slide
  • [1 min]
    Collection mgmt to me is a continuous cycle of pre-digitization and post-digitization workflows
    Getting the content scanned is 1 thing and managing the content after it’s been scanned is just as important
    You’ll notice that our users play a key role in the cycle
  • At the start of the project, trying to scan as much as possible as fast as possible “feed the beast” = low hanging fruit

    As the project matures, it becomes more difficult to find material to scan that is in good condition, that is in the Public Domain, or that is on the shelf!
    Hired a full-time in house scanner to do folios, rare fragile material
    Most staff, like scanning, is funded by grants. Not permanent, which means not truly programmatic/infrastructure.
  • 18 plus institutions, 30 plus people, 4 plus time zones
  • [1 min]
    Collection mgmt to me is a continuous cycle of pre-digitization and post-digitization workflows
    Getting the content scanned is 1 thing and managing the content after it’s been scanned is just as important
    You’ll notice that our users play a key role in the cycle
  • Workflow has become more complicated
    Difficulty finding books that are easy to scan
    Copyright review takes time
    Fragile books need repair
    Same quantity of work but different type, slower collection growth
  • Upload spreadsheet of titles you plan to scan. Include OCLC number, Title, Volume Number, Author, Publisher, Date
    Tool tries to find matches in other submitted spreadsheets
    Lesson: your metadata is always worse than you think it is
    Problems: does not match against BHL in Real Time. Still must check BHL to be sure. Doesn’t always happen
    Problem: fuzzy matching algorithm is not that great. Works best against numbers (OCLC number) (OCLC? WorldCat? Union catalog for Libraries) Your metadata is always worse than you think. 
  • Repurpose a generic “issue tracking” system to do many things
    -track requests for scanning
    -track titles libraries plan to scan (serial volumes)
    -track metadata error reports
    -track website bugs
    Comment trail can be very long. Conversation vs. database. Confusing to database people (me) but shows history of selection.
    The selection refinement process can take a long time!
  • Some background:
    Most BHL US/UK libraries use Internet Archive as our scanning “vendor” (partner) this was part of the original BHL formation and grant agreement with MacArthur.
    IA chosen because committed to Open Access, Non-profit, and low cost services – more than just digitization

    Members can also do their own scanning, or contract to other vendor, but all scans must be “staged” at internet archive
    Members provide basic metadata from their library catalogs
  • This decision to scan physical units of books is based in the limitations of available library data. Libraries typically assign data at the “title” level, with maybe some data about individual volumes of a serial.
    Workflow is designed around scanning physical books. We are working on incorporating born-digital publications.
    Focus is on the information content of the book rather than the book-as-historical-object
  • TO REITERATE: For BHL,
    IA – petaboxes
    SI – Isilon
    Total BHL storage currently ~ 90TB. It is so low because IA supplies compressed JP2s, and we store them in a .zip file.
  • Images scanned by library or other vendor
    Metadata harvested via Z39.50
    Additional metadata for item and pages entered by library staff using Macaw software (mimics IA biblio software)

  • Scanned by library or other vendor
    Smithsonian Libraries uses 2 systems:
    P65+ 60MP camera on a copy stand
    BC100 – dual camera 40MP scanning backs CaptureOne image editing software
    Macaw for extra metadata
  • Analysis of MARCxml records in IA (not all books have MARC records) for 050 and 090 (call number) and 650 (subject headings)
    Capture - Harvest
    Automated, scheduled tasks
    Books from Internet Archive
    subject terms
    library “call numbers”
    Manually entering in identifier
    Article citations
    BioStor
  • Curation includes adding and correcting metadata for books, merging records and authors, removing volumes that are outside of the scope of the collection, rescanning books with errors
    Title id 3971
    Edit record for item (from MARC)
    Edit volumes attached to the title record – correct volume information, re-order volumes
  • 4 levels of descriptive metadata (administrative, structural data produced while scanning)
    “Title” MARC record from library catalog
    Transformed into MARCXML and MODS
    “Volume” information from catalog or entered by human, stored in xml file
    “Segment” (article) information entered by human OR from bioStor etc.
    “Page” metadata entered by human, stored in xml file that provides structure to digital object
  • Item 22379
    Add article “segment” title and other information
    “paginate” = add page data
    que están fuera del alcance de la colección
  • Run taxonomic intellegence to find names (this shows manual editing, but it is an automated process)
    Discovering and storing species names associated with pages enables creation of ”species bibliographies”, connections to EOL.org, spLink and other useful tools
    Descubriendo y almacenamiento nombres de las especies asociadas con las páginas permite la creación de "especies bibliografías," conexiones a EOL.org, Splink y otras herramientas útiles
  • Won’t show portal functionality – save for William tomorrow.
    Users are a big part of the data management process (administracion de la collecion)
  • Here is a request from Dr. Karl Siegert that BHL scan Annales de l’Institut Pasteur.
    As far as we know, responding to user requests is rare in the Digital Library world.
    MBLWHOI has this title, as does Cornell University
  • Digitalização: Captura de Imagem e Fluxo de Trabalho - Constance Rinaldo

    1. 1. Digitalização: Captura de Imagem e Fluxo de Trabalho Martin Kalfatovic, Keri Thompson & Connie Rinaldo
    2. 2. Selection Refinement Digitization CurationUse Selection Collection Management Cycle
    3. 3. • Communication
    4. 4. Selection Refinement Digitization CurationUse Selection Collection Management Cycle
    5. 5. • Workflow has become more complicated • Difficulty finding books that are easy to scan • Reviewing titles in copyright takes time • Fragile books need repair • The same amount of work, but a different kind
    6. 6. Upload spreadsheet titles scanned plans. Include OCLC number, title, volume number, Author, Publisher, Date Tool tries to find matches in other spreadsheets submitted Lesson: metadata is always worse than you think
    7. 7. Title, volumes needed Which library has which volumes, additional information conversation about which volumes need to be scanned GEMINI: A Critical Tool
    8. 8. Selection Refinement Digitization CurationUse Selection
    9. 9. • Purpose - to provide an accurate digital representation of the original object • one page per image • (except Field note-books - 2 pages per image) • no image editing • Reuse existing metadata • in the library catalog • other sources (BioStor etc.) Capture: Scanning
    10. 10. Capture-Scanning • Most libraries BHL US / UK use the Internet Archive (IA) for scanning books • Some shared funds/one contract for all BHL • Open Access, nonprofit • Services inexpensive • Each member library has its own workflow • Members provide basic metadata from library catalog • In-house digitization or hire another seller • MACAW
    11. 11. • * Scan books, from cover to cover one image per page? • * Also called "volume" or "item" is a physical unit, not intellectual unity, ie, a book = multiple articles or book = a monograph Cover Cover good stuff
    12. 12. Partial replication in Alexandria, Egypt Secondary backup is in the Smithsonian, including TIFF scanned volumes for home (SIL) ~ 90TB Primary Storage files and "staging area" is on the Internet Archive in San Francisco, USA
    13. 13. Images scanned by the library or other vendor Metadata collected through Z39.50 Additional metadata for the item and pages entered by library staff using the software Macaw (biblio software mimics IA) In-house scanning
    14. 14. Smithsonian Libraries: uses 2 sets of Phase One: P65 60 MP camera on a copy stand and BC100 - dual-chamber 40mP CaptureOne software By folios (> 36cm), fragile books EXCEPT Notebooks Field Project (Smithsonian Archives) - 2 pages per image to notebooks, letters flatbed scanner
    15. 15. Capture: Harvest • Scheduled tasks automated • Books already in the Internet Archive • subject terms • Library "call numbers” • BioStor/articles
    16. 16. Selection Refinement Digitization CurationUse Selection
    17. 17. Interface for staff to edit records and serial volumes put in order Curated add and edit metadata includes books, merging records and authors, removing volumes that are outside the scope of the collection, re-scan books with errors. CURATION
    18. 18. allows people to enter the page-level metadata such as page number, page type (picture, text, etc.) creates XML files to upload to IA Replicates software functionality from Internet Archive Installed in a shared SI server for partners to use MACAW: MetadatA Collection And Workflow A Critical Tool
    19. 19. •"Title" Record MARC library catalog •Transformed into MARCXML and MODS •Information "Volume" catalog or introduced by humans, stored in xml •"Segment" (article) the information entered by humans or bioStor etc. (after scanning) •"Page" metadata entered by humans, stored in the XML file that provides structure to the digital object Metadata
    20. 20. add metadata page level, such as page numbers or titles of articles
    21. 21. • Other files derived from Internet Archive processes – PDF – Djvu (OCR text - .txt and .xml) – ePub/Daisy/Kindle • Other files created by BHL processes –Taxonomic names –OCR text – BHL METS
    22. 22. Discovering and storing species names associated with pages allows the creation of "species bibliographies," EOL.org connections, GBIF connections
    23. 23. Selection Refinement Digitization CurationUse Selection
    24. 24. Users can (and do!) Report technical problems Request new functionality Report data errors Request scanning of specific titles Gemini
    25. 25. Which library has which volumes, additional information Gemini Title, volumes needed Assigned to Cornell University Requestor For all we know, in response to user requests is rare in the world of Digital Library.
    26. 26. Smithsonian Libraries Workflow s database library catalog Macaw Internet Archive Move & de- duplicate tracking & shipping Scanning & metadata harvesting BHL transform & package scanning & metadata harvesting create metadata page create derivative create metadata page MARC  MARCxml URL to BHL into MARC record species names quality control (% sample)
    27. 27. • Obrigada!
    28. 28. Serial Gemini workflow

    ×