A reunião anual de 2015 da Rede Global Biodiversity Heritage Library será realizada no Brasil e abordará o estado de desenvolvimento da Biodiversity Heritage Library (BHL) e sistemas de informação em Biodiversidade.
Organizada pelos Programas SciELO e BIOTA da FAPESP, a reunião está dirigida para pesquisadores e profissionais relacionados com biodiversidade e informação científica. O programa científico contará com autoridades e especialistas nacionais e internacionais.
A Rede Global da BHL (gBHL) conta com a participação da África do Sul, Austrália, Brasil, China, Egito, Estados Unidos e Europa. A BHL trabalha de forma colaborativa em prol do acesso aberto à literatura em biodiversidade como parte da comunidade de biodiversidade global.
6. • Workflow has become more
complicated
• Difficulty finding books that are
easy to scan
• Reviewing titles in copyright takes
time
• Fragile books need repair
• The same amount of work, but a
different kind
7. Upload spreadsheet titles scanned plans. Include OCLC number, title, volume number,
Author, Publisher, Date
Tool tries to find matches in other spreadsheets submitted
Lesson: metadata is always worse than you think
8. Title, volumes needed
Which library has which volumes,
additional information
conversation
about which
volumes need
to be scanned
GEMINI: A Critical Tool
10. • Purpose - to provide an accurate digital
representation of the original object
• one page per image
• (except Field note-books - 2 pages per image)
• no image editing
• Reuse existing metadata
• in the library catalog
• other sources (BioStor etc.)
Capture: Scanning
11. Capture-Scanning
• Most libraries BHL US / UK use the Internet
Archive (IA) for scanning books
• Some shared funds/one contract for all BHL
• Open Access, nonprofit
• Services inexpensive
• Each member library has its own workflow
• Members provide basic metadata from library
catalog
• In-house digitization or hire another seller
• MACAW
12. • * Scan books, from
cover to cover one
image per page?
• * Also called
"volume" or "item"
is a physical unit,
not intellectual
unity, ie, a book =
multiple articles or
book = a
monograph
Cover
Cover
good stuff
13. Partial replication in
Alexandria, Egypt
Secondary backup is in the
Smithsonian, including TIFF
scanned volumes for home (SIL)
~ 90TB
Primary Storage files and
"staging area" is on the
Internet Archive in San
Francisco, USA
14. Images scanned by the library or other
vendor
Metadata collected through Z39.50
Additional metadata for the item and
pages entered by library staff using the
software Macaw (biblio software mimics
IA)
In-house scanning
15. Smithsonian Libraries:
uses 2 sets of Phase One:
P65 60 MP camera on a copy stand and BC100 -
dual-chamber 40mP
CaptureOne software
By folios (> 36cm), fragile books
EXCEPT Notebooks Field
Project (Smithsonian
Archives) - 2 pages per
image to notebooks, letters
flatbed scanner
16. Capture: Harvest
• Scheduled tasks automated
• Books already in the Internet Archive
• subject terms
• Library "call numbers”
• BioStor/articles
18. Interface for staff to
edit records and
serial volumes put in
order
Curated add and edit
metadata includes
books, merging records
and authors, removing
volumes that are
outside the scope of
the collection, re-scan
books with errors.
CURATION
19. allows people to
enter the page-level
metadata such as
page number, page
type (picture, text,
etc.)
creates XML files to
upload to IA
Replicates software
functionality from
Internet Archive
Installed in a shared
SI server for
partners to use
MACAW: MetadatA Collection And Workflow
A Critical Tool
20. •"Title" Record MARC library catalog
•Transformed into MARCXML and MODS
•Information "Volume" catalog or introduced by humans, stored
in xml
•"Segment" (article) the information entered by humans or
bioStor etc. (after scanning)
•"Page" metadata entered by humans, stored in the XML file that
provides structure to the digital object
Metadata
22. • Other files derived from Internet Archive processes
– PDF
– Djvu (OCR text - .txt and .xml)
– ePub/Daisy/Kindle
• Other files created by BHL processes
–Taxonomic names
–OCR text
– BHL METS
23. Discovering and storing species names associated with pages allows the creation of
"species bibliographies," EOL.org connections, GBIF connections
25. Users can (and do!)
Report technical
problems
Request new
functionality
Report data errors
Request scanning of
specific titles
Gemini
26. Which library has which volumes,
additional information
Gemini
Title, volumes needed
Assigned to
Cornell
University
Requestor
For all we know, in response to user requests is rare in the world
of Digital Library.
[1 min]
Collection mgmt to me is a continuous cycle of pre-digitization and post-digitization workflows
Getting the content scanned is 1 thing and managing the content after it’s been scanned is just as important
You’ll notice that our users play a key role in the cycle
At the start of the project, trying to scan as much as possible as fast as possible “feed the beast” = low hanging fruit
As the project matures, it becomes more difficult to find material to scan that is in good condition, that is in the Public Domain, or that is on the shelf!
Hired a full-time in house scanner to do folios, rare fragile material
Most staff, like scanning, is funded by grants. Not permanent, which means not truly programmatic/infrastructure.
18 plus institutions, 30 plus people, 4 plus time zones
[1 min]
Collection mgmt to me is a continuous cycle of pre-digitization and post-digitization workflows
Getting the content scanned is 1 thing and managing the content after it’s been scanned is just as important
You’ll notice that our users play a key role in the cycle
Workflow has become more complicated
Difficulty finding books that are easy to scan
Copyright review takes time
Fragile books need repair
Same quantity of work but different type, slower collection growth
Upload spreadsheet of titles you plan to scan. Include OCLC number, Title, Volume Number, Author, Publisher, Date
Tool tries to find matches in other submitted spreadsheets
Lesson: your metadata is always worse than you think it is
Problems: does not match against BHL in Real Time. Still must check BHL to be sure. Doesn’t always happen
Problem: fuzzy matching algorithm is not that great. Works best against numbers (OCLC number) (OCLC? WorldCat? Union catalog for Libraries) Your metadata is always worse than you think.
Repurpose a generic “issue tracking” system to do many things
-track requests for scanning
-track titles libraries plan to scan (serial volumes)
-track metadata error reports
-track website bugs
Comment trail can be very long. Conversation vs. database. Confusing to database people (me) but shows history of selection.
The selection refinement process can take a long time!
Some background:
Most BHL US/UK libraries use Internet Archive as our scanning “vendor” (partner) this was part of the original BHL formation and grant agreement with MacArthur.
IA chosen because committed to Open Access, Non-profit, and low cost services – more than just digitization
Members can also do their own scanning, or contract to other vendor, but all scans must be “staged” at internet archive
Members provide basic metadata from their library catalogs
This decision to scan physical units of books is based in the limitations of available library data. Libraries typically assign data at the “title” level, with maybe some data about individual volumes of a serial.
Workflow is designed around scanning physical books. We are working on incorporating born-digital publications.
Focus is on the information content of the book rather than the book-as-historical-object
TO REITERATE: For BHL,
IA – petaboxes
SI – Isilon
Total BHL storage currently ~ 90TB. It is so low because IA supplies compressed JP2s, and we store them in a .zip file.
Images scanned by library or other vendor
Metadata harvested via Z39.50
Additional metadata for item and pages entered by library staff using Macaw software (mimics IA biblio software)
Scanned by library or other vendor
Smithsonian Libraries uses 2 systems:
P65+ 60MP camera on a copy stand
BC100 – dual camera 40MP scanning backsCaptureOne image editing software
Macaw for extra metadata
Analysis of MARCxml records in IA (not all books have MARC records) for 050 and 090 (call number) and 650 (subject headings)
Capture - Harvest
Automated, scheduled tasks
Books from Internet Archive
subject terms
library “call numbers”
Manually entering in identifier
Article citations
BioStor
Curation includes adding and correcting metadata for books, merging records and authors, removing volumes that are outside of the scope of the collection, rescanning books with errors
Title id 3971
Edit record for item (from MARC)
Edit volumes attached to the title record – correct volume information, re-order volumes
4 levels of descriptive metadata (administrative, structural data produced while scanning)
“Title” MARC record from library catalog
Transformed into MARCXML and MODS
“Volume” information from catalog or entered by human, stored in xml file
“Segment” (article) information entered by human OR from bioStor etc.
“Page” metadata entered by human, stored in xml file that provides structure to digital object
Item 22379
Add article “segment” title and other information
“paginate” = add page data
que están fuera del alcance de la colección
Run taxonomic intellegence to find names (this shows manual editing, but it is an automated process)
Discovering and storing species names associated with pages enables creation of ”species bibliographies”, connections to EOL.org, spLink and other useful tools
Descubriendo y almacenamiento nombres de las especies asociadas con las páginas permite la creación de "especies bibliografías," conexiones a EOL.org, Splink y otras herramientas útiles
Won’t show portal functionality – save for William tomorrow.
Users are a big part of the data management process (administracion de la collecion)
Here is a request from Dr. Karl Siegert that BHL scan Annales de l’Institut Pasteur.
As far as we know, responding to user requests is rare in the Digital Library world.
MBLWHOI has this title, as does Cornell University