SlideShare uma empresa Scribd logo
1 de 116
Baixar para ler offline
Digitizando Literatura sobre Biodiversidad
(Contenido Técnico)
CONABIO, México
William Ulate, Director Técnico de BHL
17 Diciembre 2014
Digitization Workflow
Insert
Smithsonian
Macaw software
here
Hardware & Software
Hardware
Usando una Estación Scribe
Escaneo por Internet Archive
Northeast Regional Scanning
Facility (Boston)
New Jersey Facility
Natural History Museum, London
Fedscan (Library of Congress)
Internet Archive (San Francisco)
Smithsonian Libraries
Missouri Botanical Garden (Non-
Scribe operation)
Hardware & Software
Hardware
Usando una Estación Scribe
“Off-the-shelf” escaners o cámaras
digitales de buena calidad
Software
Wonderfetch -> Partner Meta App
(si usan máquinas Scribe)
identifier
search_id
title
volume
creator
date
call_number
language
subject
publisher
description
page-progression
possible-copyright-
status
licenseurl
Hardware & Software
Hardware
Usando una Estación Scribe
“Off-the-shelf” escaners o cámaras digitales de buena
calidad
Software
Wonderfetch -> Partner Meta App (when using Scribe
machines)
Macaw
Software de Escaneo: Macaw
Hardware & Software
Hardware
Usando una Estación Scribe
“Off-the-shelf” escaners o cámaras digitales de buena calidad
Software
Wonderfetch -> Partner Meta App (when using Scribe
machines)
Macaw
Uploading directly to Internet Archive (for example: MBG‟s
Botanicus http://www.botanicus.org/)
Standards and formats to consider
The simplest way to contribute a text item to IA is currently as a single pdf file. IA
creates a second pdf with a text layer, if none exist.
Items can be submitted as a stack of image files, one image per page. The files
can be in JPEG2000, JPG, or TIFF format, but with strict requirements for
how the files in an image stack are to be named, and the stack needs to be
packed into a single .zip or .tar file before submission.
When IA (Archive.org) scans a book for a Contributing Library, they use the
custom-engineered "Scribe" workstation, but for many materials, adequate
images can be made with off-the-shelf scanners or good-quality digital
cameras.
For best results, it is recommended to use the highest resolution your device is
capable of. Most images IA processes were produced at a resolution of 300-
600 ppi.
Standards and formats to consider
BHL recommends following, in part, the DLF's "Benchmark for
Faithful Digital Reproductions of Monographs and Serials"
(available online at
http://www.diglib.org/standards/bmarkfin.htm).
Bitonal: 600 dpi, 1-bit or bitonal TIFF images
Grayscale: 300 dpi, 8-bit grayscale uncompressed TIFF, or lossless
compressed image (e.g. LZW, JPEG2000 [*.jp2]).
Color: 300 dpi, 24-bit color uncompressed TIFF, or lossless compressed
images (e.g. LZW, JPEG2000 [*.jp2]).
NOTE: the above specifications are the preferred ones. BHL
will, however, accept lossy files. In the case of JPEG2000,
files with a compression level of 85% are acceptable.
Standards and formats to consider
Currently, BHL data can be downloaded as MODS,
EndNote and BibTex. See our wiki page with more
information:
http://biodivlib.wikispaces.com/Data+Exports#x--MODS
Title metadata as well as pagination, descriptive and page
order (structural) metadata is being copied into METS
files in the <biodiveristy> collection at IA.
The purpose of these METS files is to accommodate the need
of our pagination data.
These METS files are pagination specific and they do not have
the item/volume information included.
If bibliographic metadata for BHL content was required, it
should be found in the MODS files on the Data Exports page.
Standards and formats to consider
For the future, we are looking at serving OLEF as
an envelope format to share information with
other BHL Nodes.
See
http://www.bhle.eu/bhl-schema/v0.3/ and
http://www.slideshare.net/HeimoRainer/bhleuropemet
adataharmonisationtdwg20111018kollerwhrainer/6 )
Metadata generation and
indexing strategy
Each item to be uploaded needs a unique
identifier within our central repository, currently
Internet Archive (archive.org) and a folder with
such name is created to hold the uploaded and
generated (derivative) files.
Within BHL we record metadata at 3 levels of
bibliographic granularity – Title, Item & Page –
as well as metadata for the Creator(s) of the
title.
Metadata generation and
indexing strategy
Scanned material (jp2.zip) and basic title-level metadata content
(marc.xml), item-level metadata (meta.xml) and page-level
metadata (scandata.xml) are uploaded to Internet Archive
(IA), in the „biodiversity‟ collection.
JP2.zip: The compressed JP2 images (Compression Quality 15) that
IA will use for delivering pages to the Read Online feature
following a very specific naming convention for the filenames:
Master images files named with local library identifier + 4-digit
sequence number (with no gaps).
MARC.xml: The MARC record for the title from the library catalog in
MARCXML format
Title, *Abbreviation, *Creator, Description, Publisher, Start Date Published,
End Date Published, Local Library Identifier, *OCLC Number, *ISSN,
*ISBN, *Call Number, *Subject, *Language, Date Created, Date Last
Modified, *Foreign Keys
Metadata generation and
indexing strategy
META.xml: The item level information (even redundant with
the title-level information) including the title, author, publisher,
copyright information, digitizing sponsor, date published, type
of item, and who originally uploaded it. IA may also update
this XML file with information as it processes the pages of the
item.
Barcode, Sequence, Local Library Identifier, +Start Volume, End
Volume, +Start Date, End Date, *Language, Scanning Institution,
*Scanning Contributor, *Scanning Sponsor, Date Created, Date
Last Modified
SCANDATA.xml: An XML file (scandata.xml) recording
information about each page image (handSide, cropBox,
original width & height, etc. )
FileName, Sequence, *Page number, *Page Type, Year, Volume,
IssuePrefix, Issue, Date Created, Date Last Modified
Metadata generation and
indexing strategy
CREATOR: A “Creator” is defined as a person or
company responsible for the creation of the Title.
Name, *Role, Date of Birth, Date of Death, Biography
A detailed description of the contents of each one
of these files and the whole process of
Uploading content to IA is available at:
http://biodivlib.wikispaces.com/Upload
Metadata generation and
indexing strategy
Internet Archive runs the OCR process and
generates “derivative files” that include:
The resulting files of the OCR process with ABBYY
FineReader (djvu, djvu.txt, djvu.xml, abby.gz)
A 100x152 pixel GIF with a looping, animated thumbnail of
the first 20 pages of a book.
The presentation version on BHL in PDF format.
The MARC record in binary and XML formats.
And others ( for a more detailed description you can see
http://biodivlib.wikispaces.com/Download+All+File+Type
s+and+Descriptions )
Metadata generation and
indexing strategy
The metadata from new items included in the BHL
collection is included in the database and indexed
to be used in searches through the Portal and API
services.
Periodically, the OCR pages are ran through
taxonomic names services to mine for new taxa
names like TaxonFinder (ubio.org) or GNRDS
(Global Names resolution tools and services:
resolver.globalnames.org) soon.
Taxa names are added to the database and written
back into Internet Archive (names.xml)
Online Platform
Capture System
Scribe machines
Macaw
Publication
BHL Portal
BookViewer
PDF Generator
Online Platform
Publication
BHL API
(biodivlib.wikispaces.com/Developer+Tools+and+API)
The BHL Application Programming Interface (API) is a set of
REST-like web services that can be invoked via HTTP queries
(GET/POST requests) or SOAP.
Responses can be received in one of three formats: JSON, XML,
or XML wrapped in a SOAP envelope.
We are currently developing a new API v3, closer to a RESTful
design than previous versions, using resource-centric
URLs (where possible) and GET/PUT/POST/DELETE verbs.
Online Platform
Publication
Data Exports (biodivlib.wikispaces.com/Data+Exports)
Online Platform
Management
BHL Admin Dashboard
Admin Functions
(Alert Message, Image Server, Collections, Institutions,
Languages, Page Types, PDF Requests, Segment Types)
Library Functions
(Titles/Items/Segments /Pagination/Authors)
Science Functions (Names (Taxa) on a Page)
Library Statistics
(Titles/Items/Pages/Names/Segments/Items with Segments,
Names, Pages with Names)
Growth Statistics
(Titles/Items/Pages/Names/Segments new this Month/Year)
Online Platform
Management
BHL Admin Dashboard
PDF Generation Statistics (Generated: 174,162)
Internet Archive Harvesting Statistics (Complete: 119,125 items)
BioStor Harvest Statistics (Published: 11,126 as of Aug. 29, 2013)
DOI Assignment Statistics (DOI Approved: 57,338 as of Aug 29,
2013)
Web Traffic Statistics (API v2, OpenURL)
Reports
(Item Pagination, Title Import History, Character Encoding Problems,
DOIs by Institution, Monographic Contributions,
Items by Contributor)
Deduplication
• We try to avoid duplication where possible
• Tools
• Serials = Scanlist
• Monographs = Monographic deduper
• Check the BHL before you send for scanning
• We do our best but duplication happens
• Post-digitization, we merge titles as necessary
Online Platform
Management
Monographic Deduping Tool
The MBLWHOI Library has been working on a tool that
assists with de-duplicating the monographs that BHL
members are sending to IA for scanning.
The application is ready for use and it‟s entirely web-based,
requiring no client or user configuration.
The monographic deduper acts as a master database that
contains records for all of the monographs that any BHL
partner institution has scanned.
Online Platform
Management
Monographic Deduping Tool
In addition, there is a process also in place that allows for
material ingested from the Internet Archive, but not
contributed by a BHLpartner institution, to be added to the
deduper database.
Ultimately, the Monographic deduper database should be
seen as living record of accountability that communicates
to staff collaborating in the BHL network, a partner‟s
promise to digitize a particular monographic title.
Online Platform
Management
Serials Bid List
It is a catalogue that allows users to browse and search
Serials titles held by BHL member institutions using
advanced filtering.
Technical Group at MBG
Mike Lichtenberg
Developer
Trish Rose-Sandler
Data Analyst
William Ulate
Technical Director
Technical Support
MBG IT Division
Manage servers, systems and
telecommunications.
Installs software needed
And others:
MBL
Smithsonian
Internet Archive
BHL-Australia
BHL-Europe
Technical Advisory Group
Firewall
Images (JP2)
PDF
Coordinate-based OCR
XML metadata
BHL Architecture: Window Seat Ed.
BHL DB
Internet Archive
Storage
Logic
APIs UI
Data
Exports
Access
Data Transform
Utilities
Geocoding
Name
Finding
Projects
Global Names
Art of Life
Purposeful Gaming
Digging into Data
Scientific Name Extraction
TaxonFinder algorithm in production since 2008
More than 100 million candidate name strings
More than 1.5 million unique, verified names
Available through UI, APIs, Data Exports & Internet
Archive
New collaboration with Global Names project
Improved algorithm, better precision & recall
More data with TaxonFinder and Neti Neti!
http://gnrd.globalnames.org/
Taxon Names
BEFORE
Name Instances 101,591,803 101,288,804
Unique Names 7,498,554 7,464,924
Verified Names 1,905,507 1,902,803
EOL Names 63,130,350 62,963,582
EOL Pages 13,579,868 13,532,684
AFTER
Name Instances 151,222,182 150,066,425
Unique Names 29,246,382 29,091,767
Verified Names 10,153,165 10,109,540
EOL Names 87,791,695 87,135,089
EOL Pages 15,466,713 15,342,867
Article-level metadata
Chapter-level metadata
Treatment-level metadata
Part-level metadata
Articles in the BHL UI
See also:
Related Titles
Digitization workflow
1. Titles vs. Items vs. Segments
2. Metadata we need:
• MARC for book and journal titles
• Volume information
• Page data
BHL Term Titles Items Segments
Library Term Book or Journal
Titles
Volume, Piece Articles, Book
chapters,
Meaning Conceptual unit Object Section of
consecutive pages
Art of Life
Art of Life
Art of Life
Art of Life
Art of Life
Art of Life
Macaw
https://github.com/cajunjoel/macaw-book-metadata-tool
Reviewing Metadata
Reviewing Metadata
Manually built:
1,714 sets
89,457 images
Purposeful Gaming
*E.xvi�c�piteI von c. cXx.WptdvonfnrWmn
bu�fbe;bcn.5 am cix bIa � S &3rn~ 41X a�m cv(f
b1air�'o�et ert oiensr �; �', :�hlrfc�c wa
ff�4am.diug bist a
6aiw~s ff oJrJtwt nof bL4ecImt& blfafra mem b t
wag `wr 4 cn wiu 4 e8t5m.ed bvUratflb ck wuo,
ma144'*4I bttE5rmbebt =rt3'kn am4ra tif vrmr
Waff C * t6rmnli an `tn�ciblatGteaM w ?ffoaifrn
w4wmeu nu weib e , wpiteI voE5teiri ct c ober
gtUcr cit cm` 91 cLi biar J ' >bSciatl�Oiff ;Bruet
wacfttc n qmcx b1a bl: bt5c lttmtt bb9 lkr w.llr#e iti
ncn xoa ff cu :r trtuft *e t � B Rn "� trv W1Rt'
?Cm c blas waIwutr Ober �ci ti 1V Ces ' wt
gbtiemwwajfu tpctt, afferain 9 c: b�titbfof �r f
eran m rs bra wlg auig4;f aer�m *mc vrt
blatcabtfm wfru an'deg~m rt blas Iaum bwWt�
run f ncmai b14ianf tJobrrfan ebrut4net vnber
Brwt Ober awawi*m.crriii btafwfm uww c on$ 'it
ttu wttkc 5,10 $ m~C fca trc* cx u
W�e�&mcyfbq4 Mabtt mmw rc a iiu bc Jcn
ncI.end.*, blat s. a u:�rprd3 rw4ftf wm c ii,+ ttCC
tn wa frr9fr orfab fcfbt enb c optiti bt -r9 ceDa
ttDcn i34M sn Sem i
OCR Improvements
Gaming
Transcription
OCR Improvements
Transcription
Purposeful Gaming
Looking at…
Crowdsourcing Markup & Annotation
Purposeful Gaming
DIGITALKOOT
Joint project run by the National Library of
Finland and Microtask to index the library's
enormous archives so that they are
searchable on the Internet for easier
access to the Finnish cultural heritage.
.
Purposeful Gaming
DIGITALKOOT
Launched on Feb 8 2011, nearly 110 000
participants completed over 8 million word
fixing tasks by Nov 29 2012
DigiTalkoot enabled volunteers to participate in
this fixing work by playing games.
.
Purposeful gaming and BHL:
engaging the public in improving and
enhancing access to digital texts
IMLS Grant Program:
National Leadership Grants for Libraries
Partners:
Missouri Botanical Garden
Harvard University
Cornell University
New York Botanical Garden
P.I.: Trish Rose-Sandler, Missouri Botanical Garden
Dates: Dec 2013 – Nov. 2015
Project objectives and benefits
Test new means of crowdsourcing to support the
enhancement of content in BHL
Demonstrate if digital games are an effective tool for
analyzing and improving digital outputs from OCR and
transcription
Benefits of gaming include:
improved access to content by providing richer and more
accurate data;
an extension of limited staff resources; and
exposure of library content to communities who may not
know about the collections otherwise.
OCR Improvements
German text interpreted by the OCR process as:
“unb auf ben ©elnrgen be6 fublic{)en”
AOCR Improvements
Different resulting texts from parsing the phrase:
“und auf den Gebirgen des südlichen Deutschlands”
(“and on the mountains of southern Germany”)
IA OCR OCR 2
Transcriptio
n 1
Transcriptio
n 2
1 unb und und und Ok
2 den ben den den Ok
3 ©elnrgen ©ebirgen Bebirgen Gebirgen X
4 be6 des de5 des Chk
5 fublic{)en fublichen Füdlichen Südlichen X
6 £)eittfc{)(anb6 Deutfchlanbs Deutfchlands
Deutschland
s
X
Purposeful Gaming
iDigBio‟s aOCR Hackathon
Improve OCR parsing of labels with clear metrics
(datasets, output formats, scoring algorithm)
Libraries of regular expr. to clean up each field (different
error correction for latitude/longitude coordinates than
personal names or herbarium catalog numbers)
Tool for classifying segments of the image before
submitting to OCR
Do a first pass of OCR to clean images before sending
them to a second, 'real' pass of OCR
iDigBio‟s CITScribe Hackathon
1. Interoperability betweenpublic participation
tools and biodiversity data systems,
2. Transcription quality assessment/quality
control (QA/QC) and the reconciliation of
replicatetranscriptions,
3. Integration of optical character recognition
(OCR) into thetranscription workflow
4. User engagement
NfN & iDigBio‟s CITScribe Hackathon
Jason Best‟s DarwinScore
Ben Brumfield‟s Handwriting Gibberish Detector
Dictionaries to improve crowdsourcing consensus
(e.g., names of collectors, scientific names)
Word Clouds created using n-gram scoring,
faceting, and Solr for indexing + Carrot2 for
specimen selection (visualize and explore of the
use with a word of interest from the word cloud)
and a data cleaning step (highlight infrequent
words by the system).
NESCent EOL-BHL Research
Sprint
There is no place like home: Defining “habitat” for
biodiversity science
Robert D. Stevenson
UMass Boston, Dept. of Biology, 100 Morrissey Blvd., Boston,
MA 02125-3393
Carl Nordman (Natureserve) and
Evangelos Pafilis
Hellenic Centre for Marine Research, P.O. Box 2214, Heraklion,
71003, Crete, Greece
NESCent EOL-BHL Research
Sprint
Assessing Risk Status of Mexican Amphibians
Through Data Mining.
Esther Quintero and Bárbara Ayala
National Commission for Knowledge and Use of
Biodiversity (CONABIO)
and
Anne Thessen
Marine Biological Laboratory and Arizona State University
Planning for global change: using species interactions in conservation
Nicole F. Angeli, Emma P. Gomez, Margot A. Wood,
Applied Biodiversity Sciences Program, Texas A&M University, College
Station, Texas
nangeli1@jhu.edu
Tweet me @auratus_nicole
and
Javier Otegui
University of Colorado-Boulder
There is no place like home: Defining “habitat” for biodiversity science
Robert D. Stevenson
UMass Boston, Dept. of Biology, 100 Morrissey Blvd., Boston, MA
02125-3393
Carl Nordman (Natureserve)
Evangelos Pafilis
Hellenic Centre for Marine Research, P.O. Box 2214, Heraklion, 71003,
Crete, Greece
http://epafilis.info/ , vagpafilis@gmail.com
Evolution in the usage of anatomical concepts in
the biodiversity literature
Todd Vision (tjv@bio.unc.edu),
Prashanti Manda (manda.prashanti@gmail.com), and
Dongye Meng (dmeng@cs.unc.edu)
University of North Carolina at Chapel Hill
NESCent EOL-BHL Research
Sprint
Evolution in the usage of anatomical concepts in the
biodiversity literature
Todd Vision (tjv@bio.unc.edu),
Prashanti Manda (manda.prashanti@gmail.com), and
Dongye Meng
University of North Carolina at Chapel Hill
Some preliminary observations…
Our API seemed to work fine
Access via a taxon (or a group), for example:
“I want to harvest all pages with names from this taxon (Chordata) or this
common name (Vertebrate)”.
Groups started getting results after 2.5 days.
The structure of BHL was explained so researchers could understand
the title, item, page and part levels and define what they wanted. Ex:
one group was looking for terms in the titles and the parts‟ titles.
Some others said they would Harvest the OCR from IA although they will
not be able to harvest the text on a page by page granularity (only
item level).
NESCent EOL-BHL Research
Sprint
There is no place like home: Defining “habitat” for
biodiversity science
Robert D. Stevenson
UMass Boston, Dept. of Biology, 100 Morrissey Blvd., Boston,
MA 02125-3393
Carl Nordman (Natureserve) and
Evangelos Pafilis
Hellenic Centre for Marine Research, P.O. Box 2214, Heraklion,
71003, Crete, Greece
Mining Biodiversity
Mining Biodiversity
Mining Biodiversity: Enriching Biodiversity Heritage
with Text Mining and Social Media
One of the international projects that won in the third
round of the 2013 Digging Into Data Challenge
Promote the development of innovative computational
techniques to apply into big data in the humanities and
social sciences
The National Centre for Text Mining (UK)
Missouri Botanical Garden (US)
Dalhousie University's Big Data Analytics Institute (Canada)
Social Media Lab (Canada)
MiBIO: Mining Biodiversity
1. Automatic error correction of OCR text errors.
2. Crowdsource annotation of legacy texts with semantic metadata.
3. Adapt text mining techniques to extract terminology, entities and
significant events automatically and to track terminology evolution over
time.
4. Use Interactive visualization techniques to help users manage search
results through next generation browsing capabilities, assisted by a
semantic similarity network of important terms and entities.
5. Design of a social media layer, serving as an environment for diverse
users to interact and collaborate on science, public education,
awareness and outreach.
MiBIO: Mining Biodiversity
Crowdsource Markup
Display text Species Profile Model category
General/summary TaxonBiology
Geographic range Distribution
Habitat Habitat
Food sources and feeding behavior TrophicStrategy
Physical description (general) Description
Physical description (detailed
morphology)
DiagnosticDescription
Visit to NaCTeM, Feb. 17, 2014
NaCTeM‟s
Biodiversity-
relevant tools
ANNNOTATION PLATFORM
Remote Processing
Workflows processed on remote
machines. No attendance
needed
Workflows
GUI for creating single-flow and
multi-branch workflows
Workflow Designer
User Interaction
Annotation Editor allows for
making changes while
processingAnnotator/Curator
WebService
Third-party
applications
Processing Components
Data (de)serialisation, search
engines, NLP, NER, etc.
Developers
Workflows view
Processes View
Documents view
Workflow editor
Workflow as a Web service
Workflow as a Web service
http://argo.nactem.ac.uk/test/services/webservice/314
INPUT
OUTPUT
NAMED ENTITY RECOGNISERS
AND NORMALISERS
✔
✔
✔
✔
✔
Automatically recognised
named entities
Linking to external dictionaries
Species and habitat recognition
EVENT EXTRACTORS
Events: associations between
entities
SEMANTIC SEARCH
TERM EXTRACTION
Ryerson University SocialLab‟s
Netlytic.org
http://miningbiodiversity.comhttp://miningbiodiversity.org/
Thank you
William Ulate
BHL Technical Director
Missouri Botanical Garden
william.ulate@mobot.org
Skype: william_ulate_r
Thank you!
And thanks to Bianca Crowley
for the workflow slides

Mais conteúdo relacionado

Semelhante a Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

BioCASE web services for germplasm data sets, at FAO, Rome (2006)
BioCASE web services for germplasm data sets, at FAO, Rome (2006)BioCASE web services for germplasm data sets, at FAO, Rome (2006)
BioCASE web services for germplasm data sets, at FAO, Rome (2006)Dag Endresen
 
File Handling In C++(OOPs))
File Handling In C++(OOPs))File Handling In C++(OOPs))
File Handling In C++(OOPs))Papu Kumar
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big DataLewis Crawford
 
Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)
Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)
Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)Dag Endresen
 
The JISC Information Environment and collection description
The JISC Information Environment and collection descriptionThe JISC Information Environment and collection description
The JISC Information Environment and collection descriptionAndy Powell
 
Advances in File Carving
Advances in File CarvingAdvances in File Carving
Advances in File CarvingRob Zirnstein
 
Semantic Pipes and Semantic Mashups
Semantic Pipes and Semantic MashupsSemantic Pipes and Semantic Mashups
Semantic Pipes and Semantic Mashupsgiurca
 
Scratchpad 2, Virtual Research Environment: Project Update
 Scratchpad 2, Virtual Research Environment: Project Update Scratchpad 2, Virtual Research Environment: Project Update
Scratchpad 2, Virtual Research Environment: Project UpdateVince Smith
 
Semantic search within Earth Observation products databases based on automati...
Semantic search within Earth Observation products databases based on automati...Semantic search within Earth Observation products databases based on automati...
Semantic search within Earth Observation products databases based on automati...Gasperi Jerome
 
An Easy, Small But Powerful Web Log Analyzer Web Log Expert
An Easy, Small But Powerful Web Log Analyzer   Web Log ExpertAn Easy, Small But Powerful Web Log Analyzer   Web Log Expert
An Easy, Small But Powerful Web Log Analyzer Web Log ExpertGuo Albert
 
BHL Tech Overview for BHL-Europe
BHL Tech Overview for BHL-EuropeBHL Tech Overview for BHL-Europe
BHL Tech Overview for BHL-EuropeChris Freeland
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.Andy Petrella
 
GBIF API Hackaton, March 2015, Leiden, Sp2000/GBIF
GBIF API Hackaton, March 2015, Leiden, Sp2000/GBIFGBIF API Hackaton, March 2015, Leiden, Sp2000/GBIF
GBIF API Hackaton, March 2015, Leiden, Sp2000/GBIFDag Endresen
 
CSE3146-ADV JAVA M2.pdf
CSE3146-ADV JAVA M2.pdfCSE3146-ADV JAVA M2.pdf
CSE3146-ADV JAVA M2.pdfVithalReddy3
 
Voyager Meets MeLCat: MC'ing the Introductions
Voyager Meets MeLCat: MC'ing the IntroductionsVoyager Meets MeLCat: MC'ing the Introductions
Voyager Meets MeLCat: MC'ing the IntroductionsRoy Zimmer
 
Itp 120 Chapt 19 2009 Binary Input & Output
Itp 120 Chapt 19 2009 Binary Input & OutputItp 120 Chapt 19 2009 Binary Input & Output
Itp 120 Chapt 19 2009 Binary Input & Outputphanleson
 
Java IO, Serialization
Java IO, Serialization Java IO, Serialization
Java IO, Serialization Hitesh-Java
 

Semelhante a Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO (20)

BioCASE web services for germplasm data sets, at FAO, Rome (2006)
BioCASE web services for germplasm data sets, at FAO, Rome (2006)BioCASE web services for germplasm data sets, at FAO, Rome (2006)
BioCASE web services for germplasm data sets, at FAO, Rome (2006)
 
Folder Watching For Automated Document Capture, Batch Scanning
Folder Watching For Automated Document Capture, Batch ScanningFolder Watching For Automated Document Capture, Batch Scanning
Folder Watching For Automated Document Capture, Batch Scanning
 
File Handling In C++(OOPs))
File Handling In C++(OOPs))File Handling In C++(OOPs))
File Handling In C++(OOPs))
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big Data
 
Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)
Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)
Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)
 
The JISC Information Environment and collection description
The JISC Information Environment and collection descriptionThe JISC Information Environment and collection description
The JISC Information Environment and collection description
 
Advances in File Carving
Advances in File CarvingAdvances in File Carving
Advances in File Carving
 
Semantic Pipes and Semantic Mashups
Semantic Pipes and Semantic MashupsSemantic Pipes and Semantic Mashups
Semantic Pipes and Semantic Mashups
 
Files
FilesFiles
Files
 
AINL 2016: Bugaychenko
AINL 2016: BugaychenkoAINL 2016: Bugaychenko
AINL 2016: Bugaychenko
 
Scratchpad 2, Virtual Research Environment: Project Update
 Scratchpad 2, Virtual Research Environment: Project Update Scratchpad 2, Virtual Research Environment: Project Update
Scratchpad 2, Virtual Research Environment: Project Update
 
Semantic search within Earth Observation products databases based on automati...
Semantic search within Earth Observation products databases based on automati...Semantic search within Earth Observation products databases based on automati...
Semantic search within Earth Observation products databases based on automati...
 
An Easy, Small But Powerful Web Log Analyzer Web Log Expert
An Easy, Small But Powerful Web Log Analyzer   Web Log ExpertAn Easy, Small But Powerful Web Log Analyzer   Web Log Expert
An Easy, Small But Powerful Web Log Analyzer Web Log Expert
 
BHL Tech Overview for BHL-Europe
BHL Tech Overview for BHL-EuropeBHL Tech Overview for BHL-Europe
BHL Tech Overview for BHL-Europe
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.
 
GBIF API Hackaton, March 2015, Leiden, Sp2000/GBIF
GBIF API Hackaton, March 2015, Leiden, Sp2000/GBIFGBIF API Hackaton, March 2015, Leiden, Sp2000/GBIF
GBIF API Hackaton, March 2015, Leiden, Sp2000/GBIF
 
CSE3146-ADV JAVA M2.pdf
CSE3146-ADV JAVA M2.pdfCSE3146-ADV JAVA M2.pdf
CSE3146-ADV JAVA M2.pdf
 
Voyager Meets MeLCat: MC'ing the Introductions
Voyager Meets MeLCat: MC'ing the IntroductionsVoyager Meets MeLCat: MC'ing the Introductions
Voyager Meets MeLCat: MC'ing the Introductions
 
Itp 120 Chapt 19 2009 Binary Input & Output
Itp 120 Chapt 19 2009 Binary Input & OutputItp 120 Chapt 19 2009 Binary Input & Output
Itp 120 Chapt 19 2009 Binary Input & Output
 
Java IO, Serialization
Java IO, Serialization Java IO, Serialization
Java IO, Serialization
 

Mais de William Ulate

Enhancing the WFO in support of GSPC.pptx
Enhancing the WFO in support of GSPC.pptxEnhancing the WFO in support of GSPC.pptx
Enhancing the WFO in support of GSPC.pptxWilliam Ulate
 
Finding the annotation needs of the botanical community in a digital library
Finding the annotation needs of the botanical community in a digital libraryFinding the annotation needs of the botanical community in a digital library
Finding the annotation needs of the botanical community in a digital libraryWilliam Ulate
 
Botanists and annotations printer friendly
Botanists and annotations   printer friendlyBotanists and annotations   printer friendly
Botanists and annotations printer friendlyWilliam Ulate
 
BHL Tech Status Update Tech Director W.Ulate 2015.12.11
BHL Tech Status Update Tech Director W.Ulate 2015.12.11BHL Tech Status Update Tech Director W.Ulate 2015.12.11
BHL Tech Status Update Tech Director W.Ulate 2015.12.11William Ulate
 
Unlocking knowledge in biodiversity legacy literature through automatic seman...
Unlocking knowledge in biodiversity legacy literature through automatic seman...Unlocking knowledge in biodiversity legacy literature through automatic seman...
Unlocking knowledge in biodiversity legacy literature through automatic seman...William Ulate
 
BHL Markup Efforts and Plans
BHL Markup Efforts and PlansBHL Markup Efforts and Plans
BHL Markup Efforts and PlansWilliam Ulate
 
Purposeful Gaming and BHL
Purposeful Gaming and BHLPurposeful Gaming and BHL
Purposeful Gaming and BHLWilliam Ulate
 
Bibliographic References in BHL
Bibliographic References in BHLBibliographic References in BHL
Bibliographic References in BHLWilliam Ulate
 
A new flora fauna mycota should...
A new flora fauna mycota should...A new flora fauna mycota should...
A new flora fauna mycota should...William Ulate
 
BHL Technical Update (May 2013)
BHL Technical Update (May 2013)BHL Technical Update (May 2013)
BHL Technical Update (May 2013)William Ulate
 
Global BHL Update May 2013
Global BHL Update May 2013Global BHL Update May 2013
Global BHL Update May 2013William Ulate
 
The BHL way to content
The BHL way to contentThe BHL way to content
The BHL way to contentWilliam Ulate
 
TDWG 2012 Poster for Art of Life project
TDWG 2012 Poster for Art of Life projectTDWG 2012 Poster for Art of Life project
TDWG 2012 Poster for Art of Life projectWilliam Ulate
 
BHL Technical Projects Updates
BHL Technical Projects UpdatesBHL Technical Projects Updates
BHL Technical Projects UpdatesWilliam Ulate
 
The Biodiversity Heritage Library: an Open Global Resource of Literature for ...
The Biodiversity Heritage Library: an Open Global Resource of Literature for ...The Biodiversity Heritage Library: an Open Global Resource of Literature for ...
The Biodiversity Heritage Library: an Open Global Resource of Literature for ...William Ulate
 
BHL: Toward a Global, Sustainable Resource
BHL: Toward a Global, Sustainable ResourceBHL: Toward a Global, Sustainable Resource
BHL: Toward a Global, Sustainable ResourceWilliam Ulate
 
Global BHL Meeting Action Items
Global BHL Meeting Action ItemsGlobal BHL Meeting Action Items
Global BHL Meeting Action ItemsWilliam Ulate
 

Mais de William Ulate (17)

Enhancing the WFO in support of GSPC.pptx
Enhancing the WFO in support of GSPC.pptxEnhancing the WFO in support of GSPC.pptx
Enhancing the WFO in support of GSPC.pptx
 
Finding the annotation needs of the botanical community in a digital library
Finding the annotation needs of the botanical community in a digital libraryFinding the annotation needs of the botanical community in a digital library
Finding the annotation needs of the botanical community in a digital library
 
Botanists and annotations printer friendly
Botanists and annotations   printer friendlyBotanists and annotations   printer friendly
Botanists and annotations printer friendly
 
BHL Tech Status Update Tech Director W.Ulate 2015.12.11
BHL Tech Status Update Tech Director W.Ulate 2015.12.11BHL Tech Status Update Tech Director W.Ulate 2015.12.11
BHL Tech Status Update Tech Director W.Ulate 2015.12.11
 
Unlocking knowledge in biodiversity legacy literature through automatic seman...
Unlocking knowledge in biodiversity legacy literature through automatic seman...Unlocking knowledge in biodiversity legacy literature through automatic seman...
Unlocking knowledge in biodiversity legacy literature through automatic seman...
 
BHL Markup Efforts and Plans
BHL Markup Efforts and PlansBHL Markup Efforts and Plans
BHL Markup Efforts and Plans
 
Purposeful Gaming and BHL
Purposeful Gaming and BHLPurposeful Gaming and BHL
Purposeful Gaming and BHL
 
Bibliographic References in BHL
Bibliographic References in BHLBibliographic References in BHL
Bibliographic References in BHL
 
A new flora fauna mycota should...
A new flora fauna mycota should...A new flora fauna mycota should...
A new flora fauna mycota should...
 
BHL Technical Update (May 2013)
BHL Technical Update (May 2013)BHL Technical Update (May 2013)
BHL Technical Update (May 2013)
 
Global BHL Update May 2013
Global BHL Update May 2013Global BHL Update May 2013
Global BHL Update May 2013
 
The BHL way to content
The BHL way to contentThe BHL way to content
The BHL way to content
 
TDWG 2012 Poster for Art of Life project
TDWG 2012 Poster for Art of Life projectTDWG 2012 Poster for Art of Life project
TDWG 2012 Poster for Art of Life project
 
BHL Technical Projects Updates
BHL Technical Projects UpdatesBHL Technical Projects Updates
BHL Technical Projects Updates
 
The Biodiversity Heritage Library: an Open Global Resource of Literature for ...
The Biodiversity Heritage Library: an Open Global Resource of Literature for ...The Biodiversity Heritage Library: an Open Global Resource of Literature for ...
The Biodiversity Heritage Library: an Open Global Resource of Literature for ...
 
BHL: Toward a Global, Sustainable Resource
BHL: Toward a Global, Sustainable ResourceBHL: Toward a Global, Sustainable Resource
BHL: Toward a Global, Sustainable Resource
 
Global BHL Meeting Action Items
Global BHL Meeting Action ItemsGlobal BHL Meeting Action Items
Global BHL Meeting Action Items
 

Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

  • 1.
  • 2. Digitizando Literatura sobre Biodiversidad (Contenido Técnico) CONABIO, México William Ulate, Director Técnico de BHL 17 Diciembre 2014
  • 3.
  • 5. Hardware & Software Hardware Usando una Estación Scribe
  • 6. Escaneo por Internet Archive Northeast Regional Scanning Facility (Boston) New Jersey Facility Natural History Museum, London Fedscan (Library of Congress) Internet Archive (San Francisco) Smithsonian Libraries Missouri Botanical Garden (Non- Scribe operation)
  • 7. Hardware & Software Hardware Usando una Estación Scribe “Off-the-shelf” escaners o cámaras digitales de buena calidad Software Wonderfetch -> Partner Meta App (si usan máquinas Scribe) identifier search_id title volume creator date call_number language subject publisher description page-progression possible-copyright- status licenseurl
  • 8. Hardware & Software Hardware Usando una Estación Scribe “Off-the-shelf” escaners o cámaras digitales de buena calidad Software Wonderfetch -> Partner Meta App (when using Scribe machines) Macaw
  • 10. Hardware & Software Hardware Usando una Estación Scribe “Off-the-shelf” escaners o cámaras digitales de buena calidad Software Wonderfetch -> Partner Meta App (when using Scribe machines) Macaw Uploading directly to Internet Archive (for example: MBG‟s Botanicus http://www.botanicus.org/)
  • 11. Standards and formats to consider The simplest way to contribute a text item to IA is currently as a single pdf file. IA creates a second pdf with a text layer, if none exist. Items can be submitted as a stack of image files, one image per page. The files can be in JPEG2000, JPG, or TIFF format, but with strict requirements for how the files in an image stack are to be named, and the stack needs to be packed into a single .zip or .tar file before submission. When IA (Archive.org) scans a book for a Contributing Library, they use the custom-engineered "Scribe" workstation, but for many materials, adequate images can be made with off-the-shelf scanners or good-quality digital cameras. For best results, it is recommended to use the highest resolution your device is capable of. Most images IA processes were produced at a resolution of 300- 600 ppi.
  • 12. Standards and formats to consider BHL recommends following, in part, the DLF's "Benchmark for Faithful Digital Reproductions of Monographs and Serials" (available online at http://www.diglib.org/standards/bmarkfin.htm). Bitonal: 600 dpi, 1-bit or bitonal TIFF images Grayscale: 300 dpi, 8-bit grayscale uncompressed TIFF, or lossless compressed image (e.g. LZW, JPEG2000 [*.jp2]). Color: 300 dpi, 24-bit color uncompressed TIFF, or lossless compressed images (e.g. LZW, JPEG2000 [*.jp2]). NOTE: the above specifications are the preferred ones. BHL will, however, accept lossy files. In the case of JPEG2000, files with a compression level of 85% are acceptable.
  • 13. Standards and formats to consider Currently, BHL data can be downloaded as MODS, EndNote and BibTex. See our wiki page with more information: http://biodivlib.wikispaces.com/Data+Exports#x--MODS Title metadata as well as pagination, descriptive and page order (structural) metadata is being copied into METS files in the <biodiveristy> collection at IA. The purpose of these METS files is to accommodate the need of our pagination data. These METS files are pagination specific and they do not have the item/volume information included. If bibliographic metadata for BHL content was required, it should be found in the MODS files on the Data Exports page.
  • 14. Standards and formats to consider For the future, we are looking at serving OLEF as an envelope format to share information with other BHL Nodes. See http://www.bhle.eu/bhl-schema/v0.3/ and http://www.slideshare.net/HeimoRainer/bhleuropemet adataharmonisationtdwg20111018kollerwhrainer/6 )
  • 15. Metadata generation and indexing strategy Each item to be uploaded needs a unique identifier within our central repository, currently Internet Archive (archive.org) and a folder with such name is created to hold the uploaded and generated (derivative) files. Within BHL we record metadata at 3 levels of bibliographic granularity – Title, Item & Page – as well as metadata for the Creator(s) of the title.
  • 16. Metadata generation and indexing strategy Scanned material (jp2.zip) and basic title-level metadata content (marc.xml), item-level metadata (meta.xml) and page-level metadata (scandata.xml) are uploaded to Internet Archive (IA), in the „biodiversity‟ collection. JP2.zip: The compressed JP2 images (Compression Quality 15) that IA will use for delivering pages to the Read Online feature following a very specific naming convention for the filenames: Master images files named with local library identifier + 4-digit sequence number (with no gaps). MARC.xml: The MARC record for the title from the library catalog in MARCXML format Title, *Abbreviation, *Creator, Description, Publisher, Start Date Published, End Date Published, Local Library Identifier, *OCLC Number, *ISSN, *ISBN, *Call Number, *Subject, *Language, Date Created, Date Last Modified, *Foreign Keys
  • 17. Metadata generation and indexing strategy META.xml: The item level information (even redundant with the title-level information) including the title, author, publisher, copyright information, digitizing sponsor, date published, type of item, and who originally uploaded it. IA may also update this XML file with information as it processes the pages of the item. Barcode, Sequence, Local Library Identifier, +Start Volume, End Volume, +Start Date, End Date, *Language, Scanning Institution, *Scanning Contributor, *Scanning Sponsor, Date Created, Date Last Modified SCANDATA.xml: An XML file (scandata.xml) recording information about each page image (handSide, cropBox, original width & height, etc. ) FileName, Sequence, *Page number, *Page Type, Year, Volume, IssuePrefix, Issue, Date Created, Date Last Modified
  • 18. Metadata generation and indexing strategy CREATOR: A “Creator” is defined as a person or company responsible for the creation of the Title. Name, *Role, Date of Birth, Date of Death, Biography A detailed description of the contents of each one of these files and the whole process of Uploading content to IA is available at: http://biodivlib.wikispaces.com/Upload
  • 19. Metadata generation and indexing strategy Internet Archive runs the OCR process and generates “derivative files” that include: The resulting files of the OCR process with ABBYY FineReader (djvu, djvu.txt, djvu.xml, abby.gz) A 100x152 pixel GIF with a looping, animated thumbnail of the first 20 pages of a book. The presentation version on BHL in PDF format. The MARC record in binary and XML formats. And others ( for a more detailed description you can see http://biodivlib.wikispaces.com/Download+All+File+Type s+and+Descriptions )
  • 20. Metadata generation and indexing strategy The metadata from new items included in the BHL collection is included in the database and indexed to be used in searches through the Portal and API services. Periodically, the OCR pages are ran through taxonomic names services to mine for new taxa names like TaxonFinder (ubio.org) or GNRDS (Global Names resolution tools and services: resolver.globalnames.org) soon. Taxa names are added to the database and written back into Internet Archive (names.xml)
  • 21. Online Platform Capture System Scribe machines Macaw Publication BHL Portal BookViewer PDF Generator
  • 22. Online Platform Publication BHL API (biodivlib.wikispaces.com/Developer+Tools+and+API) The BHL Application Programming Interface (API) is a set of REST-like web services that can be invoked via HTTP queries (GET/POST requests) or SOAP. Responses can be received in one of three formats: JSON, XML, or XML wrapped in a SOAP envelope. We are currently developing a new API v3, closer to a RESTful design than previous versions, using resource-centric URLs (where possible) and GET/PUT/POST/DELETE verbs.
  • 23. Online Platform Publication Data Exports (biodivlib.wikispaces.com/Data+Exports)
  • 24. Online Platform Management BHL Admin Dashboard Admin Functions (Alert Message, Image Server, Collections, Institutions, Languages, Page Types, PDF Requests, Segment Types) Library Functions (Titles/Items/Segments /Pagination/Authors) Science Functions (Names (Taxa) on a Page) Library Statistics (Titles/Items/Pages/Names/Segments/Items with Segments, Names, Pages with Names) Growth Statistics (Titles/Items/Pages/Names/Segments new this Month/Year)
  • 25. Online Platform Management BHL Admin Dashboard PDF Generation Statistics (Generated: 174,162) Internet Archive Harvesting Statistics (Complete: 119,125 items) BioStor Harvest Statistics (Published: 11,126 as of Aug. 29, 2013) DOI Assignment Statistics (DOI Approved: 57,338 as of Aug 29, 2013) Web Traffic Statistics (API v2, OpenURL) Reports (Item Pagination, Title Import History, Character Encoding Problems, DOIs by Institution, Monographic Contributions, Items by Contributor)
  • 26. Deduplication • We try to avoid duplication where possible • Tools • Serials = Scanlist • Monographs = Monographic deduper • Check the BHL before you send for scanning • We do our best but duplication happens • Post-digitization, we merge titles as necessary
  • 27. Online Platform Management Monographic Deduping Tool The MBLWHOI Library has been working on a tool that assists with de-duplicating the monographs that BHL members are sending to IA for scanning. The application is ready for use and it‟s entirely web-based, requiring no client or user configuration. The monographic deduper acts as a master database that contains records for all of the monographs that any BHL partner institution has scanned.
  • 28. Online Platform Management Monographic Deduping Tool In addition, there is a process also in place that allows for material ingested from the Internet Archive, but not contributed by a BHLpartner institution, to be added to the deduper database. Ultimately, the Monographic deduper database should be seen as living record of accountability that communicates to staff collaborating in the BHL network, a partner‟s promise to digitize a particular monographic title.
  • 29. Online Platform Management Serials Bid List It is a catalogue that allows users to browse and search Serials titles held by BHL member institutions using advanced filtering.
  • 30. Technical Group at MBG Mike Lichtenberg Developer Trish Rose-Sandler Data Analyst William Ulate Technical Director
  • 31. Technical Support MBG IT Division Manage servers, systems and telecommunications. Installs software needed And others: MBL Smithsonian Internet Archive BHL-Australia BHL-Europe
  • 33. Firewall Images (JP2) PDF Coordinate-based OCR XML metadata BHL Architecture: Window Seat Ed. BHL DB Internet Archive Storage Logic APIs UI Data Exports Access Data Transform Utilities Geocoding Name Finding
  • 34.
  • 35. Projects Global Names Art of Life Purposeful Gaming Digging into Data
  • 36. Scientific Name Extraction TaxonFinder algorithm in production since 2008 More than 100 million candidate name strings More than 1.5 million unique, verified names Available through UI, APIs, Data Exports & Internet Archive New collaboration with Global Names project Improved algorithm, better precision & recall More data with TaxonFinder and Neti Neti! http://gnrd.globalnames.org/
  • 37. Taxon Names BEFORE Name Instances 101,591,803 101,288,804 Unique Names 7,498,554 7,464,924 Verified Names 1,905,507 1,902,803 EOL Names 63,130,350 62,963,582 EOL Pages 13,579,868 13,532,684 AFTER Name Instances 151,222,182 150,066,425 Unique Names 29,246,382 29,091,767 Verified Names 10,153,165 10,109,540 EOL Names 87,791,695 87,135,089 EOL Pages 15,466,713 15,342,867
  • 38.
  • 39.
  • 41. Articles in the BHL UI
  • 42.
  • 45.
  • 46. Digitization workflow 1. Titles vs. Items vs. Segments 2. Metadata we need: • MARC for book and journal titles • Volume information • Page data BHL Term Titles Items Segments Library Term Book or Journal Titles Volume, Piece Articles, Book chapters, Meaning Conceptual unit Object Section of consecutive pages
  • 47.
  • 53.
  • 58.
  • 60.
  • 62. *E.xvi�c�piteI von c. cXx.WptdvonfnrWmn bu�fbe;bcn.5 am cix bIa � S &3rn~ 41X a�m cv(f b1air�'o�et ert oiensr �; �', :�hlrfc�c wa ff�4am.diug bist a 6aiw~s ff oJrJtwt nof bL4ecImt& blfafra mem b t wag `wr 4 cn wiu 4 e8t5m.ed bvUratflb ck wuo, ma144'*4I bttE5rmbebt =rt3'kn am4ra tif vrmr Waff C * t6rmnli an `tn�ciblatGteaM w ?ffoaifrn w4wmeu nu weib e , wpiteI voE5teiri ct c ober gtUcr cit cm` 91 cLi biar J ' >bSciatl�Oiff ;Bruet wacfttc n qmcx b1a bl: bt5c lttmtt bb9 lkr w.llr#e iti ncn xoa ff cu :r trtuft *e t � B Rn "� trv W1Rt' ?Cm c blas waIwutr Ober �ci ti 1V Ces ' wt gbtiemwwajfu tpctt, afferain 9 c: b�titbfof �r f eran m rs bra wlg auig4;f aer�m *mc vrt blatcabtfm wfru an'deg~m rt blas Iaum bwWt� run f ncmai b14ianf tJobrrfan ebrut4net vnber Brwt Ober awawi*m.crriii btafwfm uww c on$ 'it ttu wttkc 5,10 $ m~C fca trc* cx u W�e�&mcyfbq4 Mabtt mmw rc a iiu bc Jcn ncI.end.*, blat s. a u:�rprd3 rw4ftf wm c ii,+ ttCC tn wa frr9fr orfab fcfbt enb c optiti bt -r9 ceDa ttDcn i34M sn Sem i
  • 64. OCR Improvements Transcription Purposeful Gaming Looking at… Crowdsourcing Markup & Annotation
  • 65. Purposeful Gaming DIGITALKOOT Joint project run by the National Library of Finland and Microtask to index the library's enormous archives so that they are searchable on the Internet for easier access to the Finnish cultural heritage. .
  • 66. Purposeful Gaming DIGITALKOOT Launched on Feb 8 2011, nearly 110 000 participants completed over 8 million word fixing tasks by Nov 29 2012 DigiTalkoot enabled volunteers to participate in this fixing work by playing games. .
  • 67. Purposeful gaming and BHL: engaging the public in improving and enhancing access to digital texts IMLS Grant Program: National Leadership Grants for Libraries Partners: Missouri Botanical Garden Harvard University Cornell University New York Botanical Garden P.I.: Trish Rose-Sandler, Missouri Botanical Garden Dates: Dec 2013 – Nov. 2015
  • 68. Project objectives and benefits Test new means of crowdsourcing to support the enhancement of content in BHL Demonstrate if digital games are an effective tool for analyzing and improving digital outputs from OCR and transcription Benefits of gaming include: improved access to content by providing richer and more accurate data; an extension of limited staff resources; and exposure of library content to communities who may not know about the collections otherwise.
  • 69. OCR Improvements German text interpreted by the OCR process as: “unb auf ben ©elnrgen be6 fublic{)en”
  • 70. AOCR Improvements Different resulting texts from parsing the phrase: “und auf den Gebirgen des südlichen Deutschlands” (“and on the mountains of southern Germany”) IA OCR OCR 2 Transcriptio n 1 Transcriptio n 2 1 unb und und und Ok 2 den ben den den Ok 3 ©elnrgen ©ebirgen Bebirgen Gebirgen X 4 be6 des de5 des Chk 5 fublic{)en fublichen Füdlichen Südlichen X 6 £)eittfc{)(anb6 Deutfchlanbs Deutfchlands Deutschland s X
  • 72.
  • 73. iDigBio‟s aOCR Hackathon Improve OCR parsing of labels with clear metrics (datasets, output formats, scoring algorithm) Libraries of regular expr. to clean up each field (different error correction for latitude/longitude coordinates than personal names or herbarium catalog numbers) Tool for classifying segments of the image before submitting to OCR Do a first pass of OCR to clean images before sending them to a second, 'real' pass of OCR
  • 74. iDigBio‟s CITScribe Hackathon 1. Interoperability betweenpublic participation tools and biodiversity data systems, 2. Transcription quality assessment/quality control (QA/QC) and the reconciliation of replicatetranscriptions, 3. Integration of optical character recognition (OCR) into thetranscription workflow 4. User engagement
  • 75. NfN & iDigBio‟s CITScribe Hackathon Jason Best‟s DarwinScore Ben Brumfield‟s Handwriting Gibberish Detector Dictionaries to improve crowdsourcing consensus (e.g., names of collectors, scientific names) Word Clouds created using n-gram scoring, faceting, and Solr for indexing + Carrot2 for specimen selection (visualize and explore of the use with a word of interest from the word cloud) and a data cleaning step (highlight infrequent words by the system).
  • 76. NESCent EOL-BHL Research Sprint There is no place like home: Defining “habitat” for biodiversity science Robert D. Stevenson UMass Boston, Dept. of Biology, 100 Morrissey Blvd., Boston, MA 02125-3393 Carl Nordman (Natureserve) and Evangelos Pafilis Hellenic Centre for Marine Research, P.O. Box 2214, Heraklion, 71003, Crete, Greece
  • 77. NESCent EOL-BHL Research Sprint Assessing Risk Status of Mexican Amphibians Through Data Mining. Esther Quintero and Bárbara Ayala National Commission for Knowledge and Use of Biodiversity (CONABIO) and Anne Thessen Marine Biological Laboratory and Arizona State University
  • 78. Planning for global change: using species interactions in conservation Nicole F. Angeli, Emma P. Gomez, Margot A. Wood, Applied Biodiversity Sciences Program, Texas A&M University, College Station, Texas nangeli1@jhu.edu Tweet me @auratus_nicole and Javier Otegui University of Colorado-Boulder
  • 79. There is no place like home: Defining “habitat” for biodiversity science Robert D. Stevenson UMass Boston, Dept. of Biology, 100 Morrissey Blvd., Boston, MA 02125-3393 Carl Nordman (Natureserve) Evangelos Pafilis Hellenic Centre for Marine Research, P.O. Box 2214, Heraklion, 71003, Crete, Greece http://epafilis.info/ , vagpafilis@gmail.com
  • 80. Evolution in the usage of anatomical concepts in the biodiversity literature Todd Vision (tjv@bio.unc.edu), Prashanti Manda (manda.prashanti@gmail.com), and Dongye Meng (dmeng@cs.unc.edu) University of North Carolina at Chapel Hill
  • 81. NESCent EOL-BHL Research Sprint Evolution in the usage of anatomical concepts in the biodiversity literature Todd Vision (tjv@bio.unc.edu), Prashanti Manda (manda.prashanti@gmail.com), and Dongye Meng University of North Carolina at Chapel Hill
  • 82. Some preliminary observations… Our API seemed to work fine Access via a taxon (or a group), for example: “I want to harvest all pages with names from this taxon (Chordata) or this common name (Vertebrate)”. Groups started getting results after 2.5 days. The structure of BHL was explained so researchers could understand the title, item, page and part levels and define what they wanted. Ex: one group was looking for terms in the titles and the parts‟ titles. Some others said they would Harvest the OCR from IA although they will not be able to harvest the text on a page by page granularity (only item level).
  • 83. NESCent EOL-BHL Research Sprint There is no place like home: Defining “habitat” for biodiversity science Robert D. Stevenson UMass Boston, Dept. of Biology, 100 Morrissey Blvd., Boston, MA 02125-3393 Carl Nordman (Natureserve) and Evangelos Pafilis Hellenic Centre for Marine Research, P.O. Box 2214, Heraklion, 71003, Crete, Greece
  • 85. Mining Biodiversity Mining Biodiversity: Enriching Biodiversity Heritage with Text Mining and Social Media One of the international projects that won in the third round of the 2013 Digging Into Data Challenge Promote the development of innovative computational techniques to apply into big data in the humanities and social sciences The National Centre for Text Mining (UK) Missouri Botanical Garden (US) Dalhousie University's Big Data Analytics Institute (Canada) Social Media Lab (Canada)
  • 86. MiBIO: Mining Biodiversity 1. Automatic error correction of OCR text errors. 2. Crowdsource annotation of legacy texts with semantic metadata. 3. Adapt text mining techniques to extract terminology, entities and significant events automatically and to track terminology evolution over time. 4. Use Interactive visualization techniques to help users manage search results through next generation browsing capabilities, assisted by a semantic similarity network of important terms and entities. 5. Design of a social media layer, serving as an environment for diverse users to interact and collaborate on science, public education, awareness and outreach.
  • 88. Crowdsource Markup Display text Species Profile Model category General/summary TaxonBiology Geographic range Distribution Habitat Habitat Food sources and feeding behavior TrophicStrategy Physical description (general) Description Physical description (detailed morphology) DiagnosticDescription
  • 89. Visit to NaCTeM, Feb. 17, 2014
  • 92. Remote Processing Workflows processed on remote machines. No attendance needed Workflows GUI for creating single-flow and multi-branch workflows Workflow Designer User Interaction Annotation Editor allows for making changes while processingAnnotator/Curator WebService Third-party applications Processing Components Data (de)serialisation, search engines, NLP, NER, etc. Developers
  • 97. Workflow as a Web service
  • 98. Workflow as a Web service http://argo.nactem.ac.uk/test/services/webservice/314 INPUT OUTPUT
  • 102. Linking to external dictionaries
  • 103. Species and habitat recognition
  • 106.
  • 108.
  • 109.
  • 110.
  • 112.
  • 115. Thank you William Ulate BHL Technical Director Missouri Botanical Garden william.ulate@mobot.org Skype: william_ulate_r
  • 116. Thank you! And thanks to Bianca Crowley for the workflow slides