1. Notes from Nature
Citizen Science data transcription
Peter Oboyski, Jun Ying Lim, Joyce Gross,
Chris Snyder*, Arfon Smith*, Joanie Ball,
Kip Will, Rosemary Gillespie
Essig Museum of Entomology
* Zooniverse Citizen Science Alliance
2.
3. How does it work?
•
•
•
•
•
•
•
•
Introduction to CalBug
What is Zooniverse?
What do we provide?
What happens online?
What do we get back?
Technical issues
Maintaining interest
How can you get involved?
4. What is CalBug?
NSF - ADBC grant
Collaboration among the eight major
entomology collections in California
Digitize 1.2 million specimens
Essig Museum of Entomology
California Academy of Sciences
California State Collection of Arthropods
Bohart Museum, UC Davis
Entomology Research Museum, UC Riverside
San Diego Natural History Museum
Santa Barbara Museum of Natural History
LA County Museum
6. Berkeley Natural History
Museums
• In development
– Integrating point data (specimen records) with
Habitat, Range maps, Elevation, Climate, etc.
– Historical recreation of the environment
– Predict potential impacts of environmental change
– Facilitate land use/management decisions
7. Digitization workflow
(Optional)
Sort by locality,
date, sex, etc.
Error checking
Manually enter data
into MySQL database
Remove labels, add
unique identifier
Geographic
referencing
Online crowd-sourcing
of manual data entry
Take digital image,
name and save file
Replace labels,
return to collection
Handling & Imaging
Aggregate data in
online cache
Optical Character
Recognition (OCR) &
Automated data parsing
Data Capture
Temporospatial
analyses
Data Manipulation
8. Why Image Labels?
• Magnify difficult to read labels
• Verbatim archive of label data
– Essential for proofing data
– Useful for taxonomists interested in label data
• Data capture can be done remotely
9. Digital camera tethered to computer
Average 50-55 images per hour
Including imaging, file renaming, and upload
Filename = EMEC218958 Paracotalpa ursina.jpg
24. Images in Transcriptions out
• We supply jpeg images
– 400 DPI (300 DPI good)
– Deposited as zip file
– Stored in Amazon Cloud
• In development
– Automated service to
upload images to A.C.
– Be able to prioritize
image set
• Zooniverse provides
– MondoDB data dump
– 1 record = 1 transcription
– 4 transcriptions / image
• In development
– Automated daily dump
25. Reconciling transcriptions
• Drop down lists (Country, State, County, Date)
are compared for exact match
– Occasionally missing, sometimes wrong
– Majority rule
• Free-form text fields (Locality, Collectors)
are much more problematic
– Transcribers asked to record label data verbatim
– Puctuation, capitalization, spacing between words
– Misspelling, expanding abbreviations, interpretations
26. Reconciling transcriptions
• Developing scripts in R to reconcile free-form text
• Text matching for maximum correspondence among
multiple transcriptions (cf. DNA alignment methods)
• Final result = 1 transcription in our database
with links to the 4 original transcriptions
marked as Citizen Science transcribed record
• Vetting by CalBug personnel still necessary, but we can
prioritize based on record-matching confidence scores
29. Generating & Maintaining Interest
• Popular media, social media, and press releases
– Only so many occasions for a press release
• Campaigns
– Highlight particular taxa, habitats, geographic regions
• Education
– High quality, high resolution photo of species transcribed
– Create links to other services to learn more about species
• Competitions
– Prizes are worth more than badges
– However, need to watch for bad data in pursuit of prize
30. How can you get involved?
• Right now you cannot
• iDigBio is interested in getting involved
• iDigBio hosting a hackathon in December
• Begin building up collections of images
31. Thank you
And a HUGE thank you to the
CalBug Army
who image our specimens
Chris Amy, Maritess Aristorenas, Jazmin Calderon, Alex Carolina, Sonia Castillo, Matthew Chan, Sabina Cook, Alex Darwish, John Davie, Jesson Go, Nick
Grady-Grote, Ginger Haight, Laura Hayes, Dennis Ho, Aubrey Huey, Leah Humphreys, Veronica Hurd, Hanna Huynh, Eseosa Igbinedion, Ilona Istenes, Emma
Kohlsmith, Asia Kwan, Tiffany Kyo, Jerry Lee, Ken Lee, Christina Lew, Maggie Lewis, Alex Lim, Derick Matano, Christian Munevar, Frank Ngo, Kent Nguyen,
Minh Nguyen, Riley O'Brien, Marielle Pinheiro, Rammonhan Reddy, Jessica Rothery, Stacey Rutherford, Anna Szendrenyi, Anni Sheh, Hannah Shin, Erika So,
Mee Thao, Cindy Truong, Darleen Tu, Skyler Valle, Daug Vaughn, Hayden Wong, Yiu Kei Wong, Keane Yang, Kevin Yao, Frances Zhang
Notas do Editor
Collaboration between Zooniverse, a citizen science portal which hosts a number of citizen science projects with a very large following, and CalBug, SERNAC (SouthEast Regional Network of Collections), and Natural History Museum, London, Ornithology Collection.
The site went live while I was at the iDigBio meeting at the Field Museum in April. Since that time we have surpassed a quarter million transcriptions by over 3,500 citizen scientists.
CalBug is an NSF-ADBC collaborative project among the eight major arthropod collections in California to digitize over one million specimens from our combined collections. Although we are collecting all the data together in a single cache and sharing techniques and workflows, each museum has developed its own approach based on the people and resources they have available. Therefore, what I am presenting is the approach we use at the Essig Museum, which may be somewhat different from the other institutions.
The goal is to make California arthropod diversity data available online through our own web service as well as through aggregators such as GBIF.
Our workflow for digitization can be broken down into three general categories. First is specimen handling and imaging where we remove the labels (from pinned specimens), add unique identifiers (we use datamatrix barcodes), and image the labels placed next to the specimens. Next we capture data from the images either with our own people directly in our own MySQL database, or through our citizen science project, Notes from Nature. We are also looking into ways to incorporate OCR into data capture. Finally, the data are proofed, georeferenced, aggregated and analyzed.
During the iDigBio meeting at the Field Museum in Chicago in April I learned that although many institutions are doing some form of imaging, hardly any were using the images as part of their databasing workflow! Personally I see an overwhelming benefit to imaging the individual specimens with their labels.
Here is an example of one of our pinned specimens. We use a digital camera tethered to a computer. Using IrfanView software to batch process image files we rename each file to include the unique identifier, genus, and species name. Although the genus and species name may change for this specimen over time, it is critical that these elements are in the filename for fast and efficient management of image files.
And now … slide scanning
The site went live while I was at the iDigBio meeting at the Field Museum in April. Since that time we have surpassed a quarter million transcriptions by over 3,500 citizen scientists.