Presented by Joshua Polansky at the Annual Conference of the Visual Resources Association, April 18th - April 21st, 2012, in Albuquerque, New Mexico.
The Cataloguing Case Studies session will explore metadata migration, workflows, cloud computing, and tagging and how they can be applied to digital collections. Mary Alexander of the University of Alabama will present on the second of two migrations that have taken place at the University of Alabama Libraries and the importance of metadata schema and workflows in that process. Joshua Polansky of the University of Washington will describe his automated workflow using optical character recognition (OCR), Apple Automator, and Microsoft Excel to speed the process of collecting metadata for 75,000 digital assets. Elizabeth Berenz of ARTstor will look at the advantages of cloud based software for image management using Shared Shelf as a working example. And finally Ian McDermott will demonstrate the advantages of expert tagging and annotation in improving metadata. His presentation will focus on two ARTstor collections that could benefit from the knowledge of the larger ARTstor community: the Gernsheim Photographic Corpus of Drawings and the Larry Qualls Archive of contemporary art exhibitions.
MODERATOR:
Jeannine Keefer, University of Richmond, VA
PRESENTERS:
Mary Alexander, University of Alabama
Elizabeth Berenz, ARTstor
Ian McDermott, ARTstor
Joshua Polansky, University of Washington
DevoxxFR 2024 Reproducible Builds with Apache Maven
VRA 2012, Cataloging Case Studies, ROBOCATALOGING
1. ROBOCATALOGING
Accelerated workflows using OCR and automation
Joshua Polansky
University of Washington
College of Built Environments
Cataloging Case Studies April 21, 2012 Visual Resources Collection
2. University of Washington College of Built Environments
Visual Resources Collection
Serves the departments of Architecture, Construction Management,
Landscape Architecture and Urban Design & Planning
Analog collection:
• 130,000 35mm slides accessioned and cataloged since 1950s
• Typewritten records; no digital database or online component until 2002
5. The big question:
Automated processes exist for batch
digitizing analog photos.
Is it possible to batch digitize old cataloging data, too?
Good cataloging information here,
researched and typed years ago.
More good data, including source
and a unique accession number.
6. Paper records to the rescue
Binders and binders of accession records Pristine label photocopies
7. A closer look at the slide label
Architect
Building name
Location / Year
View
Source
Photocopied label edge that Collection ID that appears Accession number
will interfere with OCR later on every label in this form
8. The big challenge:
• Digitize these typewritten pages
• Sort slide label text into distinct columns in Excel
• Identify each record with its accession number
• Do it all with common or affordable tools
10. Hardware
Apple iMac
• 2010 model
• OS 10.6
Any recent Mac will do (OS 10.4 or higher)
Photo: Alvaro Farfán via Flickr. 3392225359
11. Hardware
Epson Perfection V500 scanner
• With optional Automatic Document
Feeder for stacks of 30 sheets at a time
• Standard transparency unit makes it
useful for other scanning projects
• Retails for less than $300 with ADF
Photo: Alvaro Farfán via Flickr. 3392225359
13. Software
Photo: Zak Moreira via Flickr. 3425393424
14. Adobe Photoshop CS4
• Resize and realign scanned page into a
single-column tif with Actions
Adobe Acrobat Pro
• Create a pdf of each tif
• Analyze pdf with optical character recognition
(OCR) and make pdf text selectable
15.
16. Microsoft Excel 2008
• Receive text from Acrobat in columns
• After text manipulation and sorting, output
in a cross-platform format like csv
Apple Automator
Automator Virtual Input
• Execute workflows to control multiple
applications. Launch, copy, paste,
manipulate, save, repeat.
• Create Folder Actions for Finder automation
• Virtual Input: Extend the functionality of
Automator for even more control over
apps, mouse, keyboard
17. Automator
• Comes standard with
Mac OS X 10.4+
• Allows scripting and
workflow creation via
GUI
• Can perform operations
within an application or
across multiple
applications
25. Goal
• Produce nearly perfect metadata,
clean enough to import into
existing database
26. Goal Actual outcome
• Produce nearly perfect metadata, • Produced pretty good metadata
clean enough to import into • Spent lots of time on data cleanup
existing database to get there
27. Goal
• Use tools on hand; any new tools
should be cheap or useful for
other projects
28. Goal Actual outcome
• Use tools on hand; any new tools • Used standard software, plus one
should be cheap or useful for new application ($25)
other projects • iMac is a student workstation
• Epson scanner is in use for print
and film scanning plus pdf creation
29. Goal
• Have 75,000 new records ready
to pair with images and publish
to MDID
30. Goal Actual outcome
• Have 75,000 new records ready • Got 75,000 records!
to pair with images and publish • Created a searchable shelf list and
to MDID archival finding aid
• With further data cleanup, the
original goal of MDID use can be
achieved
32. • Every Mac comes with Automator
and it is easy to learn
• You probably have OCR tools on
your computer right now
• Experimenting can produce great
results
Photo: JF Sebastian via Flickr. 412874324
33. • Every Mac comes with Automator
and it is easy to learn
• You probably have OCR tools on
your computer right now
• Experimenting can produce great
results
Photo credits Thank you
• Software icons and screenshots by Adobe, Apple, Rainer Metzger
Microsoft and Singed Labcoat University of Washington
• Kraftwerk images by Flickr users Zak Moreira,
Alvaro Farfán and JF Sebastian
• Other photo and video by UW CBE VRC
Photo: JF Sebastian via Flickr. 412874324