Introduction to and overview of digital repository projects at Northwestern University, developed for a guest lecture at the Dominican University Graduate School of Library and Information Science Digital Curation course. Presentation based in part on an earlier presentation developed by Steve DiDomenico and Claire Stewart
2. Claire Stewart
Director, Center for Scholarly Communication and Digital Curation
Head, Digital Collections, Library Technology Division
Northwestern University
claire-stewart@northwestern.edu
5. Tweeted in 2012 by Gail
Steinhart, Head of Research
Services, Mann Library, Cornell
University
6. Vines, T. H., Albert, A. Y. K., Andrew, R. L., Débarre, F., Bock, D. G., Franklin, M. T., … Rennison, D. J. (2013). The Availability of
Research Data Declines Rapidly with Article Age. Current Biology, 24(1), 94–97. doi:10.1016/j.cub.2013.11.014
“The major cause of the reduced data availability for
older papers was the rapid increase in the proportion
of data sets reported as either lost or on inaccessible
storage media. For papers where authors reported
the status of their data, the odds of the data being
extant decreased by 17% per year (Figure 1D).”
[emphasis added]
The Availability of Research Data
Declines Rapidly with Article Age
7. What is a repository and why
should I care?
A concept
The
Repository
All the stuff
A set of technologies
9. Repository as service
• Description and characterization - descriptive, provenance and technical
metadata
• Selection, conversion, digitization
• Deposit and versioning
• Interoperability, APIs for ingest, discovery
• Access control, copyright support and other legal/regulatory compliance
• Persistence
– Stable, permanent links (URLs, DOIs, etc.)
– Health of digital objects
– Replication and dark archiving
– Migration or emulation, virtualization
15. Northwestern Books and the Book Workflow Interface
2009
Mellon-funded
Now used for all
in-house book
digitization
books.northwestern.edu
16. Every page of each digitized book has this information:
Datastream ID MIMETYPE Schema/ontology
Dublin Core metadata DC text/xml OAI_DC
MODS metadata MODS text/xml MODS
Relationship metadata RELS-EXT text/xml RELS-EXT
OCR PDF file PDF application/pdf
OCR XML OCR XML text/xml ABBYY OCR
OCR Text OCR TEXT text/plain
Source camera image file ARCHV-IMG image/jpeg
Source technical metadata in MIX ARCHIV-TECHMD text/xml MIX
Source camera technical metadata in EXIF ARCHV-EXIF text/xml Exif as XML
Corrected image file PROC-IMG image/jpeg
Corrected image technical metadata in MIX PROC-TECHMD text/xml MIX
Delivery image JPEG2000 file DELIV-IMG image/jp2
Delivery image technical metadata in MIX DELIV-TECHMD text/xml MIX
SVG for delivery mechanism DELIV-OPS text/xml SVG
Viewer html HTML text/html HTML
17. By the numbers — # of objects
As of November 2013:
• Finding aids: 1,114
• Digitized books: 3,491
• Digitized book pages: 835,806
• Image objects: 216,271
• A few others, including 3D objects, and collection objects
A total of 1,187,414 objects in the repository
Every object has several datastreams (files, descriptive metadata, technical metadata, etc.)
18. By the numbers — storage
As of Feb 5, 2014:
97.1 TB of content on repository (including digitized collections
queued for ingestion) and JPEG2000 server.
Library & NUIT purchased 200 TB of storage replicated between
Evanston and Chicago campuses (that is over 400 TB in total).
19. Digital preservation/persistence
• Persistent URLs
• Mirrored storage (as of fall 2014)
• PREMIS (preservation) metadata
• Routine health checks for data
• Geographically distributed storage
• Dark archives
• Migration/virtualization services
20. Distributed storage and dark archives
• DuraCloud
• Amazon Glacier
• Digital Preservation Network (DPN)
23. 2007 Provost funded move from
Art History to the Library,
expansion to other disciplines
115,000 images in Hydra + Fedora
Moving all legacy digital
collections into DIL & its Hydra
counterparts in 2014-2015
images.northwestern.edu
Digital Image Library (DIL)
24. Avalon
IMLS-funded project with
Indiana University
Releases:
• 0 July 2012
• .5 October 2012
• 1.0 May 2013
• 2.0 October 2013 (NU pilot)
First NU production with R3,
expected in next month
media.northwestern.edu (dev/demo)
25. Scholarly communication and
digital curation
• Options for archiving scholarly
materials
• Authors rights, copyright help and
education, open access support
• E-science and research data life
cycle
• Digital humanities
• Library-based publishing
• Responding to funder requirements
26. Hydramata (formerly Shared IR)
Five-institution project to develop a next-generation institutional repository solution in Hydra
27. Expanding our repository program
• Massive storage, planning for growth, sustainability
• Digital preservation services
o Offsite third copy (DPN, DuraCloud, Glacier)
o Verification services
• Research computing
o Research data lifecyle - how to capture metadata early? what to
keep?
o Automate deposit from Vault?
• Shared infrastructure and services whenever possible
• Deeper collaboration with NUIT, Research, central admin, schools
28. Discussion and questions
Claire Stewart
Director, Center for Scholarly Communication and Digital Curation
Head, Digital Collections, Library Technology Division
Northwestern University
claire-stewart@northwestern.edu