2. HATHITRUST!
A Shared Digital Repository!
We’re
Preserving
the
Past,
What
About
the
Present?
NISO
Webinar:
Ensuring
the
Preserva;on
of
E-‐Books
May
23,
2012
Jeremy
York,
Project
Librarian,
HathiTrust
3. Outline
• About
HathiTrust
• Preserva;on
and
Access
Strategies
• What
about
the
present?
4. Partnership
Arizona State University North Carolina State University of Connecticut
Baylor University University University of Florida
Boston College Northwestern University University of Illinois
Boston University The Ohio State University University of Illinois at Chicago
California Digital Library The Pennsylvania State
The University of Iowa
Columbia University University
Princeton University University of Maryland
Cornell University
Dartmouth College Purdue University University of Miami
Duke University Stanford University University of Michigan
Emory University Texas A&M University University of Minnesota
Florida State University Universidad Complutense University of Missouri
Getty Research Institute de Madrid University of Nebraska-Lincoln
Harvard University Library University of Arizona The University of North
Indiana University University of Calgary Carolina at Chapel
Johns Hopkins University University of California
Hill
Lafayette College Berkeley
Davis University of Notre Dame
Library of Congress
Massachusetts Institute of Irvine University of Pennsylvania
Technology Los Angeles University of Pittsburgh
McGill University` Merced University of Utah
Michigan State University Riverside University of Virginia
New York Public Library San Diego University of Washington
New York University San Francisco University of Wisconsin-
North Carolina Central Santa Barbara Madison
University Santa Cruz
Utah State University
The University of Chicago
Washington University
Yale University Library
5. The
Name
• The
meaning
behind
the
name
– Hathi
(hah-‐tee)-‐-‐Hindi
for
elephant
– Big,
strong
– Never
forgets,
wise
– Secure
– Trustworthy
6. Strategic
Advisory
Board
Guidance
on
• 12-‐member
Board
of
Policy,
Planning
Governors
Execu;ve
CommiVee
• Execu;ve
CommiVee
• Execu;ve
Director
Budget/Finances
Decision-‐making
HathiTrust
7. Digital
Repository
• Launched
2008
• Ini;al
focus
on
digi;zed
book
and
journal
content
– 10,309,742
total
volumes
– 5,464,306
book
;tles
– 271,119
serial
;tles
– 3,001,018
public
domain
(~29%)
• “Light”
archive
8. Collec;ons
and
Collabora;on
• Comprehensive
collec;on
- Preserva;on…with
Access
• Shared
strategies
– Copyright
– Collec;on
management,
development
– Preserva;on
– Discovery
/
Use
– Bibliographic
Indeterminacy
– Efficient
user
services
• Public
Good
17. Dates
Collec;ons
Languages
La;n
Remaining
Arabic
1%
Languages
2%
14%
Italian
Japanese
3%
3%
Russian
English
4%
48%
Chinese
4%
Spanish
5%
French
7%
German
9%
18. To
contribute
to
the
common
good
by
collec;ng,
organizing,
preserving,
communica(ng,
and
sharing
the
record
of
human
knowledge
19. • Rights
holders
open
access
• Publishers
deposit
master
files
• Publish
directly
into
the
repository
20. jPach:
Journal
Publishing
in
HathiTrust
• hVp://lib.umich.edu/jpach
• Package
of
tools
to
enable
publica;on
of
open
access
journals
• Includes
modifica;ons
to
exis;ng
code
base;
new
components
to
facilitate
ingest,
display,
and
discoverability
of
born-‐digital
open-‐access
journal
literature
• Allow
integra;on
with
popular
journal
publishing
tools
such
as
Open
Journal
Systems
(OJS)
21. Key
Elements
• Openness
– Content
must
be
licensed
for
perpetual
open
access
• Addi;onal
formats
– Fixity
of
bitstream
guaranteed
where
preserva;on
specifica;ons
cannot
be
developed
• Allow
download
of
content
not
rendered
in
the
interface
• Support
ar;cles
and
contextual
informa;on
(lists
of
editors,
submission
requirements)
• Support
for
revisions
to
content
29. File Format Considerations in
the Preservation of e-Books
Sheila Morrissey
Senior Research Developer, Portico
NISO Webinar: Heritage Lost? Ensuring
the Preservation of E-books
May 23, 1012
30. Portico - Third Party Preservation
Portico is among the largest community-
supported digital archives in the world.
Working with libraries, publishers,
and funders, we preserve e-
journals, e-books, and other
electronic scholarly content to
ensure researchers and students
will have access to it in the future.
31. Portico - Participating Content
Over 2,000 societies, and associations have
committed content to Portico through 147
publishers agreements.
Committed Content
» E-journal titles 13,675
» E-book titles 129,781
» D-collections 46
33. Portico - Audit and Certification
In 2010, Portico became
the first digital
preservation service to be
independently audited by
the Center for Research
Libraries (CRL) and
subsequently certified as a
trusted, reliable digital
preservation solution that
serves the needs of the
library community.
34. Portico - History
2006 2009
2002 Portico Portico
Launch of ingests ingests
Electronic initial e- initial e- 2009
Archiving journal book CRL
Initiative content content audit of
by into the into the Portico
JSTOR archive archive begins
2005 2007 2009 2010
Portico Portico Portico Portico
Launched makes fulfills first ingests
first PCA initial d-
trigger claim collection
title content
available
35. Digital Preservation
Digital preservation is the series of management policies and activities
necessary to ensure the enduring usability, authenticity, discoverability,
and accessibility of content over the very long-term. The key goals of
digital preservation include:
Usability Authenticity Discoverability Accessibility
• the intellectual • the provenance of • the content must • the content must be
content of the item the content must be have logical available for use to
must remain usable proven and the bibliographic the appropriate
via the delivery content an authentic metadata so that it community
mechanism of replica of the can be found by end
current technology original users through time
36. Preservation: Legal aspects
Legal right to preserve content
» Not always the same as access rights
» Specified in contracts
» Includes embedded or supplemental files, such as images
» DRM removed
39. Usability: Rendition and Delivery
Content is rendered to support current delivery
platform, i.e. web browser.
… rendered & delivered …
Rendition engine can be modified to meet new
technology requirements.
40. Portico – Another Look at the History
2009 2011
2006 iPad 2
Portico
2002 Portico ingests Kindle
Launch of ingests initial e- Fire
Electronic initial e- book Nook
Archiving journal content Simple
Initiative content Touch
by into the Kindle 2
JSTOR archive Nook ePub3
2005 2007 2010 2012
Portico Portico iPad 1 Portico
Launched makes Nook ingests
first Color initial d-
trigger collection
title content
available iPad 3
iPhone
Kindle 1
54. E-Book Packages in Portico Submissions
Flat directory
» ONIX xml file with bibliographic metadata, one PDF file per book
Front Cover image JPG files
55. E-Book Packages in Portico Submissions
TAR file (multiple books per file)
» XML manifest file
» One directory for each book,
Proprietary XML file (3 possible versions of XML) with bibliographic
metadata,
Subdirectory with files for front matter “chapters” (XML. PDF, OCR of
PDF)
Subdirectory with files for regular “chapters” (XML. PDF, OCR of PDF)
front
Subdirectory with files for back matter “chapters” (XML. PDF, OCR of
PDF)
Subdirectory with TIFF file for cover image of book
56. E-Book Packages in Portico Submissions
ZIP file (sometimes one book per file, sometime multiple
books)
» Sometimes flat (all books at one level)
» Sometimes one directory for each book,
Sometimes cover images (JPG or TIFF)
Sometimes one PDF for entire book in addition to PDF for each chapter
» Sometimes a manifest
66. E-book formats in Portico Submissions
PDF
» One file per chapter
» One file per book
TIFF
» One file per page
JPEG
» One file per page
XML
» For bibliographic metadata
» Proprietary
» ONIX variants
» NLM variants
67. Looking ahead: EPUB 3
EPUB 3 (http://idpf.org/epub/30 )
» “EPUB defines a means of representing,
packaging and encoding structured and
semantically enhanced Web content--
including HTML5, CSS, SVG, images,
and other resources-- for distribution in a
single-file format.”
68. Looking ahead: EPUB 3
EPUB 3
» Web standards for key component
technologies
» Free and open specification
» Must work in at least some appliance
Outside publisher’s own workflow
70. EPUB3 Formats
“Profiles” of standard formats for authoring content
» XHTML5, SVG 1.1, CSS 2.1, CSS 3
Constraints (extensions to HTML5, constraints on SVG)
Specs a “moving target”
Conforming readers must support rendition of certain formats
» Image, audio, video
Defined fallbacks
Globalization, Encoding, Fonts
71. Complications: The New “Browser Wars”
Amazon
» Announces it is replacing MOBI with K8
iBooks
» Different mimetype
» Proprietary extension of CSS Media Queries
» Proprietary XML namespace
» Etc.
72. Complications: "More What You’d Call ‘Guidelines’
Than Actual Rules”
Pirates of the Caribbean: The Black Pearl. The Walt Disney
Company (2003)
73. Questions or
Comments?
Sheila Morrissey
sheila.morrissey@ithaka.org
@sheilaMorr
www.portico.org