AWS Community Day CPH - Three problems of Terraform
BatIg
1.
2. Origin: It started with a simple need
As the Library of Congress began to deal with increasing amounts of digital content,
they faced some issues:
• How do they know what files they have and who they belong to?
• How do they get files from where they are to where they need to be?
The Library of Congress Repository Development Center began working on a solution--
tools for transfer activities including:
• Adding digital content to the collections (whether internal or external data)
• Moving digital content between storage systems
• Review of digital files for fixity, quality and/or authoritativeness
• Inventorying and recording transfer life cycle events for digital files
3. Origin: It evolved naturally from that need
Here is what Leslie Johnson (Library of Congress contributor) and John Kunze (California Digital Library
co-creator) shared about the project’s origin:
4. Origin: But what is it exactly?
• The name comes from the concept of "bag it and tag it”. BagIt allows for the transfer of digital
files by packaging them into a digital “bag” that is accessible for the library to download.
• A bag is like a folder or directory on a computer; it can hold documents, photos, movies, music,
or even other folders.
• Bags are comprised of three main elements:
1. A bag declaration text file (like a seal of authenticity)
2. A text-file manifest (tag) listing the files in the collection
3. A subdirectory filled with the digital content
• A bag can also contain an optional text file with a small amount of administrative metadata (e.g.
contact info for the collection owner and a description of the collection)
• Once a bag is sent, the receiving computer can analyze the manifest and run checksums on the
contents; if the checksums match (i.e. the files are unchanged), the transfer is successful.
• It’s that simple!
5. Evolution: Community involvement
• Working with John Kunze of the California Digital Library, Andy Boyko, Justin Littman, Liz Madden,
and Brian Vargas of the Library produced draft version of BagIt (initially referred to as the “LC
Package Specification”) in December 2008.
• This was posted on the LOC and California Digital Library sites and as an internet “Request for
Comment” (RFC).
• It was also promoted on blogs, in conference presentations, articles, etc. NDIPP strongly
encouraged partners to “bag” their content for transfer.
• Through the process, project managers began learning what was still missing and where the
specification needed clarification.
• The team then launched a Digital Curation Google group to support the activities of this
participatory community and encourage open, public discussion.
• BagIt is now on version 0.97, having undergone several iterative revisions (6 drafts to date).
6. Evolution: Tools
• BagIt was intended to be simple enough for users to work with directly. However, the community increasingly
began to request tools to help with the use of BagIt, as well as the source code so that they could develop
their own further tools.
• The LOC developed three initial scripts- key utilities for the movement and validation of bagged content- and
released them through SourceForge on December 18, 2008 under a BSD license (essentially open-sourced).
These tools have been rather popular with 4,617 downloads to date (31 this week).
• The Parallel Retriever: automates the retrieval of remote resources such as web pages, files on an FTP
server, or files on a network drive, and then wraps them into a package that meets the BagIt
specification.
• The Bag Validator Script: checks that a bag meets the standards of the specification (i.e. all files listed in
the manifest are in the data directory, there are no files in the directory not in the manifest, and there
are no duplicate entries in the manifest)
• VerifyIt Script: verifies the checksums of files in a bag against the manifest each time the files are
moved or copied.
• They later released the BagIt Library (BIL) – a Java library to support key functionality such as creating,
manipulating, validating, and verifying Bags, and reading from and writing to a number of formats.
• A client-side Bagger application was also underway in 2009. Bagger is intended to provide a graphical desktop
for the Bagging of content, and ideally will require no client-side IT support or infrastructure.
7. Evolution: Adaptations
The BagIt tool set became the LOC’s first open source software release. Since then, several BagIt specific
tools have been created to simplify the process in several programming environments (it was originally
designed for use with Unix utilities):
• Python BagIt Library– at least two recent versions exist for this, one completed by Andrew
Hankinson (https://github.com/ahankinson/bagit) and the other by Ed Summer
(https://github.com/edsu/bagit). These libraries can be used to create BagIt style packages
programmatically in Python or from the command line.
• Drupal– Mark Jordan developed a Drupal module for BagIt (http://drupal.org/project/bagit).
• Ruby– Francesco Lazzarino at the Florida Center for Library Automation developed a Ruby
adaptation for BagIt (https://github.com/tipr/bagit).
• PHP– A PHP implementation of BagIt was created by Wayne Graham and Mark Jordan
(https://github.com/scholarslab/BagItPHP).
• RESTful Bag Storage Proposal- Chris Adams developed this draft protocol for serving BagIt
repositories RESTfully (https://github.com/acdha/restful-bag-server).
8. Practicalities: Where does BagIt fit?
“Why are such transfer tools and processes so important? Transfer processes are not surprisingly
linked with preservation, as the tasks performed during the transfer of files must follow a
documented workflow and be recorded in order to mitigate preservation risks... While initial
interest in this problem space came from the need to better manage transfers from external
partners to the Library, the transfer and transport of files within the organization for the purpose
of archiving, transformation, and delivery is an increasingly large part of daily operations. The
digitization of an item can create one or hundreds of files, each of which might have many
derivative versions, and which might reside in multiple locations simultaneously to serve different
purposes. Developing tools to manage such transfer tasks reduce the number of tasks performed
and tracked by humans, and automatically provides for the validation and verification of files with
each transfer event.”
-- from “Releasing Open Source at the Library of Congress” by Leslie Johnson
9. Practicalities: What’s so special about BagIt?
• Bags are uncomplicated, and are therefore able to transcend differences in institutional
data, data architecture, formats and practices.
• Bags have built-in inventory checking (validation) to help ensure that the content is
transferred unchanged and fully intact.
• Unlike other packaging tools like zip or tar, Bagit does not require special software to extract
the files.
• Additionally, in these formats, all individual files included are condensed into a single zip or tar
file. However, BagIt creates a logical package where files maintain their individuality and are
simply stored in a traditional folder or directory container.
• There is no limit to the number / type of files that can be transferred through the use of BagIt.
• Bags are flexible and can work in many different settings– including situations when the
content is located in many different places.
• A bag’s metadata is machine readable, meaning that data can be ingested automatically.
• Bags can be used over computer networks or through the use of portable storage devices.
10. Practicalities: Who Is Using BagIt?
• As of 2009, a significant percentage of the 130 NDIIPP partners were already utilizing the BagIt
specification in their preservation transfers to the Library.
• A few of the organizations who are using BagIt include:
The University of Virginia Libraries
The Stanford Digital Repository
Archivematica
Ghent University Library
The Dryad Data Repository
The University of North Texas
Central Connecticut State University
Towards Interoperable Preservation Repositories (including the Florida Center for Library
Automation, Cornell University, and New York University)
11. Practicalities: BagIt Usage Highlights
• The Stanford Digital Repository: Having had success using BagIt to move geospatial data from the National Geospatial Digital
Archive project from Stanford to the Library of Congress, they settled on BagIt as the primary transfer format for content being
deposited into their repository (ingest stage of OAIS) (http://www.dlib.org/dlib/september10/cramer/09cramer.html).
• Ghent University Library: They currently use BagIt as archival format for their digital collections. They also use it as an
interchange format for the addition of new external collections (e.g. Google Books) to the local repositories.
http://www.slideshare.net/hochstenbach/grep-ghent-university-repository
• The Dryad Data Repository: (a repository of data underlying scientific publications) is using the BagIt specification to share
data and related metadata with TreeBASE, a repository of phylogenetic information.
http://wiki.datadryad.org/BagIt_Handshaking
• Towards Interoperable Preservation Repositories (TIPR): is a partnership between the Florida Center for Library Automation,
Cornell University, and New York University to develop, test and promote a standard interchange format for exchanging
information packages among OAIS-based repositories. The proposed format is using the BagIt specification to exchange
package bundles via HTTP. (http://wiki.fcla.edu:8000/TIPR); (https://github.com/tipr/bagit/)
12. The Process: Tutorials
• The North Carolina State Archives has provided a set of 10 thorough tutorials to explain the
BagIt process. The first video includes a summary of the steps involved; the second set
explains the installation process; and the third details creation and verification step-by-step:
http://www.youtube.com/playlist?list=PL1763D432BE25663D&feature=plcp
• The NDIIPP-funded GeoMAPP project has published a BagIt User Guide that can be found at:
http://www.geomapp.net/docs/Using_BagIt_ver2_geomapp_FINAL_20110321.pdf
• The Library of Congress NDIIPP Partner Tools and Services Inventory page includes a brief
description of BagIt, a PDF of the latest version of the BagIt specification, links to some of the
BagIt tools, and a brief video demonstrating the BagIt process:
http://www.digitalpreservation.gov/partners/resources/tools/index.html#b
13. Four Steps to use BagIt
The process is as simple as 1, 2, 3, 4…
Prepare Files Create & Copy & Extract Files
for Transfer Verify Bag Verify Bag for Use
14. Image courtesy of the GeoMapp.net BagIt Guide
http://www.geomapp.net/docs/Using_BagIt_ver2_geomapp_FINAL_20110321.pdf
15. Prepare files for transfer
• A bag must have three things– a bag declaration, a list of the content files
(manifest), and the content itself
• Validate content and metadata
• Perform virus check (suggested)
16. Create and verify the bag
• Attach portable drive to computer (or use shared drive)
• Create a new folder to serve as the holding place for your bag
• Use the “BagIt” command to create the bag on this drive
• Verify the bag by using the “verifyvalid” command
17. Copy and Verify the bag
• Copy the bag to a staging area
• Validate the received bag
• Run virus check software on the bag
18. Extract files for use
• Unpack the bag
• Your files are now ready for use!
19. Challenges: Limiting Usage Factors
• Lack of information: The LOC website contains little information aside from what is
included in their brief 3 minute video and short printed description. It’s hard to
find much more via outside online sources either. It would be useful to have
further example implementations to really understand how it can be used and
what the advantages are over other formats such as zip files.
• Learning curve: Most of the documentation language is complicated, and would
not be easy to understand by the average person. BagIt doesn’t currently have an
easy to use GUI interface to make the process simple for non-techie users. Bagger
may help with this, but there is little information out there about the Bagger
interface.
20. ?
And that concludes our tour
of BagIT…
Any Questions?
21. Additional Sources
"BagIt File Packaging Format." IETF Documents. Internet Engineering Task Force, 15 Apr 2011. Web. 1 Apr 2012.
<http://tools.ietf.org/html/draft-kunze-bagit-06>.
BagIt: Transferring Content for Digital Preservation. 2009. video. The Library of Congress, Washington, DC.
Web. 1 Apr 2012. <http://www.digitalpreservation.gov/multimedia/videos/bagit0609.html>.
Johnston, Leslie. "Releasing Open Source at the Library of Congress. "OCLC Systems & Services: International Digital
Library Perspectives. 26.2 (2010): 94-102.
Johnston, Leslie, and John Kunze. "BagIt funding and versions." 29 Mar 2012. N.p., Online Posting to Digital Curation
Google Group. Web. 1 Apr. 2012. <http://groups.google.com/group/digital-
curation/browse_thread/thread/ace8eafae819762b?pli=1>.
Lavoie, Brian. "The Open Archival Information System Reference Model: Introductory Guide." Technology Watch
Report. 04-01 (2004).
Lazorchak, Butch. "From There to Here, from Here to There, Digital Content is Everywhere!." The Signal: Digital
Preservation. The Library of Congress, 3 Jan 2012. Web. 1 Apr 2012.
<http://blogs.loc.gov/digitalpreservation/2012/01/from-there-to-here-from-here-to-there-digital-
content-is-everywhere/>.
Willett, Perry. "BagIt File Packaging Format." California Digital Library, 10 Feb 2012. Web. 1 Apr 2012.
<https://wiki.ucop.edu/display/Curation/BagIt>.
Notas do Editor
Ingest– in practice, it might be used to send packets of information to a digital preservation repository (as part of an AIP packet)