The document discusses a project to investigate using Archivematica, an open-source digital preservation system, to provide digital preservation functionality for research data at the Universities of Hull and York. The project involved three phases: exploring Archivematica and research data needs, developing Archivematica features, and implementing proof-of-concept systems at both universities. Key findings included that Archivematica could meet many preservation needs but had limitations identifying research file formats, and that collaboration was important for addressing challenges in preserving research data long-term.
3. Research at Hull and York
• University of Hull :
– 5 Faculties, 11 Schools
– c. 22,000 students
– 62% research classed as 3* or 4* in REF 2014
– In top 50 UK institutions by ‘research power’
• University of York:
– 30+ academic departments
– c. 16,000 students
– Ranked in the top ten of UK universities for research council income
– Secured £46 million in research council income in 2014/15
4. Why do we need digital preservation for research data?
We can’t ignore digital preservation – moving targets for data
retention mean we need to take this seriously
Funder requirements around retention:
• NERC - data should be retained for a minimum of 10 years but for projects
of major importance this may need to be 20 years or longer
• STFC - expect data to be retained for a minimum of 10 years and data that
cannot be re-measured should be retained indefinitely
• Wellcome Trust – expect data to be kept for a minimum of 10 years but
suggest longer periods for certain types of data
5. Why do we need digital preservation for research data?
University of York RDM questionnaire 2013:
Which data management
issues have you come
across in your research
over the last five years?
24% of 181 researchers who answered this
question admitted this had been a problem for
them
“Inability to read files in old software
formats on old media or because of
expired software licences”
6. Jisc Research Data Spring initiative
• A three phase funding programme starting in March 2015
• Looking for ideas for technical tools to help with RDM
• Ideas crowd-sourced and voted on by anyone interested
• Best ideas invited to pitch for funding
• See https://www.jisc.ac.uk/rd/projects/research-data-spring for more information
7. Project aim - our pitch!
“…to investigate Archivematica and
explore how it might be used to
provide digital preservation
functionality within a wider
infrastructure for Research Data
Management.”
8. Project structure
• Phase 1 – explore: testing, research, thinking (3 months)
• Phase 2 – develop: make Archivematica better for RDM, plan
implementation (4 months)
• Phase 3 – implement: set up proof of concepts at York and
Hull and further investigate of file format problem (6
months)
9. The team
University of Hull:
• Chris Awre – Head of Information Services, Library and Learning
Innovation
• Richard Green – Independent Consultant
• Simon Wilson – University Archivist
University of York:
• Julie Allinson – Manager, Digital York
• Jen Mitcham – Digital Archivist
11. What were we trying to achieve?
A feasibility study:
• What does research data look like?
• What does Archivematica do to preserve data?
• How can Archivematica integrate with our other RDM systems?
• What are our institutional requirements for digital preservation?
• Does Archivematica meet those requirements?
• Where does Archivematica fall short?
All written up and published in a report…
http://dx.doi.org/10.6084/m9.figshare.1481170
12. What is Archivematica?
● Free and open-source digital preservation system (AGPLv3) designed to
maintain standards-based, long-term access to digital objects
● Allows users to process digital objects from ingest to access using OAIS
functional model
● Implements format normalization upon ingest and preserves originals to
support emulation and migration strategies
13. What is Archivematica?
● Archivematica is a processing pipeline consisting of a bundle of open-source
tools and python scripts which deliver a series of preservation micro-services
● Archivematica is designed to output high-quality, standards-compliant
Archival Information Packages (AIPs)
● Bagit, METS, PREMIS
15. Why would we recommend Archivematica for RDM?
• It is flexible and can be configured in different ways for different institutional
needs and workflows
• It allows many of the tasks around digital preservation to be carried out in an
automated fashion
• It can be used alongside other existing systems as part of a wider workflow for
research data
• It is a good digital preservation solution for those with limited resources
• It is an evolving solution that is continually driven and enhanced by and for
the digital preservation community
• It gives institutions greater confidence that they will be able to continue to
provide access to usable copies of research data over time
16. What are the downsides?
• It isn’t a magic bullet
• There is no guarantee your data will be readable in the future
• It can only be as good as current digital preservation practice
• It can be fiddly to install correctly
• The GUI isn’t that intuitive
• You need staff who understand it
17. How could you use Archivematica?
• Host it in-house and link it to an existing repository/access system (for example
DSpace, CONTENTdm, Fedora/Hydra ...or a CRIS)
• Host it in-house and use as a standalone system (you would need to have a
storage system in place and establish a way of facilitating access to the data)
• Sign up for a hosted instance of Archivematica with archivesDIRECT
(combines Archivematica with DuraCloud storage)
• Sign up for a hosted instance of Archivematica with Arkivum (combines
Archivematica with Arkivum storage)
19. Phase 2 development work
• Six different areas of work
• Development carried out by Artefactual Systems from July
2015 to January 2016
• Weekly Google Hangouts to report on progress
• Will be available in Archivematica soon...
20. Deliverable 1
Problem: Research Data needs to be kept,
but we don’t know if anyone will ever want it
and it might be *massive*
The Solution: enable the DIP to be generated ‘on request’
and not as part of the initial ingest
21. Deliverable 2
Problem: We want to be able to grab the DIP,
and metadata about it and pull it into our
repository
The Solution: a library to help with parsing and creating
METS files
https://github.com/artefactual-labs/mets-reader-writer
22. Deliverable 3
Problem: We want to be able to report on what
we have
The Solution: a search API to answer basic questions about
number of files in storage, their formats, date of ingest etc.
23. Deliverable 4
Problem: With large datasets, the current
checksum mechanism in Archivematica could be
a bottleneck
The Solution: support for multiple checksum algorithms
24. Deliverable 5
Problem: What about all those file formats that
Archivematica can’t identify?
The Solution: mechanism for running file identification with
multiple tools and a report of unidentified formats.
...and I’ll talk a bit more about file format identification later!
25. Deliverable 6
Problem: We want to make it easier for
institutions to adopt Archivematica
The Solution: a webinar describing Archivematica’s
Automation Tools
26. What worked well
• Artefactual staff were good people to work with (and
patient)
• Artefactual have the bigger picture in mind and really
want to understand the use cases
• Our work builds on work that others have done and is
being used and built on by future work that is in the
pipeline:
– Search API work being looked at by Bentley Historical
Library
– DIP generation by Simon Fraser University
27. What didn’t work well
• Many of the areas of development were only partially
solved through our work:
– the problems were big and complex – what does
success look like?
– perhaps we tried to do too much?
• It was hard to prove the impact of our checksum work
• Solving the file identification problem is a huge task and
needed more thought....
28. Implementation plans
Hull and York also worked on implementation plans for
Archivematica. This was key because….
Deciding to use a system is easy...deciding exactly
how to use it is much harder
Separate plans created for Hull and York as different RDM
systems were in place and there were different institutional
needs and priorities.
30. York p-o-c implementation
York wanted to provide:
• an easy way of depositing data
• a way of monitoring datasets for
RDM staff
• a way of requesting access to data
with:
• data sent to archivematica
• dataset metadata pulled from
PURE
31. York p-o-c implementation
Metadata from PURE
pulled in nightly or on-
demand
Fedora objects created for the dataset to
store local admin info and help connect
the PURE and Archivematica records
Visual representation of
workflow status
32. York PCDM modelling
Dataset = Dataset record from
PURE
Individual data
files stored, but
folder structure
is not
Folder structure
available in
Archivematica
METS
Dataset can be
made up of
multiple
‘Packages’ of data,
eg. newer version
33. What next in York?
• Our RDM staff love the p-o-c and we have agreement to
turn it into a production system over the autumn/winter
• This has been a helpful exercise for broader data modelling /
Hydra implementation at York
• York is a pilot in the Jisc Shared Service for Research Data
and will move forward with this work over the next couple
of years
34. Hull p-o-c implementation
• Hull keen to make Archivematica part of a workflow for any
type of repository content – not just research data. You may
have seen a poster at Hydra Connect last year:
Hull’s p-o-c
implements most of
the automated bulk
ingest route, creates
AIP(s) and builds
repository objects
from the DIP(s)
35. Hull p-o-c implementation
• User assembles files and simple descriptive
file(s) in Box folder. Shares the folder with
Archivematica
• System checks folder contents and if OK
creates a bag (BagIt standard) for each
object which is passed to Archivematica
• Archivematica processes the bag to create
an AIP which goes to a preservation store…
• …and also a DIP which is passed to the DIP
processor
• DIP processor creates Hydra objects from
the DIP contents and injects them into the
repository QA queue…
• …matched to the AIP by UUID
Thanks to Cottage Labs
for all the new
development work!
36. Hull p-o-c options
• Depositors have several options:
• A folder containing multiple data files and one descriptive file ➔ a single
AIP and a single repository object with (optionally) one or more surrogate
files for download (so can be a “metadata-only” record)
• A folder containing multiple files and a csv file (one row per file) ➔
multiple AIPs with multiple repository objects, each with (optionally) a
surrogate for download
• A folder containing the top-level folder of a structure ➔ a zipped structure
in a single AIP and a single repository object (optionally) containing the
zipped file for download
37. What’s next in Hull?
• We hope to be able to take the p-o-c work and
turn it into a production system
• Hull is the UK’s “City of Culture” next year and
there will be a great deal of digital material that
the University Archives want to capture for
posterity
39. File formats problem
Research data file formats are:
• Numerous
• Sometimes a bit obscure
• Sometimes very big
• Ever-changing
• Often very new
This means they can be hard to preserve... The first hurdle is
that we can’t identify them. If we can’t identify them how can
we carry out preservation activities?
41. Can we identify our research data?
We ran Droid* over the research data deposited with Research
Data York over the past year.
Out of 3752 individual files:
• only 37% (1382) of the files were identified (with varying
degrees of accuracy)
• there were 34 different identified file formats in the sample
* Droid is a free tool from The National Archives that can be used to automatically
identify file formats
42. Unidentified research data files
Files not identified by Droid (listed by file ext):
– 107 different file extensions not identified
– huge number with no extension (help!)
– how do we solve the .dat file problem?
46. Impact
“In many ways the project at York and Hull felt like a precursor to the Shared
Services pilot; highlighting both the potential problems in working with a wide
range of stakeholders and systems, as well as the massive benefits possible
from pooling our collective knowledge and resources to tackle the technical
challenges which remain in RDM.”
From ‘Unlocking Research’ blog from the University of Cambridge Office of
Scholarly Communication (16 September 2016)
“I've just read your paper on linking repositories and Archivematica -
fascinating and full of very useful information! I will certainly be following up
with many of the links and ideas you presented, especially the Jisc Research
Data Shared Service work on digital preservation, and I will discuss with my
colleagues the possibility of using Archivematica at our data centre and our
options for collaboration. Many of our issues relate to the long tail of research
data and the preservation of data already archived.”
Comment on Digital Archiving blog (18 October 2016)
47. Challenges
• Impact of short, but focused, timeframes and short lead in times
• Access to appropriate skills (mainly technical development) limited scope of
work
• Limited budget, hence ‘parsimonious’ approach to making the best use of this
• Interpretations of digital archiving across preservation, RDM and IT
communities
• Balancing dissemination with actual doing!
48. What have we learned?
• Archivematica can be used to manage the preservation of research data
• And that this can be embedded within similar, but different, institutional workflows
• There is benefit in getting focused systems to do what they do best rather than
adding functionality to any particular one
• There is a file format recognition issue that will affect long-term preservation of
research data files
• But there is a way to address this through the development of additional file signatures
• There is real benefit in working collaboratively in this area, both within the project
and beyond it, to identify common ways of tackling the problem of preserving
research data
49. Jisc Research Data Shared Service
• Almost in parallel with the Research Data Spring projects Jisc were planning a
Research Data Shared Service
• The resulting system will be managed and hosted, and will offer three core
modules : repository, preservation and reporting
• Phase 1 and 2 reports from Hull and York very influential for the preservation
module
• Commercial and open source offerings for each module, including Archivematica
(for preservation) and Hydra (for the repo)
• Over 20 pilot institutions recruited (including York) – all identified preservation
as a priority
• https://www.jisc.ac.uk/rd/projects/research-data-shared-service