SlideShare uma empresa Scribd logo
1 de 51
Filling the Digital Preservation Gap
Chris Awre (@clawre) and Jenny Mitcham (@Jenny_Mitcham)
7th or 8th November 2016
Introduction: who we are and why are we
doing this?
Research at Hull and York
• University of Hull :
– 5 Faculties, 11 Schools
– c. 22,000 students
– 62% research classed as 3* or 4* in REF 2014
– In top 50 UK institutions by ‘research power’
• University of York:
– 30+ academic departments
– c. 16,000 students
– Ranked in the top ten of UK universities for research council income
– Secured £46 million in research council income in 2014/15
Why do we need digital preservation for research data?
We can’t ignore digital preservation – moving targets for data
retention mean we need to take this seriously
Funder requirements around retention:
• NERC - data should be retained for a minimum of 10 years but for projects
of major importance this may need to be 20 years or longer
• STFC - expect data to be retained for a minimum of 10 years and data that
cannot be re-measured should be retained indefinitely
• Wellcome Trust – expect data to be kept for a minimum of 10 years but
suggest longer periods for certain types of data
Why do we need digital preservation for research data?
University of York RDM questionnaire 2013:
Which data management
issues have you come
across in your research
over the last five years?
24% of 181 researchers who answered this
question admitted this had been a problem for
them
“Inability to read files in old software
formats on old media or because of
expired software licences”
Jisc Research Data Spring initiative
• A three phase funding programme starting in March 2015
• Looking for ideas for technical tools to help with RDM
• Ideas crowd-sourced and voted on by anyone interested
• Best ideas invited to pitch for funding
• See https://www.jisc.ac.uk/rd/projects/research-data-spring for more information
Project aim - our pitch!
“…to investigate Archivematica and
explore how it might be used to
provide digital preservation
functionality within a wider
infrastructure for Research Data
Management.”
Project structure
• Phase 1 – explore: testing, research, thinking (3 months)
• Phase 2 – develop: make Archivematica better for RDM, plan
implementation (4 months)
• Phase 3 – implement: set up proof of concepts at York and
Hull and further investigate of file format problem (6
months)
The team
University of Hull:
• Chris Awre – Head of Information Services, Library and Learning
Innovation
• Richard Green – Independent Consultant
• Simon Wilson – University Archivist
University of York:
• Julie Allinson – Manager, Digital York
• Jen Mitcham – Digital Archivist
Phase 1 - Explore
What were we trying to achieve?
A feasibility study:
• What does research data look like?
• What does Archivematica do to preserve data?
• How can Archivematica integrate with our other RDM systems?
• What are our institutional requirements for digital preservation?
• Does Archivematica meet those requirements?
• Where does Archivematica fall short?
All written up and published in a report…
http://dx.doi.org/10.6084/m9.figshare.1481170
What is Archivematica?
● Free and open-source digital preservation system (AGPLv3) designed to
maintain standards-based, long-term access to digital objects
● Allows users to process digital objects from ingest to access using OAIS
functional model
● Implements format normalization upon ingest and preserves originals to
support emulation and migration strategies
What is Archivematica?
● Archivematica is a processing pipeline consisting of a bundle of open-source
tools and python scripts which deliver a series of preservation micro-services
● Archivematica is designed to output high-quality, standards-compliant
Archival Information Packages (AIPs)
● Bagit, METS, PREMIS
Archivematica development partners
and more!
Why would we recommend Archivematica for RDM?
• It is flexible and can be configured in different ways for different institutional
needs and workflows
• It allows many of the tasks around digital preservation to be carried out in an
automated fashion
• It can be used alongside other existing systems as part of a wider workflow for
research data
• It is a good digital preservation solution for those with limited resources
• It is an evolving solution that is continually driven and enhanced by and for
the digital preservation community
• It gives institutions greater confidence that they will be able to continue to
provide access to usable copies of research data over time
What are the downsides?
• It isn’t a magic bullet
• There is no guarantee your data will be readable in the future
• It can only be as good as current digital preservation practice
• It can be fiddly to install correctly
• The GUI isn’t that intuitive
• You need staff who understand it
How could you use Archivematica?
• Host it in-house and link it to an existing repository/access system (for example
DSpace, CONTENTdm, Fedora/Hydra ...or a CRIS)
• Host it in-house and use as a standalone system (you would need to have a
storage system in place and establish a way of facilitating access to the data)
• Sign up for a hosted instance of Archivematica with archivesDIRECT
(combines Archivematica with DuraCloud storage)
• Sign up for a hosted instance of Archivematica with Arkivum (combines
Archivematica with Arkivum storage)
Phase 2 - Develop
Phase 2 development work
• Six different areas of work
• Development carried out by Artefactual Systems from July
2015 to January 2016
• Weekly Google Hangouts to report on progress
• Will be available in Archivematica soon...
Deliverable 1
Problem: Research Data needs to be kept,
but we don’t know if anyone will ever want it
and it might be *massive*
The Solution: enable the DIP to be generated ‘on request’
and not as part of the initial ingest
Deliverable 2
Problem: We want to be able to grab the DIP,
and metadata about it and pull it into our
repository
The Solution: a library to help with parsing and creating
METS files
https://github.com/artefactual-labs/mets-reader-writer
Deliverable 3
Problem: We want to be able to report on what
we have
The Solution: a search API to answer basic questions about
number of files in storage, their formats, date of ingest etc.
Deliverable 4
Problem: With large datasets, the current
checksum mechanism in Archivematica could be
a bottleneck
The Solution: support for multiple checksum algorithms
Deliverable 5
Problem: What about all those file formats that
Archivematica can’t identify?
The Solution: mechanism for running file identification with
multiple tools and a report of unidentified formats.
...and I’ll talk a bit more about file format identification later!
Deliverable 6
Problem: We want to make it easier for
institutions to adopt Archivematica
The Solution: a webinar describing Archivematica’s
Automation Tools
What worked well
• Artefactual staff were good people to work with (and
patient)
• Artefactual have the bigger picture in mind and really
want to understand the use cases
• Our work builds on work that others have done and is
being used and built on by future work that is in the
pipeline:
– Search API work being looked at by Bentley Historical
Library
– DIP generation by Simon Fraser University
What didn’t work well
• Many of the areas of development were only partially
solved through our work:
– the problems were big and complex – what does
success look like?
– perhaps we tried to do too much?
• It was hard to prove the impact of our checksum work
• Solving the file identification problem is a huge task and
needed more thought....
Implementation plans
Hull and York also worked on implementation plans for
Archivematica. This was key because….
Deciding to use a system is easy...deciding exactly
how to use it is much harder
Separate plans created for Hull and York as different RDM
systems were in place and there were different institutional
needs and priorities.
Phase 3 - Implement
York p-o-c implementation
York wanted to provide:
• an easy way of depositing data
• a way of monitoring datasets for
RDM staff
• a way of requesting access to data
with:
• data sent to archivematica
• dataset metadata pulled from
PURE
York p-o-c implementation
Metadata from PURE
pulled in nightly or on-
demand
Fedora objects created for the dataset to
store local admin info and help connect
the PURE and Archivematica records
Visual representation of
workflow status
York PCDM modelling
Dataset = Dataset record from
PURE
Individual data
files stored, but
folder structure
is not
Folder structure
available in
Archivematica
METS
Dataset can be
made up of
multiple
‘Packages’ of data,
eg. newer version
What next in York?
• Our RDM staff love the p-o-c and we have agreement to
turn it into a production system over the autumn/winter
• This has been a helpful exercise for broader data modelling /
Hydra implementation at York
• York is a pilot in the Jisc Shared Service for Research Data
and will move forward with this work over the next couple
of years
Hull p-o-c implementation
• Hull keen to make Archivematica part of a workflow for any
type of repository content – not just research data. You may
have seen a poster at Hydra Connect last year:
Hull’s p-o-c
implements most of
the automated bulk
ingest route, creates
AIP(s) and builds
repository objects
from the DIP(s)
Hull p-o-c implementation
• User assembles files and simple descriptive
file(s) in Box folder. Shares the folder with
Archivematica
• System checks folder contents and if OK
creates a bag (BagIt standard) for each
object which is passed to Archivematica
• Archivematica processes the bag to create
an AIP which goes to a preservation store…
• …and also a DIP which is passed to the DIP
processor
• DIP processor creates Hydra objects from
the DIP contents and injects them into the
repository QA queue…
• …matched to the AIP by UUID
Thanks to Cottage Labs
for all the new
development work!
Hull p-o-c options
• Depositors have several options:
• A folder containing multiple data files and one descriptive file ➔ a single
AIP and a single repository object with (optionally) one or more surrogate
files for download (so can be a “metadata-only” record)
• A folder containing multiple files and a csv file (one row per file) ➔
multiple AIPs with multiple repository objects, each with (optionally) a
surrogate for download
• A folder containing the top-level folder of a structure ➔ a zipped structure
in a single AIP and a single repository object (optionally) containing the
zipped file for download
What’s next in Hull?
• We hope to be able to take the p-o-c work and
turn it into a production system
• Hull is the UK’s “City of Culture” next year and
there will be a great deal of digital material that
the University Archives want to capture for
posterity
Phase 3 - The file formats problem
File formats problem
Research data file formats are:
• Numerous
• Sometimes a bit obscure
• Sometimes very big
• Ever-changing
• Often very new
This means they can be hard to preserve... The first hurdle is
that we can’t identify them. If we can’t identify them how can
we carry out preservation activities?
Top research data applications at York
Can we identify our research data?
We ran Droid* over the research data deposited with Research
Data York over the past year.
Out of 3752 individual files:
• only 37% (1382) of the files were identified (with varying
degrees of accuracy)
• there were 34 different identified file formats in the sample
* Droid is a free tool from The National Archives that can be used to automatically
identify file formats
Unidentified research data files
Files not identified by Droid (listed by file ext):
– 107 different file extensions not identified
– huge number with no extension (help!)
– how do we solve the .dat file problem?
Supporting
signature
development at
The National
Archives
Creating our own signatures
Conclusions and where to find out more
Impact
“In many ways the project at York and Hull felt like a precursor to the Shared
Services pilot; highlighting both the potential problems in working with a wide
range of stakeholders and systems, as well as the massive benefits possible
from pooling our collective knowledge and resources to tackle the technical
challenges which remain in RDM.”
From ‘Unlocking Research’ blog from the University of Cambridge Office of
Scholarly Communication (16 September 2016)
“I've just read your paper on linking repositories and Archivematica -
fascinating and full of very useful information! I will certainly be following up
with many of the links and ideas you presented, especially the Jisc Research
Data Shared Service work on digital preservation, and I will discuss with my
colleagues the possibility of using Archivematica at our data centre and our
options for collaboration. Many of our issues relate to the long tail of research
data and the preservation of data already archived.”
Comment on Digital Archiving blog (18 October 2016)
Challenges
• Impact of short, but focused, timeframes and short lead in times
• Access to appropriate skills (mainly technical development) limited scope of
work
• Limited budget, hence ‘parsimonious’ approach to making the best use of this
• Interpretations of digital archiving across preservation, RDM and IT
communities
• Balancing dissemination with actual doing!
What have we learned?
• Archivematica can be used to manage the preservation of research data
• And that this can be embedded within similar, but different, institutional workflows
• There is benefit in getting focused systems to do what they do best rather than
adding functionality to any particular one
• There is a file format recognition issue that will affect long-term preservation of
research data files
• But there is a way to address this through the development of additional file signatures
• There is real benefit in working collaboratively in this area, both within the project
and beyond it, to identify common ways of tackling the problem of preserving
research data
Jisc Research Data Shared Service
• Almost in parallel with the Research Data Spring projects Jisc were planning a
Research Data Shared Service
• The resulting system will be managed and hosted, and will offer three core
modules : repository, preservation and reporting
• Phase 1 and 2 reports from Hull and York very influential for the preservation
module
• Commercial and open source offerings for each module, including Archivematica
(for preservation) and Hydra (for the repo)
• Over 20 pilot institutions recruited (including York) – all identified preservation
as a priority
• https://www.jisc.ac.uk/rd/projects/research-data-shared-service
Further information
Website: http://www.york.ac.uk/borthwick/projects/archivematica
Blog: http://digital-archiving.blogspot.co.uk/
Reports: https://figshare.com/

Mais conteúdo relacionado

Mais procurados

MLUC 2011 XQuery Enigma
MLUC 2011 XQuery EnigmaMLUC 2011 XQuery Enigma
MLUC 2011 XQuery Enigma
Peter O'Kelly
 
(Tugdual grall) no sql-hadoop
(Tugdual grall)   no sql-hadoop(Tugdual grall)   no sql-hadoop
(Tugdual grall) no sql-hadoop
NAVER D2
 
ORI & RJ Broker: Automating Deposition to Multiple Repositories
ORI & RJ Broker: Automating Deposition to Multiple RepositoriesORI & RJ Broker: Automating Deposition to Multiple Repositories
ORI & RJ Broker: Automating Deposition to Multiple Repositories
EDINA, University of Edinburgh
 

Mais procurados (20)

Clipper, research data network
Clipper, research data networkClipper, research data network
Clipper, research data network
 
COMSODE networking session at ICT Lisbon 2015
COMSODE networking session at ICT Lisbon 2015COMSODE networking session at ICT Lisbon 2015
COMSODE networking session at ICT Lisbon 2015
 
ResourceSync Tutorial
ResourceSync TutorialResourceSync Tutorial
ResourceSync Tutorial
 
Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other things
 
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
 
Digital Preservation in Production (DPN and DuraCloud Vault)
Digital Preservation in Production (DPN and DuraCloud Vault)Digital Preservation in Production (DPN and DuraCloud Vault)
Digital Preservation in Production (DPN and DuraCloud Vault)
 
Archiving data from Durham to RAL using the File Transfer Service (FTS)
Archiving data from Durham to RAL using the File Transfer Service (FTS)Archiving data from Durham to RAL using the File Transfer Service (FTS)
Archiving data from Durham to RAL using the File Transfer Service (FTS)
 
December 16, 2015 NISO Webinar: Two-Part Webinar: Emerging Resource Types Pa...
December 16, 2015 NISO Webinar: Two-Part Webinar: Emerging Resource Types  Pa...December 16, 2015 NISO Webinar: Two-Part Webinar: Emerging Resource Types  Pa...
December 16, 2015 NISO Webinar: Two-Part Webinar: Emerging Resource Types Pa...
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introduction
 
Seamless access to the world’s open access research papers via ResourceSync
Seamless access to the world’s open access research papers via ResourceSyncSeamless access to the world’s open access research papers via ResourceSync
Seamless access to the world’s open access research papers via ResourceSync
 
MLUC 2011 XQuery Enigma
MLUC 2011 XQuery EnigmaMLUC 2011 XQuery Enigma
MLUC 2011 XQuery Enigma
 
BigData HUB Workshop
BigData HUB WorkshopBigData HUB Workshop
BigData HUB Workshop
 
Big Data: an introduction
Big Data: an introductionBig Data: an introduction
Big Data: an introduction
 
Geoservices Activities at EDINA
Geoservices Activities at EDINAGeoservices Activities at EDINA
Geoservices Activities at EDINA
 
(Tugdual grall) no sql-hadoop
(Tugdual grall)   no sql-hadoop(Tugdual grall)   no sql-hadoop
(Tugdual grall) no sql-hadoop
 
ORI & RJ Broker: Automating Deposition to Multiple Repositories
ORI & RJ Broker: Automating Deposition to Multiple RepositoriesORI & RJ Broker: Automating Deposition to Multiple Repositories
ORI & RJ Broker: Automating Deposition to Multiple Repositories
 
The WSTIERIA Project – A Web of Services
The  WSTIERIA Project – A Web of ServicesThe  WSTIERIA Project – A Web of Services
The WSTIERIA Project – A Web of Services
 
2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking
2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking
2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking
 
Mind the gap! Reflections on the state of repository data harvesting
Mind the gap! Reflections on the state of repository data harvestingMind the gap! Reflections on the state of repository data harvesting
Mind the gap! Reflections on the state of repository data harvesting
 
CLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage informationCLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage information
 

Destaque

Destaque (8)

9 25-12 DuraSpace Hot Topics, Slides, Introduction to Hydra
9 25-12 DuraSpace Hot Topics, Slides, Introduction to Hydra 9 25-12 DuraSpace Hot Topics, Slides, Introduction to Hydra
9 25-12 DuraSpace Hot Topics, Slides, Introduction to Hydra
 
Research data spring: filling in the digital preservation gap
Research data spring: filling in the digital preservation gapResearch data spring: filling in the digital preservation gap
Research data spring: filling in the digital preservation gap
 
Research data spring: streamlining deposit
Research data spring: streamlining depositResearch data spring: streamlining deposit
Research data spring: streamlining deposit
 
Research data spring: clipper
Research data spring: clipperResearch data spring: clipper
Research data spring: clipper
 
Research data spring: giving researchers credit for their data
Research data spring: giving researchers credit for their dataResearch data spring: giving researchers credit for their data
Research data spring: giving researchers credit for their data
 
Research data spring: DataVault
Research data spring: DataVaultResearch data spring: DataVault
Research data spring: DataVault
 
Artivity phase 3 pitch
Artivity phase 3 pitchArtivity phase 3 pitch
Artivity phase 3 pitch
 
Research data spring: extending the OPD to cover RDM
Research data spring: extending the OPD to cover RDMResearch data spring: extending the OPD to cover RDM
Research data spring: extending the OPD to cover RDM
 

Semelhante a Using Archivemedia to preserve research data

21 07 14 rdm swansea_whelf_copy
21 07 14 rdm swansea_whelf_copy21 07 14 rdm swansea_whelf_copy
21 07 14 rdm swansea_whelf_copy
rachaelwhitfield
 

Semelhante a Using Archivemedia to preserve research data (20)

Project update: A collaborative approach to "filling the digital preservation...
Project update: A collaborative approach to "filling the digital preservation...Project update: A collaborative approach to "filling the digital preservation...
Project update: A collaborative approach to "filling the digital preservation...
 
"Filling the Digital Preservation Gap" with Archivematica
"Filling the Digital Preservation Gap" with Archivematica"Filling the Digital Preservation Gap" with Archivematica
"Filling the Digital Preservation Gap" with Archivematica
 
"Filling the digital preservation gap" with Archivematica
"Filling the digital preservation gap" with Archivematica"Filling the digital preservation gap" with Archivematica
"Filling the digital preservation gap" with Archivematica
 
Implementing Archivematica, research data network
Implementing Archivematica, research data networkImplementing Archivematica, research data network
Implementing Archivematica, research data network
 
Jisc Shared Service requirements presentation - 18th November 2015
Jisc Shared Service requirements presentation - 18th November 2015Jisc Shared Service requirements presentation - 18th November 2015
Jisc Shared Service requirements presentation - 18th November 2015
 
A collaborative approach to "filling the digital preservation gap" for Resear...
A collaborative approach to "filling the digital preservation gap" for Resear...A collaborative approach to "filling the digital preservation gap" for Resear...
A collaborative approach to "filling the digital preservation gap" for Resear...
 
A collaborative approach to filling the digital preservation gap for RDM
A collaborative approach to filling the digital preservation gap for RDMA collaborative approach to filling the digital preservation gap for RDM
A collaborative approach to filling the digital preservation gap for RDM
 
Ariadne: Lifecycles
Ariadne: LifecyclesAriadne: Lifecycles
Ariadne: Lifecycles
 
Research Data Management at Imperial College London
Research Data Management at Imperial College LondonResearch Data Management at Imperial College London
Research Data Management at Imperial College London
 
Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...
 
21 07 14 rdm swansea_whelf_copy
21 07 14 rdm swansea_whelf_copy21 07 14 rdm swansea_whelf_copy
21 07 14 rdm swansea_whelf_copy
 
Research Data Services @ Edinburgh: MANTRA & Edinburgh DataShare
Research Data Services @ Edinburgh: MANTRA & Edinburgh DataShareResearch Data Services @ Edinburgh: MANTRA & Edinburgh DataShare
Research Data Services @ Edinburgh: MANTRA & Edinburgh DataShare
 
RDM Roadmap to the Future, or: Lords and Ladies of the Data
RDM Roadmap to the Future, or: Lords and Ladies of the DataRDM Roadmap to the Future, or: Lords and Ladies of the Data
RDM Roadmap to the Future, or: Lords and Ladies of the Data
 
RDM@Edinburgh
RDM@EdinburghRDM@Edinburgh
RDM@Edinburgh
 
RDM@Edinburgh
RDM@EdinburghRDM@Edinburgh
RDM@Edinburgh
 
From Box to Hydra via Archivematica
From Box to Hydra via ArchivematicaFrom Box to Hydra via Archivematica
From Box to Hydra via Archivematica
 
AMIA Presentation 2013 -- Richmond, VA
AMIA Presentation 2013 -- Richmond, VAAMIA Presentation 2013 -- Richmond, VA
AMIA Presentation 2013 -- Richmond, VA
 
L&P Humphrey Stewart-Shearer-Joint Session Project ARC & Federated DMP Pilot
L&P Humphrey Stewart-Shearer-Joint Session Project ARC & Federated DMP PilotL&P Humphrey Stewart-Shearer-Joint Session Project ARC & Federated DMP Pilot
L&P Humphrey Stewart-Shearer-Joint Session Project ARC & Federated DMP Pilot
 
Birgit Plietzsch “RDM within research computing support” SALCTG June 2013
Birgit Plietzsch “RDM within research computing support” SALCTG June 2013Birgit Plietzsch “RDM within research computing support” SALCTG June 2013
Birgit Plietzsch “RDM within research computing support” SALCTG June 2013
 
RDM Programme at University of Edinburgh
RDM Programme at University of EdinburghRDM Programme at University of Edinburgh
RDM Programme at University of Edinburgh
 

Mais de ARDC

Mais de ARDC (20)

Introduction to ADA
Introduction to ADAIntroduction to ADA
Introduction to ADA
 
Architecture and Standards
Architecture and StandardsArchitecture and Standards
Architecture and Standards
 
Data Sharing and Release Legislation
Data Sharing and Release Legislation   Data Sharing and Release Legislation
Data Sharing and Release Legislation
 
Australian Dementia Network (ADNet)
Australian Dementia Network (ADNet)Australian Dementia Network (ADNet)
Australian Dementia Network (ADNet)
 
Investigator-initiated clinical trials: a community perspective
Investigator-initiated clinical trials: a community perspectiveInvestigator-initiated clinical trials: a community perspective
Investigator-initiated clinical trials: a community perspective
 
NCRIS and the health domain
NCRIS and the health domainNCRIS and the health domain
NCRIS and the health domain
 
International perspective for sharing publicly funded medical research data
International perspective for sharing publicly funded medical research dataInternational perspective for sharing publicly funded medical research data
International perspective for sharing publicly funded medical research data
 
Clinical trials data sharing
Clinical trials data sharingClinical trials data sharing
Clinical trials data sharing
 
Clinical trials and cohort studies
Clinical trials and cohort studiesClinical trials and cohort studies
Clinical trials and cohort studies
 
Introduction to vision and scope
Introduction to vision and scopeIntroduction to vision and scope
Introduction to vision and scope
 
FAIR for the future: embracing all things data
FAIR for the future: embracing all things dataFAIR for the future: embracing all things data
FAIR for the future: embracing all things data
 
ARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian Duncan
ARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian DuncanARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian Duncan
ARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian Duncan
 
Skilling-up-in-research-data-management-20181128
Skilling-up-in-research-data-management-20181128Skilling-up-in-research-data-management-20181128
Skilling-up-in-research-data-management-20181128
 
Research data management and sharing of medical data
Research data management and sharing of medical dataResearch data management and sharing of medical data
Research data management and sharing of medical data
 
Findable, Accessible, Interoperable and Reusable (FAIR) data
Findable, Accessible, Interoperable and Reusable (FAIR) dataFindable, Accessible, Interoperable and Reusable (FAIR) data
Findable, Accessible, Interoperable and Reusable (FAIR) data
 
Applying FAIR principles to linked datasets: Opportunities and Challenges
Applying FAIR principles to linked datasets: Opportunities and ChallengesApplying FAIR principles to linked datasets: Opportunities and Challenges
Applying FAIR principles to linked datasets: Opportunities and Challenges
 
How to make your data count webinar, 26 Nov 2018
How to make your data count webinar, 26 Nov 2018How to make your data count webinar, 26 Nov 2018
How to make your data count webinar, 26 Nov 2018
 
Ready, Set, Go! Join the Top 10 FAIR Data Things Global Sprint
Ready, Set, Go! Join the Top 10 FAIR Data Things Global SprintReady, Set, Go! Join the Top 10 FAIR Data Things Global Sprint
Ready, Set, Go! Join the Top 10 FAIR Data Things Global Sprint
 
How FAIR is your data? Copyright, licensing and reuse of data
How FAIR is your data? Copyright, licensing and reuse of dataHow FAIR is your data? Copyright, licensing and reuse of data
How FAIR is your data? Copyright, licensing and reuse of data
 
Peter neish DMPs BoF eResearch 2018
Peter neish DMPs BoF eResearch 2018Peter neish DMPs BoF eResearch 2018
Peter neish DMPs BoF eResearch 2018
 

Último

The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 

Último (20)

The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 

Using Archivemedia to preserve research data

  • 1. Filling the Digital Preservation Gap Chris Awre (@clawre) and Jenny Mitcham (@Jenny_Mitcham) 7th or 8th November 2016
  • 2. Introduction: who we are and why are we doing this?
  • 3. Research at Hull and York • University of Hull : – 5 Faculties, 11 Schools – c. 22,000 students – 62% research classed as 3* or 4* in REF 2014 – In top 50 UK institutions by ‘research power’ • University of York: – 30+ academic departments – c. 16,000 students – Ranked in the top ten of UK universities for research council income – Secured £46 million in research council income in 2014/15
  • 4. Why do we need digital preservation for research data? We can’t ignore digital preservation – moving targets for data retention mean we need to take this seriously Funder requirements around retention: • NERC - data should be retained for a minimum of 10 years but for projects of major importance this may need to be 20 years or longer • STFC - expect data to be retained for a minimum of 10 years and data that cannot be re-measured should be retained indefinitely • Wellcome Trust – expect data to be kept for a minimum of 10 years but suggest longer periods for certain types of data
  • 5. Why do we need digital preservation for research data? University of York RDM questionnaire 2013: Which data management issues have you come across in your research over the last five years? 24% of 181 researchers who answered this question admitted this had been a problem for them “Inability to read files in old software formats on old media or because of expired software licences”
  • 6. Jisc Research Data Spring initiative • A three phase funding programme starting in March 2015 • Looking for ideas for technical tools to help with RDM • Ideas crowd-sourced and voted on by anyone interested • Best ideas invited to pitch for funding • See https://www.jisc.ac.uk/rd/projects/research-data-spring for more information
  • 7. Project aim - our pitch! “…to investigate Archivematica and explore how it might be used to provide digital preservation functionality within a wider infrastructure for Research Data Management.”
  • 8. Project structure • Phase 1 – explore: testing, research, thinking (3 months) • Phase 2 – develop: make Archivematica better for RDM, plan implementation (4 months) • Phase 3 – implement: set up proof of concepts at York and Hull and further investigate of file format problem (6 months)
  • 9. The team University of Hull: • Chris Awre – Head of Information Services, Library and Learning Innovation • Richard Green – Independent Consultant • Simon Wilson – University Archivist University of York: • Julie Allinson – Manager, Digital York • Jen Mitcham – Digital Archivist
  • 10. Phase 1 - Explore
  • 11. What were we trying to achieve? A feasibility study: • What does research data look like? • What does Archivematica do to preserve data? • How can Archivematica integrate with our other RDM systems? • What are our institutional requirements for digital preservation? • Does Archivematica meet those requirements? • Where does Archivematica fall short? All written up and published in a report… http://dx.doi.org/10.6084/m9.figshare.1481170
  • 12. What is Archivematica? ● Free and open-source digital preservation system (AGPLv3) designed to maintain standards-based, long-term access to digital objects ● Allows users to process digital objects from ingest to access using OAIS functional model ● Implements format normalization upon ingest and preserves originals to support emulation and migration strategies
  • 13. What is Archivematica? ● Archivematica is a processing pipeline consisting of a bundle of open-source tools and python scripts which deliver a series of preservation micro-services ● Archivematica is designed to output high-quality, standards-compliant Archival Information Packages (AIPs) ● Bagit, METS, PREMIS
  • 15. Why would we recommend Archivematica for RDM? • It is flexible and can be configured in different ways for different institutional needs and workflows • It allows many of the tasks around digital preservation to be carried out in an automated fashion • It can be used alongside other existing systems as part of a wider workflow for research data • It is a good digital preservation solution for those with limited resources • It is an evolving solution that is continually driven and enhanced by and for the digital preservation community • It gives institutions greater confidence that they will be able to continue to provide access to usable copies of research data over time
  • 16. What are the downsides? • It isn’t a magic bullet • There is no guarantee your data will be readable in the future • It can only be as good as current digital preservation practice • It can be fiddly to install correctly • The GUI isn’t that intuitive • You need staff who understand it
  • 17. How could you use Archivematica? • Host it in-house and link it to an existing repository/access system (for example DSpace, CONTENTdm, Fedora/Hydra ...or a CRIS) • Host it in-house and use as a standalone system (you would need to have a storage system in place and establish a way of facilitating access to the data) • Sign up for a hosted instance of Archivematica with archivesDIRECT (combines Archivematica with DuraCloud storage) • Sign up for a hosted instance of Archivematica with Arkivum (combines Archivematica with Arkivum storage)
  • 18. Phase 2 - Develop
  • 19. Phase 2 development work • Six different areas of work • Development carried out by Artefactual Systems from July 2015 to January 2016 • Weekly Google Hangouts to report on progress • Will be available in Archivematica soon...
  • 20. Deliverable 1 Problem: Research Data needs to be kept, but we don’t know if anyone will ever want it and it might be *massive* The Solution: enable the DIP to be generated ‘on request’ and not as part of the initial ingest
  • 21. Deliverable 2 Problem: We want to be able to grab the DIP, and metadata about it and pull it into our repository The Solution: a library to help with parsing and creating METS files https://github.com/artefactual-labs/mets-reader-writer
  • 22. Deliverable 3 Problem: We want to be able to report on what we have The Solution: a search API to answer basic questions about number of files in storage, their formats, date of ingest etc.
  • 23. Deliverable 4 Problem: With large datasets, the current checksum mechanism in Archivematica could be a bottleneck The Solution: support for multiple checksum algorithms
  • 24. Deliverable 5 Problem: What about all those file formats that Archivematica can’t identify? The Solution: mechanism for running file identification with multiple tools and a report of unidentified formats. ...and I’ll talk a bit more about file format identification later!
  • 25. Deliverable 6 Problem: We want to make it easier for institutions to adopt Archivematica The Solution: a webinar describing Archivematica’s Automation Tools
  • 26. What worked well • Artefactual staff were good people to work with (and patient) • Artefactual have the bigger picture in mind and really want to understand the use cases • Our work builds on work that others have done and is being used and built on by future work that is in the pipeline: – Search API work being looked at by Bentley Historical Library – DIP generation by Simon Fraser University
  • 27. What didn’t work well • Many of the areas of development were only partially solved through our work: – the problems were big and complex – what does success look like? – perhaps we tried to do too much? • It was hard to prove the impact of our checksum work • Solving the file identification problem is a huge task and needed more thought....
  • 28. Implementation plans Hull and York also worked on implementation plans for Archivematica. This was key because…. Deciding to use a system is easy...deciding exactly how to use it is much harder Separate plans created for Hull and York as different RDM systems were in place and there were different institutional needs and priorities.
  • 29. Phase 3 - Implement
  • 30. York p-o-c implementation York wanted to provide: • an easy way of depositing data • a way of monitoring datasets for RDM staff • a way of requesting access to data with: • data sent to archivematica • dataset metadata pulled from PURE
  • 31. York p-o-c implementation Metadata from PURE pulled in nightly or on- demand Fedora objects created for the dataset to store local admin info and help connect the PURE and Archivematica records Visual representation of workflow status
  • 32. York PCDM modelling Dataset = Dataset record from PURE Individual data files stored, but folder structure is not Folder structure available in Archivematica METS Dataset can be made up of multiple ‘Packages’ of data, eg. newer version
  • 33. What next in York? • Our RDM staff love the p-o-c and we have agreement to turn it into a production system over the autumn/winter • This has been a helpful exercise for broader data modelling / Hydra implementation at York • York is a pilot in the Jisc Shared Service for Research Data and will move forward with this work over the next couple of years
  • 34. Hull p-o-c implementation • Hull keen to make Archivematica part of a workflow for any type of repository content – not just research data. You may have seen a poster at Hydra Connect last year: Hull’s p-o-c implements most of the automated bulk ingest route, creates AIP(s) and builds repository objects from the DIP(s)
  • 35. Hull p-o-c implementation • User assembles files and simple descriptive file(s) in Box folder. Shares the folder with Archivematica • System checks folder contents and if OK creates a bag (BagIt standard) for each object which is passed to Archivematica • Archivematica processes the bag to create an AIP which goes to a preservation store… • …and also a DIP which is passed to the DIP processor • DIP processor creates Hydra objects from the DIP contents and injects them into the repository QA queue… • …matched to the AIP by UUID Thanks to Cottage Labs for all the new development work!
  • 36. Hull p-o-c options • Depositors have several options: • A folder containing multiple data files and one descriptive file ➔ a single AIP and a single repository object with (optionally) one or more surrogate files for download (so can be a “metadata-only” record) • A folder containing multiple files and a csv file (one row per file) ➔ multiple AIPs with multiple repository objects, each with (optionally) a surrogate for download • A folder containing the top-level folder of a structure ➔ a zipped structure in a single AIP and a single repository object (optionally) containing the zipped file for download
  • 37. What’s next in Hull? • We hope to be able to take the p-o-c work and turn it into a production system • Hull is the UK’s “City of Culture” next year and there will be a great deal of digital material that the University Archives want to capture for posterity
  • 38. Phase 3 - The file formats problem
  • 39. File formats problem Research data file formats are: • Numerous • Sometimes a bit obscure • Sometimes very big • Ever-changing • Often very new This means they can be hard to preserve... The first hurdle is that we can’t identify them. If we can’t identify them how can we carry out preservation activities?
  • 40. Top research data applications at York
  • 41. Can we identify our research data? We ran Droid* over the research data deposited with Research Data York over the past year. Out of 3752 individual files: • only 37% (1382) of the files were identified (with varying degrees of accuracy) • there were 34 different identified file formats in the sample * Droid is a free tool from The National Archives that can be used to automatically identify file formats
  • 42. Unidentified research data files Files not identified by Droid (listed by file ext): – 107 different file extensions not identified – huge number with no extension (help!) – how do we solve the .dat file problem?
  • 44. Creating our own signatures
  • 45. Conclusions and where to find out more
  • 46. Impact “In many ways the project at York and Hull felt like a precursor to the Shared Services pilot; highlighting both the potential problems in working with a wide range of stakeholders and systems, as well as the massive benefits possible from pooling our collective knowledge and resources to tackle the technical challenges which remain in RDM.” From ‘Unlocking Research’ blog from the University of Cambridge Office of Scholarly Communication (16 September 2016) “I've just read your paper on linking repositories and Archivematica - fascinating and full of very useful information! I will certainly be following up with many of the links and ideas you presented, especially the Jisc Research Data Shared Service work on digital preservation, and I will discuss with my colleagues the possibility of using Archivematica at our data centre and our options for collaboration. Many of our issues relate to the long tail of research data and the preservation of data already archived.” Comment on Digital Archiving blog (18 October 2016)
  • 47. Challenges • Impact of short, but focused, timeframes and short lead in times • Access to appropriate skills (mainly technical development) limited scope of work • Limited budget, hence ‘parsimonious’ approach to making the best use of this • Interpretations of digital archiving across preservation, RDM and IT communities • Balancing dissemination with actual doing!
  • 48. What have we learned? • Archivematica can be used to manage the preservation of research data • And that this can be embedded within similar, but different, institutional workflows • There is benefit in getting focused systems to do what they do best rather than adding functionality to any particular one • There is a file format recognition issue that will affect long-term preservation of research data files • But there is a way to address this through the development of additional file signatures • There is real benefit in working collaboratively in this area, both within the project and beyond it, to identify common ways of tackling the problem of preserving research data
  • 49. Jisc Research Data Shared Service • Almost in parallel with the Research Data Spring projects Jisc were planning a Research Data Shared Service • The resulting system will be managed and hosted, and will offer three core modules : repository, preservation and reporting • Phase 1 and 2 reports from Hull and York very influential for the preservation module • Commercial and open source offerings for each module, including Archivematica (for preservation) and Hydra (for the repo) • Over 20 pilot institutions recruited (including York) – all identified preservation as a priority • https://www.jisc.ac.uk/rd/projects/research-data-shared-service
  • 50.
  • 51. Further information Website: http://www.york.ac.uk/borthwick/projects/archivematica Blog: http://digital-archiving.blogspot.co.uk/ Reports: https://figshare.com/