O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Integration - the heart of researcher centric research data management systems - Steve Mackey, Arkivum

Carregando em…3

Confira estes a seguir

1 de 23 Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (19)


Semelhante a Integration - the heart of researcher centric research data management systems - Steve Mackey, Arkivum (20)

Mais de Repository Fringe (20)


Mais recentes (20)

Integration - the heart of researcher centric research data management systems - Steve Mackey, Arkivum

  1. 1. Integration – the heart of researcher centric research data management systems Steve Mackey 15 January 2015 1
  2. 2. Agenda • Who we are, what we do • How it works • RDM systems, where it fits • Workflows • Integrations 21 October 2014 2
  3. 3. Archive storage with a difference Flagship Arkivum100 service with 100% data integrity guarantee World-wide professional indemnity insurance – Arkivum100 Long term contracts for enterprise data archiving Fully automated and managed solution Audited and certified to ISO27001 Data escrow, exit plan, no lock-in 21 October 2014 3
  4. 4. Adding media – effectively continual process Monthly checks and maintenance updates Annual data retrieval and integrity checks Hardware refresh Software migration Hardware migration Tape format migration – LTO n to LTO n+2 Support and admin staff migration Change of supplier of products and services Keeping Data Alive for 25+ Years 3-5 year obsolescence of servers, operating systems and software
  5. 5. Arkivum Appliance • CIFS/NFS presentation (integrates easily to local file systems) • Simple administration of user access permissions and storage allocations • Robust REST API for application integration • GUI for file ingest status, recovery pre-staging, security • Ingest triggered by: timeout, checksum exchange, manifest (bulk). • Checksum/fixity chain of custody from ingest through replication • Immutable (WORM) • Regular (6 monthly) data copy read verify • Offline Escrow data copy (open source, self describing) • Data encryption throughout keys only held by customer 21 October 2014 5
  6. 6. Arkivum Service Arkivum Gateway on ApplianceOriginal Datasets & Files Copy for ingest
  7. 7. Arkivum Service Arkivum Gateway on Appliance Copy for ingest Original Datasets & Files Encrypted Archive
  8. 8. Encrypted Archive Arkivum Service Arkivum Gateway on Appliance Copy for ingest Original Datasets & Files Validated Archive Decrypted object
  9. 9. Arkivum Service Arkivum Gateway on Appliance Copy for ingest Original Datasets & Files Archive Copy 1 Validated Archive
  10. 10. Arkivum/100 Arkivum Gateway on Appliance Archive Copy 1 Archive Copy 2 Copy for ingest Original Datasets & Files Validated Archive
  11. 11. Arkivum/100 Arkivum Gateway on Appliance Archive Copy 1 Archive Copy 2 Copy for ingest Original Datasets & Files Validated Archive
  12. 12. Arkivum/100 Arkivum Gateway on Appliance Archive Copy 1 Archive Copy 2 Escrow Copy Copy for ingest Original Datasets & Files Validated Archive
  13. 13. Arkivum/100 Arkivum Gateway on Appliance Archive Copy 1 Archive Copy 2 Escrow Copy Original Datasets & Files Validated Archive Cached Copy
  14. 14. Arkivum/100 Arkivum Gateway on Appliance Archive Copy 1 Archive Copy 2 Escrow Copy Cached Copy Validated Archive
  15. 15. http://datablog.is.ed.ac.uk/2013/12/06/the-four-quadrants- of-research-data-curation-systems/ PURE Elements Converis ePrints, Dspace, Hydra Figshare Re3data.org Landing pages CKAN Institutional storage
  16. 16. Workflows • RDM Workflow - The sequence of repeatable processes (steps) through which Research Data passes during its lifecycle, including the steps involved in its creation, curation, preservation, access and eventual disposal. 21 October 2014 17
  17. 17. RDM Workflows Report • JISC Research Data Spring • A Consortial Approach to Building an Integrated RDM System – “Small and Specialist” • http://dx.doi.org/10.6 084/m9.figshare.1476 832 21 October 2014 18
  18. 18. Researcher Centric Workflow 21 October 2014 19
  19. 19. Figshare (Amazon) Archive (Arkivum) Researcher 8. Data DOI 2. Data files Local Research Data 5. Data DOI DataCite (BL) HR system 1. Researcher details Web browser 4. Mint DOI 3. Data Description Journal7. Article CRIS (Elements) 6. Data DOI 12. Dataset Description and Data DOI 9.Article and Article DOI 14. Data files Repository (DSpace) 10. Article and Article DOI 13. Dataset Description And Data DOI Article DOI 16. Data is safe 15. Data is safe 11. Article DOI
  20. 20. Why integrate? • Simpler and easier RDM processes from a Researcher perspective, which both encourages adoption and lowers the cost of institutional support to the research base. • Clear and repeatable RDM processes that help ensure higher levels of quality and consistency in RDM across the research base. • Ability to deploy RDM as community-driven shared service(s) so that smaller institutions can ‘join forces’ to benefit from having access to a common RDM infrastructure. • Scaling RDM up across a large research base using automation and ‘factory’ type approaches to achieve ‘economies of scale’ and move away from RDM being a manual and labour intensive endeavour. • Specifically for Archive layer storage this may include: – Confirmation of integrity of received files via checksums/fixity – File archive status reporting – Trigger for original file deletion – File location, data pool management – File recovery staging – Encryption key management 21 October 2014 21
  21. 21. Data Archiving - Integrations 21 October 2014 22
  22. 22. 21 October 2014 23 Questions?

Notas do Editor

  • speakers notes
  • These are just some of the things that will happen over 25 years of trying to retain data.

    In the diagram, a change from blue to yellow is when something happens that has to be managed. In a growing archive, adding or replacing media, e.g. tapes or discs, can be a daily process, so is effectively continual. The archive system needs regular monitoring and maintenance, which might mean monthly checks and updates. Data integrity needs to be actively verified, for example annual retrievals and integrity tests. Then comes obsolescence of hardware and software, meaning refreshes or upgrades that will typically be 3 – 5 years, for example servers, operating systems, application software. The format of the data being held may need to change so it can still be read and even long-lived formats such as PDF-A will eventually be obsolete as they are replaced with something better and applications no longer provide backwards compatibility.

    In addition to technical change, there will be the need to manage staff transitions of those who run the system, for example support staff and administrators. And suppliers of products and services will come and go to. There are very few vendors that have been around for a long time in the IT industry and mergers, acquisitions, changes in direction and companies simply going bust are all common place.

    Basically, the lifetime of the data is longer than the lifetime of almost everything that’s used to keep that data safe and accessible. The key point is that long-term archiving is an active process and there’s always some form of change going on. And when change happens there’s always a risk that something goes wrong, and there’s always the need to validate that the change has been effected properly. This all requires time, expertise and money. Digital archiving is a case of continual interventions to keep content alive and accessible.

  • A file is copied on to the appliance, how it gets there may very depending on the application and integration method. Its worth remembering that you should confirm the data got onto the appliance safely, some partner products perform the checksum validation to ensure the action of copying in hasn’t introduced data corruption.
  • The appliance watches for the file being closed (to ensure we don’t try and process incomplete files), to ensure no further changes are going to be made. it will wait for two complete ‘ingest periods’ to pass, before the process begins at which point the file is marked as ‘Red’. The duration of the ingest time is set on a per ‘data pool’ basis and defaults to ten minutes.
  • Multiple checksums are taken of the original file, and stored within the service.

    The file is then encrypted, to ensure the efficiency of the service larger files are split into ‘chunks’ up to 1GB in size before being encrypted. A key can be set at any point in the file try and applies to any object below that point. It is important to note that a custom must be applied to a folder before any data is add below it. Any keys that are used with the service must be kept safe by the client, as Arkivum never have access to theses. In addition to keeping digital copies of the keys it is also recommended that a hardcopy is made and stored securely. Without the keys, it would impossible to retrieve data from the service.

    An encrypted version of the file is created and then immediately decrypted, and compared with the original. If the encrypted archive is validated, the decrypted copy is removed and multiple checksums of the validated archive are taken and passed for replication into the service.
  • The archive is replicated to our first datacentre, once the transfer has complete its integrity is confirmed using the checksums created earlier.
  • The archive is then replicated on to our second datacentre, where again the integrity of the transfer is confirmed using the checksums.
  • Once we have two validated copies in the service, the status of the file is updated to ‘amber’. The file is pretty well protected at this point but the 100% guarantee does apply until we reach the ‘Green’ state.
  • A third copy is queued to be written for escrow, the tape is not written until a complete tapes worth has been queued. Currently this is 2.2 TB, depending on the rate at which data is archived this can me files remain in the ‘Amber’ state for sometime. Where this risk is an unacceptable ‘escrow events’ can be purchased.
  • Once a tape written, and verified, it scheduled to be couriered to the escrow site. Once a receipt confirming it safe arrive has been received then the status is updated to ‘Green’. At this point the 100% guarantee comes into effect.
  • Only now is it safe for any copies of the file outside of the service to be disposed of, or for it to be excluded from any conventional backups.

    The validated archive remains in the appliance cache but is now marked as being available for deletion as when the cache high water mark is reached.
  • But more than just archiving is required of course to achieve these benefits.

    This is a diagram from the University of Edinburgh RDM blog from just before Christmas. It shows the components required, including:

    A Current Research Information System (CRIS) for tracking grants, projects, equipment, research results, etc.
    A Data Asset Register, which might an Institutional Repository, which provides a public gateway to research done at an institution, both publications and data.
    Then there are the multitude of public data repositories where open data can be deposited
    And finally a Data Vault as a safe storage facility for research data at various stages in its lifecycle.

  • One data centric way to look at Research Data Management is to consider the processes and infrastructure when research data is created and used, which is the ‘research’ side of the diagram, and the processes and infrastructure that is also needed so that some or all of this research data can be kept and made accessible for future reuse, which is the ‘reuse’ side of the diagram. You’ve got live, active and changing data on the left and then curated, retained and highly managed data assets on the right.

    Traditionally, Researchers occupy the left hand side and the Library, Research Office etc. occupy the right hand side.

    Research Data Management spans the whole space as it covers all aspects of the data lifecycle and should be considered as part of Good Research Practice and hence part of what Researchers do as a matter of course. We might not be there yet, but this is where I think we’d like to be.

    It’s also true that the boundaries are likely to get blurred as increasing amounts of research are data-driven based on existing and shared data sets.

    One of the challenges comes when thinking about all the tools and systems involved.

    So, for example, on the LHS, you might be using a CRIS when developing and bidding a project. When the project is live, Researchers might be using their own devices, collaboration and sharing platforms, lab systems and a host of other tools or platforms to do their research. There might be HPC systems to process data, or do simulations and modelling, and if data sets are large there could be big data analytics and other funky stuff. At some point, publications are made and the outcomes of the work are released.

    Then comes the question of what to keep, why, who for, and everything needed to ensure that enough context is captured for any data that should be retained for future use. Data might be kept because its needed for repeatability and verification of the research, or it might be kept because it has value to the researcher or others in future research.

    Tied in with publication, access and meeting funding body requirements are things like minting DOIs, adding records to IR and storing data in vaults or other facilities that ensures the data is held safely and securely for future access. Then comes activities around ensuring data remains usable, which is digital preservation, that access and retention continues to meet policies, and then finally, and last but certainly not least, that use and citation of the data is tracked so impact can be assessed and decisions on whether to continue keeping it. This might feed back into the CRIS, e.g. for REF, and also for further selection/curation. And again this is an ongoing and cyclic activity.

    What we’re seeing in working with a wide range of Universities is the challenge of how to make these circles meet and work smoothly together. You can’t expect the library or research data service part of an institution to get intimately involved with all the ways in which data is created and used. Likewise, you can’t expect Researchers to have to know, understand and use a whole host of systems and tools on the long-term research data management side, i.e. the right.

    What we’re seeing is a desire and need for the simplest interface between the two, a kind of meeting in the middle, which provides a very simple solution for the Researchers. Almost like a one-stop-shop – and crucially one that has value to the Researcher so helps motivate their engagement with Research data management. For example, helping them get more citations, downloads, collaboration requests based on their data.

    And its this simple one stop shop and clear process is what I think is so interesting about the Loughborough approach.