Ross King, Project Director of SCAPE, gave a short presentation of the EU funded project SCAPE, including descriptions of tools for planning and monitoring digital preservation, scalable computation and repositories, SCAPE Testbeds and where to learn more.
The presentation was given at the workshop ‘Preservation at Scale’ http://bit.ly/17ppAln in connection with the iPres2013 conference in Lissabon, Portugal, in September 2013.
DSPy a system for AI to Write Prompts and Do Fine Tuning
SCAPE - Scalable Preservation Environments
1. Dr. Ross King
AIT Austrian Institute of Technology GmbH
Preservation at Scale Workshop
Lisbon, September 5, 2013
SCAPE
Tools and Infrastructure for Preservation at Scale
2. • SCAPE Project
• SCAPE Solutions
• Scalable Planning
• Scalable Tools
• Scalable Computation
• Scalable Repositories
• SCAPE Testbeds
• SCAPE Additional Information
• Online Resources
• Training Events
• Contact Information
2
Outline
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
3. SCAPE – what is it about?
• Planning and executing computing-intensive digital preservation
processes such as the large-scale ingestion, characterisation or
migration of large (multi-Terabyte) and complex data sets
• SCAPE results include
• Preservation scenarios
• Preservation tools
• Preservation workflows
• Preservation infrastructure
• Preservation best-practices
SCAPE is a follow-up to the highly successful FP6 IP Planets.
3
4. SCAPE Project Data
• Project instrument: FP7 Collaborative Project
• 6. Call
• Objective ICT-2009.4.1: Digital Libraries and Digital
Preservation
• Target outcome (a) Scalable systems and services for
preserving digital content
• 10. Call
• Objective ICT-2013.11.4: Supplements to Strengthen
Cooperation in ICT R&D in an Enlarged European Union
• Duration: 42 44 months
• February 2011 – July September 2014
• Budget: 11.3 12.0 Million Euro
• Funded: 8.6 9.2 Million Euro
4
7. • SCOUT: an automated preservation watch system
• Enables planning tool and decision makers to monitor the world and the organisation
• Collects relevant knowledge and enable automated notification
• Open and extensible
• c3po: scalable content profiling
• c3po analyses characterisation data based on fits
• Scale-out MongoDB (100k/min/node)
• Visual drill-down and well-documented profile
• Automated sample selection
• PLATO 4.1: scalable preservation planning
• www.ifs.tuwien.ac.at/dp/plato
• Technology upgrade - refactored, rebuilt, standardised, tested
• New features
• Groups allow collaborative planning
• Integration of control policies for group
• Quality domain – measures
7
Scalable Planning and Watch
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
8. • Tool Wrapper
• Application that adapts existing tools to the SCAPE Platform
• https://github.com/openplanets/scape-toolwrapper
• Enhances wrapped tools
• Standard naming scheme for CC, AS and QA tools
• Standard invocation method (CLI)
• Debian packages for easy deployment on the cluster
• Support for data streaming (useful for Hadoop jobs)
• Generates Preservation Components
• Taverna workflows with embedded metadata for easy discovery
• Automatic publication of components on myExperiment (to support discoverability)
• Standard ports to enable composition of Preservation Components (based on well defined component
profiles, CC, AS & QA)
• Digital Preservation Toolkit
• Software suite that contains a large set of DP tools
• 77 operations in total
• Easy to deploy on Linux machines (via apt-get)
• apt - get i nst al l di gi t al - pr eser vat i on- t ool s
8
Scalable Tools
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
9. • Deployment of environments
• XEN Hypervisor
• Eucalyptus
• Deployment of tools
• Debian Packages
• Tool Spec
• Job Execution Service (JES)
• Apache Oozie
• Apache Hadoop
9
Scalable Computation
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
from digitalbevaring.dk
User‐view on SCAPE development cloud at AIT: Eucalyptus web
interface, Hybridfox browser add‐on, and terminal‐based interaction.
10. • Fedora 4.0.0
• All REST, no SOAP
• RDF as first class objects
• JCR 2.0 Implementation (ModeShape)
• Infinispan distributed NoSQL datastore
• Lily 2.0
• Built on top of HBase/HDFS
• Integration of computation and storage
10
Scalable Repositories
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
11. 11
SCAPE Architecture
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Plan
Management
API
Digital Object
Repository
Execution
Platform
JES
Hadoop
JES API
Data
Connector API
Automated Watch
Automated Planning
PLATO
Plan
Management
GUI
Digital
Objects/
Metadata
Preservation
Plan Store
Plan
Component
Catalogue
Component
Lookup
API
Taverna
Workbench
Component
Registration
API
Component
Profile
Validator
Automated Watch
Sources
Push
API
Pull
API
Knowledge
Source
Adaptor
Client
Service
Watch Request
API
Notification API
Report
API
Assessment
Data
Publication
Platform
LDS3
APIData
Loader
Application
13. SCAPE Testbeds
• Large-scale Digital Repositories
• Carry out large scale image migrations
• The master files from legacy digitized image collections are typically TIFF files that can be costly to store due
to their size. The cost benefit can only be realized if one can remove the original TIFFs and this can only be
done if one can provide evidence of successful migration. (2.2 million pages, 80 TB)
• Detect poor sound quality
• In a collection of mp3 files (20 TB - 360.000 files) we have discovered files with very bad sound quality. Before
ingesting everything into our DOMS we would like to be able to discover the bad files and potentially get
those re-digitized from the original analogue media.
• Research Data Sets
• RAW to NEXUS conversion
• There are file size and volume of content challenges identified for nexus files
the raw to nexus format migration tool can be customised to account for
various other types of experiment data files in the process of the migration.
However, the scalability challenge here is that for different instrument specific
to each facility), the other types of experiment data files vary significantly.
13
from digitalbevaring.dk
See http://wiki.opf-labs.org/display/SP/Scenarios
14. SCAPE Testbeds
• Web Content
• Quality assurance in web harvesting
• Web crawling is a process that is highly susceptible to errors. Often, essential data is
missed by the crawler and thus not captured and preserved. Currently, quality
assurance requires manual effort and because crawls often contain millions of pages,
manual quality assurance will be neither very efficient
• Data Centers
• Anonymization of medical data
• In order to fulfil the requirements for storing medical data in terms of safety
and security, it will be necessary to develop encryption and anonymization
services that will allow medical data transfer to a data center’s remote storage
facilities. On one hand, the encryption techniques will be used to secure
sensitive personal data (e.g. internal documents, patient databases) which
must only be accessible from authorized services and users. On the other hand,
the anonymization services will enable medical data (like x-ray generator
outputs, x-ray computed tomography outputs, surgery recordings) being stored
in the data center without having sensitive data attached.
14
from digitalbevaring.dk
16. Additional Resources of Interest
• Development Infrastructure
• Code repository hosted by the Open Planets Foundation and GitHub
• https://github.com/openplanets/scape/
• Development Wiki
• http://wiki.opf-labs.org/display/SP/Home
• Experimental Workflows
• http://www.myexperiment.org/search?query=SCAPE&type=all&commit=Search
• Publications
• http://www.scape-project.eu/category/publication
• Public Deliverables
• http://www.scape-project.eu/category/deliverable
• Tools
• http://www.scape-project.eu/tools
16
17. SCAPE Training Events
• Future Formats First:
Application Infrastructures for Action Services
• 16-17 September 2013, London
• Registration: http://scape-future-formats-first.eventbrite.co.uk/
• Critical Path: Effective Evidence Based Preservation Planning
• 13 November 2013, Aarhus
• Hadoop-driven Digital Preservation (Hackathon)
• 2-4 December 2013, Vienna
17
See http://www.scape-project.eu/events
18. SCAPE Contact Information
• http://www.scape-project.eu/
• Twitter: #scapeproject
• office@list.scape-project.eu
• Dr. Ross King
AIT Austrian Institute of Technology GmbH
Donau-City-Strasse 1
A-1220 Wien
18