EUDAT and PRACE joined forces to help research communities gain access to high quality managed e-Infrastructures whose resources can be connected together to enable cross-utilization use cases and make them accessible without any technical barrier. The capability to couple data and compute resources together is considered one of the key factors to accelerate scientific innovation and advance research frontiers. The goal of this session was to present the EUDAT services, the results of the collaboration activity achieved so far and delivers a hands-on on how to write a Data Management Plan or DMP. The DMP is a useful instrument for researchers to reflect on and communicate about the way they will deal with their data. It prompts them to think about how they will generate, analyse and share data during their research project and afterwards.
Visit: https://www.eudat.eu/eudat-summer-school
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)
1. EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No. 654065 www.eudat.eu
Linking HPC to Data
Management
Stéphane COUTIN (CINES)
Giuseppe Fiameni (CINECA)
This work is licensed under the Creative
Commons CC-BY 4.0 licence
2. Objectives
High level presentation of research
data management and H2020 context
Present a simple approach and draft a
DMP for a given case.
3. THE CHANGING DATA LANDSCAPE
Image CC-BY-SA ‘data.path Ryoji.Ikeda - 3’ by r2hox www.flickr.com/photos/rh2ox/9990016123
4. Data explosion
More and more data is
being created
Issue is not creating
data, but being able to
navigate and use it
Data management is
critical to make sure
data are well-organised,
understandable and
reusable
5. Digital data are fragile and susceptible to loss for a wide variety of reasons
Natural disaster
Facilities infrastructure failure
Storage failure
Server hardware/software failure
Application software failure
Format obsolescence
Human error
Malicious attack
Loss of staffing competencies
Loss of institutional commitment
Loss of financial stability
Changes in user expectations
Data loss
Image CC-BY ‘Hard Drive 016’ by Jon Ross www.flickr.com/photos/jon_a_ross/1482849745
6. Link rot – more 404 errors
generated over time
Reference rot* – link rot
plus content drift i.e.
webpages evolving and
no longer reflecting
original content cited
* Term coined by Hiberlink http://hiberlink.org
Data persistency issues
Jonathan D. Wren Bioinformatics 2008;24:1381-1385
8. Why manage research data?
To make your research easier!
To stop yourself drowning in irrelevant stuff
In case you need the data later
To avoid accusations of fraud or bad science
To share your data for others to use and learn from
To get credit for producing it
Because funders or your organisation require it
Well-managed data opens up opportunities
for re-use, integration and new science
9. H2020 open research data pilot
• Already expanded from a select pilot to all work
areas
• All need to consider which data can be made
open
• Mantra = “As open as possible as closed as
necessary”
• Underlying driver is good (FAIR) data
management
Image CC-BY-SA by SangyaPundir
10. Key requirements of the open data pilot
Beneficiaries participating in the Pilot will:
Deposit data in a research data repository of
their choice
Take measures to make it possible for others to
access, mine, exploit, reproduce and
disseminate the data free of charge
Provide information about tools and instruments
necessary for validating the results (where
possible, provide the tools and instruments
themselves)
http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi
/oa_pilot/h2020-hi-oa-data-mgt_en.pdf
11.
12. Suggested DMP creation process
Analyse your project Information System
Suggest : Data Flow Diagram
Apply FAIR principles
Include data life cycle and time dimensions
Estimate costs
Iterate
Get funders support
Maintain DMP up to date
13. Simple diagram focusing on data dynamics
You can use other diagram type
DFD : Data Flow Diagram
Data
Processing
Data store
External
interaction
Data Flow
14. You and your team are submitting a proposal for a project in the domain of smart cities.
The City has implemented a large set of sensors measuring traffic. The data are collected
in the City datacenter.
You want to develop an application being able to forecast the traffic and also how it will
be impacted by events like planned roadworks. This application would run on a PRACE
site, not located in the City. On the PRACE site your storage space is limited to 10 TB.
The application uses the following inputs:
Sensors historical data over the last 12 months : sensors produce 1TB of data a day.
You implement a preprocessing module translating those data into a reduced data set
(10 MB per day). It is based on a format you have defined to describe the traffic.
The results provided by the simulation. This enables comparison between forecasted
and actual traffic in order to ‘train’ the application.
Weather data (historical and forecast) provided by the national meteo agency. They
use the SYNOP format. The volume is negligible.
Results will be accessible by the city council employees.
Create the project data flow diagram and fill the data summary chapter using a
table.
What would you appreciate to use efficiently the weather data?
Exercise – Phase 1
16. Proposed data flow diagram
Sensors collection area
PRACE HPC Site
Simulations
PRACE
Storage
Output files
extractor
Input files
Raw sensor
data
Data
Preprocessing
Reduced
sensor data
Weather data
City council
employees
Data transfer
17. Data summary table
Dataset Description Origin? Existing? Format Size Who could use it?
Raw sensor
data
Available, collected
from sensors
Various 1TB per
day
Reduced
sensor data
Actual
traffic, …
Extracted from raw
sensor data
Binary
(specific)
10 MB a
day
Our simulation
Weather
data
Actual and
forecast
Existing. Meteo open
data platform
SYNOP 1MB a
week
Our simulation
Citizens, scientists, ..
Simulation
results
Forecasted
traffic
Results of our
simulation
Binary
(specific)
10 MB a
day
City council
employees, our
application
18. CREATING
DATA
PROCESSING
DATA
ANALYSING
DATA
PRESERVING
DATA
GIVING
ACCESS TO
DATA
RE-USING
DATA
Research data lifecycle
CREATING DATA: designing research,
DMPs, planning consent, locate existing
data, data collection and management,
capturing and creating metadata
RE-USING DATA: follow-
up research, new
research, undertake
research reviews,
scrutinising findings,
teaching & learning
ACCESS TO DATA:
distributing data,
sharing data,
controlling access,
establishing copyright,
promoting data PRESERVING DATA: data storage, back-
up & archiving, migrating to best format
& medium, creating metadata and
documentation
ANALYSING DATA:
interpreting, & deriving
data, producing outputs,
authoring publications,
preparing for sharing
PROCESSING DATA:
entering, transcribing,
checking, validating and
cleaning data, anonymising
data, describing data,
manage and store data
Ref: UK Data Archive: http://www.data-archive.ac.uk/create-manage/life-cycle
22. A file format is a convention on how a data is
represented on a media. It can be:
Specified: a description of the convention exists,
and is sufficiently described to allow a complete
implementation of it;
Open: the convention is available without any
restrictions of access or implementation;
Standardized: the convention has been adopted
by standardization agencies (ISO, W3C). Example:
PDF/A.
A wide utilization of a format can also enable it to be
considered as a standard, even if there’s no official standard for
it. Example: PDF.
Proprietary: those formats depend on the existence
of an owner. They can be published. Example: Word.
The level of durability of a format depends on these
criteria.
Data formats
23. Through a web interface, this tool enables the
verification of a file, especially its validity and if it’s well-
formed against the specifications of the declared
format, to know if it can be archived.
You just have to download the file you want to test. The
file is then analyzed by the tool which sends
automatically the answer.
If the file is not well-formed or not valid, tutorials to help
correcting the file are available for the user. If the
problem is not resolved, the user can contact the CINES
expertise by e-mail.
The list of the file formats accepted in PAC (CINES
Arrchiving Platform) is available on FACILE
(https://facile.cines.fr/ )
FACILE : a format validation tool
24. Complexity and diversity of file formats
A few ‘pivot’ formats
HDF
NetCDF
A lot of specific binaries formats
Need to document the format
Store or reference documentation in the digital
object
Store or reference code
HPC data formats
25. Licensing research data
• Horizon 2020 guidelines point to CC-BY or CC-0
• EUDAT licensing wizard help you pick licence for data & software
(available in B2SHARE)
• DCC How-to guide helps you to license data
www.dcc.ac.uk/resources/how-guides/license-research-data
26. Commonly defined as ‘data about data’, metadata
helps to make data findable and understandable
Metadata can be:
Descriptive: information about the content and
context of the data
Structural: information about the structure of the
data
Administrative: information about the file type, rights
management and preservation processes
What is metadata?
27. Comprehensive metadata will:
Facilitate data discovery
Help users determine the applicability of the data
Enable interpretation and reuse
Allow any limitations to be understood
Clarify ownership and restrictions on reuse
Offer permanence as it transcends people and time
Provide interoperability
Why use metadata?
28. The good and the bad
Metres / seconds
2015-09-10T15:00:01+01:00
Longitudinal wind speed
PDF 1.7
2008 US Population statistics
Barcelona, Venezuela
Furlongs and fortnight
10th Sept. 2015 15:00:01
U
PDF
Population statistics
Barcelona
More precise and
standardised Ambiguous
29. Digital preservation context
39
Main risks deal with:
• Comprehension
• Integrity
• Exploitation
• Valorization
Quality assurance
procedures to be setup for
• Metadata
• File formats
• Representation information
• Storage
• Access
• Technology watching
30. Digital preservation challenges
40
Setup quality assurance procedures to mitigate the
impact of the four main identified risks when they
occur
Challenge Solutions
Loss of content knowledge • Metadata;
• Persistent, unique identifiers.
File format obsolescence • Handling of a limited set of durable formats;
• File format identification, validation;
• Logical migration (format conversion).
Storage media failure • Management of media ageing;
• Physical migration.
Software or hardware disappearance • Technology watching , anticipation ,
proactivity.
More details at https://www.cines.fr/en/long-term-preservation/
31. Certifications
Certification can help selecting a repository
Certification focuses on:
Organizational infrastructure
Digital object management
Technology
Usually refers to OAIS model
32. OAIS (Open Archival Information System) model
Framework for an archive, now ISO 14721
Defines a functional and an informations models
33. Repository certification : Data Seal of
approval
16 quality guidelines for researchers and institutions that create
digital research files, organizations that archive research files, and
users of research data.
The objectives of the Data Seal of Approval are to safeguard
data, to ensure high quality and to guide reliable management
of research data for the future without requiring the
implementation of new standards, regulations or high costs.
The DSA
Gives researchers, research sponsors the assurance that their
research results will be stored in a reliable manner and can be
reused
Allows data repositories to archive and distribute research
data efficiently
Is part of a European Framework for Audit and Certification of
Trusted Repositories
Online application and self-assessment of the 16 guidelines by the
repository
Review by a member of the DSA Board
34. Formal certification: ISO 16363
ISO 16363 – « Audit and certification of trustworthy
digital repositories »
Evaluation criteria for an auditor to judge if a
repository is trustworthy)
Published in 2012
Strongly based on OAIS reference model
ISO 16919:2014 – « Requirements for bodies
providing audit and certification of candidate
trustworthy digital repositories »
specifies requirements for bodies providing ISO
16363 audit and certification – provide detailed
competences that auditors need
35. www.eudat.eu
Thanks – any questions
Acknowledgements:
Thanks to Mark van de Sanden, Marjan Grootveld , Sarah Jones
and Giuseppe Fiameni for some of the slides