Text (personal views position statement) to accompany presentation on what research infrastructures really need for data, XLDB-Europe, 8-10th June 2011, Edinburgh
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
AH-XLDBEurope-position-09 jun2011
1. What does research infrastructure really need for data?
A personal view based on LifeWatch and ENVRI
Alex Hardisty, Cardiff University
LifeWatch: an ESFRI Research Infrastructure; an e-Infrastructure for
Biodiversity and Ecosystem Science.
What is LifeWatch?
Biodiversity science is the study of the diversity of life on our planet – plants, animals, microorganisms and
viruses – and the environments (ecosystems) they live in. LifeWatch (www.lifewatch.eu) will be an open
access infrastructure, accessed through a single portal (portal.lifewatch.eu) for users from the scientific
community, as well as policy makers and representatives of the private sector. It will allow scientists to
explore, describe and understand patterns in biodiversity, and the processes that maintain biodiversity, in
space and time at the gene, species, ecosystem and landscape levels; and to understand what causes and
affects species diversity.
The innovative design of LifeWatch offers integrated access to large-scale data resources, advanced
algorithms and computational capability through a service-oriented architecture to support creation of new
knowledge. Key elements of the infrastructure will include: distributed observatories/sensors,
interoperable datasets, processing and analytical tools, and both computational capability and capacity.
Data mining, data analysis and modelling allows users to study patterns and mechanisms across different
levels of biodiversity. The LifeWatch infrastructure provides scientific research teams with new
collaborative environments by creating ‘e-Laboratories’ or composing ‘e-Services’. They may share their
data and analytical and modelling algorithms with others, while controlling access. LifeWatch enables
“distributed large scale” and collaborative research on complex and multidisciplinary problems.
In planning for the past 3 years, LifeWatch is presently transitioning to its construction phase. Early Virtual
Labs are likely to support scientific studies of biodiversity in marine wetlands and the fragility of ecosystems
towards alien and invasive species. The Biodiversity Virtual e-Laboratory (BioVeL) project (www.biovel.eu)
contributes to the construction by causing islands of compatible infrastructure to be created / emerge at
key centres across Europe.
The challenges of scale and heterogeneity
LifeWatch is supported by many good data providers from within the scientific communities (networks of
excellence) for terrestrial ecology, marine ecology and the natural history collections with all their
biological specimens. There are currently about 1800 terrestrial monitoring sites and 200 marine research
sites across Europe. Hundreds of millions of specimens in natural history collections all over Europe are
gradually being indexed and digitised.
Biodiversity data is extremely diverse and heterogeneous. Biodiversity science spans many more familiar
disciplines: biology, botany, zoology, ecology, genetics, soil science, biogeography, climate science,
chemistry - to name but a few. Each of these established scientific communities already has its own way of
Alex Hardisty, XLDB-Europe, Edinburgh, 8-10th June 2011 Page 1
2. doing things, their own data resources and their own tools. Not only that, but they have their own different
vocabularies and conceptual underpinnings. Interoperability is a problem demanding a determined
ontological and thesaurus solution like that used in the medical domain: the Unified Medical Language
System (UMLS) (www.nlm.nih.gov/research/umls).
The interconnections between different biodiversity ideas/concepts, data sources, and the outputs from
data processing, manipulation and modelling are intricate. As well as the traditional sources mentioned
above, genomic data including, for example: sequence data, DNA barcodes and phylogenies are becoming
increasingly important sources. Biodiversity science also demands environmental data (climate, soil, ocean
temperature, etc.), as well as economic and census data for particular types of studies.
Apart from the well known and often large sources - GBIF, EBI, environmental data, census data - there are
numerous small datasets in the hands of individual researchers. If computerised at all, these small datasets
are often held in spreadsheets and with no identifiable common structure. There are probably thousands of
them. And multiple tools for processing too. The biodiversity science community is highly fragmented and
all these kinds of small, personal, group and departmental datasets need to get published and become
discoverable and usable.
LifeWatch aims to support upwards of 25,000 users, primarily from the academic and research community,
and the policymaking community, but also supporting the student education sector and the general public
(citizen science).
The LifeWatch strategy of “Thinking globally, acting locally” addresses these challenges of heterogeneity
and scale. “Thinking globally, acting locally” devises and promotes the pan-European top-down strategies
that foster collaboration and interoperability, and at the local level assists and encourages ‘islands’ of
compliant infrastructure to emerge and fuse.
ENVRI: Common Operations of the ESFRI Environmental Research
Infrastructures
What is ENVRI?
ENVRI is a soon to be funded EC FP7 project that brings together many of the main ESFRI research
infrastructures from the environmental sciences domain. The ENVRI project will contribute to the
construction of these research infrastructures by sharing experiences and technologies and by solving
crucial common technology issues and challenges together. Through cooperation in this project the ESFRI
ENV infrastructures, together with ICT partners, are seeking to increase the interoperability of their data
and facilities to increase the use and effectiveness of their infrastructures. The central goal of the ENVRI
project is to implement harmonised solutions and draw up guidelines for the common needs of the
environmental ESFRI projects, with a special focus on issues as architectures, metadata frameworks, data
discovery in scattered repositories, visualization and data curation.
ENVRI recognises scientific data services as part of a horizontal set of foundational services that include
communications, distributed computing, and storage. It recognises that data providers, as well as data
users, are users of data services and that there are common requirements irrespective of domain-specific
communities. Community-specific services sit on top of data services and interact with them.
The key to improved interoperability is finding common solutions to common problems that can be
adopted by each research infrastructure as it progresses through its construction phase. Fundamental
common solutions include:
Alex Hardisty, XLDB-Europe, Edinburgh, 8-10th June 2011 Page 2
3. a) A Common Reference Model providing multiple compatible ‘views’ of the research infrastructure for
different purposes.
An ENVRI Common Reference Model is likely to be based on the ISO/IEC 10746 series of Standards for
Open Distributed Processing, presenting 5 viewpoints: i) Science business / enterprise view; ii)
Information view; iii) Computational / services view; iv) Engineering view and v) Technology view.
b) “Standards, Standards, Standards” are required for, at least:
• Data capture from distributed sensors
• Metadata definition
• Management of high volume data
• Execution of workflows
• Visualization of data
• Provenance and annotation
• Interoperability between assets
c) Based on a generic metadata model (the Information viewpoint of the Common Reference Model),
tools to allow data discovery and access in a federation of distributed digital repositories and
interoperating infrastructures;
d) RDF and OWL frameworks to describe relations between (virtualized) e-Infrastructure components,
and to link semantic descriptions of data with the semantic descriptions of the infrastructure, allowing
the creation of a data-centric network.
Riding the Wave: How Europe can gain from the rising tide of scientific
data
The recently published report of the High Level Expert Group on Scientific Data – “Riding the Wave: How
Europe can gain from the rising tide of scientific data” – is an important contribution towards addressing
the question of what research infrastructures really need for data. Neelie Kroes, the Vice-President of the
European Commission responsible for the Digital Agenda has asked: “every citizen and every organisation
involved in scientific research to take note of this report and to use it as a reference point when discussing
the priorities of EU research investments.”
The report may be found here:
http://cordis.europa.eu/fp7/ict/e-infrastructure/docs/hlg-sdi-report.pdf
Alex Hardisty, XLDB-Europe, Edinburgh, 8-10th June 2011 Page 3