Repositories in an Open Data Ecosystem

Wolfgang KuchinkeWolfgang Kuchinke
University Duesseldorf, Duesseldorf, GermanyUniversity Duesseldorf, Duesseldorf, Germany
CORBEL ProjectCORBEL Project
W. Kuchinke (2018)
Repositories in an Open Data EcosystemRepositories in an Open Data Ecosystem
ECRIN – CORBEL WP 3.3 Working Group MeetingECRIN – CORBEL WP 3.3 Working Group Meeting
14. Jun 2018, Paris, France14. Jun 2018, Paris, France

2
W. Kuchinke (2018)
Open Data – Open
Science
Towards an ecosystem for Open Data and
Sensitive Data

3
W. Kuchinke (2018)
Open data is data that can be freely
used, shared and built-on by
anyone, anywhere, for any purpose.
Data sharing is the precondition for
the reproducibility of research
results.
Open Definition (http://opendefinition.org/okd/)

4
W. Kuchinke (2018)
For reproducibility and progress of research, data sharing is
critical. Providers of human data (e.g. publicly or privately
funded repositories and data archives) should fulfill their
social responsibility with data donors when their shareable
data conforms to the FAIR (findable, accessible,
interoperable, reusable) principles
FAIR data framework

5
W. Kuchinke (2018)
Research data, metadata and data management plans are part of Open
Research Data Management. Research data can contain a wide diversity of
collected information: text or numerical data, biosamples, images,
questionnaires, recorded videos, models, software, reports, workflows, etc.
All information about data type and format of the information needs to be
described. For this purpose, data need to be complemented by proper
metadata.
Metadata are essential to recover and reuse research data. Metadata
standards allow the interoperability across different systems, like repositories.
Metadata can be classified in 3 main types: descriptive, administrative, and
structural. Descriptive metadata serve to discovery and understand a data
source, and refers for example tothe title, author, publication date or abstract,
like, for example the Dublin Core Schema
The Importance of Metadata for
Open Research Data Management

6
W. Kuchinke (2018)
Conceptual representation of
the life cycle of data in biomedical data repositories (secure
storage of biomedical research and
healthcare-related data) from the moment of data generation,
through their utilization and transformation into useful
information, publication and finally their long-term archiving or
destruction
Data life cycle

7W. Kuchinke (2018)
Data Ecosystem
Repositories are the core components of an Open
Data Ecosystem
Many tools and data services support repositories
Different aspects
FAIR principles
Open data and clinical trials data should be stored
together
Cloud storage should be enabled
Analysis tools should be provided
Different data repositories should be connected to
each other

8
W. Kuchinke (2018)
FAIR data framework
Fig. Modified from: Dep. Med Inf. UMG, Groningen
Data Sources Data
Integration
and Data
Curation
Data
Storage
Data Usage
ePatient Record
Clinical Trial Data
Registry Data
Patient Reported Outcome
Sensor Data
Biomaterial Data
eLab Data
Lifestyle Data, Weather,
Medication, Social Media
Transform
Ontology match
Linkage
Data
Warehousing
Data Marts
Data management
and Analysis
Open Data
Data query
Data visualisation
Data analysis
Collaboration
Therapy Board
Data transfer
Data sharing
Publication
Data Governance
Persistent
Identifiers
Privacy Metadata harvesting Data Dictionary
Consent
Management
Identity Management Anonymisation Pseudonymisation
Data Annotation

9
W. Kuchinke (2018)
Move Data Governance to Data Generation step in the
data life cycle
Data Sources Data
Integration
and Data
Curation
Data
Storage
Data Usage
ePatient Record
Clinical Trial Data
Registry Data
Sensor Data
Biomaterial Data
eLab Data
Transform
Ontology match
Linkage
Data
Warehousing
Data Marts
Data management
and Analysis
Open Data
Data query
Data visualisation
Data analysis
Collaboration
Therapy Board
Data transfer
Data sharing
Publication
Data Governance
Persistent
Identifiers
Privacy Metadata harvesting Data Dictionary
Consent
Management
Identity Management Anonymisation Pseudonymisation
Data Annotation
Data
Governance
for privacy
protection
already at
the step of
data
generation

10
W. Kuchinke (2018)
Persistent Identifiers, Metadata, Privacy protection
become part of data generation
Build-in Data governance and Privacy protection
Data Sources Data
Integration
and Data
Curation
Data
Storage
Data Usage
ePatient Record
Clinical Trial Data
Registry Data
Sensor Data
Biomaterial Data
eLab Data
Transform
Ontology match
Linkage
Data
Warehousing
Data Marts
Data management
and Analysis
Open Data
Data query
Data visualisation
Data analysis
Collaboration
Therapy Board
Data transfer
Data sharing
Publication
Data Governance
Persistent
Identifiers
Privacy protection
Metadata harvesting
Data Dictionary
Consent
Management
Identity Management
Anonymisation
Pseudonymisation
Data Annotation

11
W. Kuchinke (2018)
Components of the
Repository Data
Ecosystem
An ecosystem suitable even for
Sensitive Data

12
W. Kuchinke (2018)
The Comprehensive Knowledge Archive Network (CKAN) is an open-
source open data portal for the storage and distribution of open
data.
Aimed at data publishers who make their data open and available. The
system is used both as a public platform on Datahub and in various
government data catalogues (e.g. UK's data.gov.uk, Dutch National
Data Register, the United States government's Data.gov and the
Australian government's Gov 2.0).
https://ckan.org/
What is CKAN?

13W. Kuchinke (2018)
CKAN
Open-source data portal platform
Developed by the OKFN (Open Knowledge
Foundation)
It is a complete out-of-the-box software solution
Tools to streamline publishing, sharing, finding and
using data
CKAN includes a web interface and the CKAN
Action API
Visualizations for structured data resources (such
as CSV files)

CKAN and reusability of healthcare data
Catalog and metadata saved in CKAN can be harvested based on
the OAI-PMH
Through the CKAN cloud environment, wearable and stationary
sensor data stored in individual CKANs can be integrated
Analysis and integration of clinical data of users based on diagnostic
data saved on the CKAN-based cloud
Prediction of situations, events, and incidents

15
W. Kuchinke (2018)
The Hyve is a company that provides professional IT services for open
source biomedical informatics solutions, to enhance the quality and
impact of research by enabling scientists in life sciences and healthcare
research to properly use open source software, open data and open
standards.
https://thehyve.nl/
What is the Hyve?

Tools developed by the Hyve
Portfolio of open source tools and products that facilitate FAIR
research data
FAIR Research Data Management in academic hospitals
Research Data Marts
I2b2 / tranSMART and cBioPortal (for oncology-focus medical
centers)
a robust research data warehouse can be established, which
exposes a unified patient-centric view of clinical and molecular
data for research & analysis

17
W. Kuchinke (2018)
tranSMART is an open-source data warehouse designed to store large
amounts of clinical data from clinical trials, as well as data from basic
research. In tranSMART data can be examined for translational
research purposes. tranSMART is built on top of the i2b2 platform, a
clinical data warehouse employing the i2b2 star model. Each of the data
types (e.g., gene expression, SNP or metabolomics) retain its specific
data structure.
What is tranSMART?

tranSMART data warehouse
Designed for use in individual clinical studies with hundreds or
thousands participants in which maybe tens of thousands
observations were gathered
tranSMART is also being adopted by hospitals and large population
studies
Large population study is the Netherlands Twin Register (NTR)
adding indexes, creating partitions, addition of bit strings, Saving
subject sets for single and combined queries, Splitting a query

Glowing Bear: the new tranSMART UI
Sponsored by Pfizer, Sanofi, Abbvie and Roche
Cross-study and ontology term support
Support for time series and longitudinal data
Possibility of saving queries and re-executing them later
Cohort builder

20
W. Kuchinke (2018)
The Dataverse is an open source web application to share, preserve,
cite, explore and analyze research data. Researchers, authors,
publishers, data distributors, and their institutions receive appropriate
credit via a data citation mechanism including a persistent identifier
(e.g., DOI, or Handle). A Dataverse repository hosts multiple
dataverses; each dataverse contains dataset(s) or other dataverses,
and each dataset contains descriptive metadata together with the data.
https://dataverse.org/
What is Dataverse?

Dataverse
Open source web application for sharing, citing, analyzing, and
preserving research data
Developed by the Data Science team at the Institute for Quantitative
Social Science
Dataverse code is open-source and free
Supports DataCite and other citation standards, such as ORCID
Creates a Digital Object Identifier (DOI) upon deposit

Dataverse repositories
Harvard Dataverse Network hosts the world's largest collection of
social science research
A Dataverse repository is the software installation, which hosts
multiple dataverses
Each dataverse contains datasets, and each dataset contains
descriptive metadata and data files
Dataverses may contain other dataverses

Dataverse datasets
A dataset in Dataverse is a container for data, documentation, code,
and the corresponding metadata which describe the dataset
From: http://guides.dataverse.org/en/latest/user/dataset-management.html

Dataverse and Cloud Storage
Dataverse installations can be configured to facilitate cloud-based
storage and computing
Default configuration for Dataverse uses a local file system for
storing data
Cloud-enabled Dataverse installation can use a Swift object storage
database for its data
This allows users to perform computations on data using an
integrated cloud environment

Example: DataverseNL
Service for archiving and publishing research data on several levels
faculties, institutions, research groups, projects within Dutch
universities
Possibility to store and share online a large variety of scientific data,
independent of file format, in a secure way
Not suitable for storing (privacy) sensitive data
PSI (Ψ): A Private data Sharing Interface
Privacy Tools Research Group (Harvard)

figshare
figshare helps academic institutions store, share and manage all of
their research output
Integrate into your CRIS/RIMS, institutional repository and archiving
solution
All research on figshare can be pushed to any institutional repository
Control how content is shared internally and publically
figshare is hosted on Amazon Web Services but we can also
integrate with centralized cloud

●figshare for academic institutions
figshare helps academic institutions store, share and manage their
research output
Integrates into institution’s CRIS/RIMS, the institutional repository
and archiving solutions
All research on figshare can be pushed to any institutional repository
Control how content is shared internally and publicly
figshare is hosted on Amazon Web Services but can also integrate
with a centralized cloud

Example: University of Sheffield
Custom portal to manage research data

Example: University of Salford / Manchester
Custom portal of figshare

OSF (Open Science Framework)
Cloud-based management of projects
View all projects from one dashboard.
Quickly share files
Share key project information and allow others to use and cite it.
See project changes
View project analytics
Archive data

31
W. Kuchinke (2018)
Analysis of the
repository ecosystem
components

●Role of Repositories in the Data Ecosystem
A multitude of services and tools to support research data
repositories
Different types of repositories are connected and supplement each
other in the storage, release and sharing of data with different
degree of protection and ownership
Tools to analyze, browse and visualize data should be integrated
Real World Data must be smoothly integrated into the research data
cycle
Data governance and data privacy protection play an important
role
New and efficient tools for anonymisation and data obfuscation
are necessary

Overview
Research Data Sharing and Storage Services
A multitude of services and tools support research data repositories to form
an open data ecosystem
Modified from: Instituuts Data Management Plannen, Groningen
During research
During research
After research
After research
BeeHub
B2SAFE
SurfDrive
Local ICT
Services
CLARIN INL
DANS
4TU. Centre for
ResearchData
Zenodo
B2SHARE
figshare
SURF
addgene
Brainmap.org
NeuroVault
OpenMRI
MycoBank
Language
Archive
DataFirst
Dataverse NL
CancerData
DRYAD
Connectome
SeaDataNet
nesstar
TalkBank
OpenML
OpenClinica
Curate
Science
EVIDENCIO
OSF
CRCNS
Dataverse
BeeHub
RUG GeoData
InstitutionDisciplin

34
W. Kuchinke (2018)
Real world evidence (RWE) in medicine means evidence obtained
from real world data (RWD), which is data obtained outside the
context of randomized controlled trials (RCTs); it is generated during
the routine clinical practice. Real world data is stored in Electronic
Health Records (EHR), medical claims or billing activities databases,
registries, patient-generated data, mobile devices, etc. In addition, it
may be derived from retrospective or prospective observational studies
and observational registries.
The necessity for RWD is based on the fact that clinical trials cannot
account for the entire patient population of a particular disease. Patients
suffering from comorbidities or belonging to a special geographic
region, have genetic variations or high age do not in general participate
in any clinical trials.
What is Real World Evidence?

35
W. Kuchinke (2018)
The management of human health and diseases, including
policy and decision making and the development of efficient
healthcare systems demand support by efficient and
rigorous evidence-based investigation and evaluation of
research results. Data are therefore central to further
improvements in public health, primary and hospital care, and
especially for the advancement of personalized medicine.
Relevant data should be collected as part of the usual
healthcare, from routine administrative sources and research
studies. Data governance and data privacy protection
should begin as early as possible, ideally during data
generation.
Rigorous evidence-based investigation

Dealing with Real World Clinical Data
Real World Clinical Data play an important role for research
For Patient Reported Outcomes and for Sensor Data
open source RADAR stack (RADAR-CNS project)
RADAR-base Management Portal is a one-stop shop for
managing remote patient monitoring studies
The RADAR Android apps to directly exchange data with the
patients and other care providers
Kafka-based stack: message transport system
European health data networks
Observational Health Data Sciences and Informatics, Observational
Medical Outcomes Partnership (OMOP)

Results of Ecosystem Analysis
It doesn‘t matter where one stores data
Everything is connected
Institutional repositories (dataverses), data marts, general
repositories, domain specific repositories, figshare for data sharing
An ecosystem for open data management
Covers complete data life cycle
Complete projects are supported
FAIR data as basis
tranSMART as integration hub for analysis
Integration of data governance and privacy protection at the stage of
data generation
But can sensitive data really be integrated?
Not yet convincingly shown!

Contact
Wolfgang Kuchinke
Heinrich-Heine University Düsseldorf, Düsseldorf,
Germany
wolfgang.kuchinke@uni-duesseldorf.de
Presentation contains additional material for explanation and workshop.

Repositories in an Open Data Ecosystem

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Repositories in an Open Data Ecosystem

Semelhante a Repositories in an Open Data Ecosystem (20)

Mais de Wolfgang Kuchinke

Mais de Wolfgang Kuchinke (18)

Último

Último (20)

Repositories in an Open Data Ecosystem