4. Data provision Layer : Extract, Transform, Load (ETL) , Anonymization & pre-processing of existing resources
Middleware Layer: Distributed execution of complex dataflows & distributed querying Engine
Application Layer: Data (pre) processing and knowledge discovery platform
Imaging ,
Video
Streaming Data Un/Semi/Structured
Biomedical Data
Legacy Data Simulation Models Digital Libraries
(PubMed etc)
Ontologies
(UMLS, GO..)
Clinician
knowledge
Upper level declarative language and extensible UDFs
MADRefine module
Data Preprocessing & Transformation
Curation & Validation
AITION clustering & general KDD
SoA Machine Learning Algorithms
Latent Variable & Topic Modelling
Distributed execution on clouds and ad-hoc clusters
Distributed Query Engine
AITION simulation
Graphical Probabilistic modelling for
Statistical simulation
Ontology Based Data Access
Data Processing
• Distribution, Federation, Parallelism
• EXAREME
Data Analytics
• Cleaning & curation
• MADRefine
• Modeling, Mining
• AITION
Federated data Layer & (open) research data infrastructures: Semantic Data modelling, Provenance & Integration
Multi-modal, vertical integrated,
distributed bio medical data
Biomedical Info
Registries & Metadata
Simulation Models
KDD Results
Data Infrastructures
• ESFRI Infrastructures
• ICOS, EMSO,
…
• E-Infrastructures
• OpenAIRE
WHATWHEREHOWWHY
5. OpenAIRE HUB
CERN
zenodo
Visualize - Manage
Enhanced Publications
Get support
(NOADs)
Linked Content
Statistics
+++
Search & Browse
Curate & collaborate
Deposit
Publications
& data
Research impact
Citations, usage
statistics
+++
Link Classify
De-duplicate Cite
Text Mine
APIs
Publication repositories
Institutional & Thematic
Open Access Journals
17,500,000 OA publications
700+ validated repositories
accessing >5K repos/OA journals
Data repositories
Data Journals
ResearchID (ORCID,
..)
OpenDOAR
…
CRIS
Systems
National funding
EC funding
Usage dataMetadata
on publications Metadata
on data
Guidelines for Data
Providers & Open Data Pilot
Guidelines for Funding
Info
Guidelines for
Publications
OpenAIRE
7. ICOS: Integrated Carbon Observation System
Harmonized and High Precision Scientific Data on
Carbon Cycle And Greenhouse Gas Budget and
Perturbations
EMSO: European Multi-disciplinary Seafloor and
water-column Observatory
Ocean observation systems for long-term, high-
resolution, (near) real-time monitoring of
environmental processes including natural hazards,
climate change, and marine ecosystems
8. SIOS: Svalbard Integrated Earth Observing
System
Arctic environmental and climate-related challenges
EURO-ARGO: European contribution to ARGO
Ocean observation and for oceanography and climate
IAGOS: In-service Aircraft for a Global
Observing System
Atmospheric composition, aerosol and cloud particles
9. EISCAT_3D: European Incoherent Scatter
Radar systems for the upper atmosphere, the
ionosphere and the Aurora Borealis
EUFAR-COPAL: European Facility for
Airborne Research
Airborne research for the environmental and geo
sciences in Europe
10. ACTRIS: Aerosols, Clouds and Trace gases RI
Models and forecast systems by offering high
quality data for atmospheric gases, clouds, and
trace gases
DANUBIUS-RI: Int’l Center for Advanced
Studies on River-Sea Systems
Addressing conflicts between society’s demands,
environmental change and environmental
protection in river–sea systems worldwide.
11. Data provision Layer : Extract, Transform, Load (ETL) , Anonymization & pre-processing of existing resources
Federated data Layer & (open) research data infrastructures: Semantic Data modelling, Provenance & Integration Layer:
Multi-modal, vertical integrated,
distributed bio medical data
Biomedical Info
Registries & Metadata
Simulation Models
Imaging ,
Video
Streaming Data Un/Semi/Structured
Biomedical Data
Legacy Data Simulation Models Digital Libraries
(PubMed etc)
Ontologies
(UMLS, GO..)
Clinician
knowledge
KDD Results
Application Layer: Data (pre) processing and knowledge discovery platform
MADRefine module
Data Preprocessing & Transformation
Curation & Validation
AITION clustering & general KDD
SoA Machine Learning Algorithms
Latent Variable & Topic Modelling
AITION simulation
Graphical Probabilistic modelling for
Statistical simulation
Data Analytics
• Cleaning & curation
• MADRefine
• Modeling, Mining
• AITION
Data Infrastructures
• ESFRI Infrastructures
• ELIXIR
• E-Infrastructures
• OpenAIRE
Middleware Layer: Distributed execution of complex dataflows & distributed querying Engine
Upper level declarative language and extensible UDFs
Distributed execution on clouds and ad-hoc clusters
Distributed Query Engine
Ontology Based Data Access
Data Processing
• Distribution, Federation, Parallelism
• EXAREME
13. Parallel / distributed execution of complex data flows
targeting data analysis and mining
Data remain at source (hospital) – dataflow / query travels
Privacy preserving: transmit only aggregated information
from hospital (sufficient statistics)
Advanced data compression, on the data partitioning
Query Language: SQL + UDFs (in Python)
14. Query
Federation
Decompose query into
local and global parts
1 N
id m-name m-valueid m-name m-value
Local queries Local queries
Partial
aggregated
results
Run local
queries
Run local
queries
“count, avg, std”
m-name N avg std
m-name Σx Σx2 N
Σx,Σx2,N Σx,Σx2,N
Partial
aggregated
results
m-name Σx Σx2 N
L:“Σx, Σx2, N”
G:“N, avg, std”
Run global
queries
N, avg, std
15. • Distributed elastic execution
– Parallel aggregations, unions, and joins
– Resources are reserved dynamically
• Iterative dataflow execution
– Support machine learning algorithms
• Novel query optimization techniques
– SQL with User Defined Functions
– Arbitrary user code with unknown properties
– Privacy-aware query optimization
16. • Time and money
• 2-dimensional optimization
Quantum: 1 hour
• Simple map-reduce flow
– A: 1 hour B: 10 minutes C: 1 hour
Schedule Time
(hours)
Money
(resource hours)
Winner
One host for all ops 18.60 19 5x cheaper
Different host per op 2.16 102 9x faster
17. • Optimal dataflow scheduling
• Skyline of all Pareto optimal plans
Time
Money
18. Data provision Layer : Extract, Transform, Load (ETL) , Anonymization & pre-processing of existing resources
EXAREME Middleware Layer: Distributed execution of complex dataflows & distributed querying Engine
Federated data Layer & (open) research data infrastructures: Semantic Data modelling, Provenance & Integration Layer:
Multi-modal, vertical integrated,
distributed bio medical data
Biomedical Info
Registries & Metadata
Simulation Models
Imaging ,
Video
Streaming Data Un/Semi/Structured
Biomedical Data
Legacy Data Simulation Models Digital Libraries
(PubMed etc)
Ontologies
(UMLS, GO..)
Clinician
knowledge
KDD Results
Upper level declarative language and extensible UDFs
Distributed execution on clouds and ad-hoc clusters
Distributed Query Engine
Ontology Based Data Access
Data Processing
• Distribution, Federation, Parallelism
• EXAREME
Data Infrastructures
• ESFRI Infrastructures
• ELIXIR
• E-Infrastructures
• OpenAIRE
Application Layer: Data (pre) processing and knowledge discovery platform
MADRefine module
Data Preprocessing & Transformation
Curation & Validation
AITION clustering & general KDD
SoA Machine Learning Algorithms
Latent Variable & Topic Modelling
AITION simulation
Graphical Probabilistic modelling for
Statistical simulation
Data Analytics
• Cleaning & curation
• MADRefine
• Modeling, Mining
• AITION
19. Data Mining
Disease signatures
Patient grouping & similarity
Raw data from biomarker based
personalized acquisition
Personalized Model
Guided Medicine
For a particular
patient
Unknown / missing data
Predict value of missing
variable
Variable dependencies & causality
Simulation Models
Create Statistical
Simulation
Models
Individualized diagnosis,
prognosis & treatment plan
Model & VerificationKnowledge Discovery Reasoning & decision support
Data
Preprocessing
Curation & Validation
Transformed &
Validated Data
Domain knowledge &
assumptions
Clinical workflows
BOTTOM-UP TOP-DOWN
Big Data Analytics
• Capture
• multi source
• multi modal
• multi system
Management
• Data provenance
• Sanitization
(Anonymization)
• Process
• aggregate
• distributed
Analysis
• Privacy preserving
• Algorithms
• Mechanisms
Modeling
• Personalized
• De-identified
Practice
• Ethics
• Privacy
20. SEX AgeOnSet
ILAR
JntActDis
GlbActDis
DisDur JntLOM GenEval
CHAQ ESRCRPANA
MEFNIL2RAPoznanski
NSAID STEROID DMARD BIOLOGIC
JADI
JntLOMDiff CHAQDiff
ESRDiff CRPDiff
JntActDisDiffGlbActDisDiff
GenEvalDiff
BOXValidatedOut
Adapted Sharp/
van der Heijde
Score Out JADIOut
Extended BOX
Predictors
Medication
Outcome
demographics imaging genetics
clinical
lab
Synovial
volume
OTHER
21. Disease signatures
Patient grouping & similarity
Variable dependencies & causality
Simulation Models
Individualized diagnosis,
prognosis & treatment plan
Data Mining
Personalized Model
Guided Medicine
For a particular
patient
Unknown / missing data
Predict value of missing
variable
Create Statistical
Simulation
Models
Model & VerificationKnowledge Discovery Reasoning & decision support
Domain knowledge &
assumptions
Clinical workflowsRaw data from biomarker based
personalized acquisition
Data
Preprocessing
Curation & Validation
Transformed &
Validated Data
22. Extensible validation and data transformation engine
Ιnteractive and efficient WEB-Based interface
Data cleaning:
◦ Typographical error detection (numeric & alphanumeric)
◦ Data cleaning rules: (functional dependencies, conditional funct.
dependencies, denial constraints)
◦ New/derived columns (discretization, computation of medical scores)
◦ Data visualisation (barcharts, piecharts, scatterplots, linecharts, etc.)
End-to-end data analysis workflow support (rerun experiments,
reproduce results)
23.
24. Variable dependencies & causality
Simulation Models
Individualized diagnosis,
prognosis & treatment plan
Transformed &
Validated Data
Personalized Model
Guided Medicine
For a particular
patient
Unknown / missing data
Predict value of missing
variable
Create Statistical
Simulation
Models
Model & Verification Reasoning & decision support
Data
Preprocessing
Curation & Validation
Domain knowledge &
assumptions
Clinical workflows
Data Mining
Raw data from biomarker based
personalized acquisition
Knowledge Discovery
Disease signatures
Patient grouping & similarity
25. Disease signatures: Latent factors (patterns) that characterize
disease
◦ Distribution of most relevant variables for disease (e.g., biomarkers)
◦ Multiple variables per signature, signatures per disease
Patient Cluster: Homogeneous patient group with common
characteristics
Patient Similarity: Patients “like” me or mine (patient or
clinician role)
◦ “like” = according to different criteria
(e.g., allocation on disease signatures)
28. Disease signatures
Patient grouping & similarity
Individualized diagnosis,
prognosis & treatment plan
Transformed &
Validated Data
Personalized Model
Guided Medicine
For a particular
patient
Unknown / missing data
Predict value of missing
variable
Reasoning & decision support
Clinical workflows
Data Mining
Raw data from biomarker based
personalized acquisition
Knowledge Discovery
Data
Preprocessing
Curation & Validation
Create Statistical
Simulation
Models
Model & Verification
Domain knowledge &
assumptions
Variable dependencies & causality
Simulation Models
29. Bayesian Net: Directed Acyclic Graph + Conditional Prob Distributions
◦ Features (Nodes) & Dependencies (Edges)
◦ Compact representation of joint data distribution
Patient X1 X2 X3 X4 X5 X6 X7 X8
1 Y N N Y Y Y N Y
:
1000 N N Y N N Y N N
X1
X4 X5
X7
X8
Smoking
Lung
cancer
Chronic
bronchitis
X2
Genetic Factor
X6
X3
Allergy +
Find:
Given:
+
Domain
Knowledge
31. Disease signatures
Patient grouping & similarity
Variable dependencies & causality
Simulation Models
Transformed &
Validated Data
Data Mining
Raw data from biomarker based
personalized acquisition
Knowledge Discovery
Data
Preprocessing
Curation & Validation
Create Statistical
Simulation
Models
Model & Verification
Domain knowledge &
assumptions
Personalized Model
Guided Medicine
For a particular
patient
Unknown / missing data
Predict value of missing
variable
Reasoning & decision support
Clinical workflows
Individualized diagnosis,
prognosis & treatment plan
32.
33. Increased RVD is related
with worse values in every
MR aspect
(TVPRegurg, PSMotion,
RedRV, AV_Block,
TriRegurg)
37. Obtaining consent not straightforward
Anonymisation: necessary, rather complicated,
ensuring neither privacy nor data value
“Blending in a crowd” and k-anonymity: privacy is
property not output of sanitization
How do we define privacy?
38. data publishing: “Sanitization” (Anonymisation) hiding individual info
(k-anonymity) but preserving (sufficient) aggregated statistics
data mining: Specific algorithms (usually operating in two phases)
for classification, clustering, association rules, …
mechanisms: Differential Privacy & Crowd-Blending Privacy perturb
data or add noise ensuring ε-indistinguishable output distribution
encryption: Fully Homomorphic Encryption (FHE) for computation
and query to run over encrypted data
decentralization: Blockchain to Protect Personal Data - decentralized
personal data management, users own and control their data
39. Big data is not only about size
Data is distributed, data is heterogeneous
Processing goes to data, not data to processing
ICT (Data management & processing) advances
◦ Data compression
◦ Federated / privacy-preserving processing
◦ Scalable parallel / distributed processing
◦ Data curation (otherwise: garbage in, garbage out)
◦ Text and data analytics