SlideShare uma empresa Scribd logo
1 de 17
Baixar para ler offline
Analyze Genomes:
A Federated In-Memory Computing Platform for Life Sciences
Dr. Matthieu-P. Schapranow
Festival of Genomics, London, U.K.
Jan 19, 2016
Schapranow, Festival of
Genomics, Jan 19, 2016
Recap: we.analyzegenomes.com
Real-time Analysis of Big Medical Data
2
In-Memory Database
Extensions for Life Sciences
Data Exchange,
App Store
Access Control,
Data Protection
Fair Use
Statistical
Tools
Real-time
Analysis
App-spanning
User Profiles
Combined and Linked Data
Genome
Data
Cellular
Pathways
Genome
Metadata
Research
Publications
Pipeline and
Analysis Models
Drugs and
Interactions
Challenges of Big
Medical Data
Drug Response
Analysis
Pathway Topology
Analysis
Medical
Knowledge CockpitOncolyzer
Clinical Trial
Recruitment
Cohort
Analysis
...
Indexed
Sources
Combined column
and row store
Map/Reduce Single and
multi-tenancy
Lightweight
compression
Insert only
for time travel
Real-time
replication
Working on
integers
SQL interface on
columns and rows
Active/passive
data store
Minimal
projections
Group key Reduction of
software layers
Dynamic multi-
threading
Bulk load
of data
Object-
relational
mapping
Text retrieval
and extraction engine
No aggregate
tables
Data partitioning Any attribute
as index
No disk
On-the-fly
extensibility
Analytics on
historical data
Multi-core/
parallelization
Our Technology
In-Memory Database Technology
+
++
+
+
P
v
+++
t
SQL
x
x
T
disk
3
Schapranow, Festival of
Genomics, Jan 19, 2016
Challenges of Big
Medical Data
In-Memory Data Management
Overview
Schapranow, Festival of
Genomics, Jan 19, 2016
Analyze Genomes:
Real-world Examples
4
Advances in Hardware
64 bit address space –
4 TB in current server boards
4 MB/ms/core data throughput
Cost-performance ratio rapidly
declining
Multi-core architecture
(6 x 12 core CPU per blade)
Parallel scaling across blades
1 blade ≈50k USD =
1 enterprise class server
Advances in Software
Row and
Column Store Compression PartitioningInsert Only
A
Parallelization
+
++
+
+
P
Active & Passive
Data Stores
■  Main memory access is the new bottleneck
■  Lightweight compression can reduce this bottleneck, i.e.
□  Lossless
□  Improved usage of data bus capacity
□  Work directly on compressed data
Lightweight Compression
Schapranow, Festival of
Genomics, Jan 19, 2016
Analyze Genomes:
Real-world Examples
5
Attribute Vector
RecId ValueId
1  C18.0
2  C32.0
3  C00.9
4  C18.0
5 C20.0
6 C20.0
7 C50.9
8 C18.0
Inverted Index
ValueId RecIdList
1  2
2  3
3  5,6
4  1,4,8
5  7
Data Dictionary
ValueId Value
1 Larynx
2 Lip
3 Rectum
4 Colon
5 MamaTable
………
C18.0Colon646470
C50.9Mama167898
C20.0Rectum647912
C20.0Rectum215678
C18.0Colon998711
C00.9Lip123489
C32.0Larynx357982
C18.0Colon091487RecId 1
RecId 2
RecId 3
RecId 4
RecId 5
RecId 6
RecId 7
RecId 8
…
•  Typical compression factor of 10:1 for
enterprise software
•  In financial applications up to 50:1
■  Row stores are designed for operative workload, e.g.
□  Create and maintain meta data for tests
□  Access a complete record of a trial or test series
■  Column stores are designed for analytical work, e.g.
□  Evaluate the number of positive test results
□  Identification of correlations or test candidates
■  In-Memory approach: Combination of both stores
□  Increased performance for analytical work
□  Operative performance remains interactively
Combined Column and Row Store
Schapranow, Festival of
Genomics, Jan 19, 2016
Analyze Genomes:
Real-world Examples
6
+
■  Traditional databases allow four data operations:
□  INSERT, SELECT and
□  DELETE, UPDATE
■  Insert-only requires only INSERT, SELECT to maintain a complete history
(bookkeeping systems)
■  Insert-only enables time travelling, e.g. to
□  Trace changes and reconstruct decisions
□  Document complete history of changes, therapies, etc.
□  Enable statistical observations
Insert-Only / Append-Only
Schapranow, Festival of
Genomics, Jan 19, 2016
Analyze Genomes:
Real-world Examples
7
++
+
+
■  Horizontal Partitioning
□  Cut long tables into shorter segments
□  E.g. to group samples with same relevance
■  Vertical Partitioning
□  Split off columns to individual resources
□  E.g. to separate personalized data from experiment data
■  Partitioning is the basis for
□  Parallel execution of database queries
□  Implementation of data aging and data retention management
Partitioning
Schapranow, Festival of
Genomics, Jan 19, 2016
Analyze Genomes:
Real-world Examples
8
■  Modern server systems consist of x CPUs, e.g.
■  Each CPU consists of y CPU cores, e.g. 12
■  Consider each of the x*y CPU core as individual workers, e.g. 6x12
■  Each worker can perform one task at the same time in parallel
■  Full table scan of database table w/ 1M entries results in 1/x*1/y search time when
traversing in parallel
□  Reduced response time
□  No need for pre-aggregated totals and redundant data
□  Improved usage of hardware
□  Instant analysis of data
Multi-core and Parallelization
Schapranow, Festival of
Genomics, Jan 19, 2016
Analyze Genomes:
Real-world Examples
9
■  Consider two categories of data stores: active and passive
■  Active data are accessed frequently & updates are expected, e.g.
□  Most recent experiment results, e.g. last two weeks
□  Samples that have not been processed, yet
■  Passive data are used for analytical & statistical purposes, e.g.
□  Samples that were processed 5 years ago
□  Meta data about seeds that are not longer produced
■  Moving passive data on slower storages
□  Reduces main memory demands
□  Improves performance for active data
Active and Passive Data Store
Schapranow, Festival of
Genomics, Jan 19, 2016
Analyze Genomes:
Real-world Examples
10
P
■  Layers are introduced to abstract from complexity
■  Each layer offers complete functionality, e.g. meta data of samples
■  Less layer result in
□  Less code to maintain
□  More specific code
□  Reduced resource demands
□  Improves performance of applications due to eliminating obsolete processing steps
Reduction of Application Layers
Schapranow, Festival of
Genomics, Jan 19, 2016
Analyze Genomes:
Real-world Examples
11
x
x
■  1,000 core cluster at Hasso Plattner Institute
with 25 TB main memory
■  25 nodes, each consists of:
□  40 cores
□  1 TB main memory
□  Intel® Xeon® E7- 4870
□  2.40GHz
□  30 MB Cache
In-Memory Database Technology
Hardware Characteristics at HPI FSOC Lab
Schapranow, Festival of
Genomics, Jan 19, 2016
Challenges of Big
Medical Data
12
In-Memory Database Technology
Use Case: Analysis of Genomic Data
Schapranow, Festival of
Genomics, Jan 19, 2016
Analyze Genomes:
Real-world Examples
13
Analysis of
Genomic Data
Alignment and Variant Calling
Analysis of Annotations in World-
wide DBs
Bound To CPU Performance Memory Capacity
Duration Hours – Days Weeks
HPI Minutes Real-time
In-Memory
Technology
Multi-Core Partitioning & Compression
Federated In-Memory Database (FIMDB)
Incorporating Local Compute Resources
Schapranow, Festival of
Genomics, Jan 19, 2016
Challenges of Big
Medical Data
14
Site B
Federated In-M em ory
D atabase Instance,
Algorithm s, and
Applications M anaged
by Service Provider
CloudService
Provider
Site A
FIMDB
A.1
FIMDB
A.2
FIMDB
A.3
FIMDB
A.4
FIMDB
A.5
FIMDB
B.1
FIMDB
B.2
FIMDB
B.3
FIMDB
C.1
Federated In-M em ory
Database Instances
M aster Data
M anaged by
Service Provider
Sensitive D ata
reside at Site
Multiple Sites Forming the
Federated In-Memory Database System
Schapranow, Festival of
Genomics, Jan 19, 2016
Challenges of Big
Medical Data
15
Federated In-M em ory D atabase System
M aster Data and
Shared Algorithm s
Site A Site BCloud Provider
Cloud IM D B
Instance
Local IM DB
Instance
Sensitive D ata,
e.g. Patient Data
R
Local IM DB
Instance
Sensitive Data,
e.g. Patient D ata
R
■  Front-End
□  Any user interface supported, e.g. UI5,
JavaScript, iOS
□  Communication protocol: Ajax/JSON
■  API Server
□  Ruby on Rails
□  Translates into database queries
■  Platform
□  IMDB with extensions for Life Sciences,
e.g. worker framework, scheduler, third-
party tools, etc.
Software Architecture
Schapranow, Festival of
Genomics, Jan 19, 2016
Challenges of Big
Medical Data
16
H igh-Perform ance In-M em ory G enom e
Cloud Platform
Researcher
with W eb
Browser
Clinician
with M obile
Device
Analytical Data
Processing
MasterData,e.g.
ReferenceGenomes,
Annotations,and
ClinicalTrials
R
G raph Processing
Specific C loud
Application
Specific Cloud
Application s
Specific M obile
Application
RR
TransactionalData,
e.g.GenomeReads,
PatientEMRData,
andWorkerStatus
In-M em ory Database Instances
In-M em ory Database Instances
RR
Scheduling and
W orker Fram ework
Annotation and
Updater Fram ew ork
InternetIntranet
DataLayerPlatformLayerApplicationLayer
Accounting Security Extensions
Keep in contact with us!
Hasso Plattner Institute
Enterprise Platform & Integration Concepts (EPIC)
Program Manager E-Health
Dr. Matthieu-P. Schapranow
August-Bebel-Str. 88
14482 Potsdam, Germany
Dr. Matthieu-P. Schapranow
schapranow@hpi.de
http://we.analyzegenomes.com/
Schapranow, Festival of
Genomics, Jan 19, 2016
Challenges of Big
Medical Data
17

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Big Data in Life Sciences
Big Data in Life SciencesBig Data in Life Sciences
Big Data in Life Sciences
 
Analyze Genomes: In-memory Apps for Next-generation Life Sciences Research
Analyze Genomes: In-memory Apps for Next-generation Life Sciences ResearchAnalyze Genomes: In-memory Apps for Next-generation Life Sciences Research
Analyze Genomes: In-memory Apps for Next-generation Life Sciences Research
 
A Federated In-Memory Database System for Life Sciences
A Federated In-Memory Database System for Life SciencesA Federated In-Memory Database System for Life Sciences
A Federated In-Memory Database System for Life Sciences
 
A Platform for Integrated Genome Data Analysis
A Platform for Integrated Genome Data AnalysisA Platform for Integrated Genome Data Analysis
A Platform for Integrated Genome Data Analysis
 
Gesundheit geht uns alle an: Smart Data ermöglicht passendere Entscheidungen...
Gesundheit geht uns alle an: Smart Data ermöglicht passendere Entscheidungen...Gesundheit geht uns alle an: Smart Data ermöglicht passendere Entscheidungen...
Gesundheit geht uns alle an: Smart Data ermöglicht passendere Entscheidungen...
 
"When time matters..."
"When time matters...""When time matters..."
"When time matters..."
 
Big Data in Genomics: Opportunities and Challenges
Big Data in Genomics: Opportunities and ChallengesBig Data in Genomics: Opportunities and Challenges
Big Data in Genomics: Opportunities and Challenges
 
ICT Platform to Enable Consortium Work for Systems Medicine of Heart Failure
ICT Platform to Enable Consortium Work for Systems Medicine of Heart FailureICT Platform to Enable Consortium Work for Systems Medicine of Heart Failure
ICT Platform to Enable Consortium Work for Systems Medicine of Heart Failure
 
BioNRW: Big Medical Data: Challenge or Potential
BioNRW: Big Medical Data: Challenge or PotentialBioNRW: Big Medical Data: Challenge or Potential
BioNRW: Big Medical Data: Challenge or Potential
 
Festival of Genomics 2016 London: Analyze Genomes: Real-world Examples
Festival of Genomics 2016 London: Analyze Genomes: Real-world ExamplesFestival of Genomics 2016 London: Analyze Genomes: Real-world Examples
Festival of Genomics 2016 London: Analyze Genomes: Real-world Examples
 
Analyze Genomes: Drug Response Analysis
Analyze Genomes: Drug Response AnalysisAnalyze Genomes: Drug Response Analysis
Analyze Genomes: Drug Response Analysis
 
Festival of Genomics 2016 London: What to take home?
Festival of Genomics 2016 London: What to take home?Festival of Genomics 2016 London: What to take home?
Festival of Genomics 2016 London: What to take home?
 
In-Memory Apps for Precision Medicine
In-Memory Apps for Precision MedicineIn-Memory Apps for Precision Medicine
In-Memory Apps for Precision Medicine
 
Festival of Genomics 2016 London: Mining and Processing of Unstructured Medic...
Festival of Genomics 2016 London: Mining and Processing of Unstructured Medic...Festival of Genomics 2016 London: Mining and Processing of Unstructured Medic...
Festival of Genomics 2016 London: Mining and Processing of Unstructured Medic...
 
AnalyzeGenomes.com: A Federated In-Memory Database Platform for Digital Health
AnalyzeGenomes.com: A Federated In-Memory Database Platform for Digital HealthAnalyzeGenomes.com: A Federated In-Memory Database Platform for Digital Health
AnalyzeGenomes.com: A Federated In-Memory Database Platform for Digital Health
 
Patient Journey in Oncology 2025: Molecular Tumour Boards in Practice
Patient Journey in Oncology 2025: Molecular Tumour Boards in PracticePatient Journey in Oncology 2025: Molecular Tumour Boards in Practice
Patient Journey in Oncology 2025: Molecular Tumour Boards in Practice
 
How will AI affect the patient journey of the future?
How will AI affect the patient journey of the future?How will AI affect the patient journey of the future?
How will AI affect the patient journey of the future?
 
Algorithmen statt Ärzte: Algorithmen statt Ärzte: Ersetzt Big Data künftig ...
Algorithmen statt Ärzte: Algorithmen statt Ärzte: Ersetzt Big Data künftig ...Algorithmen statt Ärzte: Algorithmen statt Ärzte: Ersetzt Big Data künftig ...
Algorithmen statt Ärzte: Algorithmen statt Ärzte: Ersetzt Big Data künftig ...
 
Festival of Genomics 2016 London: Agenda
Festival of Genomics 2016 London: AgendaFestival of Genomics 2016 London: Agenda
Festival of Genomics 2016 London: Agenda
 
AI in Oncology
AI in OncologyAI in Oncology
AI in Oncology
 

Destaque (8)

Festival of Genomics 2016 London: Real-time Exploration of the Cancer Genome,...
Festival of Genomics 2016 London: Real-time Exploration of the Cancer Genome,...Festival of Genomics 2016 London: Real-time Exploration of the Cancer Genome,...
Festival of Genomics 2016 London: Real-time Exploration of the Cancer Genome,...
 
MOLECULAR BIOLOGY
MOLECULAR BIOLOGYMOLECULAR BIOLOGY
MOLECULAR BIOLOGY
 
Biology And Pathophysiology Of Cancer
Biology And Pathophysiology Of CancerBiology And Pathophysiology Of Cancer
Biology And Pathophysiology Of Cancer
 
Neoplasia Etiology and pathogenesis of cancer
Neoplasia Etiology and pathogenesis of cancerNeoplasia Etiology and pathogenesis of cancer
Neoplasia Etiology and pathogenesis of cancer
 
Neoplasia basics ! first lecture !
Neoplasia basics ! first lecture !Neoplasia basics ! first lecture !
Neoplasia basics ! first lecture !
 
Pathology Lecture - Neoplasia
Pathology Lecture - NeoplasiaPathology Lecture - Neoplasia
Pathology Lecture - Neoplasia
 
Neoplasia Robbin's path
Neoplasia Robbin's pathNeoplasia Robbin's path
Neoplasia Robbin's path
 
Pathology neoplasm
Pathology  neoplasmPathology  neoplasm
Pathology neoplasm
 

Semelhante a Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Computing Platform for Life Sciences

Introduction to High-performance In-memory Genome Project at HPI
Introduction to High-performance In-memory Genome Project at HPI Introduction to High-performance In-memory Genome Project at HPI
Introduction to High-performance In-memory Genome Project at HPI
Matthieu Schapranow
 
In-memory Applications for Oncology
In-memory Applications for OncologyIn-memory Applications for Oncology
In-memory Applications for Oncology
Matthieu Schapranow
 
Enabling Real-Time Genome Data Research with In-Memory Database Technology (I...
Enabling Real-Time Genome Data Research with In-Memory Database Technology (I...Enabling Real-Time Genome Data Research with In-Memory Database Technology (I...
Enabling Real-Time Genome Data Research with In-Memory Database Technology (I...
Matthieu Schapranow
 
Enabling Real-time Genome Data Research with In-memory Database Technology (S...
Enabling Real-time Genome Data Research with In-memory Database Technology (S...Enabling Real-time Genome Data Research with In-memory Database Technology (S...
Enabling Real-time Genome Data Research with In-memory Database Technology (S...
Matthieu Schapranow
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Carole Goble
 

Semelhante a Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Computing Platform for Life Sciences (20)

A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis...
A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis...A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis...
A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis...
 
Introduction to High-performance In-memory Genome Project at HPI
Introduction to High-performance In-memory Genome Project at HPI Introduction to High-performance In-memory Genome Project at HPI
Introduction to High-performance In-memory Genome Project at HPI
 
In-memory Applications for Oncology
In-memory Applications for OncologyIn-memory Applications for Oncology
In-memory Applications for Oncology
 
How SAP HANA can provide value for Pharma R&D
How SAP HANA can provide value for Pharma R&DHow SAP HANA can provide value for Pharma R&D
How SAP HANA can provide value for Pharma R&D
 
Mining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasetsMining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasets
 
ReComp: Preserving the value of large scale data analytics over time through...
ReComp:Preserving the value of large scale data analytics over time through...ReComp:Preserving the value of large scale data analytics over time through...
ReComp: Preserving the value of large scale data analytics over time through...
 
Enabling Real-Time Genome Data Research with In-Memory Database Technology (I...
Enabling Real-Time Genome Data Research with In-Memory Database Technology (I...Enabling Real-Time Genome Data Research with In-Memory Database Technology (I...
Enabling Real-Time Genome Data Research with In-Memory Database Technology (I...
 
Enabling Real-time Genome Data Research with In-memory Database Technology (S...
Enabling Real-time Genome Data Research with In-memory Database Technology (S...Enabling Real-time Genome Data Research with In-memory Database Technology (S...
Enabling Real-time Genome Data Research with In-memory Database Technology (S...
 
Processing of Big Medical Data in Personalized Medicine: Challenge or Potential
Processing of Big Medical Data in Personalized Medicine: Challenge or PotentialProcessing of Big Medical Data in Personalized Medicine: Challenge or Potential
Processing of Big Medical Data in Personalized Medicine: Challenge or Potential
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
 
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart’s application t...
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart’s application t...tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart’s application t...
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart’s application t...
 
Big Medical Data – Challenge or Potential?
Big Medical Data – Challenge or Potential?Big Medical Data – Challenge or Potential?
Big Medical Data – Challenge or Potential?
 
ER 2016 Tutorial
ER 2016 TutorialER 2016 Tutorial
ER 2016 Tutorial
 
Pride and ProteomeXchange
Pride and ProteomeXchangePride and ProteomeXchange
Pride and ProteomeXchange
 
ReComp project kickoff presentation 11-03-2016
ReComp project kickoff presentation 11-03-2016ReComp project kickoff presentation 11-03-2016
ReComp project kickoff presentation 11-03-2016
 
Jewei Hans & Kamber Chapter 8
Jewei Hans & Kamber  Chapter 8Jewei Hans & Kamber  Chapter 8
Jewei Hans & Kamber Chapter 8
 
Proteomics repositories
Proteomics repositoriesProteomics repositories
Proteomics repositories
 
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
 
Labmatrix
LabmatrixLabmatrix
Labmatrix
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 

Último

Último (20)

Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 

Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Computing Platform for Life Sciences

  • 1. Analyze Genomes: A Federated In-Memory Computing Platform for Life Sciences Dr. Matthieu-P. Schapranow Festival of Genomics, London, U.K. Jan 19, 2016
  • 2. Schapranow, Festival of Genomics, Jan 19, 2016 Recap: we.analyzegenomes.com Real-time Analysis of Big Medical Data 2 In-Memory Database Extensions for Life Sciences Data Exchange, App Store Access Control, Data Protection Fair Use Statistical Tools Real-time Analysis App-spanning User Profiles Combined and Linked Data Genome Data Cellular Pathways Genome Metadata Research Publications Pipeline and Analysis Models Drugs and Interactions Challenges of Big Medical Data Drug Response Analysis Pathway Topology Analysis Medical Knowledge CockpitOncolyzer Clinical Trial Recruitment Cohort Analysis ... Indexed Sources
  • 3. Combined column and row store Map/Reduce Single and multi-tenancy Lightweight compression Insert only for time travel Real-time replication Working on integers SQL interface on columns and rows Active/passive data store Minimal projections Group key Reduction of software layers Dynamic multi- threading Bulk load of data Object- relational mapping Text retrieval and extraction engine No aggregate tables Data partitioning Any attribute as index No disk On-the-fly extensibility Analytics on historical data Multi-core/ parallelization Our Technology In-Memory Database Technology + ++ + + P v +++ t SQL x x T disk 3 Schapranow, Festival of Genomics, Jan 19, 2016 Challenges of Big Medical Data
  • 4. In-Memory Data Management Overview Schapranow, Festival of Genomics, Jan 19, 2016 Analyze Genomes: Real-world Examples 4 Advances in Hardware 64 bit address space – 4 TB in current server boards 4 MB/ms/core data throughput Cost-performance ratio rapidly declining Multi-core architecture (6 x 12 core CPU per blade) Parallel scaling across blades 1 blade ≈50k USD = 1 enterprise class server Advances in Software Row and Column Store Compression PartitioningInsert Only A Parallelization + ++ + + P Active & Passive Data Stores
  • 5. ■  Main memory access is the new bottleneck ■  Lightweight compression can reduce this bottleneck, i.e. □  Lossless □  Improved usage of data bus capacity □  Work directly on compressed data Lightweight Compression Schapranow, Festival of Genomics, Jan 19, 2016 Analyze Genomes: Real-world Examples 5 Attribute Vector RecId ValueId 1  C18.0 2  C32.0 3  C00.9 4  C18.0 5 C20.0 6 C20.0 7 C50.9 8 C18.0 Inverted Index ValueId RecIdList 1  2 2  3 3  5,6 4  1,4,8 5  7 Data Dictionary ValueId Value 1 Larynx 2 Lip 3 Rectum 4 Colon 5 MamaTable ……… C18.0Colon646470 C50.9Mama167898 C20.0Rectum647912 C20.0Rectum215678 C18.0Colon998711 C00.9Lip123489 C32.0Larynx357982 C18.0Colon091487RecId 1 RecId 2 RecId 3 RecId 4 RecId 5 RecId 6 RecId 7 RecId 8 … •  Typical compression factor of 10:1 for enterprise software •  In financial applications up to 50:1
  • 6. ■  Row stores are designed for operative workload, e.g. □  Create and maintain meta data for tests □  Access a complete record of a trial or test series ■  Column stores are designed for analytical work, e.g. □  Evaluate the number of positive test results □  Identification of correlations or test candidates ■  In-Memory approach: Combination of both stores □  Increased performance for analytical work □  Operative performance remains interactively Combined Column and Row Store Schapranow, Festival of Genomics, Jan 19, 2016 Analyze Genomes: Real-world Examples 6 +
  • 7. ■  Traditional databases allow four data operations: □  INSERT, SELECT and □  DELETE, UPDATE ■  Insert-only requires only INSERT, SELECT to maintain a complete history (bookkeeping systems) ■  Insert-only enables time travelling, e.g. to □  Trace changes and reconstruct decisions □  Document complete history of changes, therapies, etc. □  Enable statistical observations Insert-Only / Append-Only Schapranow, Festival of Genomics, Jan 19, 2016 Analyze Genomes: Real-world Examples 7 ++ + +
  • 8. ■  Horizontal Partitioning □  Cut long tables into shorter segments □  E.g. to group samples with same relevance ■  Vertical Partitioning □  Split off columns to individual resources □  E.g. to separate personalized data from experiment data ■  Partitioning is the basis for □  Parallel execution of database queries □  Implementation of data aging and data retention management Partitioning Schapranow, Festival of Genomics, Jan 19, 2016 Analyze Genomes: Real-world Examples 8
  • 9. ■  Modern server systems consist of x CPUs, e.g. ■  Each CPU consists of y CPU cores, e.g. 12 ■  Consider each of the x*y CPU core as individual workers, e.g. 6x12 ■  Each worker can perform one task at the same time in parallel ■  Full table scan of database table w/ 1M entries results in 1/x*1/y search time when traversing in parallel □  Reduced response time □  No need for pre-aggregated totals and redundant data □  Improved usage of hardware □  Instant analysis of data Multi-core and Parallelization Schapranow, Festival of Genomics, Jan 19, 2016 Analyze Genomes: Real-world Examples 9
  • 10. ■  Consider two categories of data stores: active and passive ■  Active data are accessed frequently & updates are expected, e.g. □  Most recent experiment results, e.g. last two weeks □  Samples that have not been processed, yet ■  Passive data are used for analytical & statistical purposes, e.g. □  Samples that were processed 5 years ago □  Meta data about seeds that are not longer produced ■  Moving passive data on slower storages □  Reduces main memory demands □  Improves performance for active data Active and Passive Data Store Schapranow, Festival of Genomics, Jan 19, 2016 Analyze Genomes: Real-world Examples 10 P
  • 11. ■  Layers are introduced to abstract from complexity ■  Each layer offers complete functionality, e.g. meta data of samples ■  Less layer result in □  Less code to maintain □  More specific code □  Reduced resource demands □  Improves performance of applications due to eliminating obsolete processing steps Reduction of Application Layers Schapranow, Festival of Genomics, Jan 19, 2016 Analyze Genomes: Real-world Examples 11 x x
  • 12. ■  1,000 core cluster at Hasso Plattner Institute with 25 TB main memory ■  25 nodes, each consists of: □  40 cores □  1 TB main memory □  Intel® Xeon® E7- 4870 □  2.40GHz □  30 MB Cache In-Memory Database Technology Hardware Characteristics at HPI FSOC Lab Schapranow, Festival of Genomics, Jan 19, 2016 Challenges of Big Medical Data 12
  • 13. In-Memory Database Technology Use Case: Analysis of Genomic Data Schapranow, Festival of Genomics, Jan 19, 2016 Analyze Genomes: Real-world Examples 13 Analysis of Genomic Data Alignment and Variant Calling Analysis of Annotations in World- wide DBs Bound To CPU Performance Memory Capacity Duration Hours – Days Weeks HPI Minutes Real-time In-Memory Technology Multi-Core Partitioning & Compression
  • 14. Federated In-Memory Database (FIMDB) Incorporating Local Compute Resources Schapranow, Festival of Genomics, Jan 19, 2016 Challenges of Big Medical Data 14 Site B Federated In-M em ory D atabase Instance, Algorithm s, and Applications M anaged by Service Provider CloudService Provider Site A FIMDB A.1 FIMDB A.2 FIMDB A.3 FIMDB A.4 FIMDB A.5 FIMDB B.1 FIMDB B.2 FIMDB B.3 FIMDB C.1 Federated In-M em ory Database Instances M aster Data M anaged by Service Provider Sensitive D ata reside at Site
  • 15. Multiple Sites Forming the Federated In-Memory Database System Schapranow, Festival of Genomics, Jan 19, 2016 Challenges of Big Medical Data 15 Federated In-M em ory D atabase System M aster Data and Shared Algorithm s Site A Site BCloud Provider Cloud IM D B Instance Local IM DB Instance Sensitive D ata, e.g. Patient Data R Local IM DB Instance Sensitive Data, e.g. Patient D ata R
  • 16. ■  Front-End □  Any user interface supported, e.g. UI5, JavaScript, iOS □  Communication protocol: Ajax/JSON ■  API Server □  Ruby on Rails □  Translates into database queries ■  Platform □  IMDB with extensions for Life Sciences, e.g. worker framework, scheduler, third- party tools, etc. Software Architecture Schapranow, Festival of Genomics, Jan 19, 2016 Challenges of Big Medical Data 16 H igh-Perform ance In-M em ory G enom e Cloud Platform Researcher with W eb Browser Clinician with M obile Device Analytical Data Processing MasterData,e.g. ReferenceGenomes, Annotations,and ClinicalTrials R G raph Processing Specific C loud Application Specific Cloud Application s Specific M obile Application RR TransactionalData, e.g.GenomeReads, PatientEMRData, andWorkerStatus In-M em ory Database Instances In-M em ory Database Instances RR Scheduling and W orker Fram ework Annotation and Updater Fram ew ork InternetIntranet DataLayerPlatformLayerApplicationLayer Accounting Security Extensions
  • 17. Keep in contact with us! Hasso Plattner Institute Enterprise Platform & Integration Concepts (EPIC) Program Manager E-Health Dr. Matthieu-P. Schapranow August-Bebel-Str. 88 14482 Potsdam, Germany Dr. Matthieu-P. Schapranow schapranow@hpi.de http://we.analyzegenomes.com/ Schapranow, Festival of Genomics, Jan 19, 2016 Challenges of Big Medical Data 17