This presentation covers the "Analyze Genomes: A Federated In-Memory Computing Platform for Life Sciences" presentation of the 2016 Festival of Genomics workshop "Big Medical Data in Precision Medicine: Challenges or Opportunities?" on Jan 19, 2016 in London.
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Computing Platform for Life Sciences
1. Analyze Genomes:
A Federated In-Memory Computing Platform for Life Sciences
Dr. Matthieu-P. Schapranow
Festival of Genomics, London, U.K.
Jan 19, 2016
2. Schapranow, Festival of
Genomics, Jan 19, 2016
Recap: we.analyzegenomes.com
Real-time Analysis of Big Medical Data
2
In-Memory Database
Extensions for Life Sciences
Data Exchange,
App Store
Access Control,
Data Protection
Fair Use
Statistical
Tools
Real-time
Analysis
App-spanning
User Profiles
Combined and Linked Data
Genome
Data
Cellular
Pathways
Genome
Metadata
Research
Publications
Pipeline and
Analysis Models
Drugs and
Interactions
Challenges of Big
Medical Data
Drug Response
Analysis
Pathway Topology
Analysis
Medical
Knowledge CockpitOncolyzer
Clinical Trial
Recruitment
Cohort
Analysis
...
Indexed
Sources
3. Combined column
and row store
Map/Reduce Single and
multi-tenancy
Lightweight
compression
Insert only
for time travel
Real-time
replication
Working on
integers
SQL interface on
columns and rows
Active/passive
data store
Minimal
projections
Group key Reduction of
software layers
Dynamic multi-
threading
Bulk load
of data
Object-
relational
mapping
Text retrieval
and extraction engine
No aggregate
tables
Data partitioning Any attribute
as index
No disk
On-the-fly
extensibility
Analytics on
historical data
Multi-core/
parallelization
Our Technology
In-Memory Database Technology
+
++
+
+
P
v
+++
t
SQL
x
x
T
disk
3
Schapranow, Festival of
Genomics, Jan 19, 2016
Challenges of Big
Medical Data
4. In-Memory Data Management
Overview
Schapranow, Festival of
Genomics, Jan 19, 2016
Analyze Genomes:
Real-world Examples
4
Advances in Hardware
64 bit address space –
4 TB in current server boards
4 MB/ms/core data throughput
Cost-performance ratio rapidly
declining
Multi-core architecture
(6 x 12 core CPU per blade)
Parallel scaling across blades
1 blade ≈50k USD =
1 enterprise class server
Advances in Software
Row and
Column Store Compression PartitioningInsert Only
A
Parallelization
+
++
+
+
P
Active & Passive
Data Stores
5. ■ Main memory access is the new bottleneck
■ Lightweight compression can reduce this bottleneck, i.e.
□ Lossless
□ Improved usage of data bus capacity
□ Work directly on compressed data
Lightweight Compression
Schapranow, Festival of
Genomics, Jan 19, 2016
Analyze Genomes:
Real-world Examples
5
Attribute Vector
RecId ValueId
1 C18.0
2 C32.0
3 C00.9
4 C18.0
5 C20.0
6 C20.0
7 C50.9
8 C18.0
Inverted Index
ValueId RecIdList
1 2
2 3
3 5,6
4 1,4,8
5 7
Data Dictionary
ValueId Value
1 Larynx
2 Lip
3 Rectum
4 Colon
5 MamaTable
………
C18.0Colon646470
C50.9Mama167898
C20.0Rectum647912
C20.0Rectum215678
C18.0Colon998711
C00.9Lip123489
C32.0Larynx357982
C18.0Colon091487RecId 1
RecId 2
RecId 3
RecId 4
RecId 5
RecId 6
RecId 7
RecId 8
…
• Typical compression factor of 10:1 for
enterprise software
• In financial applications up to 50:1
6. ■ Row stores are designed for operative workload, e.g.
□ Create and maintain meta data for tests
□ Access a complete record of a trial or test series
■ Column stores are designed for analytical work, e.g.
□ Evaluate the number of positive test results
□ Identification of correlations or test candidates
■ In-Memory approach: Combination of both stores
□ Increased performance for analytical work
□ Operative performance remains interactively
Combined Column and Row Store
Schapranow, Festival of
Genomics, Jan 19, 2016
Analyze Genomes:
Real-world Examples
6
+
7. ■ Traditional databases allow four data operations:
□ INSERT, SELECT and
□ DELETE, UPDATE
■ Insert-only requires only INSERT, SELECT to maintain a complete history
(bookkeeping systems)
■ Insert-only enables time travelling, e.g. to
□ Trace changes and reconstruct decisions
□ Document complete history of changes, therapies, etc.
□ Enable statistical observations
Insert-Only / Append-Only
Schapranow, Festival of
Genomics, Jan 19, 2016
Analyze Genomes:
Real-world Examples
7
++
+
+
8. ■ Horizontal Partitioning
□ Cut long tables into shorter segments
□ E.g. to group samples with same relevance
■ Vertical Partitioning
□ Split off columns to individual resources
□ E.g. to separate personalized data from experiment data
■ Partitioning is the basis for
□ Parallel execution of database queries
□ Implementation of data aging and data retention management
Partitioning
Schapranow, Festival of
Genomics, Jan 19, 2016
Analyze Genomes:
Real-world Examples
8
9. ■ Modern server systems consist of x CPUs, e.g.
■ Each CPU consists of y CPU cores, e.g. 12
■ Consider each of the x*y CPU core as individual workers, e.g. 6x12
■ Each worker can perform one task at the same time in parallel
■ Full table scan of database table w/ 1M entries results in 1/x*1/y search time when
traversing in parallel
□ Reduced response time
□ No need for pre-aggregated totals and redundant data
□ Improved usage of hardware
□ Instant analysis of data
Multi-core and Parallelization
Schapranow, Festival of
Genomics, Jan 19, 2016
Analyze Genomes:
Real-world Examples
9
10. ■ Consider two categories of data stores: active and passive
■ Active data are accessed frequently & updates are expected, e.g.
□ Most recent experiment results, e.g. last two weeks
□ Samples that have not been processed, yet
■ Passive data are used for analytical & statistical purposes, e.g.
□ Samples that were processed 5 years ago
□ Meta data about seeds that are not longer produced
■ Moving passive data on slower storages
□ Reduces main memory demands
□ Improves performance for active data
Active and Passive Data Store
Schapranow, Festival of
Genomics, Jan 19, 2016
Analyze Genomes:
Real-world Examples
10
P
11. ■ Layers are introduced to abstract from complexity
■ Each layer offers complete functionality, e.g. meta data of samples
■ Less layer result in
□ Less code to maintain
□ More specific code
□ Reduced resource demands
□ Improves performance of applications due to eliminating obsolete processing steps
Reduction of Application Layers
Schapranow, Festival of
Genomics, Jan 19, 2016
Analyze Genomes:
Real-world Examples
11
x
x
12. ■ 1,000 core cluster at Hasso Plattner Institute
with 25 TB main memory
■ 25 nodes, each consists of:
□ 40 cores
□ 1 TB main memory
□ Intel® Xeon® E7- 4870
□ 2.40GHz
□ 30 MB Cache
In-Memory Database Technology
Hardware Characteristics at HPI FSOC Lab
Schapranow, Festival of
Genomics, Jan 19, 2016
Challenges of Big
Medical Data
12
13. In-Memory Database Technology
Use Case: Analysis of Genomic Data
Schapranow, Festival of
Genomics, Jan 19, 2016
Analyze Genomes:
Real-world Examples
13
Analysis of
Genomic Data
Alignment and Variant Calling
Analysis of Annotations in World-
wide DBs
Bound To CPU Performance Memory Capacity
Duration Hours – Days Weeks
HPI Minutes Real-time
In-Memory
Technology
Multi-Core Partitioning & Compression
14. Federated In-Memory Database (FIMDB)
Incorporating Local Compute Resources
Schapranow, Festival of
Genomics, Jan 19, 2016
Challenges of Big
Medical Data
14
Site B
Federated In-M em ory
D atabase Instance,
Algorithm s, and
Applications M anaged
by Service Provider
CloudService
Provider
Site A
FIMDB
A.1
FIMDB
A.2
FIMDB
A.3
FIMDB
A.4
FIMDB
A.5
FIMDB
B.1
FIMDB
B.2
FIMDB
B.3
FIMDB
C.1
Federated In-M em ory
Database Instances
M aster Data
M anaged by
Service Provider
Sensitive D ata
reside at Site
15. Multiple Sites Forming the
Federated In-Memory Database System
Schapranow, Festival of
Genomics, Jan 19, 2016
Challenges of Big
Medical Data
15
Federated In-M em ory D atabase System
M aster Data and
Shared Algorithm s
Site A Site BCloud Provider
Cloud IM D B
Instance
Local IM DB
Instance
Sensitive D ata,
e.g. Patient Data
R
Local IM DB
Instance
Sensitive Data,
e.g. Patient D ata
R
16. ■ Front-End
□ Any user interface supported, e.g. UI5,
JavaScript, iOS
□ Communication protocol: Ajax/JSON
■ API Server
□ Ruby on Rails
□ Translates into database queries
■ Platform
□ IMDB with extensions for Life Sciences,
e.g. worker framework, scheduler, third-
party tools, etc.
Software Architecture
Schapranow, Festival of
Genomics, Jan 19, 2016
Challenges of Big
Medical Data
16
H igh-Perform ance In-M em ory G enom e
Cloud Platform
Researcher
with W eb
Browser
Clinician
with M obile
Device
Analytical Data
Processing
MasterData,e.g.
ReferenceGenomes,
Annotations,and
ClinicalTrials
R
G raph Processing
Specific C loud
Application
Specific Cloud
Application s
Specific M obile
Application
RR
TransactionalData,
e.g.GenomeReads,
PatientEMRData,
andWorkerStatus
In-M em ory Database Instances
In-M em ory Database Instances
RR
Scheduling and
W orker Fram ework
Annotation and
Updater Fram ew ork
InternetIntranet
DataLayerPlatformLayerApplicationLayer
Accounting Security Extensions
17. Keep in contact with us!
Hasso Plattner Institute
Enterprise Platform & Integration Concepts (EPIC)
Program Manager E-Health
Dr. Matthieu-P. Schapranow
August-Bebel-Str. 88
14482 Potsdam, Germany
Dr. Matthieu-P. Schapranow
schapranow@hpi.de
http://we.analyzegenomes.com/
Schapranow, Festival of
Genomics, Jan 19, 2016
Challenges of Big
Medical Data
17