Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI

© 2014 MapR Technologies 1© 2014 MapR Technologies
Hadoop for Genomics: What you need to know

© 2014 MapR Technologies 2
DNA Sequencing, pre-2004
years
CPU
transistors/mm2
HDD
GB/mm2
DNA
bp/$, pre-2004

DNA Sequencing, 2004 Disruption
years
CPU
transistors/mm2
HDD
GB/mm2DNA
bp/$, post-2004
DNA
bp/$, pre-2004

DNA Sequencing, 2004 Disruption
years
CPU
transistors/mm2
HDD
GB/mm2DNA
bp/$, post-2004
DNA
bp/$, pre-2004
Similar disruption occurred for
Internet traffic in mid-1990s

Effect: Many DNA-Based Apps Coming…
• 2014: US$ 2B, mostly
research, mostly
chemical costs
• 2020: US$ 20B,
mostly clinical, mostly
analytics costs
Macquarie Capital, 2014. Genomics 2.0: It’s just the beginning
0
5
10
15
20
25
2014 2020
Clinical
Non-Clinical

Genomics Value Chain
Order Test
from Clinic
Extract
Biosample
BioBank
Biosample
DNA
Extraction
Sequence
Biosample
Secondary
Analytics
Tertiary
Analytics
Reporting
to Clinic
Academic R&D
Pharma R&D
Clinic Therapy
Increased scale requirement
Increased feature set requirement

Sequence
Biosample
Secondary
Analytics
Tertiary
Analytics
Academic R&D OK, e.g. ILMN XTen OK, (GATK) Not OK (manual)
Pharma R&D OK, e.g. ILMN XTen Not OK (GATK) Missing, manual
Clinic Therapy OK, e.g. ILMN XTen Missing Missing
Increased scale requirement
Increased feature set requirement
Requirements
• Data Intense
• Batch
• High utilization
• Low COGS
Requirements
• Data Intense
• Interactive
• Easy to integrate
• Expressive

Target Application: Alleviate / Prevent (Deterministic) Suffering
Variant
Calling
DNA
Sequencer
Reads
Reference
Genome
Genotype/
Phenotype/
Individual
Matrix
Cure &
Prevent
Disease
Medical
Records
Patient

http://steamcommunity.com/app/203160/discussions/0/846956188647169800/
http://www.vox.com/2015/2/1/7955921/lara-croft-moores-law
What Does Moore’s Law Feel Like? #Dataviz:
Lara Croft 230=>40,000 Polygons (1996-2014)

Application: Forensics
http://cgi.uconn.edu/stranger-visions-forensic-art-exhibit/
http://snapshot.parabon-nanolabs.com/
http://www.nature.com/news/mugshots-built-from-dna-data-1.14899

Growth in Resource Capacity

Disruption Circa 2000
NASDAQ
Composite

What Happened?
What did winners
do right to survive
the .com recession?
NASDAQ
Composite

Early 1990s: Early eCommerce Vendor Setup
Storage
read/write
read/write
Website
Back Ofﬁce

Early 1990s: Early eCommerce Vendor Setup
Storage
read/write
read/write
Website
Back Ofﬁce
<= SAN & NAS, Oracle
<= HPC

Late 1990s: Workload became too big
Storage
read/write
read/write
Website WebsiteWebsite Website
Back Ofﬁce Back Ofﬁce

Survivor Strategy Revealed: Google Publishes
• 2003: Google Filesystem (aka GFS)
– http://research.google.com/archive/gfs.html
• 2004: MapReduce
– http://research.google.com/archive/mapreduce.html
• 2006: BigTable
– http://research.google.com/archive/bigtable.html

Scale-out with Google FS + MapReduce
read/write
read/write
Website WebsiteWebsite Website
Storage + Compute Cluster
Back Ofﬁce Back Ofﬁce

Genomics: Internet Boom Déjà Vu

DNA Sequencing, post-2004 DNA Sequence
NASDAQ
Composite

DNA Sequencing, pre-2004
Storage
write-only
read/write
High-Performance Compute Cluster
Coordinator /
Edge Node
Sequencer
SAN & NAS =>
HPC =>

DNA Sequencing, post-2004
Storage
write-only
read/write
Coordinator /
Edge Node
DNA Sequencer Cluster (e.g. Illumina X-Ten)

Storage
write-only
read/write
Coordinator /
Edge Node
HPC bottleneck
Sequencer
back-pressure

Storage
write-only
read/write
Coordinator /
Edge Node
HPC bottleneck
Sequencer
back-pressure
NAS doesn’t look like a
great solution anymore…

Solution: Implemented 2014 @ Complete Genomics
with MapR
write-only
DNA Sequencer Cluster (e.g. Illumina X-Ten
Storage + Compute Cluster
Decentralize I/O
Decentralize I/O

Application Server
mapr-nfsserver
Linux NFS Client
Mapr client API
Loopback Mount:
localhost:/mapr /mapr
mapr-fileserver
S1
mapr-fileserver
S2
mapr-fileserver
S3
mapr-fileserver
S4
mapr-fileserver
S5
Chunk 1
256MB
MapR Inline Compression
1 2 3 4 5
1 2Chunk 2
256MB 3Chunk 3
256MB
4Chunk 4
256MB 5Chunk 5
256MB
Translate NFS into API Calls
1 1 1
4 4
2
3
2 2
3 3
4
55 5
MapR Data Platform
Network Security :
MapR RPC Full Wire Encryption
Client -> Server Communication
Server -> Server Communication
Supported Compression algorithms
( per Directory )
LZ4, LZF, ZLIB
Network Traffic will be
compressed automatically
MapR NFS Gateway on Application Servers

[WHITEBOARD BREAK]

[REDACTED]

Allows Secondary Analytics to Scale Out
Variant
Calling
DNA
Sequencer
Reads
Reference
Genome
Genotype/
Phenotype/
Individual
Matrix
Cure &
Prevent
Disease
Medical
Records
Patient

Secondary Analytics: Acute Pain Point
FastQ
Reads
Aligned
Reads
Variants
ADAM + Avocado
Matrix rotation
is very I/O
intense
Velvet: Algorithms for de novo short read assembly
using de Bruijn graphs, Zerbino & Birney. 2008
Local de novo
is best…
…only feasible
with efficient
rotations

Apache Parquet

Row-Oriented Format
read1 chr1 10000 read2 TTGGAG ABCDEF
read2 chr1 20000 - TCGTAA ABCDEF
read3 chr2 5000 - GGGAAC ABCDEF
read4 chr3 1000000 read6 CCCTAC ABCDEF
read5 chr4 900000 - TTTAAG ABCDEF
0
5
20
40
57
ID Reference Position Next ID Sequence Quality

Row-Oriented Splitting

Column-Oriented Format
read1
read2
read3
read4
read5
chr1
chr1
chr2
chr3
chr4
10000
20000
5000
1000000
900000
read2
-
-
read6
-
TTGGAG
TCGTAA
GGGAAC
CCCTAC
TTTAAG
ABCDEF
ABCDEF
ABCDEF
ABCDEF
ABCDEF
ABCDEF

Column-Oriented Format Partitioning
read1
read2
read3
read4
read5
chr1
chr1
chr2
chr3
chr4
10000
20000
5000
1000000
900000
read2
-
-
read6
-
TTGGAG
TCGTAA
TTGGAG
GGGAAC
TTTAAG
ABCDEF
ABCDEF
ABCDEF
ABCDEF
ABCDEF
ABCDEF

Column-Oriented Format Splitting

Apache Parquet

Apache Parquet
http://grepalex.com/2014/05/13/parquet-file-format-and-object-model/

Allows Secondary Analytics to Scale Out
GATK / HPC
method: flat after
chromosome split
Hadoop / Spark
method

Tertiary Analytics

Downstream Analytics: GWAS/PheWAS
FastQ
Reads
Aligned
Reads
Variants
Function
Phenotypes
Scalable
GWAS/PheWA
S: “Green
Field” Territory
ADAM + Avocado

Target Application: Alleviate / Prevent Suffering
Variant
Calling
DNA
Sequencer
Reads
Reference
Genome
Genotype/
Phenotype/
Individual
Matrix
Cure &
Prevent
Disease
Medical
Records
Patient

GWAS Overview (Genome-wide Association Study)
• Which genome features are associated with phenotype X?
https://en.wikipedia.org/wiki/Genome-wide_association_study

PheWAS Overview (Phenome-wide …)
• Which phenotypes are associated with genome variant X?
http://www.tcpinnovations.com/drugbaron/phewas-the-tool-thats-revolutionizing-drug-development-that-youve-likely-never-heard-of/

Genome × Phenome Analysis
For given population,
given SNP 𝛿, and
given phenotype ϕ:
Count the number
of occurrences as the
value of the matrix
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
SPARSE Billion + Phenotypes
SPARSEBillion+Genotypes

Disease Cause via Genome × Phenome Matrix Factorization
• Row Eigenvectors of X represent
– Sets of related phenotypes (by SNP)
• Column Eigenvectors of Y represent
– Sets of related SNPS (by phenotype)
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
Principal
Column
Vector
Archetype
Genotypes
Archetype
Phenotypes
Principal
Row
Vector
Sparse Matrix
Package is Actively
Developed in Spark
Community

Generalized Approach: Genome × Phenome Tensor
• Maintain individual identity
• Aggregating individuals gives up statistical power
• Leverage pedigrees – Individuals are not independent observations
Variants
Phenotypes
Variants
Phenotypes

Scalable Variant Store => Root out Disease Causes
Model P ~ F(G)
Fortunately, this has already been done…
Genotypes Med Record Phenotypes, e.g.
disease risk, drug response

Largest Biometric Database in the World
PEOPLE
1.2B
PEOPLE

Why Create Aadhaar?
• India: 1.2 billion residents
– 640,000 villages, ~60% lives under $2/day
– ~75% literacy, <3% pay income tax, <20% have bank accounts
– ~800 million mobile, ~200-300 million migrant workers
• Govt. spends about $25-40 billion on direct subsidies
– Residents have no standard identity document
– Most programs plagued with ghost and multiple identities causing
leakage of 30-40%
Standardize identity => Stop leakage

Aadhaar Biometric Capture & Index
Raw
Digital
Fingerprint

Aadhaar Biometric ID Creation
F(x): unique features
G(x): uncommon features
H(x): other features
• 900MM people loaded in 4
years
• In production
– 1MM registrations/day
– 200+ trillion lookups/day
• All built on MapR-DB (HBase)
Low Entropy +
Unique
Low Entropy +
Infrequent

Consistent, Low Latency
--- M7 Read Latency --- Others Read Latency

How Does this Relate to Genomics?
F-1(x): common features
Same data shape and size
• Aadhaar: 1B humans, 5MB minutia
• Genome: 7B humans, ~3M variants

How Does this Relate to Genomics?
F-1(x): common features
Phenotype:
healthy or sick?
Phenotype Partition
=>
Low Entropy

≈
individuals
fingerprint minutiae
Find rare minutiae to
uniquely identify
medicalrecords
genetic variants
Find shared variants
to get disease root
cause
Takeaway 1: Don’t reinvent the wheel

Takeaway 2: Evolution, not Revolution
DNA Sequence
NASDAQ
Composite

Thank You
@allenday // @mapr
Now a few slides about MapR’s product…
…and proposed next actions

“Quick Start” Package
Engagement includes:
1. Identification of data sources, transformations and reporting engines
2. Access and use of the solution template including source code
3. Training on customizing the solution template to the organization’s requirement
4. Deployment architecture document that enables a production deployment plan for the specific solution
SOLUTION
TEMPLATE
KNOWLEDGE
TRANSFER
DEPLOYMENT
ARCHITECTURE

“Quick Start” 1 – Resequencing with Hadoop
Reduces Storage
Hardware
Requirements
Accelerates Data
Processing Time
Minimal impact to
existing data
pipelines
“Quick Start” 2 – Variant Analysis with NoSQL
Present data for
exploration
Operationalize
complex workflows
Web-scale
performance

Sequence
Biosample
Secondary
Analytics
Tertiary
Analytics
Academic R&D OK, e.g. ILMN XTen OK, (GATK) Not OK
Pharma R&D OK, e.g. ILMN XTen Not OK Missing

Sequence
Biosample
Secondary
Analytics
Tertiary
Analytics
Academic R&D OK, e.g. ILMN XTen OK, (GATK) Not OK
Pharma R&D OK, e.g. ILMN XTen Not OK Missing
Addressed by
Quick Start 1
Addressed by
Quick Start 2

BONUS ROUND

Genealogy Company
Slides credit: Bill Yetman, Hadoop Summit 2014
http://slidesha.re/1vRh3kY

GERMLINE is…
• …an algorithm that finds hidden relationships within a pool of
DNA
• …the reference implementation of that algorithm written in C++.
• You can find it here:
http://www1.cs.columbia.edu/~gusev/germline/
6
6

Projected GERMLINE run times (in hours)
6
7
Hours
Samples
0
100
200
300
400
500
600
700
2,500
12,500
22,500
32,500
42,500
52,500
62,500
72,500
82,500
92,500
102,500
112,500
122,500
GERMLINE run times
Projected GERMLINE run
times
700 hours = 29+ days
EXPONENTIAL COMPLEXITY

GERMLINE: What’s the Problem?
• GERMLINE (the implementation) was not meant to be used in
an industrial setting
– Stateless, single threaded, prone to swapping (heavy memory usage)
– GERMLINE performs poorly on large data sets
• Our metrics predicted exactly where the process would slow to
a crawl
• Put simply: GERMLINE couldn't scale
6
8

Run times for matching (in hours)
6
9
Hours
Samples
0
20
40
60
80
100
120
140
160
180
GERMLINE run times
Jermline run times
Projected GERMLINE
run times
EXPONENTIAL LINEAR
HBase
Refactor

• Paper submitted describing the implementation
• Releasing as an Open Source project soon
• [HBase Schema/Algorithm Slides]
7
0

Further Growth & Optimization

Underdog (Strand Phasing) performance
– Went from 12 hours to process 1,000 samples
to under 25 minutes with a MapReduce
implementation
7
2
With improved accuracy!
Underdog
replaces
Beagle
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
Total Run Size Total Beagle-Underdog Duration

Pipeline steps and incremental change…
– Incremental change over time
– Supporting the business in a “just in time” Agile way
7
3
0
50000
100000
150000
200000
250000
500
3622
7243
9615
12353
16333
19522
22861
26642
31172
35986
40852
45252
49817
54738
61675
69496
77257
84337
90074
97448
104684
111937
119669
127194
134970
142232
149988
157710
165685
173719
181617
189817
197853
205855
213471
221290
228912
236516
243550
251315
259164
267266
275335
283114
291017
298823
306556
314662
322655
330745
338813
346847
354938
362954
371064
379208
387334
395432
Beagle-Underdog Phasing
Pipeline Finalize
Relationship Processing
Germline-Jermline Results Processing
Germline-Jermline Processing
Beagle Post Phasing
Admixture
Plink Prep
Pipeline Initialization
Jermline replaces
Germline
Ethnicity V2 Release
Underdog Replaces
Beagle
AdMixture on
Hadoop

…while the business continues to grow rapidly
7
4
-
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
450,000
Jan-12 Apr-12 Jul-12 Oct-12 Jan-13 Apr-13 Jul-13 Oct-13 Jan-14 Apr-14
#ofprocessedsamples)
DNA Database Size

Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (15)

Semelhante a Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI

Semelhante a Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI (20)

Mais de Allen Day, PhD

Mais de Allen Day, PhD (13)

Último

Último (20)

Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI

Notas do Editor