Mais conteúdo relacionado Semelhante a Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI (20) Mais de Allen Day, PhD (13) Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI1. © 2014 MapR Technologies 1© 2014 MapR Technologies
Hadoop for Genomics: What you need to know
2. © 2014 MapR Technologies 2
DNA Sequencing, pre-2004
years
CPU
transistors/mm2
HDD
GB/mm2
DNA
bp/$, pre-2004
3. © 2014 MapR Technologies 3
DNA Sequencing, 2004 Disruption
years
CPU
transistors/mm2
HDD
GB/mm2DNA
bp/$, post-2004
DNA
bp/$, pre-2004
4. © 2014 MapR Technologies 4
DNA Sequencing, 2004 Disruption
years
CPU
transistors/mm2
HDD
GB/mm2DNA
bp/$, post-2004
DNA
bp/$, pre-2004
Similar disruption occurred for
Internet traffic in mid-1990s
5. © 2014 MapR Technologies 5
Effect: Many DNA-Based Apps Coming…
• 2014: US$ 2B, mostly
research, mostly
chemical costs
• 2020: US$ 20B,
mostly clinical, mostly
analytics costs
Macquarie Capital, 2014. Genomics 2.0: It’s just the beginning
0
5
10
15
20
25
2014 2020
Clinical
Non-Clinical
6. © 2014 MapR Technologies 6
Genomics Value Chain
Order Test
from Clinic
Extract
Biosample
BioBank
Biosample
DNA
Extraction
Sequence
Biosample
Secondary
Analytics
Tertiary
Analytics
Reporting
to Clinic
Academic R&D
Pharma R&D
Clinic Therapy
Increased scale requirement
Increased feature set requirement
7. © 2014 MapR Technologies 7
Genomics Value Chain
Sequence
Biosample
Secondary
Analytics
Tertiary
Analytics
Academic R&D OK, e.g. ILMN XTen OK, (GATK) Not OK (manual)
Pharma R&D OK, e.g. ILMN XTen Not OK (GATK) Missing, manual
Clinic Therapy OK, e.g. ILMN XTen Missing Missing
Increased scale requirement
Increased feature set requirement
Requirements
• Data Intense
• Batch
• High utilization
• Low COGS
Requirements
• Data Intense
• Interactive
• Easy to integrate
• Expressive
8. © 2014 MapR Technologies 8
Target Application: Alleviate / Prevent (Deterministic) Suffering
Variant
Calling
DNA
Sequencer
Reads
Reference
Genome
Genotype/
Phenotype/
Individual
Matrix
Cure &
Prevent
Disease
Medical
Records
Patient
9. © 2014 MapR Technologies 9
http://steamcommunity.com/app/203160/discussions/0/846956188647169800/
http://www.vox.com/2015/2/1/7955921/lara-croft-moores-law
What Does Moore’s Law Feel Like? #Dataviz:
Lara Croft 230=>40,000 Polygons (1996-2014)
10. © 2014 MapR Technologies 10
Application: Forensics
http://cgi.uconn.edu/stranger-visions-forensic-art-exhibit/
http://snapshot.parabon-nanolabs.com/
http://www.nature.com/news/mugshots-built-from-dna-data-1.14899
11. © 2014 MapR Technologies 11
Growth in Resource Capacity
12. © 2014 MapR Technologies 12
Disruption Circa 2000
NASDAQ
Composite
13. © 2014 MapR Technologies 13
What Happened?
What did winners
do right to survive
the .com recession?
NASDAQ
Composite
14. © 2014 MapR Technologies 14
Early 1990s: Early eCommerce Vendor Setup
Storage
read/write
read/write
Website
Back Office
15. © 2014 MapR Technologies 15
Early 1990s: Early eCommerce Vendor Setup
Storage
read/write
read/write
Website
Back Office
<= SAN & NAS, Oracle
<= HPC
16. © 2014 MapR Technologies 16
Late 1990s: Workload became too big
Storage
read/write
read/write
Website WebsiteWebsite Website
Back Office Back Office
17. © 2014 MapR Technologies 17
Survivor Strategy Revealed: Google Publishes
• 2003: Google Filesystem (aka GFS)
– http://research.google.com/archive/gfs.html
• 2004: MapReduce
– http://research.google.com/archive/mapreduce.html
• 2006: BigTable
– http://research.google.com/archive/bigtable.html
18. © 2014 MapR Technologies 18
Scale-out with Google FS + MapReduce
read/write
read/write
Website WebsiteWebsite Website
Storage + Compute Cluster
Back Office Back Office
19. © 2014 MapR Technologies 19© 2014 MapR Technologies
Genomics: Internet Boom Déjà Vu
20. © 2014 MapR Technologies 20
DNA Sequencing, post-2004 DNA Sequence
NASDAQ
Composite
21. © 2014 MapR Technologies 21
DNA Sequencing, pre-2004
Storage
write-only
read/write
High-Performance Compute Cluster
Coordinator /
Edge Node
Sequencer
SAN & NAS =>
HPC =>
22. © 2014 MapR Technologies 22
DNA Sequencing, post-2004
Storage
write-only
read/write
High-Performance Compute Cluster
Coordinator /
Edge Node
DNA Sequencer Cluster (e.g. Illumina X-Ten)
23. © 2014 MapR Technologies 23
DNA Sequencing, post-2004
Storage
write-only
read/write
High-Performance Compute Cluster
Coordinator /
Edge Node
DNA Sequencer Cluster (e.g. Illumina X-Ten)
HPC bottleneck
Sequencer
back-pressure
24. © 2014 MapR Technologies 24
DNA Sequencing, post-2004
Storage
write-only
read/write
High-Performance Compute Cluster
Coordinator /
Edge Node
DNA Sequencer Cluster (e.g. Illumina X-Ten)
HPC bottleneck
Sequencer
back-pressure
NAS doesn’t look like a
great solution anymore…
25. © 2014 MapR Technologies 25
Solution: Implemented 2014 @ Complete Genomics
with MapR
write-only
DNA Sequencer Cluster (e.g. Illumina X-Ten
Storage + Compute Cluster
Decentralize I/O
Decentralize I/O
26. © 2014 MapR Technologies 26
Application Server
mapr-nfsserver
Linux NFS Client
Mapr client API
Loopback Mount:
localhost:/mapr /mapr
mapr-fileserver
S1
mapr-fileserver
S2
mapr-fileserver
S3
mapr-fileserver
S4
mapr-fileserver
S5
Chunk 1
256MB
MapR Inline Compression
1 2 3 4 5
1 2Chunk 2
256MB 3Chunk 3
256MB
4Chunk 4
256MB 5Chunk 5
256MB
Translate NFS into API Calls
1 1 1
4 4
2
3
2 2
3 3
4
55 5
MapR Data Platform
Network Security :
MapR RPC Full Wire Encryption
Client -> Server Communication
Server -> Server Communication
Supported Compression algorithms
( per Directory )
LZ4, LZF, ZLIB
Network Traffic will be
compressed automatically
MapR NFS Gateway on Application Servers
28. © 2014 MapR Technologies 28© 2014 MapR Technologies
[REDACTED]
29. © 2014 MapR Technologies 29
Allows Secondary Analytics to Scale Out
Variant
Calling
DNA
Sequencer
Reads
Reference
Genome
Genotype/
Phenotype/
Individual
Matrix
Cure &
Prevent
Disease
Medical
Records
Patient
30. © 2014 MapR Technologies 30
Secondary Analytics: Acute Pain Point
FastQ
Reads
Aligned
Reads
Variants
ADAM + Avocado
Matrix rotation
is very I/O
intense
Velvet: Algorithms for de novo short read assembly
using de Bruijn graphs, Zerbino & Birney. 2008
Local de novo
is best…
…only feasible
with efficient
rotations
32. © 2014 MapR Technologies 32
Row-Oriented Format
read1 chr1 10000 read2 TTGGAG ABCDEF
read2 chr1 20000 - TCGTAA ABCDEF
read3 chr2 5000 - GGGAAC ABCDEF
read4 chr3 1000000 read6 CCCTAC ABCDEF
read5 chr4 900000 - TTTAAG ABCDEF
0
5
20
40
57
ID Reference Position Next ID Sequence Quality
34. © 2014 MapR Technologies 34
Column-Oriented Format
read1
read2
read3
read4
read5
chr1
chr1
chr2
chr3
chr4
10000
20000
5000
1000000
900000
read2
-
-
read6
-
TTGGAG
TCGTAA
GGGAAC
CCCTAC
TTTAAG
ABCDEF
ABCDEF
ABCDEF
ABCDEF
ABCDEF
ABCDEF
ID Reference Position Next ID Sequence Quality
35. © 2014 MapR Technologies 35
Column-Oriented Format Partitioning
read1
read2
read3
read4
read5
chr1
chr1
chr2
chr3
chr4
10000
20000
5000
1000000
900000
read2
-
-
read6
-
TTGGAG
TCGTAA
TTGGAG
GGGAAC
TTTAAG
ABCDEF
ABCDEF
ABCDEF
ABCDEF
ABCDEF
ABCDEF
ID Reference Position Next ID Sequence Quality
36. © 2014 MapR Technologies 36
Column-Oriented Format Splitting
38. © 2014 MapR Technologies 38
Apache Parquet
http://grepalex.com/2014/05/13/parquet-file-format-and-object-model/
39. © 2014 MapR Technologies 39
Allows Secondary Analytics to Scale Out
GATK / HPC
method: flat after
chromosome split
Hadoop / Spark
method
40. © 2014 MapR Technologies 40© 2014 MapR Technologies
Tertiary Analytics
41. © 2014 MapR Technologies 41
Downstream Analytics: GWAS/PheWAS
FastQ
Reads
Aligned
Reads
Variants
Function
Phenotypes
Scalable
GWAS/PheWA
S: “Green
Field” Territory
ADAM + Avocado
42. © 2014 MapR Technologies 42
Target Application: Alleviate / Prevent Suffering
Variant
Calling
DNA
Sequencer
Reads
Reference
Genome
Genotype/
Phenotype/
Individual
Matrix
Cure &
Prevent
Disease
Medical
Records
Patient
43. © 2014 MapR Technologies 43
GWAS Overview (Genome-wide Association Study)
• Which genome features are associated with phenotype X?
https://en.wikipedia.org/wiki/Genome-wide_association_study
44. © 2014 MapR Technologies 44
PheWAS Overview (Phenome-wide …)
• Which phenotypes are associated with genome variant X?
http://www.tcpinnovations.com/drugbaron/phewas-the-tool-thats-revolutionizing-drug-development-that-youve-likely-never-heard-of/
45. © 2014 MapR Technologies 45
Genome × Phenome Analysis
For given population,
given SNP 𝛿, and
given phenotype ϕ:
Count the number
of occurrences as the
value of the matrix
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
SPARSE Billion + Phenotypes
SPARSEBillion+Genotypes
46. © 2014 MapR Technologies 46
Disease Cause via Genome × Phenome Matrix Factorization
• Row Eigenvectors of X represent
– Sets of related phenotypes (by SNP)
• Column Eigenvectors of Y represent
– Sets of related SNPS (by phenotype)
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
Principal
Column
Vector
Archetype
Genotypes
Archetype
Phenotypes
Principal
Row
Vector
Sparse Matrix
Package is Actively
Developed in Spark
Community
47. © 2014 MapR Technologies 47
Generalized Approach: Genome × Phenome Tensor
• Maintain individual identity
• Aggregating individuals gives up statistical power
• Leverage pedigrees – Individuals are not independent observations
Variants
Phenotypes
Variants
Phenotypes
48. © 2014 MapR Technologies 48
Scalable Variant Store => Root out Disease Causes
Model P ~ F(G)
Fortunately, this has already been done…
Genotypes Med Record Phenotypes, e.g.
disease risk, drug response
49. © 2014 MapR Technologies 49
Largest Biometric Database in the World
PEOPLE
1.2B
PEOPLE
50. © 2014 MapR Technologies 50
Why Create Aadhaar?
• India: 1.2 billion residents
– 640,000 villages, ~60% lives under $2/day
– ~75% literacy, <3% pay income tax, <20% have bank accounts
– ~800 million mobile, ~200-300 million migrant workers
• Govt. spends about $25-40 billion on direct subsidies
– Residents have no standard identity document
– Most programs plagued with ghost and multiple identities causing
leakage of 30-40%
Standardize identity => Stop leakage
51. © 2014 MapR Technologies 51
Aadhaar Biometric Capture & Index
Raw
Digital
Fingerprint
52. © 2014 MapR Technologies 52
Aadhaar Biometric ID Creation
F(x): unique features
G(x): uncommon features
H(x): other features
• 900MM people loaded in 4
years
• In production
– 1MM registrations/day
– 200+ trillion lookups/day
• All built on MapR-DB (HBase)
Low Entropy +
Unique
Low Entropy +
Infrequent
53. © 2014 MapR Technologies 53
Consistent, Low Latency
--- M7 Read Latency --- Others Read Latency
54. © 2014 MapR Technologies 54
How Does this Relate to Genomics?
F-1(x): common features
F(x): unique features
G(x): uncommon features
H(x): other features
Same data shape and size
• Aadhaar: 1B humans, 5MB minutia
• Genome: 7B humans, ~3M variants
55. © 2014 MapR Technologies 55
How Does this Relate to Genomics?
F-1(x): common features
F(x): unique features
G(x): uncommon features
H(x): other features
Phenotype:
healthy or sick?
Phenotype Partition
=>
Low Entropy
56. © 2014 MapR Technologies 56
≈
individuals
fingerprint minutiae
Find rare minutiae to
uniquely identify
medicalrecords
genetic variants
Find shared variants
to get disease root
cause
Takeaway 1: Don’t reinvent the wheel
57. © 2014 MapR Technologies 57
Takeaway 2: Evolution, not Revolution
DNA Sequence
NASDAQ
Composite
58. © 2014 MapR Technologies 58
Thank You
@allenday // @mapr
Now a few slides about MapR’s product…
…and proposed next actions
59. © 2014 MapR Technologies 59
“Quick Start” Package
Engagement includes:
1. Identification of data sources, transformations and reporting engines
2. Access and use of the solution template including source code
3. Training on customizing the solution template to the organization’s requirement
4. Deployment architecture document that enables a production deployment plan for the specific solution
SOLUTION
TEMPLATE
KNOWLEDGE
TRANSFER
DEPLOYMENT
ARCHITECTURE
60. © 2014 MapR Technologies 60
“Quick Start” 1 – Resequencing with Hadoop
Reduces Storage
Hardware
Requirements
Accelerates Data
Processing Time
Minimal impact to
existing data
pipelines
“Quick Start” 2 – Variant Analysis with NoSQL
Present data for
exploration
Operationalize
complex workflows
Web-scale
performance
61. © 2014 MapR Technologies 62
Genomics Value Chain
Sequence
Biosample
Secondary
Analytics
Tertiary
Analytics
Academic R&D OK, e.g. ILMN XTen OK, (GATK) Not OK
Pharma R&D OK, e.g. ILMN XTen Not OK Missing
Clinic Therapy OK, e.g. ILMN XTen Missing Missing
62. © 2014 MapR Technologies 63
Genomics Value Chain
Sequence
Biosample
Secondary
Analytics
Tertiary
Analytics
Academic R&D OK, e.g. ILMN XTen OK, (GATK) Not OK
Pharma R&D OK, e.g. ILMN XTen Not OK Missing
Clinic Therapy OK, e.g. ILMN XTen Missing Missing
Addressed by
Quick Start 1
Addressed by
Quick Start 2
63. © 2014 MapR Technologies 64© 2014 MapR Technologies
BONUS ROUND
64. © 2014 MapR Technologies 65© 2014 MapR Technologies
Genealogy Company
Slides credit: Bill Yetman, Hadoop Summit 2014
http://slidesha.re/1vRh3kY
65. © 2014 MapR Technologies 66
GERMLINE is…
• …an algorithm that finds hidden relationships within a pool of
DNA
• …the reference implementation of that algorithm written in C++.
• You can find it here:
http://www1.cs.columbia.edu/~gusev/germline/
6
6
66. © 2014 MapR Technologies 67
Projected GERMLINE run times (in hours)
6
7
Hours
Samples
0
100
200
300
400
500
600
700
2,500
12,500
22,500
32,500
42,500
52,500
62,500
72,500
82,500
92,500
102,500
112,500
122,500
GERMLINE run times
Projected GERMLINE run
times
700 hours = 29+ days
EXPONENTIAL COMPLEXITY
67. © 2014 MapR Technologies 68
GERMLINE: What’s the Problem?
• GERMLINE (the implementation) was not meant to be used in
an industrial setting
– Stateless, single threaded, prone to swapping (heavy memory usage)
– GERMLINE performs poorly on large data sets
• Our metrics predicted exactly where the process would slow to
a crawl
• Put simply: GERMLINE couldn't scale
6
8
68. © 2014 MapR Technologies 69
Run times for matching (in hours)
6
9
Hours
Samples
0
20
40
60
80
100
120
140
160
180
GERMLINE run times
Jermline run times
Projected GERMLINE
run times
EXPONENTIAL LINEAR
HBase
Refactor
69. © 2014 MapR Technologies 70
• Paper submitted describing the implementation
• Releasing as an Open Source project soon
• [HBase Schema/Algorithm Slides]
7
0
70. © 2014 MapR Technologies 71© 2014 MapR Technologies
Further Growth & Optimization
71. © 2014 MapR Technologies 72
Underdog (Strand Phasing) performance
– Went from 12 hours to process 1,000 samples
to under 25 minutes with a MapReduce
implementation
7
2
With improved accuracy!
Underdog
replaces
Beagle
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
Total Run Size Total Beagle-Underdog Duration
72. © 2014 MapR Technologies 73
Pipeline steps and incremental change…
– Incremental change over time
– Supporting the business in a “just in time” Agile way
7
3
0
50000
100000
150000
200000
250000
500
3622
7243
9615
12353
16333
19522
22861
26642
31172
35986
40852
45252
49817
54738
61675
69496
77257
84337
90074
97448
104684
111937
119669
127194
134970
142232
149988
157710
165685
173719
181617
189817
197853
205855
213471
221290
228912
236516
243550
251315
259164
267266
275335
283114
291017
298823
306556
314662
322655
330745
338813
346847
354938
362954
371064
379208
387334
395432
Beagle-Underdog Phasing
Pipeline Finalize
Relationship Processing
Germline-Jermline Results Processing
Germline-Jermline Processing
Beagle Post Phasing
Admixture
Plink Prep
Pipeline Initialization
Jermline replaces
Germline
Ethnicity V2 Release
Underdog Replaces
Beagle
AdMixture on
Hadoop
73. © 2014 MapR Technologies 74
…while the business continues to grow rapidly
7
4
-
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
450,000
Jan-12 Apr-12 Jul-12 Oct-12 Jan-13 Apr-13 Jul-13 Oct-13 Jan-14 Apr-14
#ofprocessedsamples)
DNA Database Size
Notas do Editor cinical 49 Increase GDP by 2% BOOM LSH This chart shows that MapR-DB (the database in the MapR Enterprise Database Edition, formerly known as M7) (in blue) consistency reads data quickly with no spikes.
Other distributions suffer from periodic “housekeeping” tasks like compactions (defragmentation) and garbage collection, leading to sharp spikes in read delays.
Graph of each step in the pipeline for every run. This graph shows how important it is to measure everything. Some steps have been greatly reduced or eliminated. Light blue is the matching step. You can see it going quadratic and then the change when ‘J’ Jermline was released.