More Related Content Similar to HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data Collection (20) More from Cloudera, Inc. (20) HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data Collection1. Leveraging HBase for the World's Largest
Curated Genomic Data Collection
Satnam Alag, Ph.D.
VP of Engineering
satnam@nextbio.com
© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
NEXTBIO 2008
3. Genomic Big Data
Tumorscape+
#
#
2000# 2003# 2006# 2009# 2012#
Internal Data
© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
4. Use Case 1: HBase to Store Variant Data
• Each Genome has ~4 million
variants
• Immutable – write once,
never change, read many times
• Bloom Filters are useful
• Batch import of Data – HFile
• Data to be accessed
collocated in region
• Separate Hbase cluster from
Hadoop
• All the smarts are in the keys
For the various tables
In Hbase:
1 Genome 10Million rows
100 Genomes 1Billion rows
100K Genomes 1Trillion rows
100M Genomes 1 Quadrillion
1,000,000,000,000,000
Fortunately, HBase cluster access can be partitioned by the application when required
© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
5. Accessing Data with Pagination
Table 1:
Key: Bioset Id + Display Order
Columns
Pagination Example:
Page 5, Page Size = 100
Retrieve 100 rows from
Display Order = 400-500
Number of rows = 1 per SNP
Order of 4 million
© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
6. Accessing Data with Keys
Table 1:
Key: Bioset Id + Display Order
Keys returned by search index
© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
7. Filtering Data with Pagination
Table 1:
Key: Bioset Id + Display Order
Table 2:
Id+GeneId+MutationClass
Column: Counts, Keys to Table
Example:
Gene: ESR1,
Class: Misense
Page Size = 100
Retrieve rows from Table 2
Retrieve rows by keys from
Table 1
Number of rows
Order of 0.5 million per dataset
(# genes x classes)
© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
8. Powering the Genome Browser
Table 1: Table 2:
Key: Bioset Id + Display Order Id+GeneId+MutationClass
Table 3:
Id+ChromosomeId+Range+DisplayOrder
Example:
Chr: 6
Specified Range
Retrieve all rows
1 Row per SNP
~ 4 million per dataset
© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
9. Use Case 2: Correlation Data
© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
10. Use Case 2
• Each Correlation score stored as a row
• HFile created for new score
• Over 20 billion correlations
T1: scorebioset (base table)
key: biosetid_1 [+] biosetid_2
B1 B2 … … .. Bn Bn
+1
B1
B2
…
…
Bn
Bn
+1
© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
11. Lessons Learnt
• HBase Works Wells For
-- Immutable Data
-- Insertions Using HFiles
-- Billions of Rows
-- Intelligence in Key Definition
• Road to Production
-- Redundant Data in Database
© 2012 NextBio | All rights reserved | This information is proprietary and confidential.