SlideShare uma empresa Scribd logo
1 de 22
Processing Genetic Data at
Scale
Mark Schroering
Software Architect
@MarkSchroering
#IndyCloudConf @MarkSchroering
About Me
Mark Schroering
Twitter: @MarkSchroering
Github: mschroering
Life sciences company headquartered in
Indianapolis with offices in Salt Lake City
and Research Triangle Park (RTP)
#IndyCloudConf @MarkSchroering
Sequencing costs have dropped significantly
https://www.researchgate.net/figure/The-change-over-time-for-cost-per-raw-megabase-of-
DNA-sequencing-Source_fig1_261879801
#IndyCloudConf @MarkSchroering
Precision Medicine is on the rise
Source: Frost & Sullivan
#IndyCloudConf @MarkSchroering
A lot of genetic data is being
generated
A human genome has ~3 billion base pairs
AGCCCCTCAGGAGTCCGGCCACATGGAAACTCCTCATTCCGGAGGTCA
GTCAGATTTACCCTGGCTCACCTTGGCGTCGCGTCCGGCGGCAAACTA
AGAACACGTCGTCTAAATGACTTCTTAAAGTAGAATAGCGTGTTCTCT
About 0.1% of the genome is different among
individuals
~3 million germline variants (mutations) per person
#IndyCloudConf @MarkSchroering
Traditionally variant data has been transferred
and shared as files - Variant Call Format (VCF)
#CHROM POS ID REF ALT
QUAL
chr1 120056534 . G A .
chr3 178936091 . G A .
chr11 108198392 . T TA .
chr12 69233096 . C T .
chr13 32913764 . A G .
CLI tools exist that can search and transform the data
>> ./vcftools --vcf input_data.vcf --chr1 --from-bp 1000000 --to-bp 2000000
#IndyCloudConf @MarkSchroering
Variants can be “annotated” to include additional
information from other sources
##INFO=<ID=CLNSRCID,Number=.,Type=String,Description="Variant Clinical Channel IDs">
##INFO=<ID=CLNSIG,Number=.,Type=String,Description="Variant Clinical Significance, 0 - unknown, 1 - untested,
2 - non-pathogenic, 3 - probable-non-pathogenic, 4 - probable-pathogenic, 5 - pathogenic, 6 - drug-response, 7
- histocompatibility, 255 - other">
#CHROM POS ID REF ALT QUAL FILTER INFO
1 985955 rs199476396 G C . .
CLNSRCID=103320.0001;CLNSIG=5;
1 1199489 rs207460006 G A . .
CLNSRCID=.;CLNSIG=1;
ClinVar - Database of variants that pertain to human health
COSMIC - Database of Cancer variants
#IndyCloudConf @MarkSchroering
The Problem
Create a cloud based solution that
can store and efficiently analyze
large genomic data sets to
provide meaningful insights to
clinicians and researchers in a
responsive manner
#IndyCloudConf @MarkSchroering
Solution
Data Lake
Indexed Variants
Variant Annotations
MicroservicesAPI Gateway
Spark
Transformation
Jobs
Batch Annotation
Jobs
Application Analytics Storage Ingestion
#IndyCloudConf @MarkSchroering
Ingestion - Convert from legacy file formats
Variants File Data lake in S3Apache Parquet
Apache Parquet provides efficient columnar storage and integrates with technologies like Spark,
Redshift Spectrum, and Athena
https://github.com/lifeomic/spark-vcf - Natively load variant files into a Spark Dataframe/Dataset
#IndyCloudConf @MarkSchroering
Ingestion - Variant Annotation
Variants File(s) Data lake in S3
Legacy variant annotation tools are CLI based which made AWS Batch a good candidate for the
annotation process.
ClinVar
COSMIC
Dockerized
annotations tools
#IndyCloudConf @MarkSchroering
Ingestion - Lessons Learned
Utilize DynamoDB on-demand provisioning for
tables that have unpredictable spikes in
read/write capacity (released Nov’ 18)
● DynamoDB capacity auto-scaling is slow to
react to spikes in throughput
● With on-demand provisioning, you pay per
request
● Request rates are still capped by max table
throughput and account limits
#IndyCloudConf @MarkSchroering
Ingestion - Lessons Learned
Be aware of limits (hard and soft) put in place by your cloud provider on
compute resources. Large spikes of ingestion requests can result in failures.
Solutions:
● Add rate limiting to your API and force clients to slow down
● Add a queue to capture requests and process them when resources are
available
#IndyCloudConf @MarkSchroering
Ingestion - Lessons Learned
Utilize Spot (Preemptible) compute to save cost for big data ingestion tasks
Solutions:
● Use a Batch Spot Compute Environment and set the retry strategy for jobs
to >= 1 to allow jobs to be retried should an instance get terminated
● Use Spot Instance Fleets in EMR
● Have monitoring in place to get notified when jobs fail
#IndyCloudConf @MarkSchroering
Analytics - Index Variants and Annotations
Data Lake
Variant attributes needed for analytics are stored in PostgreSQL. Full annotation records are stored in
DynamoDB.
Indexed variants
in PostgreSQL
Annotations
#IndyCloudConf @MarkSchroering
Analytics - Lessons Learned
Partition large tables for better query performance
Solutions:
● PostgreSQL offers table partitioning by range (defined by a key column) or
explicit listings. After partitioning a large table we saw dramatic
improvements in query performance. Updates to a partition also do not
impact query performance of other partitions.
#IndyCloudConf @MarkSchroering
Analytics - Lessons Learned
Use a data lake for storing raw data
Solutions:
● Store raw data in S3 in a big data friendly format like Apache Parquet
○ Do not throw any data away. You may need it later
○ Can rebuild indexed data stores or create new ones as needed using
the raw data
#IndyCloudConf @MarkSchroering
Application - Provide query results to client
The Lambda function executes a query against the PostgreSQL database and joins annotation records
from DynamoDB.
Indexed variants
in PostgreSQL
Annotations
API Gateway Microservice
#IndyCloudConf @MarkSchroering
Application - Lessons Learned
Use reader endpoints to get better performance
for AWS Aurora databases
● High write load caused by large ingestion
jobs will not impact query performance for
clients needing read only access
● Reader endpoint load balances for query
intensive applications
#IndyCloudConf @MarkSchroering
Current Storage and Performance Stats
● ~101,706,611 processed unique variants
● S3 Storage: ~289 GB
● DynamoDB Storage: ~94 GB
● PostgreSQL Snapshot: ~3.3 TB
● Query Response Times
○ Min: ~214ms
○ Avg: ~1s
○ Max: ~3.5s
#IndyCloudConf @MarkSchroering
“A system is never finished being developed until it
ceases to be used.”
@jerryweinberg
Questions?
Mark Schroering
Software Architect
@MarkSchroering

Mais conteúdo relacionado

Mais procurados

Stream Processing: Choosing the Right Tool for the Job
Stream Processing: Choosing the Right Tool for the JobStream Processing: Choosing the Right Tool for the Job
Stream Processing: Choosing the Right Tool for the JobDatabricks
 
Obfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataObfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataDataWorks Summit
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks
 
Vectorized R Execution in Apache Spark
Vectorized R Execution in Apache SparkVectorized R Execution in Apache Spark
Vectorized R Execution in Apache SparkDatabricks
 
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Databricks
 
Big Data Meets Learning Science: Keynote by Al Essa
Big Data Meets Learning Science: Keynote by Al EssaBig Data Meets Learning Science: Keynote by Al Essa
Big Data Meets Learning Science: Keynote by Al EssaSpark Summit
 
Dataset Descriptions in Open PHACTS and HCLS
Dataset Descriptions in Open PHACTS and HCLSDataset Descriptions in Open PHACTS and HCLS
Dataset Descriptions in Open PHACTS and HCLSAlasdair Gray
 
AI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat DetectionAI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat DetectionDatabricks
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondRevolution Analytics
 
Piyali Kamra - Analytics and Data Visualization pipeline backed by AWS Glue &...
Piyali Kamra - Analytics and Data Visualization pipeline backed by AWS Glue &...Piyali Kamra - Analytics and Data Visualization pipeline backed by AWS Glue &...
Piyali Kamra - Analytics and Data Visualization pipeline backed by AWS Glue &...AWS Chicago
 
AWS Finland September Meetup - Using Amazon Neptune to build Fashion Knowledg...
AWS Finland September Meetup - Using Amazon Neptune to build Fashion Knowledg...AWS Finland September Meetup - Using Amazon Neptune to build Fashion Knowledg...
AWS Finland September Meetup - Using Amazon Neptune to build Fashion Knowledg...Uri Savelchev
 
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Databricks
 
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeNear Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeDatabricks
 
Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford
Migrating from Closed to Open Source - Fonda Ingram & Ken SanfordMigrating from Closed to Open Source - Fonda Ingram & Ken Sanford
Migrating from Closed to Open Source - Fonda Ingram & Ken SanfordSri Ambati
 
Extending Analytic Reach - From The Warehouse to The Data Lake by Mike Limcaco
Extending Analytic Reach - From The Warehouse to The Data Lake by Mike LimcacoExtending Analytic Reach - From The Warehouse to The Data Lake by Mike Limcaco
Extending Analytic Reach - From The Warehouse to The Data Lake by Mike LimcacoData Con LA
 
Real-Time Analytics and Actions Across Large Data Sets with Apache Spark
Real-Time Analytics and Actions Across Large Data Sets with Apache SparkReal-Time Analytics and Actions Across Large Data Sets with Apache Spark
Real-Time Analytics and Actions Across Large Data Sets with Apache SparkDatabricks
 
GraphDB Cloud: Enterprise Ready RDF Database on Demand
GraphDB Cloud: Enterprise Ready RDF Database on DemandGraphDB Cloud: Enterprise Ready RDF Database on Demand
GraphDB Cloud: Enterprise Ready RDF Database on DemandOntotext
 
Real-Time Health Score Application using Apache Spark on Kubernetes
Real-Time Health Score Application using Apache Spark on KubernetesReal-Time Health Score Application using Apache Spark on Kubernetes
Real-Time Health Score Application using Apache Spark on KubernetesDatabricks
 

Mais procurados (20)

Stream Processing: Choosing the Right Tool for the Job
Stream Processing: Choosing the Right Tool for the JobStream Processing: Choosing the Right Tool for the Job
Stream Processing: Choosing the Right Tool for the Job
 
Obfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataObfuscating LinkedIn Member Data
Obfuscating LinkedIn Member Data
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
 
Vectorized R Execution in Apache Spark
Vectorized R Execution in Apache SparkVectorized R Execution in Apache Spark
Vectorized R Execution in Apache Spark
 
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
 
Big Data Meets Learning Science: Keynote by Al Essa
Big Data Meets Learning Science: Keynote by Al EssaBig Data Meets Learning Science: Keynote by Al Essa
Big Data Meets Learning Science: Keynote by Al Essa
 
Dataset Descriptions in Open PHACTS and HCLS
Dataset Descriptions in Open PHACTS and HCLSDataset Descriptions in Open PHACTS and HCLS
Dataset Descriptions in Open PHACTS and HCLS
 
AI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat DetectionAI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat Detection
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per Second
 
Piyali Kamra - Analytics and Data Visualization pipeline backed by AWS Glue &...
Piyali Kamra - Analytics and Data Visualization pipeline backed by AWS Glue &...Piyali Kamra - Analytics and Data Visualization pipeline backed by AWS Glue &...
Piyali Kamra - Analytics and Data Visualization pipeline backed by AWS Glue &...
 
AWS Finland September Meetup - Using Amazon Neptune to build Fashion Knowledg...
AWS Finland September Meetup - Using Amazon Neptune to build Fashion Knowledg...AWS Finland September Meetup - Using Amazon Neptune to build Fashion Knowledg...
AWS Finland September Meetup - Using Amazon Neptune to build Fashion Knowledg...
 
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
 
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeNear Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
 
Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford
Migrating from Closed to Open Source - Fonda Ingram & Ken SanfordMigrating from Closed to Open Source - Fonda Ingram & Ken Sanford
Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford
 
Extending Analytic Reach
Extending Analytic ReachExtending Analytic Reach
Extending Analytic Reach
 
Extending Analytic Reach - From The Warehouse to The Data Lake by Mike Limcaco
Extending Analytic Reach - From The Warehouse to The Data Lake by Mike LimcacoExtending Analytic Reach - From The Warehouse to The Data Lake by Mike Limcaco
Extending Analytic Reach - From The Warehouse to The Data Lake by Mike Limcaco
 
Real-Time Analytics and Actions Across Large Data Sets with Apache Spark
Real-Time Analytics and Actions Across Large Data Sets with Apache SparkReal-Time Analytics and Actions Across Large Data Sets with Apache Spark
Real-Time Analytics and Actions Across Large Data Sets with Apache Spark
 
Query O
Query OQuery O
Query O
 
GraphDB Cloud: Enterprise Ready RDF Database on Demand
GraphDB Cloud: Enterprise Ready RDF Database on DemandGraphDB Cloud: Enterprise Ready RDF Database on Demand
GraphDB Cloud: Enterprise Ready RDF Database on Demand
 
Real-Time Health Score Application using Apache Spark on Kubernetes
Real-Time Health Score Application using Apache Spark on KubernetesReal-Time Health Score Application using Apache Spark on Kubernetes
Real-Time Health Score Application using Apache Spark on Kubernetes
 

Semelhante a Processing genetic data at scale

Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformEva Tse
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCMark Smith
 
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...Ahsan Javed Awan
 
Fossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriFossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriChetan Khatri
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 Databricks
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileRoy Kim
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudRick Bilodeau
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudStreamsets Inc.
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 
Big data at AWS Chicago User Group - 2014
Big data at AWS Chicago User Group - 2014Big data at AWS Chicago User Group - 2014
Big data at AWS Chicago User Group - 2014AWS Chicago
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowChetan Khatri
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
First in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationInside Analysis
 
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...Andre Essing
 
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...Amazon Web Services
 
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015Iulia Emanuela Iancuta
 

Semelhante a Processing genetic data at scale (20)

Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKC
 
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
 
Fossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriFossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatri
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI Mobile
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Big data at AWS Chicago User Group - 2014
Big data at AWS Chicago User Group - 2014Big data at AWS Chicago User Group - 2014
Big data at AWS Chicago User Group - 2014
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
First in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter Integration
 
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...
 
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
 
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
 

Último

UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 

Último (20)

UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 

Processing genetic data at scale

  • 1. Processing Genetic Data at Scale Mark Schroering Software Architect @MarkSchroering
  • 2. #IndyCloudConf @MarkSchroering About Me Mark Schroering Twitter: @MarkSchroering Github: mschroering Life sciences company headquartered in Indianapolis with offices in Salt Lake City and Research Triangle Park (RTP)
  • 3. #IndyCloudConf @MarkSchroering Sequencing costs have dropped significantly https://www.researchgate.net/figure/The-change-over-time-for-cost-per-raw-megabase-of- DNA-sequencing-Source_fig1_261879801
  • 4. #IndyCloudConf @MarkSchroering Precision Medicine is on the rise Source: Frost & Sullivan
  • 5. #IndyCloudConf @MarkSchroering A lot of genetic data is being generated A human genome has ~3 billion base pairs AGCCCCTCAGGAGTCCGGCCACATGGAAACTCCTCATTCCGGAGGTCA GTCAGATTTACCCTGGCTCACCTTGGCGTCGCGTCCGGCGGCAAACTA AGAACACGTCGTCTAAATGACTTCTTAAAGTAGAATAGCGTGTTCTCT About 0.1% of the genome is different among individuals ~3 million germline variants (mutations) per person
  • 6. #IndyCloudConf @MarkSchroering Traditionally variant data has been transferred and shared as files - Variant Call Format (VCF) #CHROM POS ID REF ALT QUAL chr1 120056534 . G A . chr3 178936091 . G A . chr11 108198392 . T TA . chr12 69233096 . C T . chr13 32913764 . A G . CLI tools exist that can search and transform the data >> ./vcftools --vcf input_data.vcf --chr1 --from-bp 1000000 --to-bp 2000000
  • 7. #IndyCloudConf @MarkSchroering Variants can be “annotated” to include additional information from other sources ##INFO=<ID=CLNSRCID,Number=.,Type=String,Description="Variant Clinical Channel IDs"> ##INFO=<ID=CLNSIG,Number=.,Type=String,Description="Variant Clinical Significance, 0 - unknown, 1 - untested, 2 - non-pathogenic, 3 - probable-non-pathogenic, 4 - probable-pathogenic, 5 - pathogenic, 6 - drug-response, 7 - histocompatibility, 255 - other"> #CHROM POS ID REF ALT QUAL FILTER INFO 1 985955 rs199476396 G C . . CLNSRCID=103320.0001;CLNSIG=5; 1 1199489 rs207460006 G A . . CLNSRCID=.;CLNSIG=1; ClinVar - Database of variants that pertain to human health COSMIC - Database of Cancer variants
  • 8. #IndyCloudConf @MarkSchroering The Problem Create a cloud based solution that can store and efficiently analyze large genomic data sets to provide meaningful insights to clinicians and researchers in a responsive manner
  • 9. #IndyCloudConf @MarkSchroering Solution Data Lake Indexed Variants Variant Annotations MicroservicesAPI Gateway Spark Transformation Jobs Batch Annotation Jobs Application Analytics Storage Ingestion
  • 10. #IndyCloudConf @MarkSchroering Ingestion - Convert from legacy file formats Variants File Data lake in S3Apache Parquet Apache Parquet provides efficient columnar storage and integrates with technologies like Spark, Redshift Spectrum, and Athena https://github.com/lifeomic/spark-vcf - Natively load variant files into a Spark Dataframe/Dataset
  • 11. #IndyCloudConf @MarkSchroering Ingestion - Variant Annotation Variants File(s) Data lake in S3 Legacy variant annotation tools are CLI based which made AWS Batch a good candidate for the annotation process. ClinVar COSMIC Dockerized annotations tools
  • 12. #IndyCloudConf @MarkSchroering Ingestion - Lessons Learned Utilize DynamoDB on-demand provisioning for tables that have unpredictable spikes in read/write capacity (released Nov’ 18) ● DynamoDB capacity auto-scaling is slow to react to spikes in throughput ● With on-demand provisioning, you pay per request ● Request rates are still capped by max table throughput and account limits
  • 13. #IndyCloudConf @MarkSchroering Ingestion - Lessons Learned Be aware of limits (hard and soft) put in place by your cloud provider on compute resources. Large spikes of ingestion requests can result in failures. Solutions: ● Add rate limiting to your API and force clients to slow down ● Add a queue to capture requests and process them when resources are available
  • 14. #IndyCloudConf @MarkSchroering Ingestion - Lessons Learned Utilize Spot (Preemptible) compute to save cost for big data ingestion tasks Solutions: ● Use a Batch Spot Compute Environment and set the retry strategy for jobs to >= 1 to allow jobs to be retried should an instance get terminated ● Use Spot Instance Fleets in EMR ● Have monitoring in place to get notified when jobs fail
  • 15. #IndyCloudConf @MarkSchroering Analytics - Index Variants and Annotations Data Lake Variant attributes needed for analytics are stored in PostgreSQL. Full annotation records are stored in DynamoDB. Indexed variants in PostgreSQL Annotations
  • 16. #IndyCloudConf @MarkSchroering Analytics - Lessons Learned Partition large tables for better query performance Solutions: ● PostgreSQL offers table partitioning by range (defined by a key column) or explicit listings. After partitioning a large table we saw dramatic improvements in query performance. Updates to a partition also do not impact query performance of other partitions.
  • 17. #IndyCloudConf @MarkSchroering Analytics - Lessons Learned Use a data lake for storing raw data Solutions: ● Store raw data in S3 in a big data friendly format like Apache Parquet ○ Do not throw any data away. You may need it later ○ Can rebuild indexed data stores or create new ones as needed using the raw data
  • 18. #IndyCloudConf @MarkSchroering Application - Provide query results to client The Lambda function executes a query against the PostgreSQL database and joins annotation records from DynamoDB. Indexed variants in PostgreSQL Annotations API Gateway Microservice
  • 19. #IndyCloudConf @MarkSchroering Application - Lessons Learned Use reader endpoints to get better performance for AWS Aurora databases ● High write load caused by large ingestion jobs will not impact query performance for clients needing read only access ● Reader endpoint load balances for query intensive applications
  • 20. #IndyCloudConf @MarkSchroering Current Storage and Performance Stats ● ~101,706,611 processed unique variants ● S3 Storage: ~289 GB ● DynamoDB Storage: ~94 GB ● PostgreSQL Snapshot: ~3.3 TB ● Query Response Times ○ Min: ~214ms ○ Avg: ~1s ○ Max: ~3.5s
  • 21. #IndyCloudConf @MarkSchroering “A system is never finished being developed until it ceases to be used.” @jerryweinberg