Processing genetic data at scale

Processing Genetic Data at
Scale
Mark Schroering
Software Architect
@MarkSchroering

#IndyCloudConf @MarkSchroering
About Me
Mark Schroering
Twitter: @MarkSchroering
Github: mschroering
Life sciences company headquartered in
Indianapolis with offices in Salt Lake City
and Research Triangle Park (RTP)

Sequencing costs have dropped significantly
https://www.researchgate.net/figure/The-change-over-time-for-cost-per-raw-megabase-of-
DNA-sequencing-Source_fig1_261879801

Precision Medicine is on the rise
Source: Frost & Sullivan

A lot of genetic data is being
generated
A human genome has ~3 billion base pairs
AGCCCCTCAGGAGTCCGGCCACATGGAAACTCCTCATTCCGGAGGTCA
GTCAGATTTACCCTGGCTCACCTTGGCGTCGCGTCCGGCGGCAAACTA
AGAACACGTCGTCTAAATGACTTCTTAAAGTAGAATAGCGTGTTCTCT
About 0.1% of the genome is different among
individuals
~3 million germline variants (mutations) per person

Traditionally variant data has been transferred
and shared as files - Variant Call Format (VCF)
#CHROM POS ID REF ALT
QUAL
chr1 120056534 . G A .
chr3 178936091 . G A .
chr11 108198392 . T TA .
chr12 69233096 . C T .
chr13 32913764 . A G .
CLI tools exist that can search and transform the data
>> ./vcftools --vcf input_data.vcf --chr1 --from-bp 1000000 --to-bp 2000000

Variants can be “annotated” to include additional
information from other sources
##INFO=<ID=CLNSRCID,Number=.,Type=String,Description="Variant Clinical Channel IDs">
##INFO=<ID=CLNSIG,Number=.,Type=String,Description="Variant Clinical Significance, 0 - unknown, 1 - untested,
2 - non-pathogenic, 3 - probable-non-pathogenic, 4 - probable-pathogenic, 5 - pathogenic, 6 - drug-response, 7
- histocompatibility, 255 - other">
#CHROM POS ID REF ALT QUAL FILTER INFO
1 985955 rs199476396 G C . .
CLNSRCID=103320.0001;CLNSIG=5;
1 1199489 rs207460006 G A . .
CLNSRCID=.;CLNSIG=1;
ClinVar - Database of variants that pertain to human health
COSMIC - Database of Cancer variants

The Problem
Create a cloud based solution that
can store and efficiently analyze
large genomic data sets to
provide meaningful insights to
clinicians and researchers in a
responsive manner

Solution
Data Lake
Indexed Variants
Variant Annotations
MicroservicesAPI Gateway
Spark
Transformation
Jobs
Batch Annotation
Jobs
Application Analytics Storage Ingestion

Ingestion - Convert from legacy file formats
Variants File Data lake in S3Apache Parquet
Apache Parquet provides efficient columnar storage and integrates with technologies like Spark,
Redshift Spectrum, and Athena
https://github.com/lifeomic/spark-vcf - Natively load variant files into a Spark Dataframe/Dataset

Ingestion - Variant Annotation
Variants File(s) Data lake in S3
Legacy variant annotation tools are CLI based which made AWS Batch a good candidate for the
annotation process.
ClinVar
COSMIC
Dockerized
annotations tools

Ingestion - Lessons Learned
Utilize DynamoDB on-demand provisioning for
tables that have unpredictable spikes in
read/write capacity (released Nov’ 18)
● DynamoDB capacity auto-scaling is slow to
react to spikes in throughput
● With on-demand provisioning, you pay per
request
● Request rates are still capped by max table
throughput and account limits

Be aware of limits (hard and soft) put in place by your cloud provider on
compute resources. Large spikes of ingestion requests can result in failures.
Solutions:
● Add rate limiting to your API and force clients to slow down
● Add a queue to capture requests and process them when resources are
available

Utilize Spot (Preemptible) compute to save cost for big data ingestion tasks
Solutions:
● Use a Batch Spot Compute Environment and set the retry strategy for jobs
to >= 1 to allow jobs to be retried should an instance get terminated
● Use Spot Instance Fleets in EMR
● Have monitoring in place to get notified when jobs fail

Analytics - Index Variants and Annotations
Data Lake
Variant attributes needed for analytics are stored in PostgreSQL. Full annotation records are stored in
DynamoDB.
Indexed variants
in PostgreSQL
Annotations

Analytics - Lessons Learned
Partition large tables for better query performance
Solutions:
● PostgreSQL offers table partitioning by range (defined by a key column) or
explicit listings. After partitioning a large table we saw dramatic
improvements in query performance. Updates to a partition also do not
impact query performance of other partitions.

Analytics - Lessons Learned
Use a data lake for storing raw data
Solutions:
● Store raw data in S3 in a big data friendly format like Apache Parquet
○ Do not throw any data away. You may need it later
○ Can rebuild indexed data stores or create new ones as needed using
the raw data

Application - Provide query results to client
The Lambda function executes a query against the PostgreSQL database and joins annotation records
from DynamoDB.
Indexed variants
in PostgreSQL
Annotations
API Gateway Microservice

Application - Lessons Learned
Use reader endpoints to get better performance
for AWS Aurora databases
● High write load caused by large ingestion
jobs will not impact query performance for
clients needing read only access
● Reader endpoint load balances for query
intensive applications

Current Storage and Performance Stats
● ~101,706,611 processed unique variants
● S3 Storage: ~289 GB
● DynamoDB Storage: ~94 GB
● PostgreSQL Snapshot: ~3.3 TB
● Query Response Times
○ Min: ~214ms
○ Avg: ~1s
○ Max: ~3.5s

“A system is never finished being developed until it
ceases to be used.”
@jerryweinberg

Questions?
Mark Schroering
Software Architect
@MarkSchroering

Processing genetic data at scale

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Processing genetic data at scale

Semelhante a Processing genetic data at scale (20)

Último

Último (20)

Processing genetic data at scale