Insights from Building the Future of Drug Discovery with Apache Spark with Lukas Habegger

Lukas Habegger, Associate Director Bioinformatics
Regeneron Genetics Center (RGC)
Insights from Building the
Future of Drug Discovery with
Apache Spark
#EntSAIS14

Outline
• Current state of drug discovery and development
• Benefits of leveraging human genetics data
• Overview of the Regeneron Genetics Center (RGC)
• Challenges on the road to delivering on the promises of big data and genomics in drug discovery
• Overview of how the RGC leverages Databricks’ Unified Analytics Platform and Apache Spark
• Discussion of key engineering innovations
• Conclusions & lessons learned
2#EntSAIS14

Current state of drug discovery and development:
Maximizing chances of success with human genetics
3
95% of experimental
medicines fail in
development; costs
exceed $2B per
approved drug
Higher probability
for success for
drugs with strong
human genetics
evidence
>$100B spent on
worldwide R&D by
biopharma industry à
only 10–20 new drugs
per year
Target bottleneck: <1,000
genes (<5% of all genes)
account for targets of all
drugs currently in
development
Herper M. Forbes.com. The Truly Staggering Cost of Inventing New Drugs. https://www.forbes.com/sites/matthewherper/2012/02/10/the-truly-staggering-cost-of-inventing-new-drugs/#355471a54a94. Feb. 10, 2012.
Herper M. Forbes.com. How the Staggering Cost of Inventing New Drugs Is Shaping the Future of Medicine. https://www.forbes.com/sites/matthewherper/2013/08/11/how-the-staggering-cost-of-inventing-new-drugs-is-shaping-the-future-of-medicine/#30f1a95113c3. Aug. 11, 2013.
Booth B. Forbes.com. A Billion Here, A Billion There: The Cost of Making a Drug Revisited. https://www.forbes.com/sites/brucebooth/2014/11/21/a-billion-here-a-billion-there-the-cost-of-making-a-drug-revisited/#6034e7f226a8. Nov. 21, 2014.
Nat Genet. 2015 Aug;47(8):856-60. doi: 10.1038/ng.3314. Nat Rev Drug Discov. 2013 Aug;12(8):581-94. doi: 10.1038/nrd4051. Nat Rev Drug Discov. 2017 Jan;16(1):19-34. doi: 10.1038/nrd.2016.230.
You cannot pursue modern drug discovery and development without incorporating human genetics

Why is human genetics such a powerful tool for drug
discovery?
4
Neutral
DNA
mutation
Loss-of-function
Impact on
disease
Impact on
gene product
Gain-of-function
NeutralProtective Damaging
Example: A à T

discovery?
5
Neutral
DNA
mutation
Loss-of-function
Impact on
disease
Impact on
gene product
Gain-of-function
Example: A à T

discovery?
6
Neutral
DNA
mutation
Loss-of-function
Impact on
disease
Impact on
gene product
Gain-of-function
Example: A à T

discovery?
7
Neutral
DNA
mutation
Loss-of-function
Drug
Impact on
disease
Impact on
gene product
Gain-of-function
Example: A à T

PCSK9: A success story where human genetics
evidence played a key role in drug development
8
Neutral
DNA
mutation
Loss-of-function
Drug
Impact on
disease
Impact on
gene product
Gain-of-function
• Loss-of-function
mutations in PCSK9
protect against heart
disease
• Regeneron developed
a drug to block PCSK9,
which has shown to be
effective in preventing
heart disease
Example: A à T

The goal of the RGC is build one of the world’s largest
genotype-phenotype resources
• Regeneron has a long history of commitment to genetics-based science, and a track record of
integrating human genetics into development programs, delivering new medicines to patients
• Regeneron established the Regeneron Genetics Center (RGC) in 2014
• Goal: build one of the world’s most comprehensive genetics databases to supplement our state-
of-the-art drug development pipeline
• To date, the RGC has sequenced DNA from more than 300,000 individuals
9#EntSAIS14

Breadth of human genetics resources: RGC network of
60+ collaborators representing over 1 million samples
10#EntSAIS14
Founder populations
Phenotype specific cohorts
Family studies
General population

Breadth of human genetics resources: RGC network of
60+ collaborators representing over 1 million samples
11#EntSAIS14
Founder populations
Phenotype specific cohorts
Family studies
General population

RGC collaboration with UK Biobank: RGC will sequence
~500K participants over 3-5 years
12#EntSAIS14
®

Automation is key to enable large-scale data production
and analysis
13#EntSAIS14
Automated biobank
(1.4M samples)
Library preparation
(>300,000 samples / year)
Sequencing instruments
(>300,000 samples / year)
100% cloud-based
informatics & analysis
®
A scalable informatics platform is needed to analyze this data and make it accessible to a broad set of users

How do we analyze our data to gain novel insights?
Approach and desired goal
14#EntSAIS14
• Approach:
1. Sequence a large number of individuals to
identify their mutations
2. Obtain paired clinical data (traits derived from
de-identified electronic medical records)
3. Test for correlations/associations between all
mutations and traits
4. Mine association results in various ways to
gain insights
MM
Individuals
Mutations
TM
Traits
Individuals
AR
Mutation : Trait
Analytical
engine
Association Results
Mutation Matrix Trait Matrix
Desired goal

It’s more complicated – lack of data unification
15#EntSAIS14
MM
Individuals
Mutations
TM
Traits
Individuals
AR
Mutation : Trait
Analytical
engine
Association Results
Desired goalReality
MM
Individuals
Mutations
TM
Traits
Individuals
txt txtpVCF
AR
ResultsFiles
Mutation : Trait
• Data is decentralized and stored in different
formats
• Data is organized in different ways (e.g., not
squared off, transposed, custom
representations and indexing schemes)
• Asking simple questions requires many time-
consuming data wrangling steps
txt

It’s more complicated – data from multiple cohorts
16#EntSAIS14
MM
Individuals
Mutations
TM
Traits
Individuals
AR
Mutation : Trait
Analytical
engine
Association Results
Desired goalReality
GT
Individuals
Mutations
TM
Traits
Individuals
txt
ResultsFiles
Mutation : Trait
• The RGC has data from multiple collaborators
• Data is not always consistent
• Limited functionality to unify / aggregate
matrices from multiple cohorts
GT
TM
MM
TM
AR
pVCF txt txt

It’s more complicated – scalability issues
17#EntSAIS14
MM
Individuals
Mutations
TM
Traits
Individuals
AR
Mutation : Trait
Analytical
engine
Association Results
Desired goalReality
MM
Individuals
Mutations
TM
Traits
Individuals
AR
Mutation : Trait
Analytical
engine10s of millions
100s of billions
10s of thousands
• Large inputs
(MM & TM)
• MM x TM
cross join
• Massive
outputs (AR)

How do we find out what these mutations do?
The Databricks solution
18#EntSAIS14
• RGC has established a major partnership with
Databricks in 2017
• RGC is leveraging the Databricks Unified Analytics
Platform to create a unified data & compute
infrastructure:
1. Developed efficient and unified data
representations
2. Implemented scalable production workflows
optimized for analyzing billions of rows
3. Created a unified codebase to enable all
levels of users to perform computation
MM
Individuals
Mutations
TM
Traits
Individuals
AR
Mutation : Trait
Analytical
engine
Association Results

The RGC has developed easy-to-use web applications
to make the data accessible to a broad set of users
19#EntSAIS14
Web
Application
Databricks
Cluster
Query
Results
Queries
Library
Architecture of RGC web applications
MM
Individuals
Mutations
TM
Traits
Individuals
AR
Mutation : Trait
Analytical
engine
Association Results
Goal: to enable everyone in the drug development process to
easily access, analyze, and extract insights from the RGC’s data

The RGC Results Browser enables users to query
billions of association results
• Goal: Efficiently search billions of association
results across multiple cohorts
• The data set is updated when association results
from a new cohort become available
• Size of the current data set: >67 billion association
results (>200 billion results for the next update)
20#EntSAIS14
AR

Optimizations to the ETL workflow have significantly
reduced the time to ingest the association results
• Association results are ingested and merged
from multiple cohorts
• Spark-based solution scales linearly with
cluster size
– Several optimizations have made the
process more efficient
– Migration of other QC processes into
this workflow enable an end-to-end
Spark solution
21#EntSAIS14

Optimizing the partitioning scheme has significantly
reduced the query response time
• The input data is naturally organized by cohort; not query optimized
22#EntSAIS14
AR
Chromosomal Location
Gene
density
Results
density AR
Chromosomal boundaries
Partition
density
Variable range width & count
Range
Partitioned
• Optimizations reduced the query response time from >30 minutes to <3 seconds

Demo notebook: mining association results and
extracting key insights
23#EntSAIS14

The RGC has recently identified a new potential drug
target for treating liver disease
24#EntSAIS14
Source: https://endpts.com/the-pcsk9-of-nash-regeneron-and-alnylam-join-forces-to-tackle-a-promising-target-for-severe-liver-diseases/

Liver disease can be detected based on enzyme levels
in the blood
• Two enzymes are typically analyzed to evaluate liver damage:
– AST (Aspartate transaminase)
– ALT (Alanine transaminase)
• Elevated levels of AST and ALT are indicative of liver damage
– Necessary but not sufficient
• Goal: identify loss-of-function mutations that are associated with lower AST and ALT levels
(protective effect)
25#EntSAIS14

Manhattan plot for AST: Several mutations in the
genome are associated with this liver trait
26#EntSAIS14
What peak / mutation is the
most interesting?

Manhattan plot for AST: Several mutations in the
genome are associated with this liver trait
27#EntSAIS14
What peak / mutation is the
most interesting?
HSD17B13

• The mutation of interest is associated with a broad spectrum of liver disease traits
• All of these associations confer protection from liver disease
29#EntSAIS14

Conclusions & lessons learned
• At Regeneron our goal is to bring the power of science to medicine and develop new medicines for
patients in need
• Incorporating human genetics evidence is critical for pursuing modern drug discovery; the RGC is
building one of the world’s largest genetics databases to identify new potential drug targets
• Our strategic partnership with Databricks has enabled us to build a state-of-the-art data science
platform from scratch by:
– Developing efficient and unified data representations
– Building out scalable workflows to mine billions of rows and addressing key bottlenecks (e.g.,
reducing the ETL time from weeks to hours and optimizing the query response time to <3s)
– Creating a unified codebase to enable all levels of users to perform computation
• Most importantly, the Databricks Unified Analytics Platform, brings our data, tools, and people together
to accelerate innovation
30#EntSAIS14

Acknowledgements
31#EntSAIS14
• RGC-LT
– Alan Shuldiner
– Aris Baras
– Aris Economides
– Jeffrey Reid
– John Overton
• RGC-GI
– Alicia Hawes
– Ashish Yadav
– Claire Chai
– Evan Maxwell
– Gisu Eom
– Jeff Staples
– John Penn
– Leland Barnard
– Shareef Khalid
– Sheldon Bai
– Suganthi Balasubramanian
– Young Hahn
• RGC
– Alexander Li
– Alexander Lopez
– Amy Damask
– Charlie Paulding
– Claudia Schurmann
– Colm O’Dushlaine
– Cristopher Van Hout
– Dylan Sun
– Jan Freudenberg
– Kavita Praveen
– Kia Manoochehri
– Lauren Gurski
– Manasi Pradhan
– Mike Norsen
– Nehal Gosalia
– Nila Banerjee
– Rick Ulloa
– Shane McCarthy
– Tanya Teslovich Dostal
– Tony Marcketta
• Databricks
– Ali Ghodsi
– Ali Hodroj
– Allan Marcos
– Ambareesh Kulkarni
– Bavesh Patel
– Christopher Hoshino-Fish
– David Weaver
– Francis Gerace
– Hossein Falaki
– Ion Stocia
– Juliusz Sompolsk
– Li Yu
– Navid Bazzazzadeh
– Paris Georgallis
– Ram Sriharsha
– Ronak Shah
– Shiva Bhattacharjee
– Vida Ha
– Yongsheng Huang
• REGN-IT
– Abdul Shaik
– Allen Chiang
– Brandon Fetch
– Christopher McCabe
– Dale Cochran
– David Glosser
– Long Le
– Michael Phillips
– Mohammad Saeed
– Pat Leblanc
– Sal Mineo
– Shaw Nawaz
– Shiva Ravi
– Stephen Huvane
– Vin Dahake
– Weylin Preodor

Questions?
32#EntSAIS14
https://tinyurl.com/yaqwl2bt
We are hiring!

Insights from Building the Future of Drug Discovery with Apache Spark with Lukas Habegger

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Insights from Building the Future of Drug Discovery with Apache Spark with Lukas Habegger

Similar to Insights from Building the Future of Drug Discovery with Apache Spark with Lukas Habegger (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Insights from Building the Future of Drug Discovery with Apache Spark with Lukas Habegger