Human genetics holds the key to understanding pathogenesis of many devastating diseases like type 2 diabetes and Alzheimer’s disease. The discovery, development, and commercialization of new classes of drugs can take 10-15 years and greater than $5 billion in R&D investment only to see less than 5% of the drugs make it to market. Committed to creating therapeutic innovations, Regeneron has built one of the world’s most comprehensive genetics databases to supplement our state-of-the-art drug development pipeline. While these massive volumes of data provide an unprecedented opportunity to gain novel therapeutic insights, Regeneron has encountered a number of challenges on the road to delivering on the promises of big data and genomics in drug discovery. For example, how do you enable fast and accurate query from >80B data points? And how do you expedite novel statistical tests on TB-scale data?
This presentation will share Regeneron’s vision for building a scalable and performant informatics infrastructure to accelerate genetics-driven drug development. Specifically, we highlight key challenges in establishing the world’s largest clinical genetics databases, provide an overview of how Regeneron leverages Databricks’ Unified Analytics Platform and Apache Spark, and discuss in detail key engineering innovations that have already come out of this collaborative effort.
Insights from Building the Future of Drug Discovery with Apache Spark with Lukas Habegger
1. Lukas Habegger, Associate Director Bioinformatics
Regeneron Genetics Center (RGC)
Insights from Building the
Future of Drug Discovery with
Apache Spark
#EntSAIS14
2. Outline
• Current state of drug discovery and development
• Benefits of leveraging human genetics data
• Overview of the Regeneron Genetics Center (RGC)
• Challenges on the road to delivering on the promises of big data and genomics in drug discovery
• Overview of how the RGC leverages Databricks’ Unified Analytics Platform and Apache Spark
• Discussion of key engineering innovations
• Conclusions & lessons learned
2#EntSAIS14
3. Current state of drug discovery and development:
Maximizing chances of success with human genetics
3
95% of experimental
medicines fail in
development; costs
exceed $2B per
approved drug
Higher probability
for success for
drugs with strong
human genetics
evidence
>$100B spent on
worldwide R&D by
biopharma industry à
only 10–20 new drugs
per year
Target bottleneck: <1,000
genes (<5% of all genes)
account for targets of all
drugs currently in
development
Herper M. Forbes.com. The Truly Staggering Cost of Inventing New Drugs. https://www.forbes.com/sites/matthewherper/2012/02/10/the-truly-staggering-cost-of-inventing-new-drugs/#355471a54a94. Feb. 10, 2012.
Herper M. Forbes.com. How the Staggering Cost of Inventing New Drugs Is Shaping the Future of Medicine. https://www.forbes.com/sites/matthewherper/2013/08/11/how-the-staggering-cost-of-inventing-new-drugs-is-shaping-the-future-of-medicine/#30f1a95113c3. Aug. 11, 2013.
Booth B. Forbes.com. A Billion Here, A Billion There: The Cost of Making a Drug Revisited. https://www.forbes.com/sites/brucebooth/2014/11/21/a-billion-here-a-billion-there-the-cost-of-making-a-drug-revisited/#6034e7f226a8. Nov. 21, 2014.
Nat Genet. 2015 Aug;47(8):856-60. doi: 10.1038/ng.3314. Nat Rev Drug Discov. 2013 Aug;12(8):581-94. doi: 10.1038/nrd4051. Nat Rev Drug Discov. 2017 Jan;16(1):19-34. doi: 10.1038/nrd.2016.230.
You cannot pursue modern drug discovery and development without incorporating human genetics
4. Why is human genetics such a powerful tool for drug
discovery?
4
Neutral
DNA
mutation
Loss-of-function
Impact on
disease
Impact on
gene product
Gain-of-function
NeutralProtective Damaging
Example: A à T
5. Why is human genetics such a powerful tool for drug
discovery?
5
Neutral
DNA
mutation
Loss-of-function
Impact on
disease
Impact on
gene product
Gain-of-function
NeutralProtective Damaging
Example: A à T
6. Why is human genetics such a powerful tool for drug
discovery?
6
Neutral
DNA
mutation
Loss-of-function
Impact on
disease
Impact on
gene product
Gain-of-function
NeutralProtective Damaging
Example: A à T
7. Why is human genetics such a powerful tool for drug
discovery?
7
Neutral
DNA
mutation
Loss-of-function
Drug
Impact on
disease
Impact on
gene product
Gain-of-function
NeutralProtective Damaging
Example: A à T
8. PCSK9: A success story where human genetics
evidence played a key role in drug development
8
Neutral
DNA
mutation
Loss-of-function
Drug
Impact on
disease
Impact on
gene product
Gain-of-function
NeutralProtective Damaging
• Loss-of-function
mutations in PCSK9
protect against heart
disease
• Regeneron developed
a drug to block PCSK9,
which has shown to be
effective in preventing
heart disease
Example: A à T
9. The goal of the RGC is build one of the world’s largest
genotype-phenotype resources
• Regeneron has a long history of commitment to genetics-based science, and a track record of
integrating human genetics into development programs, delivering new medicines to patients
• Regeneron established the Regeneron Genetics Center (RGC) in 2014
• Goal: build one of the world’s most comprehensive genetics databases to supplement our state-
of-the-art drug development pipeline
• To date, the RGC has sequenced DNA from more than 300,000 individuals
9#EntSAIS14
10. Breadth of human genetics resources: RGC network of
60+ collaborators representing over 1 million samples
10#EntSAIS14
Founder populations
Phenotype specific cohorts
Family studies
General population
11. Breadth of human genetics resources: RGC network of
60+ collaborators representing over 1 million samples
11#EntSAIS14
Founder populations
Phenotype specific cohorts
Family studies
General population
12. RGC collaboration with UK Biobank: RGC will sequence
~500K participants over 3-5 years
12#EntSAIS14
®
13. Automation is key to enable large-scale data production
and analysis
13#EntSAIS14
Automated biobank
(1.4M samples)
Library preparation
(>300,000 samples / year)
Sequencing instruments
(>300,000 samples / year)
100% cloud-based
informatics & analysis
®
A scalable informatics platform is needed to analyze this data and make it accessible to a broad set of users
14. How do we analyze our data to gain novel insights?
Approach and desired goal
14#EntSAIS14
• Approach:
1. Sequence a large number of individuals to
identify their mutations
2. Obtain paired clinical data (traits derived from
de-identified electronic medical records)
3. Test for correlations/associations between all
mutations and traits
4. Mine association results in various ways to
gain insights
MM
Individuals
Mutations
TM
Traits
Individuals
AR
Mutation : Trait
Analytical
engine
Association Results
Mutation Matrix Trait Matrix
Desired goal
15. How do we analyze our data to gain novel insights?
It’s more complicated – lack of data unification
15#EntSAIS14
MM
Individuals
Mutations
TM
Traits
Individuals
AR
Mutation : Trait
Analytical
engine
Association Results
Mutation Matrix Trait Matrix
Desired goalReality
MM
Individuals
Mutations
TM
Traits
Individuals
txt txtpVCF
AR
ResultsFiles
Mutation : Trait
• Data is decentralized and stored in different
formats
• Data is organized in different ways (e.g., not
squared off, transposed, custom
representations and indexing schemes)
• Asking simple questions requires many time-
consuming data wrangling steps
txt
16. How do we analyze our data to gain novel insights?
It’s more complicated – data from multiple cohorts
16#EntSAIS14
MM
Individuals
Mutations
TM
Traits
Individuals
AR
Mutation : Trait
Analytical
engine
Association Results
Mutation Matrix Trait Matrix
Desired goalReality
GT
Individuals
Mutations
TM
Traits
Individuals
txt
ResultsFiles
Mutation : Trait
• The RGC has data from multiple collaborators
• Data is not always consistent
• Limited functionality to unify / aggregate
matrices from multiple cohorts
GT
TM
MM
TM
AR
pVCF txt txt
17. How do we analyze our data to gain novel insights?
It’s more complicated – scalability issues
17#EntSAIS14
MM
Individuals
Mutations
TM
Traits
Individuals
AR
Mutation : Trait
Analytical
engine
Association Results
Mutation Matrix Trait Matrix
Desired goalReality
MM
Individuals
Mutations
TM
Traits
Individuals
AR
Mutation : Trait
Analytical
engine10s of millions
100s of billions
10s of thousands
• Large inputs
(MM & TM)
• MM x TM
cross join
• Massive
outputs (AR)
18. How do we find out what these mutations do?
The Databricks solution
18#EntSAIS14
• RGC has established a major partnership with
Databricks in 2017
• RGC is leveraging the Databricks Unified Analytics
Platform to create a unified data & compute
infrastructure:
1. Developed efficient and unified data
representations
2. Implemented scalable production workflows
optimized for analyzing billions of rows
3. Created a unified codebase to enable all
levels of users to perform computation
MM
Individuals
Mutations
TM
Traits
Individuals
AR
Mutation : Trait
Analytical
engine
Association Results
Mutation Matrix Trait Matrix
19. The RGC has developed easy-to-use web applications
to make the data accessible to a broad set of users
19#EntSAIS14
Web
Application
Databricks
Cluster
Query
Results
Queries
Library
Architecture of RGC web applications
MM
Individuals
Mutations
TM
Traits
Individuals
AR
Mutation : Trait
Analytical
engine
Association Results
Mutation Matrix Trait Matrix
Goal: to enable everyone in the drug development process to
easily access, analyze, and extract insights from the RGC’s data
20. The RGC Results Browser enables users to query
billions of association results
• Goal: Efficiently search billions of association
results across multiple cohorts
• The data set is updated when association results
from a new cohort become available
• Size of the current data set: >67 billion association
results (>200 billion results for the next update)
20#EntSAIS14
AR
21. Optimizations to the ETL workflow have significantly
reduced the time to ingest the association results
• Association results are ingested and merged
from multiple cohorts
• Spark-based solution scales linearly with
cluster size
– Several optimizations have made the
process more efficient
– Migration of other QC processes into
this workflow enable an end-to-end
Spark solution
21#EntSAIS14
22. Optimizing the partitioning scheme has significantly
reduced the query response time
• The input data is naturally organized by cohort; not query optimized
22#EntSAIS14
AR
Chromosomal Location
Gene
density
Results
density AR
Chromosomal boundaries
Partition
density
Variable range width & count
Range
Partitioned
• Optimizations reduced the query response time from >30 minutes to <3 seconds
24. The RGC has recently identified a new potential drug
target for treating liver disease
24#EntSAIS14
Source: https://endpts.com/the-pcsk9-of-nash-regeneron-and-alnylam-join-forces-to-tackle-a-promising-target-for-severe-liver-diseases/
25. Liver disease can be detected based on enzyme levels
in the blood
• Two enzymes are typically analyzed to evaluate liver damage:
– AST (Aspartate transaminase)
– ALT (Alanine transaminase)
• Elevated levels of AST and ALT are indicative of liver damage
– Necessary but not sufficient
• Goal: identify loss-of-function mutations that are associated with lower AST and ALT levels
(protective effect)
25#EntSAIS14
26. Manhattan plot for AST: Several mutations in the
genome are associated with this liver trait
26#EntSAIS14
What peak / mutation is the
most interesting?
27. Manhattan plot for AST: Several mutations in the
genome are associated with this liver trait
27#EntSAIS14
What peak / mutation is the
most interesting?
HSD17B13
29. • The mutation of interest is associated with a broad spectrum of liver disease traits
• All of these associations confer protection from liver disease
29#EntSAIS14
30. Conclusions & lessons learned
• At Regeneron our goal is to bring the power of science to medicine and develop new medicines for
patients in need
• Incorporating human genetics evidence is critical for pursuing modern drug discovery; the RGC is
building one of the world’s largest genetics databases to identify new potential drug targets
• Our strategic partnership with Databricks has enabled us to build a state-of-the-art data science
platform from scratch by:
– Developing efficient and unified data representations
– Building out scalable workflows to mine billions of rows and addressing key bottlenecks (e.g.,
reducing the ETL time from weeks to hours and optimizing the query response time to <3s)
– Creating a unified codebase to enable all levels of users to perform computation
• Most importantly, the Databricks Unified Analytics Platform, brings our data, tools, and people together
to accelerate innovation
30#EntSAIS14
31. Acknowledgements
31#EntSAIS14
• RGC-LT
– Alan Shuldiner
– Aris Baras
– Aris Economides
– Jeffrey Reid
– John Overton
• RGC-GI
– Alicia Hawes
– Ashish Yadav
– Claire Chai
– Evan Maxwell
– Gisu Eom
– Jeff Staples
– John Penn
– Leland Barnard
– Shareef Khalid
– Sheldon Bai
– Suganthi Balasubramanian
– Young Hahn
• RGC
– Alexander Li
– Alexander Lopez
– Amy Damask
– Charlie Paulding
– Claudia Schurmann
– Colm O’Dushlaine
– Cristopher Van Hout
– Dylan Sun
– Jan Freudenberg
– Kavita Praveen
– Kia Manoochehri
– Lauren Gurski
– Manasi Pradhan
– Mike Norsen
– Nehal Gosalia
– Nila Banerjee
– Rick Ulloa
– Shane McCarthy
– Tanya Teslovich Dostal
– Tony Marcketta
• Databricks
– Ali Ghodsi
– Ali Hodroj
– Allan Marcos
– Ambareesh Kulkarni
– Bavesh Patel
– Christopher Hoshino-Fish
– David Weaver
– Francis Gerace
– Hossein Falaki
– Ion Stocia
– Juliusz Sompolsk
– Li Yu
– Navid Bazzazzadeh
– Paris Georgallis
– Ram Sriharsha
– Ronak Shah
– Shiva Bhattacharjee
– Vida Ha
– Yongsheng Huang
• REGN-IT
– Abdul Shaik
– Allen Chiang
– Brandon Fetch
– Christopher McCabe
– Dale Cochran
– David Glosser
– Long Le
– Michael Phillips
– Mohammad Saeed
– Pat Leblanc
– Sal Mineo
– Shaw Nawaz
– Shiva Ravi
– Stephen Huvane
– Vin Dahake
– Weylin Preodor