Diamond Age Data Science and Zafgen, Inc, co-present on their work in using bioinformatics data effectively in the context of a small therapeutics company.
Eleanor Howe, PhD, CEO of Diamond Age, presents on the different types of computational biologist, the characteristics of a good bioinformatics team, and the pluses and minuses of using deep learning/AI in a discovery biology context.
Huseyin Mehmet, VP of Discovery Research at Zafgen, describes his team's work with Diamond Age and uses their capabilities to inform Zafgen's drug development. He discusses the needs of biotech companies for a diverse, experience bioinformatics team.
Using Bioinformatics Data to inform Therapeutics discovery and development
1. From data to insights and
action: Strategies to take
your bioinformatics to the
next level
Eleanor Howe, Diamond Age Data Science
Huseyin Mehmet, Zafgen, Inc.
December 7, 2018
2. What is this talk about?
• Who are we? What is computational biology?
• Lessons learned from working with our customers
• Our ongoing relationship with Zafgen
• Q&A
3. Eleanor Howe, PhD
Background in molecular biology, statistics,
programming and computational
biology/bioinformatics
eleanor@diamondage.com
4. Diamond Age Data Science
www.diamondage.com
Bioinformatics/computational biology consulting
Project-based analysis
Staff augmentation
Pipeline development
“Drop-in” bioinformatics department
The Diamond Age: or,
A Young Lady’s Illustrated Primer
by Neal Stephenson
5. Team
Chris Friedline
Sequencing,
software engineering
Somdutta Saha
Computational chemistry and
proteomics
Bruce Romano
Mathematics and data science
Nicholas Crawford
Human genetics and GWAS
Mike DeRan
Cancer and diabetes
therapeutics, scRNA-seq
Max Marin
RNA splicing
Zarko Boskovic
Medicinal chemistry and
metabolomics
Chris Dwan
IT and data security
7. Computational Biology
Computational biology is data
science for biology
Bioinformatics is sometimes a
synonym for computational
biology.
Other times, bioinformatics refers
to software engineering for
biology.
9. Drug discovery requires evaluation of
diverse, complex data
• Sequence analysis is very different
from proteomics
• Knowing the landscape of available
datasets is key
• Individual bioinformaticians tend to
specialize in one sub-field or
another
10. Public datasets are a gold mine
• Cancer Cell-line Encyclopedia
• The Cancer Genome Atlas
• Gene Expression Omnibus
• Dependencies Map (Dep-map)
• UK Biobank
• DrugBank
• VarSome
• GTeX
11. But the real gems come from your own
experiments
It’s not possible to validate a drug
target using public datasets alone.
The public datasets are general, and
cover only the most common
diseases or disease subtypes.
The most useful results come from
combining custom-generated data
with public data.
12. CROs do the basics well
• Ocean Ridge, Novogene ($200 transcriptome!)
• Good for the basics - RNA-seq, DNA-seq, proteomics, metabolomics
• Reasonable standardized analysis pipelines
• Challenges:
• combining multiple datasets across experiments or across CROs
• more involved analysis (e.g. splicing)
• Do a thorough cost-comparison when considering an academic
collaborator
• Also ask them when their student is graduating.
13. What additional expertise do you need?
Early stage “traditional” therapeutics companies don’t need a full-time
computational biologist. Part time can work fine.
When the company expands, hire a computational biologist with
substantial experience, or an analyst with some kind of advisor
available.
14. Computational biologist:
Experience/training in all three
areas
Analyst: Biology + programming,
with an advisor to help with the
statistics
Methods developer: Wants to
build new analytical tools
Know what you need
15. What expertise do you need?
For Teams:
• Cross-discipline expertise
-biology, chemistry, computer science, statistics
• Communication skills
• Lateral thinking
16. Expertise gets you fast answers
The problem:
Get a terabyte of data from a USB
hard drive to the cloud in time to
analyze a dataset for a conference
17. Expertise gets you fast answers
The problem:
Get a terabyte of data from a USB
hard drive to the cloud in time to
analyze a dataset for a conference
The solution:
Bicycle across the Charles
3Gb/s bicycle (latency of 1.2M
ms)
Datacenter internet connection
Markley Data Center
19. Deep Learning / Artificial Intelligence
Deep learning is “new” in
that it’s a more complex
version of older
technology: a neural
network
Modern compute power
allows for powerful
classifiers trained on very
large datasets
20. The basics of machine learning (and DL)
Deep Learning works in a
similar way to other types
of machine learning.
The algorithms use larger
datasets and are more
complex. But the overall
workflow is the same.
21. Should you use deep learning?
Is your training data:
Large. 100,000+ to 1M+
samples
Well-annotated. Gene
expression data usually isn’t.
Representative of the
questions you want to answer?
In discovery biology, the data is
usually not there. Hence “discovery”.
22. Good use-cases for deep learning
Image processing
Diagnostics from histology,
radiology
High-content screening
Biochemical structure/sequence
Epitope prediction
Protein folding (Deep Mind)
Single-cell RNA-seq (potentially)
23. Should you use deep learning? (cont)
Do you need an interpretable model?
Deep learning is a black box
Have you tried everything else?
Linear models, random
forests, other ML techniques
These tools are often faster, cheaper,
and easier to understand and
implement
26. Zafgen, Inc
• Publicly traded bio-pharmaceutical company
• Founded 12 years ago (IPO in 2014)
• Virtual company
• Bringing MetAP2 inhibitors to market
• Areas of interest: Metabolic disease
27. Zafgen and Diamond Age
Diamond Age acts as a virtual bioinformatics
department for Zafgen
• Data Analysis
• Data Management
• Hypothesis generation
• Technology recommendations
28. What Diamond Age has done for Zafgen
• Transcriptional profiling
• Proteomics/phosphoproteomics
• Metabolomics
• Clinical outcomes
• Custom apps for client needs
29. The benefits
What can Zafgen can do now that it couldn’t before?
• Iterative data generation
• Cross-dataset analyses
• Confidence in analysis results from CROs
• Link between pre-clinical and clinical data
• Cost efficiencies / value for money