20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser University

Allen Day, PhD // Science Advocate // @allenday // #genomics #ml #datascience

GOOGLE CONFIDENTIAL
Google Cloud
Run your apps on the same system as Google

Google confidential │ Do not distribute
Google is good at handling massive volumes of data
uploads per minute
users
search index
query response time
400hrs
500M+
100PB+
0.25s

Google can handle massive volumes of genomic data
uploads per minute
users
search index
query response time
400hrs
500M+
100PB+
0.25s
~8WGS
>100x US PhDs
~1M WGS
0.25s

Deep Neural Networks: Algorithms that Learn
● Modernization of artificial neural networks
● Made of of simple mathematical units,
organized in layers, that together can
compute some (arbitrary) function
● more layers = deeper = more general
● Learn from raw, heterogeneous data

* Human Performance
based on analysis done
by Andrej Karpathy.
More details here.
Image understanding is (getting) better than human level
ImageNet Challenge: Given
an image, predict one of
1000+ of classes
%errors

“Given an image,
predict one of
1000+ of classes”
Image credit:
360phot0.blogspot.com
ImageNet
Challenge

AI & ML
what you need to know
Machine Learning:
Make Machines
Learn
Artificial Intelligence:
Make Intelligent
Machines
programming a computer
to be intelligent is hard
programming a computer
to learn to be intelligent
is easier and progress is
measurable

Google confidential │ Do not distributeGoogle confidential │ Do not distribute
Google Genomics
August 2015

Google Genomics is more than infrastructure
General-purpose
cloud infrastructure
Genomics-specific
featuresGenomics API
Virtual Machines & Storage
Data Services & Tools

BioQuery Analysis Engine
Medical Records Genomics Devices Imaging Patient Reports
Baseline Study Data Private Data
Pharma Health Providers …
Google’s vision to tackle complex health data
Public Data

CONFIDENTIAL & PROPRIETARY
3.75 TERABYTES PER HUMAN
1.00 TB GENOME
2.00 TB EPIGENOME
0.70 TB TRANSCRIPTOME
0.06 TB METABOLOME
0.04 TB PROTEOME
~1 MB STANDARD LAB TESTS
5-YR LONGITUDINAL STUDY
BASELINE STUDY: BIG DATA ANALYSIS
Validate a pipeline to process complex phenotypic, biochemical,
and genomic data
● Pilot Study (N=200)
○ Determine optimal biospecimen collection strategy for stable sampling
and reproducible assays
○ Determine optimal assay methodology
○ Validate quality control methods
○ Validate device data against surrogate and primary endpoints
● Baseline Study (N=10,000+)
○ 6 cohorts from low to high risk for cardiovascular and cancer
○ Characterize human systems biology
○ Define normal values for a given parameter in heterogeneous states
○ Predict meaningful events
○ Validate wearable devices for human monitoring
○ Characterize transitions in disease state

Knowledge: populations cluster together

Bioinformatics scientist: BigQuery enables fast tertiary analysis

Google Cloud Platform
Dataflow + BigQuery
Used for Extract, Transform,
Load (ETL), analytics,
real-time computation and
process orchestration.
cloud.google.com/dataflow
Dataflow
Run SQL queries against
multi-terabyte datasets in
seconds.
cloud.google.com/bigquery
BigQuery

Dataflow + BigQuery

Released in Nov. 2015
#1
repository
for “machine learning”
category on GitHub
TensorFlow

Transfer Learning
Quickly able to Learn New Concepts
“t-rex”“quidditch”
Learning like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images

TensorFlow powered Cucumber Sorter

⬇40% Data Center cooling energy
⬆15% Power Usage Effectiveness (PUE)
Google’s Carbon-Neutral, Self-Optimizing Data Centers
The Dalles, Oregon, USA

Verily: Assisting Pathologists in Detecting Cancer with Deep Learning
research.googleblog.com/2017/03/assisting-pathologists-in-detecting.html
Prediction heatmaps produced by the algorithm had
improved so much that the localization score (FROC)
for the algorithm reached 89%, which significantly
exceeded the score of 73% for a pathologist with no
time constraint2
. We were not the only ones to see
promising results, as other groups were getting scores
as high as 81% with the same dataset.
Model generalized very well, even to images that were
acquired from a different hospital using different
scanners. For full details, see our paper “Detecting
Cancer Metastases on Gigapixel Pathology Images”.

Integration with Geospatial, Management, and Terrestrial Sensor Data
anezconsulting.com/precision-agronomy/

Descartes Labs - Google Cloud Customer
medium.com/@stevenpbrumby/corn-in-the-usa-d487dce84ee1
Cloud ML
Engine
TensorFlow

Phenomobile, http://www.mdpi.com/2073-4395/4/3/349/htm
See also: http://www.genomes2fields.org/

Temporo-Spatial Imaging of Growing Plants

Genomics & Genetics Problems:
How to Start Applying DNNs?
Must-haves for deep learning:
● Lots of data: >50k examples, >1M examples ideal
● High-quality input and labels for training
● Label ~ F(data) unknown but certainly function exists
● High-quality prev. efforts so we know that DNNs are key
○ i.e. hard to solve with classical statistical
approaches
SNP and indel calling from NGS data

Verily | Confidential & Proprietary
Calling genetic variation may seem easy...

... but lots of places in the genome are difficult

Creating a universal SNP and small indel
variant caller with deep neural networks
Ryan Poplin, Cory McLean, Dan Newburger, Jojo Dijamco, Nam Nguyen, Dion Loy,
Sam Gross, Madeleine Cule, Peyton Greenside, Justin Zook, Marc Salit, Mark
DePristo, Verily Life Sciences, October 2016

DNN (Inception V3) Predicts True Genotype from Pileup Images
{ 0.001, 0.994, 0.005 }
{ 0.001, 0.990, 0.009 }
{ 0.000, 0.001, 0.999 }
{ 0.600, 0.399, 0.001 }
Output:
Probability of diploid
genotype states
{ HOM_REF, HET, HOM_VAR }
Raw pixels
Input:
Millions of labeled pileup
images from gold standard
samples

Using deep learning for ultra-accurate mutation detection
Input:
Millions of labeled
pileup image
stacks from gold
standard sample
Raw pixels
{ 0.001, 0.994, 0.005 }
{ 0.001, 0.990, 0.009 }
{ 0.000, 0.001, 0.999 }
{ 0.600, 0.399, 0.001 }
Output:
Probability distribution
over the three diploid
genotype states
{ HOM_REF, HET, HOM_VAR }
35

Example DNA read pileup “images”
true snps true indels false variants
red = {A,C,G,T}. green = {quality score}. blue = {read strand}.
alpha = {matches ref genome}.

PrecisionFDA: unique opportunity with blinded truth sample
NA12878

DeepVariant won an award at PrecisionFDA competition
99.85
99.70
98.91
● Overall F-measure
combines SNP and
indel performance
● Blinded sample
shows no
overfitting to
NA12878 with
Verily’s pipelines
38

Example: GATK
Analysis Pipeline
● Decouple process
management from
host configuration
● Portable across OS
distros and clouds
● Consistent
environment from
development to
production
● Immutable images
New way: deploy
containers
Old way: install
applications on host
kernel
libs
app
app app
app
libs
app
kernel
libs
app
libs
app
libs
app
Makefiles,
CWL, WDL
(on a virtual machine)
Dockerflow:
Dataflow + Docker
Benefits

> java -jar target/dockerflow*dependencies.jar
--project=YOUR_PROJECT
--workflow-file=hello.yaml
--workspace=gs://YOUR_BUCKET/YOUR_FOLDER
--runner=DataflowPipelineRunner
To run it:
Variant Calls
Your Variant Caller
40
PubSub
Queue
Sequencer
DNA Reads
Genomics
API
Genomics
API
BigQuery
Your Other ToolYour Aligner
Genomics
API

Public Datasets Project
https://cloud.google.com/bigquery/public-data/
A public dataset is any dataset that is stored in BigQuery and made available to the general public. This URL lists a
special group of public datasets that Google BigQuery hosts for you to access and integrate into your applications.
Google pays for the storage of these data sets and provides public access to the data via BigQuery. You pay only for the
queries that you perform on the data (the first 1TB per month is free)

Confidential & ProprietaryGoogle Cloud Platform 42
Platinum Genomes
1000 Genomes
Medical (Human)
Population-scale Genome Projects
1000 Bulls
10K Dog Genomes
Veterinary / Agriculture
Open Cannabis Project
(see deck)
3K Rice
Genome To Fields
Panzea (1000 Maize)
AgriculturePersonal Genome Project
Human Microbiome Project
NCBI GEO Human 100K
Cancer Genome Atlas
Many Other
Interesting
Datasets...

20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser University

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (17)

Semelhante a 20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser University

Semelhante a 20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser University (20)

Mais de Allen Day, PhD

Mais de Allen Day, PhD (19)

Último

Último (20)

20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser University