SlideShare uma empresa Scribd logo
1 de 28
Baixar para ler offline
High-throughputGenomics
at Your Fingertips with Apache Spark
Erwin Datema
Roeland van Ham
Spark Summit EU 2016, October 26, Brussels
Overview
High-throughput Genomics at Your Fingertips with Apache Spark
• Disclaimer : I am a scientist in computational biology (bioinformatics)
• I am not a computer scientist
• I am not a data scientist
• Scope
• KeyGene’s journey into Spark to analyze genomics data
• Goal: enable interactive genomics data processing and querying
• Told from a user’s perspective
• Contents
• Introduction to KeyGene
• Crash Course Genomics
• Big Data Challenges
The crop innovation company 2
Global trends
The crop innovation company 33
World population grows from 7
to 9 billion people in 2050
Climate change
Limited/bad land, water and fossil
fuels
More obese people
Malnourished people
Agricultural challenges
The crop innovation company 4
YIELD: Producing more food on less land
Population
3.0 billion
Population
4.4 billion
Population
6.0 billion
Population
7.5 billion
4.3 hectares
arable land
per person
3.o hectares
arable land
per person
2.2 hectares
arable land
per person
1.8 hectares
arable land
per person
Genetic improvementof crops
The crop innovation company 5
Molecular breeding
Molecular mutagenesis
Our strategy:
Use of natural genetic
variation in crops
Not GM:
At this moment too
costly (20-100 mil €)
Regulatory, societal
and technical hurdles
About KeyGene
The crop innovation company 6
working for the
future of global
agriculture
Current
shareholders
Founded in
1989
>
Go-to Ag Biotech
company for higher
crop yield & quality
R&D strategy
The crop innovation company 7
Fundamental
research
Developing
technologies
& traits
Applying
molecular
breeding of
crops
Breeding
Seed
products
Market
Universities
KeyGene
Partners breeding industry
Big Data in Genomics
Relevance
The crop innovation company 8
image source: https://www.genome.gov/images/content/costpermb2015_4.jpg datasource: https://www.ncbi.nlm.nih.gov/genbank/statistics/
• Genomic data is being produced on an unprecedented scale
• The cost to sequence a genome is now a few thousand dollars
• We can routinely sequence tens to hundreds of individual plants
Totaldata volume (# bases) in NCBIGenBank
Plant Genomics
Different from human genomics!
The crop innovation company 9
one genome many genomes
genomics as a diagnostic tool genomics as a tool to direct breeding
“simple” genetics complex genetics
high quality, shared resources variable quality, fragmented resources
understand and cure diseases improve crop yield and quality
~16 Gb~5.5 Gb~3.5 Gb~850 Mb~450 Mb
Crash Course Genomics
Plant genome sizes
The crop innovation company 10
OnionWheatPepperPotatoMelon
diploidhexaploiddiploidtetraploiddiploid
Mb = megabases
Gb = gigabases
diploid = two sets of chromosomes
tetraploid = four sets of chromosomes
Crash Course Genomics
DNA, chromosomes, nucleotides
• DNA consists of four different elements (nucleotides or bases)
• We represent DNA as strings of characters from the alphabet A C G T
The crop innovation company 11
• DNA is organized into chromosomes
• Each chromosome contains millions of
nucleotides (characters)
• We can only ‘read’ short pieces of DNA
(hundreds to thousands of nucleotides)
Crash Course Genomics
Genes and traits
The crop innovation company 12
drought tolerance fruit size
genome
genes
• Genes are (some of) the functional elements in the genome
• We represent genes as an interval on a chromosome
Crash Course Genomics
Polymorphisms and their impact
The crop innovation company 13
large fruits
small fruits
no impact
Crash Course Genomics
Read alignment and variant calling
The crop innovation company 14
• The ‘reference genome’ represents the known sequence of a species
• Reads are aligned against the reference genome (string similarity search)
• Complex: sequence variation, repetitive regions and data errors
• Variants are called from differences observed in ‘pile-ups’ of reads
Crash Course Genomics
Population-scale genome sequencing
The crop innovation company 15
genome
genes
reads
(individual 1)
reads
(individual 2)
reads
(individual n)
5 Gigabases
1 billionreads
1 billionreads
1 billionreads
• Align a billion reads x 1,000 individuals to a 5 Gb genome
• Call hundreds of millions (up to potentially billions) of sequence variants
Crash Course Genomics
Recap
• Genome sequences are represented as strings of A C G T
• Variation between genome sequences underlies differences in traits
• We can “read” the genome in little pieces
• High throughput, massively parallel sequencing technologies
• Up to thousands of individual plants from a given species
• Genomics data analysis is challenging
• Rapid increase in data generation
• Rapid turnover of sequencing technologies and their outputs
• Scientific software is (often) bad (Nature News, Oct 13 2010)
The crop innovation company 16
Genomics data analysis
High Performance Computing
• Computational challenges
• Align billions of reads to the reference genome
• Call millions or billions of sequence variants
• Determine the small number of variants that impact a given trait
• HPC infrastructure (e.g. SGE clusters) are the de facto standard
• Manually split large datasets (to accommodate the job scheduler)
• Manually deal with failures: check logs, resubmit jobs…
• Many software tools are in fact large, monolithic “pipelines”
• No fine-grained control over resource usage
• A single error often implies a complete re-run of the analysis
The crop innovation company 17
High Performance Computing
Big Data technologies
The crop innovation company 18
compute node
compute node
compute node
compute node
compute node
submit host
compute node
compute node
storage
storage
storage
storage
storage
storage
storage
storage
storageserver
computecluster
Conventional Compute Cluster
Expensive, proprietary storage
Expensive network connections
Expensive, high-reliability hardware
Spark Cluster
Sparkcluster
compute + storage
compute + storage
compute + storage
compute + storage
compute + storage
admin / name node
compute + storage
compute + storage
Commodity hardware
Linear scalability
Fault tolerance
Genomics application landscape
Opportunities for Big Data technologies
19
?
Conventional
Compute
Cluster
Compute tool
High Memory
Hardware
Accelerated
(GPU / FPGA)
Spark
(and Hadoop)
Big Data in Genomics
Challenges
The crop innovation company 20
FASTQ SAM BAM VCF
DNA sequencer reads alignments variants
• Genomics is a dynamic, rapidly changing field
• Data generators and analysis algorithms are in constant flux
• Tools are generally built around flat, text-based file formats
• Workflow is file-centric (POSIX file system; no streaming…)
Big Data in Genomics
Our ‘Sparkified’ pipeline
The crop innovation company 21
Scala
BWA
GATK4
Mark
Duplicates
ADAM
Guacamole
FASTQ
HDFS
SAM
HDFS
BAM
HDFS
VCF
local
read alignment processing variant calling
Genomics on Spark
Solutions to legacy designs
The crop innovation company 22
FASTQ
DNA sequencer reads
FASTQ
reads
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
identifier
sequence
separator
quality
RDD[String]
f1.zip(f2).map
(e => e._1+ e._2)
‘interleaved’ reads
class FastQRecordReader extends RecordReader
Genomics on Spark
Read alignment
The crop innovation company 23
FASTQ
FASTQ
FASTQ
FASTQ
FASTQ RDD
Scala
partition 1
partition 2
partition n
…
BWA
partition 1
partition 2
…
partition n
memory
RDD SAM
Scala
HDFSHDFS
perl wrapper for BWA
called by rdd.pipe()
class FastQRecordReader
extends RecordReader
sort output
and add header
Genomics on Spark
Processing and variant calling
The crop innovation company 24
Broad Institute,Cambridge
Broad Institute,Cambridge
AMPlab, University of California
• Alignment processing
• Variant calling (in development)
• Data schemas + APIs
• Variant calling (Guacamole)
• Variant analysis
• You’ve all attended the Keynote talk…
Genomics on Spark
Variant selection and analysis
The crop innovation company 25
VCF
Scala
parquet
HDFSlocal
RDD tempTable
SQL
selection
Scala
VCF
localmemory
Variant selection
phenotypes GWAS
• Interactive, “real-time” selection of variant data with simple SQL queries
• GWAS analysis on Spark (e.g. hail) or conventional infra (e.g. PLINK)
Big Data in Genomics
KeyGene’s ambition
The crop innovation company 26
Wrap-up
Conclusions and lessons learnt
• Initial success in applying Spark to plant genomics
• Proof-of-concept for enabling interactive GWAS analysis on Spark
• Spark appears to be a good fit for (some of our current) Genomics problems
• Developer community needed to translate core Genomics applications!
• Paradigm shift required to move away from flat POSIX files…
• Opportunities for streaming data analysis
The crop innovation company 27
The End
High-throughput Genomics at Your Fingertips with Apache Spark
The crop innovation company 28
• Erwin Datema erwin.datema@keygene.com
• Roeland van Ham roeland.van-ham@keygene.com

Mais conteúdo relacionado

Mais procurados

Building Genomic Data Processing and Machine Learning Workflows Using Apache ...
Building Genomic Data Processing and Machine Learning Workflows Using Apache ...Building Genomic Data Processing and Machine Learning Workflows Using Apache ...
Building Genomic Data Processing and Machine Learning Workflows Using Apache ...
Databricks
 
Population-scale high-throughput sequencing data analysis
Population-scale high-throughput sequencing data analysisPopulation-scale high-throughput sequencing data analysis
Population-scale high-throughput sequencing data analysis
Denis C. Bauer
 

Mais procurados (20)

A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
 
Genome Big Data
Genome Big DataGenome Big Data
Genome Big Data
 
Next-generation sequencing: Data mangement
Next-generation sequencing: Data mangementNext-generation sequencing: Data mangement
Next-generation sequencing: Data mangement
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomics
 
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
 
Building Genomic Data Processing and Machine Learning Workflows Using Apache ...
Building Genomic Data Processing and Machine Learning Workflows Using Apache ...Building Genomic Data Processing and Machine Learning Workflows Using Apache ...
Building Genomic Data Processing and Machine Learning Workflows Using Apache ...
 
Managing Genomics Data at the Sanger Institute
Managing Genomics Data at the Sanger InstituteManaging Genomics Data at the Sanger Institute
Managing Genomics Data at the Sanger Institute
 
Future Architectures for genomics
Future Architectures for genomicsFuture Architectures for genomics
Future Architectures for genomics
 
Fast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocadoFast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocado
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciences
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
 
How novel compute technology transforms life science research
How novel compute technology transforms life science researchHow novel compute technology transforms life science research
How novel compute technology transforms life science research
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencing
 
Life sciences big data use cases
Life sciences big data use casesLife sciences big data use cases
Life sciences big data use cases
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAM
 
Population-scale high-throughput sequencing data analysis
Population-scale high-throughput sequencing data analysisPopulation-scale high-throughput sequencing data analysis
Population-scale high-throughput sequencing data analysis
 
Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...
Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...
Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...
 
Cloud Experiences
Cloud ExperiencesCloud Experiences
Cloud Experiences
 

Destaque

Destaque (20)

Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan Pu
 
Spark Summit EU talk by Patrick Baier and Stanimir Dragiev
Spark Summit EU talk by Patrick Baier and Stanimir DragievSpark Summit EU talk by Patrick Baier and Stanimir Dragiev
Spark Summit EU talk by Patrick Baier and Stanimir Dragiev
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar Castaneda
 
Spark Summit EU talk by Javier Aguedes
Spark Summit EU talk by Javier AguedesSpark Summit EU talk by Javier Aguedes
Spark Summit EU talk by Javier Aguedes
 
Spark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef HabdankSpark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef Habdank
 
The Spark (R)evolution in The Netherlands
The Spark (R)evolution in The NetherlandsThe Spark (R)evolution in The Netherlands
The Spark (R)evolution in The Netherlands
 
Spark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko KorndorfSpark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko Korndorf
 
A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...
A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...
A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...
 
Spark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg SchadSpark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg Schad
 
Utilizing Human Data Validation For KPI Analysis And Machine Learning
Utilizing Human Data Validation For KPI Analysis And Machine LearningUtilizing Human Data Validation For KPI Analysis And Machine Learning
Utilizing Human Data Validation For KPI Analysis And Machine Learning
 
Democratizing AI with Apache Spark
Democratizing AI with Apache SparkDemocratizing AI with Apache Spark
Democratizing AI with Apache Spark
 
Spark Summit EU talk by Sudeep Das and Aish Faenton
Spark Summit EU talk by Sudeep Das and Aish FaentonSpark Summit EU talk by Sudeep Das and Aish Faenton
Spark Summit EU talk by Sudeep Das and Aish Faenton
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
 
Spark Summit EU talk by Luc Bourlier
Spark Summit EU talk by Luc BourlierSpark Summit EU talk by Luc Bourlier
Spark Summit EU talk by Luc Bourlier
 
Solving The N+1 Problem In Personalized Genomics
Solving The N+1 Problem In Personalized GenomicsSolving The N+1 Problem In Personalized Genomics
Solving The N+1 Problem In Personalized Genomics
 
Spark Summit EU talk by Reza Karimi
Spark Summit EU talk by Reza KarimiSpark Summit EU talk by Reza Karimi
Spark Summit EU talk by Reza Karimi
 
Spark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean WamplerSpark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean Wampler
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
 
Spark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital Kedia
 

Semelhante a Spark Summit EU talk by Erwin Datema and Roeland van Ham

Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical Research
David Ruau
 
Accelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo dbAccelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo db
MongoDB
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes
c.titus.brown
 
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
StampedeCon
 

Semelhante a Spark Summit EU talk by Erwin Datema and Roeland van Ham (20)

Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical Research
 
Accelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDBAccelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDB
 
Accelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo dbAccelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo db
 
Cloud bioinformatics 2
Cloud bioinformatics 2Cloud bioinformatics 2
Cloud bioinformatics 2
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes
 
Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010
Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010 Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010
Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010
 
2014 sage-talk
2014 sage-talk2014 sage-talk
2014 sage-talk
 
Michael Reich, GenomeSpace Workshop, fged_seattle_2013
Michael Reich, GenomeSpace Workshop, fged_seattle_2013Michael Reich, GenomeSpace Workshop, fged_seattle_2013
Michael Reich, GenomeSpace Workshop, fged_seattle_2013
 
Platforms CIBERER and INB-ELIXIR-es
Platforms CIBERER and INB-ELIXIR-esPlatforms CIBERER and INB-ELIXIR-es
Platforms CIBERER and INB-ELIXIR-es
 
Variant analysis and whole exome sequencing
Variant analysis and whole exome sequencingVariant analysis and whole exome sequencing
Variant analysis and whole exome sequencing
 
EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017 EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017
 
HUGenomics: a support for personalized medicine research
HUGenomics: a support for personalized medicine researchHUGenomics: a support for personalized medicine research
HUGenomics: a support for personalized medicine research
 
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
 
Introduction to 16S Microbiome Analysis
Introduction to 16S Microbiome AnalysisIntroduction to 16S Microbiome Analysis
Introduction to 16S Microbiome Analysis
 
SFSCON23 - Michele Finelli - Management of large genomic data with free software
SFSCON23 - Michele Finelli - Management of large genomic data with free softwareSFSCON23 - Michele Finelli - Management of large genomic data with free software
SFSCON23 - Michele Finelli - Management of large genomic data with free software
 
Reproducible research - to infinity
Reproducible research - to infinityReproducible research - to infinity
Reproducible research - to infinity
 
HUG @ NGCLE@e-Novia 15.11.2017
HUG @ NGCLE@e-Novia 15.11.2017HUG @ NGCLE@e-Novia 15.11.2017
HUG @ NGCLE@e-Novia 15.11.2017
 
Life Sciences De-Mystified - Mark Bünger - PICNIC '10
Life Sciences De-Mystified - Mark Bünger - PICNIC '10Life Sciences De-Mystified - Mark Bünger - PICNIC '10
Life Sciences De-Mystified - Mark Bünger - PICNIC '10
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
Bda2015 tutorial-part2-data&databases
Bda2015 tutorial-part2-data&databasesBda2015 tutorial-part2-data&databases
Bda2015 tutorial-part2-data&databases
 

Mais de Spark Summit

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

Mais de Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Último

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 

Último (20)

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 

Spark Summit EU talk by Erwin Datema and Roeland van Ham

  • 1. High-throughputGenomics at Your Fingertips with Apache Spark Erwin Datema Roeland van Ham Spark Summit EU 2016, October 26, Brussels
  • 2. Overview High-throughput Genomics at Your Fingertips with Apache Spark • Disclaimer : I am a scientist in computational biology (bioinformatics) • I am not a computer scientist • I am not a data scientist • Scope • KeyGene’s journey into Spark to analyze genomics data • Goal: enable interactive genomics data processing and querying • Told from a user’s perspective • Contents • Introduction to KeyGene • Crash Course Genomics • Big Data Challenges The crop innovation company 2
  • 3. Global trends The crop innovation company 33 World population grows from 7 to 9 billion people in 2050 Climate change Limited/bad land, water and fossil fuels More obese people Malnourished people
  • 4. Agricultural challenges The crop innovation company 4 YIELD: Producing more food on less land Population 3.0 billion Population 4.4 billion Population 6.0 billion Population 7.5 billion 4.3 hectares arable land per person 3.o hectares arable land per person 2.2 hectares arable land per person 1.8 hectares arable land per person
  • 5. Genetic improvementof crops The crop innovation company 5 Molecular breeding Molecular mutagenesis Our strategy: Use of natural genetic variation in crops Not GM: At this moment too costly (20-100 mil €) Regulatory, societal and technical hurdles
  • 6. About KeyGene The crop innovation company 6 working for the future of global agriculture Current shareholders Founded in 1989 > Go-to Ag Biotech company for higher crop yield & quality
  • 7. R&D strategy The crop innovation company 7 Fundamental research Developing technologies & traits Applying molecular breeding of crops Breeding Seed products Market Universities KeyGene Partners breeding industry
  • 8. Big Data in Genomics Relevance The crop innovation company 8 image source: https://www.genome.gov/images/content/costpermb2015_4.jpg datasource: https://www.ncbi.nlm.nih.gov/genbank/statistics/ • Genomic data is being produced on an unprecedented scale • The cost to sequence a genome is now a few thousand dollars • We can routinely sequence tens to hundreds of individual plants Totaldata volume (# bases) in NCBIGenBank
  • 9. Plant Genomics Different from human genomics! The crop innovation company 9 one genome many genomes genomics as a diagnostic tool genomics as a tool to direct breeding “simple” genetics complex genetics high quality, shared resources variable quality, fragmented resources understand and cure diseases improve crop yield and quality
  • 10. ~16 Gb~5.5 Gb~3.5 Gb~850 Mb~450 Mb Crash Course Genomics Plant genome sizes The crop innovation company 10 OnionWheatPepperPotatoMelon diploidhexaploiddiploidtetraploiddiploid Mb = megabases Gb = gigabases diploid = two sets of chromosomes tetraploid = four sets of chromosomes
  • 11. Crash Course Genomics DNA, chromosomes, nucleotides • DNA consists of four different elements (nucleotides or bases) • We represent DNA as strings of characters from the alphabet A C G T The crop innovation company 11 • DNA is organized into chromosomes • Each chromosome contains millions of nucleotides (characters) • We can only ‘read’ short pieces of DNA (hundreds to thousands of nucleotides)
  • 12. Crash Course Genomics Genes and traits The crop innovation company 12 drought tolerance fruit size genome genes • Genes are (some of) the functional elements in the genome • We represent genes as an interval on a chromosome
  • 13. Crash Course Genomics Polymorphisms and their impact The crop innovation company 13 large fruits small fruits no impact
  • 14. Crash Course Genomics Read alignment and variant calling The crop innovation company 14 • The ‘reference genome’ represents the known sequence of a species • Reads are aligned against the reference genome (string similarity search) • Complex: sequence variation, repetitive regions and data errors • Variants are called from differences observed in ‘pile-ups’ of reads
  • 15. Crash Course Genomics Population-scale genome sequencing The crop innovation company 15 genome genes reads (individual 1) reads (individual 2) reads (individual n) 5 Gigabases 1 billionreads 1 billionreads 1 billionreads • Align a billion reads x 1,000 individuals to a 5 Gb genome • Call hundreds of millions (up to potentially billions) of sequence variants
  • 16. Crash Course Genomics Recap • Genome sequences are represented as strings of A C G T • Variation between genome sequences underlies differences in traits • We can “read” the genome in little pieces • High throughput, massively parallel sequencing technologies • Up to thousands of individual plants from a given species • Genomics data analysis is challenging • Rapid increase in data generation • Rapid turnover of sequencing technologies and their outputs • Scientific software is (often) bad (Nature News, Oct 13 2010) The crop innovation company 16
  • 17. Genomics data analysis High Performance Computing • Computational challenges • Align billions of reads to the reference genome • Call millions or billions of sequence variants • Determine the small number of variants that impact a given trait • HPC infrastructure (e.g. SGE clusters) are the de facto standard • Manually split large datasets (to accommodate the job scheduler) • Manually deal with failures: check logs, resubmit jobs… • Many software tools are in fact large, monolithic “pipelines” • No fine-grained control over resource usage • A single error often implies a complete re-run of the analysis The crop innovation company 17
  • 18. High Performance Computing Big Data technologies The crop innovation company 18 compute node compute node compute node compute node compute node submit host compute node compute node storage storage storage storage storage storage storage storage storageserver computecluster Conventional Compute Cluster Expensive, proprietary storage Expensive network connections Expensive, high-reliability hardware Spark Cluster Sparkcluster compute + storage compute + storage compute + storage compute + storage compute + storage admin / name node compute + storage compute + storage Commodity hardware Linear scalability Fault tolerance
  • 19. Genomics application landscape Opportunities for Big Data technologies 19 ? Conventional Compute Cluster Compute tool High Memory Hardware Accelerated (GPU / FPGA) Spark (and Hadoop)
  • 20. Big Data in Genomics Challenges The crop innovation company 20 FASTQ SAM BAM VCF DNA sequencer reads alignments variants • Genomics is a dynamic, rapidly changing field • Data generators and analysis algorithms are in constant flux • Tools are generally built around flat, text-based file formats • Workflow is file-centric (POSIX file system; no streaming…)
  • 21. Big Data in Genomics Our ‘Sparkified’ pipeline The crop innovation company 21 Scala BWA GATK4 Mark Duplicates ADAM Guacamole FASTQ HDFS SAM HDFS BAM HDFS VCF local read alignment processing variant calling
  • 22. Genomics on Spark Solutions to legacy designs The crop innovation company 22 FASTQ DNA sequencer reads FASTQ reads @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 identifier sequence separator quality RDD[String] f1.zip(f2).map (e => e._1+ e._2) ‘interleaved’ reads class FastQRecordReader extends RecordReader
  • 23. Genomics on Spark Read alignment The crop innovation company 23 FASTQ FASTQ FASTQ FASTQ FASTQ RDD Scala partition 1 partition 2 partition n … BWA partition 1 partition 2 … partition n memory RDD SAM Scala HDFSHDFS perl wrapper for BWA called by rdd.pipe() class FastQRecordReader extends RecordReader sort output and add header
  • 24. Genomics on Spark Processing and variant calling The crop innovation company 24 Broad Institute,Cambridge Broad Institute,Cambridge AMPlab, University of California • Alignment processing • Variant calling (in development) • Data schemas + APIs • Variant calling (Guacamole) • Variant analysis • You’ve all attended the Keynote talk…
  • 25. Genomics on Spark Variant selection and analysis The crop innovation company 25 VCF Scala parquet HDFSlocal RDD tempTable SQL selection Scala VCF localmemory Variant selection phenotypes GWAS • Interactive, “real-time” selection of variant data with simple SQL queries • GWAS analysis on Spark (e.g. hail) or conventional infra (e.g. PLINK)
  • 26. Big Data in Genomics KeyGene’s ambition The crop innovation company 26
  • 27. Wrap-up Conclusions and lessons learnt • Initial success in applying Spark to plant genomics • Proof-of-concept for enabling interactive GWAS analysis on Spark • Spark appears to be a good fit for (some of our current) Genomics problems • Developer community needed to translate core Genomics applications! • Paradigm shift required to move away from flat POSIX files… • Opportunities for streaming data analysis The crop innovation company 27
  • 28. The End High-throughput Genomics at Your Fingertips with Apache Spark The crop innovation company 28 • Erwin Datema erwin.datema@keygene.com • Roeland van Ham roeland.van-ham@keygene.com