SlideShare uma empresa Scribd logo
1 de 37
Baixar para ler offline
@mirocupak
Miro Cupak
VP Engineering, DNAstack
28/09/2018
How we've made a global
search engine for genetic data
https://www.ga4gh.org/
https://beacon-network.org
https://beacon-network.org
Background
https://www.nature.com/news/technology-
the-1-000-genome-1.14901
sequencing cost decreasing
exponentially
(3M times since 2000)
Background
http://journals.plos.org/plosbiology/
article?id=10.1371/journal.pbio.
1002195
genomic data volume
increasing exponentially
(1M times since 2000)
BackgroundDataVolumesby2025(GB)
0e+00
1e+10
2e+10
3e+10
4e+10
Twitter Youtube Genomics
Lower Bound Upper Bound
http://journals.plos.org/plosbiology/article?
id=10.1371/journal.pbio.1002195
up to 2 billion human genomes
sequenced in the next 10 years
(more data annually than
uploaded to and )
What does it mean?
❓
Problem
❗
Key obstacle
💡
Solution
Too much data for any single institution.
Not enough data to make new discoveries.
Discovering data.
Federated system capable of executing cross-dataset and
cross-institution queries.
• initiative started in 2014 across many groups within GA4GH
• experiment to test the willingness to share in the simplest of all technical contexts
• simple web service
• receives questions of the form Do you have information about this mutation?
• responds with yes or no (optionally additional metadata)
• design principles
• A beacon has to be technically simple.
• A beacon has to minimize risks associated with genomic data sharing.
• It has to be possible to make a beacon publicly available.
Beacon Project
https://beacon-project.io/
• no formal specification
• receives questions of the form Do you have information about this mutation?
• responds with yes or no
• 4 public beacons, each API different
Standard: Before Beacon Network
• request method
• supported parameters
• parameter names
• chromosome identifiers
• positional base
• assembly notation
• supported alleles
• dataset support
• response format
• data included in the response
Standard: Before Beacon Network
• 2014
• really simple (2 records)
• true/false response
• format: Avro
• too vague
• not enough traction
Standard: 0.1
• 2015
• true/false/overlap/null response
• datasets
• data use conditions
• self description
• complex (9 records)
• format: Avro
• not well adopted
• not polished enough
Standard: 0.2
• 2016
• simplified 0.2
• true/false/null response
• data model improvements, extended
metadata and response, improved
support for datasets and cross-dataset
queries, data versioning
• modular and extensible
• tooling
• format: Avro → Proto3
• based on real needs, successful
Standard: 0.3
• 2018
• stable and more flexible
• support for more complex
mutations
• improved error handling
• improved data use
conditions
• developer experience
• format: Proto3 → OpenAPI
Standard: 0.4
• promoted 0.4
• extended documentation
and best practices
Standard: 1.0
Finally ready! 🎉
Beacon Network
Data
• access data stored in a relational database
Service
• communication with other subsystems
• query normalization
• aggregators
• participant resolution
• query distribution
• audit trail
• L1 parallelization
Processor
• executing a query against a beacon
and processing its response
• management of a flexible, dynamic and
easily extensible query execution pipeline
• pipeline stages resolution (CDI and EJB)
• L2 parallelization
• cross-assembly query handling
Converter
• first stage in the query execution pipeline
• translating query parameters
Requester
• second stage in the query execution pipeline
• constructing beacon requests based on their
URIs and parameters produced by the
converters
Fetcher
• third stage in the query execution pipeline
• unit actually talking to the API of beacons
• submitting requests over the network and
obtaining the raw response
Parser
• last stage in the pipeline
• extracting information of interest from the
raw response obtained by a fetcher
• dealing with various formats
• handling metadata, multiple responses, errors
• response normalization
• parallelized
Mapper
• translation between different representations of objects
REST
• handling client requests
• data serialization
Size
100+ installations
40+ institutions
18 countries
6 continents
Users
16k users
141 countries
Searches
Assemblies
Others
11%
GRCh38
6%
GRCh37
83%
Chromosomes
Others
39%
Chr. 7
7% Chr. 13
11%
Chr. 1
11%
Chr. 17
14%
Chr. 2
18%
Variants
Others
74%
2 : 212289100 C (ERBB4)
1%
2 : 29432776 C (ALK)
1%
14 : 23894969 A (MYH7)
1%
1 : 115258747 A (NRAS)
1%
1 : 43815163 C (MPL)
2%
7 : 140453136 C (BRAF)
2%
2 : 45895 G (FAM110C)
3%
22 : 46546565 A (PPARA)
3%
13 : 32936732 C (BRCA2)
6%
2 : 38938 C (FAM110C)
6%
85k distinct mutations
DeleteriousnessNumberofvariants
1
1000
1000000
Score
0.00 0.07 0.14 0.21 0.28 0.35 0.42 0.49 0.56 0.63 0.70 0.77 0.84 0.91 0.98
Numberofvariants
1
1000
1000000
Score
0.00 0.07 0.14 0.21 0.28 0.35 0.42 0.49 0.56 0.63 0.70 0.77 0.84 0.91 0.98
SIFT (Sorting Intolerant From Tolerant) PolyPhen-2 HDIV (Polymorphism Phenotyping v2)
69% damaging, 31% tolerated 55% probably damaging, 22% possibly
damaging, 23% benign
• 25% rare variants (1,000 Genomes Project)
RarityNumberofvariants
1
100
10000
Allele frequency
0.00 0.03 0.06 0.090.12 0.15 0.18 0.21 0.240.27 0.30 0.33 0.36 0.39 0.420.45 0.48 0.51 0.54 0.57 0.600.63 0.66 0.69 0.72 0.75 0.780.81 0.84 0.87 0.90 0.93 0.960.99
Genes
Symbol Name
1 FAM110C Family With Sequence Similarity 110 Member C
2 BRCA1 BRCA1, DNA Repair Associated
3 BRCA2 BRCA2, DNA Repair Associated
4 PPARA Peroxisome Proliferator Activated Receptor Alpha
5 ERBB4 Erb-B2 Receptor Tyrosine Kinase 4
6 BRAF B-Raf Proto-Oncogene, Serine/Threonine Kinase
7 MPL MPL Proto-Oncogene, Thrombopoietin Receptor
8 MYH7 Myosin Heavy Chain 7
9 KIT KIT Proto-Oncogene Receptor Tyrosine Kinase
10 RET Ret Proto-Oncogene
Others
53%
RET
1%
KIT
1%
MYH7
2%
MPL
2% BRAF
3%
ERBB4
3%
PPARA
4%
BRCA2
9%
BRCA1
10%
FAM110C
11%
Disorders & clinical abnormalities
OMIM HPO
1 Pancreatic cancer, susceptibility to, 4 Autosomal dominant inheritance
2 Breast-ovarian cancer, familial, 1 Autosomal recessive inheritance
3 Fanconi anemia, complementation group D1 Scoliosis
4 Prostate cancer Short stature
5 Pancreatic cancer 2 Cognitive impairment
6 Medulloblastoma Constipation
7 Glioblastoma 3 Somatic mutation
8 Breast-ovarian cancer, familial, 2 Cafe-au-lait spot
9 Breast cancer, male, susceptibility to Failure to thrive
10 Wilms tumor Nausea and vomiting
Questions?
@mirocupak

Mais conteúdo relacionado

Mais procurados

Giab jan2016 intro and update 160128
Giab jan2016 intro and update 160128Giab jan2016 intro and update 160128
Giab jan2016 intro and update 160128GenomeInABottle
 
Giab aug2015 intro and update 150821.pptx
Giab aug2015 intro and update 150821.pptxGiab aug2015 intro and update 150821.pptx
Giab aug2015 intro and update 150821.pptxGenomeInABottle
 
Aug2015 horizon diagnostics
Aug2015 horizon diagnosticsAug2015 horizon diagnostics
Aug2015 horizon diagnosticsGenomeInABottle
 
Tools for Using NIST Reference Materials
Tools for Using NIST Reference MaterialsTools for Using NIST Reference Materials
Tools for Using NIST Reference MaterialsGenomeInABottle
 
GIAB-GRC workshop oct2015 giab introduction 151005
GIAB-GRC workshop oct2015 giab introduction 151005GIAB-GRC workshop oct2015 giab introduction 151005
GIAB-GRC workshop oct2015 giab introduction 151005GenomeInABottle
 
GIAB Sep2016 Lightning megan cleveland targeted seq
GIAB Sep2016 Lightning megan cleveland targeted seqGIAB Sep2016 Lightning megan cleveland targeted seq
GIAB Sep2016 Lightning megan cleveland targeted seqGenomeInABottle
 

Mais procurados (9)

Giab jan2016 intro and update 160128
Giab jan2016 intro and update 160128Giab jan2016 intro and update 160128
Giab jan2016 intro and update 160128
 
Jan2016 horizon GIAB
Jan2016 horizon GIABJan2016 horizon GIAB
Jan2016 horizon GIAB
 
Giab aug2015 intro and update 150821.pptx
Giab aug2015 intro and update 150821.pptxGiab aug2015 intro and update 150821.pptx
Giab aug2015 intro and update 150821.pptx
 
Aug2015 horizon diagnostics
Aug2015 horizon diagnosticsAug2015 horizon diagnostics
Aug2015 horizon diagnostics
 
Jan2016 bina giab
Jan2016 bina giabJan2016 bina giab
Jan2016 bina giab
 
Tools for Using NIST Reference Materials
Tools for Using NIST Reference MaterialsTools for Using NIST Reference Materials
Tools for Using NIST Reference Materials
 
GIAB-GRC workshop oct2015 giab introduction 151005
GIAB-GRC workshop oct2015 giab introduction 151005GIAB-GRC workshop oct2015 giab introduction 151005
GIAB-GRC workshop oct2015 giab introduction 151005
 
GIAB Sep2016 Lightning megan cleveland targeted seq
GIAB Sep2016 Lightning megan cleveland targeted seqGIAB Sep2016 Lightning megan cleveland targeted seq
GIAB Sep2016 Lightning megan cleveland targeted seq
 
20140710 1 day1_nist_ercc2.0workshop
20140710 1 day1_nist_ercc2.0workshop20140710 1 day1_nist_ercc2.0workshop
20140710 1 day1_nist_ercc2.0workshop
 

Semelhante a How we've made a global search engine for genetic data

How we built a global search engine for genetic data
How we built a global search engine for genetic dataHow we built a global search engine for genetic data
How we built a global search engine for genetic dataMiro Cupak
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...GenomeInABottle
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesGuy Coates
 
wolstencroft-ogf20-astro
wolstencroft-ogf20-astrowolstencroft-ogf20-astro
wolstencroft-ogf20-astrowebuploader
 
How we built a global search engine for genetic data
How we built a global search engine for genetic dataHow we built a global search engine for genetic data
How we built a global search engine for genetic dataMiro Cupak
 
CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECA webinar slides: Modular and reproducible workflows for federated molec...CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECA webinar slides: Modular and reproducible workflows for federated molec...CINECAProject
 
Giab agbt small_var_2020
Giab agbt small_var_2020Giab agbt small_var_2020
Giab agbt small_var_2020GenomeInABottle
 
Meaningful (meta)data at scale: removing barriers to precision medicine research
Meaningful (meta)data at scale: removing barriers to precision medicine researchMeaningful (meta)data at scale: removing barriers to precision medicine research
Meaningful (meta)data at scale: removing barriers to precision medicine researchNolan Nichols
 
How to Scale from Workstation through Cloud to HPC in Cryo-EM Processing
How to Scale from Workstation through Cloud to HPC in Cryo-EM ProcessingHow to Scale from Workstation through Cloud to HPC in Cryo-EM Processing
How to Scale from Workstation through Cloud to HPC in Cryo-EM Processinginside-BigData.com
 
Farid Ali Presentation_Final.pptx
Farid Ali Presentation_Final.pptxFarid Ali Presentation_Final.pptx
Farid Ali Presentation_Final.pptxFaridAliMousa1
 
2013-05-15 RBI_Pitch_Deck
2013-05-15 RBI_Pitch_Deck2013-05-15 RBI_Pitch_Deck
2013-05-15 RBI_Pitch_DeckMark Punyanitya
 
Scott Kahn Genomic Big Data.gia.052913
Scott Kahn Genomic Big Data.gia.052913Scott Kahn Genomic Big Data.gia.052913
Scott Kahn Genomic Big Data.gia.052913Social at Illumina
 
Using ai and automation to build resiliency into azure dev ops
Using ai and automation to build resiliency into azure dev opsUsing ai and automation to build resiliency into azure dev ops
Using ai and automation to build resiliency into azure dev opsRob Jahn
 
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference DatabaseDevelopment of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference DatabaseNathan Olson
 
The Future of Healthcare with Big Data and AI with Ion Stoica and Frank Nothaft
The Future of Healthcare with Big Data and AI with Ion Stoica and Frank NothaftThe Future of Healthcare with Big Data and AI with Ion Stoica and Frank Nothaft
The Future of Healthcare with Big Data and AI with Ion Stoica and Frank NothaftDatabricks
 
|QAB> : Quantum Computing, AI and Blockchain
|QAB> : Quantum Computing, AI and Blockchain|QAB> : Quantum Computing, AI and Blockchain
|QAB> : Quantum Computing, AI and BlockchainKan Yuenyong
 

Semelhante a How we've made a global search engine for genetic data (20)

How we built a global search engine for genetic data
How we built a global search engine for genetic dataHow we built a global search engine for genetic data
How we built a global search engine for genetic data
 
The CRISPR/Cas9 Toolbox
The CRISPR/Cas9 ToolboxThe CRISPR/Cas9 Toolbox
The CRISPR/Cas9 Toolbox
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciences
 
wolstencroft-ogf20-astro
wolstencroft-ogf20-astrowolstencroft-ogf20-astro
wolstencroft-ogf20-astro
 
Machine Learning Impact on IoT - Part 2
Machine Learning Impact on IoT - Part 2Machine Learning Impact on IoT - Part 2
Machine Learning Impact on IoT - Part 2
 
Overview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data AnalysisOverview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data Analysis
 
How we built a global search engine for genetic data
How we built a global search engine for genetic dataHow we built a global search engine for genetic data
How we built a global search engine for genetic data
 
CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECA webinar slides: Modular and reproducible workflows for federated molec...CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECA webinar slides: Modular and reproducible workflows for federated molec...
 
Giab agbt small_var_2020
Giab agbt small_var_2020Giab agbt small_var_2020
Giab agbt small_var_2020
 
Meaningful (meta)data at scale: removing barriers to precision medicine research
Meaningful (meta)data at scale: removing barriers to precision medicine researchMeaningful (meta)data at scale: removing barriers to precision medicine research
Meaningful (meta)data at scale: removing barriers to precision medicine research
 
How to Scale from Workstation through Cloud to HPC in Cryo-EM Processing
How to Scale from Workstation through Cloud to HPC in Cryo-EM ProcessingHow to Scale from Workstation through Cloud to HPC in Cryo-EM Processing
How to Scale from Workstation through Cloud to HPC in Cryo-EM Processing
 
Farid Ali Presentation_Final.pptx
Farid Ali Presentation_Final.pptxFarid Ali Presentation_Final.pptx
Farid Ali Presentation_Final.pptx
 
2013-05-15 RBI_Pitch_Deck
2013-05-15 RBI_Pitch_Deck2013-05-15 RBI_Pitch_Deck
2013-05-15 RBI_Pitch_Deck
 
HUG @ NGCLE@e-Novia 15.11.2017
HUG @ NGCLE@e-Novia 15.11.2017HUG @ NGCLE@e-Novia 15.11.2017
HUG @ NGCLE@e-Novia 15.11.2017
 
Scott Kahn Genomic Big Data.gia.052913
Scott Kahn Genomic Big Data.gia.052913Scott Kahn Genomic Big Data.gia.052913
Scott Kahn Genomic Big Data.gia.052913
 
Using ai and automation to build resiliency into azure dev ops
Using ai and automation to build resiliency into azure dev opsUsing ai and automation to build resiliency into azure dev ops
Using ai and automation to build resiliency into azure dev ops
 
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference DatabaseDevelopment of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
 
The Future of Healthcare with Big Data and AI with Ion Stoica and Frank Nothaft
The Future of Healthcare with Big Data and AI with Ion Stoica and Frank NothaftThe Future of Healthcare with Big Data and AI with Ion Stoica and Frank Nothaft
The Future of Healthcare with Big Data and AI with Ion Stoica and Frank Nothaft
 
|QAB> : Quantum Computing, AI and Blockchain
|QAB> : Quantum Computing, AI and Blockchain|QAB> : Quantum Computing, AI and Blockchain
|QAB> : Quantum Computing, AI and Blockchain
 

Mais de Miro Cupak

Exploring the latest and greatest from Java 14
Exploring the latest and greatest from Java 14Exploring the latest and greatest from Java 14
Exploring the latest and greatest from Java 14Miro Cupak
 
Exploring reactive programming in Java
Exploring reactive programming in JavaExploring reactive programming in Java
Exploring reactive programming in JavaMiro Cupak
 
Exploring the last year of Java
Exploring the last year of JavaExploring the last year of Java
Exploring the last year of JavaMiro Cupak
 
Local variable type inference - Will it compile?
Local variable type inference - Will it compile?Local variable type inference - Will it compile?
Local variable type inference - Will it compile?Miro Cupak
 
The Good, the Bad and the Ugly of Java API design
The Good, the Bad and the Ugly of Java API designThe Good, the Bad and the Ugly of Java API design
The Good, the Bad and the Ugly of Java API designMiro Cupak
 
Local variable type inference - Will it compile?
Local variable type inference - Will it compile?Local variable type inference - Will it compile?
Local variable type inference - Will it compile?Miro Cupak
 
Exploring reactive programming in Java
Exploring reactive programming in JavaExploring reactive programming in Java
Exploring reactive programming in JavaMiro Cupak
 
The good, the bad, and the ugly of Java API design
The good, the bad, and the ugly of Java API designThe good, the bad, and the ugly of Java API design
The good, the bad, and the ugly of Java API designMiro Cupak
 
Master class in modern Java
Master class in modern JavaMaster class in modern Java
Master class in modern JavaMiro Cupak
 
The good, the bad, and the ugly of Java API design
The good, the bad, and the ugly of Java API designThe good, the bad, and the ugly of Java API design
The good, the bad, and the ugly of Java API designMiro Cupak
 
Exploring reactive programming in Java
Exploring reactive programming in JavaExploring reactive programming in Java
Exploring reactive programming in JavaMiro Cupak
 
The good, the bad, and the ugly of Java API design
The good, the bad, and the ugly of Java API designThe good, the bad, and the ugly of Java API design
The good, the bad, and the ugly of Java API designMiro Cupak
 
Writing clean code with modern Java
Writing clean code with modern JavaWriting clean code with modern Java
Writing clean code with modern JavaMiro Cupak
 
The good, the bad, and the ugly of Java API design
The good, the bad, and the ugly of Java API designThe good, the bad, and the ugly of Java API design
The good, the bad, and the ugly of Java API designMiro Cupak
 
Master class in modern Java
Master class in modern JavaMaster class in modern Java
Master class in modern JavaMiro Cupak
 
Exploring reactive programming in Java
Exploring reactive programming in JavaExploring reactive programming in Java
Exploring reactive programming in JavaMiro Cupak
 
Writing clean code with modern Java
Writing clean code with modern JavaWriting clean code with modern Java
Writing clean code with modern JavaMiro Cupak
 
Exploring what's new in Java 10 and 11 (and 12)
Exploring what's new in Java 10 and 11 (and 12)Exploring what's new in Java 10 and 11 (and 12)
Exploring what's new in Java 10 and 11 (and 12)Miro Cupak
 
Exploring what's new in Java 10 and 11
Exploring what's new in Java 10 and 11Exploring what's new in Java 10 and 11
Exploring what's new in Java 10 and 11Miro Cupak
 
Exploring what's new in Java in 2018
Exploring what's new in Java in 2018Exploring what's new in Java in 2018
Exploring what's new in Java in 2018Miro Cupak
 

Mais de Miro Cupak (20)

Exploring the latest and greatest from Java 14
Exploring the latest and greatest from Java 14Exploring the latest and greatest from Java 14
Exploring the latest and greatest from Java 14
 
Exploring reactive programming in Java
Exploring reactive programming in JavaExploring reactive programming in Java
Exploring reactive programming in Java
 
Exploring the last year of Java
Exploring the last year of JavaExploring the last year of Java
Exploring the last year of Java
 
Local variable type inference - Will it compile?
Local variable type inference - Will it compile?Local variable type inference - Will it compile?
Local variable type inference - Will it compile?
 
The Good, the Bad and the Ugly of Java API design
The Good, the Bad and the Ugly of Java API designThe Good, the Bad and the Ugly of Java API design
The Good, the Bad and the Ugly of Java API design
 
Local variable type inference - Will it compile?
Local variable type inference - Will it compile?Local variable type inference - Will it compile?
Local variable type inference - Will it compile?
 
Exploring reactive programming in Java
Exploring reactive programming in JavaExploring reactive programming in Java
Exploring reactive programming in Java
 
The good, the bad, and the ugly of Java API design
The good, the bad, and the ugly of Java API designThe good, the bad, and the ugly of Java API design
The good, the bad, and the ugly of Java API design
 
Master class in modern Java
Master class in modern JavaMaster class in modern Java
Master class in modern Java
 
The good, the bad, and the ugly of Java API design
The good, the bad, and the ugly of Java API designThe good, the bad, and the ugly of Java API design
The good, the bad, and the ugly of Java API design
 
Exploring reactive programming in Java
Exploring reactive programming in JavaExploring reactive programming in Java
Exploring reactive programming in Java
 
The good, the bad, and the ugly of Java API design
The good, the bad, and the ugly of Java API designThe good, the bad, and the ugly of Java API design
The good, the bad, and the ugly of Java API design
 
Writing clean code with modern Java
Writing clean code with modern JavaWriting clean code with modern Java
Writing clean code with modern Java
 
The good, the bad, and the ugly of Java API design
The good, the bad, and the ugly of Java API designThe good, the bad, and the ugly of Java API design
The good, the bad, and the ugly of Java API design
 
Master class in modern Java
Master class in modern JavaMaster class in modern Java
Master class in modern Java
 
Exploring reactive programming in Java
Exploring reactive programming in JavaExploring reactive programming in Java
Exploring reactive programming in Java
 
Writing clean code with modern Java
Writing clean code with modern JavaWriting clean code with modern Java
Writing clean code with modern Java
 
Exploring what's new in Java 10 and 11 (and 12)
Exploring what's new in Java 10 and 11 (and 12)Exploring what's new in Java 10 and 11 (and 12)
Exploring what's new in Java 10 and 11 (and 12)
 
Exploring what's new in Java 10 and 11
Exploring what's new in Java 10 and 11Exploring what's new in Java 10 and 11
Exploring what's new in Java 10 and 11
 
Exploring what's new in Java in 2018
Exploring what's new in Java in 2018Exploring what's new in Java in 2018
Exploring what's new in Java in 2018
 

Último

A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencessuser9e7c64
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingShane Coughlan
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profileakrivarotava
 

Último (20)

A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conference
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profile
 

How we've made a global search engine for genetic data

  • 1. @mirocupak Miro Cupak VP Engineering, DNAstack 28/09/2018 How we've made a global search engine for genetic data
  • 7. BackgroundDataVolumesby2025(GB) 0e+00 1e+10 2e+10 3e+10 4e+10 Twitter Youtube Genomics Lower Bound Upper Bound http://journals.plos.org/plosbiology/article? id=10.1371/journal.pbio.1002195 up to 2 billion human genomes sequenced in the next 10 years (more data annually than uploaded to and )
  • 8. What does it mean? ❓ Problem ❗ Key obstacle 💡 Solution Too much data for any single institution. Not enough data to make new discoveries. Discovering data. Federated system capable of executing cross-dataset and cross-institution queries.
  • 9. • initiative started in 2014 across many groups within GA4GH • experiment to test the willingness to share in the simplest of all technical contexts • simple web service • receives questions of the form Do you have information about this mutation? • responds with yes or no (optionally additional metadata) • design principles • A beacon has to be technically simple. • A beacon has to minimize risks associated with genomic data sharing. • It has to be possible to make a beacon publicly available. Beacon Project https://beacon-project.io/
  • 10. • no formal specification • receives questions of the form Do you have information about this mutation? • responds with yes or no • 4 public beacons, each API different Standard: Before Beacon Network • request method • supported parameters • parameter names • chromosome identifiers • positional base • assembly notation • supported alleles • dataset support • response format • data included in the response
  • 12. • 2014 • really simple (2 records) • true/false response • format: Avro • too vague • not enough traction Standard: 0.1
  • 13. • 2015 • true/false/overlap/null response • datasets • data use conditions • self description • complex (9 records) • format: Avro • not well adopted • not polished enough Standard: 0.2
  • 14. • 2016 • simplified 0.2 • true/false/null response • data model improvements, extended metadata and response, improved support for datasets and cross-dataset queries, data versioning • modular and extensible • tooling • format: Avro → Proto3 • based on real needs, successful Standard: 0.3
  • 15. • 2018 • stable and more flexible • support for more complex mutations • improved error handling • improved data use conditions • developer experience • format: Proto3 → OpenAPI Standard: 0.4
  • 16. • promoted 0.4 • extended documentation and best practices Standard: 1.0 Finally ready! 🎉
  • 18. Data • access data stored in a relational database
  • 19. Service • communication with other subsystems • query normalization • aggregators • participant resolution • query distribution • audit trail • L1 parallelization
  • 20. Processor • executing a query against a beacon and processing its response • management of a flexible, dynamic and easily extensible query execution pipeline • pipeline stages resolution (CDI and EJB) • L2 parallelization • cross-assembly query handling
  • 21. Converter • first stage in the query execution pipeline • translating query parameters
  • 22. Requester • second stage in the query execution pipeline • constructing beacon requests based on their URIs and parameters produced by the converters
  • 23. Fetcher • third stage in the query execution pipeline • unit actually talking to the API of beacons • submitting requests over the network and obtaining the raw response
  • 24. Parser • last stage in the pipeline • extracting information of interest from the raw response obtained by a fetcher • dealing with various formats • handling metadata, multiple responses, errors • response normalization • parallelized
  • 25. Mapper • translation between different representations of objects
  • 26. REST • handling client requests • data serialization
  • 31. Chromosomes Others 39% Chr. 7 7% Chr. 13 11% Chr. 1 11% Chr. 17 14% Chr. 2 18%
  • 32. Variants Others 74% 2 : 212289100 C (ERBB4) 1% 2 : 29432776 C (ALK) 1% 14 : 23894969 A (MYH7) 1% 1 : 115258747 A (NRAS) 1% 1 : 43815163 C (MPL) 2% 7 : 140453136 C (BRAF) 2% 2 : 45895 G (FAM110C) 3% 22 : 46546565 A (PPARA) 3% 13 : 32936732 C (BRCA2) 6% 2 : 38938 C (FAM110C) 6% 85k distinct mutations
  • 33. DeleteriousnessNumberofvariants 1 1000 1000000 Score 0.00 0.07 0.14 0.21 0.28 0.35 0.42 0.49 0.56 0.63 0.70 0.77 0.84 0.91 0.98 Numberofvariants 1 1000 1000000 Score 0.00 0.07 0.14 0.21 0.28 0.35 0.42 0.49 0.56 0.63 0.70 0.77 0.84 0.91 0.98 SIFT (Sorting Intolerant From Tolerant) PolyPhen-2 HDIV (Polymorphism Phenotyping v2) 69% damaging, 31% tolerated 55% probably damaging, 22% possibly damaging, 23% benign
  • 34. • 25% rare variants (1,000 Genomes Project) RarityNumberofvariants 1 100 10000 Allele frequency 0.00 0.03 0.06 0.090.12 0.15 0.18 0.21 0.240.27 0.30 0.33 0.36 0.39 0.420.45 0.48 0.51 0.54 0.57 0.600.63 0.66 0.69 0.72 0.75 0.780.81 0.84 0.87 0.90 0.93 0.960.99
  • 35. Genes Symbol Name 1 FAM110C Family With Sequence Similarity 110 Member C 2 BRCA1 BRCA1, DNA Repair Associated 3 BRCA2 BRCA2, DNA Repair Associated 4 PPARA Peroxisome Proliferator Activated Receptor Alpha 5 ERBB4 Erb-B2 Receptor Tyrosine Kinase 4 6 BRAF B-Raf Proto-Oncogene, Serine/Threonine Kinase 7 MPL MPL Proto-Oncogene, Thrombopoietin Receptor 8 MYH7 Myosin Heavy Chain 7 9 KIT KIT Proto-Oncogene Receptor Tyrosine Kinase 10 RET Ret Proto-Oncogene Others 53% RET 1% KIT 1% MYH7 2% MPL 2% BRAF 3% ERBB4 3% PPARA 4% BRCA2 9% BRCA1 10% FAM110C 11%
  • 36. Disorders & clinical abnormalities OMIM HPO 1 Pancreatic cancer, susceptibility to, 4 Autosomal dominant inheritance 2 Breast-ovarian cancer, familial, 1 Autosomal recessive inheritance 3 Fanconi anemia, complementation group D1 Scoliosis 4 Prostate cancer Short stature 5 Pancreatic cancer 2 Cognitive impairment 6 Medulloblastoma Constipation 7 Glioblastoma 3 Somatic mutation 8 Breast-ovarian cancer, familial, 2 Cafe-au-lait spot 9 Breast cancer, male, susceptibility to Failure to thrive 10 Wilms tumor Nausea and vomiting