SlideShare uma empresa Scribd logo
1 de 1
Baixar para ler offline
Background
The past decade has seen a significant increase in high-
throughput experimental studies that catalog variant
datasets using massively parallel sequencing experiments.
New insights of biological significance can be gained by this
information with multiple genomic locations based
annotations. However, efforts to obtain this information by
integrating and mining variant data have had limited success
so far and there has yet to be a method developed that can
be scalable, practical and applied to millions of variants and
their related annotations. We explored the use of graph data
structures as a proof of concept for scalable interpretation of
the impact of variant related data.
Methods and Materials
Results
Leveraging Graph Data Structures for Variant Data and Related Annotations
Chris Zawora1, Jesse Milzman1, Yatpang Cheung1, Akshay Bhushan1, Michael S. Atkins2, Hue Vuong3, F. Pascal Girard2, Uma Mudunuri3
1Georgetown University, Washington, DC, 2FedCentric Technologies, LLC, Fairfax, VA, 3Frederick National Laboratory for Cancer Research, Frederick, MD
Conclusion
References
The 1000 Genomes Project Consortium. (2010). A map of human
genome variation from population-scale sequencing. Nature, 467,
1061–1073.
Gregg, B. (2014). Systems performance: enterprise and the cloud.
Pearson Hall: Upper Saddle River, NJ.
Acknowledgments
FedCentric acknowledges Frank D’Ippolito, Shafigh
Mehraeen, Margreth Mpossi, Supriya Nittoor, and Tianwen
Chu for their invaluable assistance with this project.
This project has been funded in whole or in part with federal funds from the National
Cancer Institute, National Institutes of Health, under contract HHSN261200800001E. The
content of this publication does not necessarily reflect the views or policies of the
Department of Health and Human Services, nor does mention of trade names, commercial
products, or organizations imply endorsement by the U.S. Government.
Contract HHSN261200800001E - Funded by the National Cancer Institute
DEPARTMENT OF HEALTH AND HUMAN SERVICES • National Institutes of Health • National Cancer Institute
Frederick National Laboratory is a Federally Funded Research and Development Center
operated by Leidos Biomedical Research, Inc., for the National Cancer Institute
Introduction
Traditional approaches of data mining and integration in the
research field have relied on relational databases or
programming for deriving dynamic insights from research
related data. However, as more next generation sequencing
(NGS) becomes available, these approaches limit the
exploration of certain hypothesis. One such limitation is the
mining of variant data from publicly available databases
such as the 1000 genomes project and TCGA.
Although there are applications available for quickly finding
the public data with a certain set of variants or for finding
minor allele frequencies, there is no such application that
can be applied generically across all the projects allowing
researchers to globally mine and find patterns that would be
applicable to their specific research interests.
In this pilot project, we have investigated whether graph
database structures are applicable for mining variants from
individuals and populations in a scalable manner and
understanding their impact by integrating with known
annotations.
Phase I: As an initial evaluation of the graph structures we
ran several simple queries, also feasible through a
relational architecture, and measured performance
speeds.
Simple Query Examples
• Get all information for a single variant
• Find annotations within a range of genomic locations
• Find variants associated with specific clinical
phenotypes
Performance speeds
• Query times in milliseconds
• Better or equal to relational database query times
Queries
• Developed a new SQL-like query language called
SparkQL
• Eases writing queries for non-programmatic users
Ingestion Times
• Slower than expected
• Sparksee is a multi-thread single write database
• Writes one node/edge at a time
• Each write involves creating connections with existing
nodes
• Slows down as the graph size increases
Solution: Implement multi-threaded insertions in
combination with internal data structures to efficiently
find nodes and create edges
High Degree Vertices
• Nodes with millions of edges
• Stored in a non-distributed list like format
• Searches for a specific edge might be slow
Example: nodes representing individuals with millions
of variants
Solution: Explore other graph clustering approaches
that can essentially condense the information
presented
Our results indicate that a graph database, run on an in-
memory machine, can be a very powerful and useful tool
for cancer research. Performance using the graph database
for finding details on specific nodes or a list of nodes is
better or equal to a well-architected relational database. We
also see promising initial results for identifying correlations
between genetic changes and specific phenotype
conditions.
We conclude that an in-memory graph database
would allow researchers to run known queries while also
providing the opportunity to develop algorithms to explore
complex correlations. Graph models are useful for mining
complex data sets and could prove essential in the
development and implementation of tools aiding precision
medicine.
Results (cont.)
Hardware
• FedCentric Labs’ SGI UV300 system: x86/Linux system,
scales up to 1,152 cores & 64TB of memory
• Data in memory, very low latency, high performance
(Fig. 2)
Fig. 1: Graphs handle data complexity intuitively and interactively
Fig. 4: Results of spectral clustering of 1000 Genomes data
Methods and Materials (cont.)
Data
• SNPs from 1000 genomes project
• Phenotype conditions from ClinVar
• Gene mappings & mRNA transcripts from Entrez Gene
• Amino acid changes from UniProt
The Graph
• Variants and annotations mapped to reference genomic
locations (Fig. 3)
• Includes all chromosomes and genomic locations
• 180 million nodes and 12 billion edges.
Fig. 2: Latency matters
Fig. 3: The Graph Model
Graph Architecture
• Sparsity Technologies’ Sparsity Graph database
• API supports C, C++, C#, Python, and Java
• Implements graph and its attributes as maps and sparse
bitmap structures
• Allows it to scale with very limited memory requirements.
Phase II: We explored complex patterns and clusters inside
the graph and spectral clustering queries that were not
feasible through the relational architecture.
Complex Query Examples
• Compare variant profiles and find individuals that are
closely related
• Compare annotation profiles to find clusters of
populations
Phase II Results
• Eight populations with 25 individuals from each
population
• Strong eigenvalue support (near zero) for 3 main clusters
• Cluster pattern supported by population genetics (Fig. 4)
Performance speeds
• Spectral clustering took ca. 2 minutes

Mais conteúdo relacionado

Mais procurados

Sciunits: Reusable Research Objects
Sciunits: Reusable Research Objects Sciunits: Reusable Research Objects
Sciunits: Reusable Research Objects Globus
 
Sources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization SystemsSources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization SystemsPaul Groth
 
Machines are people too
Machines are people tooMachines are people too
Machines are people tooPaul Groth
 
The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataPaul Groth
 
Data Management Services at the Morgan Library
Data Management Services at the Morgan LibraryData Management Services at the Morgan Library
Data Management Services at the Morgan LibraryC. Tobin Magle
 
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsCombining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsPaul Groth
 
Bringing bioinformatics into the library
Bringing bioinformatics into the libraryBringing bioinformatics into the library
Bringing bioinformatics into the libraryC. Tobin Magle
 
DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?DataONE
 
More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?Paul Groth
 
Research Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkResearch Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkPaul Groth
 
Collaborative Data Management using OSF
Collaborative Data Management using OSFCollaborative Data Management using OSF
Collaborative Data Management using OSFC. Tobin Magle
 
Minimal viable-datareuse-czi
Minimal viable-datareuse-cziMinimal viable-datareuse-czi
Minimal viable-datareuse-cziPaul Groth
 
End-to-End Learning for Answering Structured Queries Directly over Text
End-to-End Learning for  Answering Structured Queries Directly over Text End-to-End Learning for  Answering Structured Queries Directly over Text
End-to-End Learning for Answering Structured Queries Directly over Text Paul Groth
 
Knowledge graph construction for research & medicine
Knowledge graph construction for research & medicineKnowledge graph construction for research & medicine
Knowledge graph construction for research & medicinePaul Groth
 
CEDAR work bench for metadata management
CEDAR work bench for metadata managementCEDAR work bench for metadata management
CEDAR work bench for metadata managementPistoia Alliance
 
From data to knowledge – the Ondex System for integrating Life Sciences data ...
From data to knowledge – the Ondex System for integrating Life Sciences data ...From data to knowledge – the Ondex System for integrating Life Sciences data ...
From data to knowledge – the Ondex System for integrating Life Sciences data ...Catherine Canevet
 
Wim de Grave: Big Data in life sciences
Wim de Grave:  Big Data in life sciencesWim de Grave:  Big Data in life sciences
Wim de Grave: Big Data in life sciencesFlávio Codeço Coelho
 
Workflows for Biological Research at Notre Dame
Workflows for Biological Research at Notre Dame Workflows for Biological Research at Notre Dame
Workflows for Biological Research at Notre Dame Sandra Gesing
 
Tools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databasesTools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databasesValery Tkachenko
 

Mais procurados (20)

Sciunits: Reusable Research Objects
Sciunits: Reusable Research Objects Sciunits: Reusable Research Objects
Sciunits: Reusable Research Objects
 
Sources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization SystemsSources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization Systems
 
Machines are people too
Machines are people tooMachines are people too
Machines are people too
 
The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture Data
 
Data Management Services at the Morgan Library
Data Management Services at the Morgan LibraryData Management Services at the Morgan Library
Data Management Services at the Morgan Library
 
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsCombining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
 
Bringing bioinformatics into the library
Bringing bioinformatics into the libraryBringing bioinformatics into the library
Bringing bioinformatics into the library
 
DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?
 
More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?
 
Open access day
Open access dayOpen access day
Open access day
 
Research Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkResearch Data Sharing: A Basic Framework
Research Data Sharing: A Basic Framework
 
Collaborative Data Management using OSF
Collaborative Data Management using OSFCollaborative Data Management using OSF
Collaborative Data Management using OSF
 
Minimal viable-datareuse-czi
Minimal viable-datareuse-cziMinimal viable-datareuse-czi
Minimal viable-datareuse-czi
 
End-to-End Learning for Answering Structured Queries Directly over Text
End-to-End Learning for  Answering Structured Queries Directly over Text End-to-End Learning for  Answering Structured Queries Directly over Text
End-to-End Learning for Answering Structured Queries Directly over Text
 
Knowledge graph construction for research & medicine
Knowledge graph construction for research & medicineKnowledge graph construction for research & medicine
Knowledge graph construction for research & medicine
 
CEDAR work bench for metadata management
CEDAR work bench for metadata managementCEDAR work bench for metadata management
CEDAR work bench for metadata management
 
From data to knowledge – the Ondex System for integrating Life Sciences data ...
From data to knowledge – the Ondex System for integrating Life Sciences data ...From data to knowledge – the Ondex System for integrating Life Sciences data ...
From data to knowledge – the Ondex System for integrating Life Sciences data ...
 
Wim de Grave: Big Data in life sciences
Wim de Grave:  Big Data in life sciencesWim de Grave:  Big Data in life sciences
Wim de Grave: Big Data in life sciences
 
Workflows for Biological Research at Notre Dame
Workflows for Biological Research at Notre Dame Workflows for Biological Research at Notre Dame
Workflows for Biological Research at Notre Dame
 
Tools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databasesTools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databases
 

Destaque

Destaque (17)

Cuento
CuentoCuento
Cuento
 
Non techie journey in social internet age noiselessinnovation
Non techie journey in social internet age noiselessinnovationNon techie journey in social internet age noiselessinnovation
Non techie journey in social internet age noiselessinnovation
 
AndyResume2015
AndyResume2015AndyResume2015
AndyResume2015
 
Microsoft Kinect & the Microsoft MIX11 Game Preview
Microsoft Kinect & the Microsoft MIX11 Game PreviewMicrosoft Kinect & the Microsoft MIX11 Game Preview
Microsoft Kinect & the Microsoft MIX11 Game Preview
 
comercio electronico
comercio electronicocomercio electronico
comercio electronico
 
POWERPOINT PRESENTATION
POWERPOINT PRESENTATIONPOWERPOINT PRESENTATION
POWERPOINT PRESENTATION
 
Accident
AccidentAccident
Accident
 
願主伸汝聖手
願主伸汝聖手願主伸汝聖手
願主伸汝聖手
 
отчет недели правовых знаний
отчет недели правовых знанийотчет недели правовых знаний
отчет недели правовых знаний
 
大人細子少年郎
大人細子少年郎大人細子少年郎
大人細子少年郎
 
Stanmore College_Unit Achieved
Stanmore College_Unit AchievedStanmore College_Unit Achieved
Stanmore College_Unit Achieved
 
Gestion
GestionGestion
Gestion
 
Evaluation question one
Evaluation question oneEvaluation question one
Evaluation question one
 
至大上帝求汝度涯
至大上帝求汝度涯至大上帝求汝度涯
至大上帝求汝度涯
 
Resumen del 12 - 15
Resumen del 12 - 15Resumen del 12 - 15
Resumen del 12 - 15
 
Ejemplo plano-electrico CHILE
Ejemplo plano-electrico CHILEEjemplo plano-electrico CHILE
Ejemplo plano-electrico CHILE
 
Benito Juarez
Benito JuarezBenito Juarez
Benito Juarez
 

Semelhante a 2015 GU-ICBI Poster (third printing)

Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemWarren Kibbe
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...Robert Grossman
 
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...dkNET
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemWarren Kibbe
 
NRNB Annual Report 2018
NRNB Annual Report 2018NRNB Annual Report 2018
NRNB Annual Report 2018Alexander Pico
 
Nucl. Acids Res.-2014-Howe-nar-gku1244
Nucl. Acids Res.-2014-Howe-nar-gku1244Nucl. Acids Res.-2014-Howe-nar-gku1244
Nucl. Acids Res.-2014-Howe-nar-gku1244Yasel Cruz
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global communityExternalEvents
 
Data-knowledge transition zones within the biomedical research ecosystem
Data-knowledge transition zones within the biomedical research ecosystemData-knowledge transition zones within the biomedical research ecosystem
Data-knowledge transition zones within the biomedical research ecosystemMaryann Martone
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Paul Groth
 
Recognising data sharing
Recognising data sharingRecognising data sharing
Recognising data sharingJisc RDM
 
Knowledge Management in the AI Driven Scintific System
Knowledge Management in the AI Driven Scintific SystemKnowledge Management in the AI Driven Scintific System
Knowledge Management in the AI Driven Scintific SystemSubhasis Dasgupta
 
Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...
Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...
Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...Dominic Suciu
 
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...Amit Sheth
 
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...Syed Ahmad Chan Bukhari, PhD
 
The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016
The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016
The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016Jisc
 
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...Databricks
 
Being FAIR: Enabling Reproducible Data Science
Being FAIR: Enabling Reproducible Data ScienceBeing FAIR: Enabling Reproducible Data Science
Being FAIR: Enabling Reproducible Data ScienceCarole Goble
 

Semelhante a 2015 GU-ICBI Poster (third printing) (20)

Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health System
 
Cri big data
Cri big dataCri big data
Cri big data
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health System
 
NRNB Annual Report 2018
NRNB Annual Report 2018NRNB Annual Report 2018
NRNB Annual Report 2018
 
Nucl. Acids Res.-2014-Howe-nar-gku1244
Nucl. Acids Res.-2014-Howe-nar-gku1244Nucl. Acids Res.-2014-Howe-nar-gku1244
Nucl. Acids Res.-2014-Howe-nar-gku1244
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global community
 
Data-knowledge transition zones within the biomedical research ecosystem
Data-knowledge transition zones within the biomedical research ecosystemData-knowledge transition zones within the biomedical research ecosystem
Data-knowledge transition zones within the biomedical research ecosystem
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.
 
Recognising data sharing
Recognising data sharingRecognising data sharing
Recognising data sharing
 
Knowledge Management in the AI Driven Scintific System
Knowledge Management in the AI Driven Scintific SystemKnowledge Management in the AI Driven Scintific System
Knowledge Management in the AI Driven Scintific System
 
Collins seattle-2014-final
Collins seattle-2014-finalCollins seattle-2014-final
Collins seattle-2014-final
 
Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...
Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...
Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...
 
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
 
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
 
The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016
The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016
The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016
 
KnetMiner Overview Oct 2017
KnetMiner Overview Oct 2017KnetMiner Overview Oct 2017
KnetMiner Overview Oct 2017
 
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
 
Being FAIR: Enabling Reproducible Data Science
Being FAIR: Enabling Reproducible Data ScienceBeing FAIR: Enabling Reproducible Data Science
Being FAIR: Enabling Reproducible Data Science
 

Mais de Michael Atkins

Mais de Michael Atkins (10)

2016.10 HPDA in Precision Medicine
2016.10 HPDA in Precision Medicine2016.10 HPDA in Precision Medicine
2016.10 HPDA in Precision Medicine
 
Virus Analytics Poster
Virus Analytics PosterVirus Analytics Poster
Virus Analytics Poster
 
Bio-IT Award
Bio-IT AwardBio-IT Award
Bio-IT Award
 
FC Brochure & Insert
FC Brochure & InsertFC Brochure & Insert
FC Brochure & Insert
 
ASLO 2001 Poster (Black)
ASLO 2001 Poster (Black)ASLO 2001 Poster (Black)
ASLO 2001 Poster (Black)
 
Astrobiology Poster
Astrobiology PosterAstrobiology Poster
Astrobiology Poster
 
2000 WHOI Currents
2000 WHOI Currents2000 WHOI Currents
2000 WHOI Currents
 
Final Draft
Final DraftFinal Draft
Final Draft
 
OTEC
OTECOTEC
OTEC
 
2000 JME (51)278-285
2000 JME (51)278-2852000 JME (51)278-285
2000 JME (51)278-285
 

2015 GU-ICBI Poster (third printing)

  • 1. Background The past decade has seen a significant increase in high- throughput experimental studies that catalog variant datasets using massively parallel sequencing experiments. New insights of biological significance can be gained by this information with multiple genomic locations based annotations. However, efforts to obtain this information by integrating and mining variant data have had limited success so far and there has yet to be a method developed that can be scalable, practical and applied to millions of variants and their related annotations. We explored the use of graph data structures as a proof of concept for scalable interpretation of the impact of variant related data. Methods and Materials Results Leveraging Graph Data Structures for Variant Data and Related Annotations Chris Zawora1, Jesse Milzman1, Yatpang Cheung1, Akshay Bhushan1, Michael S. Atkins2, Hue Vuong3, F. Pascal Girard2, Uma Mudunuri3 1Georgetown University, Washington, DC, 2FedCentric Technologies, LLC, Fairfax, VA, 3Frederick National Laboratory for Cancer Research, Frederick, MD Conclusion References The 1000 Genomes Project Consortium. (2010). A map of human genome variation from population-scale sequencing. Nature, 467, 1061–1073. Gregg, B. (2014). Systems performance: enterprise and the cloud. Pearson Hall: Upper Saddle River, NJ. Acknowledgments FedCentric acknowledges Frank D’Ippolito, Shafigh Mehraeen, Margreth Mpossi, Supriya Nittoor, and Tianwen Chu for their invaluable assistance with this project. This project has been funded in whole or in part with federal funds from the National Cancer Institute, National Institutes of Health, under contract HHSN261200800001E. The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply endorsement by the U.S. Government. Contract HHSN261200800001E - Funded by the National Cancer Institute DEPARTMENT OF HEALTH AND HUMAN SERVICES • National Institutes of Health • National Cancer Institute Frederick National Laboratory is a Federally Funded Research and Development Center operated by Leidos Biomedical Research, Inc., for the National Cancer Institute Introduction Traditional approaches of data mining and integration in the research field have relied on relational databases or programming for deriving dynamic insights from research related data. However, as more next generation sequencing (NGS) becomes available, these approaches limit the exploration of certain hypothesis. One such limitation is the mining of variant data from publicly available databases such as the 1000 genomes project and TCGA. Although there are applications available for quickly finding the public data with a certain set of variants or for finding minor allele frequencies, there is no such application that can be applied generically across all the projects allowing researchers to globally mine and find patterns that would be applicable to their specific research interests. In this pilot project, we have investigated whether graph database structures are applicable for mining variants from individuals and populations in a scalable manner and understanding their impact by integrating with known annotations. Phase I: As an initial evaluation of the graph structures we ran several simple queries, also feasible through a relational architecture, and measured performance speeds. Simple Query Examples • Get all information for a single variant • Find annotations within a range of genomic locations • Find variants associated with specific clinical phenotypes Performance speeds • Query times in milliseconds • Better or equal to relational database query times Queries • Developed a new SQL-like query language called SparkQL • Eases writing queries for non-programmatic users Ingestion Times • Slower than expected • Sparksee is a multi-thread single write database • Writes one node/edge at a time • Each write involves creating connections with existing nodes • Slows down as the graph size increases Solution: Implement multi-threaded insertions in combination with internal data structures to efficiently find nodes and create edges High Degree Vertices • Nodes with millions of edges • Stored in a non-distributed list like format • Searches for a specific edge might be slow Example: nodes representing individuals with millions of variants Solution: Explore other graph clustering approaches that can essentially condense the information presented Our results indicate that a graph database, run on an in- memory machine, can be a very powerful and useful tool for cancer research. Performance using the graph database for finding details on specific nodes or a list of nodes is better or equal to a well-architected relational database. We also see promising initial results for identifying correlations between genetic changes and specific phenotype conditions. We conclude that an in-memory graph database would allow researchers to run known queries while also providing the opportunity to develop algorithms to explore complex correlations. Graph models are useful for mining complex data sets and could prove essential in the development and implementation of tools aiding precision medicine. Results (cont.) Hardware • FedCentric Labs’ SGI UV300 system: x86/Linux system, scales up to 1,152 cores & 64TB of memory • Data in memory, very low latency, high performance (Fig. 2) Fig. 1: Graphs handle data complexity intuitively and interactively Fig. 4: Results of spectral clustering of 1000 Genomes data Methods and Materials (cont.) Data • SNPs from 1000 genomes project • Phenotype conditions from ClinVar • Gene mappings & mRNA transcripts from Entrez Gene • Amino acid changes from UniProt The Graph • Variants and annotations mapped to reference genomic locations (Fig. 3) • Includes all chromosomes and genomic locations • 180 million nodes and 12 billion edges. Fig. 2: Latency matters Fig. 3: The Graph Model Graph Architecture • Sparsity Technologies’ Sparsity Graph database • API supports C, C++, C#, Python, and Java • Implements graph and its attributes as maps and sparse bitmap structures • Allows it to scale with very limited memory requirements. Phase II: We explored complex patterns and clusters inside the graph and spectral clustering queries that were not feasible through the relational architecture. Complex Query Examples • Compare variant profiles and find individuals that are closely related • Compare annotation profiles to find clusters of populations Phase II Results • Eight populations with 25 individuals from each population • Strong eigenvalue support (near zero) for 3 main clusters • Cluster pattern supported by population genetics (Fig. 4) Performance speeds • Spectral clustering took ca. 2 minutes