Intended for a mixed/general audience of Clinicians, Business Interests, and Research Scientists. No audio, however the event was recorded and posted to youtube by Genome Atlantic: http://www.youtube.com/watch?v=FLVjwOngu-Q I
Call Girls Hyderabad Just Call 8250077686 Top Class Call Girl Service Available
Bioinformatics in Gene Research
1. Bioinformatics in Genetics
Research
Genetics Noon Symposium Series
Daniel Gaston, PhD
Dr. Karen Bedard Lab, Department of Pathology
November 21st, 2012
2. IGNITE
Orphan Diseases: Identifying Genes and Novel
Therapeutics to Enhance Treatment
Identify causative genetic variations in orphan
diseases with an emphasis on Atlantic Canada
Develop animal and cell culture models
Identify and develop novel therapeutics
igniteproject.ca
3. IGNITE
Orphan Diseases: Identifying Genes and Novel
Therapeutics to Enhance Treatment
Identify causative genetic variations in orphan
diseases with an emphasis on Atlantic Canada
Develop animal and cell culture models
Identify and develop novel therapeutics
igniteproject.ca
4. Outline
Introduction
Bioinformatics in Disease Genomics
Next-Generation Sequencing
Genomics in Research and the Clinic
The Data Deluge and its Solutions
Bioinformatic Methods for Analyzing Genomic Data
Case Studies
Conclusion
5. Bioinformatics in Disease Genomics
Handling and long-term storage of raw data
(sequencing, gene expression, etc)
Maintenance and support of computational
infrastructure
Experimental design
Data analysis
Methods development
Analysis pipelines
Statistical analyses
Algorithm design
6. Bioinformatics in Disease Genomics
Handling and long-term storage of raw data
(sequencing, gene expression, etc)
Maintenance and support of computational
infrastructure
Experimental design
Data analysis
Methods development
Analysis pipelines
Statistical analysis techniques
Algorithm design
9. Disease Genomics: Hunting Down Pathogenic
Genetic Variation
Referenc Exon 1 Intron 1 Exon 2
e
Start
TAA
Stop
10. Disease Genomics: Hunting Down Pathogenic
Genetic Variation
Splice
Sites
Referenc Exon 1 Intron 1 Exon 2
e
Start
TAA
mRNA coding for protein Stop
11. Disease Genomics: Hunting Down Pathogenic
Genetic Variation
Splice
Sites
Referenc Exon 1 Intron 1 Exon 2
e
Start
TAA
mRNA coding for protein Stop
Patient Exon 1 Intron 1 Exon 2
12. Disease Genomics: Hunting Down Pathogenic
Genetic Variation
Splice
Sites
Referenc Exon 1 Intron 1 Exon 2
e
Start
TAA
mRNA coding for protein Stop
TAC
Tyr
Patient Exon 1 Intron 1 Exon 2
13. Disease Genomics: Hunting Down Pathogenic
Genetic Variation
Splice
Sites
Referenc Exon 1 Intron 1 Exon 2
e
Start
TAA
mRNA coding for protein Stop
TAC
Tyr
Patient Exon 1 Intron 1 Exon 2
14. Disease Genomics: Hunting Down Pathogenic
Genetic Variation
Splice
Sites
Referenc Exon 1 Intron 1 Exon 2
e
Start
TAA
mRNA coding for protein Stop
TAC
Tyr
Patient Exon 1 Intron 1 Exon 2
15. Disease Genomics: Research vs Clinic
Still predominantly research oriented
Complex/Common disease
Mendelian disorders
Cancer genomics
16. Disease Genomics: Research vs Clinic
Still predominantly research oriented
Complex/Common disease
Mendelian disorders
Cancer genomics
Clinical genomics starting to gain traction
Cancer genomics
Cancer subtype identification
Personalized medicine and predicting outcomes
Mendelian disorders
Early diagnosis
Cost effectiveness
17. Clinical Genomics
Children’s Mercy Hospital NICU
In the US >20% of infant deaths due to genetic disease
Serial sequencing of candidate genes too slow
18. Children’s Mercy Hospital NICU
50-hour differential diagnosis of monogenic disease
Sample preparation and sequencing: 30.5 hours
Automated bioinformatics analysis: 17.5 hours
Previous high-throughput sequencing methods: 19 days
Test on seven infants, two previously diagnosed using
standard methods, five undiagnosed
19. Children’s Mercy Hospital NICU
50-hour differential diagnosis of monogenic disease
Sample preparation and sequencing: 30.5 hours
Automated bioinformatics analysis: 17.5 hours
Previous high-throughput sequencing methods: 19 days
Test on seven infants, two previously diagnosed using
standard methods, five undiagnosed
Caveats
Bioinformatics portion not available outside of hospital
Requires thorough clinical phenotyping using a controlled
vocabulary
Generates a large amount of data
20. The Data Deluge
4 million genetic variants
2 million associated with
protein-coding genes
10,000 possibly
of disease
causing type
1500 <1%
frequency in
population
22. Exome Sequencing
Exome: Portion of genome composed of protein-
coding exons and functional RNA sequences
1.5 - 2% of human genome (50 Mb)
> 85% of monogenic diseases due to variants in
exome
Complete exome sequencing: ~ $1000/sample
23. Caveats
Incomplete and non-uniform coverage of exome
Systematic bias (GC content)
Random sampling
Not all genetic variants amenable to discovery
Non-coding variants
Structural variants
26. It Sounds simple but…
For every stage there are multiple programs
available and published in the literature
27. It Sounds simple but…
For every stage there are multiple programs
available and published in the literature
For every program there are a wide-variety of
parameter values and options. Defaults often “good
enough” but not always
28. It Sounds simple but…
For every stage there are multiple programs
available and published in the literature
For every program there are a wide-variety of
parameter values and options. Defaults often “good
enough” but not always
Best combinations of programs and options not well
understood
29. It Sounds simple but…
For every stage there are multiple programs
available and published in the literature
For every program there are a wide-variety of
parameter values and options. Defaults often “good
enough” but not always
Best combinations of programs and options not well
understood
Protocols changing rapidly as new technologies and
methods developed
30. It Sounds simple but…
For every stage there are multiple programs
available and published in the literature
For every program there are a wide-variety of
parameter values and options. Defaults often “good
enough” but not always
Best combinations of programs and options not well
understood
Protocols changing rapidly as new technologies and
methods developed
Different centres and groups use slightly different
workflows with similar, but not identical results
33. If a problem cannot be
solved, enlarge it.
--Dwight D.
Eisenhower
34. Annotations Associated with Genomic
Variants
Is variant in a known protein-coding gene?
What does the gene do?
What molecular pathways?
What protein-protein interactions? 4 million genetic variants
What tissues is it expressed in? 2 million associated with
protein-coding genes
When in development?
10,000 possibly
Has this variant been seen before? of disease
causing type
1500 <1%
What population(s)? With what frequency? frequency in
population
Has it been seen in local sequencing projects?
Is there any known clinical significance?
What is the effect of the variation?
Does it change the resulting protein? How?
37. Potential Pitfalls with Annotation Sources
Databases often overlap and agree, but there may
be disagreements
Source of information: Predicted versus
experimental
Incorrect and out-of-date information
Large-scale un-validated versus manually curated
datasets
39. IGNITE Data Pipeline and Integration
Gene
Annotations Annotated
Genomic
Variants
Mapped Gene
Region(s) Definitions
Filter
Sort
Prioritize
Known Genes Pathway and
Interactions
40. Filtering the Data: Categorization
4 million
variants
Intronic Exonic Intergenic
Amino Acid
Unknown Splice Site Silent Mutation Splice Site
Changing
Potential Potential
Disease Disease
Causing Causing
Known Genetic Amino Acid Amino Acid Known
Stop Loss /
Disease Change Likely Change Likely Polymorphism
Stop Gain
Variant Pathogenic Benign in Population
41. Filtering the Data: Common or Rare?
Variants in dbSNP – Typically known polymorphisms,
unlikely to be associated with rare disease
Variants with relatively high frequency in control
populations (1000 Genomes, HapMAP, EVS, 2800
Exomes)
Number of times variant previously seen at
sequencing centre/locally
42. Notes on Filtering and Variant Annotation
Very important to be aware of population when
referencing frequency of a variant. Incorrect
background leads to incorrect assumptions on
prevalence
43. Notes on Filtering and Variant Annotation
Very important to be aware of population when
referencing frequency of a variant. Incorrect
background leads to incorrect assumptions on
prevalence
Reasonably well-sampled local populations are
better than any other reference
44. Notes on Filtering and Variant Annotation
Very important to be aware of population when
referencing frequency of a variant. Incorrect
background leads to incorrect assumptions on
prevalence
Reasonably well-sampled local populations are
better than any other reference
Strike a balance between hard filtering for variants of
largest potential effect and being inclusive to not
miss variants
45. Notes on Filtering and Variant Annotation
Very important to be aware of population when
referencing frequency of a variant. Incorrect
background leads to incorrect assumptions on
prevalence
Reasonably well-sampled local populations are
better than any other reference
Strike a balance between hard filtering for variants of
largest potential effect and being inclusive to not
miss variants
Some genes acquire large effect variants (stop loss /
stop gain, etc) frequently. Some genes can be lost
without causing disease
55. Conclusions
Bioinformatics is involved at every stage of genomic
research from experimental design through to final
analysis
Standards and best practices do exist, but are
rapidly evolving as new technologies and methods
are developed
Progress towards automatic generation of clinically
interpretable genomics studies
Annotation, filtering, and prioritization of genetic
variants crucial
Balance between false positive calls and false
negatives
56. Where Are We Headed?
Integration of more data sources
Gene expression
More annotation sources
Controlled phenotype vocabularies
Gene Ontology terms
Predictive models
Recessive versus Dominant inheritance and Penetrance
“New” and Emerging Technologies
RNA-Seq (Gene Expression)
ChIP-Seq (Protein-DNA binding)
Single-Molecule Sequencing
57. Acknowledgements
Dalhousie University McGill/Genome Quebec
Dr. Karen Bedard Dr. Jacek Majewski
Dr. Chris McMaster Jeremy
Dr. Andrew Orr Schwartzentruber
Dr. Conrad Fernandez
Dr. Marissa Leblanc Dr. Sarah Dyack
Mat Nightingale Dr. Johane Robataille
Bedard Lab
Genome Atlantic
IGNITE