Bioinformatics in Gene Research

Bioinformatics in Genetics
Research

Genetics Noon Symposium Series
Daniel Gaston, PhD
Dr. Karen Bedard Lab, Department of Pathology

November 21st, 2012

IGNITE

 Orphan Diseases: Identifying Genes and Novel
Therapeutics to Enhance Treatment
 Identify causative genetic variations in orphan
diseases with an emphasis on Atlantic Canada
 Develop animal and cell culture models
 Identify and develop novel therapeutics
 igniteproject.ca

Outline

 Introduction
 Bioinformatics in Disease Genomics
 Next-Generation Sequencing
 Genomics in Research and the Clinic
 The Data Deluge and its Solutions
 Bioinformatic Methods for Analyzing Genomic Data
 Case Studies
 Conclusion

Bioinformatics in Disease Genomics
 Handling and long-term storage of raw data
(sequencing, gene expression, etc)
 Maintenance and support of computational
infrastructure
 Experimental design
 Data analysis
 Methods development
 Analysis pipelines
 Statistical analyses
 Algorithm design

Bioinformatics in Disease Genomics
 Handling and long-term storage of raw data
(sequencing, gene expression, etc)
 Maintenance and support of computational
infrastructure
 Experimental design
 Data analysis
 Methods development
 Analysis pipelines
 Statistical analysis techniques
 Algorithm design

‘Next-Generation’ Sequencing and
Disease Genomics

Disease Genomics: Hunting Down Pathogenic
Genetic Variation

Referenc Exon 1 Intron 1 Exon 2
e

Start
TAA
Stop

Genetic Variation
Splice
Sites

e

Start
TAA
mRNA coding for protein Stop

Genetic Variation
Splice
Sites

e

Start
TAA

Patient Exon 1 Intron 1 Exon 2

Genetic Variation
Splice
Sites

e

Start
TAA

TAC
Tyr

Patient Exon 1 Intron 1 Exon 2

Disease Genomics: Research vs Clinic
 Still predominantly research oriented
 Complex/Common disease
 Mendelian disorders
 Cancer genomics

Disease Genomics: Research vs Clinic
 Still predominantly research oriented
 Complex/Common disease
 Cancer genomics
 Clinical genomics starting to gain traction
 Cancer genomics
 Cancer subtype identification
 Personalized medicine and predicting outcomes
 Early diagnosis
 Cost effectiveness

Clinical Genomics

 Children’s Mercy Hospital NICU
 In the US >20% of infant deaths due to genetic disease
 Serial sequencing of candidate genes too slow

Children’s Mercy Hospital NICU
 50-hour differential diagnosis of monogenic disease
 Sample preparation and sequencing: 30.5 hours
 Automated bioinformatics analysis: 17.5 hours
 Previous high-throughput sequencing methods: 19 days
 Test on seven infants, two previously diagnosed using
standard methods, five undiagnosed

Children’s Mercy Hospital NICU
 50-hour differential diagnosis of monogenic disease
 Sample preparation and sequencing: 30.5 hours
 Automated bioinformatics analysis: 17.5 hours
 Previous high-throughput sequencing methods: 19 days
 Test on seven infants, two previously diagnosed using
standard methods, five undiagnosed
 Caveats
 Bioinformatics portion not available outside of hospital
 Requires thorough clinical phenotyping using a controlled
vocabulary
 Generates a large amount of data

The Data Deluge

4 million genetic variants

2 million associated with
protein-coding genes

10,000 possibly
of disease
causing type

1500 <1%
frequency in
population

Surviving the Data Deluge

Reducing the Search Space: Exome Sequencing

Exome Sequencing

 Exome: Portion of genome composed of protein-
coding exons and functional RNA sequences

 1.5 - 2% of human genome (50 Mb)

 > 85% of monogenic diseases due to variants in
exome

 Complete exome sequencing: ~ $1000/sample

Caveats

 Incomplete and non-uniform coverage of exome
 Systematic bias (GC content)
 Random sampling

 Not all genetic variants amenable to discovery
 Non-coding variants
 Structural variants

Surviving The Data Deluge

Bioinformatics

Typical Bioinformatics Workflow
QC of Raw Data

Map to Reference

QC

Find Variants

QC

Annotate

Filter

It Sounds simple but…
 For every stage there are multiple programs
available and published in the literature

 For every program there are a wide-variety of
parameter values and options. Defaults often “good
enough” but not always

 Best combinations of programs and options not well
understood

understood
 Protocols changing rapidly as new technologies and
methods developed

understood
 Protocols changing rapidly as new technologies and
methods developed
 Different centres and groups use slightly different
workflows with similar, but not identical results

If a problem cannot be
solved, enlarge it.
--Dwight D.
Eisenhower

Annotations Associated with Genomic
Variants
 Is variant in a known protein-coding gene?
 What does the gene do?
 What molecular pathways?
 What protein-protein interactions? 4 million genetic variants

 What tissues is it expressed in? 2 million associated with
protein-coding genes
 When in development?
10,000 possibly

 Has this variant been seen before? of disease
causing type

1500 <1%
 What population(s)? With what frequency? frequency in
population

 Has it been seen in local sequencing projects?
 Is there any known clinical significance?
 What is the effect of the variation?
 Does it change the resulting protein? How?

Potential Pitfalls with Annotation Sources

 Databases often overlap and agree, but there may
be disagreements
 Source of information: Predicted versus
experimental
 Incorrect and out-of-date information
 Large-scale un-validated versus manually curated
datasets

Bioinformatics Analyses of Genomic
Variants

Combining Data Sources and Filtering

IGNITE Data Pipeline and Integration

Gene
Annotations Annotated
Genomic
Variants

Mapped Gene
Region(s) Definitions
Filter
Sort
Prioritize
Known Genes Pathway and
Interactions

Filtering the Data: Categorization
4 million
variants

Intronic Exonic Intergenic

Amino Acid
Unknown Splice Site Silent Mutation Splice Site
Changing

Potential Potential
Disease Disease
Causing Causing

Known Genetic Amino Acid Amino Acid Known
Stop Loss /
Disease Change Likely Change Likely Polymorphism
Stop Gain
Variant Pathogenic Benign in Population

Filtering the Data: Common or Rare?

 Variants in dbSNP – Typically known polymorphisms,
unlikely to be associated with rare disease
 Variants with relatively high frequency in control
populations (1000 Genomes, HapMAP, EVS, 2800
Exomes)
 Number of times variant previously seen at
sequencing centre/locally

Notes on Filtering and Variant Annotation
 Very important to be aware of population when
referencing frequency of a variant. Incorrect
background leads to incorrect assumptions on
prevalence

prevalence
 Reasonably well-sampled local populations are
better than any other reference

prevalence
 Strike a balance between hard filtering for variants of
largest potential effect and being inclusive to not
miss variants

prevalence
 Strike a balance between hard filtering for variants of
largest potential effect and being inclusive to not
miss variants
 Some genes acquire large effect variants (stop loss /
stop gain, etc) frequently. Some genes can be lost
without causing disease

Applications to Real Data

Charcot-Marie-Tooth Disease and Cutis Laxa

Charcot-Marie-Tooth: Genetic Mapping

Chromosome 9:
120,962,282 -
133,033,431

Cutis Laxa: Genetic Mapping

Chromosome 17:
79,596,811-
81,041,077

Charcot-Marie-Tooth Cutis Laxa
 143 genes in region  52 genes in region
 13 known genes in  5 known genes in genome
genome  ATP6V0A2
 MPZ  ELN
 PMP22  FBLN5
 GDAP1  EFEMP2
 KIF1B  SCYL1BP1
 MFN2  ALDH18A1
 SOX
 EGR2
 DNM2
 RAB7
 LITAF (SIMPLE)
 GARS
 YARS
 LMNA

Pathway and Interaction Data
 37 pathways  10 pathways
 Clathrin-derived vesicle  Phagosome
budding  Collecting duct acid
 Lysosome vesicle secretion
biogenesis  Lysosome
 Endocytosis  Protein digestion and
 Golgi-associated vesicle absorption
biogenesis  Metabolic pathways
 Membrane trafficking  Oxidative
 Trans-Golgi network phosphorylation
vesicle budding  Arginine and proline
 Primarily LMNA or metabolism
DNM2  Primarily ATP6V0A2

Results: Charcot-Marie-Tooth
 8 Genes Prioritized
Gene Interactions Pathway
LRSAM1 Multiple Endocytosis
DNM1 DNM2 -
FNBP1 DNM2 -
TOR1A MNA -
STXBP1 Multiple Five
SH3GLB2 - Endocytosis
PIP5KL1 - Endocytosis
FAM125B - Endocytosis

 For more information
 Guernsey et al (2010) PLoS Genetics. 6(8): e1001081

Results: Cutis Laxa
 10 genes prioritized
Gene Interactions Pathway
HEXDC Multiple Phagosome
HG5 - Phagosome
HG5 Multiple Lysosome, Protein
digestion
SIRT7 Multiple Metabolic Pathways
FASN - Metabolic Pathways
DCXR - Metabolic Pathways
PYCR1 - Metabolic Pathways,
Arginine/Proline
PCYT2 - Metabolic Pathways
ARHGDIA - Oxidative Phosphorylation

 For more information
 Guernsey et al (2009) Am J Hum Genet. 85(1): 120-9

Conclusions
 Bioinformatics is involved at every stage of genomic
research from experimental design through to final
analysis
 Standards and best practices do exist, but are
rapidly evolving as new technologies and methods
are developed
 Progress towards automatic generation of clinically
interpretable genomics studies
 Annotation, filtering, and prioritization of genetic
variants crucial
 Balance between false positive calls and false
negatives

Where Are We Headed?

 Integration of more data sources
 Gene expression
 More annotation sources
 Controlled phenotype vocabularies
 Gene Ontology terms
 Predictive models
 Recessive versus Dominant inheritance and Penetrance
 “New” and Emerging Technologies
 RNA-Seq (Gene Expression)
 ChIP-Seq (Protein-DNA binding)
 Single-Molecule Sequencing

Acknowledgements
 Dalhousie University  McGill/Genome Quebec
 Dr. Karen Bedard  Dr. Jacek Majewski
 Dr. Chris McMaster  Jeremy
 Dr. Andrew Orr Schwartzentruber
 Dr. Conrad Fernandez
 Dr. Marissa Leblanc  Dr. Sarah Dyack
 Mat Nightingale  Dr. Johane Robataille
 Bedard Lab
 Genome Atlantic
 IGNITE

Bioinformatics in Gene Research

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (14)

Semelhante a Bioinformatics in Gene Research

Semelhante a Bioinformatics in Gene Research (20)

Mais de Dan Gaston

Mais de Dan Gaston (11)

Último

Último (20)

Bioinformatics in Gene Research