O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

AlgoAlignementGenomicSequences.ppt

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio

Confira estes a seguir

1 de 59 Anúncio

Mais Conteúdo rRelacionado

Semelhante a AlgoAlignementGenomicSequences.ppt (20)

Mais recentes (20)

Anúncio

AlgoAlignementGenomicSequences.ppt

  1. 1. Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004
  2. 2. Conservation Implies Function Exon Gene CNS: Other Conserved
  3. 3. Edit Distance Model (1) Weighted sum of insertions, deletions & mutations to transform one string into another AGGCACA--CA AGGCACACA | |||| || or | || || A--CACATTCA ACACATTCA
  4. 4. Edit Distance Model (2) Given: x, y Define: F(i,j) = Score of best alignment of x1…xi to y1…yj Recurrence: F(i,j) = max (F(i-1,j) – GAP_PENALTY, F(i,j-1) – GAP_PENALTY, F(i-1,j-1) + SCORE(xi, yj))
  5. 5. Edit Distance Model (3) F(i,j) = Score of best alignment ending at i,j Time O( n2 ) for two seqs, O( nk ) for k seqs F(i,j) F(i,j-1) F(i-1,j-1) F(i-1,j) AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
  6. 6. Overview • Local Alignment (CHAOS) • Multiple Global Alignment (LAGAN) - Whole Genome Alignment • Glocal Alignment (Shuffle-LAGAN) • Biological Story
  7. 7. Local Alignment AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC F(i,j) = max (F(i,j), 0) Return all paths with a position i,j where F(i,j) > C Time O( n2 ) for two seqs, O( nk ) for k seqs
  8. 8. Heuristic Local Alignment AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC BLAST FASTA
  9. 9. CHAOS: CHAins Of Seeds 1. Find short matching words (seeds) 2. Chain them 3. Rescore chain
  10. 10. CHAOS: Chaining the Seeds • Find seeds at current location in seq1 location in seq1 seed seq1 seq2
  11. 11. CHAOS: Chaining the Seeds location in seq1 distance cutoff seed seq1 seq2 • Find seeds at current location in seq1
  12. 12. CHAOS: Chaining the Seeds location in seq1 distance cutoff gap cutoff seed seq1 seq2 • Find seeds at current location in seq1
  13. 13. CHAOS: Chaining the Seeds • Find seeds at current location in seq1 • Find the previous seeds that fall into the search box location in seq1 distance cutoff gap cutoff seed Search box seq1 seq2
  14. 14. CHAOS: Chaining the Seeds • Find seeds at current location in seq1 • Find the previous seeds that fall into the search box • Do a range query: seeds are indexed by their diagonal location in seq1 distance cutoff gap cutoff seed Search box seq1 seq2 Range of search
  15. 15. CHAOS: Chaining the Seeds • Find seeds at current location in seq1 • Find the previous seeds that fall into the search box • Do a range query: seeds are indexed by their diagonal. • Pick a previous seed that maximizes the score of chain location in seq1 distance cutoff gap cutoff seed Search box seq1 seq2 Range of search
  16. 16. CHAOS: Chaining the Seeds • Find seeds at current location in seq1 • Find the previous seeds that fall into the search box • Do a range query: seeds are indexed by their diagonal. • Pick a previous seed that maximizes the score of chain location in seq1 distance cutoff gap cutoff seed Search box seq1 seq2 Range of search Time O(n log n), where n is number of seeds.
  17. 17. CHAOS Scoring • Initial score = # matching bp - gaps • Rapid rescoring: extend all seeds to find optimal location for gaps
  18. 18. Overview • Local Alignment (CHAOS) • Multiple Global Alignment (LAGAN) - Whole Genome Alignment • Glocal Alignment (Shuffle-LAGAN) • Biological Story
  19. 19. Global Alignment AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC x y z
  20. 20. LAGAN: 1. FIND Local Alignments 1. Find Local Alignments 2. Chain Local Alignments 3. Restricted DP
  21. 21. LAGAN: 2. CHAIN Local Alignments 1. Find Local Alignments 2. Chain Local Alignments 3. Restricted DP
  22. 22. LAGAN: 3. Restricted DP 1. Find Local Alignments 2. Chain Local Alignments 3. Restricted DP
  23. 23. MLAGAN: 1. Progressive Alignment Given N sequences, phylogenetic tree Align pairwise, in order of the tree (LAGAN) Human Baboon Mouse Rat
  24. 24. MLAGAN: 2. Multi-anchoring X Z Y Z X/Y Z To anchor the (X/Y), and (Z) alignments:
  25. 25. Cystic Fibrosis (CFTR), 12 species • Human sequence length: 1.8 Mb • Total genomic sequence: 13 Mb Human Baboon Cat Dog Cow Pig Mouse Rat Chimp Chicken Fugufish Zebrafish
  26. 26. CFTR (cont’d ) 90 550 99.7% Mammals LAGAN 90 862 96% Chicken & Fishes Chicken & Fishes Mammals 670 4547 99.8% MLAGAN 98% MAX MEMORY (Mb) TIME (sec) % Exons Aligned
  27. 27. Automatic computational system for comparative analysis of pairs of genomes http://pipeline.lbl.gov Alignments (all pair combinations): Human Genome (Golden Path Assembly) Mouse assemblies: Arachne, Phusion (2001) MGSC v3 (2002) Rat assemblies: January 2003, February 2003 ---------------------------------------------------------- D. Melanogaster vs D. Pseudoobscura February 2003
  28. 28. Tandem Local/Global Approach •Finding a likely mapping for a contig (BLAT)
  29. 29. Progressive Alignment Scheme yes no yes no Human, Mouse and Rat genomes Pairwise M/R mapping Aligned M&R fragments Unaligned M&R sequences Map to Human Genome Mapping aligned fragments by union of M&R local BLAT hits on the human genome H/M/R MLAGAN alignment M/R pairwise alignment M/H and R/H pairwise alignment Unassigned M&R DNA fragments yes no
  30. 30. Computational Time 23 dual 2.2GHz Intel Xeon node PC cluster. Pair-wise rat/mouse – 4 hours Pair-wise rat/human and mouse/human – 2 hours Multiple human/mouse/rat – 9 hours Total wall time: ~ 15 hours
  31. 31. Distribution of Large Indels 0 20 40 60 80 100 120 140 160 180 200 100 150 200 250 300 350 400 450 500 550 Indel length Count
  32. 32. Evolution Over a Chromosome
  33. 33. Overview • Local Alignment (CHAOS) • Multiple Global Alignment (LAGAN) - Whole Genome Alignment • Glocal Alignment (Shuffle-LAGAN) • Biological Story
  34. 34. Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion Translocation Duplication
  35. 35. Local & Global Alignment AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC Local Global
  36. 36. Glocal Alignment Problem Find least cost transformation of one sequence into another using new operations •Sequence edits •Inversions •Translocations •Duplications •Combinations of above AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC
  37. 37. Shuffle-LAGAN A glocal aligner for long DNA sequences
  38. 38. S-LAGAN: Find Local Alignments 1. Find Local Alignments 2. Build Rough Homology Map 3. Globally Align Consistent Parts
  39. 39. S-LAGAN: Build Homology Map 1. Find Local Alignments 2. Build Rough Homology Map 3. Globally Align Consistent Parts
  40. 40. Building the Homology Map d a b c Chain (using Eppstein Galil); each alignment gets a score which is MAX over 4 possible chains. Penalties are affine (event and distance components) Penalties: a) regular b) translocation c) inversion d) inverted translocation
  41. 41. S-LAGAN: Build Homology Map 1. Find Local Alignments 2. Build Rough Homology Map 3. Globally Align Consistent Parts
  42. 42. S-LAGAN: Global Alignment 1. Find Local Alignments 2. Build Rough Homology Map 3. Globally Align Consistent Parts
  43. 43. S-LAGAN Results (CFTR) L o c a l G l o c a l
  44. 44. S-LAGAN Results (CFTR) H u m / M u s H u m / R a t
  45. 45. S-LAGAN Results (IGF cluster)
  46. 46. S-LAGAN results (HOX) • 12 paralogous genes • Conserved order in mammals
  47. 47. S-LAGAN results (HOX) • 12 paralogous genes • Conserved order in mammals
  48. 48. S-LAGAN Results (Chr 20) • Human Chr 20 v. homologous Mouse Chr 2. • 270 Segments of conserved synteny • 70 Inversions
  49. 49. S-LAGAN Results (Whole Genome) LAGAN S-LAGAN Total 37% 38% Exon 93% 96% Ups200 78% 81% CPU Time 350 Hrs 450 Hrs • Used Berkeley Genome Pipeline • % Human genome aligned with mouse sequence • Evaluation criteria from Waterston, et al (Nature 2002)
  50. 50. Rearrangements in Human v. Mouse Preliminary conclusions: • Rearrangements come in all sizes • Duplications worse conserved than other rearranged regions • Simple inversions tend to be most common and most conserved
  51. 51. What is next? (Shuffle) • Better algorithm and scoring • Whole genome synteny mapping • Multiple Glocal Alignment(!?)
  52. 52. Overview • Local Alignment (CHAOS) • Multiple Global Alignment (LAGAN) - Whole Genome Alignment • Glocal Alignment (Shuffle-LAGAN) • Biological Story
  53. 53. Biological Story • Math1 (Mouse Atonal Homologue 1, also ATOH) is a gene that is responsible for nervous system development
  54. 54. Align Human, Mouse, Rat & Fugu
  55. 55. Detailed Alignment hum_a : CAATAGAGGGTCTGGCAGAGGCTC---------------------CTGGC @ 57336/400001 mus_a : CAATAGAGGGGCTGGCAGAGGCTC---------------------CTGGC @ 78565/400001 rat_a : CAATAGAGGGGCTGGCAGAGACTC---------------------CTGGC @ 112663/369938 fug_a : TGATGGGGAGCGTGCATTAATTTCAGGCTATTGTTAACAGGCTCGTGGGC @ 36013/68174 hum_a : CGCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGC @ 57386/400001 mus_a : CCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGC @ 78615/400001 rat_a : CCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGC @ 112713/369938 fug_a : CGAGGTGTTGGATGGCCTGAGTGAAGCACGCGCTGTCAGCTGGCGAGCGC @ 36063/68174
  56. 56. Can we align human & fly??? CGCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAG CCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAG CCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAG GAGGTGTTGGATGGCCTGAGTGA-AGCACGCGCTGTCAGCTGGCGAGCGCTCGCG-AGTCCCTGCCGTGTCCCCG Melan GCTACTCCAGCT-ACCACCTGCATGCAGCTGCACAGC Pseudo GCCACTGAGACT-GCCACCTGCATGCAGCTGCACAGA
  57. 57. Putting it all together CGCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAG CCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAG CCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAG GAGGTGTTGGATGGCCTGAGTGA-AGCACGCGCTGTCAGCTGGCGAGCGCTCGCG-AGTCCCTGCCGTGTCCCCG Melan GCTACTCCAGCT-ACCACCTGCATGCAGCTGCACAGC Pseudo GCCACTGAGACT-GCCACCTGCATGCAGCTGCACAGA
  58. 58. Overview • Local Alignment (CHAOS) • Multiple Global Alignment (LAGAN) - Whole Genome Alignment • Glocal Alignment (Shuffle-LAGAN) • Biological Story
  59. 59. Acknowledgments Stanford: Serafim Batzoglou Arend Sidow Matt Scott Gregory Cooper Chuong (Tom) Do Sanket Malde Kerrin Small Mukund Sundararajan Berkeley: Inna Dubchak Alexander Poliakov Göttingen: Burkhard Morgenstern Rat Genome Sequencing Consortium http://lagan.stanford.edu/

×