SlideShare a Scribd company logo
1 of 14
Dot plots

                    Dr Avril Coghlan
                   alc@sanger.ac.uk

Note: this talk contains animations which can only be seen by
downloading and using ‘View Slide show’ in Powerpoint
Dot plots
• How can we compare the human & Drosophila
  melanogaster Eyeless protein sequences?
  One method is a dotplot
• A dotplot is a graphical method for assessing
  similarity
  Make a matrix (table) with one row for each letter in sequence 1, & one
       column for each letter in sequence 2
  Colour in each cell with an identical letter in the 2 sequences
  Regions of local similarity between the 2 sequences appear as diagonal
       lines of coloured cells (‘dots’)
eg. for sequences ‘RQQEPVRSTC’ and ‘QQESGPVRST’:

                   Q   Q    E   S   G    P   V    R   S   T          Sequence 2
               R
               Q
               Q
               E
Sequence 1
               P
               V
               R
               S
               T
               C

     Regions of local similarity between the 2 sequences appear as
     diagonal lines
     Some off-diagonal dots may be due to chance similarities
Problem
• Make a dot-plot for DNA sequences “GCATCGGC” &
  “CCATCGCCATCG”. Are there regions of similarity?
Answer
• Make a dot-plot for DNA sequences “GCATCGGC” &
  “CCATCGCCATCG”. Are there regions of similarity?
       C    C   A   T   C   G    C   C   A   T      C   G
   G
   C
   A
   T
   C
   G
   G
   C

  CATCG in sequence 1 appears twice in sequence 2
Dot plots with thresholds
• If you colour in all cells with an identical letter, some
  dots may be due to chance similarities
• Therefore, it is common to use a threshold to decide
  whether to plot a ‘dot’ in a cell
  A window of a certain size (eg. window size = 3) is moved up all possible
        diagonals, one-by-one
  A score is calculated for each position of the window on a diagonal :
        the number of identical letters in the window
  If the score is equal to or above the threshold (eg. threshold = score of
        2), all the cells in the window are coloured in
  The choice of values for the window size and threshold for the dot plot
        are chosen by trial-and-error
eg. for sequences “GCATCGGC” and “CCATCGCCATCG” , using a window
      size of 3, and a threshold of ≥2:


          C   C   A   T   C   G   C   C     A    T   C    G
      G
      C
      A
      T
      C
      G
      G
      C

          Score = 2, ≥ threshold → colour in
                  3, <
                  0,
                  1,

  = the sliding window                    and so on....
Real data: fruitfly & human Eyeless
• A dot plot of fruitfly & human Eyeless proteins:
        Fruitfly Eyeless



                                           Window-size = 10,
                                           Threshold = 3




                           Human Eyeless
  Do you think we chose a good value for the
  window-size and threshold?
Real data: fruitfly & human Eyeless
• Here is a dot plot of fruitfly and human Eyeless
  proteins, made using windowsize=10, threshold=5:
     Fruitfly Eyeless




                                         Window-size = 10,
                                         Threshold = 5




                        Human Eyeless
  Are there any regions of similarity?
Pros and cons of dot plots
• Advantages
  A dot plot can be used to identify long regions of strong similarity
  between two sequences
  It produces a plot, which is easy to make and to interpret
  It can be used to compare very short or long sequences (even whole
        chromosomes – millions of bases)
• Disadvantages
  It is necessary to find the best window size and threshold by trial-and-
  error
  A dot plot can only be used to compare 2 sequences, not >2 sequences
  It doesn’t tell you what mutations occurred in the region of
  similarity (if there is one) since the two sequences shared a
  common ancestor
Software for making dotplots
• dotPlot() function in the SeqinR R library
  Allows you to specify a windowsize and threshold
  If the score in a window is ≥ than the threshold, colours in the 1st cell in
        the window (not all cells)
• EMBOSS dottup
  Allows you to specify a windowsize but not a threshold
  If all cells in a window are identities, it colours in all cells in the window
• EMBOSS dotmatcher
  Allows you to specify a windowsize and threshold
  Instead of using the number of identities in a window as the window
        score, it calculates a more complex score based on the
  similarities of the bases/amino acids
Problem
• Make a dot-plot for amino acid sequences
  “RQQEPVRSTC” and “QQESGPVRST”, using a
  window size of 3, and a threshold of ≥3
Answer
•   Make a dot-plot for sequences “RQQEPVRSTC” and “QQESGPVRST”,
    using window size: 3, threshold: ≥3

                Q   Q   E   S   G   P   V   R   S   T
            R
            Q
            Q
            E
            P
            V
            R
            S
            T
            C
Further reading
•   Chapter 3 in Introduction to Computational Genomics Cristianini & Hahn
•   Practical on dotplots in R in the Little Book of R for Bioinformatics:
    https://a-little-book-of-r-for-
    bioinformatics.readthedocs.org/en/latest/src/chapter4.html

More Related Content

What's hot (20)

Needleman-Wunsch Algorithm
Needleman-Wunsch AlgorithmNeedleman-Wunsch Algorithm
Needleman-Wunsch Algorithm
 
SEQUENCE ANALYSIS
SEQUENCE ANALYSISSEQUENCE ANALYSIS
SEQUENCE ANALYSIS
 
Dot matrix
Dot matrixDot matrix
Dot matrix
 
The Smith Waterman algorithm
The Smith Waterman algorithmThe Smith Waterman algorithm
The Smith Waterman algorithm
 
Swiss prot database
Swiss prot databaseSwiss prot database
Swiss prot database
 
EMBL
EMBLEMBL
EMBL
 
TrEMBL
TrEMBLTrEMBL
TrEMBL
 
sequence alignment
sequence alignmentsequence alignment
sequence alignment
 
Scoring matrices
Scoring matricesScoring matrices
Scoring matrices
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
PIR- Protein Information Resource
PIR- Protein Information ResourcePIR- Protein Information Resource
PIR- Protein Information Resource
 
Dynamic programming
Dynamic programming Dynamic programming
Dynamic programming
 
Cath
CathCath
Cath
 
smith - waterman algorithm.pptx
smith - waterman algorithm.pptxsmith - waterman algorithm.pptx
smith - waterman algorithm.pptx
 
Multiple Sequence Alignment
Multiple Sequence AlignmentMultiple Sequence Alignment
Multiple Sequence Alignment
 
blast bioinformatics
blast bioinformaticsblast bioinformatics
blast bioinformatics
 
LECTURE NOTES ON BIOINFORMATICS
LECTURE NOTES ON BIOINFORMATICSLECTURE NOTES ON BIOINFORMATICS
LECTURE NOTES ON BIOINFORMATICS
 
Primary, secondary, tertiary biological database
Primary, secondary, tertiary biological databasePrimary, secondary, tertiary biological database
Primary, secondary, tertiary biological database
 
Sequence Submission Tools
Sequence Submission ToolsSequence Submission Tools
Sequence Submission Tools
 
Maximum parsimony
Maximum parsimonyMaximum parsimony
Maximum parsimony
 

Similar to Dotplots for Bioinformatics

NIPS2007: structured prediction
NIPS2007: structured predictionNIPS2007: structured prediction
NIPS2007: structured predictionzukun
 
Intelligent Handwriting Recognition_MIL_presentation_v3_final
Intelligent Handwriting Recognition_MIL_presentation_v3_finalIntelligent Handwriting Recognition_MIL_presentation_v3_final
Intelligent Handwriting Recognition_MIL_presentation_v3_finalSuhas Pillai
 
20100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture0720100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture07Computer Science Club
 
PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)Jinwon Lee
 
Scalable membership management
Scalable membership management Scalable membership management
Scalable membership management Vinay Setty
 
SyMAP Master's Thesis Presentation
SyMAP Master's Thesis PresentationSyMAP Master's Thesis Presentation
SyMAP Master's Thesis Presentationaustinps
 
DOT MATRIX DOT MATRIX DOT MATRIX DOT MATRIX
DOT MATRIX DOT MATRIX DOT MATRIX DOT MATRIXDOT MATRIX DOT MATRIX DOT MATRIX DOT MATRIX
DOT MATRIX DOT MATRIX DOT MATRIX DOT MATRIXnanamimomozano4562
 
Indexing Text with Approximate q-grams
Indexing Text with Approximate q-gramsIndexing Text with Approximate q-grams
Indexing Text with Approximate q-gramsYasmine Long
 
Efficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingEfficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingHsing-chuan Hsieh
 
2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizonac.titus.brown
 
Significant scales in community structure
Significant scales in community structureSignificant scales in community structure
Significant scales in community structureVincent Traag
 
De bruijn graphs
De bruijn graphsDe bruijn graphs
De bruijn graphsmarium02
 
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...ssuser2624f71
 
Part 4 of RNA-seq for DE analysis: Extracting count table and QC
Part 4 of RNA-seq for DE analysis: Extracting count table and QCPart 4 of RNA-seq for DE analysis: Extracting count table and QC
Part 4 of RNA-seq for DE analysis: Extracting count table and QCJoachim Jacob
 
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular AutomataCost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automataijait
 
PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...Jinwon Lee
 
Spatially resolved pair correlation functions for point cloud data
Spatially resolved pair correlation functions for point cloud dataSpatially resolved pair correlation functions for point cloud data
Spatially resolved pair correlation functions for point cloud dataTony Fast
 
PR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorPR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorJinwon Lee
 

Similar to Dotplots for Bioinformatics (20)

Dot matrix seminar
Dot matrix seminarDot matrix seminar
Dot matrix seminar
 
NIPS2007: structured prediction
NIPS2007: structured predictionNIPS2007: structured prediction
NIPS2007: structured prediction
 
Intelligent Handwriting Recognition_MIL_presentation_v3_final
Intelligent Handwriting Recognition_MIL_presentation_v3_finalIntelligent Handwriting Recognition_MIL_presentation_v3_final
Intelligent Handwriting Recognition_MIL_presentation_v3_final
 
20100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture0720100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture07
 
PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)
 
Scalable membership management
Scalable membership management Scalable membership management
Scalable membership management
 
SyMAP Master's Thesis Presentation
SyMAP Master's Thesis PresentationSyMAP Master's Thesis Presentation
SyMAP Master's Thesis Presentation
 
DOT MATRIX DOT MATRIX DOT MATRIX DOT MATRIX
DOT MATRIX DOT MATRIX DOT MATRIX DOT MATRIXDOT MATRIX DOT MATRIX DOT MATRIX DOT MATRIX
DOT MATRIX DOT MATRIX DOT MATRIX DOT MATRIX
 
Indexing Text with Approximate q-grams
Indexing Text with Approximate q-gramsIndexing Text with Approximate q-grams
Indexing Text with Approximate q-grams
 
Efficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingEfficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketching
 
2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona
 
Significant scales in community structure
Significant scales in community structureSignificant scales in community structure
Significant scales in community structure
 
Word2vec and Friends
Word2vec and FriendsWord2vec and Friends
Word2vec and Friends
 
De bruijn graphs
De bruijn graphsDe bruijn graphs
De bruijn graphs
 
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
 
Part 4 of RNA-seq for DE analysis: Extracting count table and QC
Part 4 of RNA-seq for DE analysis: Extracting count table and QCPart 4 of RNA-seq for DE analysis: Extracting count table and QC
Part 4 of RNA-seq for DE analysis: Extracting count table and QC
 
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular AutomataCost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
 
PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...
 
Spatially resolved pair correlation functions for point cloud data
Spatially resolved pair correlation functions for point cloud dataSpatially resolved pair correlation functions for point cloud data
Spatially resolved pair correlation functions for point cloud data
 
PR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorPR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox Detector
 

More from avrilcoghlan

DESeq Paper Journal club
DESeq Paper Journal club DESeq Paper Journal club
DESeq Paper Journal club avrilcoghlan
 
Introduction to genomes
Introduction to genomesIntroduction to genomes
Introduction to genomesavrilcoghlan
 
Statistical significance of alignments
Statistical significance of alignmentsStatistical significance of alignments
Statistical significance of alignmentsavrilcoghlan
 
Multiple alignment
Multiple alignmentMultiple alignment
Multiple alignmentavrilcoghlan
 
Alignment scoring functions
Alignment scoring functionsAlignment scoring functions
Alignment scoring functionsavrilcoghlan
 
The Needleman Wunsch algorithm
The Needleman Wunsch algorithmThe Needleman Wunsch algorithm
The Needleman Wunsch algorithmavrilcoghlan
 
Introduction to HMMs in Bioinformatics
Introduction to HMMs in BioinformaticsIntroduction to HMMs in Bioinformatics
Introduction to HMMs in Bioinformaticsavrilcoghlan
 

More from avrilcoghlan (9)

DESeq Paper Journal club
DESeq Paper Journal club DESeq Paper Journal club
DESeq Paper Journal club
 
Introduction to genomes
Introduction to genomesIntroduction to genomes
Introduction to genomes
 
Homology
HomologyHomology
Homology
 
Statistical significance of alignments
Statistical significance of alignmentsStatistical significance of alignments
Statistical significance of alignments
 
BLAST
BLASTBLAST
BLAST
 
Multiple alignment
Multiple alignmentMultiple alignment
Multiple alignment
 
Alignment scoring functions
Alignment scoring functionsAlignment scoring functions
Alignment scoring functions
 
The Needleman Wunsch algorithm
The Needleman Wunsch algorithmThe Needleman Wunsch algorithm
The Needleman Wunsch algorithm
 
Introduction to HMMs in Bioinformatics
Introduction to HMMs in BioinformaticsIntroduction to HMMs in Bioinformatics
Introduction to HMMs in Bioinformatics
 

Recently uploaded

INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 

Recently uploaded (20)

INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 

Dotplots for Bioinformatics

  • 1. Dot plots Dr Avril Coghlan alc@sanger.ac.uk Note: this talk contains animations which can only be seen by downloading and using ‘View Slide show’ in Powerpoint
  • 2. Dot plots • How can we compare the human & Drosophila melanogaster Eyeless protein sequences? One method is a dotplot • A dotplot is a graphical method for assessing similarity Make a matrix (table) with one row for each letter in sequence 1, & one column for each letter in sequence 2 Colour in each cell with an identical letter in the 2 sequences Regions of local similarity between the 2 sequences appear as diagonal lines of coloured cells (‘dots’)
  • 3. eg. for sequences ‘RQQEPVRSTC’ and ‘QQESGPVRST’: Q Q E S G P V R S T Sequence 2 R Q Q E Sequence 1 P V R S T C Regions of local similarity between the 2 sequences appear as diagonal lines Some off-diagonal dots may be due to chance similarities
  • 4. Problem • Make a dot-plot for DNA sequences “GCATCGGC” & “CCATCGCCATCG”. Are there regions of similarity?
  • 5. Answer • Make a dot-plot for DNA sequences “GCATCGGC” & “CCATCGCCATCG”. Are there regions of similarity? C C A T C G C C A T C G G C A T C G G C CATCG in sequence 1 appears twice in sequence 2
  • 6. Dot plots with thresholds • If you colour in all cells with an identical letter, some dots may be due to chance similarities • Therefore, it is common to use a threshold to decide whether to plot a ‘dot’ in a cell A window of a certain size (eg. window size = 3) is moved up all possible diagonals, one-by-one A score is calculated for each position of the window on a diagonal : the number of identical letters in the window If the score is equal to or above the threshold (eg. threshold = score of 2), all the cells in the window are coloured in The choice of values for the window size and threshold for the dot plot are chosen by trial-and-error
  • 7. eg. for sequences “GCATCGGC” and “CCATCGCCATCG” , using a window size of 3, and a threshold of ≥2: C C A T C G C C A T C G G C A T C G G C Score = 2, ≥ threshold → colour in 3, < 0, 1, = the sliding window and so on....
  • 8. Real data: fruitfly & human Eyeless • A dot plot of fruitfly & human Eyeless proteins: Fruitfly Eyeless Window-size = 10, Threshold = 3 Human Eyeless Do you think we chose a good value for the window-size and threshold?
  • 9. Real data: fruitfly & human Eyeless • Here is a dot plot of fruitfly and human Eyeless proteins, made using windowsize=10, threshold=5: Fruitfly Eyeless Window-size = 10, Threshold = 5 Human Eyeless Are there any regions of similarity?
  • 10. Pros and cons of dot plots • Advantages A dot plot can be used to identify long regions of strong similarity between two sequences It produces a plot, which is easy to make and to interpret It can be used to compare very short or long sequences (even whole chromosomes – millions of bases) • Disadvantages It is necessary to find the best window size and threshold by trial-and- error A dot plot can only be used to compare 2 sequences, not >2 sequences It doesn’t tell you what mutations occurred in the region of similarity (if there is one) since the two sequences shared a common ancestor
  • 11. Software for making dotplots • dotPlot() function in the SeqinR R library Allows you to specify a windowsize and threshold If the score in a window is ≥ than the threshold, colours in the 1st cell in the window (not all cells) • EMBOSS dottup Allows you to specify a windowsize but not a threshold If all cells in a window are identities, it colours in all cells in the window • EMBOSS dotmatcher Allows you to specify a windowsize and threshold Instead of using the number of identities in a window as the window score, it calculates a more complex score based on the similarities of the bases/amino acids
  • 12. Problem • Make a dot-plot for amino acid sequences “RQQEPVRSTC” and “QQESGPVRST”, using a window size of 3, and a threshold of ≥3
  • 13. Answer • Make a dot-plot for sequences “RQQEPVRSTC” and “QQESGPVRST”, using window size: 3, threshold: ≥3 Q Q E S G P V R S T R Q Q E P V R S T C
  • 14. Further reading • Chapter 3 in Introduction to Computational Genomics Cristianini & Hahn • Practical on dotplots in R in the Little Book of R for Bioinformatics: https://a-little-book-of-r-for- bioinformatics.readthedocs.org/en/latest/src/chapter4.html

Editor's Notes

  1. In R: setwd(&quot;C:/Documents and Settings/Avril Coughlan/My Documents/BACKEDUP/MScCourseLectures/MB6301Lectures/MB6301_Ls3456_Aln&quot;) library(&quot;seqinr&quot;) seq1 &lt;- “RQQEPVRSTC” seq2 &lt;- “QQESGPVRST” seq1b &lt;- s2c(seq1) seq2b &lt;- s2c(seq2) source(“dotplot.R”) makeDotPlot1(seq1b,seq2b,dotsize=1)
  2. In R: setwd(&quot;C:/Documents and Settings/Avril Coughlan/My Documents/BACKEDUP/MScCourseLectures/MB6301Lectures/MB6301_Ls3456_Aln&quot;) library(&quot;seqinr&quot;) seq1 &lt;- “GCATCGGC” seq2 &lt;- “CCATCGCCATCG” seq1b &lt;- s2c(seq1) seq2b &lt;- s2c(seq2) source(“dotplot.R”) makeDotPlot1(seq1b,seq2b,dotsize=1)
  3. In R: setwd(&quot;C:/Documents and Settings/Avril Coughlan/My Documents/BACKEDUP/MScCourseLectures/MB6301Lectures/MB6301_Ls3456_Aln&quot;) library(&quot;seqinr&quot;) seq1 &lt;- “GCATCGGC” seq2 &lt;- “CCATCGCCATCG” seq1b &lt;- s2c(seq1) seq2b &lt;- s2c(seq2) source(“dotplot.R”) makeDotPlot2(seq1b,seq2b,dotsize=1,windowsize=3,threshold=2)
  4. setwd(&quot;C:/Documents and Settings/Avril Coughlan/My Documents/BACKEDUP/MScCourseLectures/MB6301Lectures/MB6301_Ls3456_Aln&quot;) library(&quot;seqinr&quot;) seq1 &lt;- read.fasta(“human.fa”) # human Eyeless seq2 &lt;- read.fasta(“fly.fa”) # fruitfly Eyeless seq1b &lt;- seq1[[1]] seq2b &lt;- seq2[[1]] source(“dotplot.R”) makeDotPlot2(seq1b,seq2b,dotsize=1,windowsize=10,threshold=3) Saved picture as dotplot2.png
  5. setwd(&quot;C:/Documents and Settings/Avril Coughlan/My Documents/BACKEDUP/MScCourseLectures/MB6301Lectures/MB6301_Ls3456_Aln&quot;) library(&quot;seqinr&quot;) seq1 &lt;- read.fasta(“human.fa”) # human Eyeless seq2 &lt;- read.fasta(“fly.fa”) # fruitfly Eyeless seq1b &lt;- seq1[[1]] seq2b &lt;- seq2[[1]] source(“dotplot.R”) makeDotPlot2(seq1b,seq2b,dotsize=1,windowsize=10,threshold=5) Saved picture as dotplot1.png
  6. In R: setwd(&quot;C:/Documents and Settings/Avril Coughlan/My Documents/BACKEDUP/MScCourseLectures/MB6301Lectures/MB6301_Ls3456_Aln&quot;) library(&quot;seqinr&quot;) seq1 &lt;- &quot;RQQEPVRSTC&quot; seq2 &lt;- &quot;QQESGPVRST&quot; seq1b &lt;- s2c(seq1) seq2b &lt;- s2c(seq2) source(&quot;dotplot.R&quot;) makeDotPlot2(seq1b,seq2b,dotsize=1,windowsize=3,threshold=3)