SlideShare uma empresa Scribd logo
1 de 35
Baixar para ler offline
http://www.bits.vib.be/training
search engines



                                                    lennart martens
                                             lennart.martens@ugent.be
                               Lennart MARTENS
                             lennart.martens@ebi.ac.uk
                              Computational Omics and Systems Biology Group
                                 Proteomics Services Group
                              European Bioinformatics Institute
                                 Department of Medical Protein Research, VIB
                                     Hinxton, Cambridge
                                       United Kingdom
                                 Department of Biochemistry, Ghent University
                                        www.ebi.ac.uk
Lennart Martens                                      Ghent, Belgium
                           BITS MS Data Processing – Search Engines
lennart.martens@UGent.be   UGent, Gent, Belgium – 19 September 2011
THREE TYPICAL PRE-PROCESSING STEPS




Lennart Martens            BITS MS Data Processing – Search Engines
lennart.martens@UGent.be    UGent, Gent, Belgium – 19 September 2011
Noise thresholding
                                          precursor
                                                                       Global thresholding




                                          precursor
                                                                         Local thresholding




Lennart Martens            BITS MS Data Processing – Search Engines
lennart.martens@UGent.be    UGent, Gent, Belgium – 19 September 2011
Charge deconvolution (peptides)




                    From: http://www.purdue.edu/dp/bioscience/images/spectrum.jpg

Lennart Martens                  BITS MS Data Processing – Search Engines
lennart.martens@UGent.be          UGent, Gent, Belgium – 19 September 2011
Charge deconvolution (proteins)




                           From: Gill et al, EMBO Journal, 2000

Lennart Martens            BITS MS Data Processing – Search Engines
lennart.martens@UGent.be    UGent, Gent, Belgium – 19 September 2011
Centroiding (peak picking)




                     Monoisotopic mass                                       Average mass




                 x                                                       x
Lennart Martens              BITS MS Data Processing – Search Engines
lennart.martens@UGent.be      UGent, Gent, Belgium – 19 September 2011
Combined results




                      A total ion current chromatogram, corrected by
                                typical pre-processing steps.


                           From: Last et al, Nature Rev. Mol. Cell Bio., 2007

Lennart Martens                   BITS MS Data Processing – Search Engines
lennart.martens@UGent.be           UGent, Gent, Belgium – 19 September 2011
Data size reduction
                         60



                                                                                                     Q-TOF II
                                                                                                     Q-TOF      Esquire HCT
                                                                                                                Esquire HCT
                         50




                         40



 File size
        File size (MB)




   (MB)
                         30

                              51.4


                         20




                                           24.5           25.8
                                                                 23.7
                         10




                                                                                      0.7        0.2              0.3       0.1
                         0
                                     RAW                  RAW GZIPped                   Peak lists              Peak lists GZIPped
                                                                        Data type
                                                                          Data type


                                                  See: Martens et al., Proteomics, 2005

Lennart Martens                                   BITS MS Data Processing – Search Engines
lennart.martens@UGent.be                           UGent, Gent, Belgium – 19 September 2011
MS/MS IDENTIFICATION
PEPTIDE FRAGMENTATION FINGERPRINTING




Lennart Martens            BITS MS Data Processing – Search Engines
lennart.martens@UGent.be    UGent, Gent, Belgium – 19 September 2011
Peptide sequences and MS/MS spectra

                                LENNART
intensity



                                                                        LENNAR
                                RT

                                                          NNART
                                                      NART
                                        LEN                                      LENNART
                                                         LENNA                   LENNART
                                           ART                          ENNART
                 T                            LENN
                     L
                               LE
             L             E        N          N         A            R          T
                                                                                           m/z
Lennart Martens                      BITS MS Data Processing – Search Engines
lennart.martens@UGent.be              UGent, Gent, Belgium – 19 September 2011
Peptide fragment fingerprinting (PFF)
                                                                                     Int
                                                    YSFVATAER
                                                                                                      m/z
                                                                                     Int
                                                     HETSINGK
                                    in silico                           in silico    Int
                                                                                                      m/z


                                                  MILQEESTVYYR
                                     digest                             MS/MS
                                                                                                      m/z
                                                                                     Int
                                                    SEFASTPINK
                                                          …                                           m/z



    protein sequence database                   peptide sequences                      theoretical MS/MS
                                                                                            spectra



                    1) YSFVATAER 34
                                                         in silico
                    2) YSFVSAIR 12
                    3) FFLIGGGGK 12                      matching
                           peptide scores
                                                                               experimental MS/MS spectrum

Lennart Martens                    BITS MS Data Processing – Search Engines
lennart.martens@UGent.be            UGent, Gent, Belgium – 19 September 2011
Three types of PFF identification

   Spectral comparison
                                                        theoretical            compare   experimental
     database                 sequence
                                                         spectrum                         spectrum


   Sequencial comparison
                                                compare           de novo                experimental
     database                 sequence
                                                                 sequence                 spectrum


   Threading comparison
                                                              thread                     experimental
     database                sequence
                                                                                          spectrum


                           From: Eidhammer, Flikka, Martens, Mikalsen – Wiley 2007

Lennart Martens                    BITS MS Data Processing – Search Engines
lennart.martens@UGent.be            UGent, Gent, Belgium – 19 September 2011
The most popular algorithms


      • MASCOT (Matrix Science)
          http://www.matrixscience.com

      • SEQUEST (Scripps, Thermo Fisher Scientific)
          http://fields.scripps.edu/sequest


      • X!Tandem (The Global Proteome Machine Organization)
          http://www.thegpm.org/TANDEM


      • OMSSA (NCBI)
          http://pubchem.ncbi.nlm.nih.gov/omssa/




Lennart Martens                   BITS MS Data Processing – Search Engines
lennart.martens@UGent.be           UGent, Gent, Belgium – 19 September 2011
Overall concept of scores and cut-offs

                                Incorrect identifications                  Threshold score




                                                                   Correct
                                                                identifications




                                 False negatives                False positives
                           Adapted from: www.proteomesoftware.com – Wiki pages

Lennart Martens                    BITS MS Data Processing – Search Engines
lennart.martens@UGent.be            UGent, Gent, Belgium – 19 September 2011
Playing with probabilistic cut-off scores

                             higher stringency




                                     6%                                                      100%

                                                                                             90%
                                     5%
                                                                                             80%

                                     4%
                                                                       identifications       70%

                                                                                             60%

                                     3%                                                      50%

                                                      false positives                        40%
                                     2%
                                                                                             30%

                                                                                             20%
                                     1%
                                                                                             10%

                                     0%                                                      0%
                                             p=0.05        p=0.01       p=0.005   p=0.0005



Lennart Martens            BITS MS Data Processing – Search Engines
lennart.martens@UGent.be    UGent, Gent, Belgium – 19 September 2011
SEQUEST
   • Very well established search engine
   • Can be used for MS/MS (PFF) identifications
   • Based on a cross-correlation score (includes peak height)
   • Published core algorithm (patented, licensed to Thermo), Eng, JASMS 1994
   • Provides preliminary (Sp) score, rank, cross-correlation score (XCorr),
       and score difference between the top tow ranks (deltaCn, ∆Cn)
   • Thresholding is up to the user, and is commonly done per charge state


             ������
   • Many extensions exist to perform a more automatic validation of results

    ������������ = � ������������ ∙ ������(������+������)
           ������=1
                   1
                                 +75

    XCorr = ������0 − 151            � ������������
                                                                            XCorr 1 − XCorr 2
                                ������=−75
                                                         deltaCn=
                                                                                XCorr 1
Lennart Martens                 BITS MS Data Processing – Search Engines
lennart.martens@UGent.be         UGent, Gent, Belgium – 19 September 2011
SEQUEST: some additional pictures




   From: MacCoss et al., Anal. Chem. 2002




                                                           From: Peng et al., J. Prot. Res.. 2002

Lennart Martens                 BITS MS Data Processing – Search Engines
lennart.martens@UGent.be         UGent, Gent, Belgium – 19 September 2011
Mascot


    • Very well established search engine, Perkins, Electrophoresis 1999
    • Can do MS (PMF) and MS/MS (PFF) identifications
    • Based on the MOWSE score,
    • Unpublished core algorithm (trade secret)
    • Predicts an a priori threshold score that identifications need to pass
    • From version 2.2, Mascot allows integrated decoy searches
    • Provides rank, score, threshold and expectation value per identification
    • Customizable confidence level for the threshold score




Lennart Martens            BITS MS Data Processing – Search Engines
lennart.martens@UGent.be    UGent, Gent, Belgium – 19 September 2011
Mascot: some additional pictures

                                 40
    Average identity threshold




                                 35        y = 8.3761x - 34.089
                                                 2
                                            6%R = 0.9985                                                                      100%
 Average identitythreshold




                                 30

                                 25                                                                                           90%
                                            5%
                                 20                                                                                           80%
                                 15                                                                                           70%
                                            4%
                                 10
                                                                                                            identifications   60%
                                  5
                                            3%                                                                                50%
                                  0
                                   6.50   7.00              7.50              8.00         8.50                               40%
                                            2%          log10(number of AA)
                                                                                                                              30%
                                                                                        false positives
                                                                                                                              20%
                                            1%
                                                                                                                              10%

                                            0%                                                                                0%
                                                         p=0.05                p=0.01             p=0.005         p=0.0005



Lennart Martens                                             BITS MS Data Processing – Search Engines
lennart.martens@UGent.be                                     UGent, Gent, Belgium – 19 September 2011
X!Tandem


   • A successful open source search engine, Craig and Beavis, RCMS 2003
   • Can be used for MS/MS (PFF) identifications
                                                                                     n         
   • Based on a hyperscore (Pi is either 0 or 1):                      HyperScore =  ∑ Ii * Pi  * Nb !* Ny !
                                                                                     i =0      
   • Relies on a hypergeometric distribution (hence hyperscore)
   • Published core algorithm, and is freely available
   • Provides hyperscore and expectancy score (the discriminating one)
   • X!Tandem is fast and can handle modifications in an iterative fashion
   • Has rapidly gained popularity as (auxiliary) search engine




Lennart Martens            BITS MS Data Processing – Search Engines
lennart.martens@UGent.be    UGent, Gent, Belgium – 19 September 2011
X!Tandem: some additional pictures
                  60                                                                                      4

                                                                                                         3.5
                  50
                                                                                                          3




                                                                                        log(# results)
                  40
 # results




                                                                                                         2.5

                  30                                                                                      2

                                                                                                         1.5
                  20
                                                                                                          1

                  10                                                                                     0.5

                                                                                                          0
                       0                                                                                       20   25   30   35       40   45   50
                           0        20         40         60            80      100                                       hyperscore
                                              hyperscore
                                                                                                                               significance
                  6
                                                                                                                                threshold
                  4
 log(# results)




                  2

                  0

                  -2                                                               Adapted from: Brian Searle, ProteomeSoftware,
                  -4
                                                                               http://www.proteomesoftware.com/XTandem_edited.pdf
                  -6

                  -8
                                                                       E-value=e-8.2
            -10
                       0       20        40     60   80          100
                                    hyperscore
Lennart Martens                                            BITS MS Data Processing – Search Engines
lennart.martens@UGent.be                                       UGent, Gent, Belgium – 19 September 2011
A note on how the scores differ

             SEQUEST       Accuracy Score                    Relative Score



                              XCorr                               DeltaCn
             X! Tandem




                           HyperScore                              E-Value



                                                           Adapted from: Brian Searle, ProteomeSoftware

Lennart Martens                 BITS MS Data Processing – Search Engines
lennart.martens@UGent.be         UGent, Gent, Belgium – 19 September 2011
OMSSA


   • A successful open source search engine, Geer, JPR 2004
   • Can be used for MS/MS (PFF) identifications
   • Relies on a Poisson distribution
   • Published core algorithm, and is freely available
   • Provides an expectancy score, similar to the BLAST E-value
   • OMSSA was recently upgraded to take peak intensity into account
   • Good really good marks in a recently published comparative study




Lennart Martens            BITS MS Data Processing – Search Engines
lennart.martens@UGent.be    UGent, Gent, Belgium – 19 September 2011
OMSSA: some additional pictures




       Yeast lysate spectrum, m/z matches of               Validation of the Poisson distribution model:
     fragment peak matches versus all NCBI nr               mean number of modelled and measured
     sequence library. Poisson distribution fitted.           matching peaks (against the NCBI nr
                                                               database) for two mass tolerances.

                             Adapted from: Geer et al., J. Prot. Res., 2004

Lennart Martens                  BITS MS Data Processing – Search Engines
lennart.martens@UGent.be          UGent, Gent, Belgium – 19 September 2011
COMPARATIVE STUDIES




Lennart Martens               BITS MS Data Processing – Search Engines
lennart.martens@UGent.be       UGent, Gent, Belgium – 19 September 2011
Kapp et al., Proteomics, 2005




Lennart Martens            BITS MS Data Processing – Search Engines
lennart.martens@UGent.be    UGent, Gent, Belgium – 19 September 2011
Balgley et al., Mol. Cell. Proteomics, 2007




                                                1.6x more?!




Lennart Martens            BITS MS Data Processing – Search Engines
lennart.martens@UGent.be    UGent, Gent, Belgium – 19 September 2011
Combining the output of search algorithms

                                  Mascot                                                SEQUEST
                                  3229                                                     3792
                                                212                            486
                                              (+4,2%)                        (+9,6%)
                           ProteinSolver
                               3203
                                                                179          168             Phenyx
                                                     40
                                                                                             3186
                                             329                                     380
                                           (+6,5%)        501          348         (+7,5%)


                                                                1776
                                                      139                    96
                                                            195        77

                                                                146




                 Figure courtesy of Dr. Christian Stephan, Medizinisches Proteom-Center,
                        Ruhr-Universität Bochum; Human Brain Proteome Project

Lennart Martens                       BITS MS Data Processing – Search Engines
lennart.martens@UGent.be                 UGent, Gent, Belgium – 19 September 2011
SEQUENCIAL COMPARISON
                             ALGORITHMS




Lennart Martens            BITS MS Data Processing – Search Engines
lennart.martens@UGent.be    UGent, Gent, Belgium – 19 September 2011
Sequence tags




                                                                  sequence tag




        The concept of sequence tags was introduced by Mann and Wilm
             (Mann,and Wilm, Anal. Chem. 1994, 66: 4390-4399).

                          Image from: Matthias Wilm, EMBL Heidelberg, Germany
            http://www.narrador.embl-heidelberg.de/GroupPages/PageLink/activities/SeqTag.html
Lennart Martens                 BITS MS Data Processing – Search Engines
lennart.martens@UGent.be         UGent, Gent, Belgium – 19 September 2011
GutenTag, DirecTag, TagRecon


   • Tabb, Anal. Chem. 2003, Tabb, JPR 2008, Dasari, JPR 2010
   • Recent implementations of the sequence tag approach
   • Refine hits by peak mapping in a second stage to resolve ambiguities
   • Rely on a empirical fragmentation model
   • Published core algorithms, DirecTag and TagRecon freely available
   • Most useful to retrieve unexpected peptides (modifications, variations)
   • Entire workflows exist (e.g., combination with IDPicker)




Lennart Martens            BITS MS Data Processing – Search Engines
lennart.martens@UGent.be    UGent, Gent, Belgium – 19 September 2011
GutenTag: some additional pictures




                           From: Tabb et al., Anal. Chem., 2003

Lennart Martens            BITS MS Data Processing – Search Engines
lennart.martens@UGent.be    UGent, Gent, Belgium – 19 September 2011
De novo compared to sequence tags




                        Example of a manual de novo of an MS/MS spectrum
                       No more database necessary to extract a sequence!


                     Algorithms                                      References

                       Lutefisk                            Dancik 1999, Taylor 2000
                      Sherenga                            Fernandez-de-Cossio 2000
                       PEAKS                                 Ma 2003, Zhang 2004
                      PepNovo                            Frank 2005, Grossmann 2005
                          …                                           …


Lennart Martens                   BITS MS Data Processing – Search Engines
lennart.martens@UGent.be           UGent, Gent, Belgium – 19 September 2011
Thank you!
                Questions?
Lennart Martens            BITS MS Data Processing – Search Engines
lennart.martens@UGent.be    UGent, Gent, Belgium – 19 September 2011

Mais conteúdo relacionado

Destaque

Destaque (19)

Utilidad de la genómica en la salud humana
Utilidad de la genómica en la salud humanaUtilidad de la genómica en la salud humana
Utilidad de la genómica en la salud humana
 
Text mining on the command line - Introduction to linux for bioinformatics
Text mining on the command line - Introduction to linux for bioinformaticsText mining on the command line - Introduction to linux for bioinformatics
Text mining on the command line - Introduction to linux for bioinformatics
 
Managing your data - Introduction to Linux for bioinformatics
Managing your data - Introduction to Linux for bioinformaticsManaging your data - Introduction to Linux for bioinformatics
Managing your data - Introduction to Linux for bioinformatics
 
RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3
 
Deep learning with Tensorflow in R
Deep learning with Tensorflow in RDeep learning with Tensorflow in R
Deep learning with Tensorflow in R
 
Differential expression in RNA-Seq
Differential expression in RNA-SeqDifferential expression in RNA-Seq
Differential expression in RNA-Seq
 
Mass Spectrometry: Protein Identification Strategies
Mass Spectrometry: Protein Identification StrategiesMass Spectrometry: Protein Identification Strategies
Mass Spectrometry: Protein Identification Strategies
 
BITS - Genevestigator to easily access transcriptomics data
BITS - Genevestigator to easily access transcriptomics dataBITS - Genevestigator to easily access transcriptomics data
BITS - Genevestigator to easily access transcriptomics data
 
BITS - Comparative genomics: the Contra tool
BITS - Comparative genomics: the Contra toolBITS - Comparative genomics: the Contra tool
BITS - Comparative genomics: the Contra tool
 
Productivity tips - Introduction to linux for bioinformatics
Productivity tips - Introduction to linux for bioinformaticsProductivity tips - Introduction to linux for bioinformatics
Productivity tips - Introduction to linux for bioinformatics
 
BITS - Comparative genomics on the genome level
BITS - Comparative genomics on the genome levelBITS - Comparative genomics on the genome level
BITS - Comparative genomics on the genome level
 
BITS - Protein inference from mass spectrometry data
BITS - Protein inference from mass spectrometry dataBITS - Protein inference from mass spectrometry data
BITS - Protein inference from mass spectrometry data
 
BITS - Comparative genomics: gene family analysis
BITS - Comparative genomics: gene family analysisBITS - Comparative genomics: gene family analysis
BITS - Comparative genomics: gene family analysis
 
The structure of Linux - Introduction to Linux for bioinformatics
The structure of Linux - Introduction to Linux for bioinformaticsThe structure of Linux - Introduction to Linux for bioinformatics
The structure of Linux - Introduction to Linux for bioinformatics
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processing
 
RNA-seq for DE analysis: detecting differential expression - part 5
RNA-seq for DE analysis: detecting differential expression - part 5RNA-seq for DE analysis: detecting differential expression - part 5
RNA-seq for DE analysis: detecting differential expression - part 5
 
RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysis
 
RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1
 

Mais de BITS

Vnti11 basics course
Vnti11 basics courseVnti11 basics course
Vnti11 basics course
BITS
 
Bits protein structure
Bits protein structureBits protein structure
Bits protein structure
BITS
 

Mais de BITS (13)

BITS - Overview of sequence databases for mass spectrometry data analysis
BITS - Overview of sequence databases for mass spectrometry data analysisBITS - Overview of sequence databases for mass spectrometry data analysis
BITS - Overview of sequence databases for mass spectrometry data analysis
 
BITS - Introduction to proteomics
BITS - Introduction to proteomicsBITS - Introduction to proteomics
BITS - Introduction to proteomics
 
BITS - Introduction to Mass Spec data generation
BITS - Introduction to Mass Spec data generationBITS - Introduction to Mass Spec data generation
BITS - Introduction to Mass Spec data generation
 
BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2
 
Marcs (bio)perl course
Marcs (bio)perl courseMarcs (bio)perl course
Marcs (bio)perl course
 
Basics statistics
Basics statistics Basics statistics
Basics statistics
 
Cytoscape: Integrating biological networks
Cytoscape: Integrating biological networksCytoscape: Integrating biological networks
Cytoscape: Integrating biological networks
 
Cytoscape: Gene coexppression and PPI networks
Cytoscape: Gene coexppression and PPI networksCytoscape: Gene coexppression and PPI networks
Cytoscape: Gene coexppression and PPI networks
 
Genevestigator
GenevestigatorGenevestigator
Genevestigator
 
BITS: UCSC genome browser - Part 1
BITS: UCSC genome browser - Part 1BITS: UCSC genome browser - Part 1
BITS: UCSC genome browser - Part 1
 
Vnti11 basics course
Vnti11 basics courseVnti11 basics course
Vnti11 basics course
 
Bits protein structure
Bits protein structureBits protein structure
Bits protein structure
 
BITS: Introduction to Linux - Software installation the graphical and the co...
BITS: Introduction to Linux -  Software installation the graphical and the co...BITS: Introduction to Linux -  Software installation the graphical and the co...
BITS: Introduction to Linux - Software installation the graphical and the co...
 

Último

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 

Último (20)

Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptx
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 

BITS - Search engines for mass spec data

  • 2. search engines lennart martens lennart.martens@ugent.be Lennart MARTENS lennart.martens@ebi.ac.uk Computational Omics and Systems Biology Group Proteomics Services Group European Bioinformatics Institute Department of Medical Protein Research, VIB Hinxton, Cambridge United Kingdom Department of Biochemistry, Ghent University www.ebi.ac.uk Lennart Martens Ghent, Belgium BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 3. THREE TYPICAL PRE-PROCESSING STEPS Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 4. Noise thresholding precursor Global thresholding precursor Local thresholding Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 5. Charge deconvolution (peptides) From: http://www.purdue.edu/dp/bioscience/images/spectrum.jpg Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 6. Charge deconvolution (proteins) From: Gill et al, EMBO Journal, 2000 Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 7. Centroiding (peak picking) Monoisotopic mass Average mass x x Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 8. Combined results A total ion current chromatogram, corrected by typical pre-processing steps. From: Last et al, Nature Rev. Mol. Cell Bio., 2007 Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 9. Data size reduction 60 Q-TOF II Q-TOF Esquire HCT Esquire HCT 50 40 File size File size (MB) (MB) 30 51.4 20 24.5 25.8 23.7 10 0.7 0.2 0.3 0.1 0 RAW RAW GZIPped Peak lists Peak lists GZIPped Data type Data type See: Martens et al., Proteomics, 2005 Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 10. MS/MS IDENTIFICATION PEPTIDE FRAGMENTATION FINGERPRINTING Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 11. Peptide sequences and MS/MS spectra LENNART intensity LENNAR RT NNART NART LEN LENNART LENNA LENNART ART ENNART T LENN L LE L E N N A R T m/z Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 12. Peptide fragment fingerprinting (PFF) Int YSFVATAER m/z Int HETSINGK in silico in silico Int m/z MILQEESTVYYR digest MS/MS m/z Int SEFASTPINK … m/z protein sequence database peptide sequences theoretical MS/MS spectra 1) YSFVATAER 34 in silico 2) YSFVSAIR 12 3) FFLIGGGGK 12 matching peptide scores experimental MS/MS spectrum Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 13. Three types of PFF identification Spectral comparison theoretical compare experimental database sequence spectrum spectrum Sequencial comparison compare de novo experimental database sequence sequence spectrum Threading comparison thread experimental database sequence spectrum From: Eidhammer, Flikka, Martens, Mikalsen – Wiley 2007 Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 14. The most popular algorithms • MASCOT (Matrix Science) http://www.matrixscience.com • SEQUEST (Scripps, Thermo Fisher Scientific) http://fields.scripps.edu/sequest • X!Tandem (The Global Proteome Machine Organization) http://www.thegpm.org/TANDEM • OMSSA (NCBI) http://pubchem.ncbi.nlm.nih.gov/omssa/ Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 15. Overall concept of scores and cut-offs Incorrect identifications Threshold score Correct identifications False negatives False positives Adapted from: www.proteomesoftware.com – Wiki pages Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 16. Playing with probabilistic cut-off scores higher stringency 6% 100% 90% 5% 80% 4% identifications 70% 60% 3% 50% false positives 40% 2% 30% 20% 1% 10% 0% 0% p=0.05 p=0.01 p=0.005 p=0.0005 Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 17. SEQUEST • Very well established search engine • Can be used for MS/MS (PFF) identifications • Based on a cross-correlation score (includes peak height) • Published core algorithm (patented, licensed to Thermo), Eng, JASMS 1994 • Provides preliminary (Sp) score, rank, cross-correlation score (XCorr), and score difference between the top tow ranks (deltaCn, ∆Cn) • Thresholding is up to the user, and is commonly done per charge state ������ • Many extensions exist to perform a more automatic validation of results ������������ = � ������������ ∙ ������(������+������) ������=1 1 +75 XCorr = ������0 − 151 � ������������ XCorr 1 − XCorr 2 ������=−75 deltaCn= XCorr 1 Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 18. SEQUEST: some additional pictures From: MacCoss et al., Anal. Chem. 2002 From: Peng et al., J. Prot. Res.. 2002 Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 19. Mascot • Very well established search engine, Perkins, Electrophoresis 1999 • Can do MS (PMF) and MS/MS (PFF) identifications • Based on the MOWSE score, • Unpublished core algorithm (trade secret) • Predicts an a priori threshold score that identifications need to pass • From version 2.2, Mascot allows integrated decoy searches • Provides rank, score, threshold and expectation value per identification • Customizable confidence level for the threshold score Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 20. Mascot: some additional pictures 40 Average identity threshold 35 y = 8.3761x - 34.089 2 6%R = 0.9985 100% Average identitythreshold 30 25 90% 5% 20 80% 15 70% 4% 10 identifications 60% 5 3% 50% 0 6.50 7.00 7.50 8.00 8.50 40% 2% log10(number of AA) 30% false positives 20% 1% 10% 0% 0% p=0.05 p=0.01 p=0.005 p=0.0005 Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 21. X!Tandem • A successful open source search engine, Craig and Beavis, RCMS 2003 • Can be used for MS/MS (PFF) identifications  n  • Based on a hyperscore (Pi is either 0 or 1): HyperScore =  ∑ Ii * Pi  * Nb !* Ny !  i =0  • Relies on a hypergeometric distribution (hence hyperscore) • Published core algorithm, and is freely available • Provides hyperscore and expectancy score (the discriminating one) • X!Tandem is fast and can handle modifications in an iterative fashion • Has rapidly gained popularity as (auxiliary) search engine Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 22. X!Tandem: some additional pictures 60 4 3.5 50 3 log(# results) 40 # results 2.5 30 2 1.5 20 1 10 0.5 0 0 20 25 30 35 40 45 50 0 20 40 60 80 100 hyperscore hyperscore significance 6 threshold 4 log(# results) 2 0 -2 Adapted from: Brian Searle, ProteomeSoftware, -4 http://www.proteomesoftware.com/XTandem_edited.pdf -6 -8 E-value=e-8.2 -10 0 20 40 60 80 100 hyperscore Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 23. A note on how the scores differ SEQUEST Accuracy Score Relative Score XCorr DeltaCn X! Tandem HyperScore E-Value Adapted from: Brian Searle, ProteomeSoftware Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 24. OMSSA • A successful open source search engine, Geer, JPR 2004 • Can be used for MS/MS (PFF) identifications • Relies on a Poisson distribution • Published core algorithm, and is freely available • Provides an expectancy score, similar to the BLAST E-value • OMSSA was recently upgraded to take peak intensity into account • Good really good marks in a recently published comparative study Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 25. OMSSA: some additional pictures Yeast lysate spectrum, m/z matches of Validation of the Poisson distribution model: fragment peak matches versus all NCBI nr mean number of modelled and measured sequence library. Poisson distribution fitted. matching peaks (against the NCBI nr database) for two mass tolerances. Adapted from: Geer et al., J. Prot. Res., 2004 Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 26. COMPARATIVE STUDIES Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 27. Kapp et al., Proteomics, 2005 Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 28. Balgley et al., Mol. Cell. Proteomics, 2007 1.6x more?! Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 29. Combining the output of search algorithms Mascot SEQUEST 3229 3792 212 486 (+4,2%) (+9,6%) ProteinSolver 3203 179 168 Phenyx 40 3186 329 380 (+6,5%) 501 348 (+7,5%) 1776 139 96 195 77 146 Figure courtesy of Dr. Christian Stephan, Medizinisches Proteom-Center, Ruhr-Universität Bochum; Human Brain Proteome Project Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 30. SEQUENCIAL COMPARISON ALGORITHMS Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 31. Sequence tags sequence tag The concept of sequence tags was introduced by Mann and Wilm (Mann,and Wilm, Anal. Chem. 1994, 66: 4390-4399). Image from: Matthias Wilm, EMBL Heidelberg, Germany http://www.narrador.embl-heidelberg.de/GroupPages/PageLink/activities/SeqTag.html Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 32. GutenTag, DirecTag, TagRecon • Tabb, Anal. Chem. 2003, Tabb, JPR 2008, Dasari, JPR 2010 • Recent implementations of the sequence tag approach • Refine hits by peak mapping in a second stage to resolve ambiguities • Rely on a empirical fragmentation model • Published core algorithms, DirecTag and TagRecon freely available • Most useful to retrieve unexpected peptides (modifications, variations) • Entire workflows exist (e.g., combination with IDPicker) Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 33. GutenTag: some additional pictures From: Tabb et al., Anal. Chem., 2003 Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 34. De novo compared to sequence tags Example of a manual de novo of an MS/MS spectrum No more database necessary to extract a sequence! Algorithms References Lutefisk Dancik 1999, Taylor 2000 Sherenga Fernandez-de-Cossio 2000 PEAKS Ma 2003, Zhang 2004 PepNovo Frank 2005, Grossmann 2005 … … Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011
  • 35. Thank you! Questions? Lennart Martens BITS MS Data Processing – Search Engines lennart.martens@UGent.be UGent, Gent, Belgium – 19 September 2011