SlideShare uma empresa Scribd logo
1 de 23
PRIDE: Quality control in a proteomics
data repository
Attila Csordas
Proteomics Services Team
Biocuration Conference
April 2nd, 2012



1/23
Overview

              who are we?

             what are we dealing with?

              manual curation and submission

              quick detour: ProteomeXchange

              automated curation & submission pipeline

              conclusion


       April 2, 2012
2/23
PRIDE: http://www.ebi.ac.uk/pride
       The PRoteomics IDEntifications database is
       a centralised, primary, archival, public data
          repository for MS/MS proteomics data
        containing peptide ids, protein ids, mass
            spectra, protein expression values,
                         metadata.




3/23
        April 2, 2012
Acknowledgements
                 colleagues at the PRIDE team




                             @pride_ebi

                         pride-ebi@ebi.ac.uk
                         pride-support@ebi.ac.uk


       http://code.google.com/p/pride-toolsuite/
       http://code.google.com/p/pride-converter-2/


4/23
        April 2, 2012
Mass spectrometry
analytical technique measuring the mass-to-charge (m/z) ratio of charged
        particles to determine masses of particles, composition of
        samples/molecules and chemical structures of molecules




             April 2, 2012
5/23
Shotgun/bottom-up proteomics

                                                      P
peptides                             MS/MS analysis
                                                      R
                                                      O
           sequence
           database                                   T
proteins                                              O
                              fragmentation
                                                      C
      MS analysis                                     O
                                                      L



              April 2, 2012
 6/23
What is a PRIDE submission?




7/23
        April 2, 2012
growth of
core data types                   130 million




                                   23 million
                                   4.6 million




  8/23
                  April 2, 2012
Manual curation and submission process
       Search
   Engine + spectra

                                   PRIDE
                                  Converter


                                  pride xml

Mascot (.dat),
X!Tandem (.xml) + mgf




9/23
                  April 2, 2012
PRIDE Inspector

initial assessment
on data quality

visualise/check data

summary charts

support for submitters &
reviewers/editors

more flexible than web
interface




  10/23
                 April 2, 2012
Frequent Data Quality Issues

                           <SearchEngine>PeptideShaker</SearchEngine>
  1. syntactic problems    <PeptideItem>



   2a. core data missing                no protein/peptide identifications




   2b. or metadata missing              no species




   3.inconsistent/incorrect data        protein modifications




11/23
           April 2, 2012
Delta m/z of detected peptide precursors


experimental precursor ion m/z - theoretical precursor ion m/z




   source of delta m/z outliers: incorrect or missing protein
   modifications and charge state misassignments




 12/23
             April 2, 2012
Fixing modifications based on delta m/z outliers




13/23
            April 2, 2012
Fixing modifications based on delta m/z outliers




14/23
            April 2, 2012
but the manual approach does not scale!




15/23
         April 2, 2012
10 times as many & big submissions/ day?




16/23
        April 2, 2012
single point of submission of data to the main repositories to encourage data exchange

                          Published        Raw       Reprocessed


 Individual
submissions
                                                       PeptideAtlas
                                 EBI
                                PRIDE   Raw files                                 Users
                                         archive
Large-scale
submissions

                            UniProt
                                               Other DBs
                                              (GPMDB, …)



17/23
                April 2, 2012
PX submission pipeline




                                                                    Proteome
PX Tool                     Validation   Submission   Publication
                                                                     Central




            Files

    Raw             PRIDE
    Files            XML

        Summary




18/23
                       April 2, 2012
Automated regular submission pipeline
         curation-submission time is ~1/6th of manual time

                            actionable curation summary

  number of files: 3
  Project: Combined personal saliva proteome and microbioproteome
  XML generator software         PRIDE Converter Toolsuite 2.0-
  SNAPSHOT
Filename size         Species      #Proteins   #Peptides #Spectra   #Unid-d   PTMs   % delta
                                                                    spectra          m/z
                                                                                     outlier

22143.    3.3 GB      Homo         4128        60544    184209      123665    3      0.0
xml                   sapiens                           spectra     spectra




 19/23
                   April 2, 2012
Conclusion

                growing amount of data


                growingly complex data


                scalability issues


              overcoming them by automation
              and new, smarter curation strategies




20/23
        April 2, 2012
21/23
        April 2, 2012
Thanks for the attention!




22/23
        April 2, 2012
acsordas@ebi.ac.uk
        Q&A                 @attilacsordas

23/23
        April 2, 2012

Mais conteúdo relacionado

Semelhante a Pride quality controlattilacsordasbiocuration2012

Proteomics & Metabolomics
Proteomics & MetabolomicsProteomics & Metabolomics
Proteomics & Metabolomics
gumccomm
 
Tim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasetsTim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasets
TERN Australia
 
PATHS first paths prototype
PATHS first paths prototypePATHS first paths prototype
PATHS first paths prototype
pathsproject
 
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
OSTHUS
 
Automated data pipelines at the rat genome database
Automated data pipelines at the rat genome databaseAutomated data pipelines at the rat genome database
Automated data pipelines at the rat genome database
Jennifer Smith
 

Semelhante a Pride quality controlattilacsordasbiocuration2012 (13)

C044041723
C044041723C044041723
C044041723
 
Proteomics & Metabolomics
Proteomics & MetabolomicsProteomics & Metabolomics
Proteomics & Metabolomics
 
Tim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasetsTim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasets
 
RDAP13 Jian Qin: Functional and Architectural Requirements for Metadata
RDAP13 Jian Qin: Functional and Architectural Requirements for MetadataRDAP13 Jian Qin: Functional and Architectural Requirements for Metadata
RDAP13 Jian Qin: Functional and Architectural Requirements for Metadata
 
PATHS first paths prototype
PATHS first paths prototypePATHS first paths prototype
PATHS first paths prototype
 
ProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easyProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easy
 
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
 
OpenMS: Quantitative proteomics at large scale
OpenMS: Quantitative proteomics at large scaleOpenMS: Quantitative proteomics at large scale
OpenMS: Quantitative proteomics at large scale
 
Presentation of agriopenlink @ EFITA (main program)
Presentation of agriopenlink @ EFITA (main program)Presentation of agriopenlink @ EFITA (main program)
Presentation of agriopenlink @ EFITA (main program)
 
Information systems a revision
Information systems  a revisionInformation systems  a revision
Information systems a revision
 
IntelliGO semantic similarity measure for Gene Ontology annotations
IntelliGO semantic similarity measure for Gene Ontology annotationsIntelliGO semantic similarity measure for Gene Ontology annotations
IntelliGO semantic similarity measure for Gene Ontology annotations
 
Nanoinformatics 2010 SMIRP-ONS Talk
Nanoinformatics 2010 SMIRP-ONS TalkNanoinformatics 2010 SMIRP-ONS Talk
Nanoinformatics 2010 SMIRP-ONS Talk
 
Automated data pipelines at the rat genome database
Automated data pipelines at the rat genome databaseAutomated data pipelines at the rat genome database
Automated data pipelines at the rat genome database
 

Mais de attilacsordas

Mais de attilacsordas (15)

Aging is agings: a recursive definition of biological aging
Aging is agings: a recursive definition of biological agingAging is agings: a recursive definition of biological aging
Aging is agings: a recursive definition of biological aging
 
Towards a consensus definition of biological aging
Towards a consensus definition of biological agingTowards a consensus definition of biological aging
Towards a consensus definition of biological aging
 
Aging vs agings: limits and consequences of biomedical definitions
Aging vs agings: limits and consequences of biomedical definitionsAging vs agings: limits and consequences of biomedical definitions
Aging vs agings: limits and consequences of biomedical definitions
 
What is it like to be 572 year old?
What is it like to be 572 year old?What is it like to be 572 year old?
What is it like to be 572 year old?
 
Cell lineage trees and the limiting problem of comprehensive rejuvenation
Cell lineage trees and the limiting problem of comprehensive rejuvenationCell lineage trees and the limiting problem of comprehensive rejuvenation
Cell lineage trees and the limiting problem of comprehensive rejuvenation
 
The problematic openness behind the first capability concerning the end of a ...
The problematic openness behind the first capability concerning the end of a ...The problematic openness behind the first capability concerning the end of a ...
The problematic openness behind the first capability concerning the end of a ...
 
Open Lifespan and (not) knowing our age in Rawls’ Original Position
Open Lifespan and (not) knowing our age in Rawls’ Original PositionOpen Lifespan and (not) knowing our age in Rawls’ Original Position
Open Lifespan and (not) knowing our age in Rawls’ Original Position
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
 
Ultrcentifugation: Basic Training
Ultrcentifugation: Basic TrainingUltrcentifugation: Basic Training
Ultrcentifugation: Basic Training
 
Merry XOmas
Merry XOmasMerry XOmas
Merry XOmas
 
Google's Palimpsest Project
Google's Palimpsest ProjectGoogle's Palimpsest Project
Google's Palimpsest Project
 
LindaPowers onSENS3
LindaPowers onSENS3LindaPowers onSENS3
LindaPowers onSENS3
 
SENS3: Stephen Coles on the Secrets of the Oldest Old
SENS3: Stephen Coles on the Secrets of the Oldest OldSENS3: Stephen Coles on the Secrets of the Oldest Old
SENS3: Stephen Coles on the Secrets of the Oldest Old
 
SENS3: Michael Rose
SENS3: Michael RoseSENS3: Michael Rose
SENS3: Michael Rose
 
Microvesiclesslide
MicrovesiclesslideMicrovesiclesslide
Microvesiclesslide
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 

Pride quality controlattilacsordasbiocuration2012

  • 1. PRIDE: Quality control in a proteomics data repository Attila Csordas Proteomics Services Team Biocuration Conference April 2nd, 2012 1/23
  • 2. Overview who are we? what are we dealing with? manual curation and submission quick detour: ProteomeXchange automated curation & submission pipeline conclusion April 2, 2012 2/23
  • 3. PRIDE: http://www.ebi.ac.uk/pride The PRoteomics IDEntifications database is a centralised, primary, archival, public data repository for MS/MS proteomics data containing peptide ids, protein ids, mass spectra, protein expression values, metadata. 3/23 April 2, 2012
  • 4. Acknowledgements colleagues at the PRIDE team @pride_ebi pride-ebi@ebi.ac.uk pride-support@ebi.ac.uk http://code.google.com/p/pride-toolsuite/ http://code.google.com/p/pride-converter-2/ 4/23 April 2, 2012
  • 5. Mass spectrometry analytical technique measuring the mass-to-charge (m/z) ratio of charged particles to determine masses of particles, composition of samples/molecules and chemical structures of molecules April 2, 2012 5/23
  • 6. Shotgun/bottom-up proteomics P peptides MS/MS analysis R O sequence database T proteins O fragmentation C MS analysis O L April 2, 2012 6/23
  • 7. What is a PRIDE submission? 7/23 April 2, 2012
  • 8. growth of core data types 130 million 23 million 4.6 million 8/23 April 2, 2012
  • 9. Manual curation and submission process Search Engine + spectra PRIDE Converter pride xml Mascot (.dat), X!Tandem (.xml) + mgf 9/23 April 2, 2012
  • 10. PRIDE Inspector initial assessment on data quality visualise/check data summary charts support for submitters & reviewers/editors more flexible than web interface 10/23 April 2, 2012
  • 11. Frequent Data Quality Issues <SearchEngine>PeptideShaker</SearchEngine> 1. syntactic problems <PeptideItem> 2a. core data missing no protein/peptide identifications 2b. or metadata missing no species 3.inconsistent/incorrect data protein modifications 11/23 April 2, 2012
  • 12. Delta m/z of detected peptide precursors experimental precursor ion m/z - theoretical precursor ion m/z source of delta m/z outliers: incorrect or missing protein modifications and charge state misassignments 12/23 April 2, 2012
  • 13. Fixing modifications based on delta m/z outliers 13/23 April 2, 2012
  • 14. Fixing modifications based on delta m/z outliers 14/23 April 2, 2012
  • 15. but the manual approach does not scale! 15/23 April 2, 2012
  • 16. 10 times as many & big submissions/ day? 16/23 April 2, 2012
  • 17. single point of submission of data to the main repositories to encourage data exchange Published Raw Reprocessed Individual submissions PeptideAtlas EBI PRIDE Raw files Users archive Large-scale submissions UniProt Other DBs (GPMDB, …) 17/23 April 2, 2012
  • 18. PX submission pipeline Proteome PX Tool Validation Submission Publication Central Files Raw PRIDE Files XML Summary 18/23 April 2, 2012
  • 19. Automated regular submission pipeline curation-submission time is ~1/6th of manual time actionable curation summary number of files: 3 Project: Combined personal saliva proteome and microbioproteome XML generator software PRIDE Converter Toolsuite 2.0- SNAPSHOT Filename size Species #Proteins #Peptides #Spectra #Unid-d PTMs % delta spectra m/z outlier 22143. 3.3 GB Homo 4128 60544 184209 123665 3 0.0 xml sapiens spectra spectra 19/23 April 2, 2012
  • 20. Conclusion growing amount of data growingly complex data scalability issues overcoming them by automation and new, smarter curation strategies 20/23 April 2, 2012
  • 21. 21/23 April 2, 2012
  • 22. Thanks for the attention! 22/23 April 2, 2012
  • 23. acsordas@ebi.ac.uk Q&A @attilacsordas 23/23 April 2, 2012