SlideShare uma empresa Scribd logo
1 de 27
Baixar para ler offline
From Raw Data to MetaData 
Files 
Yasset 
Perez 
Riverol 
Proteomics 
& 
Bioiforma4cs 
CIGB
Common Proteomic Workflow 
Mixture/ 
Sample 
Separa4on 
Techniques 
(1D, 
2D) 
LC 
MS/MS 
Iden4fica4on 
OMSSA 
– Different providers: (annotations, 
software converters & viewers) 
– For Raw data formats, there is also 
the very real problem of “aging”. 
Different: 
– Protocols. 
– Outputs. 
– Providers. 
Different: 
– Strategies. 
– Search Engines. 
– Post-Processing 
Analysis. 
– File Outputs.
LC-MS/MS (different 
instruments) 
Raw File Raw File Raw File Raw File 
Raw 
data 
is 
binary!!!… 
It 
means 
you 
can’t 
read 
it 
with 
Notepad 
but 
also 
without 
their 
programs 
and 
libraries. 
Peaks without processing!!!
LC-MS/MS (“aging” problem.) 
Thermo XCalibur MassLynx Trapper Compass 
FrameWork 
Next 
the 
problem 
with 
proprietary 
raw 
data 
formats, 
there 
is 
also 
the 
very 
real 
problem 
of 
“aging” 
that 
comes 
with 
any 
binary 
formaSed 
data. 
As 
4me 
goes 
by, 
support 
for 
certain 
formats 
tends 
to 
evaporate 
and 
within 
the 
space 
of 
several 
years, 
readers 
can 
no 
longer 
be 
found 
for 
the 
format. 
Martens 
and 
co. 
Proteomics 
2005, 
5, 
3501–3505
Information inside Raw files 
• Raw files contain all the individual peaks as registered 
by the instrument detector. 
• Peaks without processing!!! 
• For LC-MS machines, can store elution profiles and 
times for the LC part. 
• Depending on the vendor and make of the machine, 
other useful instrument-related information can be 
stored in these files as well.
File Formats Evolution 
Pure Peaks 
Formats 
(pkl, ms2, mgf) 
mzXML 
(2004) 
mzData 
(2006) 
mzML (2008) 
Nature 
Biotechnology. 
2004, 
22 
(11),1459-­‐1466. 
mzData, 
hIp://psidev.info/index.php?q=node/80#mzdata. 
Mol 
Cell 
Proteomics. 
2011,10(1),
Pure Peak File 
mgf (mascot generic file) 
BEGIN IONS 
PEPMASS=406.283 
CHARGE=2+,3+ 
TITLE=Experiment_1 
145.119100 8 
217.142900 75 
409.221455 11 
438.314735 46 
567.400183 24 
714.447552 31 
116.113400 72 
91.2165000 32 
405.288933 94 
39.3021000 12 
549.379462 21 
715.466300 81 
15.1098000 62 
45.1358430 28 
pkl (peak list) 
814.27 22673800 1 
221.06 2529.3 
223.84 220.9 
226.91 1026.9 
227.97 1037.9 
231.06 110.6 
239.05 7193.1 
239.74 2513.3 
240.27 363.4 
240.79 1314.7 
241.45 629.9 
254.85 332.5 
259.71 200.5 
260.93 2437.7 
dta 
539.3453 2 
86.1006 4.0000 
112.1109 3.0000 
115.0906 2.0000 
120.0817 5.0000 
175.0219 2.0000 
225.1467 2.0000 
225.7205 2.0000 
228.1194 2.0000 
230.1106 2.0000 
234.1836 2.0000 
238.6206 2.0000 
240.1569 3.0000 
251.1396 2.0000 
254.1557 2.0000 
261.1669 9.0000 
261.6609 2.0000 
268.1504 8.0000
mzXML 
mzXML 
Parent FileList 
MsInstrument 
SeparationTechnique 
dataProcessing 
spooting 
scanList 
scan 
scanDescription 
msLevel 
PrecursorList 
binaryDataArray 
binaryDataArray 
• • • 
scan 
scan 
scanOrigin 
deisotoped 
centroided 
deconvoluted 
mzXML was the first xml 
based file format developed 
for proteomics experiments. 
It was developed by the 
System Biology Group, USA. 
The annotations in the file 
are string based. It means, 
they are in this way: (Name 
Attribute, Value). 
D o n o t s u p p o r t 
chromatograms information. 
Is very difficult to extend. The structure of the file 
don’t allow to define new parameter or features for 
each elements. For example, msInstrument are defined 
only by the name of the instrument. Also, if the 
spectrum is preprocessing with any program, is difficult 
to incorporate the information. 
Actually exist more than 4 versions 
of the schema. The schema is 
supported by the System Biology 
Group, USA-Zurich.
Controlled Vocabularies & 
Ontology Lookup Service 
TOF T.O.F. 
100173 
time of flight 
time-of-flight
OLS 
• Is a web service oriented system 
developed in Java. 
• It was developed and is maintained by 
the PRIDE Team!!! 
• We have the service installed in a local 
machine!!!! 
• I know the library and the source 
code. We have an strong collaboration 
with the developers of the Service!!!
mzML 
mzML 
cvList 
referenceableParamGroupList 
sampleList 
instrumentConfigurationList 
softwareList 
dataProcessingList 
acquisitionSettingsList 
run 
spectrum 
spectrumDescription 
precursorList 
scan 
binaryDataArray 
binaryDataArray 
• • • 
spectrumList 
spectrum 
spectrum 
• • • 
chromatogramList 
chromatogram 
chromatogram 
• • • 
chromatogram 
binaryDataArray 
binaryDataArray 
Meta data about the spectra 
plus all the spectra themselves. 
The header at the top of the 
file encodes information about: 
the source of the data as well as 
information about the sample, 
instrument and software that 
processed the data. 
Cvterms are used to define the 
metadata and the properties of 
each element (software, 
instrument, sample, scansetting, 
etc. 
Chromatograms may be encoded in mzML in a special element that contains one or 
more cvParams to describe the type of chromatogram, followed by two base64- 
encoded binary data arrays.
Comparison table 
Metadata/fileformat mzml mzData mzXml mgf pkl ms2 dta 
Species X X - - - - - 
Tissue X X - - - - - 
Instrument X X X - - - - 
Experiment Description X - - - - - - 
References X - - - - - - 
Contacts X X X - - - - 
X (FileContent / 
Additional 
creationDate) X X - - - - 
Samples X X - - - - - 
Instrument Configuration X X X - - - - 
Data Processing X X X - - - - 
mzML is supported by: 
- Institute for Systems Biology , Seattle. 
- Swiss Institute for Bioinformatics and Geneva Bioinformatics, 
Switzerland. 
- European Bioinformatics Institute, Hinxton, UK. 
- Thermo Fisher, San Jose, CA. 
- Indigo Biosystems, Carmel, IN. 
mzML and mzXML is comatible with: 
- Mascot!!!!, X! Tandem, OMSSA. 
- PeptideProphet 
Is 
not 
binary!!!… 
It 
means 
you 
can 
read 
it 
with 
Notepad 
but 
also 
with 
your 
libraries 
and 
own 
code…
ProteoWizard 
msConvert 
API 
Thermo 
API 
Bruker 
API 
Agilent 
API 
Waters 
API 
File Input Supported: 
– Thermo 
– Bruker 
– Agilent 
– Waters 
– Pkl 
– mgf, 
– dta 
– ms2 
File Output Supported: 
– mzML 
– mzXML 
– mzData 
– Pkl 
– mgf 
Cross-platform !!!!
S4ll 
growing…
Identification 
X!Tandem 
Mascot 
Database 
Search 
Mascot 
Percolator 
PeptideProphet 
Scaffold 
X! Tandem OMSSA Fenyx 
PeptideProphet 
De Novo 
Sequence 
Peaks PepNovo 
Spectral 
Library 
SpectraST NIST 
Thousand 
approaches!!!… 
It 
means 
you 
can 
combine 
different 
programs, 
with 
different 
parameters, 
and 
different 
workflows..
File Formats? 
AnalysisXML: v1.0 – candidate (Dic 08) 
.dat 
.dat 
.dat 
pepXML 
protXML 
AnalysisXML 
Seattle Proteome Center at 
the Institute for Systems 
Biology 
Programs with excel output 
OMSSA 
Programs with their output format
mzidentml 
Collection of use cases agreed 
to cover: 
- e.g. PMF, MS/MS, 
sequence tag, de novo, 
spectral library 
Pep 
Evidence1 
Ambiguity 
Group1 
Protein 
Result Set 
Protein 
Hypothesis1 
Pep 
Evidence2 
Pep 
Evidence1 
Protein 
Hypothesis2 
Pep 
Evidence2 
Pep 
Evidence1 
Ambiguity 
Group2 
Protein 
Hypothesis1 
… 
… 
… 
… 
… … 
Pep 
Evidence2 
Mul9ple 
Search 
Engines!!!… 
Protocol 
Descrip4ons, 
Database 
Proper4es, 
Search 
Engines, 
Parameters, 
Modifica4ons.. 
Fully 
compa4ble 
with 
Otology's!!! 
Supported 
by 
Mascot!!!
mzidentml 
• Results in mzIdentML format can be exported directly from Mascot (export of version 1.1 
available in version 2.3) 
• Converters are currently available for Sequest and Proteome Discoverer output (.msf 
and .protXML) (e.g. within ProCon: http://www.medizinisches-proteom-center.de/ProCon),. 
• OMSSA and X!Tandem (http://code.google.com/p/mzidentml-parsers/) 
• The pipeline applications Scaffold (import into Scaffold PTM and export of mzIdentML 
available in Scaffold version 3) and TPP (results can be exported to mzIdentML via the 
ProteoWizard converter). 
• A beta exporter is also available for Phenyx. 
• OpenMS implements C++ code for reading (and as of release 1.9) writing mzIdentML. 
• An open-source Java API for reading and writing mzIdentML has also been developed, 
available from http://code.google.com/p/jmzidentml/!!!!!
Gels 
(nobody 
care) 
― Only limited support for the storage of detailed descriptions of all stages of a 
gel-based proteomics workflow. 
― Information is mostly restricted to unstructured text paragraphs. 
Different Scenarios: 
OffeGel-electrophoresis 
1D 2D 
One of the reasons is the lack of widely accepted standards for 
representing gel data and the difficulties encountered modelling the range 
of workflows employed in different settings.
gelml 
Gelml is basically a metadata file 
that contains the URI of the image 
file. 
The structure of the schema is 
complex !!!!. One of the reason is the 
amount of different protocols 
Not well documented, an small 
community behind, and not really 
extended in the community!!!
Before Technical Things!!! 
• The number of tools based on XML 
standard files is growing exponentially.. 
Why: 
– Easy to read and write!!! 
– They are standards!!!! 
– Repositories Support (PRIDE, 
PEPTIDEATLAS). 
– Have enough information for most of the 
programs.
APIs 
• jmzml: Library to read/ 
write information from 
mzml files. 
• jmzidentml: Library to 
read/write information 
from mzidentml files. 
• jgelml: Library to read/ 
write information from 
gelml files. (current 
development) 
• Developed by 
the PRIDE team. 
• Java Libraries. 
• Still growing. 
• Open-Source 
and Free.
ms-core-api 
Applications 
proteolims 
N-terminal 
Identification 
Web services 
ms-core-api 
APIs 
jmzml jmzxml jmzData jmzReader jmzidml jgelml 
m s - c o r e - a p i i s a j a v a 
framewrok, a common object 
model to represent different 
file formats. 
Support now: 
― mzidentml 
― mzml, mzData, mzXML 
― pride xml, pride database 
― pkl, mgf, ms2, dta 
― gelml (current work) 
Cross-platform and well 
documented!!! 
The aim of ms-core-api library is to guarantee for our current 
development tools a common language of objects and classes!!!!
The relevance of APIs concept 
• Different programs can used to 
implement the main functionalities. 
• If you have APIs .. Then you just need 
to think on integration, scalability and 
presentation… 
• Easy to maintain and to scale and to 
share… 
• They are the “MAIN CORE!”!!
ms-­‐core-­‐api: 
good 
for… 
? 
Spectrum Viewer 
Identification Report
Think about review our experiments 
MetaData Report 
Reviewer Panel
conclusion 
• mzml is the current standard for MS/MS storage. 
• mzidentml will be the future standard on proteomics 
community for peptide/protein identification storage. 
• gelml is not very extended in the community but so 
far the best option for gel information storage. 
• ms-core-api support mzml,mzidentml, and in the near 
future gelml.

Mais conteúdo relacionado

Mais procurados

Computers in medicine and biology
Computers in medicine and biologyComputers in medicine and biology
Computers in medicine and biology
Dan Mulco
 

Mais procurados (20)

Blotting techniques dr.raza
Blotting techniques  dr.razaBlotting techniques  dr.raza
Blotting techniques dr.raza
 
Analgesic screeening model
Analgesic screeening modelAnalgesic screeening model
Analgesic screeening model
 
Enzyme immunoassays
Enzyme immunoassaysEnzyme immunoassays
Enzyme immunoassays
 
Computers in medicine and biology
Computers in medicine and biologyComputers in medicine and biology
Computers in medicine and biology
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
RADIO IMMUNO ASSAY
RADIO IMMUNO ASSAYRADIO IMMUNO ASSAY
RADIO IMMUNO ASSAY
 
MALDI
MALDIMALDI
MALDI
 
Pharmacological screening by harikesh maurya
Pharmacological screening by harikesh mauryaPharmacological screening by harikesh maurya
Pharmacological screening by harikesh maurya
 
Radio Immuno assay
Radio Immuno assayRadio Immuno assay
Radio Immuno assay
 
Presentation blotting
Presentation blottingPresentation blotting
Presentation blotting
 
Elisa
ElisaElisa
Elisa
 
Pharmacological screening.pptx
Pharmacological screening.pptxPharmacological screening.pptx
Pharmacological screening.pptx
 
Median
MedianMedian
Median
 
Publicly available tools and open resources in Bioinformatics
Publicly available  tools and open resources in BioinformaticsPublicly available  tools and open resources in Bioinformatics
Publicly available tools and open resources in Bioinformatics
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Array technology
Array technologyArray technology
Array technology
 
Application of excel and spss programme in statistical
Application of excel and spss programme in statisticalApplication of excel and spss programme in statistical
Application of excel and spss programme in statistical
 
Southern northern and western blotting
Southern northern and western blottingSouthern northern and western blotting
Southern northern and western blotting
 
Assignment on General principles of Immunoassay
Assignment on General principles of  ImmunoassayAssignment on General principles of  Immunoassay
Assignment on General principles of Immunoassay
 
Blast and fasta
Blast and fastaBlast and fasta
Blast and fasta
 

Destaque

Data Integration, Mass Spectrometry Proteomics Software Development
Data Integration, Mass Spectrometry Proteomics Software DevelopmentData Integration, Mass Spectrometry Proteomics Software Development
Data Integration, Mass Spectrometry Proteomics Software Development
Neil Swainston
 
Do we need to make public our proteomics data?
Do we need to make public our proteomics data?Do we need to make public our proteomics data?
Do we need to make public our proteomics data?
Yasset Perez-Riverol
 
Lab assignment 1 revised 2
Lab assignment 1 revised 2Lab assignment 1 revised 2
Lab assignment 1 revised 2
juancarlosrise
 
2. better control, better life dr. ko ko
2. better control, better life   dr. ko ko2. better control, better life   dr. ko ko
2. better control, better life dr. ko ko
ko ko
 

Destaque (20)

Makefiles Bioinfo
Makefiles BioinfoMakefiles Bioinfo
Makefiles Bioinfo
 
PBS Web (Spanish)
PBS Web (Spanish)PBS Web (Spanish)
PBS Web (Spanish)
 
Design of an hexapeptide database for proteomics studies
Design of an hexapeptide database for proteomics studiesDesign of an hexapeptide database for proteomics studies
Design of an hexapeptide database for proteomics studies
 
OpenMS: Quantitative proteomics at large scale
OpenMS: Quantitative proteomics at large scaleOpenMS: Quantitative proteomics at large scale
OpenMS: Quantitative proteomics at large scale
 
PRIDE and ProteomeXchange – Making proteomics data accessible and reusable
PRIDE and ProteomeXchange – Making proteomics data accessible and reusablePRIDE and ProteomeXchange – Making proteomics data accessible and reusable
PRIDE and ProteomeXchange – Making proteomics data accessible and reusable
 
Mascot
MascotMascot
Mascot
 
Data Integration, Mass Spectrometry Proteomics Software Development
Data Integration, Mass Spectrometry Proteomics Software DevelopmentData Integration, Mass Spectrometry Proteomics Software Development
Data Integration, Mass Spectrometry Proteomics Software Development
 
ARCHIVED: new version available. 2016 - Bioinformatics of METASPACE
ARCHIVED: new version available. 2016 - Bioinformatics of METASPACEARCHIVED: new version available. 2016 - Bioinformatics of METASPACE
ARCHIVED: new version available. 2016 - Bioinformatics of METASPACE
 
Proteomics data standards
Proteomics data standardsProteomics data standards
Proteomics data standards
 
Do we need to make public our proteomics data?
Do we need to make public our proteomics data?Do we need to make public our proteomics data?
Do we need to make public our proteomics data?
 
ARCHIVED: new version available. 2016 - METASPACE Training Course
ARCHIVED: new version available. 2016 - METASPACE Training CourseARCHIVED: new version available. 2016 - METASPACE Training Course
ARCHIVED: new version available. 2016 - METASPACE Training Course
 
P pt rise insilico 2
P pt rise insilico 2P pt rise insilico 2
P pt rise insilico 2
 
InSilico DB the collaborative genomics hub https://insilicodb.com
InSilico DB the collaborative genomics hub https://insilicodb.comInSilico DB the collaborative genomics hub https://insilicodb.com
InSilico DB the collaborative genomics hub https://insilicodb.com
 
Lab assignment 1 revised 2
Lab assignment 1 revised 2Lab assignment 1 revised 2
Lab assignment 1 revised 2
 
Insilico design and docking of Novel chemical compounds
Insilico design and docking of Novel chemical compoundsInsilico design and docking of Novel chemical compounds
Insilico design and docking of Novel chemical compounds
 
2. better control, better life dr. ko ko
2. better control, better life   dr. ko ko2. better control, better life   dr. ko ko
2. better control, better life dr. ko ko
 
Glucagon-Like Peptide-1 (GLP-1) Receptor Agonists
Glucagon-Like Peptide-1 (GLP-1) Receptor AgonistsGlucagon-Like Peptide-1 (GLP-1) Receptor Agonists
Glucagon-Like Peptide-1 (GLP-1) Receptor Agonists
 
Insilico Analysis towards Infuenza Virus- A Homology modelling and molecular ...
Insilico Analysis towards Infuenza Virus- A Homology modelling and molecular ...Insilico Analysis towards Infuenza Virus- A Homology modelling and molecular ...
Insilico Analysis towards Infuenza Virus- A Homology modelling and molecular ...
 
Protein docking
Protein dockingProtein docking
Protein docking
 
Parallel conformational search of small molecules
Parallel conformational search of small moleculesParallel conformational search of small molecules
Parallel conformational search of small molecules
 

Semelhante a Standarization in Proteomics: From raw data to metadata files

Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical Research
David Ruau
 
How Machine Learning and AI Can Support the Fight Against COVID-19
How Machine Learning and AI Can Support the Fight Against COVID-19How Machine Learning and AI Can Support the Fight Against COVID-19
How Machine Learning and AI Can Support the Fight Against COVID-19
Databricks
 
Accelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo dbAccelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo db
MongoDB
 

Semelhante a Standarization in Proteomics: From raw data to metadata files (20)

Microarrays Databases.pptx
Microarrays Databases.pptxMicroarrays Databases.pptx
Microarrays Databases.pptx
 
Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical Research
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015
 
How to be a bioinformatician
How to be a bioinformaticianHow to be a bioinformatician
How to be a bioinformatician
 
ProFET - Protein Feature Engineering Toolki
ProFET - Protein Feature Engineering ToolkiProFET - Protein Feature Engineering Toolki
ProFET - Protein Feature Engineering Toolki
 
ProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easyProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easy
 
How Machine Learning and AI Can Support the Fight Against COVID-19
How Machine Learning and AI Can Support the Fight Against COVID-19How Machine Learning and AI Can Support the Fight Against COVID-19
How Machine Learning and AI Can Support the Fight Against COVID-19
 
SBML (the Systems Biology Markup Language)
SBML (the Systems Biology Markup Language)SBML (the Systems Biology Markup Language)
SBML (the Systems Biology Markup Language)
 
Accelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo dbAccelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo db
 
Accelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDBAccelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDB
 
EUGM 2014 - Alfonso Pozzan (Aptuit): Expanding the scope of “literature data”...
EUGM 2014 - Alfonso Pozzan (Aptuit): Expanding the scope of “literature data”...EUGM 2014 - Alfonso Pozzan (Aptuit): Expanding the scope of “literature data”...
EUGM 2014 - Alfonso Pozzan (Aptuit): Expanding the scope of “literature data”...
 
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
 
Environment Canada's Data Management Service
Environment Canada's Data Management ServiceEnvironment Canada's Data Management Service
Environment Canada's Data Management Service
 
DeepBlue epigenomic data server: programmatic data retrieval and analysis of ...
DeepBlue epigenomic data server: programmatic data retrieval and analysis of ...DeepBlue epigenomic data server: programmatic data retrieval and analysis of ...
DeepBlue epigenomic data server: programmatic data retrieval and analysis of ...
 
Raptor user manual3.0
Raptor user manual3.0Raptor user manual3.0
Raptor user manual3.0
 
Ireland OUG Meetup May 2017
Ireland OUG Meetup May 2017Ireland OUG Meetup May 2017
Ireland OUG Meetup May 2017
 
tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...
tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...
tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...
 

Mais de Yasset Perez-Riverol

Mais de Yasset Perez-Riverol (9)

Introduction to Proteogenomics
Introduction to Proteogenomics Introduction to Proteogenomics
Introduction to Proteogenomics
 
Biocontainers 2019: Presentation for the ELIXIR All Hands
Biocontainers 2019: Presentation for the ELIXIR All HandsBiocontainers 2019: Presentation for the ELIXIR All Hands
Biocontainers 2019: Presentation for the ELIXIR All Hands
 
Mapping millions of peptidoforms to Genome Coordinates
Mapping millions of peptidoforms to Genome CoordinatesMapping millions of peptidoforms to Genome Coordinates
Mapping millions of peptidoforms to Genome Coordinates
 
Systematic integration of millions of peptidoform evidences into Ensembl and ...
Systematic integration of millions of peptidoform evidences into Ensembl and ...Systematic integration of millions of peptidoform evidences into Ensembl and ...
Systematic integration of millions of peptidoform evidences into Ensembl and ...
 
Biocontainers Hackathon Introduction
Biocontainers Hackathon IntroductionBiocontainers Hackathon Introduction
Biocontainers Hackathon Introduction
 
BioContainers on ELIXIR All Hands 2017
BioContainers on ELIXIR All Hands 2017BioContainers on ELIXIR All Hands 2017
BioContainers on ELIXIR All Hands 2017
 
Yasset perezriverol csi2011
Yasset perezriverol csi2011Yasset perezriverol csi2011
Yasset perezriverol csi2011
 
Yasset iso point-cigb-2012
Yasset iso point-cigb-2012Yasset iso point-cigb-2012
Yasset iso point-cigb-2012
 
SintCompound: A Small Compound Database for Virtual Screening
SintCompound: A Small Compound Database for Virtual ScreeningSintCompound: A Small Compound Database for Virtual Screening
SintCompound: A Small Compound Database for Virtual Screening
 

Último

Russian Call Girls Lucknow Just Call 👉👉7877925207 Top Class Call Girl Service...
Russian Call Girls Lucknow Just Call 👉👉7877925207 Top Class Call Girl Service...Russian Call Girls Lucknow Just Call 👉👉7877925207 Top Class Call Girl Service...
Russian Call Girls Lucknow Just Call 👉👉7877925207 Top Class Call Girl Service...
adilkhan87451
 

Último (20)

Call Girls in Delhi Triveni Complex Escort Service(🔝))/WhatsApp 97111⇛47426
Call Girls in Delhi Triveni Complex Escort Service(🔝))/WhatsApp 97111⇛47426Call Girls in Delhi Triveni Complex Escort Service(🔝))/WhatsApp 97111⇛47426
Call Girls in Delhi Triveni Complex Escort Service(🔝))/WhatsApp 97111⇛47426
 
Manyata Tech Park ( Call Girls ) Bangalore ✔ 6297143586 ✔ Hot Model With Sexy...
Manyata Tech Park ( Call Girls ) Bangalore ✔ 6297143586 ✔ Hot Model With Sexy...Manyata Tech Park ( Call Girls ) Bangalore ✔ 6297143586 ✔ Hot Model With Sexy...
Manyata Tech Park ( Call Girls ) Bangalore ✔ 6297143586 ✔ Hot Model With Sexy...
 
Call Girls Hosur Just Call 9630942363 Top Class Call Girl Service Available
Call Girls Hosur Just Call 9630942363 Top Class Call Girl Service AvailableCall Girls Hosur Just Call 9630942363 Top Class Call Girl Service Available
Call Girls Hosur Just Call 9630942363 Top Class Call Girl Service Available
 
Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...
Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...
Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...
 
Call Girls Raipur Just Call 9630942363 Top Class Call Girl Service Available
Call Girls Raipur Just Call 9630942363 Top Class Call Girl Service AvailableCall Girls Raipur Just Call 9630942363 Top Class Call Girl Service Available
Call Girls Raipur Just Call 9630942363 Top Class Call Girl Service Available
 
All Time Service Available Call Girls Marine Drive 📳 9820252231 For 18+ VIP C...
All Time Service Available Call Girls Marine Drive 📳 9820252231 For 18+ VIP C...All Time Service Available Call Girls Marine Drive 📳 9820252231 For 18+ VIP C...
All Time Service Available Call Girls Marine Drive 📳 9820252231 For 18+ VIP C...
 
Call Girls Service Jaipur {9521753030 } ❤️VVIP BHAWNA Call Girl in Jaipur Raj...
Call Girls Service Jaipur {9521753030 } ❤️VVIP BHAWNA Call Girl in Jaipur Raj...Call Girls Service Jaipur {9521753030 } ❤️VVIP BHAWNA Call Girl in Jaipur Raj...
Call Girls Service Jaipur {9521753030 } ❤️VVIP BHAWNA Call Girl in Jaipur Raj...
 
Russian Call Girls Lucknow Just Call 👉👉7877925207 Top Class Call Girl Service...
Russian Call Girls Lucknow Just Call 👉👉7877925207 Top Class Call Girl Service...Russian Call Girls Lucknow Just Call 👉👉7877925207 Top Class Call Girl Service...
Russian Call Girls Lucknow Just Call 👉👉7877925207 Top Class Call Girl Service...
 
Low Rate Call Girls Bangalore {7304373326} ❤️VVIP NISHA Call Girls in Bangalo...
Low Rate Call Girls Bangalore {7304373326} ❤️VVIP NISHA Call Girls in Bangalo...Low Rate Call Girls Bangalore {7304373326} ❤️VVIP NISHA Call Girls in Bangalo...
Low Rate Call Girls Bangalore {7304373326} ❤️VVIP NISHA Call Girls in Bangalo...
 
Top Rated Hyderabad Call Girls Erragadda ⟟ 9332606886 ⟟ Call Me For Genuine ...
Top Rated  Hyderabad Call Girls Erragadda ⟟ 9332606886 ⟟ Call Me For Genuine ...Top Rated  Hyderabad Call Girls Erragadda ⟟ 9332606886 ⟟ Call Me For Genuine ...
Top Rated Hyderabad Call Girls Erragadda ⟟ 9332606886 ⟟ Call Me For Genuine ...
 
Trichy Call Girls Book Now 9630942363 Top Class Trichy Escort Service Available
Trichy Call Girls Book Now 9630942363 Top Class Trichy Escort Service AvailableTrichy Call Girls Book Now 9630942363 Top Class Trichy Escort Service Available
Trichy Call Girls Book Now 9630942363 Top Class Trichy Escort Service Available
 
Saket * Call Girls in Delhi - Phone 9711199012 Escorts Service at 6k to 50k a...
Saket * Call Girls in Delhi - Phone 9711199012 Escorts Service at 6k to 50k a...Saket * Call Girls in Delhi - Phone 9711199012 Escorts Service at 6k to 50k a...
Saket * Call Girls in Delhi - Phone 9711199012 Escorts Service at 6k to 50k a...
 
Coimbatore Call Girls in Thudiyalur : 7427069034 High Profile Model Escorts |...
Coimbatore Call Girls in Thudiyalur : 7427069034 High Profile Model Escorts |...Coimbatore Call Girls in Thudiyalur : 7427069034 High Profile Model Escorts |...
Coimbatore Call Girls in Thudiyalur : 7427069034 High Profile Model Escorts |...
 
Premium Bangalore Call Girls Jigani Dail 6378878445 Escort Service For Hot Ma...
Premium Bangalore Call Girls Jigani Dail 6378878445 Escort Service For Hot Ma...Premium Bangalore Call Girls Jigani Dail 6378878445 Escort Service For Hot Ma...
Premium Bangalore Call Girls Jigani Dail 6378878445 Escort Service For Hot Ma...
 
9630942363 Genuine Call Girls In Ahmedabad Gujarat Call Girls Service
9630942363 Genuine Call Girls In Ahmedabad Gujarat Call Girls Service9630942363 Genuine Call Girls In Ahmedabad Gujarat Call Girls Service
9630942363 Genuine Call Girls In Ahmedabad Gujarat Call Girls Service
 
Premium Call Girls In Jaipur {8445551418} ❤️VVIP SEEMA Call Girl in Jaipur Ra...
Premium Call Girls In Jaipur {8445551418} ❤️VVIP SEEMA Call Girl in Jaipur Ra...Premium Call Girls In Jaipur {8445551418} ❤️VVIP SEEMA Call Girl in Jaipur Ra...
Premium Call Girls In Jaipur {8445551418} ❤️VVIP SEEMA Call Girl in Jaipur Ra...
 
Call Girls Jaipur Just Call 9521753030 Top Class Call Girl Service Available
Call Girls Jaipur Just Call 9521753030 Top Class Call Girl Service AvailableCall Girls Jaipur Just Call 9521753030 Top Class Call Girl Service Available
Call Girls Jaipur Just Call 9521753030 Top Class Call Girl Service Available
 
Top Quality Call Girl Service Kalyanpur 6378878445 Available Call Girls Any Time
Top Quality Call Girl Service Kalyanpur 6378878445 Available Call Girls Any TimeTop Quality Call Girl Service Kalyanpur 6378878445 Available Call Girls Any Time
Top Quality Call Girl Service Kalyanpur 6378878445 Available Call Girls Any Time
 
Call Girls Kolkata Kalikapur 💯Call Us 🔝 8005736733 🔝 💃 Top Class Call Girl Se...
Call Girls Kolkata Kalikapur 💯Call Us 🔝 8005736733 🔝 💃 Top Class Call Girl Se...Call Girls Kolkata Kalikapur 💯Call Us 🔝 8005736733 🔝 💃 Top Class Call Girl Se...
Call Girls Kolkata Kalikapur 💯Call Us 🔝 8005736733 🔝 💃 Top Class Call Girl Se...
 
Model Call Girls In Chennai WhatsApp Booking 7427069034 call girl service 24 ...
Model Call Girls In Chennai WhatsApp Booking 7427069034 call girl service 24 ...Model Call Girls In Chennai WhatsApp Booking 7427069034 call girl service 24 ...
Model Call Girls In Chennai WhatsApp Booking 7427069034 call girl service 24 ...
 

Standarization in Proteomics: From raw data to metadata files

  • 1. From Raw Data to MetaData Files Yasset Perez Riverol Proteomics & Bioiforma4cs CIGB
  • 2. Common Proteomic Workflow Mixture/ Sample Separa4on Techniques (1D, 2D) LC MS/MS Iden4fica4on OMSSA – Different providers: (annotations, software converters & viewers) – For Raw data formats, there is also the very real problem of “aging”. Different: – Protocols. – Outputs. – Providers. Different: – Strategies. – Search Engines. – Post-Processing Analysis. – File Outputs.
  • 3. LC-MS/MS (different instruments) Raw File Raw File Raw File Raw File Raw data is binary!!!… It means you can’t read it with Notepad but also without their programs and libraries. Peaks without processing!!!
  • 4. LC-MS/MS (“aging” problem.) Thermo XCalibur MassLynx Trapper Compass FrameWork Next the problem with proprietary raw data formats, there is also the very real problem of “aging” that comes with any binary formaSed data. As 4me goes by, support for certain formats tends to evaporate and within the space of several years, readers can no longer be found for the format. Martens and co. Proteomics 2005, 5, 3501–3505
  • 5. Information inside Raw files • Raw files contain all the individual peaks as registered by the instrument detector. • Peaks without processing!!! • For LC-MS machines, can store elution profiles and times for the LC part. • Depending on the vendor and make of the machine, other useful instrument-related information can be stored in these files as well.
  • 6. File Formats Evolution Pure Peaks Formats (pkl, ms2, mgf) mzXML (2004) mzData (2006) mzML (2008) Nature Biotechnology. 2004, 22 (11),1459-­‐1466. mzData, hIp://psidev.info/index.php?q=node/80#mzdata. Mol Cell Proteomics. 2011,10(1),
  • 7. Pure Peak File mgf (mascot generic file) BEGIN IONS PEPMASS=406.283 CHARGE=2+,3+ TITLE=Experiment_1 145.119100 8 217.142900 75 409.221455 11 438.314735 46 567.400183 24 714.447552 31 116.113400 72 91.2165000 32 405.288933 94 39.3021000 12 549.379462 21 715.466300 81 15.1098000 62 45.1358430 28 pkl (peak list) 814.27 22673800 1 221.06 2529.3 223.84 220.9 226.91 1026.9 227.97 1037.9 231.06 110.6 239.05 7193.1 239.74 2513.3 240.27 363.4 240.79 1314.7 241.45 629.9 254.85 332.5 259.71 200.5 260.93 2437.7 dta 539.3453 2 86.1006 4.0000 112.1109 3.0000 115.0906 2.0000 120.0817 5.0000 175.0219 2.0000 225.1467 2.0000 225.7205 2.0000 228.1194 2.0000 230.1106 2.0000 234.1836 2.0000 238.6206 2.0000 240.1569 3.0000 251.1396 2.0000 254.1557 2.0000 261.1669 9.0000 261.6609 2.0000 268.1504 8.0000
  • 8. mzXML mzXML Parent FileList MsInstrument SeparationTechnique dataProcessing spooting scanList scan scanDescription msLevel PrecursorList binaryDataArray binaryDataArray • • • scan scan scanOrigin deisotoped centroided deconvoluted mzXML was the first xml based file format developed for proteomics experiments. It was developed by the System Biology Group, USA. The annotations in the file are string based. It means, they are in this way: (Name Attribute, Value). D o n o t s u p p o r t chromatograms information. Is very difficult to extend. The structure of the file don’t allow to define new parameter or features for each elements. For example, msInstrument are defined only by the name of the instrument. Also, if the spectrum is preprocessing with any program, is difficult to incorporate the information. Actually exist more than 4 versions of the schema. The schema is supported by the System Biology Group, USA-Zurich.
  • 9. Controlled Vocabularies & Ontology Lookup Service TOF T.O.F. 100173 time of flight time-of-flight
  • 10. OLS • Is a web service oriented system developed in Java. • It was developed and is maintained by the PRIDE Team!!! • We have the service installed in a local machine!!!! • I know the library and the source code. We have an strong collaboration with the developers of the Service!!!
  • 11. mzML mzML cvList referenceableParamGroupList sampleList instrumentConfigurationList softwareList dataProcessingList acquisitionSettingsList run spectrum spectrumDescription precursorList scan binaryDataArray binaryDataArray • • • spectrumList spectrum spectrum • • • chromatogramList chromatogram chromatogram • • • chromatogram binaryDataArray binaryDataArray Meta data about the spectra plus all the spectra themselves. The header at the top of the file encodes information about: the source of the data as well as information about the sample, instrument and software that processed the data. Cvterms are used to define the metadata and the properties of each element (software, instrument, sample, scansetting, etc. Chromatograms may be encoded in mzML in a special element that contains one or more cvParams to describe the type of chromatogram, followed by two base64- encoded binary data arrays.
  • 12. Comparison table Metadata/fileformat mzml mzData mzXml mgf pkl ms2 dta Species X X - - - - - Tissue X X - - - - - Instrument X X X - - - - Experiment Description X - - - - - - References X - - - - - - Contacts X X X - - - - X (FileContent / Additional creationDate) X X - - - - Samples X X - - - - - Instrument Configuration X X X - - - - Data Processing X X X - - - - mzML is supported by: - Institute for Systems Biology , Seattle. - Swiss Institute for Bioinformatics and Geneva Bioinformatics, Switzerland. - European Bioinformatics Institute, Hinxton, UK. - Thermo Fisher, San Jose, CA. - Indigo Biosystems, Carmel, IN. mzML and mzXML is comatible with: - Mascot!!!!, X! Tandem, OMSSA. - PeptideProphet Is not binary!!!… It means you can read it with Notepad but also with your libraries and own code…
  • 13. ProteoWizard msConvert API Thermo API Bruker API Agilent API Waters API File Input Supported: – Thermo – Bruker – Agilent – Waters – Pkl – mgf, – dta – ms2 File Output Supported: – mzML – mzXML – mzData – Pkl – mgf Cross-platform !!!!
  • 15. Identification X!Tandem Mascot Database Search Mascot Percolator PeptideProphet Scaffold X! Tandem OMSSA Fenyx PeptideProphet De Novo Sequence Peaks PepNovo Spectral Library SpectraST NIST Thousand approaches!!!… It means you can combine different programs, with different parameters, and different workflows..
  • 16. File Formats? AnalysisXML: v1.0 – candidate (Dic 08) .dat .dat .dat pepXML protXML AnalysisXML Seattle Proteome Center at the Institute for Systems Biology Programs with excel output OMSSA Programs with their output format
  • 17. mzidentml Collection of use cases agreed to cover: - e.g. PMF, MS/MS, sequence tag, de novo, spectral library Pep Evidence1 Ambiguity Group1 Protein Result Set Protein Hypothesis1 Pep Evidence2 Pep Evidence1 Protein Hypothesis2 Pep Evidence2 Pep Evidence1 Ambiguity Group2 Protein Hypothesis1 … … … … … … Pep Evidence2 Mul9ple Search Engines!!!… Protocol Descrip4ons, Database Proper4es, Search Engines, Parameters, Modifica4ons.. Fully compa4ble with Otology's!!! Supported by Mascot!!!
  • 18. mzidentml • Results in mzIdentML format can be exported directly from Mascot (export of version 1.1 available in version 2.3) • Converters are currently available for Sequest and Proteome Discoverer output (.msf and .protXML) (e.g. within ProCon: http://www.medizinisches-proteom-center.de/ProCon),. • OMSSA and X!Tandem (http://code.google.com/p/mzidentml-parsers/) • The pipeline applications Scaffold (import into Scaffold PTM and export of mzIdentML available in Scaffold version 3) and TPP (results can be exported to mzIdentML via the ProteoWizard converter). • A beta exporter is also available for Phenyx. • OpenMS implements C++ code for reading (and as of release 1.9) writing mzIdentML. • An open-source Java API for reading and writing mzIdentML has also been developed, available from http://code.google.com/p/jmzidentml/!!!!!
  • 19. Gels (nobody care) ― Only limited support for the storage of detailed descriptions of all stages of a gel-based proteomics workflow. ― Information is mostly restricted to unstructured text paragraphs. Different Scenarios: OffeGel-electrophoresis 1D 2D One of the reasons is the lack of widely accepted standards for representing gel data and the difficulties encountered modelling the range of workflows employed in different settings.
  • 20. gelml Gelml is basically a metadata file that contains the URI of the image file. The structure of the schema is complex !!!!. One of the reason is the amount of different protocols Not well documented, an small community behind, and not really extended in the community!!!
  • 21. Before Technical Things!!! • The number of tools based on XML standard files is growing exponentially.. Why: – Easy to read and write!!! – They are standards!!!! – Repositories Support (PRIDE, PEPTIDEATLAS). – Have enough information for most of the programs.
  • 22. APIs • jmzml: Library to read/ write information from mzml files. • jmzidentml: Library to read/write information from mzidentml files. • jgelml: Library to read/ write information from gelml files. (current development) • Developed by the PRIDE team. • Java Libraries. • Still growing. • Open-Source and Free.
  • 23. ms-core-api Applications proteolims N-terminal Identification Web services ms-core-api APIs jmzml jmzxml jmzData jmzReader jmzidml jgelml m s - c o r e - a p i i s a j a v a framewrok, a common object model to represent different file formats. Support now: ― mzidentml ― mzml, mzData, mzXML ― pride xml, pride database ― pkl, mgf, ms2, dta ― gelml (current work) Cross-platform and well documented!!! The aim of ms-core-api library is to guarantee for our current development tools a common language of objects and classes!!!!
  • 24. The relevance of APIs concept • Different programs can used to implement the main functionalities. • If you have APIs .. Then you just need to think on integration, scalability and presentation… • Easy to maintain and to scale and to share… • They are the “MAIN CORE!”!!
  • 25. ms-­‐core-­‐api: good for… ? Spectrum Viewer Identification Report
  • 26. Think about review our experiments MetaData Report Reviewer Panel
  • 27. conclusion • mzml is the current standard for MS/MS storage. • mzidentml will be the future standard on proteomics community for peptide/protein identification storage. • gelml is not very extended in the community but so far the best option for gel information storage. • ms-core-api support mzml,mzidentml, and in the near future gelml.