SlideShare uma empresa Scribd logo
1 de 63
Distributed Approach
for Peptide
Identification
By
Naga Venkata Krishna Abhinav Vedanbhatla
Outline
• Background
• About C-Ranker
• Problem Statement
• Proposed Solution
• Architecture
• Implementation
• Execution Environment
• Results
• Conclusions & Future Scope
Background
Protein
• Bio-Molecules consists of one or multiple chains of Amino Acids
• Proteins differ from one another primarily in their sequence of
amino acids
• A protein is characterized by the sequence of amino acids as they
occur in the protein
Proteins (Cont.…)
• Proteins perform a vast array of functions within
living organisms.
• A protein contains at least one long polypeptide
• Proteins are involved in almost every biological process
happening in an organism’s body.
• Important part of drug development to target specific metabolic
pathways.
Peptide
• Short chain of amino acids
• Peptides are distinguished from proteins on the basis of size
• A protein is first digested into peptides and then each peptide is
identified individually to infer the protein identity.
Finding Peptide and Protein Relationship?
• Peptide Identification using mass spectrometry
Determine the sequence of Peptides
• Peptide mass fingerprinting (PMF) in MS spectra
Peptide Identification Using MS/MS Spectra
• Sequence database searching (for the large-scale dataset)
• de novo sequencing (new protein discovery)
• Post sequence database searching (extension of sequence
database search)
Peptide Identification (cont…)
• Mass Spectrometry (MS) strategy
• Sequence database searching
• Combination of both: dominant method for peptide identification
• Which results in more spectra from MS
Sequence Database searching algorithms.
• SEQUEST
• Mascot
Post Database Algorithms
• Machine learning algorithms are proposed to identify the peptide
spectrum matches (PSM)
Post Database Search Algorithms
• PeptideProphet
• Learns distribution of scores and properties
• Percolator
• Search scores considered reliable for high values and low values
• CRanker
• Fuzzy SVM and silhouette index.
CRanker
C-Ranker
• Is to identify correct PSMs output from the database(Peptide).
• Developed in Matlab and C
Why CRanker?
• Based on research by Dr. Zhonghang Xia, it is the best
among the other.
• Easy to parallelize to make it work on a network of
computers rather than on a single computer to work on a
larger scale
Why CRanker? (cont…)
Data PeptideProphet CRanker Overlap
UPS1 582 576 509
pbmc 34035 34273 32243
Overlapping of aggregate PSMs distinguished by
PeptideProphet and CRanker are 88.4 % and 94.8% on
UPS1, and PBMC, respectively
CRanker Execution: Step 1
InputFileNam
e.txt
InputFileName
.mat
C-Ranker Read
Stage
Loads raw PSM
data into main
memory
reads load
g
e
n
e
r
a
t
e
s
CRanker Execution: Step 2
InputFileName
.txt
InputFileName_score
.mat
C-Ranker Solve
Stage
Loads PSM
records into
main memory
reads load
c
r
e
a
t
e
s
CRanker Execution: Step 3
InputFileName
.txt
InputFileName_score
.mat
C-Ranker
Write Stage
Loads PSM
scores into main
memory
OutputFile.txt
readsreads
reads
r
e
a
d
s
load
c
r
e
a
t
e
s
Problem Statement
Problem Statement
• C-Ranker need a computer with high computation power
• Dataset having about 400,000 PSM records, it may cost about 5
to 8 on normal PC
• Poor Resource Management
• Need to address future big data sets
Can’t we change C-Ranker?
• Research going on to optimize C-Ranker.
• Distributed approach of C-Ranker.
• Needs to re-write the complete code!
Brainstorming
• Can we divide(who will divide) the 400,000 PSM records across 4
machines and do the job??
• Increase computational power?
Constraints!
• Restrictions on changing the C-Ranker design and code (I am not
well experienced to do so..)
• Should not change the execution flow of C-Ranker.
Shortlisted Approach
• Fundamental Distributed Approach
Why Distributed Framework?
• It can handle bigger datasets than it would be able to in a
centralized setting.
• Requires less memory per computer and each computer can
have commodity hardware.
• Cheaper to have multiple commodity hardware computers
than having a single high-performance high-end system
capable of achieving similar goals.
Job Execution in Distributed Approach
Proposed Solution
Proposed Solution
• A framework to execute C-Ranker on distributed node.
• Design such that it may work with other post database searching
algorithms like C-Ranker with minimal changes
• Compare the time-taken of generate distributed output of C-Ranker
with actual output
• Make sure C-Ranker algorithm is well executed on the set of
predefined nodes
Architecture
Implementation
Data Flow Details in the Original Single-Threaded
C-Ranker
Data Flow Details in Distributed C-Ranker
(Dividing)
Data Flow For a Single Worker Host
Data Flow Details in Distributed C-Ranker
(Merging)
Execution Environment
Execution Environment
• JAVA
• MATLAB MCR environment
• Apache Tomcat web server
Input Data used
Hardware Used to Observe Results
Servers Server_1 Server_2 Server_3 Server_4
Memory 8GB 4GB 4GB 4GB
Processor i5 i5 i5 i5
Operating System Windows 7 Windows Vista Windows 7 Windows 7
Comparison of C-Ranker on distributed approach
with C-Ranker on an Apache Hadoop Framework
Cluster1
PBMC data (KB) C-Ranker Execution time
in hrs (Cluster1 Hadoop)
Distributed approach for
C-Ranker executiion time
in hrs
11221 6.5 3.56
12816 9.9 8.1
31422 10.2 8.25
48486 15. 2 9.2
Results
Results for testData.xls (409 KB)
Results for Pbmc_orbit_mips.xls (11221 KB)
Results for Pbmc_orbit_nomips.xls (12816
KB)
Results for Pbmc_velos_mips.xls (31422 KB)
Results for Pbmc_velos_nomips.xls (48486 KB)
Memory usage for testData.xls (409 KB)
Memory Usage for Pbmc_orbit_mips.xls
(11221 KB)
Memory Usage for Pbmc_orbit_nomips.xls
(12816 KB)
Memory Usage for Pbmc_velos_mips.xls
(31422 KB)
Memory usage for Pbmc_velos_nomips.xls
(48486 KB)
Memory Usage
Difference in Memory Usage
Upgraded hardware to compare with cluster2
Hadoop
Servers Server_1 Server_2 Server_3 Server_4
Memory 12GB 8GB 12GB 8GB
Processor i7 i5 i7 i7
Operating
System
Windows 8 Windows 7 Windows 7 Windows 7
Comparison of C-Ranker on distributed approach
with C-Ranker on an Apache Hadoop Framework
Cluster 2
PBMC data (KB) Distributed approach
for C-Ranker execution
time in hrs (new
results)
CRanker Execution
time in hrs(Cluster2
Hadoop)
11221 1.7 1.3
12816 3.82 1.58
31422 4.1 3.4
48486 5.93 4.5
Cost Calculation of Apache Hadoop Cluster 1 and
Cluster 2
PBMC Data Size(KB) Cost of Hadoop
Cluster 1($)
Cost of Hadoop cluster
2($)
11221 3.4581 0.6916
12816 5.2668 0.8410
31422 5.4264 1.8088
48486 8.8064 2.3941
Conclusion and Future Scope
Conclusion
• Reduces the execution time
• Absolutely cost free (no need of high computing machines)
• No need to change the current structure of C-Ranker
Conclusion (Cont.…)
• Better Resource Management. For example: Memory
• No need to change the implementation of CRanker
Future Scope
• The same distributed approach can be used with Percolator and
PeptideProphet to see how well they perform
• Additionally, once can use an ensemble method to combine the
results of the three tools.
Questions??
Thank you

Mais conteúdo relacionado

Destaque

MALDI-TOF MS Based Discovery Workflows: A Fully Automated, Bottom-Up Approach
MALDI-TOF MS Based Discovery Workflows: A Fully Automated, Bottom-Up ApproachMALDI-TOF MS Based Discovery Workflows: A Fully Automated, Bottom-Up Approach
MALDI-TOF MS Based Discovery Workflows: A Fully Automated, Bottom-Up ApproachShimadzu Scientific Instruments
 
Option B, UV/vis spectroscopy, Protein analysis, Gel electrophoresis and buff...
Option B, UV/vis spectroscopy, Protein analysis, Gel electrophoresis and buff...Option B, UV/vis spectroscopy, Protein analysis, Gel electrophoresis and buff...
Option B, UV/vis spectroscopy, Protein analysis, Gel electrophoresis and buff...Lawrence kok
 
Proteomics course 2
Proteomics course 2Proteomics course 2
Proteomics course 2utpaltatu
 
Presentation1
Presentation1Presentation1
Presentation1nehasutar
 
1.proteomics coursework-3 dec2012-aky
1.proteomics coursework-3 dec2012-aky1.proteomics coursework-3 dec2012-aky
1.proteomics coursework-3 dec2012-akyAmit Yadav
 
PROTEIN ANALYSIS
PROTEIN ANALYSISPROTEIN ANALYSIS
PROTEIN ANALYSISVIPIN E V
 
DiGE....2-D gel electrophoresis
DiGE....2-D gel electrophoresisDiGE....2-D gel electrophoresis
DiGE....2-D gel electrophoresisKaran Veer Singh
 
Proteome analysis
Proteome analysisProteome analysis
Proteome analysisAkash Verma
 
Protein structure determination
Protein structure determinationProtein structure determination
Protein structure determinationVydehi indraneel
 
The determination of amino acid sequences presentation autumne 2015
The determination of amino acid sequences presentation autumne 2015The determination of amino acid sequences presentation autumne 2015
The determination of amino acid sequences presentation autumne 2015Richard Trinh
 
Buchi kjeldahl guide
Buchi kjeldahl guide Buchi kjeldahl guide
Buchi kjeldahl guide mhuaringa
 
2 d electrophoresis
2 d electrophoresis2 d electrophoresis
2 d electrophoresisRahul Ghalme
 
Isoelectric Focusing
Isoelectric FocusingIsoelectric Focusing
Isoelectric FocusingKaleem Iqbal
 
2 d gel electrophoresis
2 d gel electrophoresis2 d gel electrophoresis
2 d gel electrophoresisruks143
 

Destaque (20)

MALDI-TOF MS Based Discovery Workflows: A Fully Automated, Bottom-Up Approach
MALDI-TOF MS Based Discovery Workflows: A Fully Automated, Bottom-Up ApproachMALDI-TOF MS Based Discovery Workflows: A Fully Automated, Bottom-Up Approach
MALDI-TOF MS Based Discovery Workflows: A Fully Automated, Bottom-Up Approach
 
Option B, UV/vis spectroscopy, Protein analysis, Gel electrophoresis and buff...
Option B, UV/vis spectroscopy, Protein analysis, Gel electrophoresis and buff...Option B, UV/vis spectroscopy, Protein analysis, Gel electrophoresis and buff...
Option B, UV/vis spectroscopy, Protein analysis, Gel electrophoresis and buff...
 
Protein analysis
Protein analysisProtein analysis
Protein analysis
 
Proteomics course 2
Proteomics course 2Proteomics course 2
Proteomics course 2
 
Presentation1
Presentation1Presentation1
Presentation1
 
1.proteomics coursework-3 dec2012-aky
1.proteomics coursework-3 dec2012-aky1.proteomics coursework-3 dec2012-aky
1.proteomics coursework-3 dec2012-aky
 
Protein identication characterization
Protein identication characterizationProtein identication characterization
Protein identication characterization
 
MALDI
MALDIMALDI
MALDI
 
PROTEIN ANALYSIS
PROTEIN ANALYSISPROTEIN ANALYSIS
PROTEIN ANALYSIS
 
Brain fingerprinting
Brain fingerprintingBrain fingerprinting
Brain fingerprinting
 
DiGE....2-D gel electrophoresis
DiGE....2-D gel electrophoresisDiGE....2-D gel electrophoresis
DiGE....2-D gel electrophoresis
 
Proteome analysis
Proteome analysisProteome analysis
Proteome analysis
 
Protein structure determination
Protein structure determinationProtein structure determination
Protein structure determination
 
The determination of amino acid sequences presentation autumne 2015
The determination of amino acid sequences presentation autumne 2015The determination of amino acid sequences presentation autumne 2015
The determination of amino acid sequences presentation autumne 2015
 
Buchi kjeldahl guide
Buchi kjeldahl guide Buchi kjeldahl guide
Buchi kjeldahl guide
 
2 d electrophoresis
2 d electrophoresis2 d electrophoresis
2 d electrophoresis
 
Electrophoresis
ElectrophoresisElectrophoresis
Electrophoresis
 
Example Kjeldahl Method
Example Kjeldahl MethodExample Kjeldahl Method
Example Kjeldahl Method
 
Isoelectric Focusing
Isoelectric FocusingIsoelectric Focusing
Isoelectric Focusing
 
2 d gel electrophoresis
2 d gel electrophoresis2 d gel electrophoresis
2 d gel electrophoresis
 

Semelhante a Distributed approach for Peptide Identification

Approximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithmsApproximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithmsSabidur Rahman
 
Making powerful science: an introduction to NGS data analysis
Making powerful science: an introduction to NGS data analysisMaking powerful science: an introduction to NGS data analysis
Making powerful science: an introduction to NGS data analysisAdamCribbs1
 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningDatabricks
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsHPCC Systems
 
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Cheng-Hsuan Li
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineMichael Gerke
 
QuantumChemistry500
QuantumChemistry500QuantumChemistry500
QuantumChemistry500Maho Nakata
 
참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의DzH QWuynh
 
Application Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance CenterApplication Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance Centerinside-BigData.com
 
Integrative information management for systems biology
Integrative information management for systems biologyIntegrative information management for systems biology
Integrative information management for systems biologyNeil Swainston
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxAkshitAgiwal1
 
Whitepaper: Where did my CPU go?
Whitepaper: Where did my CPU go?Whitepaper: Where did my CPU go?
Whitepaper: Where did my CPU go?Kristofferson A
 
MATLAB Bioinformatics tool box
MATLAB Bioinformatics tool boxMATLAB Bioinformatics tool box
MATLAB Bioinformatics tool boxPinky Vincent
 
Accelerated development in Automotive E/E Systems using VisualSim Architect
Accelerated development in Automotive E/E Systems using VisualSim ArchitectAccelerated development in Automotive E/E Systems using VisualSim Architect
Accelerated development in Automotive E/E Systems using VisualSim ArchitectDeepak Shankar
 
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...Anthony Bradley
 
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio..."Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...Edge AI and Vision Alliance
 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit Ganesan Narayanasamy
 

Semelhante a Distributed approach for Peptide Identification (20)

Prediction of pKa from chemical structure using free and open source tools
Prediction of pKa from chemical structure using free and open source toolsPrediction of pKa from chemical structure using free and open source tools
Prediction of pKa from chemical structure using free and open source tools
 
Approximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithmsApproximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithms
 
Making powerful science: an introduction to NGS data analysis
Making powerful science: an introduction to NGS data analysisMaking powerful science: an introduction to NGS data analysis
Making powerful science: an introduction to NGS data analysis
 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine Learning
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
 
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning Pipeline
 
QuantumChemistry500
QuantumChemistry500QuantumChemistry500
QuantumChemistry500
 
참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의
 
nnUNet
nnUNetnnUNet
nnUNet
 
Application Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance CenterApplication Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance Center
 
Integrative information management for systems biology
Integrative information management for systems biologyIntegrative information management for systems biology
Integrative information management for systems biology
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptx
 
Whitepaper: Where did my CPU go?
Whitepaper: Where did my CPU go?Whitepaper: Where did my CPU go?
Whitepaper: Where did my CPU go?
 
MATLAB Bioinformatics tool box
MATLAB Bioinformatics tool boxMATLAB Bioinformatics tool box
MATLAB Bioinformatics tool box
 
Accelerated development in Automotive E/E Systems using VisualSim Architect
Accelerated development in Automotive E/E Systems using VisualSim ArchitectAccelerated development in Automotive E/E Systems using VisualSim Architect
Accelerated development in Automotive E/E Systems using VisualSim Architect
 
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...
 
I- Tasser
I- TasserI- Tasser
I- Tasser
 
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio..."Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit
 

Distributed approach for Peptide Identification

  • 1. Distributed Approach for Peptide Identification By Naga Venkata Krishna Abhinav Vedanbhatla
  • 2. Outline • Background • About C-Ranker • Problem Statement • Proposed Solution • Architecture • Implementation • Execution Environment • Results • Conclusions & Future Scope
  • 4. Protein • Bio-Molecules consists of one or multiple chains of Amino Acids • Proteins differ from one another primarily in their sequence of amino acids • A protein is characterized by the sequence of amino acids as they occur in the protein
  • 5. Proteins (Cont.…) • Proteins perform a vast array of functions within living organisms. • A protein contains at least one long polypeptide • Proteins are involved in almost every biological process happening in an organism’s body. • Important part of drug development to target specific metabolic pathways.
  • 6. Peptide • Short chain of amino acids • Peptides are distinguished from proteins on the basis of size • A protein is first digested into peptides and then each peptide is identified individually to infer the protein identity.
  • 7. Finding Peptide and Protein Relationship? • Peptide Identification using mass spectrometry
  • 8. Determine the sequence of Peptides • Peptide mass fingerprinting (PMF) in MS spectra
  • 9. Peptide Identification Using MS/MS Spectra • Sequence database searching (for the large-scale dataset) • de novo sequencing (new protein discovery) • Post sequence database searching (extension of sequence database search)
  • 10. Peptide Identification (cont…) • Mass Spectrometry (MS) strategy • Sequence database searching • Combination of both: dominant method for peptide identification • Which results in more spectra from MS
  • 11. Sequence Database searching algorithms. • SEQUEST • Mascot
  • 12. Post Database Algorithms • Machine learning algorithms are proposed to identify the peptide spectrum matches (PSM)
  • 13. Post Database Search Algorithms • PeptideProphet • Learns distribution of scores and properties • Percolator • Search scores considered reliable for high values and low values • CRanker • Fuzzy SVM and silhouette index.
  • 15. C-Ranker • Is to identify correct PSMs output from the database(Peptide). • Developed in Matlab and C
  • 16. Why CRanker? • Based on research by Dr. Zhonghang Xia, it is the best among the other. • Easy to parallelize to make it work on a network of computers rather than on a single computer to work on a larger scale
  • 17. Why CRanker? (cont…) Data PeptideProphet CRanker Overlap UPS1 582 576 509 pbmc 34035 34273 32243 Overlapping of aggregate PSMs distinguished by PeptideProphet and CRanker are 88.4 % and 94.8% on UPS1, and PBMC, respectively
  • 18. CRanker Execution: Step 1 InputFileNam e.txt InputFileName .mat C-Ranker Read Stage Loads raw PSM data into main memory reads load g e n e r a t e s
  • 19. CRanker Execution: Step 2 InputFileName .txt InputFileName_score .mat C-Ranker Solve Stage Loads PSM records into main memory reads load c r e a t e s
  • 20. CRanker Execution: Step 3 InputFileName .txt InputFileName_score .mat C-Ranker Write Stage Loads PSM scores into main memory OutputFile.txt readsreads reads r e a d s load c r e a t e s
  • 22. Problem Statement • C-Ranker need a computer with high computation power • Dataset having about 400,000 PSM records, it may cost about 5 to 8 on normal PC • Poor Resource Management • Need to address future big data sets
  • 23. Can’t we change C-Ranker? • Research going on to optimize C-Ranker. • Distributed approach of C-Ranker. • Needs to re-write the complete code!
  • 24. Brainstorming • Can we divide(who will divide) the 400,000 PSM records across 4 machines and do the job?? • Increase computational power?
  • 25. Constraints! • Restrictions on changing the C-Ranker design and code (I am not well experienced to do so..) • Should not change the execution flow of C-Ranker.
  • 26. Shortlisted Approach • Fundamental Distributed Approach
  • 27. Why Distributed Framework? • It can handle bigger datasets than it would be able to in a centralized setting. • Requires less memory per computer and each computer can have commodity hardware. • Cheaper to have multiple commodity hardware computers than having a single high-performance high-end system capable of achieving similar goals.
  • 28. Job Execution in Distributed Approach
  • 30. Proposed Solution • A framework to execute C-Ranker on distributed node. • Design such that it may work with other post database searching algorithms like C-Ranker with minimal changes • Compare the time-taken of generate distributed output of C-Ranker with actual output • Make sure C-Ranker algorithm is well executed on the set of predefined nodes
  • 33. Data Flow Details in the Original Single-Threaded C-Ranker
  • 34. Data Flow Details in Distributed C-Ranker (Dividing)
  • 35. Data Flow For a Single Worker Host
  • 36. Data Flow Details in Distributed C-Ranker (Merging)
  • 38. Execution Environment • JAVA • MATLAB MCR environment • Apache Tomcat web server
  • 40. Hardware Used to Observe Results Servers Server_1 Server_2 Server_3 Server_4 Memory 8GB 4GB 4GB 4GB Processor i5 i5 i5 i5 Operating System Windows 7 Windows Vista Windows 7 Windows 7
  • 41. Comparison of C-Ranker on distributed approach with C-Ranker on an Apache Hadoop Framework Cluster1 PBMC data (KB) C-Ranker Execution time in hrs (Cluster1 Hadoop) Distributed approach for C-Ranker executiion time in hrs 11221 6.5 3.56 12816 9.9 8.1 31422 10.2 8.25 48486 15. 2 9.2
  • 48. Memory usage for testData.xls (409 KB)
  • 49. Memory Usage for Pbmc_orbit_mips.xls (11221 KB)
  • 50. Memory Usage for Pbmc_orbit_nomips.xls (12816 KB)
  • 51. Memory Usage for Pbmc_velos_mips.xls (31422 KB)
  • 52. Memory usage for Pbmc_velos_nomips.xls (48486 KB)
  • 55. Upgraded hardware to compare with cluster2 Hadoop Servers Server_1 Server_2 Server_3 Server_4 Memory 12GB 8GB 12GB 8GB Processor i7 i5 i7 i7 Operating System Windows 8 Windows 7 Windows 7 Windows 7
  • 56. Comparison of C-Ranker on distributed approach with C-Ranker on an Apache Hadoop Framework Cluster 2 PBMC data (KB) Distributed approach for C-Ranker execution time in hrs (new results) CRanker Execution time in hrs(Cluster2 Hadoop) 11221 1.7 1.3 12816 3.82 1.58 31422 4.1 3.4 48486 5.93 4.5
  • 57. Cost Calculation of Apache Hadoop Cluster 1 and Cluster 2 PBMC Data Size(KB) Cost of Hadoop Cluster 1($) Cost of Hadoop cluster 2($) 11221 3.4581 0.6916 12816 5.2668 0.8410 31422 5.4264 1.8088 48486 8.8064 2.3941
  • 59. Conclusion • Reduces the execution time • Absolutely cost free (no need of high computing machines) • No need to change the current structure of C-Ranker
  • 60. Conclusion (Cont.…) • Better Resource Management. For example: Memory • No need to change the implementation of CRanker
  • 61. Future Scope • The same distributed approach can be used with Percolator and PeptideProphet to see how well they perform • Additionally, once can use an ensemble method to combine the results of the three tools.