Distributed approach for Peptide Identification

Distributed Approach
for Peptide
Identification
By
Naga Venkata Krishna Abhinav Vedanbhatla

Outline
• Background
• About C-Ranker
• Problem Statement
• Proposed Solution
• Architecture
• Implementation
• Execution Environment
• Results
• Conclusions & Future Scope

Protein
• Bio-Molecules consists of one or multiple chains of Amino Acids
• Proteins differ from one another primarily in their sequence of
amino acids
• A protein is characterized by the sequence of amino acids as they
occur in the protein

Proteins (Cont.…)
• Proteins perform a vast array of functions within
living organisms.
• A protein contains at least one long polypeptide
• Proteins are involved in almost every biological process
happening in an organism’s body.
• Important part of drug development to target speciﬁc metabolic
pathways.

Peptide
• Short chain of amino acids
• Peptides are distinguished from proteins on the basis of size
• A protein is ﬁrst digested into peptides and then each peptide is
identiﬁed individually to infer the protein identity.

Finding Peptide and Protein Relationship?
• Peptide Identification using mass spectrometry

Determine the sequence of Peptides
• Peptide mass fingerprinting (PMF) in MS spectra

Peptide Identification Using MS/MS Spectra
• Sequence database searching (for the large-scale dataset)
• de novo sequencing (new protein discovery)
• Post sequence database searching (extension of sequence
database search)

Peptide Identification (cont…)
• Mass Spectrometry (MS) strategy
• Sequence database searching
• Combination of both: dominant method for peptide identification
• Which results in more spectra from MS

Sequence Database searching algorithms.
• SEQUEST
• Mascot

Post Database Algorithms
• Machine learning algorithms are proposed to identify the peptide
spectrum matches (PSM)

Post Database Search Algorithms
• PeptideProphet
• Learns distribution of scores and properties
• Percolator
• Search scores considered reliable for high values and low values
• CRanker
• Fuzzy SVM and silhouette index.

C-Ranker
• Is to identify correct PSMs output from the database(Peptide).
• Developed in Matlab and C

Why CRanker?
• Based on research by Dr. Zhonghang Xia, it is the best
among the other.
• Easy to parallelize to make it work on a network of
computers rather than on a single computer to work on a
larger scale

Why CRanker? (cont…)
Data PeptideProphet CRanker Overlap
UPS1 582 576 509
pbmc 34035 34273 32243
Overlapping of aggregate PSMs distinguished by
PeptideProphet and CRanker are 88.4 % and 94.8% on
UPS1, and PBMC, respectively

CRanker Execution: Step 1
InputFileNam
e.txt
InputFileName
.mat
C-Ranker Read
Stage
Loads raw PSM
data into main
memory
reads load
g
e
n
e
r
a
t
e
s

InputFileName
.txt
InputFileName_score
.mat
C-Ranker Solve
Stage
Loads PSM
records into
main memory
reads load
c
r
e
a
t
e
s

InputFileName
.txt
InputFileName_score
.mat
C-Ranker
Write Stage
Loads PSM
scores into main
memory
OutputFile.txt
readsreads
reads
r
e
a
d
s
load
c
r
e
a
t
e
s

Problem Statement
• C-Ranker need a computer with high computation power
• Dataset having about 400,000 PSM records, it may cost about 5
to 8 on normal PC
• Poor Resource Management
• Need to address future big data sets

Can’t we change C-Ranker?
• Research going on to optimize C-Ranker.
• Distributed approach of C-Ranker.
• Needs to re-write the complete code!

Brainstorming
• Can we divide(who will divide) the 400,000 PSM records across 4
machines and do the job??
• Increase computational power?

Constraints!
• Restrictions on changing the C-Ranker design and code (I am not
well experienced to do so..)
• Should not change the execution flow of C-Ranker.

Shortlisted Approach
• Fundamental Distributed Approach

Why Distributed Framework?
• It can handle bigger datasets than it would be able to in a
centralized setting.
• Requires less memory per computer and each computer can
have commodity hardware.
• Cheaper to have multiple commodity hardware computers
than having a single high-performance high-end system
capable of achieving similar goals.

Job Execution in Distributed Approach

Proposed Solution
• A framework to execute C-Ranker on distributed node.
• Design such that it may work with other post database searching
algorithms like C-Ranker with minimal changes
• Compare the time-taken of generate distributed output of C-Ranker
with actual output
• Make sure C-Ranker algorithm is well executed on the set of
predeﬁned nodes

Data Flow Details in the Original Single-Threaded
C-Ranker

Data Flow Details in Distributed C-Ranker
(Dividing)

Data Flow For a Single Worker Host

Data Flow Details in Distributed C-Ranker
(Merging)

Execution Environment
• JAVA
• MATLAB MCR environment
• Apache Tomcat web server

Hardware Used to Observe Results
Servers Server_1 Server_2 Server_3 Server_4
Memory 8GB 4GB 4GB 4GB
Processor i5 i5 i5 i5
Operating System Windows 7 Windows Vista Windows 7 Windows 7

Comparison of C-Ranker on distributed approach
with C-Ranker on an Apache Hadoop Framework
Cluster1
PBMC data (KB) C-Ranker Execution time
in hrs (Cluster1 Hadoop)
Distributed approach for
C-Ranker executiion time
in hrs
11221 6.5 3.56
12816 9.9 8.1
31422 10.2 8.25
48486 15. 2 9.2

Results for testData.xls (409 KB)

Results for Pbmc_orbit_mips.xls (11221 KB)

Results for Pbmc_orbit_nomips.xls (12816
KB)

Results for Pbmc_velos_mips.xls (31422 KB)

Results for Pbmc_velos_nomips.xls (48486 KB)

Memory usage for testData.xls (409 KB)

Memory Usage for Pbmc_orbit_mips.xls
(11221 KB)

Memory Usage for Pbmc_orbit_nomips.xls
(12816 KB)

Memory Usage for Pbmc_velos_mips.xls
(31422 KB)

Memory usage for Pbmc_velos_nomips.xls
(48486 KB)

Upgraded hardware to compare with cluster2
Hadoop
Servers Server_1 Server_2 Server_3 Server_4
Memory 12GB 8GB 12GB 8GB
Processor i7 i5 i7 i7
Operating
System
Windows 8 Windows 7 Windows 7 Windows 7

Comparison of C-Ranker on distributed approach
with C-Ranker on an Apache Hadoop Framework
Cluster 2
PBMC data (KB) Distributed approach
for C-Ranker execution
time in hrs (new
results)
CRanker Execution
time in hrs(Cluster2
Hadoop)
11221 1.7 1.3
12816 3.82 1.58
31422 4.1 3.4
48486 5.93 4.5

Cost Calculation of Apache Hadoop Cluster 1 and
Cluster 2
PBMC Data Size(KB) Cost of Hadoop
Cluster 1($)
Cost of Hadoop cluster
2($)
11221 3.4581 0.6916
12816 5.2668 0.8410
31422 5.4264 1.8088
48486 8.8064 2.3941

Conclusion
• Reduces the execution time
• Absolutely cost free (no need of high computing machines)
• No need to change the current structure of C-Ranker

Conclusion (Cont.…)
• Better Resource Management. For example: Memory
• No need to change the implementation of CRanker

Future Scope
• The same distributed approach can be used with Percolator and
PeptideProphet to see how well they perform
• Additionally, once can use an ensemble method to combine the
results of the three tools.

Distributed approach for Peptide Identification

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (20)

Semelhante a Distributed approach for Peptide Identification

Semelhante a Distributed approach for Peptide Identification (20)

Distributed approach for Peptide Identification