4. Protein
• Bio-Molecules consists of one or multiple chains of Amino Acids
• Proteins differ from one another primarily in their sequence of
amino acids
• A protein is characterized by the sequence of amino acids as they
occur in the protein
5. Proteins (Cont.…)
• Proteins perform a vast array of functions within
living organisms.
• A protein contains at least one long polypeptide
• Proteins are involved in almost every biological process
happening in an organism’s body.
• Important part of drug development to target specific metabolic
pathways.
6. Peptide
• Short chain of amino acids
• Peptides are distinguished from proteins on the basis of size
• A protein is first digested into peptides and then each peptide is
identified individually to infer the protein identity.
7. Finding Peptide and Protein Relationship?
• Peptide Identification using mass spectrometry
9. Peptide Identification Using MS/MS Spectra
• Sequence database searching (for the large-scale dataset)
• de novo sequencing (new protein discovery)
• Post sequence database searching (extension of sequence
database search)
10. Peptide Identification (cont…)
• Mass Spectrometry (MS) strategy
• Sequence database searching
• Combination of both: dominant method for peptide identification
• Which results in more spectra from MS
12. Post Database Algorithms
• Machine learning algorithms are proposed to identify the peptide
spectrum matches (PSM)
13. Post Database Search Algorithms
• PeptideProphet
• Learns distribution of scores and properties
• Percolator
• Search scores considered reliable for high values and low values
• CRanker
• Fuzzy SVM and silhouette index.
15. C-Ranker
• Is to identify correct PSMs output from the database(Peptide).
• Developed in Matlab and C
16. Why CRanker?
• Based on research by Dr. Zhonghang Xia, it is the best
among the other.
• Easy to parallelize to make it work on a network of
computers rather than on a single computer to work on a
larger scale
17. Why CRanker? (cont…)
Data PeptideProphet CRanker Overlap
UPS1 582 576 509
pbmc 34035 34273 32243
Overlapping of aggregate PSMs distinguished by
PeptideProphet and CRanker are 88.4 % and 94.8% on
UPS1, and PBMC, respectively
18. CRanker Execution: Step 1
InputFileNam
e.txt
InputFileName
.mat
C-Ranker Read
Stage
Loads raw PSM
data into main
memory
reads load
g
e
n
e
r
a
t
e
s
19. CRanker Execution: Step 2
InputFileName
.txt
InputFileName_score
.mat
C-Ranker Solve
Stage
Loads PSM
records into
main memory
reads load
c
r
e
a
t
e
s
20. CRanker Execution: Step 3
InputFileName
.txt
InputFileName_score
.mat
C-Ranker
Write Stage
Loads PSM
scores into main
memory
OutputFile.txt
readsreads
reads
r
e
a
d
s
load
c
r
e
a
t
e
s
22. Problem Statement
• C-Ranker need a computer with high computation power
• Dataset having about 400,000 PSM records, it may cost about 5
to 8 on normal PC
• Poor Resource Management
• Need to address future big data sets
23. Can’t we change C-Ranker?
• Research going on to optimize C-Ranker.
• Distributed approach of C-Ranker.
• Needs to re-write the complete code!
24. Brainstorming
• Can we divide(who will divide) the 400,000 PSM records across 4
machines and do the job??
• Increase computational power?
25. Constraints!
• Restrictions on changing the C-Ranker design and code (I am not
well experienced to do so..)
• Should not change the execution flow of C-Ranker.
27. Why Distributed Framework?
• It can handle bigger datasets than it would be able to in a
centralized setting.
• Requires less memory per computer and each computer can
have commodity hardware.
• Cheaper to have multiple commodity hardware computers
than having a single high-performance high-end system
capable of achieving similar goals.
30. Proposed Solution
• A framework to execute C-Ranker on distributed node.
• Design such that it may work with other post database searching
algorithms like C-Ranker with minimal changes
• Compare the time-taken of generate distributed output of C-Ranker
with actual output
• Make sure C-Ranker algorithm is well executed on the set of
predefined nodes
40. Hardware Used to Observe Results
Servers Server_1 Server_2 Server_3 Server_4
Memory 8GB 4GB 4GB 4GB
Processor i5 i5 i5 i5
Operating System Windows 7 Windows Vista Windows 7 Windows 7
41. Comparison of C-Ranker on distributed approach
with C-Ranker on an Apache Hadoop Framework
Cluster1
PBMC data (KB) C-Ranker Execution time
in hrs (Cluster1 Hadoop)
Distributed approach for
C-Ranker executiion time
in hrs
11221 6.5 3.56
12816 9.9 8.1
31422 10.2 8.25
48486 15. 2 9.2
55. Upgraded hardware to compare with cluster2
Hadoop
Servers Server_1 Server_2 Server_3 Server_4
Memory 12GB 8GB 12GB 8GB
Processor i7 i5 i7 i7
Operating
System
Windows 8 Windows 7 Windows 7 Windows 7
56. Comparison of C-Ranker on distributed approach
with C-Ranker on an Apache Hadoop Framework
Cluster 2
PBMC data (KB) Distributed approach
for C-Ranker execution
time in hrs (new
results)
CRanker Execution
time in hrs(Cluster2
Hadoop)
11221 1.7 1.3
12816 3.82 1.58
31422 4.1 3.4
48486 5.93 4.5
57. Cost Calculation of Apache Hadoop Cluster 1 and
Cluster 2
PBMC Data Size(KB) Cost of Hadoop
Cluster 1($)
Cost of Hadoop cluster
2($)
11221 3.4581 0.6916
12816 5.2668 0.8410
31422 5.4264 1.8088
48486 8.8064 2.3941
59. Conclusion
• Reduces the execution time
• Absolutely cost free (no need of high computing machines)
• No need to change the current structure of C-Ranker
60. Conclusion (Cont.…)
• Better Resource Management. For example: Memory
• No need to change the implementation of CRanker
61. Future Scope
• The same distributed approach can be used with Percolator and
PeptideProphet to see how well they perform
• Additionally, once can use an ensemble method to combine the
results of the three tools.