4. CALIT2 Compute Grid
48 dual-core dual-CPU 64 bit machines
192 SGE slots
Redhat-based ‘Rocks Clusters’ Linux
distribution (see http://rocksclusters.org)
‘Rocks Rolls’
Bio-roll (/opt/Bio)
Used to image/install each node separately,
including local Perl module installs (patches)
6. SOS Cluster Global Mounts
/share/apps
applications (and related files) are installed here,
analysis data should not be stored here
/home/thumper6
a global mount point --- 18T(!!!) storage volume
on which all analysis data/results should be
stored
/opt/Bio
tools such as clustalw, EMBOSS, hmmer, ncbi
blast are installed under here
7. SOS Local Mounts
(on each grid node)
/state/partition1
local storage device on each grid node available
for local scratch space (438G)
/tmp
system tmp partition (7G)
10. pg0-0.camera.calit2.net
CGI scripts run as the user 'apache' on pg0-0, but ‘apache’ has
sudo permissions for user 'ergatis'
The two CGI scripts in the install which run RunWorkflow and
KillWorkflow (ergatis/kill_wf.cgi, ergatis/Ergatis/Pipeline.pm)
have been modified, and 'sudo -u ergatis ' has been appended
to their normal execution strings
IdGenerator.pm has been modified to use JCVIIdGenerator.pm
Many of the settings in ergatis.ini have been changed from
defaults, including disabling a number of the components
When updating the Ergatis CGI directory from the SVN
repository, a backup copy should be set-aside in advance
11. SGE/Workflow Notes
Two SGE queues have been configured for ergatis:
ergatis.q (192 slots)
ergatis-fast.q (144 slots)
ergatis.q is subordinate queue of ergatis-fast.q
ergatis.q is set as default queue for user ‘ergatis’ by specifying ‘-q ergatis.q’ in
/home/ergatis/.sge_request
Workflow version 3.0 is installed
/share/apps/workflow
Workflow requires that the SGE queue's prolog and epilog scripts be set to the
following:
prolog=/share/apps/workflow/bin/prolog $host $job_owner $job_id $job_name $queue
epilog=/share/apps/workflow/bin/epilog $host $job_owner $job_id $job_name $queue
The queue configuration can be checked using the command
'qconf -sq ergatis.q'
12. Ergatis Application Install
The main ergatis application install directory is under /share/apps/ergatis
The chado-v1r12b1 release is the current version installed
direct copy of the install located at /usr/local/devel/ANNOTATION/ard/ at JCVI
Perl wrappers were modified via sed to the correct local directory structures
Proper install wasn't done because no working installer script was available at the
time
/share/apps/ergatis/chado-v1r12b1
symlinked to /share/apps/ergatis/current
Executables which some ergatis component use, but are not installed with
Ergatis (e.g.: JCVI internal scripts) are located under /share/apps/ergatis/bin
External tools which are not globally installed on sos are installed under
/share/apps/ergatis/external_apps
Ergatis global directories (global_id_repository, global_saved_templates) are
located under /share/apps/ergatis/ergatis_global
13. Ergatis Data Locations
All ergatis data should be put under /home/thumper6/ergatis
Project repositories are located under
/home/thumper6/ergatis/projects
or symlink /share/apps/ergatis/projects
CAMERA project repository is
/home/thumper6/ergatis/projects/camera
Databases are located under /home/thumper6/ergatis/db
or symlink /share/apps/ergatis/db
Global scratch space is under /home/thumper6/ergatis/scratch
or symlink /share/apps/ergatis/scratch
14. ikelite.rocksclusters.org
Less machines than sos cluster (~20 slots?)
Initial test ergatis install was done here
(similar directory structure to sos)
Completely distinct from sos cluster
Sandbox
Shibu, Weizhong Li and others run computes
here (e.g.: clustering pipeline)
17. Challenges
All computes in pipeline must be performed on
multi-sequence input/output files, as the filesystem
can not physically support 12M+ individual FASTA
input files/output files
other partitioning solutions could work(?) but most tools
support multiple sequence inputs anyway
Overall total space consumption was an issue when
computes were running on TIGR grid, but this is not
as much an issue (currently) on CALIT2 grid
Solution here was to keep all inputs/outputs gzipped
during pipeline execution, at the cost of some performance
loss (using things like zcat –f | with NCBI BLAST, etc.)
22. CAMERA rRNA Finder Overview
BLAST vs. a database of coded pooled rRNA
subunit sequences
BLAST prefilter step with loose parameters
blastall -p blastn -i reads.fsa -d rrna_db.fsa -e 0.1 -F 'T' -b 1 -v 1
-z 3000000000 -W 9
Reads with prefilter hits are searched using strict
parameters
blastall -p blastn -i aligned.fsa -d rrna_db.fsa -e 1e-4 -F 'm L' -b
1500 -v 1500 -q -5 -r 4 -X 1500 -z 3000000000 -W 9 -U T
Collapse aligned intervals of the same rRNA type
and extract the highest scoring alignments from
each region
25. rRNA Finder DB
/usr/local/annotation/CAMERA/CustomDB/camera_rRNA_finder.all_rRNA.coded.cdhit_80.fsa
5S
Sequences from Archaea, Bacteria and Eukaryota were
obtained from the 5S Ribosomal RNA Database
http://biobases.ibch.poznan.pl/5SData/
16S
Sequences for Archaea and Bactera were obtained from the
Green Genes 16S db
http://greengenes.lbl.gov/
18S
Source was Doug Rusch's 18S database prepared for the GOS
paper
23S
Source was Doug Rusch's 23S database prepared for the GOS
paper.
26. rRNA Finder DB
Fasta headers were coded as follows:
>#S [D] ...original.header...
where # is one of (5, 16, 18, 23) and D is one of
(A, B, E). The camera_rrna_finder
component expects this format.
27. rRNA Finder DB
CD-HIT was run on the entire database to cluster sequences with
high similarity to reduce the database size but maintain a range
of diverse sequences
Command line:
/usr/local/devel/ANNOTATION/bwhitty/cdhit/cd-hit/cd-hit-est -i
input_database.fsa -o output_database.fsa -c 0.8 -n 4
Consistency of clustering was checked with a Perl script to
ensure no heterogeneous clustering
(e.g.: 18S and 16S clustering together)
Clusters were consistent
Database size was reduced from 65,591 sequences to 1,329
32. The absence of called
ORFs in this region of
the read is due to the
soft-masked rRNA
sequence
RNAmmer didn’t
identify the 23S
sequence, though it is
capable of finding 23S
39. Thoughts on Specifications
Annotation rules should not be literally codified as
Perl code (and only Perl code)!!!
(especially when the “decision makers” never look at the code)
What tools do we trust?
What cutoffs do we use?
What evidence/data types do we consider?
These will (in some cases should) change over time
40. More Thoughts
Specifications are easier to change than
code, so code should be written to support
change
But unless they’re defined first, the
specifications will be a moving target
41. (My) Design Objectives
Must be able to add/remove annotation data
sources as the annotation SOP changes
Must be able to easily change the ways in
which these annotation data types are
applied/combined to produce final annotation
Must be able to change/expand the types of
final annotation data we are producing
42. Object-Oriented Design Approach
OOP in Perl == *, but lesser of two evils
(don’t ask me what the other evil is, but it must be pretty evil)
Encapsulates possible sources of change and prevents
them from affecting downstream components
(like HACCP)
Polymorphism of $parser->parse($infile) producing
annotation objects is nice
Re-use was not really a motive here
*Damian Conway in his OOP Perl book says using OOP in Perl yields 5X performance hit
43. Annotation Pipeline Overview
Annotation Tool(s)
Annotation Source Data
Parser(s)
We can make changes
Annotation Data Object(s) to the annotation rules,
without having to
necessarily re-run or re-
parse the data
Annotation
Rules
Final Annotation Data
44. Design Objectives for Parsers
A parser must:
Produce polypeptides with associated AnnotationData objects of a defined type
Produce AnnotationData object with attributes specified in a consistent way
E.g.: All parsers should produce EC number attributes that look like ‘1.1.1.1’ ->
‘1.-.-.-’, not sometimes ‘1.-’. Multiple values should be split. Any clean-up or
verification should be done before the AnnotationData object is created; if the data is
invalid, the attribute should not be populated, or the object should not be created.
Produce annotation data objects that are independent of the source annotation
data they were parsed from
e.g.: They have already been canonized as a type of ‘trusted annotation evidence
type’ when they are created as AnnotationData objects. These trusted types are
defined in the annotation SOP.
These features create a separation between how trusted evidence is defined
(input data), and how the evidence is used to produce annotation (annotation
rules)
46. AnnotationRules
AnnotationRules object implements the rules
from the annotation SOP document
AnnotationRules::PredictedProtein takes a
Polypeptide object with associated
AnnotationData objects of varying type and
applies the annotation rules to create a final
AnnotationData object
47. AnnotationRules
Rules are encoded as an array in the following
format:
ANNOTATION_TYPE|OPERATOR|ATTRIBUTE1 ATTRIBUTE2
Where OPERATOR is one of:
= for assign attribute (if unassigned)
+ for append attribute
- for overwrite attribute
Any operators can be defined as they are applied
with a hash of handler subroutines
48. AnnotationRules::PredictedProtein
my @annotation_order = (
## equivalog level tigrfam hits
'TIGRFAM::FullLength::Equivalog|=|common_name gene_symbol GO EC TIGR_role',
'TIGRFAM::FullLength::Exception|=|common_name gene_symbol GO EC TIGR_role',
'TIGRFAM::FullLength::HypotheticalEquivalog|=|common_name gene_symbol GO EC TIGR_role',
'TIGRFAM::FRAG::Equivalog|=|GO',
'TIGRFAM::FRAG::Exception|=|GO',
'TIGRFAM::FRAG::HypotheticalEquivalog|=|GO',
'TIGRFAM::FullLength::Domain|=|GO',
'PandaBLASTP::Characterized|=|GO',
'PRIAM|=|GO EC',
## equivalog level hits vs tigrfam frag
'TIGRFAM::FRAG::Equivalog|=|common_name gene_symbol GO EC TIGR_role',
'TIGRFAM::FRAG::Exception|=|common_name gene_symbol GO EC TIGR_role',
'TIGRFAM::FRAG::HypotheticalEquivalog|=|common_name gene_symbol GO EC TIGR_role',
## characterized high confidence blast hit
'PandaBLASTP::Characterized|=|common_name gene_symbol',
## pfam and non-equivalog tigrfams
'PFAM::FullLength::Equivalog|=|common_name gene_symbol GO EC TIGR_role',
'PFAM::FullLength::HypotheticalEquivalog|=|common_name gene_symbol GO EC TIGR_role',
'TIGRFAM::FullLength::Subfamily|=|common_name gene_symbol GO EC TIGR_role',
'TIGRFAM::FullLength::Superfamily|=|common_name gene_symbol GO EC TIGR_role',
'TIGRFAM::FullLength::EquivalogDomain|=|common_name gene_symbol GO EC TIGR_role',
'TIGRFAM::FullLength::HypotheticalEquivalogDomain|=|common_name gene_symbol GO EC TIGR_role',
'TIGRFAM::FullLength::SubfamilyDomain|=|common_name gene_symbol GO EC TIGR_role',
'TIGRFAM::FullLength::Domain|=|common_name gene_symbol GO EC TIGR_role',
…
54. Future Development
(My 2 cents)
Pipeline development must be driven by annotation SOP development
work
Feedback on pipeline bugs must be vigilantly kept separate from feedback
on annotation SOP bugs
First discuss and update the SOP, then modify the code
Cluster summary annotation
Shortest path here seems to be a combination of GO Slim and EC
assignments? GO consortium makes some scripts available for
summarizing sets of GO assignments
If using the current code, PolypeptideSet container class exists already.
Cluster members can be added to a PolypeptideSet and that can be used
as input to an AnnotationRules::FinalCluster object that is similar to the one
for PredictedProtein, but with a different set of handler routines.
Incremental clustering pipeline
Good luck