CAMERA metagenomic annotation pipeline

CAMERA Annotation Pipelines
(and related infrastructure)

Brett Whitty
12/20/2007

Overview

 Compute Infrastructure
 GOS/CAMERA ncRNA/ORF calling pipeline
 rRNA finding pipeline
 ORF calling
 GOS (incremental) protein clustering
 CAMERA Annotation Pipeline
 Specifications
 Implementation

CALIT2 Compute Grid

 48 dual-core dual-CPU 64 bit machines
 192 SGE slots
 Redhat-based ‘Rocks Clusters’ Linux
distribution (see http://rocksclusters.org)
 ‘Rocks Rolls’
 Bio-roll (/opt/Bio)
 Used to image/install each node separately,
including local Perl module installs (patches)

sos.camera.calit2.net

 Head node of sos cluster
 SSH into here
 Is not an SGE submit host

SOS Cluster Global Mounts
 /share/apps
 applications (and related files) are installed here,
analysis data should not be stored here
 /home/thumper6
 a global mount point --- 18T(!!!) storage volume
on which all analysis data/results should be
stored
 /opt/Bio
 tools such as clustalw, EMBOSS, hmmer, ncbi
blast are installed under here

SOS Local Mounts
(on each grid node)

 /state/partition1
 local storage device on each grid node available
for local scratch space (438G)
 /tmp
 system tmp partition (7G)

pg0-0.camera.calit2.net

 SSH accessible only through head
 Is an SGE submit host
 Running apache and postgres servers


 http://web1.camera.calit2.net/ergatis/

 /var/www/cgi-bin/ergatis
 /home/bwhitty/temp/subversion-1.4.5/subversion/svn/svn export --force
https://ergatis.svn.sf.net/svnroot/ergatis/tags/ergatis-v2r6b1/htdocs ergatis

 /var/www/html/ergatis
 /home/bwhitty/temp/subversion-1.4.5/subversion/svn/svn export --force
https://ergatis.svn.sf.net/svnroot/ergatis/tags/ergatis-v2r6b1/cgi-bin ergatis

 CGI scripts run as the user 'apache' on pg0-0, but ‘apache’ has
sudo permissions for user 'ergatis'
 The two CGI scripts in the install which run RunWorkflow and
KillWorkflow (ergatis/kill_wf.cgi, ergatis/Ergatis/Pipeline.pm)
have been modified, and 'sudo -u ergatis ' has been appended
to their normal execution strings

 IdGenerator.pm has been modified to use JCVIIdGenerator.pm

 Many of the settings in ergatis.ini have been changed from
defaults, including disabling a number of the components
 When updating the Ergatis CGI directory from the SVN
repository, a backup copy should be set-aside in advance

SGE/Workflow Notes
 Two SGE queues have been configured for ergatis:
 ergatis.q (192 slots)
 ergatis-fast.q (144 slots)
 ergatis.q is subordinate queue of ergatis-fast.q

 ergatis.q is set as default queue for user ‘ergatis’ by specifying ‘-q ergatis.q’ in
/home/ergatis/.sge_request

 Workflow version 3.0 is installed
 /share/apps/workflow

 Workflow requires that the SGE queue's prolog and epilog scripts be set to the
following:
 prolog=/share/apps/workflow/bin/prolog $host $job_owner $job_id $job_name $queue
 epilog=/share/apps/workflow/bin/epilog $host $job_owner $job_id $job_name $queue

 The queue configuration can be checked using the command
'qconf -sq ergatis.q'

Ergatis Application Install
 The main ergatis application install directory is under /share/apps/ergatis

 The chado-v1r12b1 release is the current version installed
 direct copy of the install located at /usr/local/devel/ANNOTATION/ard/ at JCVI
 Perl wrappers were modified via sed to the correct local directory structures
 Proper install wasn't done because no working installer script was available at the
time

 /share/apps/ergatis/chado-v1r12b1
symlinked to /share/apps/ergatis/current

 Executables which some ergatis component use, but are not installed with
Ergatis (e.g.: JCVI internal scripts) are located under /share/apps/ergatis/bin

 External tools which are not globally installed on sos are installed under
/share/apps/ergatis/external_apps

 Ergatis global directories (global_id_repository, global_saved_templates) are
located under /share/apps/ergatis/ergatis_global

Ergatis Data Locations
 All ergatis data should be put under /home/thumper6/ergatis

 Project repositories are located under
/home/thumper6/ergatis/projects
or symlink /share/apps/ergatis/projects

 CAMERA project repository is
/home/thumper6/ergatis/projects/camera

 Databases are located under /home/thumper6/ergatis/db
or symlink /share/apps/ergatis/db

 Global scratch space is under /home/thumper6/ergatis/scratch
or symlink /share/apps/ergatis/scratch

ikelite.rocksclusters.org

 Less machines than sos cluster (~20 slots?)
 Initial test ergatis install was done here
(similar directory structure to sos)
 Completely distinct from sos cluster
 Sandbox
 Shibu, Weizhong Li and others run computes
here (e.g.: clustering pipeline)

GOS/CAMERA Pipelines Overview

Metagenomic Reads

ncRNA/ORF Finding Pipeline

Incremental Clustering
ORFs/peptides Pipeline

Annotation Pipeline Cluster Memberships

Challenges
 All computes in pipeline must be performed on
multi-sequence input/output files, as the filesystem
can not physically support 12M+ individual FASTA
input files/output files
 other partitioning solutions could work(?) but most tools
support multiple sequence inputs anyway

 Overall total space consumption was an issue when
computes were running on TIGR grid, but this is not
as much an issue (currently) on CALIT2 grid
 Solution here was to keep all inputs/outputs gzipped
during pipeline execution, at the cost of some performance
loss (using things like zcat –f | with NCBI BLAST, etc.)

GOS/CAMERA ncRNA and
ORF Finding Pipeline

GOS/CAMERA ncRNA and ORF
Finding Pipeline Overview
Reads

Find tRNAs Extract tRNAs tRNAs FASTA

Soft-Mask tRNAs

Find rRNAs Extract rRNAs rRNAs FASTA

Soft-Mask rRNAs ORFs FASTA
Metagene
GOS ORF calling Peptides FASTA

ORFs FASTA
ORF stats ORF overlaps
Peptides FASTA

GOS/CAMERA
ncRNA and ORF Finding Pipeline
CAMERA-specific
Ergatis components

CAMERA rRNA Finder Overview
 BLAST vs. a database of coded pooled rRNA
subunit sequences
 BLAST prefilter step with loose parameters
 blastall -p blastn -i reads.fsa -d rrna_db.fsa -e 0.1 -F 'T' -b 1 -v 1
-z 3000000000 -W 9
 Reads with prefilter hits are searched using strict
parameters
 blastall -p blastn -i aligned.fsa -d rrna_db.fsa -e 1e-4 -F 'm L' -b
1500 -v 1500 -q -5 -r 4 -X 1500 -z 3000000000 -W 9 -U T
 Collapse aligned intervals of the same rRNA type
and extract the highest scoring alignments from
each region

camera_rrna_finder

Custom DB

rRNA Finder DB
/usr/local/annotation/CAMERA/CustomDB/camera_rRNA_finder.all_rRNA.coded.cdhit_80.fsa

 5S
 Sequences from Archaea, Bacteria and Eukaryota were
obtained from the 5S Ribosomal RNA Database
 http://biobases.ibch.poznan.pl/5SData/
 16S
 Sequences for Archaea and Bactera were obtained from the
Green Genes 16S db
 http://greengenes.lbl.gov/
 18S
 Source was Doug Rusch's 18S database prepared for the GOS
paper
 23S
 Source was Doug Rusch's 23S database prepared for the GOS
paper.

rRNA Finder DB

Fasta headers were coded as follows:

>#S [D] ...original.header...

where # is one of (5, 16, 18, 23) and D is one of
(A, B, E). The camera_rrna_finder
component expects this format.

rRNA Finder DB
 CD-HIT was run on the entire database to cluster sequences with
high similarity to reduce the database size but maintain a range
of diverse sequences

Command line:
/usr/local/devel/ANNOTATION/bwhitty/cdhit/cd-hit/cd-hit-est -i
input_database.fsa -o output_database.fsa -c 0.8 -n 4

 Consistency of clustering was checked with a Perl script to
ensure no heterogeneous clustering
(e.g.: 18S and 16S clustering together)
 Clusters were consistent
 Database size was reduced from 65,591 sequences to 1,329

FASTA Headers
 >HOT_READ_85779353 /accession=DU765170.1 /sample_id=JGI_SMPL_HF4000_12-21-03
/template_id=JGI_TMPL_ANIW12796 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT
/clr_range_begin=0 /clr_range_end=1088 /length=1088
 >JCVI_ORF_1108836626524 /pep_id=JCVI_PEP_1108836626525 /read_id=HOT_READ_85760722
/begin=0 /end=234 /orientation=1 /5_prime_stop=0 /3_prime_stop=TAG /ttable=11 /length=234
/read_defline="/accession=DU750886.1 /sample_id=JGI_SMPL_HF130_10-06-02
/template_id=JGI_TMPL_ASNF1709 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT
/clr_range_begin=0 /clr_range_end=841 /length=841"
 >JCVI_PEP_1108836626525 /orf_id=JCVI_ORF_1108836626524 /read_id=HOT_READ_85760722
/begin=0 /end=234 /orientation=1 /5_prime_stop=0 /3_prime_stop=TAG /ttable=11 /length=234
/read_defline="/accession=DU750886.1 /sample_id=JGI_SMPL_HF130_10-06-02
/template_id=JGI_TMPL_ASNF1709 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT
 >JCVI_NT_1108826205795 /read_id=HOT_READ_85801707 /begin=785 /end=858 /orientation=1
/type=Asn_tRNA /ergatis_id=1108826197895 /defline="HOT_READ_85801707
/accession=DU787412.1 /sample_id=JGI_SMPL_HF770_12-21-03
/template_id=JGI_TMPL_APKH2110 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT
 >JCVI_NT_1108806998652 /read_id=HOT_READ_85760731 /begin=55 /end=847 /orientation=0
/type=23S_rRNA /ergatis_id=1108826197895 /read_defline="/accession=DU750895.1
/sample_id=JGI_SMPL_HF130_10-06-02 /template_id=JGI_TMPL_ASNF1714
/sequencing_direction=forward /site_id=HAWAII_SITE_HOT /clr_range_begin=0 /clr_range_end=847
/length=847"

The absence of called
ORFs in this region of
the read is due to the
soft-masked rRNA
sequence

RNAmmer didn’t
identify the 23S
sequence, though it is
capable of finding 23S

Again, RNAmmer failed to identify rRNA sequence

These ORFs have
>150 unmasked
bases

BLAST-based
approach does a
pretty good job of
finding correct
boundaries

BLAST-based rRNA
finding appears to
outperform RNAmmer
for 23S sequences, and
some 16S

GOS (Incremental)
Clustering Pipeline

http://camera.venterinstitute.org/wiki/display/V

Clustering Overview
Core
Cluster

Core
Core
Cluster
All Public Cluster
Proteins +
GOS ORFs Core Core
Cluster Cluster

Core
GOS Cluster v1.2

Non-Redundant 90%
Historical Artifacts
Longest Sequence
Representatives
Identity CD-HIT Sequence
(with respect to annotation) Representatives

CAMERA Polypeptide
Annotation Pipeline

Thoughts on Specifications
 Annotation rules should not be literally codified as
Perl code (and only Perl code)!!!
(especially when the “decision makers” never look at the code)

 What tools do we trust?
 What cutoffs do we use?
 What evidence/data types do we consider?

 These will (in some cases should) change over time

More Thoughts

 Specifications are easier to change than
code, so code should be written to support
change

 But unless they’re defined first, the
specifications will be a moving target

(My) Design Objectives

 Must be able to add/remove annotation data
sources as the annotation SOP changes
 Must be able to easily change the ways in
which these annotation data types are
applied/combined to produce final annotation
 Must be able to change/expand the types of
final annotation data we are producing

Object-Oriented Design Approach

 OOP in Perl == *, but lesser of two evils
(don’t ask me what the other evil is, but it must be pretty evil)

 Encapsulates possible sources of change and prevents
them from affecting downstream components
(like HACCP)
 Polymorphism of $parser->parse($infile) producing
annotation objects is nice
 Re-use was not really a motive here

*Damian Conway in his OOP Perl book says using OOP in Perl yields 5X performance hit

Annotation Pipeline Overview
Annotation Tool(s)

Annotation Source Data

Parser(s)
We can make changes
Annotation Data Object(s) to the annotation rules,
without having to
necessarily re-run or re-
parse the data
Annotation
Rules

Final Annotation Data

Design Objectives for Parsers
A parser must:
 Produce polypeptides with associated AnnotationData objects of a defined type
 Produce AnnotationData object with attributes specified in a consistent way
 E.g.: All parsers should produce EC number attributes that look like ‘1.1.1.1’ ->
‘1.-.-.-’, not sometimes ‘1.-’. Multiple values should be split. Any clean-up or
verification should be done before the AnnotationData object is created; if the data is
invalid, the attribute should not be populated, or the object should not be created.
 Produce annotation data objects that are independent of the source annotation
data they were parsed from
 e.g.: They have already been canonized as a type of ‘trusted annotation evidence
type’ when they are created as AnnotationData objects. These trusted types are
defined in the annotation SOP.

 These features create a separation between how trusted evidence is defined
(input data), and how the evidence is used to produce annotation (annotation
rules)

AnnotationData Objects
AnnotationData

AnnotationData::Polypeptide
Polypeptide
type:
[some string]
attributes: AnnotationData Object(s)
common_name
gene_symbol
EC
GO
TIGR_role
…

AnnotationRules

 AnnotationRules object implements the rules
from the annotation SOP document

 AnnotationRules::PredictedProtein takes a
Polypeptide object with associated
AnnotationData objects of varying type and
applies the annotation rules to create a final
AnnotationData object

AnnotationRules
 Rules are encoded as an array in the following
format:
ANNOTATION_TYPE|OPERATOR|ATTRIBUTE1 ATTRIBUTE2

 Where OPERATOR is one of:
 = for assign attribute (if unassigned)
 + for append attribute
 - for overwrite attribute

 Any operators can be defined as they are applied
with a hash of handler subroutines

AnnotationRules::PredictedProtein
 my @annotation_order = (
 ## equivalog level tigrfam hits
 'TIGRFAM::FullLength::Equivalog|=|common_name gene_symbol GO EC TIGR_role',
 'TIGRFAM::FullLength::Exception|=|common_name gene_symbol GO EC TIGR_role',
 'TIGRFAM::FullLength::HypotheticalEquivalog|=|common_name gene_symbol GO EC TIGR_role',

 'TIGRFAM::FRAG::Equivalog|=|GO',
 'TIGRFAM::FRAG::Exception|=|GO',
 'TIGRFAM::FRAG::HypotheticalEquivalog|=|GO',
 'TIGRFAM::FullLength::Domain|=|GO',
 'PandaBLASTP::Characterized|=|GO',

 'PRIAM|=|GO EC',

 ## equivalog level hits vs tigrfam frag
 'TIGRFAM::FRAG::Equivalog|=|common_name gene_symbol GO EC TIGR_role',
 'TIGRFAM::FRAG::Exception|=|common_name gene_symbol GO EC TIGR_role',
 'TIGRFAM::FRAG::HypotheticalEquivalog|=|common_name gene_symbol GO EC TIGR_role',

 ## characterized high confidence blast hit
 'PandaBLASTP::Characterized|=|common_name gene_symbol',

 ## pfam and non-equivalog tigrfams
 'PFAM::FullLength::Equivalog|=|common_name gene_symbol GO EC TIGR_role',
 'PFAM::FullLength::HypotheticalEquivalog|=|common_name gene_symbol GO EC TIGR_role',
 'TIGRFAM::FullLength::Subfamily|=|common_name gene_symbol GO EC TIGR_role',
 'TIGRFAM::FullLength::Superfamily|=|common_name gene_symbol GO EC TIGR_role',
 'TIGRFAM::FullLength::EquivalogDomain|=|common_name gene_symbol GO EC TIGR_role',
 'TIGRFAM::FullLength::HypotheticalEquivalogDomain|=|common_name gene_symbol GO EC TIGR_role',
 'TIGRFAM::FullLength::SubfamilyDomain|=|common_name gene_symbol GO EC TIGR_role',
 'TIGRFAM::FullLength::Domain|=|common_name gene_symbol GO EC TIGR_role',
 …

CAMERA Annotation Pipeline

CAMERA-specific
Ergatis components

CAMERA-specific Code in SVN

 http://iwebsvn.tigr.org/listing.php?repname=ANNO

Future Development
(My 2 cents)

 Pipeline development must be driven by annotation SOP development
work
 Feedback on pipeline bugs must be vigilantly kept separate from feedback
on annotation SOP bugs
 First discuss and update the SOP, then modify the code
 Cluster summary annotation
 Shortest path here seems to be a combination of GO Slim and EC
assignments? GO consortium makes some scripts available for
summarizing sets of GO assignments
 If using the current code, PolypeptideSet container class exists already.
Cluster members can be added to a PolypeptideSet and that can be used
as input to an AnnotationRules::FinalCluster object that is similar to the one
for PredictedProtein, but with a different set of handler routines.
 Incremental clustering pipeline
 Good luck 

CAMERA metagenomic annotation pipeline

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a CAMERA metagenomic annotation pipeline

Semelhante a CAMERA metagenomic annotation pipeline (20)

Último

Último (20)

CAMERA metagenomic annotation pipeline