SlideShare uma empresa Scribd logo
1 de 54
CAMERA Annotation Pipelines
      (and related infrastructure)



            Brett Whitty
            12/20/2007
Overview

 Compute Infrastructure
 GOS/CAMERA ncRNA/ORF calling pipeline
   rRNA finding pipeline
   ORF calling
 GOS (incremental) protein clustering
 CAMERA Annotation Pipeline
   Specifications
   Implementation
Compute Infrastructure
CALIT2 Compute Grid

 48 dual-core dual-CPU 64 bit machines
    192 SGE slots
 Redhat-based ‘Rocks Clusters’ Linux
  distribution (see http://rocksclusters.org)
 ‘Rocks Rolls’
   Bio-roll (/opt/Bio)
   Used to image/install each node separately,
    including local Perl module installs (patches)
sos.camera.calit2.net

 Head node of sos cluster
    SSH into here
 Is not an SGE submit host
SOS Cluster Global Mounts
 /share/apps
    applications (and related files) are installed here,
     analysis data should not be stored here
 /home/thumper6
    a global mount point --- 18T(!!!) storage volume
     on which all analysis data/results should be
     stored
 /opt/Bio
    tools such as clustalw, EMBOSS, hmmer, ncbi
     blast are installed under here
SOS Local Mounts
                   (on each grid node)


 /state/partition1
    local storage device on each grid node available
     for local scratch space (438G)
 /tmp
    system tmp partition (7G)
pg0-0.camera.calit2.net

 SSH accessible only through head
 Is an SGE submit host
 Running apache and postgres servers
pg0-0.camera.calit2.net

 http://web1.camera.calit2.net/ergatis/


 /var/www/cgi-bin/ergatis
     /home/bwhitty/temp/subversion-1.4.5/subversion/svn/svn export --force
      https://ergatis.svn.sf.net/svnroot/ergatis/tags/ergatis-v2r6b1/htdocs ergatis




 /var/www/html/ergatis
     /home/bwhitty/temp/subversion-1.4.5/subversion/svn/svn export --force
      https://ergatis.svn.sf.net/svnroot/ergatis/tags/ergatis-v2r6b1/cgi-bin ergatis
pg0-0.camera.calit2.net
 CGI scripts run as the user 'apache' on pg0-0, but ‘apache’ has
  sudo permissions for user 'ergatis'
    The two CGI scripts in the install which run RunWorkflow and
     KillWorkflow (ergatis/kill_wf.cgi, ergatis/Ergatis/Pipeline.pm)
     have been modified, and 'sudo -u ergatis ' has been appended
     to their normal execution strings

 IdGenerator.pm has been modified to use JCVIIdGenerator.pm

 Many of the settings in ergatis.ini have been changed from
  defaults, including disabling a number of the components
    When updating the Ergatis CGI directory from the SVN
     repository, a backup copy should be set-aside in advance
SGE/Workflow Notes
   Two SGE queues have been configured for ergatis:
        ergatis.q (192 slots)
        ergatis-fast.q (144 slots)
   ergatis.q is subordinate queue of ergatis-fast.q

   ergatis.q is set as default queue for user ‘ergatis’ by specifying ‘-q ergatis.q’ in
    /home/ergatis/.sge_request

   Workflow version 3.0 is installed
        /share/apps/workflow

   Workflow requires that the SGE queue's prolog and epilog scripts be set to the
    following:
        prolog=/share/apps/workflow/bin/prolog $host $job_owner $job_id $job_name $queue
        epilog=/share/apps/workflow/bin/epilog $host $job_owner $job_id $job_name $queue

   The queue configuration can be checked using the command
    'qconf -sq ergatis.q'
Ergatis Application Install
   The main ergatis application install directory is under /share/apps/ergatis

   The chado-v1r12b1 release is the current version installed
        direct copy of the install located at /usr/local/devel/ANNOTATION/ard/ at JCVI
        Perl wrappers were modified via sed to the correct local directory structures
        Proper install wasn't done because no working installer script was available at the
         time

   /share/apps/ergatis/chado-v1r12b1
    symlinked to /share/apps/ergatis/current

   Executables which some ergatis component use, but are not installed with
    Ergatis (e.g.: JCVI internal scripts) are located under /share/apps/ergatis/bin

   External tools which are not globally installed on sos are installed under
    /share/apps/ergatis/external_apps

   Ergatis global directories (global_id_repository, global_saved_templates) are
    located under /share/apps/ergatis/ergatis_global
Ergatis Data Locations
 All ergatis data should be put under /home/thumper6/ergatis

 Project repositories are located under
  /home/thumper6/ergatis/projects
  or symlink /share/apps/ergatis/projects

 CAMERA project repository is
  /home/thumper6/ergatis/projects/camera

 Databases are located under /home/thumper6/ergatis/db
  or symlink /share/apps/ergatis/db

 Global scratch space is under /home/thumper6/ergatis/scratch
  or symlink /share/apps/ergatis/scratch
ikelite.rocksclusters.org

 Less machines than sos cluster (~20 slots?)
 Initial test ergatis install was done here
  (similar directory structure to sos)
 Completely distinct from sos cluster
 Sandbox
 Shibu, Weizhong Li and others run computes
  here (e.g.: clustering pipeline)
Pipelines
GOS/CAMERA Pipelines Overview



     Metagenomic Reads


  ncRNA/ORF Finding Pipeline

                               Incremental Clustering
        ORFs/peptides                Pipeline


      Annotation Pipeline      Cluster Memberships
Challenges
 All computes in pipeline must be performed on
  multi-sequence input/output files, as the filesystem
  can not physically support 12M+ individual FASTA
  input files/output files
    other partitioning solutions could work(?) but most tools
     support multiple sequence inputs anyway

 Overall total space consumption was an issue when
  computes were running on TIGR grid, but this is not
  as much an issue (currently) on CALIT2 grid
    Solution here was to keep all inputs/outputs gzipped
     during pipeline execution, at the cost of some performance
     loss (using things like zcat –f | with NCBI BLAST, etc.)
GOS/CAMERA ncRNA and
  ORF Finding Pipeline
GOS/CAMERA ncRNA and ORF
     Finding Pipeline Overview
            Reads

        Find tRNAs           Extract tRNAs   tRNAs FASTA

     Soft-Mask tRNAs

        Find rRNAs           Extract rRNAs   rRNAs FASTA

     Soft-Mask rRNAs                          ORFs FASTA
                              Metagene
     GOS ORF calling                         Peptides FASTA

                                              ORFs FASTA
ORF stats     ORF overlaps
                                             Peptides FASTA
GOS/CAMERA
ncRNA and ORF Finding Pipeline
                    CAMERA-specific
                   Ergatis components
camera_extract_trna
CAMERA rRNA Finder Overview
 BLAST vs. a database of coded pooled rRNA
  subunit sequences
 BLAST prefilter step with loose parameters
    blastall -p blastn -i reads.fsa -d rrna_db.fsa -e 0.1 -F 'T' -b 1 -v 1
     -z 3000000000 -W 9
 Reads with prefilter hits are searched using strict
  parameters
    blastall -p blastn -i aligned.fsa -d rrna_db.fsa -e 1e-4 -F 'm L' -b
     1500 -v 1500 -q -5 -r 4 -X 1500 -z 3000000000 -W 9 -U T
 Collapse aligned intervals of the same rRNA type
  and extract the highest scoring alignments from
  each region
camera_filter_blast
camera_rrna_finder




Custom DB
rRNA Finder DB
  /usr/local/annotation/CAMERA/CustomDB/camera_rRNA_finder.all_rRNA.coded.cdhit_80.fsa



 5S
    Sequences from Archaea, Bacteria and Eukaryota were
     obtained from the 5S Ribosomal RNA Database
    http://biobases.ibch.poznan.pl/5SData/
 16S
    Sequences for Archaea and Bactera were obtained from the
     Green Genes 16S db
    http://greengenes.lbl.gov/
 18S
    Source was Doug Rusch's 18S database prepared for the GOS
     paper
 23S
    Source was Doug Rusch's 23S database prepared for the GOS
     paper.
rRNA Finder DB

Fasta headers were coded as follows:

>#S [D] ...original.header...

where # is one of (5, 16, 18, 23) and D is one of
 (A, B, E). The camera_rrna_finder
 component expects this format.
rRNA Finder DB
 CD-HIT was run on the entire database to cluster sequences with
  high similarity to reduce the database size but maintain a range
  of diverse sequences

Command line:
/usr/local/devel/ANNOTATION/bwhitty/cdhit/cd-hit/cd-hit-est -i
   input_database.fsa -o output_database.fsa -c 0.8 -n 4

 Consistency of clustering was checked with a Perl script to
  ensure no heterogeneous clustering
  (e.g.: 18S and 16S clustering together)
 Clusters were consistent
 Database size was reduced from 65,591 sequences to 1,329
rRNA Finder
open_reading_frames
ORF Overlaps/ORF Stats
FASTA Headers
   >HOT_READ_85779353 /accession=DU765170.1 /sample_id=JGI_SMPL_HF4000_12-21-03
    /template_id=JGI_TMPL_ANIW12796 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT
    /clr_range_begin=0 /clr_range_end=1088 /length=1088
   >JCVI_ORF_1108836626524 /pep_id=JCVI_PEP_1108836626525 /read_id=HOT_READ_85760722
    /begin=0 /end=234 /orientation=1 /5_prime_stop=0 /3_prime_stop=TAG /ttable=11 /length=234
    /read_defline="/accession=DU750886.1 /sample_id=JGI_SMPL_HF130_10-06-02
    /template_id=JGI_TMPL_ASNF1709 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT
    /clr_range_begin=0 /clr_range_end=841 /length=841"
   >JCVI_PEP_1108836626525 /orf_id=JCVI_ORF_1108836626524 /read_id=HOT_READ_85760722
    /begin=0 /end=234 /orientation=1 /5_prime_stop=0 /3_prime_stop=TAG /ttable=11 /length=234
    /read_defline="/accession=DU750886.1 /sample_id=JGI_SMPL_HF130_10-06-02
    /template_id=JGI_TMPL_ASNF1709 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT
    /clr_range_begin=0 /clr_range_end=841 /length=841"
   >JCVI_NT_1108826205795 /read_id=HOT_READ_85801707 /begin=785 /end=858 /orientation=1
    /type=Asn_tRNA /ergatis_id=1108826197895 /defline="HOT_READ_85801707
    /accession=DU787412.1 /sample_id=JGI_SMPL_HF770_12-21-03
    /template_id=JGI_TMPL_APKH2110 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT
    /clr_range_begin=0 /clr_range_end=902 /length=902"
   >JCVI_NT_1108806998652 /read_id=HOT_READ_85760731 /begin=55 /end=847 /orientation=0
    /type=23S_rRNA /ergatis_id=1108826197895 /read_defline="/accession=DU750895.1
    /sample_id=JGI_SMPL_HF130_10-06-02 /template_id=JGI_TMPL_ASNF1714
    /sequencing_direction=forward /site_id=HAWAII_SITE_HOT /clr_range_begin=0 /clr_range_end=847
    /length=847"
The absence of called
   ORFs in this region of
   the read is due to the
     soft-masked rRNA
          sequence




  RNAmmer didn’t
   identify the 23S
sequence, though it is
capable of finding 23S
Again, RNAmmer failed to identify rRNA sequence
These ORFs have
 >150 unmasked
     bases




                    BLAST-based
                  approach does a
                  pretty good job of
                    finding correct
                      boundaries
BLAST-based rRNA
   finding appears to
 outperform RNAmmer
for 23S sequences, and
       some 16S
GOS (Incremental)
    Clustering Pipeline

http://camera.venterinstitute.org/wiki/display/V
Clustering Overview
                                    Core
                                   Cluster

                                    Core
                                                        Core
                                   Cluster
 All Public                                            Cluster
Proteins +
GOS ORFs                             Core               Core
                                    Cluster            Cluster

                                     Core
                  GOS               Cluster             v1.2

                                                       Non-Redundant 90%
                Historical Artifacts
        Longest Sequence
         Representatives
                                                    Identity CD-HIT Sequence
                           (with respect to annotation) Representatives
CAMERA Polypeptide
Annotation Pipeline
Thoughts on Specifications
 Annotation rules should not be literally codified as
  Perl code (and only Perl code)!!!
  (especially when the “decision makers” never look at the code)


 What tools do we trust?
 What cutoffs do we use?
 What evidence/data types do we consider?


 These will (in some cases should) change over time
More Thoughts

 Specifications are easier to change than
  code, so code should be written to support
  change

 But unless they’re defined first, the
  specifications will be a moving target
(My) Design Objectives

 Must be able to add/remove annotation data
  sources as the annotation SOP changes
 Must be able to easily change the ways in
  which these annotation data types are
  applied/combined to produce final annotation
 Must be able to change/expand the types of
  final annotation data we are producing
Object-Oriented Design Approach

 OOP in Perl == *, but lesser of two evils
    (don’t ask me what the other evil is, but it must be pretty evil)



 Encapsulates possible sources of change and prevents
    them from affecting downstream components
    (like HACCP)
 Polymorphism of $parser->parse($infile) producing
  annotation objects is nice
 Re-use was not really a motive here


*Damian Conway in his OOP Perl book says using OOP in Perl yields 5X performance hit
Annotation Pipeline Overview
            Annotation Tool(s)


         Annotation Source Data

                 Parser(s)
                                     We can make changes
         Annotation Data Object(s)   to the annotation rules,
                                         without having to
                                     necessarily re-run or re-
                                          parse the data
                 Annotation
                   Rules

           Final Annotation Data
Design Objectives for Parsers
A parser must:
 Produce polypeptides with associated AnnotationData objects of a defined type
 Produce AnnotationData object with attributes specified in a consistent way
        E.g.: All parsers should produce EC number attributes that look like ‘1.1.1.1’ ->
         ‘1.-.-.-’, not sometimes ‘1.-’. Multiple values should be split. Any clean-up or
         verification should be done before the AnnotationData object is created; if the data is
         invalid, the attribute should not be populated, or the object should not be created.
   Produce annotation data objects that are independent of the source annotation
    data they were parsed from
        e.g.: They have already been canonized as a type of ‘trusted annotation evidence
         type’ when they are created as AnnotationData objects. These trusted types are
         defined in the annotation SOP.

   These features create a separation between how trusted evidence is defined
    (input data), and how the evidence is used to produce annotation (annotation
    rules)
AnnotationData Objects
              AnnotationData


    AnnotationData::Polypeptide
                                        Polypeptide
type:
          [some string]
attributes:                       AnnotationData Object(s)
          common_name
          gene_symbol
          EC
          GO
          TIGR_role
          …
AnnotationRules

 AnnotationRules object implements the rules
 from the annotation SOP document

 AnnotationRules::PredictedProtein takes a
 Polypeptide object with associated
 AnnotationData objects of varying type and
 applies the annotation rules to create a final
 AnnotationData object
AnnotationRules
 Rules are encoded as an array in the following
  format:
ANNOTATION_TYPE|OPERATOR|ATTRIBUTE1 ATTRIBUTE2

 Where OPERATOR is one of:
   = for assign attribute (if unassigned)
   + for append attribute
   - for overwrite attribute

 Any operators can be defined as they are applied
  with a hash of handler subroutines
AnnotationRules::PredictedProtein
    my @annotation_order = (
           ## equivalog level tigrfam hits
           'TIGRFAM::FullLength::Equivalog|=|common_name gene_symbol GO EC TIGR_role',
           'TIGRFAM::FullLength::Exception|=|common_name gene_symbol GO EC TIGR_role',
           'TIGRFAM::FullLength::HypotheticalEquivalog|=|common_name gene_symbol GO EC TIGR_role',

            'TIGRFAM::FRAG::Equivalog|=|GO',
            'TIGRFAM::FRAG::Exception|=|GO',
            'TIGRFAM::FRAG::HypotheticalEquivalog|=|GO',
            'TIGRFAM::FullLength::Domain|=|GO',
            'PandaBLASTP::Characterized|=|GO',

            'PRIAM|=|GO EC',

            ## equivalog level hits vs tigrfam frag
            'TIGRFAM::FRAG::Equivalog|=|common_name gene_symbol GO EC TIGR_role',
            'TIGRFAM::FRAG::Exception|=|common_name gene_symbol GO EC TIGR_role',
            'TIGRFAM::FRAG::HypotheticalEquivalog|=|common_name gene_symbol GO EC TIGR_role',

            ## characterized high confidence blast hit
            'PandaBLASTP::Characterized|=|common_name gene_symbol',

            ## pfam and non-equivalog tigrfams
            'PFAM::FullLength::Equivalog|=|common_name gene_symbol GO EC TIGR_role',
            'PFAM::FullLength::HypotheticalEquivalog|=|common_name gene_symbol GO EC TIGR_role',
            'TIGRFAM::FullLength::Subfamily|=|common_name gene_symbol GO EC TIGR_role',
            'TIGRFAM::FullLength::Superfamily|=|common_name gene_symbol GO EC TIGR_role',
            'TIGRFAM::FullLength::EquivalogDomain|=|common_name gene_symbol GO EC TIGR_role',
            'TIGRFAM::FullLength::HypotheticalEquivalogDomain|=|common_name gene_symbol GO EC TIGR_role',
            'TIGRFAM::FullLength::SubfamilyDomain|=|common_name gene_symbol GO EC TIGR_role',
            'TIGRFAM::FullLength::Domain|=|common_name gene_symbol GO EC TIGR_role',
            …
CAMERA Annotation Pipeline




       CAMERA-specific
      Ergatis components
camera_annotation_parser
camera_annotation_rules
camera_annotation_rules
CAMERA-specific Code in SVN

 http://iwebsvn.tigr.org/listing.php?repname=ANNO
Future Development
                                     (My 2 cents)



   Pipeline development must be driven by annotation SOP development
    work
      Feedback on pipeline bugs must be vigilantly kept separate from feedback
       on annotation SOP bugs
      First discuss and update the SOP, then modify the code
   Cluster summary annotation
      Shortest path here seems to be a combination of GO Slim and EC
       assignments? GO consortium makes some scripts available for
       summarizing sets of GO assignments
      If using the current code, PolypeptideSet container class exists already.
       Cluster members can be added to a PolypeptideSet and that can be used
       as input to an AnnotationRules::FinalCluster object that is similar to the one
       for PredictedProtein, but with a different set of handler routines.
   Incremental clustering pipeline
        Good luck 

Mais conteúdo relacionado

Mais procurados

Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...Hsien-Hsin Sean Lee, Ph.D.
 
FPGAを用いた処理のロボット向けコンポーネントの設計生産性評価
FPGAを用いた処理のロボット向けコンポーネントの設計生産性評価FPGAを用いた処理のロボット向けコンポーネントの設計生産性評価
FPGAを用いた処理のロボット向けコンポーネントの設計生産性評価Kazushi Yamashina
 
Why my network does not work? Networking Quiz 2017
Why my network does not work? Networking Quiz 2017Why my network does not work? Networking Quiz 2017
Why my network does not work? Networking Quiz 2017Andriy Berestovskyy
 
True stories on the analysis of network activity using Python
True stories on the analysis of network activity using PythonTrue stories on the analysis of network activity using Python
True stories on the analysis of network activity using Pythondelimitry
 
Understanding Tomasulo Algorithm
Understanding Tomasulo AlgorithmUnderstanding Tomasulo Algorithm
Understanding Tomasulo Algorithmonesuper
 
Embedded Recipes 2019 - Introduction to JTAG debugging
Embedded Recipes 2019 - Introduction to JTAG debuggingEmbedded Recipes 2019 - Introduction to JTAG debugging
Embedded Recipes 2019 - Introduction to JTAG debuggingAnne Nicolas
 
3 Vampir Trace In Detail
3 Vampir Trace In Detail3 Vampir Trace In Detail
3 Vampir Trace In DetailPTIHPA
 
Stack Smashing Protector (Paul Rascagneres)
Stack Smashing Protector (Paul Rascagneres)Stack Smashing Protector (Paul Rascagneres)
Stack Smashing Protector (Paul Rascagneres)Hackfest Communication
 
FPGAの処理をソフトウェアコンポーネント化する設計ツールcReCompの高機能化の検討
FPGAの処理をソフトウェアコンポーネント化する設計ツールcReCompの高機能化の検討FPGAの処理をソフトウェアコンポーネント化する設計ツールcReCompの高機能化の検討
FPGAの処理をソフトウェアコンポーネント化する設計ツールcReCompの高機能化の検討Kazushi Yamashina
 
Oracle数据库日志满导致错误
Oracle数据库日志满导致错误Oracle数据库日志满导致错误
Oracle数据库日志满导致错误Zianed Hou
 
DEF CON 27- SHEILA A BERTA - backdooring hardware devices by injecting malici...
DEF CON 27- SHEILA A BERTA - backdooring hardware devices by injecting malici...DEF CON 27- SHEILA A BERTA - backdooring hardware devices by injecting malici...
DEF CON 27- SHEILA A BERTA - backdooring hardware devices by injecting malici...Felipe Prado
 
Kernel Recipes 2016 - Why you need a test strategy for your kernel development
Kernel Recipes 2016 - Why you need a test strategy for your kernel developmentKernel Recipes 2016 - Why you need a test strategy for your kernel development
Kernel Recipes 2016 - Why you need a test strategy for your kernel developmentAnne Nicolas
 
FPGA処理をROSコンポーネント化する自動設計環境
FPGA処理をROSコンポーネント化する自動設計環境FPGA処理をROSコンポーネント化する自動設計環境
FPGA処理をROSコンポーネント化する自動設計環境Kazushi Yamashina
 
[嵌入式系統] MCS-51 實驗 - 使用 IAR (2)
[嵌入式系統] MCS-51 實驗 - 使用 IAR (2)[嵌入式系統] MCS-51 實驗 - 使用 IAR (2)
[嵌入式系統] MCS-51 實驗 - 使用 IAR (2)Simen Li
 
Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...Hsien-Hsin Sean Lee, Ph.D.
 
Brno Perl Mongers 28.5.2015 - Perl family by mj41
Brno Perl Mongers 28.5.2015 - Perl family by mj41Brno Perl Mongers 28.5.2015 - Perl family by mj41
Brno Perl Mongers 28.5.2015 - Perl family by mj41Michal Jurosz
 
自律移動ロボット向けハード・ソフト協調のためのコンポーネント設計支援ツール
自律移動ロボット向けハード・ソフト協調のためのコンポーネント設計支援ツール自律移動ロボット向けハード・ソフト協調のためのコンポーネント設計支援ツール
自律移動ロボット向けハード・ソフト協調のためのコンポーネント設計支援ツールKazushi Yamashina
 

Mais procurados (20)

Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
 
FPGAを用いた処理のロボット向けコンポーネントの設計生産性評価
FPGAを用いた処理のロボット向けコンポーネントの設計生産性評価FPGAを用いた処理のロボット向けコンポーネントの設計生産性評価
FPGAを用いた処理のロボット向けコンポーネントの設計生産性評価
 
Why my network does not work? Networking Quiz 2017
Why my network does not work? Networking Quiz 2017Why my network does not work? Networking Quiz 2017
Why my network does not work? Networking Quiz 2017
 
The Spectre of Meltdowns
The Spectre of MeltdownsThe Spectre of Meltdowns
The Spectre of Meltdowns
 
True stories on the analysis of network activity using Python
True stories on the analysis of network activity using PythonTrue stories on the analysis of network activity using Python
True stories on the analysis of network activity using Python
 
Understanding Tomasulo Algorithm
Understanding Tomasulo AlgorithmUnderstanding Tomasulo Algorithm
Understanding Tomasulo Algorithm
 
Embedded Recipes 2019 - Introduction to JTAG debugging
Embedded Recipes 2019 - Introduction to JTAG debuggingEmbedded Recipes 2019 - Introduction to JTAG debugging
Embedded Recipes 2019 - Introduction to JTAG debugging
 
3 Vampir Trace In Detail
3 Vampir Trace In Detail3 Vampir Trace In Detail
3 Vampir Trace In Detail
 
Stack Smashing Protector (Paul Rascagneres)
Stack Smashing Protector (Paul Rascagneres)Stack Smashing Protector (Paul Rascagneres)
Stack Smashing Protector (Paul Rascagneres)
 
20161021_master_lesson_no_feedback
20161021_master_lesson_no_feedback20161021_master_lesson_no_feedback
20161021_master_lesson_no_feedback
 
FPGAの処理をソフトウェアコンポーネント化する設計ツールcReCompの高機能化の検討
FPGAの処理をソフトウェアコンポーネント化する設計ツールcReCompの高機能化の検討FPGAの処理をソフトウェアコンポーネント化する設計ツールcReCompの高機能化の検討
FPGAの処理をソフトウェアコンポーネント化する設計ツールcReCompの高機能化の検討
 
Oracle数据库日志满导致错误
Oracle数据库日志满导致错误Oracle数据库日志满导致错误
Oracle数据库日志满导致错误
 
DEF CON 27- SHEILA A BERTA - backdooring hardware devices by injecting malici...
DEF CON 27- SHEILA A BERTA - backdooring hardware devices by injecting malici...DEF CON 27- SHEILA A BERTA - backdooring hardware devices by injecting malici...
DEF CON 27- SHEILA A BERTA - backdooring hardware devices by injecting malici...
 
Kernel Recipes 2016 - Why you need a test strategy for your kernel development
Kernel Recipes 2016 - Why you need a test strategy for your kernel developmentKernel Recipes 2016 - Why you need a test strategy for your kernel development
Kernel Recipes 2016 - Why you need a test strategy for your kernel development
 
FPGA処理をROSコンポーネント化する自動設計環境
FPGA処理をROSコンポーネント化する自動設計環境FPGA処理をROSコンポーネント化する自動設計環境
FPGA処理をROSコンポーネント化する自動設計環境
 
[嵌入式系統] MCS-51 實驗 - 使用 IAR (2)
[嵌入式系統] MCS-51 實驗 - 使用 IAR (2)[嵌入式系統] MCS-51 實驗 - 使用 IAR (2)
[嵌入式系統] MCS-51 實驗 - 使用 IAR (2)
 
Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
 
ARM 64bit has come!
ARM 64bit has come!ARM 64bit has come!
ARM 64bit has come!
 
Brno Perl Mongers 28.5.2015 - Perl family by mj41
Brno Perl Mongers 28.5.2015 - Perl family by mj41Brno Perl Mongers 28.5.2015 - Perl family by mj41
Brno Perl Mongers 28.5.2015 - Perl family by mj41
 
自律移動ロボット向けハード・ソフト協調のためのコンポーネント設計支援ツール
自律移動ロボット向けハード・ソフト協調のためのコンポーネント設計支援ツール自律移動ロボット向けハード・ソフト協調のためのコンポーネント設計支援ツール
自律移動ロボット向けハード・ソフト協調のためのコンポーネント設計支援ツール
 

Semelhante a CAMERA metagenomic annotation pipeline

Squash Those IoT Security Bugs with a Hardened System Profile
Squash Those IoT Security Bugs with a Hardened System ProfileSquash Those IoT Security Bugs with a Hardened System Profile
Squash Those IoT Security Bugs with a Hardened System ProfileSteve Arnold
 
SANS @Night There's Gold in Them Thar Package Management Databases
SANS @Night There's Gold in Them Thar Package Management DatabasesSANS @Night There's Gold in Them Thar Package Management Databases
SANS @Night There's Gold in Them Thar Package Management DatabasesPhil Hagen
 
Ganglia monitoring
Ganglia monitoringGanglia monitoring
Ganglia monitoringChen Robert
 
Linux Common Command
Linux Common CommandLinux Common Command
Linux Common CommandJeff Yang
 
Introduction to ESP32 Programming [Road to RIoT 2017]
Introduction to ESP32 Programming [Road to RIoT 2017]Introduction to ESP32 Programming [Road to RIoT 2017]
Introduction to ESP32 Programming [Road to RIoT 2017]Alwin Arrasyid
 
Systemtap
SystemtapSystemtap
SystemtapFeng Yu
 
Bundling Packages and Deploying Applications with RPM
Bundling Packages and Deploying Applications with RPMBundling Packages and Deploying Applications with RPM
Bundling Packages and Deploying Applications with RPMAlexander Shopov
 
Efficient System Monitoring in Cloud Native Environments
Efficient System Monitoring in Cloud Native EnvironmentsEfficient System Monitoring in Cloud Native Environments
Efficient System Monitoring in Cloud Native EnvironmentsGergely Szabó
 
Linux Troubleshooting
Linux TroubleshootingLinux Troubleshooting
Linux TroubleshootingKeith Wright
 
configuring a warm standby, the easy way
configuring a warm standby, the easy wayconfiguring a warm standby, the easy way
configuring a warm standby, the easy wayCommand Prompt., Inc
 
Debugging Python with gdb
Debugging Python with gdbDebugging Python with gdb
Debugging Python with gdbRoman Podoliaka
 
Snort296x centos6x 2
Snort296x centos6x 2Snort296x centos6x 2
Snort296x centos6x 2Trinh Tuan
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Maté Ongenaert
 
Summit 16: OPNFV on ARM - Hardware Freedom of Choice Has Arrived!
Summit 16: OPNFV on ARM - Hardware Freedom of Choice Has Arrived!Summit 16: OPNFV on ARM - Hardware Freedom of Choice Has Arrived!
Summit 16: OPNFV on ARM - Hardware Freedom of Choice Has Arrived!OPNFV
 
CASPUR Staging System II
CASPUR Staging System IICASPUR Staging System II
CASPUR Staging System IIAndrea PETRUCCI
 

Semelhante a CAMERA metagenomic annotation pipeline (20)

Squash Those IoT Security Bugs with a Hardened System Profile
Squash Those IoT Security Bugs with a Hardened System ProfileSquash Those IoT Security Bugs with a Hardened System Profile
Squash Those IoT Security Bugs with a Hardened System Profile
 
SANS @Night There's Gold in Them Thar Package Management Databases
SANS @Night There's Gold in Them Thar Package Management DatabasesSANS @Night There's Gold in Them Thar Package Management Databases
SANS @Night There's Gold in Them Thar Package Management Databases
 
Ganglia monitoring
Ganglia monitoringGanglia monitoring
Ganglia monitoring
 
Linux Common Command
Linux Common CommandLinux Common Command
Linux Common Command
 
Basic Linux kernel
Basic Linux kernelBasic Linux kernel
Basic Linux kernel
 
Introduction to ESP32 Programming [Road to RIoT 2017]
Introduction to ESP32 Programming [Road to RIoT 2017]Introduction to ESP32 Programming [Road to RIoT 2017]
Introduction to ESP32 Programming [Road to RIoT 2017]
 
Systemtap
SystemtapSystemtap
Systemtap
 
Bundling Packages and Deploying Applications with RPM
Bundling Packages and Deploying Applications with RPMBundling Packages and Deploying Applications with RPM
Bundling Packages and Deploying Applications with RPM
 
Efficient System Monitoring in Cloud Native Environments
Efficient System Monitoring in Cloud Native EnvironmentsEfficient System Monitoring in Cloud Native Environments
Efficient System Monitoring in Cloud Native Environments
 
Rpm Introduction
Rpm IntroductionRpm Introduction
Rpm Introduction
 
Linux Troubleshooting
Linux TroubleshootingLinux Troubleshooting
Linux Troubleshooting
 
Pitr Made Easy
Pitr Made EasyPitr Made Easy
Pitr Made Easy
 
configuring a warm standby, the easy way
configuring a warm standby, the easy wayconfiguring a warm standby, the easy way
configuring a warm standby, the easy way
 
Debugging Python with gdb
Debugging Python with gdbDebugging Python with gdb
Debugging Python with gdb
 
App container rkt
App container rktApp container rkt
App container rkt
 
Basic Linux Internals
Basic Linux InternalsBasic Linux Internals
Basic Linux Internals
 
Snort296x centos6x 2
Snort296x centos6x 2Snort296x centos6x 2
Snort296x centos6x 2
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2
 
Summit 16: OPNFV on ARM - Hardware Freedom of Choice Has Arrived!
Summit 16: OPNFV on ARM - Hardware Freedom of Choice Has Arrived!Summit 16: OPNFV on ARM - Hardware Freedom of Choice Has Arrived!
Summit 16: OPNFV on ARM - Hardware Freedom of Choice Has Arrived!
 
CASPUR Staging System II
CASPUR Staging System IICASPUR Staging System II
CASPUR Staging System II
 

Último

WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 

Último (20)

WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 

CAMERA metagenomic annotation pipeline

  • 1. CAMERA Annotation Pipelines (and related infrastructure) Brett Whitty 12/20/2007
  • 2. Overview  Compute Infrastructure  GOS/CAMERA ncRNA/ORF calling pipeline  rRNA finding pipeline  ORF calling  GOS (incremental) protein clustering  CAMERA Annotation Pipeline  Specifications  Implementation
  • 4. CALIT2 Compute Grid  48 dual-core dual-CPU 64 bit machines  192 SGE slots  Redhat-based ‘Rocks Clusters’ Linux distribution (see http://rocksclusters.org)  ‘Rocks Rolls’  Bio-roll (/opt/Bio)  Used to image/install each node separately, including local Perl module installs (patches)
  • 5. sos.camera.calit2.net  Head node of sos cluster  SSH into here  Is not an SGE submit host
  • 6. SOS Cluster Global Mounts  /share/apps  applications (and related files) are installed here, analysis data should not be stored here  /home/thumper6  a global mount point --- 18T(!!!) storage volume on which all analysis data/results should be stored  /opt/Bio  tools such as clustalw, EMBOSS, hmmer, ncbi blast are installed under here
  • 7. SOS Local Mounts (on each grid node)  /state/partition1  local storage device on each grid node available for local scratch space (438G)  /tmp  system tmp partition (7G)
  • 8. pg0-0.camera.calit2.net  SSH accessible only through head  Is an SGE submit host  Running apache and postgres servers
  • 9. pg0-0.camera.calit2.net  http://web1.camera.calit2.net/ergatis/  /var/www/cgi-bin/ergatis  /home/bwhitty/temp/subversion-1.4.5/subversion/svn/svn export --force https://ergatis.svn.sf.net/svnroot/ergatis/tags/ergatis-v2r6b1/htdocs ergatis  /var/www/html/ergatis  /home/bwhitty/temp/subversion-1.4.5/subversion/svn/svn export --force https://ergatis.svn.sf.net/svnroot/ergatis/tags/ergatis-v2r6b1/cgi-bin ergatis
  • 10. pg0-0.camera.calit2.net  CGI scripts run as the user 'apache' on pg0-0, but ‘apache’ has sudo permissions for user 'ergatis'  The two CGI scripts in the install which run RunWorkflow and KillWorkflow (ergatis/kill_wf.cgi, ergatis/Ergatis/Pipeline.pm) have been modified, and 'sudo -u ergatis ' has been appended to their normal execution strings  IdGenerator.pm has been modified to use JCVIIdGenerator.pm  Many of the settings in ergatis.ini have been changed from defaults, including disabling a number of the components  When updating the Ergatis CGI directory from the SVN repository, a backup copy should be set-aside in advance
  • 11. SGE/Workflow Notes  Two SGE queues have been configured for ergatis:  ergatis.q (192 slots)  ergatis-fast.q (144 slots)  ergatis.q is subordinate queue of ergatis-fast.q  ergatis.q is set as default queue for user ‘ergatis’ by specifying ‘-q ergatis.q’ in /home/ergatis/.sge_request  Workflow version 3.0 is installed  /share/apps/workflow  Workflow requires that the SGE queue's prolog and epilog scripts be set to the following:  prolog=/share/apps/workflow/bin/prolog $host $job_owner $job_id $job_name $queue  epilog=/share/apps/workflow/bin/epilog $host $job_owner $job_id $job_name $queue  The queue configuration can be checked using the command 'qconf -sq ergatis.q'
  • 12. Ergatis Application Install  The main ergatis application install directory is under /share/apps/ergatis  The chado-v1r12b1 release is the current version installed  direct copy of the install located at /usr/local/devel/ANNOTATION/ard/ at JCVI  Perl wrappers were modified via sed to the correct local directory structures  Proper install wasn't done because no working installer script was available at the time  /share/apps/ergatis/chado-v1r12b1 symlinked to /share/apps/ergatis/current  Executables which some ergatis component use, but are not installed with Ergatis (e.g.: JCVI internal scripts) are located under /share/apps/ergatis/bin  External tools which are not globally installed on sos are installed under /share/apps/ergatis/external_apps  Ergatis global directories (global_id_repository, global_saved_templates) are located under /share/apps/ergatis/ergatis_global
  • 13. Ergatis Data Locations  All ergatis data should be put under /home/thumper6/ergatis  Project repositories are located under /home/thumper6/ergatis/projects or symlink /share/apps/ergatis/projects  CAMERA project repository is /home/thumper6/ergatis/projects/camera  Databases are located under /home/thumper6/ergatis/db or symlink /share/apps/ergatis/db  Global scratch space is under /home/thumper6/ergatis/scratch or symlink /share/apps/ergatis/scratch
  • 14. ikelite.rocksclusters.org  Less machines than sos cluster (~20 slots?)  Initial test ergatis install was done here (similar directory structure to sos)  Completely distinct from sos cluster  Sandbox  Shibu, Weizhong Li and others run computes here (e.g.: clustering pipeline)
  • 16. GOS/CAMERA Pipelines Overview Metagenomic Reads ncRNA/ORF Finding Pipeline Incremental Clustering ORFs/peptides Pipeline Annotation Pipeline Cluster Memberships
  • 17. Challenges  All computes in pipeline must be performed on multi-sequence input/output files, as the filesystem can not physically support 12M+ individual FASTA input files/output files  other partitioning solutions could work(?) but most tools support multiple sequence inputs anyway  Overall total space consumption was an issue when computes were running on TIGR grid, but this is not as much an issue (currently) on CALIT2 grid  Solution here was to keep all inputs/outputs gzipped during pipeline execution, at the cost of some performance loss (using things like zcat –f | with NCBI BLAST, etc.)
  • 18. GOS/CAMERA ncRNA and ORF Finding Pipeline
  • 19. GOS/CAMERA ncRNA and ORF Finding Pipeline Overview Reads Find tRNAs Extract tRNAs tRNAs FASTA Soft-Mask tRNAs Find rRNAs Extract rRNAs rRNAs FASTA Soft-Mask rRNAs ORFs FASTA Metagene GOS ORF calling Peptides FASTA ORFs FASTA ORF stats ORF overlaps Peptides FASTA
  • 20. GOS/CAMERA ncRNA and ORF Finding Pipeline CAMERA-specific Ergatis components
  • 22. CAMERA rRNA Finder Overview  BLAST vs. a database of coded pooled rRNA subunit sequences  BLAST prefilter step with loose parameters  blastall -p blastn -i reads.fsa -d rrna_db.fsa -e 0.1 -F 'T' -b 1 -v 1 -z 3000000000 -W 9  Reads with prefilter hits are searched using strict parameters  blastall -p blastn -i aligned.fsa -d rrna_db.fsa -e 1e-4 -F 'm L' -b 1500 -v 1500 -q -5 -r 4 -X 1500 -z 3000000000 -W 9 -U T  Collapse aligned intervals of the same rRNA type and extract the highest scoring alignments from each region
  • 25. rRNA Finder DB /usr/local/annotation/CAMERA/CustomDB/camera_rRNA_finder.all_rRNA.coded.cdhit_80.fsa  5S  Sequences from Archaea, Bacteria and Eukaryota were obtained from the 5S Ribosomal RNA Database  http://biobases.ibch.poznan.pl/5SData/  16S  Sequences for Archaea and Bactera were obtained from the Green Genes 16S db  http://greengenes.lbl.gov/  18S  Source was Doug Rusch's 18S database prepared for the GOS paper  23S  Source was Doug Rusch's 23S database prepared for the GOS paper.
  • 26. rRNA Finder DB Fasta headers were coded as follows: >#S [D] ...original.header... where # is one of (5, 16, 18, 23) and D is one of (A, B, E). The camera_rrna_finder component expects this format.
  • 27. rRNA Finder DB  CD-HIT was run on the entire database to cluster sequences with high similarity to reduce the database size but maintain a range of diverse sequences Command line: /usr/local/devel/ANNOTATION/bwhitty/cdhit/cd-hit/cd-hit-est -i input_database.fsa -o output_database.fsa -c 0.8 -n 4  Consistency of clustering was checked with a Perl script to ensure no heterogeneous clustering (e.g.: 18S and 16S clustering together)  Clusters were consistent  Database size was reduced from 65,591 sequences to 1,329
  • 31. FASTA Headers  >HOT_READ_85779353 /accession=DU765170.1 /sample_id=JGI_SMPL_HF4000_12-21-03 /template_id=JGI_TMPL_ANIW12796 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT /clr_range_begin=0 /clr_range_end=1088 /length=1088  >JCVI_ORF_1108836626524 /pep_id=JCVI_PEP_1108836626525 /read_id=HOT_READ_85760722 /begin=0 /end=234 /orientation=1 /5_prime_stop=0 /3_prime_stop=TAG /ttable=11 /length=234 /read_defline="/accession=DU750886.1 /sample_id=JGI_SMPL_HF130_10-06-02 /template_id=JGI_TMPL_ASNF1709 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT /clr_range_begin=0 /clr_range_end=841 /length=841"  >JCVI_PEP_1108836626525 /orf_id=JCVI_ORF_1108836626524 /read_id=HOT_READ_85760722 /begin=0 /end=234 /orientation=1 /5_prime_stop=0 /3_prime_stop=TAG /ttable=11 /length=234 /read_defline="/accession=DU750886.1 /sample_id=JGI_SMPL_HF130_10-06-02 /template_id=JGI_TMPL_ASNF1709 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT /clr_range_begin=0 /clr_range_end=841 /length=841"  >JCVI_NT_1108826205795 /read_id=HOT_READ_85801707 /begin=785 /end=858 /orientation=1 /type=Asn_tRNA /ergatis_id=1108826197895 /defline="HOT_READ_85801707 /accession=DU787412.1 /sample_id=JGI_SMPL_HF770_12-21-03 /template_id=JGI_TMPL_APKH2110 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT /clr_range_begin=0 /clr_range_end=902 /length=902"  >JCVI_NT_1108806998652 /read_id=HOT_READ_85760731 /begin=55 /end=847 /orientation=0 /type=23S_rRNA /ergatis_id=1108826197895 /read_defline="/accession=DU750895.1 /sample_id=JGI_SMPL_HF130_10-06-02 /template_id=JGI_TMPL_ASNF1714 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT /clr_range_begin=0 /clr_range_end=847 /length=847"
  • 32. The absence of called ORFs in this region of the read is due to the soft-masked rRNA sequence RNAmmer didn’t identify the 23S sequence, though it is capable of finding 23S
  • 33. Again, RNAmmer failed to identify rRNA sequence
  • 34. These ORFs have >150 unmasked bases BLAST-based approach does a pretty good job of finding correct boundaries
  • 35. BLAST-based rRNA finding appears to outperform RNAmmer for 23S sequences, and some 16S
  • 36. GOS (Incremental) Clustering Pipeline http://camera.venterinstitute.org/wiki/display/V
  • 37. Clustering Overview Core Cluster Core Core Cluster All Public Cluster Proteins + GOS ORFs Core Core Cluster Cluster Core GOS Cluster v1.2 Non-Redundant 90% Historical Artifacts Longest Sequence Representatives Identity CD-HIT Sequence (with respect to annotation) Representatives
  • 39. Thoughts on Specifications  Annotation rules should not be literally codified as Perl code (and only Perl code)!!! (especially when the “decision makers” never look at the code)  What tools do we trust?  What cutoffs do we use?  What evidence/data types do we consider?  These will (in some cases should) change over time
  • 40. More Thoughts  Specifications are easier to change than code, so code should be written to support change  But unless they’re defined first, the specifications will be a moving target
  • 41. (My) Design Objectives  Must be able to add/remove annotation data sources as the annotation SOP changes  Must be able to easily change the ways in which these annotation data types are applied/combined to produce final annotation  Must be able to change/expand the types of final annotation data we are producing
  • 42. Object-Oriented Design Approach  OOP in Perl == *, but lesser of two evils (don’t ask me what the other evil is, but it must be pretty evil)  Encapsulates possible sources of change and prevents them from affecting downstream components (like HACCP)  Polymorphism of $parser->parse($infile) producing annotation objects is nice  Re-use was not really a motive here *Damian Conway in his OOP Perl book says using OOP in Perl yields 5X performance hit
  • 43. Annotation Pipeline Overview Annotation Tool(s) Annotation Source Data Parser(s) We can make changes Annotation Data Object(s) to the annotation rules, without having to necessarily re-run or re- parse the data Annotation Rules Final Annotation Data
  • 44. Design Objectives for Parsers A parser must:  Produce polypeptides with associated AnnotationData objects of a defined type  Produce AnnotationData object with attributes specified in a consistent way  E.g.: All parsers should produce EC number attributes that look like ‘1.1.1.1’ -> ‘1.-.-.-’, not sometimes ‘1.-’. Multiple values should be split. Any clean-up or verification should be done before the AnnotationData object is created; if the data is invalid, the attribute should not be populated, or the object should not be created.  Produce annotation data objects that are independent of the source annotation data they were parsed from  e.g.: They have already been canonized as a type of ‘trusted annotation evidence type’ when they are created as AnnotationData objects. These trusted types are defined in the annotation SOP.  These features create a separation between how trusted evidence is defined (input data), and how the evidence is used to produce annotation (annotation rules)
  • 45. AnnotationData Objects AnnotationData AnnotationData::Polypeptide Polypeptide type: [some string] attributes: AnnotationData Object(s) common_name gene_symbol EC GO TIGR_role …
  • 46. AnnotationRules  AnnotationRules object implements the rules from the annotation SOP document  AnnotationRules::PredictedProtein takes a Polypeptide object with associated AnnotationData objects of varying type and applies the annotation rules to create a final AnnotationData object
  • 47. AnnotationRules  Rules are encoded as an array in the following format: ANNOTATION_TYPE|OPERATOR|ATTRIBUTE1 ATTRIBUTE2  Where OPERATOR is one of:  = for assign attribute (if unassigned)  + for append attribute  - for overwrite attribute  Any operators can be defined as they are applied with a hash of handler subroutines
  • 48. AnnotationRules::PredictedProtein  my @annotation_order = (  ## equivalog level tigrfam hits  'TIGRFAM::FullLength::Equivalog|=|common_name gene_symbol GO EC TIGR_role',  'TIGRFAM::FullLength::Exception|=|common_name gene_symbol GO EC TIGR_role',  'TIGRFAM::FullLength::HypotheticalEquivalog|=|common_name gene_symbol GO EC TIGR_role',  'TIGRFAM::FRAG::Equivalog|=|GO',  'TIGRFAM::FRAG::Exception|=|GO',  'TIGRFAM::FRAG::HypotheticalEquivalog|=|GO',  'TIGRFAM::FullLength::Domain|=|GO',  'PandaBLASTP::Characterized|=|GO',  'PRIAM|=|GO EC',   ## equivalog level hits vs tigrfam frag  'TIGRFAM::FRAG::Equivalog|=|common_name gene_symbol GO EC TIGR_role',  'TIGRFAM::FRAG::Exception|=|common_name gene_symbol GO EC TIGR_role',  'TIGRFAM::FRAG::HypotheticalEquivalog|=|common_name gene_symbol GO EC TIGR_role',   ## characterized high confidence blast hit  'PandaBLASTP::Characterized|=|common_name gene_symbol',   ## pfam and non-equivalog tigrfams  'PFAM::FullLength::Equivalog|=|common_name gene_symbol GO EC TIGR_role',  'PFAM::FullLength::HypotheticalEquivalog|=|common_name gene_symbol GO EC TIGR_role',  'TIGRFAM::FullLength::Subfamily|=|common_name gene_symbol GO EC TIGR_role',  'TIGRFAM::FullLength::Superfamily|=|common_name gene_symbol GO EC TIGR_role',  'TIGRFAM::FullLength::EquivalogDomain|=|common_name gene_symbol GO EC TIGR_role',  'TIGRFAM::FullLength::HypotheticalEquivalogDomain|=|common_name gene_symbol GO EC TIGR_role',  'TIGRFAM::FullLength::SubfamilyDomain|=|common_name gene_symbol GO EC TIGR_role',  'TIGRFAM::FullLength::Domain|=|common_name gene_symbol GO EC TIGR_role',  …
  • 49. CAMERA Annotation Pipeline CAMERA-specific Ergatis components
  • 53. CAMERA-specific Code in SVN  http://iwebsvn.tigr.org/listing.php?repname=ANNO
  • 54. Future Development (My 2 cents)  Pipeline development must be driven by annotation SOP development work  Feedback on pipeline bugs must be vigilantly kept separate from feedback on annotation SOP bugs  First discuss and update the SOP, then modify the code  Cluster summary annotation  Shortest path here seems to be a combination of GO Slim and EC assignments? GO consortium makes some scripts available for summarizing sets of GO assignments  If using the current code, PolypeptideSet container class exists already. Cluster members can be added to a PolypeptideSet and that can be used as input to an AnnotationRules::FinalCluster object that is similar to the one for PredictedProtein, but with a different set of handler routines.  Incremental clustering pipeline  Good luck 