Fields bosc2010 bio_perl

1. BioPerl Update 2010: Towards a Modern BioPerl Chris Fields (UIUC) BOSC 7-10-10

2. Present Day BioPerl ✤ Addressing new bioinformatics problems ✤ Collaborations in Open Bioinformatics Foundation ✤ Google Summer of Code

3. Towards a Modern BioPerl ✤ Lowering the barrier for new users to become involved ✤ Using Modern Perl language features ✤ Dealing with the BioPerl monolith

4. BioPerl 2.0? ✤ BioPerl and Modern Perl OOP (Moose) ✤ BioPerl and Perl 6

5. Background ✤ Started in 1996, many contributors over the years ✤ Jason Stajich (UCR) ✤ Ian Korf (Wash U) ✤ Hilmar Lapp (NESCent) ✤ Chris Mungall (NCBO) ✤ Heikki Lehväslaiho (KAUST) ✤ Brian Osborne (BioTeam) ✤ Georg Fuellen (Bielefeld) ✤ Steve Trutane (Stanford) ✤ Ewan Birney (Sanger, EBI) ✤ Sendu Bala (Sanger) ✤ Aaron Mackey (Univ. Virginia) ✤ Dave Messina (Sonnhammer Lab) ✤ Chris Dagdigian (BioTeam) ✤ Mark Jensen (TCGA) ✤ Steven Brenner (UC-Berkeley) ✤ Rob Buels (SGN) ✤ Lincoln Stein (OICR, CSHL) ✤ Many, many more!

6. Background ✤ Open source: ‘Released under the same license as Perl itself’ i.e. Artistic ✤ http://bioperl.org ✤ Core developers - make releases, drive the project, set vision ✤ Regular contributors - have direct commit access

7. BioPerl Distributions ✤ BioPerl Core - the main distribution (aka ‘bioperl-live’ if using dev version) ✤ BioPerl-Run - Perl ‘wrappers’ for common bioinformatics tools ✤ BioPerl-DB - BioSQL ORM to BioPerl classes

8. Biological Sequences ✤ Bio::Seq - sequence record class #!/bin/perl -w use Modern::Perl; use Bio::Seq; my $seq_obj = Bio::Seq->new(-seq => "aaaatgggggggggggccccgtt", -display_id => "ABC12345", -desc => "example 1", -alphabet => "dna"); say $seq_obj->display_id; # ABC12345 say $seq_obj->desc; # example 1 say $seq_obj->seq; # aaaatgggggggggggccccgtt my $revcom = $seq_obj->revcom; # new Bio::Seq, but revcom say $revcom->seq; # aacggggcccccccccccatttt

9. Sequence I/O ✤ Bio::SeqIO - sequence I/O stream classes (pluggable) #!/usr/bin/perl -w use Modern::Perl; use Bio::SeqIO; my ($infile, $outfile) = @ARGV; my $in = Bio::SeqIO->new(-file => $infile, -format => 'genbank'); my $out = Bio::SeqIO->new(-file => ">$outfile", -format => 'fasta'); while (my $seq_obj = $in->next_seq) { say $seq_obj->display_id; $out->write_seq($seq_obj); }

10. Sequence Features ✤ Bio::SeqFeature::Generic - generic SF implementation GenBank File use Modern::Perl; source 1..2629 use Bio::SeqIO; /organism="Enterococcus faecalis OG1RF" /mol_type="genomic DNA" my $in = Bio::SeqIO->new(-file => shift, /strain="OG1RF" -format => 'genbank'); /db_xref="taxon:474186" gene 25..>2629 while (my $seq_obj = $in->next_seq) { /gene="pyr operon" for my $feat_obj ($seq_obj->get_SeqFeatures) { /note="pyrimidine biosynthetic operon" say "Primary tag: ".$feat_obj->primary_tag; say "Location: ".$feat_obj->location->to_FTstring; Primary tag: source for my $tag ($feat_obj->get_all_tags) { Location: 1..2629 say " tag: $tag"; tag: db_xref for my $value ($feat_obj->get_tag_values($tag)) { value: taxon:474186 say " value: $value"; tag: mol_type } value: genomic DNA } tag: organism } value: Enterococcus faecalis OG1RF } tag: strain value: OG1RF

13. Report Parsing Query= gi|1786183|gb|AAC73113.1| (AE000111) aspartokinase I, homoserine dehydrogenase I [Escherichia coli] (820 letters) Database: ecoli.aa 4289 sequences; 1,358,990 total letters Searching..................................................done Score E Sequences producing significant alignments: (bits) Value gb|AAC73113.1| (AE000111) aspartokinase I, homoserine dehydrogen... 1567 0.0 gb|AAC76922.1| (AE000468) aspartokinase II and homoserine dehydr... 332 1e-91 gb|AAC76994.1| (AE000475) aspartokinase III, lysine sensitive [E... 184 3e-47 gb|AAC73282.1| (AE000126) uridylate kinase [Escherichia coli] 42 3e-04 >gb|AAC73113.1| (AE000111) aspartokinase I, homoserine dehydrogenase I [Escherichia coli] Length = 820 Score = 1567 bits (4058), Expect = 0.0 Identities = 806/820 (98%), Positives = 806/820 (98%) Query: 1 MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDA 60 MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDA Sbjct: 1 MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDA 60

14. Report Parsing Query=gi|1786183|gb|AAC73113.1| ✤ Bio::SearchIO Hit=gb|AAC73113.1| #!/usr/bin/perl -w Length=820 Percent_id=98.2926829268293 use Modern::Perl; use Bio::SearchIO; Query=gi|1786183|gb|AAC73113.1| my $in = Bio::SearchIO->new(-format => 'blast', -file => 'ecoli.bls'); Hit=gb|AAC76922.1| Length=821 while( my $result = $in->next_result ) { Percent_id=29.5980511571255 while( my $hit = $result->next_hit ) { while( my $hsp = $hit->next_hsp ) { Query=gi|1786183|gb|AAC73113.1| say "Query=".$result->query_name; Hit=gb|AAC76994.1| say " Hit=".$hit->name; Length=471 say " Length=".$hsp->length('total'); say " Percent_id=".$hsp->percent_identity."n"; Percent_id=30.1486199575372 } } Query=gi|1786183|gb|AAC73113.1| } Hit=gb|AAC73282.1| Length=97 Percent_id=28.8659793814433

15. Local/Remote Database Interfaces ✤ Bio::DB::GenBank #!/bin/perl -w use Modern::Perl; use Bio::DB::GenBank; my $db_obj = Bio::DB::GenBank->new; # query NCBI nuc db my $seq_obj = $db_obj->get_Seq_by_acc('A00002'); say $seq_obj->display_id; # A00002 say $seq_obj->length(); # 194 ✤ Also EntrezGene, GenPept, RefSeq, UniProt, EBI, etc.

16. And Lots More! ✤ Bio::Align/IO ✤ Bio::Map/IO ✤ Bio::Assembly/IO ✤ Bio::Restriction/IO ✤ Bio::Tree/IO ✤ Bio::Structure/IO ✤ Local ﬂatﬁle databases ✤ Bio::Factory ✤ Bio::Graphics ✤ Bio::Tools::Run (catch-all namespace) ✤ SeqFeature databases ✤ Bio::Factory (create objects) ✤ Bio::Pedigree/IO ✤ Bio::Range/Location ✤ Bio::Coordinate/IO

17. Current Development

18. Next-Gen Sequence ✤ Second-generation/next-generation sequencing ✤ This is Lincoln Stein ✤ There is a reason he is smiling...

19. Next-Gen Sequence ✤ Bio-SamTools - support for SAM and BAM data (via SamTools) ✤ Bio-BigFile - support for BigWig/BigBed (via Jim Kent’s UCSC tools) ✤ Separate CPAN distributions ✤ GBrowse (Lincoln’s talk this afternoon), BioPerl ✤ Via SeqFeatures (high-level API for both modules) ✤ Via Bio::Assembly and BioPerl-Run (using the above modules)

20. Data Courtesy R. Khetani, M. Hudson, G. Robinson

21. New Tools/Wrappers ✤ BowTie ✤ Infernal v.1.0 ✤ BWA ✤ NCBI eUtils (SOAP, CGI-based) ✤ MAQ ✤ TopHat/CuffLinks (upcoming) ✤ BEDTools (beta) ✤ The Cloud - bioperl-max ✤ SAMTools Mark Jensen, ✤ HMMER3 Thomas Sharpton, Dave Messina, ✤ BLAST+ Kai Blin, ✤ PAML Dan Kortschak

22. Collaborations Published online 16 December 2009 Nucleic Acids Research, 2010, Vol. 38, No. 6 1767–1771 doi:10.1093/nar/gkp1137 SURVEY AND SUMMARY The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants Peter J. A. Cock1,*, Christopher J. Fields2, Naohisa Goto3, Michael L. Heuer4 and Peter M. Rice5 1 Plant Pathology, SCRI, Invergowrie, Dundee DD2 5DA, UK, 2Institute for Genomic Biology, 1206 W. Gregory Drive, M/C 195, University of Illinois at Urbana-Champaign, IL 61801, USA, 3Genome Information Research Center, Research Institute for Microbial Diseases, Osaka University, 3-1 Yamadaoka, Suita, Osaka 565-0871, Japan, 4Harbinger Partners, Inc., 855 Village Center Drive, Suite 356, St. Paul, MN 55127, USA and 5EMBL Outstation - Hinxton, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK Received October 13, 2009; Revised November 13, 2009; Accepted November 17, 2009 ABSTRACT of an explicit standard some parsers will fail to cope with very long ‘>’ title lines or very long sequences without FASTQ has emerged as a common file format for line wrapping. There is also no standardization for

23. The Google Summer of Code ✤ O|B|F was accepted this year for the ﬁrst time ✤ Headed by Rob Buels (SGN), with some help from Hilmar Lapp and myself ✤ Six projects, covering BioPerl, BioJava, Biopython, BioRuby

24. The Google Summer of Code ✤ BioPerl has actually been part of the Google Summer of Code for the last three years (as have many other Bio*): ✤ NESCent - admin: H. Lapp: ✤ 2008 - PhyloXML parsing (student: Mira Han) ✤ 2009 - NeXML parsing (student: Chase Miller) ✤ O|B|F - admin: R. Buels: ✤ 2010 - Alignment subsystem refactoring (student: Jun Yin)

25. GSoC - Alignment Subsystem ✤ Clean up current code ✤ Include capability of dealing with large datasets ✤ Target next-gen data, very large alignments? ✤ Abstract the backend (DB, memory, etc.) ✤ SAM/BAM may work (via Bio::DB::SAM) ✤ ...but what about protein sequences?

26. Towards a Modern BioPerl

27. Towards a Modern BioPerl ✤ BioPerl will be turning 15 soon ✤ What can we improve? ✤ What can we do with the current code? ✤ Maybe some that we can use in a BioPerl 2.0? ✤ Or a BioPerl 6?

28. What We Can Do Now ✤ Lower the barrier ✤ Use Modern Perl ✤ Deal with the monolith

29. Lower the Barrier ✤ We have already started on this - May 2010 ✤ Migrate source code repository to git and GitHub ✤ Original BioPerl developers are added as collaborators on GitHub... ✤ ...but now anyone can now ‘fork’ BioPerl, make changes, submit ‘pull requests’, etc. ✤ Since May, have had many forks, pull requests with code reviews (so a decent success)

30. Using Modern Perl ✤ Minimal version of Perl required for BioPerl is v5.6.1 ✤ Even v5.8.1 is considered quite old ✤ Both the 5.6.x and 5.8.x releases are EOL (as of Dec. 2008)

31. Using Modern Perl ✤ Minimal version of Perl required for BioPerl is v5.6.1 ✤ Even v5.8.1 is considered quite old ✤ Both the 5.6.x and 5.8.x releases are EOL (as of Dec. 2008)

32. Using Modern Perl say defined-or print "I like newlinesn"; # work only if false && defined $foo ||= 'default'; say "I like newlines"; if (!defined($foo)) { $foo = 'default' yada yada } $foo //= 'default'; sub implement_me { shift->throw_not_implemented } sub implement_me { ... } # yada yada

33. Using Modern Perl Smart Match given/when if ($key ~~ %hash) { # like exists given ($foo) { # do something when (%lookup) { ... } } when (/^(d+)/) { ... } when (/^[A-Za-z]+/) { ... } if ($foo ~~ /d+/ ) { # like =~ default { ... } # do something } }

34. Dealing with the Monolith ✤ Release manager nightmares: ✤ Remote databases disappear (XEMBL) ✤ Others change service or URLs (SeqHound) ✤ Services become obsolete (Pise) ✤ Developers move on, disappear, modules bit-rot (not saying :) ✤ How do we solve this problem?

35. Dealing with the Monolith Classes Tests (Files) bioperl-live 874 23146 (341) (Core) bioperl-run 123* 2468 (80) bioperl-db 72 113 (16) bioperl-network 9 327 (9) * Had 285 more prior to Pise module removal!

36. Dealing with the Monolith ✤ Maybe we shouldn’t be friendly to the monolith ✤ Maybe we should ‘blow it up’ ✤ (Of course, that means make the code modular) ✤ It was originally designed with that somewhat in mind (interfaces)

37. Dealing with the Monolith ✤ Separate distributions make it easier to submit ﬁxes as needed ✤ However, separate distributions make developing a little trickier ✤ Can we create a distribution that resembles BioPerl as users know it? ✤ Is this something we should worry about? ✤ YES ✤ Don’t alienate end-users!

38. Towards BioPerl 2.0? ✤ Biome: BioPerl with Moose ✤ BioPerl6: self-explanatory

39. Biome ✤ BioPerl classes implemented in Moose ✤ GitHub: http://github.com/cjﬁelds/biome ✤ Implemented: Ranges, Locations, simple PrimarySeq, Annotation, SeqFeatures, prototype SeqIO ✤ Interfaces converted to Moose Roles ✤ ‘Type’-checking used for data types

40. Role package Biome::Role::Range; Attributes use Biome::Role; use Biome::Types qw(SequenceStrand); requires 'to_string'; Class package Biome::Range; has strand => ( isa => SequenceStrand, use Biome; is => 'rw', default => 0, with 'Biome::Role::Range'; coerce => 1 ); sub to_string { my ($self) = @_; has start => ( return sprintf("(%s, %s) strand=%s", is => 'rw', $self->start, isa => 'Int', $self->end, ); $self->strand); } has end => ( is => 'rw', isa => 'Int' ); sub length { $_[0]->end - $_[0]->start + 1; }

41. BioPerl 6 ✤ BioPerl6: http://github.com/cjﬁelds/bioperl6 ✤ Little has been done beyond simple implementations ✤ Code is open to anyone for experimentation ✤ Ex: Philip Mabon donated a FASTA grammar:

42. Grammar (FASTA) Actions (FASTA) grammar Bio::Grammar::Fasta { token TOP { ^<fasta>+ $ } token fasta { <description_line> <sequence> } token description_line { ^^> <id> <.ws> <description> n } token id { | <identifier> | <generic_id> } token identifier { S+ } token generic_id { S+ } token description { N+ } token sequence { <-[>]>+ } }

43. Grammar (FASTA) Actions (FASTA) grammar Bio::Grammar::Fasta { class Bio::Grammar::Actions::Fasta { token TOP { method TOP($/){ ^<fasta>+ $ my @matches = gather for $/<fasta> -> $m { take $m.ast; } }; token fasta { <description_line> <sequence> make @matches; } } method fasta($/){ token description_line { my $id =$/<description_line>.ast<id>; ^^> <id> <.ws> <description> n my $desc = $/<description_line>.ast<description>; } my $obj = Bio::PrimarySeq.new( token id { display_id => $id, | <identifier> description => $desc, | <generic_id> seq => $/<sequence>.ast); } make $obj; token identifier { } S+ method description_line($/){ } make $/; token generic_id { } S+ method id($/) { } make $/; } token description { method description($/){ N+ make $/; } } token sequence { method sequence($/){ <-[>]>+ make (~$/).subst("n", '', :g); } } } }

50. Acknowledgements ✤ All BioPerl developers ✤ Chris Dagdigian and Mauricio Herrera Cuadra (O|B|F gurus) ✤ Cross-Collaborative work: Peter Cock (Biopython), Pjotr Prins (BioLib, BioRuby), Naohisa Goto (BioRuby), Michael Heuer and Andreas Prlic (BioJava), Peter Rice (EMBOSS) ✤ Questions? Do we even have time?

Fields bosc2010 bio_perl

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (19)

Similar to Fields bosc2010 bio_perl

Similar to Fields bosc2010 bio_perl (20)

More from BOSC 2010

More from BOSC 2010 (20)

Recently uploaded

Recently uploaded (20)

Fields bosc2010 bio_perl