2. Present Day BioPerl
✤ Addressing new bioinformatics problems
✤ Collaborations in Open Bioinformatics Foundation
✤ Google Summer of Code
3. Towards a Modern BioPerl
✤ Lowering the barrier for new users to become involved
✤ Using Modern Perl language features
✤ Dealing with the BioPerl monolith
4. BioPerl 2.0?
✤ BioPerl and Modern Perl OOP (Moose)
✤ BioPerl and Perl 6
5. Background
✤ Started in 1996, many contributors over the years
✤ Jason Stajich (UCR) ✤ Ian Korf (Wash U)
✤ Hilmar Lapp (NESCent) ✤ Chris Mungall (NCBO)
✤ Heikki Lehväslaiho (KAUST) ✤ Brian Osborne (BioTeam)
✤ Georg Fuellen (Bielefeld) ✤ Steve Trutane (Stanford)
✤ Ewan Birney (Sanger, EBI) ✤ Sendu Bala (Sanger)
✤ Aaron Mackey (Univ. Virginia) ✤ Dave Messina (Sonnhammer Lab)
✤ Chris Dagdigian (BioTeam) ✤ Mark Jensen (TCGA)
✤ Steven Brenner (UC-Berkeley) ✤ Rob Buels (SGN)
✤ Lincoln Stein (OICR, CSHL) ✤ Many, many more!
6. Background
✤ Open source: ‘Released under the same license as Perl itself’ i.e.
Artistic
✤ http://bioperl.org
✤ Core developers - make releases, drive the project, set vision
✤ Regular contributors - have direct commit access
7. BioPerl Distributions
✤ BioPerl Core - the main distribution (aka ‘bioperl-live’ if using dev
version)
✤ BioPerl-Run - Perl ‘wrappers’ for common bioinformatics tools
✤ BioPerl-DB - BioSQL ORM to BioPerl classes
8. Biological Sequences
✤ Bio::Seq - sequence record class
#!/bin/perl -w
use Modern::Perl;
use Bio::Seq;
my $seq_obj = Bio::Seq->new(-seq => "aaaatgggggggggggccccgtt",
-display_id => "ABC12345",
-desc => "example 1",
-alphabet => "dna");
say $seq_obj->display_id; # ABC12345
say $seq_obj->desc; # example 1
say $seq_obj->seq; # aaaatgggggggggggccccgtt
my $revcom = $seq_obj->revcom; # new Bio::Seq, but revcom
say $revcom->seq; # aacggggcccccccccccatttt
9. Sequence I/O
✤ Bio::SeqIO - sequence I/O stream classes (pluggable)
#!/usr/bin/perl -w
use Modern::Perl;
use Bio::SeqIO;
my ($infile, $outfile) = @ARGV;
my $in = Bio::SeqIO->new(-file => $infile,
-format => 'genbank');
my $out = Bio::SeqIO->new(-file => ">$outfile",
-format => 'fasta');
while (my $seq_obj = $in->next_seq) {
say $seq_obj->display_id;
$out->write_seq($seq_obj);
}
10. Sequence Features
✤ Bio::SeqFeature::Generic - generic SF implementation
GenBank File
use Modern::Perl; source 1..2629
use Bio::SeqIO; /organism="Enterococcus faecalis OG1RF"
/mol_type="genomic DNA"
my $in = Bio::SeqIO->new(-file => shift, /strain="OG1RF"
-format => 'genbank'); /db_xref="taxon:474186"
gene 25..>2629
while (my $seq_obj = $in->next_seq) { /gene="pyr operon"
for my $feat_obj ($seq_obj->get_SeqFeatures) { /note="pyrimidine biosynthetic operon"
say "Primary tag: ".$feat_obj->primary_tag;
say "Location: ".$feat_obj->location->to_FTstring; Primary tag: source
for my $tag ($feat_obj->get_all_tags) { Location: 1..2629
say " tag: $tag"; tag: db_xref
for my $value ($feat_obj->get_tag_values($tag)) { value: taxon:474186
say " value: $value"; tag: mol_type
} value: genomic DNA
} tag: organism
} value: Enterococcus faecalis OG1RF
} tag: strain
value: OG1RF
11. Sequence Features
✤ Bio::SeqFeature::Generic - generic SF implementation
GenBank File
use Modern::Perl; source 1..2629
use Bio::SeqIO; /organism="Enterococcus faecalis OG1RF"
/mol_type="genomic DNA"
my $in = Bio::SeqIO->new(-file => shift, /strain="OG1RF"
-format => 'genbank'); /db_xref="taxon:474186"
gene 25..>2629
while (my $seq_obj = $in->next_seq) { /gene="pyr operon"
for my $feat_obj ($seq_obj->get_SeqFeatures) { /note="pyrimidine biosynthetic operon"
say "Primary tag: ".$feat_obj->primary_tag;
say "Location: ".$feat_obj->location->to_FTstring; Primary tag: source
for my $tag ($feat_obj->get_all_tags) { Location: 1..2629
say " tag: $tag"; tag: db_xref
for my $value ($feat_obj->get_tag_values($tag)) { value: taxon:474186
say " value: $value"; tag: mol_type
} value: genomic DNA
} tag: organism
} value: Enterococcus faecalis OG1RF
} tag: strain
value: OG1RF
12. Sequence Features
✤ Bio::SeqFeature::Generic - generic SF implementation
GenBank File
use Modern::Perl; source 1..2629
use Bio::SeqIO; /organism="Enterococcus faecalis OG1RF"
/mol_type="genomic DNA"
my $in = Bio::SeqIO->new(-file => shift, /strain="OG1RF"
-format => 'genbank'); /db_xref="taxon:474186"
gene 25..>2629
while (my $seq_obj = $in->next_seq) { /gene="pyr operon"
for my $feat_obj ($seq_obj->get_SeqFeatures) { /note="pyrimidine biosynthetic operon"
say "Primary tag: ".$feat_obj->primary_tag;
say "Location: ".$feat_obj->location->to_FTstring; Primary tag: source
for my $tag ($feat_obj->get_all_tags) { Location: 1..2629
say " tag: $tag"; tag: db_xref
for my $value ($feat_obj->get_tag_values($tag)) { value: taxon:474186
say " value: $value"; tag: mol_type
} value: genomic DNA
} tag: organism
} value: Enterococcus faecalis OG1RF
} tag: strain
value: OG1RF
13. Report Parsing
Query= gi|1786183|gb|AAC73113.1| (AE000111) aspartokinase I,
homoserine dehydrogenase I [Escherichia coli]
(820 letters)
Database: ecoli.aa
4289 sequences; 1,358,990 total letters
Searching..................................................done
Score E
Sequences producing significant alignments: (bits) Value
gb|AAC73113.1| (AE000111) aspartokinase I, homoserine dehydrogen... 1567 0.0
gb|AAC76922.1| (AE000468) aspartokinase II and homoserine dehydr... 332 1e-91
gb|AAC76994.1| (AE000475) aspartokinase III, lysine sensitive [E... 184 3e-47
gb|AAC73282.1| (AE000126) uridylate kinase [Escherichia coli] 42 3e-04
>gb|AAC73113.1| (AE000111) aspartokinase I, homoserine dehydrogenase I [Escherichia
coli]
Length = 820
Score = 1567 bits (4058), Expect = 0.0
Identities = 806/820 (98%), Positives = 806/820 (98%)
Query: 1 MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDA 60
MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDA
Sbjct: 1 MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDA 60
14. Report Parsing
Query=gi|1786183|gb|AAC73113.1|
✤ Bio::SearchIO Hit=gb|AAC73113.1|
#!/usr/bin/perl -w Length=820
Percent_id=98.2926829268293
use Modern::Perl;
use Bio::SearchIO;
Query=gi|1786183|gb|AAC73113.1|
my $in = Bio::SearchIO->new(-format => 'blast',
-file => 'ecoli.bls');
Hit=gb|AAC76922.1|
Length=821
while( my $result = $in->next_result ) { Percent_id=29.5980511571255
while( my $hit = $result->next_hit ) {
while( my $hsp = $hit->next_hsp ) { Query=gi|1786183|gb|AAC73113.1|
say "Query=".$result->query_name;
Hit=gb|AAC76994.1|
say " Hit=".$hit->name;
Length=471
say " Length=".$hsp->length('total');
say " Percent_id=".$hsp->percent_identity."n"; Percent_id=30.1486199575372
}
} Query=gi|1786183|gb|AAC73113.1|
} Hit=gb|AAC73282.1|
Length=97
Percent_id=28.8659793814433
15. Local/Remote Database Interfaces
✤ Bio::DB::GenBank
#!/bin/perl -w
use Modern::Perl;
use Bio::DB::GenBank;
my $db_obj = Bio::DB::GenBank->new; # query NCBI nuc db
my $seq_obj = $db_obj->get_Seq_by_acc('A00002');
say $seq_obj->display_id; # A00002
say $seq_obj->length(); # 194
✤ Also EntrezGene, GenPept, RefSeq, UniProt, EBI, etc.
18. Next-Gen Sequence
✤ Second-generation/next-generation sequencing
✤ This is Lincoln Stein
✤ There is a reason he is smiling...
19. Next-Gen Sequence
✤ Bio-SamTools - support for SAM and BAM data (via SamTools)
✤ Bio-BigFile - support for BigWig/BigBed (via Jim Kent’s UCSC tools)
✤ Separate CPAN distributions
✤ GBrowse (Lincoln’s talk this afternoon), BioPerl
✤ Via SeqFeatures (high-level API for both modules)
✤ Via Bio::Assembly and BioPerl-Run (using the above modules)
21. New Tools/Wrappers
✤ BowTie ✤ Infernal v.1.0
✤ BWA ✤ NCBI eUtils (SOAP, CGI-based)
✤ MAQ ✤ TopHat/CuffLinks (upcoming)
✤ BEDTools (beta) ✤ The Cloud - bioperl-max
✤ SAMTools
Mark Jensen,
✤ HMMER3 Thomas Sharpton,
Dave Messina,
✤ BLAST+
Kai Blin,
✤ PAML Dan Kortschak
22. Collaborations
Published online 16 December 2009 Nucleic Acids Research, 2010, Vol. 38, No. 6 1767–1771
doi:10.1093/nar/gkp1137
SURVEY AND SUMMARY
The Sanger FASTQ file format for sequences
with quality scores, and the Solexa/Illumina
FASTQ variants
Peter J. A. Cock1,*, Christopher J. Fields2, Naohisa Goto3, Michael L. Heuer4 and
Peter M. Rice5
1
Plant Pathology, SCRI, Invergowrie, Dundee DD2 5DA, UK, 2Institute for Genomic Biology, 1206 W. Gregory
Drive, M/C 195, University of Illinois at Urbana-Champaign, IL 61801, USA, 3Genome Information Research
Center, Research Institute for Microbial Diseases, Osaka University, 3-1 Yamadaoka, Suita, Osaka 565-0871,
Japan, 4Harbinger Partners, Inc., 855 Village Center Drive, Suite 356, St. Paul, MN 55127, USA and 5EMBL
Outstation - Hinxton, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton,
Cambridge CB10 1SD, UK
Received October 13, 2009; Revised November 13, 2009; Accepted November 17, 2009
ABSTRACT of an explicit standard some parsers will fail to cope with
very long ‘>’ title lines or very long sequences without
FASTQ has emerged as a common file format for line wrapping. There is also no standardization for
23. The Google Summer of Code
✤ O|B|F was accepted this year for the first time
✤ Headed by Rob Buels (SGN), with some help from Hilmar Lapp and
myself
✤ Six projects, covering BioPerl, BioJava, Biopython, BioRuby
24. The Google Summer of Code
✤ BioPerl has actually been part of the Google Summer of Code for the
last three years (as have many other Bio*):
✤ NESCent - admin: H. Lapp:
✤ 2008 - PhyloXML parsing (student: Mira Han)
✤ 2009 - NeXML parsing (student: Chase Miller)
✤ O|B|F - admin: R. Buels:
✤ 2010 - Alignment subsystem refactoring (student: Jun Yin)
25. GSoC - Alignment Subsystem
✤ Clean up current code
✤ Include capability of dealing with large datasets
✤ Target next-gen data, very large alignments?
✤ Abstract the backend (DB, memory, etc.)
✤ SAM/BAM may work (via Bio::DB::SAM)
✤ ...but what about protein sequences?
27. Towards a Modern BioPerl
✤ BioPerl will be turning 15 soon
✤ What can we improve?
✤ What can we do with the current code?
✤ Maybe some that we can use in a BioPerl 2.0?
✤ Or a BioPerl 6?
28. What We Can Do Now
✤ Lower the barrier
✤ Use Modern Perl
✤ Deal with the monolith
29. Lower the Barrier
✤ We have already started on this - May 2010
✤ Migrate source code repository to git and GitHub
✤ Original BioPerl developers are added as collaborators on GitHub...
✤ ...but now anyone can now ‘fork’ BioPerl, make changes, submit
‘pull requests’, etc.
✤ Since May, have had many forks, pull requests with code reviews (so
a decent success)
30. Using Modern Perl
✤ Minimal version of Perl required for BioPerl is v5.6.1
✤ Even v5.8.1 is considered quite old
✤ Both the 5.6.x and 5.8.x releases are EOL (as of Dec. 2008)
31. Using Modern Perl
✤ Minimal version of Perl required for BioPerl is v5.6.1
✤ Even v5.8.1 is considered quite old
✤ Both the 5.6.x and 5.8.x releases are EOL (as of Dec. 2008)
32. Using Modern Perl
say defined-or
print "I like newlinesn"; # work only if false && defined
$foo ||= 'default';
say "I like newlines";
if (!defined($foo)) {
$foo = 'default'
yada yada }
$foo //= 'default';
sub implement_me {
shift->throw_not_implemented
}
sub implement_me { ... } # yada yada
33. Using Modern Perl
Smart Match given/when
if ($key ~~ %hash) { # like exists
given ($foo) {
# do something
when (%lookup) { ... }
}
when (/^(d+)/) { ... }
when (/^[A-Za-z]+/) { ... }
if ($foo ~~ /d+/ ) { # like =~
default { ... }
# do something
}
}
34. Dealing with the Monolith
✤ Release manager nightmares:
✤ Remote databases disappear (XEMBL)
✤ Others change service or URLs (SeqHound)
✤ Services become obsolete (Pise)
✤ Developers move on, disappear, modules bit-rot (not saying :)
✤ How do we solve this problem?
35. Dealing with the Monolith
Classes Tests (Files)
bioperl-live
874 23146 (341)
(Core)
bioperl-run 123* 2468 (80)
bioperl-db 72 113 (16)
bioperl-network 9 327 (9)
* Had 285 more prior to Pise module removal!
36. Dealing with the Monolith
✤ Maybe we shouldn’t be friendly to the monolith
✤ Maybe we should ‘blow it up’
✤ (Of course, that means make the code modular)
✤ It was originally designed with that somewhat in mind (interfaces)
37. Dealing with the Monolith
✤ Separate distributions make it easier to submit fixes as needed
✤ However, separate distributions make developing a little trickier
✤ Can we create a distribution that resembles BioPerl as users know it?
✤ Is this something we should worry about?
✤ YES
✤ Don’t alienate end-users!
39. Biome
✤ BioPerl classes implemented in Moose
✤ GitHub: http://github.com/cjfields/biome
✤ Implemented: Ranges, Locations, simple PrimarySeq, Annotation,
SeqFeatures, prototype SeqIO
✤ Interfaces converted to Moose Roles
✤ ‘Type’-checking used for data types
40. Role
package Biome::Role::Range;
Attributes
use Biome::Role;
use Biome::Types qw(SequenceStrand);
requires 'to_string'; Class
package Biome::Range;
has strand => (
isa => SequenceStrand,
use Biome;
is => 'rw',
default => 0,
with 'Biome::Role::Range';
coerce => 1
);
sub to_string {
my ($self) = @_;
has start => (
return sprintf("(%s, %s) strand=%s",
is => 'rw',
$self->start,
isa => 'Int',
$self->end,
);
$self->strand);
}
has end => (
is => 'rw',
isa => 'Int'
);
sub length {
$_[0]->end - $_[0]->start + 1;
}
41. BioPerl 6
✤ BioPerl6: http://github.com/cjfields/bioperl6
✤ Little has been done beyond simple implementations
✤ Code is open to anyone for experimentation
✤ Ex: Philip Mabon donated a FASTA grammar:
43. Grammar (FASTA) Actions (FASTA)
grammar Bio::Grammar::Fasta { class Bio::Grammar::Actions::Fasta {
token TOP { method TOP($/){
^<fasta>+ $ my @matches = gather for $/<fasta> -> $m {
take $m.ast;
} };
token fasta {
<description_line> <sequence> make @matches;
} }
method fasta($/){
token description_line { my $id =$/<description_line>.ast<id>;
^^> <id> <.ws> <description> n my $desc = $/<description_line>.ast<description>;
} my $obj = Bio::PrimarySeq.new(
token id { display_id => $id,
| <identifier> description => $desc,
| <generic_id> seq => $/<sequence>.ast);
} make $obj;
token identifier { }
S+ method description_line($/){
} make $/;
token generic_id { }
S+ method id($/) {
} make $/;
}
token description { method description($/){
N+ make $/;
} }
token sequence { method sequence($/){
<-[>]>+ make (~$/).subst("n", '', :g);
} }
} }
44. Grammar (FASTA) Actions (FASTA)
grammar Bio::Grammar::Fasta { class Bio::Grammar::Actions::Fasta {
token TOP { method TOP($/){
^<fasta>+ $ my @matches = gather for $/<fasta> -> $m {
take $m.ast;
} };
token fasta {
<description_line> <sequence> make @matches;
} }
method fasta($/){
token description_line { my $id =$/<description_line>.ast<id>;
^^> <id> <.ws> <description> n my $desc = $/<description_line>.ast<description>;
} my $obj = Bio::PrimarySeq.new(
token id { display_id => $id,
| <identifier> description => $desc,
| <generic_id> seq => $/<sequence>.ast);
} make $obj;
token identifier { }
S+ method description_line($/){
} make $/;
token generic_id { }
S+ method id($/) {
} make $/;
}
token description { method description($/){
N+ make $/;
} }
token sequence { method sequence($/){
<-[>]>+ make (~$/).subst("n", '', :g);
} }
} }
45. Grammar (FASTA) Actions (FASTA)
grammar Bio::Grammar::Fasta { class Bio::Grammar::Actions::Fasta {
token TOP { method TOP($/){
^<fasta>+ $ my @matches = gather for $/<fasta> -> $m {
take $m.ast;
} };
token fasta {
<description_line> <sequence> make @matches;
} }
method fasta($/){
token description_line { my $id =$/<description_line>.ast<id>;
^^> <id> <.ws> <description> n my $desc = $/<description_line>.ast<description>;
} my $obj = Bio::PrimarySeq.new(
token id { display_id => $id,
| <identifier> description => $desc,
| <generic_id> seq => $/<sequence>.ast);
} make $obj;
token identifier { }
S+ method description_line($/){
} make $/;
token generic_id { }
S+ method id($/) {
} make $/;
}
token description { method description($/){
N+ make $/;
} }
token sequence { method sequence($/){
<-[>]>+ make (~$/).subst("n", '', :g);
} }
} }
46. Grammar (FASTA) Actions (FASTA)
grammar Bio::Grammar::Fasta { class Bio::Grammar::Actions::Fasta {
token TOP { method TOP($/){
^<fasta>+ $ my @matches = gather for $/<fasta> -> $m {
take $m.ast;
} };
token fasta {
<description_line> <sequence> make @matches;
} }
method fasta($/){
token description_line { my $id =$/<description_line>.ast<id>;
^^> <id> <.ws> <description> n my $desc = $/<description_line>.ast<description>;
} my $obj = Bio::PrimarySeq.new(
token id { display_id => $id,
| <identifier> description => $desc,
| <generic_id> seq => $/<sequence>.ast);
} make $obj;
token identifier { }
S+ method description_line($/){
} make $/;
token generic_id { }
S+ method id($/) {
} make $/;
}
token description { method description($/){
N+ make $/;
} }
token sequence { method sequence($/){
<-[>]>+ make (~$/).subst("n", '', :g);
} }
} }
47. Grammar (FASTA) Actions (FASTA)
grammar Bio::Grammar::Fasta { class Bio::Grammar::Actions::Fasta {
token TOP { method TOP($/){
^<fasta>+ $ my @matches = gather for $/<fasta> -> $m {
take $m.ast;
} };
token fasta {
<description_line> <sequence> make @matches;
} }
method fasta($/){
token description_line { my $id =$/<description_line>.ast<id>;
^^> <id> <.ws> <description> n my $desc = $/<description_line>.ast<description>;
} my $obj = Bio::PrimarySeq.new(
token id { display_id => $id,
| <identifier> description => $desc,
| <generic_id> seq => $/<sequence>.ast);
} make $obj;
token identifier { }
S+ method description_line($/){
} make $/;
token generic_id { }
S+ method id($/) {
} make $/;
}
token description { method description($/){
N+ make $/;
} }
token sequence { method sequence($/){
<-[>]>+ make (~$/).subst("n", '', :g);
} }
} }
48. Grammar (FASTA) Actions (FASTA)
grammar Bio::Grammar::Fasta { class Bio::Grammar::Actions::Fasta {
token TOP { method TOP($/){
^<fasta>+ $ my @matches = gather for $/<fasta> -> $m {
take $m.ast;
} };
token fasta {
<description_line> <sequence> make @matches;
} }
method fasta($/){
token description_line { my $id =$/<description_line>.ast<id>;
^^> <id> <.ws> <description> n my $desc = $/<description_line>.ast<description>;
} my $obj = Bio::PrimarySeq.new(
token id { display_id => $id,
| <identifier> description => $desc,
| <generic_id> seq => $/<sequence>.ast);
} make $obj;
token identifier { }
S+ method description_line($/){
} make $/;
token generic_id { }
S+ method id($/) {
} make $/;
}
token description { method description($/){
N+ make $/;
} }
token sequence { method sequence($/){
<-[>]>+ make (~$/).subst("n", '', :g);
} }
} }
49. Grammar (FASTA) Actions (FASTA)
grammar Bio::Grammar::Fasta { class Bio::Grammar::Actions::Fasta {
token TOP { method TOP($/){
^<fasta>+ $ my @matches = gather for $/<fasta> -> $m {
take $m.ast;
} };
token fasta {
<description_line> <sequence> make @matches;
} }
method fasta($/){
token description_line { my $id =$/<description_line>.ast<id>;
^^> <id> <.ws> <description> n my $desc = $/<description_line>.ast<description>;
} my $obj = Bio::PrimarySeq.new(
token id { display_id => $id,
| <identifier> description => $desc,
| <generic_id> seq => $/<sequence>.ast);
} make $obj;
token identifier { }
S+ method description_line($/){
} make $/;
token generic_id { }
S+ method id($/) {
} make $/;
}
token description { method description($/){
N+ make $/;
} }
token sequence { method sequence($/){
<-[>]>+ make (~$/).subst("n", '', :g);
} }
} }
50. Acknowledgements
✤ All BioPerl developers
✤ Chris Dagdigian and Mauricio Herrera Cuadra (O|B|F gurus)
✤ Cross-Collaborative work: Peter Cock (Biopython), Pjotr Prins
(BioLib, BioRuby), Naohisa Goto (BioRuby), Michael Heuer and
Andreas Prlic (BioJava), Peter Rice (EMBOSS)
✤ Questions? Do we even have time?