Mais conteúdo relacionado Semelhante a Holland R - Pistoia Alliance Sequence Squeeze (20) Holland R - Pistoia Alliance Sequence Squeeze1. Pistoia
Alliance
Sequence
Squeeze
Using
a
compe--on
model
to
spur
development
of
novel
open-‐source
algorithms
Richard
Holland
(Eagle/Pistoia),
Nick
Lynch
(AZ/Pistoia)
BOSC
July
2012
©Eagle
Genomics
Ltd.
©Eagle
Genomics
Ltd
2. Order
of
Service
• What/who
is
the
Pistoia
Alliance?
• What
is/was
Sequence
Squeeze?
• Who
won,
how,
and
why?
• Why
did
Pistoia
do
this?
• Why
is
this
good
for
BOSC
delegates?
• Will
it
happen
again?
Pistoia
Alliance
Sequence
Squeeze
©Eagle
Genomics
Ltd
July
14,
2012
2
3. What/who
is
the
Pistoia
Alliance?
Pistoia
Alliance
Sequence
Squeeze
©Eagle
Genomics
Ltd
July
14,
2012
3
4. Who
is
Pistoia?
• The
Pistoia
Alliance
is
– global
– not-‐for-‐profit
– precompeWWve
alliance
– life
science
companies,
vendors,
publishers,
and
academic
groups
– aims
to
lower
barriers
to
innovaWon
– by
improving
the
interoperability
of
R&D
business
processes.
• We
differ
from
standards
groups
because
– we
bring
together
the
key
consWtuents
to
idenWfy
the
root
causes
that
lead
to
R&D
inefficiencies
– develop
best
pracWces
and
technology
pilots
to
overcome
common
obstacles.
Pistoia
Alliance
Sequence
Squeeze
©Eagle
Genomics
Ltd
July
14,
2012
4
5. What
is/was
Sequence
Squeeze?
Pistoia
Alliance
Sequence
Squeeze
©Eagle
Genomics
Ltd
July
14,
2012
5
6. The
NGS
problem
• Storing
millions
of
NGS
reads
and
their
quality
scores
uncompressed
is
imprac,cal,
yet
current
compression
technologies
are
becoming
inadequate.
• There
is
a
need
for
a
new
and
novel
method
of
compressing
sequence
reads
and
their
quality
scores
in
a
way
that
preserves
100%
of
the
informa,on
whilst
achieving
much-‐improved
linear
(or,
even
beer,
non-‐
linear)
compression
raWos.
Pistoia
Alliance
Sequence
Squeeze
©Eagle
Genomics
Ltd
July
14,
2012
6
7. What
was
Sequence
Squeeze?
• Contest
to
find
a
beer
FASTQ
compression
algorithm
– easiest
format
for
ranking
entries
in
an
automated
se_ng.
• Open
source,
non-‐restricWve
licence
required
for
entries
– benefit
the
whole
community.
• Entries
tested
on
an
extract
of
the
1000
genomes
data
stored
in
AWS.
• Prize
fund
of
US$15,000
to
the
best
algorithm
submied
before
the
closing
date
of
15
March
2012.
• Winner
was
announced
at
the
Pistoia
Alliance
Conference
in
Boston
MA
on
24
April
2012
– more
on
that
story
later.
• Organised
and
administered
by
Eagle
under
contract
to
Pistoia.
Pistoia
Alliance
Sequence
Squeeze
©Eagle
Genomics
Ltd
July
14,
2012
7
8. Who
entered?
• 108
disWnct
entries.
• But
all
these
from
only
12
entrants!
– some
entrants
were
groups
or
consorWa
but
most
were
individuals.
• Public
leaderboard
encouraged
fiercer
compeWWon.
• Entrants
seemingly
driven
to
outdo
their
compeWtors.
Pistoia
Alliance
Sequence
Squeeze
©Eagle
Genomics
Ltd
July
14,
2012
8
9. Who
judged?
• Yingrui
Li
–
Duty
OperaWon
Officer
of
Science
&
Technology
Department
of
the
BGI-‐Shenzhen.
• Nick
Lynch
–
President
of
the
Pistoia
Alliance
(2009-‐11).
• Guy
Coates
–
leader
of
the
InformaWcs
Systems
Group
at
the
Wellcome
Trust
Sanger
InsWtute.
• Tim
Fennell
–
Assistant
Director
for
Sequencing
Pipeline
InformaWcs
at
the
Broad
InsWtute.
Pistoia
Alliance
Sequence
Squeeze
©Eagle
Genomics
Ltd
July
14,
2012
9
10. Who
won,
how,
and
why?
Pistoia
Alliance
Sequence
Squeeze
©Eagle
Genomics
Ltd
July
14,
2012
10
11. What
were
the
results?
• Entrants
were
judged
by
– compression
raWo
– compression
Wme
and
memory
– decompression
Wme
and
memory
– accuracy
(lossiness
–
100%
target)
– manual
review
for
code
quality,
scalability,
and
other
factors.
• The
same
three
people
showed
up
at
the
top
of
every
category
– in
a
different
order
– with
different
versions
of
their
entries.
Pistoia
Alliance
Sequence
Squeeze
©Eagle
Genomics
Ltd
July
14,
2012
11
12. Who
won,
and
why?
• James
Bonfield
won
overall
– majority
of
top
places
in
each
category
– using
various
versions
of
his
entry
– forming
a
suite
of
suitable
tools.
• 11.41%
compression
raWo
(test
data
~6GB)
– or
109.90
seconds
compression
Wme
– or
100.91
seconds
decompression
Wme
– or
35.76MB
compression
memory
usage
– or
16.01MB
decompression
memory
usage
– but
not
all
at
once!
Pistoia
Alliance
Sequence
Squeeze
©Eagle
Genomics
Ltd
July
14,
2012
12
13. ImplicaWons
of
winning
entry
• The
approach
is
very
simple
–
essenWally:
– convert
the
FASTQ
to
BAM
alignments
against
a
reference
genome,
preserving
quality
scores.
– compress
the
BAM
files.
• Many
other
entries
followed
the
same
paern:
– convert
to
some
other
format
then
compress
using
standard
techniques.
Pistoia
Alliance
Sequence
Squeeze
©Eagle
Genomics
Ltd
July
14,
2012
13
14. Other
interesWng
results
• Ma
Mahoney
(Dell)
submied
a
specialised
version
of
the
standard
tool
paq
which
performed
extremely
well.
• Even
vanilla
paq
wasn’t
too
bad.
• Discarding
the
quality
scores
enWrely
gets
a
compression
raWo
of
2.87%
vs.
the
original
FASTQ
(not
FASTA).
• If
this
contest
truly
represented
the
latest
and
greatest
ideas
in
the
field,
then
NGS
storage
must
therefore
either
be
– highly
compressed,
very
slow
access,
– or
less
compressed,
relaWvely
fast
access.
• Its
quite
hard
to
beat
bzip2.
Pistoia
Alliance
Sequence
Squeeze
©Eagle
Genomics
Ltd
July
14,
2012
14
15. David
Flanders
(Eagle
CEO)
and
John
Wise
(Pistoia
chairman)
present
James
Bonfield
with
his
prize.
And
unexpected
benefits
James
Bonfield
donated
his
enWre
prize
fund
–
US$15,000
–
to
charity.
50%
to
the
Wellcome
Trust
Sanger
InsWtute.
50%
to
the
BriWsh
Heart
FoundaWon.
Pistoia
Alliance
Sequence
Squeeze
©Eagle
Genomics
Ltd
July
14,
2012
15
16. PublicaWon
• Formal
paper
being
wrien
at
the
moment
by
James
Bonfield
– in
collaboraWon
with
close-‐second
Ma
Mahoney
– and
judge
Nick
Lynch
– and
the
authors
of
other
significant
entries.
• Source
code
of
ALL
entries
is
available
at
www.sequencesqueeze.org
– all
under
BSD
licence
– all
hosted
at
SourceForge
or
similar
– click
entry
names
to
be
taken
to
download
page.
• Interviews
with
entrants
at
the
Pistoia
blog
www.pistoiaalliance.org/blog
– search
for
arWcles
with
the
tag
‘compression
algorithms’.
Pistoia
Alliance
Sequence
Squeeze
©Eagle
Genomics
Ltd
July
14,
2012
16
17. Why
did
Pistoia
do
this?
Pistoia
Alliance
Sequence
Squeeze
©Eagle
Genomics
Ltd
July
14,
2012
17
18. Why
did
Pistoia
do
this?
• Encouraging
innovaWon
through
prize-‐backed
contests.
• Open
innovaWon
model
allows
industry
to
state
its
requirements
– then
let
the
free
market
decide
how
to
deliver
something
that
saWsfies
these.
Pistoia
Alliance
Sequence
Squeeze
©Eagle
Genomics
Ltd
July
14,
2012
18
19. Why
did
Pistoia
do
this?
• Typical
bioinformaWcs
open-‐source
hackers
do
things
because
they
enjoy
them
– but
someWmes
also
because
of
the
challenge,
the
kudos,
the
saWsfacWon
of
solving
a
real-‐world
problem.
• James’
charity
donaWon
is
a
great
example
of
this
– he
wasn’t
in
it
for
the
money
– but
the
prize
fund
created
a
tangible
goal
to
aim
at.
• Amazon
kindly
sponsored
vouchers
for
all
parWcipants
that
should
have
covered
the
cost
of
developing
and
submi_ng
an
entry
– contest
was
AWS-‐based
– entries
had
to
be
submied
as
S3
buckets.
Pistoia
Alliance
Sequence
Squeeze
©Eagle
Genomics
Ltd
July
14,
2012
19
20. Why
did
Pistoia
do
this?
• Leaderboard
encouraged
compeWWon
– one-‐upmanship
– innovaWon.
• Does
not
discourage
collaboraWon
– James
and
Ma
both
discussed
their
entries
with
the
data
compression
community
at
encode.ru
Pistoia
Alliance
Sequence
Squeeze
©Eagle
Genomics
Ltd
July
14,
2012
20
21. Why
did
Pistoia
do
this?
• BSD-‐licence
requirement
ensured
that
the
winning
entry
was
not
going
to
be
available
only
to
those
willing
to
pay
a
fee.
• EnWre
community
benefits,
not
just
Pistoia
members
or
those
with
deep
pockets
to
pay
for
sosware
licence
agreements.
Pistoia
Alliance
Sequence
Squeeze
©Eagle
Genomics
Ltd
July
14,
2012
21
22. Why
is
this
good
for
BOSC
delegates?
Pistoia
Alliance
Sequence
Squeeze
©Eagle
Genomics
Ltd
July
14,
2012
22
23. Why
is
this
good
for
BOSC
delegates?
• If
the
entries
had
been
closed/commercial
then
only
organisaWons
willing
to
pay
to
licence/buy
the
resulWng
products
would
benefit.
• But
this
way
the
enWre
community
benefits
from
results,
for
free,
without
restricWon.
• Beneficiaries
include
big
pharma
and
other
large
corporaWons
that
commissioned
the
contest
– but
also
all
universiWes
– all
non-‐profits
– all
small
businesses
in
biotech
– and
everyone
else
involved
in
NGS
work.
• Pistoia
is
about
pre-‐compeWWve
alliance
– there
is
no
reason
to
make
the
Alliance’s
output
exclusive
– they
are
there
to
develop
and
share
ideas,
not
to
build
an
empire.
Pistoia
Alliance
Sequence
Squeeze
©Eagle
Genomics
Ltd
July
14,
2012
23
24. Will
it
happen
again?
Pistoia
Alliance
Sequence
Squeeze
©Eagle
Genomics
Ltd
July
14,
2012
24
25. Will
it
happen
again?
• Pleased
with
outcome
and
level
of
interest.
• So,
yes.
• Goal
is
to
run
two
such
contests
a
year.
• But,
your
community
needs
you!
– we
need
a
topic/subject/idea
that
can
be
raWonally/objecWvely
judged/ranked
– and
that
is
relevant
to
the
research
acWviWes
of
life
science
companies
and
other
Pistoia
members.
• Ideas
can
be
sent
to
Pistoia
Ops
team
c/o
execdirector@pistoiaalliance.org
Pistoia
Alliance
Sequence
Squeeze
©Eagle
Genomics
Ltd
July
14,
2012
25
26. Credits
• Pistoia
Alliance
for
the
idea
and
funding.
• Eagle
for
organising
and
administering.
• All
contestants
for
entering.
• 1000
Genomes
for
the
test
data.
• AWS
for
sponsoring
parWcipants.
• BOSC/OBF
for
accepWng
this
talk.
Pistoia
Alliance
Sequence
Squeeze
©Eagle
Genomics
Ltd
July
14,
2012
26
27. www.pistoiaalliance.org
richard.holland@eaglegenomics.com
www.sequencesqueeze.org
+44
(0)1223
654481
x3
(ideas
to:
execdirector@pistoiaalliance.org
)
www.eaglegenomics.com
@eaglegen
blog.eaglegenomics.com
facebook.com/eaglegenomics
@sequencesqueeze
www.pistoiaalliance.org/blog
@pistoiaalliance
Eagle®
is
a
registered
trademark
no.
010418135
of
Eagle
Genomics
Ltd.
Postal
address:
Eagle
Genomics
Ltd.,
Babraham
Research
Campus,
Cambridge
CB22
3AT,
United
Kingdom.
©Eagle
Genomics
Ltd.
©Eagle
Genomics
Ltd