The size, number and complexity of macromolecular structures has been growing dramatically in recent years making visualisation and analysis of macromolecules non-trivial and sometimes impossible. At the same time, developments within genomics, web-based game development and Big Data mean that hardware and software now support such analysis. However existing macromolecular file formats present an I/O bottleneck meaning the power of such technologies cannot be harnessed. In this work we present a modern MacroMolecular Transmission Format (MMTF). MMTF is 91% smaller than mmCIF and is up to two orders of magnitude faster to parse. Both these changes provide a paradigm shift in the way structural biology can be carried out. The largest structures can now be visualised on all devices and the entire archive can be interactively queried and analysed in seconds through an efficient in-memory representation.
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Small, fast and useful – MMTF a new paradigm in macromolecular data transmission
1. Small,
fast
and
useful
–
MMTF
a
new
paradigm
in
macromolecular
data
transmission
–
mm9.rcsb.org
Anthony
R.
Bradley,
Alexander
S.
Rose,
Yana
Valasatava,
Jose
M.
Duarte,
Andreas
Prlić,
Peter
W.
Rose
Yet another file format???
Applications
BD2K Targeted Software Development, Grant
Number: U01 CA198942
Funding and acknowledgements
Get the data
Three ways to get involved
hJp://mm9.rcsb.org/
Already several early adopters
APIs provided
Cole Christie and Chris Randle
• Steep
increase
in
atoms
per
structure
(37%
between
2012
and
2016)
• 10,000
new
structures
added
per
year
• 68
of
the
100
largest
structures
were
deposited
in
the
past
three
years
• Largest
structure
contains
2.5
M
atoms
• EM
seen
a
sharp
rise
in
recent
years
Outcomes
• Small
~75
%
compression
over
mmCIF
GZIP
• Fast
Parsing
2
orders
of
magnitude
faster
• Self-‐contained
No
need
for
calls
to
external
resources
• Useful
Bonding
(bond
order)
and
secondary
structure
info
included
in
all
files
What is it?
• Binary
MessagePack
(binary
JSON
format)
used
as
a
data
container
hJp://msgpack.org/
• Custom
lossless
compression
Delta,
run-‐length
and
dicdonary
encoding
used
to
compress
data
• Open-‐source
Specificadon
and
soeware
libraries
developed
under
Apache/MIT
licenses
Fast
• Whole
PDB
archive
converted
to
MMTF
weekly
• Individual
files
available
from
a
REST
API:
wget
h'p://mm,.rcsb.org/v0.2/full/4hhb.mm,.gz
• Whole
archive
as
a
Hadoop
sequence
file:
wget
h'p://mm,.rcsb.org/v0.2/hadoopfiles/full.tar
• More
details:
hJp://mm9.rcsb.org/download.html
• MMTF
allows
interacdve
data
mining
of
the
endre
PDB
archive
• No
need
for
SQL
or
seing
up
a
database,
or
schema
• Queries
on
the
endre
archive
in
only
a
couple
of
minutes
1. Use
–
use
our
API
to
do
your
own
processing
2. Adopt
–
incorporate
MMTF
into
your
toolkit
3. Contribute
–
fork
us
on
github
Data mining
Efficient contact finding
Fragment generation
• Generate
all
fragments
from
the
protein
chains
in
the
PDB
• Commonly
done
in,
e.g.,
ab
ini&o
structure
predicdon
• I/O
is
a
key
boJleneck
in
this
process
• MMTF
allows
for
such
analysis
to
be
done
in
fracdon
of
dme
• More
experiments
can
be
done
/
day
• No
need
to
compromise
on
dataset
size
or
parameters
Using
a
Mac
mini
with
2.6
GHz
Intel
Core
i5
(4
cores)
and
16GB
RAM.
Using
a
Mac
mini
with
2.6
GHz
Intel
Core
i5
(4
cores)
and
16GB
RAM.
Using
a
Mac
mini
with
a
2.6
GHz
Intel
Core
i5
and
16GB
RAM.
Small
High performance analysis
Hadoop
sequence
files
are
opdmized
for
fast
parallel
and
sequendal
access
Spark
is
a
fast
in-‐memory
big
data
engine
with
clean
and
expressive
APIs
hJp://spark.apache.org/
• APIs
and
tools
designed
using
the
Apache
Spark
framework
for
fast
parallel
in-‐memory
processing
• Spark
deals
with
running
code
in
muld-‐threaded
manner
–
no
need
to
manage
thread
pools
• Python,
Java
and
Scala
APIs
available
• Spark
used
widely
in
other
areas
of
Bioinformadcs
(e.g.,
ADAM
in
Genomics
hJp://bdgenomics.org/)
Efficient
hashing
algorithm
Inefficient
looping
algorithm
• Inter-‐atomic
contacts
are
oeen
analyzed,
e.g.,
empirical
force
fields
• MMTF
facilitates
the
efficient
contact
finding
algorithm
to
have
a
strong
impact
• Using
mmCIF
efficient
algorithm
provides
only
~10
%
speedup
• Using
MMTF
the
same
algorithm
gives
a
~90
%
speedup
• MMTF
promotes
efficient
downstream
algorithm
design
Element
Occurrences
%
of
PDB
Carbon
431,487,468
43
%
Oxygen
174,153,905
17
%
Nitrogen
121,509,487
12
%
• Efficient
transmission
and
parsing
of
data
integral
to
Big
Data
inidadves,
e.g.,
ADAM
• No
compressed
format
for
macromolecules
• Processing
and
analyzing
macromolecules
is
a
boJleneck
• Visualizing
large
structures
is
challenging
• Clean
APIs
to
the
data
provided
in
commonly
used
languages
• No
need
to
write
your
own
parser
• No
more
parsers
breaking
hJps://github.com/rcsb/mm9-‐python
hJps://github.com/rcsb/mm9-‐java
hJps://github.com/rcsb/mm9-‐javascript
Atoms
per
structure
in
the
PDB
Time
taken
to
find
all
C-‐alpha-‐C-‐alpha
contacts
using
mmCIF
and
MMTF
Using
a
Mac
mini
with
2.6
GHz
Intel
Core
i5
(4
cores)
and
16GB
RAM.
30
GB
7
GB
<2
minutes
400
minutes
MMTF
mmCIF
MMTF
mmCIF
MMTF
mmCIF
MMTF
mmCIF
Time
to
count
all
the
elements
in
the
PDB
MMTF
mmCIF
Experiments
run
per
24
hours
50
6
448
404
4
640
402
4
EM
atoms
added
to
the
PDB
Atoms
per
structure
in
the
PDB
Whole
PDB
archive
GZIP
compressed
BioJava
• Protein
Data
Bank
(PDB)
is
a
world-‐wide
archive
of
macromolecular
structures
• Established
in
1972
it
has
seen
large
growth
over
the
past
30
years
• Data
currently
stored
and
transmiJed
in
PDB
and
mmCIF
archival
file
formats
• Such
format
not
appropriate
for
web-‐based
and
Big
Data
applicadons