SlideShare uma empresa Scribd logo
1 de 1
Baixar para ler offline
Small,	
  fast	
  and	
  useful	
  –	
  MMTF	
  a	
  new	
  paradigm	
  in	
  
macromolecular	
  data	
  transmission	
  –	
  mm9.rcsb.org	
  
Anthony	
  R.	
  Bradley,	
  Alexander	
  S.	
  Rose,	
  Yana	
  Valasatava,	
  Jose	
  M.	
  Duarte,	
  Andreas	
  Prlić,	
  Peter	
  W.	
  Rose	
  
Yet another file format???
Applications
BD2K Targeted Software Development, Grant
Number: U01 CA198942
Funding and acknowledgements
Get the data
Three ways to get involved
hJp://mm9.rcsb.org/	
  
Already several early adopters
APIs provided
Cole Christie and Chris Randle
•  Steep	
  increase	
  in	
  atoms	
  per	
  structure	
  
(37%	
  between	
  2012	
  and	
  2016)	
  
•  10,000	
  new	
  structures	
  added	
  per	
  year	
  
•  68	
  of	
  the	
  100	
  largest	
  structures	
  were	
  
deposited	
  in	
  the	
  past	
  three	
  years	
  
•  Largest	
  structure	
  contains	
  2.5	
  M	
  atoms	
  	
  
•  EM	
  seen	
  a	
  sharp	
  rise	
  in	
  recent	
  years	
  
Outcomes
•  Small	
  
~75	
  %	
  compression	
  over	
  mmCIF	
  GZIP	
  
•  Fast	
  
Parsing	
  2	
  orders	
  of	
  magnitude	
  faster	
  
•  Self-­‐contained	
  
No	
  need	
  for	
  calls	
  to	
  external	
  resources	
  
•  Useful	
  
Bonding	
  (bond	
  order)	
  and	
  secondary	
  
structure	
  info	
  included	
  in	
  all	
  files	
  
What is it?
•  Binary	
  
MessagePack	
  (binary	
  JSON	
  format)	
  used	
  
as	
  a	
  data	
  container	
  hJp://msgpack.org/	
  
•  Custom	
  lossless	
  compression	
  
Delta,	
  run-­‐length	
  and	
  dicdonary	
  encoding	
  
used	
  to	
  compress	
  data	
  
•  Open-­‐source	
  
Specificadon	
  and	
  soeware	
  libraries	
  
developed	
  under	
  Apache/MIT	
  licenses	
  
Fast	
  
•  Whole	
  PDB	
  archive	
  converted	
  to	
  MMTF	
  weekly	
  
•  Individual	
  files	
  available	
  from	
  a	
  REST	
  API:	
  
wget	
  	
  h'p://mm,.rcsb.org/v0.2/full/4hhb.mm,.gz	
  
•  Whole	
  archive	
  as	
  a	
  Hadoop	
  sequence	
  file:	
  
wget	
  h'p://mm,.rcsb.org/v0.2/hadoopfiles/full.tar	
  
•  More	
  details:	
  
hJp://mm9.rcsb.org/download.html	
  	
  
•  MMTF	
  allows	
  interacdve	
  data	
  
mining	
  of	
  the	
  endre	
  PDB	
  archive	
  
•  No	
  need	
  for	
  SQL	
  or	
  seing	
  up	
  a	
  
database,	
  or	
  schema	
  
•  Queries	
  on	
  the	
  endre	
  archive	
  in	
  
only	
  a	
  couple	
  of	
  minutes	
  
1.  Use	
  –	
  use	
  our	
  API	
  to	
  do	
  your	
  own	
  processing	
  
2.  Adopt	
  –	
  incorporate	
  MMTF	
  into	
  your	
  toolkit	
  
3.  Contribute	
  –	
  fork	
  us	
  on	
  github	
  
Data mining
Efficient contact finding
Fragment generation
•  Generate	
  all	
  fragments	
  from	
  the	
  
protein	
  chains	
  in	
  the	
  PDB	
  
•  Commonly	
  done	
  in,	
  e.g.,	
  ab	
  ini&o	
  
structure	
  predicdon	
  
•  I/O	
  is	
  a	
  key	
  boJleneck	
  in	
  this	
  process	
  
•  MMTF	
  allows	
  for	
  such	
  analysis	
  to	
  be	
  
done	
  in	
  fracdon	
  of	
  dme	
  	
  
•  More	
  experiments	
  can	
  be	
  done	
  /	
  day	
  
•  No	
  need	
  to	
  compromise	
  on	
  dataset	
  
size	
  or	
  parameters	
  
Using	
  a	
  Mac	
  mini	
  with	
  2.6	
  GHz	
  Intel	
  Core	
  i5	
  (4	
  cores)	
  and	
  16GB	
  RAM.	
  	
  
Using	
  a	
  Mac	
  mini	
  with	
  2.6	
  GHz	
  Intel	
  Core	
  i5	
  (4	
  cores)	
  and	
  16GB	
  RAM.	
  	
  
Using	
  a	
  Mac	
  mini	
  with	
  a	
  2.6	
  GHz	
  Intel	
  Core	
  i5	
  and	
  16GB	
  RAM.	
  	
  
Small	
  
High performance analysis
Hadoop	
  sequence	
  files	
  
are	
  opdmized	
  for	
  fast	
  
parallel	
  and	
  sequendal	
  
access	
  	
  
Spark	
  is	
  a	
  fast	
  in-­‐memory	
  
big	
  data	
  engine	
  with	
  
clean	
  and	
  expressive	
  APIs	
  
hJp://spark.apache.org/	
  
	
  
•  APIs	
  and	
  tools	
  designed	
  using	
  the	
  Apache	
  Spark	
  
framework	
  for	
  fast	
  parallel	
  in-­‐memory	
  processing	
  
•  Spark	
  deals	
  with	
  running	
  code	
  in	
  muld-­‐threaded	
  
manner	
  –	
  no	
  need	
  to	
  manage	
  thread	
  pools	
  
•  Python,	
  Java	
  and	
  Scala	
  APIs	
  available	
  
•  Spark	
  used	
  widely	
  in	
  other	
  areas	
  of	
  Bioinformadcs	
  
(e.g.,	
  ADAM	
  in	
  Genomics	
  hJp://bdgenomics.org/)	
  
Efficient	
  hashing	
  algorithm	
  
Inefficient	
  looping	
  algorithm	
  
•  Inter-­‐atomic	
  contacts	
  are	
  oeen	
  
analyzed,	
  e.g.,	
  empirical	
  force	
  fields	
  
•  MMTF	
  facilitates	
  the	
  efficient	
  
contact	
  finding	
  algorithm	
  to	
  have	
  a	
  
strong	
  impact	
  
•  Using	
  mmCIF	
  efficient	
  algorithm	
  
provides	
  only	
  ~10	
  %	
  speedup	
  
•  Using	
  MMTF	
  the	
  same	
  algorithm	
  
gives	
  a	
  ~90	
  %	
  speedup	
  
•  MMTF	
  promotes	
  efficient	
  
downstream	
  algorithm	
  design	
  
Element	
   Occurrences	
   %	
  of	
  PDB	
  
Carbon	
   431,487,468	
   43	
  %	
  
Oxygen	
   174,153,905	
   17	
  %	
  
Nitrogen	
   121,509,487	
   12	
  %	
  
•  Efficient	
  transmission	
  and	
  parsing	
  of	
  data	
  
integral	
  to	
  Big	
  Data	
  inidadves,	
  e.g.,	
  ADAM	
  
•  No	
  compressed	
  format	
  for	
  macromolecules	
  
•  Processing	
  and	
  analyzing	
  macromolecules	
  is	
  
a	
  boJleneck	
  	
  
•  Visualizing	
  large	
  structures	
  is	
  challenging	
  
•  Clean	
  APIs	
  to	
  the	
  data	
  provided	
  in	
  
commonly	
  used	
  languages	
  
•  No	
  need	
  to	
  write	
  your	
  own	
  parser	
  
•  No	
  more	
  parsers	
  breaking	
  
	
   hJps://github.com/rcsb/mm9-­‐python	
  
hJps://github.com/rcsb/mm9-­‐java	
  
hJps://github.com/rcsb/mm9-­‐javascript	
  
Atoms	
  per	
  structure	
  in	
  the	
  PDB	
  
Time	
  taken	
  to	
  find	
  all	
  C-­‐alpha-­‐C-­‐alpha	
  contacts	
  
using	
  mmCIF	
  and	
  MMTF	
  
Using	
  a	
  Mac	
  mini	
  with	
  2.6	
  GHz	
  Intel	
  Core	
  i5	
  (4	
  cores)	
  and	
  16GB	
  RAM.	
  	
  
30	
  GB	
  
7	
  GB	
  
<2	
  minutes	
  
400	
  minutes	
  
MMTF	
  mmCIF	
   MMTF	
  mmCIF	
  
MMTF	
  mmCIF	
  
MMTF	
  mmCIF	
  
Time	
  to	
  count	
  all	
  the	
  elements	
  in	
  the	
  PDB	
  
MMTF	
  mmCIF	
  
Experiments	
  run	
  per	
  24	
  hours	
  
50	
  
6	
  
448	
  
404	
  
4	
  
640	
  
402	
  
4	
  
EM	
  atoms	
  added	
  to	
  the	
  PDB	
  
Atoms	
  per	
  structure	
  in	
  the	
  PDB	
  
Whole	
  PDB	
  archive	
  GZIP	
  compressed	
  
BioJava	
  
•  Protein	
  Data	
  Bank	
  (PDB)	
  is	
  a	
  world-­‐wide	
  archive	
  of	
  macromolecular	
  structures	
  
•  Established	
  in	
  1972	
  it	
  has	
  seen	
  large	
  growth	
  over	
  the	
  past	
  30	
  years	
  
•  Data	
  currently	
  	
  stored	
  and	
  transmiJed	
  in	
  PDB	
  and	
  mmCIF	
  archival	
  file	
  formats	
  
•  Such	
  format	
  not	
  appropriate	
  for	
  web-­‐based	
  and	
  Big	
  Data	
  applicadons	
  

Mais conteúdo relacionado

Mais procurados

High Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersHigh Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC Clusters
Saliya Ekanayake
 
Quick Understanding of NoSQL
Quick Understanding of NoSQLQuick Understanding of NoSQL
Quick Understanding of NoSQL
Edward Yoon
 
Analytics and Access to the UK web archive
Analytics and Access to the UK web archiveAnalytics and Access to the UK web archive
Analytics and Access to the UK web archive
Lewis Crawford
 

Mais procurados (20)

HDF Kita Lab: JupyterLab + HDF Service
HDF Kita Lab: JupyterLab + HDF ServiceHDF Kita Lab: JupyterLab + HDF Service
HDF Kita Lab: JupyterLab + HDF Service
 
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
 
CADD meeting 08-30-2016
CADD meeting 08-30-2016CADD meeting 08-30-2016
CADD meeting 08-30-2016
 
HDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and FutureHDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and Future
 
High Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersHigh Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC Clusters
 
Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAM
 
Data science in ruby is it possible? is it fast? should we use it?
Data science in ruby is it possible? is it fast? should we use it?Data science in ruby is it possible? is it fast? should we use it?
Data science in ruby is it possible? is it fast? should we use it?
 
Parallel Computing with HDF Server
Parallel Computing with HDF ServerParallel Computing with HDF Server
Parallel Computing with HDF Server
 
Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014
 
Quick Understanding of NoSQL
Quick Understanding of NoSQLQuick Understanding of NoSQL
Quick Understanding of NoSQL
 
Federated Queries Across Both Different Storage Mediums and Different Data En...
Federated Queries Across Both Different Storage Mediums and Different Data En...Federated Queries Across Both Different Storage Mediums and Different Data En...
Federated Queries Across Both Different Storage Mediums and Different Data En...
 
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
 
Data provenance in Hopsworks
Data provenance in HopsworksData provenance in Hopsworks
Data provenance in Hopsworks
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of Techniques
 
Studies of HPCC Systems from Machine Learning Perspectives
Studies of HPCC Systems from Machine Learning PerspectivesStudies of HPCC Systems from Machine Learning Perspectives
Studies of HPCC Systems from Machine Learning Perspectives
 
Scaling and High Performance Storage System: LeoFS
Scaling and High Performance Storage System: LeoFSScaling and High Performance Storage System: LeoFS
Scaling and High Performance Storage System: LeoFS
 
RDFox Poster
RDFox PosterRDFox Poster
RDFox Poster
 
Stories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresStories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi Torres
 
Intro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariIntro to Python Data Analysis in Wakari
Intro to Python Data Analysis in Wakari
 
Analytics and Access to the UK web archive
Analytics and Access to the UK web archiveAnalytics and Access to the UK web archive
Analytics and Access to the UK web archive
 

Semelhante a Small, fast and useful – MMTF a new paradigm in macromolecular data transmission

Semelhante a Small, fast and useful – MMTF a new paradigm in macromolecular data transmission (20)

Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
Supercharging Data Performance for Real-Time Data Analysis
Supercharging Data Performance for Real-Time Data Analysis Supercharging Data Performance for Real-Time Data Analysis
Supercharging Data Performance for Real-Time Data Analysis
 
Webinar: Large Scale Graph Processing with IBM Power Systems & Neo4j
Webinar: Large Scale Graph Processing with IBM Power Systems & Neo4jWebinar: Large Scale Graph Processing with IBM Power Systems & Neo4j
Webinar: Large Scale Graph Processing with IBM Power Systems & Neo4j
 
Play With Streams
Play With StreamsPlay With Streams
Play With Streams
 
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Ccp4 mmdb-python
Ccp4 mmdb-pythonCcp4 mmdb-python
Ccp4 mmdb-python
 
Role of python in hpc
Role of python in hpcRole of python in hpc
Role of python in hpc
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
 
Real time machine learning proposers day v3
Real time machine learning proposers day v3Real time machine learning proposers day v3
Real time machine learning proposers day v3
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
 
Architecture Patterns - Open Discussion
Architecture Patterns - Open DiscussionArchitecture Patterns - Open Discussion
Architecture Patterns - Open Discussion
 
Application Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance CenterApplication Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance Center
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
 

Último

Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
RizalinePalanog2
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 

Último (20)

Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 

Small, fast and useful – MMTF a new paradigm in macromolecular data transmission

  • 1. Small,  fast  and  useful  –  MMTF  a  new  paradigm  in   macromolecular  data  transmission  –  mm9.rcsb.org   Anthony  R.  Bradley,  Alexander  S.  Rose,  Yana  Valasatava,  Jose  M.  Duarte,  Andreas  Prlić,  Peter  W.  Rose   Yet another file format??? Applications BD2K Targeted Software Development, Grant Number: U01 CA198942 Funding and acknowledgements Get the data Three ways to get involved hJp://mm9.rcsb.org/   Already several early adopters APIs provided Cole Christie and Chris Randle •  Steep  increase  in  atoms  per  structure   (37%  between  2012  and  2016)   •  10,000  new  structures  added  per  year   •  68  of  the  100  largest  structures  were   deposited  in  the  past  three  years   •  Largest  structure  contains  2.5  M  atoms     •  EM  seen  a  sharp  rise  in  recent  years   Outcomes •  Small   ~75  %  compression  over  mmCIF  GZIP   •  Fast   Parsing  2  orders  of  magnitude  faster   •  Self-­‐contained   No  need  for  calls  to  external  resources   •  Useful   Bonding  (bond  order)  and  secondary   structure  info  included  in  all  files   What is it? •  Binary   MessagePack  (binary  JSON  format)  used   as  a  data  container  hJp://msgpack.org/   •  Custom  lossless  compression   Delta,  run-­‐length  and  dicdonary  encoding   used  to  compress  data   •  Open-­‐source   Specificadon  and  soeware  libraries   developed  under  Apache/MIT  licenses   Fast   •  Whole  PDB  archive  converted  to  MMTF  weekly   •  Individual  files  available  from  a  REST  API:   wget    h'p://mm,.rcsb.org/v0.2/full/4hhb.mm,.gz   •  Whole  archive  as  a  Hadoop  sequence  file:   wget  h'p://mm,.rcsb.org/v0.2/hadoopfiles/full.tar   •  More  details:   hJp://mm9.rcsb.org/download.html     •  MMTF  allows  interacdve  data   mining  of  the  endre  PDB  archive   •  No  need  for  SQL  or  seing  up  a   database,  or  schema   •  Queries  on  the  endre  archive  in   only  a  couple  of  minutes   1.  Use  –  use  our  API  to  do  your  own  processing   2.  Adopt  –  incorporate  MMTF  into  your  toolkit   3.  Contribute  –  fork  us  on  github   Data mining Efficient contact finding Fragment generation •  Generate  all  fragments  from  the   protein  chains  in  the  PDB   •  Commonly  done  in,  e.g.,  ab  ini&o   structure  predicdon   •  I/O  is  a  key  boJleneck  in  this  process   •  MMTF  allows  for  such  analysis  to  be   done  in  fracdon  of  dme     •  More  experiments  can  be  done  /  day   •  No  need  to  compromise  on  dataset   size  or  parameters   Using  a  Mac  mini  with  2.6  GHz  Intel  Core  i5  (4  cores)  and  16GB  RAM.     Using  a  Mac  mini  with  2.6  GHz  Intel  Core  i5  (4  cores)  and  16GB  RAM.     Using  a  Mac  mini  with  a  2.6  GHz  Intel  Core  i5  and  16GB  RAM.     Small   High performance analysis Hadoop  sequence  files   are  opdmized  for  fast   parallel  and  sequendal   access     Spark  is  a  fast  in-­‐memory   big  data  engine  with   clean  and  expressive  APIs   hJp://spark.apache.org/     •  APIs  and  tools  designed  using  the  Apache  Spark   framework  for  fast  parallel  in-­‐memory  processing   •  Spark  deals  with  running  code  in  muld-­‐threaded   manner  –  no  need  to  manage  thread  pools   •  Python,  Java  and  Scala  APIs  available   •  Spark  used  widely  in  other  areas  of  Bioinformadcs   (e.g.,  ADAM  in  Genomics  hJp://bdgenomics.org/)   Efficient  hashing  algorithm   Inefficient  looping  algorithm   •  Inter-­‐atomic  contacts  are  oeen   analyzed,  e.g.,  empirical  force  fields   •  MMTF  facilitates  the  efficient   contact  finding  algorithm  to  have  a   strong  impact   •  Using  mmCIF  efficient  algorithm   provides  only  ~10  %  speedup   •  Using  MMTF  the  same  algorithm   gives  a  ~90  %  speedup   •  MMTF  promotes  efficient   downstream  algorithm  design   Element   Occurrences   %  of  PDB   Carbon   431,487,468   43  %   Oxygen   174,153,905   17  %   Nitrogen   121,509,487   12  %   •  Efficient  transmission  and  parsing  of  data   integral  to  Big  Data  inidadves,  e.g.,  ADAM   •  No  compressed  format  for  macromolecules   •  Processing  and  analyzing  macromolecules  is   a  boJleneck     •  Visualizing  large  structures  is  challenging   •  Clean  APIs  to  the  data  provided  in   commonly  used  languages   •  No  need  to  write  your  own  parser   •  No  more  parsers  breaking     hJps://github.com/rcsb/mm9-­‐python   hJps://github.com/rcsb/mm9-­‐java   hJps://github.com/rcsb/mm9-­‐javascript   Atoms  per  structure  in  the  PDB   Time  taken  to  find  all  C-­‐alpha-­‐C-­‐alpha  contacts   using  mmCIF  and  MMTF   Using  a  Mac  mini  with  2.6  GHz  Intel  Core  i5  (4  cores)  and  16GB  RAM.     30  GB   7  GB   <2  minutes   400  minutes   MMTF  mmCIF   MMTF  mmCIF   MMTF  mmCIF   MMTF  mmCIF   Time  to  count  all  the  elements  in  the  PDB   MMTF  mmCIF   Experiments  run  per  24  hours   50   6   448   404   4   640   402   4   EM  atoms  added  to  the  PDB   Atoms  per  structure  in  the  PDB   Whole  PDB  archive  GZIP  compressed   BioJava   •  Protein  Data  Bank  (PDB)  is  a  world-­‐wide  archive  of  macromolecular  structures   •  Established  in  1972  it  has  seen  large  growth  over  the  past  30  years   •  Data  currently    stored  and  transmiJed  in  PDB  and  mmCIF  archival  file  formats   •  Such  format  not  appropriate  for  web-­‐based  and  Big  Data  applicadons