SlideShare uma empresa Scribd logo
1 de 26
Baixar para ler offline
Compact representation
of 3D macromolecular
structures from the PDB
Presented by Yana Valasatava
Postdoctoral Researcher
Structural Bioinformatics Group
San Diego Supercomputer Center
The PDB evolving complexity
PDB archive
> 30 GB
~250 MB in
mmCIF format
Structural biology efforts meet a big-data era:
● Growing size: ~ 120K structures with an
annual growth by ~10K structures
● Evolving complexity: growing
compositional heterogeneity and size
● Increasing usage: > 300,000 users per
month from over 160 countries
3J3Q
3J3Q has more than 1 million atoms
The PDB has more than 1 billion atoms
★ Interactive visualization
○ slow network transfer
○ slow parsing
○ slow rendering
★ Mobile visualization
○ limited bandwidth
○ limited memory
★ Large-scale structural analysis
○ slow repeated I/O
○ slow repeated parsing
Scalability issues
PDBx/mmCIF
Flexible, extensible, and verbose
format with rich metadata, well suited
for archival purposes.
repetitive information
redundant annotations
inefficient representation
PDB/MMTF
The MacroMolecular Transmission Format
MMTF has the following advantages:
❏ it occupies less space (less disk I/O)
❏ it is faster to read (no time-consuming string parsing)
❏ it contains precalculated information useful for structural analysis
and visualisation (covalent bonds and bond orders)
Fields:
○ Format data (e.g. the version number of the specification)
○ Metadata (e.g. rFree and resolution)
○ Structure data (e.g. number of models, chains, groups, atoms)
○ Chain data (e.g. list of chain IDs, chain names)
○ Group data (e.g. list of group names, formal charges, bonds)
○ Atom data (e.g. B-factors, coordinates, occupancies)
https://github.com/rcsb/mmtf/blob/master/spec.md
MMTF compression pipeline
integer encoding
dictionary encoding
run-length encoding
delta encoding
GZIP
recursive
indexing
extract structural data
calculate bonds, SSE
The binary container format of MMTF
Compression pipeline: dictionary encoding
Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor
ATOM 1 N N ARG A 18 14.699 61.369 62.050 1.00 39.19
ATOM 2 C CA ARG A 18 14.500 62.241 60.856 1.00 38.35
ATOM 3 C C ARG A 18 13.762 61.516 59.729 1.00 36.05
{ "groupName": "ARG",
"singleLetterCode": "R",
"chemCompType": "L-PEPTIDE LINKING",
"atomNameList": [ "N", "CA", "C" ],
"elementList": [ "N", "C", "C"] }
index: 1
SER-GLY-ARG-SER-SER
groupTypeList: [ 2, 0, 1, 2, 2 ]
Compression pipeline: encodings
Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor
ATOM 1 N N ARG A 18 14.699 61.369 62.050 1.00 39.19
ATOM 2 C CA ARG A 18 14.500 62.241 60.856 1.00 38.35
ATOM 3 C C ARG A 18 13.762 61.516 59.729 1.00 36.05
14.699 -> 14699
14.500 -> 14500
169
1,2,3->1,1,1->1,3
(delta + run-length) -> (integer + delta)
integer encoding: map floating point numbers to integer
run-length encoding: stretches of equal values are represented by the value itself and the
occurrence count
delta encoding: differences (deltas) between the numbers are stored
Compression pipeline: Recursive Indexing
Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor
ATOM 1 N N ARG A 18 14.699 61.369 62.050 1.00 39.19
ATOM 2 C CA ARG A 18 14.500 62.241 60.856 1.00 38.35
ATOM 3 C C ARG A 18 13.762 61.516 59.729 1.00 36.05
Recursive Indexing: [-50, -128, 7, 127, 268] -> [-50, -128, 0, 7, 127, 0, 127, 127, 14]
Array of 8-bit integer values, so the open interval is (127, -128):
Overview of data
Full format
• all atoms (useful for structural bioinformatics analysis)
• coordinates with 3 decimal place precision (no loss after decoding)
Reduced format
• C-alpha/phosphate backbone atoms and ligands (useful for
visualisation and some structural bioinformatics)
• coordinates with 1 decimal place precision (almost further 40 %
reduction in size)
• exactly same data structure as full (parsers work for both)
MMTF size and parsing speed
* Parsing using Java libraries
Using MMTF
To efficiently store, transmit, and visualize the 3D structures of biological
macromolecules
To perform large-scale structural calculations such as geometric queries or
structural comparisons over the entire PDB archive held in memory
Presented by Anthony Bradley
Postdoctoral Researcher
Structural Bioinformatics Group
San Diego Supercomputer Center
Using MMTF
To efficiently store, transmit, and visualize the 3D structures of biological
macromolecules
To perform large-scale structural calculations such as geometric queries or
structural comparisons over the entire PDB archive held in memory
Goals
• Analysis should be easy and simple
• Whole archive analysis of the PDB should be trivial
AND fast
• Big Data tools (e.g. Spark and Hadoop) are available
mmtf-python
mmtf-java
Nobody should (have to) write their own parser. Ever.
MMTF-Spark - Simple API
Continued…..
Data mining - speed advantage
Contact finding
Contact finding
Pros and cons
Pros:
● Looping through the whole library performing simple
analyses
● Simple to parallelize code
● Much more complete data
Cons:
● Tied to Java
● Not a magic unicorn
Pros and cons
Pros:
● Looping through the whole library performing simple
analyses
● Simple to parallelize code
● Much more complete data
Cons:
● Tied to Java
● Not a magic unicorn
Thanks!
• http://mmtf.rcsb.org/
• https://github.com/rcsb/mmtf-javascript
• https://github.com/rcsb/mmtf-java
• https://github.com/rcsb/mmtf-python
• http://spark.apache.org/
Acknowledgements
NCI/NIH (U01 CA198942)

Mais conteúdo relacionado

Semelhante a CADD meeting 08-30-2016

An Evaluation of Science Data Formats and Their Use at the Community Coordin...
 An Evaluation of Science Data Formats and Their Use at the Community Coordin... An Evaluation of Science Data Formats and Their Use at the Community Coordin...
An Evaluation of Science Data Formats and Their Use at the Community Coordin...The HDF-EOS Tools and Information Center
 
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Spark Summit
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Databricks
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Databricks
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
 
Short.course.introduction.to.vhdl
Short.course.introduction.to.vhdlShort.course.introduction.to.vhdl
Short.course.introduction.to.vhdlRavi Sony
 
An Efficient Biological Sequence Compression Technique Using LUT and Repeat ...
An Efficient Biological Sequence Compression Technique Using  LUT and Repeat ...An Efficient Biological Sequence Compression Technique Using  LUT and Repeat ...
An Efficient Biological Sequence Compression Technique Using LUT and Repeat ...IOSR Journals
 
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...Anthony Bradley
 
Digital standard cell library Design flow
Digital standard cell library Design flowDigital standard cell library Design flow
Digital standard cell library Design flowijsrd.com
 
Color Digital Sign Board using Altium Designer
Color Digital Sign Board using Altium DesignerColor Digital Sign Board using Altium Designer
Color Digital Sign Board using Altium Designerijtsrd
 
The Search for Gravitational Waves
The Search for Gravitational WavesThe Search for Gravitational Waves
The Search for Gravitational Wavesinside-BigData.com
 
Experiences in ELK with D3.js for Large Log Analysis and Visualization
Experiences in ELK with D3.js  for Large Log Analysis  and VisualizationExperiences in ELK with D3.js  for Large Log Analysis  and Visualization
Experiences in ELK with D3.js for Large Log Analysis and VisualizationSurasak Sanguanpong
 
June 25-26, Workshop
 June 25-26,  Workshop June 25-26,  Workshop
June 25-26, WorkshopFahadahammed2
 
Iaetsd design and implementation of pseudo random number generator
Iaetsd design and implementation of pseudo random number generatorIaetsd design and implementation of pseudo random number generator
Iaetsd design and implementation of pseudo random number generatorIaetsd Iaetsd
 

Semelhante a CADD meeting 08-30-2016 (20)

An Evaluation of Science Data Formats and Their Use at the Community Coordin...
 An Evaluation of Science Data Formats and Their Use at the Community Coordin... An Evaluation of Science Data Formats and Their Use at the Community Coordin...
An Evaluation of Science Data Formats and Their Use at the Community Coordin...
 
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
Short.course.introduction.to.vhdl
Short.course.introduction.to.vhdlShort.course.introduction.to.vhdl
Short.course.introduction.to.vhdl
 
An Efficient Biological Sequence Compression Technique Using LUT and Repeat ...
An Efficient Biological Sequence Compression Technique Using  LUT and Repeat ...An Efficient Biological Sequence Compression Technique Using  LUT and Repeat ...
An Efficient Biological Sequence Compression Technique Using LUT and Repeat ...
 
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...
 
Bio Linux
Bio LinuxBio Linux
Bio Linux
 
Digital standard cell library Design flow
Digital standard cell library Design flowDigital standard cell library Design flow
Digital standard cell library Design flow
 
Color Digital Sign Board using Altium Designer
Color Digital Sign Board using Altium DesignerColor Digital Sign Board using Altium Designer
Color Digital Sign Board using Altium Designer
 
The Search for Gravitational Waves
The Search for Gravitational WavesThe Search for Gravitational Waves
The Search for Gravitational Waves
 
NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
 
Experiences in ELK with D3.js for Large Log Analysis and Visualization
Experiences in ELK with D3.js  for Large Log Analysis  and VisualizationExperiences in ELK with D3.js  for Large Log Analysis  and Visualization
Experiences in ELK with D3.js for Large Log Analysis and Visualization
 
June 25-26, Workshop
 June 25-26,  Workshop June 25-26,  Workshop
June 25-26, Workshop
 
Packet sniffing
Packet sniffingPacket sniffing
Packet sniffing
 
Tridiagonal solver in gpu
Tridiagonal solver in gpuTridiagonal solver in gpu
Tridiagonal solver in gpu
 
TiReX: Tiled Regular eXpression matching architecture
TiReX: Tiled Regular eXpression matching architectureTiReX: Tiled Regular eXpression matching architecture
TiReX: Tiled Regular eXpression matching architecture
 
Iaetsd design and implementation of pseudo random number generator
Iaetsd design and implementation of pseudo random number generatorIaetsd design and implementation of pseudo random number generator
Iaetsd design and implementation of pseudo random number generator
 

Último

Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLkantirani197
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Monika Rani
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Creating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening DesignsCreating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening DesignsNurulAfiqah307317
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxRizalinePalanog2
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 

Último (20)

Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Creating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening DesignsCreating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening Designs
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 

CADD meeting 08-30-2016

  • 1. Compact representation of 3D macromolecular structures from the PDB
  • 2. Presented by Yana Valasatava Postdoctoral Researcher Structural Bioinformatics Group San Diego Supercomputer Center
  • 3. The PDB evolving complexity PDB archive > 30 GB ~250 MB in mmCIF format Structural biology efforts meet a big-data era: ● Growing size: ~ 120K structures with an annual growth by ~10K structures ● Evolving complexity: growing compositional heterogeneity and size ● Increasing usage: > 300,000 users per month from over 160 countries 3J3Q 3J3Q has more than 1 million atoms The PDB has more than 1 billion atoms
  • 4. ★ Interactive visualization ○ slow network transfer ○ slow parsing ○ slow rendering ★ Mobile visualization ○ limited bandwidth ○ limited memory ★ Large-scale structural analysis ○ slow repeated I/O ○ slow repeated parsing Scalability issues
  • 5. PDBx/mmCIF Flexible, extensible, and verbose format with rich metadata, well suited for archival purposes. repetitive information redundant annotations inefficient representation
  • 6. PDB/MMTF The MacroMolecular Transmission Format MMTF has the following advantages: ❏ it occupies less space (less disk I/O) ❏ it is faster to read (no time-consuming string parsing) ❏ it contains precalculated information useful for structural analysis and visualisation (covalent bonds and bond orders) Fields: ○ Format data (e.g. the version number of the specification) ○ Metadata (e.g. rFree and resolution) ○ Structure data (e.g. number of models, chains, groups, atoms) ○ Chain data (e.g. list of chain IDs, chain names) ○ Group data (e.g. list of group names, formal charges, bonds) ○ Atom data (e.g. B-factors, coordinates, occupancies) https://github.com/rcsb/mmtf/blob/master/spec.md
  • 7. MMTF compression pipeline integer encoding dictionary encoding run-length encoding delta encoding GZIP recursive indexing extract structural data calculate bonds, SSE The binary container format of MMTF
  • 8. Compression pipeline: dictionary encoding Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor ATOM 1 N N ARG A 18 14.699 61.369 62.050 1.00 39.19 ATOM 2 C CA ARG A 18 14.500 62.241 60.856 1.00 38.35 ATOM 3 C C ARG A 18 13.762 61.516 59.729 1.00 36.05 { "groupName": "ARG", "singleLetterCode": "R", "chemCompType": "L-PEPTIDE LINKING", "atomNameList": [ "N", "CA", "C" ], "elementList": [ "N", "C", "C"] } index: 1 SER-GLY-ARG-SER-SER groupTypeList: [ 2, 0, 1, 2, 2 ]
  • 9. Compression pipeline: encodings Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor ATOM 1 N N ARG A 18 14.699 61.369 62.050 1.00 39.19 ATOM 2 C CA ARG A 18 14.500 62.241 60.856 1.00 38.35 ATOM 3 C C ARG A 18 13.762 61.516 59.729 1.00 36.05 14.699 -> 14699 14.500 -> 14500 169 1,2,3->1,1,1->1,3 (delta + run-length) -> (integer + delta) integer encoding: map floating point numbers to integer run-length encoding: stretches of equal values are represented by the value itself and the occurrence count delta encoding: differences (deltas) between the numbers are stored
  • 10. Compression pipeline: Recursive Indexing Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor ATOM 1 N N ARG A 18 14.699 61.369 62.050 1.00 39.19 ATOM 2 C CA ARG A 18 14.500 62.241 60.856 1.00 38.35 ATOM 3 C C ARG A 18 13.762 61.516 59.729 1.00 36.05 Recursive Indexing: [-50, -128, 7, 127, 268] -> [-50, -128, 0, 7, 127, 0, 127, 127, 14] Array of 8-bit integer values, so the open interval is (127, -128):
  • 11. Overview of data Full format • all atoms (useful for structural bioinformatics analysis) • coordinates with 3 decimal place precision (no loss after decoding) Reduced format • C-alpha/phosphate backbone atoms and ligands (useful for visualisation and some structural bioinformatics) • coordinates with 1 decimal place precision (almost further 40 % reduction in size) • exactly same data structure as full (parsers work for both)
  • 12. MMTF size and parsing speed * Parsing using Java libraries
  • 13. Using MMTF To efficiently store, transmit, and visualize the 3D structures of biological macromolecules To perform large-scale structural calculations such as geometric queries or structural comparisons over the entire PDB archive held in memory
  • 14. Presented by Anthony Bradley Postdoctoral Researcher Structural Bioinformatics Group San Diego Supercomputer Center
  • 15. Using MMTF To efficiently store, transmit, and visualize the 3D structures of biological macromolecules To perform large-scale structural calculations such as geometric queries or structural comparisons over the entire PDB archive held in memory
  • 16. Goals • Analysis should be easy and simple • Whole archive analysis of the PDB should be trivial AND fast • Big Data tools (e.g. Spark and Hadoop) are available
  • 17. mmtf-python mmtf-java Nobody should (have to) write their own parser. Ever.
  • 20. Data mining - speed advantage
  • 23. Pros and cons Pros: ● Looping through the whole library performing simple analyses ● Simple to parallelize code ● Much more complete data Cons: ● Tied to Java ● Not a magic unicorn
  • 24. Pros and cons Pros: ● Looping through the whole library performing simple analyses ● Simple to parallelize code ● Much more complete data Cons: ● Tied to Java ● Not a magic unicorn
  • 25. Thanks! • http://mmtf.rcsb.org/ • https://github.com/rcsb/mmtf-javascript • https://github.com/rcsb/mmtf-java • https://github.com/rcsb/mmtf-python • http://spark.apache.org/