The open and public access to structural data is of utmost importance for validation, development, testing and training. The Electron Microscopy Data Bank (EMDB) archive is the authoritative source for 3DEM data. In 2014 PDBe started EMPIAR – the electron microscopy pilot image archive to store raw image data related to EMDB structures. The challenge here has been in dealing with the storage and transfer of large datasets. EMPIAR is now fully functional with routine uploads and downloads in the Terabyte range. The success of EMPIAR has spurred interest in wider bio-imaging circles as a working example of image archiving and possibly even a prototype for a broader bio-imaging archive. I will describe EMPIAR and discuss the prospects for public archiving of bio-imaging data.
4. Molecular and Cellular Structure
• Maintain and manage archives
• PDB for atomic coordinate
models
• EMDB for 3DEM
reconstructions
• EMPIAR for 3DEM raw data
• Develop and maintain web-
services – searching, visualisation
and validation
• Facilitate community-wide
initiatives
• Key themes – integration with
other bioinformatics resources and
imaging scales and validation
5. Structural data archives
Archive Type of
data
Founded Organization Funding # people # entries Size
PDB Atomic
coordinate
models
structures
1971 wwpdb (EBI,
RCSB, PDBj,
BMRB)
Core +
grants
60-80 124286 1 TB
(8 MB)
EMDB 3DEM
volume
structures
2002 EBI (+ RCSB,
PDBj)
Core +
grants
<10 4276 340 GB
(80 MB)
EMPIAR Raw
image
data for
EMDB
structures
2014 EBI grant <5 61 40 TB
(660 GB)
Stats until 9th Nov 2016
6. What goes where...
• Final single-particle and sub-tomogram average maps must go to
EMDB (tomograms strongly recommended)
• Fitted models must go to PDB
• Deposition of raw image data to EMPIAR is encouraged
EMDBFinal map
EMPIARRaw image
data
PDBFitted model
7. Benefits of public archiving
• Reuse of data
• starting models
• compare structures of different functional states
• different emphasis may lead to new discoveries
• Validation, methods development, testing, training
• Safe storage of data
• Integration of data with other public archives
• A resource for data mining
• Enables a birds-eye perspective of the field
8. What does archiving involve?
• Working with the community, partners and journals to
achieve a consensus on practices, policies and
procedures
• Adapting to changing needs of data and meta-data
collection
• new sample preparation methods
• new validation methods
• Providing means to deposition data, e.g., web-based
deposition systems
• Curating data – automated + manual, remediation
• maximize structured annotation, minimize free-text
• Developing added value resources for searching,
validating and visualizing data
9. Viability
• Community support
• Value – uploads versus downloads
• Data transfer technologies – Aspera, Globus
• Data storage – file systems, object stores
• Data fidelity – quality measures and validation
• Annotation – structured versus unstructured
• Centralised versus distributed
10. EMPIAR
• Electron microscopy pilot (or public?) image archive
• Started in 2014
• Raw 2D image datasets related to EMDB
• Usage: validation, development, testing, teaching and…
• Safe storage of your data!
• Was source for data in EM Map Validation Challenge
• Multi-frame micrographs, averaged micrographs, particle-
stacks, tilt series
• Uses Aspera, Globus, ftp, http for data transfers
24. Expert workshop on “3D segmentations
and transformations - building bridges
between cellular and molecular structural
biology”
Madingley Hall, 6-7 Dec 2015
Co-funded by
25. File format and translators
• EMDB Segmentation File Format (EMDB-SFF)
• adds structured biological annotation
• handles transforms between tomograms and subtomograms
• Python scripts to read Segger, IMOD and Amira and convert to
EMDB-SFF
• Working on displaying segmentations in Omero
• Public open source distribution through CCP-EM
26. Future directions
• Archiving for related imaging modalities including
• 3D scanning electron microscopy
• correlative light and electron microscopy
• soft X-ray tomography
• Data harvesting pipelines
• Validation
• Deposition support for new kinds of validation data
• Validation servers, e.g., for visual analysis, map versus model FSC
• Data-mining EMDB to develop new validation metrics
• Fast archive-wide sub-structure volumetric (or shape-based) searches
27. Acknowledgements
• Gerard Kleywegt
• EM group
• Sanja Abbott
• Andrii Iudin
• Paul Korir
• Carlos Lugo
• Eduardo Sanz Garcia
• Jose Salavert Torres (UPV)
• Ingvar Lagerstedt (EL)
• Maya Holmdahl (UU)
• Vladislav Lysenkov (MAMK)
• Birkbeck
• Maya Topf
• Agnel Praveen Joseph
• Helen Saibil
• Baylor – Wah Chiu
• RCSB – Cathy Lawson
• Francis Crick
• Lucy Collinson
• Raffaella Carzaniga
• STFC
• Martyn Winn
• Tom Burnley
• Dundee
• Jason Swedlow
• Josh Moore
• CNB Madrid
• Jose Maria Carazo
• Pablo Conesa
• Jose Miguel de la Rosa Trevin
• Joan Segura Mora
• And many more!