SBGrid (Morin et al., 2013, eLIFE and www.sbgrid.org) is a Harvard based structural biology global computing consortium with a primary focus on the curation of research software. Dr. Sliz will discuss a recent SBGrid project that aims to establish a repository for experimental datasets from SBGrid laboratories. Issues of handling large data volumes, data validation and repository sustainability will be addressed in this talk.
ICT Role in 21st Century Education & its Challenges.pptx
Big Data Repository for Structural Biology: Challenges and Opportunities by Piotr Sliz
1. Big Data Repository for
Structural Biology:
Challenges and Opportunities
Piotr Sliz, PhD
sliz@hkl.hms.harvard.edu
!
SBGrid: http://sbgrid.org
SBGrid Data Bank: http://data.sbgrid.org
Twitter: @SBGrid
YouTube: SBGridTV
SBGrid
Consortium
Support Center at Harvard Medical School
300 Research Groups
13 Countries
Long Term Sustainability: Membership Fee
Harvard Medical!
School
2. SBGrid supports compilation, installation
and upgrades of ~300 scientific applications
Several Software Categories (EM, NMR, Xrays, Comp Chem, etc.)
Multiple versions of most applications
OS X (10.6-10.10) and Linux support (CentOS 5-7)
No additional, end-user configuration required
Software always works = more time for research
Core Mission:
Grid Computing (Open Science Grid VO + Grid Portal)
General Research Infrastructure (Boston Area)
Training (workshops, software cataloguing, webtales)
Webinars at youtube.com/SBGridTV
Developer Resources
Advocating for Open Source Software
Morin et al. Shining Light into Black Boxes. Science, 2012.
Other Activities:
Additional!
Publications
Primary Citation:
Other Citations:
3. New Opportunity:
Data
anonymous SBGrid member 1:
“we cannot find the original frames for many of our
structures (move from X to Y), including recent high
impact projects. What do you recommend that we do?”
anonymous SBGrid member 2:
“I was able to locate the data directory
but I must have done a good job
cleaning up the disk space before I
left: usually there are only two .img files
left in the data directory, the 1st and
the last image of a full run.”
Lack of Storage Support
for Diffraction Images
derive
reproduce
improve
correct
• Stokes-Rees, I., Levesque, I., Murphy, F.V., Yang, W., Deacon, A., and Sliz, P. (2012). Adapting federated
cyberinfrastructure for shared data collection facilities in structural biology. J Synchrotron Radiat 19, 462–467.
• Terwilliger, T.C., and Bricogne, G. (2014). Continuous mutual improvement of macromolecular structure models in the PDB
and of X-ray crystallographic software: the dual role of deposited experimental data. Acta Crystallogr. D Biol. Crystallogr.
70, 2533–2543.
• Terwilliger, T.C. (2014). Archiving raw crystallographic data. Acta Crystallogr D Biol Crystallogr.
• Guss, J.M., and McMahon (2014). How to make deposition of images a reality. Acta Crystallogr. D Biol. Crystallogr. 70,
2520–2532
4. Focus on Primary
Data
SBGrid Data Bank. Pilot: May 1st, Production: June 1st, 2015
EZID
Dataset
Lock
BIODBCORE-‐000683
re3data.org
Data Mining
and
Annotation
7. Data Access Alliance:
Make Data easily accessible for reprocessing
Minimize Project Cost
Increase Redundancy
Challenges
Dataset Size (APIs, Data Access Alliance)
Journal + Data Automation
automated embargo release
cross-referencing
coordination/communication with journals
Data vs Journal Citations
Metrics:
Dataset Deposition Rates
Data Use: DAA Membership vs. direct downloads
Dataset Quality (Level 0-2)
Data Citations
Master Format
OME-TIFF vs DataCite vs DataVerse schema
Transition to a Research Data Management Software
ORCID integration and adoption
8. Opportunities
Better support to ~300 structural biology laboratories:
Compliance
Reproducibility
Integration with PDB and other repositories
Other data types in addition to X-ray diffraction
Thank you
Piotr Sliz, PhD
sliz@hkl.hms.harvard.edu
!
SBGrid: http://sbgrid.org
SBGrid Data Bank: http://data.sbgrid.org
!
Twitter: @SBGrid
YouTube: SBGridTV
Stephanie Socias
Pete Meyer
Merce Crosas