Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Secure data management, analysis, infrastructure and policy in an international context
1. Secure data management, analysis,
infrastructure and policy in an
international context
Steven Newhouse
Head of Technical Services, EMBL-
EBI
2. International Collaborative Data Analysis
• Distributed data generation
• Distributed data analysis
• Distributed (in)formal governance
• Increasingly sensitive data
• Increasingly valuable analysis resources
• Increasingly moving closer to production
3. Some Examples
• Worldwide Large Hadron Collider Computing Grid
(WLCG)
• A worldwide federation of federated sites
• EMBL-EBI and ELIXIR
• Infrastructure to support multiple communities
• Global Alliance for Genomics and Health (GA4GH)
• International collaboration to support
4. WLCG Collaboration
WLCG Workshop, Manchester 19 June 2017 4
April 2017:
- 63 MoU’s
- 167 sites; 42 countries
985 PB Storage
395 PB disk
590 PB tape
5. Security Policy & Operations in e-
Infrastructures
• e-Infrastructures:
• Generally federation of clusters/clouds in research
community
• Structured geographically nationally and/or regionally
• Make local resources available to remote users
• Build trust around common policies
• Site Security Policy: What a site commits to
• Acceptable Use Policy: What a user commits to
• Security Operations
• Monitor use to contain & eliminate any security breach
6. WISE
• WISE: Wise Information Security for e-Infrastructures
• Community activity driven by the e-Infrastructures
• Supporting user communities that span e-Infrastructures
• Active Working Groups
• Security for Collaborating e-Infrastructures
• Security Training and Awareness
• Risk Assessment
• Security in Big and Open Data
7. Security for Collaborating e-Infrastructures
Build a trust framework to enable interoperation between e-
Infrastructures and to manage cross-infrastructure security
risks
• Manage risk through mitigation & counter measures
• Minimise impact of a security incident
• Identify the cause of incidents to stop repeats
• Identify users & services to control access to resources
8. Building trust by exposing maturity
• Expose Maturity across different Capabilities
• Operational Security, Incident Response & Traceability
• Participant Responsibilities
• Data Protection
• Capability Maturity Levels
• 0: Not implemented for critical services
• 1: Implemented for critical services but not documented
• 2: Implemented and documented for critical services
• 3: Implemented, documented and reviewed
9. EMBL sites – over 1600 people and more
than 80 nationalities
Structural
biology
Hamburg
Life sciences
Heidelberg
Epigenetics
and
neurobiology
Rome
Bioinformatics
Cambridge
(EMBL-EBI)
Structural
biology
Grenoble
Tissue biology
and disease
modelling
Barcelona
10. Data Resources at EMBL-EBI
Literature & ontologies
• Experimental Factor
Ontology
• Gene Ontology
• BioStudies
• Europe PMC
Chemical biology
• ChEBI
• ChEMBL
• SureChEMBL
Molecular structures
• Protein Data Bank in Europe
• Electron Microscopy Data Bank
Gene, protein & metabolite expression
• Expression Atlas
• Metabolights
• PRIDE
• RNA Central
Protein sequences,
families & motifs
• InterPro
• Pfam
• UniProt
Genes, genomes & variation
• Ensembl
• Ensembl Genomes
• GWAS Catalog
• Metagenomics portal
Systems
• BioModels
• BioSamples
• Enzyme Portal
• IntAct
• Reactome
Molecular Archives
• European Nucleotide Archive
• European Variation Archive
• European Genome-phenome Archive
• ArrayExpress
11. ~25 million
requests to EMBL-EBI
websites every day
Big Data, Big Demand
Scientists at over
5 million
unique sites use
EMBL-EBI websites
200 petabytes
of scientific data managed by EMBL
12. Storage Use Cases are Evolving
• Evolving away from ‘simple’ archiving
• Challenge used to be scale, now tackling diversity
• Not just diversity in type, but diversity in access
• Common use case
• Public data embargoed before publication
• Hosting sensitive data
• European Genome-phenome Archive (EGA)
• Analysing sensitive data
• Formal access to named individuals for specific research
goals
13. Classifying and controlling the data
• What data do we store?
• Personal, Scientific Research, Administrative, Professional,
Private
• How sensitive is the data?
• Controlled, Confidential, Restricted, Public
• What are the storage options?
• ‘Vault’, Managed, Standard, Any Cloud, EU Cloud, Hosting
End up with a matrix describing what can go
where!
14. Data Sensitivity Classification
Data Type On Site (inc. Embassy Cloud) Off-Site
Confidential
or
Controlled
Restricted
Restricted Public or
Controlled
Public
Confidential
or Controlled
Restricted
Restricted Public or
Controlled
Public
Scientific
Research
Vault (as
contains
Personal
Data)
Managed Standard EMBL Hosting EMBL
Hosting or
as
specified
by the Data
Access
agreement
Any
Professional N/A Standard Standard N/A EMBL
Hosting
Any
Administrative SAP Facility
(as contains
Personal
Data)
Managed Standard EMBL Hosting EMBL
Hosting
EMBL
Hosting
Private Standard Standard Standard Any Any Any
Personal Only as part of the Vault (Scientific
Data) or SAP Facility Administrative
Data)
EMBL Hosting
15. European Genome-Phenome Archive
• Data hosted by EMBL-EBI and CRG
• Several PB and growing
• Data sets managed through individual Data Access
Committees
• EMBL-EBI data stored in the ‘vault’
• Isolated network area in ISO27K leased data centre space
• Requires 2 factor auth to access
• Data encrypted at rest
• Data released to specific individuals
• Encrypted with unique individual key
16. ELIXIR – Research Infrastructure for Life Science
16
• Compute
Access, Exchange & Compute on sensitive
data
• Data
Sustain core data resources
• Tools
Services & connectors to drive access and
exploitation
• Standards
Integration and interoperability of data
and services.
• Training
Professional skills for managing and
exploiting data
17. ELIXIR: European Open Science Cloud
• Cloud activities to support BMS Research Infrastructures
• Commercial cloud providers: Helix Nebula Science Cloud, …
• Community cloud providers: EMBL-EBI, CSC, de.NBI, …
• Sensitive data may have complex requirements
• Not to leave institution or legal jurisdiction
• National legal requirements
• Specific data protection requirements
• Compile maturity matrix around key security features
• Map user requirements to complient cloud providers
20. genomicsandhealth.org
Overview
• Data Security Work Stream helps assess security risk
assessments associated with new GA4GH standards
20
At Project Start:
Assessment of security
risks associated with
use case(s) to be
addressed
Prior to Standard Release:
Assessment of how standard
has addressed identified
risks, and identification of
residual risk
Work Stream Standards-Development Activity Timeline
21. genomicsandhealth.org
Breach Response Strategy
• Projected timeline: Begun at 2017 Plenary, projected end
date TBD
• Milestones
1) Write Scope and Principles document
2) Inventory practices in place with Driver Projects
3) Define a policy for sharing breach data
4) Develop protocol for sharing breach data
5) Define strategy for responding to breaches associated
with GA4GH standards
22. genomicsandhealth.org
Authentication and Authorization
Infrastructure (AAI)
• Projected timeline: Identification/authentication
development begun in 2017; end date TBD
• Milestones
1) Document OpenID Connect profile developed for and
implemented by ELIXIR Beacons
2) Define authorization use cases
3) Document standard GA4GH OAuth 2.0 authorization
profile for RESTful APIs
23. genomicsandhealth.org
Linkages with Other Work Streams
• Breach Information Exchange protocol will be informed by
legal, regulatory, and ethical guidance provided by
Regulatory and Ethics Work Stream
• AAI profiles will consume vocabulary and ontology
developed by Data Use and Research Identities (DURI) Work
Stream
• AAI use cases will be based on APIs being defined by
Genomic Knowledge Sharing (GKS), Clinical and Phenotypic
Data Capture, and Discovery Work Streams
24. Conclusions
• One size does not fit all
• But there are some common approaches that can be
adopted
• Challenge is to build scalable trust networks
• ‘Tea and biscuits’ strategy
• Having confidence in those running sites & services
• Security is just one aspect of data protection
• Understand the data and what you are protecting it from