Embl ebi use-cases_-_t.wildish

Use-cases for the ARCHIVER project
The European Bioinformatics Institute
Tony Wildish
wildish@ebi.ac.uk

What is EMBL-EBI?
• Europe’s home for biological data services, research and training
• A trusted data provider for the life sciences
• Part of the European Molecular Biology Laboratory, an
intergovernmental research organisation
• International: 650 members of staff from 66 nations
• Home of the ELIXIR Technical hub.

Our mission
Deliver
excellent
research
Train the
next
generation
of scientists
Engage with
industry
Coordinate
bioinformatics
in Europe
Deliver
scientific
services

The European Molecular Biology Laboratory
Heidelberg, Germany
Main Laboratory
Barcelona, Spain
Tissue Biology, Disease Modeling
80+ nationalities
Hinxton, Cambridge, UK
Bioinformatics
Mouse Biology
Rome, Italy
>1700 personnel
Grenoble, France
Hamburg, Germany
Structural Biology
6 sites in Europe
Structural Biology

Database interactions
• Our collaborative community
facilitates social, scientific and
technical interactions
• This image shows internal
interactions between data
resources, as determined by
the exchange of data.
• The width of each internal arc is
weighted according to the number
of different data types exchanged.

Increasing Data, Increasing Analysis
Storage growth at EBI
• ~40-50% per year
• i.e. doubling every two
years
• No reason to expect
that to slow down
EGA and ENA account for
the bulk of the data
• DNA sequences

See the live map at www.ebi.ac.uk/about/our-impact
Who uses EMBL-EBI services?

Where does our
data come from?

Data characteristics
DNA sequence data
○ The bulk of our data, files from few MB up to many tens of GB
○ ‘long-read’ sequencing technology, can expect file sizes to increase?
Lifetime
○ EBI has custodial responsibility, most of our data is stored ‘forever’
○ Data is immutable (but may be versioned)
Analyses
○ Assembly: stream/index whole file, then random access string matching
○ Query: byte-range lookup
Access
○ POSIX, FTP, HTTP, S3…
○ Data discovery by portal lookup, dedicated portals with cross-references

Privacy, security
Public
○ Available without authorisation or identification – anonymous FTP
Private, secure
○ Apply to a committee for access, individually encrypted copy provided if granted
Collaboration
○ Team of people with access, varying degrees (R/O, R/W), fluctuating membership
Embargo
○ Public after analysis/publication, or after time window expires

“EMBL on FIRE” - Background
The FIle REplication Project started in Systems and Networking team in 2008
○ Provide an efficient, reliable, scalable and replicated data storage (for disaster recovery)
○ Provide a cost-effective and vendor-independent solution
○ Different storage technologies on Replica A and Replica B to mitigate possible data loss
Projects using FIRE include:
○ 1000 Genomes (G1K)
○ European Nucleotide Archive (ENA)
○ European Genome-phenome Archive (EGA)
○ Human Induced Pluripotent Stem Cells Initiative (HIPSCI)
○ Functional Annotation of Animal Genomes (FAANG)
○ BioImaging Data Archive

2018
Stability with
1PB/month ingress
2019
Become S3 like cloud
with metadata
features
2020
Ingress 2PB/month
Egress 60PB/month
2021
Metadata explorer
Ingress 3PB/month
2022
Not yet defined
5 Years plan

“EMBL on FIRE” - Challenges
Cost-effective scaling:
○ Can cloud-based storage offer a cost-efficient approach?
○ How do ingest rates affect this model?
○ Current use is ~1PB download, 2 billion requests, per month
Cost-effective analysis:
○ As the data-volume grows, we expect users to switch to cloud-based analysis platforms.
How can we effectively distribute/present the data for analysis
○ Need a hybrid/multi-cloud model that blurs the boundaries between on-premises and
public cloud
○ Long tail of analysis, effectively no ‘cold data’ -> tiered storage not a panacea

Caching in the cloud
Why?
○ Increasing data volumes strain our in-house compute resources
○ Many of our data products have regular release cycles, e.g. quarterly
○ Downstream processing becoming a bottleneck, unable to keep up
○ Bandwidth for access to data
○ Some workflows require specialized hardware, e.g. >> 1 TB RAM
○ Prefer to move to the cloud as soon as is cost-effective
How?
○ Hybrid-cloud model, extend on-premises resources transparently into multiple clouds

EMBL-EBI Data
Centre Space
JANET – UK Academic Network
Public Clouds
Clusters
NFS Object
Store
Research
Team
Cache
Public
Service
Service
Team
Users

Which data do we cache?
○ Which data is most likely to be used in the future? When?
○ Half our data is less than 2 years old
○ Long tail of analysis, not use-once-and-forget
○ Need monitoring of access patterns and knowledge of file relationships
○ Some knowledge of a-priori requirements, but not complete
How much data to cache?
○ Trade-off long-term caching vs. cost of upload/download of data, available bandwidth
Cache lifetime?
○ Instrument workflows with caching hints?
○ Process-mining to determine which files are used in what manner for a given workflow?
○ How much can we automate this vs. requiring the user to tell us?

Caching in the cloud and FIRE?
Cache vs. archive:
○ Cache lifetime goes to infinity -> archive
Moving target
○ Need a process that can evolve over time, over many orders of magnitude
○ Tools & technologies may change, must be fluid

Testing plans
Functionality
○ Ingest + download with multiple clients, rate ~PB/month
○ Clients distributed around several clouds, several locations
○ Byte-range download for subsets of large files
Performance
○ Sustained functionality over long periods – days, not minutes
Security
○ Test RBAC functionality, reliability, usability, latency (e.g. if eventually consistent)
Accounting, billing
○ Ability to get near-realtime ‘cost’ reports, predictions, alerts, breakdowns…

Summary
o Data growing fast, ~doubling every two years
o Don’t expect this to slow down anytime soon
o Cloud-migration for user community just beginning
o Actively pushing to accelerate this
o Need a hybrid/multi-cloud storage solution
o Flexible, performant, cost-effective

Embl ebi use-cases_-_t.wildish

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Embl ebi use-cases_-_t.wildish

Semelhante a Embl ebi use-cases_-_t.wildish (20)

Mais de Archiver

Mais de Archiver (20)

Último

Último (20)

Embl ebi use-cases_-_t.wildish