The document discusses use cases for the ARCHIVER project at the European Bioinformatics Institute. It notes that the EBI's data is growing rapidly at around 40-50% per year and will likely continue doubling every two years. It aims to develop a hybrid multi-cloud storage solution to address this growth and enable cost-effective scaling, analysis in the cloud, and caching of frequently accessed data in public clouds. Key challenges include balancing cost and performance across on-premises and cloud storage as data and analysis needs increase.
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
Embl ebi use-cases_-_t.wildish
1. Use-cases for the ARCHIVER project
The European Bioinformatics Institute
Tony Wildish
wildish@ebi.ac.uk
2. What is EMBL-EBI?
• Europe’s home for biological data services, research and training
• A trusted data provider for the life sciences
• Part of the European Molecular Biology Laboratory, an
intergovernmental research organisation
• International: 650 members of staff from 66 nations
• Home of the ELIXIR Technical hub.
6. Database interactions
• Our collaborative community
facilitates social, scientific and
technical interactions
• This image shows internal
interactions between data
resources, as determined by
the exchange of data.
• The width of each internal arc is
weighted according to the number
of different data types exchanged.
7. Increasing Data, Increasing Analysis
Storage growth at EBI
• ~40-50% per year
• i.e. doubling every two
years
• No reason to expect
that to slow down
EGA and ENA account for
the bulk of the data
• DNA sequences
8. See the live map at www.ebi.ac.uk/about/our-impact
Who uses EMBL-EBI services?
10. Data characteristics
DNA sequence data
○ The bulk of our data, files from few MB up to many tens of GB
○ ‘long-read’ sequencing technology, can expect file sizes to increase?
Lifetime
○ EBI has custodial responsibility, most of our data is stored ‘forever’
○ Data is immutable (but may be versioned)
Analyses
○ Assembly: stream/index whole file, then random access string matching
○ Query: byte-range lookup
Access
○ POSIX, FTP, HTTP, S3…
○ Data discovery by portal lookup, dedicated portals with cross-references
11. Privacy, security
Public
○ Available without authorisation or identification – anonymous FTP
Private, secure
○ Apply to a committee for access, individually encrypted copy provided if granted
Collaboration
○ Team of people with access, varying degrees (R/O, R/W), fluctuating membership
Embargo
○ Public after analysis/publication, or after time window expires
12. “EMBL on FIRE” - Background
The FIle REplication Project started in Systems and Networking team in 2008
○ Provide an efficient, reliable, scalable and replicated data storage (for disaster recovery)
○ Provide a cost-effective and vendor-independent solution
○ Different storage technologies on Replica A and Replica B to mitigate possible data loss
Projects using FIRE include:
○ 1000 Genomes (G1K)
○ European Nucleotide Archive (ENA)
○ European Genome-phenome Archive (EGA)
○ Human Induced Pluripotent Stem Cells Initiative (HIPSCI)
○ Functional Annotation of Animal Genomes (FAANG)
○ BioImaging Data Archive
13.
14.
15. 2018
Stability with
1PB/month ingress
2019
Become S3 like cloud
with metadata
features
2020
Ingress 2PB/month
Egress 60PB/month
2021
Metadata explorer
Ingress 3PB/month
2022
Not yet defined
5 Years plan
16. “EMBL on FIRE” - Challenges
Cost-effective scaling:
○ Can cloud-based storage offer a cost-efficient approach?
○ How do ingest rates affect this model?
○ Current use is ~1PB download, 2 billion requests, per month
Cost-effective analysis:
○ As the data-volume grows, we expect users to switch to cloud-based analysis platforms.
How can we effectively distribute/present the data for analysis
○ Need a hybrid/multi-cloud model that blurs the boundaries between on-premises and
public cloud
○ Long tail of analysis, effectively no ‘cold data’ -> tiered storage not a panacea
17. Caching in the cloud
Why?
○ Increasing data volumes strain our in-house compute resources
○ Many of our data products have regular release cycles, e.g. quarterly
○ Downstream processing becoming a bottleneck, unable to keep up
○ Bandwidth for access to data
○ Some workflows require specialized hardware, e.g. >> 1 TB RAM
○ Prefer to move to the cloud as soon as is cost-effective
How?
○ Hybrid-cloud model, extend on-premises resources transparently into multiple clouds
18. Caching in the cloud
EMBL-EBI Data
Centre Space
JANET – UK Academic Network
Public Clouds
Clusters
NFS Object
Store
Research
Team
Cache
Public
Service
Service
Team
Users
19. Caching in the cloud
Which data do we cache?
○ Which data is most likely to be used in the future? When?
○ Half our data is less than 2 years old
○ Long tail of analysis, not use-once-and-forget
○ Need monitoring of access patterns and knowledge of file relationships
○ Some knowledge of a-priori requirements, but not complete
How much data to cache?
○ Trade-off long-term caching vs. cost of upload/download of data, available bandwidth
Cache lifetime?
○ Instrument workflows with caching hints?
○ Process-mining to determine which files are used in what manner for a given workflow?
○ How much can we automate this vs. requiring the user to tell us?
20. Caching in the cloud and FIRE?
Cache vs. archive:
○ Cache lifetime goes to infinity -> archive
Moving target
○ Need a process that can evolve over time, over many orders of magnitude
○ Tools & technologies may change, must be fluid
21. Testing plans
Functionality
○ Ingest + download with multiple clients, rate ~PB/month
○ Clients distributed around several clouds, several locations
○ Byte-range download for subsets of large files
Performance
○ Sustained functionality over long periods – days, not minutes
Security
○ Test RBAC functionality, reliability, usability, latency (e.g. if eventually consistent)
Accounting, billing
○ Ability to get near-realtime ‘cost’ reports, predictions, alerts, breakdowns…
22. Summary
o Data growing fast, ~doubling every two years
o Don’t expect this to slow down anytime soon
o Cloud-migration for user community just beginning
o Actively pushing to accelerate this
o Need a hybrid/multi-cloud storage solution
o Flexible, performant, cost-effective