O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Data-intensive applications on cloud computing resources: Applications in life sciences

238 visualizações

Publicada em

Presentation at the de.NBI 2017 symposium “The Future Development of Bioinformatics in Germany and Europe” held at the Center for Interdisciplinary Research (ZiF) of Bielefeld University, October 23-25, 2017.

https://www.denbi.de/symposium2017

Publicada em: Ciências
  • If u need a hand in making your writing assignments - visit ⇒ www.WritePaper.info ⇐ for more detailed information.
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui
  • Yes you are right. There are many research paper writing services available now. But almost services are fake and illegal. Only a genuine service will treat their customer with quality research papers. ⇒ www.WritePaper.info ⇐
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui
  • Seja a primeira pessoa a gostar disto

Data-intensive applications on cloud computing resources: Applications in life sciences

  1. 1. Data-intensive applications on cloud computing resources: Applications in life sciences Ola Spjuth <ola.spjuth@farmbio.uu.se> Department of Pharmaceutical Biosciences and Science for Life Laboratory Uppsala University
  2. 2. Today: We have access to high-throughput technologies to study biological phenomena
  3. 3. Science for Life Laboratory, Sweden An internationally leading center that develops and applies large-scale technologies for molecular biosciences with a focus on health and environment. National platform since 2013 Stockholm node Uppsala node
  4. 4. ● CoordinatedbyNBIS ● ~63FTE:s (staff>75) ● Staffatallmajor Swedishuniversities ● 2018budget~7.5M€ ● Bioinformatics platform ofSciLifeLab 4
  5. 5. 2017: Human whole genome sequenced in 3 days for ~$1100 …requires supercomputers for analysis and storage Massively parallel sequencing…. 2017: Illumina HiSeq X systems. 15K whole human genomes per year 2016: NGI data velocity 950 Mbp/hour = 16 Mbp/s
  6. 6. Analysis Scientists Sample transfer Current mode of operation Platforms Pre-processing (NGI) Research (SNIC) Data delivery
  7. 7. What we sequenced at NGI /
  8. 8. Some statistics Storage usage Projects at SNIC-UPPMAX Data-intensive bioinformatics Other disciplines Support tickets
  9. 9. NGS users • Key observations – Batch-oriented on HPC/HTC, shared storage, Linux, open source software – Computations are not so large, seldom multi-node – Storage biggest challenge. Projects do not end. Users do not clean up data. WGS projects are very large. – Many and inefficient users, lots of software (admin burden, support, education) – Free resources (no cost) does not promote efficient usage • Investment strategies – When investing in computational hardware, it takes a long time from funding decision until the resources are operational (10-12 months on average). – Expansion of resources are done at specific points in time, low flexibility between these. – Decision on resources are made by a national board with limited influence from life science scientists or platforms (Sweden) 9
  10. 10. Why cloud in the life sciences? • Access to resources – Flexible configurations – On-demand, pay-as-you-go • Collaborate on international level – Publish/federate data – E.g. Large sequencing initiatives, “move compute to the data” • New types of analysis environments – Hadoop/Spark/Flink etc. – Microservices, Docker, Kubernetes, Mesos 10
  11. 11. Using clouds in Bioinformatics How can we take advantage of cloud resources? Simplest example: • Start VM from (pre-made) VMI • Upload data • Perform scientific task • Download results • Terminate VM Easy to scale this up to using many instances! Or….. is it? • What if I want to run 100 instances in parallel? • What about if I want a new tool? Later versions? • Do I need to upload data every time? 11
  12. 12. So we want to set up and use a virtual cluster • Multiple compute nodes • Network • Distributed storage • Firewall, DNS, reverse proxy, etc. So, we now have a virtual cluster. And now? Batch-like system – Install a queueing system, e.g. SLURM – Install bioinformatics software Big Data system – Install HDFS + Hadoop/Spark on the nodes Container-based system – Install Docker and Kubernetes Data – Ingress project data, possibly reference data 12 (There are tools that can help automating some of these procedures.)
  13. 13. Challenges with cloud • Tradition: Strong HPC tradition in academia – Sweden: Existing HPC resources funded by Research Council and personnel at 6 centra in Sweden (SNIC) • Economy: Cost model is new – Difficult to assess the costs • Data: How to work with large-scale data (TB/PB-range) • Legal: Working with sensitive data • Educational: New technology for many 13
  14. 14. Some SciLifeLab cloud options 14
  15. 15. ● Geographically distributed federated IaaS cloud based on 2nd generation HPC-hardware ● Built using OpenStack SNIC Cloud in Sweden
  16. 16. Needs in bioinformatics • Primarily resources with a lot of RAM and storage (high I/O) • Preferably transparent system, users don’t want to deal with e- infrastructure at all • How to work with storage (tiered?) • Is Best-Effort SLA enough? 16
  17. 17. Virtual Machines and Containers Virtual machines • Package entire systems (heavy) • Completely isolated • Suitable in cloud environments Containers: • Share OS • Smaller, faster, portable • Docker 17
  18. 18. Microservices 18
  19. 19. MicroServices • Decompose functionality into smaller, loosely coupled, on-demand services communicating via an API – “Do one thing and do it well” • Services are easy to replace, language-agnostic – Minimize risk, maximize agility – Suitable for loosely coupled teams – Portable - easy to scale – Multiple services can be chained into larger tasks Software containers (e.g. Docker) are ideal for microservices!
  20. 20. Orchestrating containers • Origin: Google • A declarative language for launching containers • Start, stop, update, and manage a cluster of machines running containers in a consistent and maintainable way • Suitable for microservices Containers Scheduled and packed containers on nodes
  21. 21. Connecting the microservices • A suitable way of using containers are connecting them into a (scientific) workflow. • Tools like Pachyderm (http://pachyderm.io/), Luigi (https://github.com/spotify/lui gi) and Galaxy (https://galaxyproject.org/) can assist with this. • Goal: Reproducible, fault- tolerant, scalable execution. 21
  22. 22. Tools Tools Data Data VREs aim to bridge this gap! Researcher Other researchers Virtual Research Environments
  23. 23. Researcher Tools Data Compute and storage resources Virtual Research Environment! Other researchers Virtual Research Environments
  24. 24. PhenoMeNal • Horizon 2020 project, 2015-2018 • Virtual Research Environments (VRE), Microservices, Workflows • Towards interoperable and scalable Metabolomics data analysis • Private environments for sensitive data http://phenomenal-h2020.eu/ DockerHub Virtual Infrastructure GitHub
  25. 25. Cloudflare kubeadm Terraform kubectl Packer • Enable users to deploy their own virtual infrastructure on an IaaS provider • Containerize tools, orchestrate microservices with workflow systems on top of Kubernetes PhenoMeNal approach and stack KubeNow
  26. 26. Users should not see this…
  27. 27. Users should see this! 27
  28. 28. Start-to-end MS-analysis 28
  29. 29. Deployment on local clouds Steffen Neumann, IPB Halle
  30. 30. Two on-premises deployments MRC-NIHR Phenome Centre Kultima group www.caramba.clinic
  31. 31. Bring compute to the data • Moving data can be problematic – e.g. size, legal, resources, costs, time… • VRE encompasses all components necessary to carry out analysis – Launch near data – Re-use environment, or even a scientific workflow • Next step: Federate data, federate clouds 31
  32. 32. Research focus in my group e-Science methods development Smart data management, predictive modeling Applied e-Science research Drug discovery and individualized diagnostics e-infrastructure development Automation, Big Data
  33. 33. Privacy preservation Workflows Big Data frameworks Data management and predictive modeling Data federation Compute federation
  34. 34. ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● NGS projects 2014 2015 2016 2017 Efficiency feedback to users began 0 20 40 60 80 100 Efficiency(%) Date ●● ● ●● ● ● ●●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●●●●●●●● ● ● ●● ●●● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ●●● ●● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ●●● ● ● ● ●● ● ● ●● ● ●●● ●●●●●● Other projects 2014 2015 2016 2017 0 20 40 60 80 100 Selected research questions How can we improve efficiency on shared HPC for data- intensive bioinformatics? 1. M. Dahlö, D. Schofield, Wesley Schaal and O. Spjuth, Tracking the NGS revolution: Usage and system support of bioinformatics projects on shared high-performance computing clusters. In Preparation. 2. O. Spjuth, E. Bongcam-Rudloff, J. Dahlberg, M. Dahlö, A. Kallio, L. Pireddu, F. Vezzi, and E. Korpelainen, Recommendations on e- infrastructures for next-generation sequencing. Gigascience, 2016, 5:26 3. S. Lampa, M. Dahlö, P. I. Olason, J. Hagberg, and O. Spjuth, Lessons learned from implementing a national infrastructure in sweden for storage and analysis of next-generation sequencing data. Gigascience, 2013, 2:9 Data locality? Outsourcing? Martin Dahlö
  35. 35. Selected research questions Can Big Data frameworks aid data-intensive bioinformatics? 1. A. Siretskiy, L. Pireddu, T. Sundqvist, and O. Spjuth. A quantitative assessment of the Hadoop framework for analyzing massively parallel DNA sequencing data. Gigascience. 2015; 4:26. 2. L. Ahmed, A. Edlund, E. Laure, and O. Spjuth. Using Iterative MapReduce for Parallel Virtual Screening. Cloud Computing Technology and Science (Cloud-Com), 2013 IEEE 5th International Conference on , vol.2, no., pp.27,32, 2-5, 2013 3. M. Capuccini, L. Carlsson, U. Norinder and O. Spjuth. Conformal Prediction in Spark: Large-Scale Machine Learning with Confidence. EEE/ACM 2nd International Symposium on Big Data Computing (BDC), Limassol, 2015, pp. 61-67. 4. M. Capuccini, L.Ahmed, W. Schaal, E. Laure and O. Spjuth Large-scale virtual screening on public cloud resources with Apache Spark Journal of Cheminformatics 2017 9:15 Laeeq Valentin Marco Efficient Virtual Screening with Apache Spark and Machine Learning Hadoop pipeline scales better than HPC and is economical for current data sizes
  36. 36. “EasyMapReduce: Leverage the power of Spark And Docker To scale scientific tools in MapReduce fashion“ 36 https://spark-summit.org/east-2017/events/easymapreduce-leverage-the- power-of-spark-and-docker-to-scale-scientific-tools-in-mapreduce-fashion/
  37. 37. Selected research questions How useful are Scientific Workflows in data-intensive research? O. Spjuth et al. Experiences with workflows for automating data-intensive bioinformatics. Biology Direct. 2015 Aug 19;10(1):43. S. Lampa, J. Alvarsson and O. Spjuth. Towards agile large-scale predictive modelling in drug discovery with flow-based programming design principles. Journal of Cheminformatics, 2016, 8:67 Samuel Jon • Streamline analysis on high- performance e-infrastructures • Support reproducible data analysis • Enable large-scale data analysis http://scipipe.org https://github.com/pharmbio/sciluigi http://pachyderm.io
  38. 38. Selected research questions How can we deploy smart, high-availability services with APIs? http://www.openrisknet.org • Horizon 2020 project, 2017-2020 • E-Infrastructure for chemical safety assessment • Multi-tenant Virtual Environments, microservices • APIS, “Semantic interoperability” • Academia – industry • Much focus on standardizing chemical data and predictive modeling Staffan Jonathan Arvid
  39. 39. Research questions around the corner • Public and private data sources are not static. How can we continuously improve predictive models as data changes? • We can generate too much data. Can predictive modeling aid data acquisition, storage and analysis? 39
  40. 40. Reactive/continuous modeling Data sources Coordinate Integrate Version Monitor Publish models Archive models User Train and assess model
  41. 41. HASTE Hierarchical Analysis of Spatial and TEmporal and image data From intelligent data acquisition via smart data management to confident predictions PI, Aim1: Carolina Wählby Aim 3: Andreas HellanderAim 2: Ola Spjuth 29 MSEK 2017-2022
  42. 42. . . . Training data Can we use privileged information to improve machine learning models? Training Can we make a valid ranking and guide data acquisition? . . . Is something interesting happening? Can we assign valid probabilities for that? Collect more data Online setting Aim 2: Guiding data acquisition with machine learning
  43. 43. Aim3: Explore a hierarchical model based on Information Layers Data warehouse, distributed storage Edge Cloudlet, private cloud
  44. 44. Acknowledgements Wes Schaal Jonathan Alvarsson Staffan Arvidsson Arvid Berg Samuel Lampa Marco Capuccini Martin Dahlö Valentin Georgiev Anders Larsson Polina Georgiev Maris Lapins Jon-Ander Novella 44 Lars Carlsson Ernst Ahlberg Ola Engqvist SNIC Science Cloud Andreas Hellander Salman Toor Caramba.clinic Kim Kultima Stephanie Herman Payam Emami
  45. 45. Research group website: http://pharmb.io Thank you

×