We have introduced elasticHPC-Docker based on container technology. Our package enables the creation of a computer cluster with containerized applications and workflows in private and in different commercial clouds using single interface. It also includes options to manage the cluster, to deploy and run bioinformatics applications for large datasets, and to interface with image registries.
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
The Case For Docker In Multi-Cloud Enabled Bioinformatics Applications
1. The case for Docker in multi-
cloud enabled
bioinformatics applications
Ahmed Ali, Mohamed M. ElKalioby, Mohamed Abouelhoda
Nile University, Egypt
Presented By
Mohamed M. El-Kalioby, MSc
1
2. Introduction
● Next generation sequencing technology has changed the
traditional bioinformatics practice
● Sophisticated multi-step workflows used to transform the raw
sequence data into knowledge.
● One NGS workflow can include tens of tasks and hundreds of
information sources integrated together to achieve the analysis
goals.
● Medical Variant Detection Workflow is an example of such
workflows.
2
4. Medical Variant Detection Workflow (2)
● Multiple Versions and Instances of the workflow needed
● Tools and parameters can be changed
● per user, where each one may require certain modules, annotation
databases, and special post-processing;
● per experiment type, e.g., whole genome, whole exome, or RNAseq
in a single or multiplexed mode
● per sequencing platforms, illumina, IonTorrent, or any other one.
4
5. Requirements5
● Efficient Dynamic Deployment Strategy
● The deployed system should use HPC resources
● Able to consume cloud computing resources (private and public
clouds)
6. Virtualization Technology
● the whole system with all modules, databases and the
related dependencies are packaged in a virtual machine
(VM) image.
● These images can be then used to instantiate a virtual
machine running in private or public cloud.
● Examples from sequence analysis
● Crossbow for NGS read alignment & SNP calling,
● RSD-Cloud for comparative genomics
● … many more
6
7. Virtual Technology (2)
● The traditional engine for running the virtual machine
instances is based either on
● Oracle Virtual Box,
● KVM,
● Xen Hypervisor
● VMware
7
8. Docker8
● Docker provides a new level of virtualization
● the computing machine (including the operating system) is
not virtualized,
● Only the application and the related dependencies are
encapsulated in a ’virtual’ isolated process
INFRASTRUCTURE
Operating System
Virtual Machine Hypervisor
VM1 VM2 … VMn
APP1 APP2 …. APPn
INFRASTRUCTURE
Operating System
Container Container … Container
APPnAPP1 APP2 …
Container
Engine
Software Stack with Virtual Machines Software Stack with Containers
(a) (b)
9. Usage of Docker
9
Dockerclient
DockerServer
(Daemon)
Pull Image
Download/upload
Images
Build Image
Run Container
Build/Push container
images to local registry
Terminate Container
Docker
public
registry
Local registry
Infrastructure
Operating System
container container
Run containers
10. Why Docker10
● Reduced execution overhead compared to traditional whole
machine virtualization
● Provides an effective solution to the image portability
problem.
● Virtual machine images running in Amazon are not compatible
with those running in Google and vice versa which directly lead
to duplication of work to prepare new images with each
deployment.
11. Challenges
● Extra layers need to be built on top of Docker to enable the use of HPC resources
(computer cluster) and multi-cloud platforms
● Deployment in different commercial clouds is not an easy task.
● Each cloud platforms has different APIs and different business models.
● Images are compatible with different providers
11
12. Contribution
● Define use case scenario for using Docker within a computer cluster for
bioinformatics workflows.
● Evaluate its performance in comparison to the use of native hardware and usual
virtual machines, in private and public cloud.
● We also present a new version of our multicloud elasticHPC, referred to as
elasticHPC-Docker
1. enable the user deploy and run multi-step whole analysis workflows,
2. create computer cluster with Docker based applications and define a use case scenario
for that
3. support the use of private clouds as well as commercial clouds like Amazon and Google.
12
14. Google
● Google Cloud offers a container service in the form of two products
1. container-optimized virtual machine images, which includes programs to run standard Docker
images, according to a user defined file in YAML format.
2. Google Kubernetes Engine (GKE) to create a cluster of virtual machines that can run Docker
images. GKE is based on pods,
● Google has established Google container registry (GCR).
● Cost:
● The optimized container images and GKE run at no extra cost. pays usual price of virtual
machines.
● GKE charges an extra fee of $0.15 per hour per cluster on top of the usual machine price (for
cluster size > 5 nodes).
● GKE has two limitations:
1. It does not support Docker’s private images.
2. The cluster size in GKE cannot exceed 100 nodes.
14
15. Amazon
● Amazon provides Elastic Container Service (ECS).
● ECS enables the deployment of Docker containers on Amazon EC2.
● Amazon uses docker-compose to manage docker containers.
● Docker-compose facilitates the process of setting up a multi-container application
by defining the application and all its dependencies in a single file using YAML
format.
● The instantiated machines include programs to automatically configure the
Docker environment.
● Amazon has its own images registry.
● Cost:
● the user pays for same as that of the usual instance types.
● If the load balancing service is selected, the user pays an extra small cost of $0.025 per
hour and $0.008 per GB transferred between instances
● Limitations:
● It does not support attaching EBS volumes to the running containers.
15
16. ElasticHPC-Docker
Features
● Ability to port and run any docker image to either private or commercial clouds.
● Creation and management of a cluster of containers. The cluster can use single or
multiple machines.
● The computer cluster can have nodes from different cloud providers; i.e. some
nodes can come from Amazon and some can come from Google.
● Ability to create and destroy containers in the run-time. This makes it possible to
run multiple containers on the same machine, one at a time.
● The package supports scaling up/down of virtual machines (worker nodes) in a
running clusters.
16
17. ElasticHPC-Docker
Features (2)
17
● The package allows mounting of virtual disks and establishment of a
shared file system to the containers (Default option is the NFS). In AWS, we
use EBS volumes and in Google we use persistent storage disks.
● elasticHPC-Docker automatically configures a job scheduler (including
security settings among the different providers) among the containers. The
default job schedule is PBS Torque, but SGE is also supported.
● The current package includes many Docker specification files (DockerFile)
for the most important tools for NGS data analysis. These include Fastx,
BWA, GATK .
● It includes a number of structural bioinformatics tools, including AutoDock,
Frodock, and AMBER GROMACS,, among others;.
18. EHPC-Docker (Use Case)18
EHPC-Client
EHPC-VM
Manager
Port 5000
Communication
with VM Manager
Port 5555
Ports1:4999,
5001:65535
Container
Communication with
Container service
Master Node
Communication
Among conainer
Service
Communication
Among Containerized
Services
Attached
Data
Volume
Shared File System
(Block Storage)
Running on
Users PC
EHPC-VM
Manager
Port 5000
Port 5555
Ports1:4999,
5001:65535
Container
Slave Node Worker Node
Attached
Data
Volume
EHPC-VM
Manager
Port 5000
Port 5555
Ports1:4999,
5001:65535
Container
Slave Node Worker Node
Attached
Data
Volume
EHPC-VM
Manager
Port 5000
Port 5555
Ports1:4999,
5001:65535
Container
Slave Node Worker Node
Attached
Data
Volume
1. User downloads the EHPC-Docker client2. User runs the client to create a cluster on a supported clouda. The client starts Master nodeb. Master node creates the rest of the cluster in parallelc. Master node distributes the URL of the image registryd. Master and worker nodes retrieve the image and start the containers.
e. Once done, the master node sets up the ports and finalizes the configuration of in
terms of setting up the job scheduler and the shared storage.Cluster is ready
19. Experiments
● We conducted two experiments:
1. Measure the time for establishing container clusters over different cloud platforms.
2. Measure the performance of using Docker when running the variant detection workflow.
19
20. Experiment 120
1. GKE is faster than ECS
2. elasticHPC is faster than GKE
3. elasticHPC is close to ECS
21. Experiment 2
● For this experiment, we used an exome dataset from DePristo et al. of size ~ 9 GB.
● The exome is a set of NGS reads sequenced only from the whole coding regions of a
genome.)
● The workflow was executed three times independently on Google, AWS, and private
cloud based on OpenStack.
● In each cloud, the 9 GB input data is divided into blocks to be processed in parallel
over the cluster nodes.
● For fair comparison, we used machines of as similar specifications as possible.
● Amazon: m3.2xlarge (8 C, Intel 2.5 GHz, 30 GB RAM, SSD disks, $0.532/hour),
● Google: n1-highmem-8(8 C, Intel 2.5 GHz, 52 GB RAM, SSD disks,$0.504/hour)
● OpenStack: we used local machine with 8 Cores, 56 GB RAM.
21
25. Conclusion
● We introduced elasticHPC-Docker based on container technology.
● Our package enables the creation of a computer cluster with containerized
applications and workflows in private and in different commercial clouds using
single interface.
● It includes options to run bioinformatics applications and workflows for large
datasets
● Through the container technology, elasticHPC-Docker provides an efficient
solution to the inter-operability among commercial clouds,
● It is efficient in practice with reduced overhead especially on local infrastructures.
● It is available on http://www.elastichpc.org
25