“Fast” is not a word that’s usually associated with a virtual machines. But that’s exactly what you need to create runtimes that are secure enough for a multitenant environment, yet nimble enough to be well suited for function and container compute platforms.
Let’s learn about what Firecracker is and how it enables fast and secure serverless computing.
Firecracker is an open source virtualization technology. It’s a Virtual Machine Manager (VMM) that uses the Linux Kernel-based Virtual Machine (KVM) to create and manage light weight virtual machines, dubbed “microVMs”. Before Firecracker, it was hard to avoid choice between containers with fast startup times and high density, or VMs with strong hardware-virtualization-based security and workload isolation. With Firecracker, you no longer have to choose.
<CLICK> These microVMs provide enhanced security and workload isolation over traditional VMs, while enabling the speed and resource efficiency of containers.
<CLICK> It comes with an extremely low resource overhead. That makes it very suitable for serverless computing.
<CLICK> Firecracker was developed at Amazon Web Services to improve the customer experience of services like AWS Lambda and AWS Fargate. When we launched Lambda in November of 2014, we were focused on providing a secure serverless experience. At launch we used per-customer EC2 instances to provide strong security and isolation between customers. As Lambda grew, we saw the need for technology to provide a highly secure, flexible, and efficient runtime environment for services like Lambda and Fargate. We needed something that could give us the hardware virtualization-based security boundaries of virtual machines, while maintaining the smaller package size and agility of containers and functions.
Taking the case of Lambda …
<CLICK> Let’s zoom in on a single Lambda “worker” server. It’s a physical machine, so it has fixed compute resources over time (e.g., CPU, memory).
<CLICK> Lambda customers are paying for a specific compute resource cap per function – which we must guarantee. Each invoke then consumes some unknowable fraction of that cap, and the invokes happen in some unknowable succession. How can we make efficient use of our fixed physical machine?
<CLICK><CLICK><CLICK> We add more functions to the same server. Ideally from different customers. For a large enough N, we can start using statistics and other big-number methods to ensure even near-full usage of the server. Ah, but wait, any of these N functions can run arbitrary binary code that might be malicious. How do we isolate all these execution environments? Containers would work, but they are not the most secure execution environment.
<CLICK> We need to use VMs … but VMs don’t usually work well for high density, oversubscribed, high mutation rate sandbox environments. Firecracker though, is built specifically for, and only for, this type of environment.
Firecracker implements a minimal device model that excludes all non-essential functionality and reduces the attack surface area of the microVM. This improves security, decreases the startup time, and increases hardware utilization. Let’s look at that how.
Firecracker microVMs use KVM-based virtualizations that provide enhanced security over traditional VMs. This ensures that workloads from different end customers can run safely on the same machine. And because of minimal device model, it also reduces the attack surface area which provides more security.
<CLICK> In addition to a minimal device model, Firecracker also accelerates kernel loading and provides a minimal guest kernel configuration. The only devices are virtio net and virtio block, as well as basic few-buttons keyboard (the reset pin helps when there’s no power management device). This enables fast startup times. Firecracker initiates user space or application code in less than 125ms and supports peak microVM creation rates of 150 microVMs per second per host.
<CLICK> Each Firecracker microVM runs with a reduced memory overhead of less than 5MiB, enabling a high density of microVMs to be packed on each server.
Firecracker also provides a rate limiter built into every microVM. This enables optimized sharing of network and storage resources, even across thousands of microVMs on a host.
With Firecracker, you can see that we are making the same deep investments in our infrastructure to support serverless computing as we have to support EC2 instances. Firecracker currently supports Intel CPUs, with AMD and Arm support currently in the Alpha stage (working, but not extensively tested). Firecracker will also be integrated with popular container runtimes such as containerd.
Let’s talk about the Firecracker design principles.
Firecracker can safely run workloads from different customers on the same machine. This is possible because it provides hardware virtualization-based security.
Customers can create microVMs with any combination of vCPU and memory to match their application requirements. This allows to map nicely to different Lambda and Fargate combinations. We’ll look at that later.
Firecracker microVMs oversubscribe host CPU and memory by default. The degree of oversubscription is controlled by customers, who may factor in workload correlation and load in order to ensure smooth host system operation.
With a microVM configured with a minimal Linux kernel, single-core CPU, and 128 MB of RAM, Firecracker supports a steady mutation rate of ~4 microVMs per host core per second (e.g., one can create ~150 microVMs per second on a host with 36 physical cores).
The number of Firecracker microVMs running simultaneously on a host is limited only by the availability of hardware resources.
Each microVM exposes a host-facing API via an in-process HTTP server.
Each microVM can provide guest-facing access to host-configured metadata via the /MMDS API.
The API is accessible through HTTP calls on specific URLs carrying JSON modeled data. The transport medium is a Unix Domain Socket.
/ returns general information about an instance.
Firecracker microVMs can execute actions that can be triggered via PUT requests on the /actions resource. Actions are:
InstanceStart: The InstanceStart action powers on the microVM and starts the guest OS
SendCtrlAltDel: This action will send the CTRL+ALT+DEL key sequence to the microVM. By convention, this sequence has been used to trigger a soft reboot and, as such, most Linux distributions perform an orderly shutdown and reset upon receiving this keyboard input. Since Firecracker exits on CPU reset, SendCtrlAltDel can be used to trigger a clean shutdown of the microVM.
BlockDeviceRescan: The BlockDeviceRescan action is used to trigger a rescan of one of the microVM's attached block devices. Rescanning is necessary when the size of the block device's backing file (on the host) changes and the guest needs to refresh its internal data structures to pick up this change. This action is therefore only allowed after the guest has booted.
GET /machine-config gets the machine configuration of the VM. When called before the PUT operation, it will return the default values for the vCPU count (=1), memory size (=128 MiB).
PUT updates the Virtual Machine Configuration with the specified input. Firecracker starts with default values for vCPU count (=1) and memory size (=128 MiB). With Hyperthreading enabled, the vCPU count is restricted to be 1 or an even number, otherwise there are no restrictions regarding the vCPU count. If any of the parameters has an incorrect value, the whole update fails.
/drives creates new drive with ID specified by drive_id path parameter. If a drive with the specified ID already exists, updates its state based on new input. Will fail if update is not possible.
Firecracker runs in user space and uses the Linux Kernel-based Virtual Machine (KVM) to create microVMs. The fast startup time and low memory overhead of each microVM enables you to pack thousands of microVMs onto the same machine. This means that every function or container group can be encapsulated with a virtual machine barrier, enabling workloads from different customers to run on the same machine, without any tradeoffs to security or efficiency. Firecracker is an alternative to QEMU, an established VMM with a general purpose and broad feature set that allows it to host a variety of guest operating systems.
<CLICK> You can control the Firecracker process via a RESTful API that enables common actions such as configuring the number of vCPUs or starting the machine. It provides built-in rate limiters, which allows you to granularly control network and storage resources used by thousands of microVMs on the same machine. You can create and configure rate limiters via the Firecracker API and define flexible rate limiters that support bursts or specific bandwidth/operations limitations. Firecracker also provides a metadata service that securely shares configuration information between the host and guest operating system. You can set up and configure the metadata service using the Firecracker API.
Let’s look at how AWS Lambda uses Firecracker.
As we build out AWS Lambda, we’re optimizing for security, reliability, performance, security and cost – in the serverless domain.
AWS Lambda is event-driven, serverless code execution, currently available in all AWS Regions as a “foundational” service.
We launch Lambda in every new Region that AWS launches.
We build our systems behind the scenes to distribute load, scale up and down, and detect and route around failure … so you don’t need to.
And of course, as we do that, we must preserve isolation and maximize utilization.
Just three years after general availability, AWS Lambda already processes trillions of requests every month, for hundreds of thousands of active customers.
One of the primary systems in the Lambda architecture is called a Worker – this is where we provision a secure environment for customer code execution.
What does a Worker do?
It creates and manages a collection of Sandboxes
It sets limits on Sandbox … such as memory/CPU available for function execution
It downloads customer code and mounts it for execution
It manages multiple Language Runtimes
It executes Customer Code through Initialization and Invoke
And finally …
It manages AWS owned agents for monitoring and operational controls … like CloudWatch
Let’s look a little closer at the logical view of Lambda worker.
At the top is your code, this is the most important part. This is what we run on your behalf. This is your zip, your layers and of course any language that you want to bring along.
We support a number of languages, through different Runtimes, including Node, Python, Java, C#, and more.
Underneath the Runtime is a Sandbox that hosts the runtime. This is the copy of Linux that we provide and you look around to see what’s on the file system.
All of these containers run on a Guest OS – we use Amazon Linux. GuestOS is multiplexed across hardware using virtualization.
That virtualization is enabled using a Hypervisor, and Host OS that the Hypervisor runs in. Host OS is also Amazon Linux for us.
And finally we have the Physical System Hardware.
To keep workloads safe and separate …
Code, runtime and sandbox are only ever used for a single function. Multiple invocations will land in the same sandbox. This means if you call a function, and call it again and again, they will go to the same sandbox serially. They won’t overlap concurrently and that’s where we scale up. And we do that for a whole lot of good reasons. But the biggest of those is efficiency. But the tmpfs that comes with sandbox is never used across multiple sandboxes.
Guest operating systems are shared across multiple functions in an account but are never used across multiple AWS accounts. There is 1:N mapping for an AWS account to EC2 instances or equivalent hypervisor isolated environment. So we never use the same virtual machine across multiple AWS accounts.
The boundary that we put between different accounts is virtualization. Then, we do share the underlying hardware across multiple AWS accounts. We do this because Lambda functions are really small and underlying hardware is really big. We can’t have a 128MB RAM machine so we use virtualization to chop up a box into multiple pieces.
The question we get asked most often is ISLOATION. It means two things … SECURITY and other is OPERATIONAL ISOLATION. By that I mean how you run functions at a consistent performance when there are other functions on the same hardware.
Let’s take a look at how we do isolation.
There are two ways we run Lambda functions today
One mode is where each worker is a separate EC2 instance. That’s a great security boundary and it’s a fast way to build the functionality. This is how we created Lambda and this mode is used today as well.
And the other mode is using Firecracker.
We run a bare metal instance, the same ones that you can launch using EC2 console. And then we run 1000s of Firecracker microVMs on that hardware
Firecracker microVM technology provides a sufficient security boundary to host multiple accounts.
Under Firecracker, we are able to run with much more flexibility on high performance EC2 Bare Metal Hardware.
This really simplifies the security model for us. So instead of having one function, one account and many accounts, this is now simplified to one function in a microVM and multiple microVMs across multiple accounts on a piece of hardware. And this is really good for us in a whole lot of ways which we’ll talk about it a bit later.
This is a good programming model as well because it provides good isolation even between functions.
Another optimization that we do in Lambda is how we pick workloads to run on a worker.
So, this is a worker and a server. Yes, there are servers in serverless. When we look at this from a server perspective, vs. sandboxes, packing the same workload on a server is inefficient.
You may think about running multiple copies of the same workload on the same machine. So you cut it up into multiple sandboxes and you run multiple copies of the same workload. It turns out that is a bad thing to do because functions tend to consume the same type of resources, and also be active in the same time interval. And what that means is one spikes up in CPU, its quite likely another one will spike up because they’re doing the same work. Or this could be memory or whatever your function is doing.
This really limits how densely you can pack on the hardware. This means your sever will be either running hot …. or nearly idle.
You can take advantage of statistics and simply put as many uncorrelated workloads on the server as you can. So have a diverse set of workloads instead of multiple copies of the same workload. And this makes the workload way better behaved. It really brings down those peaks and brings up the average.
Because AWS runs so many workloads, we can find the uncorrelated workloads and distribute them across a set of servers to improve this situation.
That way, we have a chance of the workloads packing well together.
We can do better than that where we find workloads that are anti correlated such as down on CPU when another one spikes up.
The most efficient placement strategy is to pick the workloads that pack well together …
… and minimize contention.
So it’s all about putting the workloads where we can get optimum hardware utilization.
Now, lets look at how Firecracker is used with Fargate. But before we talk about that, lets look at the container services landscape.
Your containers can be managed by Amazon ECS or EKS. Amazon ECS is Amazon’s managed container orchestration platform. Amazon EKS provides an upstream compatible managed Kubernetes control plane.
You can run ECS using EC2 virtual machines or using Fargate where you don’t need to manage the servers or clusters and just run containers. EKS data plane can only run using EC2-based instances at this time.
We also have a fully-managed registry service to store container images: ECR
Let’s look at Fargate more closely and see how Firecracker is used there.
AWS Fargate is a compute engine for Amazon ECS that allows you to run containers without having to manage servers or clusters.
With AWS Fargate, you only have to think about the containers so you can just focus on building and operating your application. AWS Fargate eliminates the need to manage a cluster of Amazon EC2 instances. You no longer have to pick the instance types, manage cluster scheduling, or optimize cluster utilization. All of this goes away with Fargate.
AWS Fargate makes it easy to scale your applications. You no longer have to worry about provisioning enough compute resources for your container applications. After you define your application requirements (e.g., CPU, memory, etc.), AWS Fargate manages all the scaling and infrastructure needed to run your containers in a highly-available manner.
AWS Fargate seamlessly integrates with Amazon ECS. You just define your application as you do for Amazon ECS. You package your application into task definitions, specify the CPU and memory needed, define the networking and IAM policies that each container needs, and upload everything to Amazon ECS. After everything is setup, AWS Fargate launches and manages your containers for you.
Just a year after after general availability, AWS Fargate runs tens of millions of containers for customers every week.
Fargate tasks can be provisioned using over 40 different combinations of CPU and memory. Fargate takes care of provisioning, maintaining and scaling the task and customers pay only for what their application. Like Lambda, any optimizations in placing Fargate tasks is something that customers don’t need to worry about, This is where the Fargate and Firecracker integration helps out.
Let’s take a look.
Containers are a set of processes running in cgroups and namespaces. These constructs provide a weak form of isolation. Although they can isolate well-meaning processes from each other to a certain degree, they were never designed for running hostile multi-tenant workloads side by side on top of the same kernel. In order to provide the desired level of security, each Fargate task has its own isolation boundary and does not share the underlying kernel, CPU resources, memory resources, or elastic network interface with another task.
Fargate maintains a warm pool of EC2 instances. This allows to scale the tasks rapidly instead of provisioning an EC2 instance on demand. When a customer request to run a Fargate task, we match an EC2 instance that satisfies the vCPU and memory requirement required by the task. This results in some resource waste in EC2 instances.
EC2 offers a really large variety of compute instances whether its general purpose, compute-, memory- or storage-optimized, Intel- or AMD-based, GPU powered or bare metal. But there is no 1-1 match between different vCPU and memory combinations offered by Fargate and EC2 instance types. For example, there is no EC2 instance that offers 0.25 vCPU and 0.5 GB RAM. So we pick an instance type to ensure there is enough vCPU and memory available to run the task. Customers do not see this happening as it happens in the AWS service account and they’re oblivious to it. An EC2 instance is not used across multiple AWS accounts. This is good for the customer as they get hardware virtualization-based security. But this is inefficient resource utilization as there is a likely loss of compute and memory.
Just like Lambda, Fargate isolates tasks inside a hypervisor boundary, and Firecracker helps make that more efficient.
With Firecracker,
<CLICK> one task per EC2 instance model goes away
<CLICK> and each Fargate task is now run in a microVM with minimal overhead. This allows us to exactly match the vCPU and memory requirements of the task.
<CLICK> This also means that we don’t need the warm pool and so that goes away as well.
<CLICK> In order to provision Fargate tasks in microVM, we can use bare metal instances, the same that you can provision using EC2 console.
<CLICK> And because each microVM provides hardware virtualization-based security already, we can pack these microVMs a lot more densely without compromising security. This allows us to utilize EC2 instances more efficiently. The security model also simplifies and allows Fargate tasks from multiple accounts to be spread across multiple instances.
And we also balance the tasks across bare metal instances across AZs. This gives high availability and resiliency to customer applications.
With Firecracker,
<CLICK> one task per EC2 instance model goes away
<CLICK> and each Fargate task is now run in a microVM with minimal overhead. This allows us to exactly match the vCPU and memory requirements of the task.
<CLICK> This also means that we don’t need the warm pool and so that goes away as well.
<CLICK> In order to provision Fargate tasks in microVM, we can use bare metal instances, the same that you can provision using EC2 console.
<CLICK> And because each microVM provides hardware virtualization-based security already, we can pack these microVMs a lot more densely without compromising security. This allows us to utilize EC2 instances more efficiently. The security model also simplifies and allows Fargate tasks from multiple accounts to be spread across multiple instances.
And we also balance the tasks across bare metal instances across AZs. This gives high availability and resiliency to customer applications.
With Firecracker,
<CLICK> one task per EC2 instance model goes away
<CLICK> and each Fargate task is now run in a microVM with minimal overhead. This allows us to exactly match the vCPU and memory requirements of the task.
<CLICK> This also means that we don’t need the warm pool and so that goes away as well.
<CLICK> In order to provision Fargate tasks in microVM, we can use bare metal instances, the same that you can provision using EC2 console.
<CLICK> And because each microVM provides hardware virtualization-based security already, we can pack these microVMs a lot more densely without compromising security. This allows us to utilize EC2 instances more efficiently. The security model also simplifies and allows Fargate tasks from multiple accounts to be spread across multiple instances.
And we also balance the tasks across bare metal instances across AZs. This gives high availability and resiliency to customer applications.
With Firecracker,
<CLICK> one task per EC2 instance model goes away
<CLICK> and each Fargate task is now run in a microVM with minimal overhead. This allows us to exactly match the vCPU and memory requirements of the task.
<CLICK> This also means that we don’t need the warm pool and so that goes away as well.
<CLICK> In order to provision Fargate tasks in microVM, we can use bare metal instances, the same that you can provision using EC2 console.
<CLICK> And because each microVM provides hardware virtualization-based security already, we can pack these microVMs a lot more densely without compromising security. This allows us to utilize EC2 instances more efficiently. The security model also simplifies and allows Fargate tasks from multiple accounts to be spread across multiple instances.
And we also balance the tasks across bare metal instances across AZs. This gives high availability and resiliency to customer applications.
With Firecracker,
<CLICK> one task per EC2 instance model goes away
<CLICK> and each Fargate task is now run in a microVM with minimal overhead. This allows us to exactly match the vCPU and memory requirements of the task.
<CLICK> This also means that we don’t need the warm pool and so that goes away as well.
<CLICK> In order to provision Fargate tasks in microVM, we can use bare metal instances, the same that you can provision using EC2 console.
<CLICK> And because each microVM provides hardware virtualization-based security already, we can pack these microVMs a lot more densely without compromising security. This allows us to utilize EC2 instances more efficiently. The security model also simplifies and allows Fargate tasks from multiple accounts to be spread across multiple instances.
And we also balance the tasks across bare metal instances across AZs. This gives high availability and resiliency to customer applications.
At AWS, we always look for innovative ways to be use our resources efficiently and lower our operational cost. This allows us to pass those cost savings to customers. We’ve lowered our prices 69 times (TODO: check this number) since inception and customers love it!
Earlier this year we reduced prices for Fargate task by 30-50%. Innovations such as Firecracker allow us to improve the efficiency of Fargate and help us pass on cost savings to customers.
Firecracker-containerd project enables containerd to manage containers as Firecracker microVMs. Like traditional containers, Firecracker microVMs offer fast start-up and shut-down and minimal overhead. Unlike traditional containers, however, they can provide an additional layer of isolation via the KVM hypervisor.
Because the overhead of Firecracker is low, the achievable container density per host should be comparable to running containers using kernel-based container runtimes, without the isolation compromise of such solutions.
To maintain compatibility with the container ecosystem, where possible, we use container standards such as the OCI image format.
This diagram shows how containerd runtime creates Firecracker microVMs.
The architecture consists of three main components - Snapshotter, Runtime and Agent.
A runtime linking containerd (outside the microVM) to the Firecracker virtual machine manager (VMM). The runtime is implemented as an out-of-process shim runtime communicating over ttrpc. It uses the VM disk image and kernel image to create the microVM.
A snapshotter that creates files used as block-devices for pass-through into the microVM. This snapshotter is used for providing the container image to the microVM. The snapshotter runs as an out-of-process gRPC proxy plugin.
An agent running inside the microVM, which invokes runC via containerd's containerd-shim-runc-v1 to create standard Linux containers inside the microVM.
AWS released Firecracker as an open source project at re:Invent 2018. Since then we’ve head …
In case you have any ideas, we’re currently inviting everyone to contribute 2020 Roadmap proposals on our GitHub repository (until the end of the month).
These are the teams that have integrated with Firecracker so far. <Click> kata containers and Ignite – we will deep dive this in the next few slides.
<CLICK> UniK is as an orchestration platform for light-weight VMs. It provides tools for compiling application sources or containers into unikernels which are lightweight bootable disk images and microVMs. In addition, UniK runs and manages unikernels and microVMs on a variety of cloud providers, as well as locally. In January this year, UniK announced supporting Firecracker to launch microVMs.
<CLICK> OSv is a new open-source operating system for virtual-machines. OSv was designed from the ground up to execute a single application on top of a hypervisor, resulting in superior performance and effortless management when compared to traditional operating systems which were designed for a vast range of physical machines.
The easiest way to run OSv on Firecracker is to use a Python script firecracker.py. The firecracker.py automates process of launching firecracker VMM executable and submitting necessary REST api calls over UNIX domain socket to create and start OSv micro-VM.
Lets talk about how Firecracker and Kata Containers.
First, what is Kata Containers? Kata Containers is an open source project that is building a lightweight VM that feel and perform like containers.
While Kata Containers was initially based on QEMU, the project was designed up front to support multiple hypervisor solutions. Firecracker addresses Kata end users’ requests for a more minimal hypervisor solution for simple use cases. The Kata community began working with Firecracker right after the launch. As a result, Kata Containers 1.5 introduced preliminary support for the Firecracker hypervisor. This is complementary to the project’s existing QEMU support. Given the tradeoff on features available in Firecracker, we expect people will use Firecracker for feature-constrained workloads, and use a minimal QEMU when working with more advanced workloads (for example, if device assignment is necessary, QEMU should be used).
It is possible to utilize runc, Kata + QEMU and Kata + Firecracker in a single Kubernetes cluster, as shown in the diagram.
To achieve this configuration, the cluster must be configured to use either CRI-O or containerd, and must be configured to use the runtimeClass feature of Kubernetes. RuntimeClass is an alpha feature for selecting the container runtime configuration to use to run a pod’s containers.
With runtimeClass configured in Kubernetes as well as in CRI-O/containerd, end users can select the type of isolation they’d like on a per-workload basis. In this example, two runtimeClasses are registered: kata-qemu and kata-fc.
<CLICK> Selecting Firecracker-based isolation is as simple as patching existing workloads with the shown YAML snippet. To utilize QEMU, the runtimeClassName tag would be modified to kata-qemu.
TODO: current status?
Weave Ignite is an open source Virtual Machine Manager with a container UX.
With Ignite, you pick an OCI-compliant image (Docker image) that you want to run as a VM, and then just execute ”ignite run” instead of docker run. There’s no need to use VM-specific tools to build .vdi, .vmdk, or .qcow2 images, just do a docker build from any base image you want, and add your preferred contents.
“ignite run” will use Firecracker to boot a new VM in c.125 milliseconds, using a default 4.19 linux kernel. If you want to use some other kernel, just specify the --kernel flag, pointing to another OCI image containing a kernel at /boot/vmlinux, and optionally your preferred modules. Next, the kernel executes /sbin/init in the VM, and it all starts up. After this, Ignite connects the VMs to any CNI network.
In Git you declaratively store the desired state of a set of VMs you want to manage. ”ignite gitops” reconciles the state from Git, and applies the desired changes as state is updated in the repo. This can then be automated, tracked for correctness, and managed at scale.