Automated paravitualization of device drivers in xen
1. Automated Paravirtualization of device drivers in Xen
Nikhil Pujari Vijayakumar M M Sireesh Bolla
Stony Brook University
Abstract:
Xen is an x86 virtual machine monitor which adopts a paravirtualization approach to allow multiple
operating systems to share conventional hardware without sacrificing performance or functionality.
Since it is a paravirtualization approach which provides an idealized virtual machine abstraction,
operating systems need to be ported to be able to install and run them on the Xen VMM. Instead of
emulating existing hardware devices as in full virtualization, Xen provides abstract devices which
implement a high level interface for each device category. Since the interfaces are abstract and provide
generic operations for that device class, extra effort needs to be put for providing the guest operating
systems with the peculiar functionalities that the hardware may provide. These non-generic facilities
are generally exposed by the device drivers in the form of device private ioctls, which are generally for
configuring the devices to enable/disable the non-generic functionality or to collect status information.
As of Xen 3.3, the netfront driver does not implement device private ioctls. User mode configuration
programs like ifconfig which understand the underlying devices use these ioctls. Our method
implements conversion from the local ioctl call in a Linux DomU to a remote call to the Dom0. This is
done by including a generic ioctl wrapper in netfront driver and a watch in the netback driver. This
automates the process of exposing arbitrary functionality provided by the real network hardware to the
DomU through device private ioctls. Only a specification with the ioctl numbers and buffer sizes needs
to be provided which is read by a tool which writes them to the XenStore.
Motivation:
Paravirtualization is a virtualization technique that presents a software interface to virtual machines that
is similar but not identical to that of the underlying hardware. Paravirtualization may allow the virtual
machine monitor (VMM) to be simpler or virtual machines that run on it to achieve performance closer
to non-virtualized hardware. However, operating systems must be explicitly ported to run on top of a
paravirtualized VMM.
Most full virtualization solutions provide emulated forms of simple devices. The emulated devices are
typically chosen to be common hardware, so it is likely that drivers exist already for any given guest.
2. Paravirtualized guests, however, need to be modified. Therefore, the requirement for
the virtual environment to use existing drivers disappears. Xen provides abstract devices which
correspond to device categories e.g. it provides an abstract block device instead an SCSI device and
IDE device. This device abstraction provides the generic calls corresponding to that device
category e.g. read and write calls for the block device. This is done to achieve efficient I/O
virtualization as opposed to emulation of devices in full virtualization. One of the important
optimizations included in this approach is the grouping of I/O operations, which improves efficiency.
Hardware manufacturers provide generic functionalities for that device class as well as additional
device specific functionalities. For example, a cd/dvd drive which falls into the block device category
provides the generic read and write capabilities, as well offer the “special”/non-generic
capability of multisessioning. An ethernet device may offer special capabilities such as jumbo frames
and checksum calculations, in addition to the generic functionalities of send and receive.
The device drivers provided by Xen to the guest operating systems need to be modified in order to
enable the guests to exploit these special features provided by the hardware. The aim of the project is to
automate this process of modification to the Xen ethernet device drivers as much as possible, given the
specifications of the ethernet device/NIC to be used.
The initial aim of the project was to evaluate the feasibility of porting arbitrary network device drivers
to Xen and whether we could automate the process. After examining, Xen’s approach and
implementation of network I/O virtualization we determined that there is little need of porting network
device drivers to Xen, since the real device driver and Xen’s split drivers are two separate components
of the Xen network I/O chain. Existing drivers could be loaded in Xen Dom0 and network I/O would
work without any modifications to Xen split drivers. Hence the project objectives were modified to
exploring and implementing ways to expose device specific functionality to DomU’s and to do that in
an automated way, given a specification of the non-generic functionalities implemented by the real
network driver.
The following sections consist of an overview of the components of Xen's I/O virtualization
architecture which we had to study and use in order to implement our mechanisms to achieve this aim.
Overview of Xen device driver virtualization:
In Xen, the Domain0 is a privileged domain i.e. a privileged VM which hosts the administrative
management and control interface. It provides
3. the capability to create and terminate other domains(underprivileged domains or DomU’s) and to
control their associated scheduling parameters, physical memory allocations and the access they are
given to the machine's physical disks and network devices. It also supports the creation of the virtual
network interfaces (VIFs) and virtual block interfaces(VBDs) which are used by the underprivileged
guests.
Xen implements a Split device driver model for device driver virtualization. The Dom0 is in control of
the actual hardware devices and virtual devices are exported for the DomU’s to use. Also some
domains can be given control of particular hardware devices which then become the
driver domains, but this is done only if IOMMU is available, otherwise it compromises security.
The actual hardware driver resides in the driver domain/Dom0 and virtual device driver is split in two
parts viz. the frontend driver and the backend driver. They are separated by a virtual bus, XenBus,
which is roughly modeled after a device bus such as PCI. The backend driver resides in the driver
domain/Dom0 and the frontend driver resides in the guest.
The network frontend driver i.e. netfront driver acts as a device driver for the virtual network. It
communicates with the network backend driver i.e. netback driver with the help of shared memory ring
buffers and an event channel which is used for asynchronous notifications.
The event channel is the analog of hardware interrupts. The netback driver communicates with the
hardware through the actual hardware driver.
XenStore is another important component of Xen architecture. It is a database of configuration
information to be shared between domains. In relation to device drivers, it also fulfills the function of
device tree which is generally a result of querying of an external bus such as the PCI bus. It is used to
communicate to the front end driver the information about the domain hosting the backend driver,
information about the shared memory, event channel to be used and the device specific information.
Xen networking:
Xen network interface employs two I/O ring buffers, one for incoming packets and one for outgoing.
Ring buffers are producer – consumer queues implemented in shared memory. These ring buffers are
used to transmit instructions and the actual data is transferred through shared memory pages through
the grant mechanism. A grant reference refers to the shared memory page which acts as a buffer for the
actual data transfer.
Each transmission request contains a grant reference and an offset within the granted page. This allows
4. transmit and received buffers to be reused, preventing the TLB from needing frequent updates.
A similar arrangement is used for receiving packets. The DomU guest inserts a receive request into
the ring indicating where to store a packet, and the Dom0 component places the contents there.
For each new DomU, Xen creates a new pair of "connected virtual ethernet interfaces", with one end in
DomU and the other in Dom0. For linux DomU's, the device name it sees is named eth0. The other end
of that virtual ethernet interface pair exists within Dom0 as interface vif<id#>.0.
The default Xen configuration uses bridging within Dom0 to allow all domains to appear on the
network as individual hosts. When xend (the Xen Daemon) starts it runs a script named network-bridge
which creates a new bridge named xenbr0. The virtual network interfaces in the Dom0 are connected to
real physical interface using this bridge. The network card runs in the promiscuous mode. Each guest
gets its own MAC address assigned to its virtual interface. This allows all the guests to appear on the
network as individual hosts.
Packet arrives at hardware, is handled by real ethernet driver and appears on peth0, which is the real
ethernet interface. The interface peth0 is bound to the bridge, so it is passed to the bridge from there.
This step is run on Ethernet level, no IP addresses are set on peth0 or bridge. Now the bridge distributes
the packet, just like a switch would. It is passed to the appropriate virtual interface based on the MAC
address and from there it is delivered to the correct guest domain.
XenStore and XenBus :
XenStore is a hierarchical namespace (similar to sysfs or Open Firmware) which is shared between
domains. The interdomain communication primitives exposed by Xen are very low-level (virtual IRQ
and shared memory). XenStore is implemented on top of these primitives and provides some higher
level operations (read a key, write a key, enumerate a directory, notify when a key changes value).
XenStore is a database, hosted by domain 0, that supports transactions and atomic operations. It's
accessible by either a Unix domain socket in Dom0, a kernel-level API, or an ioctl interface via
/proc/xen/xenbus.
XenStore is used to store information about the domains during their execution and as a mechanism of
creating and controlling DomU devices.
XenBus provides an in-kernel API used by virtual I/O drivers to interact with XenStore.
5. There are three main paths in XenStore:
/vm - stores configuration information about domain
/local/domain - stores information about the domain on the local node (domid, etc.)
/tool - stores information for the various tools
The /local path currently only contains one directory, /local/domain that is indexed by domain id. It
contains the running domain information. It contains directories for each of the device backends, for
example vbd for block devices and vif for network devices in the directory /local/domain/<domain-
id>/backend. It consists status entries and entries for names and ids of the various entities such as
DomU, bridge to which it is connected to, MAC address. This is the directory in which we can store we
can store configuration information specific to our netfront-netback drivers.
All Xen virtual device drivers register themselves with XenBus at initialization. Most initialization and
setup is postponed until XenBus calls the probe function, which is very similar to how the PCI probe
function gets called in real ethernet drivers.
There are two classes of API which are used to write/read/modify XenStore. One set of API is for
accessing XenStore from tools, while the the other is an in-kernel API used to access XenStore from
inside the driver code.
XenStore API for tools:
The whole set of functions can be found in the file /tools/xenstore/xs.h. It contains functions such
xs_mkdir, xs_read,xs_write,xs_directory,xs_rm which create directories,read/write entries inside
directories,read directory contents, remove entries/directories respectively. These functions are very
similar to the set of POSIX functions for file/directory operations.
These functions can be called from C programs or perl/python scripts to create/modify/destroy entries
in XenStore. Various Xen tools use them to operate on XenStore.
XenStore in-kernel API or XenBus API:
This set of functions can be found in the file /include/xen/xenbus.h. It includes functions such as
xenbus_register_frontend/backend, xenbus_read/write,xenbus_mkdir/rm,xenbus_printf/scanf,
register/unregister_xenbus_watch which registers frontend/backend drivers, create/modify/destroy
6. XenStore entries, set/unset watches on XenStore entries.
XenStore Transactions:
Transactions provide developers with a method for ensuring that multiple operations on the Xenstore
are seen as a single atomic operation. Any time multiple operations must be performed before any
changes are seen by watchers, a transaction must be used to encapsulate the changes.
A transaction is started by calling a function xenbus_transaction_start() on the directory contents of
which need to be changed or read. The XenStore API functions can then be used to read/write values
in the desired entries. The transaction is ended by calling xenbus_transaction_end().
Similar functions exist which can be called from userspace tools to modify or read values from
XenStore.
XenStore Watches:
A watch is the functionality provided by XenStore which allows for registering callback functions
which are invoked when a particular XenStore entry or any entry below the directory being watched, is
changed. This allows drivers or applications to respond immediately to changes in the XenStore.
Drivers can register a watch by using the function register_xenbus_watch() which takes as input a
structure of type xenbus_watch which contains the XenStore entry/directory to be watched and a
pointer to the callback function.
Design and Implementation:
Network interfaces are represented inside the Linux kernel by struct netdevice. Network drivers
populate the structure and register it with the kernel at the time of initialization. It is the very core of
network driver layer and contains all the different types of information pertaining to the interface like
the interface name, hardware information like DMA channel and IRQ assigned to the device, interface
information such as MAC address and flags, a function dispatch table with functions such as open,
close, transmit, do_ioctl, change_mtu etc.
The do_ioctl method is generally used to implement non-standard functionality specific to the device.
When the ioctl system call is invoked on a socket, the command number is one of the symbols defined
7. in <linux/sockios.h>, and the sock_ioctl function directly invokes a protocol-specific function. Any
ioctl command that is not recognized by the protocol layer is passed to the device layer. These device-
related ioctl commands accept a third argument from user space, a struct ifreq *. This structure is
defined in <linux/if.h>. In addition to using the standardized calls, each interface can define its own
ioctl commands. The ioctl implementation for sockets recognizes 16 commands as private to the
interface: SIOCDEVPRIVATE through SIOCDEVPRIVATE+15. When one of these commands is
recognized, dev->do_ioctl is called in the relevant interface driver. The function receives the same
struct ifreq * pointer that the general-purpose ioctl function uses:
int (*do_ioctl)(struct net_device *dev, struct ifreq *ifr, int cmd);
The ifr pointer points to a kernel-space address that holds a copy of the structure passed by the user.
After do_ioctl returns, the structure is copied back to user space. Therefore, the driver can use the
private commands to both receive and return data. The device-specific commands can choose to use the
fields in struct ifreq, but they already convey a standardized meaning, and it's unlikely that the driver
can adapt the structure to its needs. The field ifr_data is a caddr_t item (a pointer) that is meant to be
used for device-specific needs. The driver and the program used to invoke its ioctl commands should
agree about the use of ifr_data. This pointer can point to arbitrary configuration data understood both
by the application and driver.
Rest of the methods are standard methods which are supported by every interface and hence by the
netfront interface provided by Xen to the DomU guest. We have written a generic ioctl wrapper
function, address of which is assigned to the do_ioctl member of the netfront interface.
Generally the ifr_data field points to a structure or a buffer with arbitrary amount of data. If it points to
a structure then the size of the buffer pointed by it can be derived from the structure definition. But if it
points to an arbitrary data, then the method generally followed by driver developers to communicate
the size is to encode it in first 4 bytes of the buffer. The driver first reads the size from the buffer and
then reads the rest of the buffer.
A summary of our implementation is as follows. The specification of non-standard functionalities
implemented by the real network driver is provided in the form of a list of private ioctls(indicated by
the command, which lies between SIOCDEVPRIVATE through SIOCDEVPRIVATE+15) implemented
by the real network driver and the size of the buffers pointed by the ifr_data field. If the buffer points to
arbitrary amount of data then it is encoded in specification by putting -1 in the size field. A script reads
the list of ioctls and buffer sizes from the specification and creates corresponding fields in the XenStore
8. in the directory /local/domain/<domid>/ioctl/, which has been created before hand to house the ioctls.
In the entry for each ioctl, entries are created for input and return values and return status.
When a private ioctl is invoked from a DomU, the ioctl wrapper function reads the size field from the
corresponding entry. It then reads the whole struct ifreq and starts a transaction to write the ifreq
structure to input entry under that ioctl in the XenStore. For each ioctl we have included a return_ready
entry under it in the XenStore. It is a Boolean entry which indicates whether the return field is ready,
i.e. whether the netback driver has written the ioctl return value in the return entry of that ioctl. After
writing the input ifreq to the input entry the netfront ioctl wrapper writes 0 to that entry and ends the
transaction. It then enters a loop to poll its status.
Netback driver registers a watch on the ioctl directory at the time of its initialization in the netback_init
function. This watch is triggered when the netfront do_ioctl writes the ioctl input to the XenStore. Then
it reads the ifreq structure from the XenStore and calls the real device drivers do_ioctl function. It
invokes the dev_get_by_name function and passes it the name of the real network interface “peth0”.
The real interface is renamed from eth0 to peth0, when Xend brings up the network bridge. This
function returns the net_device structure for the real network interface. Then do_ioctl function is
invoked on it.
The return value of the ioctl is passed in the same ifreq structure which was passed to it. This structure
is then written back by the watch to the return entry under the ioctl in the XenStore and return status is
written in its entry. The watch then toggles the return_entry value under that ioctl in the XenStore.
As soon as the return_value becomes one the netfront ioctl wrapper comes out of the loop and starts a
transaction to read the return status. If the status indicates that the ioctl call was a success then it reads
the return value, writes it back to the buffer pointed by ifreq and ends the transaction and returns the
status.
Permissions:
DomU guests can be allowed or disallowed to invoke ioctls using the XenStore permissions API. These
permissions can be set from tools, which are perl/python scripts or C programs, from the Dom0. When
we wish to disallow a particular DomU to invoke a particular ioctl we can revoke read and write
permissions for it on the path /local/domain/<domain-id>/ioctl/<ioctl-number>. This can be done by
invoking the following function from a python script in Dom0,
9. xstransact.SetPermissions(path, { 'dom' : dom1,
'read' : False,
'write' : False });
The corresponding C function is xs_set_permissions. The first entry denotes the domain whose read
write privileges are being revoked, and the path is the path of the ioctl entry in the XenStore. When
permission is to be granted, this function can again be invoked with the values in the read and write
fields as True.
Conclusion:
Our method automates the process of exposing non-generic functionality implemented by the real
network driver in the form of device private ioctls. XenStore provides the infrastructure to encode the
ioctl information and the input and return values. The transactions API ensures that the reads and writes
are consistent and atomic. Since reads and writes to the XenStore are essentially File I/O, they are
slower than the data transfer mechanism between frontend and backend using shared memory. But
using the data transfer mechanism to transport the input and output of the ioctl would have required
major changes to the Xen networking code. In principle, XenStore is a mechanism to store
configuration information and ioctls are also generally used for configuration purposes. Also since
ioctls do not do data transfer themselves, their execution is not as time critical as that of the data
transmit and receive.
Newer NICs provide hardware support for virtualization which enable the NIC to be shared between
different VMs efficiently and safely. Various efforts are on to enable Xen to use these large set of
diverse and evolving functionalities being provided by newer NICs. The Xen netchannel2 protocol
along with a high level network I/O virtualization management system is being developed to address
this need. The manager would relieve users of the need to make decisions and configurations that are
customized to the underlying hardware capabilities. Instead, the manager would allow users to specify
policies at a high level and then determine the appropriate low-level configurations specific to the
particular hardware environment that would implement the policies. Thus, the manager would provide
a clean separation between user-relevant policies, and the hardware and software mechanisms that are
used to implement the policies.
10. References:
1) Xenwiki – http://wiki.xensource.com/xenwiki/
2) Running Xen: A Hands-On Guide to the Art of Virtualization - Jeanna N. Matthews, Eli M.
Dow, Todd Deshane, Wenjin Hu, Jeremy Bongio, Patrick F. Wilbur, Brendan Johnson
3) Linux Device Drivers - Jonathan Corbet, Greg Kroah-Hartman, Alessandro Rubini
4) Understanding Linux Network Internals - Christian Benvenuti
5) Xen-devel mailing list - lists.xensource.com/xen-devel
6) Taming Heterogeneous NIC Capabilities for I/O Virtualization - Jose Renato Santos, Yoshio
Turner, Jayaram Mudigonda.