Hardware accelerated Virtualization in the ARM Cortex™ Processors
1. Hardware accelerated Virtualization in the ARM Cortex ™ Processors John Goodacre Director, Program Management ARM Processor Division ARM Ltd. Cambridge UK 2nd November 2010
24. System Characteristics of big.LITTLE PERFORMANCE SMS & Voice do not need a 1GHz processor Browser needs full performance for complex rendering but very little if the page is just being read You do not know what combination of apps the user will use, but the Smartphone must continue to be responsive
25.
26.
27.
28.
29. Spanning Hypervisor Framework SoC Management Domain (ST/TEI) TrustZone Monitor RTOS/uKernel Para-virtualization with platform service API’s Secured hw timer (tick) SoC Management Domains Secured Services Domain Open Platform Hypervisor Full Virtualization (KVM) TZ Monitor Exception Level HYP(ervisor) Exception Level Open OS & Drivers Open OS (Linux) Kernel/Supervisor Exception Level A P P A P P A P P A P P User Exception Level (Application Layer) ARM Non-Secure Execution Environment ARM Secure Execution Environment Resource API Virtualized TZ Call Virtualized TZ Call Resource Driver API NoC sMMU Accelerators (eg DSP / GPU) Virtualized Heterogeneous API Driver (OpenMP RT) Coherence Platform Resources Project website http://www.virtical.eu This project in ARM is in part funded by ICT-Virtical, a European project supported under the Seventh Framework Programme (7FP) for research and technological development
30.
Notas do Editor
Key features that are introduced on the Cortex-A15 that have not been seen before: Virtualization – one could do it before through SOFTWARE, but Cortex-A15 brings HARDWARE support making it much more efficient. Large physical addressing – before ARM cores were limited to 4GB of memory. For the most part this is effectively 2-3GB due to address mapping of peripherals etc. Increasing the number of address bits from 32 to 40 enables this increase in usable memory. Note that virtual memory available to a single program is still limited to 32-bit/4GB. AMBA 4 System coherency – allows different clusters of processors or other cached masters to share data efficiently. ARM has supported up to 4xMP previously; AMBA 4 allows MP scale well beyond that. Currently plans include support to 8 and 16 core systems. AMBA 4 can theoretically support more cores than that, but a Cache Coherent Interconnect with more than four ports will need to be created to realize such 16+ core systems. (ARM currently has plans for two and four port CCIs)
To do: emphasize what the VMM does. The OS can now manage its own page tables and disable interrupts – details on this later. The VMM can be written to trap all hardware accesses by OS meaning that OS source need not be changed (might want to change it for performance reasons though). For instance, can configure system so that all CP15 register accesses are trapped to Hypervisor mode where they can be handled by VMM. Hyp mode can alias identification registers e.g. CPU ID etc. Interaction between Virtualization and TrustZone is not a subject for now but the two can co-exist and Virtualization is only supported in Non-Secure world.
LPAE extensions allows each OS to live in 32-bit virtual address space and VMM to map multiple systems within 40-bit physical address space. The Guest OS believes that IPA is the real PA. The second-stage translation is invisible to it. Without virtualization extensions, VMSAv7 page tables have been extended to allow generation of 40-bit physical addresses at 16MB super-section granularity only. Support of super-sections and 40-bit PA generation is optional in VMSAv7. The LPAE extension (always used for the Stage 2 translation in a virtualized system) allows 4KB granularity over entire 40-bit address range. Support of LPAE without Virtualization is permitted (but not the other way round).
ANIMATED Pretty simplified, but still shows delivery of a virtual interrupt. What is missing is the incredibly complex process of the hypervisor restoring the Guest OS context, or deciding which CPU to deliver the virtual interrupt to, or any one of the 101 details that the hypervisor must deal with.
We achieve Big-Little system can use the Right-Core for the right task. Built on the coherency and software compatiblity: - Soft ware migration across processors is at a minimal cost This migration is transparent to applications and OS There is a great deal of innovation and differentiation possible here through tuning of OS Power Management, use of Hypervisors and more. [Internal Note: The idea is to rebutt asynchronous DVFS without specifically alluding to it] Even with near-perfect voltage scaling, the “Big” processor cannot match the power and energy efficiency of the simple Kingfisher pipeline for the lower-intensity workloads such as MP3 playback, basic OS UI and communication activity and scrolling Kingfisher establishes a new bar on power-efficient performance for the low and mid-intensity workloads Lead-into next slide: For the more interactive high-intensity workloads Cortex-A15 provides the instantaneous responsiveness. However, for the higher-intensity workloads At the upper end of the spectrum
GIC-400 transports interrupts across clusters under hypervisor control
Phase I: Single cluster in operation at any given time Good fit for Linux OS estimates system load and invokes performance governors at preset watermarks Governor back end can synchronously call switcher software When performance on one cluster falls below or exceeds required levels, execution switches seamlessly to appropriate cluster Core state switch occurs in sub 20 microseconds OS remains oblivious to state switch OS load estimation re-calculates system capability Virtualizer handles out of bound accesses OS can be built for Cortex-A15 or Kingfisher Switcher handles core switching OS agnostic so easily extended to support other systems Virtualizer natural progression of the switching work Single OS and simultaneous execution across both clusters Switcher layer is redundant – design is consequently simpler Virtualizer continues to handle out-of-bound micro-architectural accesses Dual cluster system appears logically as a single cluster of processors with varying capability Phase II: Native support requires minimal extension to the Linux kernel Machine description for “SMP” platform Inter cluster signaling modifications Map CPUID reference to CPU within target cluster Still about mechanics not policy OS scheduler still won’t make intelligent scheduling decisions Scope for ‘manual’ improvement Forcefully bind threads statically to suitable processor domain Use cgroups and cpusets