HKG15-107: ACPI Power Management on ARM64 Servers
---------------------------------------------------
Speaker: Ashwin Chaugule
Date: February 9, 2015
---------------------------------------------------
★ Session Summary ★
Status of CPPC with runtime PM and discussion on idle PM with ACPI
--------------------------------------------------
★ Resources ★
Pathable: https://hkg15.pathable.com/meetings/250767
Video: https://www.youtube.com/watch?v=eDDgYIkUHLI
Etherpad: http://pad.linaro.org/p/hkg15-107
---------------------------------------------------
★ Event Details ★
Linaro Connect Hong Kong 2015 - #HKG15
February 9-13th, 2015
Regal Airport Hotel Hong Kong Airport
---------------------------------------------------
http://www.linaro.org
http://connect.linaro.org
2. Overview
● CPU Performance management
○ CPPC (Collaborative Processor
Performance Control)
○ PCC (Platform Communication Channel)
○ State of patchwork
○ Next steps
● CPU idle management overview
● Device power management overview
4. Power Management overview
● Overall goal is to run the system as efficiently as possible considering power and performance
● Active power management
● Minimize power when the system is active and running
● Idle power management
● Go to deepest possible idle state with most power savings while considering workloads
desired response time
● Limits management
● Deliver max possible performance within the system constraints
● Servers are plugged in and not backed by batteries
○ Cost of power is significant in TCO
● Server workloads typically have a high dynamic range of CPU utilization
● Burst of activity depending on time zones, holiday sales etc.
● Not always running at peak CPU utilization
● Need to be very efficient across the whole range
5. CPU Performance Management
● CPPC = Collaborative Processor Performance Control
● New method to manage CPU performance
● Defined since ACPI v5.0+
● Preferred method for ARM64 servers vs PSS
● Richer interface supersedes ~12 ACPI objects and notifications
● Performance requests are made on an abstract unit less and continuous scale
● Firmware on the remote processor is free to interpret values however it wants
○ Can choose to map unit as CPU freq. similar to “p-states”
○ Could be a combination of freq + other architecture specific performance knobs
● Handling in firmware prevents risk of preempting freq transitions in the kernel
● Also allows for much wider portability
● OS should not assume any specific meaning to the performance scale
● Per CPU table (CPC) describes each CPUs performance capabilities and controls
● Contents of table can be registers (h/w, memory mapped or PCC) or static integers
6. Alternate method
● PSS = Performance Supported States
○ Discretized table of CPU frequencies
○ Assumes all CPUs have identical P states
● Requires X86 like mechanisms to write to a register to change CPU frequency
● Processor Throttling Controls
○ PTC, TSS, TPC
○ Throttling states available to the CPU as a percentage of max
● Needs ARM specific spec updates
7. CPPC high level flow
● Platform enumerates CPU performance
range to the OS
● Highest Performance:
○ Highest performance capability of a
CPU
● Nominal Performance:
○ Max sustained perf level
● Lowest Nonlinear performance:
○ Lowest perf level at which non-linear
power savings achievable. Lower than
this level could be suboptimal
● Lowest Performance:
○ Lowest perf capability
8. CPPC high level flow
● OS requests desired performance
● Maximum Performance:
○ Upper bound on desired performance
● Desired Performance:
○ Ideal desired perf level
● Performance Reduction Tolerance:
○ Deviation below Desired Performance
that the platform is allowed to run. If
OS requests Desired perf over a
specific Time Window, then this is the
average performance to be delivered
over the Time Window. Time Window
is specific in another register.
● Minimum Performance:
○ Lower bound on desired performance
9. Other CPPC feedback regs
● Platform may be aware of power budgets and thermal constraints
● It can limit delivered performance by reading instantaneous values of specific sensors or
counters
● Provides notification back to OS when limits change
● Reference Performance Counter:
● Counts at fixed rate when processor is active
● Delivered Performance Counter:
● Counts at rate of current performance level taking Desired into account
● Guaranteed Performance:
● Sustained Performance level deliverable by Platform given current constraints
● Raises a notification when this level changes
● Performance Limited Register:
○ In the event of some constraint (e.g. thermal excursion), this reg has 2 bits defined.
indicates platform unexpectedly delivers less than Desired or less than min.
10. Per CPU CPPC descriptor
● Each entry of descriptor is either an integer
or a register
● Register could be described as a hardware
register, System I/O or PCC register
● PCC registers have following format:
11. PCC: Platform Communication Channel
● ACPI v5.0+ defines a mailbox-like mechanism for OS to communicate with a
remote processor and back. e.g. BMC
● ACPI table for PCC (PCCT) defines a list of PCC subspaces/channels
● Each subspace entry defines:
○ Shared communication region address
○ Command and status fields for this region
○ Doorbell semantics for channel
● PCC commands are client specific
○ Clients defined in the current ACPI v5.1 spec include
■ CPPC
■ MPST (Memory node power state table)
■ RAS
● Doorbell protocol defines exclusivity of access to PCC channel between OS and
remote processor
● Supports async mode of notification from remote via IRQ
12. PCC: High level flow
● PCC Reads:
○ Client acquires a PCC channel lock (client specific)
○ Rings doorbell with READ cmd
■ Client waits for command completion
○ Client reads data updated by remote processor in comm space
○ Client releases PCC channel lock
● PCC Writes:
○ Client acquires a PCC channel lock (client specific)
○ Client writes data to comm space
○ Rings doorbell with WRITE cmd
■ Client waits for command completion
○ Client releases PCC channel lock
● If command completion fails, Client must retry or assume failure
13. Linux support for CPPC + PCC
● PCC
○ Integrated as mailbox controller
○ Initial patchwork in upstream kernels today (3.19-rcX)
● CPPC
○ CPPC parsing methods abstracted into separate files
○ CPUFreq driver that plugs into existing governors (e.g. ondemand)
■ ondemand ignores CPU freq. which could lead to suboptimal choice of
next freq
■ Patchwork (v4) with CPUfreq integration under review
○ Investigating PID style governor
■ Early patchwork adapted governor from intel_pstate
■ Experiments on ARM64 led to extensive modifications in the way CPU
busy is calculated
● Frequency weighted CPU busyness including idle time
● Move busyness math to workqueue
■ Intel pstate PID suboptimally raises next freq request if workload
doesn’t cause timer to defer > 30ms
■ Need more experimentation on silicon
14. CPPC + PCC
PCC driver
CPPC lib
CPPC CPUFreq driver
CPPC driver with
inbuilt governor
CPUFreq governors
Hardware
registers, System
I/O
CPPC tables
PCCT
table
PCC firmware interface
CPU Performance
handlers
LINUX
Remote
Processor
15. CPU idle management overview
● As of current spec (v5.1)
● C states defined for each
processor
○ C0 - On
○ C1 - Cn -> ascending
order or idleness
● C state object for each
processor
● Each object defines
attributes for that idle state
● _CSD object for each
processor defines C state
cross dependency
16. CPU idle management overview
● _CST and _CSD don’t scale well
to heterogenous architectures
● Assume same number of power
states at each processor
● Cant express Device power state
dependencies
● Cant express power resource
dependencies
● No notion of effect on caches at
each level of hierarchy
● WIP to address shortcomings in
the spec
● Plan to use existing governors +
PSCI methods
17. Device PM overview
● Devices may define Dx states
○ D0 - ON
○ D3 - OFF
○ D1/D2 - possible intermediate states
○ D3hot - Off (like D3) but may remain enumerable and context preserved.
● Platform specific details handled inside PSx control methods
○ Called as needed by OSPM as the device transitions through Dx states
● Power Resources handled in PR objects
○ Each PR supports: ON, OFF and STA (status) methods
○ Devices have PRx lists which reference power resources as needed in Dx
states
● 2 options to do device pm:
○ Manage power resources inside PSx. Called on entry to Dx state
○ Declare PR separately with its own ON, OFF
■ Define device dependencies and let OSPM manage ON/OFF
● Should not have to rely on clk/reg framework in Linux
18. Device PM state transitions
● Device state transitions
1.Device wakeup (due to user request or
interrupt)
a)If device depends on a power
resource, must turn on all required
power resources prior to enabling
the device.
2.Keep alive if there are ongoing requests
3.Device inactive (no device requests for some
time)
● Power Resources track all dependent devices
(multiple devices may share the same power
resource)
● Power Resource state transitions
A.All dependent devices are inactive (D3)
B.A dependent device is attempting wakeup