Tommaso Cucinotta - Low-latency and power-efficient audio applications on Linux

Low-latency and
power-eﬃcient audio
applications on Linux
Tommaso Cucinotta
tommaso.cucinotta@santannapisa.it

About me
LinuxLab 2018 T. Cucinotta – Real-Time Systems Laboratory (RETIS) – 2 / 22
■ 2016-present: Associate Professor at the Real-Time Systems
Laboratory (RETIS) of Scuola Superiore Sant’Anna: teaching
Component-Based Software Design, Cloud Computing, Big-Data, . . .
■ 2014-2016: Software Development Engineer at AWS, improving
the real-time performance and scalability of DynamoDB
■ 2012-2014: Researcher at Alcatel-Lucent Bell Labs, investigating
on security and real-time performance of cloud infrastructures with
focus on IMS and NFV
■ 2005-2012: Researcher at the RETIS, investigating on adaptive
real-time scheduling for multimedia applications on Linux
■ 2001-2004: PhD in Computer Security & Smart-Card Based
Authentication, RETIS

About the RETIS
■ Belongs to the Institute of Communications, Information and Perception
Technologies of Scuola Superiore Sant’Anna in Pisa
■ Research specialties
◆ predictable execution of software through
■ mechanisms at operating system and kernel level
■ design methodologies and tools
■ performance and timing analysis
◆ provide real-time support for emerging computing platforms
■ multi-core and heterogeneous platforms (big.LITTLE, GPGPU, FPGA)
■ distributed infrastructures for cloud & big-data computing and NFV
◆ make real-time systems resource- and energy- eﬃcient
◆ hard real-time use-cases: automotive, industrial automation, railroads
◆ soft real-time use-cases: multimedia, health-care, telecommunications

Introduction
Common multimedia processing case: audio playback and
video streaming
■ Works without particular precautions
■ No interactivity nor low-latency requirements
■ 100s of ms, or even seconds of data can be pre-buﬀered
and pre-processed
■ run-time platform (user-space + kernel) needs only
ensure presenting pre-processed A/V frames timely to
the underlying hardware

Introduction
Common multimedia processing case: audio playback and
video streaming
■ Works without particular precautions
■ No interactivity nor low-latency requirements
■ 100s of ms, or even seconds of data can be pre-buﬀered
and pre-processed
■ run-time platform (user-space + kernel) needs only
ensure presenting pre-processed A/V frames timely to
the underlying hardware
What about interactivity ?

Problem
Interactive multimedia processing
■ low-latency requirement from when a user interaction
happens, to when it is reﬂected in the output A/V stream

Problem
Interactive multimedia processing
■ low-latency requirement from when a user interaction
happens, to when it is reflected in the output A/V stream
Examples
■ video editing: change filter(s) and/or parameters in a
real-time video processing pipeline
■ on-line interactive services: eg, office automation,
etc.
■ gaming, VR, AR: user interacts with environment
and/or other users (eg, multi-player shooting)
■ software-based sound synthesis: user presses one or
more instrument keys / controllers

Problem
Interactive multimedia processing: how can we achieve low
latency ?
■ Digital Audio Workstation (DAW)
◆ DSPs do the real-time work
◆ the general-purpose OS and software just takes care
of conﬁguring its pipeline and parameters

Problem
latency ?
■ EXPENSIVE ! → Software-based solutions

Problem
latency ?
■ EXPENSIVE ! → Software-based solutions
■ “1-system 1-function” paradigm
◆ device dedicated to a single application
◆ nothing else runs with real-time requirements
◆ we can use priorities to minimize interferences

Real-time audio processing
Commonly found guidelines for low-latency, skip-free
interactive audio processing
eg, from http://jackaudio.org/faq/linux_rt_config.html
■ create group of users who can gain RT priority
groupadd audio
cat /etc/security/limits.d/99-realtime.conf
audio - rtprio 99
audio - memlock unlimited
■ add unprivileged user to the new group
usermod -a -G audio yourUserID
■ install a “real-time / low-latency” kernel

Real-time audio processing
Commonly found guidelines for low-latency, skip-free
interactive audio processing
eg, from http://jackaudio.org/faq/linux_rt_config.html
■ create group of users who can gain RT priority
groupadd audio
cat /etc/security/limits.d/99-realtime.conf
audio - rtprio 99
audio - memlock unlimited
■ add unprivileged user to the new group
usermod -a -G audio yourUserID
■ install a “real-time / low-latency” kernel
So, problem solved ?

What about energy?
Plenty of energy saving features in the hardware
■ Dynamic Voltage and Frequency Scaling (DVFS)
■ Performance states (P-states),
Operating Performance Points (OPP)
■ Core idle states (C-states)
■ Turbo Boosting (hmmm....): spike-up CPU frequency
when/if possible

What about energy?
when/if possible
Useful in a number of cases (both battery-operated and not)
■ laptops, tablets, smartphones
■ desktop PCs, servers

What about energy?
when/if possible
Useful in a number of cases (both battery-operated and not)
■ laptops, tablets, smartphones
■ desktop PCs, servers
All bad for performance stability!

Platform stability
Energy saving features in the hardware adverseley impact
performance stability and software predictability
■ DVFS → CPUs run at diﬀerent frequencies over time
◆ frequency islands: groups of CPUs are constrained to
the same frequency
■ P-states → even less control on what frequency CPU(s)
are running at
◆ frequency control in hardware, high-level tunable
exposed to software (minPct, maxPct)
■ C-states → time to enter and exit idle state is variable
◆ going to a deep-idle state is worth only if staying
there for a minimum residency time

C-states
wake-up resid.
C-state latency time
POLL 0 0
C1 2 2
C1E 10 20
C3 70 100
C6 85 200
C7s 124 800
C8 200 800
C9 480 5000
C10 890 5000

Making the platform stable
How users typically make the computing platform (more)
stable/predictable
■ turn-off Turbo Boosting
■ disable DVFS (leverage it to fix frequency), eg:
◆ performance governor or
◆ userspace governor if/when available
■ fix performance % with P-state driver (minPct=maxPct)
■ inhibit deep-idle states
◆ echo 1 > /sys/devices/system/cpu/cpu<n>/cpuidle/state<s>/disable
◆ echo 1 > /sys/devices/system/cpu/cpu0/power/pm qos resume latency us

stable/predictable
■ or, just run:
◆ yes > /dev/null & [times # of CPUs]

stable/predictable
■ or, just run:
◆ yes > /dev/null & [times # of CPUs]
■ Bad for energy consumption!

Why audio skips
∎ audio burst in playback (top)
∎ fill-level of audio ring buffer (middle)
∎ RT app thread (bottom)
∎ big ring buffer → high latency!
∎ empty ring buffer → audible glitch!
∎ small ring buffer periodically refilled
→ low latency, glitch-free playback!

Android audio architecture
Android audio applications
∎ android.media APIs
◇ playing/recording
audio ﬁles, Internet
streaming
◇ use of large buﬀers (no
low-latency use-cases)
◇ regular mixer thread
Low-latency audio applications
∎ native APIs
(OpenSL ES, AAudio)
◇ low-latency audio pro-
cessing
◇ rely on FastMixer and
ALSA
∎ critically low-latency
◇ exclusive mode in
AAudio / ALSA

Power management in Android
■ schedutil selects the minimum operating performance point (OPP)
able to satisfy demand
■ based on CPU utilization statistics
◆ Per-Entity Load-Tracking (PELT)
■ exponentially weighted task utilization
■ slow to detect workload changes (ramp-up, cool-down)
eg, it may take 50–100 ms to detect a 90% increase of CPU % demand
◆ Window-Assisted Load-Tracking (WALT)
■ max{last window util., avg util. over past N windows}
eg, over 3 past 10 ms windows, we have a 10 ms spike detection latency, and a 30 ms
cool-down one
■ it forgets quickly a task demand when the task is oﬀ the rq
■ WALT more reactive than PELT, but ...

Power management in Android
■ schedutil selects the minimum operating performance point (OPP)
able to satisfy demand
■ based on CPU utilization statistics
◆ Per-Entity Load-Tracking (PELT)
■ exponentially weighted task utilization
■ slow to detect workload changes (ramp-up, cool-down)
eg, it may take 50–100 ms to detect a 90% increase of CPU % demand
◆ Window-Assisted Load-Tracking (WALT)
■ max{last window util., avg util. over past N windows}
eg, over 3 past 10 ms windows, we have a 10 ms spike detection latency, and a 30 ms
cool-down one
■ it forgets quickly a task demand when the task is oﬀ the rq
■ WALT more reactive than PELT, but ... not enough for very
dynamic workloads
■ can we improve on that?

SCHED DEADLINE
SCHED DEADLINE from RETIS+Evidence
(ACTORS EU project)
■ mainline since v3.14 (2013)
■ reservation-based scheduling
■ a task is reserved a given runtime within a deadline
every period
struct sched attr attr = {
.size = sizeof(struct sched attr),
.sched policy = SCHED DEADLINE,
.sched flags = 0, // RECLAIM | RESET ON FORK
.sched runtime = runtime us * 1000,
.sched deadline = deadline us * 1000,
.sched period = period us * 1000
};
if (sched setattr(0, &attr, 0) < 0) {
perror("setattr() failed");
exit(-1);
}

SCHED DEADLINE
How is SCHED DEADLINE w.r.t. POSIX RT?
■ any SCHED DEADLINE task runs before any
POSIX RT or CFS task
◆ based on resource reservations (next slide)
◆ throttling safeguard to avoid locking the CPU
(can be disabled if needed)
■ any POSIX RT (FIFO/RR) task runs before any
CFS task
◆ based on priorities
◆ throttling safeguard to avoid locking the CPU
(can be disabled if needed)
■ Completely Fair Scheduler (CFS) tasks run when
no SCHED DEADLINE nor RT tasks can
◆ based on weights (weighted fair scheduler)

SCHED DEADLINE
Main SCHED DEADLINE properties
■ based on EDF (optimum on uni-processors) and (Hard)
Costant Bandwidth Server (CBS)
■ temporal isolation: a task inability to respect its
runtime doesn’t aﬀect others
■ on multi-processors: anything from G-EDF (tardiness
bound) to P-EDF
When trying to exceed the runtime
■ task gets throttled (original)
■ opportunistically get extra runtime (GRUB), if
RECLAIM used

SCHED DEADLINE and schedutil
schedutil decided OPP depends on overall system
utilization, in which we have:
■ SCHED DEADLINE tasks’ utilization: runtime
period
dynamic workload demand changes via sched setattr():
■ readily accounted for, by schedutil

period
Does it work?

period
Does it work? Results on a HiKey 960 board:
■ energy-eﬃcient set-up: glitch-free playback at 2.67ms latency, vs
26.67ms of mainline Android using SCHED FIFO and WALT, at the
cost of +6.25% power consumption
■ low-latency set-up: at 2.67ms latency, saved 40% energy wrt
mainline Android using SCHED FIFO and WALT

Heterogeneous Architectures
ARM big.LITTLE (and DynamIQ) architectures
■ tasks can migrate among big and LITTLE cores (same
ISA)
■ big cores: high-performance workloads
■ LITTLE cores: energy-eﬃcient workloads
ARM Energy Aware Scheduling (EAS)
■ give kernel awareness of the CPU capacity associated
with big and LITTLE cores
■ give kernel clues as to how capacity of big and LITTLE
cores scales with CPU frequency
■ provide CFS with more informed task placement and
migration decisions

Capacity enhancement patches
SCHED DEADLINE improvements to account for CPU
capacity
■ runtime is speciﬁed in terms of the fastest CPU at the
fastest frequency
◆ it gets automatically rescaled using the CPU capacity
ﬁgures
■ if there’s a choice, prefer LITTLE cores before going to
big ones
■ proper consideration of CPU capacity in schedutil

Related publications
∎ A. Balsini, Towards Hard and Soft Real-time Operating Systems for Multicore Heterogeneous
Architectures, PhD dissertation, 2018
∎ A. Balsini et al., Modeling and simulation of power consumption and execution times for
real-time tasks on embedded heterogeneous architectures, EWILI 2018
∎ T. Cucinotta et al., Improving Responsiveness of Time-Sensitive Applications by Exploiting
Dynamic Task Dependencies, Wiley SPE 2018
∎ C. Scordino et al., Energy-aware real-time scheduling in the linux kernel, ACM SAC 2018
∎ D. B. de Oliveira et al., Nested Locks in the Lock Implementation: The Real-Time
Read-Write Semaphores on Linux, RTSOPS 2018
∎ M. Marinoni et al., Allocation and control of computing resources for real-time Virtual
Network Functions, SOFTNETWORKING 2018
∎ T. Cucinotta et al., Adaptive Real-Time Scheduling for Legacy Multimedia Applications, ACM
TECS 2012
∎ J. Lelli et al., An Experimental Comparison of Diﬀerent Real-Time Schedulers on Multicore
Systems, Elsevier JSS 2012
∎ T. Cucinotta et al., Virtualised e-Learning on the IRMOS Real-time Cloud, Springer SOCA’12
∎ T. Cucinotta et al., A robust mechanism for adaptive scheduling of multimedia applications,
ACM TECS 2011
∎ T. Cucinotta et al., Low-Latency Audio on Linux by Means of Real-Time Scheduling, LAC’11
∎ T. Cucinotta et al., Virtualised e-Learning with Real-Time Guarantees on the IRMOS
Platform, IEEE SOCA 2010
∎ T. Cucinotta and L. Palopoli, QoS Control for Pipelines of Tasks Using Multiple Resources,
IEEE TOC 2010
∎ L. Palopoli et al, AQuoSA - Adaptive Quality of Service Architecture, Wiley SPE 2008
∎ L. Abeni et al, QoS Management through adaptive reservations, Springer RTSJ 2005

Q&A
Thanks for listening!
Questions ?
http://retis.santannapisa.it/˜tommaso
tommaso.cucinotta@santannapisa.it

Tommaso Cucinotta - Low-latency and power-efficient audio applications on Linux

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Tommaso Cucinotta - Low-latency and power-efficient audio applications on Linux

Semelhante a Tommaso Cucinotta - Low-latency and power-efficient audio applications on Linux (20)

Mais de linuxlab_conf

Mais de linuxlab_conf (9)

Último

Último (20)

Tommaso Cucinotta - Low-latency and power-efficient audio applications on Linux