CAPS_Presentation.pdf

Content-adaptive Encoder Preset Prediction for Adaptive Live
Streaming
Vignesh V Menon1, Hadi Amirpour1, Prajit T Rajendran2, Mohammad Ghanbari1,3, and
Christian Timmerer1
1
Christian Doppler Laboratory ATHENA, Alpen-Adria-Universität, Klagenfurt, Austria
2
Universite Paris-Saclay, CEA, List, F-91120, Palaiseau, France
3
School of Computer Science and Electronic Engineering, University of Essex, UK
9 December 2022
Vignesh V Menon Content-adaptive Encoder Preset Prediction for Adaptive Live Streaming 1

Outline
1 Introduction
2 Research Problem
3 Content-Adaptive Encoder Preset Prediction Scheme (CAPS)
4 Evaluation
5 Conclusions

Introduction
Introduction
HTTP Adaptive Streaming (HAS)
HTTP Adaptive Streaming (HAS)1 has become the de-facto standard in delivering video
content for various clients regarding internet speeds and device types.
Traditionally, a fixed bitrate ladder, e.g., HTTP Live Streaming (HLS) bitrate ladder2, is
used in live streaming.
1
A. Bentaleb et al. “A Survey on Bitrate Adaptation Schemes for Streaming Media Over HTTP”. In: IEEE Communications Surveys Tutorials 21.1 (2019),
pp. 562–585. doi: 10.1109/COMST.2018.2862938.
2
https://developer.apple.com/documentation/http live streaming/ hls authoring specification for apple devices, last access: Nov 30, 2022.

Introduction
Introduction
Live encoding in HAS
Figure: Encoding time of HLS bitrate ladder2
representations of the Wood s000 sequence
(5 second duration, 24fps) of VCD dataset
using ultrafast preset of x265 and 8 CPU
threads.
For every representation, maintaining a fixed
encoding speed, which is the same as the
video framerate, independent of the video
content, is a key goal for a live encoder.
Reduction in encoding speed may lead to the
unacceptable outcome of dropped frames during
transmission, eventually decreasing the Quality
of Experience (QoE).a
Increase in encoding speed leads to more CPU
idle time!
a
Pradeep Ramachandran et al. “Content Adaptive Live Encoding with Open Source
Codecs”. In: Proceedings of the 11th ACM Multimedia Systems Conference. MMSys
’20. Istanbul, Turkey: Association for Computing Machinery, 2020, 345–348. isbn:
9781450368452. doi: 10.1145/3339825.3393580. url:
https://doi.org/10.1145/3339825.3393580.

Introduction
Introduction
Live encoding in HAS
The preset for the fastest encoding (ultrafast for x2643 and x2654) is used as the encoding
preset for all live content, independent of the dynamic complexity of the content.
The resulting encode is sub-optimal, especially when the type of the content is dynamically
changing, which is the typical use-case for live streams.5
When the content becomes easier to encode, the encoder would achieve a higher encoding
speed than the target encoding speed. This, in turn, introduces unnecessary CPU idle time
as it waits for the video feed.
3
https://www.videolan.org/developers/x264.html, last access: Nov 30, 2022.
4
https://www.videolan.org/developers/x265.html, last access: Nov 30, 2022.
5
Qingxiong Huangyuan et al. “Performance evaluation of H.265/MPEG-HEVC encoders for 4K video sequences”. In: Signal and Information Processing
Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific. 2014, pp. 1–8. doi: 10.1109/APSIPA.2014.7041782.

Research Problem
Research Problem
For easy-to-encode content, encoder preset need to be configured such that encoding speed
can be reduced while still being compatible with the expected live encoding speed, improving
the quality of the encoded content.
When the content becomes complex again, the encoder preset need to be reconfigured to
move back to the faster configuration that achieves live encoding speed.6
This paper targets an encoding scheme that determines the encoding preset configuration dy-
namically, which:
is adaptive to the video content.
maximizes the CPU utilization for a given target encoding speed.
maximizes the compression efficiency.
6
Sergey Zvezdakov, Denis Kondranin, and Dmitriy Vatolin. “Machine-Learning-Based Method for Content-Adaptive Video Encoding”. In: 2021 Picture Coding
Symposium (PCS). 2021, pp. 1–5. doi: 10.1109/PCS50896.2021.9477507.

Content-Adaptive Encoder Preset Prediction Scheme (CAPS)
Content-Adaptive encoder Preset prediction Scheme (CAPS)
Bitrate Ladder
Video Segment
Video Complexity
Feature Extraction
E
h
L
Bitrate set (B)
Resolution set (R)
Target speed
Apriori information (e.g., codec, CPU threads)
Encoding Preset
Prediction
Encoder
Encoder
Encoder
Encoder
…
rep1
rep2
repN
repN-1
Figure: The encoding pipeline using CAPS envisioned in this paper.

Content-Adaptive Encoder Preset Prediction Scheme (CAPS) Phase 1: Video Complexity Feature Extraction
CAPS
Phase 1: Video Complexity Feature Extraction
Compute texture energy per block
A DCT-based energy function is used to determine the block-wise feature of each frame
defined as:
Hk =
w−1
X
i=0
w−1
X
j=0
e|( ij
wh
)2−1|
|DCT(i, j)| (1)
where wxw is the size of the block, and DCT(i, j) is the (i, j)th DCT component when
i + j > 0, and 0 otherwise.
The energy values of blocks in a frame are averaged to determine the energy per frame.7,8
Es =
K−1
X
k=0
Hs,k
K · w2
(2)
7
Michael King, Zinovi Tauber, and Ze-Nian Li. “A New Energy Function for Segmentation and Compression”. In: 2007 IEEE International Conference on
Multimedia and Expo. 2007, pp. 1647–1650. doi: 10.1109/ICME.2007.4284983.
8
Vignesh V Menon et al. “Efficient Content-Adaptive Feature-Based Shot Detection for HTTP Adaptive Streaming”. In: 2021 IEEE International Conference
on Image Processing (ICIP). 2021, pp. 2174–2178. doi: 10.1109/ICIP42928.2021.9506092.

Content-Adaptive Encoder Preset Prediction Scheme (CAPS) Phase 1: Video Complexity Feature Extraction
CAPS
Phase 1: Video Complexity Feature Extraction
hs: SAD of the block level energy values of frame s to that of the previous frame s − 1.
hs =
K−1
X
k=0
| Hs,k, Hs−1,k |
K · w2
(3)
where K denotes the number of blocks in frame s.
The luminescence of non-overlapping blocks k of sth frame is defined as:
Ls,k =
p
DCT(0, 0) (4)
The block-wise luminescence is averaged per frame denoted as Ls as shown below.
Ls =
K−1
X
k=0
Ls,k
K · w2
(5)

Content-Adaptive Encoder Preset Prediction Scheme (CAPS) Phase 2: Encoding Preset Prediction
CAPS
Phase 2: Encoding Preset Prediction
E
h
L
Model
set
log(r)
log(b) ̂
𝑡!!"#
.
.
.
̂
𝑡!!$%
Preset
selection
̂
𝑝
Figure: Encoding Preset Prediction architecture.
Model Set: models to predict the encoding time for each preset.
Preset selection: select the optimized preset that yields the target encoding time.

CAPS
Model Set
Random Forest (RF) models are trained to predict the encoding times for the pre-defined
set of encoding presets (P).
The minimum and maximum encoder preset (pmin and pmax , respectively) are chosen based
on the target encoder. For example, x265 HEVC9 encoder supports encoding presets
ranging from 0 to 9 (i.e., ultrafast to placebo).
The model set predicts the encoding times for each of the presets in P as t̂pmin to t̂pmax .
9
G. J. Sullivan et al. “Overview of the high efficiency video coding (HEVC) standard”. In: IEEE Transactions on circuits and systems for video technology
22.12 (2012), pp. 1649–1668.

CAPS
Preset Selection
The preset is selected, which is closest to the target encoding time T, which is defined as:
T =
n
f
(6)
where n represents the number of frames in the segment.
The function ensures that the encoding time is not greater than the target encoding
time T.
t̂ = argminp | T − tp | c.t. p ∈ [pmin, pmax ]; t̂ ≤ T (7)
where p is the selected optimum preset and t̂ is the encoding time for p.

Evaluation Test Methodology
Test Methodology
Dataset: Video Complexity Dataset (VCD)10 (500 Ultra HD sequences, 5 s duration, 24 fps).
Table: Encoder Settings.
Encoder x265 v3.5
Target encoding speed/time 24 fps/ 5 seconds
CPU threads 8
Ratecontrol CBR
Table: Representations considered in this paper2
.
Representation ID 01 02 03 04 05 06 07 08 09 10 11 12
r (height in pixels) 360 432 540 540 540 720 720 1080 1080 1440 2160 2160
b (in Mbps) 0.145 0.300 0.600 0.900 1.600 2.400 3.400 4.500 5.800 8.100 11.600 16.800
E, h, and L features are extracted from the video segments using VCA11 run in eight CPU
threads.
10
Hadi Amirpour et al. “VCD: Video Complexity Dataset”. In: Proceedings of the 13th ACM Multimedia Systems Conference. MMSys ’22. Athlone, Ireland:
Association for Computing Machinery, 2022, 234–239. isbn: 9781450392839. doi: 10.1145/3524273.3532892. url:
https://doi.org/10.1145/3524273.3532892.
11
Vignesh V Menon et al. “VCA: Video Complexity Analyzer”. In: Proceedings of the 13th ACM Multimedia Systems Conference. MMSys ’22. Athlone,
Ireland: Association for Computing Machinery, 2022, 259–264. isbn: 9781450392839. doi: 10.1145/3524273.3532896. url:
https://doi.org/10.1145/3524273.3532896.

Evaluation Experimental Results
Experimental Results
Figure: Average relative importance of features
in the encoding time prediction of 2160p en-
coding for all presets.
Preset 0 1 2 3 4 5 6 7 8
R2 0.97 0.96 0.97 0.97 0.97 0.96 0.97 0.98 0.98
MAE 0.06 0.09 0.12 0.17 0.22 0.31 0.33 0.41 0.49
Table: Average encoding time prediction performance
results of every preset considered for experimental vali-
dation. (p = 0: ultrafast)

Figure: Average preset chosen for each representation in the HLS bitrate ladder.
On average, representation 01 (0.145 Mbps) chooses slow preset (p = 6).
On average, representations 11 (11.6 Mbps) and 12 (16.8 Mbps) choose ultrafast preset
(p = 0).

(a) (b) (c)
Figure: (a) Average encoding time, (b) Average PSNR, and (c) Average VMAF for each representation.
Using ultrafast preset for all representations introduces significant CPU idle time for lower
bitrate representations. However, CAPS yield lower CPU idle time when the encodings are
carried out concurrently.
Using CAPS, visual quality improves significantly at lower bitrate representations. CAPS
yields an overall BD-VMAF of 3.81 and BD-PSNR of 0.83 dB.

Conclusions
Conclusions
This paper proposed CAPS, a content-adaptive encoder preset prediction scheme for adap-
tive live streaming applications.
CAPS predicts the optimized encoder preset for a given target bitrate, and resolution for
each segment, which helps improve the compression-efficiency of video encodings.
DCT-energy-based features are used to determine segments’ spatial and temporal complex-
ity.
CAPS yield lower idle time and an overall quality improvement of 0.83dB PSNR and 3.81
VMAF score with the same bitrate, compared to the fastest preset (ultrafast) x265 encoding
of the reference HLS bitrate ladder.

Q & A
Q & A
Thank you for your attention!
Vignesh V Menon (vignesh.menon@aau.at)
VCA

CAPS_Presentation.pdf

Recommended

Recommended

More Related Content

Similar to CAPS_Presentation.pdf

Similar to CAPS_Presentation.pdf (20)

More from Vignesh V Menon

More from Vignesh V Menon (10)

Recently uploaded

Recently uploaded (20)

CAPS_Presentation.pdf