Cisco reported in the past reports that the video data share was expected to reach 80% by the year 2023. However, due to the pandemic and recently imposed a remote work lifestyle, this figure is expected to increase even more. Except for the on-demand and conferencing services, the number of users that are generating, storing, and sharing their content usually through either social media platforms or video-sharing platforms is increasing. Meanwhile from the video coding perspective, as video technologies evolve towards improved compression performance, their complexity inversely increases.
A challenge that many video service providers face is the heterogeneity of networks and display devices for streaming, as well as dealing with a wide variety of content with different encoding performance. In the past, a fixed bit rate ladder solution based on a „fitting all“ approach has been employed. However, such a content-tailored solution is highly demanding; the computational and financial cost of constructing the convex hull per video by encoding at all resolutions and quantization levels is huge. In this talk, we present a content-agnostic approach that exploits machine learning to predict the bit rate ladder with only a small number of encodes required.
3. I. Motivation
Cisco reports on internet data traffic estimate that the video data share is
expected to reach 80% by 2022 and is expected to increase more. [1]
Due to the pandemic and recently shift towards remote work life-style, this figure
is probably almost a reality.
Video providers employ adaptive streaming to address the users specifications.
Traditionally, this is achieved by creating several versions for a video sequence
using different encoding parameters, such as resolution.
This, however, requires a huge amount of encodings, which impacts on time, cost
and energy (increased CO2 footprint).
11%
17%
11%
6%20%
16%
19%
Distribution of energy
consumption for production
and use in 2017
TVs (production)
Computers
(production)
Smartphones
(production)
Others
Terminals (use)
Networks (use)
Data Centers (use)
“…as of the end of December last year,
the maximum number of daily meeting
participants, both free and paid,
conducted on Zoom was approximately
10 million. In March this year, we reached
more than 200 million daily meeting
participants, both free and paid.” [2]
Eric S. Yuan
Founder and CEO, Zoom
4. I. Motivation
Fig.1 Sample frames of a 100 4K dataset.
101
102
103
104
105
106
Bitrate (kbps)
25
30
35
40
45
50
55
PSNR(dB)
4K
RQsFHD
RQs
HD
RQs
Fig.2 PSNR-log(Rate) curves across resolutions.
One ladder
does not fit all!
Table 1 The encoding ladder presented in Apple Tech Note TN2224.
5. I. Motivation
How can we find the “best” bitrate ladder per content so that we do not compromise the quality of
experience?
How could we make this process more computationally efficient without degrading the delivered
video quality?
Table 1 The encoding ladder presented in Apple Tech Note TN2224.
Table 2 Netflix’s per-title can change both the
number of rungs and their resolution. [3, 4]
Other Per-Title Approaches: Bitmovin, Mux, CAMBRIA, etc
6. I. Motivation
How can we find the “best” bitrate ladder per content so that we do
not compromise the quality of experience?
How could we make this process more computationally efficient
without degrading the delivered video quality? Convex Hull-
Optimal Encoding
Solution
Sub-optimal
Encoding Solution
Sub-optimal
Encoding Solution
Practical
Approach
Fig.3 RD curves and convex hull.
Ideally the optimal solution would to build the ladder by sampling
the convex hull of the RQ curves across resolutions.
We propose a content-gnostic machine-
learning based approach that predicts the
bitrate ladder.
7. II. Content Features and Compression
Fig.4 Correlation matrix of HM coding statistics to
spatio-temporal features. [5]
Fig.5 Examples of predicted PSNR-Rate curves. [5]
8. III. Proposed Framework
101
102
103
104
105
106
Bitrate (kbps)
25
30
35
40
45
50
55
PSNR(dB)
4K
RQsFHD
RQs
HD
RQs
Fig.2 PSNR-log(Rate) curves across resolutions.
5000 10000 50000
log (Bitrate (kbps))
32
34
36
38
40
PSNR(dB)
4K
FHD
HD
Convex Hull
{QP
high
FHD
,QP
HD
}
{QP
4K
,QP
low
FHD
}
Fig.6 Example of RQ curves’ intersection.
Finding the cross-over points helps defining
the switching of resolution on the convex hull.
We assume that the RQs are intersecting in an ordered monotonic fashion (e.g. 2160p intersects with the
1080p, 1080p with the 720p, etc).
9. III. Proposed Framework
Fig.7 Scatterplots of cross-over QPs.
15 20 25 30 35 40 45
QP
4K
15
20
25
30
35
40
45
QP
low
FHD
PCC: .9917
SROCC: .9888
20 25 30 35 40
QPhigh
FHD
20
25
30
35
40
QP
HD
PCC: .9817
SROCC: .9538
This relation can be used to improve cross-
over QP predictions.
10. III. Proposed Framework
Content
Features
Extraction
Machine
Learning-based
Regression
Testing Videos @
Native Spatial
Resolution
Spatio-temporal
Features of
Testing Videos
Video
CodecBitrate of
Cross-over
Points
RQ Convex Hull
Fitting
Ground-truth -
RQ Convex Hull
Training Videos @
Native Resolution
Downscaling
Resolution
Training Videos @ all considered
resolutions
Training
Videos Cross-
over QPs
Training Videos @
Native Resolution
Spatio-temporal Features of Training
Videos
Training Process
Testing Process
Upscaling
Resolution
Decoded Training
Videos @ all
considered
resolutions
Upscaled Training Videos
@ Native Resolution
Quality
Metrics
Computation
Upscaled
Decoded Training
Videos @ Native
Resolution
Decoded Testing
Videos
@ Cross-over QPs
Upscaled Decoded
Testing Videos @ Cross-over
QPs
Quality Metric Values for
Training Videos
Quality Metric Values for Testing
Videos at Cross-over Points
Testing Videos @ Native Spatial
Resolution
Predicted Cross-
over QPs per
Resolution
Predicted
BitrateLadder • RQ Convex Hull
Eq.
• Rate-QP Eq.
• Resolution
Switching Rate
points
Fig.8 Proposed method.
12. III. Proposed Framework
We fitted the convex hull in a 3rd order polynomial.
This means that after determining the cross-over QPs, we need four encodes in order to determine
the polynomial parameters.
Then, we can sample the convex hull and build the bitrate ladder.
Table 3 Fitted Models.
13. III. Proposed Framework
17 18 19 20 21 22 23 24 25
log2(Bitrate)
20
30
40
50
60
70
80
90
100
VMAF
17 18 19 20 21 22 23 24 25
log2(BitRate)
20
25
30
35
40
45
50
55
PSNR(dB)
Fig.10 PSNR-Rate Ladder Fig.11 VMAF-Rate Ladder
RL,i ≃ 2RL,i−1 or log(RL,i) ≃ 1 + log(RL,i−1) , where RL,i ∈ (Rmin, Rmax)
QL,i(RL,i) ≤ Qmax and
dQL,i
RL
> ϵ , where ϵ → 0
Building the bitrate ladder:
1. Determine the operational bitrate range;
2. Sample the bitrate:
3. Sample the quality:
14. IV. Results
0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
stdTC
std
20
25
30
35
40
45
QP
4K
Table 4 List of Features [5].
Fig.12 Example of content dependency of cross-over QPs.
15. IV. Results
Fig.1 Sample frames of a 100 4K dataset. Fig.13 Spatial and Temporal Information of the dataset.
16. IV. Results
From PCS2019 paper[6]: We have tested the proposed framework with HM16.20, considering the resolutions
{2160p,1080p,720p}.
Lanczos-3 filter (ffmpeg implementation) was used for the spatial down/up-sampling.
We compare our method against two state-of-the-art solutions:
• Brute force method: we performed encodings with a QP step equal to 1. The brute force method theoretically
creates the optimal convex hull. This is considered our ground truth.
• Interpolation-based method: 7 encodings per resolution (using equidistant QPs to cover the range) and by
using a piece-wise cubic Hermite interpolation for the in-between QPs. This method of course results in
constructing a suboptimal convex hull, but it can provide a good approximation of it, while significantly
reducing the number of pre-encodes.
17. IV. Results
We applied feature selection, and particularly Recursive Feature Elimination on the set of spatio-temporal
features.
We perform a sequential prediction of the QPs starting from the higher resolution:
• For the QP4K prediction, we only relied on spatio-temporal features.
• For the rest of the predictions, we made use of the identified relations and considered the previously predicted QPs (of the
highest resolutions) as features.
We have tested various regression methods, such as SVMs with different kernels, RFs, etc, but GPs were the best
performing models.
To avoid overfitting, we performed a 10-fold cross-validation.
19. IV. Results
The different distributions
are due to the different
reference convex hulls.
Fig.17 BDRate Histogram. Fig.18 BDPSNR Histogram.
Most outliers refer to sequences
that do not comply with the
hypothesis that the RQs are
intersecting in a resolution-
monotonic manner.
21. IV. Results
94.2% fewer encodings compared to the brute
force method and 80.95% compared to the
interpolation-based method.
Proposed method overhead: the average feature
extraction time for a sequence at 4K resolution to
the average 4K encoding time for a sequence at
QP=27 is 0.18.
Table 6 Comparison of the number of encodes required per method.
22. V. Conclusion and Future Work
Conclusions:
We proposed a method that can predict the bitrate ladders of the considered resolutions based on spatio-temporal
features extracted from the uncompressed videos at their native resolution and with a few video encodings (two
encodes per RQ intersecting points).
The first results are promising compared to the ground truth, while requiring 94.2% and 81% fewer pre-encodes
compared to the brute force method and the interpolation- based method, respectively.
Future Work:
Our focus will be on validating the presented method across different codecs.
We will also work on identifying cross-codecs optimization of bitrate ladders.
23. References
1. “Global Mobile Data Traffic Forecast Update 2017-2022”, White Paper, Cisco, 2018.
2. E. S. Yuan, “A message to our users”, https://blog.zoom.us/a-message-to-our-users/
3. J. De Cock, Z. Li, M. Manohara, and A. Aaron, “Complexity-based consistent quality encoding in the Cloud”, IEEE ICIP 2016.
4. J. Sole, L. Guo, A. Norkin, M. Afonso, K. Swanson, and A. Aaron, “Performance comparison of video coding standards: an
adaptive streaming perspective,” https://medium.com/netflix-techblog/performance- comparison- of- video- coding- standards- an- adaptive- streaming-
perspective- d45d0183ca95, 2018.
5.A. Katsenou, M. Afonso, D. Agrafiotis, and D. R. Bull, “Predicting Video Rate-Distortion Curves using Textural Features,” in PCS 2016.
6. A. V. Katsenou, J. Sole, and D. R. Bull, “Content-gnostic Bitrate Ladder Prediction for Adaptive Video Streaming,” in PCS 2019.