SlideShare uma empresa Scribd logo
1 de 16
AVX-512 assembly in FFmpeg
Kieran Kunhya <kieran@kunhya.com>
What is Assembly Language?
• A low-level (close to the hardware)
programming language
• Intel x86 assembly language is
(somewhat) backwards compatible to
the 8008 CPU from 1972!
• Languages like C are compiled into
Assembly Language
• Mnemonics are human readable
versions of machine code
What is AVX-512?
• New SIMD (Single Instruction Multiple Data) Instruction Set for Intel since
2017, and more recently AMD CPUs
• 512-bit register sizes (zmm)
• Many new instructions
• Opmasks
• Comparison instructions
• Many things not so interesting in multimedia (e.g cryptography, neural
networks)
• Lots of fancy words, but high schoolers have written assembly in FFmpeg!
Why is this relevant now?
• AVX-512 has been around since 2017. Why is it relevant
now?
• Old Skylake (server) systems had large performance
throttling when using larger registers. AVX-512 remained
unused in multimedia
• Could still use new instructions with smaller registers
• Can be beneficial in some cases (see later)
• Ice Lake (10th and 11th Gen Intel) first to have no throttling
How to get started?
• Intel have removed AVX-512 support
in consumer processors (from 12th
Gen) 
• Available in AMD Zen 4
• Still exists in Intel Server CPUs (Xeon).
Available from cloud providers
• Easiest way is to buy an Intel NUC
(11th Gen)
Existing work in multimedia using AVX-512
• dav1d project added AVX-512 support
• AV1 decoding, particularly beneficial owing to large block sizes
• 10-20% faster *overall* decoding
• Classic FFmpeg/x264 approach to assembly
• No intrinsics, nor inline assembly
• Detect CPU capabilities and set function pointers to appropriate
functions
• Messy Venn Diagram of capabilities, but two in practice we care
about
CPU_FLAGS in FFmpeg
int av_get_cpu_flags(void);
#define AV_CPU_FLAG_AVX512 0x100000 ///<
AVX-512 functions: requires OS support even if
YMM/ZMM registers aren't used
#define AV_CPU_FLAG_AVX512ICL 0x200000 ///<
F/CD/BW/DQ/VL/VNNI/IFMA/VBMI/VBMI2/VPOPCNTDQ/BIT
ALG/GFNI/VAES/VPCLMULQDQ
Lanes
• Older AVX ymm registers are split
into lanes
• Instructions (mainly) operate in lanes
• Can be tricky to move data between
lanes
• Limitation on AVX2 code
16-byte Lane 1 16-byte Lane 0
256-bit ymm Register (old)
K-mask registers
• A new set of registers k0-k7 that allow the destination
register to remain unchanged or set to zero
• For example, addition, but only some values:
• paddw zmm1{k1}, zmm2, zmm3
• kmovX instructions to manipulate k-mask registers
vpermb
• Byte shuffles (permute) are the one
of the most important instructions
in multimedia
• Similar to existing pshufb
instruction but cross-lane
• Need to use k-masks with vpermb
to zero out a byte
• e.g for zigzag scan
• Also, VPERMT2B, permute from
two registers
A B C D E F G H
E A 0 C C B H D
Variable shifts
• vpsrlvw/vpsrlvd/vpsrlvq – variable right shift (logical)
• vpsllvw/vpsllvd/vpsllvq – variable left shift (logical)
• (letter soup, especially arithmetic shifts!)
• Historically had to use multiple instructions and various trickery
to achieve right shifts. Had many limitations.
• Faster than multiply for left shifts
vpternlogd
• The kitchen sink of instructions!
• Allows a programmable truth table to be
implemented per bit input in each register
• Can replace up to eight instructions!
• e.g vpternlogd zmm0, zmm1, zmm2,
0xca
• zmm0 = zmm0 ? zmm1 : zmm2; (for each bit)
• http://0x80.pl/articles/avx512-ternary-
functions.html
Example: v210enc (1)
• .loop:
• movu ym1, [yq + 2*widthq]
• vinserti32x4 m1, [uq + 1*widthq], 2
• vinserti32x4 m1, [vq + 1*widthq], 3
• vpermb m1, m2, m1 ; uyvx yuyx vyux yvyx
• CLIPUB m1, m4, m5
•
pmaddubsw m0, m1, m3 ; shift high and low samples of each dword and mask out other bits
• pslld m1, 4 ; shift center sample of each dword
• vpternlogd m0, m1, m6, 0xd8 ; C?B:A ; merge and mask out bad bits from B
•
movu [dstq], m0
• […]
• jl .loop
• Takes 3x 8-bit samples, extends to 10-bit and packs into 32-bits
Example: v210enc (2)
• Benchmarks (decicyles)
• Skylake
• v210_planar_pack_8_c: 2373.5
• […]
• v210_planar_pack_8_avx2: 194.0
• v210_planar_pack_8_avx512: 174.0
• vpternlogd on a shorter ymm register
• Ice Lake
• v210_planar_pack_8_c: 2743.6
• v210_planar_pack_8_avx2: 246.6
• v210_planar_pack_8_avx512: 238.6
• v210_planar_pack_8_avx512icl: 122.1
• vpermb and zmm nearly twice as fast as avx512 ymm, and more than twenty times
faster than C!
What AVX-512 code next?
• Anything involving line/frame based processing
• e.g filters, scalers etc.
• Comparisons
• Vpternlogd in many places (e.g 3-way boolean)
• Also change variable shifts in many places
• Intel manual is very verbose (useful in some cases)
• https://www.officedaytime.com/simd512e/
Any questions?

Mais conteĂşdo relacionado

Mais procurados

RDMA, Scalable MPI-3 RMA, and Next-Generation Post-RDMA Interconnects
RDMA, Scalable MPI-3 RMA, and Next-Generation Post-RDMA InterconnectsRDMA, Scalable MPI-3 RMA, and Next-Generation Post-RDMA Interconnects
RDMA, Scalable MPI-3 RMA, and Next-Generation Post-RDMA Interconnects
inside-BigData.com
 
Ipv6 소켓프로그래밍
Ipv6 소켓프로그래밍Ipv6 소켓프로그래밍
Ipv6 소켓프로그래밍
Heo Seungwook
 
Thin wafer processing and Dicing equipment market - 2016 Report by Yole Devel...
Thin wafer processing and Dicing equipment market - 2016 Report by Yole Devel...Thin wafer processing and Dicing equipment market - 2016 Report by Yole Devel...
Thin wafer processing and Dicing equipment market - 2016 Report by Yole Devel...
Yole Developpement
 
Silicon Photonics 2014 Report by Yole Developpement
Silicon Photonics 2014 Report by Yole DeveloppementSilicon Photonics 2014 Report by Yole Developpement
Silicon Photonics 2014 Report by Yole Developpement
Yole Developpement
 

Mais procurados (20)

RDMA, Scalable MPI-3 RMA, and Next-Generation Post-RDMA Interconnects
RDMA, Scalable MPI-3 RMA, and Next-Generation Post-RDMA InterconnectsRDMA, Scalable MPI-3 RMA, and Next-Generation Post-RDMA Interconnects
RDMA, Scalable MPI-3 RMA, and Next-Generation Post-RDMA Interconnects
 
Scaling FreeSWITCH Performance
Scaling FreeSWITCH PerformanceScaling FreeSWITCH Performance
Scaling FreeSWITCH Performance
 
第一回Web技術勉強会 efkスタック編
第一回Web技術勉強会 efkスタック編第一回Web技術勉強会 efkスタック編
第一回Web技術勉強会 efkスタック編
 
Fan-Out and Embedded Die: Technologies & Market Trends 2015 Report by Yole De...
Fan-Out and Embedded Die: Technologies & Market Trends 2015 Report by Yole De...Fan-Out and Embedded Die: Technologies & Market Trends 2015 Report by Yole De...
Fan-Out and Embedded Die: Technologies & Market Trends 2015 Report by Yole De...
 
第11回ACRiウェビナー_東工大/坂本先生ご講演資料
第11回ACRiウェビナー_東工大/坂本先生ご講演資料第11回ACRiウェビナー_東工大/坂本先生ご講演資料
第11回ACRiウェビナー_東工大/坂本先生ご講演資料
 
Svelte5でのevent受け渡し in Svelte Japan Offline Meetup #2
Svelte5でのevent受け渡し in Svelte Japan Offline Meetup #2Svelte5でのevent受け渡し in Svelte Japan Offline Meetup #2
Svelte5でのevent受け渡し in Svelte Japan Offline Meetup #2
 
Asterisk WebRTC frontier: realize client SIP Phone with sipML5 and Janus Gateway
Asterisk WebRTC frontier: realize client SIP Phone with sipML5 and Janus GatewayAsterisk WebRTC frontier: realize client SIP Phone with sipML5 and Janus Gateway
Asterisk WebRTC frontier: realize client SIP Phone with sipML5 and Janus Gateway
 
Patent Watch report on EV, Self Driving Car, V2X
Patent Watch report on EV, Self Driving Car, V2XPatent Watch report on EV, Self Driving Car, V2X
Patent Watch report on EV, Self Driving Car, V2X
 
Silicon Photonics Market & Technology 2020
Silicon Photonics Market & Technology 2020Silicon Photonics Market & Technology 2020
Silicon Photonics Market & Technology 2020
 
GROWI Users Meetup #2
GROWI Users Meetup #2GROWI Users Meetup #2
GROWI Users Meetup #2
 
Ipv6 소켓프로그래밍
Ipv6 소켓프로그래밍Ipv6 소켓프로그래밍
Ipv6 소켓프로그래밍
 
virtio勉強会 #1 「virtioの基本的なところ(DRAFT版)」
virtio勉強会 #1 「virtioの基本的なところ(DRAFT版)」virtio勉強会 #1 「virtioの基本的なところ(DRAFT版)」
virtio勉強会 #1 「virtioの基本的なところ(DRAFT版)」
 
Thin wafer processing and Dicing equipment market - 2016 Report by Yole Devel...
Thin wafer processing and Dicing equipment market - 2016 Report by Yole Devel...Thin wafer processing and Dicing equipment market - 2016 Report by Yole Devel...
Thin wafer processing and Dicing equipment market - 2016 Report by Yole Devel...
 
Silicon Photonics 2014 Report by Yole Developpement
Silicon Photonics 2014 Report by Yole DeveloppementSilicon Photonics 2014 Report by Yole Developpement
Silicon Photonics 2014 Report by Yole Developpement
 
Gpu vs fpga
Gpu vs fpgaGpu vs fpga
Gpu vs fpga
 
Pythonによる高位設計フレームワークPyCoRAMでFPGAシステムを開発してみよう
Pythonによる高位設計フレームワークPyCoRAMでFPGAシステムを開発してみようPythonによる高位設計フレームワークPyCoRAMでFPGAシステムを開発してみよう
Pythonによる高位設計フレームワークPyCoRAMでFPGAシステムを開発してみよう
 
Wolfspeed C3M™ Platform SiC 900V MOSFET - teardown reverse costing report pub...
Wolfspeed C3M™ Platform SiC 900V MOSFET - teardown reverse costing report pub...Wolfspeed C3M™ Platform SiC 900V MOSFET - teardown reverse costing report pub...
Wolfspeed C3M™ Platform SiC 900V MOSFET - teardown reverse costing report pub...
 
MAP 実装してみた
MAP 実装してみたMAP 実装してみた
MAP 実装してみた
 
IIJmio meeting 31 音声通信の世界
IIJmio meeting 31 音声通信の世界IIJmio meeting 31 音声通信の世界
IIJmio meeting 31 音声通信の世界
 
Kamailio, FreeSWITCH, and You
Kamailio, FreeSWITCH, and YouKamailio, FreeSWITCH, and You
Kamailio, FreeSWITCH, and You
 

Semelhante a AVX512 assembly language in FFmpeg

Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
Haris456
 
Porting_uClinux_CELF2008_Griffin
Porting_uClinux_CELF2008_GriffinPorting_uClinux_CELF2008_Griffin
Porting_uClinux_CELF2008_Griffin
Peter Griffin
 

Semelhante a AVX512 assembly language in FFmpeg (20)

WAN - trends and use cases
WAN - trends and use casesWAN - trends and use cases
WAN - trends and use cases
 
Andes RISC-V processor solutions
Andes RISC-V processor solutionsAndes RISC-V processor solutions
Andes RISC-V processor solutions
 
Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...
Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...
Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...
 
#OSSPARIS19 : A virtual machine approach for microcontroller programming : th...
#OSSPARIS19 : A virtual machine approach for microcontroller programming : th...#OSSPARIS19 : A virtual machine approach for microcontroller programming : th...
#OSSPARIS19 : A virtual machine approach for microcontroller programming : th...
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
 
Porting_uClinux_CELF2008_Griffin
Porting_uClinux_CELF2008_GriffinPorting_uClinux_CELF2008_Griffin
Porting_uClinux_CELF2008_Griffin
 
Java on arm theory, applications, and workloads [dev5048]
Java on arm  theory, applications, and workloads [dev5048]Java on arm  theory, applications, and workloads [dev5048]
Java on arm theory, applications, and workloads [dev5048]
 
Advanced High-Performance Computing Features of the Open Power ISA
Advanced High-Performance Computing Features of the Open Power ISAAdvanced High-Performance Computing Features of the Open Power ISA
Advanced High-Performance Computing Features of the Open Power ISA
 
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. TanenbaumA Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
 
Running Applications on the NetBSD Rump Kernel by Justin Cormack
Running Applications on the NetBSD Rump Kernel by Justin Cormack Running Applications on the NetBSD Rump Kernel by Justin Cormack
Running Applications on the NetBSD Rump Kernel by Justin Cormack
 
Something about SSE and beyond
Something about SSE and beyondSomething about SSE and beyond
Something about SSE and beyond
 
Haskell Symposium 2010: An LLVM backend for GHC
Haskell Symposium 2010: An LLVM backend for GHCHaskell Symposium 2010: An LLVM backend for GHC
Haskell Symposium 2010: An LLVM backend for GHC
 
Nodes and Networks for HPC computing
Nodes and Networks for HPC computingNodes and Networks for HPC computing
Nodes and Networks for HPC computing
 
AsiaBSDCon2023 - Hardening Emulated Devices in OpenBSD’s vmd(8) Hypervisor
AsiaBSDCon2023 - Hardening Emulated Devices in OpenBSD’s vmd(8) HypervisorAsiaBSDCon2023 - Hardening Emulated Devices in OpenBSD’s vmd(8) Hypervisor
AsiaBSDCon2023 - Hardening Emulated Devices in OpenBSD’s vmd(8) Hypervisor
 
Moving NEON to 64 bits
Moving NEON to 64 bitsMoving NEON to 64 bits
Moving NEON to 64 bits
 
ARM AAE - Architecture
ARM AAE - ArchitectureARM AAE - Architecture
ARM AAE - Architecture
 
Processors selection
Processors selectionProcessors selection
Processors selection
 
Pentium iii
Pentium iiiPentium iii
Pentium iii
 
ARM Architecture
ARM ArchitectureARM Architecture
ARM Architecture
 
ARM Introduction.pptx
ARM Introduction.pptxARM Introduction.pptx
ARM Introduction.pptx
 

Mais de Kieran Kunhya

The challenges of generating 2110 streams on Standard IT Hardware
The challenges of generating 2110 streams on Standard IT HardwareThe challenges of generating 2110 streams on Standard IT Hardware
The challenges of generating 2110 streams on Standard IT Hardware
Kieran Kunhya
 
London Video Tech - Adventures in cutting every last millisecond from glass-t...
London Video Tech - Adventures in cutting every last millisecond from glass-t...London Video Tech - Adventures in cutting every last millisecond from glass-t...
London Video Tech - Adventures in cutting every last millisecond from glass-t...
Kieran Kunhya
 

Mais de Kieran Kunhya (16)

Baby Demuxed's First Assembly Language Function
Baby Demuxed's First Assembly Language FunctionBaby Demuxed's First Assembly Language Function
Baby Demuxed's First Assembly Language Function
 
Stable Feed and Lower Costs with Use of 5G and Satellite Stable Feed and Lowe...
Stable Feed and Lower Costs with Use of 5G and Satellite Stable Feed and Lowe...Stable Feed and Lower Costs with Use of 5G and Satellite Stable Feed and Lowe...
Stable Feed and Lower Costs with Use of 5G and Satellite Stable Feed and Lowe...
 
Moving to software-based production workflows and containerisation of media a...
Moving to software-based production workflows and containerisation of media a...Moving to software-based production workflows and containerisation of media a...
Moving to software-based production workflows and containerisation of media a...
 
Private 5G Networks at the Queen's Funeral and Elsewhere
Private 5G Networks at the Queen's Funeral and ElsewherePrivate 5G Networks at the Queen's Funeral and Elsewhere
Private 5G Networks at the Queen's Funeral and Elsewhere
 
IBC 2022 IP Showcase - Timestamps in ST 2110: What They Mean and How to Measu...
IBC 2022 IP Showcase - Timestamps in ST 2110: What They Mean and How to Measu...IBC 2022 IP Showcase - Timestamps in ST 2110: What They Mean and How to Measu...
IBC 2022 IP Showcase - Timestamps in ST 2110: What They Mean and How to Measu...
 
5G for onboard racing car video
5G for onboard racing car video5G for onboard racing car video
5G for onboard racing car video
 
Ground-Cloud-Cloud-Ground - NAB 2022 IP Showcase
Ground-Cloud-Cloud-Ground - NAB 2022 IP ShowcaseGround-Cloud-Cloud-Ground - NAB 2022 IP Showcase
Ground-Cloud-Cloud-Ground - NAB 2022 IP Showcase
 
How to explain ST 2110 to a six year old.
How to explain ST 2110 to a six year old.How to explain ST 2110 to a six year old.
How to explain ST 2110 to a six year old.
 
The challenges of generating 2110 streams on Standard IT Hardware
The challenges of generating 2110 streams on Standard IT HardwareThe challenges of generating 2110 streams on Standard IT Hardware
The challenges of generating 2110 streams on Standard IT Hardware
 
Experiences from weekly sports broadcasts over 5G - what's possible and what ...
Experiences from weekly sports broadcasts over 5G - what's possible and what ...Experiences from weekly sports broadcasts over 5G - what's possible and what ...
Experiences from weekly sports broadcasts over 5G - what's possible and what ...
 
Native IP Decoding MPEG-TS Video to Uncompressed IP (and Vice versa) on COTS ...
Native IP Decoding MPEG-TS Video to Uncompressed IP (and Vice versa) on COTS ...Native IP Decoding MPEG-TS Video to Uncompressed IP (and Vice versa) on COTS ...
Native IP Decoding MPEG-TS Video to Uncompressed IP (and Vice versa) on COTS ...
 
London Video Tech - Adventures in cutting every last millisecond from glass-t...
London Video Tech - Adventures in cutting every last millisecond from glass-t...London Video Tech - Adventures in cutting every last millisecond from glass-t...
London Video Tech - Adventures in cutting every last millisecond from glass-t...
 
Don't just go IP - Go IT
Don't just go IP - Go ITDon't just go IP - Go IT
Don't just go IP - Go IT
 
Using IT Equipment in Live Broadcast
Using IT Equipment in Live BroadcastUsing IT Equipment in Live Broadcast
Using IT Equipment in Live Broadcast
 
Implementing Uncompressed over IP in software and the pitfalls
Implementing Uncompressed over IP in software and the pitfallsImplementing Uncompressed over IP in software and the pitfalls
Implementing Uncompressed over IP in software and the pitfalls
 
FOSS in Broadcast
FOSS in BroadcastFOSS in Broadcast
FOSS in Broadcast
 

Último

Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
Christopher Logan Kennedy
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 

AVX512 assembly language in FFmpeg

  • 1. AVX-512 assembly in FFmpeg Kieran Kunhya <kieran@kunhya.com>
  • 2. What is Assembly Language? • A low-level (close to the hardware) programming language • Intel x86 assembly language is (somewhat) backwards compatible to the 8008 CPU from 1972! • Languages like C are compiled into Assembly Language • Mnemonics are human readable versions of machine code
  • 3. What is AVX-512? • New SIMD (Single Instruction Multiple Data) Instruction Set for Intel since 2017, and more recently AMD CPUs • 512-bit register sizes (zmm) • Many new instructions • Opmasks • Comparison instructions • Many things not so interesting in multimedia (e.g cryptography, neural networks) • Lots of fancy words, but high schoolers have written assembly in FFmpeg!
  • 4. Why is this relevant now? • AVX-512 has been around since 2017. Why is it relevant now? • Old Skylake (server) systems had large performance throttling when using larger registers. AVX-512 remained unused in multimedia • Could still use new instructions with smaller registers • Can be beneficial in some cases (see later) • Ice Lake (10th and 11th Gen Intel) first to have no throttling
  • 5. How to get started? • Intel have removed AVX-512 support in consumer processors (from 12th Gen)  • Available in AMD Zen 4 • Still exists in Intel Server CPUs (Xeon). Available from cloud providers • Easiest way is to buy an Intel NUC (11th Gen)
  • 6. Existing work in multimedia using AVX-512 • dav1d project added AVX-512 support • AV1 decoding, particularly beneficial owing to large block sizes • 10-20% faster *overall* decoding • Classic FFmpeg/x264 approach to assembly • No intrinsics, nor inline assembly • Detect CPU capabilities and set function pointers to appropriate functions • Messy Venn Diagram of capabilities, but two in practice we care about
  • 7. CPU_FLAGS in FFmpeg int av_get_cpu_flags(void); #define AV_CPU_FLAG_AVX512 0x100000 ///< AVX-512 functions: requires OS support even if YMM/ZMM registers aren't used #define AV_CPU_FLAG_AVX512ICL 0x200000 ///< F/CD/BW/DQ/VL/VNNI/IFMA/VBMI/VBMI2/VPOPCNTDQ/BIT ALG/GFNI/VAES/VPCLMULQDQ
  • 8. Lanes • Older AVX ymm registers are split into lanes • Instructions (mainly) operate in lanes • Can be tricky to move data between lanes • Limitation on AVX2 code 16-byte Lane 1 16-byte Lane 0 256-bit ymm Register (old)
  • 9. K-mask registers • A new set of registers k0-k7 that allow the destination register to remain unchanged or set to zero • For example, addition, but only some values: • paddw zmm1{k1}, zmm2, zmm3 • kmovX instructions to manipulate k-mask registers
  • 10. vpermb • Byte shuffles (permute) are the one of the most important instructions in multimedia • Similar to existing pshufb instruction but cross-lane • Need to use k-masks with vpermb to zero out a byte • e.g for zigzag scan • Also, VPERMT2B, permute from two registers A B C D E F G H E A 0 C C B H D
  • 11. Variable shifts • vpsrlvw/vpsrlvd/vpsrlvq – variable right shift (logical) • vpsllvw/vpsllvd/vpsllvq – variable left shift (logical) • (letter soup, especially arithmetic shifts!) • Historically had to use multiple instructions and various trickery to achieve right shifts. Had many limitations. • Faster than multiply for left shifts
  • 12. vpternlogd • The kitchen sink of instructions! • Allows a programmable truth table to be implemented per bit input in each register • Can replace up to eight instructions! • e.g vpternlogd zmm0, zmm1, zmm2, 0xca • zmm0 = zmm0 ? zmm1 : zmm2; (for each bit) • http://0x80.pl/articles/avx512-ternary- functions.html
  • 13. Example: v210enc (1) • .loop: • movu ym1, [yq + 2*widthq] • vinserti32x4 m1, [uq + 1*widthq], 2 • vinserti32x4 m1, [vq + 1*widthq], 3 • vpermb m1, m2, m1 ; uyvx yuyx vyux yvyx • CLIPUB m1, m4, m5 • pmaddubsw m0, m1, m3 ; shift high and low samples of each dword and mask out other bits • pslld m1, 4 ; shift center sample of each dword • vpternlogd m0, m1, m6, 0xd8 ; C?B:A ; merge and mask out bad bits from B • movu [dstq], m0 • […] • jl .loop • Takes 3x 8-bit samples, extends to 10-bit and packs into 32-bits
  • 14. Example: v210enc (2) • Benchmarks (decicyles) • Skylake • v210_planar_pack_8_c: 2373.5 • […] • v210_planar_pack_8_avx2: 194.0 • v210_planar_pack_8_avx512: 174.0 • vpternlogd on a shorter ymm register • Ice Lake • v210_planar_pack_8_c: 2743.6 • v210_planar_pack_8_avx2: 246.6 • v210_planar_pack_8_avx512: 238.6 • v210_planar_pack_8_avx512icl: 122.1 • vpermb and zmm nearly twice as fast as avx512 ymm, and more than twenty times faster than C!
  • 15. What AVX-512 code next? • Anything involving line/frame based processing • e.g filters, scalers etc. • Comparisons • Vpternlogd in many places (e.g 3-way boolean) • Also change variable shifts in many places • Intel manual is very verbose (useful in some cases) • https://www.officedaytime.com/simd512e/