SlideShare a Scribd company logo
1 of 20
Download to read offline
1 
Moving your NEON optimizations 
to a 64-bit world 
Ian Rickards 
November 2014
What is NEON? 
 NEON™ is a wide SIMD data processing 
2 
architecture 
 Extension of the ARM® instruction set 
 32 registers, 64-bit wide 
(AArch64: 32 registers, 128-bit wide) 
 NEON Instructions perform “Packed SIMD” 
processing 
 Registers are considered as vectors of elements 
of the same data type 
 Data types can be: signed/unsigned 8-bit, 16-bit, 
32-bit, 64-bit, single precision float 
(AArch64: Double precision float) 
 Instructions perform the same operation in all 
lanes 
Dn 
Dm 
Dd 
Lane 
Source 
Registers 
Source Registers 
Operation 
Destination 
Register
Changes in AArch64 
 More registers 
3 
 AArch32: 16x128-bit “Q-regs” Q0-Q15 
 AArch64: 32x128-bit “V-regs” V0-V31 
 ‘Dual view’ no longer packed 
 32 registers of each type: S, D, V 
 Clearer mapping of overlap 
 Asm language changes 
 No ‘v’ in mnemonics 
 Width specifier moved to register description 
 128-bit Q-regs renamed V-regs 
AArch32 
S0 S1 S2 S3 
D0 D1 
Q0 
S0 
D0 
vaddq.8 q0, q1, q2 
8 = byte 
elements 
AArch64 
V0 
S1 
D1 
V1 
add v0.8B, v1.8B, v2.8B 
8 = Number 
of elements 
B byte 
H halfword (16b) 
S single (32b) 
D double (64b)
New capabilities in AArch64 ‘AdvancedSIMD’ 
 Reduction across all lanes 
ADDV 
 Rounding mode specified in instr 
4 
 Not just ‘round to nearest’ 
 Double precision 
 Single destination register 
VZIP, VSWAP now 2 instructions 
 Crypto (ARMv8 but part of NEON) 
 Ins instruction 
 Table lookup larger table 
 Saturating accumulate signed/unsigned 
 AdvancedSIMD scalar 
 REMOVED: high reg mapping, conditional 
execution
NEON use cases 
 General purpose SIMD/DSP 
5 
processing useful for many 
applications 
 Support all new multimedia codecs 
Watch any video in any 
format 
Edit & Enhance 
captured videos 
Video stabilization 
Antialiased rendering 
& compositing 
Advanced 
User Interfaces 
Game processing 
Process 
megapixel 
photos quickly 
Voice recognition 
Powerful multichannel 
hi-fi 
audio processing
NEON advantages 
 Easy to program 
6 
 Clean vector architecture 
 Off the shelf tools, OS support, commercial 
& opensource ecosystem support 
 Easy to debug 
 Single flow of control 
 No separate DSP debugger 
 Fewer cycles needed 
 Neon will provide real-world 1.5x - 4x 
performance on typical video codecs 
 Individual simple DSP algorithms can show 
larger performance boost (4x-8x) 
 Provides overall power saving and 
increased processing capabilities 
 No overheads to ‘calling’ NEON
How to use NEON 
Automatically via OS 
Opensource libraries 
Vectorizing Compilers 
7 
 Uses NEON automatically from “C” 
JIT compilers 
 LLVM (e.g. Android Renderscript) 
Commercial vendors 
 e.g. commercial HEVC decoder 
C Instrinsics 
Assembler 
No effort Some effort
Automatically via OS - NEON in Android 
 Wide use of NEON optimizations in 
current Android source tree 
 Many apps use NEON 
8 
 Games (every game engine) 
 VR 
 Media editing / photo effects 
 Content creation 
Component ARMv7 NEON 
VP8 (Google webm) 
YES (asm) 
decoder & encoder 
VP9 (Google webm) 
decoder & encoder 
YES (asm) 
JPEG YES (asm) 
Google WebP YES (asm) 
PNG decoder YES (asm) 
H.264 s/w decoder YES (asm) 
AMR WB encoder YES (asm) 
Skia YES (asm) 
WebRTC YES (asm – FFT etc) 
Renderscript YES (via LLVM backend) 
Blink (Chromium browser) YES (intrinsics)
NEON optimizations in opensource 
 Google WebM – 17,000 lines NEON code – both VP9 and VP8 
 Bluez – official Linux Bluetooth protocol stack 
 Pixman (part of cairo 2D graphics library) 
 ffmpeg (libav) – libavcodec - LGPL media player 
 X264 - GPL H.264 encoder – can be used for video conferencing 
 Eigen2 – C++ vector math / linear algebra template library 
 Theorarm – libtheora NEON version (optimized by Google) 
 Android libjpeg / libjpeg-turbo – optimized JPEG decode 
 libpng – NEON optimized PNG decode 
 FFTW – NEON enabled FFT library 
 Liboil / liborc – runtime compiler for SIMD processing 
 webkit - used by Google Chrome browser 
 Ne10 – library of low-level optimized math routines 
 Cocos2d-x – 2D game engine uses Ne10 
 Skia – 2D graphics library 
 Android Renderscript 
9
Android Native Development Kit (NDK) for ARM 
 NDK is a toolkit to enable application developers to write native applications for the 
ARM processor 
 NEON fully supported since NDK r5 
 64-bit support released in NDK r10 for “L” 
10 
Android™ applications can be written in Java, 
native ARM code, or a combination of the two 
Android ABI NEON support? 
armeabi No 
armeabi-v7a Optional - check cpu flags for 
NEON and ARMv8 crypto 
arm64-v8a Yes: NEON always present
Using vectorizing compiler – gcc 
11 
AArch32 
gcc -S -O3 -mcpu=cortex-a8 
-mfpu=neon -mfloat-abi=softfp 
test.c 
int a[256], b[256], c[256]; 
foo () { 
int i; 
for (i=0; i<256; i++){ 
a[i] = b[i] + c[i]; 
} 
} 
.L2: 
vldmia r1!, {d18-d19} 
vldmia r2!, {d16-d17} 
vadd.i32 q8, q9, q8 
vstmia r3!, {d16-d17} 
cmp r3, r0 
bne .L2 
.L2 
ldr q0, [x0],16 
ldr q1, [x2],16 
cmp x0, x3 
add v0.4s, v0.4s, v1.4s 
str q0, [x1],16 
bne .L2 
AArch64 
gcc –S –O3 test.c 
gcc -ftree-vectorize is default at –O3 
example built with linaro-gcc-4.9-2014.05 aarch64 release
What is vectorizing? 
 Adding 16x 32-bit values (scalar)  Using NEON using Q-register 
12 
+ 
AArch32 AArch64 
+ 
+ 
VPADD 
VPADD 
+ + 
ADD 
ADD 
ADD 
ADD 
ADDV 
VADD 
VADD 
VADD 
VADD 
New
Intrinsics 
 Include intrinsics header file (ACLE standard) 
13 
#include <arm_neon.h> 
 Use special NEON data types which 
correspond to D and Q registers, e.g. 
int8x8_t D-register 8x 8-bit values 
int16x4_t D-register 4x 16-bit values 
int32x4_t Q-register 4x 32-bit values 
 Use NEON intrinsics versions of instructions 
vin1 = vld1q_s32(ptr); 
vout = vaddq_s32(vin1, vin2); 
vst1q_s32(vout, ptr); 
 Strongly typed! 
 Use vreinterpret_s16_s32( ) to change the type 
Fully compatible with AArch64 
tmp1 = vmull_u8(vreinterpret_u8_u32(va0), v16_y); // tmp1 = [tmp2 = vmull_u8(vreinterpret_u8_u32(va1), vy); // tmp2 = [static inline void Filter_32_opaque_neon(unsigned x, unsigned y, 
SkPMColor a00, SkPMColor a01, 
SkPMColor a10, SkPMColor a11, 
SkPMColor *dst) { 
uint8x8_t vy, vconst16_8, v16_y, vres; 
uint16x4_t vx, vconst16_16, v16_x, tmp; 
uint32x2_t va0, va1; 
uint16x8_t tmp1, tmp2; 
vy = vdup_n_u8(y); // duplicate y into vy 
vconst16_8 = vmov_n_u8(16); // set up constant in vconst16_v16_y = vsub_u8(vconst16_8, vy); // v16_y = 16-y 
va0 = vdup_n_u32(a00); // duplicate a00 
va1 = vdup_n_u32(a10); // duplicate a10 
va0 = vset_lane_u32(a01, va0, 1); // set top to a01 
va1 = vset_lane_u32(a11, va1, 1); // set top to a11
NEON intrinsics 
Pros Cons 
 Readability 
 Reusability (inline functions, templates) 
 Type checking (vreinterpret) 
 Easier to debug 
 Portability to AArch64 
 Compiler can combine instructions 
(e.g. MAC) 
 Compiler does register allocation 
 Compiler does instruction scheduling 
14 
 Little control over registers used 
 Does not always generate the code you 
expect
Compatibility 
 C/instrinsics will port with no effort 
 Asm requires reworking of .s file 
(mostly cosmetic, but can take 
advantage of additional registers) 
 AArch64 NEON optimization in 
progress 
15 
 ARM & Linaro working on key Android 
libraries using intrinsics 
 ffmpeg AArch64 NEON decoders (asm) 
 X264 AArch64 NEON encoder (asm) 
AArch64 NEON coding 
technique 
Compatible? 
Vectorized “C” Fully compatible 
Intrinsics 
(“arm_neon.h”) 
Fully compatible 
Asm (.s) Some porting required 
Library routines Yes, if library available
Fastest JPEG codecs: Android & libjpeg-turbo 
 NEON optimizations integrated into 
16 
 Official Android 4.4 (Kitkat) and later 
(plus partner-specific versions) 
 Libjpeg-turbo (opensource) 
 Significantly improves speed of multi-megapixel 
image decode 
 Benchmarked on 1.7GHz ARM Cortex®-A15 
 Optimized 34.4cycles/pix => 0.4s total for image 
 Test image: 19.4Mpix 
http://commons.wikimedia.org/wiki/File:Willaerts_Adam_The_Embar 
kation_of_the_Elector_Palantine_Oil_Canvas-huge.jpg 
103.9 
50.5 
34.4 
120.0 
100.0 
80.0 
60.0 
40.0 
20.0 
0.0 
Unoptimized Android 4.4 
'djpeg -fast' 
Optimized Android 4.4 
'djpeg -fast' 
Optimized libjpeg-turbo- 
1.3.0 'djpeg' 
Cycles/pixel JPEG decode 
NEON 
3.0x faster 
Lower 
is better 
5944 
3268
X264 – high quality H.264 encoding 
 Can be used for highest-quality offline encoding 
17 
– movies & tv content for on-demand services 
 Full ARMv7 NEON optimizations 
 5300 lines of NEON asm 
 New: AArch64 NEON (Aug 2014) 
 Performance results from Cortex-A15 
processor @1.7GHz 
350% 
315% 
382% 
370% 
450% 
400% 
350% 
300% 
250% 
200% 
150% 
100% 
50% 
0% 
park_joy 
main 
park_joy 
high 
sintel main sintel high 
No asm 
NEON optimized 
File (media.xiph.org) 
park_joy_420_720p50.y4m 1280x720 
sintel_trailer_2k_480p24.y4m 854x480
Wide range of NEON enabled low-cost dev boards 
 Odroid XU3 $179 
18 
 4x 2.0GHz Cortex-A15 
‘Octa’ b.L 
 Cubieboard4 CC-A80 
 4x 2.0GHz Cortex-A15 
+ 4x 1.3GHz Cortex-A7 
 Cubietruck $65 
 4xCortex-A7 
 Chromebook2 
 4xCortex-A15 ‘Octa’ b.L 
 64-bit: ARM “Juno” 
 64-bit: Nexus 9
NEON summary 
 NEON in AArch64 is much improved 
19 
 More registers 
 New instructions 
 Cleaner instruction set 
 Migrating to 64-bit 
 Use C or NEON intrinsics for best 
portability 
 Asm best in special circumstances, 
e.g. video codecs 
Normally straightforward to port ARMv7 
NEON to AArch64 NEON 
 NDK r10 provides full support – 
start testing apps now! 
 Existing NEON documentation still very 
relevant 
 NEON Programmer’s Guide 
 Blog entries 
 http://www.arm.com/community/ 
 Tune for AArch64 
 Extra registers 
 Double precision float 
 New instructions
20 
Thank You 
The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the EU 
and/or elsewhere. All rights reserved. Any other marks featured may be trademarks of their respective owners

More Related Content

What's hot

HKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MHKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MLinaro
 
LAS16-402: ARM Trusted Firmware – from Enterprise to Embedded
LAS16-402: ARM Trusted Firmware – from Enterprise to EmbeddedLAS16-402: ARM Trusted Firmware – from Enterprise to Embedded
LAS16-402: ARM Trusted Firmware – from Enterprise to EmbeddedLinaro
 
Trusted firmware deep_dive_v1.0_
Trusted firmware deep_dive_v1.0_Trusted firmware deep_dive_v1.0_
Trusted firmware deep_dive_v1.0_Linaro
 
HKG18-402 - Build secure key management services in OP-TEE
HKG18-402 - Build secure key management services in OP-TEEHKG18-402 - Build secure key management services in OP-TEE
HKG18-402 - Build secure key management services in OP-TEELinaro
 
LAS16-210: Hardware Assisted Tracing on ARM with CoreSight and OpenCSD
LAS16-210: Hardware Assisted Tracing on ARM with CoreSight and OpenCSDLAS16-210: Hardware Assisted Tracing on ARM with CoreSight and OpenCSD
LAS16-210: Hardware Assisted Tracing on ARM with CoreSight and OpenCSDLinaro
 
OPTEE on QEMU - Build Tutorial
OPTEE on QEMU - Build TutorialOPTEE on QEMU - Build Tutorial
OPTEE on QEMU - Build TutorialDalton Valadares
 
Androidへのdebianインストール奮闘記
Androidへのdebianインストール奮闘記Androidへのdebianインストール奮闘記
Androidへのdebianインストール奮闘記Tomoya Kawanishi
 
Linux Porting to a Custom Board
Linux Porting to a Custom BoardLinux Porting to a Custom Board
Linux Porting to a Custom BoardPatrick Bellasi
 
Hybrid Memory Cube: Developing Scalable and Resilient Memory Systems
Hybrid Memory Cube: Developing Scalable and Resilient Memory SystemsHybrid Memory Cube: Developing Scalable and Resilient Memory Systems
Hybrid Memory Cube: Developing Scalable and Resilient Memory SystemsMicronTechnology
 
Intel(r) Quick Assist Technology Overview
Intel(r) Quick Assist Technology OverviewIntel(r) Quick Assist Technology Overview
Intel(r) Quick Assist Technology OverviewMichelle Holley
 
Embedded Linux BSP Training (Intro)
Embedded Linux BSP Training (Intro)Embedded Linux BSP Training (Intro)
Embedded Linux BSP Training (Intro)RuggedBoardGroup
 
XPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM Systems
XPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM SystemsXPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM Systems
XPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM SystemsThe Linux Foundation
 
Secure Boot on ARM systems – Building a complete Chain of Trust upon existing...
Secure Boot on ARM systems – Building a complete Chain of Trust upon existing...Secure Boot on ARM systems – Building a complete Chain of Trust upon existing...
Secure Boot on ARM systems – Building a complete Chain of Trust upon existing...Linaro
 
U Boot or Universal Bootloader
U Boot or Universal BootloaderU Boot or Universal Bootloader
U Boot or Universal BootloaderSatpal Parmar
 
Secure storage updates - SFO17-309
Secure storage updates - SFO17-309Secure storage updates - SFO17-309
Secure storage updates - SFO17-309Linaro
 
Arm device tree and linux device drivers
Arm device tree and linux device driversArm device tree and linux device drivers
Arm device tree and linux device driversHoucheng Lin
 
TEE - kernel support is now upstream. What this means for open source security
TEE - kernel support is now upstream. What this means for open source securityTEE - kernel support is now upstream. What this means for open source security
TEE - kernel support is now upstream. What this means for open source securityLinaro
 
X / DRM (Direct Rendering Manager) Architectural Overview
X / DRM (Direct Rendering Manager) Architectural OverviewX / DRM (Direct Rendering Manager) Architectural Overview
X / DRM (Direct Rendering Manager) Architectural OverviewMoriyoshi Koizumi
 
LCU14-103: How to create and run Trusted Applications on OP-TEE
LCU14-103: How to create and run Trusted Applications on OP-TEELCU14-103: How to create and run Trusted Applications on OP-TEE
LCU14-103: How to create and run Trusted Applications on OP-TEELinaro
 

What's hot (20)

HKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MHKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8M
 
LAS16-402: ARM Trusted Firmware – from Enterprise to Embedded
LAS16-402: ARM Trusted Firmware – from Enterprise to EmbeddedLAS16-402: ARM Trusted Firmware – from Enterprise to Embedded
LAS16-402: ARM Trusted Firmware – from Enterprise to Embedded
 
Trusted firmware deep_dive_v1.0_
Trusted firmware deep_dive_v1.0_Trusted firmware deep_dive_v1.0_
Trusted firmware deep_dive_v1.0_
 
HKG18-402 - Build secure key management services in OP-TEE
HKG18-402 - Build secure key management services in OP-TEEHKG18-402 - Build secure key management services in OP-TEE
HKG18-402 - Build secure key management services in OP-TEE
 
LAS16-210: Hardware Assisted Tracing on ARM with CoreSight and OpenCSD
LAS16-210: Hardware Assisted Tracing on ARM with CoreSight and OpenCSDLAS16-210: Hardware Assisted Tracing on ARM with CoreSight and OpenCSD
LAS16-210: Hardware Assisted Tracing on ARM with CoreSight and OpenCSD
 
OPTEE on QEMU - Build Tutorial
OPTEE on QEMU - Build TutorialOPTEE on QEMU - Build Tutorial
OPTEE on QEMU - Build Tutorial
 
Androidへのdebianインストール奮闘記
Androidへのdebianインストール奮闘記Androidへのdebianインストール奮闘記
Androidへのdebianインストール奮闘記
 
Linux Porting to a Custom Board
Linux Porting to a Custom BoardLinux Porting to a Custom Board
Linux Porting to a Custom Board
 
Hybrid Memory Cube: Developing Scalable and Resilient Memory Systems
Hybrid Memory Cube: Developing Scalable and Resilient Memory SystemsHybrid Memory Cube: Developing Scalable and Resilient Memory Systems
Hybrid Memory Cube: Developing Scalable and Resilient Memory Systems
 
Intel(r) Quick Assist Technology Overview
Intel(r) Quick Assist Technology OverviewIntel(r) Quick Assist Technology Overview
Intel(r) Quick Assist Technology Overview
 
Andes RISC-V processor solutions
Andes RISC-V processor solutionsAndes RISC-V processor solutions
Andes RISC-V processor solutions
 
Embedded Linux BSP Training (Intro)
Embedded Linux BSP Training (Intro)Embedded Linux BSP Training (Intro)
Embedded Linux BSP Training (Intro)
 
XPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM Systems
XPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM SystemsXPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM Systems
XPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM Systems
 
Secure Boot on ARM systems – Building a complete Chain of Trust upon existing...
Secure Boot on ARM systems – Building a complete Chain of Trust upon existing...Secure Boot on ARM systems – Building a complete Chain of Trust upon existing...
Secure Boot on ARM systems – Building a complete Chain of Trust upon existing...
 
U Boot or Universal Bootloader
U Boot or Universal BootloaderU Boot or Universal Bootloader
U Boot or Universal Bootloader
 
Secure storage updates - SFO17-309
Secure storage updates - SFO17-309Secure storage updates - SFO17-309
Secure storage updates - SFO17-309
 
Arm device tree and linux device drivers
Arm device tree and linux device driversArm device tree and linux device drivers
Arm device tree and linux device drivers
 
TEE - kernel support is now upstream. What this means for open source security
TEE - kernel support is now upstream. What this means for open source securityTEE - kernel support is now upstream. What this means for open source security
TEE - kernel support is now upstream. What this means for open source security
 
X / DRM (Direct Rendering Manager) Architectural Overview
X / DRM (Direct Rendering Manager) Architectural OverviewX / DRM (Direct Rendering Manager) Architectural Overview
X / DRM (Direct Rendering Manager) Architectural Overview
 
LCU14-103: How to create and run Trusted Applications on OP-TEE
LCU14-103: How to create and run Trusted Applications on OP-TEELCU14-103: How to create and run Trusted Applications on OP-TEE
LCU14-103: How to create and run Trusted Applications on OP-TEE
 

Viewers also liked

Q4.11: NEON Intrinsics
Q4.11: NEON IntrinsicsQ4.11: NEON Intrinsics
Q4.11: NEON IntrinsicsLinaro
 
Q4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-VectorizerQ4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-VectorizerLinaro
 
中華チップ全盛時代のARM SoCの選び方_公開版
中華チップ全盛時代のARM SoCの選び方_公開版中華チップ全盛時代のARM SoCの選び方_公開版
中華チップ全盛時代のARM SoCの選び方_公開版kinneko
 
LAS16-406: Android Widevine on OP-TEE
LAS16-406: Android Widevine on OP-TEELAS16-406: Android Widevine on OP-TEE
LAS16-406: Android Widevine on OP-TEELinaro
 
組み込み関数(intrinsic)によるSIMD入門
組み込み関数(intrinsic)によるSIMD入門組み込み関数(intrinsic)によるSIMD入門
組み込み関数(intrinsic)によるSIMD入門Norishige Fukushima
 
Software, Over the Air (SOTA) for Automotive Grade Linux (AGL)
Software, Over the Air (SOTA) for Automotive Grade Linux (AGL)Software, Over the Air (SOTA) for Automotive Grade Linux (AGL)
Software, Over the Air (SOTA) for Automotive Grade Linux (AGL)Leon Anavi
 
[NDC2015] 언제 어디서나 프로파일링 가능한 코드네임 JYP 작성기 - 라이브 게임 배포 후에도 프로파일링 하기
[NDC2015] 언제 어디서나 프로파일링 가능한 코드네임 JYP 작성기 - 라이브 게임 배포 후에도 프로파일링 하기[NDC2015] 언제 어디서나 프로파일링 가능한 코드네임 JYP 작성기 - 라이브 게임 배포 후에도 프로파일링 하기
[NDC2015] 언제 어디서나 프로파일링 가능한 코드네임 JYP 작성기 - 라이브 게임 배포 후에도 프로파일링 하기Jaeseung Ha
 
LAS16-504: Secure Storage updates in OP-TEE
LAS16-504: Secure Storage updates in OP-TEELAS16-504: Secure Storage updates in OP-TEE
LAS16-504: Secure Storage updates in OP-TEELinaro
 
SFO15-503: Secure storage in OP-TEE
SFO15-503: Secure storage in OP-TEESFO15-503: Secure storage in OP-TEE
SFO15-503: Secure storage in OP-TEELinaro
 
Introduction to Optee (26 may 2016)
Introduction to Optee (26 may 2016)Introduction to Optee (26 may 2016)
Introduction to Optee (26 may 2016)Yannick Gicquel
 
BKK16-110 A Gentle Introduction to Trusted Execution and OP-TEE
BKK16-110 A Gentle Introduction to Trusted Execution and OP-TEEBKK16-110 A Gentle Introduction to Trusted Execution and OP-TEE
BKK16-110 A Gentle Introduction to Trusted Execution and OP-TEELinaro
 
HKG15-311: OP-TEE for Beginners and Porting Review
HKG15-311: OP-TEE for Beginners and Porting ReviewHKG15-311: OP-TEE for Beginners and Porting Review
HKG15-311: OP-TEE for Beginners and Porting ReviewLinaro
 
LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3
LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3
LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3Linaro
 
Arm v8 instruction overview android 64 bit briefing
Arm v8 instruction overview android 64 bit briefingArm v8 instruction overview android 64 bit briefing
Arm v8 instruction overview android 64 bit briefingMerck Hung
 
[NDC2015] C++11 고급 기능 - Crow에 사용된 기법 중심으로
[NDC2015] C++11 고급 기능 - Crow에 사용된 기법 중심으로[NDC2015] C++11 고급 기능 - Crow에 사용된 기법 중심으로
[NDC2015] C++11 고급 기능 - Crow에 사용된 기법 중심으로Jaeseung Ha
 
오픈 소스를 활용한 캐쥬얼 게임 서버 프레임워크 개발
오픈 소스를 활용한 캐쥬얼 게임 서버 프레임워크 개발오픈 소스를 활용한 캐쥬얼 게임 서버 프레임워크 개발
오픈 소스를 활용한 캐쥬얼 게임 서버 프레임워크 개발주항 박
 
BUD17-DF15 - Optimized Android N MR1 + 4.9 Kernel
BUD17-DF15 - Optimized Android N MR1 + 4.9 KernelBUD17-DF15 - Optimized Android N MR1 + 4.9 Kernel
BUD17-DF15 - Optimized Android N MR1 + 4.9 KernelLinaro
 
XPDS16: Porting Xen on ARM to a new SOC - Julien Grall, ARM
XPDS16: Porting Xen on ARM to a new SOC - Julien Grall, ARMXPDS16: Porting Xen on ARM to a new SOC - Julien Grall, ARM
XPDS16: Porting Xen on ARM to a new SOC - Julien Grall, ARMThe Linux Foundation
 

Viewers also liked (20)

Q4.11: NEON Intrinsics
Q4.11: NEON IntrinsicsQ4.11: NEON Intrinsics
Q4.11: NEON Intrinsics
 
Q4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-VectorizerQ4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-Vectorizer
 
中華チップ全盛時代のARM SoCの選び方_公開版
中華チップ全盛時代のARM SoCの選び方_公開版中華チップ全盛時代のARM SoCの選び方_公開版
中華チップ全盛時代のARM SoCの選び方_公開版
 
64-bit Android
64-bit Android64-bit Android
64-bit Android
 
LAS16-406: Android Widevine on OP-TEE
LAS16-406: Android Widevine on OP-TEELAS16-406: Android Widevine on OP-TEE
LAS16-406: Android Widevine on OP-TEE
 
組み込み関数(intrinsic)によるSIMD入門
組み込み関数(intrinsic)によるSIMD入門組み込み関数(intrinsic)によるSIMD入門
組み込み関数(intrinsic)によるSIMD入門
 
Software, Over the Air (SOTA) for Automotive Grade Linux (AGL)
Software, Over the Air (SOTA) for Automotive Grade Linux (AGL)Software, Over the Air (SOTA) for Automotive Grade Linux (AGL)
Software, Over the Air (SOTA) for Automotive Grade Linux (AGL)
 
[NDC2015] 언제 어디서나 프로파일링 가능한 코드네임 JYP 작성기 - 라이브 게임 배포 후에도 프로파일링 하기
[NDC2015] 언제 어디서나 프로파일링 가능한 코드네임 JYP 작성기 - 라이브 게임 배포 후에도 프로파일링 하기[NDC2015] 언제 어디서나 프로파일링 가능한 코드네임 JYP 작성기 - 라이브 게임 배포 후에도 프로파일링 하기
[NDC2015] 언제 어디서나 프로파일링 가능한 코드네임 JYP 작성기 - 라이브 게임 배포 후에도 프로파일링 하기
 
EXAME-PARTE-II
EXAME-PARTE-IIEXAME-PARTE-II
EXAME-PARTE-II
 
LAS16-504: Secure Storage updates in OP-TEE
LAS16-504: Secure Storage updates in OP-TEELAS16-504: Secure Storage updates in OP-TEE
LAS16-504: Secure Storage updates in OP-TEE
 
SFO15-503: Secure storage in OP-TEE
SFO15-503: Secure storage in OP-TEESFO15-503: Secure storage in OP-TEE
SFO15-503: Secure storage in OP-TEE
 
Introduction to Optee (26 may 2016)
Introduction to Optee (26 may 2016)Introduction to Optee (26 may 2016)
Introduction to Optee (26 may 2016)
 
BKK16-110 A Gentle Introduction to Trusted Execution and OP-TEE
BKK16-110 A Gentle Introduction to Trusted Execution and OP-TEEBKK16-110 A Gentle Introduction to Trusted Execution and OP-TEE
BKK16-110 A Gentle Introduction to Trusted Execution and OP-TEE
 
HKG15-311: OP-TEE for Beginners and Porting Review
HKG15-311: OP-TEE for Beginners and Porting ReviewHKG15-311: OP-TEE for Beginners and Porting Review
HKG15-311: OP-TEE for Beginners and Porting Review
 
LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3
LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3
LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3
 
Arm v8 instruction overview android 64 bit briefing
Arm v8 instruction overview android 64 bit briefingArm v8 instruction overview android 64 bit briefing
Arm v8 instruction overview android 64 bit briefing
 
[NDC2015] C++11 고급 기능 - Crow에 사용된 기법 중심으로
[NDC2015] C++11 고급 기능 - Crow에 사용된 기법 중심으로[NDC2015] C++11 고급 기능 - Crow에 사용된 기법 중심으로
[NDC2015] C++11 고급 기능 - Crow에 사용된 기법 중심으로
 
오픈 소스를 활용한 캐쥬얼 게임 서버 프레임워크 개발
오픈 소스를 활용한 캐쥬얼 게임 서버 프레임워크 개발오픈 소스를 활용한 캐쥬얼 게임 서버 프레임워크 개발
오픈 소스를 활용한 캐쥬얼 게임 서버 프레임워크 개발
 
BUD17-DF15 - Optimized Android N MR1 + 4.9 Kernel
BUD17-DF15 - Optimized Android N MR1 + 4.9 KernelBUD17-DF15 - Optimized Android N MR1 + 4.9 Kernel
BUD17-DF15 - Optimized Android N MR1 + 4.9 Kernel
 
XPDS16: Porting Xen on ARM to a new SOC - Julien Grall, ARM
XPDS16: Porting Xen on ARM to a new SOC - Julien Grall, ARMXPDS16: Porting Xen on ARM to a new SOC - Julien Grall, ARM
XPDS16: Porting Xen on ARM to a new SOC - Julien Grall, ARM
 

Similar to Moving NEON to 64 bits

0xdroid osdc-2010-100426084937-phpapp02
0xdroid osdc-2010-100426084937-phpapp020xdroid osdc-2010-100426084937-phpapp02
0xdroid osdc-2010-100426084937-phpapp02chon2010
 
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMDEdge AI and Vision Alliance
 
LCE12: LCE12 ARMv8 Plenary
LCE12: LCE12 ARMv8 PlenaryLCE12: LCE12 ARMv8 Plenary
LCE12: LCE12 ARMv8 PlenaryLinaro
 
Embedded Recipes 2018 - Upstream multimedia on amlogic so cs from fiction t...
Embedded Recipes 2018 - Upstream multimedia on amlogic so cs   from fiction t...Embedded Recipes 2018 - Upstream multimedia on amlogic so cs   from fiction t...
Embedded Recipes 2018 - Upstream multimedia on amlogic so cs from fiction t...Anne Nicolas
 
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mãoWebinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mãoEmbarcados
 
Chapter_01_See_Program_Running.pptx
Chapter_01_See_Program_Running.pptxChapter_01_See_Program_Running.pptx
Chapter_01_See_Program_Running.pptxWaleedAbdullah2k19EE
 
Linxu conj2016 96boards
Linxu conj2016 96boardsLinxu conj2016 96boards
Linxu conj2016 96boardsLF Events
 
Efficient JIT to 32-bit Arches
Efficient JIT to 32-bit ArchesEfficient JIT to 32-bit Arches
Efficient JIT to 32-bit ArchesNetronome
 
0xdroid -- community-developed Android distribution by 0xlab
0xdroid -- community-developed Android distribution by 0xlab0xdroid -- community-developed Android distribution by 0xlab
0xdroid -- community-developed Android distribution by 0xlabNational Cheng Kung University
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi
 
Design of 32 Bit Processor Using 8051 and Leon3 (Progress Report)
Design of 32 Bit Processor Using 8051 and Leon3 (Progress Report)Design of 32 Bit Processor Using 8051 and Leon3 (Progress Report)
Design of 32 Bit Processor Using 8051 and Leon3 (Progress Report)Talal Khaliq
 
Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel Intel® Software
 
"Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Applic...
"Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Applic..."Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Applic...
"Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Applic...Edge AI and Vision Alliance
 
A 32-Bit Parameterized Leon-3 Processor with Custom Peripheral Integration
A 32-Bit Parameterized Leon-3 Processor with Custom Peripheral IntegrationA 32-Bit Parameterized Leon-3 Processor with Custom Peripheral Integration
A 32-Bit Parameterized Leon-3 Processor with Custom Peripheral IntegrationTalal Khaliq
 
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)Laurent Leturgez
 
Embedded Recipes 2019 - Making embedded graphics less special
Embedded Recipes 2019 - Making embedded graphics less specialEmbedded Recipes 2019 - Making embedded graphics less special
Embedded Recipes 2019 - Making embedded graphics less specialAnne Nicolas
 
Heterogeneous multiprocessing on androd and i.mx7
Heterogeneous multiprocessing on androd and i.mx7Heterogeneous multiprocessing on androd and i.mx7
Heterogeneous multiprocessing on androd and i.mx7Kynetics
 

Similar to Moving NEON to 64 bits (20)

0xdroid osdc-2010-100426084937-phpapp02
0xdroid osdc-2010-100426084937-phpapp020xdroid osdc-2010-100426084937-phpapp02
0xdroid osdc-2010-100426084937-phpapp02
 
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
 
LCE12: LCE12 ARMv8 Plenary
LCE12: LCE12 ARMv8 PlenaryLCE12: LCE12 ARMv8 Plenary
LCE12: LCE12 ARMv8 Plenary
 
Embedded Recipes 2018 - Upstream multimedia on amlogic so cs from fiction t...
Embedded Recipes 2018 - Upstream multimedia on amlogic so cs   from fiction t...Embedded Recipes 2018 - Upstream multimedia on amlogic so cs   from fiction t...
Embedded Recipes 2018 - Upstream multimedia on amlogic so cs from fiction t...
 
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mãoWebinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
 
Chapter_01_See_Program_Running.pptx
Chapter_01_See_Program_Running.pptxChapter_01_See_Program_Running.pptx
Chapter_01_See_Program_Running.pptx
 
Linxu conj2016 96boards
Linxu conj2016 96boardsLinxu conj2016 96boards
Linxu conj2016 96boards
 
Efficient JIT to 32-bit Arches
Efficient JIT to 32-bit ArchesEfficient JIT to 32-bit Arches
Efficient JIT to 32-bit Arches
 
0xdroid -- community-developed Android distribution by 0xlab
0xdroid -- community-developed Android distribution by 0xlab0xdroid -- community-developed Android distribution by 0xlab
0xdroid -- community-developed Android distribution by 0xlab
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
RISC V in Spacer
RISC V in SpacerRISC V in Spacer
RISC V in Spacer
 
Design of 32 Bit Processor Using 8051 and Leon3 (Progress Report)
Design of 32 Bit Processor Using 8051 and Leon3 (Progress Report)Design of 32 Bit Processor Using 8051 and Leon3 (Progress Report)
Design of 32 Bit Processor Using 8051 and Leon3 (Progress Report)
 
Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel
 
Next Stop, Android
Next Stop, AndroidNext Stop, Android
Next Stop, Android
 
"Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Applic...
"Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Applic..."Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Applic...
"Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Applic...
 
A 32-Bit Parameterized Leon-3 Processor with Custom Peripheral Integration
A 32-Bit Parameterized Leon-3 Processor with Custom Peripheral IntegrationA 32-Bit Parameterized Leon-3 Processor with Custom Peripheral Integration
A 32-Bit Parameterized Leon-3 Processor with Custom Peripheral Integration
 
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)
 
Embedded Recipes 2019 - Making embedded graphics less special
Embedded Recipes 2019 - Making embedded graphics less specialEmbedded Recipes 2019 - Making embedded graphics less special
Embedded Recipes 2019 - Making embedded graphics less special
 
Heterogeneous multiprocessing on androd and i.mx7
Heterogeneous multiprocessing on androd and i.mx7Heterogeneous multiprocessing on androd and i.mx7
Heterogeneous multiprocessing on androd and i.mx7
 
ARM Processor Tutorial
ARM Processor Tutorial ARM Processor Tutorial
ARM Processor Tutorial
 

More from Chiou-Nan Chen

More from Chiou-Nan Chen (20)

Intelligent Power Allocation
Intelligent Power AllocationIntelligent Power Allocation
Intelligent Power Allocation
 
3. v sphere big data extensions
3. v sphere big data extensions3. v sphere big data extensions
3. v sphere big data extensions
 
4. v sphere big data extensions hadoop
4. v sphere big data extensions   hadoop4. v sphere big data extensions   hadoop
4. v sphere big data extensions hadoop
 
2. hadoop
2. hadoop2. hadoop
2. hadoop
 
1. beyond mission critical virtualizing big data and hadoop
1. beyond mission critical   virtualizing big data and hadoop1. beyond mission critical   virtualizing big data and hadoop
1. beyond mission critical virtualizing big data and hadoop
 
5. pivotal hd 2013
5. pivotal hd 20135. pivotal hd 2013
5. pivotal hd 2013
 
Emc keynote 1130 1200
Emc keynote 1130 1200Emc keynote 1130 1200
Emc keynote 1130 1200
 
Emc keynote 1030 1130
Emc keynote 1030 1130Emc keynote 1030 1130
Emc keynote 1030 1130
 
Emc keynote 0945 1030
Emc keynote 0945 1030Emc keynote 0945 1030
Emc keynote 0945 1030
 
Emc keynote 0930 0945
Emc keynote 0930 0945Emc keynote 0930 0945
Emc keynote 0930 0945
 
102 1600-1630
102 1600-1630102 1600-1630
102 1600-1630
 
102 1530-1600
102 1530-1600102 1530-1600
102 1530-1600
 
102 1430-1445
102 1430-1445102 1430-1445
102 1430-1445
 
102 1315-1345
102 1315-1345102 1315-1345
102 1315-1345
 
102 1630 1700
102 1630 1700102 1630 1700
102 1630 1700
 
102 1445 1515
102 1445 1515102 1445 1515
102 1445 1515
 
101 cd 1630-1700
101 cd 1630-1700101 cd 1630-1700
101 cd 1630-1700
 
101 cd 1600-1630
101 cd 1600-1630101 cd 1600-1630
101 cd 1600-1630
 
101 cd 1445-1515
101 cd 1445-1515101 cd 1445-1515
101 cd 1445-1515
 
101 cd 1415-1445
101 cd 1415-1445101 cd 1415-1445
101 cd 1415-1445
 

Recently uploaded

CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMKumar Satyam
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 

Recently uploaded (20)

CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 

Moving NEON to 64 bits

  • 1. 1 Moving your NEON optimizations to a 64-bit world Ian Rickards November 2014
  • 2. What is NEON?  NEON™ is a wide SIMD data processing 2 architecture  Extension of the ARM® instruction set  32 registers, 64-bit wide (AArch64: 32 registers, 128-bit wide)  NEON Instructions perform “Packed SIMD” processing  Registers are considered as vectors of elements of the same data type  Data types can be: signed/unsigned 8-bit, 16-bit, 32-bit, 64-bit, single precision float (AArch64: Double precision float)  Instructions perform the same operation in all lanes Dn Dm Dd Lane Source Registers Source Registers Operation Destination Register
  • 3. Changes in AArch64  More registers 3  AArch32: 16x128-bit “Q-regs” Q0-Q15  AArch64: 32x128-bit “V-regs” V0-V31  ‘Dual view’ no longer packed  32 registers of each type: S, D, V  Clearer mapping of overlap  Asm language changes  No ‘v’ in mnemonics  Width specifier moved to register description  128-bit Q-regs renamed V-regs AArch32 S0 S1 S2 S3 D0 D1 Q0 S0 D0 vaddq.8 q0, q1, q2 8 = byte elements AArch64 V0 S1 D1 V1 add v0.8B, v1.8B, v2.8B 8 = Number of elements B byte H halfword (16b) S single (32b) D double (64b)
  • 4. New capabilities in AArch64 ‘AdvancedSIMD’  Reduction across all lanes ADDV  Rounding mode specified in instr 4  Not just ‘round to nearest’  Double precision  Single destination register VZIP, VSWAP now 2 instructions  Crypto (ARMv8 but part of NEON)  Ins instruction  Table lookup larger table  Saturating accumulate signed/unsigned  AdvancedSIMD scalar  REMOVED: high reg mapping, conditional execution
  • 5. NEON use cases  General purpose SIMD/DSP 5 processing useful for many applications  Support all new multimedia codecs Watch any video in any format Edit & Enhance captured videos Video stabilization Antialiased rendering & compositing Advanced User Interfaces Game processing Process megapixel photos quickly Voice recognition Powerful multichannel hi-fi audio processing
  • 6. NEON advantages  Easy to program 6  Clean vector architecture  Off the shelf tools, OS support, commercial & opensource ecosystem support  Easy to debug  Single flow of control  No separate DSP debugger  Fewer cycles needed  Neon will provide real-world 1.5x - 4x performance on typical video codecs  Individual simple DSP algorithms can show larger performance boost (4x-8x)  Provides overall power saving and increased processing capabilities  No overheads to ‘calling’ NEON
  • 7. How to use NEON Automatically via OS Opensource libraries Vectorizing Compilers 7  Uses NEON automatically from “C” JIT compilers  LLVM (e.g. Android Renderscript) Commercial vendors  e.g. commercial HEVC decoder C Instrinsics Assembler No effort Some effort
  • 8. Automatically via OS - NEON in Android  Wide use of NEON optimizations in current Android source tree  Many apps use NEON 8  Games (every game engine)  VR  Media editing / photo effects  Content creation Component ARMv7 NEON VP8 (Google webm) YES (asm) decoder & encoder VP9 (Google webm) decoder & encoder YES (asm) JPEG YES (asm) Google WebP YES (asm) PNG decoder YES (asm) H.264 s/w decoder YES (asm) AMR WB encoder YES (asm) Skia YES (asm) WebRTC YES (asm – FFT etc) Renderscript YES (via LLVM backend) Blink (Chromium browser) YES (intrinsics)
  • 9. NEON optimizations in opensource  Google WebM – 17,000 lines NEON code – both VP9 and VP8  Bluez – official Linux Bluetooth protocol stack  Pixman (part of cairo 2D graphics library)  ffmpeg (libav) – libavcodec - LGPL media player  X264 - GPL H.264 encoder – can be used for video conferencing  Eigen2 – C++ vector math / linear algebra template library  Theorarm – libtheora NEON version (optimized by Google)  Android libjpeg / libjpeg-turbo – optimized JPEG decode  libpng – NEON optimized PNG decode  FFTW – NEON enabled FFT library  Liboil / liborc – runtime compiler for SIMD processing  webkit - used by Google Chrome browser  Ne10 – library of low-level optimized math routines  Cocos2d-x – 2D game engine uses Ne10  Skia – 2D graphics library  Android Renderscript 9
  • 10. Android Native Development Kit (NDK) for ARM  NDK is a toolkit to enable application developers to write native applications for the ARM processor  NEON fully supported since NDK r5  64-bit support released in NDK r10 for “L” 10 Android™ applications can be written in Java, native ARM code, or a combination of the two Android ABI NEON support? armeabi No armeabi-v7a Optional - check cpu flags for NEON and ARMv8 crypto arm64-v8a Yes: NEON always present
  • 11. Using vectorizing compiler – gcc 11 AArch32 gcc -S -O3 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp test.c int a[256], b[256], c[256]; foo () { int i; for (i=0; i<256; i++){ a[i] = b[i] + c[i]; } } .L2: vldmia r1!, {d18-d19} vldmia r2!, {d16-d17} vadd.i32 q8, q9, q8 vstmia r3!, {d16-d17} cmp r3, r0 bne .L2 .L2 ldr q0, [x0],16 ldr q1, [x2],16 cmp x0, x3 add v0.4s, v0.4s, v1.4s str q0, [x1],16 bne .L2 AArch64 gcc –S –O3 test.c gcc -ftree-vectorize is default at –O3 example built with linaro-gcc-4.9-2014.05 aarch64 release
  • 12. What is vectorizing?  Adding 16x 32-bit values (scalar)  Using NEON using Q-register 12 + AArch32 AArch64 + + VPADD VPADD + + ADD ADD ADD ADD ADDV VADD VADD VADD VADD New
  • 13. Intrinsics  Include intrinsics header file (ACLE standard) 13 #include <arm_neon.h>  Use special NEON data types which correspond to D and Q registers, e.g. int8x8_t D-register 8x 8-bit values int16x4_t D-register 4x 16-bit values int32x4_t Q-register 4x 32-bit values  Use NEON intrinsics versions of instructions vin1 = vld1q_s32(ptr); vout = vaddq_s32(vin1, vin2); vst1q_s32(vout, ptr);  Strongly typed!  Use vreinterpret_s16_s32( ) to change the type Fully compatible with AArch64 tmp1 = vmull_u8(vreinterpret_u8_u32(va0), v16_y); // tmp1 = [tmp2 = vmull_u8(vreinterpret_u8_u32(va1), vy); // tmp2 = [static inline void Filter_32_opaque_neon(unsigned x, unsigned y, SkPMColor a00, SkPMColor a01, SkPMColor a10, SkPMColor a11, SkPMColor *dst) { uint8x8_t vy, vconst16_8, v16_y, vres; uint16x4_t vx, vconst16_16, v16_x, tmp; uint32x2_t va0, va1; uint16x8_t tmp1, tmp2; vy = vdup_n_u8(y); // duplicate y into vy vconst16_8 = vmov_n_u8(16); // set up constant in vconst16_v16_y = vsub_u8(vconst16_8, vy); // v16_y = 16-y va0 = vdup_n_u32(a00); // duplicate a00 va1 = vdup_n_u32(a10); // duplicate a10 va0 = vset_lane_u32(a01, va0, 1); // set top to a01 va1 = vset_lane_u32(a11, va1, 1); // set top to a11
  • 14. NEON intrinsics Pros Cons  Readability  Reusability (inline functions, templates)  Type checking (vreinterpret)  Easier to debug  Portability to AArch64  Compiler can combine instructions (e.g. MAC)  Compiler does register allocation  Compiler does instruction scheduling 14  Little control over registers used  Does not always generate the code you expect
  • 15. Compatibility  C/instrinsics will port with no effort  Asm requires reworking of .s file (mostly cosmetic, but can take advantage of additional registers)  AArch64 NEON optimization in progress 15  ARM & Linaro working on key Android libraries using intrinsics  ffmpeg AArch64 NEON decoders (asm)  X264 AArch64 NEON encoder (asm) AArch64 NEON coding technique Compatible? Vectorized “C” Fully compatible Intrinsics (“arm_neon.h”) Fully compatible Asm (.s) Some porting required Library routines Yes, if library available
  • 16. Fastest JPEG codecs: Android & libjpeg-turbo  NEON optimizations integrated into 16  Official Android 4.4 (Kitkat) and later (plus partner-specific versions)  Libjpeg-turbo (opensource)  Significantly improves speed of multi-megapixel image decode  Benchmarked on 1.7GHz ARM Cortex®-A15  Optimized 34.4cycles/pix => 0.4s total for image  Test image: 19.4Mpix http://commons.wikimedia.org/wiki/File:Willaerts_Adam_The_Embar kation_of_the_Elector_Palantine_Oil_Canvas-huge.jpg 103.9 50.5 34.4 120.0 100.0 80.0 60.0 40.0 20.0 0.0 Unoptimized Android 4.4 'djpeg -fast' Optimized Android 4.4 'djpeg -fast' Optimized libjpeg-turbo- 1.3.0 'djpeg' Cycles/pixel JPEG decode NEON 3.0x faster Lower is better 5944 3268
  • 17. X264 – high quality H.264 encoding  Can be used for highest-quality offline encoding 17 – movies & tv content for on-demand services  Full ARMv7 NEON optimizations  5300 lines of NEON asm  New: AArch64 NEON (Aug 2014)  Performance results from Cortex-A15 processor @1.7GHz 350% 315% 382% 370% 450% 400% 350% 300% 250% 200% 150% 100% 50% 0% park_joy main park_joy high sintel main sintel high No asm NEON optimized File (media.xiph.org) park_joy_420_720p50.y4m 1280x720 sintel_trailer_2k_480p24.y4m 854x480
  • 18. Wide range of NEON enabled low-cost dev boards  Odroid XU3 $179 18  4x 2.0GHz Cortex-A15 ‘Octa’ b.L  Cubieboard4 CC-A80  4x 2.0GHz Cortex-A15 + 4x 1.3GHz Cortex-A7  Cubietruck $65  4xCortex-A7  Chromebook2  4xCortex-A15 ‘Octa’ b.L  64-bit: ARM “Juno”  64-bit: Nexus 9
  • 19. NEON summary  NEON in AArch64 is much improved 19  More registers  New instructions  Cleaner instruction set  Migrating to 64-bit  Use C or NEON intrinsics for best portability  Asm best in special circumstances, e.g. video codecs Normally straightforward to port ARMv7 NEON to AArch64 NEON  NDK r10 provides full support – start testing apps now!  Existing NEON documentation still very relevant  NEON Programmer’s Guide  Blog entries  http://www.arm.com/community/  Tune for AArch64  Extra registers  Double precision float  New instructions
  • 20. 20 Thank You The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the EU and/or elsewhere. All rights reserved. Any other marks featured may be trademarks of their respective owners