SlideShare uma empresa Scribd logo
1 de 29
Baixar para ler offline
© 2017 Arm Limited
SFO17-314 Optimizing Golang for
High Performance with ARM64
AssemblyWei Xiao
Staff Software Engineer
Wei.Xiao@arm.com
September 27, 2017
Linaro Connect SFO17
© 2017 Arm Limited2
Agenda
• Introduction
• Differences from GNU Assembly
• Integrate assembly into Golang
• Optimize CRC32 for arm64
• Optimize SHA256 for arm64
• Optimize IndexByte for arm64
• Work Summary and Next steps
© 2017 Arm Limited3
Introduction
• Assembly optimization benefits
• Take advantages of ARMv8 capabilities
– Hardware specific instructions (such as SVC, AES, SHA and etc.)
– Vector (Single Instruction Multiple Data) Instructions
• Others
– No need for CGo dependency
– Avoid runtime context switching overhead
– Optimized code (vs Go compiler)
– Faster compilation
© 2017 Arm Limited4
Assembly Optimization Current Status
• Go Standard packages with assembly optimization
crypto/aes crypto/elliptic crypto/internal/cipherhw crypto/md5
crypto/rc4 crypto/sha1 crypto/sha256 crypto/sha512
hash/crc32 math math/big reflect
runtime runtime/cgo runtime/internal/atomicruntime/internal/sys
strings sync/atomic syscall ……
red – arm64 optimization ongoing
black – no arm64 optimization
© 2017 Arm Limited5
Assembly Terminology
• Mnemonic
• CALL, MOVW, MOVD, …
• Register
• R1, F0, V3, …
• Immediate
• $1, $0x100, …
• Memory
• (R1), 8(R3), …
Registers in AArch64
© 2017 Arm Limited6
Instruction Differences from GNU Assembly
• Semi-abstract instruction set (Plan 9 from Bell Labs)
• Architecture independent mnemonics like MOVD
• Some architecture aspects shine through
• Assembler may insert prologues, remove ‘unreachable’
instructions
• Instructions may be expanded by the assembler
• Not all instructions available
• BYTE/WORD/LONG directives to lay down opcodes into
instruction stream directly
1 // func Add(a, b int) int
2 TEXT ·Add(SB),$0-24
3 MOVD arg1+0(FP), R0
4 MOVD arg2+8(FP), R1
5 ADD R1, R0, R0
6 MOVD R0, ret+16(FP)
7 RET
© 2017 Arm Limited7
Operand Differences from GNU Assembly
• Data flow from left to right
• ADD R1, R2 → R2 += R1
• SUBW R12<<29, R7, R8 → R8 = R7 – (R12<<29)
• Memory operands: base + offset
• MOVH (R1), R2 → R2 = *R1
• MOVBU 8(R3), R4 → R4 = *(8 + R3)
• MOVD mypackage·myvar(SB), R8 → R8 = *myvar
• Addresses
• MOVD $8(R1), R3 → R3 = R1 + 8
• MOVD $·myvar(SB), R4 → R4 = &myvar
package mypackage
var myvar int64
Unicode
U+00B7
© 2017 Arm Limited8
Go Assembly Extension for arm64
• Extended register, e.g.: ADD Rm.<ext>[<<amount], Rn, Rd
• Arrangement for SIMD instructions, e.g.: VADDP Vm.<T>, Vn.<T>, Vd.<T>
• Width specifier and element index for SIMD instructions, e.g.: VMOV Vn.<T>[index], Rd
• Register List, e.g.: VLD1 (Rn), [Vt1.<T>, Vt2.<T>, Vt3.<T>]
• Register offset variant, e.g.: VLD1.P (Rn)(Rm), [Vt1.<T>, Vt2.<T>]
• Go assembly for ARM64 reference manual: src/cmd/internal/obj/arm64/doc.go
• Full details
• https://go-review.googlesource.com/c/go/+/41654
© 2017 Arm Limited9
Assembly Build Rule
• Toolchain will select appropriate assembly files according to GOOS+GOARCH
• Using file extensions, e.g.
• sys_linux_arm64.s
• sys_darwin_arm64.s
• Example: assembly files for: hash/crc32
• crc32_amd64p32.s
• crc32_amd64.s
• crc32_arm64.s
• crc32_ppc64le.s crc32_table_ppc64le.s
• crc32_s390x.s
© 2017 Arm Limited10
Prototype
• Function call is the bridge between Go and assembly
• Function declaration
• src/runtime/timestub.go
• func walltime() (sec int64, nsec int32)
• Function assembly implementation
• runtime/sys_linux_arm64.s
package
(optional)
function
name
Flag
(optional)
stack
frame size
arguments
size
(optional)
Middle
dot
© 2017 Arm Limited11
Pseudo-registers
• FP: Frame Pointer
• Points to the bottom of the argument list
• Offsets are positive
• Offsets must include a name, e.g. arg+0(FP)
• SP: Stack Pointer
• Points to the top of the space allocated for local variables
• Offsets are negative
• Offsets must include a name, e.g. ptr-8(SP)
• SB: Static Base
• Named offsets from a global base
Low address
High address
Low address
High address
© 2017 Arm Limited12
Calling Convention
• All arguments are passed on the stack
• Offsets from FP
• Return arguments follow input arguments
• Start of return arguments aligned to pointer size
• All registers are caller saved, except:
• Stack pointer register (RSP)
• G context pointer register (R28)
• Frame pointer (R29)
© 2017 Arm Limited13
arm64 Stack Frame
w/o frame pointer w/ frame pointer
Low address
High address
© 2017 Arm Limited14
Optimize CRC32 for arm64 – Before
• Pure Go table-driven implementation
src/hash/crc32/crc32_generic.go
42 func simpleUpdate(crc uint32, tab *Table, p []byte) uint32 {
43 crc = ^crc
44 for _, v := range p {
45 crc = tab[byte(crc)^v] ^ (crc >> 8)
46 }
47 return ^crc
48 }
© 2017 Arm Limited15
Optimize CRC32 for arm64 – After
• Assembly for arm64
src/hash/crc32/crc32_arm64.s
9 // func castagnoliUpdate(crc uint32, p []byte) uint32
10 TEXT ·castagnoliUpdate(SB),NOSPLIT,$0-36
11 MOVWU crc+0(FP), R9 // CRC value
12 MOVD p+8(FP), R13 // data pointer
13 MOVD p_len+16(FP), R11 // len(p)
14
15 CMP $8, R11
16 BLT less_than_8
17
18 update:
19 MOVD.P 8(R13), R10
20 CRC32CX R10, R9
21 SUB $8, R11
22
23 CMP $8, R11
24 BLT less_than_8
25
26 JMP update
…
46 done:
47 MOVWU R9, ret+32(FP)
48 RET
0(FP)
ret
p.cap
p.len
p.base
crc
32(FP)
8(FP)
16(FP)
© 2017 Arm Limited16
Optimize CRC32 for arm64 – Result
• Optimization with assembly
• 2X-7X speedup
© 2017 Arm Limited17
Optimize SHA256 for arm64
• SHA256 introduction
block rounds K Hash
SHA-256 512bits 64 32bits 32bits 256bits
© 2017 Arm Limited18
Optimize SHA256 for arm64 – Message schedule
src/crypto/sha256/sha256block.go
84 for i := 0; i < 16; i++ {
85 j := i * 4
86 w[i] = uint32(p[j])<<24 | uint32(p[j+1])<<16 | uint32(p[j+2])<<8 | uint32(p[j+3])
87 }
88 for i := 16; i < 64; i++ {
89 v1 := w[i-2]
90 t1 := (v1>>17 | v1<<(32-17)) ^ (v1>>19 | v1<<(32-19)) ^ (v1 >> 10)
91 v2 := w[i-15]
92 t2 := (v2>>7 | v2<<(32-7)) ^ (v2>>18 | v2<<(32-18)) ^ (v2 >> 3)
93 w[i] = t1 + w[i-7] + t2 + w[i-16]
94 }
for i := 16; i < 64; i+=4 {
SHA256SU0 Vn.S4, Vd.S4
SHA256SU1 Vm.S4, Vn.S4, Vd.S4
}
© 2017 Arm Limited19
Optimize SHA256 for arm64 – Hash Computation
src/crypto/sha256/sha256block.go
98 for i := 0; i < 64; i++ {
99 t1 := h + ((e>>6 | e<<(32-6)) ^ (e>>11 | e<<(32-11)) ^ (e>>25 | e<<(32-25))) + ((e & f) ^ (^e & g)) + _K[i] + w[i]
100
101 t2 := ((a>>2 | a<<(32-2)) ^ (a>>13 | a<<(32-13)) ^ (a>>22 | a<<(32-22))) + ((a & b) ^ (a & c) ^ (b & c))
102
103 h = g
104 g = f
105 f = e
106 e = d + t1
107 d = c
108 c = b
109 b = a
110 a = t1 + t2
111 }
for i := 0; i < 64; i+=4 {
SHA256H Vm, Vn, Vd.4S
SHA256H2 Vm, Vn, Vd.4S
}
© 2017 Arm Limited20
Optimize SHA256 for arm64 – Implementation
src/crypto/sha256/sha256block_arm64.s
© 2017 Arm Limited21
Optimize SHA256 for arm64 – Result
• Optimization with assembly
• 2X-16X speedup
© 2017 Arm Limited22
Optimize IndexByte for arm64 – Before
H E L L O W O R L D …
R1R0
R2 D
R0
src/runtime/asm_arm64.s
© 2017 Arm Limited23
Optimize IndexByte for arm64 – After
• Assembly implementation with SIMD
• SIMD instruction: CMEQ Vm.B16, Vn.B16, Vd.B16
Compare 16 bytes in parallel
More details:
• Input slice shorter than 16
• Input slice address not 16-byte aligned
• Input slice size not 16-byte aligned
• Count trailing zeros (not leading zeros)
• Implementation:
• https://go-review.googlesource.com/c/go/+/41654
© 2017 Arm Limited24
Optimize IndexByte for arm64 – Result
• Optimization with SIMD
• 1.5X-8X speedup
© 2017 Arm Limited25
Work Summary
Disassembler (arm64):
https://go-review.googlesource.com/c/arch/+/43651 https://go-review.googlesource.com/c/arch/+/56810 https://go-review.googlesource.com/c/go/+/58930
https://go-review.googlesource.com/c/go/+/56331https://go-review.googlesource.com/c/go/+/49530
Assembler (arm64):
https://go-review.googlesource.com/c/go/+/33594https://go-review.googlesource.com/c/go/+/33595https://go-review.googlesource.com/c/go/+/41511
https://go-review.googlesource.com/c/go/+/41654https://go-review.googlesource.com/c/go/+/45850https://go-review.googlesource.com/c/go/+/54951
https://go-review.googlesource.com/c/go/+/54990https://go-review.googlesource.com/c/go/+/57852https://go-review.googlesource.com/c/go/+/58350
https://go-review.googlesource.com/c/go/+/56030https://go-review.googlesource.com/c/go/+/46438https://go-review.googlesource.com/c/go/+/41653
Optimizations:
https://go-review.googlesource.com/c/go/+/40074https://go-review.googlesource.com/c/go/+/61550https://go-review.googlesource.com/c/go/+/61570
https://go-review.googlesource.com/c/go/+/33597https://go-review.googlesource.com/c/go/+/64490https://go-review.googlesource.com/c/go/+/55610
Others:
https://go-review.googlesource.com/c/go/+/61511https://go-review.googlesource.com/c/go/+/62850https://go-review.googlesource.com/c/go/+/45112
https://go-review.googlesource.com/c/go/+/44390https://go-review.googlesource.com/c/go/+/42971https://go-review.googlesource.com/c/go/+/40511
https://go-review.googlesource.com/c/arch/+/37172
© 2017 Arm Limited26
Next Steps
• Crypto optimizations:
• aes, elliptic, …
• SIMD optimizations:
• strings, bytes, runtime, reflect, …
• Compiler SSA arm64 back-end optimizations
• Others
• Internal arm64 linker
• Tool for arm64: race detector, memory sanitizer, …
• New architecture features
• ...
2727
Thank You!
Danke!
Merci!
谢谢!
ありがとう!
Gracias!
Kiitos!
© 2017 Arm Limited
© 2017 Arm Limited28
CGo
GO ABI C ABI
1 package print
2
3 // #include <stdio.h>
4 // #include <stdlib.h>
5 import "C"
6 import "unsafe"
7
8 func Print(s string) {
9 cs := C.CString(s)
10 C.fputs(cs, 11(*C.FILE)(C.stdout))
12 C.free(unsafe.Pointer(cs))
13 }
CGo
© 2017 Arm Limited29
Useful in
macros!
Branch Difference from GNU Assembly
• On arm64: B is alias for JMP, BL is alias for CALL
Jump to labels
JMP L1
NOP
L1:
NOP
L2: NOP
NOP
B L2
Call and Indirect Jump
BL $p.foo
MOV $p·foo, R3
CALL(R3)
B (R3)
MOV 0(R26), R4
JMP (R4)
Jump relative to PC
JMP 2(PC)
NOP
NOP
NOP
NOP
JMP -2(PC)

Mais conteúdo relacionado

Mais de Linaro

Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Linaro
 
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Linaro
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineLinaro
 
HKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteHKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteLinaro
 
HKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopHKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopLinaro
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineLinaro
 
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allHKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allLinaro
 
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorHKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorLinaro
 
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMUHKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMULinaro
 
HKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MHKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MLinaro
 
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation Linaro
 
HKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootHKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootLinaro
 
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...Linaro
 
HKG18-317 - Arm Server Ready Program
HKG18-317 - Arm Server Ready ProgramHKG18-317 - Arm Server Ready Program
HKG18-317 - Arm Server Ready ProgramLinaro
 
HKG18-312 - CMSIS-NN
HKG18-312 - CMSIS-NNHKG18-312 - CMSIS-NN
HKG18-312 - CMSIS-NNLinaro
 
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...Linaro
 
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...Linaro
 
HKG18-212 - Trusted Firmware M: Introduction
HKG18-212 - Trusted Firmware M: IntroductionHKG18-212 - Trusted Firmware M: Introduction
HKG18-212 - Trusted Firmware M: IntroductionLinaro
 
HKG18-116 - RAS Solutions for Arm64 Servers
HKG18-116 - RAS Solutions for Arm64 ServersHKG18-116 - RAS Solutions for Arm64 Servers
HKG18-116 - RAS Solutions for Arm64 ServersLinaro
 
HKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with CoresightHKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with CoresightLinaro
 

Mais de Linaro (20)

Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
 
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
 
HKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteHKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening Keynote
 
HKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopHKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP Workshop
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
 
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allHKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
 
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorHKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
 
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMUHKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMU
 
HKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MHKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8M
 
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation
 
HKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootHKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted boot
 
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
 
HKG18-317 - Arm Server Ready Program
HKG18-317 - Arm Server Ready ProgramHKG18-317 - Arm Server Ready Program
HKG18-317 - Arm Server Ready Program
 
HKG18-312 - CMSIS-NN
HKG18-312 - CMSIS-NNHKG18-312 - CMSIS-NN
HKG18-312 - CMSIS-NN
 
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
 
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...
 
HKG18-212 - Trusted Firmware M: Introduction
HKG18-212 - Trusted Firmware M: IntroductionHKG18-212 - Trusted Firmware M: Introduction
HKG18-212 - Trusted Firmware M: Introduction
 
HKG18-116 - RAS Solutions for Arm64 Servers
HKG18-116 - RAS Solutions for Arm64 ServersHKG18-116 - RAS Solutions for Arm64 Servers
HKG18-116 - RAS Solutions for Arm64 Servers
 
HKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with CoresightHKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with Coresight
 

Último

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 

Último (20)

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 

Optimizing GoLang for High Performance with ARM64 Assembly - SFO17-314

  • 1. © 2017 Arm Limited SFO17-314 Optimizing Golang for High Performance with ARM64 AssemblyWei Xiao Staff Software Engineer Wei.Xiao@arm.com September 27, 2017 Linaro Connect SFO17
  • 2. © 2017 Arm Limited2 Agenda • Introduction • Differences from GNU Assembly • Integrate assembly into Golang • Optimize CRC32 for arm64 • Optimize SHA256 for arm64 • Optimize IndexByte for arm64 • Work Summary and Next steps
  • 3. © 2017 Arm Limited3 Introduction • Assembly optimization benefits • Take advantages of ARMv8 capabilities – Hardware specific instructions (such as SVC, AES, SHA and etc.) – Vector (Single Instruction Multiple Data) Instructions • Others – No need for CGo dependency – Avoid runtime context switching overhead – Optimized code (vs Go compiler) – Faster compilation
  • 4. © 2017 Arm Limited4 Assembly Optimization Current Status • Go Standard packages with assembly optimization crypto/aes crypto/elliptic crypto/internal/cipherhw crypto/md5 crypto/rc4 crypto/sha1 crypto/sha256 crypto/sha512 hash/crc32 math math/big reflect runtime runtime/cgo runtime/internal/atomicruntime/internal/sys strings sync/atomic syscall …… red – arm64 optimization ongoing black – no arm64 optimization
  • 5. © 2017 Arm Limited5 Assembly Terminology • Mnemonic • CALL, MOVW, MOVD, … • Register • R1, F0, V3, … • Immediate • $1, $0x100, … • Memory • (R1), 8(R3), … Registers in AArch64
  • 6. © 2017 Arm Limited6 Instruction Differences from GNU Assembly • Semi-abstract instruction set (Plan 9 from Bell Labs) • Architecture independent mnemonics like MOVD • Some architecture aspects shine through • Assembler may insert prologues, remove ‘unreachable’ instructions • Instructions may be expanded by the assembler • Not all instructions available • BYTE/WORD/LONG directives to lay down opcodes into instruction stream directly 1 // func Add(a, b int) int 2 TEXT ·Add(SB),$0-24 3 MOVD arg1+0(FP), R0 4 MOVD arg2+8(FP), R1 5 ADD R1, R0, R0 6 MOVD R0, ret+16(FP) 7 RET
  • 7. © 2017 Arm Limited7 Operand Differences from GNU Assembly • Data flow from left to right • ADD R1, R2 → R2 += R1 • SUBW R12<<29, R7, R8 → R8 = R7 – (R12<<29) • Memory operands: base + offset • MOVH (R1), R2 → R2 = *R1 • MOVBU 8(R3), R4 → R4 = *(8 + R3) • MOVD mypackage·myvar(SB), R8 → R8 = *myvar • Addresses • MOVD $8(R1), R3 → R3 = R1 + 8 • MOVD $·myvar(SB), R4 → R4 = &myvar package mypackage var myvar int64 Unicode U+00B7
  • 8. © 2017 Arm Limited8 Go Assembly Extension for arm64 • Extended register, e.g.: ADD Rm.<ext>[<<amount], Rn, Rd • Arrangement for SIMD instructions, e.g.: VADDP Vm.<T>, Vn.<T>, Vd.<T> • Width specifier and element index for SIMD instructions, e.g.: VMOV Vn.<T>[index], Rd • Register List, e.g.: VLD1 (Rn), [Vt1.<T>, Vt2.<T>, Vt3.<T>] • Register offset variant, e.g.: VLD1.P (Rn)(Rm), [Vt1.<T>, Vt2.<T>] • Go assembly for ARM64 reference manual: src/cmd/internal/obj/arm64/doc.go • Full details • https://go-review.googlesource.com/c/go/+/41654
  • 9. © 2017 Arm Limited9 Assembly Build Rule • Toolchain will select appropriate assembly files according to GOOS+GOARCH • Using file extensions, e.g. • sys_linux_arm64.s • sys_darwin_arm64.s • Example: assembly files for: hash/crc32 • crc32_amd64p32.s • crc32_amd64.s • crc32_arm64.s • crc32_ppc64le.s crc32_table_ppc64le.s • crc32_s390x.s
  • 10. © 2017 Arm Limited10 Prototype • Function call is the bridge between Go and assembly • Function declaration • src/runtime/timestub.go • func walltime() (sec int64, nsec int32) • Function assembly implementation • runtime/sys_linux_arm64.s package (optional) function name Flag (optional) stack frame size arguments size (optional) Middle dot
  • 11. © 2017 Arm Limited11 Pseudo-registers • FP: Frame Pointer • Points to the bottom of the argument list • Offsets are positive • Offsets must include a name, e.g. arg+0(FP) • SP: Stack Pointer • Points to the top of the space allocated for local variables • Offsets are negative • Offsets must include a name, e.g. ptr-8(SP) • SB: Static Base • Named offsets from a global base Low address High address Low address High address
  • 12. © 2017 Arm Limited12 Calling Convention • All arguments are passed on the stack • Offsets from FP • Return arguments follow input arguments • Start of return arguments aligned to pointer size • All registers are caller saved, except: • Stack pointer register (RSP) • G context pointer register (R28) • Frame pointer (R29)
  • 13. © 2017 Arm Limited13 arm64 Stack Frame w/o frame pointer w/ frame pointer Low address High address
  • 14. © 2017 Arm Limited14 Optimize CRC32 for arm64 – Before • Pure Go table-driven implementation src/hash/crc32/crc32_generic.go 42 func simpleUpdate(crc uint32, tab *Table, p []byte) uint32 { 43 crc = ^crc 44 for _, v := range p { 45 crc = tab[byte(crc)^v] ^ (crc >> 8) 46 } 47 return ^crc 48 }
  • 15. © 2017 Arm Limited15 Optimize CRC32 for arm64 – After • Assembly for arm64 src/hash/crc32/crc32_arm64.s 9 // func castagnoliUpdate(crc uint32, p []byte) uint32 10 TEXT ·castagnoliUpdate(SB),NOSPLIT,$0-36 11 MOVWU crc+0(FP), R9 // CRC value 12 MOVD p+8(FP), R13 // data pointer 13 MOVD p_len+16(FP), R11 // len(p) 14 15 CMP $8, R11 16 BLT less_than_8 17 18 update: 19 MOVD.P 8(R13), R10 20 CRC32CX R10, R9 21 SUB $8, R11 22 23 CMP $8, R11 24 BLT less_than_8 25 26 JMP update … 46 done: 47 MOVWU R9, ret+32(FP) 48 RET 0(FP) ret p.cap p.len p.base crc 32(FP) 8(FP) 16(FP)
  • 16. © 2017 Arm Limited16 Optimize CRC32 for arm64 – Result • Optimization with assembly • 2X-7X speedup
  • 17. © 2017 Arm Limited17 Optimize SHA256 for arm64 • SHA256 introduction block rounds K Hash SHA-256 512bits 64 32bits 32bits 256bits
  • 18. © 2017 Arm Limited18 Optimize SHA256 for arm64 – Message schedule src/crypto/sha256/sha256block.go 84 for i := 0; i < 16; i++ { 85 j := i * 4 86 w[i] = uint32(p[j])<<24 | uint32(p[j+1])<<16 | uint32(p[j+2])<<8 | uint32(p[j+3]) 87 } 88 for i := 16; i < 64; i++ { 89 v1 := w[i-2] 90 t1 := (v1>>17 | v1<<(32-17)) ^ (v1>>19 | v1<<(32-19)) ^ (v1 >> 10) 91 v2 := w[i-15] 92 t2 := (v2>>7 | v2<<(32-7)) ^ (v2>>18 | v2<<(32-18)) ^ (v2 >> 3) 93 w[i] = t1 + w[i-7] + t2 + w[i-16] 94 } for i := 16; i < 64; i+=4 { SHA256SU0 Vn.S4, Vd.S4 SHA256SU1 Vm.S4, Vn.S4, Vd.S4 }
  • 19. © 2017 Arm Limited19 Optimize SHA256 for arm64 – Hash Computation src/crypto/sha256/sha256block.go 98 for i := 0; i < 64; i++ { 99 t1 := h + ((e>>6 | e<<(32-6)) ^ (e>>11 | e<<(32-11)) ^ (e>>25 | e<<(32-25))) + ((e & f) ^ (^e & g)) + _K[i] + w[i] 100 101 t2 := ((a>>2 | a<<(32-2)) ^ (a>>13 | a<<(32-13)) ^ (a>>22 | a<<(32-22))) + ((a & b) ^ (a & c) ^ (b & c)) 102 103 h = g 104 g = f 105 f = e 106 e = d + t1 107 d = c 108 c = b 109 b = a 110 a = t1 + t2 111 } for i := 0; i < 64; i+=4 { SHA256H Vm, Vn, Vd.4S SHA256H2 Vm, Vn, Vd.4S }
  • 20. © 2017 Arm Limited20 Optimize SHA256 for arm64 – Implementation src/crypto/sha256/sha256block_arm64.s
  • 21. © 2017 Arm Limited21 Optimize SHA256 for arm64 – Result • Optimization with assembly • 2X-16X speedup
  • 22. © 2017 Arm Limited22 Optimize IndexByte for arm64 – Before H E L L O W O R L D … R1R0 R2 D R0 src/runtime/asm_arm64.s
  • 23. © 2017 Arm Limited23 Optimize IndexByte for arm64 – After • Assembly implementation with SIMD • SIMD instruction: CMEQ Vm.B16, Vn.B16, Vd.B16 Compare 16 bytes in parallel More details: • Input slice shorter than 16 • Input slice address not 16-byte aligned • Input slice size not 16-byte aligned • Count trailing zeros (not leading zeros) • Implementation: • https://go-review.googlesource.com/c/go/+/41654
  • 24. © 2017 Arm Limited24 Optimize IndexByte for arm64 – Result • Optimization with SIMD • 1.5X-8X speedup
  • 25. © 2017 Arm Limited25 Work Summary Disassembler (arm64): https://go-review.googlesource.com/c/arch/+/43651 https://go-review.googlesource.com/c/arch/+/56810 https://go-review.googlesource.com/c/go/+/58930 https://go-review.googlesource.com/c/go/+/56331https://go-review.googlesource.com/c/go/+/49530 Assembler (arm64): https://go-review.googlesource.com/c/go/+/33594https://go-review.googlesource.com/c/go/+/33595https://go-review.googlesource.com/c/go/+/41511 https://go-review.googlesource.com/c/go/+/41654https://go-review.googlesource.com/c/go/+/45850https://go-review.googlesource.com/c/go/+/54951 https://go-review.googlesource.com/c/go/+/54990https://go-review.googlesource.com/c/go/+/57852https://go-review.googlesource.com/c/go/+/58350 https://go-review.googlesource.com/c/go/+/56030https://go-review.googlesource.com/c/go/+/46438https://go-review.googlesource.com/c/go/+/41653 Optimizations: https://go-review.googlesource.com/c/go/+/40074https://go-review.googlesource.com/c/go/+/61550https://go-review.googlesource.com/c/go/+/61570 https://go-review.googlesource.com/c/go/+/33597https://go-review.googlesource.com/c/go/+/64490https://go-review.googlesource.com/c/go/+/55610 Others: https://go-review.googlesource.com/c/go/+/61511https://go-review.googlesource.com/c/go/+/62850https://go-review.googlesource.com/c/go/+/45112 https://go-review.googlesource.com/c/go/+/44390https://go-review.googlesource.com/c/go/+/42971https://go-review.googlesource.com/c/go/+/40511 https://go-review.googlesource.com/c/arch/+/37172
  • 26. © 2017 Arm Limited26 Next Steps • Crypto optimizations: • aes, elliptic, … • SIMD optimizations: • strings, bytes, runtime, reflect, … • Compiler SSA arm64 back-end optimizations • Others • Internal arm64 linker • Tool for arm64: race detector, memory sanitizer, … • New architecture features • ...
  • 28. © 2017 Arm Limited28 CGo GO ABI C ABI 1 package print 2 3 // #include <stdio.h> 4 // #include <stdlib.h> 5 import "C" 6 import "unsafe" 7 8 func Print(s string) { 9 cs := C.CString(s) 10 C.fputs(cs, 11(*C.FILE)(C.stdout)) 12 C.free(unsafe.Pointer(cs)) 13 } CGo
  • 29. © 2017 Arm Limited29 Useful in macros! Branch Difference from GNU Assembly • On arm64: B is alias for JMP, BL is alias for CALL Jump to labels JMP L1 NOP L1: NOP L2: NOP NOP B L2 Call and Indirect Jump BL $p.foo MOV $p·foo, R3 CALL(R3) B (R3) MOV 0(R26), R4 JMP (R4) Jump relative to PC JMP 2(PC) NOP NOP NOP NOP JMP -2(PC)