Ekon24 from Delphi to AVX2

Arnaud Bouchez - Synopse
Rewrite for Performance
From Delphi to AVX2

Welcome to
a fun/wakeup session
about performance
hashes
and assembly mystery

Arnaud Bouchez
• Open SourceFounder
mORMot
SynPDF
• Delphiand FPC expert
DDD, SOA, ORM, MVC
Performance,SOLID
• SynopseConsulting
https://synopse.info

Menu du jour
• The Hash-Table Mystery
• Branches Are Evil
• SIMD Assembly: SSE2
• SIMD Assembly: AVX2
• Conclusion

• The Hash-Table Mystery
• Branches Are Evil
• SIMD Assembly: SSE2
• SIMD Assembly: AVX2
• Conclusion

The Hash-Table Mystery
mORMot is Fast

mORMot is Fast
and tries to be always faster

mORMot is Fast
and tries to be always faster
so works hard for it

One core component
is TDynArrayHasher
= a hasher for a dynamic array

One core component
is TDynArrayHasher
<> a hashed list
(it does not own the data)

One core component
is TDynArrayHasher
Used e.g. by the TDynArray wrapper
the TSynDictionary class
the in-memory ORM engine

How does a Hash-Table work?
bucketindex := hash(key) mod bucketscount
for O(1) retrieval instead of O(n) manual lookup

crc32c()
(hardware accelerated SSE4.2)

xxhash32()
(on non-Intel or old CPUs)

mORMot prefers indexes for efficiency
(and don’t store the hashcode since crc32c is fast)

mORMot stores keys with values
within a (dynamic) array

mORMot can hash several keys
in the same (dynamic) array

It is easy to insert a new item

It is easy to insert a new item
if we handle properly hash collision

the Hard Thing is for Deletion
you can not just reset the slot
since indexes changed

In case of deletion, we may:
1. Re-compute the whole hash table
2. Adjust the indexes
3. Use other algorithm

In case of deletion, we may:
1. Re-compute the whole hash table
What mORMot did for years. Not too bad in practice.
2. Adjust the indexes
Brute force O(n) algorithm.
3. Use other algorithm
More complex, and usually stores the data.

On Deletion, Adjust the Indexes
Brute force O(n) algorithm
Seems simple, lean and efficient.

Seems simple, lean and efficient.
Let’s try deleting 1/128th of 200,000 items !

But not really fast on huge count.
23 #195075 adjust=4.27s 548.6MB/s hash=2.47ms
Why????

Branches Are Evil
Alt-F2 : The Obvious Pascal  asm  CPU Flow

Branches Are Evil
Zilog Z80
nostalgic sight:
“Why would I need more than
16KB RAM on my ZX81?”

Branches Are Evil
Processors Learn to Predict Branches
Since Pentium 4
In case of misprediction,
execution pipelines need to be flushed
… just as if you needed to rewind a tape

Branches Are Evil
Since Pentium 4
In case of misprediction,
execution pipelines need to be flushed
… just as if JS needed to garbage collect

Branches Are Evil
Each CPU Vendor and Architecture
changes the execution plan
and even introduced Artificial Intelligence
i.e. a CPU is a very complex beast
 don’t trust the code, nor the asm!

Branches Are Evil
Be your own CPU: Let’s Predict !

Branches Are Evil
2 is always taken, 3 is taken but the last time
and 1 is “randomly” taken… so not predictable...
1
2
3

Branches Are Evil
Source:
https://lemire.me/blog/2019/10/16/benchmarkin
g-is-hard-processors-learn-to-predict-branches/

Branches Are Evil
Pseudo code:
while (howmany != 0) {
val = random();
if( val is an odd integer ) {
out[index] = val;
index += 1;
}
howmany--;
}

Branches Are Evil
The more trials, the better prediction…
the CPU somehow learns from its mistakes!

Branches Are Evil

Branches Are Evil
Perfect prediction! 

Branches Are Evil
… but prediction has a depth
From Lemire:
“This perfect prediction on the AMD Rome
falls apart if you grow the problem
from 2000 to 10,000 values: the best
prediction goes from a 0.1% error rate
to a 33% error rate.” 

Branches Are Evil
… but prediction has a depth 
From Lemire:
“You should probably avoid benchmarking
branchy code over small problems.”

Branches Are Evil
… but prediction has a depth 
From Lemire:
“You should probably avoid benchmarking
branchy code over small problems.”
That’s why I hate microbenchmarks!
And in the Delphi world, I have seen so much!

Branches Are Evil
Branch Misprediction Hurts
if … then …
dec(P[i]) branch is taken or not taken evenly
in not predictable manner
(as random as the hash function itself)

Branches Are Evil
Branch Misprediction Hurts
if … then …
dec(P[i]) branch is taken or not taken evenly
in not predictable manner
Note: unrolling doesn’t help, by definition

Branches Are Evil
What about Going Parallel?
We could divide P[] into sections, and use threads
- it should scale up to how many CPU cores we have
- but we are in a low-level library, so threads are unavailable
- there should be a better way

Branches Are Evil
Introducing a Branch-Less Loop

Branches Are Evil
ord(P[count] > delete)
boolean-to-integer expression returns
either 0 (false) or 1 (true)
and has no branch

Branches Are Evil
FACT: it is actually faster to execute
dec(P[count], 0);
than to handle a mispredicted branch…
(i.e. execute nothing)

Branches Are Evil
while count > 0 is very likely to loop
therefore easy to predict
(by CPU Scheduler convention,
an “upper jump” is estimated as most probable)

Branches Are Evil
ord(P[count] > delete)
compiles to very efficient asm
(branchless setl opcode)

Branches Are Evil
Here, a little unrolling (slightly) helps…
since it avoids an unlikely count <= 0 condition/branch

Branches Are Evil
Numbers Are Talking
naïve if adjust=4.27s 548.6MB/s
branchless adjust=520.85ms 4.3GB/s
We have almost 10X better performance,
in pure pascal code !

SIMD Assembly: SSE2
Can SIMD Improve It Further?
SIMD = Single Instruction,
Multiple Data

SIMD Assembly: SSE2
Can SIMD Improve It Further?
• Data Alignment Restrictions
• Gathering/Scattering is Tricky
• Architecture Specific
• Not native to Delphi or FPC compilers
• Sometimes needs setup/clear

SIMD Assembly: SSE2
SSE2 SIMD Instructions
• Introduced by Intel in 2000 (Pentium 4)
• XMM0 to XMM7 Registers
in 32-bit mode
• XMM0 to XMM15
in x86_64 mode

SIMD Assembly: SSE2
• Each 128-bit XMM Register can handle
Two 64-bit Doubles or Integers
Four 32-bit Integers
Eight 16-bit or Sixteen 8-bit Integers

SIMD Assembly: SSE2

SIMD Assembly: SSE2
We need to SIMD the following code:

SIMD Assembly: SSE2
We need to SIMD the following code:
We can identify two 4-integers = 128-bit blocks

SIMD Assembly: SSE2
1. Prepare and Align the Input
Parameters: rcx=P edx=deleted r8=count

SIMD Assembly: SSE2
2. Processing Loop

SIMD Assembly: SSE2
3. Trailing Bytes

SIMD Assembly: SSE2
Numbers Are Talking
sse2 adjust=201.53ms 11.3GB/s
We expected X4
but we got a little less than X3
(pretty good, to be fair)

SIMD Assembly: SSE2
Help Needed?
https://www.agner.org/optimize/
The “Optimization Bible” (also per-CPU timing)
https://gcc.godbolt.org/
Check what best compilers do
https://www.felixcloutier.com/x86/
OpCode Reference Documentation

SIMD Assembly: AVX2
AVX2 SIMD Instructions
• AVX introduced in Sandy Bridge 2011
New 128-bit instructions
New coding scheme
• AVX2 introduced in Haswell 2013
YMM 256-bit registers
FusedMultiplyAccumulate (FMA) ops

SIMD Assembly: AVX2
• Each 256-bit YMM Register can handle
Four 64-bit Doubles or Integers
Eight 32-bit Integers
Sixteen 16-bit or Thirty-two 8-bit Integers

SIMD Assembly: AVX2
• Before using them:
Check the CPUID flag
Ensure the OS is AVX2-Aware
• AVX2 is Supported in FPC asm
• AVX2 is Not Supported in Delphi asm

SIMD Assembly: AVX2
SSE2 Processing Loop

SIMD Assembly: AVX2
New AVX2 Processing Loop

SIMD Assembly: AVX2
Numbers Are Talking
sse2 adjust=201.53ms 11.3GB/s
avx2 adjust=161.73ms 14.1GB/s
We got only 30% better numbers
 We saturated the CPU bandwidth 

Conclusion
• On Deletion, TDynArrayHasher
is not a bottleneck any more
• The TDynArray.Delete data move
takes most time now
• We have a nice pure-pascal version

Conclusion
• Branches are Evil
• Never Trust Micro Benchmarks
• Unrolling is no magic
• Branchless is magic: 10 X faster
• SIMD is worth it if really needed
for another 3 X boost

From Delphi to AVX2
Questions?
No Marmots Were Harmed in the Making of This Session

Ekon24 from Delphi to AVX2

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Ekon24 from Delphi to AVX2

Semelhante a Ekon24 from Delphi to AVX2 (20)

Mais de Arnaud Bouchez

Mais de Arnaud Bouchez (20)

Último

Último (20)

Ekon24 from Delphi to AVX2