SlideShare uma empresa Scribd logo
1 de 79
Baixar para ler offline
Arnaud Bouchez - Synopse
Rewrite for Performance
From Delphi to AVX2
Welcome to
a fun/wakeup session
about performance
hashes
and assembly mystery
Arnaud Bouchez
• Open SourceFounder
mORMot
SynPDF
• Delphiand FPC expert
DDD, SOA, ORM, MVC
Performance,SOLID
• SynopseConsulting
https://synopse.info
Menu du jour
• The Hash-Table Mystery
• Branches Are Evil
• SIMD Assembly: SSE2
• SIMD Assembly: AVX2
• Conclusion
• The Hash-Table Mystery
• Branches Are Evil
• SIMD Assembly: SSE2
• SIMD Assembly: AVX2
• Conclusion
The Hash-Table Mystery
mORMot is Fast
The Hash-Table Mystery
mORMot is Fast
and tries to be always faster
The Hash-Table Mystery
mORMot is Fast
and tries to be always faster
so works hard for it
The Hash-Table Mystery
One core component
is TDynArrayHasher
= a hasher for a dynamic array
The Hash-Table Mystery
One core component
is TDynArrayHasher
= a hasher for a dynamic array
<> a hashed list
(it does not own the data)
The Hash-Table Mystery
One core component
is TDynArrayHasher
= a hasher for a dynamic array
Used e.g. by the TDynArray wrapper
the TSynDictionary class
the in-memory ORM engine
The Hash-Table Mystery
How does a Hash-Table work?
bucketindex := hash(key) mod bucketscount
for O(1) retrieval instead of O(n) manual lookup
The Hash-Table Mystery
How does a Hash-Table work?
crc32c()
(hardware accelerated SSE4.2)
The Hash-Table Mystery
How does a Hash-Table work?
xxhash32()
(on non-Intel or old CPUs)
The Hash-Table Mystery
How does a Hash-Table work?
mORMot prefers indexes for efficiency
(and don’t store the hashcode since crc32c is fast)
The Hash-Table Mystery
How does a Hash-Table work?
mORMot stores keys with values
within a (dynamic) array
The Hash-Table Mystery
How does a Hash-Table work?
mORMot can hash several keys
in the same (dynamic) array
The Hash-Table Mystery
How does a Hash-Table work?
It is easy to insert a new item
The Hash-Table Mystery
How does a Hash-Table work?
It is easy to insert a new item
if we handle properly hash collision
The Hash-Table Mystery
How does a Hash-Table work?
the Hard Thing is for Deletion
you can not just reset the slot
since indexes changed
The Hash-Table Mystery
In case of deletion, we may:
1. Re-compute the whole hash table
2. Adjust the indexes
3. Use other algorithm
The Hash-Table Mystery
In case of deletion, we may:
1. Re-compute the whole hash table
What mORMot did for years. Not too bad in practice.
2. Adjust the indexes
Brute force O(n) algorithm.
3. Use other algorithm
More complex, and usually stores the data.
The Hash-Table Mystery
In case of deletion, we may:
1. Re-compute the whole hash table
What mORMot did for years. Not too bad in practice.
2. Adjust the indexes
Brute force O(n) algorithm.
3. Use other algorithm
More complex, and usually stores the data.
The Hash-Table Mystery
On Deletion, Adjust the Indexes
Brute force O(n) algorithm
Seems simple, lean and efficient.
The Hash-Table Mystery
On Deletion, Adjust the Indexes
Brute force O(n) algorithm
Seems simple, lean and efficient.
Let’s try deleting 1/128th of 200,000 items !
The Hash-Table Mystery
On Deletion, Adjust the Indexes
Brute force O(n) algorithm
But not really fast on huge count.
23 #195075 adjust=4.27s 548.6MB/s hash=2.47ms
Why????
• The Hash-Table Mystery
• Branches Are Evil
• SIMD Assembly: SSE2
• SIMD Assembly: AVX2
• Conclusion
Branches Are Evil
Alt-F2 : The Obvious Pascal  asm  CPU Flow
Branches Are Evil
Alt-F2 : The Obvious Pascal  asm  CPU Flow
Branches Are Evil
Zilog Z80
nostalgic sight:
“Why would I need more than
16KB RAM on my ZX81?”
Branches Are Evil
Branches Are Evil
Processors Learn to Predict Branches
Since Pentium 4
In case of misprediction,
execution pipelines need to be flushed
… just as if you needed to rewind a tape
Branches Are Evil
Processors Learn to Predict Branches
Since Pentium 4
In case of misprediction,
execution pipelines need to be flushed
… just as if JS needed to garbage collect
Branches Are Evil
Processors Learn to Predict Branches
Each CPU Vendor and Architecture
changes the execution plan
and even introduced Artificial Intelligence
i.e. a CPU is a very complex beast
 don’t trust the code, nor the asm!
Branches Are Evil
Be your own CPU: Let’s Predict !
Branches Are Evil
2 is always taken, 3 is taken but the last time
and 1 is “randomly” taken… so not predictable...
1
2
3
Branches Are Evil
Processors Learn to Predict Branches
Source:
https://lemire.me/blog/2019/10/16/benchmarkin
g-is-hard-processors-learn-to-predict-branches/
Branches Are Evil
Processors Learn to Predict Branches
Pseudo code:
while (howmany != 0) {
val = random();
if( val is an odd integer ) {
out[index] = val;
index += 1;
}
howmany--;
}
Branches Are Evil
Processors Learn to Predict Branches
The more trials, the better prediction…
the CPU somehow learns from its mistakes!
Branches Are Evil
Processors Learn to Predict Branches
Branches Are Evil
Processors Learn to Predict Branches
Perfect prediction! 
Branches Are Evil
Processors Learn to Predict Branches
… but prediction has a depth
From Lemire:
“This perfect prediction on the AMD Rome
falls apart if you grow the problem
from 2000 to 10,000 values: the best
prediction goes from a 0.1% error rate
to a 33% error rate.” 
Branches Are Evil
Processors Learn to Predict Branches
… but prediction has a depth 
From Lemire:
“You should probably avoid benchmarking
branchy code over small problems.”
Branches Are Evil
Processors Learn to Predict Branches
… but prediction has a depth 
From Lemire:
“You should probably avoid benchmarking
branchy code over small problems.”
That’s why I hate microbenchmarks!
And in the Delphi world, I have seen so much!
Branches Are Evil
Branch Misprediction Hurts
if … then …
dec(P[i]) branch is taken or not taken evenly
in not predictable manner
(as random as the hash function itself)
Branches Are Evil
Branch Misprediction Hurts
if … then …
dec(P[i]) branch is taken or not taken evenly
in not predictable manner
Note: unrolling doesn’t help, by definition
Branches Are Evil
What about Going Parallel?
We could divide P[] into sections, and use threads
- it should scale up to how many CPU cores we have
- but we are in a low-level library, so threads are unavailable
- there should be a better way
Branches Are Evil
Introducing a Branch-Less Loop
Branches Are Evil
Introducing a Branch-Less Loop
ord(P[count] > delete)
boolean-to-integer expression returns
either 0 (false) or 1 (true)
and has no branch
Branches Are Evil
Introducing a Branch-Less Loop
FACT: it is actually faster to execute
dec(P[count], 0);
than to handle a mispredicted branch…
(i.e. execute nothing)
Branches Are Evil
Introducing a Branch-Less Loop
while count > 0 is very likely to loop
therefore easy to predict
(by CPU Scheduler convention,
an “upper jump” is estimated as most probable)
Branches Are Evil
Introducing a Branch-Less Loop
ord(P[count] > delete)
compiles to very efficient asm
(branchless setl opcode)
Branches Are Evil
Introducing a Branch-Less Loop
Here, a little unrolling (slightly) helps…
since it avoids an unlikely count <= 0 condition/branch
Branches Are Evil
Numbers Are Talking
naïve if adjust=4.27s 548.6MB/s
branchless adjust=520.85ms 4.3GB/s
We have almost 10X better performance,
in pure pascal code !
• The Hash-Table Mystery
• Branches Are Evil
• SIMD Assembly: SSE2
• SIMD Assembly: AVX2
• Conclusion
SIMD Assembly: SSE2
Can SIMD Improve It Further?
SIMD = Single Instruction,
Multiple Data
SIMD Assembly: SSE2
Can SIMD Improve It Further?
• Data Alignment Restrictions
• Gathering/Scattering is Tricky
• Architecture Specific
• Not native to Delphi or FPC compilers
• Sometimes needs setup/clear
SIMD Assembly: SSE2
SSE2 SIMD Instructions
• Introduced by Intel in 2000 (Pentium 4)
• XMM0 to XMM7 Registers
in 32-bit mode
• XMM0 to XMM15
in x86_64 mode
SIMD Assembly: SSE2
SSE2 SIMD Instructions
• Each 128-bit XMM Register can handle
Two 64-bit Doubles or Integers
Four 32-bit Integers
Eight 16-bit or Sixteen 8-bit Integers
SIMD Assembly: SSE2
SSE2 SIMD Instructions
SIMD Assembly: SSE2
We need to SIMD the following code:
SIMD Assembly: SSE2
We need to SIMD the following code:
We can identify two 4-integers = 128-bit blocks
SIMD Assembly: SSE2
1. Prepare and Align the Input
Parameters: rcx=P edx=deleted r8=count
SIMD Assembly: SSE2
2. Processing Loop
SIMD Assembly: SSE2
3. Trailing Bytes
SIMD Assembly: SSE2
Numbers Are Talking
naïve if adjust=4.27s 548.6MB/s
branchless adjust=520.85ms 4.3GB/s
sse2 adjust=201.53ms 11.3GB/s
We expected X4
but we got a little less than X3
(pretty good, to be fair)
SIMD Assembly: SSE2
Help Needed?
https://www.agner.org/optimize/
The “Optimization Bible” (also per-CPU timing)
https://gcc.godbolt.org/
Check what best compilers do
https://www.felixcloutier.com/x86/
OpCode Reference Documentation
• The Hash-Table Mystery
• Branches Are Evil
• SIMD Assembly: SSE2
• SIMD Assembly: AVX2
• Conclusion
SIMD Assembly: AVX2
AVX2 SIMD Instructions
• AVX introduced in Sandy Bridge 2011
New 128-bit instructions
New coding scheme
• AVX2 introduced in Haswell 2013
YMM 256-bit registers
FusedMultiplyAccumulate (FMA) ops
SIMD Assembly: AVX2
AVX2 SIMD Instructions
• Each 256-bit YMM Register can handle
Four 64-bit Doubles or Integers
Eight 32-bit Integers
Sixteen 16-bit or Thirty-two 8-bit Integers
SIMD Assembly: AVX2
AVX2 SIMD Instructions
• Before using them:
Check the CPUID flag
Ensure the OS is AVX2-Aware
• AVX2 is Supported in FPC asm
• AVX2 is Not Supported in Delphi asm
SIMD Assembly: AVX2
SSE2 Processing Loop
SIMD Assembly: AVX2
New AVX2 Processing Loop
SIMD Assembly: AVX2
Numbers Are Talking
naïve if adjust=4.27s 548.6MB/s
branchless adjust=520.85ms 4.3GB/s
sse2 adjust=201.53ms 11.3GB/s
avx2 adjust=161.73ms 14.1GB/s
We got only 30% better numbers
 We saturated the CPU bandwidth 
• The Hash-Table Mystery
• Branches Are Evil
• SIMD Assembly: SSE2
• SIMD Assembly: AVX2
• Conclusion
Conclusion
• On Deletion, TDynArrayHasher
is not a bottleneck any more
• The TDynArray.Delete data move
takes most time now
• We have a nice pure-pascal version
Conclusion
• Branches are Evil
• Never Trust Micro Benchmarks
• Unrolling is no magic
• Branchless is magic: 10 X faster
• SIMD is worth it if really needed
for another 3 X boost
From Delphi to AVX2
Questions?
No Marmots Were Harmed in the Making of This Session

Mais conteúdo relacionado

Mais procurados

Alphorm.com Formation Big Data avec Apache Spark: Initiation
Alphorm.com Formation Big Data avec Apache Spark: InitiationAlphorm.com Formation Big Data avec Apache Spark: Initiation
Alphorm.com Formation Big Data avec Apache Spark: Initiation
Alphorm
 
Alphorm.com Formation Blockchain : Découvrir les fondamentaux
Alphorm.com Formation Blockchain : Découvrir les fondamentauxAlphorm.com Formation Blockchain : Découvrir les fondamentaux
Alphorm.com Formation Blockchain : Découvrir les fondamentaux
Alphorm
 
Лекция 3: Векторизация кода (Code vectorization, SIMD, SSE, AVX)
Лекция 3: Векторизация кода (Code vectorization, SIMD, SSE, AVX)Лекция 3: Векторизация кода (Code vectorization, SIMD, SSE, AVX)
Лекция 3: Векторизация кода (Code vectorization, SIMD, SSE, AVX)
Mikhail Kurnosov
 
Zero bugs found? Hold my beer AFL! how to improve coverage-guided fuzzing and...
Zero bugs found? Hold my beer AFL! how to improve coverage-guided fuzzing and...Zero bugs found? Hold my beer AFL! how to improve coverage-guided fuzzing and...
Zero bugs found? Hold my beer AFL! how to improve coverage-guided fuzzing and...
Maksim Shudrak
 

Mais procurados (20)

Practical Windows Kernel Exploitation
Practical Windows Kernel ExploitationPractical Windows Kernel Exploitation
Practical Windows Kernel Exploitation
 
Alphorm.com Formation Big Data avec Apache Spark: Initiation
Alphorm.com Formation Big Data avec Apache Spark: InitiationAlphorm.com Formation Big Data avec Apache Spark: Initiation
Alphorm.com Formation Big Data avec Apache Spark: Initiation
 
A Practical Introduction to Apache Solr
A Practical Introduction to Apache SolrA Practical Introduction to Apache Solr
A Practical Introduction to Apache Solr
 
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
Tangram: Distributed Scheduling Framework for Apache Spark at FacebookTangram: Distributed Scheduling Framework for Apache Spark at Facebook
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
 
RaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheRaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cache
 
Application Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureApplication Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and Future
 
How Dashtable Helps Dragonfly Maintain Low Latency
How Dashtable Helps Dragonfly Maintain Low LatencyHow Dashtable Helps Dragonfly Maintain Low Latency
How Dashtable Helps Dragonfly Maintain Low Latency
 
Variables in Pharo
Variables in PharoVariables in Pharo
Variables in Pharo
 
Telecharger Cours java pour debutant pdf
Telecharger Cours java pour debutant pdfTelecharger Cours java pour debutant pdf
Telecharger Cours java pour debutant pdf
 
Apache Archiva を試す
Apache Archiva を試すApache Archiva を試す
Apache Archiva を試す
 
OVERVIEW OF FACEBOOK SCALABLE ARCHITECTURE.
OVERVIEW  OF FACEBOOK SCALABLE ARCHITECTURE.OVERVIEW  OF FACEBOOK SCALABLE ARCHITECTURE.
OVERVIEW OF FACEBOOK SCALABLE ARCHITECTURE.
 
Apache Spark Data Validation
Apache Spark Data ValidationApache Spark Data Validation
Apache Spark Data Validation
 
Alphorm.com Formation Blockchain : Découvrir les fondamentaux
Alphorm.com Formation Blockchain : Découvrir les fondamentauxAlphorm.com Formation Blockchain : Découvrir les fondamentaux
Alphorm.com Formation Blockchain : Découvrir les fondamentaux
 
Лекция 3: Векторизация кода (Code vectorization, SIMD, SSE, AVX)
Лекция 3: Векторизация кода (Code vectorization, SIMD, SSE, AVX)Лекция 3: Векторизация кода (Code vectorization, SIMD, SSE, AVX)
Лекция 3: Векторизация кода (Code vectorization, SIMD, SSE, AVX)
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Practical Operation Automation with StackStorm
Practical Operation Automation with StackStormPractical Operation Automation with StackStorm
Practical Operation Automation with StackStorm
 
Upgrading to Alfresco 6
Upgrading to Alfresco 6Upgrading to Alfresco 6
Upgrading to Alfresco 6
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan Blue
 
Zero bugs found? Hold my beer AFL! how to improve coverage-guided fuzzing and...
Zero bugs found? Hold my beer AFL! how to improve coverage-guided fuzzing and...Zero bugs found? Hold my beer AFL! how to improve coverage-guided fuzzing and...
Zero bugs found? Hold my beer AFL! how to improve coverage-guided fuzzing and...
 
Apache Hive Tutorial
Apache Hive TutorialApache Hive Tutorial
Apache Hive Tutorial
 

Semelhante a Ekon24 from Delphi to AVX2

Cache aware hybrid sorter
Cache aware hybrid sorterCache aware hybrid sorter
Cache aware hybrid sorter
Manchor Ko
 
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacket
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacketCsw2016 wheeler barksdale-gruskovnjak-execute_mypacket
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacket
CanSecWest
 

Semelhante a Ekon24 from Delphi to AVX2 (20)

Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
 
Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard Warburton
 
Evgeniy Muralev, Mark Vince, Working with the compiler, not against it
Evgeniy Muralev, Mark Vince, Working with the compiler, not against itEvgeniy Muralev, Mark Vince, Working with the compiler, not against it
Evgeniy Muralev, Mark Vince, Working with the compiler, not against it
 
#GDC15 Code Clinic
#GDC15 Code Clinic#GDC15 Code Clinic
#GDC15 Code Clinic
 
Tokyo APAC Groundbreakers tour - The Complete Java Developer
Tokyo APAC Groundbreakers tour - The Complete Java DeveloperTokyo APAC Groundbreakers tour - The Complete Java Developer
Tokyo APAC Groundbreakers tour - The Complete Java Developer
 
[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis
 
[Ruxcon 2011] Post Memory Corruption Memory Analysis
[Ruxcon 2011] Post Memory Corruption Memory Analysis[Ruxcon 2011] Post Memory Corruption Memory Analysis
[Ruxcon 2011] Post Memory Corruption Memory Analysis
 
Kickin' Ass with Cache-Fu (with notes)
Kickin' Ass with Cache-Fu (with notes)Kickin' Ass with Cache-Fu (with notes)
Kickin' Ass with Cache-Fu (with notes)
 
[HITB Malaysia 2011] Exploit Automation
[HITB Malaysia 2011] Exploit Automation[HITB Malaysia 2011] Exploit Automation
[HITB Malaysia 2011] Exploit Automation
 
Introduction to Parallelization and performance optimization
Introduction to Parallelization and performance optimizationIntroduction to Parallelization and performance optimization
Introduction to Parallelization and performance optimization
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Cache aware hybrid sorter
Cache aware hybrid sorterCache aware hybrid sorter
Cache aware hybrid sorter
 
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
 
[Kiwicon 2011] Post Memory Corruption Memory Analysis
[Kiwicon 2011] Post Memory Corruption Memory Analysis[Kiwicon 2011] Post Memory Corruption Memory Analysis
[Kiwicon 2011] Post Memory Corruption Memory Analysis
 
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacket
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacketCsw2016 wheeler barksdale-gruskovnjak-execute_mypacket
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacket
 
Retaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate LimitingRetaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate Limiting
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Deep learning from scratch
Deep learning from scratch Deep learning from scratch
Deep learning from scratch
 

Mais de Arnaud Bouchez

Mais de Arnaud Bouchez (20)

EKON27-FrameworksTuning.pdf
EKON27-FrameworksTuning.pdfEKON27-FrameworksTuning.pdf
EKON27-FrameworksTuning.pdf
 
EKON27-FrameworksExpressiveness.pdf
EKON27-FrameworksExpressiveness.pdfEKON27-FrameworksExpressiveness.pdf
EKON27-FrameworksExpressiveness.pdf
 
Ekon25 mORMot 2 Server-Side Notifications
Ekon25 mORMot 2 Server-Side NotificationsEkon25 mORMot 2 Server-Side Notifications
Ekon25 mORMot 2 Server-Side Notifications
 
Ekon25 mORMot 2 Cryptography
Ekon25 mORMot 2 CryptographyEkon25 mORMot 2 Cryptography
Ekon25 mORMot 2 Cryptography
 
Ekon23 (2) Kingdom-Driven-Design applied to Social Media with mORMot
Ekon23 (2) Kingdom-Driven-Design applied to Social Media with mORMotEkon23 (2) Kingdom-Driven-Design applied to Social Media with mORMot
Ekon23 (2) Kingdom-Driven-Design applied to Social Media with mORMot
 
Ekon23 (1) Kingdom-Driven-Design
Ekon23 (1) Kingdom-Driven-DesignEkon23 (1) Kingdom-Driven-Design
Ekon23 (1) Kingdom-Driven-Design
 
High Performance Object Pascal Code on Servers (at EKON 22)
High Performance Object Pascal Code on Servers (at EKON 22)High Performance Object Pascal Code on Servers (at EKON 22)
High Performance Object Pascal Code on Servers (at EKON 22)
 
Object Pascal Clean Code Guidelines Proposal (at EKON 22)
Object Pascal Clean Code Guidelines Proposal (at EKON 22)Object Pascal Clean Code Guidelines Proposal (at EKON 22)
Object Pascal Clean Code Guidelines Proposal (at EKON 22)
 
Ekon21 Microservices - SOLID Meets SOA
Ekon21 Microservices - SOLID Meets SOAEkon21 Microservices - SOLID Meets SOA
Ekon21 Microservices - SOLID Meets SOA
 
Ekon21 Microservices - Event Driven Design
Ekon21 Microservices - Event Driven DesignEkon21 Microservices - Event Driven Design
Ekon21 Microservices - Event Driven Design
 
Ekon20 mORMot WorkShop Delphi
Ekon20 mORMot WorkShop DelphiEkon20 mORMot WorkShop Delphi
Ekon20 mORMot WorkShop Delphi
 
Ekon20 mORMot SOA Delphi Conference
Ekon20 mORMot SOA Delphi Conference Ekon20 mORMot SOA Delphi Conference
Ekon20 mORMot SOA Delphi Conference
 
Ekon20 mORMot Legacy Code Technical Debt Delphi Conference
Ekon20 mORMot Legacy Code Technical Debt Delphi Conference Ekon20 mORMot Legacy Code Technical Debt Delphi Conference
Ekon20 mORMot Legacy Code Technical Debt Delphi Conference
 
2016 mORMot
2016 mORMot2016 mORMot
2016 mORMot
 
A1 from n tier to soa
A1 from n tier to soaA1 from n tier to soa
A1 from n tier to soa
 
D1 from interfaces to solid
D1 from interfaces to solidD1 from interfaces to solid
D1 from interfaces to solid
 
A3 from sql to orm
A3 from sql to ormA3 from sql to orm
A3 from sql to orm
 
A2 from soap to rest
A2 from soap to restA2 from soap to rest
A2 from soap to rest
 
D2 domain driven-design
D2 domain driven-designD2 domain driven-design
D2 domain driven-design
 
A4 from rad to mvc
A4 from rad to mvcA4 from rad to mvc
A4 from rad to mvc
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Último (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 

Ekon24 from Delphi to AVX2

  • 1. Arnaud Bouchez - Synopse Rewrite for Performance From Delphi to AVX2
  • 2. Welcome to a fun/wakeup session about performance hashes and assembly mystery
  • 3. Arnaud Bouchez • Open SourceFounder mORMot SynPDF • Delphiand FPC expert DDD, SOA, ORM, MVC Performance,SOLID • SynopseConsulting https://synopse.info
  • 4. Menu du jour • The Hash-Table Mystery • Branches Are Evil • SIMD Assembly: SSE2 • SIMD Assembly: AVX2 • Conclusion
  • 5. • The Hash-Table Mystery • Branches Are Evil • SIMD Assembly: SSE2 • SIMD Assembly: AVX2 • Conclusion
  • 7. The Hash-Table Mystery mORMot is Fast and tries to be always faster
  • 8. The Hash-Table Mystery mORMot is Fast and tries to be always faster so works hard for it
  • 9. The Hash-Table Mystery One core component is TDynArrayHasher = a hasher for a dynamic array
  • 10. The Hash-Table Mystery One core component is TDynArrayHasher = a hasher for a dynamic array <> a hashed list (it does not own the data)
  • 11. The Hash-Table Mystery One core component is TDynArrayHasher = a hasher for a dynamic array Used e.g. by the TDynArray wrapper the TSynDictionary class the in-memory ORM engine
  • 12. The Hash-Table Mystery How does a Hash-Table work? bucketindex := hash(key) mod bucketscount for O(1) retrieval instead of O(n) manual lookup
  • 13. The Hash-Table Mystery How does a Hash-Table work? crc32c() (hardware accelerated SSE4.2)
  • 14. The Hash-Table Mystery How does a Hash-Table work? xxhash32() (on non-Intel or old CPUs)
  • 15. The Hash-Table Mystery How does a Hash-Table work? mORMot prefers indexes for efficiency (and don’t store the hashcode since crc32c is fast)
  • 16. The Hash-Table Mystery How does a Hash-Table work? mORMot stores keys with values within a (dynamic) array
  • 17. The Hash-Table Mystery How does a Hash-Table work? mORMot can hash several keys in the same (dynamic) array
  • 18. The Hash-Table Mystery How does a Hash-Table work? It is easy to insert a new item
  • 19. The Hash-Table Mystery How does a Hash-Table work? It is easy to insert a new item if we handle properly hash collision
  • 20. The Hash-Table Mystery How does a Hash-Table work? the Hard Thing is for Deletion you can not just reset the slot since indexes changed
  • 21. The Hash-Table Mystery In case of deletion, we may: 1. Re-compute the whole hash table 2. Adjust the indexes 3. Use other algorithm
  • 22. The Hash-Table Mystery In case of deletion, we may: 1. Re-compute the whole hash table What mORMot did for years. Not too bad in practice. 2. Adjust the indexes Brute force O(n) algorithm. 3. Use other algorithm More complex, and usually stores the data.
  • 23. The Hash-Table Mystery In case of deletion, we may: 1. Re-compute the whole hash table What mORMot did for years. Not too bad in practice. 2. Adjust the indexes Brute force O(n) algorithm. 3. Use other algorithm More complex, and usually stores the data.
  • 24.
  • 25. The Hash-Table Mystery On Deletion, Adjust the Indexes Brute force O(n) algorithm Seems simple, lean and efficient.
  • 26. The Hash-Table Mystery On Deletion, Adjust the Indexes Brute force O(n) algorithm Seems simple, lean and efficient. Let’s try deleting 1/128th of 200,000 items !
  • 27. The Hash-Table Mystery On Deletion, Adjust the Indexes Brute force O(n) algorithm But not really fast on huge count. 23 #195075 adjust=4.27s 548.6MB/s hash=2.47ms Why????
  • 28. • The Hash-Table Mystery • Branches Are Evil • SIMD Assembly: SSE2 • SIMD Assembly: AVX2 • Conclusion
  • 29. Branches Are Evil Alt-F2 : The Obvious Pascal  asm  CPU Flow
  • 30. Branches Are Evil Alt-F2 : The Obvious Pascal  asm  CPU Flow
  • 31. Branches Are Evil Zilog Z80 nostalgic sight: “Why would I need more than 16KB RAM on my ZX81?”
  • 33. Branches Are Evil Processors Learn to Predict Branches Since Pentium 4 In case of misprediction, execution pipelines need to be flushed … just as if you needed to rewind a tape
  • 34. Branches Are Evil Processors Learn to Predict Branches Since Pentium 4 In case of misprediction, execution pipelines need to be flushed … just as if JS needed to garbage collect
  • 35. Branches Are Evil Processors Learn to Predict Branches Each CPU Vendor and Architecture changes the execution plan and even introduced Artificial Intelligence i.e. a CPU is a very complex beast  don’t trust the code, nor the asm!
  • 36. Branches Are Evil Be your own CPU: Let’s Predict !
  • 37. Branches Are Evil 2 is always taken, 3 is taken but the last time and 1 is “randomly” taken… so not predictable... 1 2 3
  • 38. Branches Are Evil Processors Learn to Predict Branches Source: https://lemire.me/blog/2019/10/16/benchmarkin g-is-hard-processors-learn-to-predict-branches/
  • 39. Branches Are Evil Processors Learn to Predict Branches Pseudo code: while (howmany != 0) { val = random(); if( val is an odd integer ) { out[index] = val; index += 1; } howmany--; }
  • 40. Branches Are Evil Processors Learn to Predict Branches The more trials, the better prediction… the CPU somehow learns from its mistakes!
  • 41. Branches Are Evil Processors Learn to Predict Branches
  • 42. Branches Are Evil Processors Learn to Predict Branches Perfect prediction! 
  • 43. Branches Are Evil Processors Learn to Predict Branches … but prediction has a depth From Lemire: “This perfect prediction on the AMD Rome falls apart if you grow the problem from 2000 to 10,000 values: the best prediction goes from a 0.1% error rate to a 33% error rate.” 
  • 44. Branches Are Evil Processors Learn to Predict Branches … but prediction has a depth  From Lemire: “You should probably avoid benchmarking branchy code over small problems.”
  • 45. Branches Are Evil Processors Learn to Predict Branches … but prediction has a depth  From Lemire: “You should probably avoid benchmarking branchy code over small problems.” That’s why I hate microbenchmarks! And in the Delphi world, I have seen so much!
  • 46. Branches Are Evil Branch Misprediction Hurts if … then … dec(P[i]) branch is taken or not taken evenly in not predictable manner (as random as the hash function itself)
  • 47. Branches Are Evil Branch Misprediction Hurts if … then … dec(P[i]) branch is taken or not taken evenly in not predictable manner Note: unrolling doesn’t help, by definition
  • 48. Branches Are Evil What about Going Parallel? We could divide P[] into sections, and use threads - it should scale up to how many CPU cores we have - but we are in a low-level library, so threads are unavailable - there should be a better way
  • 49. Branches Are Evil Introducing a Branch-Less Loop
  • 50. Branches Are Evil Introducing a Branch-Less Loop ord(P[count] > delete) boolean-to-integer expression returns either 0 (false) or 1 (true) and has no branch
  • 51. Branches Are Evil Introducing a Branch-Less Loop FACT: it is actually faster to execute dec(P[count], 0); than to handle a mispredicted branch… (i.e. execute nothing)
  • 52. Branches Are Evil Introducing a Branch-Less Loop while count > 0 is very likely to loop therefore easy to predict (by CPU Scheduler convention, an “upper jump” is estimated as most probable)
  • 53. Branches Are Evil Introducing a Branch-Less Loop ord(P[count] > delete) compiles to very efficient asm (branchless setl opcode)
  • 54. Branches Are Evil Introducing a Branch-Less Loop Here, a little unrolling (slightly) helps… since it avoids an unlikely count <= 0 condition/branch
  • 55. Branches Are Evil Numbers Are Talking naïve if adjust=4.27s 548.6MB/s branchless adjust=520.85ms 4.3GB/s We have almost 10X better performance, in pure pascal code !
  • 56. • The Hash-Table Mystery • Branches Are Evil • SIMD Assembly: SSE2 • SIMD Assembly: AVX2 • Conclusion
  • 57. SIMD Assembly: SSE2 Can SIMD Improve It Further? SIMD = Single Instruction, Multiple Data
  • 58. SIMD Assembly: SSE2 Can SIMD Improve It Further? • Data Alignment Restrictions • Gathering/Scattering is Tricky • Architecture Specific • Not native to Delphi or FPC compilers • Sometimes needs setup/clear
  • 59. SIMD Assembly: SSE2 SSE2 SIMD Instructions • Introduced by Intel in 2000 (Pentium 4) • XMM0 to XMM7 Registers in 32-bit mode • XMM0 to XMM15 in x86_64 mode
  • 60. SIMD Assembly: SSE2 SSE2 SIMD Instructions • Each 128-bit XMM Register can handle Two 64-bit Doubles or Integers Four 32-bit Integers Eight 16-bit or Sixteen 8-bit Integers
  • 61. SIMD Assembly: SSE2 SSE2 SIMD Instructions
  • 62. SIMD Assembly: SSE2 We need to SIMD the following code:
  • 63. SIMD Assembly: SSE2 We need to SIMD the following code: We can identify two 4-integers = 128-bit blocks
  • 64. SIMD Assembly: SSE2 1. Prepare and Align the Input Parameters: rcx=P edx=deleted r8=count
  • 65. SIMD Assembly: SSE2 2. Processing Loop
  • 66. SIMD Assembly: SSE2 3. Trailing Bytes
  • 67. SIMD Assembly: SSE2 Numbers Are Talking naïve if adjust=4.27s 548.6MB/s branchless adjust=520.85ms 4.3GB/s sse2 adjust=201.53ms 11.3GB/s We expected X4 but we got a little less than X3 (pretty good, to be fair)
  • 68. SIMD Assembly: SSE2 Help Needed? https://www.agner.org/optimize/ The “Optimization Bible” (also per-CPU timing) https://gcc.godbolt.org/ Check what best compilers do https://www.felixcloutier.com/x86/ OpCode Reference Documentation
  • 69. • The Hash-Table Mystery • Branches Are Evil • SIMD Assembly: SSE2 • SIMD Assembly: AVX2 • Conclusion
  • 70. SIMD Assembly: AVX2 AVX2 SIMD Instructions • AVX introduced in Sandy Bridge 2011 New 128-bit instructions New coding scheme • AVX2 introduced in Haswell 2013 YMM 256-bit registers FusedMultiplyAccumulate (FMA) ops
  • 71. SIMD Assembly: AVX2 AVX2 SIMD Instructions • Each 256-bit YMM Register can handle Four 64-bit Doubles or Integers Eight 32-bit Integers Sixteen 16-bit or Thirty-two 8-bit Integers
  • 72. SIMD Assembly: AVX2 AVX2 SIMD Instructions • Before using them: Check the CPUID flag Ensure the OS is AVX2-Aware • AVX2 is Supported in FPC asm • AVX2 is Not Supported in Delphi asm
  • 73. SIMD Assembly: AVX2 SSE2 Processing Loop
  • 74. SIMD Assembly: AVX2 New AVX2 Processing Loop
  • 75. SIMD Assembly: AVX2 Numbers Are Talking naïve if adjust=4.27s 548.6MB/s branchless adjust=520.85ms 4.3GB/s sse2 adjust=201.53ms 11.3GB/s avx2 adjust=161.73ms 14.1GB/s We got only 30% better numbers  We saturated the CPU bandwidth 
  • 76. • The Hash-Table Mystery • Branches Are Evil • SIMD Assembly: SSE2 • SIMD Assembly: AVX2 • Conclusion
  • 77. Conclusion • On Deletion, TDynArrayHasher is not a bottleneck any more • The TDynArray.Delete data move takes most time now • We have a nice pure-pascal version
  • 78. Conclusion • Branches are Evil • Never Trust Micro Benchmarks • Unrolling is no magic • Branchless is magic: 10 X faster • SIMD is worth it if really needed for another 3 X boost
  • 79. From Delphi to AVX2 Questions? No Marmots Were Harmed in the Making of This Session