SlideShare uma empresa Scribd logo
1 de 40
Baixar para ler offline
Three Optimization Tips for C++

                                Andrei Alexandrescu, Ph.D.
                                   Research Scientist, Facebook
                                  andrei.alexandrescu@fb.com




© 2012- Facebook. Do not redistribute.                            1 / 33
This Talk




         • Basics
         • Reduce strength
         • Minimize array writes




© 2012- Facebook. Do not redistribute.   2 / 33
Things I Shouldn’t Even




© 2012- Facebook. Do not redistribute.       3 / 33
Today’s Computing Architectures



         • Extremely complex
         • Trade reproducible performance for average
           speed
         • Interrupts, multiprocessing are the norm
         • Dynamic frequency control is becoming
           common
         • Virtually impossible to get identical timings
           for experiments




© 2012- Facebook. Do not redistribute.                     4 / 33
Intuition


         • Ignores aspects of a complex reality
         • Makes narrow/obsolete/wrong assumptions




© 2012- Facebook. Do not redistribute.               5 / 33
Intuition


         • Ignores aspects of a complex reality
         • Makes narrow/obsolete/wrong assumptions


         • “Fewer instructions = faster code”




© 2012- Facebook. Do not redistribute.               5 / 33
Intuition


         • Ignores aspects of a complex reality
         • Makes narrow/obsolete/wrong assumptions


         • “Fewer instructions = faster code”




© 2012- Facebook. Do not redistribute.               5 / 33
Intuition


         • Ignores aspects of a complex reality
         • Makes narrow/obsolete/wrong assumptions


         • “Fewer instructions = faster code”
         • “Data is faster than computation”




© 2012- Facebook. Do not redistribute.               5 / 33
Intuition


         • Ignores aspects of a complex reality
         • Makes narrow/obsolete/wrong assumptions


         • “Fewer instructions = faster code”
         • “Data is faster than computation”




© 2012- Facebook. Do not redistribute.               5 / 33
Intuition


         • Ignores aspects of a complex reality
         • Makes narrow/obsolete/wrong assumptions


         • “Fewer instructions = faster code”
         • “Data is faster than computation”
         • “Computation is faster than data”




© 2012- Facebook. Do not redistribute.               5 / 33
Intuition


         • Ignores aspects of a complex reality
         • Makes narrow/obsolete/wrong assumptions


         • “Fewer instructions = faster code”
         • “Data is faster than computation”
         • “Computation is faster than data”




© 2012- Facebook. Do not redistribute.               5 / 33
Intuition


         • Ignores aspects of a complex reality
         • Makes narrow/obsolete/wrong assumptions


         • “Fewer instructions = faster code”
         • “Data is faster than computation”
         • “Computation is faster than data”


         • The only good intuition: “I should time this.”



© 2012- Facebook. Do not redistribute.                      5 / 33
Paradox




      Measuring gives you a
      leg up on experts who
      don’t need to measure


© 2012- Facebook. Do not redistribute.   6 / 33
Common Pitfalls


         • Measuring speed of debug builds
         • Different setup for baseline and measured
             ◦ Sequencing: heap allocator
             ◦ Warmth of cache, files, databases, DNS
         • Including ancillary work in measurement
             ◦ malloc, printf common
         • Mixtures: measure ta + tb , improve ta ,
           conclude tb got improved
         • Optimize rare cases, pessimize others



© 2012- Facebook. Do not redistribute.                 7 / 33
Optimizing Rare Cases




© 2012- Facebook. Do not redistribute.   8 / 33
More generalities




         • Prefer static linking and PDC
         • Prefer 64-bit code, 32-bit data
         • Prefer (32-bit) array indexing to pointers
            ◦ Prefer a[i++] to a[++i]
         • Prefer regular memory access patterns
         • Minimize flow, avoid data dependencies




© 2012- Facebook. Do not redistribute.                  9 / 33
Storage Pecking Order



         • Use enum for integral constants
         • Use static const for other immutables
            ◦ Beware cache issues
         • Use stack for most variables
         • Globals: aliasing issues
         • thread_local slowest, use local caching
            ◦ 1 instruction in Windows, Linux
            ◦ 3-4 in OSX




© 2012- Facebook. Do not redistribute.               10 / 33
Reduce Strength




© 2012- Facebook. Do not redistribute.             11 / 33
Strength reduction



         • Speed hierarchy:
            ◦ comparisons
            ◦ (u)int add, subtract, bitops, shift
            ◦ FP add, sub (separate unit!)
            ◦ Indexed array access
            ◦ (u)int32 mul; FP mul
            ◦ FP division, remainder
            ◦ (u)int division, remainder




© 2012- Facebook. Do not redistribute.              12 / 33
Your Compiler Called




        I get it. a >>= 1 is the
           same as a /= 2.


© 2012- Facebook. Do not redistribute.   13 / 33
Integrals



         • Prefer 32-bit ints to all other sizes
            ◦ 64 bit may make some code slower
            ◦ 8, 16-bit computations use conversion to
              32 bits and back
            ◦ Use small ints in arrays
         • Prefer unsigned to signed
            ◦ Except when converting to floating point
         • “Most numbers are small”




© 2012- Facebook. Do not redistribute.                   14 / 33
Floating Point



         • Double precision as fast as single precision
         • Extended precision just a bit slower
         • Do not mix the three
         • 1-2 FP addition/subtraction units
         • 1-2 FP multiplication/division units
         • SSE accelerates throughput for certain
           computation kernels
         • ints→FPs cheap, FPs→ints expensive




© 2012- Facebook. Do not redistribute.                    15 / 33
Advice




      Design algorithms to
     use minimum operation
            strength


© 2012- Facebook. Do not redistribute.   16 / 33
Strength reduction: Example


         • Digit count in base-10 representation
       uint32_t digits10(uint64_t v) {
          uint32_t result = 0;
          do {
             ++result;
             v /= 10;
          } while (v);
          return result;
       }

         • Uses integral division extensively
            ◦ (Actually: multiplication)


© 2012- Facebook. Do not redistribute.             17 / 33
Strength reduction: Example

       uint32_t digits10(uint64_t v) {
          uint32_t result = 1;
          for (;;) {
             if (v < 10) return result;
             if (v < 100) return result + 1;
             if (v < 1000) return result + 2;
             if (v < 10000) return result + 3;
             // Skip ahead by 4 orders of magnitude
             v /= 10000U;
             result += 4;
          }
       }

         • More comparisons and additions, fewer /=
         • (This is not loop unrolling!)
© 2012- Facebook. Do not redistribute.                18 / 33
Minimize Array Writes




© 2012- Facebook. Do not redistribute.        20 / 33
Minimize Array Writes: Why?



         •   Disables enregistering
         •   A write is really a read and a write
         •   Aliasing makes things difficult
         •   Maculates the cache



         • Generally just difficult to optimize




© 2012- Facebook. Do not redistribute.              21 / 33
Minimize Array Writes


       uint32_t u64ToAsciiClassic(uint64_t value, char* dst) {
          // Write backwards.
          auto start = dst;
          do {
             *dst++ = ’0’ + (value % 10);
             value /= 10;
          } while (value != 0);
          const uint32_t result = dst - start;
          // Reverse in place.
          for (dst--; dst > start; start++, dst--) {
             std::iter_swap(dst, start);
          }
          return result;
       }




© 2012- Facebook. Do not redistribute.                           22 / 33
Minimize Array Writes
         • Gambit: make one extra pass to compute
            length
       uint32_t uint64ToAscii(uint64_t v, char *const buffer) {
          auto const result = digits10(v);
          uint32_t pos = result - 1;
          while (v >= 10) {
             auto const q = v / 10;
             auto const r = static_cast<uint32_t>(v % 10);
             buffer[pos--] = ’0’ + r;
             v = q;
          }
          assert(pos == 0);
          // Last digit is trivial to handle
          *buffer = static_cast<uint32_t>(v) + ’0’;
          return result;
       }


© 2012- Facebook. Do not redistribute.                            23 / 33
Improvements




         •   Fewer array writes
         •   Regular access patterns
         •   Fast on small numbers
         •   Data dependencies reduced




© 2012- Facebook. Do not redistribute.   24 / 33
One More Pass




         • Reformulate digits10 as search
         • Convert two digits at a time




© 2012- Facebook. Do not redistribute.      26 / 33
uint32_t         digits10(uint64_t v) {
          if (v         < P01) return 1;
          if (v         < P02) return 2;
          if (v         < P03) return 3;
          if (v         < P12) {
             if         (v < P08) {
                        if (v < P06) {
                           if (v < P04) return 4;
                           return 5 + (v < P05);
                        }
                        return 7 + (v >= P07);
                   }
                   if (v < P10) {
                      return 9 + (v >= P09);
                   }
                   return 11 + (v >= P11);
             }
             return 12 + digits10(v / P12);
       }

© 2012- Facebook. Do not redistribute.              27 / 33
unsigned u64ToAsciiTable(uint64_t value, char* dst) {
          static const char digits[201] =
             "0001020304050607080910111213141516171819"
             "2021222324252627282930313233343536373839"
             "4041424344454647484950515253545556575859"
             "6061626364656667686970717273747576777879"
             "8081828384858687888990919293949596979899";
          uint32_t const length = digits10(value);
          uint32_t next = length - 1;
          while (value >= 100) {
             auto const i = (value % 100) * 2;
             value /= 100;
             dst[next] = digits[i + 1];
             dst[next - 1] = digits[i];
             next -= 2;
          }




© 2012- Facebook. Do not redistribute.                         28 / 33
// Handle last 1-2 digits
             if (value < 10) {
                dst[next] = ’0’ + uint32_t(value);
             } else {
                auto i = uint32_t(value) * 2;
                dst[next] = digits[i + 1];
                dst[next - 1] = digits[i];
             }
             return length;
       }




© 2012- Facebook. Do not redistribute.               29 / 33
Summary




© 2012- Facebook. Do not redistribute.             32 / 33
Summary




         • You can’t improve what you can’t measure
            ◦ Pro tip: You can’t measure what you don’t
              measure
         • Reduce strength
         • Minimize array writes




© 2012- Facebook. Do not redistribute.                    33 / 33

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Understanding of linux kernel memory model
Understanding of linux kernel memory modelUnderstanding of linux kernel memory model
Understanding of linux kernel memory model
 
Binary exploitation - AIS3
Binary exploitation - AIS3Binary exploitation - AIS3
Binary exploitation - AIS3
 
Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01
 
Multithread & shared_ptr
Multithread & shared_ptrMultithread & shared_ptr
Multithread & shared_ptr
 
BPF - in-kernel virtual machine
BPF - in-kernel virtual machineBPF - in-kernel virtual machine
BPF - in-kernel virtual machine
 
Linux SMEP bypass techniques
Linux SMEP bypass techniquesLinux SMEP bypass techniques
Linux SMEP bypass techniques
 
How A Compiler Works: GNU Toolchain
How A Compiler Works: GNU ToolchainHow A Compiler Works: GNU Toolchain
How A Compiler Works: GNU Toolchain
 
GCGC- CGCII 서버 엔진에 적용된 기술 (2) - Perfornance
GCGC- CGCII 서버 엔진에 적용된 기술 (2) - PerfornanceGCGC- CGCII 서버 엔진에 적용된 기술 (2) - Perfornance
GCGC- CGCII 서버 엔진에 적용된 기술 (2) - Perfornance
 
Interpreter, Compiler, JIT from scratch
Interpreter, Compiler, JIT from scratchInterpreter, Compiler, JIT from scratch
Interpreter, Compiler, JIT from scratch
 
Windows IOCP vs Linux EPOLL Performance Comparison
Windows IOCP vs Linux EPOLL Performance ComparisonWindows IOCP vs Linux EPOLL Performance Comparison
Windows IOCP vs Linux EPOLL Performance Comparison
 
LinuxのFull ticklessを試してみた
LinuxのFull ticklessを試してみたLinuxのFull ticklessを試してみた
LinuxのFull ticklessを試してみた
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting Started
 
NDC 2017 하재승 NEXON ZERO (넥슨 제로) 점검없이 실시간으로 코드 수정 및 게임 정보 수집하기
NDC 2017 하재승 NEXON ZERO (넥슨 제로) 점검없이 실시간으로 코드 수정 및 게임 정보 수집하기NDC 2017 하재승 NEXON ZERO (넥슨 제로) 점검없이 실시간으로 코드 수정 및 게임 정보 수집하기
NDC 2017 하재승 NEXON ZERO (넥슨 제로) 점검없이 실시간으로 코드 수정 및 게임 정보 수집하기
 
Windows Registered I/O (RIO) vs IOCP
Windows Registered I/O (RIO) vs IOCPWindows Registered I/O (RIO) vs IOCP
Windows Registered I/O (RIO) vs IOCP
 
Introduction to gdb
Introduction to gdbIntroduction to gdb
Introduction to gdb
 
Modern C++의 타입 추론과 람다, 컨셉
Modern C++의 타입 추론과 람다, 컨셉Modern C++의 타입 추론과 람다, 컨셉
Modern C++의 타입 추론과 람다, 컨셉
 
NDC 11 자이언트 서버의 비밀
NDC 11 자이언트 서버의 비밀NDC 11 자이언트 서버의 비밀
NDC 11 자이언트 서버의 비밀
 
Multiplayer Game Sync Techniques through CAP theorem
Multiplayer Game Sync Techniques through CAP theoremMultiplayer Game Sync Techniques through CAP theorem
Multiplayer Game Sync Techniques through CAP theorem
 
Virtual Machine Constructions for Dummies
Virtual Machine Constructions for DummiesVirtual Machine Constructions for Dummies
Virtual Machine Constructions for Dummies
 
用十分鐘 向jserv學習作業系統設計
用十分鐘  向jserv學習作業系統設計用十分鐘  向jserv學習作業系統設計
用十分鐘 向jserv學習作業系統設計
 

Destaque

Stabilizer: Statistically Sound Performance Evaluation
Stabilizer: Statistically Sound Performance EvaluationStabilizer: Statistically Sound Performance Evaluation
Stabilizer: Statistically Sound Performance Evaluation
Emery Berger
 
Generic Programming Galore Using D
Generic Programming Galore Using DGeneric Programming Galore Using D
Generic Programming Galore Using D
Andrei Alexandrescu
 
Three, no, Four Cool Things About D
Three, no, Four Cool Things About DThree, no, Four Cool Things About D
Three, no, Four Cool Things About D
Andrei Alexandrescu
 
C++ idioms by example (Nov 2008)
C++ idioms by example (Nov 2008)C++ idioms by example (Nov 2008)
C++ idioms by example (Nov 2008)
Olve Maudal
 
Code Optimization
Code OptimizationCode Optimization
Code Optimization
guest9f8315
 

Destaque (18)

Dconf2015 d2 t4
Dconf2015 d2 t4Dconf2015 d2 t4
Dconf2015 d2 t4
 
Dconf2015 d2 t3
Dconf2015 d2 t3Dconf2015 d2 t3
Dconf2015 d2 t3
 
Solid C++ by Example
Solid C++ by ExampleSolid C++ by Example
Solid C++ by Example
 
Stabilizer: Statistically Sound Performance Evaluation
Stabilizer: Statistically Sound Performance EvaluationStabilizer: Statistically Sound Performance Evaluation
Stabilizer: Statistically Sound Performance Evaluation
 
ACCU Keynote by Andrei Alexandrescu
ACCU Keynote by Andrei AlexandrescuACCU Keynote by Andrei Alexandrescu
ACCU Keynote by Andrei Alexandrescu
 
Generic Programming Galore Using D
Generic Programming Galore Using DGeneric Programming Galore Using D
Generic Programming Galore Using D
 
Three, no, Four Cool Things About D
Three, no, Four Cool Things About DThree, no, Four Cool Things About D
Three, no, Four Cool Things About D
 
iOS 6 Exploitation 280 days later
iOS 6 Exploitation 280 days lateriOS 6 Exploitation 280 days later
iOS 6 Exploitation 280 days later
 
Deep C Programming
Deep C ProgrammingDeep C Programming
Deep C Programming
 
C++11
C++11C++11
C++11
 
C++ idioms by example (Nov 2008)
C++ idioms by example (Nov 2008)C++ idioms by example (Nov 2008)
C++ idioms by example (Nov 2008)
 
Insecure coding in C (and C++)
Insecure coding in C (and C++)Insecure coding in C (and C++)
Insecure coding in C (and C++)
 
How to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldHow to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the World
 
Code Optimization
Code OptimizationCode Optimization
Code Optimization
 
Code generation
Code generationCode generation
Code generation
 
H2O - making HTTP better
H2O - making HTTP betterH2O - making HTTP better
H2O - making HTTP better
 
TDD in C - Recently Used List Kata
TDD in C - Recently Used List KataTDD in C - Recently Used List Kata
TDD in C - Recently Used List Kata
 
Deep C
Deep CDeep C
Deep C
 

Semelhante a Three Optimization Tips for C++

Database , 6 Query Introduction
Database , 6 Query Introduction Database , 6 Query Introduction
Database , 6 Query Introduction
Ali Usman
 
Disaster Recovery with MySQL and Tungsten
Disaster Recovery with MySQL and TungstenDisaster Recovery with MySQL and Tungsten
Disaster Recovery with MySQL and Tungsten
Jeff Mace
 
Tales from the Field
Tales from the FieldTales from the Field
Tales from the Field
MongoDB
 
SOLID, DRY, SLAP design principles
SOLID, DRY, SLAP design principlesSOLID, DRY, SLAP design principles
SOLID, DRY, SLAP design principles
Sergey Karpushin
 
Clean code, Feb 2012
Clean code, Feb 2012Clean code, Feb 2012
Clean code, Feb 2012
cobyst
 

Semelhante a Three Optimization Tips for C++ (20)

Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
 
6-Query_Intro (5).pdf
6-Query_Intro (5).pdf6-Query_Intro (5).pdf
6-Query_Intro (5).pdf
 
Writing Readable Code
Writing Readable CodeWriting Readable Code
Writing Readable Code
 
Database , 6 Query Introduction
Database , 6 Query Introduction Database , 6 Query Introduction
Database , 6 Query Introduction
 
How To Handle Your Tech Debt Better - Sean Moir
How To Handle Your Tech Debt Better - Sean MoirHow To Handle Your Tech Debt Better - Sean Moir
How To Handle Your Tech Debt Better - Sean Moir
 
Kernel Recipes 2014 - Writing Code: Keep It Short, Stupid!
Kernel Recipes 2014 - Writing Code: Keep It Short, Stupid!Kernel Recipes 2014 - Writing Code: Keep It Short, Stupid!
Kernel Recipes 2014 - Writing Code: Keep It Short, Stupid!
 
Performance #5 cpu and battery
Performance #5  cpu and batteryPerformance #5  cpu and battery
Performance #5 cpu and battery
 
4 colin walls - self-testing in embedded systems
4   colin walls - self-testing in embedded systems4   colin walls - self-testing in embedded systems
4 colin walls - self-testing in embedded systems
 
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
 
Optimizing Browser Rendering
Optimizing Browser RenderingOptimizing Browser Rendering
Optimizing Browser Rendering
 
Disaster Recovery with MySQL and Tungsten
Disaster Recovery with MySQL and TungstenDisaster Recovery with MySQL and Tungsten
Disaster Recovery with MySQL and Tungsten
 
Tales from the Field
Tales from the FieldTales from the Field
Tales from the Field
 
Where Did All These Cycles Go?
Where Did All These Cycles Go?Where Did All These Cycles Go?
Where Did All These Cycles Go?
 
SOLID, DRY, SLAP design principles
SOLID, DRY, SLAP design principlesSOLID, DRY, SLAP design principles
SOLID, DRY, SLAP design principles
 
Cassandra introduction mars jug
Cassandra introduction mars jugCassandra introduction mars jug
Cassandra introduction mars jug
 
Clean code, Feb 2012
Clean code, Feb 2012Clean code, Feb 2012
Clean code, Feb 2012
 
Pragmatic Performance from NDC Oslo 2019
Pragmatic Performance from NDC Oslo 2019Pragmatic Performance from NDC Oslo 2019
Pragmatic Performance from NDC Oslo 2019
 
Interactive DSML Design
Interactive DSML DesignInteractive DSML Design
Interactive DSML Design
 
BRV CTO Summit Deep Learning Talk
BRV CTO Summit Deep Learning TalkBRV CTO Summit Deep Learning Talk
BRV CTO Summit Deep Learning Talk
 
Cassandra On EC2
Cassandra On EC2Cassandra On EC2
Cassandra On EC2
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 

Three Optimization Tips for C++

  • 1. Three Optimization Tips for C++ Andrei Alexandrescu, Ph.D. Research Scientist, Facebook andrei.alexandrescu@fb.com © 2012- Facebook. Do not redistribute. 1 / 33
  • 2. This Talk • Basics • Reduce strength • Minimize array writes © 2012- Facebook. Do not redistribute. 2 / 33
  • 3. Things I Shouldn’t Even © 2012- Facebook. Do not redistribute. 3 / 33
  • 4. Today’s Computing Architectures • Extremely complex • Trade reproducible performance for average speed • Interrupts, multiprocessing are the norm • Dynamic frequency control is becoming common • Virtually impossible to get identical timings for experiments © 2012- Facebook. Do not redistribute. 4 / 33
  • 5. Intuition • Ignores aspects of a complex reality • Makes narrow/obsolete/wrong assumptions © 2012- Facebook. Do not redistribute. 5 / 33
  • 6. Intuition • Ignores aspects of a complex reality • Makes narrow/obsolete/wrong assumptions • “Fewer instructions = faster code” © 2012- Facebook. Do not redistribute. 5 / 33
  • 7. Intuition • Ignores aspects of a complex reality • Makes narrow/obsolete/wrong assumptions • “Fewer instructions = faster code” © 2012- Facebook. Do not redistribute. 5 / 33
  • 8. Intuition • Ignores aspects of a complex reality • Makes narrow/obsolete/wrong assumptions • “Fewer instructions = faster code” • “Data is faster than computation” © 2012- Facebook. Do not redistribute. 5 / 33
  • 9. Intuition • Ignores aspects of a complex reality • Makes narrow/obsolete/wrong assumptions • “Fewer instructions = faster code” • “Data is faster than computation” © 2012- Facebook. Do not redistribute. 5 / 33
  • 10. Intuition • Ignores aspects of a complex reality • Makes narrow/obsolete/wrong assumptions • “Fewer instructions = faster code” • “Data is faster than computation” • “Computation is faster than data” © 2012- Facebook. Do not redistribute. 5 / 33
  • 11. Intuition • Ignores aspects of a complex reality • Makes narrow/obsolete/wrong assumptions • “Fewer instructions = faster code” • “Data is faster than computation” • “Computation is faster than data” © 2012- Facebook. Do not redistribute. 5 / 33
  • 12. Intuition • Ignores aspects of a complex reality • Makes narrow/obsolete/wrong assumptions • “Fewer instructions = faster code” • “Data is faster than computation” • “Computation is faster than data” • The only good intuition: “I should time this.” © 2012- Facebook. Do not redistribute. 5 / 33
  • 13. Paradox Measuring gives you a leg up on experts who don’t need to measure © 2012- Facebook. Do not redistribute. 6 / 33
  • 14. Common Pitfalls • Measuring speed of debug builds • Different setup for baseline and measured ◦ Sequencing: heap allocator ◦ Warmth of cache, files, databases, DNS • Including ancillary work in measurement ◦ malloc, printf common • Mixtures: measure ta + tb , improve ta , conclude tb got improved • Optimize rare cases, pessimize others © 2012- Facebook. Do not redistribute. 7 / 33
  • 15. Optimizing Rare Cases © 2012- Facebook. Do not redistribute. 8 / 33
  • 16. More generalities • Prefer static linking and PDC • Prefer 64-bit code, 32-bit data • Prefer (32-bit) array indexing to pointers ◦ Prefer a[i++] to a[++i] • Prefer regular memory access patterns • Minimize flow, avoid data dependencies © 2012- Facebook. Do not redistribute. 9 / 33
  • 17. Storage Pecking Order • Use enum for integral constants • Use static const for other immutables ◦ Beware cache issues • Use stack for most variables • Globals: aliasing issues • thread_local slowest, use local caching ◦ 1 instruction in Windows, Linux ◦ 3-4 in OSX © 2012- Facebook. Do not redistribute. 10 / 33
  • 18. Reduce Strength © 2012- Facebook. Do not redistribute. 11 / 33
  • 19. Strength reduction • Speed hierarchy: ◦ comparisons ◦ (u)int add, subtract, bitops, shift ◦ FP add, sub (separate unit!) ◦ Indexed array access ◦ (u)int32 mul; FP mul ◦ FP division, remainder ◦ (u)int division, remainder © 2012- Facebook. Do not redistribute. 12 / 33
  • 20. Your Compiler Called I get it. a >>= 1 is the same as a /= 2. © 2012- Facebook. Do not redistribute. 13 / 33
  • 21. Integrals • Prefer 32-bit ints to all other sizes ◦ 64 bit may make some code slower ◦ 8, 16-bit computations use conversion to 32 bits and back ◦ Use small ints in arrays • Prefer unsigned to signed ◦ Except when converting to floating point • “Most numbers are small” © 2012- Facebook. Do not redistribute. 14 / 33
  • 22. Floating Point • Double precision as fast as single precision • Extended precision just a bit slower • Do not mix the three • 1-2 FP addition/subtraction units • 1-2 FP multiplication/division units • SSE accelerates throughput for certain computation kernels • ints→FPs cheap, FPs→ints expensive © 2012- Facebook. Do not redistribute. 15 / 33
  • 23. Advice Design algorithms to use minimum operation strength © 2012- Facebook. Do not redistribute. 16 / 33
  • 24. Strength reduction: Example • Digit count in base-10 representation uint32_t digits10(uint64_t v) { uint32_t result = 0; do { ++result; v /= 10; } while (v); return result; } • Uses integral division extensively ◦ (Actually: multiplication) © 2012- Facebook. Do not redistribute. 17 / 33
  • 25. Strength reduction: Example uint32_t digits10(uint64_t v) { uint32_t result = 1; for (;;) { if (v < 10) return result; if (v < 100) return result + 1; if (v < 1000) return result + 2; if (v < 10000) return result + 3; // Skip ahead by 4 orders of magnitude v /= 10000U; result += 4; } } • More comparisons and additions, fewer /= • (This is not loop unrolling!) © 2012- Facebook. Do not redistribute. 18 / 33
  • 26.
  • 27. Minimize Array Writes © 2012- Facebook. Do not redistribute. 20 / 33
  • 28. Minimize Array Writes: Why? • Disables enregistering • A write is really a read and a write • Aliasing makes things difficult • Maculates the cache • Generally just difficult to optimize © 2012- Facebook. Do not redistribute. 21 / 33
  • 29. Minimize Array Writes uint32_t u64ToAsciiClassic(uint64_t value, char* dst) { // Write backwards. auto start = dst; do { *dst++ = ’0’ + (value % 10); value /= 10; } while (value != 0); const uint32_t result = dst - start; // Reverse in place. for (dst--; dst > start; start++, dst--) { std::iter_swap(dst, start); } return result; } © 2012- Facebook. Do not redistribute. 22 / 33
  • 30. Minimize Array Writes • Gambit: make one extra pass to compute length uint32_t uint64ToAscii(uint64_t v, char *const buffer) { auto const result = digits10(v); uint32_t pos = result - 1; while (v >= 10) { auto const q = v / 10; auto const r = static_cast<uint32_t>(v % 10); buffer[pos--] = ’0’ + r; v = q; } assert(pos == 0); // Last digit is trivial to handle *buffer = static_cast<uint32_t>(v) + ’0’; return result; } © 2012- Facebook. Do not redistribute. 23 / 33
  • 31. Improvements • Fewer array writes • Regular access patterns • Fast on small numbers • Data dependencies reduced © 2012- Facebook. Do not redistribute. 24 / 33
  • 32.
  • 33. One More Pass • Reformulate digits10 as search • Convert two digits at a time © 2012- Facebook. Do not redistribute. 26 / 33
  • 34. uint32_t digits10(uint64_t v) { if (v < P01) return 1; if (v < P02) return 2; if (v < P03) return 3; if (v < P12) { if (v < P08) { if (v < P06) { if (v < P04) return 4; return 5 + (v < P05); } return 7 + (v >= P07); } if (v < P10) { return 9 + (v >= P09); } return 11 + (v >= P11); } return 12 + digits10(v / P12); } © 2012- Facebook. Do not redistribute. 27 / 33
  • 35. unsigned u64ToAsciiTable(uint64_t value, char* dst) { static const char digits[201] = "0001020304050607080910111213141516171819" "2021222324252627282930313233343536373839" "4041424344454647484950515253545556575859" "6061626364656667686970717273747576777879" "8081828384858687888990919293949596979899"; uint32_t const length = digits10(value); uint32_t next = length - 1; while (value >= 100) { auto const i = (value % 100) * 2; value /= 100; dst[next] = digits[i + 1]; dst[next - 1] = digits[i]; next -= 2; } © 2012- Facebook. Do not redistribute. 28 / 33
  • 36. // Handle last 1-2 digits if (value < 10) { dst[next] = ’0’ + uint32_t(value); } else { auto i = uint32_t(value) * 2; dst[next] = digits[i + 1]; dst[next - 1] = digits[i]; } return length; } © 2012- Facebook. Do not redistribute. 29 / 33
  • 37.
  • 38.
  • 39. Summary © 2012- Facebook. Do not redistribute. 32 / 33
  • 40. Summary • You can’t improve what you can’t measure ◦ Pro tip: You can’t measure what you don’t measure • Reduce strength • Minimize array writes © 2012- Facebook. Do not redistribute. 33 / 33