SlideShare uma empresa Scribd logo
1 de 40
Baixar para ler offline
Three Optimization Tips for C++

                                Andrei Alexandrescu, Ph.D.
                                   Research Scientist, Facebook
                                  andrei.alexandrescu@fb.com




© 2012- Facebook. Do not redistribute.                            1 / 33
This Talk




         • Basics
         • Reduce strength
         • Minimize array writes




© 2012- Facebook. Do not redistribute.   2 / 33
Things I Shouldn’t Even




© 2012- Facebook. Do not redistribute.       3 / 33
Today’s Computing Architectures



         • Extremely complex
         • Trade reproducible performance for average
           speed
         • Interrupts, multiprocessing are the norm
         • Dynamic frequency control is becoming
           common
         • Virtually impossible to get identical timings
           for experiments




© 2012- Facebook. Do not redistribute.                     4 / 33
Intuition


         • Ignores aspects of a complex reality
         • Makes narrow/obsolete/wrong assumptions




© 2012- Facebook. Do not redistribute.               5 / 33
Intuition


         • Ignores aspects of a complex reality
         • Makes narrow/obsolete/wrong assumptions


         • “Fewer instructions = faster code”




© 2012- Facebook. Do not redistribute.               5 / 33
Intuition


         • Ignores aspects of a complex reality
         • Makes narrow/obsolete/wrong assumptions


         • “Fewer instructions = faster code”




© 2012- Facebook. Do not redistribute.               5 / 33
Intuition


         • Ignores aspects of a complex reality
         • Makes narrow/obsolete/wrong assumptions


         • “Fewer instructions = faster code”
         • “Data is faster than computation”




© 2012- Facebook. Do not redistribute.               5 / 33
Intuition


         • Ignores aspects of a complex reality
         • Makes narrow/obsolete/wrong assumptions


         • “Fewer instructions = faster code”
         • “Data is faster than computation”




© 2012- Facebook. Do not redistribute.               5 / 33
Intuition


         • Ignores aspects of a complex reality
         • Makes narrow/obsolete/wrong assumptions


         • “Fewer instructions = faster code”
         • “Data is faster than computation”
         • “Computation is faster than data”




© 2012- Facebook. Do not redistribute.               5 / 33
Intuition


         • Ignores aspects of a complex reality
         • Makes narrow/obsolete/wrong assumptions


         • “Fewer instructions = faster code”
         • “Data is faster than computation”
         • “Computation is faster than data”




© 2012- Facebook. Do not redistribute.               5 / 33
Intuition


         • Ignores aspects of a complex reality
         • Makes narrow/obsolete/wrong assumptions


         • “Fewer instructions = faster code”
         • “Data is faster than computation”
         • “Computation is faster than data”


         • The only good intuition: “I should time this.”



© 2012- Facebook. Do not redistribute.                      5 / 33
Paradox




      Measuring gives you a
      leg up on experts who
      don’t need to measure


© 2012- Facebook. Do not redistribute.   6 / 33
Common Pitfalls


         • Measuring speed of debug builds
         • Different setup for baseline and measured
             ◦ Sequencing: heap allocator
             ◦ Warmth of cache, files, databases, DNS
         • Including ancillary work in measurement
             ◦ malloc, printf common
         • Mixtures: measure ta + tb , improve ta ,
           conclude tb got improved
         • Optimize rare cases, pessimize others



© 2012- Facebook. Do not redistribute.                 7 / 33
Optimizing Rare Cases




© 2012- Facebook. Do not redistribute.   8 / 33
More generalities




         • Prefer static linking and PDC
         • Prefer 64-bit code, 32-bit data
         • Prefer (32-bit) array indexing to pointers
            ◦ Prefer a[i++] to a[++i]
         • Prefer regular memory access patterns
         • Minimize flow, avoid data dependencies




© 2012- Facebook. Do not redistribute.                  9 / 33
Storage Pecking Order



         • Use enum for integral constants
         • Use static const for other immutables
            ◦ Beware cache issues
         • Use stack for most variables
         • Globals: aliasing issues
         • thread_local slowest, use local caching
            ◦ 1 instruction in Windows, Linux
            ◦ 3-4 in OSX




© 2012- Facebook. Do not redistribute.               10 / 33
Reduce Strength




© 2012- Facebook. Do not redistribute.             11 / 33
Strength reduction



         • Speed hierarchy:
            ◦ comparisons
            ◦ (u)int add, subtract, bitops, shift
            ◦ FP add, sub (separate unit!)
            ◦ Indexed array access
            ◦ (u)int32 mul; FP mul
            ◦ FP division, remainder
            ◦ (u)int division, remainder




© 2012- Facebook. Do not redistribute.              12 / 33
Your Compiler Called




        I get it. a >>= 1 is the
           same as a /= 2.


© 2012- Facebook. Do not redistribute.   13 / 33
Integrals



         • Prefer 32-bit ints to all other sizes
            ◦ 64 bit may make some code slower
            ◦ 8, 16-bit computations use conversion to
              32 bits and back
            ◦ Use small ints in arrays
         • Prefer unsigned to signed
            ◦ Except when converting to floating point
         • “Most numbers are small”




© 2012- Facebook. Do not redistribute.                   14 / 33
Floating Point



         • Double precision as fast as single precision
         • Extended precision just a bit slower
         • Do not mix the three
         • 1-2 FP addition/subtraction units
         • 1-2 FP multiplication/division units
         • SSE accelerates throughput for certain
           computation kernels
         • ints→FPs cheap, FPs→ints expensive




© 2012- Facebook. Do not redistribute.                    15 / 33
Advice




      Design algorithms to
     use minimum operation
            strength


© 2012- Facebook. Do not redistribute.   16 / 33
Strength reduction: Example


         • Digit count in base-10 representation
       uint32_t digits10(uint64_t v) {
          uint32_t result = 0;
          do {
             ++result;
             v /= 10;
          } while (v);
          return result;
       }

         • Uses integral division extensively
            ◦ (Actually: multiplication)


© 2012- Facebook. Do not redistribute.             17 / 33
Strength reduction: Example

       uint32_t digits10(uint64_t v) {
          uint32_t result = 1;
          for (;;) {
             if (v < 10) return result;
             if (v < 100) return result + 1;
             if (v < 1000) return result + 2;
             if (v < 10000) return result + 3;
             // Skip ahead by 4 orders of magnitude
             v /= 10000U;
             result += 4;
          }
       }

         • More comparisons and additions, fewer /=
         • (This is not loop unrolling!)
© 2012- Facebook. Do not redistribute.                18 / 33
Minimize Array Writes




© 2012- Facebook. Do not redistribute.        20 / 33
Minimize Array Writes: Why?



         •   Disables enregistering
         •   A write is really a read and a write
         •   Aliasing makes things difficult
         •   Maculates the cache



         • Generally just difficult to optimize




© 2012- Facebook. Do not redistribute.              21 / 33
Minimize Array Writes


       uint32_t u64ToAsciiClassic(uint64_t value, char* dst) {
          // Write backwards.
          auto start = dst;
          do {
             *dst++ = ’0’ + (value % 10);
             value /= 10;
          } while (value != 0);
          const uint32_t result = dst - start;
          // Reverse in place.
          for (dst--; dst > start; start++, dst--) {
             std::iter_swap(dst, start);
          }
          return result;
       }




© 2012- Facebook. Do not redistribute.                           22 / 33
Minimize Array Writes
         • Gambit: make one extra pass to compute
            length
       uint32_t uint64ToAscii(uint64_t v, char *const buffer) {
          auto const result = digits10(v);
          uint32_t pos = result - 1;
          while (v >= 10) {
             auto const q = v / 10;
             auto const r = static_cast<uint32_t>(v % 10);
             buffer[pos--] = ’0’ + r;
             v = q;
          }
          assert(pos == 0);
          // Last digit is trivial to handle
          *buffer = static_cast<uint32_t>(v) + ’0’;
          return result;
       }


© 2012- Facebook. Do not redistribute.                            23 / 33
Improvements




         •   Fewer array writes
         •   Regular access patterns
         •   Fast on small numbers
         •   Data dependencies reduced




© 2012- Facebook. Do not redistribute.   24 / 33
One More Pass




         • Reformulate digits10 as search
         • Convert two digits at a time




© 2012- Facebook. Do not redistribute.      26 / 33
uint32_t         digits10(uint64_t v) {
          if (v         < P01) return 1;
          if (v         < P02) return 2;
          if (v         < P03) return 3;
          if (v         < P12) {
             if         (v < P08) {
                        if (v < P06) {
                           if (v < P04) return 4;
                           return 5 + (v < P05);
                        }
                        return 7 + (v >= P07);
                   }
                   if (v < P10) {
                      return 9 + (v >= P09);
                   }
                   return 11 + (v >= P11);
             }
             return 12 + digits10(v / P12);
       }

© 2012- Facebook. Do not redistribute.              27 / 33
unsigned u64ToAsciiTable(uint64_t value, char* dst) {
          static const char digits[201] =
             "0001020304050607080910111213141516171819"
             "2021222324252627282930313233343536373839"
             "4041424344454647484950515253545556575859"
             "6061626364656667686970717273747576777879"
             "8081828384858687888990919293949596979899";
          uint32_t const length = digits10(value);
          uint32_t next = length - 1;
          while (value >= 100) {
             auto const i = (value % 100) * 2;
             value /= 100;
             dst[next] = digits[i + 1];
             dst[next - 1] = digits[i];
             next -= 2;
          }




© 2012- Facebook. Do not redistribute.                         28 / 33
// Handle last 1-2 digits
             if (value < 10) {
                dst[next] = ’0’ + uint32_t(value);
             } else {
                auto i = uint32_t(value) * 2;
                dst[next] = digits[i + 1];
                dst[next - 1] = digits[i];
             }
             return length;
       }




© 2012- Facebook. Do not redistribute.               29 / 33
Summary




© 2012- Facebook. Do not redistribute.             32 / 33
Summary




         • You can’t improve what you can’t measure
            ◦ Pro tip: You can’t measure what you don’t
              measure
         • Reduce strength
         • Minimize array writes




© 2012- Facebook. Do not redistribute.                    33 / 33

Mais conteúdo relacionado

Mais procurados

전형규, SilvervineUE4Lua: UE4에서 Lua 사용하기, NDC2019
전형규, SilvervineUE4Lua: UE4에서 Lua 사용하기, NDC2019전형규, SilvervineUE4Lua: UE4에서 Lua 사용하기, NDC2019
전형규, SilvervineUE4Lua: UE4에서 Lua 사용하기, NDC2019
devCAT Studio, NEXON
 
이승재, 사례로 배우는 디스어셈블리 디버깅, NDC2014
이승재, 사례로 배우는 디스어셈블리 디버깅, NDC2014이승재, 사례로 배우는 디스어셈블리 디버깅, NDC2014
이승재, 사례로 배우는 디스어셈블리 디버깅, NDC2014
devCAT Studio, NEXON
 
홍성우, 내가 만든 언어로 게임 만들기, NDC2017
홍성우, 내가 만든 언어로 게임 만들기, NDC2017홍성우, 내가 만든 언어로 게임 만들기, NDC2017
홍성우, 내가 만든 언어로 게임 만들기, NDC2017
devCAT Studio, NEXON
 
이승재, 실버바인 서버엔진 2 설계 리뷰, NDC2018
이승재, 실버바인 서버엔진 2 설계 리뷰, NDC2018이승재, 실버바인 서버엔진 2 설계 리뷰, NDC2018
이승재, 실버바인 서버엔진 2 설계 리뷰, NDC2018
devCAT Studio, NEXON
 
[KGC2011_박민근] 신입 게임 개발자가 알아야 할 것들
[KGC2011_박민근] 신입 게임 개발자가 알아야 할 것들[KGC2011_박민근] 신입 게임 개발자가 알아야 할 것들
[KGC2011_박민근] 신입 게임 개발자가 알아야 할 것들
MinGeun Park
 
05_Reliable UDP 구현
05_Reliable UDP 구현05_Reliable UDP 구현
05_Reliable UDP 구현
noerror
 
덤프 파일을 통한 사후 디버깅 실용 테크닉 NDC2012
덤프 파일을 통한 사후 디버깅 실용 테크닉 NDC2012덤프 파일을 통한 사후 디버깅 실용 테크닉 NDC2012
덤프 파일을 통한 사후 디버깅 실용 테크닉 NDC2012
Esun Kim
 
06_게임엔진구성
06_게임엔진구성06_게임엔진구성
06_게임엔진구성
noerror
 
20分くらいでわかった気分になれるC++20コルーチン
20分くらいでわかった気分になれるC++20コルーチン20分くらいでわかった気分になれるC++20コルーチン
20分くらいでわかった気分になれるC++20コルーチン
yohhoy
 
이권일 Sse 를 이용한 최적화와 실제 사용 예
이권일 Sse 를 이용한 최적화와 실제 사용 예이권일 Sse 를 이용한 최적화와 실제 사용 예
이권일 Sse 를 이용한 최적화와 실제 사용 예
zupet
 

Mais procurados (20)

전형규, SilvervineUE4Lua: UE4에서 Lua 사용하기, NDC2019
전형규, SilvervineUE4Lua: UE4에서 Lua 사용하기, NDC2019전형규, SilvervineUE4Lua: UE4에서 Lua 사용하기, NDC2019
전형규, SilvervineUE4Lua: UE4에서 Lua 사용하기, NDC2019
 
이승재, 사례로 배우는 디스어셈블리 디버깅, NDC2014
이승재, 사례로 배우는 디스어셈블리 디버깅, NDC2014이승재, 사례로 배우는 디스어셈블리 디버깅, NDC2014
이승재, 사례로 배우는 디스어셈블리 디버깅, NDC2014
 
Three Optimization Tips for C++
Three Optimization Tips for C++Three Optimization Tips for C++
Three Optimization Tips for C++
 
BKK16-315 Graphics Stack Update
BKK16-315 Graphics Stack UpdateBKK16-315 Graphics Stack Update
BKK16-315 Graphics Stack Update
 
SPU Shaders
SPU ShadersSPU Shaders
SPU Shaders
 
홍성우, 내가 만든 언어로 게임 만들기, NDC2017
홍성우, 내가 만든 언어로 게임 만들기, NDC2017홍성우, 내가 만든 언어로 게임 만들기, NDC2017
홍성우, 내가 만든 언어로 게임 만들기, NDC2017
 
이승재, 실버바인 서버엔진 2 설계 리뷰, NDC2018
이승재, 실버바인 서버엔진 2 설계 리뷰, NDC2018이승재, 실버바인 서버엔진 2 설계 리뷰, NDC2018
이승재, 실버바인 서버엔진 2 설계 리뷰, NDC2018
 
[NDC 2018] 신입 개발자가 알아야 할 윈도우 메모리릭 디버깅
[NDC 2018] 신입 개발자가 알아야 할 윈도우 메모리릭 디버깅[NDC 2018] 신입 개발자가 알아야 할 윈도우 메모리릭 디버깅
[NDC 2018] 신입 개발자가 알아야 할 윈도우 메모리릭 디버깅
 
[KGC2011_박민근] 신입 게임 개발자가 알아야 할 것들
[KGC2011_박민근] 신입 게임 개발자가 알아야 할 것들[KGC2011_박민근] 신입 게임 개발자가 알아야 할 것들
[KGC2011_박민근] 신입 게임 개발자가 알아야 할 것들
 
05_Reliable UDP 구현
05_Reliable UDP 구현05_Reliable UDP 구현
05_Reliable UDP 구현
 
NextGen Server/Client Architecture - gRPC + Unity + C#
NextGen Server/Client Architecture - gRPC + Unity + C#NextGen Server/Client Architecture - gRPC + Unity + C#
NextGen Server/Client Architecture - gRPC + Unity + C#
 
Tips and experience_of_dx12_engine_development._ver_1.2
Tips and experience_of_dx12_engine_development._ver_1.2Tips and experience_of_dx12_engine_development._ver_1.2
Tips and experience_of_dx12_engine_development._ver_1.2
 
덤프 파일을 통한 사후 디버깅 실용 테크닉 NDC2012
덤프 파일을 통한 사후 디버깅 실용 테크닉 NDC2012덤프 파일을 통한 사후 디버깅 실용 테크닉 NDC2012
덤프 파일을 통한 사후 디버깅 실용 테크닉 NDC2012
 
UnityのクラッシュをBacktraceでデバッグしよう!
UnityのクラッシュをBacktraceでデバッグしよう!UnityのクラッシュをBacktraceでデバッグしよう!
UnityのクラッシュをBacktraceでデバッグしよう!
 
06_게임엔진구성
06_게임엔진구성06_게임엔진구성
06_게임엔진구성
 
20分くらいでわかった気分になれるC++20コルーチン
20分くらいでわかった気分になれるC++20コルーチン20分くらいでわかった気分になれるC++20コルーチン
20分くらいでわかった気分になれるC++20コルーチン
 
이권일 Sse 를 이용한 최적화와 실제 사용 예
이권일 Sse 를 이용한 최적화와 실제 사용 예이권일 Sse 를 이용한 최적화와 실제 사용 예
이권일 Sse 를 이용한 최적화와 실제 사용 예
 
Unityでオニオンアーキテクチャ
UnityでオニオンアーキテクチャUnityでオニオンアーキテクチャ
Unityでオニオンアーキテクチャ
 
Plug-ins & Third-Party SDKs in UE4
Plug-ins & Third-Party SDKs in UE4Plug-ins & Third-Party SDKs in UE4
Plug-ins & Third-Party SDKs in UE4
 
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
 

Destaque

Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimization
guest3eed30
 
Stabilizer: Statistically Sound Performance Evaluation
Stabilizer: Statistically Sound Performance EvaluationStabilizer: Statistically Sound Performance Evaluation
Stabilizer: Statistically Sound Performance Evaluation
Emery Berger
 
Generic Programming Galore Using D
Generic Programming Galore Using DGeneric Programming Galore Using D
Generic Programming Galore Using D
Andrei Alexandrescu
 
Three, no, Four Cool Things About D
Three, no, Four Cool Things About DThree, no, Four Cool Things About D
Three, no, Four Cool Things About D
Andrei Alexandrescu
 
Muslim rule lect_4.ppt_filename_= utf-8''muslim rule lect 4
Muslim rule lect_4.ppt_filename_= utf-8''muslim rule lect 4Muslim rule lect_4.ppt_filename_= utf-8''muslim rule lect 4
Muslim rule lect_4.ppt_filename_= utf-8''muslim rule lect 4
khair20
 
C++ idioms by example (Nov 2008)
C++ idioms by example (Nov 2008)C++ idioms by example (Nov 2008)
C++ idioms by example (Nov 2008)
Olve Maudal
 
Code Optimization
Code OptimizationCode Optimization
Code Optimization
guest9f8315
 

Destaque (20)

Dconf2015 d2 t4
Dconf2015 d2 t4Dconf2015 d2 t4
Dconf2015 d2 t4
 
Dconf2015 d2 t3
Dconf2015 d2 t3Dconf2015 d2 t3
Dconf2015 d2 t3
 
Solid C++ by Example
Solid C++ by ExampleSolid C++ by Example
Solid C++ by Example
 
What every C++ programmer should know about modern compilers (w/o comments, A...
What every C++ programmer should know about modern compilers (w/o comments, A...What every C++ programmer should know about modern compilers (w/o comments, A...
What every C++ programmer should know about modern compilers (w/o comments, A...
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimization
 
Stabilizer: Statistically Sound Performance Evaluation
Stabilizer: Statistically Sound Performance EvaluationStabilizer: Statistically Sound Performance Evaluation
Stabilizer: Statistically Sound Performance Evaluation
 
ACCU Keynote by Andrei Alexandrescu
ACCU Keynote by Andrei AlexandrescuACCU Keynote by Andrei Alexandrescu
ACCU Keynote by Andrei Alexandrescu
 
Generic Programming Galore Using D
Generic Programming Galore Using DGeneric Programming Galore Using D
Generic Programming Galore Using D
 
Three, no, Four Cool Things About D
Three, no, Four Cool Things About DThree, no, Four Cool Things About D
Three, no, Four Cool Things About D
 
iOS 6 Exploitation 280 days later
iOS 6 Exploitation 280 days lateriOS 6 Exploitation 280 days later
iOS 6 Exploitation 280 days later
 
UTF-8
UTF-8UTF-8
UTF-8
 
Informatica 1_1_.pdf;filename_= utf-8''informatica_(1)[1]
Informatica  1_1_.pdf;filename_= utf-8''informatica_(1)[1]Informatica  1_1_.pdf;filename_= utf-8''informatica_(1)[1]
Informatica 1_1_.pdf;filename_= utf-8''informatica_(1)[1]
 
Muslim rule lect_4.ppt_filename_= utf-8''muslim rule lect 4
Muslim rule lect_4.ppt_filename_= utf-8''muslim rule lect 4Muslim rule lect_4.ppt_filename_= utf-8''muslim rule lect 4
Muslim rule lect_4.ppt_filename_= utf-8''muslim rule lect 4
 
C++ on the Web: Run your big 3D game in the browser
C++ on the Web: Run your big 3D game in the browserC++ on the Web: Run your big 3D game in the browser
C++ on the Web: Run your big 3D game in the browser
 
Deep C Programming
Deep C ProgrammingDeep C Programming
Deep C Programming
 
C++11
C++11C++11
C++11
 
C++ idioms by example (Nov 2008)
C++ idioms by example (Nov 2008)C++ idioms by example (Nov 2008)
C++ idioms by example (Nov 2008)
 
Insecure coding in C (and C++)
Insecure coding in C (and C++)Insecure coding in C (and C++)
Insecure coding in C (and C++)
 
Code Optimization
Code OptimizationCode Optimization
Code Optimization
 
Code generation
Code generationCode generation
Code generation
 

Semelhante a Three Optimization Tips for C++

Database , 6 Query Introduction
Database , 6 Query Introduction Database , 6 Query Introduction
Database , 6 Query Introduction
Ali Usman
 
Disaster Recovery with MySQL and Tungsten
Disaster Recovery with MySQL and TungstenDisaster Recovery with MySQL and Tungsten
Disaster Recovery with MySQL and Tungsten
Jeff Mace
 
Tales from the Field
Tales from the FieldTales from the Field
Tales from the Field
MongoDB
 
SOLID, DRY, SLAP design principles
SOLID, DRY, SLAP design principlesSOLID, DRY, SLAP design principles
SOLID, DRY, SLAP design principles
Sergey Karpushin
 
Clean code, Feb 2012
Clean code, Feb 2012Clean code, Feb 2012
Clean code, Feb 2012
cobyst
 

Semelhante a Three Optimization Tips for C++ (20)

Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
 
6-Query_Intro (5).pdf
6-Query_Intro (5).pdf6-Query_Intro (5).pdf
6-Query_Intro (5).pdf
 
Writing Readable Code
Writing Readable CodeWriting Readable Code
Writing Readable Code
 
Database , 6 Query Introduction
Database , 6 Query Introduction Database , 6 Query Introduction
Database , 6 Query Introduction
 
How To Handle Your Tech Debt Better - Sean Moir
How To Handle Your Tech Debt Better - Sean MoirHow To Handle Your Tech Debt Better - Sean Moir
How To Handle Your Tech Debt Better - Sean Moir
 
Kernel Recipes 2014 - Writing Code: Keep It Short, Stupid!
Kernel Recipes 2014 - Writing Code: Keep It Short, Stupid!Kernel Recipes 2014 - Writing Code: Keep It Short, Stupid!
Kernel Recipes 2014 - Writing Code: Keep It Short, Stupid!
 
Performance #5 cpu and battery
Performance #5  cpu and batteryPerformance #5  cpu and battery
Performance #5 cpu and battery
 
4 colin walls - self-testing in embedded systems
4   colin walls - self-testing in embedded systems4   colin walls - self-testing in embedded systems
4 colin walls - self-testing in embedded systems
 
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
 
Optimizing Browser Rendering
Optimizing Browser RenderingOptimizing Browser Rendering
Optimizing Browser Rendering
 
Disaster Recovery with MySQL and Tungsten
Disaster Recovery with MySQL and TungstenDisaster Recovery with MySQL and Tungsten
Disaster Recovery with MySQL and Tungsten
 
Tales from the Field
Tales from the FieldTales from the Field
Tales from the Field
 
Where Did All These Cycles Go?
Where Did All These Cycles Go?Where Did All These Cycles Go?
Where Did All These Cycles Go?
 
SOLID, DRY, SLAP design principles
SOLID, DRY, SLAP design principlesSOLID, DRY, SLAP design principles
SOLID, DRY, SLAP design principles
 
Cassandra introduction mars jug
Cassandra introduction mars jugCassandra introduction mars jug
Cassandra introduction mars jug
 
Clean code, Feb 2012
Clean code, Feb 2012Clean code, Feb 2012
Clean code, Feb 2012
 
Pragmatic Performance from NDC Oslo 2019
Pragmatic Performance from NDC Oslo 2019Pragmatic Performance from NDC Oslo 2019
Pragmatic Performance from NDC Oslo 2019
 
Interactive DSML Design
Interactive DSML DesignInteractive DSML Design
Interactive DSML Design
 
BRV CTO Summit Deep Learning Talk
BRV CTO Summit Deep Learning TalkBRV CTO Summit Deep Learning Talk
BRV CTO Summit Deep Learning Talk
 
Cassandra On EC2
Cassandra On EC2Cassandra On EC2
Cassandra On EC2
 

Three Optimization Tips for C++

  • 1. Three Optimization Tips for C++ Andrei Alexandrescu, Ph.D. Research Scientist, Facebook andrei.alexandrescu@fb.com © 2012- Facebook. Do not redistribute. 1 / 33
  • 2. This Talk • Basics • Reduce strength • Minimize array writes © 2012- Facebook. Do not redistribute. 2 / 33
  • 3. Things I Shouldn’t Even © 2012- Facebook. Do not redistribute. 3 / 33
  • 4. Today’s Computing Architectures • Extremely complex • Trade reproducible performance for average speed • Interrupts, multiprocessing are the norm • Dynamic frequency control is becoming common • Virtually impossible to get identical timings for experiments © 2012- Facebook. Do not redistribute. 4 / 33
  • 5. Intuition • Ignores aspects of a complex reality • Makes narrow/obsolete/wrong assumptions © 2012- Facebook. Do not redistribute. 5 / 33
  • 6. Intuition • Ignores aspects of a complex reality • Makes narrow/obsolete/wrong assumptions • “Fewer instructions = faster code” © 2012- Facebook. Do not redistribute. 5 / 33
  • 7. Intuition • Ignores aspects of a complex reality • Makes narrow/obsolete/wrong assumptions • “Fewer instructions = faster code” © 2012- Facebook. Do not redistribute. 5 / 33
  • 8. Intuition • Ignores aspects of a complex reality • Makes narrow/obsolete/wrong assumptions • “Fewer instructions = faster code” • “Data is faster than computation” © 2012- Facebook. Do not redistribute. 5 / 33
  • 9. Intuition • Ignores aspects of a complex reality • Makes narrow/obsolete/wrong assumptions • “Fewer instructions = faster code” • “Data is faster than computation” © 2012- Facebook. Do not redistribute. 5 / 33
  • 10. Intuition • Ignores aspects of a complex reality • Makes narrow/obsolete/wrong assumptions • “Fewer instructions = faster code” • “Data is faster than computation” • “Computation is faster than data” © 2012- Facebook. Do not redistribute. 5 / 33
  • 11. Intuition • Ignores aspects of a complex reality • Makes narrow/obsolete/wrong assumptions • “Fewer instructions = faster code” • “Data is faster than computation” • “Computation is faster than data” © 2012- Facebook. Do not redistribute. 5 / 33
  • 12. Intuition • Ignores aspects of a complex reality • Makes narrow/obsolete/wrong assumptions • “Fewer instructions = faster code” • “Data is faster than computation” • “Computation is faster than data” • The only good intuition: “I should time this.” © 2012- Facebook. Do not redistribute. 5 / 33
  • 13. Paradox Measuring gives you a leg up on experts who don’t need to measure © 2012- Facebook. Do not redistribute. 6 / 33
  • 14. Common Pitfalls • Measuring speed of debug builds • Different setup for baseline and measured ◦ Sequencing: heap allocator ◦ Warmth of cache, files, databases, DNS • Including ancillary work in measurement ◦ malloc, printf common • Mixtures: measure ta + tb , improve ta , conclude tb got improved • Optimize rare cases, pessimize others © 2012- Facebook. Do not redistribute. 7 / 33
  • 15. Optimizing Rare Cases © 2012- Facebook. Do not redistribute. 8 / 33
  • 16. More generalities • Prefer static linking and PDC • Prefer 64-bit code, 32-bit data • Prefer (32-bit) array indexing to pointers ◦ Prefer a[i++] to a[++i] • Prefer regular memory access patterns • Minimize flow, avoid data dependencies © 2012- Facebook. Do not redistribute. 9 / 33
  • 17. Storage Pecking Order • Use enum for integral constants • Use static const for other immutables ◦ Beware cache issues • Use stack for most variables • Globals: aliasing issues • thread_local slowest, use local caching ◦ 1 instruction in Windows, Linux ◦ 3-4 in OSX © 2012- Facebook. Do not redistribute. 10 / 33
  • 18. Reduce Strength © 2012- Facebook. Do not redistribute. 11 / 33
  • 19. Strength reduction • Speed hierarchy: ◦ comparisons ◦ (u)int add, subtract, bitops, shift ◦ FP add, sub (separate unit!) ◦ Indexed array access ◦ (u)int32 mul; FP mul ◦ FP division, remainder ◦ (u)int division, remainder © 2012- Facebook. Do not redistribute. 12 / 33
  • 20. Your Compiler Called I get it. a >>= 1 is the same as a /= 2. © 2012- Facebook. Do not redistribute. 13 / 33
  • 21. Integrals • Prefer 32-bit ints to all other sizes ◦ 64 bit may make some code slower ◦ 8, 16-bit computations use conversion to 32 bits and back ◦ Use small ints in arrays • Prefer unsigned to signed ◦ Except when converting to floating point • “Most numbers are small” © 2012- Facebook. Do not redistribute. 14 / 33
  • 22. Floating Point • Double precision as fast as single precision • Extended precision just a bit slower • Do not mix the three • 1-2 FP addition/subtraction units • 1-2 FP multiplication/division units • SSE accelerates throughput for certain computation kernels • ints→FPs cheap, FPs→ints expensive © 2012- Facebook. Do not redistribute. 15 / 33
  • 23. Advice Design algorithms to use minimum operation strength © 2012- Facebook. Do not redistribute. 16 / 33
  • 24. Strength reduction: Example • Digit count in base-10 representation uint32_t digits10(uint64_t v) { uint32_t result = 0; do { ++result; v /= 10; } while (v); return result; } • Uses integral division extensively ◦ (Actually: multiplication) © 2012- Facebook. Do not redistribute. 17 / 33
  • 25. Strength reduction: Example uint32_t digits10(uint64_t v) { uint32_t result = 1; for (;;) { if (v < 10) return result; if (v < 100) return result + 1; if (v < 1000) return result + 2; if (v < 10000) return result + 3; // Skip ahead by 4 orders of magnitude v /= 10000U; result += 4; } } • More comparisons and additions, fewer /= • (This is not loop unrolling!) © 2012- Facebook. Do not redistribute. 18 / 33
  • 26.
  • 27. Minimize Array Writes © 2012- Facebook. Do not redistribute. 20 / 33
  • 28. Minimize Array Writes: Why? • Disables enregistering • A write is really a read and a write • Aliasing makes things difficult • Maculates the cache • Generally just difficult to optimize © 2012- Facebook. Do not redistribute. 21 / 33
  • 29. Minimize Array Writes uint32_t u64ToAsciiClassic(uint64_t value, char* dst) { // Write backwards. auto start = dst; do { *dst++ = ’0’ + (value % 10); value /= 10; } while (value != 0); const uint32_t result = dst - start; // Reverse in place. for (dst--; dst > start; start++, dst--) { std::iter_swap(dst, start); } return result; } © 2012- Facebook. Do not redistribute. 22 / 33
  • 30. Minimize Array Writes • Gambit: make one extra pass to compute length uint32_t uint64ToAscii(uint64_t v, char *const buffer) { auto const result = digits10(v); uint32_t pos = result - 1; while (v >= 10) { auto const q = v / 10; auto const r = static_cast<uint32_t>(v % 10); buffer[pos--] = ’0’ + r; v = q; } assert(pos == 0); // Last digit is trivial to handle *buffer = static_cast<uint32_t>(v) + ’0’; return result; } © 2012- Facebook. Do not redistribute. 23 / 33
  • 31. Improvements • Fewer array writes • Regular access patterns • Fast on small numbers • Data dependencies reduced © 2012- Facebook. Do not redistribute. 24 / 33
  • 32.
  • 33. One More Pass • Reformulate digits10 as search • Convert two digits at a time © 2012- Facebook. Do not redistribute. 26 / 33
  • 34. uint32_t digits10(uint64_t v) { if (v < P01) return 1; if (v < P02) return 2; if (v < P03) return 3; if (v < P12) { if (v < P08) { if (v < P06) { if (v < P04) return 4; return 5 + (v < P05); } return 7 + (v >= P07); } if (v < P10) { return 9 + (v >= P09); } return 11 + (v >= P11); } return 12 + digits10(v / P12); } © 2012- Facebook. Do not redistribute. 27 / 33
  • 35. unsigned u64ToAsciiTable(uint64_t value, char* dst) { static const char digits[201] = "0001020304050607080910111213141516171819" "2021222324252627282930313233343536373839" "4041424344454647484950515253545556575859" "6061626364656667686970717273747576777879" "8081828384858687888990919293949596979899"; uint32_t const length = digits10(value); uint32_t next = length - 1; while (value >= 100) { auto const i = (value % 100) * 2; value /= 100; dst[next] = digits[i + 1]; dst[next - 1] = digits[i]; next -= 2; } © 2012- Facebook. Do not redistribute. 28 / 33
  • 36. // Handle last 1-2 digits if (value < 10) { dst[next] = ’0’ + uint32_t(value); } else { auto i = uint32_t(value) * 2; dst[next] = digits[i + 1]; dst[next - 1] = digits[i]; } return length; } © 2012- Facebook. Do not redistribute. 29 / 33
  • 37.
  • 38.
  • 39. Summary © 2012- Facebook. Do not redistribute. 32 / 33
  • 40. Summary • You can’t improve what you can’t measure ◦ Pro tip: You can’t measure what you don’t measure • Reduce strength • Minimize array writes © 2012- Facebook. Do not redistribute. 33 / 33