SlideShare uma empresa Scribd logo
1 de 60
MEMORY OPTIMIZATION Christer Ericson Sony Computer Entertainment, Santa Monica (christer_ericson@playstation.sony.com)
Talk contents 1/2 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Talk contents 2/2 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Problem statement ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Need more justification? 1/3 SIMD instructions consume data at 2-8 times the rate of normal instructions! Instruction parallelism:
Need more justification? 2/3 Improvements to compiler technology double program performance every  ~18 years ! Proebsting’s law: Corollary: Don’t expect the compiler to do it for you!
Need more justification? 3/3 ,[object Object],[object Object],[object Object],On Moore’s law:
Brief cache review ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
The memory hierarchy Main memory L2 cache L1 cache CPU ~1-5 cycles ~5-20 cycles ~40-100 cycles 1 cycle Roughly:
Some cache specs ,[object Object],[object Object],L1 cache (I/D) L2 cache PS2 16K/8K †  2-way N/A GameCube 32K/32K ‡  8-way 256K 2-way unified XBOX 16K/16K 4-way 128K 8-way unified PC ~32-64K ~128-512K
Foes: 3 C’s of cache misses ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Friends: Introducing the 3 R’s ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Compulsory Capacity Conflict Rearrange X (x) X Reduce X X (x) Reuse (x) X
Measuring cache utilization ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Code cache optimization 1/2 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Code cache optimization 2/2 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Data cache optimization ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Prefetching and preloading ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Software prefetching const int kLookAhead = 4;  // Some elements ahead for (int i = 0; i < 4 * n; i += 4) { Prefetch(elem[i + kLookAhead]); Process(elem[i + 0]); Process(elem[i + 1]); Process(elem[i + 2]); Process(elem[i + 3]); } // Loop through and process all 4n elements for (int i = 0; i < 4 * n; i++) Process(elem[i]);
Greedy prefetching void PreorderTraversal(Node *pNode) { // Greedily prefetch left traversal path Prefetch(pNode->left); // Process the current node Process(pNode); // Greedily prefetch right traversal path Prefetch(pNode->right); // Recursively visit left then right subtree PreorderTraversal(pNode->left); PreorderTraversal(pNode->right); }
Preloading (pseudo-prefetch) (NB: This code reads one element beyond the end of the elem array.)  Elem a = elem[0]; for (int i = 0; i < 4 * n; i += 4) { Elem e = elem[i + 4];  // Cache miss, non-blocking Elem b = elem[i + 1];  // Cache hit Elem c = elem[i + 2];  // Cache hit Elem d = elem[i + 3];  // Cache hit Process(a); Process(b); Process(c); Process(d); a = e; }
Structures ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Field reordering ,[object Object],void Foo(S *p, void *key, int k) { while (p) { if (p->key == key) { p->count[k]++; break; } p = p->pNext; } } struct S { void *key; int count[20]; S *pNext; }; struct S { void *key; S *pNext; int count[20]; };
Hot/cold splitting ,[object Object],[object Object],[object Object],[object Object],Hot fields: Cold fields: struct S { void *key; S *pNext; S2 *pCold; }; struct S2 { int count[10]; };
Hot/cold splitting
Beware compiler padding ,[object Object],Decreasing size! struct X { int8 a; int64 b; int8 c; int16 d; int64 e; float f; }; struct Z { int64 b; int64 e; float f; int16 d; int8 a; int8 c; }; struct Y { int8 a, pad_a[7]; int64 b; int8 c, pad_c[1]; int16 d, pad_d[2]; int64 e; float f, pad_f[1]; };
Cache performance analysis ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Tree data structures ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Breadth-first order ,[object Object],[object Object]
Depth-first order ,[object Object],[object Object]
van Emde Boas layout ,[object Object],[object Object]
A compact static k-d tree ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Linearization caching ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
 
Memory allocation policy ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
The curse of aliasing ,[object Object],Aliasing is also missed opportunities for optimization What value is returned here? Who knows! Aliasing is multiple references to the same storage location int Foo(int *a, int *b) { *a = 1; *b = 2; return *a; } int n; int *p1 = &n; int *p2 = &n;
The curse of aliasing ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
How do we do ‘anti-aliasing’? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],†  To be defined
Matrix multiplication 1/3 Consider optimizing a 2x2 matrix multiplication: How do we typically optimize it? Right, unrolling! Mat22mul(float a[2][2], float b[2][2], float c[2][2]){ for (int i = 0; i < 2; i++) { for (int j = 0; j < 2; j++) { a[i][j] = 0.0f; for (int k = 0; k < 2; k++) a[i][j] += b[i][k] * c[k][j]; } } }
Matrix multiplication 2/3 ,[object Object],[object Object],[object Object],[object Object],[object Object],Staightforward unrolling results in this: // 16 memory reads, 4 writes Mat22mul(float a[2][2], float b[2][2], float c[2][2]){ a[0][0] = b[0][0]*c[0][0] + b[0][1]*c[1][0]; a[0][1] = b[0][0]*c[0][1] + b[0][1]*c[1][1];  //(1) a[1][0] = b[1][0]*c[0][0] + b[1][1]*c[1][0];  //(2) a[1][1] = b[1][0]*c[0][1] + b[1][1]*c[1][1];  //(3) }
Matrix multiplication 3/3 A correct approach is instead writing it as: … before producing outputs Consume inputs… // 8 memory reads, 4 writes Mat22mul(float a[2][2], float b[2][2], float c[2][2]){ float b00 = b[0][0], b01 = b[0][1]; float b10 = b[1][0], b11 = b[1][1]; float c00 = c[0][0], c01 = c[0][1]; float c10 = c[1][0], c11 = c[1][1]; a[0][0] = b00*c00 + b01*c10; a[0][1] = b00*c01 + b01*c11; a[1][0] = b10*c00 + b11*c10; a[1][1] = b10*c01 + b11*c11; }
Abstraction penalty problem ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
C++ abstraction penalty ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
C++ abstraction penalty Pointer members in classes may alias other members: Code likely to refetch  numVals  each iteration! numVals  not a local variable! May be aliased by  pBuf ! class Buf { public: void Clear() { for (int i = 0; i < numVals; i++) pBuf[i] = 0; } private: int numVals, *pBuf; }
C++ abstraction penalty We know that aliasing won’t happen, and can manually solve the aliasing issue by writing code as: class Buf { public: void Clear() { for (int i = 0, n = numVals; i < n; i++) pBuf[i] = 0; } private: int numVals, *pBuf; }
C++ abstraction penalty Since  pBuf[i]  can only alias  numVals  in the first iteration, a quality compiler can fix this problem by peeling the loop once, turning it into: Q: Does  your  compiler do this optimization?! void Clear() { if (numVals >= 1) { pBuf[0] = 0; for (int i = 1, n = numVals; i < n; i++) pBuf[i] = 0; } }
Type-based alias analysis ,[object Object],[object Object],Use language types to disambiguate memory references!
Type-based alias analysis ,[object Object],[object Object],[object Object],[object Object],[object Object]
Compatibility of C/C++ types ,[object Object],[object Object],[object Object],[object Object],[object Object]
What TBAA can do for you into this: It can turn this: Possible aliasing between v[i]  and  *n No aliasing possible so fetch  *n  once! void Foo(float *v, int *n) { for (int i = 0; i < *n; i++) v[i] += 1.0f; } void Foo(float *v, int *n) { int t = *n; for (int i = 0; i < t; i++) v[i] += 1.0f; }
What TBAA can also do ,[object Object],[object Object],Required by standard Allowed By gcc Illegal C/C++ code! uint32 i; float f; i = *((uint32 *)&f); uint32 i; union { float f; uchar8 c[4]; } u; u.f = f; i = (u.c[3]<<24L)+ (u.c[2]<<16L)+ ...; uint32 i; union { float f; uint32 i; } u; u.f = f; i = u.i;
Restrict-qualified pointers ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Using the restrict keyword Given this code: You really want the compiler to treat it as if written: But because of possible aliasing it cannot! void Foo(float v[], float *c, int n) { for (int i = 0; i < n; i++) v[i] = *c + 1.0f; } void Foo(float v[], float *c, int n) { float tmp = *c + 1.0f; for (int i = 0; i < n; i++) v[i] = tmp; }
Using the restrict keyword giving for the first version: and for the second version: For example, the code might be called as: The compiler must be conservative, and cannot perform the optimization! v[] = 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 v[] = 1, 1, 1, 1, 1, 2, 2, 2, 2, 2 float a[10]; a[4] = 0.0f; Foo(a, &a[4], 10);
Solving the aliasing problem The fix? Declaring the output as  restrict : ,[object Object],[object Object],[object Object],[object Object],[object Object],void Foo(float * restrict v, float *c, int n) { for (int i = 0; i < n; i++)  v[i] = *c + 1.0f; }
‘ const’ doesn’t help Some might think this would work: ,[object Object],[object Object],[object Object],[object Object],Since  *c  is const,  v[i]  cannot write to it, right? void Foo(float v[], const float *c, int n) { for (int i = 0; i < n; i++)  v[i] = *c + 1.0f; }
SIMD + restrict = TRUE ,[object Object],Independent loads and stores. Operations can be performed in parallel! Stores may alias loads. Must perform operations sequentially. void VecAdd(int *a, int *b, int *c) { for (int i = 0; i < 4; i++)  a[i] = b[i] + c[i];  } void VecAdd(int * restrict a, int *b, int *c) { for (int i = 0; i < 4; i++)  a[i] = b[i] + c[i];  }
Restrict-qualified pointers ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Tips for avoiding aliasing ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
That’s it! – Resources 1/2 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Resources 2/2 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Mais conteúdo relacionado

Mais procurados

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 

Mais procurados (20)

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
Execution
ExecutionExecution
Execution
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
 
Vectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
Vectorized UDF: Scalable Analysis with Python and PySpark with Li JinVectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
Vectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovSpark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud Ibrahimov
 
Linux Kernel - Virtual File System
Linux Kernel - Virtual File SystemLinux Kernel - Virtual File System
Linux Kernel - Virtual File System
 
Page cache in Linux kernel
Page cache in Linux kernelPage cache in Linux kernel
Page cache in Linux kernel
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 

Semelhante a Memory Optimization

Gdc03 ericson memory_optimization
Gdc03 ericson memory_optimizationGdc03 ericson memory_optimization
Gdc03 ericson memory_optimization
brettlevin
 
ECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docxECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docx
tidwellveronique
 
ECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docxECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docx
tidwellveronique
 

Semelhante a Memory Optimization (20)

Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard Warburton
 
Gdc03 ericson memory_optimization
Gdc03 ericson memory_optimizationGdc03 ericson memory_optimization
Gdc03 ericson memory_optimization
 
Scope Stack Allocation
Scope Stack AllocationScope Stack Allocation
Scope Stack Allocation
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
 
Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
 
PPU Optimisation Lesson
PPU Optimisation LessonPPU Optimisation Lesson
PPU Optimisation Lesson
 
Caching in (DevoxxUK 2013)
Caching in (DevoxxUK 2013)Caching in (DevoxxUK 2013)
Caching in (DevoxxUK 2013)
 
04 Cache Memory
04  Cache  Memory04  Cache  Memory
04 Cache Memory
 
Caching in
Caching inCaching in
Caching in
 
Happy To Use SIMD
Happy To Use SIMDHappy To Use SIMD
Happy To Use SIMD
 
C language
C languageC language
C language
 
DConf 2016: Keynote by Walter Bright
DConf 2016: Keynote by Walter Bright DConf 2016: Keynote by Walter Bright
DConf 2016: Keynote by Walter Bright
 
C language
C languageC language
C language
 
ECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docxECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docx
 
ECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docxECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docx
 
Coding for multiple cores
Coding for multiple coresCoding for multiple cores
Coding for multiple cores
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

Memory Optimization

  • 1. MEMORY OPTIMIZATION Christer Ericson Sony Computer Entertainment, Santa Monica (christer_ericson@playstation.sony.com)
  • 2.
  • 3.
  • 4.
  • 5. Need more justification? 1/3 SIMD instructions consume data at 2-8 times the rate of normal instructions! Instruction parallelism:
  • 6. Need more justification? 2/3 Improvements to compiler technology double program performance every ~18 years ! Proebsting’s law: Corollary: Don’t expect the compiler to do it for you!
  • 7.
  • 8.
  • 9. The memory hierarchy Main memory L2 cache L1 cache CPU ~1-5 cycles ~5-20 cycles ~40-100 cycles 1 cycle Roughly:
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18. Software prefetching const int kLookAhead = 4; // Some elements ahead for (int i = 0; i < 4 * n; i += 4) { Prefetch(elem[i + kLookAhead]); Process(elem[i + 0]); Process(elem[i + 1]); Process(elem[i + 2]); Process(elem[i + 3]); } // Loop through and process all 4n elements for (int i = 0; i < 4 * n; i++) Process(elem[i]);
  • 19. Greedy prefetching void PreorderTraversal(Node *pNode) { // Greedily prefetch left traversal path Prefetch(pNode->left); // Process the current node Process(pNode); // Greedily prefetch right traversal path Prefetch(pNode->right); // Recursively visit left then right subtree PreorderTraversal(pNode->left); PreorderTraversal(pNode->right); }
  • 20. Preloading (pseudo-prefetch) (NB: This code reads one element beyond the end of the elem array.) Elem a = elem[0]; for (int i = 0; i < 4 * n; i += 4) { Elem e = elem[i + 4]; // Cache miss, non-blocking Elem b = elem[i + 1]; // Cache hit Elem c = elem[i + 2]; // Cache hit Elem d = elem[i + 3]; // Cache hit Process(a); Process(b); Process(c); Process(d); a = e; }
  • 21.
  • 22.
  • 23.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.  
  • 34.
  • 35.
  • 36.
  • 37.
  • 38. Matrix multiplication 1/3 Consider optimizing a 2x2 matrix multiplication: How do we typically optimize it? Right, unrolling! Mat22mul(float a[2][2], float b[2][2], float c[2][2]){ for (int i = 0; i < 2; i++) { for (int j = 0; j < 2; j++) { a[i][j] = 0.0f; for (int k = 0; k < 2; k++) a[i][j] += b[i][k] * c[k][j]; } } }
  • 39.
  • 40. Matrix multiplication 3/3 A correct approach is instead writing it as: … before producing outputs Consume inputs… // 8 memory reads, 4 writes Mat22mul(float a[2][2], float b[2][2], float c[2][2]){ float b00 = b[0][0], b01 = b[0][1]; float b10 = b[1][0], b11 = b[1][1]; float c00 = c[0][0], c01 = c[0][1]; float c10 = c[1][0], c11 = c[1][1]; a[0][0] = b00*c00 + b01*c10; a[0][1] = b00*c01 + b01*c11; a[1][0] = b10*c00 + b11*c10; a[1][1] = b10*c01 + b11*c11; }
  • 41.
  • 42.
  • 43. C++ abstraction penalty Pointer members in classes may alias other members: Code likely to refetch numVals each iteration! numVals not a local variable! May be aliased by pBuf ! class Buf { public: void Clear() { for (int i = 0; i < numVals; i++) pBuf[i] = 0; } private: int numVals, *pBuf; }
  • 44. C++ abstraction penalty We know that aliasing won’t happen, and can manually solve the aliasing issue by writing code as: class Buf { public: void Clear() { for (int i = 0, n = numVals; i < n; i++) pBuf[i] = 0; } private: int numVals, *pBuf; }
  • 45. C++ abstraction penalty Since pBuf[i] can only alias numVals in the first iteration, a quality compiler can fix this problem by peeling the loop once, turning it into: Q: Does your compiler do this optimization?! void Clear() { if (numVals >= 1) { pBuf[0] = 0; for (int i = 1, n = numVals; i < n; i++) pBuf[i] = 0; } }
  • 46.
  • 47.
  • 48.
  • 49. What TBAA can do for you into this: It can turn this: Possible aliasing between v[i] and *n No aliasing possible so fetch *n once! void Foo(float *v, int *n) { for (int i = 0; i < *n; i++) v[i] += 1.0f; } void Foo(float *v, int *n) { int t = *n; for (int i = 0; i < t; i++) v[i] += 1.0f; }
  • 50.
  • 51.
  • 52. Using the restrict keyword Given this code: You really want the compiler to treat it as if written: But because of possible aliasing it cannot! void Foo(float v[], float *c, int n) { for (int i = 0; i < n; i++) v[i] = *c + 1.0f; } void Foo(float v[], float *c, int n) { float tmp = *c + 1.0f; for (int i = 0; i < n; i++) v[i] = tmp; }
  • 53. Using the restrict keyword giving for the first version: and for the second version: For example, the code might be called as: The compiler must be conservative, and cannot perform the optimization! v[] = 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 v[] = 1, 1, 1, 1, 1, 2, 2, 2, 2, 2 float a[10]; a[4] = 0.0f; Foo(a, &a[4], 10);
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.