SlideShare uma empresa Scribd logo
1 de 60
MEMORYMEMORY
OPTIMIZATIONOPTIMIZATION
Christer EricsonChrister Ericson
Sony Computer Entertainment, SantaSony Computer Entertainment, Santa
MonicaMonica
(christer_ericson@playstation.sony.com)(christer_ericson@playstation.sony.com)
Talk contents 1/2Talk contents 1/2
â–șProblem statementProblem statement
 Why “memory optimization?”Why “memory optimization?”
â–șBrief architecture overviewBrief architecture overview
 The memory hierarchyThe memory hierarchy
â–șOptimizing for (code and) data cacheOptimizing for (code and) data cache
 General suggestionsGeneral suggestions
 Data structuresData structures
â–șPrefetching and preloadingPrefetching and preloading
â–șStructure layoutStructure layout
â–șTree structuresTree structures
â–șLinearization cachingLinearization caching
â–ș


Talk contents 2/2Talk contents 2/2
â–ș


â–șAliasingAliasing
 Abstraction penalty problemAbstraction penalty problem
 Alias analysis (type-based)Alias analysis (type-based)
 ‘‘restrict’ pointersrestrict’ pointers
 Tips for reducing aliasingTips for reducing aliasing
Problem statementProblem statement
â–ș For the last 20-something years
For the last 20-something years

 CPU speeds have increased ~60%/yearCPU speeds have increased ~60%/year
 Memory speeds only decreased ~10%/yearMemory speeds only decreased ~10%/year
â–ș Gap covered by use of cache memoryGap covered by use of cache memory
â–ș Cache is under-exploitedCache is under-exploited
 Diminishing returns for larger cachesDiminishing returns for larger caches
â–ș Inefficient cache use = lower performanceInefficient cache use = lower performance
 How increase cache utilization? Cache-awareness!How increase cache utilization? Cache-awareness!
Need more justification? 1/3Need more justification? 1/3
SIMD instructions consume
data at 2-8 times the rate
of normal instructions!
Instruction parallelism:
Need more justification? 2/3Need more justification? 2/3
Improvements to
compiler technology
double program performance
every ~18 years!
Proebsting’s law:
Corollary: Don’t expect the compiler to do it for you!Corollary: Don’t expect the compiler to do it for you!
Need more justification? 3/3Need more justification? 3/3
On Moore’s law:
â–șConsoles don’t follow it (as
such)
 Fixed hardware
 2nd/3rd generation titles must get
improvements from somewhere
Brief cache reviewBrief cache review
â–ș CachesCaches
 Code cache for instructions, data cache for dataCode cache for instructions, data cache for data
 Forms a memory hierarchyForms a memory hierarchy
â–ș Cache linesCache lines
 Cache divided into cache lines of ~32/64 bytes eachCache divided into cache lines of ~32/64 bytes each
 Correct unit in which to count memory accessesCorrect unit in which to count memory accesses
â–ș Direct-mappedDirect-mapped
 For n KB cache, bytes at k, k+n, k+2n, 
 map to sameFor n KB cache, bytes at k, k+n, k+2n, 
 map to same
cache linecache line
â–ș N-way set-associativeN-way set-associative
 Logical cache line corresponds to N physical linesLogical cache line corresponds to N physical lines
 Helps minimize cache line thrashingHelps minimize cache line thrashing
The memory hierarchyThe memory hierarchy
Main memory
L2 cache
L1 cache
CPU
~1-5 cycles
~5-20 cycles
~40-100 cycles
1 cycle
Roughly:
Some cache specsSome cache specs
L1 cache (I/D) L2 cache
PS2 16K/8K16K/8K††
2-way2-way N/AN/A
GameCube 32K/32K32K/32K‡‡
8-way8-way 256K 2-way unified256K 2-way unified
XBOX 16K/16K 4-way16K/16K 4-way 128K 8-way unified128K 8-way unified
PC ~32-64K~32-64K ~128-512K~128-512K
â–ș††
16K data scratchpad important part of design16K data scratchpad important part of design
â–ș‡‡
configurable as 16K 4-way + 16K scratchpadconfigurable as 16K 4-way + 16K scratchpad
Foes: 3 C’s of cache missesFoes: 3 C’s of cache misses
â–ș Compulsory missesCompulsory misses
 Unavoidable misses when data read for first timeUnavoidable misses when data read for first time
â–ș Capacity missesCapacity misses
 Not enough cache space to hold all active dataNot enough cache space to hold all active data
 Too much data accessed inbetween successive useToo much data accessed inbetween successive use
â–ș Conflict missesConflict misses
 Cache thrashing due to data mapping to same cacheCache thrashing due to data mapping to same cache
lineslines
Friends: Introducing the 3Friends: Introducing the 3
R’sR’s
Compulsory Capacity Conflict
Rearrange XX (x)(x) XX
Reduce XX XX (x)(x)
Reuse (x)(x) XX
â–ș Rearrange (code, data)Rearrange (code, data)
 Change layout to increase spatial localityChange layout to increase spatial locality
â–ș Reduce (size, # cache lines read)Reduce (size, # cache lines read)
 Smaller/smarter formats, compressionSmaller/smarter formats, compression
â–ș Reuse (cache lines)Reuse (cache lines)
 Increase temporal (and spatial) localityIncrease temporal (and spatial) locality
Measuring cache utilizationMeasuring cache utilization
â–șProfileProfile
 CPU performance/event countersCPU performance/event counters
â–șGive memory access statisticsGive memory access statistics
â–șBut not access patterns (e.g. stride)But not access patterns (e.g. stride)
 Commercial productsCommercial products
â–șSN Systems’ Tuner, Metrowerks’ CATS, Intel’sSN Systems’ Tuner, Metrowerks’ CATS, Intel’s
VTuneVTune
 Roll your ownRoll your own
â–șIn gcc ‘-p’ option + defineIn gcc ‘-p’ option + define _mcount()_mcount()
â–șInstrument code with calls to logging classInstrument code with calls to logging class
 Do back-of-the-envelope comparisonDo back-of-the-envelope comparison
â–șStudy the generated codeStudy the generated code
Code cache optimization 1/2Code cache optimization 1/2
â–șLocalityLocality
 Reorder functionsReorder functions
â–șManually within fileManually within file
â–șReorder object files during linking (order in makefile)Reorder object files during linking (order in makefile)
â–ș__attribute__ ((section ("xxx")))__attribute__ ((section ("xxx"))) in gccin gcc
 Adapt coding styleAdapt coding style
â–șMonolithic functionsMonolithic functions
â–șEncapsulation/OOP is less code cache friendlyEncapsulation/OOP is less code cache friendly
 Moving targetMoving target
 Beware various implicit functions (e.g. fptodp)Beware various implicit functions (e.g. fptodp)
Code cache optimization 2/2Code cache optimization 2/2
â–șSizeSize
 Beware: inlining, unrolling, large macrosBeware: inlining, unrolling, large macros
 KISSKISS
â–șAvoid featuritisAvoid featuritis
â–șProvide multiple copies (also helps locality)Provide multiple copies (also helps locality)
 Loop splitting and loop fusionLoop splitting and loop fusion
 Compile for size (‘-Os’ in gcc)Compile for size (‘-Os’ in gcc)
 Rewrite in asm (where it counts)Rewrite in asm (where it counts)
â–șAgain, study generated codeAgain, study generated code
 Build intuition about code generatedBuild intuition about code generated
Data cache optimizationData cache optimization
â–șLots and lots of stuff
Lots and lots of stuff

 ““Compressing” dataCompressing” data
 Blocking and strip miningBlocking and strip mining
 Padding data to align to cache linesPadding data to align to cache lines
 Plus other things I won’t go intoPlus other things I won’t go into
â–șWhat I will talk about
What I will talk about

 Prefetching and preloading data into cachePrefetching and preloading data into cache
 Cache-conscious structure layoutCache-conscious structure layout
 Tree data structuresTree data structures
 Linearization cachingLinearization caching
 Memory allocationMemory allocation
 Aliasing and “anti-aliasing”Aliasing and “anti-aliasing”
Prefetching and preloadingPrefetching and preloading
â–șSoftware prefetchingSoftware prefetching
 Not too early – data may be evicted before useNot too early – data may be evicted before use
 Not too late – data not fetched in time for useNot too late – data not fetched in time for use
 GreedyGreedy
â–șPreloading (pseudo-prefetching)Preloading (pseudo-prefetching)
 Hit-under-miss processingHit-under-miss processing
const int kLookAhead = 4; // Some elements ahead
for (int i = 0; i < 4 * n; i += 4) {
Prefetch(elem[i + kLookAhead]);
Process(elem[i + 0]);
Process(elem[i + 1]);
Process(elem[i + 2]);
Process(elem[i + 3]);
}
Software prefetchingSoftware prefetching
// Loop through and process all 4n elements
for (int i = 0; i < 4 * n; i++)
Process(elem[i]);
Greedy prefetchingGreedy prefetching
void PreorderTraversal(Node *pNode) {
// Greedily prefetch left traversal path
Prefetch(pNode->left);
// Process the current node
Process(pNode);
// Greedily prefetch right traversal path
Prefetch(pNode->right);
// Recursively visit left then right subtree
PreorderTraversal(pNode->left);
PreorderTraversal(pNode->right);
}
Preloading (pseudo-Preloading (pseudo-
prefetch)prefetch)
Elem a = elem[0];
for (int i = 0; i < 4 * n; i += 4) {
Elem e = elem[i + 4]; // Cache miss, non-blocking
Elem b = elem[i + 1]; // Cache hit
Elem c = elem[i + 2]; // Cache hit
Elem d = elem[i + 3]; // Cache hit
Process(a);
Process(b);
Process(c);
Process(d);
a = e;
}(NB: This code reads one element beyond the end of the elem array.)(NB: This code reads one element beyond the end of the elem array.)
StructuresStructures
â–șCache-conscious layoutCache-conscious layout
 Field reordering (usually grouped conceptually)Field reordering (usually grouped conceptually)
 Hot/cold splittingHot/cold splitting
â–șLet use decide formatLet use decide format
 Array of structuresArray of structures
 Structures of arraysStructures of arrays
â–șLittle compiler supportLittle compiler support
 Easier for non-pointer languages (Java)Easier for non-pointer languages (Java)
 C/C++: do it yourselfC/C++: do it yourself
void Foo(S *p, void *key, int k) {
while (p) {
if (p->key == key) {
p->count[k]++;
break;
}
p = p->pNext;
}
Field reorderingField reordering
struct S {
void *key;
int count[20];
S *pNext;
};
struct S {
void *key;
S *pNext;
int count[20];
};
â–șLikely accessedLikely accessed
together sotogether so
store themstore them
together!together!
struct S {
void *key;
S *pNext;
S2 *pCold;
};
struct S2 {
int count[10];
};
Hot/cold splittingHot/cold splitting
â–șAllocate all ‘struct S’ from a memory poolAllocate all ‘struct S’ from a memory pool
 Increases coherenceIncreases coherence
â–șPrefer array-style allocationPrefer array-style allocation
 No need for actual pointer to cold fieldsNo need for actual pointer to cold fields
Hot fields:Hot fields: Cold fields:Cold fields:
Hot/cold splittingHot/cold splitting
Beware compiler paddingBeware compiler padding
struct X {
int8 a;
int64 b;
int8 c;
int16 d;
int64 e;
float f;
};
Assuming 4-byte floats, for most compilers sizeof(X) == 40,Assuming 4-byte floats, for most compilers sizeof(X) == 40,
sizeof(Y) == 40, and sizeof(Z) == 24.sizeof(Y) == 40, and sizeof(Z) == 24.
struct Z {
int64 b;
int64 e;
float f;
int16 d;
int8 a;
int8 c;
};
struct Y {
int8 a, pad_a[7];
int64 b;
int8 c, pad_c[1];
int16 d, pad_d[2];
int64 e;
float f, pad_f[1];
};
Decreasingsize!
Cache performance analysisCache performance analysis
â–ș Usage patternsUsage patterns
 ActivityActivity –– indicates hot or cold fieldindicates hot or cold field
 CorrelationCorrelation –– basis for field reorderingbasis for field reordering
â–ș Logging toolLogging tool
 Access all class members through accessor functionsAccess all class members through accessor functions
 Manually instrument functions to call Log() functionManually instrument functions to call Log() function
 Log() function
Log() function

â–ș takes object type + member field as argumentstakes object type + member field as arguments
â–ș hash-maps current args to count field accesseshash-maps current args to count field accesses
â–ș hash-maps current + previous args to track pairwise accesseshash-maps current + previous args to track pairwise accesses
Tree data structuresTree data structures
â–șRearrangeRearrange nodesnodes
 Increase spatial localityIncrease spatial locality
 Cache-aware vs. cache-oblivious layoutsCache-aware vs. cache-oblivious layouts
â–șReduceReduce sizesize
 Pointer elimination (using implicit pointers)Pointer elimination (using implicit pointers)
 ““Compression”Compression”
â–șQuantize valuesQuantize values
â–șStore data relative to parent nodeStore data relative to parent node
Breadth-first orderBreadth-first order
â–ș Pointer-less: Left(n)=2n, Right(n)=2n+1Pointer-less: Left(n)=2n, Right(n)=2n+1
â–ș Requires storage for complete tree of height HRequires storage for complete tree of height H
Depth-first orderDepth-first order
â–ș Left(n) = n + 1, Right(n) = stored indexLeft(n) = n + 1, Right(n) = stored index
â–ș Only stores existing nodesOnly stores existing nodes
van Emde Boas layoutvan Emde Boas layout
â–ș ““Cache-oblivious”Cache-oblivious”
â–ș Recursive constructionRecursive construction
A compact static k-d treeA compact static k-d tree
union KDNode {
// leaf, type 11
int32 leafIndex_type;
// non-leaf, type 00 = x,
// 01 = y, 10 = z-split
float splitVal_type;
};
Linearization cachingLinearization caching
â–șNothing better than linear dataNothing better than linear data
 Best possible spatial localityBest possible spatial locality
 Easily prefetchableEasily prefetchable
â–șSo linearize data at runtime!So linearize data at runtime!
 Fetch data, store linearized in a custom cacheFetch data, store linearized in a custom cache
 Use it to linearize
Use it to linearize

â–șhierarchy traversalshierarchy traversals
â–șindexed dataindexed data
â–șother random-access stuffother random-access stuff
Memory allocation policyMemory allocation policy
â–șDon’t allocate from heap, use poolsDon’t allocate from heap, use pools
 No block overheadNo block overhead
 Keeps data togetherKeeps data together
 Faster too, and no fragmentationFaster too, and no fragmentation
â–șFree ASAP, reuse immediatelyFree ASAP, reuse immediately
 Block is likely in cache so reuse its cachelinesBlock is likely in cache so reuse its cachelines
 First fit, using free listFirst fit, using free list
The curse of aliasingThe curse of aliasing
What is aliasing?What is aliasing?
int Foo(int *a, int *b) {
*a = 1;
*b = 2;
return *a;
}
int n;
int *p1 = &n;
int *p2 = &n;
Aliasing is also missed opportunities for optimizationAliasing is also missed opportunities for optimization
What value is
returned here?
Who knows!
Aliasing is multiple
references to the
same storage location
The curse of aliasingThe curse of aliasing
â–șWhat is causing aliasing?What is causing aliasing?
 PointersPointers
 Global variables/class members make it worseGlobal variables/class members make it worse
â–șWhat is the problem with aliasing?What is the problem with aliasing?
 Hinders reordering/elimination of loads/storesHinders reordering/elimination of loads/stores
â–șPoisoning data cachePoisoning data cache
â–șNegatively affects instruction schedulingNegatively affects instruction scheduling
â–șHinders common subexpression elimination (CSE),Hinders common subexpression elimination (CSE),
loop-invariant code motion, constant/copyloop-invariant code motion, constant/copy
propagation, etc.propagation, etc.
How do we do ‘anti-How do we do ‘anti-
aliasing’?aliasing’?
â–șWhat can be done about aliasing?What can be done about aliasing?
 Better languagesBetter languages
â–șLess aliasing, lower abstraction penaltyLess aliasing, lower abstraction penalty††
 Better compilersBetter compilers
â–șAlias analysis such as type-based alias analysisAlias analysis such as type-based alias analysis††
 Better programmers (aiding theBetter programmers (aiding the
compiler)compiler)
â–șThat’s you, after the next 20 slides!That’s you, after the next 20 slides!
 Leap of faithLeap of faith
â–ș-fno-aliasing-fno-aliasing ††
To be definedTo be defined
Matrix multiplication 1/3Matrix multiplication 1/3
Mat22mul(float a[2][2], float b[2][2], float c[2][2]){
for (int i = 0; i < 2; i++) {
for (int j = 0; j < 2; j++) {
a[i][j] = 0.0f;
for (int k = 0; k < 2; k++)
a[i][j] += b[i][k] * c[k][j];
}
}
}
Consider optimizing a 2x2 matrix multiplication:Consider optimizing a 2x2 matrix multiplication:
How do we typically optimize it? Right, unrolling!How do we typically optimize it? Right, unrolling!
â–ș But wait! There’s a hidden assumption! a is not b or c!But wait! There’s a hidden assumption! a is not b or c!
â–ș Compiler doesn’t (cannot) know this!Compiler doesn’t (cannot) know this!
 (1) Must refetch b[0][0] and b[0][1](1) Must refetch b[0][0] and b[0][1]
 (2) Must refetch c[0][0] and c[1][0](2) Must refetch c[0][0] and c[1][0]
 (3) Must refetch b[0][0], b[0][1], c[0][0] and c[1][0](3) Must refetch b[0][0], b[0][1], c[0][0] and c[1][0]
Matrix multiplication 2/3Matrix multiplication 2/3
Staightforward unrolling results in this:Staightforward unrolling results in this:
// 16 memory reads, 4 writes
Mat22mul(float a[2][2], float b[2][2], float c[2][2]){
a[0][0] = b[0][0]*c[0][0] + b[0][1]*c[1][0];
a[0][1] = b[0][0]*c[0][1] + b[0][1]*c[1][1]; //(1)
a[1][0] = b[1][0]*c[0][0] + b[1][1]*c[1][0]; //(2)
a[1][1] = b[1][0]*c[0][1] + b[1][1]*c[1][1]; //(3)
}
Matrix multiplication 3/3Matrix multiplication 3/3
// 8 memory reads, 4 writes
Mat22mul(float a[2][2], float b[2][2], float c[2][2]){
float b00 = b[0][0], b01 = b[0][1];
float b10 = b[1][0], b11 = b[1][1];
float c00 = c[0][0], c01 = c[0][1];
float c10 = c[1][0], c11 = c[1][1];
a[0][0] = b00*c00 + b01*c10;
a[0][1] = b00*c01 + b01*c11;
a[1][0] = b10*c00 + b11*c10;
a[1][1] = b10*c01 + b11*c11;
A correct approach is instead writing it as:A correct approach is instead writing it as:

before
producing
outputs
Consume
inputs

Abstraction penalty problemAbstraction penalty problem
â–șHigher levels of abstraction have a negativeHigher levels of abstraction have a negative
effect on optimizationeffect on optimization
 Code broken into smaller generic subunitsCode broken into smaller generic subunits
 Data and operation hidingData and operation hiding
â–șCannot make local copy of e.g. internal pointersCannot make local copy of e.g. internal pointers
â–șCannot hoist constant expressions out of loopsCannot hoist constant expressions out of loops
â–șEspecially because of aliasing issuesEspecially because of aliasing issues
C++ abstraction penaltyC++ abstraction penalty
â–ș Lots of (temporary) objects aroundLots of (temporary) objects around
 IteratorsIterators
 Matrix/vector classesMatrix/vector classes
â–ș Objects live in heap/stackObjects live in heap/stack
 Thus subject to aliasingThus subject to aliasing
 Makes tracking of current member value very difficultMakes tracking of current member value very difficult
 But tracking required to keep values in registers!But tracking required to keep values in registers!
â–ș Implicit aliasing through theImplicit aliasing through the thisthis pointerpointer
 Class members are virtually as bad as global variablesClass members are virtually as bad as global variables
C++ abstraction penaltyC++ abstraction penalty
class Buf {
public:
void Clear() {
for (int i = 0; i < numVals; i++)
pBuf[i] = 0;
}
private:
int numVals, *pBuf;
}
Pointer members in classes may alias other members:Pointer members in classes may alias other members:
Code likely to refetchCode likely to refetch numValsnumVals each iteration!each iteration!
numVals not a
local variable!
May be
aliased
by pBuf!
class Buf {
public:
void Clear() {
for (int i = 0, n = numVals; i < n; i++)
pBuf[i] = 0;
}
private:
int numVals, *pBuf;
}
C++ abstraction penaltyC++ abstraction penalty
We know that aliasing won’t happen, and canWe know that aliasing won’t happen, and can
manually solve the aliasing issue by writing code as:manually solve the aliasing issue by writing code as:
C++ abstraction penaltyC++ abstraction penalty
void Clear() {
if (numVals >= 1) {
pBuf[0] = 0;
for (int i = 1, n = numVals; i < n; i++)
pBuf[i] = 0;
}
}
SinceSince pBuf[i]pBuf[i] can only aliascan only alias numValsnumVals in the firstin the first
iteration, a quality compiler can fix this problem byiteration, a quality compiler can fix this problem by
peeling the loop once, turning it into:peeling the loop once, turning it into:
Q: DoesQ: Does youryour compiler do this optimization?!compiler do this optimization?!
Type-based alias analysisType-based alias analysis
â–ș Some aliasing the compiler can catchSome aliasing the compiler can catch
 A powerful tool isA powerful tool is type-based alias analysistype-based alias analysis
Use language
types to
disambiguate
memory
references!
Type-based alias analysisType-based alias analysis
â–șANSI C/C++ states that
ANSI C/C++ states that

 Each area of memory can only be associatedEach area of memory can only be associated
with one type during its lifetimewith one type during its lifetime
 Aliasing may only occur between references ofAliasing may only occur between references of
the samethe same compatiblecompatible typetype
â–șEnables compiler to rule out aliasingEnables compiler to rule out aliasing
between references of non-compatible typebetween references of non-compatible type
 Turned on with –fstrict-aliasing in gccTurned on with –fstrict-aliasing in gcc
Compatibility of C/C++Compatibility of C/C++
typestypes
â–șIn short
In short

 Types compatible if differing byTypes compatible if differing by signedsigned,,
unsignedunsigned,, constconst oror volatilevolatile
 charchar andand unsigned charunsigned char compatible with anycompatible with any
typetype
 Otherwise not compatibleOtherwise not compatible
â–ș(See standard for full details.)(See standard for full details.)
What TBAA can do for youWhat TBAA can do for you
void Foo(float *v, int *n) {
for (int i = 0; i < *n; i++)
v[i] += 1.0f;
}
void Foo(float *v, int *n) {
int t = *n;
for (int i = 0; i < t; i++)
v[i] += 1.0f;
}
into this:into this:
It can turn this:It can turn this:
Possible aliasing
between
v[i] and *n
No aliasing possible
so fetch *n once!
What TBAA can also doWhat TBAA can also do
uint32 i;
float f;
i = *((uint32 *)&f);
uint32 i;
union {
float f;
uchar8 c[4];
} u;
u.f = f;
i = (u.c[3]<<24L)+
(u.c[2]<<16L)+
...;
â–ș Cause obscure bugs in non-conforming code!Cause obscure bugs in non-conforming code!
 Beware especially so-called “type punning”Beware especially so-called “type punning”
uint32 i;
union {
float f;
uint32 i;
} u;
u.f = f;
i = u.i;
Required
by standard
Allowed
By gcc
Illegal
C/C++ code!
Restrict-qualified pointersRestrict-qualified pointers
â–ș restrictrestrict keywordkeyword
 New to 1999 ANSI/ISO C standardNew to 1999 ANSI/ISO C standard
 Not in C++ standard yet, but supported by many C++Not in C++ standard yet, but supported by many C++
compilerscompilers
 A hint only, so may do nothing and still be conformingA hint only, so may do nothing and still be conforming
â–ș A restrict-qualified pointer (or reference)
A restrict-qualified pointer (or reference)

 

is basically a promise to the compiler that for theis basically a promise to the compiler that for the
scope of the pointer, the target of the pointer will only bescope of the pointer, the target of the pointer will only be
accessed through that pointer (and pointers copied fromaccessed through that pointer (and pointers copied from
it).it).
 (See standard for full details.)(See standard for full details.)
Using the restrict keywordUsing the restrict keyword
void Foo(float v[], float *c, int n) {
for (int i = 0; i < n; i++)
v[i] = *c + 1.0f;
}
Given this code:Given this code:
You really want the compiler to treat it as if written:You really want the compiler to treat it as if written:
But because of possible aliasing it cannot!But because of possible aliasing it cannot!
void Foo(float v[], float *c, int n) {
float tmp = *c + 1.0f;
for (int i = 0; i < n; i++)
v[i] = tmp;
}
v[] = 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
v[] = 1, 1, 1, 1, 1, 2, 2, 2, 2, 2
Using the restrict keywordUsing the restrict keyword
giving for the first version:giving for the first version:
and for the second version:and for the second version:
float a[10];
a[4] = 0.0f;
Foo(a, &a[4], 10);
For example, the code might be called as:For example, the code might be called as:
The compiler must be conservative, andThe compiler must be conservative, and
cannot perform the optimization!cannot perform the optimization!
void Foo(float * restrict v, float *c, int n) {
for (int i = 0; i < n; i++)
v[i] = *c + 1.0f;
}
Solving the aliasing problemSolving the aliasing problem
The fix? Declaring the output asThe fix? Declaring the output as restrictrestrict::
â–ș Alas, in practice may need to declareAlas, in practice may need to declare bothboth pointers restrict!pointers restrict!
 A restrict-qualified pointer can grant access to non-restrict pointerA restrict-qualified pointer can grant access to non-restrict pointer
 Full data-flow analysis required to detect thisFull data-flow analysis required to detect this
 However, two restrict-qualified pointers are trivially non-aliasing!However, two restrict-qualified pointers are trivially non-aliasing!
 Also may work declaring second argument as “float *Also may work declaring second argument as “float * constconst c”c”
void Foo(float v[], const float *c, int n) {
for (int i = 0; i < n; i++)
v[i] = *c + 1.0f;
}
‘‘const’ doesn’t helpconst’ doesn’t help
Some might think this would work:Some might think this would work:
â–ș Wrong!Wrong! constconst promises almost nothing!promises almost nothing!
 SaysSays *c*c is const throughis const through cc,, notnot thatthat *c*c is const in generalis const in general
 Can be cast awayCan be cast away
 For detecting programming errors, not fixing aliasingFor detecting programming errors, not fixing aliasing
Since *c is const, v[i]
cannot write to it, right?
SIMD + restrict = TRUESIMD + restrict = TRUE
â–ș restrictrestrict enables SIMD optimizationsenables SIMD optimizations
void VecAdd(int *a, int *b, int *c) {
for (int i = 0; i < 4; i++)
a[i] = b[i] + c[i];
}
void VecAdd(int * restrict a, int *b, int *c) {
for (int i = 0; i < 4; i++)
a[i] = b[i] + c[i];
}
Independent loads and
stores. Operations can
be performed in parallel!
Stores may alias loads.
Must perform operations
sequentially.
Restrict-qualified pointersRestrict-qualified pointers
â–șImportant, especially with C++Important, especially with C++
 Helps combat abstraction penalty problemHelps combat abstraction penalty problem
â–șBut beware
But beware

 Tricky semantics, easy to get wrongTricky semantics, easy to get wrong
 Compiler won’t tell you about incorrect useCompiler won’t tell you about incorrect use
 Incorrect use = slow painful death!Incorrect use = slow painful death!
Tips for avoiding aliasingTips for avoiding aliasing
â–ș Minimize use of globals, pointers, referencesMinimize use of globals, pointers, references
 Pass small variables by-valuePass small variables by-value
 Inline small functions taking pointer or referenceInline small functions taking pointer or reference
argumentsarguments
â–ș Use local variables as much as possibleUse local variables as much as possible
 Make local copies of global and class member variablesMake local copies of global and class member variables
â–ș Don’t take the address of variables (with &)Don’t take the address of variables (with &)
â–ș restrictrestrict pointers and referencespointers and references
â–ș Declare variables close to point of useDeclare variables close to point of use
â–ș Declare side-effect free functions asDeclare side-effect free functions as constconst
â–ș Do manual CSE, especially of pointer expressionsDo manual CSE, especially of pointer expressions
That’s it!That’s it! –– Resources 1/2Resources 1/2
â–ș Ericson, Christer. Real-time collision detection. Morgan-Ericson, Christer. Real-time collision detection. Morgan-
Kaufmann, 2005. (Chapter on memory optimization)Kaufmann, 2005. (Chapter on memory optimization)
â–ș Mitchell, Mark. Type-based alias analysis. Dr. Dobb’sMitchell, Mark. Type-based alias analysis. Dr. Dobb’s
journal, October 2000.journal, October 2000.
â–ș Robison, Arch. Restricted pointers are coming. C/C++Robison, Arch. Restricted pointers are coming. C/C++
Users Journal, July 1999.Users Journal, July 1999.
http://www.cuj.com/articles/1999/9907/9907d/9907d.htmhttp://www.cuj.com/articles/1999/9907/9907d/9907d.htm
â–ș Chilimbi, Trishul. Cache-conscious data structures - designChilimbi, Trishul. Cache-conscious data structures - design
and implementation. PhD Thesis. University of Wisconsin,and implementation. PhD Thesis. University of Wisconsin,
Madison, 1999.Madison, 1999.
â–ș Prokop, Harald. Cache-oblivious algorithms. Master’sProkop, Harald. Cache-oblivious algorithms. Master’s
Thesis. MIT, June, 1999.Thesis. MIT, June, 1999.
â–ș 


Resources 2/2Resources 2/2
â–ș 


â–ș Gavin, Andrew. Stephen White. Teaching an old dog newGavin, Andrew. Stephen White. Teaching an old dog new
bits: How console developers are able to improvebits: How console developers are able to improve
performance when the hardware hasn’t changed.performance when the hardware hasn’t changed.
Gamasutra. November 12, 1999Gamasutra. November 12, 1999
http://www.gamasutra.com/features/19991112/GavinWhite_01.htmhttp://www.gamasutra.com/features/19991112/GavinWhite_01.htm
â–ș Handy, Jim. The cache memory book. Academic Press,Handy, Jim. The cache memory book. Academic Press,
1998.1998.
â–ș Macris, Alexandre. Pascal Urro. Leveraging the power ofMacris, Alexandre. Pascal Urro. Leveraging the power of
cache memory. Gamasutra. April 9, 1999cache memory. Gamasutra. April 9, 1999
http://www.gamasutra.com/features/19990409/cache_01.htmhttp://www.gamasutra.com/features/19990409/cache_01.htm
â–ș Gross, Ornit. Pentium III prefetch optimizations using theGross, Ornit. Pentium III prefetch optimizations using the
VTune performance analyzer. Gamasutra. July 30, 1999VTune performance analyzer. Gamasutra. July 30, 1999
httphttp://www.gamasutra.com/features/19990730/sse_prefetch_01.htm://www.gamasutra.com/features/19990730/sse_prefetch_01.htm
â–ș Truong, Dan. François Bodin. AndrTruong, Dan. François Bodin. AndrĂ©Ă© Seznec. ImprovingSeznec. Improving
cache behavior of dynamically allocated data structures.cache behavior of dynamically allocated data structures.

Mais conteĂșdo relacionado

Mais procurados

09 searching[1]
09 searching[1]09 searching[1]
09 searching[1]Kamal Shrish
 
Understanding DLmalloc
Understanding DLmallocUnderstanding DLmalloc
Understanding DLmallocHaifeng Li
 
DUG'20: 07 - Storing High-Energy Physics data in DAOS
DUG'20: 07 - Storing High-Energy Physics data in DAOSDUG'20: 07 - Storing High-Energy Physics data in DAOS
DUG'20: 07 - Storing High-Energy Physics data in DAOSAndrey Kudryavtsev
 
An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"sandinmyjoints
 
Steps to Shrink Datafiles
Steps to Shrink DatafilesSteps to Shrink Datafiles
Steps to Shrink DatafilesRam Kumar
 
Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)
Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)
Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)Kyle Davis
 
GPU Computing with CUDA
GPU Computing with CUDAGPU Computing with CUDA
GPU Computing with CUDAPriyankaSaini94
 
Make money fast! department of computer science-copypasteads.com
Make money fast!   department of computer science-copypasteads.comMake money fast!   department of computer science-copypasteads.com
Make money fast! department of computer science-copypasteads.comjackpot201
 
Modern software design in Big data era
Modern software design in Big data eraModern software design in Big data era
Modern software design in Big data eraBill GU
 
Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Adam Kawa
 
Gur1009
Gur1009Gur1009
Gur1009Cdiscount
 
Writing file system in CPython
Writing file system in CPythonWriting file system in CPython
Writing file system in CPythondelimitry
 
DSTree: A Tree Structure for the Mining of Frequent Sets from Data Streams
DSTree: A Tree Structure for the Mining of Frequent Sets from Data StreamsDSTree: A Tree Structure for the Mining of Frequent Sets from Data Streams
DSTree: A Tree Structure for the Mining of Frequent Sets from Data StreamsAllenWu
 
Large scale data-parsing with Hadoop in Bioinformatics
Large scale data-parsing with Hadoop in BioinformaticsLarge scale data-parsing with Hadoop in Bioinformatics
Large scale data-parsing with Hadoop in BioinformaticsNtino Krampis
 

Mais procurados (20)

09 searching[1]
09 searching[1]09 searching[1]
09 searching[1]
 
Understanding DLmalloc
Understanding DLmallocUnderstanding DLmalloc
Understanding DLmalloc
 
DUG'20: 07 - Storing High-Energy Physics data in DAOS
DUG'20: 07 - Storing High-Energy Physics data in DAOSDUG'20: 07 - Storing High-Energy Physics data in DAOS
DUG'20: 07 - Storing High-Energy Physics data in DAOS
 
An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"
 
Steps to Shrink Datafiles
Steps to Shrink DatafilesSteps to Shrink Datafiles
Steps to Shrink Datafiles
 
Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)
Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)
Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)
 
User biglm
User biglmUser biglm
User biglm
 
GPU Computing with CUDA
GPU Computing with CUDAGPU Computing with CUDA
GPU Computing with CUDA
 
Make money fast! department of computer science-copypasteads.com
Make money fast!   department of computer science-copypasteads.comMake money fast!   department of computer science-copypasteads.com
Make money fast! department of computer science-copypasteads.com
 
Modern software design in Big data era
Modern software design in Big data eraModern software design in Big data era
Modern software design in Big data era
 
lec4_ref.pdf
lec4_ref.pdflec4_ref.pdf
lec4_ref.pdf
 
Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm
 
Gur1009
Gur1009Gur1009
Gur1009
 
Python for R users
Python for R usersPython for R users
Python for R users
 
Writing file system in CPython
Writing file system in CPythonWriting file system in CPython
Writing file system in CPython
 
Mifare Desfire Technology
Mifare Desfire TechnologyMifare Desfire Technology
Mifare Desfire Technology
 
Vault2016
Vault2016Vault2016
Vault2016
 
Statistical computing 01
Statistical computing 01Statistical computing 01
Statistical computing 01
 
DSTree: A Tree Structure for the Mining of Frequent Sets from Data Streams
DSTree: A Tree Structure for the Mining of Frequent Sets from Data StreamsDSTree: A Tree Structure for the Mining of Frequent Sets from Data Streams
DSTree: A Tree Structure for the Mining of Frequent Sets from Data Streams
 
Large scale data-parsing with Hadoop in Bioinformatics
Large scale data-parsing with Hadoop in BioinformaticsLarge scale data-parsing with Hadoop in Bioinformatics
Large scale data-parsing with Hadoop in Bioinformatics
 

Semelhante a Gdc03 ericson memory_optimization

Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimizationguest3eed30
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory OptimizationWei Lin
 
Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)RichardWarburton
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonJAXLondon2014
 
Optimizing columnar stores
Optimizing columnar storesOptimizing columnar stores
Optimizing columnar storesIstvan Szukacs
 
Optimizing columnar stores
Optimizing columnar storesOptimizing columnar stores
Optimizing columnar storesIstvan Szukacs
 
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2Hsien-Hsin Sean Lee, Ph.D.
 
.NET Fest 2018. Maarten Balliauw. Let’s refresh our memory! Memory management...
.NET Fest 2018. Maarten Balliauw. Let’s refresh our memory! Memory management....NET Fest 2018. Maarten Balliauw. Let’s refresh our memory! Memory management...
.NET Fest 2018. Maarten Balliauw. Let’s refresh our memory! Memory management...NETFest
 
DotNetFest - Let’s refresh our memory! Memory management in .NET
DotNetFest - Let’s refresh our memory! Memory management in .NETDotNetFest - Let’s refresh our memory! Memory management in .NET
DotNetFest - Let’s refresh our memory! Memory management in .NETMaarten Balliauw
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit
 
Why learn Internals?
Why learn Internals?Why learn Internals?
Why learn Internals?Shaul Rosenzwieg
 
ScyllaDB: NoSQL at Ludicrous Speed
ScyllaDB: NoSQL at Ludicrous SpeedScyllaDB: NoSQL at Ludicrous Speed
ScyllaDB: NoSQL at Ludicrous SpeedJ On The Beach
 
[ENG] Hacktivity 2013 - Alice in eXploitland
[ENG] Hacktivity 2013 - Alice in eXploitland[ENG] Hacktivity 2013 - Alice in eXploitland
[ENG] Hacktivity 2013 - Alice in eXploitlandZoltan Balazs
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictabilityRichardWarburton
 
[Ruxcon 2011] Post Memory Corruption Memory Analysis
[Ruxcon 2011] Post Memory Corruption Memory Analysis[Ruxcon 2011] Post Memory Corruption Memory Analysis
[Ruxcon 2011] Post Memory Corruption Memory AnalysisMoabi.com
 
Data Grids with Oracle Coherence
Data Grids with Oracle CoherenceData Grids with Oracle Coherence
Data Grids with Oracle CoherenceBen Stopford
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictabilityRichardWarburton
 
Building Reliable Cloud Storage with Riak and CloudStack - Andy Gross, Chief ...
Building Reliable Cloud Storage with Riak and CloudStack - Andy Gross, Chief ...Building Reliable Cloud Storage with Riak and CloudStack - Andy Gross, Chief ...
Building Reliable Cloud Storage with Riak and CloudStack - Andy Gross, Chief ...buildacloud
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
 
MongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsMongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsServer Density
 

Semelhante a Gdc03 ericson memory_optimization (20)

Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimization
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimization
 
Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard Warburton
 
Optimizing columnar stores
Optimizing columnar storesOptimizing columnar stores
Optimizing columnar stores
 
Optimizing columnar stores
Optimizing columnar storesOptimizing columnar stores
Optimizing columnar stores
 
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
 
.NET Fest 2018. Maarten Balliauw. Let’s refresh our memory! Memory management...
.NET Fest 2018. Maarten Balliauw. Let’s refresh our memory! Memory management....NET Fest 2018. Maarten Balliauw. Let’s refresh our memory! Memory management...
.NET Fest 2018. Maarten Balliauw. Let’s refresh our memory! Memory management...
 
DotNetFest - Let’s refresh our memory! Memory management in .NET
DotNetFest - Let’s refresh our memory! Memory management in .NETDotNetFest - Let’s refresh our memory! Memory management in .NET
DotNetFest - Let’s refresh our memory! Memory management in .NET
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan Pu
 
Why learn Internals?
Why learn Internals?Why learn Internals?
Why learn Internals?
 
ScyllaDB: NoSQL at Ludicrous Speed
ScyllaDB: NoSQL at Ludicrous SpeedScyllaDB: NoSQL at Ludicrous Speed
ScyllaDB: NoSQL at Ludicrous Speed
 
[ENG] Hacktivity 2013 - Alice in eXploitland
[ENG] Hacktivity 2013 - Alice in eXploitland[ENG] Hacktivity 2013 - Alice in eXploitland
[ENG] Hacktivity 2013 - Alice in eXploitland
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
 
[Ruxcon 2011] Post Memory Corruption Memory Analysis
[Ruxcon 2011] Post Memory Corruption Memory Analysis[Ruxcon 2011] Post Memory Corruption Memory Analysis
[Ruxcon 2011] Post Memory Corruption Memory Analysis
 
Data Grids with Oracle Coherence
Data Grids with Oracle CoherenceData Grids with Oracle Coherence
Data Grids with Oracle Coherence
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
 
Building Reliable Cloud Storage with Riak and CloudStack - Andy Gross, Chief ...
Building Reliable Cloud Storage with Riak and CloudStack - Andy Gross, Chief ...Building Reliable Cloud Storage with Riak and CloudStack - Andy Gross, Chief ...
Building Reliable Cloud Storage with Riak and CloudStack - Andy Gross, Chief ...
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
MongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsMongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & Analytics
 

Último

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍾 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍾 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍾 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍾 8923113531 🎰 Avail...gurkirankumar98700
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 

Último (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍾 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍾 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍾 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍾 8923113531 🎰 Avail...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 

Gdc03 ericson memory_optimization

  • 1. MEMORYMEMORY OPTIMIZATIONOPTIMIZATION Christer EricsonChrister Ericson Sony Computer Entertainment, SantaSony Computer Entertainment, Santa MonicaMonica (christer_ericson@playstation.sony.com)(christer_ericson@playstation.sony.com)
  • 2. Talk contents 1/2Talk contents 1/2 â–șProblem statementProblem statement  Why “memory optimization?”Why “memory optimization?” â–șBrief architecture overviewBrief architecture overview  The memory hierarchyThe memory hierarchy â–șOptimizing for (code and) data cacheOptimizing for (code and) data cache  General suggestionsGeneral suggestions  Data structuresData structures â–șPrefetching and preloadingPrefetching and preloading â–șStructure layoutStructure layout â–șTree structuresTree structures â–șLinearization cachingLinearization caching â–ș


  • 3. Talk contents 2/2Talk contents 2/2 â–ș

 â–șAliasingAliasing  Abstraction penalty problemAbstraction penalty problem  Alias analysis (type-based)Alias analysis (type-based)  ‘‘restrict’ pointersrestrict’ pointers  Tips for reducing aliasingTips for reducing aliasing
  • 4. Problem statementProblem statement â–ș For the last 20-something years
For the last 20-something years
  CPU speeds have increased ~60%/yearCPU speeds have increased ~60%/year  Memory speeds only decreased ~10%/yearMemory speeds only decreased ~10%/year â–ș Gap covered by use of cache memoryGap covered by use of cache memory â–ș Cache is under-exploitedCache is under-exploited  Diminishing returns for larger cachesDiminishing returns for larger caches â–ș Inefficient cache use = lower performanceInefficient cache use = lower performance  How increase cache utilization? Cache-awareness!How increase cache utilization? Cache-awareness!
  • 5. Need more justification? 1/3Need more justification? 1/3 SIMD instructions consume data at 2-8 times the rate of normal instructions! Instruction parallelism:
  • 6. Need more justification? 2/3Need more justification? 2/3 Improvements to compiler technology double program performance every ~18 years! Proebsting’s law: Corollary: Don’t expect the compiler to do it for you!Corollary: Don’t expect the compiler to do it for you!
  • 7. Need more justification? 3/3Need more justification? 3/3 On Moore’s law: â–șConsoles don’t follow it (as such)  Fixed hardware  2nd/3rd generation titles must get improvements from somewhere
  • 8. Brief cache reviewBrief cache review â–ș CachesCaches  Code cache for instructions, data cache for dataCode cache for instructions, data cache for data  Forms a memory hierarchyForms a memory hierarchy â–ș Cache linesCache lines  Cache divided into cache lines of ~32/64 bytes eachCache divided into cache lines of ~32/64 bytes each  Correct unit in which to count memory accessesCorrect unit in which to count memory accesses â–ș Direct-mappedDirect-mapped  For n KB cache, bytes at k, k+n, k+2n, 
 map to sameFor n KB cache, bytes at k, k+n, k+2n, 
 map to same cache linecache line â–ș N-way set-associativeN-way set-associative  Logical cache line corresponds to N physical linesLogical cache line corresponds to N physical lines  Helps minimize cache line thrashingHelps minimize cache line thrashing
  • 9. The memory hierarchyThe memory hierarchy Main memory L2 cache L1 cache CPU ~1-5 cycles ~5-20 cycles ~40-100 cycles 1 cycle Roughly:
  • 10. Some cache specsSome cache specs L1 cache (I/D) L2 cache PS2 16K/8K16K/8K†† 2-way2-way N/AN/A GameCube 32K/32K32K/32K‡‡ 8-way8-way 256K 2-way unified256K 2-way unified XBOX 16K/16K 4-way16K/16K 4-way 128K 8-way unified128K 8-way unified PC ~32-64K~32-64K ~128-512K~128-512K â–ș†† 16K data scratchpad important part of design16K data scratchpad important part of design â–ș‡‡ configurable as 16K 4-way + 16K scratchpadconfigurable as 16K 4-way + 16K scratchpad
  • 11. Foes: 3 C’s of cache missesFoes: 3 C’s of cache misses â–ș Compulsory missesCompulsory misses  Unavoidable misses when data read for first timeUnavoidable misses when data read for first time â–ș Capacity missesCapacity misses  Not enough cache space to hold all active dataNot enough cache space to hold all active data  Too much data accessed inbetween successive useToo much data accessed inbetween successive use â–ș Conflict missesConflict misses  Cache thrashing due to data mapping to same cacheCache thrashing due to data mapping to same cache lineslines
  • 12. Friends: Introducing the 3Friends: Introducing the 3 R’sR’s Compulsory Capacity Conflict Rearrange XX (x)(x) XX Reduce XX XX (x)(x) Reuse (x)(x) XX â–ș Rearrange (code, data)Rearrange (code, data)  Change layout to increase spatial localityChange layout to increase spatial locality â–ș Reduce (size, # cache lines read)Reduce (size, # cache lines read)  Smaller/smarter formats, compressionSmaller/smarter formats, compression â–ș Reuse (cache lines)Reuse (cache lines)  Increase temporal (and spatial) localityIncrease temporal (and spatial) locality
  • 13. Measuring cache utilizationMeasuring cache utilization â–șProfileProfile  CPU performance/event countersCPU performance/event counters â–șGive memory access statisticsGive memory access statistics â–șBut not access patterns (e.g. stride)But not access patterns (e.g. stride)  Commercial productsCommercial products â–șSN Systems’ Tuner, Metrowerks’ CATS, Intel’sSN Systems’ Tuner, Metrowerks’ CATS, Intel’s VTuneVTune  Roll your ownRoll your own â–șIn gcc ‘-p’ option + defineIn gcc ‘-p’ option + define _mcount()_mcount() â–șInstrument code with calls to logging classInstrument code with calls to logging class  Do back-of-the-envelope comparisonDo back-of-the-envelope comparison â–șStudy the generated codeStudy the generated code
  • 14. Code cache optimization 1/2Code cache optimization 1/2 â–șLocalityLocality  Reorder functionsReorder functions â–șManually within fileManually within file â–șReorder object files during linking (order in makefile)Reorder object files during linking (order in makefile) â–ș__attribute__ ((section ("xxx")))__attribute__ ((section ("xxx"))) in gccin gcc  Adapt coding styleAdapt coding style â–șMonolithic functionsMonolithic functions â–șEncapsulation/OOP is less code cache friendlyEncapsulation/OOP is less code cache friendly  Moving targetMoving target  Beware various implicit functions (e.g. fptodp)Beware various implicit functions (e.g. fptodp)
  • 15. Code cache optimization 2/2Code cache optimization 2/2 â–șSizeSize  Beware: inlining, unrolling, large macrosBeware: inlining, unrolling, large macros  KISSKISS â–șAvoid featuritisAvoid featuritis â–șProvide multiple copies (also helps locality)Provide multiple copies (also helps locality)  Loop splitting and loop fusionLoop splitting and loop fusion  Compile for size (‘-Os’ in gcc)Compile for size (‘-Os’ in gcc)  Rewrite in asm (where it counts)Rewrite in asm (where it counts) â–șAgain, study generated codeAgain, study generated code  Build intuition about code generatedBuild intuition about code generated
  • 16. Data cache optimizationData cache optimization â–șLots and lots of stuff
Lots and lots of stuff
  ““Compressing” dataCompressing” data  Blocking and strip miningBlocking and strip mining  Padding data to align to cache linesPadding data to align to cache lines  Plus other things I won’t go intoPlus other things I won’t go into â–șWhat I will talk about
What I will talk about
  Prefetching and preloading data into cachePrefetching and preloading data into cache  Cache-conscious structure layoutCache-conscious structure layout  Tree data structuresTree data structures  Linearization cachingLinearization caching  Memory allocationMemory allocation  Aliasing and “anti-aliasing”Aliasing and “anti-aliasing”
  • 17. Prefetching and preloadingPrefetching and preloading â–șSoftware prefetchingSoftware prefetching  Not too early – data may be evicted before useNot too early – data may be evicted before use  Not too late – data not fetched in time for useNot too late – data not fetched in time for use  GreedyGreedy â–șPreloading (pseudo-prefetching)Preloading (pseudo-prefetching)  Hit-under-miss processingHit-under-miss processing
  • 18. const int kLookAhead = 4; // Some elements ahead for (int i = 0; i < 4 * n; i += 4) { Prefetch(elem[i + kLookAhead]); Process(elem[i + 0]); Process(elem[i + 1]); Process(elem[i + 2]); Process(elem[i + 3]); } Software prefetchingSoftware prefetching // Loop through and process all 4n elements for (int i = 0; i < 4 * n; i++) Process(elem[i]);
  • 19. Greedy prefetchingGreedy prefetching void PreorderTraversal(Node *pNode) { // Greedily prefetch left traversal path Prefetch(pNode->left); // Process the current node Process(pNode); // Greedily prefetch right traversal path Prefetch(pNode->right); // Recursively visit left then right subtree PreorderTraversal(pNode->left); PreorderTraversal(pNode->right); }
  • 20. Preloading (pseudo-Preloading (pseudo- prefetch)prefetch) Elem a = elem[0]; for (int i = 0; i < 4 * n; i += 4) { Elem e = elem[i + 4]; // Cache miss, non-blocking Elem b = elem[i + 1]; // Cache hit Elem c = elem[i + 2]; // Cache hit Elem d = elem[i + 3]; // Cache hit Process(a); Process(b); Process(c); Process(d); a = e; }(NB: This code reads one element beyond the end of the elem array.)(NB: This code reads one element beyond the end of the elem array.)
  • 21. StructuresStructures â–șCache-conscious layoutCache-conscious layout  Field reordering (usually grouped conceptually)Field reordering (usually grouped conceptually)  Hot/cold splittingHot/cold splitting â–șLet use decide formatLet use decide format  Array of structuresArray of structures  Structures of arraysStructures of arrays â–șLittle compiler supportLittle compiler support  Easier for non-pointer languages (Java)Easier for non-pointer languages (Java)  C/C++: do it yourselfC/C++: do it yourself
  • 22. void Foo(S *p, void *key, int k) { while (p) { if (p->key == key) { p->count[k]++; break; } p = p->pNext; } Field reorderingField reordering struct S { void *key; int count[20]; S *pNext; }; struct S { void *key; S *pNext; int count[20]; }; â–șLikely accessedLikely accessed together sotogether so store themstore them together!together!
  • 23. struct S { void *key; S *pNext; S2 *pCold; }; struct S2 { int count[10]; }; Hot/cold splittingHot/cold splitting â–șAllocate all ‘struct S’ from a memory poolAllocate all ‘struct S’ from a memory pool  Increases coherenceIncreases coherence â–șPrefer array-style allocationPrefer array-style allocation  No need for actual pointer to cold fieldsNo need for actual pointer to cold fields Hot fields:Hot fields: Cold fields:Cold fields:
  • 25. Beware compiler paddingBeware compiler padding struct X { int8 a; int64 b; int8 c; int16 d; int64 e; float f; }; Assuming 4-byte floats, for most compilers sizeof(X) == 40,Assuming 4-byte floats, for most compilers sizeof(X) == 40, sizeof(Y) == 40, and sizeof(Z) == 24.sizeof(Y) == 40, and sizeof(Z) == 24. struct Z { int64 b; int64 e; float f; int16 d; int8 a; int8 c; }; struct Y { int8 a, pad_a[7]; int64 b; int8 c, pad_c[1]; int16 d, pad_d[2]; int64 e; float f, pad_f[1]; }; Decreasingsize!
  • 26. Cache performance analysisCache performance analysis â–ș Usage patternsUsage patterns  ActivityActivity –– indicates hot or cold fieldindicates hot or cold field  CorrelationCorrelation –– basis for field reorderingbasis for field reordering â–ș Logging toolLogging tool  Access all class members through accessor functionsAccess all class members through accessor functions  Manually instrument functions to call Log() functionManually instrument functions to call Log() function  Log() function
Log() function
 â–ș takes object type + member field as argumentstakes object type + member field as arguments â–ș hash-maps current args to count field accesseshash-maps current args to count field accesses â–ș hash-maps current + previous args to track pairwise accesseshash-maps current + previous args to track pairwise accesses
  • 27. Tree data structuresTree data structures â–șRearrangeRearrange nodesnodes  Increase spatial localityIncrease spatial locality  Cache-aware vs. cache-oblivious layoutsCache-aware vs. cache-oblivious layouts â–șReduceReduce sizesize  Pointer elimination (using implicit pointers)Pointer elimination (using implicit pointers)  ““Compression”Compression” â–șQuantize valuesQuantize values â–șStore data relative to parent nodeStore data relative to parent node
  • 28. Breadth-first orderBreadth-first order â–ș Pointer-less: Left(n)=2n, Right(n)=2n+1Pointer-less: Left(n)=2n, Right(n)=2n+1 â–ș Requires storage for complete tree of height HRequires storage for complete tree of height H
  • 29. Depth-first orderDepth-first order â–ș Left(n) = n + 1, Right(n) = stored indexLeft(n) = n + 1, Right(n) = stored index â–ș Only stores existing nodesOnly stores existing nodes
  • 30. van Emde Boas layoutvan Emde Boas layout â–ș ““Cache-oblivious”Cache-oblivious” â–ș Recursive constructionRecursive construction
  • 31. A compact static k-d treeA compact static k-d tree union KDNode { // leaf, type 11 int32 leafIndex_type; // non-leaf, type 00 = x, // 01 = y, 10 = z-split float splitVal_type; };
  • 32. Linearization cachingLinearization caching â–șNothing better than linear dataNothing better than linear data  Best possible spatial localityBest possible spatial locality  Easily prefetchableEasily prefetchable â–șSo linearize data at runtime!So linearize data at runtime!  Fetch data, store linearized in a custom cacheFetch data, store linearized in a custom cache  Use it to linearize
Use it to linearize
 â–șhierarchy traversalshierarchy traversals â–șindexed dataindexed data â–șother random-access stuffother random-access stuff
  • 33.
  • 34. Memory allocation policyMemory allocation policy â–șDon’t allocate from heap, use poolsDon’t allocate from heap, use pools  No block overheadNo block overhead  Keeps data togetherKeeps data together  Faster too, and no fragmentationFaster too, and no fragmentation â–șFree ASAP, reuse immediatelyFree ASAP, reuse immediately  Block is likely in cache so reuse its cachelinesBlock is likely in cache so reuse its cachelines  First fit, using free listFirst fit, using free list
  • 35. The curse of aliasingThe curse of aliasing What is aliasing?What is aliasing? int Foo(int *a, int *b) { *a = 1; *b = 2; return *a; } int n; int *p1 = &n; int *p2 = &n; Aliasing is also missed opportunities for optimizationAliasing is also missed opportunities for optimization What value is returned here? Who knows! Aliasing is multiple references to the same storage location
  • 36. The curse of aliasingThe curse of aliasing â–șWhat is causing aliasing?What is causing aliasing?  PointersPointers  Global variables/class members make it worseGlobal variables/class members make it worse â–șWhat is the problem with aliasing?What is the problem with aliasing?  Hinders reordering/elimination of loads/storesHinders reordering/elimination of loads/stores â–șPoisoning data cachePoisoning data cache â–șNegatively affects instruction schedulingNegatively affects instruction scheduling â–șHinders common subexpression elimination (CSE),Hinders common subexpression elimination (CSE), loop-invariant code motion, constant/copyloop-invariant code motion, constant/copy propagation, etc.propagation, etc.
  • 37. How do we do ‘anti-How do we do ‘anti- aliasing’?aliasing’? â–șWhat can be done about aliasing?What can be done about aliasing?  Better languagesBetter languages â–șLess aliasing, lower abstraction penaltyLess aliasing, lower abstraction penalty††  Better compilersBetter compilers â–șAlias analysis such as type-based alias analysisAlias analysis such as type-based alias analysis††  Better programmers (aiding theBetter programmers (aiding the compiler)compiler) â–șThat’s you, after the next 20 slides!That’s you, after the next 20 slides!  Leap of faithLeap of faith â–ș-fno-aliasing-fno-aliasing †† To be definedTo be defined
  • 38. Matrix multiplication 1/3Matrix multiplication 1/3 Mat22mul(float a[2][2], float b[2][2], float c[2][2]){ for (int i = 0; i < 2; i++) { for (int j = 0; j < 2; j++) { a[i][j] = 0.0f; for (int k = 0; k < 2; k++) a[i][j] += b[i][k] * c[k][j]; } } } Consider optimizing a 2x2 matrix multiplication:Consider optimizing a 2x2 matrix multiplication: How do we typically optimize it? Right, unrolling!How do we typically optimize it? Right, unrolling!
  • 39. â–ș But wait! There’s a hidden assumption! a is not b or c!But wait! There’s a hidden assumption! a is not b or c! â–ș Compiler doesn’t (cannot) know this!Compiler doesn’t (cannot) know this!  (1) Must refetch b[0][0] and b[0][1](1) Must refetch b[0][0] and b[0][1]  (2) Must refetch c[0][0] and c[1][0](2) Must refetch c[0][0] and c[1][0]  (3) Must refetch b[0][0], b[0][1], c[0][0] and c[1][0](3) Must refetch b[0][0], b[0][1], c[0][0] and c[1][0] Matrix multiplication 2/3Matrix multiplication 2/3 Staightforward unrolling results in this:Staightforward unrolling results in this: // 16 memory reads, 4 writes Mat22mul(float a[2][2], float b[2][2], float c[2][2]){ a[0][0] = b[0][0]*c[0][0] + b[0][1]*c[1][0]; a[0][1] = b[0][0]*c[0][1] + b[0][1]*c[1][1]; //(1) a[1][0] = b[1][0]*c[0][0] + b[1][1]*c[1][0]; //(2) a[1][1] = b[1][0]*c[0][1] + b[1][1]*c[1][1]; //(3) }
  • 40. Matrix multiplication 3/3Matrix multiplication 3/3 // 8 memory reads, 4 writes Mat22mul(float a[2][2], float b[2][2], float c[2][2]){ float b00 = b[0][0], b01 = b[0][1]; float b10 = b[1][0], b11 = b[1][1]; float c00 = c[0][0], c01 = c[0][1]; float c10 = c[1][0], c11 = c[1][1]; a[0][0] = b00*c00 + b01*c10; a[0][1] = b00*c01 + b01*c11; a[1][0] = b10*c00 + b11*c10; a[1][1] = b10*c01 + b11*c11; A correct approach is instead writing it as:A correct approach is instead writing it as: 
before producing outputs Consume inputs

  • 41. Abstraction penalty problemAbstraction penalty problem â–șHigher levels of abstraction have a negativeHigher levels of abstraction have a negative effect on optimizationeffect on optimization  Code broken into smaller generic subunitsCode broken into smaller generic subunits  Data and operation hidingData and operation hiding â–șCannot make local copy of e.g. internal pointersCannot make local copy of e.g. internal pointers â–șCannot hoist constant expressions out of loopsCannot hoist constant expressions out of loops â–șEspecially because of aliasing issuesEspecially because of aliasing issues
  • 42. C++ abstraction penaltyC++ abstraction penalty â–ș Lots of (temporary) objects aroundLots of (temporary) objects around  IteratorsIterators  Matrix/vector classesMatrix/vector classes â–ș Objects live in heap/stackObjects live in heap/stack  Thus subject to aliasingThus subject to aliasing  Makes tracking of current member value very difficultMakes tracking of current member value very difficult  But tracking required to keep values in registers!But tracking required to keep values in registers! â–ș Implicit aliasing through theImplicit aliasing through the thisthis pointerpointer  Class members are virtually as bad as global variablesClass members are virtually as bad as global variables
  • 43. C++ abstraction penaltyC++ abstraction penalty class Buf { public: void Clear() { for (int i = 0; i < numVals; i++) pBuf[i] = 0; } private: int numVals, *pBuf; } Pointer members in classes may alias other members:Pointer members in classes may alias other members: Code likely to refetchCode likely to refetch numValsnumVals each iteration!each iteration! numVals not a local variable! May be aliased by pBuf!
  • 44. class Buf { public: void Clear() { for (int i = 0, n = numVals; i < n; i++) pBuf[i] = 0; } private: int numVals, *pBuf; } C++ abstraction penaltyC++ abstraction penalty We know that aliasing won’t happen, and canWe know that aliasing won’t happen, and can manually solve the aliasing issue by writing code as:manually solve the aliasing issue by writing code as:
  • 45. C++ abstraction penaltyC++ abstraction penalty void Clear() { if (numVals >= 1) { pBuf[0] = 0; for (int i = 1, n = numVals; i < n; i++) pBuf[i] = 0; } } SinceSince pBuf[i]pBuf[i] can only aliascan only alias numValsnumVals in the firstin the first iteration, a quality compiler can fix this problem byiteration, a quality compiler can fix this problem by peeling the loop once, turning it into:peeling the loop once, turning it into: Q: DoesQ: Does youryour compiler do this optimization?!compiler do this optimization?!
  • 46. Type-based alias analysisType-based alias analysis â–ș Some aliasing the compiler can catchSome aliasing the compiler can catch  A powerful tool isA powerful tool is type-based alias analysistype-based alias analysis Use language types to disambiguate memory references!
  • 47. Type-based alias analysisType-based alias analysis â–șANSI C/C++ states that
ANSI C/C++ states that
  Each area of memory can only be associatedEach area of memory can only be associated with one type during its lifetimewith one type during its lifetime  Aliasing may only occur between references ofAliasing may only occur between references of the samethe same compatiblecompatible typetype â–șEnables compiler to rule out aliasingEnables compiler to rule out aliasing between references of non-compatible typebetween references of non-compatible type  Turned on with –fstrict-aliasing in gccTurned on with –fstrict-aliasing in gcc
  • 48. Compatibility of C/C++Compatibility of C/C++ typestypes â–șIn short
In short
  Types compatible if differing byTypes compatible if differing by signedsigned,, unsignedunsigned,, constconst oror volatilevolatile  charchar andand unsigned charunsigned char compatible with anycompatible with any typetype  Otherwise not compatibleOtherwise not compatible â–ș(See standard for full details.)(See standard for full details.)
  • 49. What TBAA can do for youWhat TBAA can do for you void Foo(float *v, int *n) { for (int i = 0; i < *n; i++) v[i] += 1.0f; } void Foo(float *v, int *n) { int t = *n; for (int i = 0; i < t; i++) v[i] += 1.0f; } into this:into this: It can turn this:It can turn this: Possible aliasing between v[i] and *n No aliasing possible so fetch *n once!
  • 50. What TBAA can also doWhat TBAA can also do uint32 i; float f; i = *((uint32 *)&f); uint32 i; union { float f; uchar8 c[4]; } u; u.f = f; i = (u.c[3]<<24L)+ (u.c[2]<<16L)+ ...; â–ș Cause obscure bugs in non-conforming code!Cause obscure bugs in non-conforming code!  Beware especially so-called “type punning”Beware especially so-called “type punning” uint32 i; union { float f; uint32 i; } u; u.f = f; i = u.i; Required by standard Allowed By gcc Illegal C/C++ code!
  • 51. Restrict-qualified pointersRestrict-qualified pointers â–ș restrictrestrict keywordkeyword  New to 1999 ANSI/ISO C standardNew to 1999 ANSI/ISO C standard  Not in C++ standard yet, but supported by many C++Not in C++ standard yet, but supported by many C++ compilerscompilers  A hint only, so may do nothing and still be conformingA hint only, so may do nothing and still be conforming â–ș A restrict-qualified pointer (or reference)
A restrict-qualified pointer (or reference)
  

is basically a promise to the compiler that for theis basically a promise to the compiler that for the scope of the pointer, the target of the pointer will only bescope of the pointer, the target of the pointer will only be accessed through that pointer (and pointers copied fromaccessed through that pointer (and pointers copied from it).it).  (See standard for full details.)(See standard for full details.)
  • 52. Using the restrict keywordUsing the restrict keyword void Foo(float v[], float *c, int n) { for (int i = 0; i < n; i++) v[i] = *c + 1.0f; } Given this code:Given this code: You really want the compiler to treat it as if written:You really want the compiler to treat it as if written: But because of possible aliasing it cannot!But because of possible aliasing it cannot! void Foo(float v[], float *c, int n) { float tmp = *c + 1.0f; for (int i = 0; i < n; i++) v[i] = tmp; }
  • 53. v[] = 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 v[] = 1, 1, 1, 1, 1, 2, 2, 2, 2, 2 Using the restrict keywordUsing the restrict keyword giving for the first version:giving for the first version: and for the second version:and for the second version: float a[10]; a[4] = 0.0f; Foo(a, &a[4], 10); For example, the code might be called as:For example, the code might be called as: The compiler must be conservative, andThe compiler must be conservative, and cannot perform the optimization!cannot perform the optimization!
  • 54. void Foo(float * restrict v, float *c, int n) { for (int i = 0; i < n; i++) v[i] = *c + 1.0f; } Solving the aliasing problemSolving the aliasing problem The fix? Declaring the output asThe fix? Declaring the output as restrictrestrict:: â–ș Alas, in practice may need to declareAlas, in practice may need to declare bothboth pointers restrict!pointers restrict!  A restrict-qualified pointer can grant access to non-restrict pointerA restrict-qualified pointer can grant access to non-restrict pointer  Full data-flow analysis required to detect thisFull data-flow analysis required to detect this  However, two restrict-qualified pointers are trivially non-aliasing!However, two restrict-qualified pointers are trivially non-aliasing!  Also may work declaring second argument as “float *Also may work declaring second argument as “float * constconst c”c”
  • 55. void Foo(float v[], const float *c, int n) { for (int i = 0; i < n; i++) v[i] = *c + 1.0f; } ‘‘const’ doesn’t helpconst’ doesn’t help Some might think this would work:Some might think this would work: â–ș Wrong!Wrong! constconst promises almost nothing!promises almost nothing!  SaysSays *c*c is const throughis const through cc,, notnot thatthat *c*c is const in generalis const in general  Can be cast awayCan be cast away  For detecting programming errors, not fixing aliasingFor detecting programming errors, not fixing aliasing Since *c is const, v[i] cannot write to it, right?
  • 56. SIMD + restrict = TRUESIMD + restrict = TRUE â–ș restrictrestrict enables SIMD optimizationsenables SIMD optimizations void VecAdd(int *a, int *b, int *c) { for (int i = 0; i < 4; i++) a[i] = b[i] + c[i]; } void VecAdd(int * restrict a, int *b, int *c) { for (int i = 0; i < 4; i++) a[i] = b[i] + c[i]; } Independent loads and stores. Operations can be performed in parallel! Stores may alias loads. Must perform operations sequentially.
  • 57. Restrict-qualified pointersRestrict-qualified pointers â–șImportant, especially with C++Important, especially with C++  Helps combat abstraction penalty problemHelps combat abstraction penalty problem â–șBut beware
But beware
  Tricky semantics, easy to get wrongTricky semantics, easy to get wrong  Compiler won’t tell you about incorrect useCompiler won’t tell you about incorrect use  Incorrect use = slow painful death!Incorrect use = slow painful death!
  • 58. Tips for avoiding aliasingTips for avoiding aliasing â–ș Minimize use of globals, pointers, referencesMinimize use of globals, pointers, references  Pass small variables by-valuePass small variables by-value  Inline small functions taking pointer or referenceInline small functions taking pointer or reference argumentsarguments â–ș Use local variables as much as possibleUse local variables as much as possible  Make local copies of global and class member variablesMake local copies of global and class member variables â–ș Don’t take the address of variables (with &)Don’t take the address of variables (with &) â–ș restrictrestrict pointers and referencespointers and references â–ș Declare variables close to point of useDeclare variables close to point of use â–ș Declare side-effect free functions asDeclare side-effect free functions as constconst â–ș Do manual CSE, especially of pointer expressionsDo manual CSE, especially of pointer expressions
  • 59. That’s it!That’s it! –– Resources 1/2Resources 1/2 â–ș Ericson, Christer. Real-time collision detection. Morgan-Ericson, Christer. Real-time collision detection. Morgan- Kaufmann, 2005. (Chapter on memory optimization)Kaufmann, 2005. (Chapter on memory optimization) â–ș Mitchell, Mark. Type-based alias analysis. Dr. Dobb’sMitchell, Mark. Type-based alias analysis. Dr. Dobb’s journal, October 2000.journal, October 2000. â–ș Robison, Arch. Restricted pointers are coming. C/C++Robison, Arch. Restricted pointers are coming. C/C++ Users Journal, July 1999.Users Journal, July 1999. http://www.cuj.com/articles/1999/9907/9907d/9907d.htmhttp://www.cuj.com/articles/1999/9907/9907d/9907d.htm â–ș Chilimbi, Trishul. Cache-conscious data structures - designChilimbi, Trishul. Cache-conscious data structures - design and implementation. PhD Thesis. University of Wisconsin,and implementation. PhD Thesis. University of Wisconsin, Madison, 1999.Madison, 1999. â–ș Prokop, Harald. Cache-oblivious algorithms. Master’sProkop, Harald. Cache-oblivious algorithms. Master’s Thesis. MIT, June, 1999.Thesis. MIT, June, 1999. â–ș 


  • 60. Resources 2/2Resources 2/2 â–ș 

 â–ș Gavin, Andrew. Stephen White. Teaching an old dog newGavin, Andrew. Stephen White. Teaching an old dog new bits: How console developers are able to improvebits: How console developers are able to improve performance when the hardware hasn’t changed.performance when the hardware hasn’t changed. Gamasutra. November 12, 1999Gamasutra. November 12, 1999 http://www.gamasutra.com/features/19991112/GavinWhite_01.htmhttp://www.gamasutra.com/features/19991112/GavinWhite_01.htm â–ș Handy, Jim. The cache memory book. Academic Press,Handy, Jim. The cache memory book. Academic Press, 1998.1998. â–ș Macris, Alexandre. Pascal Urro. Leveraging the power ofMacris, Alexandre. Pascal Urro. Leveraging the power of cache memory. Gamasutra. April 9, 1999cache memory. Gamasutra. April 9, 1999 http://www.gamasutra.com/features/19990409/cache_01.htmhttp://www.gamasutra.com/features/19990409/cache_01.htm â–ș Gross, Ornit. Pentium III prefetch optimizations using theGross, Ornit. Pentium III prefetch optimizations using the VTune performance analyzer. Gamasutra. July 30, 1999VTune performance analyzer. Gamasutra. July 30, 1999 httphttp://www.gamasutra.com/features/19990730/sse_prefetch_01.htm://www.gamasutra.com/features/19990730/sse_prefetch_01.htm â–ș Truong, Dan. François Bodin. AndrTruong, Dan. François Bodin. AndrĂ©Ă© Seznec. ImprovingSeznec. Improving cache behavior of dynamically allocated data structures.cache behavior of dynamically allocated data structures.