2. Talk contents 1/2Talk contents 1/2
âșProblem statementProblem statement
ï§ Why âmemory optimization?âWhy âmemory optimization?â
âșBrief architecture overviewBrief architecture overview
ï§ The memory hierarchyThe memory hierarchy
âșOptimizing for (code and) data cacheOptimizing for (code and) data cache
ï§ General suggestionsGeneral suggestions
ï§ Data structuresData structures
âșPrefetching and preloadingPrefetching and preloading
âșStructure layoutStructure layout
âșTree structuresTree structures
âșLinearization cachingLinearization caching
âșâŠâŠ
3. Talk contents 2/2Talk contents 2/2
âșâŠâŠ
âșAliasingAliasing
ï§ Abstraction penalty problemAbstraction penalty problem
ï§ Alias analysis (type-based)Alias analysis (type-based)
ï§ âârestrictâ pointersrestrictâ pointers
ï§ Tips for reducing aliasingTips for reducing aliasing
4. Problem statementProblem statement
âș For the last 20-something yearsâŠFor the last 20-something yearsâŠ
ï§ CPU speeds have increased ~60%/yearCPU speeds have increased ~60%/year
ï§ Memory speeds only decreased ~10%/yearMemory speeds only decreased ~10%/year
âș Gap covered by use of cache memoryGap covered by use of cache memory
âș Cache is under-exploitedCache is under-exploited
ï§ Diminishing returns for larger cachesDiminishing returns for larger caches
âș Inefficient cache use = lower performanceInefficient cache use = lower performance
ï§ How increase cache utilization? Cache-awareness!How increase cache utilization? Cache-awareness!
5. Need more justification? 1/3Need more justification? 1/3
SIMD instructions consume
data at 2-8 times the rate
of normal instructions!
Instruction parallelism:
6. Need more justification? 2/3Need more justification? 2/3
Improvements to
compiler technology
double program performance
every ~18 years!
Proebstingâs law:
Corollary: Donât expect the compiler to do it for you!Corollary: Donât expect the compiler to do it for you!
7. Need more justification? 3/3Need more justification? 3/3
On Mooreâs law:
âșConsoles donât follow it (as
such)
ï§ Fixed hardware
ï§ 2nd/3rd generation titles must get
improvements from somewhere
8. Brief cache reviewBrief cache review
âș CachesCaches
ï§ Code cache for instructions, data cache for dataCode cache for instructions, data cache for data
ï§ Forms a memory hierarchyForms a memory hierarchy
âș Cache linesCache lines
ï§ Cache divided into cache lines of ~32/64 bytes eachCache divided into cache lines of ~32/64 bytes each
ï§ Correct unit in which to count memory accessesCorrect unit in which to count memory accesses
âș Direct-mappedDirect-mapped
ï§ For n KB cache, bytes at k, k+n, k+2n, ⊠map to sameFor n KB cache, bytes at k, k+n, k+2n, ⊠map to same
cache linecache line
âș N-way set-associativeN-way set-associative
ï§ Logical cache line corresponds to N physical linesLogical cache line corresponds to N physical lines
ï§ Helps minimize cache line thrashingHelps minimize cache line thrashing
9. The memory hierarchyThe memory hierarchy
Main memory
L2 cache
L1 cache
CPU
~1-5 cycles
~5-20 cycles
~40-100 cycles
1 cycle
Roughly:
10. Some cache specsSome cache specs
L1 cache (I/D) L2 cache
PS2 16K/8K16K/8Kâ â
2-way2-way N/AN/A
GameCube 32K/32K32K/32KâĄâĄ
8-way8-way 256K 2-way unified256K 2-way unified
XBOX 16K/16K 4-way16K/16K 4-way 128K 8-way unified128K 8-way unified
PC ~32-64K~32-64K ~128-512K~128-512K
âșâ â
16K data scratchpad important part of design16K data scratchpad important part of design
âșâĄâĄ
configurable as 16K 4-way + 16K scratchpadconfigurable as 16K 4-way + 16K scratchpad
11. Foes: 3 Câs of cache missesFoes: 3 Câs of cache misses
âș Compulsory missesCompulsory misses
ï§ Unavoidable misses when data read for first timeUnavoidable misses when data read for first time
âș Capacity missesCapacity misses
ï§ Not enough cache space to hold all active dataNot enough cache space to hold all active data
ï§ Too much data accessed inbetween successive useToo much data accessed inbetween successive use
âș Conflict missesConflict misses
ï§ Cache thrashing due to data mapping to same cacheCache thrashing due to data mapping to same cache
lineslines
12. Friends: Introducing the 3Friends: Introducing the 3
RâsRâs
Compulsory Capacity Conflict
Rearrange XX (x)(x) XX
Reduce XX XX (x)(x)
Reuse (x)(x) XX
âș Rearrange (code, data)Rearrange (code, data)
ï§ Change layout to increase spatial localityChange layout to increase spatial locality
âș Reduce (size, # cache lines read)Reduce (size, # cache lines read)
ï§ Smaller/smarter formats, compressionSmaller/smarter formats, compression
âș Reuse (cache lines)Reuse (cache lines)
ï§ Increase temporal (and spatial) localityIncrease temporal (and spatial) locality
13. Measuring cache utilizationMeasuring cache utilization
âșProfileProfile
ï§ CPU performance/event countersCPU performance/event counters
âșGive memory access statisticsGive memory access statistics
âșBut not access patterns (e.g. stride)But not access patterns (e.g. stride)
ï§ Commercial productsCommercial products
âșSN Systemsâ Tuner, Metrowerksâ CATS, IntelâsSN Systemsâ Tuner, Metrowerksâ CATS, Intelâs
VTuneVTune
ï§ Roll your ownRoll your own
âșIn gcc â-pâ option + defineIn gcc â-pâ option + define _mcount()_mcount()
âșInstrument code with calls to logging classInstrument code with calls to logging class
ï§ Do back-of-the-envelope comparisonDo back-of-the-envelope comparison
âșStudy the generated codeStudy the generated code
14. Code cache optimization 1/2Code cache optimization 1/2
âșLocalityLocality
ï§ Reorder functionsReorder functions
âșManually within fileManually within file
âșReorder object files during linking (order in makefile)Reorder object files during linking (order in makefile)
âș__attribute__ ((section ("xxx")))__attribute__ ((section ("xxx"))) in gccin gcc
ï§ Adapt coding styleAdapt coding style
âșMonolithic functionsMonolithic functions
âșEncapsulation/OOP is less code cache friendlyEncapsulation/OOP is less code cache friendly
ï§ Moving targetMoving target
ï§ Beware various implicit functions (e.g. fptodp)Beware various implicit functions (e.g. fptodp)
15. Code cache optimization 2/2Code cache optimization 2/2
âșSizeSize
ï§ Beware: inlining, unrolling, large macrosBeware: inlining, unrolling, large macros
ï§ KISSKISS
âșAvoid featuritisAvoid featuritis
âșProvide multiple copies (also helps locality)Provide multiple copies (also helps locality)
ï§ Loop splitting and loop fusionLoop splitting and loop fusion
ï§ Compile for size (â-Osâ in gcc)Compile for size (â-Osâ in gcc)
ï§ Rewrite in asm (where it counts)Rewrite in asm (where it counts)
âșAgain, study generated codeAgain, study generated code
ï§ Build intuition about code generatedBuild intuition about code generated
16. Data cache optimizationData cache optimization
âșLots and lots of stuffâŠLots and lots of stuffâŠ
ï§ ââCompressingâ dataCompressingâ data
ï§ Blocking and strip miningBlocking and strip mining
ï§ Padding data to align to cache linesPadding data to align to cache lines
ï§ Plus other things I wonât go intoPlus other things I wonât go into
âșWhat I will talk aboutâŠWhat I will talk aboutâŠ
ï§ Prefetching and preloading data into cachePrefetching and preloading data into cache
ï§ Cache-conscious structure layoutCache-conscious structure layout
ï§ Tree data structuresTree data structures
ï§ Linearization cachingLinearization caching
ï§ Memory allocationMemory allocation
ï§ Aliasing and âanti-aliasingâAliasing and âanti-aliasingâ
17. Prefetching and preloadingPrefetching and preloading
âșSoftware prefetchingSoftware prefetching
ï§ Not too early â data may be evicted before useNot too early â data may be evicted before use
ï§ Not too late â data not fetched in time for useNot too late â data not fetched in time for use
ï§ GreedyGreedy
âșPreloading (pseudo-prefetching)Preloading (pseudo-prefetching)
ï§ Hit-under-miss processingHit-under-miss processing
18. const int kLookAhead = 4; // Some elements ahead
for (int i = 0; i < 4 * n; i += 4) {
Prefetch(elem[i + kLookAhead]);
Process(elem[i + 0]);
Process(elem[i + 1]);
Process(elem[i + 2]);
Process(elem[i + 3]);
}
Software prefetchingSoftware prefetching
// Loop through and process all 4n elements
for (int i = 0; i < 4 * n; i++)
Process(elem[i]);
19. Greedy prefetchingGreedy prefetching
void PreorderTraversal(Node *pNode) {
// Greedily prefetch left traversal path
Prefetch(pNode->left);
// Process the current node
Process(pNode);
// Greedily prefetch right traversal path
Prefetch(pNode->right);
// Recursively visit left then right subtree
PreorderTraversal(pNode->left);
PreorderTraversal(pNode->right);
}
20. Preloading (pseudo-Preloading (pseudo-
prefetch)prefetch)
Elem a = elem[0];
for (int i = 0; i < 4 * n; i += 4) {
Elem e = elem[i + 4]; // Cache miss, non-blocking
Elem b = elem[i + 1]; // Cache hit
Elem c = elem[i + 2]; // Cache hit
Elem d = elem[i + 3]; // Cache hit
Process(a);
Process(b);
Process(c);
Process(d);
a = e;
}(NB: This code reads one element beyond the end of the elem array.)(NB: This code reads one element beyond the end of the elem array.)
21. StructuresStructures
âșCache-conscious layoutCache-conscious layout
ï§ Field reordering (usually grouped conceptually)Field reordering (usually grouped conceptually)
ï§ Hot/cold splittingHot/cold splitting
âșLet use decide formatLet use decide format
ï§ Array of structuresArray of structures
ï§ Structures of arraysStructures of arrays
âșLittle compiler supportLittle compiler support
ï§ Easier for non-pointer languages (Java)Easier for non-pointer languages (Java)
ï§ C/C++: do it yourselfC/C++: do it yourself
22. void Foo(S *p, void *key, int k) {
while (p) {
if (p->key == key) {
p->count[k]++;
break;
}
p = p->pNext;
}
Field reorderingField reordering
struct S {
void *key;
int count[20];
S *pNext;
};
struct S {
void *key;
S *pNext;
int count[20];
};
âșLikely accessedLikely accessed
together sotogether so
store themstore them
together!together!
23. struct S {
void *key;
S *pNext;
S2 *pCold;
};
struct S2 {
int count[10];
};
Hot/cold splittingHot/cold splitting
âșAllocate all âstruct Sâ from a memory poolAllocate all âstruct Sâ from a memory pool
ï§ Increases coherenceIncreases coherence
âșPrefer array-style allocationPrefer array-style allocation
ï§ No need for actual pointer to cold fieldsNo need for actual pointer to cold fields
Hot fields:Hot fields: Cold fields:Cold fields:
25. Beware compiler paddingBeware compiler padding
struct X {
int8 a;
int64 b;
int8 c;
int16 d;
int64 e;
float f;
};
Assuming 4-byte floats, for most compilers sizeof(X) == 40,Assuming 4-byte floats, for most compilers sizeof(X) == 40,
sizeof(Y) == 40, and sizeof(Z) == 24.sizeof(Y) == 40, and sizeof(Z) == 24.
struct Z {
int64 b;
int64 e;
float f;
int16 d;
int8 a;
int8 c;
};
struct Y {
int8 a, pad_a[7];
int64 b;
int8 c, pad_c[1];
int16 d, pad_d[2];
int64 e;
float f, pad_f[1];
};
Decreasingsize!
26. Cache performance analysisCache performance analysis
âș Usage patternsUsage patterns
ï§ ActivityActivity ââ indicates hot or cold fieldindicates hot or cold field
ï§ CorrelationCorrelation ââ basis for field reorderingbasis for field reordering
âș Logging toolLogging tool
ï§ Access all class members through accessor functionsAccess all class members through accessor functions
ï§ Manually instrument functions to call Log() functionManually instrument functions to call Log() function
ï§ Log() functionâŠLog() functionâŠ
âș takes object type + member field as argumentstakes object type + member field as arguments
âș hash-maps current args to count field accesseshash-maps current args to count field accesses
âș hash-maps current + previous args to track pairwise accesseshash-maps current + previous args to track pairwise accesses
27. Tree data structuresTree data structures
âșRearrangeRearrange nodesnodes
ï§ Increase spatial localityIncrease spatial locality
ï§ Cache-aware vs. cache-oblivious layoutsCache-aware vs. cache-oblivious layouts
âșReduceReduce sizesize
ï§ Pointer elimination (using implicit pointers)Pointer elimination (using implicit pointers)
ï§ ââCompressionâCompressionâ
âșQuantize valuesQuantize values
âșStore data relative to parent nodeStore data relative to parent node
28. Breadth-first orderBreadth-first order
âș Pointer-less: Left(n)=2n, Right(n)=2n+1Pointer-less: Left(n)=2n, Right(n)=2n+1
âș Requires storage for complete tree of height HRequires storage for complete tree of height H
29. Depth-first orderDepth-first order
âș Left(n) = n + 1, Right(n) = stored indexLeft(n) = n + 1, Right(n) = stored index
âș Only stores existing nodesOnly stores existing nodes
30. van Emde Boas layoutvan Emde Boas layout
âș ââCache-obliviousâCache-obliviousâ
âș Recursive constructionRecursive construction
31. A compact static k-d treeA compact static k-d tree
union KDNode {
// leaf, type 11
int32 leafIndex_type;
// non-leaf, type 00 = x,
// 01 = y, 10 = z-split
float splitVal_type;
};
32. Linearization cachingLinearization caching
âșNothing better than linear dataNothing better than linear data
ï§ Best possible spatial localityBest possible spatial locality
ï§ Easily prefetchableEasily prefetchable
âșSo linearize data at runtime!So linearize data at runtime!
ï§ Fetch data, store linearized in a custom cacheFetch data, store linearized in a custom cache
ï§ Use it to linearizeâŠUse it to linearizeâŠ
âșhierarchy traversalshierarchy traversals
âșindexed dataindexed data
âșother random-access stuffother random-access stuff
33.
34. Memory allocation policyMemory allocation policy
âșDonât allocate from heap, use poolsDonât allocate from heap, use pools
ï§ No block overheadNo block overhead
ï§ Keeps data togetherKeeps data together
ï§ Faster too, and no fragmentationFaster too, and no fragmentation
âșFree ASAP, reuse immediatelyFree ASAP, reuse immediately
ï§ Block is likely in cache so reuse its cachelinesBlock is likely in cache so reuse its cachelines
ï§ First fit, using free listFirst fit, using free list
35. The curse of aliasingThe curse of aliasing
What is aliasing?What is aliasing?
int Foo(int *a, int *b) {
*a = 1;
*b = 2;
return *a;
}
int n;
int *p1 = &n;
int *p2 = &n;
Aliasing is also missed opportunities for optimizationAliasing is also missed opportunities for optimization
What value is
returned here?
Who knows!
Aliasing is multiple
references to the
same storage location
36. The curse of aliasingThe curse of aliasing
âșWhat is causing aliasing?What is causing aliasing?
ï§ PointersPointers
ï§ Global variables/class members make it worseGlobal variables/class members make it worse
âșWhat is the problem with aliasing?What is the problem with aliasing?
ï§ Hinders reordering/elimination of loads/storesHinders reordering/elimination of loads/stores
âșPoisoning data cachePoisoning data cache
âșNegatively affects instruction schedulingNegatively affects instruction scheduling
âșHinders common subexpression elimination (CSE),Hinders common subexpression elimination (CSE),
loop-invariant code motion, constant/copyloop-invariant code motion, constant/copy
propagation, etc.propagation, etc.
37. How do we do âanti-How do we do âanti-
aliasingâ?aliasingâ?
âșWhat can be done about aliasing?What can be done about aliasing?
ï§ Better languagesBetter languages
âșLess aliasing, lower abstraction penaltyLess aliasing, lower abstraction penaltyâ â
ï§ Better compilersBetter compilers
âșAlias analysis such as type-based alias analysisAlias analysis such as type-based alias analysisâ â
ï§ Better programmers (aiding theBetter programmers (aiding the
compiler)compiler)
âșThatâs you, after the next 20 slides!Thatâs you, after the next 20 slides!
ï§ Leap of faithLeap of faith
âș-fno-aliasing-fno-aliasing â â
To be definedTo be defined
38. Matrix multiplication 1/3Matrix multiplication 1/3
Mat22mul(float a[2][2], float b[2][2], float c[2][2]){
for (int i = 0; i < 2; i++) {
for (int j = 0; j < 2; j++) {
a[i][j] = 0.0f;
for (int k = 0; k < 2; k++)
a[i][j] += b[i][k] * c[k][j];
}
}
}
Consider optimizing a 2x2 matrix multiplication:Consider optimizing a 2x2 matrix multiplication:
How do we typically optimize it? Right, unrolling!How do we typically optimize it? Right, unrolling!
39. âș But wait! Thereâs a hidden assumption! a is not b or c!But wait! Thereâs a hidden assumption! a is not b or c!
âș Compiler doesnât (cannot) know this!Compiler doesnât (cannot) know this!
ï§ (1) Must refetch b[0][0] and b[0][1](1) Must refetch b[0][0] and b[0][1]
ï§ (2) Must refetch c[0][0] and c[1][0](2) Must refetch c[0][0] and c[1][0]
ï§ (3) Must refetch b[0][0], b[0][1], c[0][0] and c[1][0](3) Must refetch b[0][0], b[0][1], c[0][0] and c[1][0]
Matrix multiplication 2/3Matrix multiplication 2/3
Staightforward unrolling results in this:Staightforward unrolling results in this:
// 16 memory reads, 4 writes
Mat22mul(float a[2][2], float b[2][2], float c[2][2]){
a[0][0] = b[0][0]*c[0][0] + b[0][1]*c[1][0];
a[0][1] = b[0][0]*c[0][1] + b[0][1]*c[1][1]; //(1)
a[1][0] = b[1][0]*c[0][0] + b[1][1]*c[1][0]; //(2)
a[1][1] = b[1][0]*c[0][1] + b[1][1]*c[1][1]; //(3)
}
41. Abstraction penalty problemAbstraction penalty problem
âșHigher levels of abstraction have a negativeHigher levels of abstraction have a negative
effect on optimizationeffect on optimization
ï§ Code broken into smaller generic subunitsCode broken into smaller generic subunits
ï§ Data and operation hidingData and operation hiding
âșCannot make local copy of e.g. internal pointersCannot make local copy of e.g. internal pointers
âșCannot hoist constant expressions out of loopsCannot hoist constant expressions out of loops
âșEspecially because of aliasing issuesEspecially because of aliasing issues
42. C++ abstraction penaltyC++ abstraction penalty
âș Lots of (temporary) objects aroundLots of (temporary) objects around
ï§ IteratorsIterators
ï§ Matrix/vector classesMatrix/vector classes
âș Objects live in heap/stackObjects live in heap/stack
ï§ Thus subject to aliasingThus subject to aliasing
ï§ Makes tracking of current member value very difficultMakes tracking of current member value very difficult
ï§ But tracking required to keep values in registers!But tracking required to keep values in registers!
âș Implicit aliasing through theImplicit aliasing through the thisthis pointerpointer
ï§ Class members are virtually as bad as global variablesClass members are virtually as bad as global variables
43. C++ abstraction penaltyC++ abstraction penalty
class Buf {
public:
void Clear() {
for (int i = 0; i < numVals; i++)
pBuf[i] = 0;
}
private:
int numVals, *pBuf;
}
Pointer members in classes may alias other members:Pointer members in classes may alias other members:
Code likely to refetchCode likely to refetch numValsnumVals each iteration!each iteration!
numVals not a
local variable!
May be
aliased
by pBuf!
44. class Buf {
public:
void Clear() {
for (int i = 0, n = numVals; i < n; i++)
pBuf[i] = 0;
}
private:
int numVals, *pBuf;
}
C++ abstraction penaltyC++ abstraction penalty
We know that aliasing wonât happen, and canWe know that aliasing wonât happen, and can
manually solve the aliasing issue by writing code as:manually solve the aliasing issue by writing code as:
45. C++ abstraction penaltyC++ abstraction penalty
void Clear() {
if (numVals >= 1) {
pBuf[0] = 0;
for (int i = 1, n = numVals; i < n; i++)
pBuf[i] = 0;
}
}
SinceSince pBuf[i]pBuf[i] can only aliascan only alias numValsnumVals in the firstin the first
iteration, a quality compiler can fix this problem byiteration, a quality compiler can fix this problem by
peeling the loop once, turning it into:peeling the loop once, turning it into:
Q: DoesQ: Does youryour compiler do this optimization?!compiler do this optimization?!
46. Type-based alias analysisType-based alias analysis
âș Some aliasing the compiler can catchSome aliasing the compiler can catch
ï§ A powerful tool isA powerful tool is type-based alias analysistype-based alias analysis
Use language
types to
disambiguate
memory
references!
47. Type-based alias analysisType-based alias analysis
âșANSI C/C++ states thatâŠANSI C/C++ states thatâŠ
ï§ Each area of memory can only be associatedEach area of memory can only be associated
with one type during its lifetimewith one type during its lifetime
ï§ Aliasing may only occur between references ofAliasing may only occur between references of
the samethe same compatiblecompatible typetype
âșEnables compiler to rule out aliasingEnables compiler to rule out aliasing
between references of non-compatible typebetween references of non-compatible type
ï§ Turned on with âfstrict-aliasing in gccTurned on with âfstrict-aliasing in gcc
48. Compatibility of C/C++Compatibility of C/C++
typestypes
âșIn shortâŠIn shortâŠ
ï§ Types compatible if differing byTypes compatible if differing by signedsigned,,
unsignedunsigned,, constconst oror volatilevolatile
ï§ charchar andand unsigned charunsigned char compatible with anycompatible with any
typetype
ï§ Otherwise not compatibleOtherwise not compatible
âș(See standard for full details.)(See standard for full details.)
49. What TBAA can do for youWhat TBAA can do for you
void Foo(float *v, int *n) {
for (int i = 0; i < *n; i++)
v[i] += 1.0f;
}
void Foo(float *v, int *n) {
int t = *n;
for (int i = 0; i < t; i++)
v[i] += 1.0f;
}
into this:into this:
It can turn this:It can turn this:
Possible aliasing
between
v[i] and *n
No aliasing possible
so fetch *n once!
50. What TBAA can also doWhat TBAA can also do
uint32 i;
float f;
i = *((uint32 *)&f);
uint32 i;
union {
float f;
uchar8 c[4];
} u;
u.f = f;
i = (u.c[3]<<24L)+
(u.c[2]<<16L)+
...;
âș Cause obscure bugs in non-conforming code!Cause obscure bugs in non-conforming code!
ï§ Beware especially so-called âtype punningâBeware especially so-called âtype punningâ
uint32 i;
union {
float f;
uint32 i;
} u;
u.f = f;
i = u.i;
Required
by standard
Allowed
By gcc
Illegal
C/C++ code!
51. Restrict-qualified pointersRestrict-qualified pointers
âș restrictrestrict keywordkeyword
ï§ New to 1999 ANSI/ISO C standardNew to 1999 ANSI/ISO C standard
ï§ Not in C++ standard yet, but supported by many C++Not in C++ standard yet, but supported by many C++
compilerscompilers
ï§ A hint only, so may do nothing and still be conformingA hint only, so may do nothing and still be conforming
âș A restrict-qualified pointer (or reference)âŠA restrict-qualified pointer (or reference)âŠ
ï§ âŠâŠis basically a promise to the compiler that for theis basically a promise to the compiler that for the
scope of the pointer, the target of the pointer will only bescope of the pointer, the target of the pointer will only be
accessed through that pointer (and pointers copied fromaccessed through that pointer (and pointers copied from
it).it).
ï§ (See standard for full details.)(See standard for full details.)
52. Using the restrict keywordUsing the restrict keyword
void Foo(float v[], float *c, int n) {
for (int i = 0; i < n; i++)
v[i] = *c + 1.0f;
}
Given this code:Given this code:
You really want the compiler to treat it as if written:You really want the compiler to treat it as if written:
But because of possible aliasing it cannot!But because of possible aliasing it cannot!
void Foo(float v[], float *c, int n) {
float tmp = *c + 1.0f;
for (int i = 0; i < n; i++)
v[i] = tmp;
}
53. v[] = 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
v[] = 1, 1, 1, 1, 1, 2, 2, 2, 2, 2
Using the restrict keywordUsing the restrict keyword
giving for the first version:giving for the first version:
and for the second version:and for the second version:
float a[10];
a[4] = 0.0f;
Foo(a, &a[4], 10);
For example, the code might be called as:For example, the code might be called as:
The compiler must be conservative, andThe compiler must be conservative, and
cannot perform the optimization!cannot perform the optimization!
54. void Foo(float * restrict v, float *c, int n) {
for (int i = 0; i < n; i++)
v[i] = *c + 1.0f;
}
Solving the aliasing problemSolving the aliasing problem
The fix? Declaring the output asThe fix? Declaring the output as restrictrestrict::
âș Alas, in practice may need to declareAlas, in practice may need to declare bothboth pointers restrict!pointers restrict!
ï§ A restrict-qualified pointer can grant access to non-restrict pointerA restrict-qualified pointer can grant access to non-restrict pointer
ï§ Full data-flow analysis required to detect thisFull data-flow analysis required to detect this
ï§ However, two restrict-qualified pointers are trivially non-aliasing!However, two restrict-qualified pointers are trivially non-aliasing!
ï§ Also may work declaring second argument as âfloat *Also may work declaring second argument as âfloat * constconst câcâ
55. void Foo(float v[], const float *c, int n) {
for (int i = 0; i < n; i++)
v[i] = *c + 1.0f;
}
ââconstâ doesnât helpconstâ doesnât help
Some might think this would work:Some might think this would work:
âș Wrong!Wrong! constconst promises almost nothing!promises almost nothing!
ï§ SaysSays *c*c is const throughis const through cc,, notnot thatthat *c*c is const in generalis const in general
ï§ Can be cast awayCan be cast away
ï§ For detecting programming errors, not fixing aliasingFor detecting programming errors, not fixing aliasing
Since *c is const, v[i]
cannot write to it, right?
56. SIMD + restrict = TRUESIMD + restrict = TRUE
âș restrictrestrict enables SIMD optimizationsenables SIMD optimizations
void VecAdd(int *a, int *b, int *c) {
for (int i = 0; i < 4; i++)
a[i] = b[i] + c[i];
}
void VecAdd(int * restrict a, int *b, int *c) {
for (int i = 0; i < 4; i++)
a[i] = b[i] + c[i];
}
Independent loads and
stores. Operations can
be performed in parallel!
Stores may alias loads.
Must perform operations
sequentially.
57. Restrict-qualified pointersRestrict-qualified pointers
âșImportant, especially with C++Important, especially with C++
ï§ Helps combat abstraction penalty problemHelps combat abstraction penalty problem
âșBut bewareâŠBut bewareâŠ
ï§ Tricky semantics, easy to get wrongTricky semantics, easy to get wrong
ï§ Compiler wonât tell you about incorrect useCompiler wonât tell you about incorrect use
ï§ Incorrect use = slow painful death!Incorrect use = slow painful death!
58. Tips for avoiding aliasingTips for avoiding aliasing
âș Minimize use of globals, pointers, referencesMinimize use of globals, pointers, references
ï§ Pass small variables by-valuePass small variables by-value
ï§ Inline small functions taking pointer or referenceInline small functions taking pointer or reference
argumentsarguments
âș Use local variables as much as possibleUse local variables as much as possible
ï§ Make local copies of global and class member variablesMake local copies of global and class member variables
âș Donât take the address of variables (with &)Donât take the address of variables (with &)
âș restrictrestrict pointers and referencespointers and references
âș Declare variables close to point of useDeclare variables close to point of use
âș Declare side-effect free functions asDeclare side-effect free functions as constconst
âș Do manual CSE, especially of pointer expressionsDo manual CSE, especially of pointer expressions
59. Thatâs it!Thatâs it! ââ Resources 1/2Resources 1/2
âș Ericson, Christer. Real-time collision detection. Morgan-Ericson, Christer. Real-time collision detection. Morgan-
Kaufmann, 2005. (Chapter on memory optimization)Kaufmann, 2005. (Chapter on memory optimization)
âș Mitchell, Mark. Type-based alias analysis. Dr. DobbâsMitchell, Mark. Type-based alias analysis. Dr. Dobbâs
journal, October 2000.journal, October 2000.
âș Robison, Arch. Restricted pointers are coming. C/C++Robison, Arch. Restricted pointers are coming. C/C++
Users Journal, July 1999.Users Journal, July 1999.
http://www.cuj.com/articles/1999/9907/9907d/9907d.htmhttp://www.cuj.com/articles/1999/9907/9907d/9907d.htm
âș Chilimbi, Trishul. Cache-conscious data structures - designChilimbi, Trishul. Cache-conscious data structures - design
and implementation. PhD Thesis. University of Wisconsin,and implementation. PhD Thesis. University of Wisconsin,
Madison, 1999.Madison, 1999.
âș Prokop, Harald. Cache-oblivious algorithms. MasterâsProkop, Harald. Cache-oblivious algorithms. Masterâs
Thesis. MIT, June, 1999.Thesis. MIT, June, 1999.
âș âŠâŠ