1. Branch Prediction Contest: Implementation of Piecewise Linear Prediction
Algorithm
Prosunjit Biswas
Department of Computer Science.
University of Texas at San Antonio.
Abstract First Path-Based Neural Branch Prediction[4]
Branch predictor’s accuracy is very important to is another attempt that combines path and pattern
harness the parallelism available in ILP and thus history to overcome the limitation associated with
improve performance of today’s microprocessors preexisting neural predictors. It improved accuracy
and specially superscalar processors. Among branch over previous neural predictors and achieved
predictors, various neural branch predictors significantly low latency. This predictor achieved IPC
including Scaled Neural Branch Predictor (SNAP), of an aggressively clocked microarchitecture by 16%
Piecewise Linear Branch predictor outperform other over the former perceptron predictor.
state-of-the-art predictors. In this course final Scaled neural analog predictor, or SNAP is another
project for the course of Computer Architecture recently proposed neural branch predictor which uses
(CS-5513), I have studied various neural predictors the concept of piecewise-linear branch prediction and
and implemented the Piecewise Linear Branch relies on a mixed analog/digital implementation. This
Predictor as per the algorithm provided by a predictor decreases latency over power consumption
research paper of Dr. Daniel A. Jimenez. The over other available neural predictors [5]. Fig.1
hardware budget is restricted for this project and I (Courtesy – “An Optimized Scaled Neural Branch
have implemented the predictor within a predefined Predictor” by Daniel A. Jimenez) shows comparative
hardware budget of 64K of memory. I am also performance of noted branch prediction approaches on
competing for branch prediction contest. a set of SPEC CPU 2000 and 2006 integer benchmarks.
III. THE ALGORITHM
Keywords: Piecewise Linear, Neural Network,
The Branch predictor algorithm has two major parts
Branch Prediction.
namely i) Prediction algorithm ii) Train/Update
algorithm. Before going to the implementation of these
I. INTRODUCTION
Neural Branch predictors are the most accurate
predictors in the literature but they were impractical
due to the high latency associated with prediction. This
latency is due to the complex computation that must be
carried out to determine the excitation of an artificial
neuron. [3]
Piecewise Linear Branch Prediction [1] improved both
accuracy and latency over previous neural predictors.
This predictor works by developing a set of linear
functions, one for each program path to the branch to
be predicted that separate predicted taken from
predicted untaken.
In this paper, Piecewise Linear Branch Prediction,
Daniel A. Jimenez proposed two versions of the
prediction algorithm – i) The Idealized Piecewise
Linear Branch Predictor and ii) A Practical Piecewise
Linear Branch Predictor. In this project, I have focused
on the idealized predictor.
II. RELATED WORKS Fig. 1. Performance of Branch different branch
Predictors over SPEC CPU 2000 and 2006 integer
benchmarks (Courtesy - “An Optimized Scaled Neural
Perceptron prediction is one of the first attempts in Branch Predictor” by Daniel A. Jimenez)
branch prediction history that associated branch two algorithms, we will discuss the states and variable
prediction through neural network. This predictor they use. The three dimensional array W is the data
achieved a improved misprediction rate on a composite structure used to store weights of the branches which is
trace of SPEC2000 benchmarks by 14.7%. [2] But used in both prediction and update algorithm.
unfortunately, this predictor was impractical due to its
high latency.
2. Table II: The update/train algorithm
void update (branch_update *u, bool taken, unsigned int target) {
if (bi.br_flags & BR_CONDITIONAL) {
Fig2: The array of W with its corresponding indices if ( abs(output)< theta || ( (output>=0) != taken) ){
if (taken == true ) {
Branch address is generally taken as the last 8/10 bits if (W[address][0][0] < SAT_VAL)
of the instruction address. For each predicting branch, W[address][0][0] ++;
the algorithm keeps history of all other branches that } else {
if (W[address][0][0] > (-1) * SAT_VAL)
precede this branch in the dynamic path taken by the W[address][0][0] --;
branch. The second dimension indicated by the variable
GA keeps track of these per branch dynamic path }
history. The third dimension, as shown as GHR[i], for(int i=0; i<H-1; i++) {
if(GHR[i] == taken ) {
keeps track of the position of the address GA[i] in the if (W[address][GA[i]][i] < SAT_VAL)
global branch history register namely GHR. W[address][GA[i]][i] ++;
} else {
Some of the important variables of the algorithm is also if (W[address][GA[i]][i] > (-1) * SAT_VAL+1
)
given here for the clarity purpose. W[address][GA[i]][i] --;
}
GA : An array of address. This array keeps the path }
history associated with each branch address. As new }
shift_update_GA(address);
branch is executed, the address of the branch is shifted shift_update_GHR(taken);
into the first position of the array. }
}
GHR: An array of Boolean true/false value. This array
keep track of the taken / untaken status of the branches.
H : Length of History Register. IV. TUNING PERFORMANCE
Output: An integer value generated by the predictor Besides the algorithm, the MPKI (Miss Per Kilo
algorithm to predict current branch. Instruction) rate of the algorithm depends on the size of
various dimension of the array W. I have experienced
MPKI against various dimension of W. The result of
Table I: The prediction algorithm. my experiment is shown below. Table 1 shows the
result of the experiment.
void branch_update *predict (branch_info & b) { Table I : MPKI rate of the Piecewise Linear Algorithm
bi = b;
if (b.br_flags & BR_CONDITIONAL) { with limited budget of 64K
address = ( ((b.address >> 4 ) & 0x0F )<<2) |
((b.address>>2)) & 0x03; W[i][GA[i]][GHR[i] MPKI
output = W[address][0][0];
for (int i=0; i<H; i++) {
W[64][16][64] 3.982
if ( GHR[i] == true ) W[128][16][32] 4.217
output += W[address][GA[i]][i]; W[64][8][128] 4.292
else if (GHR[i] == false) W[32][16][128] 5.807
output -= W[address][GA[i]][i];
W[64][64][16] 4.826
}
u.direction_prediction(output>=0); The table shows that the predictor performs better when
} else {
i, GA[i], GHR[i] has corresponding 64,16,64 entries.
u.direction_prediction (false);
}
u.target_prediction (0); V. TWEAKING INSTRUCTION ADDRESS
return &u;
}
I have found that rather than taking the last bits from
the address, discarding the 2 least significant bits of the
address and then taking 3-8 bits make the predictor
predicts more accurately. It decreases the aliasing and
thus improves prediction rate a little bit.
3. Table II: 64 K ( 65,532 Byte) memory budget limit
calculation
DataStructure/Array/Varia Memory calculation
Fig. 3: Tweaking Branch address for performance ble
speed up. W[64][16][63] of each 1 64,512 byte
Byte long
Constants(SIZE,H,SAT_V 5*1 byte ( each value < 128)
VI. RESULT AL,theta,N)
(GA[63] * 6 bits / 8) byte 48 byte
(GHR[63] * 1 bit / 8) byte 8 byte
Misprediction rate of the benchmarks according to the vaiables (address , output ) 8 byte
piecewise linear algorithm is shown in fig 4. Fig.5 * 4 byte
shows comparison of different prediction Total: 64,581 byte
algorithms(piecewise linear, perceptron and gshare)
against various given benchmarks.
14
12 VIII CONCLUSION
10
8
In this individual course final project, I have tried to
6
implement the piecewise linear branch prediction
4 algorithm. . In my implementation, I have achieved a
2 MPKI of 3.988 at best. I think, it is also possible to
0 enhance the performance of this algorithm with better
/253.perlbmk
222.mpegaudio
300.twolf
205.raytrace
255.vortex
227.mtrt
256.bzip2
164.gzip
181.mcf
197.parser
201.compress
209.db
186.crafty
176.gcc
213.javac
175.vpr
202.jess
252.eon
254.gap
228.jack
implementation tricks. I have also compared the
performance of piecewise prediction algorithm with
perceptron and gshare algorithms. With the same
memory limit, piecewise prediction performs
Fig 4: Misprediction rate of different benchmarks using significantly better than the other two.
piecewise linear prediction algorithm
REFERENCES
[1] Daniel A. Jimenez. Piecewise linear branch
prediction. In Proceedings of the 32nd Annual
International Symposium on Computer
Architecture (ISCA-32), June 2005.
[2] D. Jimenez and C. Lin. Dynamic branch prediction
with per-ceptrons. In Proceedings of the Seventh
International Sym-posium on High Performance
Computer Architecture,Jan-uary 2001
[3] Lakshminarayanan, Arun; Shriraghavan, Sowmya,
“Neural Branch Prediction” available at
Fig 5: Comparison of prediction algorithms against http://webspace.ulbsibiu.ro/lucian.vintan/html/neu
different benchmarks on given 64K budget. ralpredictors.pdf
[4] D.A. Jimenez, “Fast Path-Based Neural Branch
VII. 64K BUDGET CALCULATION Prediction,” Proc. 36th Ann. Int’l Symp.
Microarchitecture, pp. 243-252, Dec. 2003.
I have limited the implementation of piecewise linear
prediction algorithm within 64K + 256 byte memory. [5] D.A. Jimenez, “An optimized scaled neural branch
The algorithm performs better as I increase the memory predictor,” Computer Design (ICCD), 2011 IEEE
limit. In table II, I have shown the calculation of 64K + 29th International Conference, pp. 113 - 118, Oct.
256 byte budget. 2011.