SlideShare a Scribd company logo
1 of 9
Download to read offline
(12) United States Patent
Le et a].
US007051177B2
US 7,051,177 B2
May 23, 2006
(10) Patent N0.:
(45) Date of Patent:
(54) METHOD FOR MEASURING MEMORY
LATENCY IN A HIERARCHICAL MEMORY
SYSTEM
Inventors: Hung Qui Le, Austin, TX (US);
Alexander Erik Mericas, Austin, TX
(US); Robert Dominick Mirabella,
Round Rock, TX (US); Toshihiko
Kurihara, KanagaWa-ken (JP);
Michitaka Okuno, Nagano (JP);
Masahiro Tokoro, KanagaWa (JP)
(75)
International Business Machines
Corporation, Armonk, NY (US)
Subject to any disclaimer, the term ofthis
patent is extended or adjusted under 35
U.S.C. 154(b) by 535 days.
(73) Assignee:
Notice:
(21)
(22)
(65)
Appl. No.: 10/210,359
Filed: Jul. 31, 2002
Prior Publication Data
US 2004/0024982 A1 Feb. 5, 2004
Int. Cl.
G06F 12/00
G06F 12/14
G06F 12/16
G06F 13/00
G06F 13/28
(51)
(2006.01)
(2006.01)
(2006.01)
(2006.01)
(2006.01)
(52) U.S. Cl. ........................ 711/167; 711/113; 711/137;
711/202; 710/100; 710/260; 712/200; 712/216;
712/219; 712/227
Field of Classi?cation Search ............... .. 711/113,
711/137, 167, 202; 710/100, 260; 712/200,
712/216, 219, 227
See application ?le for complete search history.
(58)
(56) References Cited
U.S. PATENT DOCUMENTS
5,664,193 A * 9/1997 Tirumalai ................. .. 717/153
5,964,867 A * 10/1999 Anderson et a1. . 712/219
2002/0087811 A1 * 7/2002 Khare et a1. .... .. 711/146
2003/0005252 A1 * l/2003 Wilson et a1. ............ .. 711/167
OTHER PUBLICATIONS
Singhal et al, Apr. 18421, 1994, IEEE, pp. 48459.*
* cited by examiner
Primary ExamineriDonald Sparks
Assistant ExamineriBao Q Truong
(74) Attorney, Agent, or FirmiMark E. McBurney; Dillon
& Yudell LLP
(57) ABSTRACT
A method for determining the latency for a particular level
of memory Within a hierarchical memory system is dis
closed. A performance monitor counter is allocated to count
the number of loads (load counter) and for counting the
number of cycles (cycle counter). The method begins With a
processor determining Which load to select for measure
ment. In response to the determination, the cycle counter
value is stored in a reWind register. The processor issues the
load and begins counting cycles. In response to the load
completing, the level of memory for the load is determined.
If the load Was executed from the desired memory level, the
load counter is incremented. Otherwise, the cycle counter is
reWound to its previous value.
12 Claims, 4 Drawing Sheets
Processor copies
Latency PMC value
to rewind counter
i ql06
Processor issues
load
410
Processor increments
load counter
Processor starts
latency counter
408
4 1 2
Processor %
determines if No Processor restores
load was from desired Latency PMC value
level of from rewind counter
memory ?
Yes 41 4
U.S. Patent May 23, 2006 Sheet 1 014 US 7,051,177 B2
122
1 1 4 1 06
g 0 2 Q‘ 0 &
Graphics .
CPU Adapter Display
1 0 8
A Q
1 1 0 1 1 2
% a
System l/O Bus
Memory Bridge
1 1 4
Q
q 1 6 q 1 8
Nonvolatile Input
Storage Device
TzLg. 1
U.S. Patent May 23, 2006 Sheet 3 0f 4 US 7,051,177 B2
Tzg. 3
2 2
‘—_|
31 8
IFAR
3 2
l-cache
31 2 l
BR Scan 3 2 2
I Instruction Queue
31 6 I
BR Predict 3 2 4
--—J Decode, Crack and
3 0 8 Group Formatlon
ecr _L____.
F T i l
3 2 6 3 2 8 a 3 2 8 b 3 3 0
BR/CH FX/LS 1 FX/LS 2 PP
Issue Queue Issue Queue Issue Queue Issue Queue
l l l l l I I I
302 304 3063 310a 310b 306b 3082 308b
C C C C C C C C
.9 .9 .9 9 .9 .9 .9 .9
‘5 ‘5 ‘5 u ‘5 u ‘5 u ‘5 *5 H ‘5' H
8% 8% 8?: 8?: 8?: 8% 8E 85x D X D x D x D x D >< D x D >< D
LU l-l-J LU LLI LI-l LU LL] LLI
SE ‘5 E ‘L’ 3 EE & E
_J I I
31 4
Storage
Queue
31 2
U.S. Patent May 23, 2006 Sheet 4 0f 4 US 7,051,177 B2
>= 402
l sProcessor selects
load for measurement
404
Q
Processor copies
Latency PMC value
to rewind counter
ll
‘ Q1106 v qlOB
Processor issues Processor starts
load latency counter
410 412
%
Processor restores
Latency PMC value
from rewind counter
Processor
determines if
load was from desired
level of
PIOCBSSOI' increments 4
load counter
US 7,051,177 B2
1
METHOD FOR MEASURING MEMORY
LATENCY IN A HIERARCHICAL MEMORY
SYSTEM
BACKGROUND OF THE INVENTION
The present invention is related to the subject matter of
the following commonly assigned, copending US. patent
applications: Ser. No. 10/210,357 entitled “SPECULATIVE
COUNTING OF PERFORMANCE EVENTS WITH
REWIND COUNTER” and ?led Jul. 31, 2002. The content
of the above-referenced applications is incorporated herein
by reference.
1. Technical Field
This invention relates to performance monitoring for a
microprocessor, more particularly, to monitoring memory
latency, and still more particularly to monitoring memory
latency for a microprocessor having a hierarchical memory
system.
2. Description of the Related Art
Processors often contain several levels of memory for
performance and cost reasons. Generally, memory levels
closest to the processor are small and fast, While memory
farther from the processor is larger and sloWer. The level of
memory closest to the processor is the Level 1 (L1) cache,
Which provides a limited amount ofhigh speed memory. The
next closest level of memory to the processor is the Level 2
(L2) cache. The L2 caches is generally larger than the L1
cache, but takes longer to access than the L1 cache. The
system main memory is the level of memory farthest from
the processor. Accessing main memory consumes consider
ably more time than accessing loWer levels of memory.
When a processor requests data from a memory address,
the L1 cache is examined for the data. If the data is present,
it is returned to the processor. OtherWise, the L2 cache is
queried for the requested memory data. If the data is not
present, the L2 cache acquires the requested memory
address data from the system main memory. As data passes
from main memory to each loWer level of memory, the data
is stored to permit more rapid access on subsequent requests.
Additionally, many modern microprocessors include a
Performance Monitor Unit (PMU). The PMU contains one
ore more counters (PMCs) that accumulate the occurrence of
internal events that impact or are related to the performance
of a microprocessor. For example, a PMU may monitor
processor cycles, instructions completed, or delay cycles
executing a load from memory. These statistics are useful in
optimiZing the architecture of a microprocessor and the
instructions executed by a microprocessor.
While a PMU may accumulate the number of delay cycles
executing a load in a PMC, this value is not alWays useful
as the count does not indicate hoW much each level of
memory contributed to the count. Performance engineers are
often interested in the contributions to the load delay by each
level ofmemory. Currently, there is no method of crisply, or
accurately counting, the number of delay cycles attributable
to a particular level of memory in a hierarchical memory
system.
The method currently used to determine delay cycles
While accessing a particular level of memory involves
setting a threshold value. As a processor is required to search
memory levels farther aWay, the number of delay cycles
increases noticeably. If the number of delay cycles versus
level of memory Were plotted, there Would be sharp rises in
the delay cycles for each level of memory moving aWay
20
25
30
35
45
50
55
60
65
2
from the processor. Accordingly, the present method of
determining delay cycles for a particular level of memory
sets a threshold value depending on the level of memory to
be measured.
Typically, the system main memory is ?rst measured With
a large threshold value since accesses to main memory take
longer. If a load delay exceeds the threshold, then the delay
is attributed to main memory. Having a delay cycle count for
main memory, the next loWer level of memory (assume L2)
is measured. The threshold is set accordingly and all delays
exceeding the threshold are counted. The count also includes
delays from accessing main memory; hoWever, since the
number ofdelay cycles for main memory is already approxi
mately knoWn, the delay cycles for L2 is obtained by
subtracting the delays cycle count for main memory from
the count obtained using the threshold for L2. The process
is repeated for each loWer level of memory.
The problem With using a threshold to measure memory
latency in a hierarchical memory system is that it does not
accurately determine the delay for each level ofmemory and
requires several passes to determine the delay cycle counts
for loWer levels of memory. A memory access to a loWer
level ofmemory may exceed the threshold for a higher level
of memory under certain circumstances Which Would result
in the delay being attributed to the incorrect level of
memory.
Therefore, there is a need for a neW and improved method
for accurately counting the number of delay cycles attrib
utable to a particular level of memory in a hierarchical
memory system.
SUMMARY OF THE INVENTION
As Will be seen, the foregoing invention satis?es the
foregoing needs and accomplishes additional objectives.
Brie?y described, the present invention provides an
improved method for counting the number of delay cycles
attributable to a particular level of memory Within a hierar
chical memory system.
According to one aspect of the present invention, a
method for counting the number ofdelay cycles attributable
to a particular level of memory Within a hierarchical
memory system is described. A performance monitor
counter is allocated to count the number of loads (load
counter) and for counting the number of cycles (cycle
counter). The system and method begin With a processor
determining Which load to select for measurement. In
response to the determination, the cycle counter value is
stored in a reWind register. The processor issues the load and
begins counting cycles. In response to the load completing,
the level of memory for the load is determined. If the load
Was executed from the desired memory level, the load
counter is incremented. OtherWise, the cycle counter is
reWound to its previous value.
BRIEF DESCRIPTION OF THE DRAWINGS
The novel features believed characteristic ofthe invention
are set forth in the appended claims. The invention itself,
hoWever, as Well as a preferred mode of use, further objects
and advantages thereof, Will best be understood by reference
to the folloWing detailed description of an illustrative
embodiment When read in conjunction With the accompa
nying draWings, Wherein:
FIG. 1 is a block diagram of an exemplary computer
system used in the present invention;
FIG. 2 depicts an exemplary processor used With the
present invention;
US 7,051,177 B2
3
FIG. 3 illustrates an exemplary processor core used With
the present invention; and
FIG. 4 is a How chart depicting one possible set of steps
taken to carry out the present invention, allowing for the
number of delay cycles for a particular level of memory in
a hierarchical memory system to be determined.
DETAILED DESCRIPTION OF THE
PREFERRED EMBODIMENT
Referring noW to the draWing ?gures, in Which like
numerals indicate like elements or steps throughout the
several vieWs, the preferred embodiment of the present
invention Will be described. In general, the present invention
provides for counting the number of delay cycles attribut
able to a particular level of memory Within a hierarchical
memory system.
With reference noW to FIG. 1, there is depicted a block
diagram of a data processing system in Which a preferred
embodiment of the present invention may be implemented.
Data processing system 100 may be, for example, one of the
models of personal computers available from International
Business Machines Corporation ofArmonk, N.Y. Data pro
cessing system 100 includes a central processing unit (CPU)
102, Which is connected to a system bus 108. In the
exemplary embodiment, data processing system 100
includes a graphics adapter 104 also connected to system
bus 108, for providing user interface information to a display
106.
Also connected to system bus 108 are a system memory
110 and an input/output (I/O) bus bridge 112. I/O bus bridge
112 couples an I/O bus 114 to system bus 108, relaying
and/or transforming data transactions from one bus to the
other. Peripheral devices such as nonvolatile storage 116,
Which may be a hard disk drive, and input device 118, Which
may include a conventional mouse, a trackball, or the like,
is connected to I/O bus 114.
The exemplary embodiment shoWn in FIG. 1 is provided
solely for the purposes of explaining the invention and those
skilled in the art Will recogniZe that numerous variations are
possible, both in form and function. For instance, data
processing system 100 might also include a compact disk
read-only memory (CD-ROM) or digital video disk (DVD)
drive, a sound card and audio speakers, and numerous other
optional components. All such variations are believed to be
Within the spirit and scope of the present invention.
The CPU 102 described in FIG. 1 is preferably a micro
processor such as the POWER4TM chip manufactured by
International Business Machines, Inc. of Armonk, N.Y.
With reference noW to FIG. 2, such an exemplary micro
processor is depicted as CPU 102. In the preferred
embodiment, at least tWo processor cores 202a and 20219 are
included in CPU 102. Processor cores 202 share a uni?ed
second-level cache system depicted as L2 caches
204ai204c, through a core interface unit (CIU) 206. CIU
206 is a crossbar sWitch betWeen the L2 caches 204ai204c,
each implemented as a separate, autonomous cache
controller, and the tWo CPU’s 202. Each L2 cache 204 can
operate concurrently and feed multiple bytes of data per
cycle. CIU 206 connects each of the three L2 caches 204 to
either an L1 data cache (shoWn as D-cache 312 in FIG. 3)
or an L1 instruction cache (shoWn as I-cache 320 in FIG. 3)
in either of the tWo CPU’s 102. Additionally, CIU 206
accepts stores from CPU 102 across multiple-byte-Wide
buses and sequences them to the L2 caches 204. Each CPU
102 has associated With it a noncacheable (NC) unit 208,
responsible for handling instruction-serialiZing functions
20
25
30
35
40
45
50
55
60
65
4
and performing any noncacheable operations in the storage
hierarchy. Logically, NC unit 208 is part of L2 cache 204.
An L3 directory 210 for a third-level cache, L3 (not
shoWn), and an associated L3 controller 212 are also part of
CPU 102. The actual L3 may be onboard CPU 102 or on a
separate chip. A separate functional unit, referred to as a
fabric controller 214, is responsible for controlling data?oW
betWeen the L2 cache, including L2 cache 204 and NC unit
208, and L3 controller 212. Fabric controller 214 also
controls input/output (I/O) data?oW to other CPUs 102 and
other I/O devices (not shoWn). For example, a GX controller
216 can control a How of information into and out of CPU
102, either through a connection to another CPU 102 or to
an I/O device.
As depicted, PMU 222 includes performance monitor
counters (PMC) 223aw. PMCs 223aw may be allocated to
count various events related to CPU 102. For example,
PMCs 223aw may be utiliZed in determining cycles per
instruction (CPI), load delay, execution delay, and data
dependency delay. In the present invention, PMC 223aic are
utiliZed to maintain counts of the number of loads and the
number of delay cycles attributable to a particular memory
level
Also included Within CPU 102 are functions logically
called pervasive functions. These include a trace and debug
facility 218 used for ?rst-failure data capture, a built-in
self-test (BIST) engine 220, a performance-monitoring unit
(PMU) 222, a service processor (SP) controller 224 used to
interface With a service processor (not shoWn) to control the
overall data processing system 100 shoWn in FIG. 1, a
poWer-on reset (POR) sequencer 226 for sequencing logic,
and an error detection and logging circuitry 228.
With reference noW to FIG. 3, there is depicted a high
level block diagram of processor core 202 depicted in FIG.
2. The tWo processor cores 202 shoWn in FIG. 2 are on a
single chip and are identical, providing a tWo-Way Symmet
ric Multiprocessing (SMP) model to softWare. Under the
SMP model, ether idle processor core 202 can be assigned
any task, and additional CPUs 102 can be added to improve
performance and handle increased loads.
The internal microarchitecture of processor core 202 is
preferably a speculative superscalar out-of-order execution
design. In the exemplary con?guration depicted in FIG. 3,
multiple instructions can be issued each cycle, With one
instruction being executed each cycle in each of a branch
(BR) execution unit 302, a condition register (CR) execution
unit 304 for executing CR modifying instructions, ?xed
point (FX) execution units 306a and 30619 for executing
?xed-point instructions, load-store execution units (LSU)
310a and 31019 for executing load and store instructions, and
?oating-point (FP) execution units 308a and 30819 for
executing ?oating-point instructions. LSU’s 310, each
capable of performing address-generation arithmetic, Work
With data cache (D-cache) 312 and storage queue 314 to
provide data to FP execution units 308.
A branch-prediction scan logic (BR scan) 312 scans
fetched instructions located in Instruction-cache (I-cache)
320, looking for multiple branches each cycle. Depending
upon the branch type found, a branch-prediction mechanism
denoted as BR predict 316 is engaged to help predict the
branch direction or the target address of the branch or both.
That is, for conditional branches, the branch direction is
predicted, and for unconditional branches, the target address
is predicted. Branch instructions ?oW through an
Instruction-fetch address register (IFAR) 318, and I-cache
320, an instruction queue 322, a decode, crack and group
US 7,051,177 B2
5
(DCG) unit 324 and a branch/condition register (BR/CR)
issue queue 326 until the branch instruction ultimately
reaches and is executed in BR execution unit 302, Where
actual outcomes of the branches are determined. At that
point, if the predictions Were found to be correct, the branch
instructions are simply completed like all other instructions.
If a prediction is found to be incorrect, the instruction-fetch
logic, including BR scan 312 and BR predict 316, causes the
mispredicted instructions to be discarded and begins refetch
ing instructions along the corrected path.
Instructions are fetched from l-cache 320 on the basis of
the contents of IFAR 318. IFAR 318 is normally loaded With
an address determined by the branch-prediction logic
described above. For cases in Which the branch-prediction
logic is in error, the branch-execution unit Will cause IFAR
318 to be loaded With the corrected address ofthe instruction
stream to be fetched. Additionally, there are other factors
that can cause a redirection of the instruction stream, some
based on internal events, others on interrupts from external
events. In any case, once IFAR 318 is loaded, then l-cache
320 is accessed and retrieves multiple instructions per cycle.
The l-cache 320 is accessed using an l-cache directory
(IDIR) (not shoWn), Which is indexed by the effective
address of the instruction to provide required real addresses.
On an l-cache 320 cache miss, instructions are returned from
the L2 cache 204 illustrated in FIG. 2.
With reference noW to FIG. 4, a How chart ofone possible
set of steps to carry out the present invention is depicted.
Prior to the execution of the steps in FIG. 4, a performance
monitor count is allocated to count latency cycles for a
predetermined level of the memory hierarchy (latency
counter). A second performance monitor counter is allocated
to count the total number of loads from the predetermined
level (load counter).
As illustrated at step 402, a processor selects a load
instruction for measurement. The method of selecting the
load instruction may be by any number of means knoWn in
the art such as random selection based on position in an
internal queue, or ?ltering of instructions based on some
characteristic of the instruction. After the processor selects
a load for measurement, the processor causes the latency
count value to be copies to a reWind register as depicted at
step 404.
Once the value of the latency counter is preserved in the
reWind register, the processor is ready to issue the load as
illustrated at step 406. While the processor is executing the
load, the processor is incrementing the latency counter each
cycle as depicted at step 408. After the load has completed,
the storage system returns an indicator specifying Which
level of the hierarchy the load Was satis?ed from. The
processor is able to determine if the load Was satis?ed from
the predetermined level ofmemory as illustrated at step 410.
If the load Was not satis?ed from the predetermined level
of memory, the processor restores the latency counter value
from the reWind counter as depicted at step 412. By restoring
the latency counter to the reWind counter value, the latency
counter value discards the latency cycles attributed to loads
from levels other than the predetermined level of memory.
If the load Was satis?ed from the predetermined level of
memory, the processor increments the load counter as illus
trated at step 414. The processor doe snot need to reWind the
latency counter as the cycles accumulated Were attributable
to the predetermined level of memory.
Those skilled in the art Will readily appreciate that the
method of the present invention may be carried out in
different manners. For example, instead of using a reWind
20
25
30
35
40
50
55
60
65
6
counter, the processor could accumulate the number of
latency cycles for the current load in a separate counter.
Once the load completed, the separate counter could be
added to the latency counter if the load Was satis?ed from
the predetermined level of memory.
The present invention has been described in relation to
particular embodiments Which are intended in all respects to
be illustrative rather than restrictive. Alternative embodi
ments Will become apparent to those skilled in the art to
Which the present invention pertains Without departing from
its spirit and scope. For example, While the present invention
has been described in terms of a processor With tWo pro
cessor cores, the present invention has use in processors of
any number or processor cores. Accordingly, the scope ofthe
present invention is de?ned by the appended claims rather
than the foregoing discussion.
What is claimed is:
1. A method for determining the latency of a desired level
of memory Within a hierarchical memory system, said
method comprising the steps of:
issuing a load from a computer microprocessor to a
hierarchical memory system including a plurality of
levels;
incrementing a latency counter during performance of
said load;
receiving a response to said load;
determining if said response to said load Was issued from
a desired level from among said plurality of levels of
said hierarchical memory system;
in response to determining that said response to said load
Was issued from said desired level of said hierarchical
memory system, recording a latency value of said
latency counter in a secondary storage location; and
in response to determining that said response to said load
Was not issued from said desired level of said hierar
chical memory system, discarding said latency value of
said latency counter.
2. The method of claim 1, Wherein:
said step ofrecording further comprises storing a present
value of said latency counter in a reWind counter prior
to issuing a subsequent load; and
resetting said present value of said latency counter With
said stored value from said reWind counter in response
to determining said subsequent load Was not satis?ed
from said desired level of memory.
3. The method as described in claim 1, Wherein the step
of incrementing said latency counter comprises the step of
adding a latency of said load to a latency value of a previous
load in an accumulator.
4. The method as described in claim 1, Wherein said
incrementing step further comprises incrementing a perfor
mance monitor counter.
5. A system for determining the latency of a desired level
of memory Within a hierarchical memory system, said
system comprising:
means for, responsive to issuance of a load from a
computer microprocessor to a hierarchical memory
system including a plurality of levels, incrementing a
latency counter during performance of a load;
means, in response to receiving a response to said load,
for determining if said response to said load Was issued
from a desired level from among said plurality of levels
of said hierarchical memory system;
means, responsive to determining that said response to
said load Was issued from said desired level of said
US 7,051,177 B2
7
hierarchical memory system, for recording a latency
value from said latency counter in a secondary storage
location; and
means, responsive to determining that said response to
said load Was not issued from said desired level of said
hierarchical memory system, for discarding said
latency value of said latency counter.
6. The system as described by claim 5, Wherein:
said means for recording further comprises means for
storing a present value of said latency counter in a
reWind counter prior to issuing a subsequent load; and
means for resetting said present value of said latency
counter With said stored value from said reWind counter
in response to determining said subsequent load Was
not satis?ed from said desired level of memory.
7. The system of claim 5, said means for incrementing
further comprising means for adding said latency value of
said load to latency value of a previous load in an accumu
lator.
8. The system as described in claim 5, Wherein said
latency counter is a performance monitor counter.
9. A method for determining the latency of a predeter
mined level of memory Within a hierarchical memory sys
tem used With a computer microprocessor having a latency
counter, said method comprising the steps of:
said computer microprocessor determining if a selected
load for measurement Was issued from said predeter
mined level of memory;
incrementing said latency counter in response to said
determination;
said computer microprocessor storing a present value of
said latency counter in a stored value in a reWind
counter prior to issuing said load selected for measure
ment;
20
30
8
said computer microprocessor incrementing said latency
counter during execution of said load selected for
measurement; and
said computer microprocessor resetting said present value
of said latency counter With said stored value in said
reWind counter in response to said computer micropro
cessor determining said load Was not satis?ed from said
predetermined level of memory.
10. A system for determining the latency of a predeter
mined level of memory Within a hierarchical memory sys
tem used With a computer microprocessor said system
comprising:
means for said computer microprocessor determining if a
selected load for measurement Was issued from said
predetermined level of memory; and
a latency counter, said latency counter incrementing in
response to said determination, Wherein said latency
counter is a reWind counter, said reWind counter storing
a present value and a reWind value and Wherein said
present value of said reWind counter is stored as said
reWind value prior to said microprocessor issuing said
load selected for measurement.
11. The system as described in claim 10, Wherein said
present value of said reWind counter increments during
execution of said load selected for measurement.
12. The system as described in claim 11, Wherein said
present value of said reWind counter is reset to the reWind
value in response to said computer microprocessor deter
mining said load Was not satis?ed from said predetermined
level of memory.

More Related Content

Viewers also liked

joint service transcript
joint service transcriptjoint service transcript
joint service transcript
daniel murray
 

Viewers also liked (15)

los símbolos patrios
los símbolos patrios  los símbolos patrios
los símbolos patrios
 
joint service transcript
joint service transcriptjoint service transcript
joint service transcript
 
You Built a Textbook in How Many Days????!!
You Built a Textbook in How Many Days????!!You Built a Textbook in How Many Days????!!
You Built a Textbook in How Many Days????!!
 
0
00
0
 
9483931].PDF
9483931].PDF9483931].PDF
9483931].PDF
 
Atelier RCP Haute Qualité Zoll Equipe D : analyse avant feedback
Atelier RCP Haute Qualité Zoll Equipe D : analyse avant feedbackAtelier RCP Haute Qualité Zoll Equipe D : analyse avant feedback
Atelier RCP Haute Qualité Zoll Equipe D : analyse avant feedback
 
Using Science to Develop More Energized Employees
Using Science to Develop More Energized EmployeesUsing Science to Develop More Energized Employees
Using Science to Develop More Energized Employees
 
Mateo en la evangelii gaudium.doc
Mateo en la evangelii gaudium.docMateo en la evangelii gaudium.doc
Mateo en la evangelii gaudium.doc
 
Dios ama la justicia
Dios ama la justiciaDios ama la justicia
Dios ama la justicia
 
PBSN Case Comp Slide Deck
PBSN Case Comp Slide DeckPBSN Case Comp Slide Deck
PBSN Case Comp Slide Deck
 
Gerak Pada Tumbuhan Untuk Kelas 8 SMP/MTs
Gerak Pada Tumbuhan Untuk Kelas 8 SMP/MTsGerak Pada Tumbuhan Untuk Kelas 8 SMP/MTs
Gerak Pada Tumbuhan Untuk Kelas 8 SMP/MTs
 
Sistemas Hiperestáticos
Sistemas HiperestáticosSistemas Hiperestáticos
Sistemas Hiperestáticos
 
TCC: vida positiva
TCC:   vida positivaTCC:   vida positiva
TCC: vida positiva
 
Nanofluid Heat Pipes 2015 Symposium.pptx
Nanofluid Heat Pipes 2015 Symposium.pptxNanofluid Heat Pipes 2015 Symposium.pptx
Nanofluid Heat Pipes 2015 Symposium.pptx
 
Ley 16.744
Ley 16.744Ley 16.744
Ley 16.744
 

Similar to US7051177

Operating system Q/A
Operating system Q/AOperating system Q/A
Operating system Q/A
Abdul Munam
 
OCP Server Memory Channel Testing DRAFT
OCP Server Memory Channel Testing DRAFTOCP Server Memory Channel Testing DRAFT
OCP Server Memory Channel Testing DRAFT
Barbara Aichinger
 
1.CPU INSTRUCTION AND EXECUTION CYCLEThe primary function of the .pdf
1.CPU INSTRUCTION AND EXECUTION CYCLEThe primary function of the .pdf1.CPU INSTRUCTION AND EXECUTION CYCLEThe primary function of the .pdf
1.CPU INSTRUCTION AND EXECUTION CYCLEThe primary function of the .pdf
aniyathikitchen
 

Similar to US7051177 (20)

lecture_11.pptx
lecture_11.pptxlecture_11.pptx
lecture_11.pptx
 
An Introduction to the Formalised Memory Model for Linux Kernel
An Introduction to the Formalised Memory Model for Linux KernelAn Introduction to the Formalised Memory Model for Linux Kernel
An Introduction to the Formalised Memory Model for Linux Kernel
 
Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
 
MARTINEZ R POSTER SULI
MARTINEZ R POSTER SULIMARTINEZ R POSTER SULI
MARTINEZ R POSTER SULI
 
Operating system Q/A
Operating system Q/AOperating system Q/A
Operating system Q/A
 
OCP Server Memory Channel Testing DRAFT
OCP Server Memory Channel Testing DRAFTOCP Server Memory Channel Testing DRAFT
OCP Server Memory Channel Testing DRAFT
 
Linux Kernel Memory Model
Linux Kernel Memory ModelLinux Kernel Memory Model
Linux Kernel Memory Model
 
REDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURES
REDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURESREDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURES
REDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURES
 
REDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURES
REDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURESREDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURES
REDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURES
 
Reducing Competitive Cache Misses in Modern Processor Architectures
Reducing Competitive Cache Misses in Modern Processor ArchitecturesReducing Competitive Cache Misses in Modern Processor Architectures
Reducing Competitive Cache Misses in Modern Processor Architectures
 
Guide to alfresco monitoring
Guide to alfresco monitoringGuide to alfresco monitoring
Guide to alfresco monitoring
 
Mainmemoryfinal 161019122029
Mainmemoryfinal 161019122029Mainmemoryfinal 161019122029
Mainmemoryfinal 161019122029
 
Operating System-Concepts of Process
Operating System-Concepts of ProcessOperating System-Concepts of Process
Operating System-Concepts of Process
 
Main memoryfinal
Main memoryfinalMain memoryfinal
Main memoryfinal
 
Tdt4260 miniproject report_group_3
Tdt4260 miniproject report_group_3Tdt4260 miniproject report_group_3
Tdt4260 miniproject report_group_3
 
Computer architecture
Computer architectureComputer architecture
Computer architecture
 
Oversimplified CA
Oversimplified CAOversimplified CA
Oversimplified CA
 
HP 3PAR SSMC 2.1
HP 3PAR SSMC 2.1HP 3PAR SSMC 2.1
HP 3PAR SSMC 2.1
 
1.CPU INSTRUCTION AND EXECUTION CYCLEThe primary function of the .pdf
1.CPU INSTRUCTION AND EXECUTION CYCLEThe primary function of the .pdf1.CPU INSTRUCTION AND EXECUTION CYCLEThe primary function of the .pdf
1.CPU INSTRUCTION AND EXECUTION CYCLEThe primary function of the .pdf
 
Limitations of memory system performance
Limitations of memory system performanceLimitations of memory system performance
Limitations of memory system performance
 

More from Stephen Mason

More from Stephen Mason (9)

9503627
95036279503627
9503627
 
9,390,614
9,390,6149,390,614
9,390,614
 
Midnight (1)
Midnight (1)Midnight (1)
Midnight (1)
 
4-28-16 IP for general counsel (publish)
4-28-16 IP for general counsel (publish)4-28-16 IP for general counsel (publish)
4-28-16 IP for general counsel (publish)
 
To patent or not to patent
To patent or not to patentTo patent or not to patent
To patent or not to patent
 
US20110191333
US20110191333US20110191333
US20110191333
 
US7167901
US7167901US7167901
US7167901
 
US7177901
US7177901US7177901
US7177901
 
0
00
0
 

US7051177

  • 1. (12) United States Patent Le et a]. US007051177B2 US 7,051,177 B2 May 23, 2006 (10) Patent N0.: (45) Date of Patent: (54) METHOD FOR MEASURING MEMORY LATENCY IN A HIERARCHICAL MEMORY SYSTEM Inventors: Hung Qui Le, Austin, TX (US); Alexander Erik Mericas, Austin, TX (US); Robert Dominick Mirabella, Round Rock, TX (US); Toshihiko Kurihara, KanagaWa-ken (JP); Michitaka Okuno, Nagano (JP); Masahiro Tokoro, KanagaWa (JP) (75) International Business Machines Corporation, Armonk, NY (US) Subject to any disclaimer, the term ofthis patent is extended or adjusted under 35 U.S.C. 154(b) by 535 days. (73) Assignee: Notice: (21) (22) (65) Appl. No.: 10/210,359 Filed: Jul. 31, 2002 Prior Publication Data US 2004/0024982 A1 Feb. 5, 2004 Int. Cl. G06F 12/00 G06F 12/14 G06F 12/16 G06F 13/00 G06F 13/28 (51) (2006.01) (2006.01) (2006.01) (2006.01) (2006.01) (52) U.S. Cl. ........................ 711/167; 711/113; 711/137; 711/202; 710/100; 710/260; 712/200; 712/216; 712/219; 712/227 Field of Classi?cation Search ............... .. 711/113, 711/137, 167, 202; 710/100, 260; 712/200, 712/216, 219, 227 See application ?le for complete search history. (58) (56) References Cited U.S. PATENT DOCUMENTS 5,664,193 A * 9/1997 Tirumalai ................. .. 717/153 5,964,867 A * 10/1999 Anderson et a1. . 712/219 2002/0087811 A1 * 7/2002 Khare et a1. .... .. 711/146 2003/0005252 A1 * l/2003 Wilson et a1. ............ .. 711/167 OTHER PUBLICATIONS Singhal et al, Apr. 18421, 1994, IEEE, pp. 48459.* * cited by examiner Primary ExamineriDonald Sparks Assistant ExamineriBao Q Truong (74) Attorney, Agent, or FirmiMark E. McBurney; Dillon & Yudell LLP (57) ABSTRACT A method for determining the latency for a particular level of memory Within a hierarchical memory system is dis closed. A performance monitor counter is allocated to count the number of loads (load counter) and for counting the number of cycles (cycle counter). The method begins With a processor determining Which load to select for measure ment. In response to the determination, the cycle counter value is stored in a reWind register. The processor issues the load and begins counting cycles. In response to the load completing, the level of memory for the load is determined. If the load Was executed from the desired memory level, the load counter is incremented. Otherwise, the cycle counter is reWound to its previous value. 12 Claims, 4 Drawing Sheets Processor copies Latency PMC value to rewind counter i ql06 Processor issues load 410 Processor increments load counter Processor starts latency counter 408 4 1 2 Processor % determines if No Processor restores load was from desired Latency PMC value level of from rewind counter memory ? Yes 41 4
  • 2. U.S. Patent May 23, 2006 Sheet 1 014 US 7,051,177 B2 122 1 1 4 1 06 g 0 2 Q‘ 0 & Graphics . CPU Adapter Display 1 0 8 A Q 1 1 0 1 1 2 % a System l/O Bus Memory Bridge 1 1 4 Q q 1 6 q 1 8 Nonvolatile Input Storage Device TzLg. 1
  • 3.
  • 4. U.S. Patent May 23, 2006 Sheet 3 0f 4 US 7,051,177 B2 Tzg. 3 2 2 ‘—_| 31 8 IFAR 3 2 l-cache 31 2 l BR Scan 3 2 2 I Instruction Queue 31 6 I BR Predict 3 2 4 --—J Decode, Crack and 3 0 8 Group Formatlon ecr _L____. F T i l 3 2 6 3 2 8 a 3 2 8 b 3 3 0 BR/CH FX/LS 1 FX/LS 2 PP Issue Queue Issue Queue Issue Queue Issue Queue l l l l l I I I 302 304 3063 310a 310b 306b 3082 308b C C C C C C C C .9 .9 .9 9 .9 .9 .9 .9 ‘5 ‘5 ‘5 u ‘5 u ‘5 u ‘5 *5 H ‘5' H 8% 8% 8?: 8?: 8?: 8% 8E 85x D X D x D x D x D >< D x D >< D LU l-l-J LU LLI LI-l LU LL] LLI SE ‘5 E ‘L’ 3 EE & E _J I I 31 4 Storage Queue 31 2
  • 5. U.S. Patent May 23, 2006 Sheet 4 0f 4 US 7,051,177 B2 >= 402 l sProcessor selects load for measurement 404 Q Processor copies Latency PMC value to rewind counter ll ‘ Q1106 v qlOB Processor issues Processor starts load latency counter 410 412 % Processor restores Latency PMC value from rewind counter Processor determines if load was from desired level of PIOCBSSOI' increments 4 load counter
  • 6. US 7,051,177 B2 1 METHOD FOR MEASURING MEMORY LATENCY IN A HIERARCHICAL MEMORY SYSTEM BACKGROUND OF THE INVENTION The present invention is related to the subject matter of the following commonly assigned, copending US. patent applications: Ser. No. 10/210,357 entitled “SPECULATIVE COUNTING OF PERFORMANCE EVENTS WITH REWIND COUNTER” and ?led Jul. 31, 2002. The content of the above-referenced applications is incorporated herein by reference. 1. Technical Field This invention relates to performance monitoring for a microprocessor, more particularly, to monitoring memory latency, and still more particularly to monitoring memory latency for a microprocessor having a hierarchical memory system. 2. Description of the Related Art Processors often contain several levels of memory for performance and cost reasons. Generally, memory levels closest to the processor are small and fast, While memory farther from the processor is larger and sloWer. The level of memory closest to the processor is the Level 1 (L1) cache, Which provides a limited amount ofhigh speed memory. The next closest level of memory to the processor is the Level 2 (L2) cache. The L2 caches is generally larger than the L1 cache, but takes longer to access than the L1 cache. The system main memory is the level of memory farthest from the processor. Accessing main memory consumes consider ably more time than accessing loWer levels of memory. When a processor requests data from a memory address, the L1 cache is examined for the data. If the data is present, it is returned to the processor. OtherWise, the L2 cache is queried for the requested memory data. If the data is not present, the L2 cache acquires the requested memory address data from the system main memory. As data passes from main memory to each loWer level of memory, the data is stored to permit more rapid access on subsequent requests. Additionally, many modern microprocessors include a Performance Monitor Unit (PMU). The PMU contains one ore more counters (PMCs) that accumulate the occurrence of internal events that impact or are related to the performance of a microprocessor. For example, a PMU may monitor processor cycles, instructions completed, or delay cycles executing a load from memory. These statistics are useful in optimiZing the architecture of a microprocessor and the instructions executed by a microprocessor. While a PMU may accumulate the number of delay cycles executing a load in a PMC, this value is not alWays useful as the count does not indicate hoW much each level of memory contributed to the count. Performance engineers are often interested in the contributions to the load delay by each level ofmemory. Currently, there is no method of crisply, or accurately counting, the number of delay cycles attributable to a particular level of memory in a hierarchical memory system. The method currently used to determine delay cycles While accessing a particular level of memory involves setting a threshold value. As a processor is required to search memory levels farther aWay, the number of delay cycles increases noticeably. If the number of delay cycles versus level of memory Were plotted, there Would be sharp rises in the delay cycles for each level of memory moving aWay 20 25 30 35 45 50 55 60 65 2 from the processor. Accordingly, the present method of determining delay cycles for a particular level of memory sets a threshold value depending on the level of memory to be measured. Typically, the system main memory is ?rst measured With a large threshold value since accesses to main memory take longer. If a load delay exceeds the threshold, then the delay is attributed to main memory. Having a delay cycle count for main memory, the next loWer level of memory (assume L2) is measured. The threshold is set accordingly and all delays exceeding the threshold are counted. The count also includes delays from accessing main memory; hoWever, since the number ofdelay cycles for main memory is already approxi mately knoWn, the delay cycles for L2 is obtained by subtracting the delays cycle count for main memory from the count obtained using the threshold for L2. The process is repeated for each loWer level of memory. The problem With using a threshold to measure memory latency in a hierarchical memory system is that it does not accurately determine the delay for each level ofmemory and requires several passes to determine the delay cycle counts for loWer levels of memory. A memory access to a loWer level ofmemory may exceed the threshold for a higher level of memory under certain circumstances Which Would result in the delay being attributed to the incorrect level of memory. Therefore, there is a need for a neW and improved method for accurately counting the number of delay cycles attrib utable to a particular level of memory in a hierarchical memory system. SUMMARY OF THE INVENTION As Will be seen, the foregoing invention satis?es the foregoing needs and accomplishes additional objectives. Brie?y described, the present invention provides an improved method for counting the number of delay cycles attributable to a particular level of memory Within a hierar chical memory system. According to one aspect of the present invention, a method for counting the number ofdelay cycles attributable to a particular level of memory Within a hierarchical memory system is described. A performance monitor counter is allocated to count the number of loads (load counter) and for counting the number of cycles (cycle counter). The system and method begin With a processor determining Which load to select for measurement. In response to the determination, the cycle counter value is stored in a reWind register. The processor issues the load and begins counting cycles. In response to the load completing, the level of memory for the load is determined. If the load Was executed from the desired memory level, the load counter is incremented. OtherWise, the cycle counter is reWound to its previous value. BRIEF DESCRIPTION OF THE DRAWINGS The novel features believed characteristic ofthe invention are set forth in the appended claims. The invention itself, hoWever, as Well as a preferred mode of use, further objects and advantages thereof, Will best be understood by reference to the folloWing detailed description of an illustrative embodiment When read in conjunction With the accompa nying draWings, Wherein: FIG. 1 is a block diagram of an exemplary computer system used in the present invention; FIG. 2 depicts an exemplary processor used With the present invention;
  • 7. US 7,051,177 B2 3 FIG. 3 illustrates an exemplary processor core used With the present invention; and FIG. 4 is a How chart depicting one possible set of steps taken to carry out the present invention, allowing for the number of delay cycles for a particular level of memory in a hierarchical memory system to be determined. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT Referring noW to the draWing ?gures, in Which like numerals indicate like elements or steps throughout the several vieWs, the preferred embodiment of the present invention Will be described. In general, the present invention provides for counting the number of delay cycles attribut able to a particular level of memory Within a hierarchical memory system. With reference noW to FIG. 1, there is depicted a block diagram of a data processing system in Which a preferred embodiment of the present invention may be implemented. Data processing system 100 may be, for example, one of the models of personal computers available from International Business Machines Corporation ofArmonk, N.Y. Data pro cessing system 100 includes a central processing unit (CPU) 102, Which is connected to a system bus 108. In the exemplary embodiment, data processing system 100 includes a graphics adapter 104 also connected to system bus 108, for providing user interface information to a display 106. Also connected to system bus 108 are a system memory 110 and an input/output (I/O) bus bridge 112. I/O bus bridge 112 couples an I/O bus 114 to system bus 108, relaying and/or transforming data transactions from one bus to the other. Peripheral devices such as nonvolatile storage 116, Which may be a hard disk drive, and input device 118, Which may include a conventional mouse, a trackball, or the like, is connected to I/O bus 114. The exemplary embodiment shoWn in FIG. 1 is provided solely for the purposes of explaining the invention and those skilled in the art Will recogniZe that numerous variations are possible, both in form and function. For instance, data processing system 100 might also include a compact disk read-only memory (CD-ROM) or digital video disk (DVD) drive, a sound card and audio speakers, and numerous other optional components. All such variations are believed to be Within the spirit and scope of the present invention. The CPU 102 described in FIG. 1 is preferably a micro processor such as the POWER4TM chip manufactured by International Business Machines, Inc. of Armonk, N.Y. With reference noW to FIG. 2, such an exemplary micro processor is depicted as CPU 102. In the preferred embodiment, at least tWo processor cores 202a and 20219 are included in CPU 102. Processor cores 202 share a uni?ed second-level cache system depicted as L2 caches 204ai204c, through a core interface unit (CIU) 206. CIU 206 is a crossbar sWitch betWeen the L2 caches 204ai204c, each implemented as a separate, autonomous cache controller, and the tWo CPU’s 202. Each L2 cache 204 can operate concurrently and feed multiple bytes of data per cycle. CIU 206 connects each of the three L2 caches 204 to either an L1 data cache (shoWn as D-cache 312 in FIG. 3) or an L1 instruction cache (shoWn as I-cache 320 in FIG. 3) in either of the tWo CPU’s 102. Additionally, CIU 206 accepts stores from CPU 102 across multiple-byte-Wide buses and sequences them to the L2 caches 204. Each CPU 102 has associated With it a noncacheable (NC) unit 208, responsible for handling instruction-serialiZing functions 20 25 30 35 40 45 50 55 60 65 4 and performing any noncacheable operations in the storage hierarchy. Logically, NC unit 208 is part of L2 cache 204. An L3 directory 210 for a third-level cache, L3 (not shoWn), and an associated L3 controller 212 are also part of CPU 102. The actual L3 may be onboard CPU 102 or on a separate chip. A separate functional unit, referred to as a fabric controller 214, is responsible for controlling data?oW betWeen the L2 cache, including L2 cache 204 and NC unit 208, and L3 controller 212. Fabric controller 214 also controls input/output (I/O) data?oW to other CPUs 102 and other I/O devices (not shoWn). For example, a GX controller 216 can control a How of information into and out of CPU 102, either through a connection to another CPU 102 or to an I/O device. As depicted, PMU 222 includes performance monitor counters (PMC) 223aw. PMCs 223aw may be allocated to count various events related to CPU 102. For example, PMCs 223aw may be utiliZed in determining cycles per instruction (CPI), load delay, execution delay, and data dependency delay. In the present invention, PMC 223aic are utiliZed to maintain counts of the number of loads and the number of delay cycles attributable to a particular memory level Also included Within CPU 102 are functions logically called pervasive functions. These include a trace and debug facility 218 used for ?rst-failure data capture, a built-in self-test (BIST) engine 220, a performance-monitoring unit (PMU) 222, a service processor (SP) controller 224 used to interface With a service processor (not shoWn) to control the overall data processing system 100 shoWn in FIG. 1, a poWer-on reset (POR) sequencer 226 for sequencing logic, and an error detection and logging circuitry 228. With reference noW to FIG. 3, there is depicted a high level block diagram of processor core 202 depicted in FIG. 2. The tWo processor cores 202 shoWn in FIG. 2 are on a single chip and are identical, providing a tWo-Way Symmet ric Multiprocessing (SMP) model to softWare. Under the SMP model, ether idle processor core 202 can be assigned any task, and additional CPUs 102 can be added to improve performance and handle increased loads. The internal microarchitecture of processor core 202 is preferably a speculative superscalar out-of-order execution design. In the exemplary con?guration depicted in FIG. 3, multiple instructions can be issued each cycle, With one instruction being executed each cycle in each of a branch (BR) execution unit 302, a condition register (CR) execution unit 304 for executing CR modifying instructions, ?xed point (FX) execution units 306a and 30619 for executing ?xed-point instructions, load-store execution units (LSU) 310a and 31019 for executing load and store instructions, and ?oating-point (FP) execution units 308a and 30819 for executing ?oating-point instructions. LSU’s 310, each capable of performing address-generation arithmetic, Work With data cache (D-cache) 312 and storage queue 314 to provide data to FP execution units 308. A branch-prediction scan logic (BR scan) 312 scans fetched instructions located in Instruction-cache (I-cache) 320, looking for multiple branches each cycle. Depending upon the branch type found, a branch-prediction mechanism denoted as BR predict 316 is engaged to help predict the branch direction or the target address of the branch or both. That is, for conditional branches, the branch direction is predicted, and for unconditional branches, the target address is predicted. Branch instructions ?oW through an Instruction-fetch address register (IFAR) 318, and I-cache 320, an instruction queue 322, a decode, crack and group
  • 8. US 7,051,177 B2 5 (DCG) unit 324 and a branch/condition register (BR/CR) issue queue 326 until the branch instruction ultimately reaches and is executed in BR execution unit 302, Where actual outcomes of the branches are determined. At that point, if the predictions Were found to be correct, the branch instructions are simply completed like all other instructions. If a prediction is found to be incorrect, the instruction-fetch logic, including BR scan 312 and BR predict 316, causes the mispredicted instructions to be discarded and begins refetch ing instructions along the corrected path. Instructions are fetched from l-cache 320 on the basis of the contents of IFAR 318. IFAR 318 is normally loaded With an address determined by the branch-prediction logic described above. For cases in Which the branch-prediction logic is in error, the branch-execution unit Will cause IFAR 318 to be loaded With the corrected address ofthe instruction stream to be fetched. Additionally, there are other factors that can cause a redirection of the instruction stream, some based on internal events, others on interrupts from external events. In any case, once IFAR 318 is loaded, then l-cache 320 is accessed and retrieves multiple instructions per cycle. The l-cache 320 is accessed using an l-cache directory (IDIR) (not shoWn), Which is indexed by the effective address of the instruction to provide required real addresses. On an l-cache 320 cache miss, instructions are returned from the L2 cache 204 illustrated in FIG. 2. With reference noW to FIG. 4, a How chart ofone possible set of steps to carry out the present invention is depicted. Prior to the execution of the steps in FIG. 4, a performance monitor count is allocated to count latency cycles for a predetermined level of the memory hierarchy (latency counter). A second performance monitor counter is allocated to count the total number of loads from the predetermined level (load counter). As illustrated at step 402, a processor selects a load instruction for measurement. The method of selecting the load instruction may be by any number of means knoWn in the art such as random selection based on position in an internal queue, or ?ltering of instructions based on some characteristic of the instruction. After the processor selects a load for measurement, the processor causes the latency count value to be copies to a reWind register as depicted at step 404. Once the value of the latency counter is preserved in the reWind register, the processor is ready to issue the load as illustrated at step 406. While the processor is executing the load, the processor is incrementing the latency counter each cycle as depicted at step 408. After the load has completed, the storage system returns an indicator specifying Which level of the hierarchy the load Was satis?ed from. The processor is able to determine if the load Was satis?ed from the predetermined level ofmemory as illustrated at step 410. If the load Was not satis?ed from the predetermined level of memory, the processor restores the latency counter value from the reWind counter as depicted at step 412. By restoring the latency counter to the reWind counter value, the latency counter value discards the latency cycles attributed to loads from levels other than the predetermined level of memory. If the load Was satis?ed from the predetermined level of memory, the processor increments the load counter as illus trated at step 414. The processor doe snot need to reWind the latency counter as the cycles accumulated Were attributable to the predetermined level of memory. Those skilled in the art Will readily appreciate that the method of the present invention may be carried out in different manners. For example, instead of using a reWind 20 25 30 35 40 50 55 60 65 6 counter, the processor could accumulate the number of latency cycles for the current load in a separate counter. Once the load completed, the separate counter could be added to the latency counter if the load Was satis?ed from the predetermined level of memory. The present invention has been described in relation to particular embodiments Which are intended in all respects to be illustrative rather than restrictive. Alternative embodi ments Will become apparent to those skilled in the art to Which the present invention pertains Without departing from its spirit and scope. For example, While the present invention has been described in terms of a processor With tWo pro cessor cores, the present invention has use in processors of any number or processor cores. Accordingly, the scope ofthe present invention is de?ned by the appended claims rather than the foregoing discussion. What is claimed is: 1. A method for determining the latency of a desired level of memory Within a hierarchical memory system, said method comprising the steps of: issuing a load from a computer microprocessor to a hierarchical memory system including a plurality of levels; incrementing a latency counter during performance of said load; receiving a response to said load; determining if said response to said load Was issued from a desired level from among said plurality of levels of said hierarchical memory system; in response to determining that said response to said load Was issued from said desired level of said hierarchical memory system, recording a latency value of said latency counter in a secondary storage location; and in response to determining that said response to said load Was not issued from said desired level of said hierar chical memory system, discarding said latency value of said latency counter. 2. The method of claim 1, Wherein: said step ofrecording further comprises storing a present value of said latency counter in a reWind counter prior to issuing a subsequent load; and resetting said present value of said latency counter With said stored value from said reWind counter in response to determining said subsequent load Was not satis?ed from said desired level of memory. 3. The method as described in claim 1, Wherein the step of incrementing said latency counter comprises the step of adding a latency of said load to a latency value of a previous load in an accumulator. 4. The method as described in claim 1, Wherein said incrementing step further comprises incrementing a perfor mance monitor counter. 5. A system for determining the latency of a desired level of memory Within a hierarchical memory system, said system comprising: means for, responsive to issuance of a load from a computer microprocessor to a hierarchical memory system including a plurality of levels, incrementing a latency counter during performance of a load; means, in response to receiving a response to said load, for determining if said response to said load Was issued from a desired level from among said plurality of levels of said hierarchical memory system; means, responsive to determining that said response to said load Was issued from said desired level of said
  • 9. US 7,051,177 B2 7 hierarchical memory system, for recording a latency value from said latency counter in a secondary storage location; and means, responsive to determining that said response to said load Was not issued from said desired level of said hierarchical memory system, for discarding said latency value of said latency counter. 6. The system as described by claim 5, Wherein: said means for recording further comprises means for storing a present value of said latency counter in a reWind counter prior to issuing a subsequent load; and means for resetting said present value of said latency counter With said stored value from said reWind counter in response to determining said subsequent load Was not satis?ed from said desired level of memory. 7. The system of claim 5, said means for incrementing further comprising means for adding said latency value of said load to latency value of a previous load in an accumu lator. 8. The system as described in claim 5, Wherein said latency counter is a performance monitor counter. 9. A method for determining the latency of a predeter mined level of memory Within a hierarchical memory sys tem used With a computer microprocessor having a latency counter, said method comprising the steps of: said computer microprocessor determining if a selected load for measurement Was issued from said predeter mined level of memory; incrementing said latency counter in response to said determination; said computer microprocessor storing a present value of said latency counter in a stored value in a reWind counter prior to issuing said load selected for measure ment; 20 30 8 said computer microprocessor incrementing said latency counter during execution of said load selected for measurement; and said computer microprocessor resetting said present value of said latency counter With said stored value in said reWind counter in response to said computer micropro cessor determining said load Was not satis?ed from said predetermined level of memory. 10. A system for determining the latency of a predeter mined level of memory Within a hierarchical memory sys tem used With a computer microprocessor said system comprising: means for said computer microprocessor determining if a selected load for measurement Was issued from said predetermined level of memory; and a latency counter, said latency counter incrementing in response to said determination, Wherein said latency counter is a reWind counter, said reWind counter storing a present value and a reWind value and Wherein said present value of said reWind counter is stored as said reWind value prior to said microprocessor issuing said load selected for measurement. 11. The system as described in claim 10, Wherein said present value of said reWind counter increments during execution of said load selected for measurement. 12. The system as described in claim 11, Wherein said present value of said reWind counter is reset to the reWind value in response to said computer microprocessor deter mining said load Was not satis?ed from said predetermined level of memory.