lsq.dvi

Late-Binding: Enabling Unordered Load-Store Queues

Simha Sethumadhavan

Franziska Roesner

Joel S. Emer Doug Burger Stephen W. Keckler

Department of Computer Sciences VSSAD, Intel

The University of Texas at Austin Hudson, MA

car[email protected]xas.edu joel.emer@intel.com

ABSTRACT

Conventional load/store queues (LSQs) are an impediment to both

power-efﬁcient execution in superscalar processors and scaling to

large-window designs. In this paper, we propose techniques to

improve the area and power efﬁciency of LSQs by allocating en-

tries when instructions issue (“late binding”), rather than when they

are dispatched. This approach enables lower occupancy and thus

smaller LSQs. Efﬁcient implementations of late-binding LSQs,

however, require the entries in the LSQ to be unordered with re-

spect to age. In this paper, we show how to provide full LSQ

functionality in an unordered design with only small additional

complexity and negligible performance losses. We show that late-

binding, unordered LSQs work well for small-window superscalar

processors, but can also be scaled effectively to large, kilo-window

processors by breaking the LSQs into address-interleaved banks.

To handle the increased overﬂows, we apply classic network ﬂow

control techniques to the processor micronetworks, enabling low-

overhead recovery mechanisms from bank overﬂows. We evalu-

ate three such mechanisms: instruction replay, skid buffers, and

virtual-channel buffering in the on-chip memory network. We show

that for an 80-instruction window, the LSQ can be reduced to 32 en-

tries. For a 1024-instruction window, the unordered, late-binding

LSQ works well with four banks of 48 entries each. By applying

a Bloom ﬁlter as well, this design achieves full hardware memory

disambiguation for a 1,024 instruction window while requiring low

average power per load and store access of 8 and 12 CAM entries,

respectively.

Categories and Subject Descriptors

C [Computer Systems Organization]; C.1 [Processor Architectures]; C.1.1 [Single

Data Stream Architectures]: Subjects: Single-instruction-stream, single-data-stream

processors (SISD)Additional Classiﬁcation:; C [Computer Systems Organization];

C.2 [Computer-Communication Networks]; C.2.1 [Network Architecture and De-

sign]: Subjects: Packet-switching networks

General Terms

Design, Performance

Keywords

Late-binding, Memory Disambiguation, Network ﬂow control

PREPRINT: To Appear in the International Symposium on Com-

puter Architecture 2007.

Fetch Dispatch Issue Execute Writeback Commit

I H G

FE D

B A

(a) Conventional

ordered LSQ

(b) Unordered

LB-LSQ

large-window LB-LSQ

overflows

Superscalar pipeline containing memory instructions A-I

Figure 1: High-Level Depiction of Ordered vs. Unordered LSQs.

1. INTRODUCTION

A long-standing challenge in computer architecture has been

handling memory dependences in a scalable manner without sac-

riﬁcing programmability or parallelism. Dataﬂow computers in the

1970’s and 80’s supported high concurrency, but required uncon-

ventional programming models (write-once memory semantics) to

prevent incorrect memory behavior. Modern superscalar proces-

sors enforce memory dependences while supporting standard pro-

gramming models using load/store queues (LSQs). These LSQs,

however, are one of the primary structures inhibiting the scaling of

superscalar processors to larger in-ﬂight windows, since they have

typically supported only a few tens of in-ﬂight loads and stores due

to power and complexity limits.

These limits have led an industry-wide shift to chip multiproces-

sors (CMPs). Much current research is focusing on CMPs in which

the individual are processors smaller, lower power, and simpler,

effectively reducing single-thread performance to support more

threads. This trend is reducing the instruction-level parallelism that

can be exploited, placing a higher burden on software and/or pro-

grammers to use a ﬁne-grained chip multiprocessor effectively. An

alternative to CMPs are tiled large-window microarchitectures that

can scale to thousands of in-ﬂight instructions, while potentially

exploiting multiple threads in addition to providing a large win-

dow for single-threaded execution. Memory disambiguation at this

scale, however, has been a key limitation for these types of archi-

tectures. This paper describes an LSQ design that both provides

power and area improvements for conventional superscalar designs

and which provides the ability to scale to thousands of instructions

power efﬁciently. The two new techniques that provide this scaling

are unordered late binding, in which loads and stores allocate LSQ

entries after they issue, and lightweight overﬂow control, which en-

ables good performance with unordered LSQs that are divided into

multiple small, address-interleaved partitions.

Late binding: Traditional LSQs allocate entries at fetch or dis-

patch time, and deallocate them at commit. Thus, entries in a con-

ventional LSQ physically age-ordered, a feature that LSQ designs

exploit to provide the necessary functionality efﬁciently. When

an LSQ reaches capacity, the microarchitecture typically throttles

fetch until a load or store commits and is removed from the LSQ.

Figure 1 shows a simple six-stage pipeline diagram with nine mem-

ory instructions (loads and stores labeled A through I) in different

stages. As shown in Figure 1a, a conventional eight-entry LSQ is

full after H is dispatched, stalling the fetch of later instructions.

Unordered, late-binding LSQs (ULB-LSQs) can reduce both the

average occupancy time and average number of entries in use.

ULB-LSQs achieve this reduction by allocating entries only when

a load or store issues, instead of when it is fetched, permitting

a smaller, more efﬁcient structure. Figure 1b shows the smaller

ULB-LSQ that must be sufﬁciently large only to capture the in-

ﬂight loads and stores after they have issued. Figure 2 shows the

potential savings of late-binding, issue-time allocation. On the Al-

pha 21264 microarchitecture, only 32 or fewer memory instructions

must be buffered in the LSQ for 99% of the execution cycles, even

though the original 21264 design had a combined LQ/SQ capacity

of 64 entries.

To achieve this reduction, however, the entries in an ULB-LSQ

must be unordered; loads and stores are allocated LSQ entries in

their dynamic issue order as opposed to the traditional in-order

fetch sequence. Maintaining age order with issue-time allocation

would require a complex and power-inefﬁcient compacting-FIFO-

like circuit [7]. This paper describes an ULB-LSQ design that re-

quires only small additional complexity while performing compa-

rably to a traditional ordered LSQ.

Low overhead overﬂow handling: A second advantage above

and beyond the reduced size is that ULB-LSQs lend themselves

naturally to address partitioning, enabling smaller banks that are

indexed by address. However, smaller banks will experience more

overﬂows. In small-window superscalar processors, ﬂushing the

pipeline on an ULB-LSQ overﬂow is an acceptable policy, since

the ULB-LSQ can be sized to save area and power over a con-

ventional design while overﬂowing infrequently. However, ULB-

LSQs can also provide efﬁcient memory disambiguation for large-

window processors, in which thousands of instructions may be in

ﬂight [12, 14, 24], by exploiting the ability to address interleave

LSQ banks. The late binding and unordered organization are both

necessary to support an undersized, address-interleaved LSQ, since

the mapping of loads and stores to LSQ banks cannot be known at

fetch/dispatch time.

In large-window designs, however, the probability of overﬂows

0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64

Number of memory instructions between Execute and Commit Stage

100

Percentage of total execution cycles

95% 99%

Figure 2: Potential for undersizing: In an Alpha 21264, for 99% of

the execution cycles across 18 SPEC2000 benchmarks, only 32 or fewer

memory instructions are in ﬂight between execute and commit.

grows, since the global ULB-LSQ is divided into smaller inter-

leaved partitions that are more susceptible to overﬂows when the

distribution of loads and stores to banks is unbalanced. Figure 1c

shows how load imbalance for an address-interleaved ULB-LSQ

increases the likelihood of overﬂow. In the example, each of four

address-interleaved partitions holds one entry. Whereas instruction

E could have been held in the centralized ULB-LSQ of Figure 1b,

E conﬂicts in the banked example with instruction B, which was

mapped to the same partition.

As the banks are shrunk and the number of overﬂows increases,

the conventional techniques of throttling fetch or ﬂushing the

pipeline become too expensive. By incorporating techniques to

handle overﬂows with low overhead, an ULB-LSQ for a large-

window design can consist of small partitions, resulting in low

energy per load or store, while still supporting high performance

across a large-window processor. We observe that in a distributed,

large-window processor, LSQ banks can be treated as clients on

a micronetwork, with overﬂows handled using traditional network

ﬂow-control techniques adapted for a distributed processor mi-

croarchitecture. We evaluate three such techniques in this paper:

instruction replay, skid buffers, and micronet virtual channels.

This paper shows that ULB-LSQswith appropriate overﬂow sup-

port work well for both small and large-window processors. For an

Alpha 21264 microarchitecture with an 80-instruction window, a

32-entry ULB-LSQ using ﬂush-on-overﬂow provides the same per-

formance as a 64-entry split LQ/SQ. For the 1,024-instruction win-

dow TRIPS processor, four banks of 48 entries each–using virtual

channel ﬂow control to handle overﬂows–provides the same per-

formance as an idealized 1,024-entry LSQ. By adding a Bloom Fil-

ter [22], the banked ULB-LSQ design incurs, for a kilo-instruction

window, an average of only eight and twelve CAM entries searched

per load and store, respectively.

2. PRIOR LSQ INNOVATIONS

Most of the recent LSQ optimization proposals, including pro-

posals for non-associative LQ/SQs, focus primarily on reducing the

dynamic power and latency of LSQs. Typically these schemes also

reduce or eliminate the power-hungry CAMs, but add numerous

predictors and other signiﬁcant supporting structures outside of the

LSQ, effectively increasing the area of the total memory disam-

biguation logic.

Cain and Lipasti eliminate the associative load queue by having

loads compare their speculative values to the cached values at com-

mit time to detect mis-speculations [13]. Roth proposed enhance-

ments to Cain and Lipasti’s scheme, to reduce the number of loads

that have to perform commit time checking [20]. These mecha-

nisms eliminate the CAM from the LQ but require additional bits

in the load queue RAM (64bits) to hold the load values. Roth’s

mechanism requires additional storage of 1KB (SSBF) for support-

ing commit-time ﬁltering.

Sha, Martin and Roth extend Roth’s scheme by using a modiﬁed

dependence predictor to match loads with the precise store buffer

slots from which they are likely to receive forwarded data [27].

This solution, coupled with the non-associative load queue scheme

proposed in Roth’s earlier work, can completely eliminate all the

associative structures from the LSQ. However, instead of the as-

sociative CAMs, this solution requires large multi-ported depen-

dence/delay predictors (approximately 23.5KB, as reported in [4]),

effectively improving dynamic power at the expense of area.

Garg, Rashid and Huang propose another mechanism for elim-

inating associative LQ/SQs [6]. In the ﬁrst phase of two-phase

processing, loads and stores speculatively obtain their value from

a L0 cache. In the second phase, the memory instructions are

Unoptimized Optimized Unoptimized

Scheme ROB Size Depth Storage Depth CAM RAM Storage Supporting :optimized

(KB) Width Width (KB) Structures Ratio

LQ SQ LQ SQ LQ SQ LQ SQ LQ SQ LQ SQ (KB)

SVW - NLQ 512 128 64 1.00 1 128 64 0 12 92 64 1.44 0.78 1 1.61

SQIP 512 128 64 1.00 1 128 64 0 0 92 64 1.44 0.5 23.5 12.72

Garg et al. 512 64 48 0.50 0.75 64 48 0 0 92 0 0.72 0 16 13.38

NoSQ 128 40 24 0.31 0.375 40 0 0 0 92 0 0.45 0 11 16.65

FnF 512 128 64 1.00 1 128 0 0 0 92 0 1.44 0 23.75 12.59

Stone et al. 1024 N/A N/A 18 N/A

LateBinding-Alpha 80 32 32 0.25 0.5 32 12 92 0.5 0.0625 0.75

LateBinding-TRIPS 1024 1024 16 192 12 92 3 0.25 0.20

Table 1: Area for LSQ and supporting structures for recent related work

re-executed in program order, without any speculation, and ac-

cess the regular L1 cache. Any difference in the load values be-

tween the two phases results in corrective action. This mechanism,

while eliminating the CAM, requires a 16KB L0 cache and an age-

ordered queue for holding the values read during the ﬁrst phase.

Subramaniam and Loh [4] and Sha, Martin, and Roth [28] both

propose methods for completely eliminating the store queue, by by-

passing store values through the register ﬁle and LQ, respectively.

Sha, Martin and Roth avoid speculative store buffering by execut-

ing stores in program order, while Subramaniam and Loh do so by

using the ROB or the physical register ﬁle. Both proposals use so-

phisticated dependence predictors that are smaller than their earlier

schemes but still require additional area for dependence predictors

and combinational logic for renaming.

Akkary, Rajwar, and Srinivasan propose two-level store buffers,

each of which are centralized, fully associative and age ordered [5].

Stores ﬁrst enter the L1 store buffer, and when it overﬂows they are

moved to the L2 store buffer. Both buffers support forwarding and

speculation checking, but stores always commit from the second

level buffer. This scheme reduces power and delay, but still requires

a worst-case sized L2 store buffer and is thus area inefﬁcient.

Stone, Woley, and Frank suggest partitioning the LSQ into its

three functions, and distributing and address interleaving all of

them [3]. The three structures are a set associative cache for for-

warding (80-bit wide, 512 sets, 2-way), a non associative FIFO

for commit, and an address-indexed timestamp table (8K entry, 8-

bit wide) for checking speculation. The partitioning removes the

CAMs but has high area overheads.

Torres et al. propose using a distributed, address-partitioned,

forwarding buffer backed up by a centralized, fully associative age-

indexed LSQ [1]. Baugh and Zilles use a small centralized forward-

ing buffer, but address-partitioned unordered structures for viola-

tion detection [10]. Both these schemes increase the area required

for memory disambiguation to optimize for latency and power.

Table 1 summarizes the ROB size, the area of the LSQ before

and after the optimizations, the size of the supporting structures re-

quired by these optimizations (but which may be already be present

in the design for performance reasons), and ﬁnally the ratio of to-

tal area required for memory disambiguation, before and after the

optimizations. We computed the area of the memory structures,

in bytes of storage, assuming the CAM cell area to be three times

larger than the RAM cell. We also assumed 40-bit addresses and

64-bit data, and that all the unoptimized designs had 12-bit par-

tial addresses in CAMs and rest of the address bits in the RAMs.

The depth of the queues, however, is different for each of these

structures. The table shows that the proposed schemes add area

overhead between factors of 1.5 to 16.5. When discounting the de-

pendence predictor, the area overheads are between factors of 1.5

and 13.

In contrast to all of the above schemes, the design proposed in

this paper uses late binding to reduce the area and latency with-

out any additional state outside of the LSQ. Dynamic power reduc-

tion, however, requires additional state in the form of address-based

Bloom ﬁlters [22]. These structures take up only few hundreds of

bytes and can even be further reduced by optimizations suggested

by Castro et al. [2].

Other researchers have also applied issue-time binding, explic-

itly or implicitly, to improve the efﬁciency of microarchitectural

structures. Monreal et al. use late allocation to reduce the num-

ber of physical registers [19]. The effectiveness of some LSQ op-

timizations like address-based ﬁlters [22] or the small associative

forwarding buffers [10, 8, 1] can also be explained in part by late

allocation.

3. BACKGROUND

Most LSQs designed to date have been age indexed, because age

indexing supports the physical sorting of instructions in the LSQ

based on their age. LSQs must support three functions: commit

of stores, detection of memory ordering violations, and forwarding

of earlier store values to later loads. The physical ordering makes

some of these operations simpler to support, but is not fundamen-

tally required for committing of stores and violation detection, and

only provides speed beneﬁts for store forwarding in rare cases.

In an age-indexed LSQ, the address and value of an in-ﬂight

memory instruction is stored in an LSQ slot obtained by decod-

ing the age of the memory instruction. This organization results in

a LSQ that is physically ordered; the relative ages of two instruc-

tions can be determined by examining the physical locations they

occupy in the LSQ. For example, it is simple to determine that in-

struction at slot 5 is older than instruction at slot 8 because slot 5

is physically “before” slot 8. Additionally, this mechanism allows

determining the relative order between all instructions in the LSQ

that satisfy some criterion (i.e. a matching address). For example,

if slots 25, 28 and 29 are occupied in the LSQ, linearly scanning the

the LSQ from position 29 will reveal the most recent older instruc-

tion ﬁrst (28) and then then next oldest (25) and so on. In some

cases, circuit implementations exploit the physical ordering to ac-

celerate LSQ operations. To understand the design changes that

an unordered LSQ requires, it is instructive to examine how LSQ

ordering supports the three functions that the LSQ combines.

Commit: The LSQ buffers all stores to avoid potential write-

after-write hazards between stores to the same address that execute

out-of-order. Additionally, the stores cannot be written out until

they are non-speculative. Once a store is determined to be non-

speculative, the store address and value are written to the cache

using the age supplied from the ROB/control logic. With ordering,

age-based indexed lookup is sufﬁcient. Without ordering, a search

is necessary to ﬁnd the oldest store to commit.

Operation Search Input Output Num Sorting

matches Required

Forwarding LD age Older STs 1 Yes

Violation ST age Younger LDs 1 No

Commit == ROB age ST to commit 1 No

Table 2: Summary of LSQ operations and ordering require-

ments

Violation Detection: The LSQ must report a violation when it

detects that a load following a store in program order, and to same

address, executed before the store. To support this operation, the

LSQ buffers all in-ﬂight memory data addresses, and when a store

arrives at the LSQ, checks for any issued loads younger than and

to the same address as the store. If there are any matching loads, a

violation is reported. For this operation, the LSQ need only deter-

mine the set of younger load instructions. It does not require sorting

among the multiple matching loads based on their age. In ordered

LSQ circuit implementation, the age of the incoming instruction is

decoded into a bit mask and all bits “before” the decoded bit vector

are set. In the case of store forwarding with multiple matches, the

most recent store and successive older stores can be accomplished

by linearly scanning the bit mask.

Store Forwarding: The LSQ must support forwarding from the

uncommitted buffered stores in the LSQ to in-ﬂight loads that issue

after the stores. When a load arrives at the LSQ, it checks for older

stores to the same address. If there are matching stores, the LSQ

ensures that the load obtains its value from the most recent match-

ing store preceding the load. To support this functionality when

a load matches multiple stores, the LSQ sorts the matching stores

based on their ages and processes the matching stores until all the

load bytes have been obtained or until there are no matching stores.

The age-indexing policy requires an LSQ that is sized large

enough to hold all in ﬂight memory instructions (

slots), which

results in a physically ordered LSQ. The ability to sort through

multiple matching instructions is especially useful for forwarding

values from multiple matching stores to a load, but a coarser age

comparison is sufﬁcient for implementing the other LSQ operations

(Table 2). Additionally, the LSQ allocation policy is conservative.

Even though the LSQ slots are occupied only after the instructions

execute they are allocated early, during instruction dispatch. Tra-

ditional age-indexed LSQs are thus both overdesigned in terms of

functionality and overprovisioned in terms of size.

4. AN UNORDERED, LATE-BINDING LSQ

DESIGN

Late-Binding LSQs address the inefﬁciencies resulting from the

“worst-case” design policies used in traditional LSQs. By allocat-

ing the memory instruction in the LSQ at issue, the sizes of ULB-

SearchAllocation Ordering Commit

decode

commit logic

RAM data array

ordering logic

CAM

1 write port

1 search, 1 read &

M: Memory instruction window size

Figure 3: The Age-Indexed LSQ

LSQs are comparartively reduced. Allocating memory instructions

at issue requires a set of mechanisms different from allocation in

age-indexed LSQs. When a load or store instruction arrives at the

ULB-LSQ the hardware simply allocates an entry from a pool of

free LSQ slots instead of indexing by age. This allocation policy

results in an LSQ that is physically unordered in which the age of

the instruction has no relation to the slot occupied by the instruc-

tion.

To compensate for the lack of ordering, the ULB-LSQs take a

more direct approach to determining the age by explicitly storing

the age information in a separate age CAM. The age CAM is a

special type of CAM that can output greater/lesser/equal results in-

stead of just the equality matches. The LSQ functions that used the

implicit age information in the age-indexed LSQ for implementing

the LSQ operations now use the explicit associative age CAM to

determine younger and older instructions. Figures 3 and 4 illus-

trates and contrasts the structures used in the ULB-LSQ and tradi-

tional LSQ implementations, where

is the memory instruction

dow size, and

is the ULB-LSQ size.

To support commits, the small age CAM is associatively

searched with the age supplied by the ROB. The address and data

from the exact matching entry are read out from the CAM and

RAM respectively, and sent to the caches. This extra associative

search is avoidable if the baseline architecture holds the ULB-LSQ

slot id allocated to the store in the ROB.

To support violation detection, when a store arrives it searches

the address CAM to identify matching loads, and searches the age

CAM using the greater-than operator to identify younger loads.

The LSQ then performs a logical OR of the results of the two

searches. If any of the resulting bits is one then a violation is

ﬂagged. Detecting violations is simpler in this LSQ compared to

age-indexed LSQs, since no generation of age masks is necessary.

Supporting forwarding is more involved because the ULB-LSQ

does not have the total order readily available. In the case of only

one match, the loss of order does not pose a problem; however

when there are multiple matches, the matches must logically be

processed from most recent to the oldest. In the ULB-LSQ, on

multiple store matches, the age of each match is read out from the

ULB-LSQ, one per cycle, and decoded into a per-byte bit vector.

Bytes to forward to the load replace bytes from other stores if the

later-found store is more recent in program order. This step recon-

structs the physical ordering between the matches from the ULB-

LSQ, but may take multiple cycles to do so. Once the ordering is

available, store forwarding proceeds exactly as in an age-indexed

LSQ [23]. Thus, compared to the age-indexed LSQ, which may

already require multiple cycles to forward from multiple stores, the

ULB-LSQ requires additional cycles for creating the decoded bit-

vector. However, as we will show in the next section, these addi-

tional cycles rarely affect performance because multiple store for-

warding is uncommon in many benchmarks.

Allocation

Ordering

Search Commit

Free

Slot

CAM

age

CAM

ordering logic

RAM data array

commit logic

U = Size of LB−LSQ (U << M)

2 search, 1 read

and 1 write portand 1 write port

1 search, 1 read

M= Memory instruction window size

Log2(M)

Figure 4: The ULB-LSQ Microarchitecture

4.1 Performance Results

The performance of the ULB-LSQ depends on the number of

entries in the LSQ, which affects the number of LSQ overﬂows.

Performance will also be affected by the relative cost of each over-

ﬂow and the additional penalty for multi-cycle store forwarding (in

the case of multiple address matches). We modeled the ULB-LSQ

in the sim-alpha [15] simulator and simulated single Simpoint re-

gions of 100M for the 18 SPEC benchmarks compatible with our

experimental infrastructure.

In this set of experiments, ULB-LSQ overﬂows are handled by

partially ﬂushing the pipeline and re-starting execution from the

oldest unarrived memory instruction at the time of the overﬂow.

The penalty of an overﬂow is 15 cycles, which matches the cost

of a branch misprediction. Table 3 summarizes the unordered LSQ

behavior and statistics. The ﬁrst two columns show the number of

memory instructions between the commit and execute stages for

95% and 99% of the execution cycles. The next eight columns

show the number of overﬂows per 1000 memory instructions, and

the performance normalized against the Alpha, for ULB-LSQ sizes

ranging from 16 to 40 entries. The ﬁnal column shows the percent-

age of dynamic loads that forward from more than one store. From

the table, for 14 of the 18 benchmarks, for 99% of the cycle time,

there are 32 or fewer uncommitted but executed memory instruc-

tions. This explains why a 32 entry ULB-LSQ does not show any

performance degradation.

To isolate the performance impact of slower multiple forward-

ing, we increased the latency of store forwarding in the baseline

simulator without changing the LQ/SQ size. In this experiment, if

a load matches with N (N 1) older stores, then store forwarding

takes an additional N cycles even if load does not have to get their

data from all N stores. The results showed no performance degra-

dation for any of the benchmarks because loads rarely need to get

data from multiple stores. From Table 3, the number of loads that

have to forward from two or more stores is less than 0.2% on the

average.

Another interesting trend in the Table 3 is that the performance

impact of the number of overﬂows is different for High vs. Low

IPC benchmarks. For instance, for the 16 entry LSQ, even though

179.art incurs more ﬂushes per 1K instructions than 178.galgel

(651 vs. 463), the performance degradation is higher in the case

of 178.galgel (24% vs. 96%). This data indicates that unless low-

overhead overﬂow mechanisms are invented for machines that sup-

port high ILP, the performance impact of undersizing the LSQ can

be quite high.

4.2 Power

This section examines the dynamic power consumption of an

ULB-LSQ against the Alpha LQ/SQ organization. The ULB-LSQ

design holds 32 memory operations, while the Alpha has a separate

32-entry LQ and SQ. Although the size of the ULB-LSQ is smaller,

each memory operation has to access the same number of address

CAM entries in both designs because the Alpha has a partitioned

LQ/SQ. Even so, the ULB-LSQ will have higher dynamic power

per access because of the additional accesses to the age CAM.

To measure the increase in power per access, we assumed that

the LSQ power is approximately equal to the power consumed by

the CAM and the RAM. Thus, the power-per-access of the Alpha

LQ or SQ will be purely from the address CAM (

) and the

RAM (

), while the power consumed by the ULB-LSQ will

be the power from the address CAM, the age CAM (

) and the

RAM. Since the ULB-LSQ and Alpha have the same number of

entries, their sizes will be the same. Thus the power increase will

be .

We synthesized the design using IBMs 130nm ASIC methodol-

ogy with the frequency set at 450MHz, and veriﬁed that the age

CAM will ﬁt in the same cycle time as the address CAM. Even

though the delay of the age CAM was approximately 20% more

than the address CAM, the delay was still not long enough to be on

the critical path. Thus, assuming that the LB-CAMs can be run at

the same clock frequency as the traditional LQ/SQ, the power in-

crease is simply the ratio of the capacitances. From our synthesis,

the capacitance of the address CAM (32x40b, 1 search, 1 read and

1 write port) was 144.5pF (

) while the capacitance of the age

CAM was 12.86pF (

). Thus the power overhead is roughly 8%

even after neglecting the power due to the RAM.

However, both the ULB-LSQ and the Alpha LSQ can beneﬁt

from Bloom ﬁlters. With a 64 entry, 8-bit counting Bloom Fil-

ter [16], we observed that 88.7% of the age and address searches

can be ﬁltered. Applying this optimization, the additional power

required by the unordered design can be reduced to under 1% of

the power consumed by the original Alpha LSQ.

4.3 Implementation Complexity

The ULB-LSQ design differs from a traditional LSQ in the fol-

lowing ways: (1) the entries are managed as a free-list, (2) multiple

store forwarding requires additional control logic to scan through

matching entries, and (3) the LSQ must detect and react to over-

ﬂows. These differences are not a signiﬁcant source of complex-

ity because many existing microarchitectural structures implement

similar functionality. For example, MSHRs and the physical reg-

ister ﬁles are managed as free-lists. The scanning logic has been

implemented in traditional LSQs which do not ﬂush on multiple

matches. Overﬂow is ﬂagged when there are no free LSQ entries,

and is simple to implement. Furthermore, the ULB-LSQ does not

require any modiﬁcations to the load/store pipeline. The LSQ oper-

ations are pipelined in the exact same way as the age-ordered LSQ

implementation in the POWER4 [26].

These results showed that for small-window processors like the

Alpha 21264, even with simplistic overﬂow handling mechanisms,

the queue sizes can be reduced by half without affecting perfor-

mance. The smaller queue size directly translates to improvements

in static power, dynamic power, and latency. Most important, the

unordered nature of the ULB-LSQ allows it to be partitioned with-

out introducing additional area overheads as explained next.

5. LARGE-WINDOW PROCESSORS

While high-performance uniprocessor design has fallen out of

favor with industry due to scaling difﬁculties, many researchers are

examining novel and scalable means for extracting more ILP by us-

ing a larger instruction window. Such designs often employ some

form of partitioning to implement larger microarchitectural logical

structures without sacriﬁcing clock cycle time. In particular, the

architectural trends motivating the design of a partitioned LSQs in-

clude (1) very large instruction windows with hundreds of in-ﬂight

memory instructions, and (2) partitioning of microarchitectures for

scaling to higher execution and local memory bandwidth.

5.1 Partitioning the LSQ

In a partitioned microarchitecture, an LSQ partition is best

matched with a partition of an address-interleaved level-1 cache.

Partitioning an age-ordered LSQ is not straightforward as the per-

instruction age identiﬁers do not indicate dependence relationships

between load and store instructions. Distributing age-interleaved

partitions to address-interleaved caches will effectively result in

signiﬁcant cross-partition communication among LSQ and cache

banks and between matching loads and stores to the same address.

Benchmark Occupancy Flushes per 1K mem instr Normalized IPC Baseline % of LDs

%program cycles LSQ entries LSQ entries IPC matching 0 or 1

95% 99% 16 24 32 40 16 24 32 40 STs

164.gzip 7 22 6 4 3 1 1.00 1.00 1.00 1.00 1.60 99.95

175.vpr 14 21 10 2 0 0 0.93 1.00 1.00 1.00 0.87 99.72

177.mesa 8 14 8 0 0 0 0.05 1.00 1.00 1.00 1.16 99.8

178.galgel 24 24 463 88 0 0 0.04 0.56 1.00 1.00 2.70 100

179.art 35 43 651 131 33 11 0.76 0.91 1.00 1.00 0.63 99.95

183.equake 3 6 0 0 0 0 1.00 1.00 1.00 1.00 0.96 99.91

188.ammp 11 15 4 0 0 0 1.01 1.00 1.00 1.00 1.31 99.91

189.lucas 11 13 0 0 0 0 1.02 1.00 1.00 1.00 0.76 100

197.parser 10 17 4 0 0 0 1.00 1.00 1.00 1.00 1.17 99.84

252.eon 15 20 28 2 0 0 0.94 1.00 1.00 1.00 1.17 98.58

253.perlbmk 9 13 2 0 0 0 1.00 1.00 1.00 1.00 0.83 98.91

254.gap 6 12 0 0 0 0 1.00 1.00 1.00 1.00 1.11 99.92

256.bzip2 7 9 0 0 0 0 1.00 1.00 1.00 1.00 1.82 99.9

173.applu 22 25 2 1 0 0 0.99 1.00 1.00 1.00 0.62 100

181.mcf 51 51 360 251 171 98 1.01 1.01 1.01 1.01 0.20 99.92

176.gcc 31 32 28 22 2 1 0.99 1.00 1.00 1.00 1.21 99.88

171.swim 15 15 1 0 0 0 1.00 1.00 1.00 1.00 0.88 100

172.mgrid 19 35 23 10 4 2 1.07 1.00 0.99 1.00 0.87 99.98

Average 0.88 0.97 1.00 1.00 1.10 99.79

Table 3: Performance of an ULB-LSQ on an 80-window ROB machine.

Address-interleaved LSQs are a better match for address inter-

leaved partitioned memory systems. However, address interleaving

also means that there is no guarantee that the loads and stores will

be distributed evenly across the partitions. Such an LSQ must func-

tion correctly even when all in-ﬂight memory instructions map to

the same partition. Such imbalance is uncommon, but may arise

when the application is loading from or storing to a sequential ar-

ray of characters. A system could tolerate the imbalance by sizing

each partition for the worst case, but the total LSQ size would be N

times the instruction window size, for N partitions. In the remain-

der of this paper, we explore better solutions that instead undersizes

the LSQ partitions in the same manner as Section 4 and gracefully

tolerates the rare overﬂow conditions with minimum effect on per-

formance.

Related Work: Research proposals for clustered architec-

tures [30, 9] employ multiple partitions of an age-indexed LSQ,

but instead of reserving a slot in each of the LSQ partitions, they

use memory bank predictors [25] to predict a target bank and re-

serve a slot there. If the bank prediction is low-conﬁdence, slots

are reserved in all banks. While this approach is better than con-

servatively reserving a slot in each partition, it still wastes space

because of conservative dispatch allocation. The ﬁrst memory dis-

ambiguation hardware to be address indexed was MultiScalar’s Ad-

dress Resolution Buffer (ARB) [17]. Loads and stores would index

into this structure, where age tags were stored to assist in forward-

ing values and detecting ordering violations. The ARB caused a

high overhead pipeline ﬂush if it overﬂowed.

5.2 Large-window Processor Model

To examine the viability of the partitioned ULB-LSQ in a large-

window processor, we use the TRIPS processor microarchitecture

as a baseline. The TRIPS processor is a partitioned microarchi-

tecture that enables a window of up to 1024 instructions and up to

256 in ﬂight memory instructions. All major components of the

processor are partitioned and distributed including fetch, issue, and

memory access. The overﬂow handling mechanisms described in

the rest of this paper are built on top of the TRIPS microarchitec-

ture.

The processor is composed of an array of 16 execution units con-

nected via a routed operand network. Instructions are striped across

4 instruction cache banks which are accessed in parallel to fetch

TRIPS instruction blocks. Instructions are delivered to the execu-

tion units where each instruction waits until its operands arrive. The

primary memory system (level-1 data cache, LSQs, dependence

predictors and miss handling units) is divided into multiple banks

which are also attached to the routed operand network. Cache lines

are interleaved across the banks, which enables up to 4 memory in-

structions per cycle to enter the level-1 cache pipelines. Figure 5.2

shows an abstract view of the TRIPS microarchitecture, highlight-

ing the parts that are relevant to memory instructions. Additional

details about the TRIPS microarchitecture can be found in [21].

The features of a distributed microarchitecture most relevant to

the design of LSQs can be distilled down to a few principles which

are not unique to TRIPS. First is an address-interleaved distributed

cache in which multiple level-1 cache banks independently pre-

serve the proper ordering of load and store instructions to the same

address. Second is the set of distributed execution units which inde-

pendently decides which instructions to issue each cycle. Finally, a

distributed architecture with multiple execution and memory units

must include some form of interconnection network. TRIPS em-

ploys a mesh-routed operand network which can be augmented to

provide multiple virtual channels. Some of the techniques for mit-

igating overﬂows rely on buffering in-ﬂight memory instructions

within the network buffers. However, other interconnection net-

works can could be enabled to buffer in-ﬂight memory instructions

as well. While we examine partitioned LSQ design in a TRIPS

context, we believe that these concepts apply to other architectures

that share these basic characteristics.

Jaleel et al. point out that blindly scaling the larger window

LSQs can be detrimental to performance due to the increase in the

number of replay traps [18]. In their study on a scaled Alpha-21264

core, such traps can occur when load instructions violate the con-

sistency model, when load needs to partially obtain the data from

the LSQ and the cache, when a load miss cannot be serviced be-

cause of structural hazards and when a load instruction executes

prematurely. TRIPS avoids these traps and does not suffer from the

performance losses described in [18]. In particular, TRIPS avoids

Distributed functional units

LSQ

MHU

TLBD$

LSQ

MHU

TLBD$

LSQ

MHU

TLBD$

LSQ

MHU

TLBD$

Network

Router

Network

Router

Network

Router

Network

Router

Figure 5: Overview of the distributed memory system in TRIPS: Each

of the memory partitions includes a portion of the address-interleaved

level-1 cache, a portion of the unordered LSQ, a local dependence pre-

dictor, a miss-handling unit, and a copy of the TLB.

load-load traps with weak memory ordering, wrong-size traps by

supporting partial forwarding in LSQ, and load-miss traps by using

larger MSHRs [23]. Like the Alpha, TRIPS also uses a dependence

predictor to reduce the number of load-store replay traps which oc-

cur when a load instruction is executed prematurely.

6. MITIGATING LSQ OVERFLOWS

Ideally, a distributed LSQ should be divided into equal-sized

banks, where the aggregate LSQ entries equals the average num-

ber of loads and stores in ﬂight, but which shows only minor per-

formance losses over a maximally sized LSQ. When a bank over-

ﬂows, however, if the microarchitecture does not ﬂush the pipeline,

it must ﬁnd someplace to buffer the load. We examine three places

to buffer these instructions: in the execution units, in extensions

to the memory units, or in the network connecting the execution

units to the memory units. The buffering space is much less ex-

pensive than the LSQ space since the buffered locations need not

be searched for memory conﬂicts, which mitigates the area and en-

ergy overheads of large LSQs. The penalty associated with these

schemes correspond to different “load loops” and changes as the

time for load execution changes [11].

These buffering approaches effectively stall processing of cer-

tain memory instructions, which could potentially lead to dead-

lock. However, memory instructions can be formed into groups

based on age, with all of the instructions in a group having similar

ages. In a microarchitecture that is block-oriented like TRIPS, the

memory instruction groups correspond to the instruction blocks.

One block is non-speculative, while multiple blocks can be spec-

ulative. By choosing to prioritize the non-speculative instructions

over the speculative instructions, our solutions can reduce the cir-

cumstances for deadlocks and ﬂushing. One possible design would

reserve LSQ entries for the non-speculative block, but our experi-

ments indicated that this approach did not provide any substantive

performance beneﬁts and resulted in larger than a minimum sized

LSQ.

6.1 Issue Queue Buffering: Memory

Instruction Retry

One common alternative to ﬂushing the pipeline in conventional

processors is to replay individual offending instructions, either by

retracting the instruction back into the issue window, or by logging

the instruction in a retry buffer. In TRIPS retrying means send-

ing an offending instruction back to the ALU where it was issued

and storing it back into its designated reservation station. Since

the reservation station still holds the instruction and its operands,

only a short negative-acknowledgement (NACK) message needs to

be sent back to the execution unit. No additional storage in the

system is required, as the reservation station cannot be reassigned

to another instruction until the prior instruction commits. The is-

sue logic may retry this instruction later according to a number of

possible policies.

Figure 6a shows the basics of this technique applied to LSQ

overﬂows. When a speculative instruction arrives at a full LSQ,

the memory unit sends the NACK back to that instructions execu-

tion unit. This policy ensures that speculative instructions will not

prevent a non-speculative instruction from reaching the LSQ. If a

non-speculative instruction arrives at a full LSQ, then the pipeline

must be ﬂushed.

A range of policies are possible for determining when to reissue

a NACKed memory instruction. If the instruction reissues too soon

(i.e. immediately upon NACK), it can degrade performance by

clogging the network, possibly requiring multiple NACKs for the

same instruction. Increased network from NACKs can also delay

older non-speculative instructions from reaching the LSQ partition,

as well as general execution and instructions headed to other LSQ

partitions. Alternatively, the reservation stations can hold NACKed

instructions until a ﬁxed amount of time has elapsed. Waiting re-

quires a counter per NACKed instruction, and may be either too

long (incurring unnecessary latency) or too short (increasing net-

work contention).

Instead, our approach triggers re-issue when the non-speculative

block commits, which has the desirable property that LSQ entries

in the overﬂowed partition are likely to have been freed. This mech-

anism has two minor overheads, however: an additional state bit for

every reservation station, to indicate that the instruction is ready but

waiting for a block to commit before reissuing; and a control path

to wake up NACKed instructions when the commit signal for the

non-speculative block arrives.

6.2 Memory Buffering: Skid Buffers

A second overﬂow-handling technique is to store memory in-

structions waiting to access the LSQ in a skid buffer located in the

memory unit. As shown in Figure 6b, the skid buffer is simple pri-

ority queue into which memory instructions can be inserted and ex-

tracted. To avoid deadlock, our skid buffers only hold speculative

memory instructions. If an arriving speculative memory instruc-

tion ﬁnds the LSQ full, it is inserted into the skid buffer. If the skid

buffer is also full, the block is ﬂushed. Arriving non-speculative

instructions are not placed in the skid buffer. If they ﬁnd the LSQ

full, they trigger a ﬂush.

When the non-speculative block commits and the next oldest

block becomes non-speculative, all of its instructions that are lo-

cated in the skid buffer must be extracted ﬁrst and placed into the

LSQ. If the LSQ ﬁlls up during this process, the pipeline must be

ﬂushed. Like retry, the key to this approach is to prioritize the non-

speculative instructions and ensure that the speculative instructions

do not impede progress. Skid buffers can reduce the ALU and net-

work contention associated with NACK and instruction replay, but

may result in more ﬂushes if the skid buffer is small.

ALU LSQ PartitionOPN Router

Nack

ALU LSQ PartitionOPN Router

Flush

VC0

VC1

VC0

(a) NACK and Retry

Flush

(b) Skid Buffer (c) Virtual Channel (VC)

LSQ Partition

OPN Router

Flush

Skid Buffer

Figure 6: LSQ Flow Control Mechanisms.

6.3 Network Buffering: Virtual

Channel-Based Flow Control

A third approach to handle overﬂows is to use the buffers in the

network that transmits memory instructions from the execution to

the memory units as temporary storage for memory instructions

when the LSQ is full. In this scheme, the operand network is aug-

mented to have two virtual channels (VCs): one for non-speculative

trafﬁc and one for speculative trafﬁc. When a speculative instruc-

tion is issued at an ALU, its operands and memory requests are

transmitted on the lower priority channel. When a speculative

memory instruction reaches a full LSQ and cannot enter, it remains

in the network and asserts backpressure along the speculative vir-

tual channel. Non-speculative instructions use the higher priority

virtual channel for both operands and memory requests. A non-

speculative memory instruction that ﬁnds the LSQ full triggers a

ﬂush to avoid deadlock. Figure 6c shows a diagram of this ap-

proach.

This virtual channel approach has a number of beneﬁts. First,

no new structures are required so logic overhead is only minimally

increased. Additional router buffers are required to implement the

second virtual channel, but our experiments show that two-deep

ﬂit buffers for each virtual channel is sufﬁcient. Second, no ad-

ditional ALU or network contention is induced by NACKs or in-

struction replays. Third, the higher priority virtual channel allows

non-speculative network trafﬁc to bypass speculative trafﬁc. Thus

non-speculative memory instructions are likely to arrive at the LSQ

before speculative memory instructions, which reduces the likeli-

hood of ﬂushing.

Despite its conceptual elegance, this solution requires a number

changes to the baseline network and execution engine. The baseline

TRIPS implementation includes a number of pertinent features. It

provides a single operand network channel that uses on-off ﬂow

control to exert back-pressure. Each router contains a four-entry

FIFO to implement wormhole routing and the microarchitecture

can ﬂush any in-ﬂight instructions located in any tile or network

router when the block they belong to is ﬂushed. Finally, all of

the core tiles (execution, register ﬁle, data cache) of the TRIPS

processor connect to the operand network and will stall issue if

they have a message to inject and the outgoing network FIFO is

full.

Adjusting this network to support VCs requires several augmen-

tations: (1) an additional virtual channel in the operand network to

separate speculative from non-speculative network trafﬁc, includ-

ing the standard buffer capacity and control logic needed by virtual

channels, (2) virtualization of the pipeline registers, which must

stretch into the execution and register tiles to allow non-speculative

instructions to proceed even if speculative instructions are stalling

up the virtual network, (3) issue logic in these tiles that selects non-

speculative instructions over speculative logic when the virtual net-

work is congested, and (4) a means to promote speculative instruc-

tions from the speculative virtual channel to the non-speculative

channel when its block becomes non-speculative.

The trickiest part of this design is the promotion of specula-

tive network packets to the non-speculative virtual channel when

the previous non-speculative block commits. The TRIPS microar-

chitecture already has a commit signal which is distributed in a

pipelined fashion to all of the execution units, memory units, and

routers. When the commit signal indicates that the non-speculative

block has committed, each router must nullify any remaining pack-

ets in the non-speculative virtual channel and copy any packets be-

longing to the new non-speculative block from the speculative VC

to the non-speculative VC.

6.4 Flow Control Mechanisms Performance

We implemented these ﬂow control mechanisms on a simulator

that closely models the TRIPS prototype processor [21] which has

been validated to be within 11% of the RTL for the TRIPS proto-

type processor. The microarchitectural parameters most relevant to

the experiments are summarized in Table 4.

For each benchmark, we normalize the performance (measured

in cycle counts) to a conﬁguration with maximally sized, 256-entry

LSQ partitions that never overﬂow. For these experiments, we

used skid buffer sizes that are sized slightly larger than the ex-

pected number of instructions at each partition, (72 - LSQ parti-

tion size). For the virtual channel scheme, we divided the four

operand network buffers in the baseline equally between the two

channels. Thus two buffers are provided for the speculative and

non-speculative virtual channels for the VC scheme. We present

results for 28 EEMBC benchmarks (all except cjpeg and djpeg)

and 12 SPEC CPU 2000 (ammp, applu, art, bzip2, crafty, equake,

gap, gzip, mesa, mgrid, swim and wupwise) benchmarks with

minnespec medium sized reduced inputs. The other benchmarks

are not currently supported in our infrastructure.

For four 48-entry LSQs and thus a total LSQ size of 192 en-

tries (25% undersized), the ﬂush scheme results in average perfor-

mance loss of 6% for EEMBC benchmarks (Figure 7) and 11%

for SPEC benchmarks (Figure 8). The worst-case slowdowns are

much higher: 180% for idct and 206% for mgrid. These results sup-

port the perspective that traditional ﬂow-control mechanisms are

inadequate for distributed load-store queues. The VC mechanism is

the most robust with 2% average performance degradation and less

than 20% performance degradation in the worst case. As expected,

the Skid buffer scheme performs better than the NACK scheme

because it avoids network network congestion from the NACKed

packets, at the cost of extra area.

For six of the SPEC and EEMBC benchmarks, the memory ac-

cesses are unevenly distributed and cause LSQ overﬂows that re-

Parameter Conﬁguration

Overview Out-of-order execution with up to 1024 instructions inﬂight, Up to 256 memory instructions can be simultaneously in ﬂight.

Up to 4 stores can be committed every cycle.

Instruction

Supply

Partitioned 32KB I-cache 1-cycle hit. Local/Gshare Tournament predictor (10K bits, 3 cycle latency) with speculative updates;

Local: 512(L1) + 1024(L2), Global: 4096, Choice: 4096, RAS: 128, BTB: 2048.

Data Supply 4-bank cache-line interleaved DL1 (8KB/bank, 2-way assoc, writeback, write-around 2-cycle hit) with one read and one write

port per bank to different addresses. Up to 16 outstanding misses per bank to up to four cache lines, 2MB L2, 8 way assoc,

LRU, writeback, write-allocate, average (unloaded) L2 hit latency is 15 cycles, Average (unloaded) main memory latency is

127 cycles. Best case load-to-use latency is 5 cycles. Store forwarding latency is variable, minimum penalty is 1 cycle.

Interconnection

Network

The banks are arranged in 5x5 grid connected by mesh network. Each router uses round-robin arbitration. There are four

buffers in each direction per router and 25 routers. The hop latency is 1-cycle.

Simulation Execution-driven simulator validated to be within 11% of RTL design. 28 EEMBC benchmarks, 12 SPEC benchmarks simu-

lated with single simpoints of 100M

Table 4: Relevant aspects of the TRIPS microarchitecture

32 40 48 56 64

Number of LSQ Entries

0.9

1.0

1.1

1.2

Slowdowns wrt max LSQ

SKID

NACK

FLUSH

EEMBC arith mean (28)

IDCT IFFT BITMNP

Top 3 worst EEMBC Benchmarks

0.9

1.2

1.5

1.8

2.1

2.4

2.7

3.0

3.3

3.6

3.9

4.2

4.5

Slowdowns wrt max LSQ

FLUSH

NACK

SKID

40 48

56 64

Figure 7: Left: Average LSQ Performance for the EEMBC benchmark suite. Right: Three worst benchmarks. bitmnp shows a different trend

because there are fewer LSQ conﬂict violations in bitmnp when the LSQ capacity is decreased.

realLow_1 = &realData_1[l_1];

imagLow_1 = &imagData_1[l_1];

realHi_1 = &realData_1[i_1];

imagHi_1 = &imagData_1[i_1];

realData_1[l_1] = *realHi_1 - tRealData_1;

imagData_1[l_1] = *imagHi_1 - tImagData_1;

realData_1[i_1] += tRealData_1;

imagData_1[i_1] += tImagData_1;

Figure 9: Code snippet from idct benchmark.

duce performance signiﬁcantly. For instance, Figure 9 shows a fre-

quently executed code sequence in idct in the EEMBC suite. The

innermost loop contains two reads and two writes to two different

arrays and the code generated by the compiler aligns both arrays

to 256-byte boundaries. Since the arrays are accessed by same in-

dices, all four accesses map to the same bank. This problem is ex-

acerbated by the aggressive loop unrolling of the TRIPS compiler.

The accesses could in theory be distributed by aligning the arrays

differently, but aligning data structures to minimize bank conﬂicts

is a difﬁcult compiler problem.

6.5 Power Efﬁciency

Three mechanisms are necessary to achieve high power efﬁ-

ciency in large-window processor LSQs: address partitioning, late-

binding and associative search ﬁltering. First, partitioning the LSQ

by addresses naturally divides the number of entries arriving at each

memory partition. Second, late-binding reduces the number of en-

tries in each partition by reducing occupancy. Finally, Bloom ﬁl-

tering reduces the number of memory instructions performing as-

sociative searches [22].

The Bloom ﬁlters for the TRIPS study use 8 32-bit registers, one

for each in-ﬂight block. The ﬁlters associated with each block are

cleared when the block commits or is ﬂushed (commonly called

ﬂash clearing). This clearing scheme does not require the coun-

ters often needed for Bloom ﬁlters and thus reduces the area of

the ﬁlter signiﬁcantly. As shown in Table 5, nearly 70-80% of the

memory instructions (both loads and stores) can be prevented from

performing associative searches in the TRIPS processor by using

Bloom ﬁltering.

Benchmarks Average LSQ Activity Factor

VC SKID NACK

40 48 40 48 40 48

SPEC .21 .21 .27 .30 .30 .31

EEMBC .26 .27 .38 .39 .38 .39

Table 5: Fraction of loads performing associative searches

Using Bloom ﬁlters, however, incurs additional some additional

power for reading and updating the ﬁlters for every memory in-

struction. Using the 130nm ASIC synthesis methodology described

in the Alpha evaluation section, the capacitance of the 48-entry

TRIPS LSQ was computed to be 322pF. The capacitance of the

Bloom ﬁlter was 64pF. With the activity reduction of 80% the ef-

fective capacitance of the combination is 120pF which roughly is

the capacitance of a 12-entry, 40-bit unﬁltered CAM.

In this evaluation, we have not included the additional power

expended by the network routers as it is unclear if they will be

signﬁcant. Wang et al. [29] show a 20% increase in power for a

32 40 48 56 64

Number of LSQ Entries

0.8

0.9

1.0

1.1

1.2

1.3

Slowdowns wrt max LSQ

SKID

NACK

FLUSH

SPEC arith mean (12)

MGRID APPLU SWIM

Top 3 Worst SPEC Benchmarks

0.9

1.2

1.5

1.8

2.1

2.4

2.7

3.0

3.3

Slowdowns wrt max LSQ

FLUSH

NACK

SKID

Figure 8: Left: Average LSQ Performance for the SPEC benchmark suite. Right: Three worst benchmarks.

four fold increase for implementing virtual channels. The power

increase primarily comes from having four times as many buffers

for implementing the Virtual channels. In our scheme we do not

increase the number of buffers. We simply divide the number of

buffers equally between the virtual channels.

6.6 Area

Among the three overﬂow handling mechanisms, the NACK

mechanism is the most area efﬁcient if the issue window is designed

to hold instructions until explicit deallocation. On the TRIPS pro-

cessor, the NACK scheme requires cumulative storage of 1024 bits

to identify the NACK’ed instructions (one bit for every instruction

in the instruction window) and changes to the issue logic to se-

lect and re-issue the NACK’ed instructions. The VC mechanism

is next best in terms of area efﬁciency. The area overheads of the

VCs are due to the additional storage required for pipeline priority

registers in the execution units to avoid deadlocks and the combi-

national logic in routers to deal with promotion. The VC scheme

does not require any additional router buffers since the speculative

channels divide the number of buffers in the baseline. The skid

buffer scheme require the largest amount of storage, although most

of the structure can be implemented as RAMs. A 24-entry skid

buffer supplementing a 40-entry LSQ, increases the size of each

LSQ partition by 4%. Overall, using our best scheme to support –

the VC mechanism – a 1024 instruction window, as shown in Ta-

ble 1, the area after optimizations is 80% smaller compared to the

fully replicated LSQs at each LSQ partition.

6.7 Complexity

The VC scheme requires the most changes to the baseline as it

requires virtualization of not only the network routers but also the

execution units that feed the router. For instance, when the low

priority channel in the network is backed up, the issue logic must

supply the network with a high priority instruction even though it

may be in the middle of processing a low priority instruction. The

NACK scheme comes second or third depending on the baseline

architecture – if the baseline allows instructions to be held in the

issue queues until commit, implementing NACK is as simple as

setting a bit in a return packet and routing it back to the source in-

stead of the destination. However, if instructions are immediately

deallocated upon execution from the windows, NACK may be con-

siderable more complex. The skid buffer solution is probably the

simplest of all the solutions: it requires some form of priority logic

for selecting the oldest instructions, mechanisms for handling in-

validations in the skid buffer and arbitration for the LSQ between

instructions in the skid buffer and new instructions coming into the

LSQ partition. Despite the changes required for the schemes de-

scribed here, the mechanisms are feasible and operations required

have been implemented in other parts of the processor.

7. CONCLUSIONS

Detecting load/store data dependences has been challenging ever

since designers began to implement computer systems that execute

more than one instruction per cycle. As processor designers have

scaled ILP and instruction window size, hardware to enforce proper

memory ordering has become large and power hungry. However,

today’s load/store queue (LSQ) designs make inefﬁcient use of

their storage by allocating LSQ entries earlier than necessary.

By performing late binding, allocating LSQ entries only at in-

struction issue, designers can reduce the occupancy and resultant

size of the load/store queues. This reduction requires that the

queues be unordered. While the unordered property requires some

extra overhead, such as saving the CAM index in the ROB or

searching for the load or store age at commit, the design is not in-

trinsically more complex, and can achieve performance equivalent

to an ordered LSQ, but with less area and power.

Many of the recent papers that propose to eliminate the CAMs

do so at the expense of increased state, resulting in increased area

requirements. These designs certainly have some advantages, and

one or more of them may well be the solution that some implemen-

tations eventually use. However, we have shown in this paper that

unordered, late-binding LSQs can be quite competitive for super-

scalar designs, requiring only a small number of CAM entries to

be searched on average, while consuming less area than the CAM-

free approaches. It is unclear which approach suffers from worse

complexity, and will likely remain so until each approach is imple-

mented in a microarchitecture.

However, the most promising aspect of the ULB-LSQ approach

is its partitionability, which was the original impetus for this line

of research. Address-interleaved LSQ banks should be both late-

bound and unordered; the ULB-LSQ design naturally permits the

LSQ to be divided into banks, provided that a mechanism exists to

handle the resultant increase in bank overﬂows. We observed that,

for distributed microarchitectures that use routed micronetworks to

communicate control, instructions, and data, that we could embed

classic network ﬂow-control solutions into the processor micronet-

works to handle these overﬂows. We evaluate three such overﬂow

control handling schemes in the context of the TRIPS microarchi-

tecture. The best of these schemes (virtual micronet channels)

enables a scalable, distributed, load/store queue, requiring four

banks of only 48 entries each to support a 1024-instruction window.

When conjoined with a Bloom ﬁlter, this design greatly reduces the

number of CAM accesses, resulting in an average of only eight and

twelve CAM accesses per load or store, respectively.

Looking forward, this design supports a microarchitecture un-

der design that can run a single thread across a dynamically speci-

ﬁed collection of individual, light-weight processing cores. When

combined, the LSQ banks in the individual cores become part of

a single logical interleaved memory system, permitting the system

to choose between ILP and TLP dynamically using a collection of

composable, light-weight cores on a CMP.

8. ACKNOWLEDGMENTS

We thank M.S. Govindan for his help with the power mea-

surements and K. Coons, S. Sharif and R. Nagarajan for help in

preparing this document. This research is supported by the De-

fense Advanced Research Projects Agency under contract F33615-

01-C-4106 and by NSF CISE Research Infrastructure grant EIA-

0303609.

9. REFERENCES

[1] E. F. Torres and P. Ibanez and V. Vinals and J. M. Llaberia.

Store Buffer Design in First-Level Multibanked Data

Caches. In ISCA, 2005.

[2] F. Castro, L. Pinuel, D. Chaver, M. Prieto, M. Huang, and F.

Tirado . DMDC: Delayed Memory Dependence Checking

through Age-Based Filtering. In MICRO, 2006.

[3] Sam S. Stone and Kevin M. Woley and Matthew I. Frank.

Address-indexed memory disambiguation and store-to-load

forwarding. In MICRO, 2005.

[4] Samantika Subramaniam and Gabriel H. Loh.

Fire-and-Forget: Load/Store Scheduling with No Store

Queue at all. In MICRO, 2006.

[5] H. Akkary, R. Rajwar, and S. Srinivasan. Checkpoint

processing and recovery: Towards scalable large instruction

window processors. In Proceedings of the 36th Annual

IEEE/ACM International Symposium on Microarchitecture,

pages 423–434, December 2003.

[6] Alok Garg and M. Wasiur Rashid and Michael Huang.

Slackened Memory Dependence Enforcement: Combining

Oppurtunistic Forwarding with Decoupled Verﬁcation. In

ISCA, 2006.

[7] Alper Buyuktosunoglu and David H. Albonesi and Pradip

Bose and Peter W. Cook and Stanley E. Schuster. Tradeoffs

in Power-Efﬁcient Issue Queue Design. In ISPLED, 2002.

[8] Amir Roth. High Bandwidth Load Store Unit for Single- and

Multi-Threaded Processors. Technical Report

MS-CIS-04-09, Dept. of Computer and Information

Sciences, University of Pennsylvania, 2004.

[9] R. Balasubramonian. Dynamic Management of

Microarchitecture Resources in Future Microprocessors.

PhD thesis, University of Rochester, 2003.

[10] L. Baugh and C. Zilles. Decomposing the load-store queue

by function for power reduction and scalability. In P=ac

Conference, IBM Research, 2004.

[11] E. Borch, E. Tune, S. Manne, and J. Emer. Loose loops sink

chips. In Proceedings of the 8th International Symposium on

High-Performance Computer Architecture, pages 299–310,

2002 February.

[12] D. Burger, S. W. Keckler, K. S. McKinley, M. Dahlin, L. K.

John, C. Lin, C. R. Moore, J. Burrill, R. G. McDonald,

W. Yoder, and the TRIPS Team. Scaling to the End of Silicon

with EDGE architectures. IEEE Computer, 37(7):44–55,

July 2004.

[13] H. W. Cain and M. H. Lipasti. Memory ordering: A

value-based approach. In ISCA, 2004.

[14] A. Cristal, O. Santana, F. Cazorla, M. Galluzzi, T. Ramirez,

M. Pericas, and M.Valero. Kilo-instruction processors:

Overcoming the memory wall. IEEE Micro, 25(3):48–57,

May/June 2005.

[15] R. Desikan, D. Burger, and S. W. Keckler. Measuring

experimental error in microprocessor simulation. In

Proceedings of the 28th Annual International Symposium on

Computer Architecture, pages 266–277, July 2001.

[16] Elham Saﬁ and Andreas Moshovos and Andreas Veneris.

L-CBF: A Low Power, Fast Counting Bloom Filter

Implementation. In ISPLED, 2006.

[17] M. Franklin and G. S.Sohi. ARB: a hardware mechanism for

dynamic reordering of memory references. IEEE

Transactions on Computers, 45(5):552–571, 1996.

[18] A. Jaleel and B. Jacob. Using Virtual Load/Store Queues

(VLSQs) to Reduce the Negative Effects of Reordered

Memory Instructions. In HPCA, 2005.

[19] T. Monreal, A. Gonzalez, M. Valero, J. Gonzlez, and

V. Vinals. Delaying Physical Register Allocation Through

Virtual-Physical Registers. In MICRO, 1999.

[20] A. Roth. Store vulnerability window (svw): Re-execution

ﬁltering for enhanced load optimization. In ISCA, 2005.

[21] K. Sankaralingam, R. Nagarajan, R. McDonald, R. Desikan,

S. Drolia, M. Govindan, P. Gratz, D. Gulati, H. Hanson,

C. Kim, H. Liu, N. Ranganathan, S. Sethumadhan, S. Sharif,

P. Shivakumar, S. W. Keckler, and D. Burger. Distributed

microarchitectural protocols in the TRIPS prototype

processor. In MICRO, December 2006.

[22] S. Sethumadhavan, R. Desikan, D. Burger, C. R. Moore, and

S. W. Keckler. Scalable memory disambiguation for high ilp

processors. In 36th International Symposium on

Microarchitecture, pages 399–410, December 2003.

[23] Simha Sethumadhavan and Robert McDonald and

Rajagopalan Desikan and Doug Burger and Stephen W.

Keckler. Design and Implementation of the TRIPS Primary

Memory System. In ICCD, 2006.

[24] S. T. Srinivasan, R. Rajwar, H. Akkary, A. Gandhi, and

M. Upton. Continual ﬂow pipelines. In ASPLOS, pages

107–119, October 2004.

[25] Stefan Bieschewski and Joan-Manuel Parcerisa and Antonio

Gonzalez. Memory Bank Predictors. In ICCD, 2005.

[26] J. M. Tendler, J. S. Dodson, J. J. S. Fields, H. Le, and

B. Sinharoy. POWER4 system microarchitecture. IBM

Journal of Research and Development, 26(1):5–26, January

2001.

[27] Tingting Sha and Milo M. K. Martin and Amir Roth.

Scalable store-load forwarding via store queue index

prediction. In MICRO, 2005.

[28] Tingting Sha, Milo M.K. Martin and Amir Roth. NoSQ:

Store-Load Communication without a Store Queue. In

MICRO, 2006.

[29] H.-S. Wang, X. Zhu, L.-S. Peh, and S. Malik. Orion: a

power-performance simulator for interconnection networks.

In MICRO, pages 294–305, 2002 December.

[30] V. V. Zyuban. Inherently Lower-Power High-Performance

Superscalar Architectures. PhD thesis, University of Notre

Dame, March 2000.