When conjoined with a Bloom filter, this design greatly reduces the
number of CAM accesses, resulting in an average of only eight and
twelve CAM accesses per load or store, respectively.
Looking forward, this design supports a microarchitecture un-
der design that can run a single thread across a dynamically speci-
fied collection of individual, light-weight processing cores. When
combined, the LSQ banks in the individual cores become part of
a single logical interleaved memory system, permitting the system
to choose between ILP and TLP dynamically using a collection of
composable, light-weight cores on a CMP.
8. ACKNOWLEDGMENTS
We thank M.S. Govindan for his help with the power mea-
surements and K. Coons, S. Sharif and R. Nagarajan for help in
preparing this document. This research is supported by the De-
fense Advanced Research Projects Agency under contract F33615-
01-C-4106 and by NSF CISE Research Infrastructure grant EIA-
0303609.
9. REFERENCES
[1] E. F. Torres and P. Ibanez and V. Vinals and J. M. Llaberia.
Store Buffer Design in First-Level Multibanked Data
Caches. In ISCA, 2005.
[2] F. Castro, L. Pinuel, D. Chaver, M. Prieto, M. Huang, and F.
Tirado . DMDC: Delayed Memory Dependence Checking
through Age-Based Filtering. In MICRO, 2006.
[3] Sam S. Stone and Kevin M. Woley and Matthew I. Frank.
Address-indexed memory disambiguation and store-to-load
forwarding. In MICRO, 2005.
[4] Samantika Subramaniam and Gabriel H. Loh.
Fire-and-Forget: Load/Store Scheduling with No Store
Queue at all. In MICRO, 2006.
[5] H. Akkary, R. Rajwar, and S. Srinivasan. Checkpoint
processing and recovery: Towards scalable large instruction
window processors. In Proceedings of the 36th Annual
IEEE/ACM International Symposium on Microarchitecture,
pages 423–434, December 2003.
[6] Alok Garg and M. Wasiur Rashid and Michael Huang.
Slackened Memory Dependence Enforcement: Combining
Oppurtunistic Forwarding with Decoupled Verfication. In
ISCA, 2006.
[7] Alper Buyuktosunoglu and David H. Albonesi and Pradip
Bose and Peter W. Cook and Stanley E. Schuster. Tradeoffs
in Power-Efficient Issue Queue Design. In ISPLED, 2002.
[8] Amir Roth. High Bandwidth Load Store Unit for Single- and
Multi-Threaded Processors. Technical Report
MS-CIS-04-09, Dept. of Computer and Information
Sciences, University of Pennsylvania, 2004.
[9] R. Balasubramonian. Dynamic Management of
Microarchitecture Resources in Future Microprocessors.
PhD thesis, University of Rochester, 2003.
[10] L. Baugh and C. Zilles. Decomposing the load-store queue
by function for power reduction and scalability. In P=ac
Conference, IBM Research, 2004.
[11] E. Borch, E. Tune, S. Manne, and J. Emer. Loose loops sink
chips. In Proceedings of the 8th International Symposium on
High-Performance Computer Architecture, pages 299–310,
2002 February.
[12] D. Burger, S. W. Keckler, K. S. McKinley, M. Dahlin, L. K.
John, C. Lin, C. R. Moore, J. Burrill, R. G. McDonald,
W. Yoder, and the TRIPS Team. Scaling to the End of Silicon
with EDGE architectures. IEEE Computer, 37(7):44–55,
July 2004.
[13] H. W. Cain and M. H. Lipasti. Memory ordering: A
value-based approach. In ISCA, 2004.
[14] A. Cristal, O. Santana, F. Cazorla, M. Galluzzi, T. Ramirez,
M. Pericas, and M.Valero. Kilo-instruction processors:
Overcoming the memory wall. IEEE Micro, 25(3):48–57,
May/June 2005.
[15] R. Desikan, D. Burger, and S. W. Keckler. Measuring
experimental error in microprocessor simulation. In
Proceedings of the 28th Annual International Symposium on
Computer Architecture, pages 266–277, July 2001.
[16] Elham Safi and Andreas Moshovos and Andreas Veneris.
L-CBF: A Low Power, Fast Counting Bloom Filter
Implementation. In ISPLED, 2006.
[17] M. Franklin and G. S.Sohi. ARB: a hardware mechanism for
dynamic reordering of memory references. IEEE
Transactions on Computers, 45(5):552–571, 1996.
[18] A. Jaleel and B. Jacob. Using Virtual Load/Store Queues
(VLSQs) to Reduce the Negative Effects of Reordered
Memory Instructions. In HPCA, 2005.
[19] T. Monreal, A. Gonzalez, M. Valero, J. Gonzlez, and
V. Vinals. Delaying Physical Register Allocation Through
Virtual-Physical Registers. In MICRO, 1999.
[20] A. Roth. Store vulnerability window (svw): Re-execution
filtering for enhanced load optimization. In ISCA, 2005.
[21] K. Sankaralingam, R. Nagarajan, R. McDonald, R. Desikan,
S. Drolia, M. Govindan, P. Gratz, D. Gulati, H. Hanson,
C. Kim, H. Liu, N. Ranganathan, S. Sethumadhan, S. Sharif,
P. Shivakumar, S. W. Keckler, and D. Burger. Distributed
microarchitectural protocols in the TRIPS prototype
processor. In MICRO, December 2006.
[22] S. Sethumadhavan, R. Desikan, D. Burger, C. R. Moore, and
S. W. Keckler. Scalable memory disambiguation for high ilp
processors. In 36th International Symposium on
Microarchitecture, pages 399–410, December 2003.
[23] Simha Sethumadhavan and Robert McDonald and
Rajagopalan Desikan and Doug Burger and Stephen W.
Keckler. Design and Implementation of the TRIPS Primary
Memory System. In ICCD, 2006.
[24] S. T. Srinivasan, R. Rajwar, H. Akkary, A. Gandhi, and
M. Upton. Continual flow pipelines. In ASPLOS, pages
107–119, October 2004.
[25] Stefan Bieschewski and Joan-Manuel Parcerisa and Antonio
Gonzalez. Memory Bank Predictors. In ICCD, 2005.
[26] J. M. Tendler, J. S. Dodson, J. J. S. Fields, H. Le, and
B. Sinharoy. POWER4 system microarchitecture. IBM
Journal of Research and Development, 26(1):5–26, January
2001.
[27] Tingting Sha and Milo M. K. Martin and Amir Roth.
Scalable store-load forwarding via store queue index
prediction. In MICRO, 2005.
[28] Tingting Sha, Milo M.K. Martin and Amir Roth. NoSQ:
Store-Load Communication without a Store Queue. In
MICRO, 2006.
[29] H.-S. Wang, X. Zhu, L.-S. Peh, and S. Malik. Orion: a
power-performance simulator for interconnection networks.
In MICRO, pages 294–305, 2002 December.
[30] V. V. Zyuban. Inherently Lower-Power High-Performance
Superscalar Architectures. PhD thesis, University of Notre
Dame, March 2000.