Boost Fertility New Invention Ups Success Rates.pdf
Algorithmic Memory Increases Memory Performance by an Order of Magnitude
1. Algorithmic Memory Increases Memory
Performance By an Order of Magnitude
Sundar Iyer
Co-Founder & CTO Memoir Systems
Track F, Lecture 2: Intellectual Property for SoC & Cores
May 2, 2012
2. Problem: Processor-Embedded Memory Performance Gap
Performance degradation can be
more significant
more significant
and is getting worse!
Processor Embedded
Memory Performance Gap
Normalized Growth
*Source: Hennessy and Patterson, 5th Edition
May 2, 2012
3. Why is Embedded Memory Slow?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
clk
read
One operation per
addr A B C D E F G H
memory clock cycle
data A B C D E F G H
How can we increase memory performance
without increasing memory clock speed?
May 2, 2012
5. Solution Overview
2X Performance for ~15% area overhead
Any Embedded Physical Memory
RTL Based: No Circuit or Simultaneous Accesses to the
1P 1P 1P 1P
Layout changes same Address, Row, Column, or
Bank (no exceptions)
1P 1P 1P 1P
1P 1P 1P 1P
Extra Memory
Algorithmic Memory
Exhaustively Formally Verified
Data
Data
Addr
Addr
Addr
Addr
Data
Data
Each Port can access the & Transparent to end-user
entire Memory Address
Using Physical 1-Port Memory to Build any Multiport Functionality
May 2, 2012
6. Usage & Adoption
Easily Interface
128 Width
• Presents standard memory interface
• Adds no clock cycle latency
• Used as a drop-in replacement
8K Depth
Physical
Memory
Readily Integrate
• Fits seamlessly in SoC design flow Memoir IP IP
Memoir
• Used in SoCs - ASICs, ASSPs, GPPs
A D A D A D A D
Rapidly Implement Identical Pinout
to Standard Memory
• Supports any process, node or foundry
May 2, 2012
7. Increases Density
Denser Physical
1P Memory
Algorithmic
2P Memory
Physical
2P Memory
Normalized for 1P = 1 Mb/mm2
May 2, 2012
11. Configurable Performance
Performance
(MOPS) Higher performance
algorithmic memories
4P
Higher density
2P algorithmic memories
Memory Density
(Mb/mm2)
Physical Memory
Power efficient
algorithmic memories Higher Performance Algorithmic Memory
Algorithmic 2P SP SP
Area Efficient Algorithmic Memory
Power Efficient Algorithmic Memory
Power Efficiency
(Mb/mW)
May 2, 2012
12. Increases Portfolio of Available Memories
1R1W
1R/4W 2R/1W
4R/1W 1R/2W
3R1W 1RW 2RW
2R2W 1R2W
1R3W 2R1W
3R/1W
Physical Memory
Algorithmic Memory
May 2, 2012
13. Rapid Memory Analysis & Generation
2X
3X
Acceleration
4X
Push Button Analysis
# Read Ports
# Write Generate Memory
Real-time
Algorithmic
# Width
Specify Capacity
Feed Inputs
GUI SYN GEN CHK Memory
# Depth
Memory
…
Feedback
Reduced Latency
Standard
Power Optimization
SRAM Register File
Area
eDRAM Standard Cell
Library & Building Blocks
May 2, 2012
14. Multiport Memory Usages
Descriptor and Free Lists, Ingress Buffers
3R1W L2 MAC Lookups, Shared Caches
2R1W
1R2W Descriptor and Free Lists, Egress Buffers
Cache Coherency Arrays for L2/L3 Caches
1R3W
2R2W Netflow, Counters
State Tables, Linked Lists
1R1W
4Ror1W Data and Tag Arrays for L2, L3 Caches
Route Lookup Tables
3Ror1W
ACL Tables
2Ror1W
May 2, 2012
15. Exhaustive Formal Verification Reduces Risk
Independently Verify Logic SRAM
BIST Wrapper
• Mathematically proven algorithms
• Formally, exhaustively verified RTL
SCAN
Physical Memory
Separately Test Physical Memories BIST
• Supports 3rd party DFT methodology Algorithmic Memory
Memoir IP
• Transparent customer BIST, BISR
• Doesn’t need complex multiport BIST A D A D A D A D
May 2, 2012
16. Tier-1 OEM Evaluation
– Performance, Area and Power Benefits
Large ASIC
Algorithmic Memory Solution
4X MOPS
Memories
24mm
21mm
24mm 21mm
Area 576 mm2 Area 441 mm2
• 800 Mb of total memory • Area Savings of 135 mm2 (23% die)
• 165 Memory Instances • 136 Memory Instances Accelerated
Versatile memories required Power Savings > 12W
• 4R/1W, 2R1W, 1R2W memories 4X MOPS for select memories
May 2, 2012
17. Summary
1. Increases Port and Clock Performance
2. Lowers Area and Power
3. Easy Interface, Integration and Implementation
4. Creates Versatile Memory Portfolio
5. Reduces Cost, Risk and Time to Market
Algorithmic Memories are not a panacea, but present a new solution to
alleviate the processor embedded memory performance gap
May 2, 2012
18. Q&A
Sundar Iyer
sundaes@memoir-systems.com
Come Visit Our Booth!
Memoir Systems
May 2, 2012
Notas do Editor
Today, a single-port embedded memory can perform one memory operation per clock cycle. Therefore embedded memory performance has traditionally been closely tied to memory clock speed and ultimately limited by it. Because embedded memory IP providers (responding to application needs for more on-chip memory) had to make design trade-offs early on that favored high density over high speed, memory clock speeds lag behind processor clock speeds. With its Algorithmic Memory technology, Memoir Systems tackles a fundamental question --- can we increase memory performance without increasing memory clock speeds? Historically, circuits and advances in lithography have been used at every generation as the approach to enhance memory performance. Unfortunately these approaches alone do not give enough performance improvement, and are not keeping up with applications that require higher memory performance. The problem is we have limited our thinking about embedded memories to a purely circuit and process oriented approach. Thus, our focus has been on maximizing the number of transistors on a chip and cranking up the clock speed. This has been successful up to a point, but as transistors approach atomic dimension, we are running into fundamental physical barriers. For this reason, we need to rethink our approach to embedded memory design.
Algorithmic memory technology increases the density (lowers area) of physical memories. This also reduces the leakage power consumption.
Algorithmic memory technology allows system designers to treat memory performance as a configurable entity with its own set of tradeoffs with respect to speed, area and power.
AlgorithmicMemories can be generated from a small set of base physical memories and provide a broad portfolio of customized memories with any combination of read and write interfaces.
An algorithmic memory synthesis platform can analyze and estimate the resulting area, power and speed of custom memory configurations in seconds, and generate it in a matter of days.
OrangeApplications??Compare sizes area/power
Logic is scan insertedScan chain way to test normal logicFlops are scan chain scan enabled flops