SlideShare uma empresa Scribd logo
1 de 8
Baixar para ler offline
OCP Server Memory Channel Testing
Overview – Draft Version 0.1 May 4 2015
Contributed by: David Woolf UNH-OL david@iol.unh.edu and Barbara Aichinger
FuturePlus Systems Barb.Aichinger@FuturePlus.com
Executive Summary: Cloud Computing is pervasive in our society today and at the
heart of every cloud computing server is DDR Memory. Some data centers have reported that
DDR Memory errors are the #2 failure in their data centers. Error detection and correction
techniques fall short if there are more than 1 -2 bit errors in a 64-72 bit line of information.
Industry studies have shown that DDR Memory errors are much more pervasive in the field than
the vendors data sheets would lead you to believe. Adding a cost effective and relatively quick
post validation check of the DDR Memory channel for OCP Servers would add credibility to the
OCP brand. In addition it would help identify the elusive cause of post manufacturing field
memory errors seen in data centers across the globe. This effort would be a value to OCP
manufacturers, OCP customers and the Cloud Computing industry in general.
A second goal of this effort would be to further the investigation into the recently publicized
failure mechanism of DDR3 memory called ‘Row Hammer’. Google has identified this not only
as a reliability issue but as a security risk that can be exploited in order to gain complete control
over a targeted Server.
Goal #1: Memory Channel Validation Audit: Add value to the OCP brand of
Servers by specifying a robust memory channel test procedure. This test procedure is not meant
to be a design validation. It is meant to be an audit that a robust electrical and protocol DDR
Memory Channel validation was done. As an added benefit this procedure can also be used to:
 Spot check motherboards from manufacturing to ensure quality.
 Isolate failing memory channels in the field on servers displaying above average memory
errors
 Check for BIOS bugs that program the Memory Controller incorrectly thus causing JEDEC
specification violations
Procedure: This testing will be broken into two parts. The first is the electrical audit and the
second is the protocol and timing JEDEC specification testing.
Electrical Audit: This testing is not meant to be an electrical signal integrity validation.
Rather it ensures that a validation has been done and that the signals at the DDR DIMM
connector are acceptable with regards to signal swing, alignment, data valid eye size and that
none of the strobe signals, data signals, address, command or control signals look appreciably
degraded with respect to their form or function. It will be a qualitative measurement.
Protocol Timing Audit: This test procedure is not meant to replace a protocol timing
validation of the server. Rather it ensures that the BIOS has programmed the memory
controller correctly for key timing parameters of the DDR memory. It also ensures that under
heavy traffic loads the memory controller adheres to the JEDEC specification.
Outline: Electrical Audit of the DDR Memory Channel
Run Eye Scan1
on all bus signals in all slots:
 Address
 Command
 Control
 Data Signals
 Data Strobes
Check for
Signal Alignment
o Bytes to each other (fly by)
o Signals within a byte
o Strobes to Data
 Read
 Write
o Command to Address and Control
Data Valid Eye (all signals)
o Signals within a byte
o Composite eye (millions of samples laid on top of each other)
o Burst Scan (beats within a burst)
Signal Swing
o At the start and end of a burst on Data and Data Strobe (DQ/DQS)
o All address command and control signals
Method of Implementation:
For the audit a qualitative measurement will be performed. An oscilloscope would be the tool
of choice for a robust validation. However that is not our goal. For an audit it is much more cost
effective in money and time to use a logic analyzer and an interposer installed in the DIMM slot.
Using a DIMM slot interposer results in no soldering or delicate time consuming probing.
1 Eye Scan is a general term describing a high speed digital sampling of a signal with either a scope, logic
analyzer or protocol analyzer.
Figure 1: Keysight U4154A/B logic analyzer with a FS2510 interposer from FuturePlus Systems
This setup has a sampling resolution of 5ps x 2mv. It can sample every signal on the DDR4 bus.
simultaneously. All signals are scanned under the same conditions and viewed with respect to
each other. This method does not replace a high bandwidth scope however it does provide a
rapid qualitative insight that will act as an ‘audit’ that the memory slot has good signal quality.
Pass/Fail Criteria
Eye Scan
Address/Command and Control Signals have similar eye shape and opening size. A numeric
output (excel sheet) from the tool can be compared quickly in order to make the comparison.
DQ/DQS signals all have similar eye shape and opening size. Byte fly by is present and
consistent between bytes
Figure 2: Eye Scan showing alignment of Data and Strobe for a Write
Burst Scan
Beat 0-7 and 0-3 for BC are consistent with good signal quality. This is checked for all DQ/DQS
signals.
U p
Figure 3: Example of a Burst Scan on Data byte 0 with associated strobe signal
Outline: Protocol Timing Audit of the DDR Memory Channel
The JEDEC specification has hundreds of timing parameters that govern the ordering and timing
between transactions on the DDR memory bus. These are setup by a complicated set of steps at
boot time and if not configured correctly can result in data corruption of the memory.
Ensure that the OCP memory channel adheres to the JEDEC specification with regards to:
o Command to Command timing
o Refresh Rate
o Calibration Commands
o Correct Operation of Mode Register Settings
o ODT operation
o Rank to Rank command timing
o DIMM to DIMM command timing
o Power management operation
Method of Implementation:
For the audit a protocol or logic analyzer based solution will suffice. The measurement is
made at the DIMM slot thus no soldering or delicate time consuming probing needs to be
done.
Figure 4: Automated Violation Detection with the FS2800 DDR Detective®
Figure 5: DDR Detective interposer installed in 1 slot of a 2 slot memory channel
Pass/Fail Criteria
All JEDEC protocol and timing as specified by the JEDEC Specification pass without error.
Running each test for 1 hour each.
Software
In order to exercise the memory channel software benchmarks will be run. They will be selected
using the following criteria:
o Near theoretical bandwidth created on the data bus
o Variety of Commands caused on the bus
o Exercises all supported power management commands
o Power Spikes and Inter-symbol interference caused
o Creates the Row Hammer event
The challenge will be to find the least number of different software programs to cause the
above events. It is anticipated that once failure mechanisms are uncovered and better
understood software that targets creating those failure mechanisms will be added to the above
list.
Documentation
The following documents will be created and electronically stored for each Server tested, for
each slot for each memory channel.
 Eye Scan results with ‘eye’ EXCEL spreadsheets for each signal
 Burst Scan for each byte (including ECC if implemented) for both Reads and Writes
 Protocol Violation Report
Example: A 4 memory channel 3 slot per channel Server has 12 total slots. The tests are
thus repeated 12 times. Each Eye Scan file will contain all signals for that slot. Each Burst Scan
file will contain all signals for that slot. Each Slot will have a protocol violation check report thus
12 protocol violation reports will be generated. In total this example system will have 36 files. A
summary file can then be created that gives an easy to read pass/fail summary. Thus 37 files will
be generated for each Server.
Goal #2: Row Hammer Detection
As geometries shrink and capacities increase DDR Memory cells are susceptible to leakage
current from adjacent cells. In the case of DDR Memory a ROW subjected to excessive
ACTIVATE commands can leak current into adjacent ROWS. This ROW is referred to as the
‘aggressor’. If the adjacent ROWS, called the victim ROWS, are on the tail end of the cyclical
refresh cycle their charge is low. Thus they are susceptible to leakage current that can cause a
bit flip. The failure of the DDR Memory cell to hold its charge due to leakage current from an
adjacent ROW when the adjacent ROW is targeted with excessive ACTIVATE commands is
known as “Row Hammer”. The name was coined because the ROW is being ‘hammered’ with
ACTIVATE commands.
How should OCP Certification address this problem? The answer is not altogether clear at this
point in time since this is relatively a new issue for the industry. Here are some of the possible
scenarios.
1. Replace all DDR3 Memory with Tested and Certified parts that do not have this failure.
It is not clear that this is an economical or viable option.
2. Implement Row Hammer mitigation strategies for DDR3. There are several but they do
not totally prevent the problem and are a power and performance hit to the server.
3. Does the customers application software create the ROW hammer event? This can be
detected using hardware test equipment that looks for excessive ACTIVATE commands
being generated by the Memory Controller. If it does not then perhaps no action needs
to be taken.
4. The DDR Memory DRAM itself is the culprit. Identify which parts are most susceptible
and purge only those parts from critical applications.
5. Move as quickly as possible to DDR4, which some in the industry claim is not susceptible
to Row Hammer failures. This has not been proven to be correct.
In any event, testing and research by reputable test labs should be undertaken and the results
published. This will arm the OCP community and the Cloud Computing industry with good
information on how to tackle this problem.
Initial Row Hammer Investigation
 Gather information and repeat Row Hammer experiments
 Measure the effectiveness of the mitigation strategies and the power and performance
tradeoffs
 Identify what software shows the problem the quickest
 Compile Know Good Parts list for DDR3
 Identify what types of applications are most susceptible and publish results
 Develop a network of industry experts to review and publish results
Current Row Hammer Resources
Google’s Article: http://googleprojectzero.blogspot.com/2015/03/exploiting-dram-rowhammer-
bug-to-gain.html
CMU expose: http://users.ece.cmu.edu/~yoonguk/papers/kim-isca14.pdf
Various Papers and articles on the topic:
https://blogs.synopsys.com/committedtomemory/2015/03/09/row-hammering-what-it-is-and-
how-hackers-could-use-it-to-gain-access-to-your-system/
http://www.ddrdetective.com/files/6414/1036/5710/The_Known_Failure_Mechanism_in_DDR
3_memory_referred_to_as_Row_Hammer.pdf
https://www.youtube.com/watch?v=7wIUQ04Vkes
Wikipedia Link: http://en.wikipedia.org/wiki/Row_hammer#cite_note-googleprojectzero-4
Products identifying Row Hammer events and causing them
http://teledynelecroy.com/pressreleases/document.aspx?news_id=1805
http://www.eurosoft-uk.com/eurosoft-test-bulletin-testing-row-hammer/
http://www.ddrdetective.com/files/3314/1036/5702/Description_of_the_Row_Hammer_featur
e_on_the_FS2800_DDR_Detective.pdf
http://www.memtest86.com/
Summary
Memory Channel Validation Audit and developing a well understood Row Hammer mitigation
strategy will put OCP in a leadership position. OCP is the only industry standards organization
that encompasses the entire ‘food chain’ of Cloud Computing from component vendors, OEMs,
Server Vendors, Software vendors to large data center operators. OCP members represent the
spectrum of industries reliant on Cloud Computing from the financial sector to social media.
OCP is poised to address this issue where other standards organizations, due to corporate
malfeasance or just plain ignorance, have failed to do so. These goals are attainable and in the
best interests of the OCP community.

Mais conteúdo relacionado

Mais procurados

Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesNetflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesEd Hunter
 
LTE Testing - Network Performance, Security, and Stability at Massive Scale
LTE Testing - Network Performance, Security, and Stability at Massive ScaleLTE Testing - Network Performance, Security, and Stability at Massive Scale
LTE Testing - Network Performance, Security, and Stability at Massive ScaleIxia
 
FPGA IMPLEMENTATION OF DEBLOCKING FILTER CUSTOM INSTRUCTION HARDWARE ON NIOS-...
FPGA IMPLEMENTATION OF DEBLOCKING FILTER CUSTOM INSTRUCTION HARDWARE ON NIOS-...FPGA IMPLEMENTATION OF DEBLOCKING FILTER CUSTOM INSTRUCTION HARDWARE ON NIOS-...
FPGA IMPLEMENTATION OF DEBLOCKING FILTER CUSTOM INSTRUCTION HARDWARE ON NIOS-...VLSICS Design
 
Agilent flash programming agilent utility card versus deep serial memory-ca...
Agilent flash programming   agilent utility card versus deep serial memory-ca...Agilent flash programming   agilent utility card versus deep serial memory-ca...
Agilent flash programming agilent utility card versus deep serial memory-ca...AgilentT&M EMEA
 
Topic2a ss pipelines
Topic2a ss pipelinesTopic2a ss pipelines
Topic2a ss pipelinesturki_09
 
Design for testability and automatic test pattern generation
Design for testability and automatic test pattern generationDesign for testability and automatic test pattern generation
Design for testability and automatic test pattern generationDilip Mathuria
 
Software development for the COMPASS experiment
Software development for the COMPASS experimentSoftware development for the COMPASS experiment
Software development for the COMPASS experimentbodlosh
 
Four Ways to Improve Linux Performance IEEE Webinar, R2.0
Four Ways to Improve Linux Performance IEEE Webinar, R2.0Four Ways to Improve Linux Performance IEEE Webinar, R2.0
Four Ways to Improve Linux Performance IEEE Webinar, R2.0Michael Christofferson
 
Trends in Mixed Signal Validation
Trends in Mixed Signal ValidationTrends in Mixed Signal Validation
Trends in Mixed Signal ValidationDVClub
 
Making of a PD Data Acqusition System
Making of a PD Data Acqusition SystemMaking of a PD Data Acqusition System
Making of a PD Data Acqusition SystemVishal Mathur
 
How to Monitoring the SRE Golden Signals (E-Book)
How to Monitoring the SRE Golden Signals (E-Book)How to Monitoring the SRE Golden Signals (E-Book)
How to Monitoring the SRE Golden Signals (E-Book)Siglos
 
DESIGN APPROACH FOR FAULT TOLERANCE IN FPGA ARCHITECTURE
DESIGN APPROACH FOR FAULT TOLERANCE IN FPGA ARCHITECTUREDESIGN APPROACH FOR FAULT TOLERANCE IN FPGA ARCHITECTURE
DESIGN APPROACH FOR FAULT TOLERANCE IN FPGA ARCHITECTUREVLSICS Design
 
Pin pointpresentation
Pin pointpresentationPin pointpresentation
Pin pointpresentationLevan Huan
 
White Paper: Six-Step Competitive Device Evaluation
White Paper: Six-Step Competitive Device EvaluationWhite Paper: Six-Step Competitive Device Evaluation
White Paper: Six-Step Competitive Device EvaluationIxia
 

Mais procurados (20)

LVTS Projects
LVTS ProjectsLVTS Projects
LVTS Projects
 
Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesNetflix SRE perf meetup_slides
Netflix SRE perf meetup_slides
 
LTE Testing - Network Performance, Security, and Stability at Massive Scale
LTE Testing - Network Performance, Security, and Stability at Massive ScaleLTE Testing - Network Performance, Security, and Stability at Massive Scale
LTE Testing - Network Performance, Security, and Stability at Massive Scale
 
FPGA IMPLEMENTATION OF DEBLOCKING FILTER CUSTOM INSTRUCTION HARDWARE ON NIOS-...
FPGA IMPLEMENTATION OF DEBLOCKING FILTER CUSTOM INSTRUCTION HARDWARE ON NIOS-...FPGA IMPLEMENTATION OF DEBLOCKING FILTER CUSTOM INSTRUCTION HARDWARE ON NIOS-...
FPGA IMPLEMENTATION OF DEBLOCKING FILTER CUSTOM INSTRUCTION HARDWARE ON NIOS-...
 
Reliability and clock synchronization
Reliability and clock synchronizationReliability and clock synchronization
Reliability and clock synchronization
 
Agilent flash programming agilent utility card versus deep serial memory-ca...
Agilent flash programming   agilent utility card versus deep serial memory-ca...Agilent flash programming   agilent utility card versus deep serial memory-ca...
Agilent flash programming agilent utility card versus deep serial memory-ca...
 
Topic2a ss pipelines
Topic2a ss pipelinesTopic2a ss pipelines
Topic2a ss pipelines
 
Design for testability and automatic test pattern generation
Design for testability and automatic test pattern generationDesign for testability and automatic test pattern generation
Design for testability and automatic test pattern generation
 
Software development for the COMPASS experiment
Software development for the COMPASS experimentSoftware development for the COMPASS experiment
Software development for the COMPASS experiment
 
Four Ways to Improve Linux Performance IEEE Webinar, R2.0
Four Ways to Improve Linux Performance IEEE Webinar, R2.0Four Ways to Improve Linux Performance IEEE Webinar, R2.0
Four Ways to Improve Linux Performance IEEE Webinar, R2.0
 
Trends in Mixed Signal Validation
Trends in Mixed Signal ValidationTrends in Mixed Signal Validation
Trends in Mixed Signal Validation
 
Rtos by shibu
Rtos by shibuRtos by shibu
Rtos by shibu
 
Making of a PD Data Acqusition System
Making of a PD Data Acqusition SystemMaking of a PD Data Acqusition System
Making of a PD Data Acqusition System
 
No[1][1]
No[1][1]No[1][1]
No[1][1]
 
How to Monitoring the SRE Golden Signals (E-Book)
How to Monitoring the SRE Golden Signals (E-Book)How to Monitoring the SRE Golden Signals (E-Book)
How to Monitoring the SRE Golden Signals (E-Book)
 
Real time operating systems
Real time operating systemsReal time operating systems
Real time operating systems
 
dft
dftdft
dft
 
DESIGN APPROACH FOR FAULT TOLERANCE IN FPGA ARCHITECTURE
DESIGN APPROACH FOR FAULT TOLERANCE IN FPGA ARCHITECTUREDESIGN APPROACH FOR FAULT TOLERANCE IN FPGA ARCHITECTURE
DESIGN APPROACH FOR FAULT TOLERANCE IN FPGA ARCHITECTURE
 
Pin pointpresentation
Pin pointpresentationPin pointpresentation
Pin pointpresentation
 
White Paper: Six-Step Competitive Device Evaluation
White Paper: Six-Step Competitive Device EvaluationWhite Paper: Six-Step Competitive Device Evaluation
White Paper: Six-Step Competitive Device Evaluation
 

Semelhante a OCP Server Memory Channel Testing DRAFT

DesignCon 2015-criticalmemoryperformancemetricsforDDR4
DesignCon 2015-criticalmemoryperformancemetricsforDDR4DesignCon 2015-criticalmemoryperformancemetricsforDDR4
DesignCon 2015-criticalmemoryperformancemetricsforDDR4Barbara Aichinger
 
SYBSC IT SEM IV EMBEDDED SYSTEMS UNIT II Embedded Systems Peripherals
SYBSC IT SEM IV EMBEDDED SYSTEMS UNIT II  Embedded Systems PeripheralsSYBSC IT SEM IV EMBEDDED SYSTEMS UNIT II  Embedded Systems Peripherals
SYBSC IT SEM IV EMBEDDED SYSTEMS UNIT II Embedded Systems PeripheralsArti Parab Academics
 
Barbara_Aichinger_Server_Forum_2014
Barbara_Aichinger_Server_Forum_2014Barbara_Aichinger_Server_Forum_2014
Barbara_Aichinger_Server_Forum_2014Barbara Aichinger
 
Applications - embedded systems
Applications - embedded systemsApplications - embedded systems
Applications - embedded systemsDr.YNM
 
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI ConvergenceDAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergenceinside-BigData.com
 
DDR4 Compliance Testing. Its time has come!
DDR4 Compliance Testing.  Its time has come!DDR4 Compliance Testing.  Its time has come!
DDR4 Compliance Testing. Its time has come!Barbara Aichinger
 
Modeling of DDR4 Memory and Advanced Verifications of DDR4 Memory Subsystem
Modeling of DDR4 Memory and Advanced Verifications of DDR4 Memory SubsystemModeling of DDR4 Memory and Advanced Verifications of DDR4 Memory Subsystem
Modeling of DDR4 Memory and Advanced Verifications of DDR4 Memory SubsystemIRJET Journal
 
DDR4 Memory Compliance Testing Barbara Aichinger FuturePlus Systems
DDR4 Memory Compliance Testing   Barbara Aichinger FuturePlus SystemsDDR4 Memory Compliance Testing   Barbara Aichinger FuturePlus Systems
DDR4 Memory Compliance Testing Barbara Aichinger FuturePlus SystemsBarbara Aichinger
 
Webinar: Practical DDR Testing for Compliance, Validation and Debug
Webinar: Practical DDR Testing for Compliance, Validation and DebugWebinar: Practical DDR Testing for Compliance, Validation and Debug
Webinar: Practical DDR Testing for Compliance, Validation and Debugteledynelecroy
 
Polyteda: Power DRC/LVS, October 2016
Polyteda: Power DRC/LVS, October 2016Polyteda: Power DRC/LVS, October 2016
Polyteda: Power DRC/LVS, October 2016Oleksandra Nazola
 
Polyteda Power DRC/LVS July 2016
Polyteda Power DRC/LVS July 2016Polyteda Power DRC/LVS July 2016
Polyteda Power DRC/LVS July 2016Oleksandra Nazola
 
Oracle R12 EBS Performance Tuning
Oracle R12 EBS Performance TuningOracle R12 EBS Performance Tuning
Oracle R12 EBS Performance TuningScott Jenner
 
Accelerated development in Automotive E/E Systems using VisualSim Architect
Accelerated development in Automotive E/E Systems using VisualSim ArchitectAccelerated development in Automotive E/E Systems using VisualSim Architect
Accelerated development in Automotive E/E Systems using VisualSim ArchitectDeepak Shankar
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
PowerDRC/LVS 2.2 released by POLYTEDA
PowerDRC/LVS 2.2 released by POLYTEDAPowerDRC/LVS 2.2 released by POLYTEDA
PowerDRC/LVS 2.2 released by POLYTEDAAlexander Grudanov
 
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)Ontico
 
Reproducible Emulation of Analog Behavioral Models
Reproducible Emulation of Analog Behavioral ModelsReproducible Emulation of Analog Behavioral Models
Reproducible Emulation of Analog Behavioral Modelsfnothaft
 
Analysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific ApplicationsAnalysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific ApplicationsJames McGalliard
 

Semelhante a OCP Server Memory Channel Testing DRAFT (20)

DesignCon 2015-criticalmemoryperformancemetricsforDDR4
DesignCon 2015-criticalmemoryperformancemetricsforDDR4DesignCon 2015-criticalmemoryperformancemetricsforDDR4
DesignCon 2015-criticalmemoryperformancemetricsforDDR4
 
SYBSC IT SEM IV EMBEDDED SYSTEMS UNIT II Embedded Systems Peripherals
SYBSC IT SEM IV EMBEDDED SYSTEMS UNIT II  Embedded Systems PeripheralsSYBSC IT SEM IV EMBEDDED SYSTEMS UNIT II  Embedded Systems Peripherals
SYBSC IT SEM IV EMBEDDED SYSTEMS UNIT II Embedded Systems Peripherals
 
Barbara_Aichinger_Server_Forum_2014
Barbara_Aichinger_Server_Forum_2014Barbara_Aichinger_Server_Forum_2014
Barbara_Aichinger_Server_Forum_2014
 
Applications - embedded systems
Applications - embedded systemsApplications - embedded systems
Applications - embedded systems
 
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI ConvergenceDAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
 
DDR4 Compliance Testing. Its time has come!
DDR4 Compliance Testing.  Its time has come!DDR4 Compliance Testing.  Its time has come!
DDR4 Compliance Testing. Its time has come!
 
DDR DIMM Design
DDR DIMM DesignDDR DIMM Design
DDR DIMM Design
 
ECI OpenFlow 2.0 the Future of SDN
ECI OpenFlow 2.0 the Future of SDN ECI OpenFlow 2.0 the Future of SDN
ECI OpenFlow 2.0 the Future of SDN
 
Modeling of DDR4 Memory and Advanced Verifications of DDR4 Memory Subsystem
Modeling of DDR4 Memory and Advanced Verifications of DDR4 Memory SubsystemModeling of DDR4 Memory and Advanced Verifications of DDR4 Memory Subsystem
Modeling of DDR4 Memory and Advanced Verifications of DDR4 Memory Subsystem
 
DDR4 Memory Compliance Testing Barbara Aichinger FuturePlus Systems
DDR4 Memory Compliance Testing   Barbara Aichinger FuturePlus SystemsDDR4 Memory Compliance Testing   Barbara Aichinger FuturePlus Systems
DDR4 Memory Compliance Testing Barbara Aichinger FuturePlus Systems
 
Webinar: Practical DDR Testing for Compliance, Validation and Debug
Webinar: Practical DDR Testing for Compliance, Validation and DebugWebinar: Practical DDR Testing for Compliance, Validation and Debug
Webinar: Practical DDR Testing for Compliance, Validation and Debug
 
Polyteda: Power DRC/LVS, October 2016
Polyteda: Power DRC/LVS, October 2016Polyteda: Power DRC/LVS, October 2016
Polyteda: Power DRC/LVS, October 2016
 
Polyteda Power DRC/LVS July 2016
Polyteda Power DRC/LVS July 2016Polyteda Power DRC/LVS July 2016
Polyteda Power DRC/LVS July 2016
 
Oracle R12 EBS Performance Tuning
Oracle R12 EBS Performance TuningOracle R12 EBS Performance Tuning
Oracle R12 EBS Performance Tuning
 
Accelerated development in Automotive E/E Systems using VisualSim Architect
Accelerated development in Automotive E/E Systems using VisualSim ArchitectAccelerated development in Automotive E/E Systems using VisualSim Architect
Accelerated development in Automotive E/E Systems using VisualSim Architect
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
PowerDRC/LVS 2.2 released by POLYTEDA
PowerDRC/LVS 2.2 released by POLYTEDAPowerDRC/LVS 2.2 released by POLYTEDA
PowerDRC/LVS 2.2 released by POLYTEDA
 
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
 
Reproducible Emulation of Analog Behavioral Models
Reproducible Emulation of Analog Behavioral ModelsReproducible Emulation of Analog Behavioral Models
Reproducible Emulation of Analog Behavioral Models
 
Analysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific ApplicationsAnalysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific Applications
 

OCP Server Memory Channel Testing DRAFT

  • 1. OCP Server Memory Channel Testing Overview – Draft Version 0.1 May 4 2015 Contributed by: David Woolf UNH-OL david@iol.unh.edu and Barbara Aichinger FuturePlus Systems Barb.Aichinger@FuturePlus.com Executive Summary: Cloud Computing is pervasive in our society today and at the heart of every cloud computing server is DDR Memory. Some data centers have reported that DDR Memory errors are the #2 failure in their data centers. Error detection and correction techniques fall short if there are more than 1 -2 bit errors in a 64-72 bit line of information. Industry studies have shown that DDR Memory errors are much more pervasive in the field than the vendors data sheets would lead you to believe. Adding a cost effective and relatively quick post validation check of the DDR Memory channel for OCP Servers would add credibility to the OCP brand. In addition it would help identify the elusive cause of post manufacturing field memory errors seen in data centers across the globe. This effort would be a value to OCP manufacturers, OCP customers and the Cloud Computing industry in general. A second goal of this effort would be to further the investigation into the recently publicized failure mechanism of DDR3 memory called ‘Row Hammer’. Google has identified this not only as a reliability issue but as a security risk that can be exploited in order to gain complete control over a targeted Server. Goal #1: Memory Channel Validation Audit: Add value to the OCP brand of Servers by specifying a robust memory channel test procedure. This test procedure is not meant to be a design validation. It is meant to be an audit that a robust electrical and protocol DDR Memory Channel validation was done. As an added benefit this procedure can also be used to:  Spot check motherboards from manufacturing to ensure quality.  Isolate failing memory channels in the field on servers displaying above average memory errors  Check for BIOS bugs that program the Memory Controller incorrectly thus causing JEDEC specification violations Procedure: This testing will be broken into two parts. The first is the electrical audit and the second is the protocol and timing JEDEC specification testing. Electrical Audit: This testing is not meant to be an electrical signal integrity validation. Rather it ensures that a validation has been done and that the signals at the DDR DIMM connector are acceptable with regards to signal swing, alignment, data valid eye size and that none of the strobe signals, data signals, address, command or control signals look appreciably degraded with respect to their form or function. It will be a qualitative measurement.
  • 2. Protocol Timing Audit: This test procedure is not meant to replace a protocol timing validation of the server. Rather it ensures that the BIOS has programmed the memory controller correctly for key timing parameters of the DDR memory. It also ensures that under heavy traffic loads the memory controller adheres to the JEDEC specification. Outline: Electrical Audit of the DDR Memory Channel Run Eye Scan1 on all bus signals in all slots:  Address  Command  Control  Data Signals  Data Strobes Check for Signal Alignment o Bytes to each other (fly by) o Signals within a byte o Strobes to Data  Read  Write o Command to Address and Control Data Valid Eye (all signals) o Signals within a byte o Composite eye (millions of samples laid on top of each other) o Burst Scan (beats within a burst) Signal Swing o At the start and end of a burst on Data and Data Strobe (DQ/DQS) o All address command and control signals Method of Implementation: For the audit a qualitative measurement will be performed. An oscilloscope would be the tool of choice for a robust validation. However that is not our goal. For an audit it is much more cost effective in money and time to use a logic analyzer and an interposer installed in the DIMM slot. Using a DIMM slot interposer results in no soldering or delicate time consuming probing. 1 Eye Scan is a general term describing a high speed digital sampling of a signal with either a scope, logic analyzer or protocol analyzer.
  • 3. Figure 1: Keysight U4154A/B logic analyzer with a FS2510 interposer from FuturePlus Systems This setup has a sampling resolution of 5ps x 2mv. It can sample every signal on the DDR4 bus. simultaneously. All signals are scanned under the same conditions and viewed with respect to each other. This method does not replace a high bandwidth scope however it does provide a rapid qualitative insight that will act as an ‘audit’ that the memory slot has good signal quality. Pass/Fail Criteria Eye Scan Address/Command and Control Signals have similar eye shape and opening size. A numeric output (excel sheet) from the tool can be compared quickly in order to make the comparison. DQ/DQS signals all have similar eye shape and opening size. Byte fly by is present and consistent between bytes Figure 2: Eye Scan showing alignment of Data and Strobe for a Write
  • 4. Burst Scan Beat 0-7 and 0-3 for BC are consistent with good signal quality. This is checked for all DQ/DQS signals. U p Figure 3: Example of a Burst Scan on Data byte 0 with associated strobe signal Outline: Protocol Timing Audit of the DDR Memory Channel The JEDEC specification has hundreds of timing parameters that govern the ordering and timing between transactions on the DDR memory bus. These are setup by a complicated set of steps at boot time and if not configured correctly can result in data corruption of the memory. Ensure that the OCP memory channel adheres to the JEDEC specification with regards to: o Command to Command timing o Refresh Rate o Calibration Commands o Correct Operation of Mode Register Settings o ODT operation o Rank to Rank command timing o DIMM to DIMM command timing o Power management operation Method of Implementation: For the audit a protocol or logic analyzer based solution will suffice. The measurement is made at the DIMM slot thus no soldering or delicate time consuming probing needs to be done.
  • 5. Figure 4: Automated Violation Detection with the FS2800 DDR Detective® Figure 5: DDR Detective interposer installed in 1 slot of a 2 slot memory channel Pass/Fail Criteria All JEDEC protocol and timing as specified by the JEDEC Specification pass without error. Running each test for 1 hour each. Software In order to exercise the memory channel software benchmarks will be run. They will be selected using the following criteria:
  • 6. o Near theoretical bandwidth created on the data bus o Variety of Commands caused on the bus o Exercises all supported power management commands o Power Spikes and Inter-symbol interference caused o Creates the Row Hammer event The challenge will be to find the least number of different software programs to cause the above events. It is anticipated that once failure mechanisms are uncovered and better understood software that targets creating those failure mechanisms will be added to the above list. Documentation The following documents will be created and electronically stored for each Server tested, for each slot for each memory channel.  Eye Scan results with ‘eye’ EXCEL spreadsheets for each signal  Burst Scan for each byte (including ECC if implemented) for both Reads and Writes  Protocol Violation Report Example: A 4 memory channel 3 slot per channel Server has 12 total slots. The tests are thus repeated 12 times. Each Eye Scan file will contain all signals for that slot. Each Burst Scan file will contain all signals for that slot. Each Slot will have a protocol violation check report thus 12 protocol violation reports will be generated. In total this example system will have 36 files. A summary file can then be created that gives an easy to read pass/fail summary. Thus 37 files will be generated for each Server. Goal #2: Row Hammer Detection As geometries shrink and capacities increase DDR Memory cells are susceptible to leakage current from adjacent cells. In the case of DDR Memory a ROW subjected to excessive ACTIVATE commands can leak current into adjacent ROWS. This ROW is referred to as the ‘aggressor’. If the adjacent ROWS, called the victim ROWS, are on the tail end of the cyclical refresh cycle their charge is low. Thus they are susceptible to leakage current that can cause a bit flip. The failure of the DDR Memory cell to hold its charge due to leakage current from an adjacent ROW when the adjacent ROW is targeted with excessive ACTIVATE commands is known as “Row Hammer”. The name was coined because the ROW is being ‘hammered’ with ACTIVATE commands.
  • 7. How should OCP Certification address this problem? The answer is not altogether clear at this point in time since this is relatively a new issue for the industry. Here are some of the possible scenarios. 1. Replace all DDR3 Memory with Tested and Certified parts that do not have this failure. It is not clear that this is an economical or viable option. 2. Implement Row Hammer mitigation strategies for DDR3. There are several but they do not totally prevent the problem and are a power and performance hit to the server. 3. Does the customers application software create the ROW hammer event? This can be detected using hardware test equipment that looks for excessive ACTIVATE commands being generated by the Memory Controller. If it does not then perhaps no action needs to be taken. 4. The DDR Memory DRAM itself is the culprit. Identify which parts are most susceptible and purge only those parts from critical applications. 5. Move as quickly as possible to DDR4, which some in the industry claim is not susceptible to Row Hammer failures. This has not been proven to be correct. In any event, testing and research by reputable test labs should be undertaken and the results published. This will arm the OCP community and the Cloud Computing industry with good information on how to tackle this problem. Initial Row Hammer Investigation  Gather information and repeat Row Hammer experiments  Measure the effectiveness of the mitigation strategies and the power and performance tradeoffs  Identify what software shows the problem the quickest  Compile Know Good Parts list for DDR3  Identify what types of applications are most susceptible and publish results  Develop a network of industry experts to review and publish results Current Row Hammer Resources Google’s Article: http://googleprojectzero.blogspot.com/2015/03/exploiting-dram-rowhammer- bug-to-gain.html CMU expose: http://users.ece.cmu.edu/~yoonguk/papers/kim-isca14.pdf Various Papers and articles on the topic: https://blogs.synopsys.com/committedtomemory/2015/03/09/row-hammering-what-it-is-and- how-hackers-could-use-it-to-gain-access-to-your-system/ http://www.ddrdetective.com/files/6414/1036/5710/The_Known_Failure_Mechanism_in_DDR 3_memory_referred_to_as_Row_Hammer.pdf https://www.youtube.com/watch?v=7wIUQ04Vkes
  • 8. Wikipedia Link: http://en.wikipedia.org/wiki/Row_hammer#cite_note-googleprojectzero-4 Products identifying Row Hammer events and causing them http://teledynelecroy.com/pressreleases/document.aspx?news_id=1805 http://www.eurosoft-uk.com/eurosoft-test-bulletin-testing-row-hammer/ http://www.ddrdetective.com/files/3314/1036/5702/Description_of_the_Row_Hammer_featur e_on_the_FS2800_DDR_Detective.pdf http://www.memtest86.com/ Summary Memory Channel Validation Audit and developing a well understood Row Hammer mitigation strategy will put OCP in a leadership position. OCP is the only industry standards organization that encompasses the entire ‘food chain’ of Cloud Computing from component vendors, OEMs, Server Vendors, Software vendors to large data center operators. OCP members represent the spectrum of industries reliant on Cloud Computing from the financial sector to social media. OCP is poised to address this issue where other standards organizations, due to corporate malfeasance or just plain ignorance, have failed to do so. These goals are attainable and in the best interests of the OCP community.