Fault Tolerance
MicroBlaze processor v8.00.a and later include fault tolerance features, enabled with the parameter C_FAULT_TOLERANT. These features provide Error Detection for internal block RAMs, and support for Error Detection and Correction (ECC) in LMB block RAMs. When fault tolerance is enabled, all soft errors in block RAM are detected and corrected, which significantly reduces overall failure intensity.
In addition to protecting block RAM, the FPGA configuration memory also generally needs to be protected. A detailed explanation of this topic, and further references, can be found in the SEU Strategies for Virtex-5 Devices Application Note(XAPP864):
http://www.xilinx.com/support/documentation/application_notes/xapp864.pdf
Configuration
Using MicroBlaze Configuration
Fault tolerance can be enabled in the MicroBlaze configuration dialog. Note that this must currently be done using the advanced configuration, since the X-button to the left of the parameter must be clicked to override the default setting.
After enabling fault tolerance in MicroBlaze, ECC is automatically enabled in the connected LMB Block RAM Interface Controllers by the tools, when the system is generated. This means that nothing else needs to be configured to enable fault tolerance and minimal ECC support.
It is possible (albeit not recommended) to manually override ECC support, leaving the LMB Block RAM unprotected, by disabling C_ECC in the configuration dialogs of all connected LMB Block RAM Interface Controllers. In this case, the internal MicroBlaze block RAM protection is still enabled, since fault tolerance is enabled.
Using LMB Block RAM Interface Controller Configuration
As an alternative to the method described above, it is also possible to enable ECC in the configuration dialogs of all connected LMB Block RAM Interface Controllers. In this case, fault tolerance is automatically enabled in MicroBlaze by the tools, when the system is generated. This means that nothing else needs to be configured to enable ECC support and MicroBlaze fault tolerance.
ECC must either be enabled or disabled in all controllers, which is enforced by a DRC.
It is possible to manually override fault tolerance support in MicroBlaze, by explicitly disabling C_FAULT_TOLERANT in the MicroBlaze configuration dialog. This is not recommended, unless no block RAM is used in MicroBlaze, and there is no need to handle bus exceptions from uncorrectable ECC errors.
Features
An overview of all MicroBlaze fault tolerance features is given here. Further details on each feature can be found in the MicroBlaze Processor Reference Guide (UG081):
http://www.xilinx.com/support/documentation/sw_manuals/xilinx13_1/mb_ref_guide.pdf
The LMB Block RAM Interface Controller v3.00.a provides the LMB ECC implementation. For details, including performance and resource utilization, see the "IP Processor LMB Block RAM Interface Controller" (DS452).
Instruction and Data Cache Protection
To protect the block RAM in the Instruction and Data Cache, parity is used. When a parity error is detected, the corresponding cache line is invalidated. This forces the cache to reload the correct value from external memory. Parity is checked whenever a cache hit occurs.
Note that this scheme only works for write-through, and thus write-back data cache is not available when fault tolerance is enabled. This is enforced by a DRC.
When new values are written to a block RAM in the cache, parity is also calculated and written. One parity bit is used for the tag, one parity bit for the instruction cache data, and one parity bit for each word in a data cache line.
In many cases, enabling fault tolerance does not increase the required number of cache block RAMs, since spare bits can be used for the parity.Any increase in resource utilization, in particular the number of block RAMs, can easily be seen in the MicroBlaze configuration dialog, when enabling fault tolerance.
Memory Management Unit Protection
To protect the block RAM in the MMU Unified Translation Look-Aside Buffer (UTLB), parity is used. When a parity error is detected during an address translation, a TLB miss exception occurs, forcing software to reload the entry.
When a new TLB entry is written using the TLBHI and TLBLO registers, parity is calculated. One parity bit is used for each entry.
Parity is also checked when a UTLB entry is read using the TLBHI and TLBLO registers. When a parity error is detected in this case, the entry is marked invalid by clearing the valid bit.
Enabling fault tolerance does not increase the MMU block RAM size, since a spare bit is available for the parity.
Branch Target Cache Protection
To protect block RAM in the Branch Target Cache, parity is used. When a parity error is detected when looking up a branch target address, the address is ignored, forcing a normal branch.
When a new branch address is written to the Branch Target Cache, parity is calculated. One parity bit is used for each address.
Enabling fault tolerance does not increase the Branch Target Cache block RAM size, since a spare bit is available for the parity.
Exception Handling
With fault tolerance enabled, if an error occurs in LMB block RAM, the LMB Block RAM Interface Controller generates error signals on the LMB bus.
If exceptions are enabled in MicroBlaze, by setting the EE bit in the Machine Status Register, the uncorrectable error signal either generates an instruction bus exception or a data bus exception, depending on the affected interface.
Should a bus exception occur when an exception is in progress, MicroBlaze is halted, and the external error signal is set. This behavior ensures that it is impossible to execute an instruction corrupted by an uncorrectable error.
Software Support
Scrubbing
To ensure that bit errors are not accumulated in block RAMs, they must be periodically scrubbed.
The standalone BSP provides the function microblaze_scrub() to perform scrubbing of the entire LMB block RAM and all MicroBlaze internal block RAMs used in a particular configuration. This function is intended to be called periodically from a timer interrupt routine.
The following example code illustrates how this can be done.
#include "xparameters.h"
#include "xtmrctr.h" #include "xintc.h"
#include mb_interface.h
#define SCRUB_PERIOD ...
XIntc InterruptController; /* The instance of the Interrupt Controller */
XTmrCtr TimerCounterInst; /* The instance of the Timer Counter */
void MicroBlazeScrubHandler(void *CallBackRef, u8 TmrCtrNumber)
{
/* Perform other timer interrupt processing here */
microblaze_scrub();
}
int main (void)
{
int Status;
/*
* Initialize the timer counter so that it's ready to use,
* specify the device ID that is generated in xparameters.h
*/
Status = XTmrCtr_Initialize(&TimerCounterInst, TMRCTR_DEVICE_ID);
if (Status != XST_SUCCESS) {
return XST_FAILURE;
}
/*
* Connect the timer counter to the interrupt subsystem such that
* interrupts can occur.
*/
Status = XIntc_Initialize(&InterruptController, INTC_DEVICE_ID);
if (Status != XST_SUCCESS) {
return XST_FAILURE;
}
/*
* Connect a device driver handler that will be called when an interrupt
* for the device occurs, the device driver handler performs the specific
* interrupt processing for the device
*/
Status = XIntc_Connect(&InterruptController, TMRCTR_DEVICE_ID,
(XInterruptHandler)XTmrCtr_InterruptHandler,
(void *) &TimerCounterInst);
if (Status != XST_SUCCESS) {
return XST_FAILURE;
}
/*
* Start the interrupt controller such that interrupts are enabled for
* all devices that cause interrupts, specifying real mode so that
* the timer counter can cause interrupts thru the interrupt controller.
*/
Status = XIntc_Start(&InterruptController, XIN_REAL_MODE);
if (Status != XST_SUCCESS) {
return XST_FAILURE;
}
/*
* Setup the handler for the timer counter that will be called from the
* interrupt context when the timer expires, specify a pointer to the
* timer counter driver instance as the callback reference so the handler
* is able to access the instance data
*/
XTmrCtr_SetHandler(&TimerCounterInst, MicroBlazeScrubHandler,
&TimerCounterInst);
/*
* Enable the interrupt of the timer counter so interrupts will occur
* and use auto reload mode such that the timer counter will reload
* itself automatically and continue repeatedly, without this option
* it would expire once only
*/
XTmrCtr_SetOptions(&TimerCounterInst, TIMER_CNTR_0,
XTC_INT_MODE_OPTION | XTC_AUTO_RELOAD_OPTION);
/*
* Set a reset value for the timer counter such that it will expire
* earlier than letting it roll over from 0, the reset value is loaded
* into the timer counter when it is started
*/
XTmrCtr_SetResetValue(TmrCtrInstancePtr, TmrCtrNumber, SCRUB_PERIOD);
/*
* Start the timer counter such that it's incrementing by default,
* then wait for it to timeout a number of times
*/
XTmrCtr_Start(&TimerCounterInst, TIMER_CNTR_0);
...
}
See below for further details on how scrubbing is implemented, including how to calculate the scrubbing rate.
Block RAM Driver
The standalone BSP Block RAM driver is used to access the ECC registers in the LMB Block RAM Interface Controller, and also provides a comprehensive self test.
By implementing the SDK Xilinx C Project "Peripheral Tests", a self test example including the block RAM self test for each LMB Block RAM Interface Controller in the system is generated. Depending on the ECC features enabled in the LMB Block RAM Interface Controller, this code performs all possible tests of the ECC function.
The self test example can be found in the standalone BSP Block RAM driver source code, typically in the subdirectory microblaze_0/libsrc/bram_v3_00_a/src/xbram_selftest.c.
Scrubbing
Scrubbing Methods
Scrubbing is performed using specific methods for the different block RAMs:
It is also possible to add interrupts for correctable errors from the LMB Block RAM Interface Controllers, and immediately scrub this address in the interrupt handler, although in most cases it only improves reliability slightly.
The failing address can be determined by reading the Correctable Error First Failing Address Register in each of the LMB Block RAM Interface Controllers. To be able to generate an interrupt C_ECC_STATUS_REGISTERS must be set to 1 in the connected LMB Block RAM Controllers, and to read the failing address C_CE_FAILING_REGISTERS must be set to 1.
Calculating Scrubbing Frequency
Scrubbing frequency depends on failure intensity and desired reliability.
The approximate equation to determine the LMB memory scrubbing rate is, in our case, given by:
where Pw is the probability of an uncorrectable error in a memory word, BER is the soft error rate for a single memory bit, and SR is the Scrubbing Rate.
The soft error rates affecting Block RAM for each product family can be found in the Device Reliability Report, Fourth Quarter 2010 (UG116):
http://www.xilinx.com/support/documentation/user_guides/ug116.pdf
Use Cases
A number of common use cases are described here. These use cases are derived from the "IP Processor LMB Block RAM Interface Controller" (DS452).
Minimal
This system is obtained when enabling fault tolerance in MicroBlaze, without doing any other configuration.
The system is suitable when area constraints are high, and there is no need for testing of the ECC function, or analysis of error frequency and location. No ECC registers are implemented. Single bit errors are corrected by the ECC logic before being passed to MicroBlaze. Uncorrectable errors set an error signal, which generates an exception in MicroBlaze.
Small
This system should be used when it is required to monitor error frequency, but there is no need for testing of the ECC function. Minimal system with Correctable Error Counter Register added to monitor single bit error rates. If the error rate is too high, the scrubbing rate should be increased to minimize the risk of a single bit error becoming an uncorrectable double bit error. Parameters set are C_ECC = 1 and C_CE_COUNTER_WIDTH = 10.
Typical
This system represents a typical use case, where it is required to monitor error frequency, as well as generating an interrupt to immediately correct a single bit error through software. It does not provide support for testing of the ECC function. Small system with Correctable Error First Failing registers and Status register added. A single bit error latches the address for the access into the Correctable Error First Failing Address Register and set the CE_STATUS bit in the ECC Status Register. An interrupt is generated triggering MicroBlaze to read the failing address and then perform a read followed by a write on the failing address. This removes the single bit error from the block RAM, thus reducing the risk of the single bir error becoming a uncorrectable double bit error. Parameters set are C_ECC= 1, C_CE_COUNTER_WIDTH = 10, C_ECC_STATUS_REGISTER = 1 and C_CE_FAILING_REGISTERS = 1.
Full
This system uses all of the features provided by the LMB Block RAM Interface Controller, to enable full error injection capability, as well as error monitoring and interrupt generation. Typical system with Uncorrectable Error First Failing registers and Fault Injection registers added. All features switched on for full control of ECC functionality for system debug or systems with high fault tolerance requirements. Parameters set are C_ECC = 1, C_CE_COUNTER_WIDTH = 10, C_ECC_STATUS_REGISTER = 1, C_CE_FAILING_REGISTERS = 1, C_UE_FAILING_REGISTERS = 1 and C_FAULT_INJECT = 1.
References
Each of the fault tolerance features is described in detail in the the MicroBlaze Processor Reference Guide (UG081):
http://www.xilinx.com/support/documentation/sw_manuals/xilinx13_1/mb_ref_guide.pdf
The LMB ECC implementation is described in further detail in the "IP Processor LMB Block RAM Interface Controller" (DS452), including a complete reference to the internal registers, as well as performance and resource utilization.
The soft error rates affecting block RAM for each product family can be found in the Device Reliability Report, Fourth Quarter 2010 (UG116):
http://www.xilinx.com/support/documentation/user_guides/ug116.pdf
A detailed explanation of configuration memory protection, and further references, can be found in the SEU Strategies for Virtex-5 Devices Application Note(XAPP864):
http://www.xilinx.com/support/documentation/application_notes/xapp864.pdf
Answer Number | 问答标题 | 问题版本 | 已解决问题的版本 |
---|---|---|---|
39843 | 13.x EDK - Master Answer Record | N/A | N/A |
AR# 40863 | |
---|---|
日期 | 12/15/2012 |
状态 | Active |
Type | 综合文章 |
Tools |