Next: Failure detection Up: Towards Fault-Tolerance Aspects for Previous: Real-Time distributed embedded systems

Fault-Tolerance!

Fault-tolerant systems are thoroughly discussed in [Cri91], [Nel90] and [Jal94] (we do not have this book).

A fault can be classified by its duration, nature and extent. The duration can be transient, intermittent or permanent [Laprie book] [Nel90]. Transient fault often the result of external disturbances (noise, radiation), causes upsetting data, exist for a finite time and is non recurring. The data upsetting is called soft error and the rate at which these events occur is called soft error rate (SER) [BFM$^+$03]. Soft errors are already of great concern in devices built in current technologies. A system with an intermittent fault oscillates between faulty and fault-free operation. Usually results from an unstable device operation. Permanent (hard) faults are device conditions that do not correct with time (component failures, physical damage or design errors). Transient and intermittent faults typically occur with a higher frequency than permanents and are more difficult to detect, since they may disappear after producing error. It has been found in practice that over 80% of the failures are transient and intermittent. For other classifications (nature, extent of failures), refer to [Laprie] and [Nel90].

[Nel90]:A Fault-tolerance system is achieved thr. redundancy in (1)HW, (2)SW, (3)Information and/or (4)Time. Such redundancy can be implemented in static, dynamic or hybrid configurations. A fault-tolerance strategy includes one or more of the following elements:

Masking: is dynamic correction of generated errors
Detection: of an error - a symptom of a fault
Containment: is prevention of an error
Diagnosis
Repair/reconfiguration
Recovery: is correction of the system to a state acceptable for continued operation.

To cope with transient faults [BFM$^+$03], temporal redundancy techniques can be used to implement fault detection and masking. For ex. if we consider two kind of redundancy:

static temporal red.: e.g. triple execution majority voting masks any single soft error. The disadvantage is that it may be too demanding when taking into account the hard RT requirements of automative.
dynamic temporal red.: e.g. error detection (duplication and comparison) and checkpointing&roll-back. The disadvantage is that it may result a serious performance degradation if SER is high.

That is why [BFM$^+$03] proposes a form of spatial redundancy at the chip level to mask the soft errors:

Lock-step dual processor architecture:: behaves like a fail-silence node. Provides the capability of detecting any single error.
Loosely-synchronized dual processor architecture:: OS does cross-check, sanity check to determine the CPU producing wrong result. The term loosely-coupled means relying primarily on software to ensure reliable operation, while tightly-coupled systems provide hardware solution [NJD94].
Triple modular redundant (TMR) architecture:: majority voting masks any single CPU fault.
Dual lock-step architecture: The best one. Two fail-silent nodes. No sanity check because nodes (Lock-step dual processor architecture) have self-checking capability. SW design errors can also be avoided (e.g. 2-version programming).

A fail-silent node is a self-checking node that either functions correctly or stops functioning after an internal failure is detected [BES$^+$96]. Such a node can be constructed from a number of conventional processors. In a Software Implemented (SI) fail-silent node, the non-faulty processors of the node need to execute message order and comparison protocols to keep-in-step and check each other respectively. A fail-silent node that uses replicated tasks with comparison/voting must incorporate mechanisms to keep replicas synchronized. Replica synch. can be done in HW or SW. There are a few problems with HW synch (see [BES$^+$96]). SW approach that is done at a higher (application) level has several advantages to HW one:

Technology upgrades rapidly. Since the principles behind the protocol remains same, the software can be ported relatively easily.
By employing different types of processors within a node, we may have some FT against the design errors in processors.
Since replicated tasks do not execute in lock-step, a node is likely to be more robust against transient failures.

In SIFT, FT is achieved by voting on the output data. Thus, task replicas must be synchronized at the beginning of each iteration (start of frame). To do this, SIFT maintains a global time base and uses a static priority based scheduling, which schedules tasks at predefined time frames. The global timebase is implemented by keeping the clocks of all correct processors synhronized by a software implementation of a Byzantine resilient clock synchronization protocol. However, in terms of performance, HW redundancy management outperforms SW. In HW red. there is almost no overhead, while SW redundancy can consume as much as 80% of the processor throughput. Therefore hybrid solutions have been proposed to circumvent this problem.

Definition 1 Fail-safe system whose failures can only be benign failures.

Definition 2 Fail-silent system whose failures can only be crash failures. Crash failure is a persistent ommision failure. Thus, in fail-silence, components either provide correct results or do not provide any result at all.

Definition 3 Fail-stop system whose failures can only be stopping failures. Stopping failure: system activity, if any, is not anymore perceptible to the users, and a constant value service is delivered. There is a stable memory that allows to recover the state of the system (unlike fail-silent where failure is crash).

``...'' Safety property: something (bad) will not happen during a system execution.
Liveness property: eventually something (good) must happen.
A short survey about safety and liveness properties is given by [Kin94].

A fail-silent node can be in one of three states (figure 1):

Normal state: the node produces correct results.
Failing state: an intermediate state in which the node suffers at most one performance failure. For the explanation about why this state is needed, refer to [BES$^+$96]
Silent state: No new messages are produced by the node.

A fail-stop node has the following properties [Sch84]:

Halt on Failure: It halts immediately when a failure occurs
Failure status: Any processor can detect other's halt.
Stable storage: State is always stored and available after a halt.

In actual life, processors are far from providing fail-stop properties.

Redundancy can be categorized as: (see figure 5)

Hardware redundancy

adding redundancy to HW.

passive: NMR (e.g. TMR) with HW or SW voter (voting can be 1. majority, 2. average, 3. mid value(good but costly to imp. in hardware))
active: active replication (see [TPDF98]), stand-by sparing (spare unit invoked after failure occurance), watchdog timers.
hybrid: NMR with spares. For ex. 5 units, 3 in TMR mode and 2 spares. 5MR can tolerate 2 faults whereas hybrid 3 faults that occur sequentally.

Software redundancy

adding redundancy to SW (Detailed explanation in [Wil00]).

Single-version software
Checkpointing (CP) and roll-back
There exist two kinds of restart recovery: static and dynamic. A static restart is based on returning the module to a predetermined state. This can be a direct return to the initial reset state, or to one of a set of possible states, with the selection being made based on the operational situation at the moment the error detection occurred. Dynamic restart uses dynamically created checkpoints that are snapshots of the state at various points during the execution.
The advantage of these checkpoints is that they are based on states created during operation, and can thus be used to allow forward progress of execution without having to discard all the work done up to the time of error detection.
CP can be (1) Equidistant: created at fixed intervals or at particular points during the computation determined by some optimizing rule. (2) Modular: CPs at the end of the sub-modules of the program. (3) Random.
Too frequent CP results in large overhead, while a few CPs results in overhad of roll-back after failure and repeating more code, both of which may jeopardize RT requirements.
N-version software
Recovery blocks
The Recovery Blocks technique combines the basics of the checkpoint and restart approach with multiple versions of a software component such that a different version is tried after an error is detected (see figure 2). Checkpoints are created before a version executes. Checkpoints are needed to recover the state after a version fails to provide a valid operational starting point for the next version if an error is detected.
N-version programming
SW equivalence of NMR in HW redundancy. N-Version programming is a multi-version technique in which all the versions are designed to satisfy the same basic requirements and the decision of output correctness is based on the comparison of all the outputs (see figure 3). The use of a generic decision algorithm (usually a voter) to select the correct output is the fundamental difference of this approach from the Recovery Blocks approach, which requires an application dependent acceptance test. Since all the versions are built to satisfy the same requirements, the use of N-version programming requires considerable development effort but the complexity (i.e., development difficulty) is not necessarily much greater than the inherent complexity of building a single version.

N Self-Checking Programming
Consensus Recovery Blocks
t/(n-1)-Variant Programming

Information redundancy

Adding redundancy to information/data. Error detection (parity bit, checksum) and error correction (triplication, Hamming code).

Time redundancy

Allowing extra time for re-execution of a task after it fails or double execution of a task and comparing the outputs([TPDF98]). Re-transmission of an information. for example,
exec(); save output1;
exec(); save output2;
compare output1 output2;

[Gar99] gives a formal view of FT. Figure 4 tells that a distributed program A is said to tolerate faults from a fault class F for an invariant P if there exists a predicate T. If a program A tolerates faults from a fault class F for an invariant P, it means that A is F-tolerant for P (for details refer to [Gar99]). The paper is also a good guidance about fault-tolerant distributed computing (in asynchronous message-passing systems). For state machine approaches to implementation of fault-tolerance, see [Sch90] and [RPL$^+$99].

**Figure 1:** Fail-silent node states
$\begin{figure}\center \includegraphics[width=2in]{figures/fail-silent-node-states.ps} \end{figure}$

**Figure 2:** Recovery block model
$\begin{figure}\center \includegraphics[width=4in]{figures/recovery-block-model.ps} \end{figure}$

**Figure 3:** N-version programming model
$\begin{figure}\center \includegraphics[width=4in]{figures/N-version-programming.ps} \end{figure}$

**Figure 4:** Schematic overview of Classes
$\begin{figure}\center \includegraphics[width=2.9in]{figures/schematic-F-class.ps} \end{figure}$

**Figure 5:** The chart of basic FT techniques
$\begin{figure}\center \includegraphics[width=4in]{figures/basic-FT-techniques.ps} \end{figure}$

One of the main advantages of off-the-shelf solutions is that the application does not need to be aware of fault-tolerant mechanism that are transparently provided by the architecture to cover the execution platform faults.

Kulkarni claims that a fault-tolerant system consists of a fault-intolerant system and a set of fault-tolerance components [AK98] [Kul99]. For this purpose, they add detectors and correctos to fault-intolerant system to achieve fault-tolerance through formal definitions(!). A detector is a system component that detects whether some state predicate is true at the system state. Well-known examples of detectors include comparators, error detection codes, consistency checkers, watchdog programs, snoopers, alarms, snapshot procedures, acceptance tests, and exception conditions. A corrector is a system component that detects whether some state predicate is true at the system state and that corrects the system state in order to truthify that state predicate whenever it is false. Well-known examples of correctors include voters, error correction codes, reset procedures, rollback recovery, rollforward recovery, constraint (re)satisfaction, exception handlers and alternate procedures in recovery blocks.

Next: Failure detection Up: Towards Fault-Tolerance Aspects for Previous: Real-Time distributed embedded systems

Tolga Ayav 2004-10-25