Next: Failure detection
Up: Towards Fault-Tolerance Aspects for
Previous: Real-Time distributed embedded systems
Fault-tolerant systems are thoroughly discussed in [Cri91], [Nel90] and
[Jal94] (we do not have this book).
A fault can be classified by its duration, nature and extent.
The duration can be transient, intermittent or permanent [Laprie book]
[Nel90].
Transient fault often the result of external disturbances (noise, radiation), causes upsetting
data, exist for a finite time and is non
recurring. The data upsetting is called soft error and the rate at which these
events occur is called soft error rate (SER) [BFM$^+$03]. Soft errors are already
of great concern in devices built in current technologies. A system with an intermittent fault
oscillates between faulty and fault-free operation.
Usually results from an unstable device operation. Permanent (hard) faults are
device conditions that do not correct with time (component failures, physical damage or
design errors). Transient and intermittent faults typically occur with a higher frequency than
permanents and are more difficult to detect, since they may disappear after producing error.
It has been found in practice that over 80% of the failures are transient and intermittent.
For other classifications (nature, extent of failures), refer to [Laprie] and [Nel90].
[Nel90]:A Fault-tolerance system is achieved thr. redundancy in
(1)HW, (2)SW, (3)Information and/or (4)Time.
Such redundancy can be implemented in static, dynamic or hybrid
configurations. A fault-tolerance strategy includes one or more of the following elements:
- Masking
- is dynamic correction of generated errors
- Detection
- of an error - a symptom of a fault
- Containment
- is prevention of an error
- Diagnosis
-
- Repair/reconfiguration
-
- Recovery
- is correction of the system to a state acceptable for continued operation.
To cope with transient faults [BFM$^+$03], temporal redundancy techniques can be used
to implement fault detection and masking. For ex. if we consider two kind of redundancy:
- static temporal red.
- e.g. triple execution majority voting masks any single soft error.
The disadvantage is that it may be too demanding when taking into account the hard RT
requirements of automative.
- dynamic temporal red.
- e.g. error detection (duplication and comparison) and
checkpointing&roll-back. The disadvantage is that it may result a serious performance
degradation if SER is high.
That is why [BFM$^+$03] proposes a form of spatial redundancy at the chip level to mask
the soft errors:
- Lock-step dual processor architecture:
- behaves like a fail-silence node. Provides
the capability of detecting any single error.
- Loosely-synchronized dual processor architecture:
- OS does cross-check, sanity check to
determine the CPU producing wrong result. The term loosely-coupled means relying primarily
on software to ensure reliable operation, while tightly-coupled systems provide
hardware solution [NJD94].
- Triple modular redundant (TMR) architecture:
- majority voting masks any single CPU fault.
- Dual lock-step architecture
- The best one. Two fail-silent nodes. No sanity check because
nodes (Lock-step dual processor architecture) have self-checking capability. SW design errors
can also be avoided (e.g. 2-version programming).
A fail-silent node is a self-checking node that either functions correctly or stops
functioning after an internal failure is detected [BES$^+$96]. Such a node can be
constructed from a number of conventional processors. In a Software Implemented (SI)
fail-silent node, the non-faulty
processors of the node need to execute message order and comparison protocols to keep-in-step
and check each other respectively. A fail-silent node that uses replicated tasks with
comparison/voting must incorporate mechanisms to keep replicas synchronized. Replica synch.
can be done in HW or SW. There are a few problems with HW synch (see [BES$^+$96]).
SW approach that is done at a higher (application) level has several advantages to HW one:
- Technology upgrades rapidly. Since the principles behind the protocol remains same,
the software can be ported relatively easily.
- By employing different types of processors within a node, we may have some FT against
the design errors in processors.
- Since replicated tasks do not execute in lock-step, a node is likely to be more robust
against transient failures.
In SIFT, FT is achieved by voting on the output data. Thus, task replicas must be synchronized
at the beginning of each iteration (start of frame). To do this, SIFT maintains a global time
base and uses a static priority based scheduling, which schedules tasks at predefined time frames.
The global timebase is implemented by keeping the clocks of all correct processors synhronized
by a software implementation of a Byzantine resilient clock synchronization protocol.
However, in terms of performance, HW redundancy management outperforms SW. In HW red. there is almost
no overhead, while SW redundancy can consume as much as 80% of the processor throughput.
Therefore hybrid solutions have been proposed to circumvent this problem.
Definition 1
Fail-safe system whose failures can only be benign failures.
Definition 2
Fail-silent system whose failures can only be crash failures. Crash failure is
a persistent ommision failure. Thus, in fail-silence, components either provide correct
results or do not provide any result at all.
Definition 3
Fail-stop system whose failures can only be stopping failures. Stopping failure:
system activity, if any, is not anymore perceptible to the users, and a constant value
service is delivered. There is a stable memory that allows to recover the state of the
system (unlike fail-silent where failure is crash).
``...''
Safety property: something (bad) will not happen during a system execution.
Liveness property: eventually something (good) must happen.
A short survey about safety and liveness properties is given by [Kin94].
A fail-silent node can be in one of three states (figure 1):
- Normal state: the node produces correct results.
- Failing state: an intermediate state in which the node suffers at most one
performance failure. For the explanation about why this state is needed, refer to [BES$^+$96]
- Silent state: No new messages are produced by the node.
A fail-stop node has the following properties [Sch84]:
- Halt on Failure: It halts immediately when a failure occurs
- Failure status: Any processor can detect other's halt.
- Stable storage: State is always stored and available after a halt.
In actual life, processors are far from providing fail-stop properties.
Redundancy can be categorized as: (see figure 5)
- Hardware redundancy
- adding redundancy to HW.
- passive: NMR (e.g. TMR) with HW or SW voter (voting can be
1. majority, 2. average, 3. mid value(good but costly to imp. in hardware))
- active: active replication (see [TPDF98]), stand-by sparing (spare unit invoked after
failure occurance), watchdog timers.
- hybrid: NMR with spares. For ex. 5 units, 3 in TMR mode and 2 spares.
5MR can tolerate 2 faults whereas hybrid 3 faults that occur sequentally.
- Software redundancy
- adding redundancy to SW (Detailed explanation in [Wil00]).
- Single-version software
Checkpointing (CP) and roll-back
There exist two kinds of restart recovery: static and dynamic.
A static restart is based on returning the module to a predetermined state.
This can be a direct return to the initial reset state, or to one of a set of possible states,
with the selection being made based on the operational situation at the moment
the error detection occurred. Dynamic restart uses dynamically created checkpoints
that are snapshots of the state at various points during the execution.
The advantage of these checkpoints is that they are based on states created during operation,
and can thus be used to allow forward progress of execution without having to discard all
the work done up to the time of error detection.
CP can be (1) Equidistant: created at fixed intervals or at particular points during
the computation determined by some optimizing rule. (2) Modular: CPs at the end of the
sub-modules of the program. (3) Random.
Too frequent CP results in large overhead, while a few CPs results in overhad of roll-back
after failure and repeating more code, both of which may jeopardize RT requirements.
- N-version software
Recovery blocks
The Recovery Blocks technique combines the basics of the checkpoint and restart approach
with multiple versions of a software component such that a different version is tried
after an error is detected (see figure 2).
Checkpoints are created before a version executes.
Checkpoints are needed to recover the state after a version fails to provide
a valid operational starting point for the next version if an error is detected.
N-version programming
SW equivalence of NMR in HW redundancy.
N-Version programming is a multi-version technique in which all
the versions are designed to satisfy the same basic requirements and the decision
of output correctness is based on the comparison of all the outputs
(see figure 3).
The use of a generic decision algorithm (usually a voter) to select the correct output
is the fundamental difference of this approach from the Recovery Blocks approach,
which requires an application dependent acceptance test. Since all the versions are
built to satisfy the same requirements, the use of N-version programming requires
considerable development effort but the complexity (i.e., development difficulty)
is not necessarily much greater than the inherent complexity of building a single version.
N Self-Checking Programming
Consensus Recovery Blocks
t/(n-1)-Variant Programming
- Information redundancy
- Adding redundancy to information/data.
Error detection (parity bit, checksum) and error correction (triplication, Hamming code).
- Time redundancy
- Allowing extra time for re-execution of a task after it fails
or double execution of a task and comparing the outputs([TPDF98]).
Re-transmission of an information.
for example,
exec(); save output1;
exec(); save output2;
compare output1 output2;
[Gar99] gives a formal view of FT. Figure 4
tells that a distributed program A is said to tolerate faults from a fault class F
for an invariant P if there exists a predicate T. If a program A tolerates faults
from a fault class F for an invariant P, it means that A is F-tolerant for P
(for details refer to [Gar99]). The paper is also a good guidance
about fault-tolerant distributed computing (in asynchronous message-passing
systems).
For state machine approaches to implementation of fault-tolerance, see
[Sch90] and [RPL$^+$99].
Figure 1:
Fail-silent node states
|
Figure 2:
Recovery block model
|
Figure 3:
N-version programming model
|
Figure 4:
Schematic overview of Classes
|
Figure 5:
The chart of basic FT techniques
|
One of the main advantages of off-the-shelf solutions is that the application does not
need to be aware of fault-tolerant mechanism that are transparently provided by
the architecture to cover the execution platform faults.
Kulkarni claims that a fault-tolerant system consists of a fault-intolerant
system and a set of fault-tolerance components [AK98]
[Kul99]. For this purpose, they add
detectors and correctos to fault-intolerant system to achieve fault-tolerance
through formal definitions(!).
A detector is a system component that detects whether some state predicate
is true at the system state. Well-known examples of detectors include comparators,
error detection codes, consistency checkers, watchdog programs, snoopers, alarms,
snapshot procedures, acceptance tests, and exception conditions.
A corrector is a system component that detects whether some state
predicate is true at the system state and that corrects the system state
in order to truthify that state predicate whenever it is false.
Well-known examples of correctors include voters, error correction codes,
reset procedures, rollback recovery, rollforward recovery, constraint (re)satisfaction,
exception handlers and alternate procedures in recovery blocks.
Next: Failure detection
Up: Towards Fault-Tolerance Aspects for
Previous: Real-Time distributed embedded systems
Tolga Ayav
2004-10-25