Next: Layered FT Approaches Up: Towards Fault-Tolerance Aspects for Previous: Failure detection

Checkpointing and Roll-back

**Figure 6:** The chart of checkpointing techniques
$\begin{figure}\center \includegraphics[width=4in]{figures/checkpoint-tree.ps} \end{figure}$

Consensus and Atomic Broadcast have been extensively studied by both theoretical and experimental researchers for over a decade. Informally, a distributed system is asynchronous if there is no bound on message delay, clock drift, or the time necessary to execute a step. Thus, to say that a system is asynchronous is to make no timing assumptions whatsoever. This model is attractive and has recently gained much currency for several reasons: It has simple semantics; applications programmed on the basis of this model are easier to port than those incorporating specific timing assumptions; and in practice, variable or unexpected workloads are sources of asynchrony thus, synchrony assumptions are, at best, probabilistic. Although the asynchronous model of computation is attractive for the reasons outlined above, it is well known that Consensus and Atomic Broadcast cannot be solved deterministically in an asynchronous system that is subject to even a single crash failure [Fischer et al. 1985; Dolev et al. 1987].1 Essentially, the impossibility results for Consensus and Atomic Broadcast stem from the inherent difficulty of determining whether a process has actually crashed or is only very slow.

The overhead of CP is:
1- Context saving overhead: The time taken to save the global context of a computation is defined as the context-saving overhead. This overhead is proportional to the size of the context. If stable storage is not available with every node in a multiprocessor system, the context is transferred over the network. Network transmission delay is also included in the overhead.
2- Coordination overhead: In a parallel-distributed system, coordination among processes is needed to obtain a consistent global state. Special messages and piggy-backed information with regular messages are used to obtain coordination among processes. Coordination overhead is due to these special messages and piggy-backed information.
And, the frequency of CP, contents of CP, method of CP are important concerns for the overhead. Another class of opimization of CP includes CP compression, Main memory/copy-on-write CP and diskless CP which are explained in [BPK94] to increase the performance.

In message based (dynamic CP approach) distributed systems, CP algorithms fall into two categories: synhronous and asynchronous. In synchronous checkpointing schemes, processes synchronize their checkpointing activities so that a globally consistent set of checkpoints is always maintained in the system. The storage requirement for the checkpoints is minimum because each process needs to keep at most two checkpoints (one committed and one possibly not committed) in stable storage at any given time. Major disadvantages of synchronous checkpointing are 1) process execution may have to be suspended during the checkpointing coordination, resulting in performance degradation and 2) it requires extra message overhead to synchronize the checkpointing activity. [KT87] proposes a CP and rollback recovery algorithm for distributed systems (but for transient failures!). It also explains taking global consistent CPs in detail (considering domino effect and livelock problems). [LWK03] presents a time-based coordinated CP protocol for mobile networks (minimized size and number of msgs due to the limited BW as in embedded systems). [CL85] gives a work on determining the global state of a distributed system.

In asynchronous checkpointing, processes take local checkpoints periodically without any coordination with each other. This approach allows maximum process autonomy for taking checkpoints and has no message overhead for local checkpointing. A process determines consistent global checkpoints ([CLYL04] gives a definition for consistent global CP) by communicating with other processes to determine the dependency among local checkpoints. In asynchronous checkpointing, it could very well happen that processes took checkpoints such that none of the checkpoints lies on a consistent global checkpoint. A local checkpoint that cannot be part of a consistent global checkpoint is said to be useless. A local checkpoint that can be part of a consistent global checkpoint is called an useful checkpoint. In general, a number of checkpoints will be useful. A recent approach is to coordinate the CPs partially, which is called Quasi-synchronous CP [MS99].

According to figure 6 [KR00], besides dynamic CP algorithms, there are static approaches like compiler based methods. Compiler based CP uses static program analysis to optimize the performance of CP that is automated by the compiler or a pre-processor. A method and application is presented in [BPK94]. [ZB97] proposes an on-line CP algorithm based on optimization of the system performance. They say that they will extend their work to a system collecting the CP cost information from the previous executions of the program and making an optimization to place CPs off-line, whose performace could approximate to off-line optimal algorithm.

[RPM97] proposes an approach on extension of some real time system calls in order to save a recovery point when the user invokes them. If checkpointing is frequently done, computer performance will be decreased because a great amount of temporal data will be stored. It tries to reduce the performance loss in two ways: checkpoints are saved only at the end of a control cycle (the number of checkpoints is reduced) and only when a write is done the number of system calls affected are decreased). checkpoint is done at the end of these services.

Next: Layered FT Approaches Up: Towards Fault-Tolerance Aspects for Previous: Failure detection

Tolga Ayav 2004-10-25