Next: Bibliography Up: Towards Fault-Tolerance Aspects for Previous: Related Works

Aspects that can be implemented

To provide Information redundancy in communication, we can implement the following aspects. We can consider these properties: 1) A message sent from node is received correctly by node . 2) Messages sent from are delivered to in the order in which sent them.
1. CRC (to detect the error in message and to discard it without correcting!).
  
  For example, assume we have this code:
  char data[]; // data to send
  int n; // number of bytes
  ...
  send(data, n);
  ...
  receive(data, n);
  ...
  after weaving, it looks like:
  ...
  data[n]=CRC(data, n);
  n:=n+1;
  send(data, n);
  ...
  receive(data, n);
  if(data[n-1] $\neq$ CRC(data, n-1)) ERROR();
  ...
2. Triplication of the message sent (It also provides correction, satisfies property 1). It is also time redundancy. For example, after weaving:
  ...
  send(data, n);
  send(data, n);
  send(data, n);
  ...
  char mdata[3][];
  receive(mdata[0], n);
  receive(mdata[1], n);
  receive(mdata[2], n);
  data:=majority(mdata,3,n);
  ...
3. Sliding Window Protocol (Satisfies property 2).
To cope with transient failures, the following aspect can be implemented.
1. Triplicate SOME PART OF code on the same processor and apply majority voting on the output. i.e.
  s1: exec(); (execution duration is )
  s2: save output[1]; (execution duration is )
  s3: exec();
  s4: save output[2];
  s5: exec();
  s6: save output[3];
  s7: output:= majority(output[i]);
  
  Let $\nu$ be the maximum duration of a failure and $\tau$ be the interval between two consecutive failures. For this scheme to work, we can write this assumption: $\nu \leq c_s$ and $\tau > 3c$ . If the transient failures are short but occurs very frequently, then the code part to be triplicated should be reduced such that $\tau > 3c$ is satisfied, which increases the number of ``saving output'' phases, and consequently the overhad.
  And also note that this helps immediate recovering from data flow errors, not control flow errors (for detection of control flow errors, signature monitoring can be used).
The following aspects seems more mandatory.
1. Clock synchronization. It may be needed for vaious fault-tolerant services. we want that the clocks of any two processors are approximately equal at any time, i.e. $\vert C_i(t)-C_j(t)\vert \leq \beta$ .
  An approach for clk synchronization (Assumption is that all clocks are running at the same speed): At the beginning of each cycle (or perhaps more frequently?), monitor processor broadcasts its clock value ( $C_{mon}$ ) to all processors. The travelling time of this message is $\delta$ . So, it is delivered in $[\delta-\epsilon, \delta+\epsilon]$ .
  Each processor makes this adjustment: $C_{i\in\{1,\ldots,n\}}(t)=C_{mon}-\delta$ . Moreover in heartbeat messages, all processors can send their clock values to the monitor as feedback so that monitor can apply any control scheme to adjusts the clocks better such that the monitor sends $C_{mon}+CORR_1$ to processor 1, $C_{mon}+CORR_2$ to processor 2 and so on.
2. Failure detection using heartbeating. Each processor sends heartbeats to the monitor processor at a frequency in range of $[\pi-\epsilon_p, \pi+\epsilon_p]$ . The travel time of the heartbeat is $[\delta-\epsilon_t, \delta+\epsilon_t]$ . The monitor runs a watchdog timer () for each processor and if $wdt_i > \pi+\epsilon_p+\epsilon_t$ then it means that processor failed. Note that the monitor can precisely define $\epsilon_p+\epsilon_t$ at run-time. Therefore, the maximum time passed to detect a failure is $\pi+\Sigma$ where $\Sigma$ is $\epsilon_p+\epsilon_t$ .
  After the monitor receives all heartbeats at around time $k\pi$ (i.e. it resets all watchdogs related to processors), the monitor may send acknowledgments (transmission time is $\delta_{ack}$ ). There are two kinds of ACKs: Monitor can request a rollback to the previous state, or it just says okay, perhaps including $C_{mon}+CORR_i$ information for clock synchronization. Therefore, the recovery time after failure occurs is $R=\pi+n(\delta+\delta_{ack})+\Sigma$ . This is the cost of recovery and our real-time constraints should tolerate this.
3. Checkpointing and Rollback. If we synchronize clocks, then we can use equidistant checkpointing. Each processor checkpoints its local state by sending this state information to the monitor processor at a frequency $\pi$ . So, checkpoints are taken at discrete times $k\pi$ . Let $\beta$ be the maximum clock difference between any two processors, and $\delta$ be the transmission time of message. For this scheme to work, the assumption is $\pi»\delta+\beta$ . On the other hand, there is a problem with orphan and missing messages because of $\delta$ and $\beta$ . We can solve this problem as follows [Jal94]: To prevent orphan messages, if there is a send() command within the interval $[k\pi-\beta, k\pi]$ , we must replace the checkpoint just before send(). To prevent missing messages, message logging can be used. Sender logs the messages it sends during interval $[k\pi-\delta-\beta,k\pi]$ on a stable storage. If our assumption is that there is not a stable storage on fail-silent processors and the only processor providing fail-stop behavior is the monitor processor, then processors should insert these messages into the hearbeat (or checkpointing) messages and send to the monitor (oops, overhead is getting bigger and bigger!). If rollback is performed, then these messages are retransmitted after rollback. In a similar way, receiver logs the messages in the critical interval $[k\pi, k\pi+\delta+\beta]$ too.
4. Atomic Actions.
  Supporting atomicity is easy in uniprocessor systems; CP is taken at the beginning of the atomic action, if failure occurs before end system goes to the beginning of that action. In distributed systems, supporting atomic actions is more complicated because data can be accessed concurrently... (necessary for embedded systems?)
5. Data Resiliency by means of Replication.
  If user enables data resiliency, the aspect will consider the replication of the specified data objects(e.g. primary site approach). When the primary site fails, one of the backup objects will be used. Another aspect that can be used as well is voting on the data objects. This aspect provides that performing an operation on replicated data is decided collectively by replicas through voting. Voting can mask both node a communication failures. (is data resiliency necessary for embedded systems?)
6. Process Resiliency by means of Replication.
  If user enables task resiliency, the aspect will consider the replication of the specified tasks, and then there is no further need for rolling-back, when the replicated task fails (Primary site approach). For resilient task, replication is necessary. If primary replica fails, one of the backup replicas takes the place of the primary.

Next: Bibliography Up: Towards Fault-Tolerance Aspects for Previous: Related Works

Tolga Ayav 2004-10-25