We address the problem of off-line fault-tolerant scheduling of an algorithm onto a given architecture with distributed memory and provide a generic algorithm which solves this problem. We take into account two kinds of failures: permanent fail-stop and intermittent fail-silent. The basic technique we use is the replication of operations and data communications. We then discuss the principles which govern the execution of schedulings with replication under the state-machine and the primary/backup arbitrations between replicas. We also show how to compute the execution date for each operation and the timeouts which are used for detecting failures. We end with a heuristic which, using this calculus, computes a possibly non optimal scheduling which tries to minimize locally the total duration of execution of the distributed fault-tolerant algorithm.
@InProceedings{DGLS01, author = {C. Dima and A. Girault and C. Lavarenne and Y. Sorel}, title = {Off-Line Real-Time Fault-Tolerant Scheduling}, booktitle = {Euromicro Workshop on Parallel and Distributed Processing}, year = {2001}, address = {Mantova, Italy}, month = {February}, pages = {410--417}, }