Fault-Tolerance for Real-Time Systems

Fault-Tolerance for Real-Time Systems

1. Introduction

2. A fable

3. Our contribution

4. Contributors

5. References

6. Results

1. Introduction

Fault-tolerance is the ability of a system to maintain its functionality, even in the presence of faults. It has been extensively studied in the literature: [ALRL04] and [Lap04] gives an exhaustive list of the basic concepts and terminology on fault-tolerance, [Pow92] introduces two fundamental notions for fault-tolerance, namely failure mode assumption and assumption coverage, and [Gär99] formalizes the important underlying notions of fault-tolerance. Concerning more specifically real-time systems, [Rus94] gives a short survey and taxonomy for fault-tolerance and real-time systems, and [Cri93,Jal94] treat in details the special case of fault-tolerance in distributed systems.

If you want to be convinced of the impact of faults and failures, you can browse the following pages:

The three basic notions are fault, failure, and error: a fault is a defect or flaw that occurs in some hardware or software component; an error is a manifestation of a fault; a failure is a departure of a system from the service required. A failure in a sub-system may be seen as a fault in the global system. Hence the following causal relationship:

... --> fault --> error --> failure --> fault --> ...

Consider for instance a system running on a multi-processor architecture: a fault in one processor might cause it to crash (i.e., a failure), which will be seen as a fault of the system. Therefore, the ability of the system to function even in the presence of the failure of one processor will be regarded as fault-tolerance instead of failure-tolerance.

Not all faults cause immediate failure: faults may be latent (activated but not apparent at the service level), and later become effective. Fault-tolerant systems attempt to detect and correct latent errors before they become effective. Faults are classified according to the following criteria:

by their nature: accidental or intentional;
by their origin: physical, human, internal, external, conception, operational;
by their persistence: transient or permanent.

Failures are classified according to the following criteria:

by their domain: failures on values and/or timing failures;
by their perception by the user;
by their consequences on the environment.

The means for fault-tolerance are either:

error processing (to remove errors from the system's state), which can be carried out either with recovery (rolling back to a previous correct state) or with compensation (masking errors using the internal redundancy of the system);
fault treatment (to prevent faults from being activated again), which is carried out in two steps: diagnostic (determining the cause, location, and nature of the error) and then passivation (preventing the fault from being activated again).

Theses means use redundancy in order to treat errors, of which three forms exist: hardware redundancy (e.g., using a spare processor), software redundancy (e.g., using two implementations of the same module), and time redundancy (e.g., re-executing a module later).

Finally, two things are important when designing fault-tolerant systems: the fault hypothesis (what type of fault do we want the system to tolerate) and the fault coverage (the probability that the fault hypothesis be respected when a fault actually occurs in the system).

2. A fable

The following is a fable by the famous French writer and poet Jean de la Fontaine (1621--1695), titled "Le loup, la chèvre et le chevreau" (in English, "The wolf, the goat, and the goat kid)". I think that it illustrates rather neatly the concept of fault-tolerance, at least the need of it:

    La Bique allant remplir sa traînante mamelle
    Et paître l'herbe nouvelle,
    Ferma sa porte au loquet,
    Non sans dire à son Biquet :
    Gardez-vous sur votre vie
    D'ouvrir que l'on ne vous die,
    Pour enseigne et mot du guet :
    Foin du Loup et de sa race !
    Comme elle disait ces mots,
    Le Loup de fortune passe ;
    Il les recueille à propos,
    Et les garde en sa mémoire.
    La Bique, comme on peut croire,
    N'avait pas vu le glouton.
    Dès qu'il la voit partie, il contrefait son ton,
    Et d'une voix papelarde
    Il demande qu'on ouvre, en disant Foin du Loup,
    Et croyant entrer tout d'un coup.
    Le Biquet soupçonneux par la fente regarde.
    Montrez-moi patte blanche, ou je n'ouvrirai point,
    S'écria-t-il d'abord. (Patte blanche est un point
    Chez les Loups, comme on sait, rarement en usage.)
    Celui-ci, fort surpris d'entendre ce langage,
    Comme il était venu s'en retourna chez soi.
    Où serait le Biquet s'il eût ajouté foi
    Au mot du guet, que de fortune
    Notre Loup avait entendu ?
    Deux sûretés valent mieux qu'une,
    Et le trop en cela ne fut jamais perdu.

The important point here is the morale of the fable, the last two verses:

    Deux sûretés valent mieux qu'une,
    Et le trop en cela ne fut jamais perdu.

In English, it translates more or less into:

    Two safeties are better than one,
    And too much in this respect was never a loss.

I find both amusing and amazing that back in the seventeenth century, Jean de la Fontaine wrote something so much up to date with today's concerns!

3. Our contribution to fault-tolerance

In the past, we have been involved in a French "Action de Recherche Coordonnée" funded by Inria, named Tolère, and a European Research Project dealing with embedded electronics for automotive, named EAST-EEA, and involving various automotive industries and research labs.

3.1. New scheduling/distribution heuristics

Researchers involved: Girault, Sorel, Lavarenne, Sighireanu, Dima, Pinello, Kalla, Assayad, Leignel, Yu, and Leveque.

Our personal contribution to research in the fault-tolerant embedded systems consists of several scheduling/distribution heuristics. Their common feature is to take as an input two graphs: a data-flow graph ALG describing the algorithm of the application, and a graph ARC describing the target distributed architecture. Below to the left is an example of an algorithm graph: it has nine operations (represented by circles) and eleven data-dependences (represented by green arrows). Among the operations, one is a sensor operation (I), one is an actuator operation (O), while the seven others are computations (A to G). Below to the right is an example of an architecture graph: it has three processors (P1, P2, and P3) and three point-to-point communication links (L1.2, L1.3, and L2.3).

Also given is a table giving the Worst-Case Execution Time (WCET) of each operation onto each processor, and the worst-case transmission time of each data-dependence onto each communication link. The architecture being a priori heterogeneous, these need not be identical. Below is an example of such a table for the operations of ALG. The infinity sign expresses the fact that the operation I cannot be executed by the processor P3, for instance to account for the requirement of certain dedicated hardware.

Form these three inputs, the heuristic distributes the operations of ALG onto the processors of ARC and schedules them statically, as well as the communications induced by these scheduling decisions. The output of the heuristic is therefore a static schedule, from which embeddable code can be generated.

For the embeddable code generation, we use SynDEx, a system level CAD software based on the "Algorithm-Architecture Adequation" (AAA) methodology, for rapid prototyping and optimizing the implementation of distributed real-time embedded applications onto "multicomponent" architectures. It has been designed and developed at INRIA by the AOSTE team. Also, our heuristics are implemented inside SynDEx, as an alternative to its own default heuristics (called DSH: Distribution Scheduling Heuristic [GLS99]).

Our fault hypothesis is that the hardware components are fail silent, meaning that a component is either healthy and works fine, or is faulty and produces no output at all. Recent studies on modern hardware architectures have shown that a fail-silent behavior can be achieved at a reasonable cost [BFM+03], so our fault hypothesis is reasonable.

Our contribution consists of the definition of several new scheduling/distribution heuristics in order to generate static schedules that are in addition tolerant to a fixed number of hardware components (processors and/or communication links) faults. These new heuristics include:

FTBAR (Fault-Tolerant Based Active Replication) generates a static schedule that tolerates Npf processor faults, by replicating actively all the operations of the algorithm graph ALG exactly Npf+1 times. It works with target architectures having either point-to-point communication links or buses, but assumes that all the communication links are reliable. FTBAR tries to minimize the critical path of the obtained schedule w.r.t. the know WCETs of the operations onto the various processors of the architecture [GKS06] [Kal04] [GKS04b] [GKS03] [GKSS03] [GLSS01a] [GLSS01b] [GLSS00].
RBSA (Reliable Bicriteria Scheduling Algorithm) generates a reliable and static schedule, also by replicating actively the operations of the algorithm graph. The difference with FTBAR is that the number of times an operation is replicated depends on the individual reliability of the processors it is scheduled on and on the overall reliability level required by the user. RBSA tries both to minimize the critical path of the obtained schedule and to maximize its reliability (these are the two criteria of this heuristic). Unlike all bicriteria (length,reliability) scheduling heuristics found in the literature, RBSA is the only one that can actually increase the reliability of the obtained system, because it is the only scheduling heuristics that actively replicates the operations and communications [AGK04].
BSH (Bicriteria Scheduling Heuristics) solves a difficult problem inherent to all bicriteria (length,reliability) scheduling heuristics. This problem is due to the fact that the reliability criterion is intrinsically dependent on the length criterion (under the classical exponential failure model of Shatz and Wang), and incurs three major drawbacks: first, the length criterion overpowers the reliability criterion; second, it is very tricky to control precisely the replication factor of the operations onto the processors, from the beginning to the end of the schedule (in particular, it can cause a "funnel" effect); and third, the reliability is not a monotonous function of the schedule. To solve this problem, we propose a new criterion instead of the reliability, which we call the Global System Failure Rate (GSFR); the GSFR is the failure rate per time unit of the system, seen as if it was a single operation placed onto a single processor. We have conducted extensive simulations that demonstrate that our new bicreteria (length,GSFR) scheduling algorithm BSH avoids all the problems that plague the classical (length,reliability) scheduling heuristics found in the literature, which we mentioned above [GK09].
TSH (Tricriteria Scheduling Heuristics) extends BSH with the power consumption criterion. From a given software application graph ALG and a given multiprocessor architecture ARC, produces a static multiprocessor schedule that optimizes three criteria: its length (crucial for real-time systems), its reliability (crucial for dependable systems), and its power consumption (crucial for autonomous systems). TSH uses the active replication of the operations and the data-dependencies to increase the reliability, and uses dynamic voltage scaling (DVS) to lower the power consumption. Moreover, we show that, to produce the Pareto surface of the best tradeoffs between the three criteria, it is necessary to transform two of the criteria into constraints, and then optimize the third criterion under those two constraints. This in turn requires that the constraints are invariant measures of the schedule under construction. This is because list scheduling heuristics cannot backtrack. For example, we use the power consumption and not the total energy consumed, because the energy is not an invariant measure of the schedule: indeed, if S' is a prefix schedule of S, then E(S) > E(S'). It follows that ensuring that E(S') is less than the energy constraint Eobj does not ensure that E(S) is less than Eobj. For the same reason, we use the GSFR (Global System Failure Rate --- see the previous item) rather than the reliability [AGK12a] [AGK12b] [AGK11].
GRT + eDSH (Graph Redundancy Transformation + extended Distribution Scheduling Heuristic) generates a static schedule that tolerates Npf processor faults and Nlf communication link faults. It first transforms the algorithm graph ALG into another data-flow graph ALG* by adding redundancy into it such that the required number of hardware component faults will be tolerated. During this phase, it also generates exclusion relations between subsets of operations that must be scheduled onto distinct processors, and subsets of data dependences that must be routed through disjoint paths. Then it uses an extended version of the DSH heuristics to generate a static schedule of ALG* onto ARC, w.r.t. the exclusion relations generated during the first phase [GKS04a].
FPMH (Fault Patterns Merging Heuristic) is an original approach to generate a static schedule of ALG onto ARC tolerant to a given list of fault patterns. A fault pattern is a subset of the architecture's component that can fail simultaneously. Our methods involves two steps. First, for each fault pattern, we generate the corresponding reduced architecture (the architecture from which the pattern's component have been removed), and we generate a static schedule of ALG onto this reduced architecture (we use the basic DSH heuristic of SYnDEx for this). From N fault patterns, we therefore obtain N basic schedules. The second step consists of the merging of these N basic schedules into one static schedule that will be, by construction, tolerant to all the specified fault patterns [DGLS01]. The merging operation raises a number of theoretical and practical problems that were hard to solve elegantly and efficiently!

3.2. Discrete controller synthesis

Researchers involved: Girault, Rutten, Abdennebi, Dumitrescu, Taha, Marchand, and Sun.

Another of our contributions (not a heuristic this time) was the usage of discrete controller synthesis theory [RW87] to generate automatically fault-tolerant software. The principle is to design a software (for instance to control some plant) by taking into account all possible behaviors, i.e., both the good ones and the bad ones (the faults). Then, we have considered that all the fault events were uncontrollable. We have added to the system an environment model that specifies what fault events can occur simultaneously. The advantage of discrete controller synthesis is that it is able to produce automatically a controller that, put in parallel with the system, controls it in such a manner that it satisfies some predefined safety requirements. In our approach, these requirements express precisely the fault tolerance. We have conducted several studies on this approach that prove its feasibility and its elegance. From the point of view of fault-tolerance, our approach is interesting in the sense that, when the controller synthesis actually succeeds in producing a controller, we obtain a system equipped with a dynamic reconfiguration mechanism to handle faults, with a static guarantee that all specified faults will be tolerated during the execution, and with a known bound on the system's reaction time (thanks to optimal controller synthesis) [DGMR10] [GR09] [DGMR07b] [DGMR07a] [DGR04] [GR04]. New developments are towards the efficient implementation of the synthesized fault-tolerant controlled systems, by using the LibDGALS library for dynamic GALS systems.

3.3. Aspect oriented programming: fault-tolerant programs and fault-tolerant circuits

Researchers involved: Fradet, Girault, Ayav, and Burlyaev.

We are investigating the use of aspect oriented programming [KLM⁺97] [BSL01] for transforming automatically a non fault-tolerant program into a fault-tolerant one. As a first step in this direction, we have proposed several automatic program transformations (i.e., no an aspect language yet) to insert automatically heartbeats and checkpoints in a real-time distributed program. We have formalized these transformations as rewriting rules in ML for a simple programming language (with assignment, if-then-else, for loops, and input/output). Our contribution is twofold. First we have formally proved that our transformations preserve the semantics of initial program and we have derived formulas to compute the WCET of the obtained program (this WCET can then be checked against the real-time constraints). Second, choosing the lengths of checkpointing and heartbeating intervals is delicate. Long intervals lead to long roll-back time, while too frequent checkpointing leads to high overheads. We have derived formulas for choosing the optimal checkpointing and hearbeating intervals. As a result, the overhead due to adding the fault-tolerance is minimized [AFG08] [AFG06].

New developments concern fault-tolerant circuits for which we want to propose automatic transformation procedures. These procedures will turn an initial non fault-tolerant circuit into a new fault-tolerant circuit (for instance by replicating portions of the circuits, by adding voters, or by adding error correction blocks). We will also seek to prove formally the correctness of there procedure, manually or with the help of a theorem prover.

3.4. Probabilistic contracts for reliable components

Researchers involved: Xu, Goessler, and Girault.

We are working on a probabilistic contract framework for describing and analysing component-based embedded systems, based on the theory of Interactive Markov Chains (IMC). A contract specifies the assumptions a component makes on its context and the guarantees it provides. Probabilistic transitions allow for uncertainty in the component behavior, e.g. to model observed black-box behavior (internal choice) or reliability. An interaction model specifies how components interact. We provide the ingredients for a component-based design flow, including (1) contract satisfaction and refinement, (2) parallel composition of contracts over disjoint, interacting components, and (3) conjunction of contracts describing different requirements over the same component. By using parametric probabilities in the contracts, we are able to answer questions such as "what is the most permissive component which satisfies a given contract?" [GXG12] [XGG10].

4. Contributors (in chronological order)

Alain Girault: 1998-now (principal investigator)
Yves Sorel: 1998-2006 (INRIA Paris Rocquencourt, on scheduling for fault-tolerance and reliability in SynDEx)
Christophe Lavarenne: 1998-2000 (INRIA Paris Rocquencourt, on scheduling for fault-tolerance and reliability in SynDEx)
Mihaela Sighireanu: 1998-1999 (postdoc; now Assistant Professor at University Paris VII)
Catalin Dima: 1999-2000 (postdoc; now Assistant Professor at University of Paris XII - Créteil)
Claudio Pinello: 2000 (intern; now at Cadence Berkeley Labs after a PhD at UC Berkeley)
Hamoudi Kalla: 2002-now (PhD student, now Assistant Professor at the University of Batna)
Eric Rutten: 2002-now (INRIA Grenoble, on discrete controller synthesis)
Ismail Assayad: 2003-now (master student, now Assistant Professor at the University of Casablanca; works on multicriteria multiprocessor static scheduling heuristics)
Mohamed Abdennebi: 2003 (master student, worked on discrete controller synthesis)
Emil Dumitrescu: 2003-now (postdoc, now Assistant Professor at INSA Lyon; worked on discrete controller synthesis)
Denis Trystram: 2004-now (Professor at INPGrenoble, on scheduling for fault-tolerance and reliability)
Nicolas Leignel: 2004 (master student, on (length,reliability) bicriteria multiprocessor static scheduling heuristics)
Safouan Taha: 2004 (master student, on discrete controller synthesis)
Huafeng Yu: 2004-2005 (master student, on byzantine sensor faults)
Thomas Leveque: 2004 (intern, on fault patterns)
Pascal Fradet: 2004-now (INRIA Grenoble, on aspect oriented programming and automatic program transformations)
Tolga Ayav: 2004-2005 (postdoc, now Assistant Professor at the University of Izmir; worked on aspect oriented programming)
Erik Saule: 2005-2008 (PhD student, on (length,reliability) bicriteria multiprocessor static scheduling heuristics)
Pierre-Francois Dutot: 2006-2009 (Assistant Professor at UJF, on multicriteria branch-and-bound scheduling)
Gérald Vaisman: 2006-2009 (PhD student, on multicriteria branch-and-bound scheduling)
Hervé Marchand: 2006-now (INRIA Rennes, on optimal discrete controller synthesis)
Fanny Dufossé: 2009-now (PhD student at ENS Lyon, on multi-criteria mapping by interval)
Yves Robert: 2009-now (Professor at ENS Lyon, on multi-criteria mapping by interval)
Anne Benoit: 2009-now (Assistant Professor at ENS Lyon, on multi-criteria mapping by interval)
Dana Xu: 2009-2012 (INRIA Paris Rocquencourt, on probabilistic contracts for reliability)
Gregor Goessler: 2009-2012 (INRIA Grenoble, POP ART team, on probabilistic contracts for reliability)
Dmitry Burlyaev: 2012-now (INRIA Grenoble, SPADES team, on automatic program transofrmations for fault-tolerant circuits)
Wei-Tsun Sun: 2012-now (INRIA Grenoble, SPADES team, on discrete controller synthesis)

5. References (in alphabetical order)

[ALRL04]	A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr. Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. on Dependable and Secure Computing, 1(1):11-33, January 2004. [ bib ]
[BFM⁺03]	M. Baleani, A. Ferrari, L. Mangeruca, M. Peri, S. Pezzini, and A. Sangiovanni-Vincentelli. Fault-tolerant platforms for automotive safety-critical applications. In International Conference on Compilers, Architectures and Synthesis for Embedded Systems, CASES'03, San Jose (CA), USA, November 2003. ACM. [ bib ]
[BSL01]	N. Bouraqadi-Saâdani and T. Ledoux. Le point sur la programmation par aspects. Technique et Science Informatique, 20(4):505-528, 2001. [ bib ]
[Cri93]	F. Cristian. Understanding fault-tolerant distributed systems. Communication of the ACM, 34(2):56-78, February 1993. [ bib ]
[Gär99]	F. Gärtner. Fundamentals of fault-tolerant distributed computing in asynchronous environments. ACM Computing Surveys, 31(1):1-26, March 1999. [ bib ]
[GLS99]	T. Grandpierre, C. Lavarenne, and Y. Sorel. Optimized rapid prototyping for real-time embedded heterogeneous multiprocessors. In 7th International Workshop on Hardware/Software Co-Design, CODES'99, Rome, Italy, May 1999. ACM. [ bib ]
[Jal94]	P. Jalote. Fault-Tolerance in Distributed Systems. Prentice-Hall, Englewood Cliffs, New Jersey, 1994. [ bib ]
[KLM⁺97]	G. Kiczales, J. Lamping, A. Mendhekar, C. Maeda, C. Videira Lopes, J.-M. Loingtier, and J. Irwin. Aspect-oriented programming. In European Conference on Object-Oriented Programming, ECOOP'97, volume 1241 of LNCS, pages 220-242, Jyväskylä, Finland, June 1997. Springer-Verlag. [ bib ]
[Lap04]	J.-C. Laprie. Sûreté de fonctionnement informatique : concepts de base et terminologie. Technical report, LAAS-CNRS, Toulouse, France, 2004. [ bib ]
[Pow92]	D. Powell. Failure mode assumption and assumption coverage. In International Symposium on Fault-Tolerant Computing, FTCS-22, pages 386-395, Boston (MA), USA, July 1992. IEEE. Research report LAAS 91462. [ bib ]
[RW87]	P.J. Ramadge and W.M. Wonham. Supervisory control of a class of discrete event processes. SIAM J. Control Optimization, 25(1):206-230, January 1987. [ bib ]
[Rus94]	J. Rushby. Critical system properties: Survey and taxonomy. Reliability Engineering and Systems Safety, 43(2):189-219, 1994. Research report CSL-93-01. [ bib ]

6. Results (in reverse chronological order)

[AGK13]	I. Assayad, A. Girault, and H. Kalla. Tradeoff exploration between reliability, power consumption, and execution time for embedded systems. Int. J. Software Tools for Technology Transfer, 15(3):229-245, June 2013. [ bib ]
[BDGR13]	A. Benoit, F. Dufossé, A. Girault, and Y. Robert. Reliability and performance optimization of pipelined real-time systems. J. of Parallel and Distributed Computing, 73:851-865, 2013. [ bib ]
[AGK12]	I. Assayad, A. Girault, and H. Kalla. Scheduling of real-time embedded systems under reliability and power constraints. In International Conference on Complex Systems, ICCS'12, Agadir, Morocco, November 2012. IEEE. [ bib ]
[GXG12]	G. Goessler, D.N. Xu, and A. Girault. Probabilistic contracts for component-based design. Formal Methods in System Design, 41(2):211-231, 2012. [ bib ]
[AGK11]	I. Assayad, A. Girault, and H. Kalla. Tradeoff exploration between reliability, power consumption, and execution time. In International Conference on Computer Safety, Reliability and Security, SAFECOMP'11, volume 6894 of LNCS, pages 437-451, Napoli, Italy, September 2011. Springer-Verlag. [ bib ]
[BDGR10]	A. Benoit, F. Dufossé, A. Girault, and Y. Robert. Reliability and performance optimization of pipelined real-time systems. In International Conference on Parallel Processing, ICPP'10, pages 150-159, San Diego (CA), USA, September 2010. [ bib ]
[DGMR10]	E. Dumitrescu, A. Girault, H. Marchand, and E. Rutten. Multicriteria optimal reconfiguration of fault-tolerant real-time tasks. In Workshop on Discrete Event Systems, WODES'10, Berlin, Germany, September 2010. IFAC, New-York. [ bib ]
[XGG10]	D.N. Xu, G. Goessler, and A. Girault. Probabilistic contracts for component-based design. In International Symposium on Automated Technology for Verification and Analysis, ATVA'10, volume 6252 of LNCS, pages 325-340, Singapore, Singapore, September 2010. Springer-Verlag. [ bib ]
[GK09]	A. Girault and H. Kalla. A novel bicriteria scheduling heuristics providing a guaranteed global system failure rate. IEEE Trans. Dependable Secure Comput., 6(4):241-254, December 2009. [ bib \| http ]
[GR09]	A. Girault and E. Rutten. Automating the addition of fault tolerance with discrete controller synthesis. Formal Methods in System Design, 35(2):190-225, October 2009. [ bib \| http ]
[GST09]	A. Girault, E. Saule, and D. Trystram. Reliability versus performance for critical applications. J. of Parallel and Distributed Computing, 69(3):326-336, March 2009. [ bib ]
[AFG08]	T. Ayav, P. Fradet, and A. Girault. Implementing fault-tolerance by automatic program transformations. ACM Trans. Embedd. Comput. Syst., 7(4), July 2008. Research report INRIA 5919. [ bib \| .ps \| .pdf ]
[DGMR07b]	E. Dumitrescu, A. Girault, H. Marchand, and E. Rutten. Synthèse optimale de contrôleurs discrets et systèmes répartis tolérants aux fautes. In Modélisation des Systèmes Réactifs, MSR'07, pages 71-86, Lyon, France, October 2007. Hermes. [ bib \| .ps \| .pdf ]
[DGMR07a]	E. Dumitrescu, A. Girault, H. Marchand, and E. Rutten. Optimal discrete controller synthesis for modeling fault-tolerant distributed systems. In Workshop on Dependable Control of Discrete Systems, DCDS'07, pages 23-28, Cachan, France, June 2007. IFAC, New-York. [ bib \| .ps \| .pdf ]
[AFG06]	T. Ayav, P. Fradet, and A. Girault. Implementing fault-tolerance in real-time systems by automatic program transformations. In S.L. Min and W. Yi, editors, International Conference on Embedded Software, EMSOFT'06, pages 205-214, Seoul, South Korea, October 2006. ACM, New-York. Research report INRIA 5919. [ bib \| .ps \| .pdf ]
[GKS06]	A. Girault, H. Kalla, and Y. Sorel. Transient processor/bus fault tolerance for embedded systems. In IFIP Working Conference on Distributed and Parallel Embedded Systems, DIPES'06, pages 135-144, Braga, Portugal, October 2006. Springer-Verlag. [ bib \| http \| .ps \| .pdf ]
[Gir06]	A. Girault. System-level design of fault-tolerant embedded systems. ERCIM News, 67:25-26, October 2006. [ bib \| http \| .ps \| .pdf ]
[GY06]	A. Girault and H. Yu. A flexible method to tolerate value sensor failures. In International Conference on Emerging Technologies and Factory Automation, ETFA'06, pages 86-93, Prague, Czech Republic, September 2006. IEEE, Los Alamitos. [ bib \| .ps \| .pdf ]
[Kal04]	H. Kalla. Génération automatique de distributions/ordonnancements temps-réel, fiables et tolérants aux fautes. PhD Thesis, INPG, INRIA Grenoble Rhône-Alpes, projet Pop-Art, December 2004. [ bib \| .ps.gz \| .pdf.gz ]
[DGR04]	E. Dumitrescu, A. Girault, and E. Rutten. Validating fault-tolerant behaviors of synchronous system specifications by discrete controller synthesis. In Workshop on Discrete Event Systems, WODES'04, Reims, France, September 2004. IFAC, New-York. [ bib \| .ps \| .pdf ]
[Lév04]	T. Lévêque. Fault tolerance adequation in SynDEx. Internship report, Inria Rhône-Alpes, Montbonnot, France, September 2004. [ bib \| .ps.gz \| .pdf.gz ]
[GR04]	A. Girault and E. Rutten. Discrete controller synthesis for fault-tolerant distributed systems. In International Workshop on Formal Methods for Industrial Critical Systems, FMICS'04, volume 133 of ENTCS, pages 81-100, Linz, Austria, September 2004. Elsevier Science, New-York. [ bib \| http \| .ps \| .pdf ]
[DGS04]	C. Dima, A. Girault, and Y. Sorel. Static fault-tolerant scheduling with ``pseudo-topological'' orders. In Joint Conference on Formal Modelling and Analysis of Timed Systems and Formal Techniques in Real-Time and Fault Tolerant System, FORMATS-FTRTFT'04, volume 3253 of LNCS, Grenoble, France, September 2004. Springer-Verlag. [ bib \| .ps \| .pdf ]
[GKS04a]	A. Girault, H. Kalla, and Y. Sorel. An active replication scheme that tolerates failures in distributed embedded real-time systems. In IFIP Working Conference on Distributed and Parallel Embedded Systems, DIPES'04, Toulouse, France, August 2004. Kluwer Academic Pub., Hingham, MA. [ bib \| .ps \| .pdf ]
[GKS04b]	A. Girault, H. Kalla, and Y. Sorel. A scheduling heuristics for distributed real-time embedded systems tolerant to processor and communication media failures. Int. J. of Production Research, 42(14):2877-2898, July 2004. [ bib \| .ps \| .pdf ]
[AGK04]	I. Assayad, A. Girault, and H. Kalla. A bi-criteria scheduling heuristics for distributed embedded systems under reliability and real-time constraints. In International Conference on Dependable Systems and Networks, DSN'04, pages 347-356, Firenze, Italy, June 2004. IEEE, Los Alamitos. [ bib \| http \| .ps \| .pdf ]
[GKS03]	A. Girault, H. Kalla, and Y. Sorel. Une heuristique d'ordonnancement et de distribution tolÃ©rante aux pannes pour systÃ¨mes temps-rÃ©el embarquÃ©s. In Modélisation des Systèmes Réactifs, MSR'03, pages 145-160, Metz, France, October 2003. Hermes. [ bib \| http \| .ps \| .pdf ]
[GKSS03]	A. Girault, H. Kalla, M. Sighireanu, and Y. Sorel. An algorithm for automatically obtaining distributed and fault-tolerant static schedules. In International Conference on Dependable Systems and Networks, DSN'03, San-Francisco (CA), USA, June 2003. IEEE, Los Alamitos. [ bib \| http \| .ps \| .pdf ]
[GLSS01a]	A. Girault, C. Lavarenne, M. Sighireanu, and Y. Sorel. Fault-tolerant static scheduling for real-time distributed embedded systems. In 21st International Conference on Distributed Computing Systems, ICDCS'01, pages 695-698, Phoenix (AZ), USA, April 2001. IEEE, Los Alamitos. Extended abstract. [ bib \| http \| .ps \| .pdf ]
[GLSS01b]	A. Girault, C. Lavarenne, M. Sighireanu, and Y. Sorel. Generation of fault-tolerant static scheduling for real-time distributed embedded systems with multi-point links. In IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, FTPDS'01, San Francisco (CA), USA, April 2001. IEEE, Los Alamitos. [ bib \| http \| .ps \| .pdf ]
[DGLS01]	C. Dima, A. Girault, C. Lavarenne, and Y. Sorel. Off-line real-time fault-tolerant scheduling. In 9th Euromicro Workshop on Parallel and Distributed Processing, PDP'01, pages 410-417, Mantova, Italy, February 2001. [ bib \| http \| .ps \| .pdf ]
[GLSS00]	A. Girault, C. Lavarenne, M. Sighireanu, and Y. Sorel. Fault-tolerant static scheduling for real-time distributed embedded systems. Research report 4006, Inria, September 2000. [ bib \| .ps \| .pdf ]

The bibliographies were generated by bibtex2html 1.69
Last modification Janumber 1st, 2013