TOLÈRE joint research project: Bip - Sosso

Fault tolerant distributed code for embedded systems

Research project context

Constraints for embedded systems

Embedded systems take a growing part in numerous applications by controlling more and more their behavior automatically. Present for a long time in costly applications (space, aeronautics, military, and so on), where they account for a major part of the application, they now begin to appear in public domain applications such as automatic car driving. Their main features are:

duality automatic-control/discrete-event: these systems include 1) control laws modeled as differential equations in sampled time, and 2) discrete event systems to schedule the control laws;
critical real-time: timing constraints which are not verified may involve a system failure leading to a human, an ecological, and/or a financial disaster;
limited resources: these softwares are embedded, they rely on limited computing power and memory because of weight, encumbrance, energy consumption (autonomous vehicles), radiation resistance (nuclear or space), or price constraints (public domain applications);
distributed and heterogeneous architecture: these applications are most of the time distributed, 1) to provide enough computing power, and 2) to keep sensors and actuators close to the computing sites.

Synchronous programming [Hal93] offers specification methods and formal verification tools that give satisfying answers to the needs mentioned above. These methods are based upon: the modeling with automata, the specification with high level languages formally defined, and finally the theoretical analysis of the models used to obtain formal proof methods.

However, the following aspects, extremely important within the aimed application fields, are not taken into account by synchronous programming:

Distribution: synchronous languages are parallel, but the parallelism used in the language aims only at making the designer's task easier, and is not related to the system's parallelism. Therefore, synchronous languages compilers produce centralized sequential code. This allows the debugging of the program, its formal verification, and various optimizations. In order to obtain distributed code minimizing the required hardware, one may use the "Algorithm/Architecture Adequation" method [Sor94] developed at Inria-Rocquencourt in the Sosso team, which preserves the properties of synchronous programs mentioned above.
Fault tolerance: an embedded system being intrinsically critical, it is essential to insure that its software is fault tolerant. This can even motivate its distribution itself. In such a case, at the very least, the loss of one computing site must not lead to the loss of the whole application.
Conformity of the implementation to the specification: the formal verification of safety properties is already possible with synchronous languages. But it is also necessary to insure that the system's specification is complete and correct. This problem is even more difficult if we want to take into account distribution and fault tolerance.

As a summary, synchronous programming allows the design of embedded systems with several key advantages: timing constraints, formal verification, clean and safe programming, thanks to the synchrony hypothesis. Modern distribution techniques allow then the automatic production of distributed code. However, the fault tolerance of the final distributed program is not insured. We intend to address this difficult problem within a collaboration between the Bip and Sosso teams of Inria.

Softwares involved in the TOLÈRE research project

The approaches and methods presented above are implemented within two software environments: they allow the design of critical embedded systems with a reasonable developing time and design safety.

The Orccad environment [SECK93] (for Open Robot Controller CAD) is a high level software well adapted to the specification of robotic applications involving automatic control and discrete event aspects. Orccad is developed jointly by the Icare team at Sophia-Antipolis, the Bip team at Montbonnot, and by the robotics systems administrators at Montbonnot. Within Orccad, an elementary action is modeled as a Robot-Task (RT), which is a command law merged with a logical reactive behavior. RTs are the interface between the continuous time (in fact sampled time) and the discrete events. Such control laws are built with modules connected through a data-flow network. The logical reactive behavior is specified in the Esterel language [BG92]. Such elementary actions are then combined to form Robot-Procedures (RPs) of growing complexity, up to the final application. Each PR, also specified in Esterel, describes the behavior of a robotic mission along with predefined exception handling. Programming in Esterel all the logical aspects of the application allows us to benefit from the associated formal proof environment (FC2Tools and Xeve). A prototype version of Orccad is currently available. So far, the code generated is for a single processor running under VxWorks or Solaris.

The SynDEx environment [Sor94] (for Synchronized Distributed Executive), developed by the Sosso team at Rocquencourt, is a software environment dedicated to the multi-processor implementation of synchronous programs. SynDEx supports the "Algorithm/Architecture Adequation" method. It takes into account the real-time and embedding constraints that must be satisfied by the application. The application algorithm is described as a conditioned data-flow graph, either specified graphically or produced by one of the synchronous compilers under the DC format. The heterogeneous target architecture is specified as a network of hardware components, processors, and/or specific circuits, connected through communication medias (links, busses). Fast heuristics allow, as statically as possible, the automatic distribution and scheduling of the program on the target architecture, while minimizing its execution duration as well as the number of necessary components. Finally, SynDEx generates the minimal real-time distributed executive required to run the distributed algorithm.

The TOLÈRE research project

Work topics

For each topic, we have indicated the involved researchers:

To deal with the automatic-control/discrete-event aspects (Sorel et Simon). In numerous computer systems, in particular in embedded ones controlling physical processes, the computing power of the machine is shared between a periodic computing task (control law) and a discrete-event task related to reactive systems (mode change, exception handling, and so on). The design safety relying on the possibility to formally verify the program, it requires to be able to make cooperate both aspects and to model accurately their behaviors. Although they are complementary, both aspects are addressed separately in the robotics field. It seems interesting to merge them under a unified model. To do so, several problems must be carefully addressed:
- modeling of the automatic-control/discrete-event cooperation with a graph model unifying the data flow and the control flow respective approaches, while taking into account multiple real-time constraints (i.e., several latency and rate constraints),
- preemption and scheduling of multi-tasks actions within a distributed real-time context,
- use of the DC synchronous format as a common model.
To propose solutions to make the distributed code fault tolerant (Sorel et Girault). It concerns two aspects: 1) taking into account faults at the level of the algorithm itself, by modifying it to provide more and more degraded modes, and 2) adding hardware redundancy at the architecture level, with a vote mechanism between processors and communication media, as well as adding dedicated hardware components to embed the most critical parts of the program. For instance, to prevent a faulty processor from blocking the whole application, we can imagine a sub-network dedicated to the detection of faults, in charge of switching the application into a degraded mode. A final point concerns performances which must remain at a sufficient level.
To propose solutions to insure that the specification of the problem is complete (Sorel et Simon). Currently, exception handling within Orccad are programmed with the Esterel synchronous language. This specification and coding phase is followed by a formal verification phase performed with FC2Tools and Xeve. During this phase, generic properties are automatically checked. However, application specific properties must be checked manually, which is of course less reliable. Taking into account distribution and fault tolerance constraints will inevitably make things worse. Another solution consists in synthesizing the discrete-event part from symbolic constraints. These synthesis techniques seem to derive from the Ramadge and Wonham theory [RW87].
To combine Orccad and SynDEx (Sorel, Simon, and Girault). The goal here is to propose a unique software environment that is coherent from the system's specification to the real-time optimized implementation. This software environment also has to take into account the fault tolerance aspects, both at the hardware and software levels. Here the goal is to study a way to unify the Orccad and SynDEx semantics, with respect to the models described in the previous topics, with DC as an intermediate format. Note that the Esterel compiler can already produce DC code, and that SynDEx accepts DC programs as input.

Application domains

The two teams are involved in the following domains:

Underwater robotics: the automatic control of an underwater system is characterized by a lot of degrees of freedom, numerous sensors, and the succession of numerous behaviors. An underwater system is intrinsically distributed (bottom/surface), and the main fault tolerance constraint is that the loss of the link between the surface part and the bottom part must not cause the loss of the bottom part.
Semi-autonomous vehicles: the loss of one wheel controller, for instance, should be detected by the other controllers and should lead to the immediate stop of the vehicle.

Other applications are in the transport (automotive and aeronautics) field.

Inria teams involved in the TOLÈRE research project

Bip Project : Alain Girault (04 76 61 53 51), Daniel Simon (04 76 61 53 28), Eric Rutten (04 76 61 54 02), and Catalin Dima.
Sosso Projet : Yves Sorel (01 39 63 52 60) and Christophe Lavarenne (01 39 63 55 80).

References

[BG92] G. Berry and G. Gonthier. The Esterel synchronous programming language: Design, semantics, implementation. Science of Computer Programming, 19(2):87-152, 1992.

[Hal93] N. Halbwachs. Synchronous programming of reactive systems. Kluwer Academic Pub., 1993.

[RW87] P.J. Ramadge and W.M. Wonham. Supervisory control of a class of discrete event processes. SIAM Journal on Control and Optimization, 25(1):206-230, January 1987.

[SECK93] D. Simon, B. Espiau, E. Castillo, and K. Kapellos. Computer-aided design of a generic robot controller handling reactivity and real-time control issues. IEEE Transactions on Control Systems Technology, 1(4), December 1993.

[Sor94] Y. Sorel. Massively parallel computing systems with real-time constraints, the "algorithm/architecture adequation" methodology. In Massively Parallel Computing Systems Conference, Ischia, Italy, May 1994.