TOLÈRE joint research project: Bip - Sosso
Fault tolerant distributed code for
embedded systems |
---|
Research project context
Constraints for embedded systems
Embedded systems take a growing part in numerous applications by
controlling more and more their behavior automatically. Present for a
long time in costly applications (space, aeronautics, military, and so
on), where they account for a major part of the application, they now
begin to appear in public domain applications such as automatic car
driving. Their main features are:
- duality automatic-control/discrete-event: these
systems include 1) control laws modeled as differential equations in
sampled time, and 2) discrete event systems to schedule the control
laws;
- critical real-time: timing constraints which are
not verified may involve a system failure leading to a human, an
ecological, and/or a financial disaster;
- limited resources: these softwares are embedded,
they rely on limited computing power and memory because of weight,
encumbrance, energy consumption (autonomous vehicles), radiation
resistance (nuclear or space), or price constraints (public domain
applications);
- distributed and heterogeneous architecture: these
applications are most of the time distributed, 1) to provide enough
computing power, and 2) to keep sensors and actuators close to the
computing sites.
Synchronous programming [Hal93] offers
specification methods and formal verification tools that give
satisfying answers to the needs mentioned above. These methods are
based upon: the modeling with automata, the specification with high
level languages formally defined, and finally the theoretical analysis
of the models used to obtain formal proof methods.
However, the following aspects, extremely important within the
aimed application fields, are not taken into account by synchronous
programming:
- Distribution: synchronous languages are parallel,
but the parallelism used in the language aims only at making the
designer's task easier, and is not related to the system's
parallelism. Therefore, synchronous languages compilers produce
centralized sequential code. This allows the debugging of the program,
its formal verification, and various optimizations. In order to obtain
distributed code minimizing the required hardware, one may use the
"Algorithm/Architecture Adequation" method [Sor94] developed at Inria-Rocquencourt in the Sosso
team, which preserves the properties of synchronous programs mentioned
above.
- Fault tolerance: an embedded system being
intrinsically critical, it is essential to insure that its software is
fault tolerant. This can even motivate its distribution itself. In
such a case, at the very least, the loss of one computing site must
not lead to the loss of the whole application.
- Conformity of the implementation to the
specification: the formal verification of safety properties
is already possible with synchronous languages. But it is also
necessary to insure that the system's specification is complete and
correct. This problem is even more difficult if we want to take into
account distribution and fault tolerance.
As a summary, synchronous programming allows the design of embedded
systems with several key advantages: timing constraints, formal
verification, clean and safe programming, thanks to the synchrony
hypothesis. Modern distribution techniques allow then the automatic
production of distributed code. However, the fault tolerance of the
final distributed program is not insured. We intend to address this
difficult problem within a collaboration between the Bip and Sosso
teams of Inria.
Softwares involved in the TOLÈRE research project
The approaches and methods presented above are implemented within two
software environments: they allow the design of critical embedded systems
with a reasonable developing time and design safety.
The Orccad
environment [SECK93] (for Open Robot Controller
CAD) is a high level software well adapted to the specification of
robotic applications involving automatic control and discrete event
aspects. Orccad
is developed jointly by the Icare team at Sophia-Antipolis, the Bip
team at Montbonnot, and by the robotics systems administrators at
Montbonnot. Within Orccad, an
elementary action is modeled as a Robot-Task (RT), which is a command
law merged with a logical reactive behavior. RTs are the interface
between the continuous time (in fact sampled time) and the discrete
events. Such control laws are built with modules connected through a
data-flow network. The logical reactive behavior is specified in the
Esterel language [BG92]. Such elementary actions are then combined to
form Robot-Procedures (RPs) of growing complexity, up to the final
application. Each PR, also specified in Esterel, describes the behavior of a
robotic mission along with predefined exception handling. Programming
in Esterel all the logical
aspects of the application allows us to benefit from the associated
formal proof environment (FC2Tools and Xeve). A
prototype version of Orccad is
currently available. So far, the code generated is for a single
processor running under VxWorks or Solaris.
The SynDEx
environment [Sor94] (for Synchronized Distributed
Executive), developed by the Sosso team at Rocquencourt, is a software
environment dedicated to the multi-processor implementation of
synchronous programs. SynDEx supports the "Algorithm/Architecture
Adequation" method. It takes into account the real-time and embedding
constraints that must be satisfied by the application. The application
algorithm is described as a conditioned data-flow graph, either
specified graphically or produced by one of the synchronous compilers
under the DC
format. The heterogeneous target architecture is specified as a
network of hardware components, processors, and/or specific circuits,
connected through communication medias (links, busses). Fast
heuristics allow, as statically as possible, the automatic
distribution and scheduling of the program on the target architecture,
while minimizing its execution duration as well as the number of
necessary components. Finally, SynDEx generates the minimal real-time
distributed executive required to run the distributed algorithm.
The TOLÈRE research project
Work topics
For each topic, we have indicated the involved researchers:
- To deal with the automatic-control/discrete-event aspects
(Sorel et Simon). In numerous computer systems, in particular
in embedded ones controlling physical processes, the computing power
of the machine is shared between a periodic computing task (control
law) and a discrete-event task related to reactive systems (mode
change, exception handling, and so on). The design safety relying on
the possibility to formally verify the program, it requires to be able
to make cooperate both aspects and to model accurately their
behaviors. Although they are complementary, both aspects are addressed
separately in the robotics field. It seems interesting to merge them
under a unified model. To do so, several problems must be carefully
addressed:
- modeling of the automatic-control/discrete-event cooperation with
a graph model unifying the data flow and the control flow respective
approaches, while taking into account multiple real-time constraints
(i.e., several latency and rate constraints),
- preemption and scheduling of multi-tasks actions within a
distributed real-time context,
- use of the DC synchronous
format as a common model.
- To propose solutions to make the distributed code fault
tolerant (Sorel et Girault). It concerns two aspects: 1) taking
into account faults at the level of the algorithm itself, by modifying
it to provide more and more degraded modes, and 2) adding hardware
redundancy at the architecture level, with a vote mechanism between
processors and communication media, as well as adding dedicated
hardware components to embed the most critical parts of the
program. For instance, to prevent a faulty processor from blocking the
whole application, we can imagine a sub-network dedicated to the
detection of faults, in charge of switching the application into a
degraded mode. A final point concerns performances which must remain
at a sufficient level.
- To propose solutions to insure that the specification of
the problem is complete (Sorel et Simon). Currently,
exception handling within Orccad are
programmed with the Esterel
synchronous language. This specification and coding phase is followed
by a formal verification phase performed with FC2Tools and Xeve. During
this phase, generic properties are automatically checked. However,
application specific properties must be checked manually, which is of
course less reliable. Taking into account distribution and fault
tolerance constraints will inevitably make things worse. Another
solution consists in synthesizing the discrete-event part from
symbolic constraints. These synthesis techniques seem to derive from
the Ramadge and Wonham theory [RW87].
- To combine Orccad and SynDEx (Sorel, Simon, and
Girault). The goal here is to propose a unique software
environment that is coherent from the system's specification to the
real-time optimized implementation. This software environment also has
to take into account the fault tolerance aspects, both at the hardware
and software levels. Here the goal is to study a way to unify the Orccad and SynDEx semantics, with
respect to the models described in the previous topics, with DC as an
intermediate format. Note that the Esterel compiler can already produce
DC code, and
that SynDEx accepts DC programs as
input.
Application domains
The two teams are involved in the following domains:
- Underwater robotics: the automatic control of an
underwater system is characterized by a lot of degrees of freedom,
numerous sensors, and the succession of numerous behaviors. An
underwater system is intrinsically distributed (bottom/surface), and
the main fault tolerance constraint is that the loss of the link
between the surface part and the bottom part must not cause the loss
of the bottom part.
- Semi-autonomous vehicles: the loss of one wheel
controller, for instance, should be detected by the other controllers
and should lead to the immediate stop of the vehicle.
Other applications are in the transport (automotive and
aeronautics) field.
Inria teams involved in the TOLÈRE research project
References
[BG92] G. Berry and G. Gonthier. The Esterel
synchronous programming language: Design, semantics,
implementation. Science of Computer Programming, 19(2):87-152,
1992.
[Hal93] N. Halbwachs. Synchronous
programming of reactive systems. Kluwer Academic Pub., 1993.
[RW87] P.J. Ramadge and W.M. Wonham. Supervisory
control of a class of discrete event processes. SIAM Journal on
Control and Optimization, 25(1):206-230, January 1987.
[SECK93] D. Simon, B. Espiau, E. Castillo, and
K. Kapellos. Computer-aided design of a generic robot controller
handling reactivity and real-time control issues. IEEE Transactions
on Control Systems Technology, 1(4), December 1993.
[Sor94] Y. Sorel. Massively parallel computing
systems with real-time constraints, the "algorithm/architecture
adequation" methodology. In Massively Parallel Computing Systems
Conference, Ischia, Italy, May 1994.