Self-monitoring and self-recovering flight software

Allen Goldberg



Flight software is complex, must be highly-reliable, and delivered under aggressive cost and schedule. This is particularly true for manned missions with safety being the primary concern. All fielded software has residual errors in proportion to its size, and the marginal cost to remove these errors grows prohibitively large as residual error rates are driven down. Recognizing these realities, we claim that the very high levels of reliability required for human flight is most economically achieved by overlaying the software with fault detection, isolation, and recovery capabilities. Software FDIR (SFDIR) enables recovery from faults with corrective action, performed either automatically within real time constraints or with human assistance. SFDIR fixes or contains the impact of faults and reduces the possibility of catastrophic loss.

Hardware FDIR has been implemented in many flight systems, but extending this concept to software raises new challenges. Software has characteristics that differ than hardware and that fundamentally affect fault protection. Unlike hardware, software does not wear out. (Computer hardware may be affected by radiation or other physical stresses, but that is not specifically addressed here.) Instead we focus on the more common design and coding errors. Furthermore, software systems do not have well defined notion of component and the degree of component independence seen in hardware. Note that a software error in a non-critical code has the potential of causing loss of mission, due to deadlock, data or program corruption, or inadequate exception handling. A program error can corrupt the program state, and this corruption may only become evident much later in time. Thus, in our view, these considerations mean that new software architectural concepts must be applied. First to make precise in the first instance the notion of component, allowing failure can be attributed to a precise entity, and second to enable safe recovery strategies, such as component resets, or dynamic component replacement.

We propose an approach to SFDIR with the following technical challenges.


A means to realize such a capability is used to bootstrap existing tools such as  software monitoring with Eagle, design tools, code instrumentation tools, and model based reasoning tools such as Livingstone II. Concepts and methodologies from the IBM initiative into autonomic computing, and Stanford's Recovery Oriented Computing are also relevant.

Future flight operations will require very high reliability levels and increasing software autonomy. All errors and failures cannot be prevented, and successful completion of missions will require greater levels of fault protection and recovery than is currently possible. SFDIR increases reliability of critical flight software by providing re-configurability and greater margins and redundancy. Errors such as the Spirit’s, flash memory problem (resource usage), Opportunity’s unsafe encounter with a crater top (violation of flight safety rules), Arianne 5 overflow (uncaught software exception), are instances of problems that SFDIR may reasonably be expected to detect.

Howard Barringer, Allen Goldberg, Klaus Havelund, Koushik Sen, “Rule-Based Runtime Verification,” VMCAI’03, Venice, Italy, 2003.