Self-monitoring and self-recovering
flight software
Allen Goldberg
Flight software is complex, must be
highly-reliable, and
delivered under aggressive cost and schedule. This is particularly true
for manned missions with safety being the primary concern. All fielded
software has residual errors in proportion to its size, and the
marginal cost to remove these errors grows prohibitively large as
residual error rates are driven down. Recognizing these realities, we
claim that the very high levels of reliability required for human
flight is most economically achieved by overlaying the software with
fault detection, isolation, and recovery capabilities. Software FDIR
(SFDIR) enables recovery from faults with corrective action, performed
either automatically within real time constraints or with human
assistance. SFDIR fixes or contains the impact of faults and reduces
the possibility of catastrophic loss.
Hardware FDIR has been implemented in many flight systems, but
extending this concept to software raises new challenges. Software has
characteristics that differ than hardware and that fundamentally affect
fault protection. Unlike hardware, software does not wear out.
(Computer hardware may be affected by radiation or other physical
stresses, but that is not specifically addressed here.) Instead we
focus on the more common design and coding errors. Furthermore,
software systems do not have well defined notion of component and the
degree of component independence seen in hardware. Note that a software
error in a non-critical code has the potential of causing loss of
mission, due to deadlock, data or program corruption, or inadequate
exception handling. A program error can corrupt the program state, and
this corruption may only become evident much later in time. Thus, in
our view, these considerations mean that new software architectural
concepts must be applied. First to make precise in the first instance
the notion of component, allowing failure can be attributed to a
precise entity, and second to enable safe recovery strategies, such as
component resets, or dynamic component replacement.
We propose an approach to SFDIR with the following technical challenges.
- Detection. With the aid
of automated code instrumentation, detect a wide range of faults
against explicated safety requirement models. There are existing tools
that can be leveraged for this task. For example, SpecTRM-RL, is a TRL
5 software requirements modeling and validation environment. It
includes an easily learned language for formally modeling software
requirements. These models are executable and formally analyzable.
Thus, they provide early requirements validation through both simulated
execution as well as formal analysis for correctness, consistency,
completeness and safety. We focus on specification and monitoring of
environmental constraints (i.e. that environmental conditions fall out
of expected envelop), resource constraints (usage of resources both
computational (memory, network etc.) and spacecraft (power, life
support, communication)), and software component interface behavior
(messages are received within expected bounds, there are no deadlocks
or livelocks). Constraints expressed in the SpecTRM-RL are “compiled”
into monitoring code that is merged into the flight code. The
monitoring code generates an event stream fed to our Eagle monitoring
system to check for constraint violations.
- Isolation: Use of
model-based reasoning, software architectures and program analysis to
trace from symptom to source of error. We shall develop architectural
and system concepts that allow true component separation and component
non-interference. Hardware engineers strive for designs where a failure
in one component can be isolated; but independent software components
running on the same processor do not enjoy such isolation. Security
work on separation kernels and firewalls provides guidance in achieving
isolation. Identification of the faulty software component is achieved
by model-based diagnosis combined with program analysis techniques such
as program slicing, control and data flow analysis. We can leverage
model based reasoning tools such as NASA ARC Livingstone II
- Recovery: Dynamic
software reconfiguration to recover from errors in a safe way
preserving or replacing as much functionality as possible. We propose
conservative recovery strategies including sending diagnostic data to
crew or ground, reinitializing components (micro reboot), reconfiguring
with reduced functionality, and installation of a predefined
replacement component. An overriding imperative is in all cases “to do
no harm.”
A means to realize such a capability is used to
bootstrap existing
tools such as software
monitoring with Eagle, design tools, code instrumentation tools, and
model based reasoning
tools such as Livingstone II. Concepts and methodologies from the IBM
initiative into autonomic
computing, and Stanford's Recovery
Oriented Computing are also relevant.
Future flight operations will require very high reliability levels and
increasing software autonomy. All errors and failures cannot be
prevented, and successful completion of missions will require greater
levels of fault protection and recovery than is currently possible.
SFDIR increases reliability of critical flight software by providing
re-configurability and greater margins and redundancy. Errors such as
the Spirit’s, flash memory problem (resource usage), Opportunity’s
unsafe encounter with a crater top (violation of flight safety rules),
Arianne 5 overflow (uncaught software exception), are instances of
problems that SFDIR may reasonably be expected to detect.
Howard Barringer, Allen Goldberg, Klaus Havelund, Koushik Sen,
“Rule-Based Runtime Verification,” VMCAI’03, Venice, Italy, 2003.