Validation of Fault-Free Behavior of a Reliable Multiprocessor System FTMP: A Case Study

The Fault Tolerant Multi-Processor (FTMP) was designed as a highly survivable avionics system. It has been simulated, modeled and emulated extensively for several years. A prototype has been constructed and delivered to the Avionics integrated Research Laboratory (AIRLAB) at NASA/Langley Research Center. This report summarizes the analysis of several baseline experiments. These experiments were part of a methodology designed to test and verify the fault-free performance of the system at many levels, from hardware through system software. The experiments, while simple, were fundamental to understanding how to measure FTMP performance. From the experiments it was determined that the clock is a valid time measurement device in the tested cases; that instruction execution times are virtually constant; that the R4 frame size is 40mS with some variation; that the frame stretching mechanism has some flaws in its implementation that allow the possibility of a frame stretching infinitely. Future measurements are planned. Some will broaden the results of these initial experiments. Others will measure the system more dynamically. This includes the implementation of a synthetic workload generation mechanism on FTMP.


Introduction
An aircraft of the 1990's will have computer systems that must function correctly for the aircraft to fly. Many studies have been performed on fault tolerant avionics computers. One such study by NASA, in its Aircraft Energy Efficiency (ACEE) program [9], requires that an aircraft computer failure probability should be less than 10' 10 per hour. Systems have been built with this goal in mind (SIFT and FTMP [9,4,3]). Techniques must be developed for measuring the performance and reliability of these systems.
Carnegie-Mellon University has been involved in developing methods of validating and measuring multi-processors. In addition, NASA has held several workshops to determine system validation procedures. One workshop in particular [6] produced a detailed list of validation tasks to verify a system in an orderly manner.
This theory is being tested and extended on the Fault Tolerant Multi-Processor (FTMP). These experiments were performed at NASA-Langley Research Center. The computer systems are in a laboratory called the Avionics Integrated Research Laboratory (AIRLAB). AIRLAB is a facility for developing technologies and methodologies to evaluate and integrate avionics and control functions of future aircraft and to establish a store of performance evaluation and reliability evaluation statistics.
Some background is required to understand the experiments that were performed and how they relate to the overall goals of this research. Section 2 explains FTMP and discusses the theoretical environment and methodology for fault-free validation of FTMP. This section also describes the present state of the environment and validation process. Section 3 explains the experiments to measure some of the processor and operating system characteristics of FTMP. In Section 4, the experimental results are presented and conclusions drawn. A summary and discussion of future work is presented in Section 5.

Background
This section first describes the Fault Tolerant Multi-Processor (FTMP). Next, it discusses the theoretical basis for the experiment environment and the methodology for evaluating fault tolerant multi-processors. Then, the present working environment and state of FTMP validation is explained.

The Fault Tolerant Multi-Processor (FTMP)
The Fault Tolerant Multi-Processor (FTMP) has been discussed in several papers and manuals [3,4]. This section will only describe those details necessary for understanding the experimental results. For more details the interested reader is referred to the literature. Figure 2-1 depicts FTMP at the software level (as seen by the application programmers). There are up to three triads, each with local memory. A triad consists of three processors that the programmer sees as a single processor.
A bus connects the triads to global or main memory, I/O devices, a real-time dock and several latches needed for fault handling. The triads only execute independently when accessing local memory.
Work is performed by tasks. A task is a process that can be started independent of other tasks.
Each triad will run tasks according to a # schedule. Each task is assigned a time limit If a task cannot be completed within the time limit, the task is stopped and the next task started.
Tasks are run within frames. Frames also act as a synchronization mechanism between triads. One of the triads becomes the leader and starts a frame for that triad and signals all of the other triads to start the frame. In the time allotted by the frame, the group of working triads must execute ail of the tasks they are assigned. The tasks are in a global linked list with each pointing to the next task (except the last which has a null pointer). The individual triads access the global list to pick up a task.
If there is more than one triad, some tasks will be executed in parallel. When there are no more tasks available for a triad to execute the triad becomes idle until the end of the frame. At that time, a triad becomes leader and starts a new frame.
In FTMP there are actually three frame sizes, each having a different frequency of execution as seen in Figure 2-2. Each triad has separate pointers to tasks for each rate group. The frame sizes are: • R4, the basic frame size • R3, equivalent to 2 R4 frames • R1, equivalent to 4 R3 frames, the 'major 1 frame Task execution becomes more complicated with multiple frame sizes. One triad still signals the start of the R4 frame, however, every second R4 frame it also starts an R3 frame and every eight R4 frames it starts an R1 frame. The order in which a triad executes tasks for the different frame groups is fairly simple. First, it executes all of the R4 tasks, then (still in the same R4 frame) it executes R3 tasks. If the R3 tasks do not finish before the next R4 frame, execution of the R3 task is suspended and another R4 frame is started. Again, when ail of the R4 tasks are done; the R3 tasks are continued. If the R3 tasks are finished, the R1 tasks are started. If these tasks are not finished before the beginning of the next R4 frame, they are suspended and started after the R4 tasks are done in the next frame. If another R3 frame starts before the R1 tasks finish, the current R1 task is suspended in the triad until time is available in a frame.
There is another interesting item concerning frames. According to the documentation, if a task needs more time in a frame, the frame can be stretched as illustrated in Figure 2-3. An R4 frame is stretched by a specific amount and R3 and R1 frames are stretched by giving them more R4 frames.
The third experiment uncovered some interesting properties of this stretching mechanism.
Time is kept using a global clock. The clock has a resolution of .2SmS (that is, each clock tick is

Experiment Environment
This section wiil briefly outline the environment proposed for the fault-free validation of FTMP. This environment has been developed for the performance evaluation of Cm* at CMU.
Multiprocessor systems are enormously complex. In order to make them easier to comprehend, it is necessary to divide the system into several levels. One can then proceed from the most primitive level upwards to the highest conceptual level by introducing a series of abstractions. Each abstraction contains only information important to its particular level, and suppresses unnecessary information about lower levels. The levels in a digital system frequently coincide with the system's physical boundaries since the concept of levels was utilized by the system's designers to manage complexity.
Once details at one level are comprehended, only the functionality provided for the next higher level need be considered. Figure   There are several activities in the life of an experiment. First, the code has to be designed and written. Next, it must be compiled, followed by loading, debugging, measurement, and analysis.
Another view of the stages of a system's life is the number of these activities that are directly supported by the system for the user.
The following are three representative stages in the evolution of a typical multiprocessor system.

Stage 1 • Standalone
The system is completed through the instruction set level of abstraction. That is, the instruction set has been defined and the hardware has been implemented. There is virtually no software to support user applications. The only software utility would be a loader whereby programs compiled on another machine can be loaded into the system under test Experiments are limited to simple, regular, compute bound algorithms. Only a limited number of parameters may be varied, and this variation requires rewriting of the source code of the experiment There are several attributes to Stage 1 experiments. The programmer must be a hardware expert since there is little software to provide a higher level virtual (abstract) machine. Hence the program is tied closely to the hardware. The user must specify where code is placed, define the memory map, and write code to initialize the memory, create processes, manage resources, and collect data.
Typical experiments in Stage 1 include:' • Hardware Saturation. Programs consist of two or three instruction loops with variation in placement of code and data. The capacity of various system hardware resources is determined as well as the impact of contention for those resources. • Speedup due to Algorithm/Data Variation.
Experiments seek the impact of synchronization for data, as well as variation due to data dependencies and size of data. • Errors. Diagnostic programs can be continuously run and monitored on the system. Distribution of diagnostic detected errors can be studied.

Stage 2 • Operating System (OS)
The user is presented the abstraction provided by the executive software. This software provides basic functions such as resource management and scheduling. In programming experiments, the user employs operating system primitives. Hence, the user needs a substantial operating system expertise. Also, characteristic for this phase is the discrete incremental nature of the experimentation process; each experiment represents one point in the design space.
The attributes of Stage 2 applications can be stated as follows: • very regular, data bound with limited variation of parameters • the general program organization has a Master process controlling a collection of Slave processes doing the actual computation • code is replicated • heavy use of OS mechanisms Typical experiments are: • Measurements of the cost-per-feature of the operating system's functions.
Experiments exercise statically each OS function on a one by one basis. Examples include: memory management communication primitives, synchronization, scheduling and exception handling.
• Measurements of different implementation of parallel algorithms. The impact of using various strategies in parallel program organization, data structure and resource allocation is studied.

Stage 3 -Integrated Instrumentation Environment
At this stage hardware and software have been provided for generating experimental stimulus, dynamically observing hardware and software activities, and analyzing results [8].
With this enhanced support the user can experiment at the application level of abstraction with full variation of parameters. A major characteristic of this stage is the provision of stimulus generation, monitoring, data collection and analysis grouped under a unique user interface. Also the OS, the support software and the user application are uniformly instrumented enabling improved behavior visibility.
Only with this capability, the interaction between OS, support software and user application became measurable with acceptable effort Hence, the programmer could be a relative system novice.
Experiments at this stage have the following attributes: • Examples of experiments at this stage include: • Comparison of various OS policies as reflected by classes of applications. • Tuning a virtual machine for a specific application. • Designing application oriented architectures. • Study of multiprocessor intercommunication strategies. • Validation of fault-free performance of an emulated system. • Study of the architectural effectiveness and efficiency. • The handling of faults represents additional load for the avionics system. The fault capabilities represent another aspect of system functionality. Whereas a system without faults may be able to meet all of its deadlines, the addition of fault handling workload may cause schedule slippage and/or violations of realtime constraints.
A key part of the Stage 3 methodology is the specification and generation of a controlled parallel Workload fSl. Such a Workload for avionics annlicatinnA ift nivon in Ml THA uinrklnaH i« rtxnraeanfart as a special purpose parallel data-flow graph. A run-time experimentation environment provides capability of controlling, varying and measuring the workload without having to recompile or re-debug the parallel program.

Proposed Methodology
NASA held several workshops to determine system validation procedures. One in particular [6] produced a detailed list of validation tasks to verify a system in an orderly manner. The list was as follows: The first six tasks verify functionality and the next seven verify fault handling performance. Several of these tasks are in process or have been completed. This paper specifically deals with a set of performance baseline experiments (Task 6) for FTMP.

Present and Future Experiment Environment
A signifcant amount of work was required by AIRLAB personnel to bring the system environment up to Stage 2. At present, each experiment generally requires some code compilation, followed by linkage and downloading of the whole FTMP binary file. Experiments can be designed with modifiable variables so that some variation can be made without having to go through the entire code development cycle. The experiments described in this paper used the modifiable variable approach.
A more specialized workload generation mechanism is being developed for use on real-time multiprocessor systems (FTMP in particular [1]). With this mechanism in place, experiments can be run in an environment somewhere between Stage 2 and Stage 3. This model considers tasks of a specific organization and deals with a simple set of parameters. The system is assumed to be made up of a bus with several processors (each with local memory), one global memory, and I/O.
Operating system tasks are considered part of the system under measurement Traffic on the bus is stricted to I/O and Inter-Process Communication (IPC), each of which access memory.
In this real-time model, tasks are made up of five sections. These sections include read in I/O data, read in IPC data, perform some representative operation, write out I/O data and write out IPC data.
The amount of work performed in each section can be varied by parameters.
The workload structure was designed for simplicity so that variations in the workload parameters and the resulting measurements could be easily understood. The system parameters consist of total I/O, total IPC, and total instruction execution per second. Each system parameter is divided between functions as a percentage of the total work each function performs. Each function is in turn made up of tasks which divide the work of the function as evenly as possible. Measurement of the throughput, system utilization and interaction of the system is done by using the system clock to measure when a task begins and ends. The specific experiments that are reported upon in this paper are:

Present State of Validation Process
• Clock Read Delay. In order for subsequent experiments to be valid, the delay and variation in reading the clock must be determined. • Processor Performance for Simple Operations. This is a measure of the amount of time required by the processor to perform simple AED instructions, for example 'AaV or •A*B + C\ • R4 Frame Iteration Rate. The measurement of the R4 frame under nominal conditions as well as when stressed by long tasks.

The Experiments
Most of the experiments were organized as in Figure 3-  The time limit for each experiment task must be large enough that the task can finish. Also, some of the experiment tasks must finish before the 60% mark of the R4 frame has been reached. According to the documentation, after that time an interrupt will occur. The interrupt would invalidate any time measurements.
Using the VAX/FTMP interface program called CTA and a batch command script, the values in the FTMP global memory can be read and stored in a file on the VAX. CTA can also set FTMP memory locations. Therefore, a command script can set CMU.EXEC to 0, wait, read the global array and repeat as many times as desired.
It is important to note, that the experiments allow the number of iterations to be changed using CTA.
For example, if the number of iterations is found to be too small to obtain useful results, CMU.RTCNUM can be increased using CTA and the experiment can be run again. Changes can be made without having to recompile, relink and reload FTMP. This saves a great deal of time.

Clock Read Time Delay
In order for any subsequent experimental results to be considered valid, the characteristics of the clock must be determined. The delay and variation in reading the clock must be determined, as well as the causes of any variations. If these variations cannot be characterized or minimized, any further experiments using the clock would be suspect For example in Cm* multiprocessor system, there was as much as 4.6% difference in clock frequency, and substantial variation of clock read delays [5].
An example of possible characteristics of clock read delay would be a constant offset that could be subtracted from any future experiment results using the clock.
On FTMP the time is read with the instruction , HREAD(RT.CLOCK,variable,2);\ In the experiment task, 16 iterations, each of 5 clock reads were made with the time before starting and the time of the last read being stored in global memory. A second task was created that did exactly the same as the first task except that it did not write to global memory. This second task was placed so that it would start execution immediately after the measurement task. If two triads were in use, the second task would execute in parallel with the first and add contention for the clock.
The experiment was repeated about 100 times for three situations 1. Triad 1 running alone, 2. Triad 2 running alone, a Triad 1 and 2 running simultaneously (contention for the clock), The first two runs determined single triad clock read time with no contention, and variation between triads. The third case determined how the clock contention effects the clock read time.

Instruction Times
The times for the following AED instructions were measured: Each of these instructions was executed in a loop 100 times along with the instruction 'A = V. The 'AaV instruction was added because the compiler would not accept a null statement for the first instruction. The 'Null' statement was included so that the overhead from clock reading and loop (integer assign) (real assign) (long assign) (integer add) (real add) (long add) (integer multiply) 5. A»B + C; 6. A1-B1+C1; 7. A2 = B2 + C2; 8. A»B*C; control can be eliminated from the other instructions, leaving only the time for instruction execution.
This task was executed 308 times.

Measuring R4 Frame Size
There were three parts to this experiment The first part was to measure the nominal R4 size. Upon starting, the first R4 task reads the real-time clock. If the global variable CMU.EXEC is set correctly, this time will be stored. This will be done eight consecutive times, giving eight relative R4 frame sizes.
The experiment was run about 100 times each for one triad and two triads.
The second part of the experiment was to determine how the system behaved when an R4 frame was stretched. Only one triad was used. The time limit for the task was set to a very large number (so that a task would not abort before it was finished). Finally, RTC.NUM had to be set to several values that would stretch the frame. These values were 2000,3000 and 5000. Data was recorded about 100 times for each iteration value.
The last part of the experiment was to determine the effect on the system of an R4 frame with an infinite number of tasks. One of the words of control information associated with a task was a pointer to the next task. An infinite string of tasks could be generated by having a task point to itself as the next task. One R4 task was caused to execute over and over in this way. Another task was checked to see if it ran once the R4 task started repeating.

Read Time Clock Delay
The clock read overhead was virtually constant for a single triad configuration. The data never varied more than a dock tick (.25 mSec). For the two different triads the results were: 859.3 ± .68 /6 • 172 ± .14 jiSec A read with no contention on the bus requires 172j*Seconds. Although there is an indication of some variation between triads, it is not significant and within the margin of error for a 95% confidence interval.
In the second measurement, two triads were started, each executing roughly the same code so that contention for the bus is created.
It is evident that the contention for the clock at this rate does not affect the delay in reading the clock greatly (less than 1%). The Contention is large enough that the range of the 95% confidence intervals for the single triad read time and double triad read time do not overlap. These results do not take into account other contention for the bus like memory access or I/O device access.
The reason that this variation is so small is that the section of code in the read procedure that actually uses the bus is a small percentage of the whole clock read procedure. Since both contending procedures are exactly the same when in the iteration section, they will tend to be synchronized so that only one will actually request control of the bus at a time. The slight variation from the single triad case could be due to slight variations in the execution rates of the different processors so that occasionally the two triads do conflict. However, this would seem to be very minor.
On the whole, the real-time clock on FTMP should serve as a reliable measurement device with predictable delays that can be factored out of experiments. This is especially true in the single triad case. However, this assumes that the experimenter has complete control of ail of the tasks. If an experimenter on the system with multiple triads, lets one triad run uncontrolled, the clock results may not be reliable. The range of system activities under which the clock times are repeatable should be explored further.

Instruction Measurement
The result of the instruction measurements were as follows: The instruction 'A » B+C; 1 actually used four assembly instructions. Therefore, a rough assembly level throughput would be 150K0PS.

Measuring R4 Frame Size
In the first part of the experiment the time that represented one R4 frame varied considerably. Since time was measured at the beginning of the first R4 task of a frame, these variations are probably due to variations in the execution time of the R4 dispatcher. There may be predictible variation caused by the starting of R3 and R1 tasks. However, this could not be determined without a large number of consecutive R4 frame start times, which was not done. The distributions of frame sizes are shown in Figures 4-1 and 4-2.
The distribution looks approximately normal except that the frame sizes near the average occur less frequently than would In the second part of the experiment the R4 frame was stretched. In all of these runs at least 70% of the data points were in the two smallest frame sizes of the total range. The rest of the data points were spread out over the rest of the range as seen in Figures 4*3,4-4 and 4-5. The task that is stretching the frame can be considered to take constant time since the major portion is a loop of 9 A « V which has been shown to be very predictable (see experiment 2).
This would indicate that the variation might be due to some kernai procedures or the R4 dispatcher.
The large number of data points in the first two consecutive frame sizes might indicate a basic frame size when no unusual situations occur. The other data might suggest exceptions that caused some extra instructions to be executed. The range of data is about 2.5mSeconds. Using the approximations from experiment 2, the amount of extra code that would be executed during these unusual situations is on the order of 100 AED instructions or 400 assembly level instructions.
When the average times were plotted against the iteration rate, a linear relation emerged as depicted in Figure 4-6. The step function (also shown on the graph) was expected because the documentation discusses a timer interrupt that, according to our interpretation, was supposed to happen every 24mSeconds. In fact; after the first timer interrupt, 24mSeconds into the R4 Frame, the timer was not used until all of the R4 tasks finished. Therefore, the size of the frame would increase linearly above 40ma The final part of the experiment was to determine the behavior of the FTMP system when it executed an 'infinite' number of R4 tasks. In the experiment an R4 task pointed to itself as the next task. If there were no mechanism for aborting a frame, as would be expected from the previous section of the experiment, the R4 frame would continue forever. This could be demonstrated by watching the