Mitigation of Radiation Effects in SRAM-Based FPGAs for Space Applications

The use of static random access memory (SRAM)-based field programmable gate arrays (FPGAs) in harsh radiation environments has grown in recent years. These types of programmable devices require special mitigation techniques targeting the configuration memory, the user logic, and the embedded RAM blocks. This article provides a comprehensive survey of the literature published in this rich research field during the past 10 years. Furthermore, it can also serve as a tutorial for space engineers, scientists, and decision makers who need an introduction to this topic.


INTRODUCTION
Electronics on board modern spacecraft comprise a considerable number of field programmable gate array (FPGA) devices. Although in the early days these devices were mainly used to implement rudimentary glue logic, they enable far more complex operations today. Regardless of the application, such as science, Earth observation, or military surveillance, a trend to ever-increasing payload data volumes can be observed. Thus, data processing in space can be essential for some missions, as payload data downlinks can be too slow to transmit these growing data volumes, even if data compression techniques are applied.
Many payload data processing applications benefit from an efficient implementation in hardware using programmable logic devices. Modern static random access memory (SRAM)-based FPGAs offer huge amounts of logic resources, allow fast clocking, and can be quickly reconfigured, which makes them ideal platforms for the implementation of such algorithms. However, they are prone to radiation effects in space because the This work is supported by the European Space Agency under the NPI Programme, Airbus Defence and Space UK, and University of Leicester. Authors' addresses: F. Siegle and T. Vladimirova, Embedded Systems Group, University of Leicester, UK; email: {fs131, tv29}@le.ac.uk; J. Ilstad, ESA-ESTEC, Noordwijk, The Netherlands; email: jorgen.ilstad@esa. int; O. Emam, Airbus Defence and Space, Stevenage, UK; email: omar.emam@astrium.eads.net. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permissions@acm.org.  state of their memory cells can be flipped due to single and multiple event upsets caused by radiation. Hence, design techniques to mitigate radiation effects must be applied to these devices. In the past 10 years, research on such mitigation methodologies established its own rich field, which we thoroughly survey in this article. The article is structured as follows. Section 2 answers the question as to why SRAMbased FPGAs are so popular in space engineering and discusses radiation effects in space as well as their effects on SRAM-based FPGAs. Then, in Section 3, terminology, failure modes, and mitigation techniques are outlined. In Section 4, mitigation techniques applied during runtime operation are reviewed, dividing this huge field into three areas: techniques aimed at the (i) configuration memory, (ii) user logic, and (iii) embedded RAM blocks. A brief survey of methodologies that can be applied during design time is then given in Section 5. Section 6 covers the simulation and emulation of radiation effects, including accelerated radiation testing and fault injection. Section 7 is dedicated to purpose-built hardware platforms that have been used in research projects on SRAM-based FPGAs for space applications. Section 8 provides a summary of the reviewed techniques as well as design recommendations. Finally, Section 9 concludes the article.

Why SRAM-Based FPGAs in Space?
FPGAs are commonly used on board spacecraft. The importance of these devices for space applications is illustrated by the figures given in Table I, which show that most integrated circuits (ICs) on board the Sentinel-2 spacecraft, a current mission of the European Space Agency, are FPGAs. As evidenced by Table I, although applicationspecific integrated circuits (ASICs) and microcontrollers still play an important role for platform applications, payload processing applications are mainly implemented with FPGAs.
Today, three main types of space-qualified FPGA technologies are employed in commercial products. The most common technology is antifuse, which is used by onetime programmable FPGAs. One advantage of these devices is their natural tolerance against radiation effects because the hardware configuration is fixed. In principle, these devices can still suffer from single event upsets (SEUs) in user logic and embedded RAM cells. However, radiation-tolerant versions are available, which offer hardened user flip-flops by design, such as the RTAX and RTSX devices by Microsemi. The second technology is based on SRAM memory-that is, the configuration of the FPGA is stored in volatile memory cells. An obvious benefit of these devices is the possibility to reconfigure the hardware in later design or even mission stages. Furthermore, some of these devices, such as the newer Virtex-4 and Virtex-5 FPGAs by Xilinx, offer high performance and a large amount of logic and embedded memory resources as well as dedicated digital signal processing (DSP) blocks. In contrast to their antifuse counterparts, SRAM-based FPGAs are in principle more susceptible to SEUs because the hardware configuration can be altered by radiation effects (e.g., Virtex-4QV and earlier devices by Xilinx). FPGAs with radiation-hardened configuration memory are also available, however, such as Virtex-5QV devices by Xilinx or ATF280 devices by Atmel.
Recently, flash memory-based FPGAs are also being considered for use in space projects, such as the ProASIC3 device by Microsemi. Similar to SRAM-based FPGAs, they can be reconfigured and offer good performance. The use of such devices on long space missions is, however, problematic due to their rather low immunity to the total ionizing dose (TID) effect and single event latchups (SELs) [Microsemi 2012].
Internal studies at the Jet Propulsion Laboratory (JPL) estimate that the raw, uncompressed data captured from spectroscopy instruments on board recently proposed U.S. missions could reach 1 to 5 TB per day [Norton et al. 2009].
It is therefore recommended to drastically reduce the data volume that must be stored on board and later transmitted to Earth by transforming the raw measurements of payload instruments into intermediate results performing onboard processing. In a technology assessment, NASA scientists also found that Xilinx FPGAs are best suited for such high-performance tasks due to their flexibility and their embedded DSP blocks compared to single-board computers and DSP processors. Apart from the increase in performance, SRAM-based FPGAs offer the capability of being reconfigured-a feature not to be underestimated for space projects. Pingree [2010] describes the typical problem of one-time programmable FPGAs. For one of the instruments on the NASA Juno spacecraft to Jupiter, the engineers had to design and program the FPGA design 2 years before launch. Since the FPGA was one-time programmable, it could not be changed or improved without a high impact to the project cost and schedule. Furthermore, as the spacecraft travels for 5 years to Jupiter, instrument calibration activities may be required during that time, in which the FPGA design cannot be changed. With SRAMbased FPGAs, however, hardware updates could be easily applied in later design stages or even in flight.
Most recent publications are concerned with the SRAM-based Xilinx Virtex-4QV and Virtex-5QV FPGAs, as they are the only fast SRAM-based FPGAs at this time that are available in space-qualified versions. Therefore, in the following, the focus is on these devices.

Radiation Effects in SRAM-Based FPGAs for Space
2.2.1. Sources of Radiation Effects. The space radiation environment comprises a large range of energetic particles with energies from several keV up to GeV and beyond. The main elements are as follows [Holmes-Siedle and Adams 1993;ECSS 2008b]: -Trapped radiation: Energetic electrons and ions are magnetically trapped in the socalled Van Allen radiation belts, which extend from 100km to 65,000km and consist mainly of electrons up to a few MeV and protons of up to several hundred MeV energy. The Earth's magnetic field is not symmetrical, leading to local distortions. One important distortion is known as the South Atlantic anomaly. Spacecraft passing this area are exposed to an increased level of radiation. -Galactic cosmic rays: These rays are high-energy charged particles that enter the solar system from outside and are composed of protons, electrons, and fully ionized nuclei. -Solar energetic particles during solar flares: These flares are high-energy particles that are encountered in interplanetary space and close to Earth and are seen in short bursts associated with other solar activity. The duration of such bursts can be a few hours up to several days. They consist of protons, electrons, and heavy ions in the energy range of a few tens of keV to GeV and beyond.
In addition, secondary radiation is generated by the interaction of energetic particles with materials. One example is bremsstrahlung, a high-energy electromagnetic radiation that is caused by the deceleration of a charged particle in materials.

Radiation
Effects. An overview of common radiation effects that must be mitigated in SRAM-based FPGAs is given in Figure 1. The main effects are as follows: -TID effect: Ionization of electronic components is caused by electrons, protons, and bremsstrahlung and leads to a degradation due to increasing leakage currents and other effects [ECSS 2008a]. Processes that cause ionization are based on photon interaction and include the photoelectric effect, Compton effect, and pair production, all leading to the production of free electrons and hole-electron pairs [Messenger and Ash 1992]. The accumulation of these effects is referred to as TID and is usually measured in krad with 1 rad = 10 −2 Gy = 6.24 · 10 7 MeV g [Dierker 2007]. For spacequalified Virtex-4QV and Virtex-5QV devices, the TID is of no concern because the dose is guaranteed to be 300krad for Virtex-4QV devices [Xilinx 2010], respectively 1Mrad for Virtex-5QV devices [Xilinx 2012a].
-SEL: An SEL is a potentially destructive single event effect (SEE) that can trigger parasitic PNPN thyristor structures in a device [ECSS 2008a]. Similar to the TID effect, SELs are of no concern for Virtex-4QV and Virtex-5QV devices, as both devices have a guaranteed latchup immunity to LET > 100 MeV · cm 2 · mg −1 [Xilinx 2010[Xilinx , 2012a]. -SEU: This class of SEE is a soft error that changes the state of a bistable element.
It is triggered by heavy ions and protons and results from ionization by a single energetic particle or the nuclear reaction products of an energetic proton. The ionization induces a current pulse in a p-n junction whose charge may exceed the critical charge that is required to change the logic state of the element.   [Dodd et al. 2004].

Single Event Effects
Rates. SEU rates in FPGAs depend on the particular device component in which they occur (see Section 3.2). The mitigation strategy for Virtex-4QV FPGAs must mainly take into account nondestructive SEEs, as these devices are tolerant to accumulated ionization and SELs. In contrast, Virtex-5QV devices are radiation hardened by design. This was achieved by replacing the configuration memory and flip-flop cells by dual-node counterparts that require charge collection in at least two active nodes before an upset can occur. Furthermore, all flip-flop inputs are now protected by SET filters, and TMR is applied to control circuitry and registers [Swift and Allen 2012]. Static and dynamic cross-sections for most FPGA blocks with regard to the Virtex-4QV family can be found in Allen [2009] and Allen et al. [2008]. Using these crosssections, SEU and SEFI rates can be calculated for a particular design and orbit. For European space projects, the necessary calculation methods are standardized in ECSS [2008a, 2008b]. A tool that greatly simplifies the SEU prediction according to these standards is OMERE, which was developed by the French company TRAD with support from the French space agency CNES [TRAD 2014].
In , SEU and SEFI rates for several orbits in quiet solar maximum conditions were calculated using the CREME96 model. For illustration, rates for two orbits are given in Table II. The first one is a low Earth orbit (LEO) at 800km altitude with an inclination of 22.0 • , and the second one is a geostationary Earth orbit (GEO) at 36,000km. The FPGA type is a XQR4VSX55, and it is assumed that all memory cells are used-that is, the upset rates per bit-day are scaled to the whole device. It can be seen that the likelihood of SEFIs is low, with approximately one SEFI every 36 years in LEO and every 103 years in GEO.
Assuming that all flip-flop cells are used, the chance of an upset in these elements is far below 0.1 upsets per device-day. In contrast, if a design heavily utilizes block RAM (BRAM) blocks (in this example, all blocks are used), the probability of an upset is more than 400 times higher than for a flip-flop upset due to the high ratio of BRAM cells to flip-flop cells. For the configuration memory cells, the ratio is even larger: in LEO, more than 7.5 upsets can occur per device-day. It is, however, assumed that all configuration memory cells are utilized, which is unrealistic for a real design.
The preceding results show that mitigation techniques must mainly focus on configuration memory and BRAM upsets. Although SEFIs occur only rarely, they can necessitate an undesired full reconfiguration and must therefore be mitigated as good as possible. In contrast, a mitigation strategy for flip-flops may not be necessary for some applications.
In 2012, Quinn et al. presented the first on-orbit results for Virtex-4QV FPGAs [Quinn et al. 2012], collected from an experimental payload launched by the Los Alamos National Laboratory. The system comprises two Virtex-4 FPGAs running the same DSP application. The mitigation strategy is based on triple modular redundancy (TMR) in combination with scrubbing. Using fault injection experiments and the CREME96 model, an observable output error rate of approximately one in 15 to 25 days was predicted before launch. This rate is based on a calculated configuration memory upset rate of 68 to 89 SEU /device· day.
The on-orbit results are surprising. First, the measured upset rate per unit is much lower than predicted (19 SEU /unit· day). Second, the only two measurable output errors were triggered by SEUs in bit locations that could not be predicted by fault injection before. Third, a SelectMAP SEFI was observed, although such a failure should only occur rarely according to worst-case predictions. Finally, the authors were able to observe atypical events where many bits in one single frame were corrupted all at the same time.
The authors assume that the measured upset rate is artificially low due to the shielding of the spacecraft and the duty cycle of the device. It was further found that 8.42% of SEU events are actually multiple bit upsets (MBUs), although most have a size of only two bits. As well, 78% of the SEUs occurred in configurable logic blocks (CLBs), followed by nearly 15% in BRAM interconnect and nearly 6% in input output blocks (input outputs).

Terminology
Several techniques can be applied during the design process to mitigate soft errors in digital circuits. A classification of these techniques is presented in Section 3.3, which makes use of the terminology introduced in the NASA Fault Management Handbook [NASA 2012]. Although targeting flight systems in general, this terminology proves to be well suited to describing soft error mitigation techniques for FPGAs as well.
A common terminology to describe an abnormal state of a system includes three terms: fault, error, and failure. Although several standards define these terms slightly differently, a fault is usually understood as the cause of an error and an error as the cause of a failure. For instance, in functional safety standard ISO 26262, a fault is defined as an "abnormal condition that can cause an element or an item to fail." The error is defined as the "discrepancy between a computed, observed or measured value or condition, and the true, specified, or theoretically correct value or condition." Finally, the failure is defined as the "termination of the ability of an element, to perform a function as required." According to the Fault Management Handbook, failures can be either prevented or tolerated. In the first case, actions are taken to avoid failures either at design time or runtime. The design-time fault avoidance includes "design function and FM [fault management] capabilities to minimize the risk of a fault and resulting failure," whereas operational failure avoidance "predicts that a failure will occur in the future and takes action to prevent it from happening." With failure tolerance, failures are either accepted or mitigated. Failure masking techniques "allow a lower level failure to occur, but mask its effects so that it does not affect the higher level system function." Failure recovery techniques "allow a failure to temporarily compromise the system function, but respond and recover before the failure compromises a mission goal." Finally, goal change strategies "allow a failure to compromise the system function, and respond by changing the system's goals to new, usually degraded goals that can be achieved." In the following, erroneous FPGA output is seen as a failure. Although the failure is always caused by a fault, a fault does not necessarily lead to a failure. In an FPGA circuit design, such a fault could be a flipped bit in a flip-flop or a reprogrammed logical operation due to a falsified look-up table (LUT). In any case, only if the faulty resource is actually used in the design will the associated fault finally lead to a failure.

Failure Modes in SRAM-Based FPGAs
An FPGA model, commonly found in the literature [Padovani 2005], that is suitable for illustration of different fault and failure modes of SRAM-based FPGAs is shown in Figure 2. SRAM-based FPGAs comprise a configuration memory layer that stores the configuration of the FPGA in SRAM memory cells and a user logic layer on which the actual circuit is implemented. A typical circuit utilizes sequential and combinational logic elements and often accesses embedded BRAM and/or DSP blocks. Whereas the user flip-flops and other user memory resources as well as the DSP blocks are physically present, combinational logic gates are realized with LUTs within CLBs.
The configuration bits on the configuration memory layer control the resources on the user logic layer, including the wiring between the resources, the content of the LUTs, and the configuration of the configurable logic, BRAM, DSP, and input output blocks.
If an ion hits the FPGA, it can affect memory resources (i) on the configuration memory or (ii) on the user logic layer. In both cases, upsets can be seen as faults that may lead to a failure. In both cases, the system can fortunately recover from such failures because affected memory cells can be updated with correct values. Since the configuration bits control "really everything" [Padovani 2005], the configuration memory is the main concern of most mitigation strategies. However, although more than 60% of the configuration bits are used to control routing resources, only 10% to 20% of routing resources are used in a typical design [Carmichael and Tseng 2009]. The ratio between used configuration bits and user flip-flops bits is, however, usually still so high that flip-flop upsets account for only a few percent of all upsets. Obviously, configuration bit upsets can lead to much more unpredictable behavior than flip-flop upsets. In contrast to user flip-flops, BRAM upsets can be as much of a concern as configuration memory upsets if large amounts of these resources are utilized in a design.
A fault in the configuration memory may lead to a failure in case the affected configuration bit controls a resource that is utilized by the design. In Xilinx terminology,  configuration bits can be classified as essential and critical bits [Xilinx 2012b]. Essential bits are the subset of configuration bits that are responsible for resources of the design. Thus, a fault affecting an essential bit may lead to a failure. Because not every resource of a design is used by the application nonstop, only faults in a subset of the essential bits, also referred to as critical bits, are guaranteed to manifest as failure.
A fault in a user flip-flop can lead to a failure if its value is used by subsequent circuitry. Although the failure can propagate through the system until it becomes measurable at the output, it is often only of transient nature. If the flip-flop is used in state-dependent logic, however, a failure can be "trapped" in a feedback loop until the logic is reset to a known (initial) state. For instance, if a bit of a counter register is flipped, the counter "jumps" and the output is permanently falsified.
A fault in a BRAM cell can lead to a failure with the next read access. Often, the memory is not immediately accessed, and the manifestation of the failure is delayed. Figure 3 shows an overview of fault management strategies classified according to the aforementioned terminology, together with the corresponding mitigation techniques surveyed in this article.

Classification of Mitigation Techniques for Spaceborne SRAM-Based FPGAs
During runtime, failure masking techniques can be used to tolerate failures. Failure masking is usually achieved by redundancy. Most commonly, spatial redundancy is applied, such as TMR, partial TMR, duplication with compare (DWC), or reducedprecision redundancy (RPR). Alternatively, information redundancy techniques can be used to detect and mask failures in certain types of circuits, such as error detection and correction (EDAC) codes or algorithm-based fault tolerance (ABFT).
Aside from failure masking, failure recovery techniques can be used during runtime as well. Failure recovery is usually done by refreshing the memory, which is often referred to as scrubbing.
During design time, several techniques can help to avoid faults in advance, including tools for the susceptibility analysis (e.g., for the quantification of sensitive configuration bits) and tools that place and route a circuit design in a reliability-oriented way.

MITIGATION DESIGN TECHNIQUES AIMED AT RUNTIME FAILURE TOLERANCE
In this section, a review of the rich field of failure tolerance techniques that can be applied during runtime, targeting both the configuration memory and the user logic, is presented. Different scrubbing approaches are outlined with regard to the configuration memory. Besides implementation specific differences (Blind vs. Readback Scrubbing, Device vs. Frame-Oriented Scrubbing, External vs. Internal Scrubbing), two fundamentally different concepts commonly found in the literature are discussed as well. The first concept combines periodic scrubbing with a low-level redundancy approach, whereas the second concept implements an FDIR approach in which the configuration memory is only repaired once a failure has been detected in user logic. For the user logic, different redundancy concepts are surveyed. Again, it turns out that the concepts presented in the literature can be roughly divided into two categories. The first type of spatial redundancy is applied to the netlist of the circuit and is thus a quite low-level approach. The second type is a modular redundancy approach in which whole hardware blocks are operated in hot redundancy.

Configuration Memory
Single and multiple bit upsets in the configuration memory of SRAM-based FPGAs can be mitigated by periodically writing a known to be correct bitstream to the device. This technique is often referred to as scrubbing, and several types of implementations can be found in research literature and application notes. In the following, the different methodologies and architectures are classified using a similar terminology as introduced in Heiner et al. 4.1.1. Blind Versus Readback Scrubbing. The most basic methodology is blind scrubbing, where the configuration memory is periodically updated with a known to be good copy of the original bitstream. This copy that is sometimes referred to as the golden copy is stored in an external, radiation-hardened memory. An external or internal configuration controller controls the download of the bitstream via one of the configuration interfaces of the FPGA. Using the classification shown in Figure 3, blind scrubbing can be described as an operational failure avoidance methodology because faults are handled in a preventive manner without any knowledge about the current health state of the system.
One concern that is sometimes raised in connection with blind scrubbing is the fact that the configuration controller gains write access to the configuration memory even if there is no need for scrubbing. Since the configuration interface is prone to SEFIs, a bitstream download can be affected by radiation effects, potentially leading to a corrupted design. Therefore, Xilinx recommends a SEFI detection before each write access that includes a FAR check and a status and control register check [Carmichael and Tseng 2009].
To further minimize the risk of a corrupted bitstream download, the readback feature of SRAM-based FPGAs can be utilized for scrubbing. Using one of the configuration interfaces, bitstreams cannot only be written to the device but also can be read back during operation. With this capability, unnecessary write accesses to the configuration memory can be avoided during scrubbing. Before writing a correct bitstream to the device, the current bitstream is read and checked for upsets. Only if upsets are detected is the correct bitstream eventually written. Such a scrubbing methodology can be identified as a failure recovery technique. Two possible detection mechanisms are commonly used. The first one is based on comparison and relies on golden bitstream copies. The current bitstream is read back from the FPGA and compared to the golden copy, either by bitwise comparison or more simply by calculating a cyclic redundancy check (CRC) checksum during readback that can then be compared with the CRC value of the golden copy (later referred to as CRC readback scrubbing). If a mismatch is detected, the golden copy is used to overwrite the current bitstream. The second detection mechanism is based on information redundancy and uses the error-correcting code (ECC) bits that are embedded into each configuration frame. This single error correction and double error detection (SECDED) code allows the detection of single and double bit upsets and the correction of single bit upsets. For MBUs with more than two wrong bits, the syndrome value is indeterminate [Xilinx 2009]. During readback, a syndrome value is calculated by an ECC logic that must be initiated as a user primitive called FRAME_ECC_VIRTEX4 for Virtex-4 devices and FRAME_ECC_VIRTEX5 for Virtex-5 devices [Xilinx 2012c]. The syndrome value not only identifies upsets but can also localize single upsets. Hence, two possible failure recovery methodologies can be combined with the ECC logic: either the erroneous bit is flipped and the corrected bitstream is written back to the device (later referred to as ECC readback scrubbing) or the whole bitstream is overwritten with a golden copy from memory.
A methodology that allows the detection and correction of MBUs using a custom-built EDAC core is presented in Lanuzza et al. [2010]. The authors divide a configuration frame into several data segments and interleave the bits of these data segments. Then, an EDAC check code is calculated for each segment. Since adjacent memory cells are distributed over several data segments, MBUs can be detected and corrected. A recent work that advances this concept is proposed in Rao et al. [2014]. Here, the process of detecting MBUs and correcting them is separated. The detection is done using a novel lightweight error detection coding technique called interleaved two-dimensional parity, whereas the correction utilizes so-called erasure codes as can be found in reliable storage devices and similar applications.
Starting with the Virtex-5 architecture, an internal readback CRC logic allows a continuous and automatic readback in the background [Xilinx 2012c]. In the first readback round, a golden CRC checksum is calculated that is later used to compare the CRC values of the subsequent rounds. Once a mismatch has been detected, dedicated user logic can initiate a reconfiguration of the device or a bitstream repair using the ECC logic [Chapman 2010]. A summary of blind and readback scrubbing approaches is given in Table III. 4.1.2. Device Versus Frame-Based Scrubbing. All scrubbing methodologies mentioned in the previous paragraph can use different bitstream sizes. However, the configuration memory is typically scrubbed with a full bitstream or on a frame-by-frame basis. The first case, sometimes also referred to as device-based scrubbing, requires a rather simple implementation. Except for the modified header information, the bitstream can be directly downloaded from a memory to the configuration interface. One drawback of this solution is the susceptibility of the configuration interface to SEFIs. If such an upset occurs during download, the whole design is likely to become corrupted. Frame-based scrubbing requires a more complex configuration controller implementation because each frame must be prepared before download. However, the benefit of this approach is the possibility to isolate the effects of a SEFI to a single frame. Aside from the increased complexity of implementation, the scrubbing speed is decreased as well. First, a SEFI check must be done before downloading each frame. Second, each frame bitstream comes with an overhead due to its header. Finally, after each frame, a dummy frame must be written to flush the pipeline [Carmichael and Tseng 2009]. In some applications, increased scrubbing speed is desired. This is especially true for applications in which scrubbing is used as the only mitigation technique. In Sari and Psarakis [2011], such a "low-cost" strategy, based on an idea presented in Asadi and Tahoori [2005], is proposed. The authors point out that many configuration frames are scrubbed, although they contain no or only a small number of essential bits. As a consequence, they propose to constrain the placement of the design in such a way that the number of frames with essential bits is minimized. Then, the frame-based scrubber must only take this subset of frames into account. A summary of device and frame-based scrubbing approaches is given in Table IV. 4.1.3. Periodic Versus On-Demand Scrubbing. In many designs, the scrubbing process is independent of other mitigation techniques. Then the configuration memory is periodically scrubbed, respectively scanned for upsets with a fixed scrubbing rate.
Alternatively, the scrubbing process can also be triggered by a failure detection mechanism. Such a methodology can be advantageous in systems where continuous scrubbing is unwanted. Eventually, the availability of a system depends on the time a faulty component remains unrepaired. This time can be minimized either by increasing the scrubbing frequency or by implementing a mechanism that can trigger a repair process immediately after failure detection. With the aid of stochastic models, Siegle et al. [2013], on-demand scrubbing always maximizes the availability (Figure 4).
Depending on the implementation, on-demand scrubbing can be power saving because the scrubbing logic is only active if required. Becker et al. [2007] of Virtex-II devices during reconfiguration. Although the authors use an LZSS decompression core during reconfiguration, the power consumption is increased by only 95mW. Nevertheless, the avoidance of additional resource overhead is always beneficial, especially for systems where failure detection mechanisms are implemented anyway.
In the literature, on-demand scrubbing is mainly mentioned in connection with systems utilizing dynamic partial reconfiguration. In many proposed systems, spatial redundancy, such as TMR, is implemented by using redundant reconfigurable modules Fig. 4. On-demand scrubbing. A failure detection mechanism, such as a majority voter, flags an error code to the configuration controller once a faulty component has been detected. As a consequence, the configuration controller initiates a scrubbing process. and a majority voter within the static area. Due to the physical separation of the reconfigurable modules, on-demand scrubbing can be advantageous here. The majority voter can be easily designed as a failure detection mechanism that is able to identify a faulty module and can trigger a scrubbing process on-demand, targeting only the faulty component. Paulsson et al. [2006] presented such a system for Virtex-II devices. The authors use so-called dynamic TMR or hardware test benches as failure detection mechanisms and reconfigure a partition only if a failure has been detected. Researchers at the University of Arizona proposed similar mechanisms for their SCARS system, which is based on Virtex-5 devices [Sreeramareddy et al. 2008]. Here, a faulty reconfigurable module is only scrubbed after a failure has been detected by software routines. Jacobs et al. [2012b] propose a similar approach. Again, failures in reconfigurable modules are detected by voters or comparators, and scrubbing is triggered only for the faulty module on demand. Straka et al. [2010b] at the University of Brno also work on a faulttolerant framework for SRAM-based FPGAs. Similar to the approaches mentioned previously, a so-called generic partial reconfiguration controller receives error signals from reconfigurable modules and triggers on-demand scrubbing if required. Azambuja et al. [2008Azambuja et al. [ , 2009 also use majority voters as failure detection mechanisms and scrub a faulty reconfigurable module only after a failure has been detected. The authors emphasize the increased repair speed compared to a full reconfiguration. Iturbe et al. [2009] propose a fault management strategy in which they combine on-demand blind scrubbing, triggered by a majority voter as failure detection mechanism, with ECC readback scrubbing.
A methodology that aims at speeding up the on-demand scrubbing process is presented in Nazar et al. [2013]. The authors analyze the statistical distribution of sensitive bits within a partial bitstream. Instead of starting the scrubbing process from the first byte position of the bitstream, it is started from the one frame for which the authors calculated that this start position minimizes the mean time to recover (MTTR). For a set of benchmark circuits, an average MTTR reduction of 30% was achieved. A methodology with a similar aim is presented in Bolchini et al. [2011]. The authors partition a circuit design and apply a specific redundancy scheme (e.g., DWC or TMR) to each partition. If one of these partitions is detected to be faulty, it is scrubbed by an external reconfiguration controller on demand. The authors developed an algorithm that optimizes the floor planning of the different partitions to find an optimal solution in terms of reconfiguration time, area, and performance overhead. Results for a set of example circuits suggest that the reconfiguration time can be heavily reduced, although this reduction is at the cost of an increased area and performance overhead. A summary of periodic and on-demand scrubbing approaches is given in Table V. 4.1.4. External Versus Internal Scrubbing. The scrubbing logic can be implemented internally or externally. From the available configuration interfaces, the SelectMAP interface is commonly used for external scrubbing due to its high throughput rates. ICAP, the internal counterpart to SelectMAP, can be used if the scrubbing logic is implemented on the user logic layer. Internal scrubbing is sometimes seen as a "low-budget" solution, because it does not necessitate an external configuration controller and a memory for the golden bitstream copies. It can be argued, however, that in most space applications, a radiation-hardened supervisor as well as reliable memory for the initial bitstream configuration is available anyway.
External scrubbing via the SelectMAP interface is commonly seen as the more robust approach and is also recommended by Xilinx [Carmichael and Tseng 2009]. Berg et al. [2008] at NASA come to similar conclusions, in which the authors compare an external blind scrubber to an internal ECC readback scrubber by Xilinx. The internal scrubber is based on a PicoBlaze microcontroller, and its design was published in the no longer available application note [Jones 2007]. Using heavy-ion SEE radiation testing, it was found that the external scrubber was always recoverable without the need for a reset or power cycle, whereas the internal scrubber was never recoverable. Thus, the internal scrubber consistently reached a state where it could not operate anymore, either because of MBUs that cannot be handled by the scrubber or because the scrubber itself was hit by ions. Heiner et al. [2008] from Brigham Young University improved the fault tolerance of the same Xilinx scrubber design by applying TMR and BRAM scrubbing. In radiation tests, it was found that the improved scrubber performs much better, but in more than 45% of all tests, the design failed at some point, requiring a subsequent full reconfiguration of the device. The authors assume that the missing ability of ECC readback scrubbers to repair MBUs was the main reason for this behavior. Ebrahim et al. [2012] from the University of Edinburgh also work on a fault-tolerant ICAP controller in the course of their R3TOS system. The controller is based on Xilinx' XPS_HWICAP core. The controller is used not only for scrubbing but also for partial reconfiguration. Similar to the ICAP controller described earlier, the scrubber is an ECC readback scrubber. To improve its fault-tolerance, the authors apply spatial redundancy, but instead of applying TMR to the whole controller, only a so-called recovery module is triplicated. This module, on the other hand, is able to gain access to the ICAP interface only for the sake of recovering the controller from failures. A summary of external and internal scrubbing approaches is given in Table VI. 4.1.5. Integration with Dynamic Partial Reconfiguration. Most scrubbing approaches described in the literature assume a static user design. If dynamic partial reconfiguration is used as part of the normal operation, however, such as to time-share chip area by swapping different modules during runtime, the reconfiguration and scrubbing process must be somehow orchestrated, because only one of them can gain access to the   configuration interface at the same time. Furthermore, if blind scrubbing or CRC readback scrubbing is used, the golden bitstream must be kept updated after each partial reconfiguration to mirror the currently running design. One approach to overcome these problems is described by Heiner et al. [2009]. The authors use a CRC readback scrubber as described earlier. Instead of downloading the bitstream of a reconfigurable module and updating the golden bitstream afterward, the authors suggest simply integrating the bitstream of the reconfigurable module into the golden bitstream. During the next scrubbing cycle, the scrubber detects a discrepancy between the golden bitstream and the bitstream that has been read back from the device due to mismatching CRC sums. As a consequence, it will then "repair" the bitstream by writing the updated frames to the device.

User Logic
Whereas failure recovery mainly takes place on the configuration memory layer, failure masking is implemented on the user logic layer using some kind of redundancy. Most commonly, spatial redundancy is used, but information and temporal redundancy can be found as well for specific components.
4.2.1. Spatial Redundancy. By far, the most common form of spatial redundancy is TMR. In this approach, all components of a circuit are triplicated as depicted in Figure 5, and a majority voter is placed at the end that chooses the correct output.
To decrease the susceptible area, the circuit can be further partitioned by adding additional voters, as can be seen in Figure 6. The possible increase of availability is  discussed by  using Markov chains. The authors show that the reliability is indeed increased because the area of each circuit stage, and therefore the chance that more than two redundant circuit stages fail, is decreased.
Since the voter is a single point of failure, it is usually triplicated as well. As mentioned earlier, upsets affecting feedback loops, such as counter or state machines, can be problematic because the failure is trapped in the loop. To overcome this problem, voters can be placed inside feedback loops. This technique, sometimes also referred to as XTMR (Xilinx TMR) [Bridgford et al. 2008;Adell and Allen 2008], synchronizes the flip-flops automatically after repair (Figure 7).
Most commonly, TMR is applied to the netlist of a circuit using automatic insertion. Several commercial and academic software tools are available, including the TMRTool by Xilinx [2014], Precision Hi-Rel by Mentor Graphics [2014], and Synplify Premier by Synopsys [2012]. A notable free collection of tools is the BYU EDIF tool suite developed at Brigham Young University [2014].
Researchers at Politecnico di Torino analytically found that TMR-protected circuits are still prone to SEUs, because in some cases one single configuration bit upset can lead to multiple failures on user logic layer, invalidating the TMR approach [Sonza Reorda et al. 2005b]. Logic blocks inside the fabric of the FPGA are interconnected via switch boxes that are built from programmable interconnect points (PIPs). The authors found that one single configuration bit can control two or more PIPs, and they identified three possible modifications caused by one SEU, as depicted in Figure 8. In the figure, given a pair of connections (a), a short between the connections can occur (b), both connections can be opened (c), or a bridge between the connections can be created (d). If the connections belong to two redundant circuits of a TMR system, the voter will choose a wrong or falsified output. For Virtex-II devices, this failure mode, which is sometimes referred to as domain crossing error (DCE), was partly confirmed by Quinn et al. [2007c] by fault injection experiments, although the authors point out that SEU-induced DCEs "are possible when TMR is incompletely applied to a design, but they appear to be rare otherwise." However, the authors found that TMR can even be defeated by MBUs with two or more bits. Using a stochastic model, they predict a worst-case probability for DCEs of 0.36% for Virtex-II devices and up to 1.2% for the Virtex-5 device.
One obvious drawback of TMR is the large area, and thus power overhead, that can exceed 200%. To decrease the overhead of TMR, several alternatives were proposed in the literature. One example is partial TMR, discussed in Pratt et al. [2006. The basic idea is to apply TMR only to feedback paths and optionally to their inputs to avoid so-called persistent errors [Morgan et al. 2005] in state-dependent logic. By doing so, only failures with a transient nature can occur. The authors demonstrated for a DSP Kernel design that the number of persistent bits decreased by 63% if only the feedback is triplicated at the cost of 26% hardware overhead. By applying partial TMR to feedback paths and their inputs, the persistent bits were reduced by two orders of magnitude at the cost of 40% hardware overhead. This partial TMR approach is part of the already mentioned TMR Tool by Brigham Young University.
Another drawback of TMR is its strong impact on the performance of a circuit, especially if the circuit contains many TMR partitions. For instance, Kastensmidt et al. [2005] analyzed the performance of a digital FIR filter design. Although the implementation without TMR could achieve a performance of 154MHz, the performance of the TMR version with a maximum number of possible partitions dropped down to 123MHz.
A less common form of spatial redundancy is DWC where a circuit is duplicated and the output of the redundant circuits is compared by a comparator. Naturally, this mechanism is only able to detect failures instead of masking them. It can be useful for systems that allow a downtime but need to implement fail-silent behavior, or it can be used as a failure detection mechanism that triggers scrubbing on demand. Johnson et al. [2008] investigated DWC in detail. By means of fault injection experiments and radiation tests, the authors found that DWC can detect approximately 99.85% of all failures at the cost of approximately 100% hardware overhead.
An extension to DWC is proposed by Anderson et al. [2010]. The authors use DWC as a failure masking technique by taking advantage of the probabilistic distribution of a circuit's output. The authors use a system comprising five cascaded half-band filters whose output is characterized by a distinct, nonuniform distribution. Once the comparator detects a mismatch, it selects the output with the higher probability by checking a stored histogram. Due to the discrete bins of the histogram, a correct detection percentage can be bad for lower significant bits. Therefore, the authors combine this approach with an additional history buffer filled with the last decisions.
Instead of applying spatial redundancy to the netlist of a circuit, the whole circuit can also be seen as a module that is then duplicated or triplicated. This approach is easy to implement [Habinc 2002] but lacks the automatic resynchronization after repair, which can be achieved by netlist approaches like XTMR. However, a benefit of modular redundancy is the physical separation of the modules, which allows a partial (on-demand) scrubbing [Azambuja et al. 2008].
Today's use of modular redundancy is often driven by systems that utilize dynamic partial reconfiguration and in which the design is broken up into physically separated partitions anyway. Thus, it is no surprise that the earlier mentioned systems proposed by Paulsson et al. [2006], Jacobs et al. [2012b], and Straka et al. [2010a] are all based on modular redundancy. All of these systems have in common that one or more voters and/or comparators are placed in the static area, similar to the scheme depicted in Figure 9. The failure detection mechanism monitors the output of redundant modules and triggers an on-demand scrubbing process once a failure has been detected. An interesting aspect of such a system is its adaptability, as pointed out in Jacobs et al. [2012b]: since redundant modules can be added and removed on demand, the system  availability can be tuned according to external constraints in terms of area and power overhead.
Another technique that can be understood as a modification of modular TMR is RPR (Figure 10). The idea of applying RPR to FPGAs in space systems goes back to several works at the U.S. Naval Postgraduate School in Monterey [Snodgrass 2006;Sullivan 2008Sullivan , 2009Gavros et al. 2011] and was later followed up by Pratt et al. [2011Pratt et al. [ , 2013. Instead of using three redundant copies of a module, one module processes data with full precision while the two other modules process the data with reduced precision. Hence, RPR is suitable for algorithms that process data, which is represented by a block of bits ordered in increasing or decreasing importance, such as fixed-point numerical problems [Sullivan 2008]. A decision block determines if a failure has occurred as follows [Pratt et al. 2011]: The full-precision output F P Out is always chosen if no failure has been detected or if the reduced-precision modules disagree. The decision further depends on a threshold level T h : the full-precision module is assumed to be correct if its output differs less than T h from the reduced-precision module output RP1 Out . For an FIR filter design, Pratt et al. [2011] showed that the failure rate can be improved by nearly 200 times compared to an unmitigated design at the cost of nearly 70% hardware overhead. For the same circuit, a full TMR mitigation approach improves the failure rate by nearly 1,200 times at the cost of 208% hardware overhead.

Information Redundancy.
Although information redundancy techniques are mainly applied to memory and communication channels, several circuits can profit from them as well. Information redundancy techniques add redundant bits to data to be able to detect or even correct falsified information. An example for the first case is the CRC code, whereas error correction can be achieved by Hamming codes.
EDAC techniques are often applied to state machines. The states can be encoded using different coding schemes, such as binary, one-hot, or Grey. In addition, parity bits can be added to achieve a Hamming code, which enables the detection or correction of bit upsets. In Burke and Taft [2004], the robustness of state machines with binary, one-hot, Hamming with a distance of 2 (H2) and Hamming with a distance of 3 (H3) codes was tested using synchronous fault injection. According to the authors, H3 encoding can fully handle single errors and is least affected by double bit errors. State machines with H2 encoding have fewer overall errors than state machines with one-hot encoding and about half the error rate of state machines with binary encoding. Due to the hardware overhead and the decreased performance, the authors conclude that H2 encoding is the best compromise in terms of size, speed, and fault tolerance. For SRAM-based FPGAs, however, these results should be regarded with care because the influence of configuration memory upsets is not taken into account. Using fault injection, Morgan et al. [2007] found that the additional required logic can "potentially add more unreliability than the reliability it adds to the original circuit." An actual implementation of a fault-tolerant state machine that uses Hamming codes is described in Skliarova [2005].
An interesting application of information redundancy for the sake of EDAC is ABFT which goes back to the work of Huang and Abraham [1984]. ABFT is used to implement fault-tolerant matrix operations. Recently, Jacobs et al. [2012a] investigated the overhead and reliability of ABFT in FPGA systems. The authors use a multiply and accumulate (MAC) unit where the inputs are fed from BRAM and where the output data is written back to BRAM. One of the implementations uses a second MAC unit that generates and validates the checksums. Compared to TMR, the hardware overhead is as follows: 21% LUT overhead (TMR 148%), 24% flip-flop overhead (TMR 84%), 0% BRAM overhead (TMR 200%), and 25% DSP48 overhead (TMR 200%). From 100,000 injected faults, 1,216 errors occurred in the unmitigated design, 351 errors in the ABFT design, and 42 errors in the TMR design.

Block Random Access Memory
The embedded RAM blocks in Virtex devices need special care regarding the mitigation because similar to the configuration memory, upsets in BRAM can accumulate, leading to an ever-decreasing reliability of the memory. Although the BRAM content can be read out via the configuration interface, external scrubbing is not possible during operation because the RAM cannot be accessed by configuration and user logic at the same time [Xilinx 2009].
The recommended mitigation approach by Xilinx  includes TMR for the BRAM (including triplicated voter) and a memory scrubbing engine implemented on user logic layer, similar to the scheme depicted in Figure 11. This method can only be applied to single-port RAMs, because eventually they must be replaced by dual-port counterparts, as the second port is required for scrubbing. A counter autoincrements the address of the second port, and once the voter detects a failure, the counter is stopped, the voted and thus corrected output is written back to memory, and the counter is started again. Rollins et al. [2010] present a comprehensive comparison of fault-tolerant memories in SRAM-based FPGAs. The study covers the TMR and scrubbing approach as described earlier plus several information redundancy techniques, including duplication with error detection codes and EDAC, applied in different configurations (with and without triplicating the logic, with and without memory scrubbing). The fault-injection results are similar to what Morgan et al. [2007] observed when applying information redundancy to state machines. First, the area overhead of the information redundancy techniques sometimes exceeds the TMR approach, and second, the failure rates are always worse than the rate for the unmitigated design. Only if BRAM is used as ROM do some of the information redundancy techniques perform slightly better than the unmitigated design but still always worse than TMR.

MITIGATION DESIGN TECHNIQUES AIMED AT DESIGN-TIME FAULT AVOIDANCE
Another group of mitigation techniques, which can be best described as fault avoidance techniques applicable during design time, are discussed in this section. This group includes analytic approaches that are aimed not only at analyzing the sensitivity of circuits but also at reducing this sensitivity, such as by rerouting the circuit design.
As already mentioned in Section 4.2, researchers at Politecnico di Torino found that the TMR approach can be invalidated by SEUs because a single configuration bit upset can lead to multiple failures on the user logic layer. By observing and analyzing this fault mechanism, a set of tools has been developed that allow the avoidance of these faults already at design time.
In Sonza Reorda et al. [2005a], a static analyzer tool (STAR) is presented. Based on the researcher's knowledge about the proprietary bitstream format, the tool is able to determine the critical configuration bits of a circuit design. If TMR is applied to this circuit, the tool also determines the bits that can invalidate the TMR approach as described earlier.
Based on STAR, a reliability-oriented place and route (RoRA) algorithm has been developed Sterpone and Violante 2006]. By rerouting the circuit, RoRA can avoid the problem of single upsets attacking the TMR approach. The authors were able to demonstrate that RoRA minimizes the number of wrong answers of circuitry to which TMR or XTMR is applied drastically. Furthermore, the authors point out that RoRA can identify critical configuration bits much faster than fault injection experiments. However, compared to the original TMR version, the rerouting decreases the performance of the circuit.
To increase the performance of circuits to which TMR is applied, a tool called V-Place was then presented in Sterpone and Battezzati [2008], and it was shown that this tool can optimize the circuit's frequency up to 44%.
The STAR tool was later updated to STAR-LX. The main advantages are the reduction of the analysis time of more than five times as well as the ability to analyze the dynamic evaluation of the design under the presence of SEUs [Sterpone and Violante 2007]. A modification that can analyze the effects of MBUs called STAR-MCU is presented in . To mitigate the effects of MBUs, a new placement algorithm called PHAM was presented in Sterpone and Battezzati [2010].
For engineers and scientists who are interested in building their own netlist analysis and CAD tools, researchers from Brigham Young University present an interesting JAVA toolkit called RapidSmith [Lavin et al. 2011]. The toolkit offers a rich application programming interface (API) to parse, analyze, and manipulate XDL files (which can be easily created from Xilinx netlists). A very recent work that makes use of this toolkit is presented in Sari et al. [2014]. In this work, the authors estimate the susceptibility of an FPGA design. To determine the number of sensitive bits that are responsible for the different SEU-induced effects, as discussed in Section 4.2.1 and shown in Figure 8, the authors conduct the postrouting analysis using appropriate API functions of RapidSmith.

SIMULATION AND EMULATION OF SINGLE EVENT EFFECTS
This section gives a brief overview of techniques for simulation and emulation of SEEs, including accelerated radiation testing and fault injection. These techniques are necessary to validate any mitigation methodology applied to the design.

Accelerated Radiation Testing
Although first in-flight data for Virtex-4 devices have been published [Quinn et al. 2012], the common way to gain reliable static and dynamic cross-sections for these devices is by means of accelerated radiation testing.
To simulate high-energy galactic cosmic rays and solar event heavy ions on ground, the FPGA is exposed to low-energy ions available in particle accelerators. The quality of the simulation can be evaluated by the amount of energy lost per unit length of track, also referred to as linear energy transfer (LET). Because the SEE sensitive region is rather thin, ions with lower energies are sufficient for simulation as long as the LET is similar to the one of galactic cosmic rays and solar event heavy ions. The typical energy range used for simulation is of the order of several MeV /A, and the penetration range is between 30 and 100 μm. In general, the estimation of the SEU sensitivity using this concept is rather conservative [Barth et al. 2004]. The machine most commonly used for heavy ion SEU testing is the cyclotron. Several accelerators can be found across Europe, such as GANIL and IPN in France, SIRAD and LNS in Italy, GSI in Germany, and HIF in Belgium [ESA/ESCIES 2010].
Single event phenomena can also be induced by protons. Linear accelerators and cyclotron accelerators are capable of generating protons with sufficient energy to simulate solar flare and proton belt conditions [Holmes-Siedle and Adams 1993].
Regarding Virtex devices, most test result data have been collected at the cyclotron at Texas A&M University and/or at the cyclotron at Lawrence Berkeley National Laboratory and published by Los Alamos National Laboratory, NASA Goddard Space Flight Center, and the Xilinx Radiation Test Consortium. Quinn et al. [2005] present results regarding radiation-induced MBUs in Virtex, Virtex-II, and Virtex-4 devices. In Quinn et al. [2007a], the authors discuss the general reliability concerns of Virtex FPGAs. In Quinn et al. [2007aQuinn et al. [ , 2007c, they present the results regarding circuitry to which TMR is applied and discuss the problem of DCEs. In Quinn et al. [2007b], first results for Virtex-5 are published. In 2009, results regarding the SEU susceptibility of logical constants were presented ]. In the same year, a paper describing their methodology for static and dynamic testing was published ]. The upset characterization of embedded PowerPC cores is presented in Allen et al. [2007], and the more general characterization for Virtex-4QV FPGAs is presented in , finally leading to the summary report published by NASA and Xilinx . One year later, a report summarizing the results gathered from dynamic testing and from testing of mitigated designs was published by Allen [2009]. Recently, the static SEU characterization of Virtex-5QV was presented in Swift and Allen [2012].

Fault Injection
As an alternative to accelerated radiation testing, upsets in the configuration memory can also be emulated by fault injection.
Although many different fault injection implementations have been presented in the literature, the basic structure is similar for many systems (Figure 12). Two devices are simultaneously fed by test vectors. One of the DUTs is used as a "golden" reference while faults are injected into the second DUT. Fault injection is based on bitstream manipulation. Often, a configuration frame is read back from the FPGA via one of the configuration interfaces, one or more bits are flipped, and the frame is written back to the device. The outputs of both FPGAs are compared by some mechanism that detects if the fault injection led to a failure. Alternatively, only one DUT can be used, and its response is compared to golden answers during the fault injection campaign. In the following, two exemplary European systems are presented. A fault injection system with a long history in Europe is FLIPPER, developed by Alderighi et al. [2007] under ESA funding. The system comprises one Virtex-II Pro FPGA as a controller that can be connected to a DUT board hosting the FPGA under test. The controller communicates with a software running on a host PC via USB. In contrast to the example depicted in Figure 12, only one DUT is used, and its response is compared to stored and known to be correct answers. Test vectors can be converted from testbench stimuli and are fed into the DUT after one or more faults have been injected into the bitstream. Results from FLIPPER were compared to acceleration testing results [Alderighi et al. 2010], and the authors conclude that FLIPPER is an effective tool to evaluate different mitigation techniques due to its capability to predict failure rates, provided that raw configuration bit upset rates of the target environment are known. However, the authors also point out that the failure rate might be underestimated because SEUs in flip-flops, SETs, and MBUs are not emulated. FLIPPER was also compared to and used for the analysis of the STAR/RoRA tool as described in Section 5 [Alderighi et al. 2008].
The second fault injection system developed under ESA funding is FT-UNSHADES from the University of Seville. Compared to FLIPPER, its initial aim was the emulation of SEUs and MBUs that originate from SETs and thus manifest in flip-flop cells rather than the emulation of configuration memory upsets. The system is based on a Virtex-II, and the circuit under test is duplicated and compared within one FPGA. In each campaign, the application is driven to the desired fault injection time, the clock is stopped, the fault is injected into the desired flip-flop(s) of one of the circuits, the clock is restarted, and the outputs of both circuits are compared to detect any mismatch [Aguirre et al. 2007a[Aguirre et al. , 2007b. Later, the injection system was updated (FT-UNSHADES-uP) to allow a more in-depth analysis of microcontrollers. The system only uses one circuit under test whose output is compared to the theoretically correct output, and it is now able to also inject faults into BRAMs, LUTs, and SRL16s [Napoles et al. 2008;]. More recently, FT-UNSHADES2 has been developed [Mogollon et al. 2011]. The system is based on Virtex-5, and all data management is processed in hardware, leading to much higher fault injection rates. Now, the system can also be used to inject faults into the configuration memory, and the usability was increased due to a simplified design flow and a Web browser-based user interface.

RESEARCH PLATFORMS FOR THE SRAM-BASED FPGA IN SPACE
This section presents an overview of several research platforms that comprise SRAMbased FPGAs. It summarizes concepts that could possibly be applicable to future spacecraft data processing systems.
Flight heritage for Virtex-4 and Virtex-5 is relatively rare, and many of the payloads serve only as technology demonstrators so far. From the publicly available information, it seems that none of the platforms utilizes dynamic partial reconfiguration as a functional feature. Although dynamic partial reconfiguration may offer benefits for some projects, it can be assumed that in-flight experience for this unique capability of SRAM-based FPGAs is still a long way off.
However, in the research community, this topic is actively investigated. In the following, six platforms and frameworks are presented that comprise Virtex devices utilizing dynamic partial reconfiguration and specifically target space applications. A summary of the systems is given in Table VII.
The platforms can be coarsely classified by the way in which the reconfigurable modules are used. Most of the platforms implement a system on chip (SoC), in which the reconfigurable modules are connected to a soft central processing unit (CPU) core-that is, the reconfigurable modules are used as hardware accelerators that can be installed on demand. The other group of platforms uses reconfigurable modules as processors that can process data streams independently and thus without interaction of a CPU.

Reconfigurable System on Chip
A demonstrator platform-the dynamically reconfigurable processing module (DPRM)-is under development at the University of Bielefeld and University of Paderborn in Germany [Hagemeyer et al. 2012]. It is based on the RAPTOR-X64 prototype platform. Aside from a communication module, the system comprises two processing modules, with each module including a Virtex-4 FPGA. The reconfiguration controller is part of the FPGA. It is used not only to reconfigure the FPGA via the ICAP interface but also for scrubbing of the configuration memory. The partial reconfigurable modules Software-based fault detection; partial on-demand scrubbing; cold module redundancy; task allocation to another FPGA are connected using so-called embedded macros that embed a bus structure into tiles. The main motivation for such a structure is given in . By dividing a partial reconfigurable area into atomic units called tiles, modules of different sizes can be more efficiently placed at runtime. It was found that an embedded bus structure with shared signals supports the flexible placement of the modules optimally due to its homogeneity. The tool STARECS from Politecnico di Torino [Sterpone et al. 2011] is used to analyze the SEU effects on the system, and at present, work regarding the fault-tolerant communication via the embedded macros is in progress.
Researchers at the University of Edinburgh are also working toward a partial reconfigurable system on Virtex-4 FPGAs. The main objective of the work is a reliable reconfigurable real-time operating system (R3TOS), introduced in Iturbe et al. [2010b]. Reconfiguration is done by R3TOS internally through the ICAP port. In the course of the research on R3TOS, an area-time response balancing algorithm (ATB) for scheduling real-time hardware tasks was proposed in Iturbe et al. [2010a] and a task allocator in Hong et al. [2011], which is able to deal with spontaneously occurring faults. The paradigm followed in Edinburgh is that hardware tasks are handled like normal threads in a higher programming language (e.g., POSIX threads). To "call" a hardware task, the reconfigurable module needs an appropriate interface, which is proposed in Iturbe et al. [2011a]. In the same work, an interesting approach for intertask communication is presented: instead of utilizing a network with a high resource overhead, the 37:24 F. Siegle et al. data is simply copied from the output buffer of one module to the input buffer of another module by reading the data through ICAP and writing it back. One benefit of avoiding an on-chip communication is the fact that fewer wires need to cross the partial reconfigurable modules, which increases the flexibility regarding the module placement. Built on the ICAP-based communication, a second task allocator called Snake is presented in Iturbe et al. [2011b], and because R3TOS is heavily making use of the ICAP port, a fault-tolerant ICAP controller is introduced in Ebrahim et al. [2012].
Researchers at the University of Florida are working toward a framework for the use of commercial off-the-shelf (COTS) FPGAs in space applications. The system also utilizes dynamic partial reconfiguration, and one main aspect of their work is adaptable fault tolerance that is achieved by adding and removing redundant reconfigurable modules depending on external constraints in terms of availability and power [Jacobs et al. 2012b;Yousuf et al. 2011]. The basic structure of the proposed framework is depicted in Figure 13. Several reconfigurable partitions are connected to a controller that comprises failure detectors. The controller itself is connected via a PLB bus to an on-chip MicroBlaze soft core. The controller, the bus, and the soft core, as well as its peripherals, are placed in the static area of the FPGA. The static area is hardened against SEUs by applying TMR to the netlist of the design. The researchers also investigated the suitability of ABFT for such systems [Jacobs et al. 2012a] and presented their own fault injection system [Cieslewski et al. 2010].
Researchers at the University of Arizona are working on a Virtex-5-based partial reconfigurable system called SCARS, which is introduced in Sreeramareddy et al. [2008] and based on a two-level healing methodology. The system comprises five Virtex-5 FPGAs, each including a MicroBlaze soft core that is responsible for the self-healing. The partial reconfigurable modules are partitioned together with redundant copies into so-called slots and connected to the MicroBlaze processor bus. The software running on the MicroBlaze is responsible for the detection of faulty modules. If a fault is detected, the module is scrubbed through the ICAP interface. If the failure persists, it is seen as a hard error and another redundant module in the slot is activated. The five FPGAs are connected to a master node in a wireless network. Once all modules in a slot are faulty, the task that was running in the faulty slot is moved by the master node to another FPGA.

Reconfigurable Stream Processors
The work on reconfigurable FPGAs at the University of Braunschweig in Germany goes back to a data processing unit for a camera on board the Venus Express mission. In 2007, an update of this architecture was proposed that also allows in-flight partial reconfiguration [Osterloh et al. 2007]. To interconnect the partial reconfigurable modules, a NoC called SoCWire was proposed in Osterloh et al. [2008] that is heavily based on SpaceWire, the only space-qualified point-to-point network architecture. The main motivation for using a NoC approach can be found in Osterloh et al. [2009]: during the reconfiguration process, glitches can occur because frames become directly active during the write cycle. To qualify such a system for space applications, the partial reconfigurable module must be isolated from the host system, which can be optimally achieved by a NoC approach. The successful isolation and thus protection against such glitches was proofed in Osterloh et al. [2010]. SoCWire became part of a demonstrator platform called dynamically reconfigurable processing module (DRPM), developed in cooperation with the European Space Agency and Astrium Ltd. in the United Kingdom. The demonstrator comprises one or more modules, each module with a radiation-hardened reconfiguration controller and two Virtex-4 devices. In 2011, an advanced microcontroller bus architecture (AMBA) to SoCWire bridge was presented in Michel et al. [2011], and recently a higher protocol called SoCP, which is also based on a SpaceWire protocol, was introduced in Michel et al. [2012]. The basic structure of the DRPM can be seen in Figure 14. The reconfiguration controller, depicted on the lefthand side of Figure 14, is implemented on a reliable antifuse FPGA. It comprises a SoC with a LEON3 CPU and several peripherals, such as memory controllers. The Virtex-4 FPGAs, one of them depicted on the right-hand side, are divided into reconfigurable partitions that are interconnected via a SoCWire routing switch.
A more theoretical framework that deals with fault detection, isolation, and recovery (FDIR) for SRAM-based FPGAs is proposed by researchers at the University of Brno in the Czech Republic. Several hardware blocks are arranged in a hardwired processing pipeline. For each hardware block, a different redundancy mode can be applied, such as TMR or DWC. The output of the failure detectors is connected to a bus, and the health status of the hardware blocks is reported via this bus to a reconfiguration controller. In contrast to other approaches presented here, the reconfiguration controller is implemented in hardware. In the course of this research, the design of online failure checkers was first proposed in Straka et al. [2007] and later extended to the overall framework [Straka et al. 2010a]. The reconfiguration controller is described in Straka et al. [2010b], and a fault injection system is presented in . Finally, a dependability analysis for the framework is described in .

SUMMARY OF EXISTING MITIGATION TECHNIQUES
Research on mitigating SRAM-based FPGAs for space applications has engendered a large number of publications in this research field. As was shown in Section 3.3, the proposed methodologies can be split into just a few main categories targeting (i) the configuration memory, (ii) the user logic, or (iii) the BRAMs. Regrettably, there are not enough design details in the open literature to compare the existing methods fairly. In addition, on-board designs very much depend on mission objectives and constraints. In most cases, the decision on the use of a particular technique will be based on a trade-off between power, area, and performance overheads as well as achievable system availability. In this section, a summary of the reviewed mitigation methods is presented, which is illustrated by an example decision strategy on selecting the right mitigation technique. It is our hope that the proposed decision strategy can serve as a guide to researchers and engineers who are novices in the field. However, it is expected that designers will exercise their own judgment and draw their own conclusions, taking into account the specifics of their projects when considering our following recommendations. Figure 15 exemplifies the main steps, which a decision process on selecting a particular mitigation technique for SRAM-based FPGAs on board spacecraft might involve. Considering the fact that space engineering projects usually require strict verification procedures, it might be a wise decision to choose a solution with the lowest possible implementation complexity to meet the given constraints.
If the FPGA design is often reconfigured, such as in the case the chip area is shared by several applications, it could be decided not to apply any mitigation technique at all. This is because every time the system is reconfigured, it is brought back into a safe initial state-that is, possible faults in configuration memory and user memory are removed as well. This simple solution has no additional power, area, or performance overhead. If it does not lead to a satisfactory system availability, however, one may add periodic scrubbing that is able to remove faults in the configuration memory during the time the system is running. Still, failures can be "trapped" in state-dependent user logic, but fortunately the ratio of user memory elements to sensitive configuration bits is often small enough to gain a significant increase in system availability anyway.
If frequent full reconfigurations are not part of the normal operation, one must consider adding a combination of mitigation techniques. Applying spatial redundancy without implementing a repair strategy (like scrubbing) is not recommended because it only extends the time span until the system becomes unreliable. On the other hand, a strategy solely based on scrubbing is not an ideal solution either because faults can manifest as permanent failures within the user memory logic. Thus, a failure detection and/or failure masking technique is usually combined with a failure recovery technique.
Most payload data and imaging applications could be classified as either being of a processor type or a stream type. The first type comprises all kinds of microcontrollers and microprocessors or custom-built processors used for data acquisition and similar tasks. These types of applications are never or rarely reset or restarted and may contain a large state space and a huge amount of state variables. Embedded RAM is often used to store data over a long period of time. The second type comprises circuits that can mainly be found in payload data processing applications, for tasks such as data compression, encryption, or filtering. These types of applications process data blockwise, such as image by image, and often the state space is traversed with each data block. Typically, the number of state variables is low, and embedded RAM is mainly used for FIFO buffers.
Two possible mitigation strategies can be followed: (1) Spatial redundancy (partial TMR or full TMR) is applied to the netlist of the circuit, and the configuration memory is periodically scrubbed.
(2) The whole circuit is duplicated or triplicated (modular redundancy), and a majority voter, respectively comparator, is used as a failure detector. Then, the failure recovery can be triggered on demand.
The spatial redundancy mitigation strategy goes well with the processor type of application because this low-level redundancy approach allows automatic data resynchronization after repair. This strategy is also simple to apply because commercial tools exist that automate the insertion. However, one must carefully verify that the used toolchain does not optimize the inserted redundancy away. If the design is small enough and the power budget relaxed, full TMR leads to the best possible system availability. If the design is too large to apply full TMR, then one could either use a multi-FPGA system or apply partial TMR. With partial TMR, only the feedback loops are typically protected, but not the data path within the user logic. As a consequence, errors will become visible as transient failures at the output of the FPGA. No matter which kind of TMR is applied, one must implement periodic scrubbing as well. If external radiation-hardened hardware is available or can be afforded, external scrubbing is the more reliable approach. Whether blind or readback scrubbing should be used depends on the particular application, but blind scrubbing is surely the solution with the lowest implementation complexity.
The modular redundancy mitigation strategy goes well with the stream type of application because the user logic can be brought back to a safe initial state after each data block. One drawback of this mitigation approach is that state variables must be synchronized between redundant instances after repair. But since data are processed blockwise and the number of state variables is usually low, technical solutions can be found to execute the data resynchronization between the data blocks. Despite the increased implementation complexity, this mitigation approach offers some real benefits. For instance, the configuration memory must only be repaired after a failure has been detected, which can maximize the system availability and minimize the power consumption compared to the periodic scrubbing approach. Often, the stream type of application does not require maximum possible system availability. For instance, one may tolerate a short downtime with each upset as long as the system is fail silent. In this case, it is sufficient to only duplicate the circuit and to use a comparator as a failure detector. If downtime is not an option, modular TMR can be applied. If not enough chip area is available to triplicate the circuit, one can investigate if RPR is applicable. Then, failures can be masked, but the output of the circuit might be degraded in precision until the faulty module is repaired. Modular redundancy also goes well with multi-FPGA systems because redundant instances can be distributed over several FPGAs.

CONCLUSIONS
SRAM-based FPGAs are well suited for modern approaches to spacecraft data processing because these devices offer high-performance computing capabilities as well as a large amount of logic and memory resources. However, even if some manufacturers offer radiation-hardened devices, most systems will demand a methodology to (further) mitigate soft errors triggered by radiation effects.
This article is meant as a literature survey and a tutorial for scientists and engineers who need to get a quick but thorough overview of this topic. A comprehensive coverage of all aspects of radiation effects design mitigation for SRAM-based FPGAs on board spacecraft is given. Design guidelines for the right choice of mitigation techniques are provided as well. Despite that the final choice may heavily depend on project constraints, the proposed recommendations can still serve as a starting point in what is a difficult and multifaceted design process.