A Pipelined Algorithm and Area-Efficient Architecture for Serial Real-Valued FFT

In this brief, we present a novel pipelined architecture for real-valued fast Fourier transform (RFFT), which is dedicated to processing serial input data. An optimized algorithm is proposed for stage division in RFFT to achieve an area-efficient RFFT computing structure with full hardware utilization. A single path butterfly and a real rotator are merged into one processing element (PE) in each stage, except the last stage, to reduce hardware resource utilization. In addition, a novel shift-adder is designed for <inline-formula> <tex-math notation="LaTeX">$N$ </tex-math></inline-formula>-point RFFT with <inline-formula> <tex-math notation="LaTeX">$W$ </tex-math></inline-formula>-bit signals, and a new data management method based on the PEs is proposed, which saves resources with a more regular flow graph.

A Pipelined Algorithm and Area-Efficient Architecture for Serial Real-Valued FFT Hongji Fang , Bo Zhang , Feng Yu , Member, IEEE, Bei Zhao , and Zhenguo Ma Abstract-In this brief, we present a novel pipelined architecture for real-valued fast Fourier transform (RFFT), which is dedicated to processing serial input data.An optimized algorithm is proposed for stage division in RFFT to achieve an area-efficient RFFT computing structure with full hardware utilization.A single path butterfly and a real rotator are merged into one processing element (PE) in each stage, except the last stage, to reduce hardware resource utilization.In addition, a novel shift-adder is designed for N-point RFFT with W-bit signals, and a new data management method based on the PEs is proposed, which saves resources with a more regular flow graph.

I. INTRODUCTION
F AST Fourier Transform (FFT) has been widely used in various aspects of the digital signal processing field [1]- [3].Generally, FFT operates over complex input data (CFFT), and extensive research has been conducted in recent years [4], [5].In addition, there has been a growing interest in designing the structure of FFT for real-valued signals (RFFT) because most signals in the natural world are real-valued.For example, biomedical signals, including electrocardiogram and electroencephalogram, are real-valued [6], and the real-valued signals' FFT results have the property of conjugate symmetry.So nearly half of the calculations are redundant, which means that for the computation of real-valued input data, the RFFT architectures are more efficient than CFFTs'.
In recent years, many computing architectures of RFFT have been proposed [7]- [11].We can divide them into two categories according to their implementation mode: memorybased and pipelined architectures.Memory-based architectures are more appropriate for applications that are limited by area, while pipelined architectures are suitable for real-time processing.Notably, little research has been conducted to optimize the serial architectures of pipelined RFFT.Chinnapalanichamy and Parhi [11] presented two different computing RFFT architectures for serial input signals.These architectures take advantage of interleaved datapaths to save hardware resources, but the butterflies among these are only half-utilized.Garrido et al. [9] proposed a real-valued Serial Commutator (RSC) FFT architecture with a novel data management strategy, which achieves 100% hardware utilization.This design not only achieves full hardware utilization but also occupies the fewest real adder and real multiplier numbers.
In this brief, we focus on investigating the serial pipelined architectures of RFFT that can reach minimum hardware usage.We present a pipelined algorithm and an area-efficient architecture with the optimized strategy of stage partition based on a modified radix-2 DIF algorithm for serial RFFT.By using this architecture, a more regular flow graph is obtained, which simplifies the understanding of the RSC FFT and leads to some reduction in terms of hardware components of the architecture.It can be removed that one by-pass datapath of single path butterflies at every stage and one register that belongs to rotators in the conventional architecture, which, compared to the previous work, can reduce area.Moreover, a shift-adder with a Y-bits ripple carry adder instead of a W-bits ripple carry adder is also proposed in the last stage, attaining the purpose of saving hardware resources without adding additional latency.Notably, W represents the word length of each real-valued data, and the relationship between W and Y is described as equation (2).
The rest of this brief is organized as follows: First, Section II reviews the modified radix-2 DIF algorithm for RFFT and explains both the proposed algorithm and its corresponding computing architecture for serial data.Second, the data management of the proposed architecture is presented in Section III.Third, the PEs in the proposed architecture are detailed in Section IV.Section V compares the proposed architecture with others in terms of area and performance and describes its implementation on a hardware platform.Finally, we conclude this brief in Section VI.

II. PROPOSED PIPELINED ALGORITHM AND COMPUTING ARCHITECTURE
The N points discrete Fourier transform for signals x(n) is defined as: 1549-7747 c 2022 IEEE.Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information. where N is referred to as the twiddle factor.When x(n) is real, almost half of the computation is redundant because of the conjugate symmetry.A modified radix-2 RFFT algorithm is proposed in [8] based on the DIF algorithm.There is only the real datapath in pipelined architecture based on this modified algorithm.Furthermore, it is unnecessary to compute all the output frequencies X Based on the modified radix-2 RFFT algorithm, a novel pipelined algorithm is proposed for serial N point RFFT.This algorithm is demonstrated by the pseudocode Algorithm 1 and Algorithm 2. Unlike the traditional ones, we merge the butterflies and rotators into one PE at stage 1 to log 2 N − 1, which is illustrated in Algorithm 1.In addition, the total pipelined processing procedure is presented in Algorithm 2, handling N points serial data sequence D in .This algorithm is divided into n = log 2 N stages, where s represents the index number of each stage.
In order to better understand the Algorithm 1 and Algorithm 2, a N = 32 points serial pipeline structure whose word length is W = 16 is presented in Fig. 1.It comprises log 2 N = 5 stages, each of which includes data commutators and calculating PEs, corresponding to the Rearrange, ButterflyRotator, and ShiftAdder functions in Algorithm 2. * ShiftAdder means shift-adding the input data sequence, which will be detailed in Section IV.
The corresponding relationship between different segments in the pipeline architecture and each function in the Algorithm 2 is also marked with arrows and letters.For instance, at stage 1, only the butterfly and rotator are required, corresponding to the AB part in Fig. 1.
When 2 ≤ s ≤ log 2 N − 1, a new data sequence (that we deduce) is generated by the data commutators at the beginning of each stage.The rearrange function can be easily implemented by cascading the basic commutating circuit in Fig. 3(a), and it will be detailed in Section IV.Then, calculating PE in each stage plays the same role as the ButterflyRotator function shown in Algorithm 1, which is mapped into the AB, CD, and EF parts in Fig. 1.The architecture and timing diagram of these PEs will be detailed in Section IV.
At stage n, there is only one butterfly that can be achieved in the bit dimension with a shift-adder in Fig. 5(b).Shiftadder is used to add (subtract) D n (1) and D n (2) whose word length is W, which is designed to reduce hardware usage.To get the correct result, a Y-bits adder (subtractor) is presented, which can add (subtract) every Y bits from LSB to MSB of the W-bits input data.Then shifting both input data and addition (subtraction) generates the complete W-bits result.To obtain the pipelined output data, the shift addition must complete the calculation within N − 2 clock cycles.So the restriction of the Y is described as: i.e., Y can be expressed as: For general N, hardware resources used in this architecture are calculated as follows: the architecture has two real adders per butterfly and rotator PE and there are log 2 N − 2 butterfly and rotator PEs in the architecture.What's more, there is one adder per single path butterfly and a Y W adder in the last stage.Therefore,the total number of real adders is 2 log 2 N−3+ Y W .In terms of real multipliers, there is only one in butterfly and rotator PE.Thereby, the number of real multipliers is log 2 N − 2. Besides, the total number of real data registers is calculated as: Lastly, total number of multiplexers can be calculated as follows:

III. PROPOSED RFFT DATA MANAGEMENT
The data flow graph of the proposed architecture for a 32points RFFT is shown in Fig. 2. As can be seen, the serial data are processed in the continuous-flow clock cycles.In the horizontal direction, there are 5 stages and letters from A to J on the top and bottom of the data flow, respectively.They indicate the points of the proposed pipelined circuit, which are shown in Fig. 1.Furthermore, there are dashed boxes from A to B, C to D, and E to F, representing PE1 in Fig. 1.In the vertical, the numbers in the square boxes represent the nk in equation ( 1), the circled numbers in Fig. 2 show the timing instance of arrival data, and the red numbers represent the data index of each stage.For example, the first data is left at the end of stage 2 with index number 0, whose time instance is 8.This number can be calculated by summing the latency between A and B, B and C, and C and D. Finally, the output frequencies are presented as red numbers within parentheses in the last column of the data index.Notably, the first output frequency is at the number 38 circled in the last stage, whose index is 2r (8r).
In stage 1, four registers of a rotator and one computing cycle are saved by combining the butterflies with rotators into PE1 compared to the previous work.Unlike the conventional design, the operators of butterflies in dashed boxes vary significantly (i.e., by {+, −, −, +} every four clock cycles.).
For stages 2, 3, and 4, there are data commutators at the beginning of these stages, interchanging data in the different clock cycles by using the commutator circuit is shown in Fig. 3(a), which will be detailed in Section IV.Eventually, there is only one butterfly computation at the last stage, which is mapped to the shift adder in Fig. 5(b).We turn the butterfly into the bit dimension to save resources without additional latency as detailed in Section IV.

IV. PROCESSING ELEMENTS
The structure and the timing diagram of PE1 in the first stage are presented in Fig. 4 and Table I.In this example, the single path butterfly operates the addition calculation x 0 + x 16 of the butterfly first and then calculates the subtraction x 0 −x 16 in the next clock cycle.This is different from the conventional design where the single path butterfly calculates the subtraction between x 8 and x 24 after conducting x 0 − x 16 .Finally, the single path butterfly operates the addition, x 8 + x 24 , and starts the next loop by adding x 4 together with x 20 in the following cycle.The rotator of the PE1 rotates the x 0 − x 16 together with x 8 − x 24 and x 4 − x 20 together with x 12 − x 28 .It occurs at the same time for x 0 − x 16 and (x 0 − x 16 ) cos α.
The commutator circuit that rearranged the data sequences is illustrated in Fig. 3(a), which is referred to as the circuit for bit reversal [13].This basic circuit contains L registers and two multiplexers, interchanging two signals arriving at times t in and t out = t in + L. As far as data management is concerned, the data sequence is rearranged by cascading two basic permutation circuits at stage s = 2, 3, . . ., log 2 N − 1 as shown in Fig. 3(b).In Fig. 3(b), L = 2 s − 1 and control signals S 0 , S 1 can be easily generated by one n = log 2 N-bits counter In stage 5, we turn the butterfly into bit-dimension with the shift-adder, which consists of three shift registers, 1 register, and a 2-bits adder/subtractor as shown in Fig. 5(b).The structure of the 2-bits ripple carrying adder is shown in Fig. 5(a).This circuit consists of four full adders (FA), which are used to calculate the addition of 2-bits data [14].Because the bit widths of the real data are 16, we can add every two bits from low to high, shifting the 2-bits addition result.The 16-bit addition result is obtained within 8 clock cycles, and another 8 clock cycles are taken to get the subtraction result.Finally, both the addition and subtraction results are registered until times 68 and 69 as shown in Fig. 2.

V. COMPARISON AND EXPERIMENT RESULTS
Table II compares both the area and performance of various N-points pipelined FFT architectures for serial data.We compared the proposed architecture and state-of-the-art research on pipelined CFFT and RFFT architectures of which the input manner is serial.Several components are listed in Table II, including the number of real adders, real multipliers, real sample registers, and real multiplexers for the area.The performance in terms of latency and throughput is also listed.To calculate real adders and multipliers, we consider a complex rotator composed of four real multipliers and two real adders.
Table II shows that the proposed architecture occupies the minimum number of real adders, real multipliers, registers, and multiplexers.Compared with the CFFT architecture in [4], [12], our proposed structure saves 33.3% adders and approximately 50% multipliers and registers.As for the RFFT architectures, [11] uses four times multipliers, two times adders, and more memory than the proposed architecture.Compared with [9], we present the shift-adder with a Y-bits ripple carry adder instead of a W-bits ripple carry adder in the last stage, whose area is W Y times of a Y-bits one's.Research on ripple carry adder areas that differ from bit width was conducted in [14].Therefore, the proposed architecture uses the same number of multipliers and the total number of adders is 1 − Y W less than [9].Meanwhile, the proposed architecture reduces the number of registers and multiplexers.For instance, when N = 1024, W = 24, the total number of real adders and real multiplexers in the proposed architecture is 17.0 and 71.0, saving 5.6% and 17.4% compared with [9] respectively.
The experiment results of both the proposed RSC architecture for N = 1024 points and other serial FFT architecture in [12], [15], [16] for W = 16, 24 bits are listed in Table III.The experiment result of the proposed design is obtained by using the Vivado tool on a Virtex-7 XC7VX690T Field Programmable Gate Array (FPGA).In the proposed design, DSPs are used to realize real multiplication, so its number is the same as that of real multipliers, and Block-RAMs are used to store the twiddle factors.Table III shows that the LUTs and DSPs of the proposed architecture are decreased by a factor of 7.8% and 75.0%compared with [15].Compared with [12], the proposed architecture achieves higher clock frequency, and the LUTs and DSPs are also decreased by a factor of 27.1% and 50.0%.Compared with [16], the proposed design uses more DSPs and Block-RAMs due to the architecture in [16] is multiplierless with radix-2 4 FFT, removing the needs for DSPs and Block-RAMs.Meanwhile, the slices of the proposed architecture are decreased by 42.2% compared with that in [16].

VI. CONCLUSION
This brief proposes a novel pipelined algorithm for serial data based on the modified radix-2 DIF RFFT algorithm.The corresponding pipelined architecture obtains a more regular flow graph, which simplifies the understanding of the RSC FFT.Additionally, the proposed design utilizes the least hardware resources in terms of real adders, multipliers, registers, and multiplexers.Above all, the proposed architecture is more area-efficient than the prior design.

TABLE II COMPARISON
OF PIPELINED HARDWARE ARCHITECTURES FOR THE N-POINTS RFFT ON SERIAL DATA