Ferroelectric FET-Based Time-Mode Multiply-Accumulate Accelerator: Design and Analysis

General-purpose multiply-accumulate (MAC) accelerators have become inevitable in the internet-of-things (IoT) edge devices for performing computationally intensive tasks such as deep learning, signal processing, and combinatorial optimization. The throughput and the energy-efficiency of the conventional digital processors and MAC accelerators are limited due to their sparse design owing to the von-Neumann architecture. Although mixed-signal time-mode MAC accelerators utilizing emerging non-volatile memories appear promising owing to their ability to perform in-memory MAC operation via the physical laws, their application is limited due to their incompatibility and complex integration with the CMOS process, high sensitivity to process variations, large operating voltage/cell currents, etc. To mitigate these issues, in this work, we propose a time-mode MAC accelerator based on ferroelectric-FinFETs with CMOS-compatible doped-<inline-formula> <tex-math notation="LaTeX">${\text {HfO}_{{2}}}$ </tex-math></inline-formula> in the gate-stack. Our rigorous analysis reveals a trade-off between the performance metrics such as computational precision, area- and energy-efficiency of the proposed MAC accelerator. Therefore, we provide the necessary design guidelines to further optimize the performance. Extensive design space exploration and simulations exploiting an experimentally calibrated compact model for the doped <inline-formula> <tex-math notation="LaTeX">${\text {HfO}_{{2}}}$ </tex-math></inline-formula> ferroelectric capacitor along with experimentally calibrated baseline FinFET model for 14 nm-technology indicates that the proposed MAC accelerator exhibits an energy-efficiency of 800 TeraOperations/Joule, a considerably high area-efficiency of 12.607 bits/<inline-formula> <tex-math notation="LaTeX">$\mu \text {m}^{{2}}$ </tex-math></inline-formula> (including I/O peripheral circuitry), and a throughput of 2.5 TeraOp/s while supporting a 4-bit MAC operation for a square weight matrix of size <inline-formula> <tex-math notation="LaTeX">$200\times200$ </tex-math></inline-formula> which is sufficient for realistic inference tasks.


I. INTRODUCTION
E NERGY-EFFICIENT edge computing in this era of internet-of-things (IoT) necessitates the development of dedicated neuromorphic processing engines.Multiplyaccumulate (MAC) unit forms the fundamental building block of most hardware systems designed to accelerate the computationally intensive applications such as deep learning, signal processing, and combinatorial optimization on the mobile IoT devices.The sparse design of the digital MAC accelerators such as graphics processing unit (GPU) and tensor processing unit (TPU), whereby the memory and processing blocks are separated, leads to a significant energy dissipation and delay during dataflow in the bus [1].Therefore, neuromorphic MAC accelerators, where memory and computational blocks are co-located, based on the cross-point array of emerging non-volatile memories such as resistive random access memory (RRAM), phase-change memory (PCM), and flash memory were proposed [1], [2], [3].The synaptic weights are stored as the conductance or the threshold voltage of cells in these memory arrays and the MAC outputs are readily obtained via physical laws such as Ohm's law and Kirchoff's law as the column currents of the crossbar when the inputs are encoded as voltages.Although the voltage-and current-mode MAC accelerators exhibit an enhanced energy-efficiency compared to the digital MAC accelerators, their performance is limited due to the use of active analog peripheral circuitry (DAC/ADC/current conveyer-based neurons) [1], [2], [3], [4].The performance of the MAC accelerators can be significantly improved by utilizing digital peripheral circuitry and performing MAC operation by encoding inputs in the time-domain rather than in the voltage-domain [5], [6].The post-layout estimates for such time-mode MAC accelerators based on flash memory and RRAMs have shown enormous potential for energy-efficient edge-computing.However, the intrinsic device artifacts such as limited scalability (for flash memories), large temporal and spatial variability (for RRAMs), high operating/programming/forming voltage/current and significant complexity for CMOS backend-of-line (BEOL) integration limit their widespread application and hardware realization on a large scale.Although the time-domain MAC accelerators based on mature SRAM technology exhibit better performance as compared to the emerging non-volatile memories in terms of variability, operating voltage/current, and ease of large-scale fabrication [7], [8], the large number of transistors utilized for each memory cell (>6 T) [7], [8] limits their scalability and inherent area-efficiency.
Recently, ferroelectric field-effect transistors (FeFETs), specifically those based on doped-hafnium oxide-based stacks, have attracted significant attention for memory as well as neuromorphic computing owing to their CMOS compatibility, high scalability, high-precision polarization tunability using gate and drain program/erase schemes and ultralow power consumption [9], [10], [11].Considering the potential benefits of the ferroelectric FinFETs, it becomes imperative to explore their application for time-mode MAC accelerators.Recently, a small scale (two-inputs and one output) proof-of-concept demonstration of MAC operation utilizing the time-domain analog computing with transient state (TACT) approach based on Ca0.2Sr0.8Bi2Ta2O9(CSBT) FeFET was provided in [12].However, the ferroelectric material (CSBT) used in [12] is not CMOS-compatible and the planar FeFET dimensions (W /L = 50/7 µm) are too large as compared to the industry-standard FinFETs.For practical applications, it becomes essential to utilize a more realistic CMOS-compatible technology such as industry-standard FinFETs with doped-HfO 2 as the ferroelectric layer in the gate-stack.Also, a programmable delay-unit (DU)-based MAC accelerator utilizing a cascaded chain of 2-FinFET-1-ferroelectric-FinFET (with CMOS-compatible HZO in the gate-stack) DUs was proposed in [13].In the DUbased time-domain MAC accelerators, the input bits are fed to a chain of cascaded programmable DUs and the MAC output is obtained as the delay of the chain of DUs [13].In [13], programmable DUs were realized by introducing a ferroelectric-FinFET in the pull-down network of a CMOS inverter and tuning the polarization state of the ferroelectric layer to modulate its driving strength and the delay of PU.Although such a DU-based MAC accelerator along with a single ferroelectric-FinFET implementing the "sign" activation function exhibits a significantly high energy-efficiency, the individual DU area (3 FinFETs) is large and such an approach requires n-DUs to implement n-bit MAC operation.Moreover, the peripheral circuits were also not realized in [12] and [13] and their contribution to the overall energy/area-efficiency was not considered.However, to analyze the true potential of the FeFETs-based time-mode MAC accelerators, a pragmatic approach must be followed considering the system-level MAC implementation including the input-output peripheral circuits.
To this end, in this work, we have proposed a design methodology and performed a comprehensive analysis of the time-mode inference accelerators utilizing a 1-FinFET (selector) 1-ferroelectric-FinFET (memory) with zirconium doped hafnium (HZO) oxide in the gate-stack including the input-output and the peripheral circuitry for practical applications.Our extensive design exploration methodology utilizing a calibrated compact model for HZO ferroelectric capacitor integrated with the 14-nm FinFET technology reveals an unforeseen trade-off between the computational precision and the area-and energy-efficiency of the proposed MAC accelerator.Moreover, the ferroelectric-FinFET-based time-mode MAC accelerator can support 4-bit MAC operations while exhibiting a significantly high energy-efficiency exceeding 800 TeraOperations/Joule and a considerably high area-efficiency exceeding 12.6 bits/µm 2 for a 200 × 200 square weight matrix and 200 inputs.

II. TIME-MODE MAC OPERATION
In general, any M × N MAC operation can be represented as where x i are the normalized inputs and w i j are the normalized weights for weighted-sum operation.For the time-mode MAC accelerators, the inputs are encoded in the time-domain, i.e., the input pulse amplitude is fixed while the time duration is proportional to the input x i .This is contrary to the voltage-mode MAC accelerators where the voltage amplitudes are proportional to the inputs.Also, unlike the voltage-mode MAC implementations where the weights are encoded as the conductance state of the tunable cross-point memory devices, in time-mode MAC accelerators, the weights are encoded as reconfigurable/programmable current sinks.Here, we propose to use a 1-FinFET(selector)-1-Fe-FinFET(memory) supercell as the reconfigurable current sink [inset of Fig. 1(a)] for realizing the weights considering the efficient tunability of the polarization states of the Fe-FinFETs without half-select disturbance in 1selector-1FeFET configuration [12].Moreover, the Fe-FinFETs are connected at the source of the selector FinFET in this supercell architecture to exploit the negative feedback introduced by the source degeneracy of the selector which results in a reduced sensitivity of the supercell current to the variation in its drain voltage enabling efficient operation of the supercell as a current sink.The weights (w i j ) are mapped to the supercell currents (I i j ) by tuning the polarization state of the Fe-FinFETs via gate program/erase technique [14] for inference applications as [15] where I MIN and I MAX are the minimum and maximum supercell currents corresponding to the minimum and maximum values of the polarization state, respectively, of the HZO ferroelectric in the gate-stack of Fe-FinFETs.The inputs x i are converted to the voltage pulses with pulse durations t i = x i T (where T is the maximum time duration) utilizing a digital-to-time converter (DTC) circuitry (Section V) and applied at the gate of the selector FinFET (V G,Sel ) and the Fe-FinFET (V G,Fe ) as shown in Fig. 1(a).A load capacitor connected at the end of the column integrates the current through the reconfigurable supercell sinks for the time duration proportional to the inputs.The integrated voltage on the load capacitor is encoded as the time duration of the output pulse as t out, j = Y j T with the help of a simple neuron circuitry consisting of S-R latch [Fig.1(a)].
The MAC operation is typically performed in two cycles in the time-mode implementation: the load capacitor is first pre-charged to V Reset during initialization.The input pulses from DTC are applied to the gate electrodes of both selector FinFET and the Fe-FinFET to turn-on the reconfigurable current sinks for a duration proportional to the inputs.The load capacitor gets discharged via the reconfigurable supercell currents and its voltage decreases as shown in Fig. 1(b).The time duration of the input along with the programed cell currents of the 1FinFET-1Fe-FinFET supercells dictate the charge loss from the column capacitor and the MAC output.Furthermore, sensing the charge lost from the load capacitor during the integration phase (MAC output) directly is extremely difficult and requires complex, bulky, and energy hungry peripheral circuitry.Therefore, we use another supercell array to sink a constant current from the load capacitor to discharge it to the threshold voltage of the S-R latch neuron which in turn fires a pulse whose pulsewidth encodes the normalized MAC output in the evaluation phase.The inputs and outputs in the proposed approach are encoded in the time-domain before getting converted to the digital domain by the DTC and TDC.Assuming constant current through the programmable current sinks for the entire range of voltage swing (V Reset −V TH ) across the load capacitor, at the end of phase I or the integration phase (at time t = T ), the voltage at the load capacitor reduces to ( It can be observed from (3) that V phase I , j decreases by an amount proportional to the normalized MAC value.The exact value of the MAC operation is extracted in phase II or the evaluation phase.Moreover, to ensure that the minimum value of V phase I , j = V TH (where V TH is the threshold voltage of the S-R latch) and the neuron circuitry generate an output pulse only in phase II or the evaluation phase, the load capacitor value is selected as In the evaluation phase, the load capacitor is discharged with a constant current M I MAX through a similar 1FinFET-1Fe-FinFET array with all Fe-FinFETs programed to the least polarization state such that the neuron circuit generates an output pulse with a duration of T when the output of MAC operation is 1, while no pulse is generated if the MAC output is 0. However, with the increase in the number of inputs and weights, the use of an additional 1FinFET-1Fe-FinFET supercell array for discharging the load capacitor connected to the column in the evaluation phase will lead to a significant increase in the area overhead, energy, and the overall hardware cost.Although the additional supercell array for discharging the column capacitor was proposed to realize an all-Fe-FinFET-based MAC design, we can also use a current mirror circuit tuned to M I MAX in the evaluation phase to reduce the area overhead when the size of inputs and weights is large [15].The time duration of the output pulse can be obtained as ( Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. The output pulse duration (T out, j ) can be rearranged as Equations ( 6) and ( 7) indicate that the MAC output deviates linearly from the theoretically calculated value (T out, j = y j T ) when the difference between the minimum supercell current I MIN and the maximum supercell current I MAX is not significantly large.The undesirable scaling factor p can be minimized by biasing the supercells to have a large difference between I MAX and I MIN .Furthermore, to nullify the impact of the additive component of the actual MAC output (q) and considering bipolar weights and MAC products, we have utilized a differential approach where each weight w i j is implemented as a combination of its positive component w + i j and a negative component w − i j such that w i j = w + i j − w − i j .Moreover, in the differential configuration, each neuron is also realized as two sub-neurons: T + out, j evaluating the MAC output associated with the positive component of the weights w + i j and T − out, j evaluating the negative component of the MAC output.
As the sub-outputs (T + out, j and T − out, j ) are aligned at the end of phase II, we can compute their difference which encodes the final MAC output by utilizing a simple logic circuitry like AND gate [17]

III. SIMULATION METHODOLOGY
For design exploration and performance evaluation of the time-mode MAC accelerator utilizing Fe-FinFETs, an experimentally calibrated compact model for the ferroelectric FinFETs was used in the commercial SPICE simulator Cadence Specter.This compact model for Fe-FinFETs is based on the self-consistent solution of the dynamics of the ferroelectric capacitor connected to a baseline industry-standard FinFET model [18], [19].The 3-D view of the Fe-FinFET used in this study is shown in Fig. 2(a).The parameters of the compact model were tuned to reproduce the experimental ferroelectric capacitor characteristics in [16] as shown in Fig. 2(b).The ferroelectric capacitor model not only captures the static polarization characteristics accurately [Fig.2(b)] but also reproduces the program/erase behavior, temporal dynamics during the transient analysis, and the history effect of an HZO stack reliably [11], [16].For baseline FinFET calibration, experimental data is extracted from a commercially fabricated minimum channel length n-FinFET in 14 nm technology having a Weff/FIN of 85 nm and a fin thickness of 7 nm.The device is placed in a ground source ground (GSG) configuration.The on-wafer characteristics are extracted using a Cascade Summit 11 K probe station, along with the help of the Keysight B-1500A parameter analyzer.The experimentally extracted dc I DS − V DS and I DS − V GS characteristics are fit with the model and show good fitting, as seen in Fig. 3.A 10-nm thick ferroelectric layer is used in the gate-stack of the Fe-FinFET.Furthermore, a 1-FinFET(selector)-1-Fe-FinFET(memory) supercell was selected as the reconfigurable current sink to avoid the write-disturb mechanism in the unselected cells of the array by turning-off their selector FinFETs [12].By utilizing optimal weight update scheme and effectively controlling the partial polarization states of the ferroelectric layer, weight precision of more than 5-bits with precise control has already been demonstrated experimentally in FeFETs [14], [20], [21], [22], [23].Moreover, the proposed MAC accelerator design also targets a bit precision of 4-bits which works well with certain neuromorphic applications such as computer vision, speech, and NLP, where moderate weight precision does not degrade the accuracy significantly [24], [25], [26].The output characteristics of the Fe-FinFET with HZO in the gate-stack for different polarization states of HZO layer (obtained from incremental step pulse erase (ISPE) mechanism in [27]) are shown in Fig. 4. The Fe-FinFET was programed to 16 different polarization states using a series of erase pulses with decreasing amplitude (and constant width) and the output characteristics were obtained by sweeping the drain voltage while applying a read voltage at the gate of the Fe-FinFET (V G,Fe = 0.4 V) which does not disturb its polarization state.The gate voltage of the selector FinFET (V G,Sel ) is chosen to maximize the range of the read currents corresponding to the extreme polarization states while controlling the maximum current that can flow through the supercell to minimize power dissipation.The capability of the HZO layer to be programed to 16 different polarization states results in the realization of 16 different threshold voltages for each Fe-FinFET.Although a 4-bit MAC operation is suitable for computationally intensive tasks such as deep learning, ferroelectric capacitors with tuning precision ≥5-bits have been realized experimentally [14], [27].Therefore, it is possible to extend the proposed simulation methodology for ≥5-bit MAC operations.To program a cell, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.appropriate programming voltage is applied to the word line of the target cell, while the selected bitline and source line are grounded.Unselected bit lines, source lines and word lines are kept at inhibit voltage for program inhibition while the bulk contact of all the FeFETs is connected (together and) to the ground during the program operation.However, this always-grounded bulk contact scheme may result in write disturbance issues [28].To mitigate this write-disturb issue with the previous schemes, a column-wise body connection scheme has been proposed [29], [30] and can be utilized for efficiently programming the polarization states in the proposed supercells.

IV. OPERATING POINT AND DESIGN SPACE EXPLORATION
In this section, we discuss the trade-offs while selecting an operating point for the supercell for designing a highly area-and energy-efficient time-mode MAC accelerator with a computational precision suitable for realistic applications.The constraints such as the size of the weight matrix and input vector, the frequency of MAC operation (which depends on time window, T ), and the bit-precision of MAC outputs also need to be considered while choosing an appropriate operating point for the supercell.
Since the time-mode MAC accelerators rely on integration of current through multiple reconfigurable supercells (which should ideally behave like a constant current sink), any mechanism which leads to a deviation in the supercell currents from their programed (weight) value would introduce an error in the computation (inference) [31], [32], [33].Also, the capacitive coupling or charge sharing between the supercell and the load capacitor integrating the supercell currents can lead to disturbance in the load capacitor voltage, further degrading the computational accuracy [33].In addition, the voltage-drop across the interconnect wires and the additional signal delays introduced by the parasitic wire capacitances also lead to the errors in the computation, more so for the large weight matrices (supercell arrays).However, the differential configuration helps in mitigating the impact of the fixed line parasitics effectively.Furthermore, the process variations in the input-output (I/O) and peripheral circuitry may introduce errors in the MAC computation during the signal conversion from digital-domain to time-domain and vice versa.To analyze the impact of these non-ideal effects on the performance of the proposed Fe-FinFETs-based time-mode MAC accelerator, apart from the experimentally calibrated compact model for Fe-FinFETs which captures the artifacts such as history-effect, channel-length modulation (CLM), and drain-induced barrier lowering (DIBL) accurately, the I/O circuitry and interconnect parasitic resistance and capacitance (and their randomness due to the process variation) were also considered.Moreover, it has been experimentally demonstrated that the ferroelectric FETs can be tuned to different polarization states with a precision >5-bits [14], [27].Therefore, considering a weight tuning precision of more than 5-bits, the computational error E out, j can be decoupled from the weight tuning error as long as the precision of MAC operation is ≤5-bits [31], [32], [33].Utilization of moderate precision weights (<5-bits) has been shown to be effective for some neuromorphic applications like computer vision, speech, and NLP without significantly degrading the state-of-art accuracies [24], [25], [26].For estimating the computational precision of the MAC output from the proposed time-domain accelerator based on ferroelectric FETs, we have performed MAC-level circuit simulations in the SPICE simulator utilizing an experimentally calibrated compact model for the 1FinFET-1Fe-FinFET supercell array (which takes into account the CLM/DIBL error that leads to a change in the supercell current with the voltage drop across the load capacitor) and the parasitic capacitance and resistance values pertaining to the interconnects of the 14-nm technology node.We have utilized a parasitic line resistance of 43 /µm and a parasitic line capacitance of 0.3 fF/µm for the interconnects in our analysis.We observe that the inevitable change in the cell current due to CLM/DIBL with the change in the voltage at the load capacitor leads to a deviation in the simulated MAC output t sim out, j obtained from the HSPICE simulations as compared to the MAC output computed for the reference case utilizing the ideal current sinks t cal out, j obtained by (8).The error in the computed MAC output utilizing the proposed ferroelectric FET-based time-domain MAC accelerator is defined as the maximum difference between the simulated t sim out, j and the ideal calculated MAC output t cal out, j for all possible combinations of weights and inputs as Furthermore, for benchmarking the time-mode MAC accelerator exploiting Fe-FinFETs against the digital MAC implementations, we define an effective computational precision (P out, j ) as: P out, j = − log 2 E out, j − 1.
To extract the maximum range of supercell currents corresponding to the extreme polarization states of the HZO in the gate-stack, the gate voltage of the Fe-FinFET (V G,Fe ) and the selector FinFET (V G,Sel ) were chosen to bias the Fe-FinFETs in the sub-threshold regime.For the polarization states toward the lower limit of the ferroelectric HZO in the gate-stack, the Fe-FinFETs are biased deeper into the sub-threshold regime where the supercell current does not change significantly with the polarization states.Since the Fe-FinFETs are inherently MOSFETs, even when biased in the sub-threshold regime, they exhibit a weak drain dependency, i.e., the supercell current varies with the drain voltage of the selector FinFET (which is connected to the load capacitor).This dependency of the supercell current on the load capacitor voltage for different polarization states of the Fe-FinFETs is characterized in Fig. 5 for different input voltages (V G,Sel and V G,Fe ).The contours indicate the change in the supercell current due to the physical mechanisms such as DIBL and CLM when the voltage on the load capacitor changes by 1 mV (e DIBL/CLM ).
After an extensive analysis, the operating voltages have been chosen such that the Fe-FinFETs are biased in the sub-threshold region with high drain voltage throughout the MAC operation (by limiting the voltage swing across the load capacitor) to significantly suppress any drain dependency.Moreover, in the proposed supercell, the drain of the Fe-FinFET is connected to the source of the selector FinFET as shown in the inset of Fig. 1(a) and introduces source degeneracy in the selector FinFET which provides inherent negative feedback and a large output resistance.This negative feedback facilitates the realization of an efficient current sink by compensating for any reduction in the supercell current due to a decrease in the drain (load capacitor) voltage by an increase in the effective gate-to-source voltage of the selector FinFET (drop across the Fe-FinFET).To suppress the nonlinearity, the source node potential can be blocked to a reference value by employing a Miller integrator as done in [34].The analysis of the performance enhancement by utilizing this scheme in the supercells is an important future work.This improvement in terms of linearity can be an important work for future.Also, the effect of noise is inherently captured by the BSIM-CMG model utilized for the 1FinFET-1Fe-FinFET supercells.Moreover, since the proposed scheme utilizes a crossbar array of 1FinFET-1Fe-FinFET supercells which are activated simultaneously in parallel, the signal-tonoise ratio (SNR) improves with the number of supercells.While the signal strength is proportional to "M 2 I 2 ", the shot noise (which is the dominant noise in sub-threshold mode of operation and proportional to the cell current) is proportional to "M I 2 ," where M is the number of weights, and "I " is the cell current.Moreover, due to the parallel connection, the overall cell resistances get divided resulting in a decrease in the thermal noise by a factor of "M" further improving SNR.Also, as the size of layers become larger in DNNs resulting in large number of weights, more weights tend to have redundancy making them more robust to noise [35].
We explored different bias conditions for the supercell to find out the optimal operating condition with minimal dependency of the supercell current on the load capacitor voltage as shown in Fig. 5.Although a larger gate bias on the Fe-FinFET and the selector FinFET results in a larger range of drain voltage with minimal variation in the supercell current [Fig.5(b)], it also leads to an increase in the maximum current of the supercell, I MAX (corresponding to the maximum polarization state).A higher I MAX leads to a larger value of load capacitance [see (4)] increasing the power dissipation and degrading the energy-efficiency of the MAC accelerator.On the other hand, a lower gate bias on the Fe-FinFET and the selector FinFET leads to a smaller value of I MAX and improves the energy-efficiency.However, it leads to an increase in the computational error as it reduces the difference between the I MAX and I MIN resulting in an increase in the scaling factor p of (7).Therefore, there exists a trade-off between the computational precision and the energy-efficiency of the time-mode MAC accelerator using Fe-FinFETs.
The dynamic range and sensing margin of the proposed MAC approach depends on the pre-charge voltage V Reset .Furthermore, V Reset also dictates the non-ideal current sink behavior of the 1FinFET-1Fe-FinFET supercells due to CLM/DIBL error.Since the supercells are programed in the sub-threshold regime where the drain current has the least dependence on the drain voltage (load capacitor voltage) only for high drain voltages, we have selected the supply voltage of the technology node of the FinFET technology (0.8 V) as the V Reset to minimize the CLM/DIBL errors and realize efficient current sinks.Furthermore, V Reset is also scalable and can be selected as the nominal supply voltage of the technology node being used for realizing the 1FinFET-1Fe-FinFET supercells.To reduce the dynamic power while realizing 1FinFET-1Fe-FinFET as efficient current sinks with low CLM/DIBL error, we may reduce the load capacitor by utilizing the successive integration and rescaling scheme [36].
Apart from the deviation in the calculated MAC output due to DIBL and CLM, the charge-sharing mechanism between the supercell capacitance (CGD of the selector FinFET) and the load capacitor also changes the voltage across the load capacitor which dictates the MAC output.Nevertheless, the large value of the load capacitor as compared to the parasitic Miller capacitance of the minimum-sized selector FinFET results in a significantly suppressed charge-sharing effect.Furthermore, switching of the selector FinFETs can also generate spikes leading to a charge disturbance in the load capacitor (similar to the clock feedthrough).Moreover, the parasitic resistance in the interconnect between the supercells and the neuron circuitry may also generate spikes leading to a charge disturbance at the load capacitor.These secondary effects are also considered in our SPICE simulations while calculating the overall computational precision of the proposed MAC accelerator.Furthermore, process and temperature variations can impact the multilevel operation in FeFETs and the performance of the peripheral circuitry.However, due to the unavailability of experimental data across different process corners and temperatures, we could not calibrate our model to emulate experimentally observed process/temperature variations and V G,Sel = V G,Fe = 0.35 V AND V G,Sel = V G,Fe = 0.4 V mismatch effects.The analysis of the proposed time-domain MAC accelerator with process/temperature variations and mismatch effects is an important future work.Considering these non-ideal effects, we have performed an extensive design exploration to extract optimal performance from the proposed time-mode MAC accelerator based on Fe-FinFETs.Our rigorous analysis indicates that the performance of the MAC accelerator depends significantly on the input voltage for the supercell (which dictates its biasing conditions), the input time window (T ), range of supercell current, the leakage current of the supercell, and the size of weight matrix and input vectors.The computational error for different input time windows and different sizes with different biasing conditions of the symmetric weight matrix are shown in Table I.The proposed MAC accelerator exhibits a computational precision of 4-bits for all the input time windows and sizes of the symmetric weight matrix.As the drain voltage does not alter the polarization state of the Fe-FinFETs significantly, we can increase the output voltage swing on the load capacitor and reduce the area and energy dissipated by minimizing the value of the load capacitor [see (4)] without degrading the computational precision.
When the supercells are biased at a higher input voltage (V G,Sel = V G,Fe = 0.4 V), the integrated current range is significantly large as compared to the leakage current of the additional array even for a large weight matrix size of 200 × 200.However, when a low input voltage (V G,Sel = V G,Fe = 0.3 V) is used, the range of integrated current is comparable to the overall leakage current flowing through the additional supercell array (biased to sink constant current in the evaluation phase).This leakage current introduces an additional error in phase I apart from the DIBL/CLM error when low input voltages are used for supercells.However, the reduced current range also enables integration during phase I with a lower value of load capacitor leading to an enhanced area-and energy-efficiency.Therefore, there is an inherent trade-off between the area-and energy-efficiency and the computational precision in the proposed MAC accelerator.We have also performed Monte-Carlo simulations (with 100 runs) to study the impact of voltage variations on the operation of the proposed time-domain MAC accelerator utilizing a single Fe-FinFETs.For analyzing the impact of supply voltage variation, a relative change of 10% with a 3σ standard deviation from the nominal value was introduced.The simulation results indicate that the proposed implementation exhibits excellent resiliency against the supply voltage variations while achieving a 4-bit computational precision even in the presence of supply voltage variations.
Furthermore, we have also analyzed the computational error of the proposed MAC accelerator with a 5-bit weight precision (32 polarization states in the Fe-FinFET of supercell).The proposed MAC accelerator can only support 4-bit computational precision even if we apply 5-bit inputs and weights.This is attributed to the closely spaced (and indistinguishable) read current characteristics of the supercell when the number of polarization states is increased specially for the lower values of the polarization states.

V. PERIPHERAL CIRCUITRY AND PERFORMANCE ESTIMATES
The proposed time-mode MAC accelerator exploits mixedsignal processing, i.e., while the inputs and outputs of the MAC accelerator are digital, the MAC operation is performed in the analog domain with the help of Fe-FinFET supercells.Digital inputs are encoded in the time domain and fed to the analog supercells for integration (weighted-sum operation).The time-encoded output pulse is then converted back to the digital domain.For converting the input (and output) signals from digital-domain to time-domain (and vice versa), we require a digital-to-time convertor (DTC) for each input and a time-to-digital convertor (TDC) block at each output.Therefore, we have designed the DTC and TDC blocks for handling 4-bit inputs and outputs which are suitable for the neuromorphic interference tasks [24] with a minimum time period T = 32 ns.Fig. 6(a) shows the design of the DTC which consists of three components: a 4-bit counter, a 4-bit comparator, and a one-bit S-R latch.The 4-bit edge-triggered counter with set-and reset-enable signals comprises of 24 NAND gates.Set and reset signals are used to ensure that the counter is initialized to 0000 at the end of integration phase (t = T ).The count starts from 0000 and the moment the counter value reaches the value of the input, the comparator is triggered, and the input is latched to a high logic level through the S-R latch and fed to the supercell array.The proposed 4-bit DTC block consumes an energy equal to 71.32 fJ and occupies an area of 1.862 µm 2 .
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE II PERFORMANCE BENCHMARKING OF TIME-DOMAIN VMMS
The TDC which consists of a 4-bit counter and an AND gate is shown in Fig. 6(b).The global clock is fed to the AND gate along with the output of the neuron which is implemented as an S-R latch [Fig.1(a)].The output of the AND gate enables the counter which counts the number of clock signals which represent the MAC output in the digital domain.A single 4-bit TDC block consumes an energy equal to 81.88 fJ while occupying an area of 1.47 µm 2 .Although a single inverter at the input of the digital neuron works fine considering the time windows (≥32 ns) used in this work, multiple stages of inverters may be needed for appropriate operation of the neuron circuitry at aggressively scaled time windows due to the poor slope at the input.Moreover, in TDC, the delay between different stages varies in the presence of PVT variations.Although we could not perform an extensive analysis of the impact of PVT variations on the peripheral circuitry due to unavailability of experimental data, our prior works on time-domain MAC accelerators 55-nm technology node using the process design kit (PDK) from Global Foundries [15] targeting an aggressive clock duration of 1 ns have clearly indicated that the TDC and DTC designs even without calibration can support a computational precision >4 bits in the presence of PVT variations.Since we are not targeting aggressively scaled (sub ns) clock period, we believe that the PVT variations should not pose a significant difficulty while achieving a computational precision of 4-bits in the proposed MAC accelerator.
Since the proposed time-mode MAC accelerator exploiting Fe-FinFETs supports a computational precision of 4-bits (which is sufficient for practical inference applications [24]) for different inputs and weight matrix sizes, frequency of MAC operation, voltage swing on the load capacitor, etc., an input voltage (V G,Sel = 0.35 V and V G,Fe = 0.35 V), a voltage swing of 0.2 V on the load capacitor, and a time window (T ) = 32 ns have been selected for MAC-level area and energy estimates.Fig. 7(a) shows the area-efficiency and the area occupied by different components of the MAC accelerator for different sizes of the inputs and outputs and a square weight matrix (N).Fig. 7(a) clearly indicates that the capacitor dominates the area landscape of the time-mode MAC accelerator based on Fe-FinFETs.However, for a 200 × 200 weight matrix and 200 inputs and outputs, the proposed MAC accelerator exhibits an area-efficiency of 12.607 bits/µm 2 which is significantly larger in magnitude as compared to the prior approaches [15] and exhibits a very high throughput of 2.5 TeraOp/s.Also, as can be observed from Fig. 7(b), a considerable amount of energy (70%) is dissipated while charging and discharging the load capacitor during the MAC operation.However, the proposed time-mode MAC accelerator exploiting Fe-FinFETs exhibits a high energy-efficiency of 800 TeraOperations/Joule (Table II).With approaches like successive integration and rescaling, energy-efficiency can be significantly increased [32], [36].It may be noted that the contribution of the I/Os and peripheral circuits were considered in these estimates unlike [12], [13].

VI. CONCLUSION
In this work, we have proposed a time-mode MAC accelerator exploiting Fe-FinFETs.Our extensive analysis shows that the high supercell leakage current and the DIBL/CLM deviate the MAC outputs from the theoretically calculated values and limit the computational precision to 4-bits.Moreover, our rigorous design space exploration reveals the unforeseen trade-offs between the computational precision and area-and energy-efficiency of the proposed time-mode MAC accelerator.The significantly high area-and energy-efficiency for the proposed Fe-FinFET-based time-mode MAC accelerator may provide incentive for its experimental realization.

Fig. 1 .
Fig. 1.(a) Schematic view of the proposed time-mode mixed-signal VMM accelerator and (b) timing diagram of input-output, and the voltage across load capacitor.

Fig. 3 .
Fig. 3. FinFET model calibration with experimentally measured data for 14 nm FinFETs.(a) Transfer characteristics for different drain voltages.(b) Output characteristics for different gate voltages.

Fig. 4 .
Fig. 4. Output characteristics of the supercell for different polarization states of HZO layer in the gate-stack of Fe-FinFET.

Fig. 7 .
Fig. 7. (a) Area-efficiency and area breakdown.(b) Energy-efficiency and energy breakdown for different sizes of MAC inputs and outputs (N).

TABLE I COMPUTATIONAL
ERROR FOR DIFFERENT VMM SIZE FOR