A Technique for Approximate Communication in Network-on-Chips for Image Classification

Approximation is an emerging design methodology for reducing power consumption and latency of on-chip communication in many computing applications. However, existing approximation techniques either achieve modest improvements in these metrics or require retraining after approximation. Since classifying many images introduces intensive on-chip communication, reductions in both network latency and power consumption are highly desired. In this paper, we propose an approximate communication technique (ACT) to improve the efficiency of on-chip communications for image classification applications. The proposed technique exploits the error-tolerance of the image classification process to reduce power consumption and latency of on-chip communications, resulting in better overall performance for image classification. This is achieved by incorporating novel quality control and data approximation mechanisms that reduce the packet size. In particular, the proposed quality control mechanisms identify the error-resilient variables and automatically adjust the error thresholds of the variables based on the image classification accuracy. The proposed data approximation mechanisms significantly reduce packet size when the variables are transmitted. The proposed technique reduces the number of flits in each data packet as well as the on-chip communication while maintaining an excellent image classification accuracy. Cycle-accurate simulation results show that ACT achieves 23% in network latency reduction and 24% in dynamic power reduction as compared to existing approximate communication techniques with less than 0.99% classification accuracy loss.


I. INTRODUCTION
Image classification applications widely use deep convolutional neural networks (CNNs) and are deployed from cloud to edge computational frameworks for a variety of scenarios, such as search engines and self-driving cars [1], [2]. As the complexity of these applications and the resolution of images continue to increase, conventional homogeneous architectures (such as multicore CPU/GPU) are constrained due to excessive communication latencies and significant power dissipation [3]- [5]. To efficiently process these applications, heterogeneous architectures have been proposed with preprocessing and inference cores [3]- [8].
Pre-processing cores are designed to prepare data by resizing the raw image and then normalizing the value for each pixel into a specific range.
Inference cores are designed to fetch the processed data and parameters of the CNN model to perform inference. Network-on-chips (NoCs) have been widely used to efficiently connect cores, memory interfaces, and caches in these architectures [9]. Recent research [3], [10] has shown that with a heterogeneous architecture, data transfer can account for up to 34% of the execution time and up to 40% of the overall chip power consumption. Since image classification applications can tolerate errors in the parameters and the inputs, approximation techniques have been proposed for reducing data transfer, thus reducing network latency and power consumption [11], [12]. Existing approximation techniques can be categorized as follows: Existing approximate communication techniques [13]- [18] reduce communication latency and power consumption by utilizing packet approximation in NoCs. However, existing techniques only rely on the relative error for data approximation. Since relative error tolerance is limited for image classification applications, only few packets can be approximated using existing approximate communication techniques.
Existing CNN approximation techniques [19]- [23] reduce the size of the model using quantization or pruning. However, these techniques do not specifically target image classification. Moreover, as quantizing and pruning the parameters can significantly reduce the classification accuracy, existing techniques require the model to be retrained prior to inference. The retraining process requires substantial time to complete while incurring considerable power consumption. To address the above issues, an approximate communication technique (ACT) that enhances communication efficiency for image classification is proposed for heterogeneous systems; it leverages the error-tolerance of the image classification application to reduce the transmitted packet size, thus reducing power consumption and network latency. ACT utilizes two approximate communication schemes: an approximate communication for the pre-processing cores (ACT-P) and an approximate communication for the inference cores (ACT-I). Each scheme includes quality control and data approximation mechanisms to leverage the error tolerance in multiple steps of the image classification process. Specifically, the contributions of this paper are as follows.
The proposed approximate communication technique (ACT) is utilized in the pre-processing cores (ACT-P) and the inference cores (ACT-I) to reduce network latency and dynamic power consumption for image classification applications by leveraging the error tolerance of the application. The ACT is implemented with software-hardware codesign.
Performance evaluation results show that compared to the existing approximate communication techniques, ACT reduces network latency and dynamic power consumption by 23% and 24%, respectively, with less than 0.99% classification accuracy loss. This paper is organized as follows. Section II presents a background for the proposed technique; Section III outlines the basic operational principles of the ACT. The implementation is presented in detail in Section IV, while Section V deals with its extensive evaluation. Section VI concludes this manuscript.

II. BACKGROUND
Approximation techniques are widely used to enhance the efficiency of image classification applications and CNNs [19]- [25]. Existing approximation techniques can be categorized into two types based on on-chip communication: Approximate communication techniques to reduce power and latency of communication during the execution of an application Approximation techniques for the CNN model to reduce the model size prior to the execution These techniques will be reviewed next as relevant to the proposed scheme.

A. APPROXIMATE COMMUNICATION TECHNIQUES
Approximate communication is considered to be an effective approach to improve network performance when an application can tolerate errors [13]- [18]. With a reduced accuracy during communication, approximation techniques significantly reduce network latency and the power consumption for on-chip communication. Figure 1 shows an approximate communication NoC [13]- [18] implemented in a heterogeneous multicore system [3]- [8] with an L2 shared cache for CNN inference. The data approximation module in the network interface reduces the packet size by truncation or lossy compression according to the approximation information, which includes variable error tolerance and type (e.g., integer or floating-point). Consider a cache miss during a memory load or store operation by an X86 CPU for image pre-processing.
Miss on load operation: When a cache miss occurs during a memory load operation, a read request packet is sent to the memory or the shared cache through the NoC. The memory or shared cache uses a read reply packet to send the required data back to the core. If the memory load can be approximated, then the read request also carries approximation information. Subsequently, the data approximation module at the memory/shared cache node reduces the size of the read reply packet accordingly. The approximated read reply packet carries the approximated data and the approx-imation information to the core. When the approximated read reply reaches the core, the data recovery module recovers the approximated data to the original length in accordance with the approximation information.
Miss on store operation: When a cache miss occurs during a memory store operation, the data is incorporated into a write request packet and sent to the memory or shared cache through the NoC. The data approximation module reduces the size of the write request packet according to the approximation information if the memory store can be approximated. The approximated write request packet carries the approximated data and approximation information to the memory or shared cache node. When the approximated write request reaches the memory or shared cache node, the data recovery module recovers the approximated data to the original length in accordance with the approximation information. After the memory or shared cache has received the data, a write reply is sent back to the core to confirm a successful memory write.
Various data approximation methods [14], [15], [17], [18], [26] have been proposed to reduce the packet size according to the approximation information. However, existing techniques achieve a limited improvement when CNNs are utilized for image classification applications because the parameters of the model and the inputs cannot be approximated using methods based on the relative error. Figure 2 shows the network latency reduction normalized to the network with no approximation (baseline) for the existing approximate communication framework (ACF) [18]. This figure indicates that ACF only reduces network latency by less than 5% when applied to various state-of-the-art image classification applications [27]- [38] as executed on heterogeneous multicore architectures with an NoC.

B. CNN APPROXIMATION TECHNIQUES
Quantization and pruning methods are widely used for deep CNNs in image classification applications to reduce communication traffic and computation [21]. For example, in [23], the size of the deep neuron network is significantly reduced using quantization, pruning, and Huffman coding. Existing image classification applications [27]- [38] are implemented using Pytorch [39] and TensorFlow [40] frameworks, which support CNN quantization and pruning on generic inference cores (e.g., CPUs, GPUs, CNN Accelerators). However, existing model approximation techniques have two major limitations.
1. They are developed for generic CNN inference. The classification system performance improvement methods have not been explored specifically designed for image classification. Thus, system performance can be further improved with dedicated optimization techniques.
2. They require the model to be retrained or fine-tuned before classifying images, because these techniques incur a significant reduction in classification accuracy. This paper aims to approximate the image classification application during the execution process for communication efficiency enhancement by incurring only a very limited impact on accuracy.

III. PROPOSED APPROXIMATE COMMUNICATION TECHNIQUE
The proposed approximate communication technique (ACT) reduces network latency and power consumption of on-chip communication in NoCs. This is mainly accomplished by reducing the size of each packet and exploiting the error-tolerant features of image classification applications. The image classification applications tolerate two types of errors [19], [20], [41]: the first type is image contrast reduction during image pre-processing; the second type is quantization errors in the fully connected layer during model inference. Thus, ACT includes two sets of approximate communication techniques to leverage two types of error tolerance. 1. The approximate communication for image pre-processing (ACT-P) includes quality control and data approximation mechanisms. The quality control mechanism dynamically adjusts the image contrast and monitors the accuracy of the application to balance it with the communication efficiency. The data approximation mechanism for image preprocessing reduces the data size by reducing the image contrast. 2. The approximate communication for model inference (ACT-I) includes quality control and data approximation mechanisms. The quality control mechanism monitors the values of the variables when a fully connected layer is processed. After recording the maximum/minimum values of the variables by the quality control mechanism, the data approximation mechanism utilizes data quantization to reduce data size.

A. APPROXIMATE COMMUNICATION FOR IMAGE PRE-PROCESSING (ACT-P)
Recent research has shown that image classification applications are resilient to contrast reduction on the raw image prior to inference [19], [41]. In this paper, it is assumed that the level of contrast C ranges from À255 to infinity. When C ¼ 0, there is no adjustment to the image, but when C 2 ðÀ255; 0Þ, the image contrast is reduced. When C ¼ À255, all values of the pixels (R, G, B) in an image are 128, making the image of a solid grey color. Hence, Eq. (1) describes the relationship between the contrast correction factor F and the level of contrast C.
Network latency reduction by employing the existing approximate communication technique [18] to different image classification applications.
As per F above, the contrast reduction for each pixel is performed by Eq. (2), in which the variable P is the value of a color of a pixel (in a range from 0 to 255Þ, and P 0 represents the corresponding value with contrast reduction. Figure 3 shows the classification accuracy for a few widely-used image classification applications [2] versus the level of contrast reduction; image classification applications can tolerate 23 levels of contrast reduction (i.e., C ¼ À23) with negligible accuracy reduction (0.07% accuracy reduction on average). Figure 3 also shows that different image classification applications have different accuracy tolerance for image contrast reduction; for example, for a classification accuracy loss of up to 1%, AlexNet [28] can tolerate 23 levels of contrast reduction, while VGG19 can tolerate -90 levels. Thus, a quality control mechanism is needed to select the appropriate contrast reduction level for the different image classification applications to avoid a significant loss in classification accuracy.

1) QUALITY CONTROL
A quality control mechanism for image pre-processing is utilized to maintain the accuracy of image classification. Figure 4 shows the proposed design of the quality control for image classification. This mechanism adjusts the contrast level during the testing process. This is the last step prior to the classification of the images by the application. Testing includes three phases.
1. The raw images in the test data set are processed by the core. Different from the images that the application processes, the data set contains the query data (raw image) and the true value for each image (label). 2. The model inference is then accomplished by fetching the processed data and the classification model. 3. Finally, the generated result is processed by the core to compare it with the true value. The model accuracy is calculated by comparing the predictions generated by the model with the true value. The quality control mechanism utilizes the accuracy calculated by the core to adjust the image contrast. When considering the potential accuracy reduction caused by applying approximate communication for model inference, the accuracy reduction due to the image contrast reduction is limited to less than 1%. The proposed quality control mechanism supports eight contrast reduction levels that are shown in the left column of Table 1. Thus, the following novel procedure is proposed to determine the image contrast reduction level: 1. During the first phase of the test process, the classification accuracy of the image application is calculated with no image contrast reduction. The classification accuracy is calculated as in Eq. (3).

Classification Accuracy ¼ Number of Correct Classifications Total number of Classificaitons
(3) The correct classification is defined as the image category (e.g., cat, dog, car) with the highest probability (as predicted by the model); this must be exactly the same as the expected answer (label).
2. The quality control mechanism gradually reduces the image contrast levels by choosing a contrast reduction level according to the left column of Table 1 until there is more than 1% loss (as threshold) in the classification accuracy compared to the estimated base accuracy in the next testing process. 3. The value prior to the last contrast reduction level is chosen for image classification. For example, if the classification error exceeds 1% at level -68, the lower consecutive level (-45 according to Table 1) is selected for image classification.   Section V.E discusses the impact on classification accuracy if different accuracy loss thresholds are chosen. During image classification, the true value for each query image is not available, so the contrast reduction level is fixed and registered in the network interface prior to the classification.

2) DATA APPROXIMATION
The data approximation mechanism reduces the amount of transmitted data for image contrast reduction. The so-called base-delta approximation mechanism is proposed to take advantage of the reduced image contrast for data reduction. Since the difference in values between pixels in an image is small, the base-delta compression mechanism can significantly reduce the number of bits needed to represent each pixel. Moreover, the proposed image contrast reduction process further reduces the difference between pixels, so the data size can be substantially reduced by only transmitting the difference between pixels. Figure 5 shows the design of the proposed base-delta approximation mechanism for image pre-processing. The data approximation process consists of two steps.
Step 1: The image contrast reduction operation is activated with a contrast reduction level.
Step 2: The multipliers and adders then adjust the value for each pixel based on Eqs. (1) and (2).
To reduce its complexity, ACT-P supports eight levels of image contrast reduction. Table 1 shows the mapping of the conversion of the supported contrast reduction level (C) into the contrast correction factor (F). The first 8-bit data is chosen as the base; the remaining data is represented as the distance to the base. Figure 5a shows the approximation process when the image contrast reduction is deactivated (Contrast Reduction Level ¼ 0). The data bypass the contrast reduction operation and is compressed with full accuracy. Figure 5b shows the approximation process when the image contrast reduction is activated.
For example, suppose a packet contains three pixels with values 128 (10000000), 192 (11000000), and 100 (1100100). Also, these values can tolerate -68 levels of contrast reduction. After approximation, the packet only contains 128 (10000000), -38 (1100110), and 6 (110), respectively. After the compressed data arrives at the destination, the data is recovered by adding the delta and the base.

B. APPROXIMATE COMMUNICATION FOR MODEL
INFERENCE (ACT-I) [22] has shown that the classification accuracy reduction in image classification applications is negligible after applying quantization to the parameters and activation of the fully connected layers (inputs). As floating-point data type is widely used in image classification applications [27]- [38] to represent parameters and activation, the quantization process consists of mapping a floating-point value x 2 ½a; b to a b-bit integer x q 2 ½a q ; b q ; this is computed as per Eq. (4) (where c and d are variables).
Note that when performing quantization, the floating-point 0 must be mapped to a b-bit integer 0. Thus, the relationship between c, d and the ranges of x and x q are given as follows.
In Eq. (5), a and b are the minimum and maximum of the floating-point value, respectively. a q and b q are the minimum and maximum of the integer value (i.e., quantized floatingpoint value), respectively. c and d are the two variables that must be solved for the quantization process. Eq. (6) illustrates the solution for Eq. (5).
[22] has shown that the accuracy reduction caused by quantizing the IEEE standard 32-bit floating-point data into 8-bit integers in fully connected layers is negligible. This is shown in Figure 6 for different image classification applications in which the parameters and inputs of the fully connected layers are quantized. The quantization process only reduces the classification accuracy by 0.022% on average. This observation suggests that data quantization is an attractive solution to significantly reduce on-chip communication when the model has fully connected layers.
However, as per Eq. (6), the range of x (i.e., a and b) must be considered when performing data quantization. Since data (i.e., x) exceeding the range is basically clipped (by truncation) during the quantization process, the range must be dynamically determined for different data items. Otherwise, many data items could be clipped, thus negatively impacting the model accuracy. Therefore, a novel quality control mechanism is developed to estimate the range of inputs and parameters in the proposed scheme. Also, once the data range is determined, floating-point operations (such as multiplication, division, and subtraction) must be performed to map the data (as per Eqs. (4) and (6)). These operations are not always acceptable because they could incur significant communication overheads for latency and power consumption. Therefore, the proposed data approximation mechanism uses a variable i to quantize data, as described next. Figure 7 shows the proposed process of quality control for model inference. The proposed quality control mechanism constantly monitors the parameters and inputs of the fully connected layer. To reduce the complexity of data quantization, a new variable i is introduced based on the following observations.

1) QUALITY CONTROL
Observation 1: Quantization maps data from the original range to another range with different granularity, thus causing quantization errors. For example, when quantizing 32-bit floating-point data into 8-bit integers, the granularity of the data range increases from 1/16777216 to 1/255; in this case, the error originates from the decimal part.
Observation 2: For an integer, a deviation within (0, 1) (i.e., adding the integer with a decimal value) is only reflected on a few lower mantissa bits in its floating-point representation, and it has an almost negligible impact on all upper bits. Moreover, the changed mantissa bits are separated from the upper bits related to the sign and the integer part of the data.
Observation 3: The expansion of a floating-point value by 2 i times only changes its exponent bits (where i is a positive integer). Consider Eqs. (7) and (8) that calculate the value of a 32bit floating-point data D [43] and the corresponding enlarged value with 2 i times respectively; only the exponent value increases by i; while the sign and mantissa remain the same.
Therefore, as per the above observations, a conventional data quantization approach can be replaced by expanding the original data by 2 i times and then rounding it down (i.e., mapping the floating-point data to an integer).
Thus, when the quality control mechanism receives the minimum (a) and the maximum (b) values of the weights and biases, i is calculated based on a and the b using Eq. (9).
However, the dynamic range of the inputs is not fixed because the query image is changed after each image classification. Thus, the inputs of the fully connected layer are constantly monitored during the image classification to establish the dynamic range of the input to calculate i.
To reduce the hardware overhead, i is limited to 8 bits, and the initial i values for the inputs and parameters are calculated during the image classification application testing and registered in the network interface before processing images. Since the increase of the value range leads to a decrease of i, the quality control mechanism automatically reduces i by 1 when an input value exceeds the dynamic range during the classification. Then, the data approximation mechanism reduces the data size based on i.

2) DATA APPROXIMATION
The proposed data approximation mechanism quantizes data by enlarging the original data by 2 i times and then rounding it down. Thus, the quantization error is bounded within a few lower mantissa bits. Only the sign bit, the exponent bits, and a few upper mantissa bits need to be transmitted because they are separated from the lower bits. Moreover, to perform the multiplication with 2 i , a binary sequence of i needs to be added to the exponent part, i.e., only a binary addition operation is required, rather than floating-point arithmetic operations. As the ranges of inputs and parameters of the fully connected layers can be determined by utilizing the quality control mechanism proposed in the previous section, then the value of i is adjusted to guarantee that the quantized data belong to an integer range with an acceptable granularity (so to provide a good classification accuracy, such as the 8-bit integer range [20]).
To further reduce the size of the transmitted data, the exponent part is compressed by mapping the data patterns into symbols with a shorter length. Since all integers within the range of [2 j , 2 jþ1 ) share the same exponent pattern as per Eq. (7) (where j ¼ 1, 2, . . ., 127), then only a few patterns are used for representing the quantized data that belongs to a range significantly smaller than the entire floating-point field. For example, when quantizing data into the 8-bit integer range, only eight exponent patterns may appear (i.e., 00000000 for value 0, 01111111 for value 2 0 , 10000000 for values within [2 1 , 2 2 ), 10000001 for values within [2 2 , 2 3 ), etc.). In this case, 3-bit symbols that provide eight different combinations can be used for mapping all possible exponent patterns (and so transmitted), thus reducing the size for each exponent from 8 to 3 bits. Next, an example of quantizing 32-bit floating-point values within [-0.0903250, 0.0882086] (i.e., the range of parameters for one fully connected layer of the trained VGG11 [29]) into 8-bit integers is shown as an application of the proposed mechanism. To perform quantization, the original data is enlarged by 2 10 (i.e., i ¼ 10) times because i ¼ 11 makes the quantized data exceeding the range of [-127, 127]. Therefore, only 15 bits, including 1 sign bit, 8 exponent bits, and 6 upper mantissa bits are sufficient for all quantized data. Then, the exponent part is further compressed by utilizing 3bit symbols as per the mapping given in Table 2 because only eight exponent patterns are used to represent the quantized data. Overall, the data size is reduced from 32 bits to 10 bits by employing the proposed mechanism, achieving a reduction of 68.75%.
The hardware design for the proposed data approximation mechanism for data quantization is illustrated in Figure 8. Once data approximation is enabled, an 8-bit adder is utilized to perform the binary addition between the original exponent and the binary sequence of i obtained by the quality control logic for quantization (i.e., enlarge the data by 2 i times); then the mapping hardware compresses the quantized exponent (Table 2) to further reduce the data size. Finally, the approximated data is sent to the packet encoder for transmission. Note that once the data arrives for neuron computation, the compressed exponent is decompressed according to Table 2, and a few bits with 0 are padded to the mantissa to recover the format of the quantized data back into the standard 32-bit floating-point format for subsequent computation.

IV. IMPLEMENTATION OF THE APPROXIMATE COMMUNICATION TECHNIQUE (ACT)
An architecture based on hardware-software co-design is proposed in this section to implement ACT for image classification applications. The proposed implementation includes a software interface and an architectural design. The software interface is designed to identify the variables that need to be monitored or approximated during image classification. The network interfaces in the heterogeneous architecture are augmented with data approximation and quality control.

A. SOFTWARE INTERFACE FOR APPROXIMATE COMMUNICATION
ACT approximates pixels in the images when the pre-processing cores convert the raw image. Also, ACT quantizes the inputs and parameters when the inference cores process the fully connected layers. Hence, ACT monitors and approximates the pixels, inputs, and parameters when the image classification application is executed on the heterogeneous architecture. Two specialized instructions are developed to identify these variables in the source code and the on-chip communication. When the application designer programs the pre-processing cores, the variables (which store the images) are separately annotated in the application. For the pre-processing cores that are X86 CPUs, once the program is compiled into X86 instructions, then the load-and-store of an image pixel (mov dist; src) is replaced with (amov dist; src) for the network interface to identify the image pixels that can be approximated. Similarly, the loading process of the parameters and the inputs for the fully connected layer (ld dist; src) are replaced using specialized instructions (ald dist; src). During the execution of an application, these new instructions allow the network interface to identify these variables in the requests or replies.

B. ARCHITECTURE DESIGN OF ACT
The ACT arguments the network interfaces (NIs) for the preprocessing cores, model inference cores, shared cache, and memory controller with specific hardware for approximation and recovery (Figure 1). Since the approximation logic needs to handle different data at different nodes, the approximation and recovery logics are specifically designed according to the functionality of the node, such as pre-processing or model inference.

1) APPROXIMATE NETWORK INTERFACE (PRE-PROCESSING CORES)
To support the ACT-P, the data approximation logic approximates image pixels according to the contrast reduction level. Since images must be processed by the pre-processing core, the write requests and read replies carry image pixels and data in these packets can be approximated. Figure 9 shows the proposed approximation logic for the pre-processing core. The approximation logic includes the data approximation logic and the quality control logic to adjust the image contrast. The design of the data approximation logic for a pre-processing core is described in Section III.A.2. For clarity, only the control signal for the quality control logic is shown in Figure 9. The quality control logic monitors the write requests. If the write requests contain raw images, then the quality control logic instructs the data approximation logic to approximate the requests according to the current contrast reduction level. 3 bits are used to represent the contrast reduction level to support 8 contrast reduction levels (0 to -158). If the write request cannot be approximated, the data approximation logic applies base-delta compression without contrast reduction (level 0). Then, the quality control logic checks the length of the write requests. If the length is larger than the original write request (Approx. Size > Org. Size), the original request is sent to the packet encoder. Once the memory or shared cache has received the data, a write reply is sent back to the core to confirm a successful memory write. During the image load, the quality control logic attaches the information of contrast reduction mode (3 bits) to the read requests. Once the read reply packet arrives at the core, the data recovery logic recovers the data into its original form if the packet is compressed. Otherwise, the data recovery logic directly sends the read reply to the core. The data recovery logic for the pre-processing cores decompresses the data by adding the delta back to the base.

2) APPROXIMATE NETWORK INTERFACE (MODEL INFERENCE CORES)
Since the core directly loads and stores data from/to memory or shared cache, the read and write requests are generated by the node and sent to the memory controller or shared cache. To support ACT-I, the data approximation logic monitors the write requests and read replies to update the dynamic range of the parameters and the inputs for the fully connected layer. Figure 10 shows the proposed approximation logic for the model inference. The quality control logic monitors all requests and replies to update i for the inputs; it also controls two demultiplexers and the data approximation logic. Since the destination of the write request could be another node for model inference or a memory controller or a shared cache, i (monitored at a specific node) can be the dynamic range of a section of the inputs for the fully connected layer. To find the dynamic range of the inputs for the entire layer, the following procedure is proposed. (1) The quality control logic attaches i of the inputs to the read request packet if the destination of the packet is the memory controller or shared cache. (2) The quality control logic constantly monitors the i of the write reply packets from the memory controller or shared cache. If the received i is smaller than the current i, the value of i for the inputs in the current node is updated. Therefore, the i in the memory controller and shared cache node has the dynamic range of all the inputs when the core loads the data. When the core stores the result, the i in each node is updated with the i in the memory controller and shared cache.
As a model inference core needs to fetch images, parameters, and inputs, the data recovery logic contains two decompression functions. The decompression function for images is the same function used in the pre-processing core. The decompression function for the parameters and inputs recovers the data based on Table 2, and a few bits (of values 0's) are padded to the mantissa to recover the format of the quantized data back into the standard 32-bit floating-point format for subsequent computation.

3) APPROXIMATE NETWORK INTERFACE (MEMORY CONTROLLER AND SHARED CACHE)
Since the memory controller and shared cache handles requests from both pre-processing and model inference cores, this interface performs data approximation and recovery functions for both tasks. Also, the network interface carries the quality control logic for both pre-processing and model inference. Figure 11 shows the approximation logic for the memory controller and shared cache. The approximation logic consists of the data approximation and quality control logic. The quality control logic monitors the read request packets for the i value from the node for inputs. If the i value is smaller than the value stored in the quality control logic, the stored i is updated. The updated i is attached to the write replies to update i stored in the network interface at the node for model inference. The quality control logic also monitors the read request packets for receiving the contrast level for the read reply   packet approximation. When the read reply has the data for image pre-processing or model inference, the corresponding data approximation logic is activated to approximate the data based on the contrast level or i. Similar to the quality control logic in the pre-processing core, the quality control logic checks the length of the read reply to the pre-processing core; if the length is greater than the original read reply after base-delta compression, the original reply is sent to the packet encoder.
Since the traffic contains the pixels, model parameters, and inputs, the data recovery logic has the recovery functions for both model inference and pre-processing.

V. EVALUATION
In this section, the performance of the approximate communication technique (ACT) is evaluated by using the SMAUG [3] simulator. The SMAUG simulation model is modified to support the ACT and heterogeneous architectures for image classification. Table 3 shows the settings for the SMAUG simulator. The hardware for data approximation, data recovery, and quality control is implemented in the network interface. Two heterogeneous architectures (i.e., CPU/NDLA [3], [4] and ASIC/ACC [5]) are modified for image classification to show the wide applicability and effectiveness of the proposed technique. The CPU/NDLA is based on Simba [4], and the ASIC/ACC is based on DNPU [5]. The ASIC/ACC system contains seven ASIC pre-processing cores, an X86 CPU as controller, and seven CNN processors. Each CNN processor includes an aggregation core and three convolution cores. Thus seven CNN processors contain seven aggregation cores and twenty-one convolution cores. All the cores in the two architectures are connected using 6Â6 2D mesh NoC. Table 4 shows the executed image classification applications with their original classification accuracy [27]- [38](Acc.) and the corresponding contrast reduction levels (C).
We evaluate the proposed technique by comparing it with approximate communication framework (ACF) [18], Approx-NoC [14], AxBA [17], and the baseline (i.e., NoC with no approximation) from the communication efficiency perspective, which includes network latency and dynamic power consumption.

A. NETWORK LATENCY
The network latency is defined as the number of clock cycles elapsed between sending a packet at the source node and the successful delivery of the packet to the destination. Thus, the network latency includes the time of three procedures: packet generation at the source node, packet transmission in the network, and data extraction at the destination node. Next, ACT is compared with the baseline, ACF, Approx-NoC, and AxBA.
The Heterogeneous system with CPUs for pre-processing and NDLA for model inference: Figure 12 shows the results for the network latency normalized with respect to the baseline. ACT achieves an average network latency reduction of 26% and 23% compared to the baseline and ACF, respectively. This occurs because image classification applications have limited tolerance to the relative error for a smaller reduction in data size compared to ACT. The largest network latency reduction achieved by ACT in the experiment is VGG11 (45% reduction compared to the baseline), while the smallest network latency improvement is obtained for EffcientNet B7 (14% reduction compared to baseline).
The Heterogeneous system with ASIC for pre-processing and CNN processor for model inference: Figure 13 shows the results for the average network latency normalized with respect to baseline for the heterogeneous system, which uses ASIC for pre-processing and CNN processor for model inference. The ACT achieves an average network latency reduction of 22% compared to the baseline. Compared to the heterogeneous system that uses CPUs for pre-processing, the ACT achieves different improvements in network latency when ASICs are used for pre-processing. This occurs due to the better pre-processing data flow. The CPUs' cache coherence protocol incurs in more traffic injected into the network, whereas each ASIC pre-processing core is designed to directly load images from shared cache or memory. Thus, more packets can be approximated for the CPUs, leading to more improvement in network latency than for ASIC.
Compared to the baseline, existing approximate communication techniques (e.g., Approx-NoC, AxBA, and ACF) achieve marginal improvement in network latency (less than 5% on average), as these techniques only rely on the relative error to approximate data. As a result, existing techniques miss the opportunity of data approximation for image classification applications; however, ACT can achieve a significant latency reduction due to the dual approximate communication scheme. Moreover, the proposed technique significantly reduces the network latency when the model frequently uses the fully connected layer and can tolerate a significant image contrast loss.
For example, Figure 14 shows the size of the fully connected layer in the image classification models. VGG11 uses 86% of the data, which includes inputs and parameters for the fully connected layers. As Table 4 shows that VGGs can tolerate -68 levels of contrast reduction (C ¼ À68) with minimal accuracy loss, then the combined effect of two packet approximation mechanisms leads to a high reduction in packet size when VGG11 is executed on the heterogeneous system with ACT. Figure 15 shows this effect by plotting the compression rate of the ACT. VGG11 achieves the highest compression rate (2.42) for the transmitted data packets. This leads to a significant improvement in network latency; however, 3% of the data in EfficientNet B7 (EffNet B7 in Figure 14) is occupied by the fully connected layer and can tolerate 23 levels of contrast reduction. Therefore, the proposed technique achieves the smallest network latency improvement when EfficientNet B7 is executed on the heterogeneous system.

B. DYNAMIC POWER CONSUMPTION
Dynamic power includes the power consumed by the switching activity for all transistors in the NIs and routers. For all on-chip communication, the results are normalized with respect to the baseline. Figure 16 shows the dynamic power consumption for the CPU/NDLA heterogeneous system. ACT achieves an average dynamic power reduction of 29% and 24% compared with the baseline and ACF, respectively. The power reduction for the rest of the applications is between 48% and 17% compared to the baseline. Figure 17 compares the dynamic power consumption for two heterogeneous systems with ACT. As ASIC improves data flow for pre-processing, the proposed technique achieves the best dynamic power saving for CPU/NDLA compared to an ASIC/ACC system. Figure 18 shows the breakdown of the dynamic power saving of the ACT to the approximation for the model inference (ACT-I) and image pre-processing (ACT-P). Since the Effi-cientNet B7 is sensitive to contrast reduction, ACT-P achieves the smallest improvement to dynamic power saving (16%). However, the AdvProps can tolerate significant image contrast reduction; thus, ACT-P yields a 24% reduction in dynamic power. Overall, the average dynamic power reduction achieved by ACT-P is 21%, while the dynamic power saving for ACT-I is determined by the size of the fully connected layer. ACT-I contributes 30% in dynamic power reduction, as AlexNet contains the largest fully connected layer (96% as per Figure 14) compared to other image classification applications. The ACT-I achieves the smallest contribution (1%) on EfficientNet B7 due to the small size of the fully connected layer (3%, as shown in Figure 14). Overall, the average dynamic power reduction achieved by ACT-I is 9%. Therefore, ACT achieves a significant improvement in dynamic power consumption for the two considered heterogeneous systems due to the effective packet approximation. The technique can significantly reduce packet size using the proposed data approximation mechanisms. Thus, less activity is observed in the NoC, leading to significant dynamic power reduction.     Figure 19 shows the accuracy loss (i.e., loss of classification accuracy) for different image classification applications when ACT and ACF are applied to different heterogeneous systems. The classification accuracy is measured using the testing data set of ImageNet [2]. 512 randomly selected images from the testing data set are used for testing and setting the contrast reduction level. The rest of the images are used to measure the accuracy loss of the application. The accuracy loss for all applications is less than 0.99% across all considered heterogeneous systems for the ACT. However, ACF has a significantly higher quality loss compared to ACT. The highest accuracy loss (2.2%) is observed when NASNet-4A is executed on heterogeneous systems with ACF. This is mainly due to the low relative error tolerance of the image classification application. The highest accuracy loss (0.85%) is observed when NASNet-4A is executed with ACT. Moreover, the incurred accuracy loss is consistent across all systems, thus indicating that the proposed quality control mechanisms are effective in maintaining a low accuracy loss during approximate communication.

D. OVERALL SYSTEM PERFORMANCE EVALUATION
The ACT is implemented using Verilog to evaluate the area, static power, and latency. The entire system is synthesized with 32 nm technology using Synopsys Design Vision software. The synthesis results show that for each NI, the proposed hardware implementation incurs in an area of 4.79 mm 2 . When the supply voltage is 1.0 Volt, the proposed technique incurs a static power overhead of 1.7 mW for each NI. For a 6Â6 2D mesh NoC, the ACT modules occupy 1.7% of the total NoC area and consume 4.7% of the total static power consumption. As for the latency, the approximation process and data recovery for pre-processing cores require one cycle each. Also, the approximation process and data recovery for the mode-inference cores require one cycle each. As for the overhead of this process, 5 iterations of testing are needed on average for the quality control mechanism to choose the appropriate contrast reduction level. Compared to the overhead of several epochs of retraining required by CNN approximation techniques [19]- [23], testing is very efficient. Moreover, testing overhead can be further reduced for the proposed technique by using a small test data set or a predetermined contrast reduction level. Figure 20 illustrates the speedup in execution time for image classification applications when ACT is implemented. The execution time is defined as the number of clock cycles elapsed between the CPU receiving the image classification request and when the result is computed. A single-core CPU and CPUþEyeriss architectures are added to the comparison to show the need of using heterogeneous architecture for image classification. Eyeriss [45] is a standalone machinelearning accelerator that relies on an off-chip system interconnect (i.e., PCI Express) to communicate with CPUs. The results of single CPU, CPUþEyeriss, and CPU/NDLA with ACT are normalized to the baseline CPU/NDLA. The results of ASIC/ACC with ACT are normalized to the baseline ASIC/ACC. Due to the high latency in off-chip communication, CPUþEyeriss is 5% slower on average compared to the baseline CPU/NDLA heterogeneous system. The improvement in the execution time for the ACT ranges from 9% to 38% for the two heterogeneous architectures; especially, the execution times for VGG11 are reduced by 38% and 35% for CPU/NDLA and ASIC/ACC with ACTs, respectively. Therefore, ACT improves system performance for the two heterogeneous architectures by implementing the data approximation mechanism in the on-chip network. Figure 21 shows the accuracy loss of the image classification applications when the threshold for accuracy loss changes from 1% to 7%. According to Section III.B and Figure 19, the type of pre-processing or model-inference core (e.g., NDLA or CNN Processor) used in a heterogeneous architecture has no effect on the classification accuracy loss. Thus, FIGURE 16. Dynamic power consumption for CPU/NDLA heterogeneous system (normalized to the baseline). the experiment on the sensitivity analysis is based on the CPU/NDLA with ACT, as shown in Figure 21. If the threshold accuracy loss is more than 1%, then the approximation mechanism incurs more than a 1% accuracy loss for the applications that are sensitive to image contrast reduction. Thus, 1% is chosen to be the threshold for the quality control mechanism for the image pre-processing.

VI. CONCLUSION
In this work, we have proposed an approximate communication technique (ACT) to enhance on-chip communication efficiency for image classification applications. The proposed technique leverages the error tolerance of image classification applications to enhance communication efficiency during the execution of an application. ACT-P and ACT-I are developed for pre-processing and inference, respectively, thus reducing the transmitted data while maintaining the image classification accuracy. Novel approximate network interfaces for the preprocessing core, inference core, memory controller, and shared cache have been proposed to implement ACT in NoCs. Compared to existing approximate communication techniques [14], [17], [18], ACT significantly reduces the transmitted data by efficiently approximating image classification applications. Compared to existing CNN approximation techniques [19]- [23], ACT eliminates the retraining process, which is time and energy-consuming. The detailed evaluation shows that compared to the state-of-the-art approximate communication technique (ACF) [18], the proposed approximate communication technique reduces dynamic power consumption and network latency by 24% and 23%, respectively, with less than 0.99% accuracy loss. His research interests include bio-inspired and nano manufacturing/computing, VLSI design, testing, and fault/defect tolerance of digital systems. He is currently the vice president for Publications of the IEEE Computer Society, the 2021 president-elect of the IEEE Nanotechnology Council and a member of the IEEE Publication Services and Products Board.
AHMED LOURI (Fellow, IEEE) received the PhD degree in computer engineering from the University of Southern California, Los Angeles, California, in 1988. He is the David and Marilyn Karlgaard Endowed chair professor with Electrical and Computer Engineering, George Washington University, which he joined in August 2015. From 1988 to 2015, he was a professor with the Electrical and Computer Engineering, University of Arizona. From 2010 to 2013, he served as a program director with the National Science Foundation's (NSF) directorate for Computer and Information Science and Engineering. His research interests include interconnection networks and network on chips for multicores, and the use of machine learning techniques for energy-efficient, reliable, high-performance and secure many-core architectures and accelerators. He was recently selected to be the recipient of the IEEE Computer Society 2020 Edward J. McCluskey Technical Achievement Award. He is currently serving as the editor-in-chief of IEEE Transactions on Computers.