Attention-Based Adversarial Robust Distillation in Radio Signal Classifications for Low-Power IoT Devices

Due to great success of transformers in many applications, such as natural language processing and computer vision, transformers have been successfully applied in automatic modulation classification. We have shown that transformer-based radio signal classification is vulnerable to imperceptible and carefully crafted attacks called adversarial examples. Therefore, we propose a defense system against adversarial examples in transformer-based modulation classifications. Considering the need for computationally efficient architecture particularly for Internet of Things (IoT)-based applications or operation of devices in an environment where power supply is limited, we propose a compact transformer for modulation classification. The advantages of robust training such as adversarial training in transformers may not be attainable in compact transformers. By demonstrating this, we propose a novel compact transformer that can enhance robustness in the presence of adversarial attacks. The new method is aimed at transferring the adversarial attention map from the robustly trained large transformer to a compact transformer. The proposed method outperforms the state-of-the-art techniques for the considered white-box scenarios, including the fast gradient method and projected gradient descent attacks. We have provided reasoning of the underlying working mechanisms and investigated the transferability of the adversarial examples between different architectures. The proposed method has the potential to protect the transformer from the transferability of adversarial examples.

T HE Internet of Things (IoT) and mobile networks are evolving rapidly to fulfill the need for ultra-reliable and low-latency performance, seamless connectivity, mobility, and intelligence [1], [2], [3], [4]. It is estimated that over 50 billion devices are wirelessly connected to the Internet, which can sense their surroundings and offer high-quality services. The explosive growth of IoT devices demands efficient management of the already scarce radio spectrum, which is very challenging particularly in a noncooperative communication environment. As a result, classifying modulation types at the receiver under noncooperative communication conditions becomes a critical task. Automatic modulation classification (AMC) is proposed which plays a key role in wireless spectrum monitoring by performing modulation classifications possibly without prior knowledge of the received signals or channel parameters [5], [6]. It also plays an important role in wireless spectrum anomaly detection, transmitter identification, and radio environment awareness, consequently improving radio spectrum usage and the context-aware intelligent decision making in autonomous wireless spectrum monitoring.
Traditionally, AMC was mainly achieved by carefully extracted features (such as higher order cyclic moments) by experts and certain classification criteria [7], [8], [9], [10], [11]. These existing feature-based methods are easy to implement in practice, however, handcrafted features and hardcoding criteria for AMC make scaling to new modulation types challenging. Recently, due to the superior performance of deep learning, many researchers have resorted to various deep neural network (DNN) architectures for AMC [12], [13], [14], [15], [16], [17], [18], [19]. For example, a convolutional neural network (CNN) was used for AMC in [12]. Later convolutional long short-term DNNs (CLDNNs), long short-term memory neural networks (LSTM), and deep residual networks (ResNet) were proposed to improve the classification performance [16]. A complex CNN was proposed in [17] for the identification of signal spectrum information. A spatiotemporal hybrid DNN was proposed in [18] for AMC which is based on multichannels and multifunction blocks. Furthermore, to reduce the communication overhead, Fu et al. [19] proposed an innovative learning framework which is based on the combination of decentralized learning and ensemble learning. With the great success of transformer in the computer vision area [20], [21], [22], the work in [23] has successfully applied transformers in AMC which shows considerable performance improvement compared to the state-of-the-art techniques.
Despite its superior performance of DNN, several recent research works have pointed out that DNNs are vulnerable to adversarial examples, which are imperceptible and deliberately crafted modifications to the input that result in misclassifications [24]. Adversarial examples have been proven to be effective in terms of hindering the operation of several machine learning applications, such as face recognition [25], object detection [26], semantic segmentation [27], natural language processing [28], and malware detection [29]. Notably, adversarial examples using fast gradient methods (FGMs) have been shown to reduce the classification accuracy in AMC [30], [31]. Recently, we have shown a transformer-based AMC is also vulnerable to adversarial examples using a projected gradient descent (PGD) method [32].
In practice, AMC can be applied to both military and civilian scenarios. In a networked battlefield, important information may be shared using radio signals by the units of each adversary (opponent transmitter and receiver as indicated in Fig. 1). The allied forces (playing the role of eavesdropper in this scenario) can employ AMC to determine the modulation used to eavesdrop information transmitted between the adversary units (opponents). To deter the allied forces from eavesdropping messages, small perturbations (adversarial perturbations) can be applied to the communication signals by the opponents such that the modulation discovery executed by the allied forces eventually fails. The modulation can still be discovered by the allied forces, however in this case, an AMC system which is robust against adversarial attacks should be applied as proposed in this article. To the best of our knowledge, this work is the first to propose a defense for the transformer-based AMC in the literature. The proposed defense will be used to protect the allied forces from the adversarial perturbations such that the allied forces could successfully discover the modulation. While the transformers offer superior performance, it comes with a price in terms of large  I  ACCURACY OF THE AT TRAINED COMPACT TRANSFORMER AS  COMPARED TO THE AT TRAINED LARGE TRANSFORMER AGAINST PGD  ATTACKS FOR A WIDE RANGE OF PNR VALUES model size and computational complexity, which may limit reaping its benefits in applications that rely on low-power sensors and IoT networks [33], [34], [35], [36] with possibly a low memory size. Therefore, we propose a novel compact transformer-based defense for low-power IoT devices based on the distillation of knowledge through the adversarial attention map, which is a critical element for the transformer and the details are given in Section III. There are several countermeasures to defend DNN against adversarial examples [37], [38], [39], [40], [41]. Among these defenses, adversarial training (AT) has been shown to be the most powerful method [42], [43]. It is a data augmentation technique which augments adversarial examples during the training of DNN. While AT offers superior performance in large transformers, such benefits are not apparent in compact transformers as shown in Table I, which shows robustness (classification accuracy in the presence of attacks), especially when the PNR (i.e., the ratio of the adversarial perturbation power to the noise power) is high. Therefore, we focus on transferring the robustness from a large transformer to a compact transformer. For DNN, there have been some works in computer vision applications that considered transferring robustness from a teacher model to a student model. For example, an adversarially robust distillation (ARD) method for ordinary DNN was proposed in [44] to distill robustness into small student networks during knowledge distillation. A reliable introspective adversarial distillation (IAD) technique was proposed in [45] where students partially trust their teachers instead of trusting them fully. The work in [46] used both robust-trained and standard-trained self-teachers, which we call adversarial knowledge distillation (AKD). Furthermore, a robust soft label adversarial distillation (RSLAD) method was proposed in [47] which fully exploits the robust soft labels obtained by a robust teacher model to teach the student's learning. However, these methods consider only the logits (i.e., representations of the penultimate layer of the DNN) information from the robust teacher models which may not be sufficient to maintain robustness. In this article, based on a unique architecture of the transformer-based neural network, we propose an attention-based adversarial robustness distillation (ATARD) method for low-power IoT devices to transfer the robustness onto a compact transformer model by learning the adversarial attention map from a robust large transformer model. The proposed ATARD has a better smoothness (i.e., less sensitive to perturbations) than AT, ARD, IAD, AKD, and RSLAD as demonstrated in Section III. Our key contributions are as follows.
1) For the first time in the literature, we propose a defense against adversarial attacks for transformer-based AMC.
2) We propose an ATARD method to transfer the robustness onto a compact transformer model by learning the adversarial attention map from a robust large transformer model. To the best of our knowledge, this is the first work to distill robustness by transferring the adversarial attention map onto small networks. 3) Considering white-box attacks, we show the performance advantage of our proposed method compared to the state-of-the-art techniques, including AT, ARD, IAD, AKD, and RSLAD. Furthermore, the transferability of the adversarial examples among different architectures is shown and the robustness of ATARD is established. The rest of the sections are arranged as follows. Section II reviews the related works. The proposed methodology is presented in Section III followed by the results and discussions in Section IV. Finally, conclusions are drawn in Section V.

II. RELATED WORK
First, we elaborate on the related techniques, including AT, ARD, IAD, AKD, and RSLAD. AT can be traced back to Goodfellow et al. [24], in which the clean images and the corresponding adversarial examples were mixed into every mini-batch for training. The loss function of AT is expressed as where CE(·) is the cross-entropy loss. θ is the parameter of the network f (·), x means the clean (benign) samples without adversarial perturbation, x adv is the adversarial counterparts of the corresponding x, and y is the true label of the sample. α accounts for the relative importance between the clean sample loss and the adversarial sample loss. We choose α = 0.5 as suggested in [24]. The first term in (1) is to guarantee the classification accuracy for benign samples, and the second term is to maintain the accuracy when the network is exposed to adversarial samples. The optimization function of the procedure for generating the adversarial samples x adv is as follows: We use a 3-step PGD attack as suggested by [41]. The PGD attack is the strongest attack utilizing the local first-order information about the network. The value of α and the generation of x adv remain the same for the rest of the techniques unless stated otherwise. ARD was proposed to distill robustness from a robust teacher to a small student network. ARD is analogous to AT but in a distillation setting. During AT, a DNN is used to predict the true label when the input to the DNN is exposed to adversarial perturbation. While in ARD, given a robust (AT-trained) teacher model T, a student model S would aim to match the teacher network's output when exposed to an adversary. At the same time, the loss function between the output of the student model and the ground-truth label was considered to balance the natural accuracy. The loss function of the ARD is as follows [44]: where KL(·) is the KL-divergence loss and t means the temperature. The logits of both the teacher and the student networks are divided by the temperature term t. In this work, we set t = 1.
In terms of adversarial robustness, Zhu et al. [45] considered that the teacher models may become unreliable so that adversarial distillation may not work. This is because the teacher models are pretrained on their own adversarial samples, and it is unrealistic to expect the teacher models to be reliable for each adversarial sample inquired by student models. Hence, IAD was proposed in which the student trusts the teacher network only partially than fully. To be specific, given a query of an adversarial sample and its corresponding adversarial sample from the student network, IAD considers three different situations. First, if a teacher performs well at the adversarial sample, then its outputs (or soft labels) can be completely trusted. Second, if a teacher performs well at clean sample but not at adversarial samples, its output is partly trusted and the student network will rely on its own outputs. Otherwise, the student model will only trust its own outputs. The loss function of IAD is shown as follows: where α is a parameter which is introduced to balance the influence of the teacher network in IAD. As shown in (5), α is defined as the probability of the teacher network about the targeted label y when queried by the adversarial data, and β is used to sharpen the prediction probability In AKD, the authors propose to inject learned smoothing during AT in order to avoid the overfitting issue for the AT trained models. AKD utilizes both self-training and knowledge distillation to smooth the logits. Specifically, the teacher model is trained using two methods: standard training and robust training, which yields two teacher models. The loss function of AKD is expressed as follows: where T at (x) is the AT-trained teacher network and T std (x) is the standard-trained teacher network. λ 1 and λ 2 are hyperparameters, and we use λ 1 = 0.5 and λ 2 = 0.25 following the default setting in AKD.
Based on both AT and ARD, the authors of RSLAD [47] observed that the use of predictions of an adversarially trained model could improve the robustness. Hence, an RSLAD method was proposed, which brings robust soft labels into its full play, i.e., the robust soft labels were used in both the loss function of RSLAD and the generation of the adversarial counterparts during training. The overall optimization is expressed as follows: where T(x) are the robust soft labels obtained by the adversarially trained teacher model, and the adversarial samples x adv are generated using No hard labels were used in the objective function of RSLAD and the most commonly used CE loss was replaced by the KL-divergence loss to express the degree of distributional difference between the output probabilities of the two models.

III. PROPOSED ATTENTION-BASED ADVERSARIAL ROBUSTNESS DISTILLATION
In this section, we provide details of our proposed ATARD method for low-power IoT devices. Two main differences between the proposed ATARD and the existing techniques AT, ARD, IAD, AKD, and RSLAD are as follows. The difference between transformers and CNNs is that the self-attention (SA) in transformers not only extracts intrapatch features but also considers the interpatch relations, Therefore, ATARD is based on the transformer networks, i.e., the teacher and student networks are both based on the transformer architecture. Second, instead of transferring the robustness through the prediction probability, the ATARD transfers the attention map obtaining inherent information from the teacher model to the student model.

A. Proposed ATARD
Before delving into details, we first provide an introduction to the large teacher transformer model. We adopt the same transformer architecture as that used in [32] for the teacher network as shown in Fig. 2. The input to the teacher transformer is such that I and Q components form a 2-D image of depth one (1 × I 1 × I 2 dimension). After going through a convolutional layer and a reshaping layer, this input signal becomes N 0 patch embeddings, where N o = (I 2 − N k /N s ) + 1, and each patch embedding has 1 × N c (in our case N c = 128) dimension. Then, a learnable embedding (CLS token) is prepended to the sequence of embedded patches, whose state at the output of the transformer encoder serves as the signal representation. The CLS token has the same dimension as each embedded patch and its parameters are learned during the backpropagation of the training process. Hence, in total, we have (N 0 + 1) patch embeddings denoted as z 0 , and these patch embeddings are fed into N (in our case N = 4) consecutive transformer encoder layers. These encoder layers will not change the dimension of the input, i.e., the output of these four transformer encoder layers has the same dimension as the input patches ((N 0 + 1) × N c ). After a layer normalization (LN) [48] process, the CLS token information is extracted and processed through a dense layer. Finally, a 1 × K vector is produced which represents the prediction probabilities of K different classes for the input signal.
The encoder layer, as a key part of the transformer network, consists of two sublayers. The first one is a multihead SA (MSA) function, and the second is a position-wise fully connected feedforward network (FFN). An LN is used around each of the two sublayers, followed by a residual connection [49]. Specifically, the output of the encoder layer z n can be expressed as follows: z n = FFN LN z n−1 + z n−1 , n = 1, . . . , N. (10) The standard SA mechanism, as shown in Fig. 3, is a function which converts a query and a set of key-value pairs into an output. As in (11), the output of SA is generated by calculating the weighted sum of the values V and the associated weight A assigned to each value. We denote the weight A as the attention map which is obtained as a function of the query Q and the  corresponding key K with a dimension of d k . Specifically, the weight A is calculated as the scaled dot products of the query with the keys, followed by using a softmax function as shown in (12): MSA is an extension of SA, in which SA operations are implemented h times in parallel, and their outputs are concatenated and projected. Hence, the output of MSA can be expressed as where U MSA is a linear projection matrix. We now present the proposed ATARD method that transfers the robustness from a large teacher transformer network into a small student transformer model as shown in Fig. 4. The computational complexity is proportional to the total number of parameters of the network. To reduce the computational complexity, we use N = 2 and N c = 96 for the small transformer network, and the total number of parameters has been reduced from 801 675 to 230 699. The calculation of the number of parameters for both the teacher and the student transformer network is shown in the Appendix. The teacher transformer network is pretrained using AT as in (1) and (2), then during the training of the student transformer networks, instead of only constraining loss function between the predicted output of the adversarial samples and true label y, we also force the adversarial attention map extracted from the teacher transformer network T(·) and the student transformer network S(·) as similar as possible. Specifically, during each training iteration of the student network, the adversarial samples x adv are generated for each benign sample. Feeding x adv into both teacher and student networks, the adversarial attention map is extracted. For notational simplicity, we denote the adversarial attention map for four encoder layers of the teacher network as AAM 1 T , AAM 2 T , AAM 3 T , and AAM 4 T , respectively. Similarly, the adversarial attention map for two encoder layers of the student is denoted as AAM 1 S and AAM 2 S . As mentioned above, an MSA is used in this work which means rather than implementing a single attention function, the query Q and the key K, are linearly projected h times with different, learned linear projections. Hence, we have h different queries and keys denoted as Q 1 , . . . , Q h and K 1 , . . . , K h , respectively. Therefore, in terms of the AAM for both the teacher and the student networks, we use the average of h keys and queries. Hence, the AAM can be written as Then, we force the AAM 1 S to learn from AAM 1 T , AAM 2 T , and AAM 3 T , and force the AAM 2 S to learn from AAM 2 T , AAM 3 T , and AAM 4 T . We denote N T as the number of transformer encoder layers for the teacher network. Mathematically, the objective function of the proposed ATARD can be expressed as where x adv is generated in each batch during each training iteration using the 3-step PGD attacks. The first term in (15), CE(S t θ (x adv ), y) is denoted as Loss1, and the rest is denoted as Loss2. The algorithm of the proposed ATARD is depicted in Algorithm 1. For a clear comparison, a table containing the objective function of the ATARD and the related works is presented in Table II. As seen, the key contribution and the biggest difference between the proposed ATARD and the listed related works is that, instead of transferring the adversarial logits information, the ATARD transfers the adversarial attention information from the teacher model to the student model for better robustness against adversarial examples. Now, we shed light onto the reason as to why the proposed ATARD improves the robustness against adversarial examples through the perspective of the "smoothness" of the neural network. We first give a simple 2-D example to illustrate how a smoother function could help the robustness of a neural network against adversarial examples as shown in Fig. 5. Given two neural networks g 1 (x) and g 2 (x) for two-class classification, we assume that the sample x belongs to class 1 end for 9: end for when g(x) ≥ 0; otherwise, x belongs to class 2. Starting from the same input x which belongs to class 1, for g 1 (x), the perturbation x 1 needs to be added so that the perturbed sample x 1 falls into class 2, i.e., x 1 is the adversarial version of x for g 1 (x). Similarly, as for g 2 (x), the perturbation x 2 needs to be introduced in order for the perturbed sample x 2 to become an adversarial sample. It is apparent that for a smoother function g 2 (x), it requires a larger perturbation for the sample x to become an adversarial example. In other words, given the same amount of the perturbation, a smoother function could achieve higher robustness against adversarial examples. To verify the smoothness of our proposed ATARD, we calculate the average l 2 -norm of the gradient of the loss function CE(f (x), y) with respect to the input for 1000 randomly chosen samples from the testing data set. The results shown in Table III demonstrate that the proposed ATARD has the smallest gradient norm among all the baseline methods, i.e., ATARD is the least sensitive to the adversarial perturbations among all the techniques. In this case, given a certain amount of perturbations, the ATARD could achieve higher robustness against adversarial examples as verified by the experimental results in Section IV, i.e., the attacker has to apply higher perturbations to make the input signals misclassified, hence more transmission power is needed to succeed with the modulation  classification attacks, which will hinder stealth operation of the adversarial transmitter.

B. White-Box FGM and PGD Attacks
Now, we give details of the white-box FGM and PGD attacks we used to evaluate the robustness of the proposed defense. According to the attacker's knowledge, adversarial examples can be classified into three different classes: 1) perfect-knowledge white-box attacks; 2) limited-knowledge gray-box attacks; and 3) zero-knowledge black-box attacks. The white-box attack indicates that the adversary has the full knowledge of the targeted defense system, including the architectures, the parameters, and the data used by the defender. On the contrary, the black-box attack means the attacker has no knowledge of the targeted system, and the gray-box attack

Input:
• input x 0 and its true label y • the number of classes N c • the prediction probability of the data sample f (·) • allowed l 2 -norm of the perturbation ε, allowed PNR and SNR • the cross entropy loss between the predicted probability and the targeted wrong class CE(f (·), e t ) Output: x : the adversarial examples. 1: for t in range(N c ) do 2: 3: x = x 0 + ε · r norm 4: until arg max i f i (x ) = y 5: return x means the attacker has part of the knowledge of the defense. In this work, the white-box attack is considered, because this scenario allows for a worst case evaluation of the security of the learning algorithms, creating empirical upper bounds on the performance deterioration that may be induced by the system under attack [51].
The algorithm for generating a white-box FGM attack is shown in Algorithm 2 which is adopted from [30]. Specifically, given an input signal x 0 , certain PNR and SNR (i.e., the ratio of the signal power to the noise power), the allowed l 2 -norm of the perturbation ε is calculated as The above formula is obtained as described below. The data set of RML (denote as x) contains both the signal (assume the signal power is S) and noise (assume the noise power is N). The power of the data x (denoted by P x ) is therefore the addition of the signal power and the noise power (because they are independent), i.e., P x = S + N. Then, we can obtain that (P x /N) = (S + N/N) = SNR + 1. Hence, N = (P x /[SNR + 1]). Given the l 2 -norm of the adversarial perturbation ε, the perturbation power is ε 2 /L, where L is the number of elements in the perturbation vector. Then by definition, PNR = (ε 2 /L/N) = ε 2 · ([SNR + 1]/P x L). Hence, ε = √ (PNR · P x L)/(SNR + 1) and (16) is obtained by replacing P x L with its sample estimate, i.e., x 0 2 2 . For each possible targeted wrong class, the normalized perturbation is obtained using line 3, where x 0 CE(f (x 0 ), e t ) indicates the gradient of the cross-entropy loss between the predicted probability f (x 0 ) and the targeted wrong class e t with respect to the input x 0 . Then, the normalized perturbation is multiplied by the allowed l 2 -norm of the perturbation ε and then added to the original sample in line 4. The algorithm will terminate when the predicted label of the generated adversarial sample arg max i f i (x ) is not equal to the true label y.
The PGD attack is considered as it is a strong form of attack utilizing the local first-order information about the network [41]. The algorithm for generating the white-box PGD attack is shown in Algorithm 3, which is adopted from [41]. Specifically, in line 4, the gradient of the cross-entropy loss between the predicted probability f (x) and the true label y is first computed and a standard gradient descent algorithm is

Input:
• input x 0 and its true label y • the step size η 0 • the prediction probability of the data sample f (·) • the cross entropy loss between the predicted probability and the true label CE(f (·), y) • a projector on the l 2 -norm constraint ||x − Output: x : the adversarial examples. 1: x ← x 0 2: repeat 3: x ← x 4: x * ← x + η 0 CE(f (x), y) 5: x ← (x * ) 6: until arg max i f i (x ) = y 7: return x employed to obtain the updated sample x * . Then, a projection procedure is applied to x * in line 5 such that the generated adversarial perturbation is less than a predefined bound ε, which is obtained as in (16). The projection procedure is formulated as the optimization below where x * indicates the updated sample after the standard gradient step x * = x+η 0 CE(f (x), y) and x 0 is the original input signal. The solution to (17) is computed as follows: However, we adopt (19) as the projector in order to make the l 2 -norm of the generated adversarial perturbation equal to ε, i.e., x − x 0 2 = ε, which will help setting specific PNR for performance analysis Finally, the loop will terminate when the condition in line 5 is met, i.e., the predicted label of the produced adversarial example arg max i f i (x ) is unequal to the true label y.

IV. RESULTS AND DISCUSSION
In this section, we present and analyze the experimental results. All algorithms are implemented in PyTorch and executed by NVIDIA GEforce RTX 2080 Ti GPU.

A. Data Set
The data set used in this work is the GNU radio ML data set RML2016.19a [52] and RDL2021.12 [53]. The GNU radio ML data set RML2016.10a has 220 000 input samples, each of which corresponds to one modulation type at a specific SNR. This data set contains 11 different modulation schemes, including BPSK, QPSK, 8PSK, QAM16, QAM64, CPFSK, GFSK, PAM4, WBFM, AM-SSB, and AM-DSB. The samples are produced for 20 different SNR levels ranging from  Half of the samples are used for training, while the rest is used for testing. Compared to the RML data set, the RDL data set contains two noise types, namely, Gaussian noise and alpha-stable distributed noise. In addition, the RDL data set considered both the Rayleigh fading and Rician fading channels. 110 000 samples are generated for 11 different modulation types with SNR = 10 dB, in which 90% samples are used for training, and the rest are used for testing. The RDL data set has impulsive noise due to addition of alpha-stable noise. Therefore, to mitigate the effect of impulsive noise, we have preprocessed the received signal before applying it to the DNN classifiers during the training and testing phases. Accordingly, we computed the standard deviation σ x of the signal after applying a five sample moving median filter and rejected any samples that fall outside of ±σ x . Finally, the data samples were normalized. To assess the effectiveness of the transformer-based neural network against adversarial samples, we generate adversarial attacks using 1000 data samples from the testing set with SNR = 10 dB.

B. Robustness Results Against White-Box FGM and PGD Attacks
Considering the RML2016.19a data set, the classification accuracy of ATARD and other competing techniques in the presence of FGM and PGD attacks is shown in Figs. 6 and 7, respectively. The experiments are repeated for ten times and the average performance is presented. Our proposed ATARD has higher robustness than the competing techniques, including AT, ARD, IAD, AKD, and RSLAD for a wide range of PNR values from −30 to −10 dB. The results for normal training (NT)-based transformer are also included as the baseline. The robustness improvement becomes more significant when the PNR value is large. Specifically, from Fig. 6, when PNR = −10 dB, the proposed ATARD achieves 71.7% accuracy which is around 7% higher than ARD and RSLAD and 14% higher than AT. From Fig. 7, when PNR = −10 dB, the classification accuracy against PGD attacks for the proposed ATARD is around 13% higher than AT, and around 10% higher than ARD, AKD, and RSLAD. As compared to the NT, the proposed ATARD improves the robustness against both FGM and PGD attacks significantly. Specifically, when PNR = −10 dB, the proposed ATARD achieves 16.8% and 25.8% higher accuracy against FGM and PGD attacks, respectively. Furthermore, from the attacker's point of view, the PGD attack is more powerful than the FGM attacks, i.e., the PGD attacks achieve a higher misclassification rate than that of the FGM attacks for a wide range of PNR values from −30 to −10 dB. For the proposed ATARD, the misclassification rate against PGD attacks is 8.5% higher than that against FGM attacks when PNR = −10 dB.
Furthermore, considering the RDL 2021.12 data set, the classification accuracy against PGD attacks for a wide range of perturbation to generalized noise ratio (PGNR) values for both the Rayleigh and Rician fading channels is presented in Figs. 8 and 9, respectively. As in [53], we considered the identical power for the Gaussian noise and alpha-stable noise. Our definition of PGNR is the ratio of the perturbation power and the alpha-stable noise power. Therefore, the ratio of the perturbation to Gaussian noise power is the same as PGNR, but the ratio of perturbation power to total Gaussian and alpha-stable noise power is PGNR/2 (i.e., 3 dB less than PGNR in dB). Note that for a given SNR and PGNR value, the perturbation to signal ratio (PSR) can be obtained as PSR = PGNR/SNR. As shown in Fig. 8, for Rayleigh fading channels, our proposed ATARD scheme which transfers the attention map outperforms the AT technique and standard training. The improvement of accuracy becomes more significant when PGNR is large. Specifically, when PGNR = −20 dB, the proposed ATARD achieves 19.0% higher accuracy than AT and 39.0% higher accuracy than NT. A similar trend is seen for the Rician channel as shown in Fig. 9. Specifically, the proposed ATARD achieves 9.6% higher accuracy than AT and 28.4% higher accuracy than NT when PGNR = −20 dB.

C. Transferability Between Different Architectures
In this section, we investigate the transferability of the proposed ATARD method. Transferability is a common   To investigate the transferability of the proposed method, we use 1000 RML data samples from the testing data set to generate FGM and PGD attacks on two substitute networks: Model-1 and Model-2. Then, the produced adversarial attacks   The PGD attacks generated based on both Model-1 and Model-2 could achieve around 36% misclassification rate (i.e., around 64% classification accuracy) when PNR = −10 dB in the absence of a defense mechanism. From Fig. 11, it can be seen that with the defense system like the proposed ATARD, the transformer model can achieve a better classification accuracy against PGD attacks transferred from Model-1 for a wide range of PNR values. Specifically, the ATARD achieves 80.0% and 82.9% accuracy when PNR = −10 dB and when PNR = −20 dB, respectively, which is roughly 17.1% and 3.8% higher than the NT trained transformer  (i.e., in the absence of the defense mechanism). Similarly, in Fig. 10, the transformer in the presence of the ATARD defense system performs better than the NT trained transformer from −30 to −10 dB. Similarly, for surrogate Model-2, it can be seen from Figs. 12 and 13 that with the defense system like the proposed ATARD, the transformer model can achieve a better classification accuracy against FGM and PGD attacks transferred from Model-2 for a wide range of PNR values. Specifically, using PGD attacks, the ATARD achieves 14.0% and 3% higher accuracy than the NT trained transformer when PNR = −10 dB and PNR = −20 dB, respectively. In the mean time, using FGM attacks, the ATARD achieves 9.5% and 2.3% higher accuracy than the NT trained transformer when PNR = −10 dB and PNR = −20 dB, respectively.

V. CONCLUSION
We have proposed for the first time in the literature a transformer-based defense mechanism, namely, an ATARD method in modulation classifications for low-power IoT devices. The main novelty is that the adversarial attention map was transferred from a large transformer network to a small transformer model. Through experimental results, we have shown that the proposed ATARD scheme achieves a better robustness against adversarial examples among all the related works, including NT, AT, ARD, IAD, AKD, and RSLAD. We have also provided a possible reasoning for the superior performance of ATARD by investigating the smoothness of the loss function, i.e., the ATARD-based transformer model is less sensitive to the perturbations. As a result, the attacker will have to apply more transmission power to fool the ATARD-based transformer model used by the defender. The transferability of the adversarial examples among different architectures was also investigated to show the superior performance of the proposed ATARD-based transformer model against the transferability of adversarial examples.

APPENDIX CALCULATION OF THE NUMBER OF PARAMETERS
FOR THE TRANSFORMER NETWORK We compute the number of parameters used in our teacher and student transformer networks. The transformer can be divided into five parts, including a convolutional layer, a CLS token, a stack of transformer encoder layers, an LN layer, and a dense layer. The number of parameters for each part is calculated as follows.
For a convolutional layer, given input channel l, output channel k, and the filter size n * m, its number of parameters N CL can be calculated as For a CLS token, as mentioned before, it has the same dimension as each embedded patch. Therefore, its number of parameters N cls is equal to the output channel k, i.e., N cls = k. For a stack of N transformer encoder layers, its calculation can be divided into three parts, including an MSA, two LN layers, and an FFN. For the SA, given an input sequence Z, three projections are used for queries Q, keys K, and values V as shown in where U Q , U K , and U V are three projection matrices. For the MSA, a linear projection matrix U MSA is used such that the outputs of SA operations can be concatenated and projected as shown in (13). Hence, the number of parameters for the MSA N MSA is obtained as For the LN layer, its number of parameters N LN is obtained as The FFN is a fully connected network with one hidden layer.
Considering the size of the hidden layer s, its number of parameters N FFN is written as For a simple dense layer with c classes output, its number of parameters N DL is calculated as Finally, for the transformer network, the total number of parameters is N total = N CL + N cls + N LN + N DL + N * (N MSA + 2 * N LN + N FFN ).