JDAT: Joint-Dimension-Aware Transformer with Strong Flexibility for EEG Emotion Recognition

,


I. INTRODUCTION
A S one of the most significant human senses, emotion largely affects people's quality of life.At present, emotion recognition is mainly based on analysis of physical signals and physiological signals.Physical signals, such as facial expression [1] and speech [2], often generate inaccurate results due to the obscure boundaries between people's external actions.In contrast, physiological signals, including Electroencephalography (EEG) [3], Electromyography (EMG) [4], and Electrocardiography (ECG) [5], reflect human emotion more objectively.EEG is a non-invasive method of Human-Computer Interaction (HCI) [6], which has become efficient and indispensable in human emotion recognition, in light of its valuable multidimensional information.In the spatial dimension, earlier works stated that emotional changes mostly affect the EEG signals on the frontal and temporal lobes [7], [8], and the experiments demonstrated that channel selection can improve the recognition efficiency [9].In the spectral dimension, positive emotion generates high-frequency EEG signals, while negative emotion is exactly the reverse.Therefore, Differential Entropy (DE) [10] and Power Spectral Density (PSD) [11] features are commonly used to better reflect the spectral characteristics.In the temporal dimension, the signal activation strength of positive emotion is generally higher than that of negative emotion.Furthermore, from the perspective of neuroscience [12], brain activities will generate complicated cross-frequency coupling and phase-amplitude coupling across different brain lobes [13].The above phenomena have proved the advantages of utilizing joint dimensional features of EEG to recognize human emotions.
Generally, existing machine learning methods in EEG emotion recognition have two main streams: traditional methods and deep learning methods.In traditional methods, K-Nearest Neighbor (KNN) [14], Bayesian Network (BN) [15], and Support Vector Machine (SVM) [16], [17] appear to be the most popular algorithms.However, these algorithms can not detect or extract complex EEG features, and they used to obtain general classification results.In recent years, deep learning methods have demonstrated its efficiency and flexibility for its excellent performance, especially Convolutional Neural Network (CNN) [18], Long Short-Term Memory (LSTM) [19], and Graph Neural Network (GNN) [20].After intensive exploration, attention mechanism has been successfully adapted to EEG emotion recognition.Attention-based Convolutional Recurrent Neural Network (ACRNN) [21], 4D attention-based Neural Network (4D-aNN) [22], and Spatial-Spectral-Temporal based attention 3D dense Network (SST-EmotionNet) [23] are the most successful designs.However, they suffer from not only complex network structures with tremendous parameters, but also separate attention modules which can not utilize joint dimensional information of EEG.In addition, they used to achieve good results on specific datasets but are not adaptable to other datasets, due to the different experimental settings such as the channel selection and sampling frequency.
In this paper, we propose the Joint-Dimension-Aware Transformer (JDAT) with strong flexibility for EEG emotion recognition.It adopts the adaptive Multi-head Self-Attention (MSA) mechanism to focus on different dimensional EEG features mutually and jointly, without any hybrid structures or other attention mechanism.Its structure mainly consists of the Spatial-Spectral-Projection (SSP) Block and Multidimensional Trans- former Encoder (MTE).The SSP Block serves to facilitate the training of Transformer and automatically assign the weights of different channels and frequency bands.The MTE simultaneously processes intersecting EEG information of space, frequency, time, and intensity, where the global features are well noticed.In MTE layers, the Spatial-Spectral MSA attends to mixed features of space and frequency, which is capable of discovering the complicated phase couplings and resonance across different brain lobes; the Temporal MSA handles the sequential space-frequency-fused information, which makes the model sensitive to the signal activation and waveform changes globally.Meanwhile, we revise the conventional MSA into a squeezed structure, applying the attention between a specific feature point and merged points to reduce the risk of overfitting.Performance of JDAT is evaluated on three datasets, including DEAP [24], DREAMER [25], and SEED [26], [27].Compared with conventional models, our model achieves the State-Of-The-Art (SOTA) results in all datasets.The rest of the paper is organized as follows: Section II introduces related work, including some attention-based models for EEG emotion recognition and Transformer models for EEG classification.Section III describes the proposed methods in detail, including pre-processing methods and our proposed JDAT model.Section IV provides our experiments along with the results.Finally, the conclusion and discussion are made in Section V.

A. Attention-Based EEG Emotion Recognition
The early neural network models in EEG emotion recognition rarely adopted attention mechanism.For example, Hwang et al. [18] designed a simple CNN similar to LeNet [28], for automatic learning from DE features of EEG signals.Yang et al. [29] integrated LSTM layers into their parallel convolutional recurrent neural network for emotion recognition and achieved satisfactory results.Recently, attention mechanism has shown its strength in EEG emotion recognition.For instance, Tao et al. [21] proposed ACRNN, a fused model based on channel-wise attention and self-attention, to assign the weights of different channels and seek intrinsic similarity in EEG signals.Xiao et al. [22] proposed 4D-aNN, which adopted spectral and spatial attention mechanism to adaptively focus on the corresponding features.However, as illustrated in Fig. 1(a), these independent attention modules in ACRNN and 4D-aNN only pay attention to the specific dimensional features, while not treating them in a correlation way.Prominently, SST-EmotionNet proposed by Jia et al. [23] is able to concentrate on spatial-spectral and spatial-temporal features.However, Fig. 1(a) shows that it is still unable to process the global features in a unified attention module.In addition, SST-EmotionNet's attention streams are constructed in parallel branches, which make the model complex and inflexible.Therefore, unified attention modules processing multidimensional features should be developed to avoid the above-mentioned issues.

B. Transformer for EEG classification
On the basis of MSA mechanism, Transformer has shown its strength in both Natural Language Processing (NLP) and Computer Vision (CV).Since the birth of BERT [30], Transformer has become the gold standard methods in most NLP tasks.Afterwards, ViT [31] has demonstrated the ability of Transformer in computer vision, in tasks of classification [32], segmentation [33], and object detection [34].Nowadays, some researchers have made an effort to integrate Transformer into their models for EEG classification tasks, since Transformer can focus on sequential EEG data globally.Pedoeem et al. [35] proposed the Transformer Based Seizure detector (TABS), a hybrid model comprising convolutional layers, fully connected layers, and Transformer, which achieved a good result in seizure detection.Sun et al. [36] constructed several Transformerbased models for motor imaginary EEG classification (MI-Transformer), whose results showed that Transformer combined with CNN might serve as a powerful model.However, as illustrated in Fig. 1  applying joint attention on spatial and temporal dimensions.However, as illustrated in Fig. 1(b), they ignored the valuable spectral information of EEG signals for emotion recognition.Some other works fused their hybrid structures with MSA layers, the core part of Transformer, in order to strengthen their networks with the help of self-attention [21], [38].In this work, we take Transformer as the main frame of our proposed model, utilizing the multidimensional EEG features jointly and comprehensively for emotion recognition.

A. Overview
Our proposed methods consist of two main parts: preprocessing and proposed model.The pre-processing aims to facilitates the training and operation of our model.Our proposed JDAT takes Transformer as the main frame, built with the unified squeezed MSA instead of fused attention mechanism.As shown in Fig. 1(c), it focuses on multiple dimensional EEG features and processes them jointly.

B. Pre-Processing
EEG signals should be pre-processed before being sent into the model.The pre-processing steps include baseline removal, Continuous Wavelet Transform (CWT), and data augmentation.
1) Baseline Removal: Recorded EEG signals usually contain pre-trial (baseline) and trial signals.Yang et al. [29] proposed that baseline removal can improve the accuracy of EEG emotion recognition.Tao et al. [21] also used this preprocessing method and achieved better results.As illustrated in the first step of Fig. 2, baseline removal can be described in detail as follows: Firstly, pre-trail signals X p and trial signals X t are sliced into N and M segments with the same length, respectively.
where C and L denote the number of channels and the length of the segments, respectively.
Secondly, the mean value X p of the N pre-trail signal segments is calculated, which denotes the mean baseline.
Finally, X p is subtracted from each trial segment.
In this paper, L is set to the sampling points of 1 second, while C varies with the EEG channels provided by datasets.
2) Continuous Wavelet Transform: CWT is able to capture signal frequency along the time axis, which can fully decompose a continuous time function and generate the "Time-Frequency window" [40].Different from Fourier Transform [41], it constructs a Time-Frequency relationship with subtle resolution in both domains.CWT of an integrable function f (t) can be expressed as: where ψ(t) is the wavelet function, and a, b are translational value and scale factor, respectively.In this work, CWT is conducted separately on each channel's signal, where the frequency of wavelet ranges from 1 to F Hz. Therefore, each segment is converted into 3D data I j ∈ R C×F ×L , as illustrated in the second step of Fig. 2.These data embody richer and more obvious spatial, spectral, and temporal information with stronger connection, which are taken as the input of our proposed model.In this work, F is set to 50.Different wavelets are compared during our experiments and Complex Gaussian Derivative Wavelet with the best performance is chosen.
Previous works showed that slicing EEG signals into samples between 1 second and 10 seconds helped achieve good results in emotion recognition [9], [23].In this work, the 1second segments I j (j = 1, 2, ..., M ) after baseline removal and CWT are directly taken as the dataset samples.Thus, our dataset is made up of 1-second non-overlapping samples, which contains more sliced data and avoids repeated training and testing.
3) Data Augmentation: Considering the limited amount of public EEG data and the difficulty of training Transformer, we apply data augmentation on the training set.Translation, grayscale, or generative models are not adopted because they will destroy distinct EEG features and affect the objective experimental results.Therefore, Gaussian noise with zero mean and a standard deviation of 0.01 is added to enrich the training set to avoid overfitting.The data augmentation technique is similar to the method in [9].

C. Proposed Model
As shown in Fig. 3, our proposed JDAT mainly consists of SSP Block and MTE.The SSP Block compresses and projects the input data to facilitate the operation of MTE.The MTE serves as its main frame, which is composed of several stages with the unified attention modules, which are based on squeezed MSA.Every stage except the last consists of one MTE layer and a process of temporal compression.These MTE layers have different scales to adapt the model to gradually compressed feature maps.The MTE is followed by fully connected layers to classify different emotional states.The specific structures of modules in JDAT are described in detail as follows.In detail, a Spatial-Spectral convolutional layer with D kernels of the size (F, 1) is applied on I, which is illustrated in Fig. 4.These kernels extract global information of space and frequency along the time axis, projecting the spatial and spectral dimensions into a mixed dimension.Thereafter, an activation of Rectified Linear Unit (ReLU) and a Fully Connected (FC) layer ∈ R D×D are followed to increase the non-linearity.
where SSconv denotes the Spatial-Spectral convolutional layer, and φ(•) represents the ReLU activation function.The parameter D is set to 128 optimized by the experiments.The reason for fusing the spatial and spectral information lies in that, in the human brain, there exist complicated couplings and resonance between signals of different frequency bands and different lobes [42].Joint attention on spatial and spectral features will help discover these complex relationships.In addition, the convolutional kernels automatically assign the weights of different channels and frequency bands.As shown in Fig. 5, we visualize four convolutional kernels from 2) Positional Embedding: The Positional Embedding (PE) is added on I like most Transformer-based models, which is a set of 2D learnable parameters.
where Z 0 ∈ R L×D .Therefore, the matrix Z 0 containing all compressed tokens with positional information is obtained and served as the input of MTE, whose dimensions correspond to time and fused channel-frequency.
3) MTE Layer: Typically, the Transformer layer contains one MSA layer and one feedforward Multi-Layer Perceptron (MLP) [43], where the MSA is usually applied on the sequential dimension.The MSA mechanism is sensitive to the internal features of long-term series data for its strong global attention.The multiple attention heads allow it to learn relevant information in different representation subspaces and reduce the computational cost.The scaled dot-product attention is computed from queries, keys, and values (Q, K, V ), all of which are linearly projected from the sequential data.Every MLP contains two Fully Connected (FC) layers with a Gaussian Error Linear Unit (GELU) activation function between them, where the middle dimension is usually expanded to better express rich spatial characteristics.
To better adapt the model to our task, we build our JDAT which focuses on the multiplex EEG information jointly, utilizing the unified attention modules.As shown in Fig. 6, three sublayers are built in our adjusted MTE layer, namely Spatial-Spectral MSA layer, Temporal MSA layer, and MLP.The residual connection and Layer Normalization (LN) follow every sublayers, which alleviate the problems of gradient dissipation, explosion, etc.The workflow of MTE layers can be formulated as follows: where the SS-MSA and T-MSA denote Spatial-Spectral MSA layer and Temporal MSA layer, respectively.Besides, the number of attention heads in every Spatial-Spectral MSA layer is optimized to 8, while that in Temporal MSA layers are 8, 4, 2, and 1 from Stage 1 to Stage 4 to fit the gradually compressed structure, respectively.The expansion ratio of the middle layer in feedforward MLP is set to 4. As illustrated in Fig. 6, the internal structure of our MSA layer is revised into a squeezed structure, to improve the generalization ability of the model.The conventional MSA layer applies the global attention between all specific tokens, while our squeezed MSA is applied between a specific feature point and a bunch of adjacent points, which is illustrated in Fig. 7. Specifically, we take the squeezed Temporal MSA as the example to illustrate the mechanism.The input sequence Z ∈ R L Z ×D is squeezed by an 1D Average Pooling (1D-AP) layer with a reduction ratio R, generating Z with a shorter temporal length.It is equivalent to merge the adjacent R tokens into a new token of the same dimension.
where Z ∈ R (L Z /R)×D .Then, Q is computed from the input Z, while K and V are computed from the squeezed data Z.
The linear projection W Q , W K , and W V ∈ R D×d transform the input sequence into the corresponding heads of the same dimension d. where Therefore, the number of tokens in K and V are also squeezed by the same ratio.The squeeze ratio R is optimized to 2 in our model, so the squeeze process is equivalent to merging pairs of adjacent feature points.The dimension of attention heads d equals to D divided by the number of heads.The computation process of squeezed MSA from Q, K, and V can be formulated as follows: The multiple attention heads will be concatenated together and projected into the original dimension D again.Therefore, the squeezed MSA computed from Q, K, and V holds the same shape as the conventional MSA.On one hand, our novel squeezed MSA layer can reduce the calculation time generated by the long-term sequence.On the other hand, self-attention between a specific feature point and merged points can reduce the risk of overfitting.
In our MTE layer, the Spatial-Spectral MSA layer adaptively seeks the essential information across different electrode channels and frequency bands, focusing on the complex features in integrated dimensions, such as cross-lobe coupling, frequency-intensity resonance, etc. Next, the Temporal MSA layer handles the long-time series of EEG data, which makes the model sensitive to the degree of signal activation and waveform changes.Meanwhile, every Temporal MSA layer utilizes the fused features discovered by the previous Spatial-Spectral MSA layer, whose output will be processed globally by the feedforward MLP and sent into the next stage again.It enables the model to repeatedly use intersecting features to reveal complicated brain activities.Our experiments have proved that the best performance is achieved with the combination of Spatial-Spectral MSA and Temporal MSA with the squeezed structures to avoid overfitting.Consequently, the MTE is able to focus its deep attention on multidimensional EEG information comprehensively.
4) Temporal Compression: Four stages with MTE layers are built in JDAT, where each stage except the last one is followed by a MAx Pooling (MAP) layer with the kernel size of 3 and the stride of 2, which acts as the temporal compression.
In stage 1, 2, and 3, the temporal compression gradually compresses the sequential dimensions of the output feature maps.We calculate the computational operations of stage n in JDAT and JDAT without the temporal compression, which is listed in Table I.The shape of the input of MTE is (L, D), and we have also considered the specific numbers of attention heads in MSA layers.In stage 1, the two models hold the same module structures, while JDAT efficiently reduces the computational operations in higher stages.Compared with the model without temporal compression, it saves nearly 40% of the total operations with the same depth, which will dramatically boost the inference speed.The experiment also shows that the performance is maintained with the temporal compression.5) Classification: Different from popular Transformer models like BERT [30] and ViT [31], our JDAT discards the class token, a learnable embedding prepended to the sequential tokens.Instead, the last stage of MTE is extended with a Global Average Pooling (GAP) layer, transforming the matrix Z γ 4 into a vector of dimension D. Finally, two dense FC layers are implemented on the head of the model to make the classification.
This modification is more suitable for our JDAT with a gradually compressed structure, and the experiments also show that using GAP performs better than prepending the class token.
In DEAP and DREAMER, we regard valence and arousal as our classification tasks and set the thresholds to divide the labels into two emotional classes.The videos with rating values no larger than the threshold are labeled as negative, while that larger than the threshold are labeled as positive.In SEED, there are no baseline removal and label division.Also, only one label is provided for these videos with three emotional classes (positive, neutral, and negative), and we treat it as the valence task.The EEG signals in SEED are down-sampled to 128Hz to fit our model.Some other details of the public datasets, such as ages, male/female, and signal filters are clear in their public documentation and will not be described here.

B. Settings
We trained and tested JDAT on an NVIDIA RTX 1080Ti GPU, implemented with the PyTorch framework [44].During training, the Adam optimizer [45] was used for backpropagation.The learning rate was set to 0.001.For experiment on each subject's samples, we used 10-fold crossvalidation to evaluate our model.To prevent overfitting, dropout was implemented on SSP Block, MTE, and classification layers, with dropout rates of 0.1, 0.1, and 0.5, respectively.

C. Results
We compare the performance of our JDAT with SOTA baseline models listed below on DEAP, DREAMER, and SEED datasets.These works all conduct subject-dependent experiments.They use the same methods to divide labels in DEAP and DREAMER datasets, and most of them use the same cross-validation method.
• gcForest: Deep forest model named multi-Grained Cascade Forest [47].• MLF-CapsNet: Multi-level features guided capsule network [48].• EeT: EEG emotion Transformer which applies attention on temporal and spatial dimensions [37].The 4D-CRNN and 4D-aNN both combine CNN and LSTM layers with attention modules together.They adopt 4D data as the input, which are transformed from the EEG signals or extracted features.SST-EmotionNet focuses on spatialspectral and spatial-temporal features using parallel blocks, whose attention is separately applied in different combined dimensions.ACRNN adopts the preprocessed EEG signals as the input instead of extracted features, containing channelwise attention, CNN layers, LSTM layers, and self-attention.gcForest is based on the deep forest algorithm, whose critical scanning module mines the spatial and temporal information of EEG signals to classify emotions.MLF-CapsNet is an end-to-end capsule network, combining multi-level features extracted from different convolutional layers to form primary capsules.EeT is a Transformer-based model, which focuses on spatial-temporal EEG information jointly, but ignores the spectral information.Most of the above baseline models are conducted only on one or two datasets, because their structures are usually not applicable to other datasets.
The accuracies and standard deviations of these baseline models and our proposed JDAT on three datasets are presented in Table III.Overall, deep learning methods combined with attention mechanism exhibit better performance and flexibility.The traditional method, gcForest, achieves good result on DEAP dataset but performs inferior on DREAMER dataset.The standard deviation varies with different models, but on the whole it decreases as the accuracy increases.
Obviously, our JDAT achieves the highest accuracies in all datasets.In SEED dataset, JDAT improves the accuracy to 97.30%, outperforming other attention-based models including EeT by more than 1%.In DEAP dataset, it obtains accuracies above 98.5% in both valence and arousal tasks, which performs much better than the networks with combined attention modules, namely ACRNN and 4D-CRNN.
In DREAMER dataset, it also performs slightly better than ACRNN.According to our analysis, SEED dataset stores the largest amount of data for each subject, which facilitates the training of our Transformer-based model.Meanwhile, SEED and DEAP datasets contain EEG signals of 62 and 32 channels respectively, while only 14 channels are included in DREAMER dataset.Therefore, the performance improvement in SEED and DEAP is relatively significant, which indicates that JDAT is more suitable for learning EEG data with richer information.But in short, JDAT has demonstrated the strong and flexible learning ability, since it operates well in various EEG datasets with different experimental settings and different information.

D. Ablation Studies
To illustrate the distinct advantages of our JDAT, we implement the following variants for comparison:  Among these variants, JDAT without SSP Block also removes the Continuous Wavelet Transform (CWT) in the preprocessing.Instead, the EEG signals after baseline removal are directly taken as the training data, and an additional FC layer is implemented to adapt the data dimensions to the MTE.The other variants maintain the SSP Block and the same preprocessing method.All variants hold the same classification layers as the complete JDAT.
Firstly, we conduct experiments on the DEAP, DREAMER, and SEED datasets to compare the performance of these variants, where the accuracies and F1-scores are tested.F1score embodies quality of the model as it considers the generalization ability when the data samples are unevenly distributed.The experimental results of JDAT and its variants are shown in Fig. 8.In DEAP and DREAMER, the results are shown as the mean values of results in valence and arousal tasks.
The ablation study shows that our proposed JDAT achieves better performance than the variant without temporal compression in SEED dataset, and comparable results in the other two datasets.The temporal compression greatly reduces the computational operations, and it still enables the model to maintain the overall performance.In addition, it is obvious that without the length squeeze in MSA layers, the accuracies and F1-scores drop further by about 2.5%.It demonstrates that self-attention between all specific feature points are unnecessary, and our squeezed MSA can decrease the risk of overfitting.Furthermore, the lack of SSP Block will have a significant negative impact on the model, especially in SEED dataset, where it reduces the accuracy and F1-score by around 3% and 4%, respectively.Therefore, the SSP Block exhibits its strong ability to extract critical spatial and spectral information globally, and facilitate the operation of MTE.It is worth noting that JDAT without Spatial-Spectral MSA layers or Temporal MSA layers reduces the accuracy by around 3% -4.5%.The lack of MSA layers shows the largest negative influence on the performance, which demonstrates the advantages of joint attention on spatial, spectral, and temporal EEG features.The Spatial-Spectral MSA seems to be more significant, since the attention on space and frequency plays an important role.In summary, all of these modules are essential in our JDAT, and the best performance is achieved by the combination of them.
Meanwhile, to demonstrate the flexibility and practicality of our designed model, we also test the parameters, Giga FLOating Point operations (GFLOPs), and inference time during the ablation study.We choose the variant without temporal compression and the variant without length squeeze for comparison, as they hold the same module structures and model depth as the complete JDAT.The results are listed in Table IV, where the GFLOPs and inference time are tested per data sample in SEED dataset.Obviously, the temporal compression and length squeeze both alleviate the parameters and GFLOPs, and speed up the inference.The temporal compression in stage 1, 2, and 3 greatly reduces the total GFLOPs by around 40%, and saves 0.12ms for the inference of one data sample.It helps compress the sequential data size step by step, instead of keeping the same shape during forward propagation.The above experiments have demonstrated that the performance is maintained with the temporal compression.Meanwhile, the length squeeze in the MSA layer further alleviates the parameters and GFLOPs, while still greatly improving the overall performance.With the combination of squeezed MSA and temporal compression, the lightweight and fast JDAT are more suitable for mobile hardware devices.

V. CONCLUSION AND DISCUSSION
In this paper, we have proposed JDAT, utilizing Transformer for EEG emotion recognition.It adopts the unified architecture based on adaptive self-attention mechanism, solving the problem that attention was always applied on scattered dimensional

Fig. 1 .
Fig. 1.Illustration of the models utilizing attention mechanism on different dimensions.Attn denotes Attention, and MSA denotes Multi-head Self-Attention.The models include (a): Attention-based models for EEG emotion recognition.(b): Transformer models for EEG classification, in tasks of seizure detection, motor imagination, and emotion recognition.(c): Our proposed JDAT which focuses on spatial, spectral, and temporal features jointly for emotion recognition.

Fig. 2 .
Fig. 2. Illustration of pre-processing workflow of baseline removal and Continuous Wavelet Transform (CWT).The schematic diagram of recording EEG signals from a person's brain is taken from [39].

Fig. 3 .Fig. 4 .
Fig. 3. Overall structure diagram of JDAT.The tuples above the model represent the data dimensions at the current stages.The result is illustrated for three emotional states, while the neutral class is eliminated when there are only two emotional states.

Fig. 5 .
Fig. 5. Visualization of kernels in spatial-spectral convolutional layer.The four kernels are taken as the examples from our trained model.

Fig. 6 .
Fig. 6.Illustration of the structure of one MTE layer, which consists of Spatial-Spectral MSA, Temporal MSA, and MLP.The internal structures of Spatial-Spectral MSA and Temporal MSA are illustrated below the MTE layer.

Fig. 7 .
Fig. 7. Illustration of the comparison of self-attention mechanism between conventional MSA and our squeezed MSA.Here n denotes the number of feature points.(a) In the convention MSA, the correlations between all specific feature points are computed.(b) In our squeezed MSA, the correlations between feature points and pairs of merged points are computed.

Fig. 8 .
Fig. 8. Ablation study on JDAT, (a): DEAP dataset; (b): DREAMER dataset; (c): SEED dataset.In the experiments, the accuracies and F1-Scores are tested.In DEAP and DREAMER datasets, the results are shown as the mean values of results in valence and arousal tasks.
[37] these works are not aware of the value of joint dimensional EEG information.TABS only takes Transformer as the tool for processing time series.MI-Transformer are constructed with attention only in spatial or temporal dimension, instead of the fused dimensions.Recently, Liu et al.[37]built the EEG emotion Transformer (EeT),

TABLE I COMPUTATIONAL
OPERATIONS OF JDAT AND JDAT WITHOUT TEMPORAL COMPRESSION IN STAGE N.

TABLE II DETAILS
OF DEAP, DREAMER, AND SEED DATASETS.
JDAT w/o LS which removes the Length Squeeze in every MSA layer, which adopts conventional MSA layers.• JDAT w/o SSP which removes the SSP Block.• JDAT w/o T-MSA which removes the Temporal MSA layer in every MTE layer.
• • JDAT w/o SS-MSA which removes the Spatial-Spectral MSA layer in every MTE layer.

TABLE III PERFORMANCE
COMPARISON BETWEEN BASELINE MODELS AND JDAT ON DEAP, DREAMER, AND SEED DATASETS.

TABLE IV PARAMETERS
, GFLOPS AND INFERENCE TIME OF JDAT AND ITS TWO VARIANTS ON SEED DATASET.features in previous works.The main structure of JDAT only consists of SSP Block and MTE.SSP Block captures and fuses spatial and spectral information, which facilitates the operation of MTE.The MTE is built with multidimensional squeezed MSA layers, which is sensitive to the complicated couplings and resonance in EEG signals.The squeezed MSA helps improve the generalization ability of the model, reducing the risk of overfitting.Temporal compression in each stage greatly reduces the parameters and computational operations, and the model maintains the overall performance.JDAT exhibits its distinct advantage of automatic and comprehensive jointattention on features in multiple dimensions, without selecting channels or extracting spectral features.The SOTA results on three datasets verify its strength and flexibility, and the ablation studies demonstrate the necessity of essential modules of JDAT.In future, we will utilize JDAT in other EEG classification tasks, such as motor imagination and seizure detection.