Deep Learning Framework for Modeling Cognitive Load From Small and Noisy EEG Data

Modern systems (e.g., assistive technology and self-driving) can place significant demands on the user’s working memory (WM), which can adversely impact performance (i.e., elevated risk of errors) and increase cognitive load (CL). Robust prediction of CL from electroencephalography (EEG) remains a challenge due to the small sample problem, noisy recordings, ineffective data representation, and lack of robust models. This article presents a holistic approach to developing a reliable prediction of CL. We used EEG data recorded following a modified Stenberg WM task in which four levels of CL were defined based on the encoding of 2, 4, 6, and 8 English characters. First, we address the problem of noise and “small sample” by generating large low-noise data using eigenspace-based bootstrap sampling and generative adversarial network (GAN). Second, we transform EEG recordings into spatial-spectral images to capture spatial information. Third, we built parameter-optimized acrlong CNN models to predict four levels of CL using single-frequency bands (i.e., <inline-formula> <tex-math notation="LaTeX">$\theta $ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$\alpha $ </tex-math></inline-formula>, and <inline-formula> <tex-math notation="LaTeX">$\beta$ </tex-math></inline-formula>) and stacked (i.e., all three bands) representations. In our quest to provide interpretable models, we applied gradient-weighted class activation mapping (Grad-CAM) to our models to localize the brain regions responsible for the prediction of CL. Empirical analysis of models trained using <inline-formula> <tex-math notation="LaTeX">$\theta $ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$\alpha $ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$\beta $ </tex-math></inline-formula>, and stacked representation show accuracy of 90%, 89%, 91%, and 94%, respectively. Grad-CAM visualizations showed that the prefrontal, cerebellum, frontal, and parietal areas have the highest contribution to the prediction of CL.

Abstract-Modern systems (e.g., assistive technology and selfdriving) can place significant demands on the user's working memory (WM), which can adversely impact performance (i.e., elevated risk of errors) and increase cognitive load (CL).Robust prediction of CL from electroencephalography (EEG) remains a challenge due to the small sample problem, noisy recordings, ineffective data representation, and lack of robust models.This article presents a holistic approach to developing a reliable prediction of CL.We used EEG data recorded following a modified Stenberg WM task in which four levels of CL were defined based on the encoding of 2, 4, 6, and 8 English characters.First, we address the problem of noise and "small sample" by generating large low-noise data using eigenspace-based bootstrap sampling and generative adversarial network (GAN).Second, we transform EEG recordings into spatial-spectral images to capture spatial information.Third, we built parameter-optimized convolutional neural network models to predict four levels of CL using single-frequency bands (i.e., θ , α, and β) and stacked (i.e., all three bands) representations.In our quest to provide interpretable models, we applied gradient-weighted class activation mapping (Grad-CAM) to our models to localize the brain regions responsible for the prediction of CL.Empirical analysis of models trained using θ , α, β, and stacked representation show accuracy of 90%, 89%, 91%, and 94%, respectively.Grad-CAM visualizations showed that the prefrontal, cerebellum, frontal, and parietal areas have the highest contribution to the prediction of CL.

I. INTRODUCTION
W ORKING memory (WM) stores and processes information temporarily to perform cognitive tasks.Due to the limited capacity and duration of WM, the excessive use of WM may cause cognitive overload, resulting in poor performance in certain mental tasks, such as learning, comprehension, and problem-solving [1], [2].Cognitive load (CL) refers to the amount of WM resources expended in processing (e.g., encoding, maintaining, and recalling) information in the WM.A reliable CL prediction will help us in designing real-world systems, such as education and instructional design [3], brain-computer interface [4], and human-computer interaction [5].
CL is classified into three types: 1) intrinsic; 2) extraneous; and 3) germane [6].Intrinsic CL is related to the inherent complexity of the task.The level of intrinsic CL depends on the number of items or elements that need to be processed simultaneously in the WM.Another factor that influences intrinsic CL is elements interactivity.Element interactivity refers to the degree to which items can or cannot be learned independently.Elements with low interactivity do not cause high CL [7].For example, memorizing numbers from one to 100 requires effort but does not impose excessive CL since each number can be memorized individually without remembering the rest.On the other hand, learning a long English sentence has high-element interactivity and imposes a highmental load since learning one word requires understanding its relationship with other words in a sentence.Extraneous CL is caused by the manner in which information is presented to a learner, and it is the under the control of the instructional designer.Therefore, optimal learning is achieved when instruction designs are designed considering WM limitations to decrease extraneous CL that impedes learning.Germane CL refers to the effort expended in constructing permanent knowledge or schema.The above types or sources of CL may contribute to the overall CL experienced by a learner during WM activities.Therefore, there is a need for a robust technique to measure or predict the level of CL imposed by a certain WM task (e.g., memorization task) on the learner's cognitive system.
In this article, we present a deep learning-based approach to predict four levels of CL from electroencephalography (EEG) recordings obtained using a modified Sternberg memory task.EEG has received more attention than other neuroimaging techniques, including functional magnetic resonance imaging (fMRI) and positron emission tomography (PET), due to its noninvasive and nonintrusive nature and excellent temporal resolution.
A plethora of reported studies applied various machine learning (ML) techniques (such as support vector machines (SVMs) [8], XGboost [9], and K-nearest neighbors [10]) in predicting CL from EEG recordings.In [11], ML classifiers [SVM and artificial neural networks (i.e., ANN)] were used to predict mental states (i.e., high and low-mental effort) using EEG data recorded while participants are learning online materials.Gao et al. [12] utilized recurrent network (RN) and convolutional neural network (CNN) for fatigue detection (i.e., alert and fatigue states) from EEG signals recorded from participants involved in the simulated driving experience.Recently, self-supervised deep learning models (CNN and LSTM) were used in [13] to model CL (mental load) using EEG data recorded from 32 participants while watching excerpts of music videos.In [14], two ML classifiers, namely, decision tree and SVM, were used to learn the cognitive pattern from EEG and ECG data to perform two binary classification tasks (i.e., CL versus baseline and CL mismatch state versus CL matching state).Furthermore, Quatieri et al. [15] used a Gaussian classifier and various EEG features, such as power, coherence, and covariance-based features, to discriminate CL (i.e., low, medium, and high) as well as cognitive performance.Boring et al. [16] used a sliding window SVM to detect the period in EEG signal that has low and high CL using data recorded while participants perform three types of mental tasks: 1) n-back test; 2) mental arithmetic; and 3) object tracking.
In general, most of these models rely on feature engineering, and their performances (generalizability and reliability) are limited by the noise in the data, the curse of dimensionality, and the lack of proper data representation.We address learning representations from very small and noisy samples, parameter optimization, and interpretability.
The proposed approach uses both generative and discriminative deep learning frameworks for CL prediction from EEG data.In [17] and [18], we used eigenspace-based bootstrap sampling to address the noise in EEG recordings and the small sample size.This article extends the previous work by combining eigenspace-based bootstrap and generative adversarial network (GAN) to generate more representative samples.The proposed method first generates large and less noisy EEG samples, then transforms time-series data into spatial-spectral representations (i.e., topomap), and maps the representations to four levels of CL using a CNN trained on topomaps from three frequency bands (θ , α, and β) and stacked representation.For interpretability, we applied CNN visualization (e.g., Grad-CAM [19]) to our trained models to highlight the functional areas of the brain more associated with encoding WM information at different levels of CL.

II. LITERATURE REVIEW
The predominant neuroimaging techniques, such as EEG, fMRI, and PET, are widely used for recording neural activities [21], [22].The EEG is one of the most popular noninvasive and nonintrusive modalities for studying neuroelectric activity, thanks to its high-temporal resolution and low cost.
ML has gained traction in the field of cognitive engineering, facilitating research in becoming more data-driven and significantly reducing the amount of time spent on analysis.Furthermore, since ML requires fewer assumptions and domain knowledge than traditional statistical analysis, it has encouraged interdisciplinary research in Neuroscience that incorporates cutting-edge techniques from various fields of science and engineering [23], [24], [25], [26].ML techniques have been widely used to model the relationship between EEG features and levels of CL.The most commonly reported EEG features for CL classification are entropy, connectivity, and power spectral density (PSD) [5], [9], [27], [28].
Various works of the literature demonstrated that spectral (e.g., PSD) features are the most admired EEG features, owing primarily to the relationship between EEG frequency bands and mental states during WM tasks [5], [29].The most common classical ML models used for predictive analysis based on EEG features are Support SVM, K-nearest neighbor (KNN), and random forest [23].However, due to excessive noise in EEG recordings, using these models requires a time-consuming and laborious feature extraction process.Furthermore, due to their inability to handle high dimensionality in data, classical ML models are highly prone to performing classification on a suboptimal set of features.Therefore, to achieve a reliable CL classification performance, there is a need for a more robust method to address the issues related to noise, small samples, and poor data representation in EEG analysis.One such method is deep learning.
A handful of works have adopted deep learning for WM analysis and CL classification.In [30], 1-D CNN and 2-D CNN were used to classify four levels of CL and achieved 91% and 93% accuracy, respectively.The authors reported that 1-D CNN outperformed 2-D CNN, which is possibly due to the lack of enough data to train 2-D CNN efficiently.Stacked denoising autoencoder (SDAE), and multilayer perceptron (MLP) were used in [31] to classify three levels of CL from EEG data from four subjects and two trials.
An attempt to preserve the EEG signal's spatial-spectraltemporal structure was made in [32].The authors transformed EEG signals into 2-D spatial-spectral images from multiple frames of time series EEG recordings and used deep recurrent CNN to classify four levels of CL.The work was able to maintain the good structure of EEG recordings by capturing spatial, spectral, and temporal representations of the data.However, the work does not provide a framework to reduce the noise and increase the data size, which could allow the application of deeper and more diverse models on the data set.
In this work, we propose a data-driven and interpretable CNN-based approach capable of generating a sufficient amount of low-noise EEG data and efficiently learning from spatial-spectral representations of the data.We accomplish these objectives by 1) using eigenspace-based bootstrap sampling and deep convolutional GAN to reduce noise and augment data samples; 2) transforming EEG signals into spectral topomaps to preserve the spatial structure of EEG datal; and 3) creating parameter-optimized CNN models to learn and map representations to four levels of CL.To add interpretation to our models and demonstrate the relevance of the proposed approach to CL theory, we used Grad-CAM [19] on trained models to automatically localize the regions in images (i.e., functional areas of the brain) that contribute the most to the correct CL classification.These class activation maps can help in explaining how activation levels of different brain regions change during the transition from low to high CL.For each experiment trial, participants listened to a series of 300-ms English characters (SET) with 700 ms between them.Following the SET characters, the subjects took a 300-ms break and then listened to a TEST character.After the TEST character, participants were asked to press a button to indicate whether the TEST character was among SET.Source: [20].
To the best of our knowledge, our proposed method is the first interpretable deep learning approach to address the challenges of noise in EEG data and scarcity of data samples while preserving the EEG signal's spatial-spectral structure and achieving high and reliable CL prediction performance.

A. Data Set Description
The analysis in this article is based on EEG auditory WM data originally collected at the University of Memphis [20].In the experiment, 15 participants (eight females) performed a modified Stenberg auditory task.The continuous EEG signals were recorded using 64 electrodes at standard 10-10 locations.To reduce the effects of the environment during recording, participants were seated in an electro-acoustically shielded booth and instructed to avoid body movements.During the experiment, the subjects listened to a series of English characters (SETs) of varying sizes (i.e., 2, 4, 6, and 8).The size of English characters reflects different levels of CL.Each trial starts with a 2000-ms period prior to the presentation of the first SET character (i.e., prestimulus period), followed by playing a sequence of SET characters based on SET size (i.e., 2, 4, 6, and 8 characters) with a 700-ms gap between them.After listening to SET characters, there was a pause of 3 s, after which the participants listened to a test Character (TEST) and were asked to respond with YES if the character is from SET; otherwise, NO.Fig. 1 summarizes the above description.
Each experimental trial is broken down into three main temporal WM stages: 1) encoding; 2) maintenance; and 3) recall.The encoding stage covers a period of 700 ms between SET characters when participants are encoding stimulus into the memory, followed by a maintenance stage which is a pause of 3 s between the end of the last SET character and TEST character presentation in which subjects internalize the characters.The maintenance stage is followed by a recall stage which covers the subjects' response period.The recordings were sampled at a sampling rate of 250 HZ and band-pass filtered from 1 to 45 Hz using a zero-phase (two-pass) FIR filter of order 500.The data were baseline corrected by subtracting the mean of the prestimulus period from each signal.Furthermore, the EEG's ocular artifacts (saccades and blink artifacts) were corrected using principal component analysis (PCA).For a more detailed description of data collection and behavior results, refer to the original publication [20].The analysis in this work is based on the encoding stage of the WM.Throughout the article, the SET characters size 2, 4, 6, and 8 will be labeled as "CL-1," "CL-2," "CL-3," and "CL-4."CL-1 through CL-4 refers to the CL level associated with the encoding of two, four, six, and eight characters in the memory.In other words, CL-1 through CL-4 reflect the increasing level of complexity of the WM task.We utilized EEG data from 11 subjects who successfully completed all experiment trials.

B. Data Augmentation and Denoising With Eigenspace-Based Bootstrap Sampling
Despite the advantage EEG has over other brain signal acquisition methods, raw EEG signals are inherently noisy due to numerous and often unavoidable noise sources during the recording process, such as eye blink, muscle movements, and electromagnetic (EM) noise [33].Signal averaging has proven to be a simple yet powerful way to eliminate noise from event-locked signals through a simple average of event trials.A signal resulting from averaging brain activities from multiple trials of a given event is known as event-related potential (ERP) [34].However, as deep learning requires a lot of data for training, and due to the limited sample size in EEG experiments (i.e., a small number of participants), ERP calculation with all event trials would further reduce the number of available signals for our modeling.
To address this challenges, in our previous work [17] we developed an Eigenspace-based bootstrap sampling technique to generate large and representative ERP data.In summary, the bootstrap sampling is computed by first generating large ERP signals (i.e., 2000 per subject per CL level) by averaging randomly selected (with replacement) 20 single-trials.Second we created Eigenspace formed by 75 principal components (PCs) obtained by fitting PCA to single trial data.We used cumulative explained variance to choose the number of PCs.Further, we project the bootstrap ERP samples to the eigenspace and reconstruct the signals.Finally, we used distribution of reconstruction (see Fig. 2 to select samples close to the sample of original data.The above procedure discards samples with very low-reconstruction error (left tail) as noisy because they are close to the original noisy data.Similarly, we discard samples with the highest reconstruction error (right tail) as redundant because they are far from the direction of maximum variance in the original data.After ERP calculation Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.and sample selection using eigenspace, the number of samples is reduced from 2000 to 1000 per subject per CL level.Therefore, for 11 subjects and four CL levels, the total size of the data set is 44 000 ERP samples.
The analysis in this work is based on a spatial-spectral representation of EEG data known as a topographic map (or topomap) that incorporates both spectral features and spatial information.Even though EEG signals have a low-spatial resolution, having a data representation with spatial information could allow us to learn how distinctive neural areas are related to different aspects of WM, such as CL.
To obtain spatial-spectral representation, we first used fast Fourier transform (FFT) to compute the mean PSD of our EEG data and then projected and interpolated the computed mean PSD values over a 2-D standard 10-10 system EEG montage.This analysis was done using MNE [37], an open-source Python API for EEG analysis.We created the representations for θ , α, and β bands and stacked the three topomaps along the z-axis to create a stacked representation.Fig. 3 shows the process of generating spatial-spectral representation from a time-series EEG signal.

D. Spatial-Spectral Data Augmentation With GAN
Generative models, such as GAN and variational autoencoder (VAE) [38], have gained a lot of success in many applications, such as data generation, data encoding, information retrieval, and image super resolution.This article uses deep convolution GAN (DCGAN) to generate new spatial-spectral images to supplement our training data set.We added GANgenerated data to our training set to effectively reduce redundancy caused by bootstrap sampling and further improve model generalization power.
The traditional GAN models consist of two deep networks: 1) a generator and 2) a discriminator.The generator network of a DCGAN samples a latent vector from a given distribution (e.g., uniform) and uses a series of upsampled convolutions to produce an image with the same dimension as real images.The goal of the discriminator network is to classify correctly real and fake data.In other words, during training, the discriminator learns to distinguish between real and false samples precisely, whilst the generator tries to trick the discriminator by generating data that are as close as possible to real samples.The training continues until the discriminator can no longer distinguish between fake and real samples.After training, we use the generator to produce new images.
Because of a limited number of training images, we designed a shallow GAN model shown in Fig. 4. The generator takes a 100 long latent vector as input, followed by one dense layer with 100 352 nodes, three successive transposed convolution layers, and one convolution layer.The filter size for the transposed convolution layers is 7 × 7 and 3 × 3 for the final convolution layer.The discriminator network consists of two successive convolution and dropout layers, followed by a fully connected layer, one dense layer, and a sigmoid layer.
The proposed GAN architecture was trained with a batch size of 32 for 100 epochs.Fig. 5 shows the training loss curves of the discriminator and generator for stacked topomaps.As expected, the figure illustrates the overall gradual decrease and increase in discriminator and generator losses, respectively, until both losses become close to each other.Fig. 6 shows examples of real and GAN-generated samples.As shown in the figure, GAN has successfully learned to generate images that resemble real images.

E. CL Classification With Convolution Neural Networks
This section describes our approach to classifying CL from spatial-spectral images using CNN.We created CNN architectures for θ , α, β, and stacked representations using Keras, a TensorFlow API for deep learning.To achieve the best model performance, we used Bayesian hyperparameter optimization with Hyperopt to select the best architecture and hyperparameters for the model.Hyperopt is an open-source Python library for Bayesian hyperparameter optimization using the treestructured Parzen estimator (TPE) algorithm [39].TPE uses the previous trial's search history to suggest the best hyperparameters for the subsequent trial.Due to TPE's nonexhaustive nature, it is fast and capable of handling hyperparameters with continuous range.The key hyperparameters in our search space for our models include activation function type, number of convolution layers, dropout rate, batch size, convolution kernel size, optimizer, pooling type (max or average), learning rate, batch normalization, regularizes, residual blocks, and activation function.
The CNN architecture for a stacked representation data set generated by eigenspace-based bootstrap sampling is shown   in Fig. 7.As illustrated, the model comprises successive convolution layers, batch normalization, average poling, residual blocks, and dropout layers, followed by a fully connected layer and softmax layer.Furthermore, we used ReLU as an activation function and Adagrad as an optimizer.We mitigated the over-fitting effect by applying L2 weight regularization and dropout.

IV. RESULTS
We discuss the performance of our deep CNN models trained on data from individual frequency bands and stacked representation.To determine our models' predictive performance, we used various evaluation metrics, including accuracy, precision, recall, and F-1 score.Our findings

A. CL Classification Performance on Bootstrap-Based Data Set
We present the CL classification performance of models trained on data that has been cleaned and generated using the bootstrap sampling process described in Section III (data set-1).The data set-1 includes spatial-spectral images from θ , α, and β bands, and stacked representation.For each representation, we have 44 000 spatial-spectral images with 11 000 images coming from each of the four levels of CL.
To train our models, we randomly split each data set into train, test, and validation sets with a split ratio of 60%, 25%, and 15%, respectively.We trained the CNN models on a training set for 100 epochs.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Training and validation loss and accuracy curves (learning curves) of the model trained on stacked representation are shown in Fig. 8.Our learning curves show a good model fitting to the data because the loss and accuracy curves decrease and increase gradually and smoothly, reaching a stable point with a small gap (generalization gap) between the training and validation curves.
Fig. 9 (left) depicts the models' performance in predicting CL on test and validation sets in terms of accuracy, precision, racall, and F1 scores.The figure demonstrates that β is the best predictor of CL for a single frequency band, with 87% accuracy.Additionally, the findings show a significant performance improvement when the three individual bands are combined to form a stacked representation.We achieved 90% accuracy with the model trained on stacked representation, which is 3% better than β band performance.
In Fig. 10, we summarize the performance results for each band using a confusion matrix.Our confusion matrices indicate that the highest misclassifications occur at the upper intermediate CL level (CL-3).The high misclassifications at CL-3 could be primarily attributed to high signal fluctuations linked to the transition from low to high CL.

B. CL Classification Performance on Bootstrap Sampling and GAN Data Set
This section describes the classification outcomes from models developed using the data set produced by combining bootstrap and GAN samples (data set-2).The data set utilized in this section comprises GAN-generated data and half of the bootstrap-generated data that was not used to train GAN.We trained the CNN models for 100 epochs with a batch size of 32 images, similar to the previous section.
Fig. 9 (right) illustrates the classification performance of CNN models trained on data set-2.As expected, the figure shows the overall CL prediction performance improvement compared to the models trained on data set-1 (Fig. 9 left).The accuracy scores for θ , α, β, and stacked representation increased to 80%, 89%, 91%, and 94%, respectively.Similar to data set-1, the β band outperformed α, and θ bands in predicting CL and stacked representation had the best overall performance.These findings demonstrate GAN's ability to generate reliable spatial-spectral representation of EEG for CL classification.

C. Model Interpretability With Grad-CAM
In this work, we applied Grad-CAM [19] on the test set of stacked representation in data set-2 to find the functional areas of the brain responsible for the correct classification of four CL levels (i.e., CL-1, CL-2, CL-3, and CL-4).Grad-CAM belongs to the family of techniques used to give a visual interpretation of CNN models, such as class activation map [40], and Eigen-CAM [41].
Grad-CAM uses gradient-based localization to provide a visual explanation of what deep CNN models have learned about a specific class from an input image.Therefore, the Grad-CAM can help us highlight the regions in the input image that have the highest contribution to the prediction of CL.
We    values.Fig. 12 shows the electrodes used to calculate activation values for each brain region.Fig. 13 shows the bar plot for mean activations extracted from class activation maps generated using Grad-CAM.The figure depicts how much each region contributes to the correct classification of a given CL.We can see that some regions contribute more than others for a given level, but the contribution level varies across CL levels, reflecting the spatial Fig. 12. 12 brain regions and their associated EEG electrodes.dynamics of CL.For example, the left frontal cortex shows the highest mean activation value for CL-1, the right cerebellum contributes the most for CL-2 and CL-3, and the left prefrontal cortex (i.e., LPFC) has the highest activation for CL-4.In general, for lower CL (i.e., CL-1 and CL-2), frontal (i.e., LFC and RFC), prefrontal (i.e., LPFC and RPFC), right cerebellum, and right occipital regions show the highest mean activation values.As CL increases from low CL to high CL (i.e., CL-3 and CL-4), the cerebellum(i.e., RC and LC), temporal cortex (i.e., LTC and RTC), left prefrontal and occipital regions dominate other regions during the classification of CL.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.These findings agree with what has been reported in the literature regarding the role of different functional areas in processing WM information.For example, using cross-correlation for neural recordings from monkeys during the WM task, Constantinidis et al. [42] revealed strong interaction among cells in the prefrontal cortex when they are in close proximity.In addition to the prefrontal cortex, our Grad-CAM localization shows a strong contribution of the left and right cerebellum to the prediction of intermediate CL (i.e., CL-2 and CL-3).Beyond the motor control function of the cerebellum, there is a large body of fMRI studies that link the cerebellum to different aspects of cognition, and WM [43].For example, in [44] fMRI studies on two versions of the Sternberg WM task revealed that the cerebro-cerebellar activity increased with the executive load.Also, using delay serial recall task and fMRI imaging, Durisko and Fiez [45] found evidence using activations in foci within the cerebellum to support the cerebellum's contribution to speech and verbal WM.Recently, another fMRI study based on the Stenberg WM task of repeating and novel letter sequences (with similar and dissimilar phonology) found that prediction based on sequence learning is a cerebellar function [46].It is challenging to capture high-brain activity from the cerebellum using EEG as this area is much deeper in the brain, and signals can easily be affected by nearby regions, so most of the studies on this area are based on fMRI due to its high-spatial resolution.Therefore, there is a need for more extensive work focused on the cerebellum region beyond Grad-CAM visualization to investigate the ability of ML models to learn useful patterns from EEG data.Other regions that have been previously reported to be associated with CL and WM include temporal, frontal, and parietal lobes [47], and occipitalparietal and occipital-temporal regions [48].It is important to note that heatmaps generated from Grad-CAM are just model gradients with respect to the target class overlaid over the input image and should not be used as the sole measure of a brain region's response to the stimuli.However, Grad-CAM is a helpful tool in interpreting vision models as it shows whether the model is learning from the right region of the image.

V. CONCLUSION
This work addresses a number of challenges in robust and reproducible modeling of cognitive events from EEG recordings.In particular, we addressed issues with noise in the EEG data, small sample size, and inefficient data representation in building parameter-optimized and interpretable models for CL prediction.
We developed a holistic approach using both generative and discriminative deep neural networks to build parameteroptimized and interpretable models that learn the spatialspectral representation of EEG signals for CL prediction.The topomap preserves both spatial and spectral features of EEG data.We achieved an accuracy of 94% in classifying CL with the CNN model trained and tested on the stacked representation of EEG signals recorded following a modified Stenberg auditory WM task.The on single frequency bands (θ , α, β) shows that the β band has more predictive power than θ and α in classifying CL with accuracy > 91%.The significant predictive power of β bands corroborates previously reported findings about the involvement of beta oscillations in WM and cognitive processes.For example, a substantial body of research has discovered that the β band is linked to high-CL levels during mental tasks, concentration, regulating WM activity, and clearing out WM content [49], [50].To achieve model interpretability, we used a Grad-CAM visualization to create class activation maps for images in a test set of the stacked representation.The visual representation of CNN results helped us better understand how different functional areas of the brain respond to increased mental load or contribute to CL classification.The Grad-CAM results show that our CNN models use contributions from most functional areas, but the contribution level varies across different CL levels.
Our framework differs from other approaches described in the literature by being the first deep learning framework capable of successfully reducing noise in EEG signals, addressing the issue of small sample sizes, maintaining good data representation, and creating robust and reliable parameter-optimized and interpretable models.Furthermore, the results in this analysis are easily reproducible as EEG signals were recorded using the standard 10-10 EEG system and can easily be processed with most of the publicly available EEG signal processing tools, such as MNE and EEGLAB.To further facilitate the results' reproducibility, we used open-source Python packages for modeling (i.e., Tensorflow, Keras, and Hyperopt) and utilized simple yet intuitive techniques, such as bootstrap sampling and PCA.In addition, our framework can be easily applied to other EEG-based applications, such as disease diagnosis (e.g., epilepsy detection), speech analysis (e.g., categorical perception), and brain-computer interface.

Fig. 1 .
Fig.1.WM Task (Auditory Stimuli): For each experiment trial, participants listened to a series of 300-ms English characters (SET) with 700 ms between them.Following the SET characters, the subjects took a 300-ms break and then listened to a TEST character.After the TEST character, participants were asked to press a button to indicate whether the TEST character was among SET.Source:[20].

Fig. 2 .
Fig. 2. Reconstruction Error From Eigenspace: Histogram of reconstruction errors (log scale) obtained by subtracting the reconstructed signal from eigenspace projection from the original signal.

Fig. 3 .
Fig. 3. Generation of spatial-spectral representation from ERP signal: (A) ERP signal is computed from raw EEG signals, (B) mean PSD density from three frequency bands is obtained by applying FFT on the ERP signal, (C) mean PSDs are projected to a 2-D montage to obtain topomaps for individual frequency band, and (D) stacked topomap is obtained by stacking the gray topomaps horizontally from single-frequency bands.

Fig. 5 .
Fig. 5. GAN Losses.Discriminator and generator were trained simultaneously for 100.The figure shows discriminator and generator's loss curves for the GAN model trained on stacked representation from CL − 3.

Fig. 7 .
Fig. 7. CNN architecture for the combined bootstrap and GAN data set for stacked representation.The model is made up of eight convolution, seven batch normalization, three average poling, two residual blocks, and three dropout layers for the feature extraction network and one fully connected followed by two dense layers, one dropout layer after each dense layer, followed by a softmax layer with four nodes for the classification network.

Fig. 8 .
Fig. 8. Learning curves (i.e., training and validation) for CNN model trained on β band images generated through the bootstrap sampling process.The model was trained for 100 epochs with a batch size of 32 images.(a) Loss curves.(b) Accuracy curves.
used Grad-CAM to extract activation values from 12 functional areas of the brain, namely, the left and right prefrontal cortex (LPFC and RPFC), left and right frontal cortex (LFC and RFC), left and right temporal cortex (LTC and RTC), left and right parietal cortex (LPC and RPC), left and right occipital cortex (LOC and ROC), and left and right cerebellum (LC and RC).The localization is done in two main steps.1) Activation Map Computation: First, we applied Grad-CAM on test images to obtain the class activation maps for the entire brain for four CL levels.The visualization was computed on stacked representation since it contains information from the three frequency bands and showed the best classification performance.Fig. 11 shows the examples of Grad-CAM results for four CL levels.The regions in the figure with the greatest contribution to the prediction of CL are highlighted in red, while those with the smallest or no contributions are highlighted in blue.Grad-CAM visualizations clearly show the regions with a substantial contribution relative to others, even though it can be challenging to separate the regions with a medium contribution from the figure.2) Mapping Class Activation Onto Functional Areas of the Brain: We mapped the class activation to functional areas of the brain by calculating the mean of the activation values around each electrode.EEG electrodes are named and placed around the scalp based on functional areas.To extract activation values for individual electrodes from our Grad-CAM heatmaps, we applied an 11 × 11 mask kernel around the electrode location and computed the mean of the extracted activation values.To obtain the contribution of a given brain region, we used clusters of channels likely to record activities from that region and computed the mean of their activation

Fig. 11 .
Fig. 11.Example of class activation maps using GradCAM for four levels of CL using spatial-spectral images from composite topomap test data set.

Fig. 13 .
Fig. 13.Mean Grad-CAM activation values for functional areas of the brain for four levels of CL.
Deep Learning Framework for Modeling Cognitive Load From Small and Noisy EEG Data Felix Havugimana , Graduate Student Member, IEEE, Kazi Ashraf Moinudin, and Mohammed Yeasin