Cross-Task and Cross-Participant Classification of Cognitive Load in an Emergency Simulation Game

Assessment of cognitive load is a major step towards adaptive interfaces. However, non-invasive assessment is rather subjective as well as task specific and generalizes poorly, mainly due to methodological limitations. Additionally, it heavily relies on performance data like game scores or test results. In this study, we present an eye-tracking approach that circumvents these shortcomings and allows for effective generalizing across participants and tasks. First, we established classifiers for predicting cognitive load individually for a typical working memory task (n-back), which we then applied to an emergency simulation game by considering the similar ones and weighting their predictions. Standardization steps helped achieve high levels of cross-task and cross-participant classification accuracy between 63.78 and 67.25 percent for the distinction between easy and hard levels of the emergency simulation game. These very promising results could pave the way for novel adaptive computer-human interaction across domains and particularly for gaming and learning environments.


INTRODUCTION
I N many digital environments designed for learning, working or even for entertainment purpose, there is a close link between users' current cognitive load and their affective experiences. The important impact of users' cognitive load on their affective states has been demonstrated repeatedly (e.g., [1], [2], [3]). For instance, in a learning context users' subjective experiences rely on instructions provided in a learning environment that is neither over-straining (and thereby frustrating) nor under-challenging (and thus boring) due to a limited workload [4]. However, imposing an optimal level of cognitive load onto learners that keeps them engaged and satisfied as well as in their zone of proximal development [5] may be a highly learner-specific issue, depending strongly on an individual learner's prerequisites in terms of prior knowledge and abilities. Similar to the zone of proximal development, the Yerkes-Dodson law suggests that managing arousal can result in optimal performance. Since cognitive load can be interpreted as a form of arousalas illustrated by its impact on pupil diameter -its management would be beneficial in many situations. For calibrating learning experiences with regard to their cognitive demands, an affective connotations technical support system might be helpful. Such a system could optimize the resulting learning outcomes by performing real-time adjustments of the cognitive-load level imposed by a learning environment.
Systems able to detect and properly react to a user's cognitive load in order to calibrate their affective and cognitive experiences would, of course, also offer considerable benefits for numerous other applications beyond learning environmentsranging from the workplace to the digital playground. For instance, in potentially stressful digital working environments (such as systems for surgical assistance, engine control or emergency management), in which errors might have serious and life-threatening consequences, individuals might also experience strong and fluctuating affective reactions related to their current level of cognitive (over-)load. Monitoring cognitive load in these contexts and providing respective feedback and support to users might not only help to avoid errors related to high cognitive load, but also improve the overall affective experience. For example, (truck) drivers or other workers controlling complex engines may be prompted to take breaks or provided with individualized training when detected to be over-strained in specific situations (e.g., [6] or [7]). Other examples may be conceivable in the medical domain where surgeons could be relieved when necessary, or in aviation scenarios where pilots may be provided with support from their copilots or from assistance systems depending on their cognitive-load levels [8].
Beyond scenarios related to learning or working, gaming also seems to be a prime area for applying adaptive procedures based on cognitive load measurement in order to optimize affective user experiences. For instance, in cases where an obstacle in a game is too difficult for a player to overcome, frustration may set in and the gaming experience may suffer. Contrarily, when a game is too easy in relation to the user's current abilities, gaming may not be as enjoyable for that user. Both cases can be circumvented by adapting the degree of difficulty based on the player's current level of cognitive load. Thus, with accurate estimations of cognitive load during gaming, automatic adaptations could be enabled that prevent negative affective states (such as boredom, frustration, and stress) and enhance positive affects (such as engagement, joy, and satisfaction). A prime example of a desirable affective state in gaming and many other scenarios of human-computer interaction is flow [9]. Flow is considered a positive affective state of optimal experience [10] that creates pleasure by balancing the challenge of the task at hand and the available capabilities of the user. Measuring a person's cognitive load online could help adapt levels of difficulty to a degree that still constitutes an enjoyable challenge without over-straining the user.
Usually, cognitive load is measured based on self-reports, such as the NASA task-load index (TLX) [11] or by obtaining performance metrics. These traditional approaches, however, have some drawbacks that render them impractical for systems aiming for real-time adaptations. In particular, filling out a questionnaire about the level of cognitive load currently experienced might strongly interfere with task performance and immersion and is, therefore, unsuitable for most applications. Moreover, questionnaire data are rather subjective and may be influenced by many factors, cognitive load only being one [12]. Thus, to adapt cognitive load levels in real time, a less obtrusive method is required that does not interrupt the current task, as in the case of questionnaires, but offers a reliable, objective and continuous indirect online estimation of a user's cognitive load.
Performance metrics such as test scores and task completion times are indirect and thus less obtrusive as compared to questionnaires. However, they are usually only available at specific points in time and can not be measured continuously as would be required for real-time adaptations to cognitive load levels. For instance, in the case of digital learning environments, the goal would be to measure cognitive load levels during, not just after, the learning process in order to adapt the level of difficulty of the learning materials. Thus, the continuous and unobtrusive cognitive-load monitoring that is required usually can't be provided by performance metrics or questionnaire data [13].
An alternative approach for assessing cognitive load is based on physiological measures. Cognitive load causes physiological reactions that can be measured by sensors [13], [14], [15]. The most reliable indicators are changes that occur in the brain, but measuring these changes is intrusive, hard to set up, and not feasible in broad real-world settings. In this context, methods like electroencephalography (EEG) or near-infrared spectroscopy (NIRS) require very specialized hardware and expertise to operate. A thorough overview on EEG measures, including their advantages and drawbacks -such as the number of required trials per experiment -that illustrates why these measures are rather unsuitable in the context of most real-life adaptive systems is provided by [16]. Less intrusive sensors include heart rate monitors and devices for measuring skin conductance, but these seem to lack accuracy and/or validity for measuring cognitive load [17]. Finally, eye tracking measures such as eye-fixation features offer a good alternative to the aforementioned physiological signals. They do not require physical contact with participants, can be obtained in real-time, and have been comprehensively demonstrated to be associated with cognitive load [18]. When obtained by means of webcams, eye-tracking measures have the potential to become available to a broad audience across various application domains. Moreover, with the increasing integration of eye-tracking technology in VR, AR, and smart glasses [19], this physiological signal can also be measured in high quality in a variety of applications in the future [20].
One of the major limitations of physiological indicators such as eye-tracking data for measuring cognitive load is the difficulty of generalizing measures across tasks and participants [21], rendering cross-task and cross-participant predictions or real-time assessments virtually impossible. This drawback, however, is not limited to eye-tracking data or physiological measures in general, but applies to many algorithms for real-time workload assessment as Heard et al. conclude in their meta-review [22]. Systems designed for real-time assessments usually need to be adapted to individual participants and/or specific tasks in order to yield reliable predictions. This usually requires data collection for lengthy calibration procedures for each participant, rendering these systems time consuming and inconvenient for users. Even with individual calibration, generalizations to different tasks or applications are usually poor, resulting in the necessity of repeated calibrations for different tasks and/or applications. Currently, there is no satisfying general purpose classifier for cognitive load available.
Many researchers have worked, so far with limited success, on the problem of either cross-task or cross-participant estimations of cognitive load (see below for a more detailed discussion). Although intra-participant results are usually good, generalization results are limited or do not even exceed chance level (e.g., [23], [24], [25]).
In this article, we present a novel and intuitive approach to remedy these methodological shortcomings. We show how a machine learning approach might be used for cognitive load detection based on eye-tracking data to allow for successful generalization across participants and tasks. We employ a schema of weighted votes that combines participant-specific classifiers into a composite classifier with a broader scope, offering generalization ability across participants and tasks. Thus,our method has the capacity to pave the way for out-of-the-box solutions for adaptive humancomputer interaction based on a reliable assessment and classification of users' cognitive load independent of the user and task at hand. As a result, users' affective experiences during human-computer interaction in contexts such as learning, working, or gaming may benefit significantly in terms of avoiding frustration, boredom, and stress and in terms of enhancing engagement, joy, and satisfaction. In line with this assumption, we show that our cognitive-load classification correlates notably with negative emotions such as stress and frustration.

Adaptations Based on Cognitive Load Estimation
Cognitive load estimation is usually performed in a task and participant-specific way. In this context, it has also been demonstrated to be useful for workload adaptations in learning environments and vehicle control tasks. Yuksel and colleagues created a brain-computer interface that adapted the difficulty of a musical learning task [26]. They measured cognitive load using fNIRS to decide when to increase difficulty. Their approach managed to significantly increase learning gains during piano lessons compared to a control group. However, the classifiers they used were participant-specific and trained over a long training period consisting of 30 songs per participant. An aviation simulation was used by Wilson and Russel to provide real-time adaptive feedback [8]. They used a combination of EEG, respiration, and heart rate, but also eye-fixation behavior to realize adaptations during an uninhabited air vehicle task. Participant-specific artificial neural networks were trained to detect high cognitive load and adapt the task by slowing down simulated time when cognitive load was too high. In contrast to our approach, realtime adaptation was successfully realized only with participant-and task-specific classifiers.
Lastly, Kelleher et al. developed a method that does not rely on EEG data, but rather on users' behavioural performance [27]. Their approach was able to distinguish between a difficult puzzle and an easy puzzle based on the users' performance completing previous puzzles with an accuracy of 71 to 79 percent. A wide array of features derived from performance, user input, and user ratings were used to train random forests and predictions were made based on the last three puzzles the user was attempting to solve. While the results are promising, their method is still specific to their task and to individual participants.

Cross-Participant and Cross-Task Approaches
While cognitive load estimation is usually performed in a task and participant-specific way, there are several studies that successfully implemented either cross-task or crossparticipant approaches (but not both). In contrast to the method that we present in this article, most of them rely, at least partly, on EEG.
A very detailed assessment of mental workload is provided by Popovic et al. [28]. They classified different kinds of load (i.e., speech, fine motor, gross motor, auditory, visual and cognitive) using EEG and ECG. Their cross-participant classifier achieved 72.5 percent accuracy for cognitive load in a leave-one-participant-out cross-validation.
Another interesting approach was presented by Ke and colleagues [29]. They generalized from individual regression models to more general regression models by applying a feature selection algorithm to EEG data recorded from a working memory task and a complex simulated multi-attribute task designed to evaluate operator performance and workload (see [30]). In a first step, they used two thirds of their data to systematically eliminate features with low cross-task correlations and then evaluated their feature set on the remaining validation set. They found a significant increase in performance of their regression model. Again, these results show cross-task-but not cross-participant-generalization. This makes them applicable in some situations, but still not general enough to meet the demands of many applications.
Finally, in previous work, our group successfully developed a machine learning approach for cross-participant classification of cognitive load [31] using eye-tracking data. For a working memory task we achieved an accuracy of 76.8 percent for offline classification and 70.4 percent for real-time online classification. A reworked version of this approach was used in an emergency simulation and showed promising results under noisy conditions similar to actual applications [32]. This updated version worked across different versions of the simulation, showing potential for a cross-participant and cross-task solution. The article presented here expands on this work by refining the set of used eye-tracking features and adding a further weighing step for cross-task application that makes use of the accuracy scores. The previous two articles were limited to one task or variations of one task, but in this work we perform actual cross-task and cross-participant classification working towards a truly general algorithm.

EXPERIMENTAL SETUP
We collected eye-tracking data from two different tasks: (1) an N-back task (a standardized working memory task inducing a controlled level of cognitive load), and (2) a computer simulation, that represented a real-life application. Pursuing cross-task classification, we aimed to use data from the first to estimate cognitive load in the latter thereby strictly separating the training task and validation task. This separation was crucial in order to keep the approach as general as possible. We further ensured that there was no overlap between the two groups of participants to guarantee that the method was also strictly cross-participant. Both are crucial aspects of this work.

N-Back Task
The N-back task [33] is commonly used to induce cognitive load and to measure working memory capacity. Participants are presented with a randomly generated sequence of letters and have to press one of two buttons to indicate whether the currently presented letter is the same as N letters before. N modulates the difficulty of the task because a larger N means that more letters have to be memorized and compared to the actual letter at hand. With regard to working-memory demands, the N-back task requires participants to keep a string of N letters active in memory, compare the first letter of the string to the current trial, decide on the correct button, and update the memorized string by deleting the first letter of the string and adding the letter of the current trial. 0-back can be used as a control condition where participants have to compare the current stimulus with a constant that was presented at the very beginning of the task. Letters are randomly chosen from the set L ¼ fC; F; H; Sg and are presented for 0.5 seconds, followed by a black screen shown for 1.5 seconds. A schematic overview is provided in Fig. 1 and descriptive statistics can be found in Table 2. A one-way ANOVA confirmed the manipulation of difficulty with regards to participants' mean accuracy (F ¼ 105:41; p 1:8 Â 10 À19 ).
Participants first received instructions for the task and had to perform a short training until they achieved an accuracy of 60 percent. They then completed two critical blocks, each comprising three difficulty levels: 0-back, 1-back, and 2-back. Each level of a block consisted of 154 trials; the order of levels in a block was randomized.
We used the "N" of the N-back task as an experimental manipulation of cognitive load (within participants design) and focused on the difference between 0-back and 2-back conditions.

Participants
28 students (mean age = 24.71, SD = 4.12, 14 females) from the University of T€ ubingen were recruited for the N-back task. Data from one participant was discarded due to problems with the eye-tracking recordings, resulting in too little usable data.
The experiment was approved by the local ethics committee and all participants gave written informed consent at the beginning of the experiment. Participants received monetary compensation at the end of the experiment. All were right-handed and German native speakers.

Apparatus
We used a RED250 eye tracker from SensoMotoric Instruments (SMI) in combination with the SMI Experiment Center software (version 2.7.13) for the recording of eye movements and pupil-related features. Calibration was performed with SMI's built-in 9-point calibration. All eyetracking data were recorded at 250 Hz in a laboratory setting with illumination held constant in individual sessions.
During the task, a chin-rest was used to ensure stable head position and constant viewing distance. Stimuli were presented on a 22-inch monitor with a resolution of 1,680 Â 1,050 px using Arial font with a size of 25. All letters were presented in gray on a black background.

Emergency Simulation Task
The Emergency simulation task was based on the commercially available simulation Emergency by Promotion Software GmbH [34]. Participants had to coordinate emergency personnel consisting of firefighters, paramedics, and ambulances responding to different scenarios (e.g., a car crash or burning buildings). The simulation can be used as a training tool for emergency management tasks as well as for entertainment purposes and thus covers aspects of digital environments for learning, working, and gaming. For illustration, a typical scene is presented in Fig. 2.
The simulation started with a tutorial that introduced participants to how the simulation works. After the tutorial was completed successfully, three scenarios were presented: a car crash, burning buildings, and a train crash. Each scenario had three levels of increasing difficulty (easy, medium, and hard). Scenarios had to be completed in the same order, of easy to difficult levels and scenarios (i.e., from car crash to train crash), by all participants. Scenarios and difficulty levels differed in the number of sub-tasks to be completed as well as the available numbers of emergency personnel and their composition. These manipulations were calibrated by Promotion Software GmbH for the purposes of this study in order to optimally manipulate the levels of cognitive load imposed onto participants. In order for a scenario to be completed successfully, all sub-tasks must be performed, meaning all trapped victims had to be freed, every injured person had to be treated and transported to a hospital, and every fire had to be extinguished. Descriptive statistics about the scenarios' parameter, completion rates, and subjective ratings can be found in Table 1.  The fixed order of scenarios and difficulty was chosen deliberately for this task. While a randomized order would be ideal to avoid confounds, it is hard to implement in a task that involves learning and skill acquisition. Participants who first complete easy parts of the simulation gain proficiency quickly. Scenarios that may have been more difficult in the beginning become easier. On the other hand, when participants are confronted with very difficult scenarios first, they may become overwhelmed which hinders or even prevents learning. This expertise change over time and its dependency on the order of task presentation necessitated a fixed order. Gerjets et al. also recommend a fixed order from simple to complex for learning tasks for these exact reasons [4]. Moreover, an ascending order of difficulty is in line with the way learning materials and games are commonly structured, making it a better showcase for application of our method. Another important aspect is that we aim to use data from the N-back task to estimate cognitive load in the Emergency simulation so that the training data are not affected by a confounding of difficulty level and time.
In the simulation, certain sub-tasks could only be performed by specific emergency personnel and with varying degrees of efficiency. Therefore, especially in scenarios that did involve fire, planning activities were essential for successful completion of a mission. For instance, as fire can spread to nearby buildings and also hurt emergency personnel, prioritisation of which fires to put out first was crucial. Putting out fires could be done by firetrucks or by firefighters alone, but firefighters were considerably slower at performing the task. Moreover, firefighters might be required to free trapped victims. In general, more difficult levels involved more coordination of emergency personnel units and more sub-tasks, posing higher demands on planning, prioritisation, monitoring, and information updating.
After each level, participants were asked to indicate their subjective cognitive load and their affective experiences based on a modified version of the NASA-TLX questionnaire [11]. The questionnaire contained scales for positive and negative emotions, mental and temporal demand, effort, frustration, and stress, as well as a measure of seriousness, that were all rated on a scale of 0 to 100. Participants' responses were used to evaluate the validity of our approach and the relation of our envisioned cognitive-load classification to affective experiences.
For this task, we used the difficulty levels within each scenario as manipulation of cognitive load (within participants design) and focused on the difference between the easy and hard version of each scenario.

Participants
The Emergency simulation was completed by 47 participants (mean age = 24.6, SD = 6.3, 33 females). There was no overlap between the participants of the Emergency simulation and the participants of the N-back task. Seven participants had to be excluded due to problems with eye-tracking recordings. Another 2 were excluded because they reported that they did not take the experiment seriously. Finally, 2 participants had a very high number of missing data and consequently did not provide enough usable data for all scenarios, rendering their data partly unusable. The data of the remaining 36 participants were included in further analyses.
We deliberately included participants with noisy data or poor tracking ratios (i.e., time spans with invalid data caused by the pupil not being detected reliably). This renders the data more realistic with closer resemblance to data one would expect in an online-scenario of a real-world application. The experiment was approved by the local ethics committee and all participants gave written informed consent at the beginning of the experiment. All participants were right-handed, German native speakers and received monetary compensation at the end of the experiment.

Apparatus
The eye-tracking setup was the same as for the N-back task, featuring a RED250 eye tracker from SensoMotoric Instruments (SMI) in combination with the SMI Experiment Center software (3.7.60). Calibration was performed with SMI's built-in 9-point calibration and the recording frequency was set to 250 Hz. Data recording was performed in a laboratory setting with illumination held constant in individual sessions.
For the Emergency simulation, a laptop with a 16-inch screen driven at 1920 x 1080 px resolution was used. This task did not involve a chin-rest as to closer mimic a realworld learning or gaming situation.

FEATURES USED FOR CLASSIFICATION
Eye-fixation behavior is strongly influenced by presented stimuli as their structure and appearance guide the user's attention (e.g., Rayner, for a review [35], [36]). Therefore, our approach relies on eye-related features that were chosen because they are either independent of the stimulus structure or only marginally dependent on it. More specifically, we did not rely on saccades, areas of interest, or the coordinates of fixations.
The feature extraction process for a chosen share of data always followed the same procedure. First, we extracted 7 features, to be described later in this section, and then normalized them using a participant-specific baseline to allow for cross-participant comparisons. Baseline in this context refers to the features of a specific part of the data. For the Nback task, this baseline was taken from the instruction phase, while for the Emergency simulation we used the tutorial phase as baseline.
Normalization was performed at the participant level and involved subtracting and then dividing the baseline from the segment's features. As a consequence, all features used reflected relative changes from the individual participant's baseline.
All eye event detection used SMI's built-in methods. For fixations, this is a dispersion-based algorithm with a maximum dispersion of 2 À 3 (depending on the distance between screen and user) and a minimum fixation duration of 80ms. Blinks are defined via the gaze and pupil signal. Gaze coordinates of (0,0) or the pupil being zero or outside a dynamic computed validity range is interpreted as a blink. Blinks of less than 70ms are discarded. SMI's default algorithm interprets anything that is between between two fixations or a blink and a fixation as a saccade. Even though we did not use saccades for our approach, we included their detection for the sake of completeness.

Pupil-Related Features
Pupil diameter has been used to measure cognitive load for several decades. An increase in cognitive load leads to decreased parasympathetic activity in the peripheral nervous system, which, in turn, leads to an increase in pupil diameter [37]. This effect was observed consistently within a task, between tasks, and between individuals [38]. Various studies have successfully replicated this relationship within a wide range of settings, including short-term memory, language processing, reasoning, perception, as well as sustained and selective attention [18]. Pupil diameter has also successfully been used to detect cognitive load in a variety of scenarios, including driving [39], during low visual load tasks [40], route planning with maps [41], and simultaneous interpreting [42]. Furthermore, it was successfully used to differentiate expertise closely related to cognitive load [43] We applied preprocessing steps to improve data quality of the pupil signal. First, we removed periods that were marked as blinks, as well as the 100 ms right before and after a blink. During these phases, the pupil could not be detected reliably and, as a consequence, measurements of pupil diameter suffered from reduced accuracy. We furthermore removed implausible pupil values (e.g., values of 0mm or less, as well as values greater than 10mm). Finally, we linearly interpolated small gaps of less than 50ms (12 data points at sampling rate 250 Hz) and applied a median filter to reduce noise.
We selected the median of the pupil diameter as the main pupil feature because it is more robust to outliers, particularly for short sampling periods. Moreover, we utilized the maximum pupil diameter as a feature to capture spikes in the pupil signal. We expected to see an increase in both median and maximum pupil diameter with increasing cognitive load.
Moreover, we employed the Index of Cognitive Activity (ICA) as proposed by Marshall [44], [45]. It uses wavelet decomposition to detrend a pupil signal and reduce it to only short-term fluctuations where rapid pupil spikes that exceed a certain threshold can be detected. These spikes are supposedly caused by cognitive activity. For this part of the analysis, we did not apply any interpolation or filter preprocessing as it may remove the small-scale fluctuations needed for the ICA. A higher degree of cognitive load was supposed to result in an increased ICA.
As an additional, more exploratory feature, we included the standard deviation of the pupil diameter. According to the ICA, cognitive load can cause fluctuations and rapid spikes in pupil diameter. Based on this assumption, we expected a higher standard deviation of pupil diameter for higher cognitive load.

Blinks
Cognitive load influences the frequency and duration of blinks [46], [47]. Thus, increasing task difficulty was expected to increase the frequency of blinks, while increasing visual demands should lower the number of blinks [48]. We used blink frequency as a feature and expected it to increase in alignment with cognitive load in both the Nback task and the Emergency simulation.

Fixations
Fixations describe a stable gaze on the same location usually lasting between 200 ms and 350 ms [35]. Frequency of fixations is influenced by several factors. Time pressure induced by high task demands tends to increase the number of fixations while reducing their duration [49]. We expected to observe the same pattern for higher levels of cognitive load in our study. Consequently, we used the number of fixations per second as a feature.

Microsaccades
Microsaccades are small involuntary eye movements that may occur during a fixation and are associated with cognitive load. Studies reported an increase in microsaccade frequency in visually demanding tasks [50], whereas nonvisual tasks (e.g, auditory tasks or mental arithmetic) seemed to reduce their frequency [51], [52], [53].
We used the method suggested by Krejtz and colleagues [53] to detect microsaccades, which relies on thresholds to find small ballistic sequences in an otherwise fixed gaze. Instead of focusing on amplitude or velocity, we use microsaccade frequency as a feature, because the aforementioned measurements would require a higher sampling rate than 250 Hz in order to be reliable. Since both tasks involved visual presentation, we expected an increase in microsaccade frequency alongside rising cognitive load.

COGNITIVE LOAD DETECTION METHOD
The core of our approach was strongly inspired by Appel et al. [31], [32]. The fundamental idea was to train participant-specific classifiers for low and high cognitive load based on data from an N-back task and use their weighted predictions on the Emergency simulation. Participants that were similar during baseline periods were weighted more strongly as we expected their physiology to change under cognitive load in a similar way.

Within-Task and Within-Participant Classification
Participant-specific classifiers were trained on N-back data. As the N-back is a standard working-memory updating task that is recorded under laboratory conditions, we expected it to reflect characteristic physiological changes caused by cognitive load and to allow for generalization from this task to the Emergency simulation as described in Section 5.2.
To train a classifier that can differentiate between high and low cognitive load, we needed data from periods of high cognitive load and periods of low cognitive load during the training phase. We used the N-back task as foundation for single-participant classifiers and considered the 0back condition to reflect low cognitive load and the 2-back condition to represent high cognitive load. 25 non-overlapping samples with a length of 4s each were randomly selected from both conditions, yielding 50 samples per participant, that were used for training the individual classifier. We rejected samples with more than 50 percent missing values in the pupil signal and resampled to ensure that each sample contained enough information to be useful for training. These numbers represented a balanced compromise between sample size and sample length.
For each of the samples, we extracted the features described in Section 4. All features were then z-transformed using individual means and SDs for standardization. This scaling improved inter-participant comparability considerably and should thus help apply classifiers across participants.
Finally, we trained a forest of 1,000 extremely randomized trees (Extra-Trees) [54] per participant to distinguish between high and low cognitive load based on that individual participant's samples. Extra-Trees had the advantage of providing not just a decision into classes, but also class probabilities between 0 and 1. This enabled us to form a continuous scale instead of a dichotomous decision, adding further information. The output was a number between 0, when the classifier was absolutely certain that a sample was collected under low cognitive load, and 1, in case of high cognitive load. In addition, Extra-Trees tended to not overfit as quickly as other classification methods, allowing more features in conjunction with fewer samples. Moreover, Extra trees seemed appropriate for the goal of real-time classification of cognitive-load levels as they can be trained and evaluated rapidly. The use of 1,000 trees was empirically determined. More trees make the classifier more robust and more accurate, but require longer time for training. On the one hand, adding trees beyond 1,000 did not increase accuracy, neither within-participant nor cross-participant, but, on the other hand, reducing the number did not improve training or evaluation time in a meaningful way. In case computation time is an issue, less trees may be chosen to ensure acceptable execution times.
We used the Extra-Tree implementation provided by the Python toolbox scikit-learn [55].

Cross-Participant and Cross-Task Approach
In a next step, we combined the single-participant classifiers trained with data from the N-back task to form a composite classifier that can be applied across participants and tasks. The fundamental idea of our approach was not only to apply the classifiers trained on N-back data to the Emergency simulation, but also to weigh their contribution to the final prediction according to how similar their baselines were. In this way, participants from the N-back task with similar physiological features and behavioral parameters to participants the the Emergency task were given higher weights in the final prediction. Adding this weighing substantially increased the accuracy of the combined classifier.
To verify cross-task capability, sample data from the Emergency task was needed. Therefore, we randomly sampled 25 segments of length 4s from the easy and hard version of each scenario in Emergency, resulting in 50 samples per scenario and participant. Again, these numbers represented a compromise between sample length and sample number. From these samples, we extracted the features in the same way as the N-back data including the normalization using the baseline and z-transformation. Segments extracted from the easy version of a scenario represented low cognitive load, while segments from the hard version represented high cognitive load.
For baseline comparison, features were normalized across all participants to have a mean of 0 and standard deviation of 1 so as to not inflate the importance of features on a larger scale. There were, for instance, a lot fewer blinks within one second than there were fixations and the pupil diameter in millimeters was much larger compared to the number of microsaccades per second.
The procedure can be described as follows (see 3 for an overview of all variables): Let p be a participant of Emergency whose cognitive load we want to classify, base p ðiÞ the ith baseline feature, sample p a sample of p characterized by a set of features, and P the set of participants from the Nback task. Every q 2 P has a classifier c q that predicts a value between 0 and 1 for sample p . We combined these predictions according to the following equations: simðp; qÞ ¼ 1 P i acc cq w cqðiÞ jbase p ðiÞ À base q ðiÞj predðsample p Þ ¼ P q2Pn simðp; qÞpred cq ðsample p Þ P q2Pn simðp; qÞ ; simðp; qÞ refers to the baseline similarity between participants p and q, w cq ðiÞ to the ith normalized feature weight of classifier c q , acc cq to the cross-validated accuracy that classifier c q achieved on participant q, and pred q to the prediction of classifier c q . This means that we drew a prediction for cognitive load from each classifier c q and weighted these predictions according to how similar the baselines of participants p and q were. Additionally, we factored the feature weights of classifier c q into the similarity, giving a higher weight to more important features. Feature weights can be obtained by examining the trees of a random forest and how the addition of a specific feature reduces impurity (see [56] for more details). We further factored in how well the classifier performed on its specific participant by multiplying by its accuracy score. This way, participants with more exemplary data -and consequently a good participant-specific accuracy -were weighted higher.
Dividing by the sum of all similarities, normalized these similarities and ensured that the prediction's final result was within the interval of [0,1]. P n refers to a subset of P that is restricted to the n participants with the highest similarity. The choice of a smaller n can help to reduce computational costs in case there are many available classifiers from the Nback task. We employed n ¼ 5 to highlight that it does not take a lot of participants to achieve accurate results.

Algorithm 1. Pseudocode Outlining Our Method for Cognitive Load Detection
P set of all participants in N-back task d p;i data of participant p taken from the ith scenario of Emergency base P normalized baseline of participant p c P N-back-trained classifier of participant p n number of neighbours to consider for q 2 P do " calculate distances between participants' baselines and make predictions acc accuracyðc q Þ w featureweightsðqÞ w w P w distðp; qÞ P acc wjbase p À base q j for i 2 f1; 2; 3g do prediction q;p;i c q :predictðd p;i Þ " prediction for each sample of d p;i end for dist dist P dist " normalize distances to sum to 1 sim 1 dist " get similarity from the distance P n Àfy 2 P jy amongst n most similarg " n participants with highest similarity to p for i 2 f1; 2; 3g do for sample 2 d p;i do out p ½i; sample À P q2Pn simðx;qÞpredc q ðsampleÞ P q2Pn simðp;qÞ end for end for end for   3 shows a schematic overview of our method and Algorithm 1 presents pseudo-code of our cross-participant and cross-task classification. Both serve to illustrate our approach. Table 3 clarifies the naming of variables that are part of the pseudo-code and several formulas.

Within-Task
As a frame of reference, we did not only analyze results for cross-task and cross-participant classification, but also for within-task and within-participant classification. To this end, we performed the method described in Section 5.1 for withinparticipant results and the approach described in Section 5.2 for cross-participant within-task results, but limited each to the participants of one task. All results reported for within-participant classification were obtained based on a 10-fold crossvalidation to avoid overfitting a classifier to a specific participant and thereby artificially inflating classification accuracy.
Within-task accuracy for the Emergency task is reported for each scenario individually and is based on random samples with a length of 4 seconds that were extracted from the easy and hard version, respectively. Feature extraction was performed in the same way as described in detail for the Nback samples.
In the case that a participant reported the easy version of a scenario to be more difficult than the hard version, that scenario was excluded from the participant's results. "More difficult" refers to the average rating of cognitive demands, temporal demands, and effort as reported in the NASA TLX. For this reason, 10 participants from the first scenario and 2 from the second scenario had to be excluded. In the easy version of the first scenario, participants had their first real interaction with the simulation. It is therefore likely that the easy version felt more difficult for them than the hard version as they were already familiar with the simulation by that time. Table 4 shows the detailed accuracy scores for the N-back task and emergency task, respectively.
It is notable that the results for the N-back task were slightly better than those obtained for the Emergency task. This may be partly because of the experimental setup. The N-back task was recorded with participants using a chinrest, which helped improve quality of the eye-tracking data in general and the reliability of the pupil measurements in particular. Furthermore, difficulty remained constant over the course of one level, whereas situational difficulty varied during levels of the emergency simulation. The fact that we took random samples from easy and difficult versions of the simulation may have, thus, led to samples not reflecting the same exact degree of cognitive load, even within one participant. This, in turn, added variance to the features and made the labels "easy" and "difficult" less distinct for Emergency than for the N-back.
Comparing the drop in accuracy caused by the shift from within-participant to cross-participant classification, one can see that the drop was more pronounced for the N-back task. This was likely due to the fact that we had a larger number of participants for the Emergency simulation, meaning that it was more likely to find good matches during the baseline comparison.

Cross-Task
Cross-task and cross-participant results were obtained by applying classifiers trained on N-back data to samples from the Emergency simulation following the approach described in Section 5.2. Classification accuracy is summarized in Table 5 and ranged between 63.78 and 69.25 percent.
As expected, applying N-back classifiers to Emergency data led to a slight drop in classification accuracy as it represented a classification across different participants and tasks. Using classifiers trained on one participant and applying it to a different one introduced a certain error as the classifier did not match the participant. The same holds true for the application across tasks. Moreover, as Fig. 4 shows, feature weights also differed between the two tasks, introducing yet another source of error.
The main difference in feature importance was observed for ICA and microsaccades. Both carried considerably more importance in the Emergency simulation than they did in the N-back task. This is in line with the results of Fairclough and colleagues who found the ICA not significantly sensitive to isolated working memory tasks like the N-back task [57]. A possible explanation for the difference in  classifier trained on data of participant q acc cq accuracy of c q w cq feature weights of c q sample p a sample of participant p pred cq ðxÞ prediction of c q for a sample x base p baseline features of participant p base p baseline features of participant q simðp; qÞ baseline similarity between participants p and q microsaccade importance may be the different nature of the task. The Emergency task was a lot more visually demanding and required more widely distributed attention. This fits with findings from Duchowski and colleagues [58] that ambient visual search increases the number of microsaccades. The importance of the remaining features was slightly higher for the N-back task because ICA and microsaccades were less important and the importance of all features added up to 1. For scenario 1, classification accuracy decreased from 71.91 percent from within-participant, within-task application to 69.25 percent in cross-participant, cross-task classification. A possible explanation for this good performance may be that participants did not yet have experience with the task (i.e., they all started at the same point). This "neutral" condition with regard to the experience and skills acquired may be similar to the N-back task, thus leading to a decrease in loss caused by cross-task application of classifiers. Scenario 2 showed the most pronounced decrease, from 71.98 to 63.78 percent. The major error source was the crosstask application, as cross-participant results differed only slightly from the within-participant ones for Emergency. It seems likely that the structure of this specific scenario led to a feature distribution that differed the most from the Nback task's features, resulting in decreased accuracy.
An accuracy loss of less than 5 percentage points -from 68.91 to 64.02 percent -could be observed for scenario 3. These results were very similar to the second scenario and likely have a similar cause: cross-task application. Fig. 5 depicts the performance of our approach depending on the number of neighbours considered. As expected, there was an increase in accuracy when increasing the number of considered neighbours, but after 5 the benefit was negligible, only increasing computation time. Even by just choosing the closest match, performance was only a few percentage points lower compared to a larger set of 5 or 10.
To verify our prediction not solely on a binary level, we considered participants' questionnaire data and correlation with our predicted continuous cognitive load. Table 6 shows Pearson correlations between cognitive load predicted by our algorithm and self-report scores. Self-reports were normalized on participant level to account for individual differences in scale. As a frame of reference, we also included correlations between the questionnaire's different sub-scales.
Predictions made by our algorithm showed a significant correlation with self-reports. They correlated at 0.399 with self-reported cognitive demands, at 0.459 with reported temporal demands, and at 0.484 with the effort subjectively experienced by participants. The high correlation with perceived effort is a strong indicator for the validity of our predictions.

DISCUSSION
We applied a machine learning approach to the classification of cognitive load based on eye-tracking data and investigated how this approach generalizes across participants and  tasks. Our results indicate a robust approach that yields a good classification accuracy of 63.78 to 69.25 percent across participants and tasks. This is above chance level and is comparable to eye-tracking based classification results for cognitive load in other scenarios. For instance, Hogervost et al. [59] reported roughly 68 percent accuracy in the distinction between level 0 and level 2 for the N-back task purely based on eye-related features. However, their classification algorithm was trained for each participant individually within a specific task and used intervals that were 50s long -all limitations that our approach does not have. Additionally, cognitive load predictions yielded by our method correlate at r ¼ 0:484 with participants' self-reported invested effort, providing a second level indicator of validity. When taking a closer look at our data set, the robustness of our approach seems noteworthy. We considered noisy data in our analyses and, in the case of the Emergency game, used the tutorial as an active baseline instead of a neutral fixation cross. Furthermore, Emergency is not a well-controlled laboratory task, but a complex emergency simulation game that requires participants to identify actions for the right emergency personnel under time constraints in an environment that adaptively reacts to player's actions (e.g., spreading fires when not extinguished by fire fighters). As such, the present results seem promising as they indicate the validity of our approach even when applied to a real-world scenario with limited baseline options and complex interactions.
Moreover, due to the dynamic nature of the Emergency simulation, cognitive load was not constant over the course of one level. Closer inspection of the predictions generated by our algorithm revealed that participants seemed to start each level with a rather high predicated load that quickly dropped after the first orientation phase of about 20-30s. Towards the end, there was also a clear difference between participants that successfully finished a level and those who did not. When participants realized they could finish on time, predicted cognitive load dropped considerably, whereas it rose for participants concerned about not finishing on time. This uneven distribution of cognitive load added to the error rates that we report. Therefore, our predictions may actually be even more accurate than what is reported because we had to rely on the overall task difficulty of a specific level as an indicator of cognitive load instead of a more direct measure (e.g., derived from interaction metrics). Generalizing a difficulty level by labeling all samples from this level as "high cognitive load" possibly introduced a kind of artificial error given the fact that there were most likely periods of lower cognitive load within the same time frame.
Additionally, the nature of our features and method is very versatile. All the features we used were either aggregated over the entire length of the segment or calculated per second. As a result, the length of segments can be adjusted at will. Longer segments are less noisy, but shorter segments better capture the cognitive load at a certain point in time. Pre-trained classifiers may be applied independent of segment length, making our approach more flexible. The same holds true for the number of classifiers that are used. When computation time is a constraint, fewer classifiers may be used for prediction, as n -the number of closest classifiers during baseline comparison -can be adjusted at will.

Limitations
There are, however, also some limitations to our approach. The biggest limitation arises from z-transformation of features on a participant level that is required for making data of participants and tasks comparable and on a similar scale. This means that we can only reliably analyze data in hindsight and when there are periods of low and high cognitive load. This scaling also limits the scope with which to interpret cognitive load during different tasks. Only if two tasks share a similar difference in cognitive load across its experimental manipulation can the predictions be considered reliable. A truly objective classifier for cognitive load would need features that are not normalized as to be on the same scale for all participants and circumstances. Our future goal is it to compensate for environmental and individual factors by training a classifier that can objectively estimate the difficulty of a task or activity.
Scaling renders real-time application extremely difficult, too, as it requires a complete dataset to be useful. Nevertheless, one may use the presented approach in real-time scenarios, but should be cognizant that workload predictions may not be reliable in the very beginning. However, predictions will improve over time as more data becomes available and more variation in cognitive load is observed.
Moreover, one of the reasons our approach works successfully may also be considered a drawback, namely: baseline comparisons. When the baseline for two tasks is recorded under different conditions problems might arise. For instance, cognitive load could be different in a baseline obtained while looking at a fixation cross as compared to a baseline extracted from completing a tutorial. Using the suggested process of matching participants for cross-participant and cross-task classification, this can lead to a sub-optimal distance metric and consequently an inappropriate weighing of predictions. Ideally, all baselines should evoke the same degree of cognitive load for baseline distances to best operate.
Additionally, as our analysis of the feature weights showed, a complex simulation game such as Emergency does not evoke the exact same physiological responses as a laboratory working-memory task like the N-back task. Our results indicate that, in particular, the importance of the ICA and microsaccades seems dependent on the task at hand. This hints that -although successful cross-task classification is possible -there may not be a classifier that works optimally for all tasks. In part, this may also be a result of the inadequate use of task difficulty as a proxy for cognitive load. Research by Howard et al. suggests that caution has to be exercised when comparing difficulty single-task paradigms with multi-tasking ones [60]. Training classifiers on data from different tasks and adding these tasks to the baseline comparison could be a potential solution for this problem. This way, classifiers trained on tasks that are similar in nature will be preferred.
Our study relied on visual stimuli and did not include other modalities such as auditory tasks. However, at least for working memory tasks, important features seem to react alike and independently of presentation modality. For instance, Kahneman demonstrated extensively and with many different stimuli, tasks, and modalities that working memory load impacts pupil diameter [38]. Recent research yielded similar findings for microsaccades [51] -at least for auditory and visual stimuli.

Future Research
Based on this manuscript, many avenues for potential future research are possible. A more elaborate study design that uses the same group of participants across a number of different tasks may be insightful as to what good cognitive load estimators for individual participants entail. It may also help evaluate what features are useful across a wide array of tasks and what features are specific to certain types of tasks.
Another way to further this line of research is to find ways to compensate for environmental factors, especially light. Provided that there is an adequate way to do this, we could repeat this experiment without the need to scale features, thereby estimating objective cognitive load instead of cognitive load that is relative to participant and task.
Finally, to make cognitive load estimation possible with systems that are even less intrusive than remote eye trackers and widely available, a webcam-based solution would be ideal. This, however, necessitates webcams that provide good enough resolution to achieve accurate pupil measurements. Testing with different quality levels of hardware is needed to judge the requirements and feasibility of such an approach. The final goal would be a system that can be employed in everyday life and is able to impact a broad audience, allowing for applications in a real-life situation instead of a lab environment.

CONCLUSION
In summary, we evaluated a cross-participant as well as cross-task classification algorithm that yields good accuracy. Combined with the robustness of our method and its non-invasive nature, this article -despite its limitationsprovides a promising step towards out-of-the-box solutions for adaptive human-computer interaction based on the assessment and classification of users' cognitive load by means of eye-tracking data.
Tobias Appel (Student Member, IEEE) received the bachelor's and master's degrees in computer science from the University of T€ ubingen, in 2014 and 2017, respectively. In 2021 he completed the PhD degree in computer science at the LEAD Graduate School and Research Network and is now working at the Hector Research Institute of Education Sciences and Psychology. His research focuses on the evaluation of cognitive load based on Eye Tracking and other physiological sensors. In his research he relies on machine learning to realize cross-participant and cross-task solutions.
Peter Gerjets received the Diploma in psychology from the University of Goettingen in 1991 and the PhD degree from the University of G€ ottingen in 1994. From 1991 to 1995, he was a research associate with the University of G€ ottingen. He was an assistant professor in Saarbrücken until he finished his habilitation in 2002 and moved to the Knowledge Media Research Center (Tübingen). Since 2002, he has been a principal research scientist with Knowledge Media Research Center and a full professor for research on learning and instruction with the University of T€ ubingen. His research interests include multimodal and embodied interaction with digital media and learning from multimedia, hypermedia, and the web. He was honoured with the Young Scientist Award of the German Cognitive Science Society in 1999. He was on the editorial boards of several major journals. He is a member of the DGPs, APS, and EARLI, and was a coordinator of the EARLI Special Interest Group 6: Instructional Design.
Stefan Hoffmann is a developer of video games, for more than 30 years. He works for Serious Games Solutions, which is a division of Promotion Software GmbH, the software developer behind the "Emergency Lernspiel" (learning game). He develops games and gamified apps for mobile platforms, browser, and HoloLens. He focuses on game design and project management.
Korbinian Moeller received the PhD degree in psychology from the University of T€ ubingen, Germany, in 2010. He is currently a professor of mathematical cognition with Loughborough University, U.K. His research interests include neurocognitive foundations of mathematical cognition and developmental psychology in the context of educational games and learning analytics.
Manuel Ninaus received the PhD degree in psychology from the University of Graz, Austria, in 2015. He is currently a postdoc with the University of Innsbruck, Austria. His research interests include neuroscience and educational psychology in the context of educational games and learning analytics. He is an elected board member of the Serious Games Society.
Christian Scharinger received the PhD degree in cognitive science from the University of Tuebingen in 2015. He is currently with the Multimodal Interaction Lab, Knowledge Media Research Center Tuebingen. He has a profound expertise in (neuro-) physiological measures like eye-tracking and EEG. In his research, he tries to combine basic and applied research areas. His research interests include memory, executive functions, learning, hypertext reading, web searching, and multimedia, with a focus on physiological measures of cognitive load.
Natalia Sevcenko received the bachelor's and master's degrees in psychology from the Eberhard-Karl-University of T€ ubingen, in 2015 and 2017, respectively. She is currently working toward the doctorate degree in psychology with the LEAD Graduate College and Research Network, University of T€ ubingen. Her research interests include human-machine interaction, with a focus on behavior and sensor-based measurement of cognitive states of operators and their relation to learning outcomes and personal predisposition.
Franz Wortha received the bachelor's degree in industrial engineering from the University of Applied Sciences in Dresden in 2013 and the master's degree in psychology from Technische Universit€ at Dresden in 2016. He is currently working toward the PhD degree in psycholgoy with the University of Greifswald. His research interests include self-regulated learning, with a focus on metacognitive and emotional processes and their relation to learning outcomes and personal predisposition.
Enkelejda Kasneci received the MSc degree in computer science from the University of Stuttgart in 2007 and the PhD degree in computer science, as a BOSCH scholar, from the University of T€ ubingen in 2013. She is currently a professor of computer science with the University of T€ ubingen, Germany, where she leads the Human-Computer Interaction Lab. From 2013 to 2015, she was a postdoctoral researcher and a Margarete-von-Wrangell fellow with the University of T€ ubingen. Her research interests include application of machine learning for intelligent and perceptual human-computer interaction. She is an academic editor of PlosOne and a TPC member and reviewer for several major conferences and journals. She was the recipient of the Research Prize of the Federation S€ udwestmetall in 2014, for her PhD research.
" For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/csdl.