The Continuous Matching Task (CMT) – real-time procedural stimulus generation for adaptive testing of attention

Abstract The Continuous Matching Task (CMT) is a novel paradigm designed to measure sustained attention and alertness. It is a special type of Continuous Performance Task (CPT) that utilizes truly continuous stimulus material. Stimuli are generated in real-time by a procedural algorithm which also enables adaptive testing. The task is highly flexible and can be used in either single or dual-task configurations that also allow for task mixing. The functionality of the algorithm and applications are presented. The viability of the CMT is tested and results are compared with similar tasks, i.e. Stroop-Task and Conner’s CPT (CCPT), as well as self-reports of ADHD in adults in a Multi-Trait-Multi-Method approach in a sample of N = 122 participants. Self-reports and measurements of heart rate variability during testing are analyzed to infer and compare mental workload during tasks. Overall, variants of the CMT induce a higher mental workload than the other tasks, and employing the dual-task CMT with adaptive difficulty resulted in the highest reliability and validity. Results indicate that the CMT is primarily a measure of alertness and processing speed and benefits from adaptive testing.

An integral part of perspectives on attention is that attention is not a solitary action but performed continuously. Thus, a common procedure in tests of attention is that a stream of stimuli is presented and relatively easy tasks must be performed. Continuous Performance Tasks (CPT) are common paradigms to operationalize the continuity of attention by repeatedly presenting stimuli. However, this mode of measurement is not continuous but discrete. It only approximates a continuous demand by presenting a stream of discrete items. Notably, the type of presentation impacts task performance with continuous stimuli being more demanding than discrete stimuli (Frank & Macnamara, 2021). Continuous stimulus material thus seems advantageous in operationalizing attention. Although computerized testing offers potential for real-time and continuous tasks, computerized tests of attention have largely utilized streams of discrete items. We surmise that aligning the task with the measured ability and testing attention with a continuous paradigm offers an opportunity to increase validity and utility in psychological assessment. This article introduces a novel paradigm of measuring sustained attention and alertness within a computerized task that is truly continuous-the Continuous Matching Task (CMT).

Continuous measurement of attention
Among the many perspectives on attention as a mental capacity, Alertness, Sustained Attention, Vigilance, Spatial Attention, Selective Attention, Focused Attention, Divided Attention, and Executive Attention are distinguished as dimensions of attention (Cohen, 2014;Goldhammer et al., 2007;Mahoney et al., 2010;Moosbrugger et al., 2006;Posner & DiGirolamo, 2000;Sturm, 2009). Besides orienting attention and selecting relevant stimuli, Posner and Petersen (1990) specifically point out the need to activate and maintain attention as a distinct performance, called Alerting. Alertness is viewed as a separate dimension of attention that is also necessary for all other dimensions.
In most paradigms measuring various types of attention, a participant is presented with stimuli and has to perform rule-based reactions; for an overview of common measurement paradigms, see Towey et al. (2019). Generally, participants are instructed to work as fast and as accurately as possible. Performance is assessed via reaction times (RT) and the accuracy of the responses; stimuli are presented and the responses are recorded. After an inter-stimulus interval, this procedure repeats. The primary workload comes from managing one's attention and processing speed and visual perception make up a large portion of the overall demands of such tasks. A classic example of this paradigm is the Conners Continuous Performance Task (CCPT) (Conners & Sitarenios, 2011). It is a member of the CPT family of attention tasks, in which participants must react to a continuous stream of items, usually in a computerized forced-pace format. Generally, attention must be maintained continually to correctly respond to relatively rare target stimuli. The task finds regular use in neuropsychological applications and has a tradition as a predictor of attention-deficit-hyperactivity disorder (ADHD) especially in children (Epstein et al., 2003). However, contradictory results, for example recently by Baggio et al. (2020) showed no link between ADHD and CCPT performance. In a meta-analysis focused on CPT, Huang-Pollock et al. (2012) caution against the deployment of CCPT to measure ADHD vigilance deficits.
Another paradigm of measuring perception and attention related to but not necessarily included in the CPT family is the Stroop-Color-Word Interference-Test (Stroop, 1935). This classical paradigm is usually deployed using wordreading and color-naming conditions. It is assumed that the competing information generated by automatic and involuntary reading must be suppressed to generate a correct response resulting in longer reaction times-a phenomenon called Stroop-interference. The capacity to resolve interference is associated with selective attention and closely related to the concept of inhibition, which is central to ADHD. Despite its theoretical link, the practical association between ADHD scores and Stroop performance was shown to be low (Lansbergen et al., 2007;van Mourik et al., 2005).
In summary, traditional paradigms of measuring attention are rarely truly continuous despite aiming to measure forms of attention that are a continuous performance. Thus far, computerized testing was not leveraged successfully to solve this discrepancy by operationalizing attention using a truly continuous paradigm.

The continuous matching task (CMT)
The CMT was designed to measure the continuous performance of attention by employing many of the principles of CPTs. It differs substantially in its operationalization of continuous performance compared to other measures. It employs a continuous measurement paradigm with a heavy emphasis on computerized testing. Within the CMT, participants must perform the very basic action of imitating a target motion. A colored indicator (the target) is moving up and down along one axis on the screen. A second indicator moving along the same axis (the input) is controlled by the participant with a control slider (an analog input method commonly found in control panels). The slider controls the position of the input indicator and the participant is tasked with minimizing the distance between the target and input indicators. The target indicator is continuously moving so constant adjustments to the input are necessary (Figure 1). Performance metrics are derived from the degree to which this matching is achieved. The closer the participant's input matches the target, the higher the performance. The CMT is trivially easy and no other rules or conditions besides the task of matching the target must be followed. Yet inattention or a lack of action will immediately result in an increased distance between the input and the target. This is in stark contrast to conventional paradigms in which distractor-and target-stimuli are used. In the CMT, every moment of testing is an informative target stimulus and there are no discrete items. We surmise that the CMT is potentially a measure for alertness and sustained attention that is demanding despite its minimal rules.
A noteworthy feature of the CMT is the use of analog control sliders, displayed in Figure 1. Our rationale is that for truly continuous measurement, analog input is necessary. The slider offers continuous control with only one degree of freedom. Additionally, the concept of moving something up and down in a straight line is intuitively executable for most participants. Furthermore, the motor requirements, although greater than when pushing a button, are substantially lower than manipulating a pen or a computer mouse. Because the task itself is simple, we expect that the mental workload exerted by the CMT should almost exclusively be on alertness and attention.
We assume that in the CMT, performance is primarily influenced by the underlying ability to adjust the input according to the stimulus. Additionally, motoric abilities can be assumed to be relevant, as well as a degree of motor learning and tiredness. Learning and tiredness occur over time and impact performance and according to Langner et al. (2010), after 50 minutes of simple reaction-time tasks, fatigue increases mean RT by 8%. Whitley (1969) found that after performing the Foot-Twist Tracking Task (FTTT) for 30 minutes, motor learning increased the accuracy of responses by roughly 25%. Both results indicate that fatigue and learning can be expected to be relevant for long periods of testing.
The CMT paradigm can be executed in a range of configurations: In its simplest form, a fixed difficulty is used to generate stimuli. Alternatively, changing this difficulty based on participants' performance enables the implementation of adaptive testing which is generally associated with increased quality and efficiency of measurement (Masoner & ElBassiouny, 2020). Furthermore, the CMT allows for dualtask paradigms to be implemented, either by mixing the CMT with a different task or by two CMTs performed simultaneously. This should increase workload substantially and enable measuring of split attention and parallel processing. When two tasks are performed simultaneously, parallel processing and selective attention are required which are associated with characteristic switching costs. When dissimilar tasks are performed simultaneously, additional mixing costs are assumed to impact performance (Gilbert & Shallice, 2002). Depending on the (dis-)similarity of tasks, mental workload varies, with similar tasks being easier to perform in parallel than different ones. Generally, mixing costs are assumed to negatively impact performance when two tasks are performed simultaneously and the magnitude of decrement was linked to clinical conditions like schizophrenia (Lin et al., 2015). We assume, that the differences in performance between single and dual CMT can function as an indicator of an individual's capacity for parallel processing. Further details on the different modes of measurement are elaborated in the methods section.

Mental workload (MWL)
Given the presented variants of the CMT, the mental workload (MWL) is of relevance. MWL is a general concept used to describe different types of cognitive effort required by an activity. In the case of attention tasks, the demand for attention can be inferred from the overall workload, provided the task does not require substantial higher mental processing like reasoning. High MWL in an otherwise simple task can therefore indicate a demand for attention. MWL can be measured in different ways, usually, self-reports are recorded right after the task was performed. A scale commonly used in this context is the NASA Task-Load-Index (TLX) (Hart, 2006).

Heart rate variability (HRV)
Mental workload is also manifested in psychophysiological parameters, which reflect the organism's autonomous response to stress, relaxation, and heightened cognitive demand. It is assumed that increased MWL leads to a corresponding increase in alertness which is associated with characteristic responses of the autonomous nervous system; for an overview on psychophysiological parameters of MWL see Megaw (2005).
A physiological marker that has frequently been researched is cardiac activity in the form of heart rate variability (HRV). HRV is a by-product of the dynamic autonomic relationship between the sympathetic (SNS) and parasympathetic nervous systems (PNS). The interplay between PNS and SNS constitutes the organism's way of adapting to changing circumstances. HRV emerges from this interaction; higher variability is generally associated with a relaxed state of the underlying systems. HRV seems particularly sensitive to MWL induced by sustained attention. However, this relationship depends on how task difficulty and motivation interact in a given scenario (Veltman & Gaillard, 1993).

Aims and hypotheses
In the present study, we aimed to investigate the CMT in regards to its usability as a computerized and truly continuous test of sustained attention and alertness. Furthermore, the potential of real-time stimulus generation and adaptive testing in this setting was explored. A central point of interest is therefore determining whether the CMT is a measure of attention. The analysis will primarily focus on the MWL required and the convergence and divergence with other instruments measuring attention. Finally, an analysis of their reliability is reported.
Hypotheses concerning validity and reliability: The first hypothesis (H1) assumes that CMT shows moderate to high convergence (jrj > 0.5) with other sustained attention tasks (CCPT) as well as self-reports of ADHD. For tasks that require selective attention (Stroop task), divergence marked by a low to moderate association (jrj < 0.5) was expected. For the second hypothesis (H2), we expect split-half reliability to be at least acceptable (jrj > 0.7).
Hypotheses concerning mental workload: The third hypothesis (H3) was that during CMT administration, MWL would be at least equally high as compared to other attention tasks. The fourth hypothesis (H4) assumed that dual tasks will have a higher workload than single tasks, including those where different tasks are performed. Based on previous studies, for the fifth hypothesis (H5) we expected to replicate the correlation between TLX scores and HRV indicators of MWL.

Sample
For the present study, a sample of N ¼ 122 (53 female (44%), 1 diverse) participants, aged between 19 and 64 years (M ¼ 27, Mdn ¼ 24, SD ¼ 8.66) were tested. Most (n ¼ 94) were German social science university students who received course credit for their participation. The remainder of the participants were recruited via social media and received a small gift bag including a USB drive as compensation. Initially, a sample size of N > 200 was planned, however, under the conditions of the ongoing COVID-19 pandemic, a sample size of N > 100 was deemed reasonable and sufficient for the present analysis. Informed consent was obtained from all individual participants included in the study. Average testing sessions took M ¼ 67 minutes (range [58,87], SD ¼ 5), including the two resting periods. Table 1 summarizes the order of trials and measures within the session. Preregistration of hypotheses and methods, as well as the datasets for the present analysis, are available online under https://osf.io/97dgv/.

Apparatus and materials
Responses were collected using specially designed response boxes that feature a touch screen, input sliders, twist knobs, as well as illuminated colored buttons (illustrated in Figure 1). The device interfaced with an in-house developed testing software programmed in Python. It was designed for running highly flexible psychological experiments that employ real-time stimuli and adaptive testing. The response box processed digital and analog inputs through a lowlatency USB controller commonly used for building joysticks. As a result, there is a small dead zone in the middle of the slider travel that creates a slight step at the center of the slider motion. We concluded that this artifact (visible in Figure 2) does not meaningfully impact the measurement quality.
Two screens were used in measurement, the computer's main screen and the secondary (touch) screen in the response box. Stimuli were presented on the main screen, while the touch screen was used to display and record rating scales for questionnaires. To minimize the effects of fatigue and learning as well as maintaining participant compliance and motivation, trials were limited in length, with the longest being seven minutes. For a synopsis of the testing session see Table 1.
After collection of personal information-age, gender, handedness, education, occupation, medication, level of physical activity, vision impairment, and cigarette consumption-the testing session began with a 5-minute resting period in which participants were instructed to wait and relax in a seated position without talking or distracting themselves with devices like smartphones. A second resting period was set at the end of the session.

CMT details and implementation
The CMT is driven by a custom procedural algorithm that continuously and smoothly changes the position of the target indicator between its minimum and maximum position. To avoid automaticity and create a constant requirement for attention, this motion must be unpredictable and continuously changing. Algorithmically moving the target stimulus in a manner that generates a motion that participants can CAARSquestionnaire 10 Stroopword reading þTLX 11 Unrelated questionnaire 12 Unrelated questionnaire 13 Stroopcolor naming þTLX 14 Post-rest 5 minutes Note. CMT: continuous matching task; SF: single task with fixed difficulty; M: main-hand; O: offhand; DF: dual-task with fixed difficulty; DA: dual-task with adaptive difficulty; CCPT: Conner's continuous performance task. match using an input slider can efficiently be achieved by a random walk algorithm. The algorithm is controlled by a central difficulty parameter b and produces usable output for the range b¼ [0; 1]. The motion of the target could be defined beforehand and thus be standardized. Although entirely standardized stimuli seem preferable from a psychometric perspective, after some initial testing, we rejected this approach and decided to instead pursue the promise of innovation. This decision may seem irrational at first, however, it not only enabled us to tread new ground but also presented further methodical opportunities. For instance, concerning the standardization in procedural stimulus generation: After sufficient iterations, the average of a random event will approach its expected value. As will the overall output of a random walk algorithm. It follows that a stimulus generation algorithm based on a random walk will generate a similar output over time. The output will be uniquely generated but the character of motion across several testing sessions of sufficient length will be virtually identical. Provided the core behavior of the stimulus generation algorithm is determined by modulating probabilistic parameters-as compared to deterministic rules-we surmise that a state of latent standardization can be reached. This should enable interindividual comparisons even though unique testing sequences were presented (results from a preliminary spectral analysis of CMT stimuli that support this assumption can be found in the supplemental material).
Another innovation made possible by real-time stimulus generation is the application of adaptive testing to this context. The core principle in adaptive testing is the modulation of difficulty based on performance, which can easily be applied in this application. Adaptive or tailored testing is usually implemented in computerized form-Computerized Adaptive Testing (CAT)-which is associated with characteristic benefits such as higher power of measurement and lower number of items necessary (Masoner & ElBassiouny, 2020). However, conventional CAT requires a sizeable pool of calibrated items on the entire spectrum of difficulty. Creating, optimizing, and calibrating an item pool is often associated with considerable effort and based on the itemresponse theory (IRT), for example in the form of the Rasch model (Betz & Turner, 2011;Kubinger & Draxler, 2007).
In our CAT approach, adjustments to difficulty b are continuously made based on real-time performance. Therefore, b becomes an estimation of individual performance; the better the performance, the higher the difficulty until an equilibrium is reached. The algorithm was designed to make it practically impossible for a participant to reach the maximum difficulty value of 1, which alleviates the possibility of ceiling effects. High levels of difficulty can thus only be reached by consistently high performance. Our approach offers a unique opportunity to sidestep the requirements of adaptive testing on conventional approaches and enables CAT without needing an item pool. Given that the assumption of latent standardization holds, it should be sufficient to calibrate the underlying algorithm instead of the stimuli themselves. Further explanations, details, commented python code for this algorithm and an executable to demonstrate the described behavior can be found in the supplementary materials and the accompanying OSF repository.

Forms of CMT testing and scoring
The CMT can be deployed in a single or dual-task configuration, which can take the form of two CMTs. In this application, the two tasks are run completely independently and performance can be assessed separately for both hands. Additionally, a combined performance metric for both hands can be computed. One CMT can also be deployed in combination with a different task to introduce task mixing. In all these configurations, difficulty can either be fixed or adaptive. For a period of fixed-difficulty testing, performance is expressed by the average distance between the target (t) and input (i). Alternatively, the correlation coefficient between the two variables r ti can represent overall performance in a standardized and intuitive way. For the adaptive CMT, we found this correlation coefficient to be an insufficient estimator of performance. In exploratory analyses of adaptive CMT, the correlation coefficient and the mean difference between input and target did not converge with performance. Although surprising at first glance, this can be easily explained: If an equilibrium between performance and difficulty is reached, a high correlation between target and input will be achieved regardless of where on the difficulty scale this equilibrium occurred. Consequently, the targetinput correlation is a viable estimator of performance only in fixed difficulty. For adaptive testing, another metric needed to be found. As difficulty b estimates momentary performance, average b or the integral of b over the duration of testing becomes a reasonable estimator of overall performance.

Application in this study
Measurement began with the CMT utilizing the input sliders in the response boxes. Trials were administered in three variants: with single-task (S) with one hand, dual-task with both hands (D), and a mixed-task (MIX) combining CMT and CCPT. In these trials, difficulty was either fixed (F) or adaptive (A). First, participants could make themselves familiar with the sliders and corresponding on-screen indicators. Then, single hand testing with fixed difficulty (CMT-SF, b ¼ 0.5) began, starting with the participant's main hand (M) followed by their offhand (O)-both trials lasted for 200 seconds. Next, the dual-task with fixed difficulty (CMT-DF, b ¼ 0.5) was administered for 420 seconds, followed by an adaptive dual-task trial of 360 seconds (CMT-DA). Later, in the mixed condition, single CMT with fixed difficulty (b ¼ 0.5) and CCPT were administered simultaneously.
For fixed difficulty testing, target-input correlations for the main hand r ti_M and offhand r ti_O were calculated. Performance values for both hands were combined into an index of parallel performance R ti by calculating the parallel sum R ti ¼(r ti_M Ã r ti_O )/(r ti_M þ r ti_O ). The result expresses the overall performance in dual tasks; this measure was also determined for the two singular tasks. For adaptive testing, the integral of the difficulty curve throughout the trial was determined as an indicator of performance.
CCPT A custom adaptation of the CCPT was deployed with the "ignore-X" paradigm in which 168 letter stimuli were presented (presentation: 250 ms, interstimulus interval: 1500 ms). Participants were asked to press a button with their offhand every time a letter appeared that was not an "X" (18 occurrences). Afterward, a second trial was deployed in which CCPT had to be performed with the offhand and CMT with the main hand.
Performance in the CCPT can be measured by the sensibility of responses, expressed as detectability in d prime (d 0 ). This signal detection theory-based measure is calculated from the number of correct and incorrect responses to targets (letters) and non-targets ("X") (Stanislaw & Todorov, 1999). Additionally, the number and mean RT of correct responses were recorded.
Stroop color-word interference task The Stroop task was deployed in two conditions using a custom computerized adaptation. Stimuli were displayed on the main screen and the colored buttons on the response box were used to record responses and reaction times (RT). Although there are indications that computerized and cardbased Stroop tests differ in measurement (Penner et al., 2012), we found that the reported advantages of computerized testing, primarily temporal precision and the opportunity to present neutral stimuli alongside regular stimuli, outweighed these remarks (Lansbergen et al., 2007;Pilli et al., 2013). First, the word-reading condition was deployed; alongside congruent and incongruent stimuli, neutral stimuli (black words) were presented. In the color-naming condition, black squares were used as neutral stimuli. Both conditions were preceded by short trials (30 stimuli each) in which understanding of the rules was tested and feedback was given when necessary. Then the main trials consisting of 120 stimuli each were administered. In both conditions, 48 congruent, 48 incongruent, and 24 neutral stimuli were presented in a standardized order. Responses and reaction times were recorded and used to calculate interference scores.
The optimal way of scoring Stroop performance has been debated in the past (Chafetz & Matthews, 2004;Scarpina & Tagini, 2017;van Mourik et al., 2005). In the present analysis, we scored results based on recommendations by Lansbergen et al. (2007). In this method, the interference score I R is calculated by I R ¼C/CW, where C is the mean RT for responses to neutral color-naming stimuli, and CW is the mean RT for responses to incongruent stimuli from the color-naming condition.

NASA task-load-index (NASA-TLX)
The NASA Task-Load-Index (NASA-TLX) is a general instrument that is commonly used for gathering self-reports on MWL (Hart, 2006). Six items with a 20-point scale are used to collect judgments concerning the task demands in the domains: mental, physical, temporal, performance, effort, and frustration. We employed a German version with the commonly used simple scoring method, in which the responses are directly interpreted as scores of the dimensions. The TLX was deployed five times in total after trials which ratings we deemed most informative: fixed and adaptive difficulty CMT, CCPT, and MIX task, as well as after both Stroop task conditions (Table 1).

Heartrate variability analysis
During the entire testing phase, cardiac activity was recorded using Bitium Faros TM 180 devices. Automatic QRS detection was used to determine the interbeat intervals (RR intervals) from the recorded ECG. Analysis of these intervals was performed in Kubios Premium 3.4.3 (Tarvainen et al., 2014) and resulted in a wide range of HRV and cardiac parameters. Time-domain, frequencydomain, and non-linear indicators can be used to interpret HRV dynamics (Shaffer & Ginsberg, 2017). Some indices have been shown to reflect cognitive activity, stress, and mental workload (Kim et al., 2018). In past studies, relations between subjective workload ratings (TLX) and HRV parameters were analyzed to identify relevant associations (Luque-Casado et al., 2016). Delliaux et al. (2019) found a significant association (r ¼À0.61) between TLX scores and the correlation dimension (D2) of an HRV time series. D2 estimates the underlying complexity of a system generating a time series and is lowered under load. Thus, we selected D2 (correlation dimension) as the primary physiological measure of MWL. Additionally, the root mean square of successive RR interval differences (RMSSD), a common measure of HRV (Shaffer & Ginsberg, 2017) was determined. For the trials, recommendations by the Task Force of the European Society of Cardiology (1996) were followed.

Statistical analyses
Correlative multi-trait-multi-method analysis (MTMM) To assess construct validity as laid out for hypothesis 1, scores from CMT, CCPT, Stroop, and CAARS, were entered into a Multi-Trait-Multi-Method Analysis (Campbell & Fiske, 1959). Different methods of testing were used to gather data on theoretically converging and diverging constructs. Corresponding correlations between the measurements thus can indicate convergent and discriminant validity. In the present analysis, CMT scores were assumed to be measures of alertness and sustained attention, both constructs should overlap with CCPT but diverge from Stroop scores, which are reported to measure selective attention. Furthermore, convergence was assumed between CMT and CAARS scores. Testing hypothesis 2 was performed by determining the split-half reliability of the CMT by correlating performance metrics of the first half of the trials with the second half.

Repeated measurements analysis of variance
For hypothesis 3, the dynamics of MWL during testing were assumed to be dependent on task type. Using the TLX-Mn scores and HRV index D2, two one-way repeated measures ANOVA were conducted to evaluate the effect of the type of task on indicators of MWL. As was noted by Luque-Casado et al. (2013), throughout testing, physiological relaxation in participants and their habituation to the situation has to be expected. Thus, the difference in D2 between the two resting periods was entered as a covariate in the analysis of D2 scores. Post-hoc mean comparisons of the MWL indicators were then used to test hypothesis 4. Additionally, the correlation of TLX-Mn and D2 across trials was calculated to test hypothesis 5.

Exploratory analyses
In Addition to hypothesis testing, multiple mean comparisons (T-tests) of CMT and CCPT scores in the various trials were performed. Performance metrics in dual tasks were compared to those in single tasks. Additional comparisons were also undertaken to further explore the differences in performance under various task conditions.
Beyond the measures presented thus far, a range of instruments not essential to the present inquiry was also deployed. These measures were self-report questionnaires on sensory-processing sensitivity, personality, and adverse childhood experiences. They were administered throughout the session to allow for some recovery after performance tasks and results were gathered for a separate research project.
Alongside laboratory testing, participants were invited to take part in a mobile variant of this study in which similar measures were deployed using their smartphones; results will be published separately.  Table 3.

CMT
Mean comparisons indicate significant differences in the main hand and offhand performance in all CMT trials. Furthermore, significant differences were observed between the combined performance metrics of single and dual-task CMT. In CCPT trials, d 0 differed significantly between the single task and mixed trial, see Table 2.
Comparing performance metrics in CMT trials in the single-and the dual-task application also shows significant differences. Average main-and offhand performance was different during all trials, indicating that performance is not achieved evenly between hands. The performance difference between hands was relatively small in the single trials and much more pronounced in dual tasks. Furthermore, combined performance was significantly lower in dual-task applications than in single-task trials. This observation was, however, not repeated in the mixed task trial. CMT performance in the mixed trial was higher than in the dual-task trial but lower than in the single-task trial.
Adaptive testing did not correspond to higher MWL compared to fixed difficulty testing and performances could not be compared between the two modes as they employ different indicators of performance. 37 Note. CMT: continuous matching task; SF: single task with fixed difficulty; DF: dual-task with fixed difficulty; DA: dual-task with adaptive difficulty; M: main hand score; O: offhand score; C: combined score for both hands; MIX: dual-task trial with CCPT and CMT. a Bonferroni adjusted p-values. b For adaptive CMT, the integral of the difficulty curve was used to score performance.

CCPT
CCPT performance was primarily evaluated using d 0 for single CCPT (M d 0 ¼3.03, SD ¼ 0.75) and in the MIX trial (M d 0 ¼3.31, SD ¼ 0.66). Additionally, the number of correct responses and mean reaction times for correct responses in milliseconds were evaluated. Table 3 shows performance metrics for all CMT and CCPT trials. Of note, average CCPT performance was higher in the mixed trial compared to the single trial.

Validity analysis
A Multi-Trait-Multi-Method analysis was conducted to assess construct validity using performance data from CMT, CCPT, and Stroop trials, as well as CAARS scales. Correlations are presented in Table 4 with split-half reliability on the diagonal where applicable; values lay in the range r sh ¼ [0.14, 0.91].
The convergence between CMT and CCPT assumed in hypothesis 1 was not observed. Associations between CMT performance and CCPT detectability were close to zero in most cases; only adaptive CMT scores showed small relations. Similarly but in line with our assumptions, low associations were observed between CMT performance and Stroop interference. While CMT performance did not correspond to detectability and interference, it showed associations with reaction times in these tasks.
In regards to ADHD symptoms, the assumed convergence was also largely not observed in the expected magnitude. Overall, some convergence was indicated for adaptive difficulty CMT but did not reach the expected magnitude. Interestingly, single CMT performance showed small effects in the assumed direction. The validity as defined by the hypotheses could thus not or only partially be confirmed.
Internal consistency was evaluated using split-half reliability. In single task CMT, the first half of the trial's performance was not strongly correlated with the second half. In dual-task applications and especially in the adaptive tasks, correlations were substantial and support Hypothesis 2.

Analysis of mental workload
TLX scores for the dimensions mental, physical, temporal, performance, effort, and frustration were gathered from the five measurement points. HRV was assessed throughout the trials by determining D2 and RMSSD. Means and standard deviations of TLX scores and HRV indices are summarized in Table 5. Scores were analyzed using two repeated measurement analyses of variance.
Estimated marginal means and their Bonferroni-corrected pairwise comparisons of TLX-Mn and D2 were calculated and are displayed in Figure 3; details for all pairwise comparisons are reported in the supplemental material. Concerning the research question, the following differences (annotated in  For D2, pairwise comparisons were not significant between CMT-SFM, CMT-SFO, d¼ [À0.19,À0.06]), all of which differed significantly from Pre-Rest measurement (all p <. 001, d ¼ [0.43, 0,66]). During CCPT testing, D2 rose significantly compared to the preceding and subsequent measurements during d ¼ [À0.46,0.31]). During all trials containing CMT variants, D2 was significantly lower than in the Stroop trials (all p < 0.001, d ¼ [À0.68, À0.20]), with the only exception being MIX and Stroop-W, for which no difference was observed (p ¼ .123, d¼ À0.20). The negative association of physiological and self-reported indicators of MWL reported in the literature was also observed in the substantial correlation r tlx_d2 ¼ À0.72 of means across trials.
Indicators of MWL of the several tasks in the testing session suggest that CMT trials were generally more demanding than other tasks (Stroop and single CCPT). Across all trials, self-reports of MWL, as well as physiological measures, indicated significantly higher MWL during CMT trials. Trials in which dual tasks were performed, showed the highest MWL. This was the case with dual CMT trials as well as the mixed trial (CCPT and CMT), with the former being more demanding than the latter. Our results indicate that the CMT successfully induces greater MWL compared to the other attention tasks and can be used in dual-task applications (H3 and H4).

Additional observations
Upon completion, some participants, without being explicitly prompted, reported that they found the CMT to be more demanding and frustrating than the other tasks. Furthermore, some participants reported that they felt unevenness, delay, or roughness in the motion of specifically the right slider; some participants even claimed that the slider was physically damaged or harder to move. Response boxes and software were thus thoroughly and repeatedly tested for irregularities in performance; however, none were found. Input and screen latency, slider motion, and software performance were as expected. We conclude that some participants recognized the irregularity caused by the sliders' dead zone but could not explicitly call it out. As most participants were right-handed, they were likely more aware of this side, which led to more reports of this slider being problematic although left and right sliders were identical.

Discussion
In the present article, the newly developed Continuous Matching Task (CMT) was presented and evaluated in an initial application. The task was designed to measure sustained attention and alertness with a computerized continuous paradigm. The unique characteristics of the CMT allow for real-time stimulus generation and adaptive testing.
Based on the correlations in the Multi-Trait-Multi-Method analysis, we conclude that the Stroop test and CCPT measure different types of performances compared to the CMT. Neither selective attention nor sustained attention seems to align with CMT performance. Given that the task induces a substantial workload and participants report the task to be demanding, the type of performance measured was unclear. However, inspecting the reaction times of CCPT and Stroop tasks in this context gives further insights in this regard. While performance metrics did not converge, CMT scores showed substantial associations with the reaction times in CCPT and Stroop trials. Reaction times are considered to be an indicator of alertness (Appelle & Oswald, 1974). Thus, we believe that concerning the components of attention-orienting, selecting, alerting-laid out by Posner and Petersen (1990), CMT performance seems to primarily be dependent on alerting. Consequently, the CMT appears not to be a test of sustained attention, as we initially expected, but a measure of continuous alertness.
The convergence between CMT scores and ADHD symptoms did not emerge in the magnitude expected in hypothesis 1. Nevertheless, associations in line with the assumption were observed. The link between CMT performance and ADHD was overall the strongest out of all the tasks. These observations contradict the assumption that CCPT is a predictor of ADHD symptoms (Epstein et al., 2003) and aligns with findings to the contrary (Baggio et al., 2020). Neither its performance scores nor reaction times showed meaningful associations with ADHD symptoms whereas CMT performance (particularly single task) showed small correlations. We, therefore, conclude that our initial hypothesis was overambitious in this regard. The associations at least indicate a promising link between CMT performance and ADHD. Future studies should further investigate this matter with appropriate hypotheses.
Interestingly, the average performance in the CCPT was higher when the CMT was performed at the same time (Mixed trial) while CMT performance decreased compared to the single-task condition. Simultaneously, MWL was also increased in the mixed trial. These observations are somewhat contradictory to the concept of mixing costs, which points toward a decrease in performance in both tasks (Gilbert & Shallice, 2002). Such effects were observed in the performance difference between single and dual-task CMT. We suspect that, in the mixed trial, participants directed more attentional resources toward the CCPT than the CMT, thus increasing performance in the former and decreasing it in the latter. The increase in CCPT performance in the mixed trial as compared to the single trial might be explained by participants learning the task. Alternatively, performing the CMT simultaneously may have increased intrinsic alertness which was beneficial for CCPT performance. An assumption supported by results from Matthias et al. (2010), who found that the level of intrinsic alertness positively influenced attentional weighting and performance. This observation requires further investigation but also supports the notion that the CMT is an alertness task.
For CMT performances of the main and offhand, slight differences could be observed in the single trials that were more pronounced during dual trials. We thus suspect that there is a small innate performance difference between hands that gets accentuated by an uneven distribution of processing resources during dual tasks.
A key aspect of our analysis was the development of MWL throughout the trials. The other tasks we deployed within this study rely on rather complex rules that participants must follow, whereas the CMT only employs one simple instruction. Yet it induces higher MWL which we believe to be a result of the continuous nature of the task which aligns with previous research (Frank & Macnamara, 2021). For the duration of testing, every moment is relevant to the task and the measurement. We presume this leads to increased MWL and constitutes one of the key contributions of this work. As combined CMT performance was lower in dual than in single-task trials, while MWL was increased, we assume that this difference in performance is a direct manifestation of switching costs during dual tasks. This notion is also supported by the decreased performance in dual tasks compared to single tasks. Consequently, the CMT may be a viable instrument in clinical assessment specifically of schizophrenia (Lin et al., 2015).
From a psychometric perspective, adaptive testing of attention could successfully be performed using the CMT. Furthermore, we surmise that the procedural generation of stimuli was successful and beneficial. Besides its novelty in the field of attention, it not only enables arbitrary test lengths but also enables adaptive testing. In this study, two advantages of adaptive testing became apparent: Firstly, the internal consistency increased drastically, indicating more efficient and accurate measurements. Secondly, the adaptive testing performance showed more convergence with ADHD symptoms than fixed difficulty testing, indicating higher validity. This observation is in line with the efficiency gains associated with adaptive testing and indicates that they apply to the CMT as well.

Contribution
The CMT paradigm offers a novel method of measuring alertness and reaction speed in a truly continuous form. This mode of testing induces high degrees of MWL with a minimal ruleset and high flexibility. The real-time procedural generation of stimulus material is a novel approach to this area of assessment which allows for arbitrary test durations without repetition. Another major contribution is the introduction of real-time adaptive testing to this field which increases reliability and validity.

Limitations
To reduce session length and increase participant compliance, TLX ratings were not gathered for all trials, and trials were kept relatively short. TLX reports for all trials would have been informative and longer task durations might have allowed for a more detailed perspective on sustained attention and alertness. In the scope of an initial application, however, we considered the selected task durations to be sufficient. Initially, moderate associations with ADHD symptoms were assumed which, in hindsight, resulted in overambitious and therefore relatively conservative hypotheses. For an initial application, less restrictive assumptions or even a broader instrument to compare with may have been preferable. A primary shortcoming of our design was the fixed order of trials. Over the time of the testing session, fatigue and relaxation likely influenced physiological parameters. For HRV analysis, we accounted for changes between resting periods which appeared to be the most viable course. Ideally, a randomized order of trials could be used in mixed model analysis to account for the effect of time-on-task. In our current configuration, this approach was not feasible because the fixed order of trials caused substantial collinearity between time-on-task and task type. Nevertheless, we believe that our results are clear enough to identify MWL differences in tasks.

Outlook
Future research into the CMT and its application should include longer task durations. We suspect that over 30 minutes, detailed progressions of alertness should be measurable. As was indicated in previous research, longer time on task likely causes both fatigue and learning in participants, which should be observable in longer trials. Consequently, the order of trials should be altered between subjects. Additionally, we see potential in alternating the conditions in which the CMT is applied as the performance was shown to be sensitive to changes in task conditions. In the present study, singular, parallel, and mixed tasks were employed. As an extension, other critical parameters of attention tasks, like reorienting attention during the performance, filtering distractions, or mixing sensory modes, seem worth investigating. The observed increase in performance in tasks that are performed simultaneously warrants further investigation to assess its precise origin and circumstances.
Regarding clinical applications, the CMT showed some potential in the diagnosis of ADHD. Furthermore, performance metrics from the single and dual-task paradigm could be informative regarding clinical conditions such as schizophrenia. Leveraging the strengths of computerized adaptive testing, multimedia stimulus material, and the CMT paradigm could be especially fruitful. Real-time performance feedback could also easily be implemented and might have interesting motivational implications.

Conclusion
In summary, the CMT appears to be a test of alertness and reaction speed that functions well without discrete items by employing truly continuous measurement. The task induces high MWL with a minimal ruleset and can be deployed in various configurations. In the present study, dual-task configurations were deployed and numerous modifications are possible. The task benefits from the increased reliability provided by the innovative approach to adaptive testing and procedural stimulus generation. We believe the CMT successfully delivers on many of its initial goals. The task effectively induces MWL and even surpasses similar tasks. The real-time generation of stimulus material is not only innovative for the assessment of attention and processing speed but also offers advantages in utility. It enables arbitrary timeframes of testing without repetition of stimulus material. Furthermore, this approach enables adaptive testing in this field, which is innovative, and was shown to increase reliability. These benefits have thus far only manifested in relatively short-term testing. We believe that these advantages transition to longer testing sessions and might even be accentuated in such applications.

Ethical approval
This study was performed in line with the principles of the Declaration of Helsinki. The study was approved by the ethics committee of the [Helmut-Schmidt-University/University of the Federal Armed Forces Hamburg]. Informed consent was obtained from all individual participants included in the study.

Disclosure statement
No potential conflict of interest was reported by the author(s).