Hybrid statistical and machine learning modeling of cognitive neuroscience data

The nested data structure is prevalent for cognitive measure experiments due to repeatedly taken observations from different brain locations within subjects. The analysis methods used for this data type should consider the dependency structure among the repeated measurements. However, the dependency assumption is mainly ignored in the cognitive neuroscience data analysis literature. We consider both statistical, and machine learning methods extended to repeated data analysis and compare distinct algorithms in terms of their advantage and disadvantages. Unlike basic algorithm comparison studies, this article analyzes novel neuroscience data considering the dependency structure for the first time with several statistical and machine learning methods and their hybrid forms. In addition, the fitting performances of different algorithms are compared using contaminated data sets, and the cross-validation approach. One of our findings suggests that the GLMM tree, including random term indices indicating the location of functional near-infrared spectroscopy optodes nested within experimental units, shows the best predictive performance with the lowest MSE, RMSE, and MAE model performance metrics. However, there is a trade-off between accuracy and speed since this algorithm is required the highest computational time.


Introduction
There are approximately 100 billion neurons in the human brain [43] and every neuron should pass through almost 10 15 neurons to communicate with another one [26].As a non-stop process, the brain functions as an extensive data system.Considering the brain's complexity, the data from neuroscience studies require appropriate analysis to extract features representing brain functionality.Many experimental studies focus on explaining how the brain functions and what factors are related to specific changes in the brain [13,15,16,21,22].As many experimental data about the brain include repeated measurements from the same experimental units, this data structure is regarded as longitudinal or clustered data.Before explaining the data description and the methods used in this paper, we review studies that utilize such methods to analyze data from cognitive science.
In exploring naturalistic interactions in parent-child dyads, Nguyen et al. [28] analyzed functional near-infrared spectroscopy (fNIRS) hyper scanning data.They constructed a generalized linear mixed-effects model (GLMM) including Wavelet Transform Coherence (WTC), which is a measure for interpersonal neural synchrony, as a response variable, while fixed effects entered in the model are conditions with levels of cooperation and individual regions of interest with four levels which are left/right dorsolateral prefrontal cortex (dlPFC), left/right temporoparietal junction (TPJ), and pairing indicating true or random.Their proposed GLMM was composed of both main effect terms of these fixed effects and interaction among these variables.The contribution of an interaction effect among condition and pairing variables and region of interest as a fixed effect term to the model was significant.In terms of random effects, their model exhibited a random slope for condition variables and a random intercept for dyads.In their study of the Leiden 85-plus, Spagnoli et al. [41] utilized a mixed-effect logit model on Mini-Mental Status Examination (MMSE) index to assess the cognitive functioning of adults.One of their findings indicates that subjects with higher education levels have a lower probability of cognitive impairment, while cognitive functioning tends to decrease as they age.
In addition to statistical methods, machine learning (ML) algorithms are widely used for classification and regression problems.However, we have recognized a few studies on implementing ML techniques using neural data in the literature, mainly concentrating on classification problems.In this part, we review studies on Electroencephalogram (EEG) and fNIRS since similar ML methods are implemented to analyze these two neuroscience modalities.
There are few studies in the literature, including the analysis of fNIRS data with ML algorithms [11,12,17,40].For instance, Sitaram et al. [40] used SVM to differentiate lefthand and right-hand imagery on the located NIRS signals.According to their finding, the average accuracy of the SVM classifier was 73% for all volunteers.Additionally, Girouard et al. [12] implemented k-nearest-neighbors (kNN) with k = 3 to classify two difficulty levels of the Pacman game and the resting state of participants on fNIRS data.The average classification accuracy of kNN for comparing these three conditions is 76.7%.In their experimental study on the estimation in a flight simulator using fNIRS, Gateau et al. [11] utilized an SVM-based classifier to differentiate between task difficulty levels which are low working memory load and high working memory load.According to their estimator tested on 19 pilots, they reached a classification accuracy of 80%.
Additional to the studies including fNIRS, we also review EEG data analysis cases in which ML algorithms are used.For instance, Plotnikov et al. [32] used (SVM) to analyze subjects' boredom/flow conditions while they were playing Tetris games with different levels using EEG data.Considering the average accuracy for all participants, their 4-class SVM provides an accuracy of 57%.Papakostas et al. [29] investigated data on EEG signals to predict the outcome of a sequence learning task using SVM, Random Forest (RF), Extra Tree (ET), and Gradient Boosting (GB) algorithms.They obtain the highest average accuracy of 74% from the GB classifier.
Although the previously mentioned methods' predictive capabilities are sufficient to capture data information, their results are questionable due to the assumptions required to implement these algorithms.The methods used in the mentioned studies depend on the independency among measurement points.However, this assumption may not hold for the repeated measurements.Violation of dependency assumption may cause undetectable bias in predictions.To overcome the assumption limitation of these studies, we suggest methods that consider the dependency among repeated measures in neuroscience data analyses.In the current study, we focus on the linear mixed model (LMM), in which case the dependency structure among repeated measures is taken into account while also adjusting for fixed effects and their interactions.However, LMM depends on some assumptions, such as normality and linearity.Because of these limitations, we relaxed the assumptions of LMM by using hybrid ML methods merging LMM and ML algorithms to analyze the repeated data.The hybrid ML methods chosen in this study are all alternatives to the LMM, and they can handle the nested structures besides repeated measurements.This is the first attempt in the literature that hybrid ML methods have been considered for the repeated data in cognitive studies to the best of our knowledge.
Further, we compare prediction performances and computational speeds of LMM and hybrid ML algorithms.We discussed the conceptual differences between these algorithms and the statistical methods used here.The algorithms used in this study are Generalized Linear Mixed-Effects Model Tree (GLMM tree), Random Effects Expectation-Maximization Tree (RE-EM tree), unbiased RE-EM tree, Longitudinal Classification and Regression Tree (LongCART), and Gaussian Process Boosting (GPBoost).Furthermore, the data were perturbated to simulate the behavior of the methods in case of contamination, and the relative changes of all methods with RMSE and MAE were compared.Also, crossvalidation is implemented to compare the predictive performances of LMM and hybrid methods on the unseen data set.
In literature, there are several other available methods, such as Mixed Effect Machine Learning (MEML) [27], Mixed-Effects Random Forest (MERF) [14], Repeated Measures Random Forests (RMRF) [4], and Mixed Effects Neural Networks (MENETS) [45].However, we were unable to use them for the following reasons.The values (i.e.responses) are collected over indices nested within subjects.Since the data points appear to be scattered along an invisible line in our data plots, which are located in the text and Appendix (Figures 4(b), 6-9(b), and 10), the nested structure should be added to the modeling.The Likelihood Ratio Test (LRT) testing models with and without nested structures supported the nested structure, too.Because RMRF [4] and MEML [27] do not allow us to include nested structure in modeling, they were not added in the methods sections.For MERF, LongituRF is required to use a time variable [14], and it is not possible to define a time variable for our data set.Lastly, the MENETS [45] method is written in MATLAB language.We used the R programming language for the analysis part to be consistent in method comparisons.
The rest of the study is organized as follows.The following section includes the cognitive data design and descriptives.The methodology part, including LMM and hybrid ML algorithms, is detailed in the third section.The results and the comparison of algorithms are given in the fourth and fifth sections with contamination and cross-validation.Finally, the sixth section provides the conclusion.

Cognitive data
This section describes the data investigated in this study with the summary statistics of the corresponding variables.We utilized open access data in vendor-agnostic and vendorspecific formats for the n-back task from the study of cognitive science research [38].The raw data set was prepared for analysis with the data processing methods described in the following paragraphs of this section.
Twenty-six healthy experimenters (9 males, 17 females) participated in this study.Also, participants ages ranged from 17 to 33 years.The variable names with their descriptions are listed in Table 1.During the n-back experiment, the series of numbers as visual stimuli are shown to the participants.The principle of the task is based on the strategy that they should remember previously seen nth number.There were three sessions for the experiment, and in each session, experimenters were exposed to nine equally distributed 0-, 2-, and 3-back tasks.If the number seen on the screen matches the previously seen 2 or 3 cases, the participants are supposed to press the target button (i.e.7) while they should press 8 for an incorrect case.To illustrate, if the task is 2-back, participants should press button 7 once the number is the same as the one seen in 2 steps before.As the 0-back condition is considered as a control used to identify whether the subject continues performing the task, we conducted our analyses based on the data from 2-and 3-back conditions.During this task period, experimenters are supposed to press number 7 or 8 with their right index and right middle finger.
The timing of each session is constructed by 20s instruction and a 40s task period.To indicate the task's beginning and end, the participants are provided with a short beep (250ms).An additional STOP word (1s) is displayed on the screen at the end of the task.There is also a rest period in which a fixation cross is seen on screen.This resting period is provided so that their brain state can return to a baseline.This experiment is designed and conducted by [38].The open raw data can be found at http://doc.ml.tu-berlin.de/simultaneous_EEG_NIRS/.After gathering the raw data from the source, Modified Beer-Lambert Law (MBLL) is implemented to convert optical density measures to oxyhemoglobin (HbO) and deoxyhemoglobin (HbR) concentrations.Biologically, an increased blood flow occurs to the active brain region [5].This results in a decreasing HbR and increasing HbO concentrations [30] which are indirect indicators of cortical activity.The statistical analysis procedures are conducted only using HbO concentrations since Hoshi [18] stated that they are the most reliable measures to work on regional cerebral blood flow alterations.
After obtaining these measures, Butterworth low-pass filter is implemented through butter() function in MATLAB.Then, observations were separated according to each block (i.e.0-, 2-, and 3-back) with the help of proc_segmentation() command.The rest of the analyses, LMM, and hybrid ML, are implemented in R programming language [33].
The descriptive statistics for quantitative variables in terms of both age and experimentrelated variables in this data set are summarized in Table 2.The distribution of MeanRT according to n-back conditions for different experiment sessions can be seen in Figure 1 for a randomly selected subject.Also, error bars are added to each bar to indicate the variability in the obtained measures.According to the bar plot in Figure 1, the highest MeanRT results occurred at the 3-back condition in all sessions.Specifically, the highest MeanRT is obtained as 0.96 ± 0.24, which belongs to the 3-back condition in the first session.Also, 2-and 3-back conditions have higher MeanRT compared to a 0-back condition in which participants do not have to remember the previously seen numbers.The rest of the MeanRT plots of all subjects can be examined at https://github.com/statserenay/Subjects_MeanRT_Plots.
The distribution of the response variable indicating mean HbO measures (named values) is obtained for each subject in the data.The resulted plot is shown in Figure 2. According to this plot, there are differences in the distribution of response variables for distinct subjects.To illustrate, while the distribution has a symmetric shape for subject-7, it exhibits a right-skewed shape for subject-3.
In Figure 1 at Appendix 1.3, a 3-dimensional scatter plot is shown to examine how the response variable is associated with MeanRT and Accuracy together.This graph indicates that the mean HbO values differ for each subject and fluctuations between observations.The measurements taken from different locations (i.e. the location of optode) withinsubjects are given in Figures 2 and 3 in Appendix 1.3, and different characteristics from these plots are observed.The first information from the graphs pointed out that the nested structures of the locations within the individuals should be added to the model, and the  intra-observational dependency structures should be included in the analysis.In addition, the apparent differences observed in the graphs according to the n-back labels indicate that this variable may also have significant effects on the analysis results.

Methods
In this study, two distinct models, of which one only includes random effect with subject variable while the other takes the nested structure of indices within subjects, are considered for the analyses.After implementing the algorithms on data, we examine three goodness of fit measures which are the Mean Squared Error (MSE), Root-Mean-Squared Error (RMSE), and Mean Absolute Error (MAE).These results are first obtained for full data set.Then we contaminate data to check the performance differences of the methods against unusual observations.Lastly, model validation is also performed through the train and test data set approach.Then we summarized their results to compare the different methods' fitting performances.Before moving on to the data analysis results, the methods are explained in the following subsections.

Linear mixed model (LMM)
LMM is a powerful statistical technique used to model data with repeated measurements, one of the most commonly seen data structures in neuroscience studies (cognitive data).The power of LMM over the classical linear regression model is that random effect terms can be incorporated into the model to express changes among different observational units.
The general form for LMM can be given as for m subjects in Equation ( 1) [25].where y i is a (n i × 1) vector including observations for the i−th individual on response, X i is a (n i × p) design matrix of known p regressors, β is a (p × 1) vector of fixed-effect coefficients, Z i is a (n i × q) known random effect matrix, u i is a (q × 1) vector of random effects, and ε i is a (n i × 1) vector of error terms.
As in many statistical models, some assumptions exist about the LMM components.For instance, one of the basic assumptions for the model in Equation ( 1) is that the errors are independently normally distributed with mean 0 and variance of R i .Like the errors, random effect coefficients are also assumed to have a normal distribution with 0 mean and variance-covariance matrix of D having the dimension of (q × q).Depending on the assumptions that the random effect coefficients and random errors are independently normally distributed, and they are linear functions of response in the LMM, the response variable also follows a normal distribution with mean X i β and covariance matrix V i = Z i DZ i + R i .Then, the probability distribution of response for i−th subject can be written as: where n i is the number of observations for i−th subject in data.Hence, the log-likelihood function is given by where θ is the vector of all variance-covariance components and c represents a constant term.To find Maximum Likelihood Estimators (MLEs) of the model parameters, we need to differentiate the log-likelihood function with respect to the corresponding parameter.However, the maximization of the log-likelihood function does not give a closed form of the corresponding parameter estimations.So, optimization algorithms such as Expectation-Maximization (EM) are used to find parameter estimations.Details related to the EM algorithm in LMM can be found in [44].

Generalized linear mixed-effects model trees (GLMM trees)
GLMM trees were first proposed by Fokkema et al. [9] in 2018 so that one can detect interactions among the subgroups of treatments in clustered data.These models are composed of two parts, namely global and local.The global part of the GLMM model includes the random effect terms with all observations.On the other hand, the local part constitutes the fixed effect terms estimated locally, indicating that the algorithm partitions the observations in the data set based on the additional variables known as partitioning variables.
Then, it estimates a fixed-effect model in each of the separated portions of the data [7].
The GLMM tree algorithm has the flexibility of detecting subgroups as this method works based on model-based recursive partitioning (MOB) introduced by Zeileis et al. [47] in 2008.MOB assumes that a single global model such as a generalized linear model (GLM) may not fully describe the data.Hence, it looks for any possible data partitioning with additional regressors to detect whether there is a better fit in each partition cell.To illustrate the working principle of MOB, we will first explain the procedure for the model with only a fixed effect term.Then we will explain the method for the clustered or longitudinal data structure.

Model-based recursive partitioning
The logic of MOB is that the data may not be described well by a single global model such as GLM in some cases, and the data can be partitioned concerning any other additional covariates.In such cases, it can be possible to find better fits in each partition for the data.For instance, one can fit a global GLM to examine the treatment effect on the response.Still, this effect will be the same for all observations as the fitted model can exhibit the same coefficient for the indicator of the treatment variable.On the other hand, the partition of data concerning other regressors can be needed to get separate models through different sets of observations in data if distinct groups exhibit different treatment effects.We can write down a single global GLM which can model the expected response y i given treatment covariate x i through a link function as follows: where x i corresponds to a matrix of predictor values for i−th data observation, β is the vector including the fixed effect regression coefficients.In this formulation, μ i , which is the expected response given the predictor x i , is linearly modeled by x i β through a link function of g().The best-fitting may not be obtained with such a single model, especially when partitioning variables exist.This model cannot fit data well if there are potential partitioning variables as it considers the same effect/coefficient on the response for all data points.The MOB algorithm can take the other regressors into account while finding data partitions and can fit better local models.The MOB algorithm utilizes parameter stability tests on a set of partitioning variables to achieve this.

Including random effects
As mentioned in the previous sections, using an LMM would be more appropriate for analyzing clustered or repeated data structures.The GLM extended from (4) for the analysis of such data sets can be written as follows: If the model is composed of only a random intercept, z i represents a unit vector with M components, of which the mth entry (m = 1, . . ., M) is 1 while the other entries take a value of 0. The m represents the cluster of the ith observation.Furthermore, b corresponds to a random vector of length M, of which each m−th element represents the random intercept for the mth cluster.
In Equation ( 5), although the clustered structure of data can be accounted for by the random part, the global fixed-effects part x i β may not fit data well.Considering this limitation of the mixed-effects model, Fokkema et al. [9] proposed the GLMM tree method.In their proposed method, fixed-effect coefficients can also partition the data.
As can be seen from Equation ( 6), the fixed-effect coefficients β j constitute the local part, and the values of these terms depend on the terminal node j while the random effects b corresponds to the global part of the algorithm.The steps for estimating Equation ( 6) using the GLMM tree algorithm are listed below.
Step 0: Set r and all values b(r) to 0.
Step 1: Set r = r + 1. Fit a GLM tree by using z i b(r−1) .
Step 2: Estimate the mixed-effects model g(μ ij ) = x i β j + z i b with terminal node j(r) from the GLM tree in Step 1. Get posterior predictions b(r) from the fitted model.
Step 3: Repeat Step 1 and Step 2 until convergence.
As the random effects are initially unknown, the algorithm starts by equating them to 0. In each iteration, the algorithm recursively fits the GLM tree in step 1 so fixed and random effects can be re-estimated in step 2. There is no partitioning on the random effects, but these are estimated globally.On the other hand, the algorithm estimates fixed effects locally in each partition cell.
The GLMM trees can be applied using lmertree function under glmertree package in R [8].The glmertree package utilizes package partykit [19] to obtain partitions and lme4 package [1] to fit the mixed-effects model.

Random effects expectation maximization tree (RE-EM tree)
RE-EM tree was first released by Sela, and Simonoff [37].The most general LMM form can be specified to exhibit functional form for the relation between fixed effect terms and response.The underlying model with the distributional assumptions on the error term and random effects can be expressed as: In this formalization, while i stands for the longitudinal data objects (i = 1, 2, ., n), the indices of t(t = 1, . . ., T i ) is used to represent repeated measures which are especially taken through time.That is, objects in the data exhibit multiple observations in which each observation is represented by a vector of redK variables x it = (x it1 , . . ., x itK ) .The function, f, is introduced to allow different forms for the relationship among variables, included as fixed effects and numerical responses.Also, the known design matrix, Z it , corresponds to the random effect terms, and b i is the unknown random effect coefficient to represent the object-specific effects.
To fit the model in Equation ( 7), one can use the MLE technique with numerical algorithms such as EM.According to the structure and characteristics of the data set to be analyzed, these traditional methods may suffer from some problems.For instance, the LMM assumes a parametric form for the function, f (where f = Xβ).However, this assumption might be too restrictive because f is generally unknown, and fitting a linear model on the data may not be the best choice.It is important to note that the linear model may not be flexible enough to approximate the real relation between response and explanatory variables.So, it may result in some bias in the estimate of f.Sela and Simonoff [37] proposed a method named Random Effects/EM tree, or REEM tree.Their longitudinal or clustered data method is reminiscent of the EM algorithm, and it can estimate f by incorporating random effects, b i .In this method, distinct observational units corresponding to the same subject may be included in different nodes.The advantage of the RE-EM tree is indicated as having a flexible structure on the fixed effects compared to the traditional LMM approach.We now move one step further by explaining the estimation methodology of the RE-EM tree method.
The steps for estimating the RE-EM tree are given as follows: Step 0: Initialize by setting the estimated random effects, bi , to zero.
Step 1: Iterate the following two steps of the algorithm until the estimated random effect terms converge, which is controlled based on alteration in the likelihood or restricted likelihood function to be less than the determined tolerance value: a. Fit a regression tree approximating the function, f, given that the target variable is y it − Z it bi while attributes are expressed as x it = (x it1 , . . ., x itK ) for (i = 1, 2, ., n) and t(t = 1, . . ., T i ).Extract a set of indicator variables I(x it ∈ g p ) where g p ranges over all of the terminal nodes using the estimated regression tree.b.Estimate the mixed effects model, y it = Z it b i + I(x it ∈ g p )μ p + ε it and get bi from the fitted model.
Step 2: Replace the estimated response value at each terminal node of the fitted tree with the predicted mean μp obtained from the mixed-effects model in Step 1b.
The function REEMtree under the package with the same name in R can be implemented to fit the RE-EM tree.This function utilizes both rpart() from rpart package [42] to construct a regression tree, and lme() from nlme package [31] to fit LMM part of the RE-EM tree.

Unbiased regression trees for longitudinal and clustered data
The REEM tree explained in the previous section can incorporate a mixed-effects regression model with tree-based algorithms.One of the advantages of this technique is that the RE-EM tree can assume a general structure for the relation between the fixed effects and response.This technique is more robust to parametric assumptions than the models that can handle random effects and regression trees with only fixed effects.The RE-EM tree's drawback is that it utilizes CART as a tree-based algorithm to split data into nodes.CART, introduced by Breiman et al. [3], is one of the widely used algorithms to construct regression tree models for longitudinal and multivariate data.However, many researchers argue that the CART algorithm has limitations, such as overfitting and selection bias.Although pruning trees can avoid the overfitting problem, unsatisfactory results can be obtained due to CART's variable selection (splitting) bias, in that CART tends to select the variables with a large number of split points [10].
Fu and Simonoff [10] revised the RE-EM tree algorithm and suggested using the conditional inference tree (ctree) of Hothorn et al. [20] rather than using CART in Step 1a to construct an unbiased regression tree for longitudinal and clustered data method which can overcome bias problem.
The implementation of unbiased RE-EM tree can be realized through REEMctree function from party package in R.

Gaussian process boosting
Boosting [36] is a general technique that can improve the predictive capability of an ML algorithm.Besides increasing the algorithm's predictive performance, boosting using trees as base learners can deal with multicollinearity, non-linearities, discontinuities, and highorder interactions.Also, it makes the method robust against potential outliers and can be used for data sets with missing values in explanatory variables without losing any information [39].On the other hand, boosting has some limitations, such as the observations assumed to be independent among different data points.The algorithm cannot work accurately when the residuals exhibit a correlation.Besides the correlation structure in data, boosting has difficulty with the categorical variables on many levels.
Williams and Rasmussen [34] define Gaussian Process as 'a collection of random variables, any finite number of which have (consistent) joint Gaussian distributions.'These processes allow flexible non-parametric models to provide superior predictive accuracy and make probabilistic predictions.As mentioned in the previous sections, mixed-effects models have widely used techniques for panel or longitudinal data.The observations may exhibit correlation due to the grouping structure of such data sets.These models can handle categorical variables with high cardinality.
The mean is assumed to be either zero or a linear function of given independent variables in the Gaussian process and mixed-effects regression models.The structured residual variation can be modeled using a Gaussian process with zero-mean and a mixed-effects model.However, assuming both zero-mean and linearity is generally not realistic, and it may be required to relax these assumptions to get high predictive performance [39].
Considering the difficulties above in methods, Sigrist [39] proposed a technique combining tree-boosting with the Gaussian process and mixed-effects model.Further details on this algorithm are provided in Appendix 1.1.
The gpboost package in R can be used to implement this method.

Methodology and results
In this section, we define the methodology used for analyzing cognitive data with the analysis results.The steps for the analyses can be seen in Figure 3. First, LMM is constructed using lme function from nlme package in R. The modeling scheme is constructed with the implementation of LRTs to identify significant variables at a 5% significance level following a strategy similar to the backward elimination process.We initialize the model, including all variables, and remove the cases with p-values higher than 0.05.In each step of the variable selection process, we checked Variance Inflation Factors (VIFs) to identify whether the model exhibits a multicollinearity problem or not.The significance of interaction terms is also checked.We have n-back and MeanRT variables in the parsimonious model.
Furthermore, four different covariance structures, which are the Autoregressive process of order 1 (AR(1)), Continuous AR(1), a general positive definite matrix, and a compoundsymmetric matrix for the random effects, are employed, and these models are compared using Akaike Information Criterion (AIC).The results are illustrated in Table 3.The final decision is given based on the model having the lowest AIC.Second, GLMM tree algorithm proposed by Fokkema et al. [9] is carried out using lmertree function under glmertree package in R [8].We have implemented this method, both using subjects as only random effects and indices nested within subjects.In Appendix 1.3, the regression trees are shown in Figures 4 and 5, respectively.
We first mention the interpretation of the general tree structure and then use this terminology to comment on the plotted GLMM tree for the model without and with the nested structure.In a tree data structure, the variables are represented as the nodes of a tree, and these variables have a significant association with the response, mean HbO values.In any tree structure, the variable in the origin of the splitting can be called a root node, while the nodes at the end of this data structure are known as terminal nodes.Also, each line connecting any two nodes is called an edge.At first, we use this terminology to interpret the part of the GLMM tree shown in Figure 4 in Appendix.This data representation shows that Accuracy is a root node as this variable initializes the process.In other words, the  most crucial variable in splitting the data is Accuracy which corresponds to the ratio of correct answers to overall responses given by experimenters.First, the algorithm splits the tree with Accuracy at 98.147, forming two heterogeneous groups.Considering the first condition, we can conclude that if Accuracy is less than or higher than a value shown in the edge, which is 98.147, the data can be partitioned accordingly.Since the experimental design requires remembering the numbers in a short time, it does not require a tremendous cognitive effort.So accuracy is significantly high.
The second split assigns observations in Session 1 and 2 to the left branch, and then that group is further subdivided by MeanRT.Observations in Session 3 are assigned to the right branch.That can be explained by the fact that there is no significant difference in mean HbO values among the first and second levels of the session variable, while Session 3 differs from these two levels.
Following a similar strategy with the other variables: Age, MeanRT, Gender, and n-back, the tree is finalized by the terminal nodes, which can be seen from the last line of this tree in Figure 4 in Appendix.Overall, the tree stratifies or segments the observations into 19 regions of predictor space.From this part of the tree, the distribution of HbO values is grouped into each terminal node.For instance, we can conclude a right-skewed shape for the distribution of HbO values observed at Node 2, while the observations in Node 7 exhibit a more symmetric distribution.Similar conclusions can be drawn from the other partitions of data observed in other nodes.We now compare GLMM trees with nested and without nested data structures.Similar to the case seen without nested data structure in Figure 4 in Appendix, Accuracy is the most crucial variable in determining HbO values, and the Session variable follows it in Figure 5 in Appendix, represented with nested structure results.In both plots, Session 1 and 2 are in one branch while Session 3 is in another branch.Unlike the pattern without the nested data structure, the fourth split occurs with the Gender variable rather than MeanRT.Following a similar strategy to interpret the GLMM tree, it is possible to compare the splitting variables and edge values for the rest of the GLMM tree given in Figure 4 with the corresponding parts of Figure 5 in Appendix.
As a third method, RE-EM tree, which is first released by Sela and Simonoff [37], is constructed with the help of REEMtree function from REEMtree package in R.After estimating RE-EM tree, unbiased RE-EM tree developed by Fu and Simonoff [10] is constructed through REEMctree function from party package in R. The functions to construct the RE-EM tree and its unbiased version provide to implement the methods incorporating different correlation structures.The results for the RE-EM tree are summarized in Tables 1 and 2    different correlation structures, AR(1) is selected for the model without a nested structure, while compound symmetry is preferred for the nested model.After deciding on the models with the lowest AIC values, performance measures are summarized in Table 5 in Appendix to compare the models with the default correlation structure in the REEMtree function in R. The same procedure is repeated for the Unbiased RE-EM tree, but log-likelihood values are compared since AIC is not available in the R function.The results are displayed in Tables 3 and 4 in Appendix 1.4.According to these tables, log-likelihood is highest when compound symmetry is assumed for the model without nested structure, while AR(1) provides better results for the model with nested structure.Hence, the performance metrics are obtained for the Unbiased RE-EM tree with these correlation structures, and the results are presented in Table 6 in Appendix 1.4.As seen in Tables 5 and 6, there are no big differences among the performance measures of the methods with different correlation structures for the RE-EM tree and its unbiased version compared to the cases using the default correlation structure.Hence, the results for hybrid methods are provided with a default correlation structure in the next section.Additionally, we have implemented the longitudinal classification and regression tree (LongCART) algorithm developed by Kundu and Harezlak [23].This algorithm can be utilized to analyze longitudinal data, including several distinct subgroups, as the correct model parameter values of the mixed-effects can alter among these subgroups.For instance, Raudenbush [35] argued that although it may be applicable to assume that all subjects in the study exhibit growth in a common function, the magnitude of this change might differ among these different observation units.Apart from the differences in the magnitudes, some studies may include the results changing their directions over time.To illustrate such a situation, Raudenbush [35] gave an example of an analysis of depression and argued that while some patients might experience a decrease in their depression level, some of them might have increasing depression levels.In such cases, a common parametric form assumption cannot capture the subgroup differences in the changes.As Kundu and Harezlak [23] argued, their proposed method of LongCART can overcome such limitations by taking a two-step approach into account.However, the function in R requires the partitioning variables, which are not changing over time.We have only age and gender variables satisfying this condition.After specifying all possible combinations of these two variables, the algorithm converged when only gender was used as a partitioning variable.However, the resulting model indicated no split, and performance measures of LongCART are not calculated.That is why these measures are not reported in the following table.The implementation of this algorithm can be achieved using LongCART function from LongCART package in R [24].The working mechanism of LongCART is not explained in the third section as the model convergence is not achieved.Instead, the algorithm is detailed in Appendix 1.2.
We have finalized the analyses with GPBoost algorithm proposed by Sigrist [39] using the gpboost package in R. The results for the performance measures of the methods are depicted in Table 4.According to Table 4, the GLMM tree with nested structure outperforms other cases demonstrated in terms of all three model performance metrics as it leads to the lowest RMSE, MSE, and MAE values.On the other hand, the time required for convergence of the GLMM tree algorithm is highest compared to the other methods.
It is also better to examine how the fitted and actual values are related.To this end, the fitted versus actual values plots are constructed for each method according to with and without nested data structure to check how these values are close.The plots for the GLMM tree, which has the best predictive performance, with and without nested structure, are illustrated in Figure 4.The scatter plot on the left at Figure 4 is the output of the model with the indices nested within-subjects, while the right one is the output of the model without As expected from an accurate model, the actual and predicted values are spread around a linear line on the left.We conclude that the performance of the GLMM tree is better when we include indices nested within the levels of the subject.
The fitted versus actual values plots for other methods are included in Figures 6-9 in Appendix 1.3.The plot of the RE-EM tree with nested structure (part (a) in Figure 7) is also promising in terms of prediction accuracy.This model's performance criterion is also close to the chosen model (GLMM tree with nested structure) with a much lower computation time.
In addition to the plots indicating fitted versus actual values through all subjects, we partition the plots according to distinct subjects in data for the case without nested data structure (Figure 10 in Appendix 1.3).Considering the nested data structure, we randomly select one subject and obtain the plot of fitted versus actual response values for each indices (i.e. the location of optode) (Figure 5).In Appendix, we added only the GLMM tree plot to illustrate the model performance according to each subject in Figure 10.Also, the performance of this method for Subject-7 concerning different indices is shown in Figure 5 in this part.When Figure 10 is examined to have an idea about the predictive performance of the GLMM tree, it is concluded that the observations for each subject mostly deviate from the line indicating the predictions, which are the same as the actual response values.On the other hand, according to Figure 5, most points scatter closer to the line for the randomly selected subject.Compared to the model without a nested structure, it is concluded that the model performs better in making predictions closest to the observed response values when we introduce a term for indices nested within the levels of the subject.The rest of the lattice plots for the first five subjects can be found at https://github.com/statserenay.The GPBoost algorithm is also constructed with different optimizer algorithms, which are Gradient Descent (GD), Weighted Least Squares (WLS), Nelder Mead (NM), and Broyden-FletcherGoldfarbShanno (BFGS).The performance metrics calculated from the GPBoost algorithm with different optimizer functions are illustrated in Table 5.According to these results, although each technique provides close results, the one with the lowest RMSE is WLS, the default optimizer for the Gaussian data.
As a last step of this section, the data is contaminated to examine how the methods perform when we include high values in the response.The range of values variable is examined, and the previously mentioned models constructed on the data sets which are obtained by adding values corresponding to 10 and 15 standard deviations from the mean response to the randomly selected 10% of the values variable, respectively.The relative changes in the model performance metrics for these two contaminations are obtained, and the results are illustrated in Table 6.The results indicate that GPBoost and LMM methods tend to be less affected compared to the other algorithms, as they exhibit smaller changes in model performance metrics.Along with working on fitting performances using contaminated data sets, it is also investigated how these models perform on unseen data.To this end, the original data set is partitioned into train and test data.Mainly, 10% of the observations are randomly selected to construct test data set.While obtaining predictions for test data observations from the nested RE-EM tree approach on the train data, only the subject-level can be used [37].That is to say; the nested structure is not taken into account for the calculation of model performance metrics from this approach.The model validation results are shown in Table 7.In addition, the prediction of the unseen data is not supported by an unbiased RE-EM tree, we did not include this method in the table.According to the results given in Table 7, the predictions of GLMM tree with the nested structure are better compared to the other methods with cross-validation.

Comparison of different algorithms
In this section, we explain the differences among the implemented methods.We focus on pure prediction algorithms, ML's, and LMM's pros and cons.The aim of the studies comparing the performances of different ML algorithms is under which conditions a specific algorithm significantly outperforms on a given problem instead of deciding whether an algorithm is best compared to others [2].In this study, our aim is not to determine which algorithm is superior to others, as their results can be examined under different circumstances.Instead, we focus on the differences between the LMM and pure prediction algorithms and examine their performances based on the data from cognitive research.As in many studies comparing different ML algorithms' performance metrics and theoretical settings, we also explain the potential advantages and disadvantages of the applied methods for this data set.First, we present the main differences between the LMM and ML algorithms, then compare the algorithms built from a similar theoretical framework.
The cognitive data set was not used as a tool in algorithm comparisons in this study; on the contrary, it was intentionally preferred because it was not analyzed with appropriate methods in the literature.These data structures, which can reveal many unknowns in neuroscience, were analyzed very primitively, and even the dependency structure between repeated observations, for example, was ignored.
Efron [6] explained the differences between the traditional regression models and the pure prediction algorithms in terms of the assumptions needed, the scientific philosophy behind these methods, and their objectives.As Efron [6] expressed, the first difference can result from the assumptions that should be satisfied for implementing the methods.ML algorithms are non-parametric ways of data analysis; they are flexible against the assumptions required to implement LMM, a parametric approach.For instance, the adequate LMM should exhibit normally distributed residuals while there is no such requirement for applying any ML algorithms.
The second difference occurs because the researcher carries out model parsimony for traditional regression models as the variable selection can be conducted through the significance testing procedures.On the other hand, the pure prediction algorithms decide on the predictors without considering the scientific fact, resulting in uncontrolled variable selection.To illustrate, we obtain the parsimonious LMM after implementing LRTs, whereas we include all the variables in the data while conducting ML algorithms.
Third, generalization of the results is possible for only LMM once the sample represents the population.However, the form of the pure prediction algorithms can change according to the data set in the particular study as these methods learn from the training data set.
We mainly mentioned the differences between LMM and ML algorithms from advantages and disadvantages perspectives.We now focus on the similarities and dissimilarities of these methods based on their theoretical background.As we have mentioned, LMM is most widely used to analyze data with a clustered structure for which the random part of LMM can account.However, the fixed effects part may not fit data well in some cases.Hence, the fixed effect terms may also be needed to consider partition data.GLMM tree, which is first proposed by Fokkema et al. [9] depends on such a strategy as it utilizes MOB.That is to say, while LMM fits a parametric model on data, the MOB algorithm is used to construct a GLMM tree, first fitting a parametric model and then partitioning the data concerning splitting variables that have p-values lower than the significance level α with the help of parameter stability tests.
Because the assumption of a parametric form for the relation among the fixed effect terms and response may not be the best choice, Sela and Simonoff [37] suggest a method called RE-EM tree which can handle other types of functions for expressing such relations.In addition to handling variables changing over time, observations from the same unit can be partitioned into different terminal nodes in the RE-EM tree approach [46].However, unlike the GLMM tree approach, the RE-EM tree utilizes a mixed-effects model to handle random effects while modeling fixed effects using a CART approach.Then, Fu and Simonoff [10] proposed an unbiased version of the RE-EM tree after changing the CART algorithm into the RE-EM tree.The motivation of their method is to overcome the problems such as overfitting and variable selection bias of the CART algorithm using the conditional inference tree instead of CART for the partition of the fixed effects.According to our findings, both the RE-EM tree and unbiased RE-EM tree provide close results for the cases of data with and without the nested structure.The reason behind such a conclusion might be that the categorical variables in our data, n-back, gender, and session, have a similar number of levels.So, adapting conditional inference instead of CART in variable selection does not make a difference in model performance metrics for the analysis of our data set.
As another ML approach that can be used to analyze repeated data, LongCART is considered.However, we observe that the function in R for implementing this algorithm requires baseline regressors as partitioning variables that are not changing over time.When the algorithm is constructed accordingly, we cannot reach a solution for a complete set of variables.That can result from the restriction on selecting partitioning variables as only age and gender, which may not provide any further split, are appropriate to construct this algorithm.
As one of the latest approaches for analyzing clustered or longitudinal data, GPBoost combining Gaussian process and mixed-effects regression model with boosting is also considered.Similar to the RE-EM tree approach, any functional form to represent the relation between fixed effect terms and response can be assumed in this approach.Further, unlike LMM, this method does not require any assumptions regarding the model adequacy.
In terms of goodness of fit measures to evaluate the predictive capability of algorithms, the GLMM tree, including indices nested within-subject variable, outperforms all the other mentioned methods since it has the lowest RMSE, MSE, and MAE for the data set used in this study.The interpretation of the resulted tree is also not that difficult.As a result, considering the simplicity of interpretation and model performance metrics, we recommend using the GLMM tree to analyze such a cognitive data setting.On the other hand, we recommend using LMM if the aim is to make inferences on the model parameters since the GLMM tree approach does not provide inferences on the model parameters while LMM has such a property.

Conclusion
The data coming from neuroscience studies require careful research, development, and analysis.Hence, scientists must consider every aspect of data and implement methods appropriate for the data structure.This study used statistical and hybrid ML methods to analyze the repeated measured data, including observations taken over the same subjects, commonly observed in cognitive studies.The most critical finding of this study is that if measurements for fNIRS data are collected for different points for each subject, the nested structure should be included in the model.In the data set used here, the HbO value, the response variable, has a statistically significant relationship with mean response time and n-back condition level.While this statistical significance relationship is essential for LMM, all variables are included for prediction instead of variable selection in hybrid ML methods.Therefore, researchers in this field can choose the method that suits their purposes.
After we implement the methods covered in the study, we compare their performance metrics and advantage-disadvantage perspective.Our findings on performance metrics of different algorithms conclude that the one with the lowest resulted RMSE, MSE, and MAE is the GLMM tree, including random terms as indices nested within the Subject variable for cognitive data.On the other hand, if the researcher aims to make inferences on the model parameters for fNIRS data, we recommend using LMM over the other mentioned methods.
Another strength brought by our study is to provide comparisons of different algorithms through simulation and model validation approaches.The LMM and GPBoost algorithms can be preferred as they are less affected by the data contamination under the settings of this study.On the other hand, GLMM tree can be preferred if one aims at predicting unseen observations according to the results obtained from cross-validation.

Figure 2 .
Figure 2. Distribution of mean HbO values for each subject.

Figure 3 .
Figure 3.The flowchart for data analysis process.
in Appendix 1.4.When AIC values are compared for the trees with

Figure 4 .
Figure 4. Fitted vs. actual values for GLMM tree.(a) GLMM tree with nested data structure and (b) GLMM tree without nested data structure.

Figure 5 .
Figure 5. Fitted versus actual values plot of GLMM tree for indices nested within subject-7 (randomly selected).

Table 2 .
Summary statistics of the variables.

Table 3 .
AIC values with different correlation structures.

Table 4 .
Results for LMM and hybrid methods.

Table 6 .
Relative change (%) in performance metrics for contaminated data sets.where RMSE 2 represents RMSE value obtained from the contaminated data and RMSE 1 corresponds to the RMSE calculated from the original data set.

Table 7 .
Results for LMM and hybrid methods on the validated data.