A sequential modeling approach for predicting clinical outcomes with repeated measures

Abstract The increased availability of healthcare data has made predictive modeling popular in a clinical setting. If an expected patient-specific outcome can be estimated prior to a medical intervention the healthcare costs can be reduced for both patient and provider. The nature of data used to train such predictive models is frequently longitudinal, as interventions with convalescence times or chronic conditions contain outcome measures at intermediate follow-up points. Here we outline a predictive modeling approach that takes advantage of the longitudinal structure of the data by sequentially predicting the outcomes at intermediate time points and including them as predictors in models for later time points. This is done for continuous and threshold-dichotomized outcomes. The proposed method improves predictive accuracy as it takes advantage of the correlation in follow-up measures to distribute the estimation of coefficient effects over several models, making it advantageous for smaller datasets. This formulation also allows for effective screening of first-order interaction effects. The improved performance is illustrated using a simulation study and an applied example of predicting outcomes following surgery. The proposed approach is shown to be consistent for prediction, effective in modeling interactions and robust to presence of noise variables.


Introduction
The prediction of outcomes for medical interventions has gained popularity with the increased accessibility of medical data. The focus of attention in more recent years has been on identifying patients that are most likely to benefit from a given intervention and achieve a positive outcome (Cooper et al. 2010;Khor et al. 2018;Lazzarini et al. 2017). If the outcome can be predicted accurately prior to treatment this will lower the medical burden for the patient and be more cost effective for the healthcare provider (Goltz et al. 2018;Schwab et al. 2008). However, heterogeneous patient populations from multiple study sites or hospitals make fitting aggregate models difficult, while modeling a hospital-specific population results in limited sample sizes. In this paper we propose training a sequential model that incorporates intermediate outcome measures to more accurately predict the outcome of an intervention at a final time point using baseline data and show how this formulation is effective in screening for first-order interactions. As clinical outcomes are often continuous measures dichotomized based on a minimal clinically important difference (MCID) threshold, the method is extended to predict threshold-derived outcomes.

Problem formulation
The goal is to predict an outcome at a final time point using baseline measures available at an initial time point, with samples sizes expected to be in the hundreds, reflecting the number of patients in a medium-sized clinical study. This gives a more homogeneous set of samples and avoids adjusting for structural differences in diverse patient populations. The measured outcome variable is continuous or dichotomized based on a threshold; fixed improvements or category based cutoffs (Fairbank 2014;Klepstad et al. 2002). Longitudinal measures of the outcome are assumed to be available at standard followup time points.
The proposed approach builds a model by iteratively predicting the continuous outcome for each time point, with the prediction used as a predictor for subsequent time points. We estimate the variance of the prediction at the time point of interest by conditioning on the variances of the predicted variables and the population unexplained variance to get a probability distribution for the predicted value, which is used to estimate probabilities for threshold-derived outcomes. This formulation is also effective when screening for first-order interactions, if restricted to main effects interacted with the predicted outcomes at each time point, since each predicted value is a function of the baseline predictors.
We outline the competing approaches to modeling continuous and threshold-derived outcomes and the common methods used for variable selection and optimization of fit for predictive modeling, then present the model formulation and give derivations to estimate the mean and variance of the predicted value as well as formulate a Bayesian model to measure the accuracy of the variance estimate. The different methods are compared in a simulation study and an applied example. The results demonstrate that performance remains robust across datasets and various continuous and threshold outcomes, achieving similar or better performance relative to the comparison methods, and that the variance estimates are accurate. The sequential model is also shown to be the most robust to noise variables and effective in identifying first-order interactions. To conclude, we outline the areas of applicability and give a brief summary of the results.

Existing methods
The standard methods for predictive modeling of continuous-outcome data in this context are regularized linear regression on the endpoint of interest or all observed time points, random forest and gradient boosting. Threshold-derived outcomes can be modeled with logistic regression, random forest or gradient boosting directly, or derived from the continuous prediction of a regression model. For modeling binary outcomes directly, the loss of information from dichotomization can be problematic (Altman and Royston 2006;Deyi, Kosinski, and Snapinn 1998;Senn 2003). Additionally, multiple models are required if there are multiple cutoffs. The linear regression alternatives model the continuous outcome directly and obtain the expectation and variance of the predicted value, then estimate the probability of achieving a threshold using a specified distribution. When the outcome of interest is of a known distribution this is a preferred approach (Ragland 1992;Suissa 1991;Suissa and Blais 1995;Wason and Jenkins 2016), provided variance estimates are robust. The linear model with all measures assumes a uniform effect of the predictors at each time point, and the increased number of observations is offset by the potential for biased coefficient estimates and variance estimates may need to account for repeated measures. In all cases, screening for first-order interactions still relies on specifying every pair of predictors. The machine learning approaches, random forest (Breiman 2001) and gradient boosting (Friedman 2001), model each continuous and binary outcome directly. Both use trees as weak learners, which are able to partially identify interaction effects if they are present (Chen and Liu 2005;Friedman 2001;McKinney et al. 2006).
Variable selection for main effects and interactions, as well as optimizing fit and reducing coefficient bias for the linear models, is done using L 1 regularization (Tibshirani 1996;Zou and Hastie 2005). In the Bayesian model, restrictive coefficient priors are specified. For random forest and gradient boosting, standard tuning parameters are used; the number of trees and predictors to sample for random forest and tree depth, number of trees and shrinkage for gradient boosting.

Methods
In this section we outline the approach to sequentially build a predictive model when repeated outcome measures are available. The model is trained using information at baseline and incorporates intermediate outcome measures to predict an outcome at a final time point as a function of the baseline predictors. Starting from the first time point, each intermediate outcome is modeled as a function of the baseline covariates, and then a predicted outcome value is calculated and iteratively included in the subsequent models which are trained conditional on predictions from the previous models. This takes advantage of correlation in the outcomes to estimate coefficient effects across multiple time points. Variance estimates are also calculated to predict thresholdderived outcomes.

Sequential model formulation
Consider a dataset with repeated measures of a continuous outcomes y i, t for sample i ¼ 1, … , N at t ¼ 1, … , T. A model is trained to sequentially predict the outcome at each t as a function of baseline predictors x i ¼ ð1, x i1 , :::, x iK Þ T and a set of coefficients b t ¼ b 0, t , b 1, t , :::, b K, t À Á T for modeling the outcome at time t. The model for the first time point is y i, 1 ¼ x T i b 1 þ i, 1 , where i, 1 $ Nð0, r 2 1 Þ is the error term for the outcome at t ¼ 1, and r 2 1 is the error variance. The subsequent trajectory of the outcome over time can be modeled as where the outcomes at prior time points, y i, t ¼ y i, 1 , :::, y i, t ð Þ T , become predictors for the model at t þ 1 with corresponding coefficients c tþ1 ¼ c 1, tþ1 , :::, c t, tþ1 ð Þ T , with c j, tþ1 modeling the effect of the outcome at time j < t þ 1 on the t þ 1 time point. Future outcomes can only be predicted using baseline covariates, x i , as the intermediate y i, t won't be available, so y i, t are predicted iteratively using x i : The model at t þ 1 becomes y is predicted from a model trained with y i, t as the outcome andŷ i, tÀ1 and x i as the predictors with i, tþ1 $ Nð0, r 2 tþ1 Þ the residual error that includes the prediction error ofŷ i, tÀ1 when t > 1. The set of models are trained sequentially, starting at the first time point and using the fitted values as predictors in the subsequent models, this way each model is conditional on the expected outcome of the previous models. The final model is a set of models for each of the T time points, similar to the use of instrumental variables (Chamberlain and Imbens 1996). For prediction, the point estimates at each time point areŷ respectively. The distribution of the outcome prediction isŷ i, t $ NfEðŷ i, t Þ, Vðŷ i, t Þg, with the variance being required to calculate probabilities for the dichotomized outcomes.

Estimation
The mean and variance of the estimate at the final time point is calculated as follows. For shorthand notation, define the design matrix for the initial set of predictors, X ¼ ½1, x 1 , :::, x K , an N x K þ 1 matrix, where N is the sample size and K is the number of predictors. Also define a matrix of outcomes to be used for predicting the outcome at t þ 1 as an N x t matrix Y t ¼ ½y 1 , :::, y t , where y s ¼ y 1, s , :::, y N, s ð Þ T for s ¼ 1, … , t.
Note that since the y s are unavailable as predictors, we useŶ t ¼ ½ŷ 1 , :::,ŷ t , the predicted values from the sequential models. The combined data matrix for the model at t þ 1 is then Z tþ1 ¼ ½X,Ŷ t : At t ¼ 1, the mean and variance estimates are Eðŷ 1 Þ ¼ Xb 1 and Vðŷ 1 Þ ¼ XðX T XÞ À1 X T r 2 1 , respectively. For t ¼ 2, … , T, the estimates for the sequential model are: The estimates ofÊðŷ tþ1 Þ andV ðŷ tþ1 Þ are obtained by substituting the sample estimates forb tþ1 ,ĉ tþ1 andŶ tþ1 : A simple estimate ofV ðŶ t Þ is the variance covariance matrix of the predictions at time t from the training set. Alternatively, the variance is calculated iteratively based on the predictors and estimated coefficients as: The subscript [t] denotes the t'th row of the covariance matrix of the predicted values at t þ k-1. The covariances withŷ 1 are simplified, as the model includes only the baseline covariates; covðx T b tþk ,ŷ 1 Þ ¼ b T tþk VðXÞb 1 and covŷ 2 ,ŷ 1 ð Þ¼ b 2 V X ð Þb 1 þ c 2 Vðŷ 1 Þ, where c 2 ¼ ðc 1, 2 Þ, the coefficient ofŷ 1 in predicting y 2 :

First-order interactions
Interactions are included by multiplying each of the predicted outcomes,ŷ t , with the baseline predictors. The model is defined as: Since each y t were predicted using the same set of predictors, x, the set of first-order interactions is implicitly included in the model. The value of c tk, tþ1 assesses the effect of y t x k ¼b 0, t x k þb 1, t x 1 x k þ ::: þb K, t x K x k þŷ T tÀ1ĉt x k or the average across the scaled main effect and set of interactions. If the effects of some of theŷ t x k constituents are non-zero, they will be accounted for in the model by the aggregate coefficient, but be diluted by the noise from the x j x k pairs with no effect. If there are predictors (ŷ 1 , :::,ŷ T Þ in the model, then interactions for x k are modeled by ðc 1k, tþ1 , :::, c Tk, tþ1 Þ, which allows the model to reflect the heterogeneity of interaction effects for different x j x k pairs (Supplementary Materials Appendix A). The advantage of this formulation is that the number of coefficients to estimate is KT, instead of K(K-1)/ 2 and the model is simplified if T < (K-1)/2, which is frequently the case in practice. Define an extended vector of c tþ1 coefficient asc tþ1 ¼ fðc 10, tþ1 þ c 11, tþ1 x i1 þ ::: þ c 1K, tþ1 x iK Þ, :::, ðc t0, tþ1 þ c t1, tþ1 x i1 þ ::: þ c tK, tþ1 x iK Þg: The estimates ofÊðŷ tþ1 Þ and V ðŷ tþ1 Þ are obtained from Equations (1) and (2) by substitutingc tþ1 for c tþ1 :

Prediction
Assuming normality of the error term at time point t þ 1, the conditional distribution of the predicted value, y p, tþ1 is given in Equation (5). Further discussion on prediction distributions can be found in (Lawless and Fredette 2005;Molinaro, Simon, and Pfeiffer 2005).
Outcomes that were dichotomized using a threshold are estimated from the distribution of y p : Define the threshold, s 0 , then the probability of success,p, is Pðy pred s 0 Þ for a decrease and Pðy pred > s 0 Þ for an increase.

Regularization and variable selection
In a model with regularization Vðb t Þ can be estimated using bootstrap (Tibshirani 1996). The estimate is calculated by sampling with replacement and fitting a model on each dataset. For each iterate, b ¼ 1, … , B, there is an estimated set of coefficients,b t, b andVb t À Á ¼ Varðb t, 1 , :::,b t, B Þ, the empirical covariance. A bootstrap estimate of r 2 t is obtained analogously. Further discussions on estimation of variance in regularized models can be found in (Chatterjee and Lahiri 2011;Kyung et al. 2010;Reid, Tibshirani, and Friedman 2016). Variable selection can be done using Lasso regularization, where some coefficient estimates will be zero and excluded from the model.

Bayesian model
The formulation for the Bayesian model is the same as the sequential model. The empirical distribution of the posterior of the predicted value, y p , is obtained by using Markov Chain Monte Carlo (MCMC) and it is not necessary to calculate the variance. The full model formulation is presented in Supplementary Materials Appendix B for reference.

Distributional Assumptions
Although we are fitting a predictive model, a distributional assumption and corresponding variance estimate for the predicted value are necessary to calculate probabilities for the threshold-derived outcomes. Here we assume the prediction distribution is normal, which is implied by the assumed normality of the error term, however, neither distributional assumption is strictly required for the model and different distributions can be specified to reflect the problem at hand, in order to avoid biased variance and probability estimates. If different distributions are specified, the variance estimates for the prediction can be verified against results from the corresponding Bayesian model. If it is still not feasible to specify a distribution, then an empirical distribution can be estimated for either the error or prediction, or both, via resampling, however, this approach can become computationally expensive with many time points.

Simulation study
The performance of the sequential model is evaluated using a simulation study and compared against linear-endpoint, linear-all measures and logistic regularized regressions as well as random forest, gradient boosting and a Bayesian model. We simulate a continuous outcome and dichotomize based on set thresholds. Simulated datasets are varied in size and number of noise variables. Model performance with main effects and first-order interactions is evaluated using goodness-of-fit metrics described under Performance Measures in Section 5.2. Bias and MSE between prediction variance estimates are also measured. The simulations, comparison methods and implementation are described in Supplementary Materials Appendices C, D and E, respectively.

Simulated data
We simulate datasets of size N ¼ 125, 250, and 500 from a model with 40 predictors across three time points and include 0, 50, 100, or 200 noise variables for each sample size to assess the impact on performance. Training and test sets are assigned using a random 80/ 20 split and measures of predictive accuracy are averaged over 3000 replicates.

Performance measures
The primary measure of performance we use is the mean squared error (MSE) of a full model (all covariates), normalized to the MSE of the null model (intercept only), referred to as the Residual Prediction Error (RE). We use the subscripts full and null to denote the full and null models, respectively, and subscripts train and test to refer to the training and test sets, respectively. All models are fit on the training set and evaluated on the test set.

Residual Prediction Error RE
Here N test is the number of observations in the test set, y k, test are the test set observations,ŷ k, test, full are the test set predictions from the full model,ŷ k, test, null are the test set predictions from the null model for k ¼ 1, :::, N test : The predictions from the null model areŷ k, test, null ¼ y train , the mean of the training set outcomes.
The measure gives the ratio of the test set MSEs for the two models. The advantage of using RE is that it gives a uniform measure of performance for each replicate and simulation scenario, since the measure is normalized to a prediction error from a baseline model. It also gives a sense of scale, since values less than 1.0 give the improvement in MSE when covariates are included; e.g., a value of 0.8 means that about 20% of variance was explained by the covariates compared to the null model and that the covariates contain some information. Values above 1.0 indicate the model was overfit and that including covariates degraded performance, which is used to evaluate the effectiveness of the proposed variable selection technique. The measure is akin to 1 À R 2 but within a predictive modeling framework and is based on the test set MSEs, giving the improvement when including predictors. The performance for the dichotomized outcomes is measured in an analogous manner, except instead of MSE we use the test set binomial deviances from the full and null models. For the continuous model, we evaluate the bias of the point estimate for each replicate as 1 N test P k ðy k, test Àŷ k, full Þ=SDðy test Þ, normalizing by the standard deviation of the test set outcomes within the replicate and average across all replicates. The accuracy of the variance estimate from the sequential model is assessed using bias and MSE relative to the Bayesian estimate.

Simulation results
The residual prediction errors and bias estimates for the continuous outcome are given in Figures 1 and 2, respectively. The residual prediction deviance errors for dichotomized outcomes are given in Figure 3. Results for main effects models are solid lines and interaction models are dashed lines. The results are presented for the sequential (blue circle), Bayesian (orange triangle), linear-endpoint (red plus), linear-all measures (green cross), logistic (magenta diamond), random forest (cyan inverted-triangle), and gradient boosting (beige square) models. Lower values indicate better model performance. Comparisons of variance estimates, measured by bias and MSE, between the sequential and Bayesian models within replicates are given in Table 1.

Applied example: heart valve replacement surgery
This dataset measures left ventricular mass following aortic valve replacement in a cohort of 300 patients. The dataset was collected as part of a long-term follow-up study (Lim et al. 2008) and measured left ventricular mass regeneration over ten years following surgery. The outcome measure was the left ventricular mass index (LVMI) measured by echocardiography over a number of follow-up visits. Baseline variables included demographic, clinical factors and heart function measures. The outcome was the log-LVMI at five years with an intermediate measure at two years. The predictors were the baseline covariates and the first measure of log-LVMI prior to one year. Three threshold outcomes at five years were also estimated; LVMI less than 110, 150, and 180, with the cutoffs corresponding to the quartiles of the outcome.

Results
The results for the different methods and outcomes are presented in Figure 4a. The plot shows the residual prediction error for the same measures as the simulation study. The results are given for the sequential (circle), Bayesian (triangle), linear-endpoint (plus), linear-all measures (cross), logistic (diamond), random forest (inverted-triangle) and gradient boosting (square) models, with black indicating main effects models, red corresponding to interaction models and light blue to machine learning methods. The effects of including first-order interactions are given in Figure 4b. The plot shows the change  in residual predictive error relative to main effects models with positive values indicating an improvement. The results for the sequential (blue circle), Bayesian (orange triangle), linear-endpoint (red plus), linear-all measures (green cross) and logistic (purple diamond) are presented for the same outcomes as in Figure 4a. Comparisons of sequential and Bayesian variance estimates in this data were very close; bias was À0.0062 and À0.0064, while MSE was 0.0005 and 0.0006 for the main effects and interaction models, respectively.

Discussion
In this section we compare and contrast the performance of different methods for a variety of sample sizes, under main effects and first-order interaction models and with the inclusion of varying numbers of noise variables.

Main Effects models
In the case of fitting a simple main effect model, the sequential model performed best across sample sizes and numbers of noise variables. The biggest improvements compared to the other methods were for the smaller sample sizes and higher numbers of noise variables. This is consistent with the goal of the model formulation, which aimed to distribute the estimation of coefficients across the intermediate time points by fitting multiple models. This resulted in less biased coefficient estimates and better predictive accuracy. Additionally, the formulation compliments the Lasso for selecting predictors as the model was more robust to noise variables than the other methods.
There were some discrepancies in performance between the dichotomized outcomes. For the fixed change and <40 outcomes, performance was more robust and prediction error stabilized faster with increased sample size. In the other two, differences between methods were more pronounced, stabilization was slower and overfitting was more problematic. This is the result of modeling different parts of the continuous outcome by applying different thresholds, where some ranges will have a more stable signal than others.

Interaction effects
The sequential model was formulated in a way to make screening for interaction effects more efficient and the second set of models trained sought to assess performance with first-order interactions. Performance remained consistent across different scenarios and the model was generally able to identify first-order interaction effects. The improvements relative to the main effects model diminished as noise variables were added, but there was still an improvement in larger sample sizes. For each case, the fit remained robust, as predictive power converged to that of the main effects model instead of overfitting. This illustrates that the approach was effective at identifying underlying interactions through the proxy of the predicted intermediate measures, while remaining resistant to overfitting by limiting the number of interaction variables to filter.

Model robustness
The sequential model, in combination with L 1 regularization for variable selection, is formulated in a way that it reverts to a simple baseline model if the intermediate outcomesŷ t or interactions do not improve prediction of the final outcome. The aim of the model is to reflect complex structures in a dataset, but if these are not present or the signal is too weak to be reliable, the variables representing these structures are removed from the model by Lasso at any of the time points. Furthermore, since the number of variables in each model is relatively small, T and KT, for the intermediate outcomes and interactions, respectively, the chances of overfitting are further reduced, making the model more robust.

Threshold-derived outcomes
The methods modeling the continuous outcome distribution had better performance for the threshold-derived outcomes compared to those using the binary outcome directly, which is consistent with the results in the literature. The extra requirement, however, was specifying a distribution for the prediction and estimating its variance. In this case, we assumed normality, which is reasonable for most applications, as the distribution can be made approximately normal using transformations. However, if it is non-normal, other distributions can be substituted without changing the model formulation, which is worthwhile as the nonparametric approaches, random forest and gradient boosting, performed poorly in most cases.

Estimates of variance
The bias and MSE between variance estimates from the sequential and Bayesian models were low, MSE ranging from 1-5% and bias ranging from À1% to À5% of the estimated Bayesian model value, indicating the analytical estimate is accurate, but tends to slightly underestimate variance. Differences may originate from convergence issues in the Bayesian model or gradually accumulating errors in the sequential estimation of the variance-covariance matrix of theŷ t : However, neither issue materially affected performance.

Applied example
The results of the applied example were largely consistent with the simulation study. In this case the sample size was larger and there were fewer predictors, so some of the advantages highlighted previously were not as apparent. The sequential model still produced the best and most consistent results among the methods. Its performance in main effects models was on par or better and it was able to identify some first-order interactions.

Conclusions
In this paper we presented an approach for predictive modeling of longitudinal data using a sequentially trained model. The goal was to develop a method that can be used with a dataset of limited size to predict either continuous or threshold derived outcomes. The approach addressed a number of features that could be encountered, including weak signals, interaction effects, and presence of noise variables. We illustrated the performance of the method using a simulation study and an applied example of predicting surgical outcomes and compared it to a Bayesian model, linear-endpoint, linear-all measures and logistic regressions as well as random forest and gradient boosting algorithms. The proposed sequential model had either comparable or superior predictive power, with notable improvements for smaller sample sizes. In combination with L 1 regularization and Lasso for variable selection, it was robust to overfitting, effective at identifying interactions and filtering out noise variables. The analytically computed variance estimate was also accurate. Overall, we showed that the sequential model gave the best and most consistent performance among the methods, making it an advantageous approach for predictive modeling of data with repeated outcome measures.

Data availability
Data used in the example is available from the joineR package in R.