A new approach to modeling positive random variables with repeated measures

In many situations, it is common to have more than one observation per experimental unit, thus generating the experiments with repeated measures. In the modeling of such experiments, it is necessary to consider and model the intra-unit dependency structure. In the literature, there are several proposals to model positive continuous data with repeated measures. In this paper, we propose one more with the generalization of the beta prime regression model. We consider the possibility of dependence between observations of the same unit. Residuals and diagnostic tools also are discussed. To evaluate the finite-sample performance of the estimators, using different correlation matrices and distributions, we conducted a Monte Carlo simulation study. The methodology proposed is illustrated with an analysis of a real data set. Finally, we create an package for easy access to publicly available the methodology described in this paper.


Introduction
Repeated measures study, in special, longitudinal data studies are very common in several areas of knowledge, such as medicine, economy, biology, and so on. In the modeling of such experiments is necessary to consider and model the structure of dependence in experimental units [10]. The first proposal to modeling such experiments was based under the normality assumption [39], as multivariate models, linear mixed models [15,20] and non-linear mixed models [22]. But they do not always present a good alternative. An alternative of flexibilization is to make use of generalized estimating equations (GEE) [21] based on generalized linear models [24]. Outside the exponential dispersion family, there are also some approaches in the literature such as the GEE Simplex model [40,41], GEE Birnbaum-Saunders [44] and the GEE Beta Prime model developed in this paper. A good review of the various approaches can be seen in [39], for example.
In particular, for modeling positive continuous data, few alternatives are available in the literature, generally restricting to the use of the gamma (GA), inverse gaussian (IG), log-normal (LN) and Birnbaum-Saunders (BS) distributions (see [11,12,38,51] and [32], for example). Recently, [2] proposed a new parameterization in terms of mean and precision parameter of the beta prime (BP) distribution [19]. Under this parameterization, the probability density function (pdf) is given by where μ ∈ R + is the location parameter (mean), φ ∈ R + is a precision parameter, B(a, b) = (a) (b)/ (a + b) is the beta function and (a) = ∞ 0 w a−1 e −w dw is the gamma function. It follows from (1) that: where V(μ) acts as a 'variance function'. This function assumes a quadratic form of the mean and it is convenient for modeling data with right asymmetry, which can be seen as an alternative to generalized linear models when data present skewness.
In [2], they emphasize that the BP distribution has common properties with GA and reparameterized BS [34] such as unimodality in the density function and quadratic variance function. It follows from (1) that the log-likelihood function based on a single observation has the form: In the literature, several proposals are for modeling positive longitudinal data, such as generalized linear mixed models [3], LN model [38] and [4] GEE's for GLM's and more recently GEE using BS [44]. However, the BP distribution has some advantages over the popular GA and IG models. For example, the BP hazard rate function can be upsidedown bathtub or increasing depending on the parameters values. While most classical two-parameter distributions such as Gamma distribution have monotone hazard rate functions. Besides that, the skewness and kurtosis of the BP distribution can be much larger than those of the GA and IG distributions, which maybe more appropriate in certain practical situations (see Table 1). The BS distribution has been widely used today to model positive continuous data, mainly in reliability applications to model failure times. (This distribution was developed initially to model the fatigue life of a material, subject to cyclic stress). Also, in the context of GEEs, which are approximations, this distribution was proposed for median and shape modeling, which is a different approach to BP that models mean and precision. Therefore, the main objective of this paper is to extend the proposal of [2] for modeling longitudinal data, i.e. propose a regression model for positive continuous data with repeated measurements, allowing skewness, based on the BP distribution and GEE approach. This paper is organized as follows: in Section 2, we introduce a new regression model for positive continuous data with repeated measurements based on the BP distribution.
In Section 3, we discuss Wald and score testing of coefficients and information criteria. Diagnostic measures are discussed in Section 4. In Section 5, some numerical results of the estimators are presented. In Section 6, we provide application to real data set. Concluding remarks are given in the final section.

GEE for BP regression models with repeated measures
Consider that the longitudinal data consists of n subjects, such that for subject where μ ij is the corresponding jth mean. Note that we are assuming, the balanced case (for unbalanced data the algebra is essentially the same). Assume that the marginal distribution of y ij follows a BP distribution, i.e. y ij ∼ BP(μ ij , φ), where φ is equal for the N observations, where N = nt. Assume a model for the mean μ ij in the form: where β = (β 1 , . . . , β p ) is a vector of parameters (p < n) and g(·) is called the link function which is assumed to be monotone and at least doubly differentiable, x ij = (x ij1 , . . . , x ijp ) is a fixed and known covariate vector on the jth observation of the ith subject. For the construction of GEE, consider the vectors u i = u i (y i , β) = (u i1 , . . . , u it ) , i = 1, . . . , n, based on the score function for μ ij (see supplemental material of this paper), with zero mean, mutually independent and satisfying the properties of regular estimating functions [7]. Notice that we are working with the case of constant precision, but we can extend this model by placing a regression structure on φ.
In the context of BP regression models with repeated measures, we use where ψ(·) is denoted as digamma function, with i = 1, . . . , n and j = 1, . . . , t.
For the subject i, we have is a (t × t) symmetric matrix satisfying the conditions for a working correlation matrix [21], which depends on a parameter vector α that is the same for all experiment units. For some definitions for α depending on the chosen structure of correlation matrix, see in [13], for example.
The generalized estimating equations for repeated measures is, in general, given by where X = (X 1 , . . . , X n ) is an (N × p) full rank specification matrix, = diag( 1 , . . . , n ) , = diag( 1 , . . . , n ) and W = −1 are (N × N) block-diagonal matrices, and u = (u 1 , . . . , u n ) is an N-dimensional vector. A solution of (4) can be obtained by combining the Newton's modified method (see [17]) to estimate β and method of moments to estimate α and φ, for example. The estimation of β is solved with iteratively reweighed least squares by regressing the modified response vector z = η + −1 u on X with block-diagonal weight matrix W. A current estimate of β is updated by the approx- is an initial estimate, and the right side of the equation is evaluated according to the estimate of β in the mth iteration. [16] showed that under certain regularity conditions (see [36] for example): where The sandwich variance estimator J −1 of β (this is referred to as robust estimator in the literature) can be consistently estimated [21] by where which is known as the sensitivity matrix.

Hypothesis test and model selection
Let υ be a (q × 1) vector of the q components of interest of β. Then the hypothesis tests of interest can be expressed as follows: H 0 : υ = υ 0 versus H a : υ = υ 0 . An adaptation of the Wald statistic [48] to GEE (see [13] for example) is given by where J −1 υ is the (q × q) submatrix of the robust estimatorĴ −1 given in (6). Considering the convergence given in (5), when n → ∞, we have, under the null hypothesis, that (7) has asymptotic distribution χ 2 q . [13] also present Rao score statistics [31] for GEE models. Several proposals appear in the literature as measures to select the best model in the context of GEE's, among them we mention the classic ones Quasi-likelihood Information Criterion (QIC) proposed by [26] based on a modification in the Akaike Information Criterion (AIC) [1], the Rotnitzky-Jewell criterion (RJ) proposed by [33] and the generalized error sum of squares (ESS) [37]. [49] presents the measures cited and some extensions based on them, in addition, it presents simulations to check its performance and notes that the RJ criterion does not perform well for the scenarios with either continuous or binary responses. The criterion based on the ESS does not perform well for longitudinal data with a normal response. However, the performance improvement with binary response. The QIC presents problems when is used a correlation structure independent, but presents advantageous performance when the correlation structure is exchangeable or AR-1. Having said that, we will now talk about the construction of the QIC and two measures based on it.
The QIC can be used both for the selection of covariates and for the choice of the working correlation matrix(see [8] and [9] for example). Another attractive recently proposed model selection measure is the Extended Quasi-likelihood Information Criterion (EQIC) proposed by [50]. These authors using the extended quasi-likelihood [23], based on the deviance function under the independence. EQIC can be used for both covariate and variance function selection [49,50], so it is a good propose for comparing models of different distributions. The structure (model) that obtains the smaller QIC (EQIC) is better according to this criterion.

Diagnostic analysis
Through diagnostic analysis we can verify possible deviations from the initially assumed assumptions for a model, as well as find possible extreme observations that can influence the results of the model fit, such as hypothesis tests and estimates for regression parameters.
In the context of GEEs, the misspecification of the work correlation matrix still gives us consistent estimates for β, but can impact the estimation efficiency. That said, the diagnostic analysis together with the measures of goodness of fit provide us with tools to select the best working correlation matrix improving the efficiency of the estimators for β.
For GEE, [46], based on the GLM's measures, extend the concepts of leverage, case deletion and residual analysis based on the proposals of [29] and [42]. [47] presents an extension of the idea of local influence [6] using the generalized local influence [4]. Among other studies cited in the literature dealing with diagnostic techniques for GEE we have [5,18,26] and [25], for example.

Leverage analysis
The hat matrix, based on propose of [27], is given by

Residual analysis
The vector of ordinary residuals is given by [46] Venezuela et al. [46] showed that Cov(r Since the elements of r * i can have different variances, we can use the standard residuals which are given by where r * ij is the ordinary residual based on the jth response of the ith subject. We can use r ij to highlight possible influence observations or evaluate the linearity of effects.

Case-deletion analysis
The assessment of the influence of a cluster of observations on the overall model fit may be carried out with a measure of the standardized influence of the subset G of observations on the linear predictor [28,29]. Based on the idea of [27], of 1-step approximation, one can approximate Cooks Distance as follows: which has similar expression for linear models. We can use the normalized Cook's distance: DC ij / i,j DC ij , in order to obtain values in scale (0, 1) to improve the comparison between points, specially in difference class of models. The diagnostic measures presented above are based on an approximation to β and the quality of these measurements depends on the accuracy of the proposed estimator.

Local influence
The local influence method proposed by [6] consists in evaluate, through an appropriate measure of influence, the robustness of estimates provided by the model thought the effect of minor perturbations in the model itself or in the data. The initial propose is to work with likelihood displacement, but the models based on GEE are not likelihood models. [4] generalized the likelihood displacement to any fit function, which is given by where F is a fit function, assumed doubly differentiable and whose estimator is β β β, denoted by β β β, the solution of Here, β β β and β β β ω ω ω are the estimated value for the postulated original and perturbed model, respectively, with FD(ω ω ω) ≥ 0 and ω ω ω = (ω 1 , . . . , ω m ) is a perturbation vector, where the size m depends on the proposed perturbation scheme and ω ω ω ∈ ⊂ R M . Based on [6], the idea is to study the local behavior of FD(ω ω ω) for any value of ω ω ω in a neighborhood of ω ω ω 0 , which represents the null perturbation vector, such that Venezuela et al. [47] extend the proposal of [4] to GEE, replacing likelihood equations with estimation functions, in such a way that given a perturbed estimating equation (β ω |ω) = 0, there is a null perturbation vector so that (β ω |ω 0 ) = (β). Venezuela et al. [47] proposing a local influence measure for GEE given by the eigenvector d d d max corresponding to the largest eigenvalue of the matrix: which are evaluated at β = β and ω = ω 0 . In this paper, we calculate the matrix for six different perturbation schemes based on the proposals of [47] namely: case-weight perturbation, response perturbation, single-covariate perturbation, precision parameter perturbation and working correlation matrix perturbation. These results are presented with details in supplemental material.
In general, highlighted observations in case-weight can be interpreted as a perturbation in variance of each experimental unit, in particular for Normal linear models [43], as cited by [47]. Perturbation in the response variable can be seen as an alternative way of identifying outliers [35]. The single-covariate perturbation scheme helps to evaluate the influence of each continuous covariate in the estimating process. Perturbation in the precision parameter indicates how sensitive the model is in relation to the homoscedasticity assumption. Working correlation matrix perturbation can indicate, for example, whether we need to use another correlation structure.

Case-weight perturbation
First, we consider an arbitrary attribution of weights for the estimating equation. For this perturbation scheme, we find: Here, the nonperturbation vector, ω 0 , assumes ω ij = 1, with i = 1, . . . , n and j = 1, . . . , t. For this scheme, the matrix B G defined in (8) is given by diag(u) −1 W X(X WX) −1 X W −1 diag(u), which is evaluated at β.

Response perturbation
Consider the additive perturbation scheme in y ij given by where the nonperturbation vector assumes ω ij = 0, i.e. ω 0 = 0. Analyzing the GEE defined in (4), the only part that depends on y ij is the vector u. Therefore, considering u ω the vector u perturbed in response variable. The perturbed estimating equation is given by where u ω = (u ω1 , . . . , u ωn ) , with u ωi = (u ωi1 , . . . , u ωit ) , i = 1, . . . , n. In this case, the matrix is given by X W −1 B, where B = ∂u ω /∂ω = diag{∂u ω11 /∂∂ω 11 , . . . , ∂u ωnt /∂∂ω nt }. The matrixB G under response perturbation is given by which is evaluated at β and ω 0 . For GEE using BP, we have to (see supplemental material): where s ij = V(μ ij )/φ and V(y ωij ) = y ωij (1 + y ωij ), thereby we have to Thomas and Cook [43] proposed an additive scheme of perturbation in the kth column of X, let is name this column of x k = (x 11k , x 12k , . . . , x ntk ) , with k = 1, . . . , p. Therefore the elements of perturbation vector x k is given by
Theoretical results of this paper have been implemented in the R software [30] through the creation of a package called geeBP. The geeBP package provides functions for fitting and analyzes GEE with Beta prime models. The current version can be downloaded from GitHub using the command in R: devtools::install_github("jvbfreitas/geeBP") The function geeBP() fits a model and creates an object of class geeBP which can be applied in the functions diag.geeBP() (Cook's distance and H matrix), LI.BP() (local influence), summary() and residuals() (Standard and Pearson residuals).

Parameter recovery
In this section, we present an simulation study to analyze the consistency of the estimates using: Relative bias (RB), standard deviation (SD) and root-mean-square error (RMSE). The Monte Carlo experiments were carried out using where D is the marginal distribution considered (BP), the covariate x ij was generated from a standard uniform and w ij from a Gamma(1,2), with i = 1, . . . , n, j = 1, . . . , t and the working correlation matrix is AR-1. The simulated values were obtained via Gaussian copulas, due to the dependency between the observations of the same experimental units. The simulated values for each explanatory variable were kept constant during the simulations. The number of Monte Carlo iterations per scenario (or case) was set equal to 2000. This study aimed to evaluate the impact of the sample size, precision, and correlation in the parameters estimates. All simulations were performed using the R software [30] and regression coefficients were estimated for each sample using the function geeBP(). The results of the simulation can easily be obtained from https://santosneto.shinyapps.io/GEEBP/.

Sample size effect
In this scenario, we consider φ = 10, α = 0.5, n = 30, 50, 100 and t = 4, 6, 10. Table 2 allows you to see what happens to RB, SD and RMSE when n and t grow. Note that, in general, the RB decreases with the growth of n and t (mainly for n). Now for SD and RMSE we note that these quantities decreasing with the increase of n and t. In this way we show that the estimators are consistent.

Precision effect
What effect does increasing the precision have on estimates? To answer, we consider n = 50, t = 4 and α = 0.5 fixed and φ = 2, 10 and 100. In Table 3, note that the performance of the estimators improves with increasing precision. For example, for β 0 , the RB suffers a 98% cut in its value.

Correlation effect
The correlation effect on estimates of parameters was measured considering a low (α = 0.2), moderate (α = 0.5) and high (α = 0.5) correlation. The values of n, t and φ  Table 3. Estimates of the RB, SD and RMSE under different values of φ.  considered were 50, 10 and 10, respectively. We conclude that the correlation effect is stronger on estimates of SD and RMSE. For these quantities, you can see a reduction in its estimates with the growth of the degree of correlation. For example, for β 2 , there was a 33.5% reduction in the value of the RMSE (see Table 4).

Performance with beta prime data
In order to verify the performance of the GEE BP model and show the poor performance of GEE GA and IG models when fitting data generated from a BP we generate 2000 replicas of the model: ln(μ ij ) = 4 − 0.6x ij , where x ij = j, with n = 40, j = 1, . . . , 8, α = 0.5, AR-1 working correlation matrix and φ = 1. We note from Table 5 that both GA and BP estimated β 0 with low bias (< 0.04) and β 1 with low bias as well (< 0.01), although we note that in particular the BP model has the lowest standard deviation indicating that this model estimates more precisely, which is evidenced through the RMSE.

Real-world data analysis
In this section, we apply the methodology considered in previous sections to one real dataset. The required numerical evaluations for the data analysis were carried out by using the R software. Consider an experiment, analyzed by [46], carried out in School of Physical Education and Sport of the University of São Paulo (USP), whose data were obtained from Centre for Applied Statistics, part of the Institute of Mathematics and Statistics, at the USP. The purpose of the study was to evaluate the process of learning synchronizing tasks developed by volunteers, analyzing the variable absolute error. For this, each one of the 40 volunteers performed the same task 80 times, divided in 8 blocks of 10 attempts. In each attempt, was observed the time difference between the moment the individual received the stimulus and the moment of their motor reaction (in ms). The response variable was defined, for each block, as the mean of the absolute values of observed time differences over the 10 attempts. This database was previously analyzed by [46] using a Normal GEE model, but this distribution is initially proposed to model real data and a distribution with positive support is more appropriate to model such data. Also, we noticed through their analysis of Cook's distance that there are points which influence the modeling and that indicate the need for a more robust model. The data can be found in Appendix A of [45]. It should be noticed that the previous analyses made for this experiment were based on normal models. The purpose of this data analysis is to remodel this data set with more adequate distributions (continuous positive), among them BP, and show that it presents more reliable results according to the diagnostic analysis compared to the classic Gamma and IG, for example.   Through the profile plot (Figure 1), we observed a possible linear relation between the logarithm of absolute difference and the blocks. In the sample variogram plot (Figure 1), we have indications that the correlation between blocks decreases with respect to lag of the time, because the spline approaches the dotted line referring to the sample variance σ 2 s related to the standardized response variable. The analysis of the sample variogram consists of verifying whether its values in each lag are close to σ 2 s or not, the closer they are, the lower the correlation (due to the definition of the sample variogram), in this case the use of the spline serves as tool to study the behavior of the sample variogram around this variance, for more details and explanation, see [10] for example. Therefore, there is evidence that a model with AR-1 structure for intra-class dependence is reasonable, but unstructured structure will be considered as well.
We then initially fit six models to this data. The GEE models with BP, GA and IG distributions with AR-1 and unstructured correlation matrix and the following structure: where block j is the jth block, with i = 1, . . . , 40 and j = 1, . . . , 8. Note that only in order to have another candidate model, we also considered the model based on the IG distribution in this application, although it was not considered in the simulation. The results of half-normal plots with simulated envelope of 95% confidence bands are shown in Figure 2. We still have that the GEE BP model presented the best fit analyzing the half-normal plots in comparison to GA and IG models.
In Table 6, we observe that through the EQIC we have an indication that the BP model presents a better fit, because considering either of the two working correlation matrices it has the lowest EQIC in all cases, based on this also, for the AR-1 type correlation the Gamma model is better than the IG and the inverse is observed for the unstructured correlation matrix. Analyzing the QIC, across the working correlation matrices for each model, we have an indication that the AR-1 structured is the most appropriate for BP and Inverse Gaussian models, respectively, already for GA the best structure is the unstructured.
In Figure 3, we plot the normalized Cook's distance versus Subjects observations and the highlighted points were named by (Subject, Block). Note that the observations highlighted in the GA and IG models have less influence on the BP model. In Figure 4, we plot the normalized |d max |, i.e. |d max | ij / i,j |d max | ij for case-weight perturbation and we have the same comment made for the normalized Cook's distance, showing even more the good fit of GEE BP against GEE Gamma and GEE IG models.
The results of the fitted models follow in Table 7. Note that the parameters for the models presented p-value, using Wald statistic, less than 0.01, but BP and IG model have much smaller standard errors than GA.  That said, through the goodness-of-fit measures, the Cook's distance and case-weight perturbation, we selected the GEE BP model as the best. In Figure 5, we plot standard residuals and the normalized |d max | for the response, precision parameter and working correlation matrix perturbations, and highlighted observations associated to subjects 5 and 17. We decide to make a confirmatory analysis by withdraw subjects 5 and 17, and we observed that the inferential results did not undergo major changes, but in the working correlation matrix perturbation new observations are highlighted. Analyzing the dataset, in the second block the experimental units 17 and 5 were the second and fourth lowest values, respectively. It is also worth noting that in relation to other local influence schemes, the IG and Gamma models have points considered influential in the response variable perturbation.

Discussion
In this paper, we have developed a new regression model used to modeling asymmetric positive continuous data to take account of possible serial dependence between repeated observations. We developed the modeling only mean. The proposed model may serve as a good alternative to the GA and IG regression models for modeling asymmetric positive continuous data with repeated observations. Diagnostic tools have been obtained to detect locally influential data. Simulations show the good performance of this new model.
Through the analysis of synchronizing tasks data, we found that the mean can be a function of blocks and show that the BP models have a good performance in relation to the its competitors: Gamma and IG. The diagnostic analysis developed in this paper shows that the GEE BP model proves to fit well to the data, which is mainly evidenced by the simulated envelopes, also, the confirmatory analysis revealed its stability.
As limitations of the proposed model we mention that it does not consider the heterogeneity of the precision parameter, a GEE BP model for joint modeling of mean and precision is being developed by the authors as an extension to the model proposed in this paper. Another study of interest would be to consider semi-parametric predictors for the mean of the response variable.