Unit gamma mixed regression models for continuous bounded data

We propose the unit gamma mixed regression model to deal with continuous bounded variables in the context of repeated measures and clustered data. The proposed model is based on the class of generalized linear mixed models and parameter estimates are obtained based on the maximum likelihood method. The computational implementation combines automatic differentiation and the Laplace approximation (via Template Model Builder/C++) to compute the derivatives of the log-likelihood function with respect to fixed and random effects parameters. We carry out extensive simulations to check the computational implementation and to verify the properties of the maximum likelihood estimators. Our results suggest that the proposed maximum likelihood approach provides unbiased and consistent estimators for all model parameters. The proposed model was motivated by two data sets. The first concerns the body fat percentage, where the goal was to investigate the effect of covariates which were taken in the same subject. The second data set refers to a water quality index data, where the main interest was to evaluate the effect of dams on the water quality measured on power plant reservoirs. The data sets and R code are provided as supplementary material.

The analysis of bounded variables is generally performed by the beta [2] and simplex [3] regression models. Besides that, other regression models were proposed to analyse continuous bounded variables on the interval (0, 1). Some examples are the unit-Weibull [4], Johnson S B [5], Kumaraswamy [6] and unit gamma [7] regression models. Additionally, using second-moment assumptions [8] developed a flexible class of regression models to deal with continuous bounded variables on the interval [0, 1].
Although these models are useful in many applications, they are usually limited to analyse independent data. In the case of longitudinal data, it is essential that the regression model takes into account the longitudinal and/or grouped data structure. According to Diggle et al. [9], longitudinal data are repeated measures evaluated on the same subjects over time, that are potentially correlated. Dependent data can also arise in studies with block designs, spatial and multilevel data [10,11]. For the analysis of such data, several methods have been proposed over the last four decades.
Laird and Ware [12] proposed the random effects regression models for longitudinal data analysis. Breslow and Clayton [13] presented the generalized linear mixed models (GLMMs) for the analysis of non-Gaussian outcomes. The authors [14,15] extended the GLMs for the analysis of longitudinal data using a generalized estimating equation (GEE) approach. Masarotto and Varin [16] developed a class of marginal models for modelling dependence structures in the analysis of longitudinal data, time series, and spatial based on Gaussian copula models.
Based on the aforementioned approaches, some regression models have been proposed to deal with longitudinal continuous bounded outcomes. GLMMs based on beta distribution were employed in medical research [17], social sciences [18,19] and behavioural studies [20]. Other regression models based on the simplex distribution were proposed for modelling longitudinal data [21][22][23]. Under the likelihood paradigm, the simplex mixed models with applications are discussed in [24].
The main goal of this study is to propose the unit gamma mixed regression model to deal with repeated measures of continuous bounded outcomes. The unit gamma distribution is new in the literature and has been explored in other contexts, like control charts [25], comparison between different methods for parameter estimation [26] and likelihood ratio tests [27]. In this paper, we shall investigate the unit gamma distribution as an alternative to beta distribution for the analysis of dependent continuous bounded data. We consider this distribution into the GLMM framework in order to fit regression models with random effects. We employ automatic differentiation [28] and Laplace approximation [29] for fast parameter estimation through the R [30] package TMB [31].
The main contributions of this article are: (i) introducing the unit gamma distribution into the GLMMs framework; (ii) performing an extensive simulation study to check the properties of the maximum likelihood estimator in the context of longitudinal data; (iii) applying the proposed model to two data sets from different research fields; (iv) providing R code and C++ implementation for the unit gamma mixed regression model.
The article is organized as follows. Section 2 presents the unit gamma mixed regression model. Section 3 describes the method proposed for parameter estimation and inference. The results of simulation studies are reported in Section 4. Section 5 illustrates the fit of the model to two data sets. Finally, the main contributions of the article are discussed in Section 6.

Unit gamma mixed regression models
In this section, we shall present the unit gamma distribution and its associated mixed regression models for continuous bounded data.
Let X ∼ Gamma(β, φ) denote a gamma distributed random variable and consider the transformation Y = e −X . Thus, Y follows the unit gamma distribution [32] with probability density function (pdf) where y ∈ (0, 1) and β, φ > 0. According to Grassia [32], the expectation and variance are given by Recently, Mousa et al. [7] proposed a reparametrization for the unit gamma dis- denoting a unit gamma random variable with pdf given by where y, μ ∈ (0, 1) and φ > 0. Under this parametrization, μ is the mean and φ is a precision parameter. Here, the expectation and variance are given by Note that, when φ → 0 the variance of Y tends to μ(1 − μ), which corresponds to the mean-variance relationship of the Bernoulli distribution. In order to illustrate the pdf of the unit gamma distribution given in Equation (2), Figure 1 present the distribution shape for five levels of variance, when μ = 0.5, Var(Y) = 0.15, 0.10, 0.05, 0.025 and 0.01, such that the corresponding precision parameters are φ = 0.36185, 0.76118, 1.95705, 4.3557, 11.56, respectively. In Figure 1, we combine different values of μ and φ to show the flexibility of unit gamma distribution to deal with symmetric and skewed continuous bounded data. Based on Figure 1, we may easily see that the unit gamma pdf can assume different shapes according to the selected values of the parameters μ and φ. It can have J shapes, inverted J shapes, symmetric shapes (when μ = 0.5 and φ → ∞) and, U shaped.
Let y ij be the repeated measure j within the subject i, for i = 1, . . . , n and j = 1, . . . , n i observations of the random variable Y ij . It is assumed that Y ij given the random effects is independently distributed with probability density function denoted by Y ij |u i1 , u i2 In this paper, we use the framework of the double generalized linear models [33,34] for jointly modelling mean and precision parameter as functions of covariates. Thus, the unit gamma mixed regression model is specified by the following couple of equations where g(·) and h(·) are suitable link functions. In this paper, we adopted the logit(μ) = log(μ/(1 − μ)) and the logarithm log(φ) link functions for g(·) and h(·), respectively. In this notation, x ij1 and β are p-dimensional vectors of known covariates and unknown regression coefficients associated to the expectation of the continuous bounded response variable. Similarly, x ij2 and γ are s-dimensional vectors of known covariates and unknown precision coefficients associated to the precision structure. Furthermore, z ij1 and z ij2 are q-dimensional and r-dimensional vectors of known covariates associated to the random effects. Finally, we have assumed that u i1 ∼ N (0, 1 ) and u i2 ∼ N (0, 2 ), where 1 and 2 are the covariance matrices of the random effects as usual in linear regression models.

Estimation and inference
In this section, we shall present the maximum likelihood approach adopted for parameter estimation and inference. Let θ = (β, γ , 1 , 2 ) denote the vector of parameters in the unit gamma mixed regression model. In this notation, 1 and 2 denote the parameters composing such matrices. Let y i = (y i1 , . . . , y in i ) denote the n i × 1 vector of repeated measures for the i th subject. For each independent subject, the marginal likelihood is given by Assuming independence between the n subjects, the log-likelihood function is given by It is important to note that, the evaluation of Equation (5) requires solving a potentially high dimensional integral n times. Hence, it requires the use of numerical integration methods. We used the Laplace approximation method [29] combined with automatic differentiation [28,35]. The Laplace approximation method is commonly used to approximate integrals that cannot be resolved analytically and can be applied to obtain the marginal likelihood. Let u = (u i1 , u i2 ) denote the stacked vector of random effects of dimension q = s + r and its covariance matrix. Let (θ, u) denote the complete log-likelihood function. In order to obtain the marginal likelihood, we have to integrate out the random effects from the complete likelihood. In this case, the Laplace approximation can be written as follows whereû is the value of u for which (θ , u) is maximized, i.e.
and H(θ) denote the Hessian of (θ , u) given by Finally, the Laplace approximation for the marginal log-likelihood is given by It is important to highlight that we have used the notation θ (û) to emphasize that θ is a function ofû. Consequently, we have two maximization problems to be solved. The internal maximization problem consists of findingû for a fixed vector of model parameters θ. The second or external maximization problem consists to maximize the marginal loglikelihood function with relation to θ . For the internal maximization problem, we adopted the Newton-Raphson method, which in turn requires to compute the first and second derivatives of the complete log-likelihood with relation to u. For such computations, we opted to use automatic differentiation [28,35]. For the external maximization problem, we use the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm, implemented by the function optim() in R. The fitting algorithm aforementioned can be conveniently implemented in R using Template Model Builder (TMB).
The R package TMB [31] provides computational implementation for fitting complex regression models with random effects. TMB offers a collection of tools for implementing random effects models using C++ templates. These tools refer to C++ libraries as CppAD [36] and Eigen [37] used for automatic differentiation and matrix linear algebra, respectively. TMB also works with other high performing software tools for quickly estimation. For details, see [31].

Simulation studies
In this section, we report the main results of the three simulation studies conducted to check the properties of the maximum likelihood estimators to deal with continuous bounded data in studies with repeated measures. Firstly, in Section 4.1, we consider a unit gamma mixed model with fixed precision. Thereafter, in Section 4.2, we adopt a unit gamma mixed model with random slopes. Finally, Section 4.3 presents the unit gamma mixed model with varying precision.

Unit gamma mixed model with fixed precision
In this section, we designed a simulation study with 15 scenarios taking into account different values for the precision parameter and variance of the random effects.
where β = (2, 1, −1.5) and u i ∼ N (0, τ 2 ). The linear predictor (2) is composed of a continuous and a categorical covariates. The continuous covariate x 1 was generated from a Gaussian distribution (mean zero and variance 0.3 2 ), while the categorical covariate x 2 was generated from a Bernoulli distribution (p = 0.5). Figure 2 shows the average bias plus and minus the average standard error for the regression, precision, and variance parameter estimates for each simulated scenario. We highlight that the scales in Figure 2 were standardized for each parameter by dividing the average bias and the limits of the confidence intervals by the standard error obtained for the sample of size 200. The results presented in Figure 2 suggest that the regression and precision parameters are unbiased under all simulation scenarios and sample sizes; however, the parameter τ 2 is slightly biased for small samples. Thus, the results in Figure 2 suggest that in all simulation scenarios both average bias and standard error tend to zero as the sample size increases. Figure 3 presents the empirical coverage rate for all model parameters, sample sizes, and simulation scenarios with τ 2 fixed at 0.5. The results of the coverage rate with τ 2 fixed at 0.25 and 0.75 are found in the Web Appendix A.
According to the results presented in Figure 3, the coverage rate for each parameter (β 0 , β 1 , β 2 , τ 2 , φ) were around the nominal level of 95% in all simulation scenarios and sample sizes. Overall, the results presented in this section indicate that the maximum likelihood approach provides unbiased and consistent estimators for all model parameters, at least for large samples.

Unit gamma mixed model with random slope
We carried out a simulation study to check the properties of the maximum likelihood approach to evaluate the performance of the unit gamma mixed models with random slopes. The model includes an intercept and a random slope, where we assume that the random effects are correlated. We designed 15 simulation scenarios by considering different values for the precision parameter and correlation levels.
The simulation scenarios were designed by combining five values of the unit gamma precision parameter φ = (0.36185, 0.76118, 1.95705, 4.3557, 11.56) with three levels of correlation between the random effects ρ = (0.3, 0.5, 0.7). Similar to the simulation study presented in the previous section, for each simulation scenarios we drawn 1000 data sets with five repeated measures taken at an increasing number of subjects N = 40, 80, 150 and 300 resulting in samples of size 200, 400, 750 and 1500, respectively. In the linear predictor we included a continuous covariate, that was simulated from Gaussian distribution with mean zero and variance 0.5 2 . The regression coefficients were fixed at β 0 = 2 and β 1 = 1. The unit gamma mixed model with random slope is specified as follows where τ 2 1 = 0.9 and τ 2 2 = 0.6. Figure 4 shows the average bias plus and minus the average standard error for all model parameters under each simulation scenario. The results in Figure 4 suggest that the proposed estimators for the regression parameters are unbiased and consistent for all sample sizes and simulation scenarios. However, as expected, the variance, correlation, and precision parameters are slightly biased for small samples in all simulation scenarios. Figure 5 presents the confidence intervals coverage rate for each parameter by sample size and simulation scenario with ρ fixed at 0.5. The additional results for ρ = (0.3 and 0.7) can be found in the Web Appendix A. The results in Figure 5 show that the coverage rate for the regression and precision parameters was around the nominal level of 95% for all sample sizes and simulation scenarios. However, for the τ 2 2 parameter the empirical coverage rates are close to 88% and 90%, for scenarios with φ = 0.36185.

Unit gamma mixed model with varying precision
Following the principles of the double generalized linear models, we investigated the performance of the maximum likelihood estimators to evaluate the unit gamma mixed models for jointly modelling the mean and the precision structures (Equation (3)). We designed a simulation study with three scenarios taking into account different variance levels of the Gaussian random effects. Moreover, we considered that the random effects in the mean and precision models were the same for each simulation scenario, i.e. τ 2 1 = τ 2 2 . Thus, the variance levels of the Gaussian random effects were fixed at τ 2 = (0.25, 0.5, 0.75). For specification of the unit gamma mixed model with varying precision, consider the following linear predictors for the mean and the precision structures where u i1 ∼ N (0, τ 2 1 ) and u i2 ∼ N (0, τ 2 2 ). The values of the regression coefficients were fixed at β = (2, 1, −1.5) . The covariates x 1 and x 2 were generated from Gaussian (mean zero and variance 0.4 2 ) and Bernoulli (p = 0.7) distributions, respectively. For precision structure we set γ = (1, 0.5, 1.2) . The covariates z 1 and z 2 were simulated from Gaussian (mean zero and variance 0.3 2 ) and Bernoulli (p = 0.5) distributions, respectively. Therefore, for each of the three simulation scenarios, we generated 1000 data sets with five repeated measures taken at an increasing number of subjects N = 40, 80, 150 and 300 resulting in samples of size 200, 400, 750 and 1500, respectively. Figure 6 shows the average bias plus and minus the average standard error for each parameter of the unit gamma mixed model with varying precision. The results in Figure 6 suggest that the regression and precision parameters are unbiased and consistent under all simulation scenarios and sample sizes. However, the variance parameters of the Gaussian random effects were biased, as expected. The empirical coverage rate for all model parameters, sample sizes, and simulation scenarios is presented in Figure 7.
The results in Figure 7 show that for the β 1 and β 2 parameters the empirical coverage rates are close to the nominal level of 95% for all sample sizes and simulation scenarios. For the intercepts in both mean and precision structures the empirical coverage rates are close to 90% and 94% for sample sizes larger than 100. In the supplementary material online (Web Appendix B), we repeat the simulation study presented in this section with 15 repeated measures taken at an increasing number of subjects N = 10, 20, 40 and 80 resulting in samples of size 150, 300, 600 and 1200, respectively. The results are pretty similar to the ones reported in the article.

Applications
In this section, we shall illustrate the application of the unit gamma mixed model to two data sets. The first corresponds to a body fat percentage of subjects evaluated in a public hospital in Brazil. The second data set refers to a water quality index measurements taken at hydroelectric power plant reservoirs. The data sets, R and C++ codes are available as supplementary material on http://www.leg.ufpr.br/doku.php/publications:papercompanions: unigamma.

Body fat percentage data set
The first data set refers to a study conducted at the Clinic's Hospital of the Paraná Federal University, Curitiba, Brazil. The response variable is the body fat percentage (BFP) Figure 6. Average bias and confidence intervals on a standardized scale by sample size and simulation scenarios -simulation study 3. measured at five different regions of the body (arms, legs, trunk, android, and gynoid). The data set has 1490 observations (298 subjects × 5 repeated measures) and was previously analysed by [38,39], based on multivariate regression models for continuous bounded data. We shall analyse the BFP considering the five repeated measures per subject, and the following covariates: age -patient age in years; BMI -body mass index (kg/m 2 ); gender -gender of individuals (0: Female; 1: Male) and IPAQ -level of physical activity practiced routinely (0: sedentary; 1: insufficiently active; 2: active). The main goal of this data analysis is to investigate the effects of the aforementioned covariates in the BFP considering the repeated measures on the same subjects. Figure 8 displays some exploratory plots which help us evaluate the body fat distribution and its association with the considered covariates.
The results in Figure 8(A) show that the distribution of the total percentage of body fat is slightly asymmetric. Figure 8(B,C) shows that the total percentage of body fat increases positively with age and BMI, respectively. Moreover, female (Figure 8(D)) and sedentary (Figure 8(E)) individuals present higher total percentage of body fat. Finally, Figure 8(F) indicates that the total percentage of body fat differs between the different body regions.
For selecting the components of the linear predictors for both μ and φ, we used a step-wise procedure. First, we selected the covariates for the mean model assuming constant precision. Then, the components of precision structures were selected, taking into account the covariates selected for the mean model in the first step. Finally, we remove non-significant covariates (if any) in both mean and precision linear predictors.  Let Y ijk |u i1 , u i2 ∼ UG(μ ijk , φ ij ) the percentage of body fat for i = 1, . . . , 298 subjects and j = 1, . . . , 5 regions. We assume that u i1 ∼ N (0, τ 2 1 ) and u i2 ∼ N (0, τ 2 2 ). Thus, the linear predictors for the mean and precision are given, respectively, by logit(μ ijk ) = β 0 + β 1 age i + β 2 BMI i + β 3 gender i + β 4k IPAQ ki + β 5j regions ji + u i1 , and log(φ ij ) = γ 0 + γ 1 BMI i + γ 2j regions ji + u i2 .   The effect of the continuous covariates age (years) and BMI (kg/m 2 ) are accessed by β 1 and β 2 , respectively. For the categorical covariates, the parameter β 3 evaluates the change from female to male, while β 4k , for k = 2, 3, measure the changes from the IPAQ sedentary to insufficiently active and active, respectively. Similarly, the parameter β 5j , quantifies the changes from the regions from the arms to legs, trunk, android, and gynoid, for j = 2, . . . , 5. The linear predictor for the precision structure is modelled by the logarithm link function, and this structure is composed of an intercept and the covariate effects of BMI (kg/m 2 ) and regions. We fitted the unit gamma mixed regression models with fixed and varying precision, and we compared it with the beta mixed models. The computational implementation of the beta mixed regression models is available in the R package glmmTMB [40]. Although both TMB and glmmTMB packages provide the same computational implementation, we opted for fitting all regression models using TMB. The main advantage of TMB is that we can easily apply the delta method to calculate the standard errors for parameter estimates when required. Table 1 shows the parameter estimates, standard errors, maximized log-likelihood and Akaike and Bayesian information criterion obtained by the unit gamma and beta random effects models. Moreover, we used the delta method to calculate the standard errors for the parameters φ, τ 1 , and τ 2 .
According to the results presented in Table 1, the unit gamma mixed regression model with random intercept (LogLik = 2208.64) and varying precision (LogLik = 2305.82) showed the best fit to the data in relation to the beta counterparts. Furthermore, all parameter estimates by the unit gamma and beta random intercept models are significantly different from zero. Thus, the covariates age, BMI and regions of legs, trunk, android and gynoid showed a positive relation with the response variable.
Regarding the unit gamma mixed regression model with varying precision, only the coefficients β 1 , β 42 and τ 2 did not present statistical significance (p > 0.05). Such coefficients are associated with the covariates age, IPAQ (insufficiently active) and standard deviation of the random effect on the precision structure, respectively. Based on the results obtained by the unit gamma mixed model with varying precision, it is easy to see that precision is positively associated with the covariates BMI and body regions.

Water quality index data set
The second data set corresponds to a water quality index of hydroelectric power plants operated by the energy company COPEL in the State of Paran, Brazil. The response variable is the IQA (acronym in Portuguese) which was measured 12 times (3 locations ×4 quarters) in each of the 16 power plants. Thus, this dataset is composed of 190 observations (excluding two missing observations). We assume that this data set is structured in two grouping levels: power plants and quarters within the power plant. This data set was previously analysed by other authors, using different regression models [8,18,24,41] and, in this paper, we will use to illustrate the proposed model through a nested model under the generalized linear mixed models framework. Figure 9 presents a descriptive analysis of the data through histograms and boxplots. The results presented in Figure 9(A) show that the distribution of the response variable is asymmetric. Thus, we need a suitable probability distribution for asymmetric data in the interval (0, 1).
For specification of the unit gamma mixed regression models, we assume that IQA was measured at the k th location on the j th quarter within the power plant i, for i = 1, . . . 16. Therefore, we considered the following structure where β 1j evaluates the differences from quarter 1 to up to 4, for j = 2, 3, 4. The parameter β 2k measure the changes from the upstream to the reservoir and to the downstream, for k = 2, 3. We highlight that, both random effects u i and u i,j are assumed to be independent for different i and j. Table 2 shows the parameter estimates, standard errors, maximized log-likelihoods (LogLik) and Akaike (AIC) and Bayesian (BIC) information criterion obtained by the unit gamma and beta mixed models.
The results presented in Table 2 show that the unit gamma mixed model provide the best fit. The difference in terms of log-likelihood value from the unit gamma mixed model with random intercept and nested to the corresponding beta mixed models was 2.18 and 5.37, respectively. The results of the unit gamma nested model show that in quarter 3 (β 13 = 0.327) and the reservoir (β 22 = 0.259) the IQA were higher than the values obtained from quarter 1 and upstream, respectively. Such results agree with the descriptive analysis presented in Figure 9.  Note: * Indicate p-value < .05. a random intercept models. b nested models.

Discussion
In this paper, we proposed the unit gamma mixed regression models to deal with continuous bounded variables in studies with repeated measures and/or grouped data. The model was introduced in the GLMM framework and we used the maximum likelihood approach for parameter estimation and inference. Our model was implemented using Template Model Builder (TMB), which allows us to easily apply the Laplace approximation method combined with automatic differentiation. We carried out three extensive simulation studies to verify the computational implementation and check the properties of the maximum likelihood estimators to deal with longitudinal continuous bounded variables. We conducted three simulation studies, taking into account the unit gamma mixed models with (i) fixed precision; (ii) random slope (correlated random effects), and (iii) varying precision. Furthermore, we simulate data using different numbers of subjects and repeated measures (see supplementary material online). The results suggest that in all simulation scenarios the maximum likelihood approach provides unbiased and consistent estimators for the regression, precision, correlation, and variance parameters for large samples.
We applied the proposed model in two real data sets. In both data analyses, all measures of goodness-of-fit (LogLik, AIC and BIC) indicated that the unit gamma mixed models provide a better fit than the beta mixed models.
As future work, it is suggested to extend the unit gamma mixed models to deal with longitudinal continuous bounded variables with exact zeroes and/or ones. Furthermore, we suggest to propose a multivariate regression model based on the unit gamma distribution to handle multiple continuous bounded variables, as the multivariate regression models proposed by Souza and Moura [42] and Lemonte and Moreno-Arenas [43].