Estimation of bias-corrected intraclass correlation coefficient for unbalanced clustered studies with continuous outcomes

Abstract Intraclass correlation coefficient for data in a clustered study is traditionally estimated from a one-way random-effects model. This model assumes normality for the random cluster effect and the residual effect. When the normality assumption is questionable, we find that the estimated correlation could be much below the nominal level when data are highly skewed or data have low kurtosis. We develop a bias-corrected estimator based on the approach by Thomas and Hultquist for a study with unbalanced cluster sizes. For multivariate normal data or non-normal data with moderate skewness, we compare the performance of the new bias-corrected estimator with two existing estimators with regards to accuracy and precision. When correlation is small, the existing ANOVA estimator works well. When correlation is medium to large, the proposed new estimator has the correlation close to the nominal level, and its mean squared error is smaller than others.


Introduction
Clustered data are found in many studies: such as repeated-measure studies, healthcare studies, inter-rater reliability studies, and household surveys. Dependence of outcomes from the same cluster is known as the intraclass correlation coefficient (ICC), which is traditionally estimated from a one-way random-effects model (Demetrashvili, Wit, and van den Heuvel 2016;Donner 1986;Lai 2020;Shan and Ma 2014;Ukoumunne 2002). This estimator q can be expressed as a function of the sum of squares from the analysis of variance (ANOVA) table. This estimator is referred to be as the ANOVA estimator which can be applied to studies with balanced or unbalanced cluster sizes. Thomas and Hultquist (1978) developed an asymptotic estimator for q when a study has unbalanced cluster sizes (named as the TH estimator), and they showed that the TH estimator works very well. These two estimators were developed under the normality assumption. When this assumption is not met, a non-parametric measure proposed by Rothery (1979) may be considered. However, this rank-based estimator tends to be over-estimated in many configurations.
In one of our on-going projects to study the impact of cognitive impairment on patients with Parkinson's Disease (PD), skewness of some biomarkers could be very small (e.g. À4), which is considered to be highly skewed. A large number of existing studies showed that the ANOVA estimator works not too bad for skewed data when kurtosis is very large (e.g. over 5) (Ukoumunne 2002). When kurtosis is large, data have a huge spike and outcomes are close to each other. It is not clear whether the estimated q is close to the nominal level or not when kurtosis is small or even negative. For this reason, we conduct extensive simulation studies to investigate the robustness of q estimators to non-normality with a wide range of skewness and kurtosis. These findings provide guidance on the q estimation for other researchers.
Recently, Atenafu et al. (2012) developed a bias-corrected estimator for q by using the secondorder Taylor series expansion for a balanced study. Following their approach, we develop a biascorrected estimator for a study with unbalanced cluster sizes. In a balanced study, the ratio of sum of square among groups and its variance follows a chi-squared distribution (Donner and Wells 1986). This important property is generally not satisfied in an unbalanced study. For this reason, Thomas and Hultquist (1978) proposed an asymptotic version of that ratio: the variance of cluster means is used to approximate the sum of square among groups. Similarly, we use the second-order Taylor expansion of the log transformation to derive the bias of the TH estimator. We then compare the new bias-corrected estimator with the existing ANOVA estimator and the TH estimator with regards to accuracy and precision.
We organize this article as follows. In Sec. 2, we first introduce the two commonly used ICC estimators, and then derive the biased-corrected estimator. In Sec. 3, we conduct simulation studies to compare the three estimators with regards to accuracy and precision under a wide range of configurations. An example from the Parkinson's Progression Markers Initiative (PPMI) is used to illustrate the application of the new estimator and the existing estimators. Lastly, we provide some comments and suggestions in Sec. 4.

Methods
Suppose a clustered study has k clusters, and the i-th cluster has n i subjects. When n 1 ¼ n 2 ¼ Á Á Á ¼ n k , it becomes a balanced study. Atenafu et al. (2012) developed a bias-corrected estimator for q in a balanced study from a one-way random-effects model. We extend their approach to a study with unbalanced cluster sizes which often occur in practice (e.g. different length of follow up, or unequal numbers of raters for each participant). The one-way random-effects model is expressed as where i ¼ 1, 2, :::, k, j ¼ 1, 2, :::, n i , Y ij is the outcome for the j-th subject in the i-th cluster. The random cluster effect a i is assumed to follow Nð0, r 2 a Þ, and the residual effect e ij follows Nð0, r 2 e Þ: These two effects are assumed to be independent from each other. It is obvious that this model is under normal assumptions of the random effect and the residual effect. The parameter of interest q is defined as The ICC can be estimated by using the sums of squares from the ANOVA table where m ¼ 1 kÀ1 ½N À P k i¼1 n 2 i N is the harmonic mean of the cluster sizes, SSA ¼ P k i¼1 n i ð Y i: À Y :: Þ 2 is the sum of square among clusters, SSE ¼ P k i¼1 P n i j¼1 ðY ij À Y i: Þ 2 is the sum of square within clusters, Y i: is the mean of outcomes from the i-th cluster, and Y :: is the grand mean of all outcomes.
The ICC in Equation (1) can be reorganized as It is well known that SSA=ðmr 2 a þ r 2 e Þ follows a chi-squared distribution in an unbalanced study if and only if r 2 a ¼ 0 (Donner and Wells 1986;Thomas and Hultquist 1978). Thus, the estimator for F in a balanced study by Atenafu et al. (2012) can be not directly extended to an unbalanced study.
For an unbalanced study, Thomas and Hultquist (1978) developed a new ratio test by using the sample variance of the cluster means (S 2 c ) and SSE, where They showed that mðk À 1ÞS 2 c =ðmr 2 a þ r 2 e Þ asymptotically follows a chi-squared distribution with the degree freedom of ðk À 1Þ : v 2 ðk À 1Þ: Let g SSA ¼ mðk À 1ÞS 2 c : It should be noted that g SSA is SSA in a balanced study. We then have the following theorem for an asymptotically unbiased estimator for F in a study with unbalanced cluster sizes.
Theorem 2.1. An asymptotically unbiased estimator for F is Proof. From the results by Thomas and Hultquist (1978), we have g SSA mr 2 a þ r 2 e $ asy v 2 ðk À 1Þ, and SSE r 2 e $ v 2 ðN À kÞ: Then, an asymptotically unbiased estimator for F is presented as : This estimator was originally proposed by Thomas and Hultquist (1978) who extended the results from a balanced study by Searle (1971). When b F is derived from the aforementioned theorem, an estimator for q is given as A log transformation is applied on q to compute the bias: log q ¼ log F À log ðF þ 1Þ: Then, the second-order Taylor expansion of the log function can be used as an approximation of the log function: Thus, the estimated bias of d log ðb q TH Þ is 1 Since the bias is always positive, the bias-corrected estimator b q TH bc is always larger than b q TH : When the estimated F is small, the quantity of 1 b F 2 À 1 ð b F þ1Þ 2 in the bias-corrected estimator would be too large. For this reason, the second order Taylor expansion of log ð1 À qÞ is used to compute the bias, and the bias-corrected estimator is We use 0.1 as the threshold for b F on whether the estimator in Equation (4) or the one in Equation (5) is used as the bias-corrected estimator.

Results
We compare the performance of the three ICC estimators based on the ANOVA method, the TH method, and the bias-corrected TH method (b q ANOVA , b q TH , and b q TH bc ) with regards to accuracy and precision when data are from multivariate normal distributions and non-normal distributions. The clustered data are simulated by using the R function mvrnonnorm from the semTools package, with the mean vector of zero (R codes are presented in the Supplementary material). The algorithm for simulating multivariate normal and non-normal data in the R function was developed by Vale and Maurelli (1983). Skewness and kurtosis can be specified in this R function for non-normal data. When skewness is below À1 or above 1, such data are considered as highly skewed. We first investigate q estimators with data normally distributed or moderately skewed: skewness ¼ -0.5, 0, and 0.5. The considered values of kurtosis are: À0.8, 0, and 0.8. When both skewness and kurtosis are zero, data are normally distributed. The pre-specified ICC q values are set as: 0.1, 0.3, 0.5, 0.7, and 0.9. The variance r 2 ¼ r 2 a þ r 2 e is assumed to be 1. Suppose n is the maximum of clustere sizes: n ¼ maxfn i : i ¼ 1, 2 Á Á Á , kg: The size of each cluster is uniformly distributed between 1 and n. For each given configuration (k, n, r 2 , skewness, kurtosis), we generate B ¼10,000 data sets where each data set has k clusters. Figure 1 shows the average of q estimators and the mean squared error (MSE) for the multivariate normal data (top) when k ¼ 40 and n ¼ 30, and the multivariate non-normal data (bottom) when k ¼ 40, n ¼ 100, skewness ¼ 0.5, and kurtosis ¼ 0.8. The MSE is defined as where b q b is the estimated q from the b-th simulation and q is the pre-specified ICC. It can be seen from the figure that the three methods have similar average correlations being very close to the nominal level. For the multivariate normal case, the new b q TH bc is slightly above the nominal level while the other two methods are slightly below the nominal level. When q is small (e.g. 0.1 or 0.3), the traditionally used estimator b q ANOVA has the smallest MSE, and b q TH bc is the best estimator when q is medium or large. Similar results are observed when data are non-normal. We present the relationship between the estimated ICC and variance (r 2 ) from 0.3 to 1,000 in Figure 2 for multivariate normal data when the pre-specified q is 0.1 (top) and 0.7 (bottom). The other parameters are: k ¼ 30 and n ¼ 40. The average of the estimated q values for each method is consistent as r 2 changes. The MSE also seems independent of r 2 : These plots show a detailed comparison between the three ICC estimators, having the results similar to the findings in Figure 1. When q ¼ 0:1, the ANOVA method has the estimate close to the nominal level, and the new biascorrected estimator becomes the best when q ¼ 0:7: In the existing literature for non-normal data, skewness is often small or skewness is above 1 but kurtosis is large. We conduct simulation studies with skewness from À2 to 2, and kurtosis from À2 to 3 (À2, 0, 0.5, and 3). Figure 3 shows the average of the b q ANOVA when q ¼ 0.1 (top), 0.5 (middle), and 0.9 (bottom). The results from three plots are similar to each other. It can be seen that the average of b q ANOVA is close to the nominal level when skewness between À1 and 1 and kurtosis is above 0.5. When kurtotsis is small or even negative, the average of b q ANOVA could be much below the nominal level, especially in the cases of kurtosis ¼ À2. When kurtosis is near 0 and skewness is between 0.5 and 1 (or À1 to À0.5), the average of b q ANOVA is about 10% below the nominal level. This simulation shows that the larger kurtosis is, the higher the estimated ICC is.

Example
The PPMI study is an international study having 33 clinical sites with the goal to develop new treatments for PD by discovering new and promising biomarkers. The test battery of Hopkins Verbal Learning Test-Revised (HVLT-R) is one of the tests in the PPMI study to measure verbal learning and memory. For PD patients in the PPMI study, the average HVLT-R test score is 26.1, and the range of visits is from 1 to 6. The estimated ICC based on the three considered methods are: b q ANOVA ¼ 0:170, b q TH ¼ 0:236, and b q TH bc ¼ 0:244: The TH estimator and the new estiamtor are higher than the traditionally used ANOVA estimator. The average of skewness values across the visits is À0.487, which is considered as moderately skewed. Kurtosis is close to zero as 0.024.
The Montreal Cognitive Assessment (MoCA) is another measure that is designed as a screening instrument for cognitive impairment, with the score range between 0 and 30. A high score is considered to be not impaired. In the PPMI study, the average score of MoCA for PD is 27.5, which is similar to results from other reported studies Cummings and Fulkerson 2018;. The ICC is estimated as: b q ANOVA ¼ 0:508, b q TH ¼ 0:423, and b q TH bc ¼ 0:428: Skewness has the range from À0.22 to À4.58. The skewness seems a decreasing function of visit. Meanwhile, the estimated kurtosis is an increasing function of visit and it is as high as 32 at the last visit.

Conclusion
In this article, we develop an asymptotic bias-corrected estimator for the ratio of r 2 a and r 2 e in the ICC calculation. We show that this new estimator performs better than the existing TH estimator and the traditionally used ANOVA estimator with regards to accuracy and precision for medium to large ICC when data are not highly skewed (Shan 2015(Shan , 2018Shan and Zhang 2019). The bias in the F estimate is computed from a second-order Taylor expansion of a log transformation. We explore the third-order Taylor expansion, and the results are similar to the proposed estimator (Shan 2020a(Shan , 2020bShan and Chen 2018;Shan et al. 2016). The impact of the estimated q from the third moment is relatively small as compared to that from the second moment.

Discussion
We conduct extensive simulation studies to show that ICC could be significantly under-estimated when data are highly skewed or kurtosis is near zero or negative (Shan et al. 2010. Kurtosis seems to have a positive relationship with the ICC. For this reason, simulations from existing literature showed good accuracy of ICC methods as kurtosis is often large. Based on this finding, we would recommend researchers to use the ICC methods with caution when kurtosis is small or negative and skewness is below À1 or above 1. In such cases, non-parametric methods for ICC may be considered. The rank based measure by Rothery (1979) does not have estimates close to the nominal level. Thus, it would be interesting to develop new non-parametric measures to improve the accuracy (Shan 2014;Shan, Young, and Kang 2014;Zhang and Guogen 2020). Alternatively, the approaches to calculate Pearson correlation when normality is not met, could be possibly extended for clustered data (Pearson 1947).
In addition to ICC estimate, confidence interval of ICC is also another important research topic. Hartung and Knapp (2000) proposed a Wald type confidence interval for ICC in the oneway random effects model. Liu, Zhao, and Li (2016) utilized the confidence interval distribution approach by combining asymptotic confidence distributions in the one-way random effects model. Recently, Al-Sarraj, von Br€ omssen, and Forkman (2019) proposed a new generalized prediction interval based on the approximate methods of Satterthwaite and Kenward and Roger for random effects in linear random-effects models, and it was shown to have the coverage probabilities being closer to the nominal level than the traditional intervals based on the restricted maximum likelihood method.
Methods for constructing confidence intervals for variance component ratios in general unbalanced mixed models are developed. The methods are based on inverting the distribution of the signed root of the log-likelihood ratio statistic constructed from either the restricted maximum likelihood or the full likelihood (Jiang, Cao, and Shan 2020). As this distribution is intractable, the inversion is rather based on using a saddlepoint approximation to its distribution. Apart from Wald's exact method, the resulting intervals are unrivaled in terms of achieving accuracy in overall coverage, underage, and overage. Issues related to the proper "reference set" with which to judge the coverage as well as issues connected to variance ratios being nonnegative with lower bound 0 are addressed. Applications include an unbalanced nested design and an unbalanced crossed design.
It can be seen from the PPMI example, the estimated skewness and kurtosis are not small for the MoCA measure. One possible reason is that such measures always have a boundary, and a certain population (e.g. PD without MCI) has the outcome close to the upper boundary due to learning effect. Thus, data are likely to be skewed. In addition, as disease status progresses in the disease groups, outcomes are skewed to the worse direction. Meanwhile, kurtosis could change as time goes. Generally, we assume a constant skewness and kurtosis in the ICC calculation which can be violated in practice. When this occurs, homogeneity of skewness and kurtosis should be tested.