Modeling and analysis of functional method comparison data

Abstract We consider modeling and analysis of functional data arising in method comparison studies. The observed data consist of repeated measurements of a continuous variable obtained using multiple methods of measurement on a sample of subjects. The data are treated as multivariate functional data that are observed with noise at a common set of discrete time points which may vary from subject to subject. The proposed methodology uses functional principal components analysis within the framework of a mixed-effects model to represent the observations in terms of a small number of method-specific principal components. Two approaches for estimating the unknowns in the model, both adaptations of general techniques developed for multivariate functional principal components analysis, are presented. Bootstrapping is employed to get estimates of bias and covariance matrix of model parameter estimates. These in turn are used to compute confidence intervals for parameters and functions thereof, such as the measures of similarity and agreement between the measurement methods, that are necessary for data analysis. The estimation approaches are evaluated using simulation. The methodology is illustrated by analyzing two datasets.


Introduction
Multivariate functional data arise when repeated measurements of J ð! 2Þ variables are taken over time on every subject (Ramsay and Silverman 2005;Berrendero, Justel, and Svarc 2011;Chiou, Chen, and Yang 2014;Jacques and Preda 2014;Happ and Greven 2018). The measurements of each variable on a subject are assumed to be values of an underlying smooth random function that is observed with noise at discrete time points. For each subject, all the J variables are recorded at every observation time. Thus, these data consist of J curves per subject, observed at a common set of discrete observation times. This set of times, however, may vary from subject to subject. There is dependence in the J curves as they come from the same subject.
We are specifically interested in the special case of multivariate functional data arising in method comparison studies (Choudhary and Nagaraja 2017). They involve measuring a continuous variable on every subject using multiple methods of measurement in a common unit. All methods measure the variable with error. The primary goal in these studies is to evaluate whether the methods agree sufficiently well to be used interchangeably. It is evident from more than 25,000 citations of Bland and Altman (1986), which proposed the popular limits of agreement approach for agreement evaluation with scalar observations, that such studies are common in biomedical sciences. The measurements from the multiple methods are the dependent functional variables here.
For example, consider two method comparison datasets, both with J ¼ 2, that motivated this work-body fat data Chinchilli et al. (1996) and body temperature data Li and Chow (2005). In the first, we have measurements of percentage body fat made using skinfold calipers and dual energy x-ray absorptiometry (DEXA) in a cohort of adolescent girls over a period of about 4 years. These longitudinal data are an example of sparse bivariate functional data. In the second, we have core body temperature-the temperature of tissues deep within the body, measured every minute over a period of 90 minutes at two locations in the body-esophagus and rectum. These data are an example of dense bivariate functional data. Our interest is in evaluating agreement between measurements from caliper and DEXA methods in the first case and between measurements taken at the two body locations in the second case.
There is a growing body of literature on the analysis of method comparison data. See Barnhart, Haber, and Lin (2007) and Choudhary and Nagaraja (2017) for an introduction. Nevertheless, almost all the literature assumes that the observations are scalar. For scalar data, evaluation of agreement between two methods involves quantifying how far the methods are from having perfect agreement, in which case the joint distribution of the methods is concentrated on the line of equality. In other words, two methods in perfect agreement have equal means, equal variances, and a correlation of one; or equivalently, their differences are zero with probability one. In the statistical literature, agreement is commonly evaluated by performing inference on measures of agreement such as concordance correlation coefficient (CCC) of Lin (1989) and total deviation index (TDI) of . However, in the biomedical literature, the limits of agreement approach of Bland and Altman (1986) is the most popular. These measures are defined in Sec. 4. A reader interested in their comparison may consult Barnhart, Haber, and Lin (2007). In addition to evaluation of agreement, a secondary goal of a method comparison study is to evaluate similarity of methods by comparing their marginal characteristics such as means and precisions. This is typically done by performing inference on measures of similarity such as mean difference and precision ratio (Dunn 2007). Evaluation of similarity is a necessary supplement to evaluation of agreement as it provides information about the sources of disagreement between the methods (Choudhary and Nagaraja 2017, Chapter 1).
In the method comparison literature, we are only aware of Li and Chow (2005) that deals with functional observations. It extends the ideas of Lin (1989) to develop a CCC for functional data from J ¼ 2 methods. But this approach has drawbacks that limit its usefulness. First, it produces a single overall index of agreement over the entire time interval. However, given the functional nature of the data, an index that changes smoothly over time may be preferable over the overall scalar index because the former allows insight into how the extent of agreement changes over time. Second, the approach is specifically designed for CCC-a function of first and second order moments of the measurements. It is unclear how the approach can be adapted for other measures of agreement such as TDI, which is a percentile (see Sec. 4). This is an issue because CCC is often criticized for being unduly influenced by the between-subject variation in the data as it may lead to misleading conclusions (see, e.g., Barnhart, Haber, and Lin 2007). Third, the approach assumes that all curves are observed at the same time points. This assumption is unnecessarily restrictive. For example, it does not hold for the body fat data although it holds for the body temperature data. Fourth, the approach in its present form cannot deal with J > 2 methods. These drawbacks may be overcome by a model-based approach for analyzing functional method comparison data. The model parameters can be used to obtain functional analogs of any measure of similarity and agreement for scalar observations. The model would allow the observation times to differ between the subjects. It can also accommodate more than two methods. This is the approach we take in this article. Functional data analysis is currently an active area of research, see Ramsay and Silverman (2005) for an introduction. A common analytical approach involves performing a functional principal components analysis (FPCA) to obtain a parsimonious representation of the data (Ramsay and Silverman 2005, Chapter 8). The PACE (principal components analysis through conditional expectation) methodology of Yao, M€ uller, and Wang (2005) is a popular approach for FPCA of data that are observed with measurement error. It involves decomposing the functional observations via a Karhunen-Lo eve expansion and using the framework of mixed-effects model for estimating coefficients in the expansion as best linear unbiased predictors of random effects, and estimating error variance by smoothing the covariance function. This approach and its refinement due to Goldsmith, Greven, and Crainiceanu (2013) are implemented in the refund (Goldsmith et al. 2016) and MFPCA (Happ 2018) packages for the statistical software system R (R Core Team 2018).
Methodologies for FPCA of multivariate functional data have also been developed, see, e.g., Ramsay and Silverman (2005, Chapter 8), Berrendero, Justel, and Svarc (2011), Jacques and Preda (2014), Chiou, Chen, and Yang (2014), and Happ and Greven (2018). Among these, the approaches of Chiou, Chen, and Yang (2014) and Happ and Greven (2018) are of specific interest in this article as they can be used for data observed with measurement error, which is the case for our method comparison data. Although Chiou, Chen, and Yang (2014) and Happ and Greven (2018) differ in their basic premise regarding univariate components of the multivariate observation-in particular, they may have different units in Chiou, Chen, and Yang (2014) and they may be observed on different (dimensional) domains in Happ and Greven (2018)-both first obtain a Karhunen-Lo eve expansion of the multivariate observations. Thereafter, Chiou, Chen, and Yang (2014) estimate the unknowns by a generalization of the PACE methodology. They also employ normalization to deal with the different units. On the other hand, Happ and Greven (2018) establish a relation between univariate and multivariate FPC decompositions and employ it to obtain estimates of the unknowns in the multivariate model using their estimates from the univariate models. The univariate estimates may be obtained, e.g., using the PACE approach of Yao, M€ uller, and Wang (2005). This methodology is implemented in an R package MFPCA (Happ 2018).
This brings us to our approach for analysis of functional method comparison data. In Sec. 2, we begin by writing a subject's observed curve from a measurement method as a sum of an unobservable true smooth curve and a random measurement error. Each measurement method has its own mean and covariance functions and error variance. Next, the method-specific true curves are represented via a multivariate Karhunen-Lo eve expansion. In Sec. 3, we consider two approaches for estimating the unknowns in the model. The first approach-termed MPACE-directly adapts the PACE methodology to deal with multivariate data along the lines of Chiou, Chen, and Yang (2014). The second approach-termed UPACE-adapts the methodology of Happ and Greven (2018). Bootstrap is used to construct relevant confidence intervals and bands. In Sec. 4, we discuss evaluation of similarity and agreement under the assumed model. Sec. 5 presents a simulation study to evaluate properties of the two estimation approaches. The body fat data are analyzed in Sec. 6. Sec. 7 concludes with a discussion. Appendix A contains some technical details. An analysis of the body temperature data and additional simulation results are presented in the online Supplemental Material, which can be accessed from the journal website.

Modeling of data
Let the random function X j denote the true unobservable curve measured using method j ¼ 1, :::, J ð! 2Þ for a randomly selected subject from the population of interest. The curves are defined on a common domain T ¼ ½a, b, a < b 2 R: Let the mean and covariance functions of the random functions be denoted by l j ðtÞ ¼ EðX j ðtÞÞ, G jl ðs, tÞ ¼ covðX j ðsÞ, X l ðtÞÞ, j, l ¼ 1, :::, J; s, t 2 T : Let X ¼ ðX 1 , :::, X J Þ T denote the J Â 1 vector of the curves and lðtÞ ¼ ðl 1 ðtÞ, :::, l J ðtÞÞ T be the J Â 1 vector of its mean.

Model for population curves
Under certain conditions (Chiou, Chen, and Yang 2014;Happ and Greven 2018), the multivariate Karhunen-Lo eve Theorem provides a stochastic representation of X as Here, / k ðtÞ ¼ ð/ k1 ðtÞ, :::, / kJ ðtÞÞ T are orthonormal eigenfunctions, satisfying the property that the inner product of / k and / l , given as P J j¼1 Ð T / kj ðtÞ/ lj ðtÞdt, equals zero if k 6 ¼ l and one if k ¼ l; and n k -called "scores"-are uncorrelated random variables with mean zero and variance k k . The variances k k are eigenvalues associated with the eigenfunctions / k and are non-increasing, i.e., k 1 ! k 2 ! ::: ! 0: We can write (1) as X j ðtÞ ¼ l j ðtÞ þ X 1 k¼1 n k / kj ðtÞ, j ¼ 1, :::, J; t 2 T : (2) Thus, the Karhunen-Lo eve representation provides a basis expansion of the curve X j in terms of the basis functions / 1j , / 2j , ::: that depend on method j, whereas the random coefficients n 1 , n 2 , ::: are common to all methods. It is these coefficients that induce dependence within and between the curves. In particular, under (2), the covariance functions can be written as G jl ðs, tÞ ¼ k k / kj ðsÞ/ kl ðtÞ, j, l ¼ 1, :::, J; s, t 2 T : The true curves X j ðtÞ are observed with error as Y j ðtÞ ¼ X j ðtÞ þ j ðtÞ, where the errors j ðtÞ are independent random variables with mean zero and variance s 2 j , j ¼ 1, :::, J, and are independent of the true values. Using (2), we can write this model as Y j ðtÞ ¼ l j ðtÞ þ X 1 k¼1 n k / kj ðtÞ þ j ðtÞ, j ¼ 1, :::, J; t 2 T : Thus, the mean and autocovariance functions of the observed curves are EðY j ðtÞÞ ¼ l j ðtÞ, covðY j ðsÞ, Y j ðtÞÞ ¼ G jj ðs, tÞ þ s 2 j Iðs ¼ tÞ, j ¼ 1, :::J, and their cross covariance function is covðY j ðsÞ, Y l ðtÞÞ ¼ G jl ðs, tÞ, j 6 ¼ l ¼ 1, :::, J: Here I is the indicator function. It follows that, for each t 2 T , the vector ðY 1 ðtÞ, :::, Y J ðtÞÞ has a J-variate distribution with mean ðl 1 ðtÞ, :::, l J ðtÞÞ, variance ðr 2 1 ðtÞ, :::, r 2 J ðtÞÞ, and correlation q jl ðtÞ, where r 2 j ðtÞ ¼ G jj ðt, tÞ þ s 2 j , q jl ðtÞ ¼ G jl ðt, tÞ r j ðtÞr l ðtÞ , j 6 ¼ l ¼ 1, :::, J: Further, for j 6 ¼ l, the difference D jl ðtÞ ¼ Y j ðtÞ À Y l ðtÞ has a distribution with mean d jl ðtÞ and variance g 2 jl ðtÞ, where d jl ðtÞ ¼ l j ðtÞ À l l ðtÞ, g 2 jl ðtÞ ¼ r 2 j ðtÞ þ r 2 l ðtÞ À 2G jl ðt, tÞ: These distributions are used in Sec. 4 to get functional analogs of measures of similarity and agreement.

Model for observed data
Suppose there are n subjects in the study, indexed as i ¼ 1, :::, n: The observed data consist of J curves per subject, one from each method, observed at discrete observation times. Specifically, let Y ij ðt im Þ denote the observation from method j on subject i taken at time t im , m ¼ 1, :::, N i , j ¼ 1, :::, J, i ¼ 1, :::, n: The J curves for a subject are linked in that they are observed at common observation times t im , m ¼ 1, :::, N i : Thus, subject i contributes JN i observations. The number of observations and the observation times need not be the same for each subject. The design is balanced if the observation times are common for all subjects and the linked observations are available at each observation time from every subject. Otherwise, the design is unbalanced. The functional data are usually said to be dense when the design is balanced and the common N i is large, and they are said to be sparse when the design is unbalanced and N i is small. To obtain a model for the observed data, let X ij ðtÞ, ij ðtÞ, Y ij ðtÞ, and n ik denote the respective counterparts of the population quantities X j ðtÞ, j ðtÞ, Y j ðtÞ, and n k , given by (2) and (4), for subject i. The quantities for subject i are assumed to be independent copies of the corresponding population quantities. Thus, the model for the data can be written as where the errors ij ðt im Þ are independent random variables with mean zero and variance s 2 j : The model postulates that a subject's true curve from a method is an infinite linear combination of method-specific basis functions that are common to all subjects but with subject-specific coefficients that are common to all methods. The eigenfunctions / kj ðtÞ serve as the basis functions and the scores n ik serve as the coefficients.
To analyze these data, first we perform a dimension reduction by truncating the infinite sum in (8) to K terms, where K is the number of FPC to be selected. This leads to as the approximate model. It has the structure of a mixed-effects model. The true model (4) is used to define the parameters and their functions that are the target of inference. But they are estimated by fitting this approximate model to the data. The number of components K is treated as an unknown component in the model. The issue of estimation of unknowns is taken up in Sec. 3. To write (9) in the matrix notation, define the N i Â 1 vectors t i ¼ ðt i1 , :: Þ, :::, l j ðt iN i ÞÞ T , and ij ðt i Þ ¼ ð ij ðt i1 Þ, :::, ij ðt iN i ÞÞ T : These respectively represent the vectors of the observation times for subject i, the corresponding observations from method j, their means, and the associated random errors. Next, define the JN i Â 1 vectors and take the JN i Â JN i diagonal matrix R i ¼ diagfs 2 1 , :::, s 2 1 , :::, s 2 J , :::, s 2 J g, where s 2 j is repeated N i times for each j, as the covariance matrix of i ðt i Þ: Further, define n i ¼ ðn i1 , :::, n iK Þ T as the K Â 1 vector of scores and K ¼ diagfk 1 , :::, k K g as its K Â K diagonal covariance matrix; / kj ðt i Þ ¼ ð/ kj ðt i1 Þ, :::, / kj ðt iN i ÞÞ T as the N i Â 1 vector of values of kth eigenfunction / kj associated with method j; / k ðt i Þ ¼ ð/ T k1 ðt i Þ, :::, / T kJ ðt i ÞÞ T as the JN i Â 1 vector by stacking the values for all the methods; and Uðt i Þ ¼ ð/ 1 ðt i Þ, :::, / K ðt i ÞÞ as their JN i Â K matrix.
With this notation, the model (9) can be written as Here the n i follow independent distributions with mean 0 and covariance matrix K, i ðt i Þ follow independent distributions with mean 0 and covariance matrix R i , and the two vectors are mutually independent. It follows that The elements of the first term of varðY i ðt i ÞÞ consist of values of covariance functions given by (3) but with the infinite sums therein truncated to K terms. The unknowns in the model are h ¼ fl 1 , :::, l J , K, k 1 , :::, k K , / 11 , :::, / J1 , :::, / 1K , :::, / JK , s 2 1 , :::, s 2 J g: The mean functions and eigenfunctions in h depend on t 2 T as well but this dependency is suppressed for convenience. Next, we discuss estimation of h to get the plug-in estimatorĥ:

Parameter estimation
Let N 0 be the number of unique observation times in the data and t 0 ¼ ðt 01 , :::, t 0N 0 Þ T be the N 0 Â 1 vector of these times in increasing order. The elements of t 0 form a grid in the domain T : By definition, there is at least one observation from all measurement methods at each time in t 0 : For estimation, we begin by pooling observations from each method on all subjects and smoothing them ignoring the within-subject dependence. Separate smoothing is performed for each method. This results in a smooth estimatel j ðtÞ of l j ðtÞ, j ¼ 1, :::, J: Then, each observation in the data is centered by subtracting off the corresponding estimated mean asỸ ij ðt im Þ ¼ Y ij ðt im Þ Àl j ðt im Þ: These centered observations are used to form JN 0 Â 1 vectorsỸ i ðt 0 Þ in the same way as Y i ðt i Þ are formed in (10). If the subject i does not have an observation for some t 2 t 0 , that observation is set to be missing inỸ i ðt 0 Þ: In Appendix A, we describe the two approaches-MPACE and UPACE-for estimating the remaining unknown components of h and the multivariate scores n in the model (11). Both use the centered data as inputs and involve the PACE methodology of Yao, M€ uller, and Wang (2005) for univariate functional data. MPACE directly adapts the PACE methodology to deal with multivariate data along the lines of Chiou, Chen, and Yang (2014), whereas UPACE adapts the approach of Happ and Greven (2018). UPACE is computationally simpler of the two as it involves first applying the univariate PACE methodology separately to each component of the multivariate data and then processing the results. However, this may result in loss of efficiency in estimates, especially of the error variances s 2 j because they also come from univariate analyses rather than a multivariate analysis as in MPACE. Although the smoothing needed in these approaches and also for estimating the mean functions is performed here using gam function in R package mgcv (Wood 2017), any other smoothing technique-e.g., local linear regression as in Yao, M€ uller, and Wang (2005) and Chiou, Chen, and Yang (2014)-can also be used without affecting the general methodology. Upon model fitting, the fitted curves areŶ i ðt i Þ ¼l i ðt i Þ þÛðt i Þn i , i ¼ 1, :::, n:

Confidence intervals and bands
Suppose w wðhÞ is a function of model parameters of interest. Examples of w include the precision ratio s 2 1 =s 2 2 : Often, the parameter function depends on t, i.e., it has the form wðtÞ wðt, hÞ, t 2 T : Examples of wðtÞ include the mean difference d jl ðtÞ and the agreement measures defined in next section. Since w can be considered a special case of wðtÞ, we focus on constructing one-and two-sided confidence bands for wðtÞ: In effect, we construct pointwise and simultaneous intervals on a relatively fine grid t of L points in T , say, t 1 , :::, t L : This grid may be the same as the grid t 0 formed by the observed time points, used for estimation in Sec. 3. Or it may consist of a subset of these time points. In practice, L 2 ½25, 50 is often adequate.
LetŵðtÞ wðt,ĥÞ be the plug-in estimator of wðtÞ: Also, letŵðtÞ and wðtÞ be L Â 1 vectors representing the values of the two functions evaluated at the elements of t. When n is large, the joint distribution ofŵðtÞ À wðtÞ can be approximated by a N L ðb, SÞ distribution, possibly after a applying normalizing transformation, where the L Â 1 vector b ¼ ðb 1 , :::, b L Þ T and the L Â L matrix S ¼ ðs jk Þ j, k¼1, :::, L respectively represent the estimated bias vector and covariance matrix of the estimators. Once b and S are available, an approximate 100ð1 À aÞ% one-or two-sided pointwise confidence band for wðtÞ, t 2 T can be computed as where z a is the 100ath percentile of a N 1 ð0, 1Þ distribution. A simultaneous band can be constructed by replacing z a in (13) by an appropriate percentile (Choudhary and Nagaraja 2017, Chapter 3) that can be computed using the multcomp package of Hothorn, Bretz, and Westfall (2008) in R or via simulation as we do here. We now present a bootstrap methodology to compute b and S. It has the following steps: 1. Sample n indices with replacement from the integers 1, :::, n: Take the observed curves associated with the sampled subject indices as a resample of the original data. 2. Apply the estimation and FPCA approach described in Appendix A to estimate h from the resampled data to getĥ Ã : 3. Useĥ Ã to estimate wðtÞ asŵ Ã ðtÞ: Thisŵ Ã ðtÞ is a resample ofŵðtÞ: 4. Repeat the previous steps Q times to get the resamplesŵ Ã q ðtÞ, q ¼ 1, :::, Q: Compute the bias vector b as P Q q¼1ŵ Ã q ðtÞ=Q ÀŵðtÞ, and the covariance matrix S as the sample covariance matrix of the resamples.
In practice, Q ¼ 500 is often enough to estimate b and S. If there is evidence that a bias correction is not needed, then the term b l in (13) can be dropped (see Sec. 5 for an example). Note that a separate FPCA is performed in each bootstrap repetition. Therefore, the resulting confidence intervals also account for the uncertainty due to FPC decomposition in addition to the usual uncertainty due to sampling (Goldsmith, Greven, and Crainiceanu 2013). The procedure of this subsection can be easily adapted to construct confidence interval for a parameter function w that does not depend on t.

Evaluation of similarity and agreement
We now focus on how to evaluate similarity and agreement of a pair of measurement methods j and l, j 6 ¼ l ¼ 1, :::, J: This evaluation can be repeated for all such pairs of interest. For similarity evaluation, inference is performed on two measures of similarity-difference in means of the methods and ratio of their precisions (Dunn 2007). Under the true model (4), d jl ðtÞ given by (7) is the mean difference and s 2 j =s 2 l is the precision ratio. For agreement evaluation, inference is performed on functional analogs of agreement measures originally developed for scalar data. These are obtained by using the definitions of the measures under the bivariate distribution of ðY j ðtÞ, Y l ðtÞÞ induced by the true model (4) for each t 2 T : We specifically consider two agreement measures. One is the concordance correlation coefficient (CCC) due to Lin (1989). It is defined in terms of first and second order moments of the paired observations. Using (5) and (6), the functional CCC can be expressed as See Lin (1989) for properties of a CCC. Here we just note that jCCC jl ðtÞj jq jl ðtÞj 1 and CCC jl ðtÞ ¼ q jl ðtÞ if l j ðtÞ ¼ l l ðtÞ and r 2 j ðtÞ ¼ r 2 l ðtÞ: A large positive value for CCC implies good agreement. The methods j and l have perfect agreement when CCC jl ðtÞ ¼ 1 for all t.
The other measure is the total deviation index (TDI) due to . For a given large probability p 0 , it is defined as the p 0 th percentile of absolute difference in the paired observations. For inference on TDI, we additionally assume that the scores and the errors in the models (4) and (9) follow normal distributions. Under this assumption, for each t 2 T , the difference D jl ðtÞ follows a normal distribution with mean d jl ðtÞ and variance g 2 jl ðtÞ, given by (7). This implies that the functional TDI can be expressed as where v 2 1, p 0 ðDÞ is the 100p 0 th percentile of a noncentral v 2 distribution with one degree of freedom and noncentrality parameter D. A TDI is non-negative, and its small value implies good agreement. Agreement between methods j and l is perfect when TDI jl ðtÞ ¼ 0 for all t.
The measures of similarity and agreement are estimated by plug-in. Similarity of the methods is evaluated by examining a two-sided confidence band for d jl ðtÞ and a two-sided confidence interval for s 2 j =s 2 l : Agreement between the methods is evaluated by examining appropriate onesided confidence bands for agreement measures. Since a large value for CCC and a small value for TDI imply good agreement, an upper confidence band for CCC and a lower confidence band for TDI are appropriate. The construction of confidence intervals and bands was discussed in the previous subsection. To improve accuracy, the intervals for precision ratio and TDI are obtained by first applying a log transformation and those for CCC are obtained by first applying the Fisher's z-transformation. The results are then transformed back to the original scale.
As mentioned in Sec. 1, the limits of agreement approach of Bland and Altman (1986) is quite popular in the biomedical literature for agreement evaluation. This involves, under the normality assumption for the differences, computing estimated mean ± 1.96 times the estimated standard deviation of the differences, and examining whether the limits contain any unacceptably large differences. Using (7), the functional limits of agreement ared jl ðtÞ61:96ĝ jl ðtÞ, t 2 T :

Simulation study
In this section, we use Monte Carlo simulation to evaluate performance of point and interval estimators of key parameters and parameter functions, including measures of similarity and agreement, provided by the MPACE and UPACE approaches. This investigation focuses on J ¼ 2 measurement methods and takes mean squared error (MSE) of a point estimator and coverage probability of a confidence interval as the measure of accuracy. The data are simulated from the true model (4) along the lines of our real data examples by taking the domain as T ¼ ½0, 1; assuming normality for scores and errors; taking the mean functions of the two methods as l 1 ðtÞ ¼ 24 þ t and l 2 ðtÞ ¼ 23 þ 2t; and setting the eigenvalues as k k ¼ 100 Â e ÀðkÀ1Þ=2 for k 6 and zero for k > 6. The eigenfunctions corresponding to the non-zero eigenvalues are taken as the eigenfunctions estimated from the body temperature data by restricting them to the selected domain T : The grid t grid ¼ fu : u ¼ 0, 1=49, :::, 1g of 50 equally-spaced points between 0 and 1 is used for simulating data as well as point and interval estimation.
We consider a total of four dense and sparse designs. In the dense case, a balanced design with N i ¼ 50 is considered. The observation times in this case are all points on t grid , and all subjects have the same observation times. In the sparse case, three scenarios with increasing sparsity are considered. Two are balanced designs with N i ¼ 30 and N i ¼ 20, and the third is an unbalanced design with N i distributed as a Poisson random variable with mean 20. We refer to these four designs as (a), (b), (c), and (d), respectively. The observation times in the sparse cases are drawn from a uniform distribution on t grid separately for each subject. Consequently, in the sparse case, the subjects may not have the same observation times. In all the four designs, observations from both measurement methods are simulated at each observation time, ensuring paired data. The observations for different subjects are independent. Three combinations of values are chosen for the error variances of the methods, namely, ðs 2 1 , s 2 2 Þ ¼ (2, 2), (2, 4), and (4, 4), to allow a range of practical scenarios. Three values are chosen for the number of subjects, n 2 f50, 100, 200g: Further, as is common in practice, p 0 ¼ 0:90 is taken for TDI and 1 À a ¼ 0:95 is taken for the confidence intervals and bands. Thus, we consider a total of 4 Â 3 Â 3 ¼ 36 settings. Table 1. MSEs of estimators of quantities that are free of t and average MSEs of estimators of quantities that depend on t, computed using MPACE (marked as M) and UPACE (marked as U) approaches, and the ratio of the MSEs (marked as U/M) in case of four designs: (a) N i ¼ 50 (dense data), (b) N i ¼ 30 (sparse data), (c) N i ¼ 20 (sparse data), and (d) unbalanced design with mean N i ¼ 20 (sparse data), each with (s 2 1 , s 2 2 Þ ¼ ð2, 2Þ: For each setting, we simulate a dataset, perform parameter estimation as described in Sec. 3.1 and Appendix A, and construct 95% confidence intervals and bands as described in Secs. 3.2 and 4. The proportion of variation explained that is needed for FPCA is taken to be 0.99 for both MPACE and UPACE. For the smoothing involved in point estimation, gam function in mgcv package of Wood (2017) is used with default settings. For interval estimation, Q ¼ 250 bootstrap resamples are used. The entire process from data simulation to interval estimation is repeated 300 times. The results are used to compute estimated MSEs of point estimators of log ðs 2 1 Þ, log ðs 2 2 Þ, log ðs 2 2 =s 2 1 Þ, log fTDIðp 0 , tÞg and zfCCCðtÞg, with zðÁÞ denoting the Fisher's ztransformation, and estimated coverage probabilities for confidence intervals of these quantities. The coverage probabilities are also computed for l 1 ðtÞ, l 2 ðtÞ, and dðtÞ but these quantities are excluded from the MSE calculation as both MPACE and UPACE use the same point estimators for them. We additionally compute estimated MSE ofK and estimates of , which provide an overall measure of accuracy in prediction of scores and individual curves, respectively. For convenience, these measures are also referred to as MSE. The efficiency of MPACE relative to UPACE is measured by dividing the MSE in case of UPACE by its MPACE counterpart. From a practical viewpoint, if a relative efficiency falls between 0.9 and 1.1, we may consider the two approaches to be equally accurate for estimating that quantity. Now, a note about interval estimation of log fTDIðp 0 , tÞg is in order. Our initial simulation studies showed that its confidence band tended to be more accurate without the bias correction. Therefore, we drop the bias term from (13) when computing the confidence band for this measure in the remainder of this article. Table 1 presents the MSEs for the two approaches and their relative efficiencies for ðs 2 1 , s 2 2 Þ ¼ ð2, 2Þ: We see that, with a few exceptions, the efficiency tends to decrease with the sparsity of design. Further, the efficiencies for the curves and scores in all cases are between 0.96 and 1.05, implying that the two approaches may be considered equally accurate for estimating them. Also, the efficiencies for K are between 0.26 and 0.83 in all cases but one. This suggests that UPACE is more accurate than MPACE for estimation of K. Additional investigation shows that MPACE tends to overestimate K. All the efficiencies for zðCCCÞ are greater than one, implying superiority of MPACE over UPACE. For the remaining quantities, the efficiencies depend on n and sparsity of design. In particular, for dense data (Design (a)), the efficiencies range between 0.96 and 1.20, indicating superiority of MPACE. However, as the level of sparsity increases, MPACE begins to lose its efficiency advantage to UPACE, especially when n ¼ 50. But then the advantage of UPACE also shrinks as n increases. For example, for Design (d), the efficiencies range between 0.84 and 0.96 when n ¼ 50, clearly indicating superiority of UPACE, but the range becomes 0.98 to 1.03 when n ¼ 200, indicating nearly the same efficiency of the two approaches. Qualitatively similar conclusions hold in case of ðs 2 1 , s 2 2 Þ ¼ ð4, 4Þ (see Table 2) and also (2, 4), the results for which are omitted. On the whole, these findings indicate that MPACE may be considered slightly more efficient than UPACE for dense data but the converse is true for sparse data with small n. In the other cases, the two may be considered more or less equally efficient. These conclusions remain unaffected by the error variances.
Next, we examine estimated coverage probabilities of the confidence intervals. Table 3 presents the coverage probabilities for confidence intervals of error variances and their ratio, which are free of t. With a few exceptions, the entries are 1-2% higher than the nominal level of 95%, suggesting the intervals are slightly conservative. Both MPACE and UPACE appear equally accurate and there is little impact of n or the error variances.
For parameter functions that depend on t, Table 4 presents averages of estimated pointwise coverage probabilities of the confidence bands. There is no difference in the entries for l 1 , l 2 , Table 3. Estimated coverage probabilities (in %) of 95% confidence intervals for error variances and their ratio in case of four designs: (a) N i ¼ 50 (dense data), (b) N i ¼ 30 (sparse data), (c) N i ¼ 20 (sparse data), and (d) unbalanced design with mean N i ¼ 20 (sparse data).
(s 2 1 , s 2 2 Þ (2, 2) (2, 4) (4, 4) and d between MPACE and UPACE because both use the same estimates for them. In general, these entries are about 1% higher than 95%. For CCC, the entries are close to 95% for MPACE but about 96-98% for UPACE. For TDI, the entries are close to 95% for MPACE. This is also true for UPACE for n ! 100: These conclusions hold regardless of the values of the error variances and whether the design is dense or sparse. Table 5 presents estimated simultaneous coverage probabilities of the confidence bands. With the exception of TDI, in which case the entries are below 95%, the other entries may be considered close to 95%, especially when n ! 100: In case of TDI, the accuracy of MPACE improves with n and it may be considered acceptable for n ¼ 200.
Although the accuracy of UPACE also improves with n, but it remains quite liberal even with n ¼ 200. Taken together, our key findings based on the settings considered and their practical implications may be summarized as follows. First, the sparsity of design affects the relative performance of the two approaches in point estimation but not so much in interval estimation. However, the error variances do not seem to have much impact on the performance. Second, for both point and interval estimation, MPACE may be considered to have an edge over UPACE. Finally, we have also evaluated the two variants of MPACE and UPACE algorithms mentioned in Appendix A. However, we did not find any noticeable difference in the results from those presented here. Therefore, these are omitted. The results of an additional simulation study to evaluate the impact of non-normality is presented in online Supplemental Material.

Analysis of body fat data
These data from Chinchilli et al. (1996) consist of percentage body fat measurements taken over time on a cohort of 112 adolescent girls using skinfold calipers (method 1) and DEXA (method 2) methods. Age at visit is the time variable t here. See Chinchilli et al. (1996), King, Chinchilli, and Carrasco (2007), Hiriote and Chinchilli (2011), and Rathnayake and Choudhary (2017) for more details about the dataset. Upon pre-processing the data which includes retaining only the observation times for which paired observations are available from both methods, we get a total of 2 Ã 654 ¼ 1308 observations from n ¼ 91 girls. The observations range between 12.7 and 37.4. There are 56 distinct observation times on the domain T ¼ ½11:2, 16:8 years and their numbers per subject range between 4 and 8 with an average of 7.2. Figure 1 presents the individual longitudinal profiles from the two methods, superimposed with their estimated mean functions (see below). The caliper mean ranges from 23.6 to 24.7, whereas the DEXA mean ranges from 21.4 to 24.3. They also behave differently over the domain. For example, the caliper mean remains essentially flat until age 14, then it decreases slightly until about age 15.5, and begins to increase thereafter. However, the DEXA mean decreases in the beginning, bottoms out around age 13, and increases thereafter with some flattening near the end. Figure 2 shows the age-specific scatterplots for ages 12 through 16. (Note that to draw these plots, the ages have been rounded to the nearest integer. Otherwise, there would be relatively few points in each plot, making it hard to discern any pattern.) The methods appear moderately Table 5. Estimated simultaneous coverage probabilities (in %) of 95% simultaneous confidence bands in case of four designs, (a) N i ¼ 50 (dense data), (b) N i ¼ 30 (sparse data), (c) N i ¼ 20 (sparse data) and (d) unbalanced design with mean N i ¼ 20 (sparse data).
(s 2 1 , s 2 2 Þ (2, 2) (2, 4) (4, 4) correlated at these ages, with associated sample correlations 0:80, 0:73, 0:66, 0:67, and 0.73, respectively. Also, the points do not tightly cluster around the 45 line for any age, implying that the methods do not agree well. Our next task is to perform an FPCA of these data by fitting the model (9) using both MPACE and UPACE approaches. The smoothing is performed using gam function in mgcv package of R as described in the simulation section. The resulting mean functions are displayed in Figures 1 and 3. The FPCA yields the following estimates for the number of FPC needed to explain at least 99% of variability, eigenvalues, and error variances:   ( MPACE requires one fewer FPC than UPACE but both yield similar estimates for the error variances. Figure 4 presents the estimated eigenfunctions. Ignoring a sign flip as the FPC are unique only up to a sign change, we see that the first three eigenfunctions for caliper and the first two eigenfunctions for DEXA from the two approaches are quite similar. The resulting estimates of standard deviation functions and correlation function for caliper and DEXA are displayed in Figure 3. The standard deviation functions estimated by MPACE and UPACE are somewhat similar, with the function exhibiting a decreasing trend for caliper and an increasing trend for DEXA. However, the two correlation functions seem quite different. In particular, the UPACE estimate shows a decreasing trend throughout, whereas the MPACE estimate shows an initial decreasing trend with minima at age 14, followed by an increasing trend. The latter pattern is consistent with the trend of correlation associated with the scatterplots in Figure 2. Therefore, we use MPACE for rest of the analysis here. Figure 3. The estimated mean, standard deviation, correlation, and mean difference functions for the two methods for percentage body fat data. The bottom right panel also shows a 95% simultaneous confidence band for the mean difference function. We now proceed as described in Sec. 4 to compute interval estimates for measures of similarity and agreement using Q ¼ 500 bootstrap resamples. The estimate and 95% simultaneous confidence band for mean difference (caliper -DEXA) are displayed in Figure 3. The estimate increases from 1 around age 11 to about 3 around age 13, then starts to decrease and falls slightly under zero around age 15, and then increases to about 0.5 around age 17. The band lies above zero over the age interval from 11.5 to 14.5. The estimate and 95% confidence interval for precision ratio (caliper over DEXA) are 1.14 and ð0:60, 2:57Þ, respectively. Taken together, these findings indicate that the methods have the same precision but their means are not the same. Hence the methods cannot be regarded as similar.
For agreement evaluation, Figure 5 presents estimates and 95% one-sided simultaneous confidence bands for CCC and TDI (with p 0 ¼ 0:90) functions. A lower band for CCC and an upper band for TDI are presented. The pattern of increase and decrease of TDI estimate broadly resembles that of the mean difference function in Figure 3. This indicates that the agreement between the methods is best in the beginning. Then, it becomes progressively worse as age increases to about age 13.5, starts to get better till about age 14.5, and gets progressively worse thereafter. The same conclusion can be reached on the basis of CCC also. The TDI upper bound ranges between 6.78 and 9.64 and the CCC lower bound ranges between 0.22 and 0.60. Based on these values, the methods cannot be considered to agree well. It is also clear from the similarity evaluation that this lack of agreement is primarily due to a difference in the means of the two methods. These conclusions are consistent with other analyses of these data reported in Chinchilli et al. (1996), King, Chinchilli, and Carrasco (2007), and Rathnayake and Choudhary (2017).

Summary and discussion
To summarize, this article discusses modeling and analysis of functional data arising in a method comparison study. The methodology involves representing the data using a truncated Karhunen-Lo eve expansion. The unknowns in the model are estimated using two approaches-MPACE and UPACE, both adaptations of existing methods for FPCA of multivariate functional data observed with noise. Confidence intervals for measures of similarity and agreement, obtained by bootstrapping, are used to evaluate similarity and agreement of the measurement methods. A separate FPC decomposition is obtained for each bootstrap resample. Therefore, the variability due to FPC decomposition is also accounted for in the confidence intervals. Although both MPACE and UPACE often have comparable performance, there is evidence in both simulation studies and real data analysis that sometimes MPACE works better than UPACE. Here we use splines for smoothing involved in estimation. However, any other smoothing method can also be used without affecting the general methodology. No parametric assumption is required unless the inference on TDI is needed in which case normality is assumed for scores and errors.
This article takes a multivariate FPCA approach to model the data. Given that the mixedeffects models are common for modeling univariate method comparison data (Choudhary and Nagaraja 2017, Chapter 3), an alternative would be to take a functional mixed-effects model approach. For example, Zhou, Huang, and Carroll (2008) use this to model dependence in paired functional variables. However, this methodology is difficult to implement, especially since no computer program is publicly available to fit their model. Although our methodology works for both dense and sparse functional data, it assumes that observations from all methods are available at each observation time. But this assumption is restrictive. For example, it does not hold for the body fat data. However, it may be possible to relax this assumption. Further research is needed to explore these directions. Software in the form of R code, together with illustration and documentation, is available at http://www.utdallas.edu/$pankaj/.

Appendix A: Two approaches for parameter estimation
This appendix describes two approaches based on FPCA for estimating components of h besides the mean functions. Estimation of mean function was discussed in Sec. 3.1. Both use the centered dataỸ i ðt 0 Þ, i ¼ 1, :::, n as input.

A.1. Approach 1 (MPACE)
This approach is an adaptation of the PACE methodology for univariate functional data (Yao, M€ uller, and Wang 2005;Goldsmith, Greven, and Crainiceanu 2013) to deal with multivariate functional data. A similar approach has been used by Chiou, Chen, and Yang (2014) for normalized functional data. It involves the following steps.
1. Compute the sum of products P n i¼1Ỹ i ðt 0 ÞỸ T i ðt 0 Þ using only the non-missing observations inỸ i ðt 0 Þ: Divide each element of this JN 0 Â JN 0 matrix by the corresponding number of non-missing terms contributing to the sum. This divisor is n for a balanced design. If at least two observations are available at each t 2 t 0 , we may subtract 1 from the number of non-missing terms contributing to the sum and use that as the divisor. Denote the resulting matrix as V. It has a block structure, where each of the submatrices is a N 0 Â N 0 matrix and V jl ¼ V T lj , j 6 ¼ l: By construction, there is no missing entry in this matrix. The elements of V jj provide a raw estimate of the autocovariance function G jj ðs, tÞ þ s 2 j Iðs ¼ tÞ of Y j ðtÞ for s, t 2 t 0 , see (5). Likewise, the elements of V jl provide a raw estimate of the cross covariance function G jl ðs, tÞ, given by (3), for s, t 2 t 0 : 2. Perform bivariate smoothing of the off-diagonal elements of V jj , separately for each j, to obtain preliminary smooth estimates of the functions G jj ðs, tÞ: Evaluate the estimated functions at s, t 2 t 0 to get N 0 Â N 0 matri-cesṼ jj , j ¼ 1, :::, J: 3. Perform bivariate smoothing of the elements of V jl , separately for each (j, l) pair with l > j, to obtain preliminary smooth estimates of the functions G jl ðs, tÞ: Evaluate the estimated functions at s, t 2 t 0 to get N 0 Â N 0 matricesṼ jl , l > j ¼ 1, :::, J: 4. Compute the JN 0 Â JN 0 matrixṼ, an analog of V, by replacing V jj on the diagonal, V jl above the diagonal, and V lj below the diagonal of V withṼ jj ,Ṽ jl , andṼ T jl , respectively. 5. Use a quadrature rule (e.g., the trapezoidal rule) that approximates an integral Ð T f ðtÞdt as P N0 q¼1 w q f ðt 0q Þ, where the quadrature points t 01 , :::, t 0N0 are the elements of t 0 , to get the associated weights w 1 , :::, w N0 : Form a JN 0 Â JN 0 diagonal matrix W with the entire set of weights ðw 1 , :::, w N0 Þ repeated J times as the diagonal elements. Compute the JN 0 Â JN 0 matrix U ¼ W 1=2Ṽ W 1=2 : 6. Perform a spectral decomposition of U to get the eigenvaluesk k and the corresponding JN 0 Â 1 orthogonal eigenvectors u k , k ¼ 1, :::, JN 0 : Replace any negative eigenvalues, which may possibly be nearly zero, with zero. ChooseK as the smallest number of eigenvalues for which ð P k¼1Kk k = P JN0 k¼1k k Þ ! p, where p is a specified lower bound on the proportion of total variability in the observed curves to be explained by the principal components. Compute the vectors/ k ðt 0 Þ ¼ W À1=2 u k for k ¼ 1, :::,K: The eigenvaluesk k provide the estimated score variances and the corresponding vectors/ k ðt 0 Þ ¼ ð/ 8. For j ¼ 1, :::, J, computeŝ 2 j by subtractingĜ jj ðt, tÞ given above from the diagonal elements of V jj -given by (A.1) and which estimate G jj ðt, tÞ þ s 2 j -for t 2 t 0 and combining the differences to form a single number. One way to accomplish this is to proceed along the lines of the implementations of the PACE methodology in R packages MFPCA and refund (Goldsmith et al. 2016;Happ 2018). For this, define an interval T Ã & T as T Ã ¼ minft 2 t 0 : t ! t 01 þ ðt 0N0 À t 01 Þ=4g, maxft 2 t 0 : t t 0N0 À ðt 0N0 À t 01 Þ=4g Â Ã , and let jT Ã j be its length. Also let t Ã 01 , :::, t Ã 0Q be those elements of t 0 that also lie in T Ã : Corresponding to each t Ã 0q , there is a diagonal element of matrix V jj in (A.1), say, v jj ðt Ã 0q Þ: Then, as in Step 5, let w Ã 1 , :::, w Ã Q be the weights associated with a quadrature rule that takes t Ã 01 , :::, t Ã 0Q as the quadrature points. Finally, takê provided it is positive, otherwise take it to be zero. This is the estimate we use here. An alternative is to proceed as in Goldsmith, Greven, and Crainiceanu (2013) and use the average difference between v jj ðtÞ and G jj ðt, tÞ computed over the middle 60% of the grid t 0 : In either case, some observation times from the two ends of t 0 are discarded to improve stability of the estimate. 9. Estimate the score vector n i by its estimated best linear unbiased predictor under model (11), The estimated matrices here are plug-in estimates of their population counterparts.

The matrixṼ in
Step 4 is not guaranteed to be positive definite. Therefore, we may replace it by its nearest positive definite approximation, computed using the nearPD function in R package Matrix (Bates and Maechler 2017) which implements the algorithm of Higham (2002), before continuing with the rest of the steps. This variant of the algorithm is evaluated using simulation in Sec. 5.

A.2. Approach 2 (UPACE)
This approach is a special case of a general approach for multivariate FPCA proposed by Happ and Greven (2018). In our adaptation here, it begins with the centered dataỸ i ðt 0 Þ, i ¼ 1, :::, n and involves the following steps.
1. Use the PACE methodology (Yao, M€ uller, and Wang 2005) to perform univariate FPCA of data from each measurement method separately. This effectively amounts to considering data from one measurement method at a time, assuming a model for it similar to (9) that is based on a univariate Karhunen-Lo eve expansion, and fitting the model by applying the algorithm of the previous section by suitably modifying it. Suppose for data from measurement method j ¼ 1, :::, J, this results inK ðjÞ as the smallest number of principal components explaining at least a specified proportion p ðjÞ of variability;/ ðjÞ k ðt 0 Þ as the N 0 Â 1 vector of values of the kth estimated eigenfunction for t 2 t 0 , k ¼ 1, :::,K ðjÞ ;ŝ 2ðjÞ as the estimated error variance; andn ðjÞ i as theK j Â 1 vector of estimated scores for the ith subject. Note that the corresponding true scores have expectation zero and we have a total ofK þ ¼K ð1Þ þ ::: þK ðJÞ estimated univariate scores for each subject. Theŝ 2ðjÞ resulting from this univariate FPCA also estimate the error variances in h, i.e.,ŝ 2 j ¼ŝ 2ðjÞ , j ¼ 1, :::, J: 2. Arrange the univariate scores as a n ÂK þ matrixN wherê and form theirK þ ÂK þ covariance matrix Z ¼N TN =ðn À 1Þ: 3. Perform a spectral decomposition of Z to get the eigenvaluesk k and the correspondingK þ Â 1 orthogonal eigenvectors z k , k ¼ 1, :::,K þ : Replace any negative eigenvalues, which may possibly be nearly zero, with zero.
ChooseK as the smallest number of eigenvalues for which ð P k¼1Kk k = PK þ k¼1k k Þ ! p, where p is a specified lower bound on the proportion of variability explained. By construction,K K þ : TheseK andk k estimate the corresponding components K and k k of h: 4. For k ¼ 1, :::,K, represent theK þ Â 1 eigenvector z k as z k ¼ ðz  (11), as the ith row of the n ÂK matrixNðz 1 , :::, zK Þ: To conclude, both MPACE and UPACE approaches provide estimates of all the components of h except the mean functions whose estimation was discussed in Sec. 3.1. They also provide estimates of the multivariate scores in the model (11). In applications, MPACE involves choosing only one proportion of variation explained-p for multivariate data. On the other hand, UPACE involves choosing J þ 1 such proportions-p ð1Þ , :::, p ðJÞ for univariate data and p for multivariate data. In practice, such proportions are taken to be large, e.g., between 0.95 and 0.99. The specific choice will depend on the application and can be guided by a scree plot (Ramsay and Silverman 2005, Chapter 8). We have chosen 0.99 in this work. A variant of the UPACE algorithm with p ðjÞ ¼ 1 for j ¼ 1, :::, J is evaluated using simulation in Sec. 5.