Robust hypothesis testing in functional linear models

We extend three robust tests – Wald-type, the likelihood ratio-type and F-type in functional linear models with the scalar dependent variable and the functional covariate. Based on the percentage of variance explained criterion, we use the functional principal components analysis and re-express a functional linear model to a finite regression. We investigate the theoretical properties of these robust testing procedures and assess the finite sample properties through the numerical simulation. In our experiments, the power performance and Type I error rates are studied separately in the sparsely and densely functional linear models. The simulation results show that the robust test procedures are more stable and less sensitive to heavy-tailed distributed errors than the classical ones. Two real datasets are analysed to compare the classical and robust testing procedures.


Introduction
Rapid development in modern technology has enable more and more data to be recorded densely over time and even continuously in a time interval, such as functional magnetic resonance imaging (fMRI) data, spectrometric data [1] and economics data [2,3]. These kinds of data sets are termed functional data, as they are in the form of curves, images or shapes. Functional data analysis (FDA, hereinafter) deals with the statistical methodology, theory and applications of such data. It is noted that there are rich literature and references on FDA, among which there are some influential books and overviews include [4][5][6][7][8][9]. In the studies of functional data, a common problem is to explore the relationship between functional covariates and functional or scalar response, which is an active area of research in FDA. A popular method to address this problem is the functional linear model (FLM, hereinafter) analysis, which was first introduced by Ramsay and Dalzell [10]. The basic idea of FLM is to extend the classical linear model to functional data, and one may reference the book of [4] for the details.
Due to the infinite dimensionality of the functional covariates, the testing problem about the association of the covariates and the response is of great interest in the research area of FLM. The popular strategy to reduce the dimension is to represent the functional covariates and coefficient function by linear combinations of a set of pre-determined basis functions, such as B-splines, Fourier and wavelet bases or eigenbasis from the functional principal analysis. After such pre-treatment, the testing problem of functional data has become a testing problem under the classical linear model. Numerous works in literature have used this strategy. For example, Cardot et al. [11] studied the hypothesis testing in the FLM with the scalar response. They introduced two test statistics based on the norm of the empirical cross-covariance operators and used penalized B-spline estimator of the coefficient function in the simulation studies. Kong et al. [12] re-expressed the FLM as a standard linear model through functional principal component analysis (FPCA) method and considered the classical tests such as Wald, score, likelihood ratio and F test. Su et al. [13] proposed a new method of selecting PCs for constructing the Wald-type test statistics to avoid the sensitive performance of the power in the classical test statistics with varying thresholds.
This paper mainly studies robust hypothesis testing procedures in the FLM. Our motivation of this study is to use M-estimators in the testing procedure, which has been studied in the classical linear regression [14] but has not been studied in FDA. Our objective is to show that the robust versions of Wald-type, the likelihood ratio-type, and F-type tests can be used in FDA. The proposed testing procedures are scale equivariant since we consider a residual scale estimator such that they have the protection against high-leverage outliers. Moreover, our simulation results and real data studies show that the robust test procedures are more stable in contaminated datasets and have the higher estimated power values than those based on the classical testing procedure.
The present paper is organized as follows. Section 2 describes the FLM and introduces the robust testing procedures for testing whether the coefficient function is zero. The asymptotic properties of the proposed procedures are also established in Section 2. Simulation studies are performed in Section 3. Real data examples and conclusions are given in Sections 4-5, respectively. All proofs are shown in the Appendix and some additional numerical simulation results are provided in the supplementary material.
Throughout the rest of this paper, ·, · and · are the L 2 (T ) inner product and norm, respectively. 1(·) is an indicator function, and d − → represents convergence in distribution.

Model specification
We consider the FLM in which the response of interest is scalar while the covariate is a function. Let y be a real-valued random variable defined on a probability space ( , B, P), X(t) ∈ L 2 (T ) is functional covariate defined on a compact support T . Suppose that the mean and covariance function of X(t) are μ(t) = E(X(t)) and (s, t) = Cov(X(s) − μ(s), X(t) − μ(t)), respectively. Then, our FLM with scalar response is expressed as follows: where β(t) is a smooth square integrable coefficient function on T , α 0 is an unknown intercept and ε 0 is a random error with zero mean and variance σ 2 ε 0 . Denote centred functional covariate X(t) = X(t) − μ(t) and α = α 0 + T β(t)μ(t) dt, then the equivalent model is One of the popular techniques for dimension reduction in FDA is the FPCA. Suppose that the spectral decomposition of the covariance function is ( is a set of eigenvalues with non-increasing order λ 1 ≥ λ 2 ≥ · · · ≥ 0, ∞ k=1 λ k < ∞, and {φ k (·), k ≥ 1} are the corresponding orthonormal eigenfunctions. Use the Karhunen-Loève representation, we have Use the same basis functions, the coefficient function can be expanded by (2) can be equivalently written as This model is still impractical because of its infinite sum. In many existing methods, the main technique is to approximate the model by truncating ∞ k=1 ξ k β k to k n k=1 ξ k β k , where k n is a finite number and increases with the sample size n. The truncation value k n can be seen as a 'cut-off' level. In practice, there are some empirical choices of the 'cut-off' level, such as percentage of variance explained (PVE) method, equivalently leading PCs method [11,15,16], cross-validation (CV, hereinafter) criterion [17] and information (AIC or BIC) criterion [18]. Thus, a truncated model using a finite numbers k n can be written as where ε = ∞ k=k n +1 β k ξ k + ε 0 have zero mean and large variance σ 2 ε = ∞ k=k n +1 λ k β 2 k + σ 2 ε 0 [13]. Our robust testing procedures in the next section are based on the above truncated model. In the later simulation studies, we report the results based on the different PVE threshold levels.

Robust testing procedures
The statistical problem of our interest is to test whether there has relationship between the covariate function X(t) and the scalar response y. Therefore, the null hypothesis is to test if the coefficient function β(t) is equal to zero: H 0 : β 1 = β 2 = · · · = β k n = 0 vs.
Let G = {(X i (t), y i ) : i = 1, 2, . . . , n} be independent realizations of (X(t), y) generated by the model (1) and the errors ε i are independent and identically distributed with finite variance and zero mean. The empirical version of (s, t) is , and {(λ k ,φ k ),λ 1 ≥λ 2 ≥ · · · ≥ 0, k ≥ 1} are eigenvalue and eigenfunction pairs of (s, t). Denote Based on FPCA method, the approximate solution of estimatorβ(t) is k n k=1β kφk (t), whereα andβ k are obtained by solving the following minimization problem: with ρ(·) being the loss function.

Remark 2.2:
The loss function ρ(·) is a suitable function which typically is convex, symmetric about 0, and bounded. In our simulation studies, we choose Turkey's bisquare func- where the tuning parameter c = 4.685. Even though this loss function is not convex, it works well in dealing with extreme outliers. The additional simulation results based on Huber's loss function are given in the supplementary material.
Kong et al. [12] pointed out that the testing problems considered in the truncated FLM were analogue to the multivariate covariates situation but with a few differences: (1) The number of PCs selected (i.e. the truncation number) k n that diverges with the sample size n is not known and needs to be pre-estimated based on some criterions; (2) the covariates {ξ ik , i = 1, 2, . . . , n, k = 1, 2, . . . , k n } (i.e. the principal components) are estimated through the empirical covariance function and cannot be observed directly; (3) the previously scale parameter estimationσ ε is dependent on the truncation number k n .

Asymptotic distributions under the null and alternatives
As discussed in Section 2.2, our robust testing procedures are based on the truncated model (4), and the selected truncation level k n may diverge as n → ∞. In the following, we first make some assumptions required for our theoretical developments and then present the results of asymptotic distributions of the test statistics under both null and alternative hypotheses. Note that we use a generic constant C whose value may be changed from line to line throughout these assumptions.
For the Fourier coefficients β k , there exist constant C and b > a/2 + 1 such that The number of PCs selected, k n , satisfies that k n → ∞ as n → ∞, and k n n 1/(a+2b) , which means that the ratio k n n 1/(2a+b) is bounded away from both zero and infinity. (A4) The function ρ(·) is convex and non-monotone, and Assumption (A1) is a common moment condition imposed for the estimation of functional predictor [18], which ensures that X(t) has light tails so that the empirical covariance has √ n consistency. Assumptions (A2)-(A3) are adapted from [22], which implies that all λ k 's are positive and guarantees the identification of the coefficient function. The second part of Assumption (A2) requires that the spacing between adjacent eigenvalues not be small, which ensures identifiability of the eigenfunctions. Assumption (A3) allows k n to diverge, and gives the order of the truncation number k n . Combined with Assumption (A2), (A3) implies that the divergence of the number of functional PCs with n depends on the spacing between adjacent eigenvalues [12]. Assumptions (A4)-(A7) are some regularity conditions about the score function ψ(·) and robust estimates, which are adapted from [20,23].

Theorem 2.2:
Assume that X i (t) ∈ L 2 (T ) for i = 1, 2, . . . , n, λ = Eψ 2 (ε/σ ε ) 2Eψ (ε/σ ε ) , and Assumptions (A1)-(A7) hold. Then, under alternative hypothesis H a : β(t) = β a (t) = 0 for some unknown real-valued function β a (t) defined in T , we have that Under the null hypothesis and truncated FLM, the distributions of these test statistics are similar to their counterparts in multiple regression. In particular, if the truncated number k n is fixed, the null distribution of Wald-type and likelihood-type tests will behave like χ 2 k n and the null distribution of F test will behave like F distribution F k n ,n−(k n +1) . The proofs of both theories are given in the Appendix.

Simulation studies
In this section, we study the performance of the robust version of Wald-type, the likelihood type and F-type tests in terms of Type I errors and powers in the FLM. According to (1), we simulate data with three different settings based on three different models. Meanwhile, we add a baseline covariate Z i ∼ U [0,5] in the model as a potential con-founder. For simplicity of presentation, we consider a scenario in which the functional covariates X i (·), i = 1, 2, . . . , n, are generated by five Fourier basis functions, i.e.
. The scalar response variable y i is generated by the following model: where 5], and the coefficient function β(t) is formed by the same basis functions for X i (t). Here, β(t) = a 5 k=1 β k φ k (t) with a = 0, 0.03, 0.06, 0.12 for controlling the magnitude of β(·) across the support. In our simulation studies, we consider the true coefficient parameter vector β = (β 1 , β 2 , β 3 , β 4 , β 5 ) in three different settings. The details of these settings are as follows: ◆ Setting 1: β = (1, 1, 1, 1, 1) , which means that the functional effects are the same across the five directions. ◆ Setting 2: β = (1, 0.7, 0.5, 0.3, 0) , which means that the direction with the smallest variation has a null effect on the response. ◆ Setting 3: β = (0, 0.7, 0.5, 0.3, 1) , which means that the direction with the smallest variation has a strong effect on the response.
A total of 2000 datasets, each with n = 1000 observations, are generated under these three settings. We study the performance of the powers under both classical and robust testing procedures for each setting. The robust testing procedures are mainly implemented in 'RobStatTM' R packages [14]. For the above three settings, we consider two types of sampling designs for the functional covariates: dense model and sparse model. Based on our numerous simulations, we find out that the weight of random errors has effect on the dense model. Therefore, we increase the weight of the errors term in the dense model and design the following three models: To study the proposed robust hypothesis testing procedure, the error ε i can be distributed from a heavy-tailed distribution or a contaminated normal distribution. For a simple presentation, we only report the simulation results of the three models with errors being Cauchy (0, 1) distributed. The simulation results are displayed in Tables 1-3 and Figures 1-3. For concise presentation, the results corresponding to five selected threshold levels, i.e. 0.65, 0.75, 0.85, 0.95 and 0.99, are given in the tables, while the estimated Type I error rates and power curves based on robust Wald-type testing are shown in Figures 1-3.
The simulation results shown in Table 1 associated with Figure 1 are for the model I. The results show that the estimated Type I error rates in most settings are close to the nominal significance level of 0.05. The power performance based on the classical method is worse than that of the robust method; this is particularly obvious in the third setting. The simulation results shown in Table 2 (or Figure 2) and Table 3 (or Figure 3) are for Models II and III, respectively. As with Model I, the Type I error rates are close to 0.05, except we observe inflated Type I error rates for the sparse model (Model III). Obviously, the power values based on the classical method are less than those based on the robust testing procedures. All these results indicate that the robust procedures are more stable and consistent than the classical ones in the hypothesis testing problem. The results also show that the high threshold of PVE does not guarantee a better power both in the robust and classical testing procedures. For example, in Table 2, under the setting 1 with a = 0.12, the power decreases from 0.855 to 0.844 (based on the classical Wald test ) with γ = 0.65 increasing to γ = 0.99. In Table 3, under the first setting with a = 0.03 , the power drops from 0.667 to 0.612 (based on the robust Wald-type test) with threshold γ increasing from 0.95 to 0.99. This is a trade-off between the degrees of freedom and the information included in the selected PCs. Based on our extensive experiments, γ = 0.95 is the most suitable threshold when implementing both the classical and robust testing procedures. To further demonstrate the performance of the robust hypothesis testing procedure, we also report Table 1. Simulation results based on the classical (T c ) and robust (T r ) methods for Model I (dense model 1) with random errors ε i ∼ Cauchy (0, 1). Setting

Diffusion tensor imaging data
The first real example we consider is the diffusion tensor imaging (DTI) study data. This dataset has been considered in literature including [13,15,24] and is available in the R package 'refund'. DTI tractography is a magnetic resonance imaging technique that measures the restricted diffusion of water in tissue to produce neural tract images. It allows the study of white-matter tracts by measuring the diffusivity of water in the brain. In white-matter tracts, water diffuses anisotropically in the direction of the tract, while elsewhere water diffuses isotropically. The diffusion of water molecules at each voxel is measured by fractional anisotropy (FA) and profiles along corpus callosum (CC) and right corticospinal tracts (RCST). Using measurements of diffusivity along several gradients, DTI can provide relatively detailed images of white-matter anatomy in the brain [25]. In this study, 100 multiple scleroses (MS) patients and 42 healthy controls were observed at multiple visits. For each subject and each visit, FA along with the CC and the RCST were recorded. There were 93 locations along the CC and 55 tracts profiles from RCST [24,25]. Since the auditory serial addition test was only recorded to multiple sclerosis patients and the missingness along RCST is very large, the interest of our study here is to test the association between the paced auditory serial addition test (PASAT) scores and CC in multiple sclerosis group. Although some multiple sclerosis patients visited several times, we use the data obtained at the baseline visit in our study. The modelling framework we assumed is We first explore the relationship between FA profiles and PASAT scores. The randomly selected five samples (left panel) and estimated mean FA profiles (middle panel) along CC tracts and PASAT scores (right panel) for all subjects are presented in Figure 4. It is difficult to conclude that the FA profiles along CC tracts have a significant impact on PASAT scores intuitively; a scalar-on-function linear model is applied to study the relationship between PASAT scores and the FA profiles along CC tracts. We use both classical and robust testing procedures with different threshold levels of PVE: 65%, 75%, 85%, and 95%. All the results are presented in Table 4. We also report the estimated coefficient functionβ(t) based on both classical and robust methods in Figure 5, respectively.
From Table 4, we can see that there are inconsistent conclusions for classical hypothesis testing procedures with different choices of the threshold. If we choose significant level  α = 0.05, we can conclude that there is a significant association between FA profiles along CC tracts and the PASAT scores if the threshold γ = 0.65, 0.75, 0.85. But if γ = 0.95, we cannot draw such a conclusion because the p-value is larger than 0.05, which means that different thresholds lead to opposite conclusions based on the classical testing methods. On Table 4. Testing results based on both classical (T c ) and robust (T r ) methods for the DTI data with FA profiles in CC. the other hand, the robust testing procedures provide a consistent conclusion for different threshold levels, i.e. the FA profiles along CC tracts is significantly associated with PASAT scores. Furthermore, we mimic the DTI data by a simulation study to explain the inconsistency of the classical testing procedures. First, we estimated coefficient function based on the classical and robust methods as well as the functional regression along CC tracts. Then, we conducted a simulation study by setting the true parameters to be those estimated from the DTI data. To better explain the advantage of the robust testing procedure, we contaminate the mimic data by adding an error term i ∼ Cauchy(0,1) to producing 'vertical outliers'. For a simple presentation, we only display the estimated power curve based on the classical and robust Wald tests with different thresholds in Figure 5 (right panel). For the robust testing method, the power performance is better than that with the classical testing procedure. The good performance of power based on the robust method is demonstrated by relatively stableness for the different threshold levels.

Fat content spectrometric data
The fat content spectrometric (FCS) data consists of 215 near-infrared absorbance spectra of meat samples, recorded on a Tecator Infratec Food and Feed Analyzer. This data set is a part of the Tecator data set, which is available in the R package 'fds' and can also be found at the website (http://lib.stat.cmu.edu/datasets). Each sample consists of a 100channel absorbance spectrum of the 850-1050 nm wavelength range. Each spectrum in the database is associated with a content description of the meat sample, obtained by analytic chemistry, which is the percentage of fat, water and protein [26]. Our interest is to study the association between the fat percentage value and the recorded 100-channel absorbance spectrum.
To investigate the robustness of the proposed testing procedure, we contaminate the fat percentage by the collective outliers. We notice that some values of the fat percentage in a segment are small contextually. In fact, there are 20 samples (sample 50 to sample 70) in this segment. We reset the values there to be two, five, ten and twenty multiples of the original ones so that the data are contaminated. We compare the classical and robust hypothesis testing procedures in the following table and use 0.95 as the threshold for selecting PC numbers. The 100-channel absorbance spectrum in the 850-1050 nm wavelength range and fat content percentage are shown in Figure 6. From Table 5, it is obviously that the  robust testing procedure is more effective than the classical method, especially for contaminated data. If we choose 0.05 as our significant level, the p-values based on the two testing procedures 0.05, which means that there is a significant association between the fat percentage value and 100-channel absorbance spectrum. On the other hand, when we reset some values to be ten or twenty multiples of the original ones, the p-value is larger than 0.05 for using the classical procedure, which shows that there is no significant association between the fat percentage value and the 100-channel absorbance spectrum. This conclusion is unappealing in practice as contradictory conclusions could be drawn when a dataset is contaminated.

Conclusions
FLM has been a prevalent tool to describe the dynamic data of infinite-dimensional covariates on scalar/functional responses. There is a rich literature on the performance of a classical test based on the FPCA method under FLM. However, these testing procedures are not robust to outliers. In this paper, we propose three robust testing procedures (Wald, F and likelihood-type) and investigate their asymptotic properties. The number of PCs is determined by the PVE threshold. Three simulation models, along with three different settings, are designed for studying the performance of testing power. Under FLM, our numerical studies show that the power of the classical testing procedure is lower than that of the robust procedures if a dataset contains outliers. In addition, the sparse FLM is more sensitive to outliers than the dense model. Furthermore, the robust testing approach guarantees a higher power and is more stable for a contaminated dataset. The two real data examples also demonstrate the good performance of the robust testing procedures.
There are several directions for future research. In this work, we only discuss the scalar-on-function FLM. Extending the method to the function-on-function linear or the generalized FLMs will be interesting. On the other hand, the establishment of theoretical properties for robust global hypothesis testing in the FLM is still challenging. That will be most of the interest in our future work.