S-Estimators for Functional Principal Component Analysis

Principal component analysis is a widely used technique that provides an optimal lower-dimensional approximation to multivariate or functional datasets. These approximations can be very useful in identifying potential outliers among high-dimensional or functional observations. In this article, we propose a new class of estimators for principal components based on robust scale estimators. For a fixed dimension q, we robustly estimate the q-dimensional linear space that provides the best prediction for the data, in the sense of minimizing the sum of robust scale estimators of the coordinates of the residuals. We also study an extension to the infinite-dimensional case. Our method is consistent for elliptical random vectors, and is Fisher consistent for elliptically distributed random elements on arbitrary Hilbert spaces. Numerical experiments show that our proposal is highly competitive when compared with other methods. We illustrate our approach on a real dataset, where the robust estimator discovers atypical observations that would have been missed otherwise. Supplementary materials for this article are available online.


INTRODUCTION
Principal component analysis (PCA) is a widely used method to obtain a lower-dimensional approximation to multivariate data. This approximation is optimal in the sense of minimizing the mean squared loss between the original observations and the resulting approximations. Estimated principal components can be a valuable tool to explore the data visually, and are also useful to describe some characteristics of the data (e.g., directions of high variability). Thanks to the ever reducing cost of collecting data, many datasets in current applications are both large and complex, sometimes with a very high number of variables. The chance of having outliers or other type of imperfections in the data increases both with the number of observations and their dimension. Thus, detecting these outlying observations is an important step, even when robust estimates are used, either as a preprocessing step or because there is some specific interest in finding anomalous observations. As a motivation, consider the problem of identifying days with an atypical concentration of ground level ozone (O 3 ) in the air. Ground level ozone forms as a result of the reaction between sunlight, nitrogen oxide (NOx), and volatile organic compounds (VOC). We obtained hourly average concentration of ground level ozone at a monitoring station in Richmond, BC (a few kilometers south of the city Vancouver, BC). The data come from the Ministry of Environment of the province of British Columbia, and are available online at http://envistaweb.env.gov.bc.ca. We focus on the month of August for the years 2004 to 2012. Figure 1 displays the data.
Each line corresponds to the evolution of the hourly average concentration (in ppb) of ground level ozone for 1 day. A few days exceed the maximum desired level threshold of 50 ppb set by the Canadian National Ambient Air Quality Objectives, but there may also be days exhibiting an hourly pattern different from the majority of the curves.
In this article, we study robust low-dimensional approximations for high-(or infinite-) dimensional data that can be used to identify poorly fitted observations as potential outliers. The earliest and probably most immediate approach to obtain robust estimates for the principal components consists in using the eigenvalues and eigenvectors of a robust scatter estimator (Campbell 1980;Devlin, Gnanadesikan, and Kettenring 1981;Boente 1987;Naga and Antille 1990;Croux and Haesbroeck 2000). A different approach was proposed by Locantore et al. (1999) based on using the covariance matrix of the data projected onto the unit sphere. Since principal component directions are also those that provide projections with the largest variability, robust PCA estimators can alternatively be obtained as the directions that maximize a robust estimator of scale of the projected data. This approach is called "projection pursuit," see Li and Chen (1985), Ruiz-Gazen (1996, 2005), Hubert, Rousseeuw, and Verboven (2002), and Hubert, Rousseeuw, and Vanden Branden (2005).
It is well known that, for finite-dimensional observations with finite second moments, when using mean squared errors, the best lower-dimensional approximation is given by the projections onto the linear space spanned by the eigenvectors of the covariance matrix corresponding to its largest eigenvalues. Several robust proposals exist in the literature exploiting this characterization of PCA. They amount to replacing the squared residuals with a different loss function. Liu et al. (2003) used the absolute value of the residuals, and McCoy and Tropp (2011) proposed a randomized algorithm to find an approximate solution to this L 1 minimization problem. Croux et al. (2003) proposed a weighted version of this procedure that reduces the effect of high-leverage points. Verboon and Heiser (1994) and De la Torre and Black (2001) used a bounded loss function applied to column-wise standardized residuals. Later, Maronna and Yohai (2008) proposed a similar loss function, but modified in such a way that the method reduces to the classical PCA when one uses a squared loss function. Maronna (2005) also considered best-estimating lower-dimensional subspaces directly, but his approach cannot be easily extended to infinite-dimensional settings because there may be infinitely many minimum eigenvalues.
There has been recent attention paid to a similar problem in the engineering and computer science literature. The main assumption in that approach is that a proportion of the observations lies on a proper lower-dimensional subspace, and that there may be a sparse amount of arbitrary additive and diffuse "noise" present. The objective is to fully recover the low-rank part of the data. Chandrasekaran et al. (2011), Candès et al. (2011), McCoy and Tropp (2011), and Xu, Caramanis, and Sanghavi (2012 studied different convex relaxations of the problem of finding an exact representation of the data matrix as the sum of a low-rank one and a sparse one. Lerman et al. (2014) and Zhang and Lerman (2014) also considered convex relaxations of this problem. The focus of these proposals is on obtaining fast algorithms, and they derive sufficient conditions to guarantee that the solution to the surrogate convex optimization problem is the lower-dimensional subspace that properly contains the "nonoutlying" points.
Our approach relies on a probabilistic model and assumes that our observations follow an elliptical distribution. We are interested in studying best lower-dimensional approximations, in the sense of minimizing the expected prediction error over the distribution of the random vector. These approximations need not fit exactly any subset of the data. Moreover, our goal is to obtain robust alternatives for estimating principal spaces in infinitedimensional settings. We use finite (or high-)dimensional esti-mators as a step toward achieving that purpose. Nevertheless, our proposal provides consistent estimators of the best lowerdimensional subspace when applied to multivariate data that follow an elliptical distribution, even if second moments do not exist. Furthermore, our approach is Fisher consistent for the case of infinite-dimensional observations. Few robust principal components estimates for functional data (FPCA) have been proposed in the literature. Gervini (2008) studied spherical principal components, andUllah (2007) discussed a projection-pursuit approach using smoothed trajectories, but without studying their properties in detail. More recently, Sawant, Billor, and Shin (2012) adapted the BACON-PCA method to detect outliers and to provide robust estimators of the functional components, while Bali et al. (2011) proposed robust projection-pursuit FPCA estimators and showed that they are consistent to the eigenfunctions and eigenvalues of the underlying process.
The rest of the article is organized as follows. Section 2 tackles the problem of providing robust estimators for a q-dimensional approximation for Euclidean data. Section 3 discusses extending this methodology to accommodate functional data, and its use to detect outliers is described in Section 4. In Section 5 we report the results of a simulation study conducted to study the performance of the proposed procedure for functional data. The Richmond Ozone dataset is analyzed in Section 6, where the advantage of the proposed procedure to detect possible influential observations is illustrated. Finally, Section 7 provides some further discussion and recommendations. Proofs are relegated to the online supplementary materials where we also analyze the French mortality data.

S-ESTIMATORS OF THE PRINCIPAL COMPONENTS IN R p
Consider the problem of finding a lower-dimensional approximation to a set of observations x i , 1 ≤ i ≤ n, in R p . Specifically, we search for q < p vectors b (l) ∈ R p , 1 ≤ l ≤ q, whose spanned linear subspace provides a good approximation to the data. From now on, B ∈ R p×q stands for the matrix B = (b (1) , . . . , b (q) ), b T j denotes the jth row of B, and the subspace spanned by its columns is L B . For a given μ ∈ R p , the corresponding "fitted values" are where a i ∈ R q . The principal components are defined as the minimizers, over matrices A ∈ R n×q , B ∈ R p×q , and vectors μ ∈ R p , of where the ith row of the matrix A ∈ R n×q is a i , r ij = x ij − x ij and · R p denotes the usual Euclidean norm in R p . Furthermore, this optimization problem can be solved using alternating regression iterations. Note that if we restrict B to satisfy B T B = I q , then the vectors a i , 1 ≤ i ≤ n, correspond to the scores of the sample on this basis. Our approach is based on noting that L 2 (A, B, μ) in (1) is proportional to p j =1 s 2 j , where s 2 j is the sample variance of the residuals' jth coordinate: r 1j , r 2j , . . . , r nj . To reduce the influence of atypical observations, we propose to use robust scale estimates instead of sample variances. Our robustly estimated q-dimensional subspace best approximating the data are defined as the linear space L B where (A, B, μ) minimizes and σ j denotes a robust scale estimator of the residuals r ij = x ij − x ij , 1 ≤ i ≤ n. Note that if we use the sample variance s 2 j instead of σ 2 j , then the objective function in (2) reduces to the classical one in (1).
Scale estimators measure the spread of a sample and are invariant under translations and equivariant under scale transformations (see, e.g., Maronna, Martin, and Yohai 2006). Although any robust scale estimator can be used in (2), to fix ideas we focus our presentation on M-estimators of scale (see Huber and Ronchetti 2009). As in Maronna, Martin, and Yohai (2006), let ρ : R → R + be a ρ-function, that is, an even function, nondecreasing on |x|, increasing for x > 0 when ρ(x) < lim t→+∞ ρ(t) and such that ρ(0) = 0. Given residu- where ρ c (u) = ρ(u/c), and c > 0 is a user-chosen tuning constant. When ρ(y) = min(3y 2 − 3y 4 + y 6 , 1), (Tukey's biweight function) with c = 1.54764 and b = 1/2, the estimator is Fisher consistent at the normal distribution and has breakdown point 50%. In general, if ρ ∞ = 1, then the breakdown point of the M-scale estimator is min(b, 1 − b). We can write our estimator in a slightly more general way as follows. Let π (y, L B ) denote the orthogonal projection of y onto L B . To simplify the presentation, assume that μ is known. For each observation , . . . , r ip (L B )) T denote the corresponding vector of residuals and σ j,L B = σ (r 1j (L B ), . . . , r nj (L B )) the scale estimator of the jth coordinate of the residuals. Let n (L B ) = p j =1 σ 2 j,L B . The S-estimator of the q-dimensional principal subspace is the linear space L = L B that solves To study the asymptotic properties of robust estimators, it is convenient to think of them as functionals of the empirical distribution of the sample (Huber and Ronchetti 2009 Here, D is a subset of all the univariate distributions, which contains all the empirical ones. In what follows we will assume that x i ∈ R p , 1 ≤ i ≤ n are independent and identically distributed random vectors with distribution P. The independence condition may be relaxed, for instance, requiring stationarity and a mixing condition or just ergodicity, since we only need the strong law of large numbers to hold to guarantee the consistency results given below. For a random vector x with distribution P, denote F j (L B ) the distribu-tion of the jth coordinate of r(L B ) and let (L) = p j =1 σ 2 j,L , where σ j,L = σ R (F j (L B )). The functional L(P ) corresponding to the S-estimators defined in (4) is the linear space of dimension q that satisfies Recall that a random vector is said to have a spherical distribution if its distribution is invariant under orthogonal transformations. In particular, the characteristic function of a spherically distributed x ∈ R p is of the form ϕ x (t) = φ(t T t) for t ∈ R p , where φ : R → R is the generator of the characteristic function.
We write x ∼ S p (φ). For a p × p matrix B and a vector μ ∈ R p , the distribution of The following proposition, whose proof is relegated to the online supplementary materials, shows that the solution to (5) is the desired linear space. (p) . Assume that λ q > λ q+1 . Then, the linear space L q spanned by β (1) , . . . , β (q) is the unique solution of (5).
As mentioned before, this approach can also be used with any robust scale estimator. For example, we can define τ -estimators by considering the τ -best lower-dimensional approximations, given by the minimizers of (3) with a ρ-function ρ such that ρ ≤ ρ 1 . Note that if an iterative procedure is used to solve (4), the scale estimators σ j need to be updated at each step of the algorithm.
Consistency of projection-pursuit principal component estimators for random vectors was derived in Cui, He, and Ng (2003) requiring uniform convergence over the unit ball of the projected data scale estimators to the scale functional. This condition was generalized in Bali et al. (2011) to the functional case. A natural extension for q > 1 is Note that this condition is easily verified when using a robust scale functional with finite-dimensional random vectors since the Stiefel manifold V p×q = {B ∈ R p×q : B T B = I q } is a compact set. Furthermore, the following proposition shows that this condition is sufficient to obtain consistency of the S-estimators in (4).
Proposition 2.2. Assume that L(P ) is unique and that (6) holds. Then, the estimators L = L B obtained minimizing n (L) in (4) over linear spaces L of dimension q, are consistent to the linear space L(P ) defined in (5). In other words, with probability one, π (x, L) converges to π (x, L(P )), for almost all x.

Algorithm for S-Estimators
The optimization problem defining our estimator is generally nonconvex, and typically difficult to solve. In this section, we show that first-order conditions for a critical point of the objective function in (4) naturally suggest an iterative reweighted least-square algorithm. Once such iterations are available, a standard strategy used in the statistical literature to compute this type of estimators (e.g., Rousseeuw and van Driessen 1999;Maronna 2005;Salibian-Barrera and Yohai 2006) is to use a large number of random initial points, and select the best visited local minimum as the estimator.
Note that although S-scale estimators are only defined implicitly, explicit first-order conditions can be obtained differentiating both sides of (3). More specifically, let σ j , j = 1, . . . , p be an M-estimator of scale of the residuals x ij − x ij , i = 1, . . . , n. In other words, where we have absorbed the constant c into the loss function ρ. The derivatives with respect to a i , i = 1, . . . , n are given by Setting these to zero, we obtain a system of equations that can be reexpressed as reweighted least-square problems as follows: This formulation suggests the usual iterative reweighted least-square (IRWLS) algorithm. Given initial estimates b (0) j , 1 ≤ j ≤ p, and μ (0) compute the scores a (0) i , i = 1, . . . , n, the weights w (0) ij and obtain updated values for a (1) i , b (1) j , 1 ≤ i ≤ n, 1 ≤ j ≤ p, and μ (1) . We repeat these steps until the objective function changes less than a chosen tolerance value. The best q-dimensional linear space approximation is spanned by { b (1) , . . . , b (q) }, the final values obtained above. For interpretation purposes, we orthogonalize the set { b (1) , . . . , b (q) } and compute the scores a i as the corresponding orthogonal projections.
For the initial location vector μ (0) , we use the L 1 -median, and adapt the strategy of Rousseeuw and van Driessen (1999) to select initial values for B and A. More specifically, we generate N 1 random starts for the matrix B, which are orthogonalized, each of them leading to an initial matrix B (0) . The columns of the matrix A are the scores of each observation on the basis given by the q columns of B (0) . For each of these initial values, we run N 2 IRWLS iterations, or until a tolerance level is achieved. The initial values giving the best objective function after N 2 iterations are then iterated until convergence. This algorithm depends on the number of random starts N 1 , the desired tolerance for sequential change in the objective function, and the number of iterations N 2 that is applied to each random candidate. In our experiments, we used a tolerance of 10 −6 and found that using N 1 = 50 random starts and N 2 = 50 partial IRWLS iterations for each of them was typically sufficient to find a good solution to (4), which is in line with the results of Maronna (2005).
An implementation of this algorithm in R is publicly available online from http://www.stat.ubc.ca/∼matias/soft.html. Although a formal computational complexity analysis of this algorithm is beyond the scope of this article, our numerical experiments reported in Section 5 show that the algorithm works very well. We tested the speed of our R code using these settings on an Intel i7 CPU (3.5GHz) machine running Windows 7. In Table 1, we report the average time in CPU minutes over 10 random samples for different combinations of the sample size (n), number of variables (p), and dimension of the subspace (q). Note that these times could be improved notably if the algorithm was implemented in C or a language with faster linear algebra operations.

Choosing the Dimension of the Approximating Subspace
In some cases, the desired dimension of the linear subspace providing an approximation to the data is either known or chosen in advance (e.g., for visualization purposes). In many applications, however, this dimension is selected based on the resulting "proportion of unexplained variability." Proposition 2.1 shows that for x ∼ E p (0, , OE), the functional (L) is minimized, when L = L q the subspace spanned by the first q eigenvectors of the scatter matrix and Thus, the proportion of unexplained variability can be defined as u q = (L q )/ (L 0 ) and an estimator of u q is given by u q = n ( L q )/ n ( L 0 ), where L q is defined in (4) and L 0 corresponds to minimizing n (L 0 ) = p j =1 σ 2 j,L 0 with σ j,L 0 = σ (r 1j (μ), . . . , r nj (μ)) the scale estimator of the jth coordinate of the residuals r i (μ) = x i − μ. Proposition 2.2 can be used to show the consistency of u q to u q .
To avoid the high computational cost of solving (4) for different values of q, we adapt the strategy of Maronna (2005). Let u max be the maximum allowed proportion of unexplained variability, and a maximum dimension q max of the approximating subspace. We look for the smallest q 0 such that q 0 ≤ q max  and u q 0 ≤ u max . We first verify that u q max ≤ u max otherwise the problem cannot be solved and we need to modify our goals. The procedure starts with q 1 = 1. If u 1 ≤ u max we are done. Otherwise, assume that after j steps, we have u q j ≥ u max , where q j = j dimension used in step j. Let μ (q j ) be the estimated center and B q j ∈ R p×q j the orthonormal basis of the best q j - q j . As before, let A q j ∈ R n×q j be the matrix of scores. Let q j +1 = q j + 1 and define the matrices and a 1 , . . . , a n denote the columns of B and the rows of A, respectively. We construct our predictions as x , and note that the residuals satisfy r (A, B, μ) given in (2). A system of equations analogous to that described in Section 2.1 can be derived to formulate an iterative reweighted least-square algorithm. Once the optimal β and α are found, we optimize L S (A, B, μ) over μ to obtain μ (q j +1 ) . This approach is much faster than solving (4) for q = q j +1 . Note that u q j +1 = n ( L q j +1 )/ n ( L 0 ) is typically larger than u q j +1 , so that if u q j +1 ≤ u max , we select q = q j +1 , and otherwise increase j and continue.

S-ESTIMATORS IN THE FUNCTIONAL SETTING
In this section, we discuss extensions of the estimators defined in Section 2 to accommodate functional data. The most common situation corresponds to the case when the observations correspond to realizations of a stochastic process X ∈ L 2 (I) with I an interval of the real line, which can be assumed to be I = [0, 1]. A more general setup that can accommodate applications where observations are images, for example, is to consider realizations of a random element on a separable Hilbert space H with inner product ·, · H and norm · H . Note that principal components for functional data (defined via the Karhunen-Loève decomposition of the covariance function of the process X) also have the property of providing best lower-dimensional approximations, in the L 2 sense. Recently, a stochastic best lower-dimensional approximation for elliptically distributed random elements on separable Hilbert spaces, such as those considered when dealing with multivariate data, was obtained by Boente, Salibian-Barrera, and Tyler (2014). This optimality property does not require second moment conditions. However, even in the simplest situation when X ∈ L 2 ([0, 1]), one rarely observes entire curves. The functional datum for replication i usually corresponds to a finite set of discrete values Depending on the characteristics of the grid of points t ij where observations were obtained, one can employ different strategies to analyze these data.
The easiest situation is when observations were made at common design points. In this case, we have p = m 1 = m i and t ij = τ j , for all 1 ≤ i ≤ n and 1 ≤ j ≤ p. Defining x i = (x i 1 , . . . , x i p ) T , a purely multivariate approach can be used as in Section 2 to obtain a q-dimensional linear space L spanned by orthonormal vectors b (1) , . . . , b (q) . An associated basis in L 2 ([0, 1]) can be defined as φ (τ j ) = a b j , for 1 ≤ ≤ q, 1 ≤ j ≤ p, where a is a constant to ensure that φ L 2 = 1 and b ( ) = (b 1 , . . . , b p ) T . Smoothing over the observed data points, one can recover the complete trajectory. This approach provides a consistent estimator for the best approximating linear space and the corresponding "fitted trajectories" π (X i , L), In many cases, however, trajectories are observed at different design points t ij , 1 ≤ j ≤ m i , 1 ≤ i ≤ n. In what follows, we will assume that as the sample size n increases, so does the number of points where each trajectory is observed and that, in the limit, these points cover the interval [0, 1]. Our approach consists of using a sequence of finite-dimensional functional spaces, which increases with the sample size. The basic idea is to identify each observed point in H with the vector formed by its coordinates on a finite-dimensional basis that increases with the sample size. The procedure of Section 2 can be applied to these vectors to obtain a q-dimensional approximating subspace, which can then be mapped back onto H.  More specifically, let {δ i } i≥1 be an orthonormal basis of H and, for each n ≥ 1, let H p n be the linear space spanned by δ 1 , . . . , δ p n . To simplify the notation, we write p = p n . Let x ij = X i , δ j H be the coefficient of the ith trajectory on the jth element of the basis, 1 ≤ j ≤ p, and form the p-dimensional vector x i = (x i1 , . . . , x ip ) T . When, H = L 2 ([0, 1]), the inner products X i , δ j H can be numerically computed using a Riemann sum over the design points for the ith trajectory {t ij } 1≤j ≤m i . We apply the procedure described in Section 2 to the multivariate observations x 1 , . . . , x n ∈ R p to obtain a q-dimensional linear space L spanned by orthonormal vectors b (1) , . . . , b (q) and the corresponding "predicted values" It is now easy to find the corresponding approximation in the original space H. The location parame- we can also use squared residual norms to detect atypical observations. For a proof of the Fisher consistency of this approach, we refer the reader to Section S.1 of the online supplementary materials.

Algorithm for Functional Data
In this section, we give details on how to compute our Sestimators for functional principal components. The basic idea consists of applying the algorithm of Section 2.1 to the coordinates of the observed data on a sufficiently rich orthonormal basis of the Hilbert space, and then transforming back the result to the original variables.
To fix ideas, consider the case where the data consist of functions X i , 1 ≤ i ≤ n, observed at points t 1 , . . . , t m . We approximate the L 2 inner product with a Riemann sum over the grid of points: Let ν 1 , . . . , ν p be a B-spline basis. We orthonormalize ν 1 , . . . , ν p using the approximated inner product to obtain or-thonormal elements δ 1 , . . . , δ p . Let ∈ R m×p be the matrix of the functions δ j evaluated at the points t i : = (δ 1 , δ 2 . . . δ p ), where δ j = (δ j (t 1 ), δ j (t 2 ), . . . , δ j (t m )) T . Then, if X ∈ R n×m is the matrix of observed trajectories (one in each row), the coordinates of each X i on each element δ j of the spline basis is denoted as We now apply the algorithm given in Section 2.1 to the "data" matrix X ∈ R n×p of the coordinates of our observations on the B-spline basis. We obtain the center vector μ ∈ R p , an orthonormal basis B ∈ R p×q of the best q-dimensional subspace, and the matrix of scores A ∈ R n×q . The matrix X = I n μ T + A B T provides the q-dimensional approximation to our functional data written in the B-splines basis. Finally, we express our solution in the original variables X = X T . Note that X = I n ( μ) T + A( B) T . In other words, μ ∈ R m is the vector of the center functionμ H evaluated at the points t 1 , . . . , t m , and B ∈ R m×q is the matrix of q orthonormal functions φ spanning the best lower approximation space in H, evaluated on the same points.

OUTLIER DETECTION
An important use of robust estimators for multivariate data is the detection of potential outliers; see, for example, Rousseeuw and Van Zomeren (1990), Becker and Gather (2001), Pison and van Aelst (2004), and Hardin and Rocke (2005). Unfortunately, these approaches do not extend naturally to the functional case.
Alternatively, one can consider the PCA residuals as indicators of outlyingness. Given a sample x 1 , . . . , x n in R p and the estimated subspace L = L B in (4), one can construct the corresponding "best q-dimensional" approximations We expect outlying or otherwise atypical observations to be poorly fitted and thus to have a relatively large residual Exploring the norm of these residuals sometimes provides sufficient information to  detect abnormal points in the data. It is worth noticing that the distribution of the residuals squared norm R 2 i is unknown, but typically skewed to the right because they are bounded by 0 from below. Following the approach of Hubert and Vandervieren (2008), we propose to flag an observation as atypical if its squared residual norm exceeds the upper whisker of a skewed-adjusted boxplot.
Another way to use principal components to look for potential outliers considers the scores of each point on the estimated principal eigenvectors. The solution to (4) provides an estimated basis b (j ) , 1 ≤ j ≤ q (the columns of B) for the optimal qdimensional linear space spanned by the first q eigenvectors, but the b (j ) 's themselves need not be estimates of the principal directions. However, we can use an approach similar to "projection pursuit" to sequentially search for vectors in L B that maximize a robust scale estimate of the corresponding projections of the data. Specifically, for each γ ∈ L B , let F n [γ ] be the empirical distribution of the projected observations γ T x 1 , . . . , γ T x n , and σ R (F n [γ ]) be the corresponding scale estimator. The estimated first principal direction is obtained maximizing σ R (F n [γ ]) over unitary vectors in L B . Subsequent principal directions are similarly computed with the additional condition of being orthogonal to the previous ones. The scores of each observation on the estimated principal directions can be used to screen for atypical data points.
Both of these last two approaches have natural counterparts for functional data and can be used with the estimators defined in Section 3. Hyndman and Shang (2010) defined two detection rules based on the scores of a robust two-dimensional fit and compared them with a residuals-based PCA procedure introduced by Hyndman and Ullah (2007). Our simulation study in Section 5 includes these methods as well those based on functional depth proposed by Febrero, Galeano, andGonzalez-Manteiga (2007, 2008).
As in the finite-dimensional case, to find potential outliers one may consider looking for curves X i that are poorly predicted by the S-estimator using the squared prediction errors As in the finite-dimensional case, the distribution of these prediction residuals is unknown and difficult to estimate. Hyndman and Ullah (2007) proposed to use a normal approximation to the residual squared norm, which they called the integrated squared error, to define a threshold. Our approach is more data analytic and does not depend on the underlying distribution of the process even if we always have in mind that the uncontaminated process has an elliptical distribution. For that reason, we mimic the proposal given in the finite-dimensional case and to decide whether an observation may be flagged as a potential outlier, we used the adjusted boxplot of Hubert and Vandervieren (2008) on the residuals R 2 i,H , identifying as an atypical observation a value exceeding the upper whisker of the adjusted boxplot. We use this approach in the example and in our simulation study discussed below.

SIMULATION
In this section, we present the results of a simulation study performed to investigate the finite-sample properties of our robust sieve proposal. In all cases, we generated 500 samples of size n = 70 where each trajectory was observed at m = 100 equidistant points in the interval [0, 1]. We used a cubic B-spline basis of dimension p = 50, which is sufficiently rich to represent the data well. This choice represents a realistic situation where the sample size is similar to the dimension of the problem. Other reasonable choices for the dimension of the spline basis (even with n < p) yielded very similar results and lead to the same conclusions in our numerical experiments.

Simulation Settings
The following three different models constructed from finiteand infinite-range processes were used to generate the data. In two of them we included a relatively small proportion of measurement errors, as is usual in many applications.
We also generated contaminated trajectories X (c) i as realizations of the process X (c) defined by X (c) (t s ) = X(t s ) + V Y (t s ), s = 1, . . . , 100, where V ∼ Bi(1, 1 ) is independent of X and Y, Y (t s ) = W s z s with W s ∼ Bi(1, 2 ), z s ∼ N (μ (c) , 0.01), W s and z s are all independent. In other words, a trajectory is contaminated with probability 1 , and at any point t s the contaminated function is shifted with probability 2 . The shift is random but tightly distributed around the constant μ (c) = 30. Samples without outliers correspond to 1 = 0. To investigate the influence of different outlier configurations of our estimator, we considered the settings: 1 = 0.10 and 1 = 0.20, with 2 = 0.30 in both cases.
Model 2 This situation corresponds to a similar case as in Model 1, but with some curves starting on a different trajectory that joins smoothly with the one that most curves follow. In this case, noncontaminated observations X i ∼ X were generated as X(t s ) ∼ 150 − 2μ(t s ) + ξ 1 φ 1 (t s ) + ξ 2 φ 2 (t s ) + z s , s = 1, . . . , 100, where z s , ξ 1 , ξ 2 , μ, φ 1 , and φ 2 are as in the previous model. However, contaminated trajectories are only perturbed in a specific part of their range. The atypical observations satisfy X (c) i ∼ X (c) , where X (c) (t s ) = X(t s ) + V Y (t s ) for t s < 0.4 and X (c) (t s ) = X(t s ) for t s ≥ 0.4, where V ∼ Bi(1, 1 ) is independent of X and Y, Y (t s ) = W s z s with W s ∼ Bi(1, 2 ), z s ∼ N(μ (c) (t s ), 0.01), with μ (c) (t s ) = −5 − 2μ(ts), and W s and z s are all independent. In this model, we used 1 = 0.10 and 1 = 0.20, and in both cases we set 2 = 0.90.

The Estimators
We computed the classical principal components estimator (LS) as well as the robust one defined in (2), using an M-scale estimator, with function ρ c in Tukey's bisquare family with tuning constants c = 1.54764 and b = 0.50. We also considered the choice c = 3.0 and b = 0.2426, which we expect to yield more efficiency. The robust estimators are labeled as S (1.5) and S (3) in the tables. As mentioned in Section 2.1, after obtaining the robust q-dimensional linear space, we orthonormalize its basis and compute the scores a i as the corresponding orthogonal projections. We also computed the sieve projection-pursuit approach proposed in Bali et al. (2011), which is called "PP" in our tables. For comparison purposes, we have also calculated the mean squared prediction errors obtained with the true best q-dimensional linear space for uncontaminated data. This benchmark is indicated as "True" in all tables. Since trajectories following Models 1 and 2 were generated using a twodimensional scatter operator (i.e., the underlying process had only two nonzero eigenvalues) plus measurement errors, we used q = 1 with our estimator. For Model 3, we used q = 4, which results in 95% of explained variance.

Simulation Results
To summarize the results of our simulation study, for each replication we consider mean squared prediction errors in the original space, that is, based on X i − X i 2 H . The conclusions that can be reached using the finite-dimensional residuals squared prediction error x i − x i 2 R p are the same as those discussed below, and hence are not reported here. Tables 2 to 4 report the average mean squared error for outlying and nonoutlying trajectories separately, as a way to quantify how the procedures fit the bulk of the data. More specifically, let γ i = 1 when X i is an outlier and γ i = 0 otherwise, then Note that the total prediction error equals We also report the mean PE over contaminated and clean trajectories separately: We also compute the prediction squared errors of the actual best lower-dimensional predictions X 0 i , obtained with the first q true eigenfunctions (recall that we used q = 1 in Models 1 and 2, and q = 4 in Model 3). The results for this "estimator" are tabulated in the row labeled "True." The averages over the 500 replications of PE H,OUT , PE H,CLEAN , PE H,OUT , and PE H,CLEAN are labeled "Out," "Clean," "Out," and "Clean," respectively.
As expected, when no outliers are present all procedures are comparable, with a small loss for the robust procedures. The S-estimator with c = 3 had the second smallest mean squared prediction error, after the LS. When samples were contaminated, the classical procedure based on least squares tries to compromise between outlying and nonoutlying trajectories and this is reflected on the values of PE H,OUT and PE H,CLEAN in (7) and also on the average prediction error of the contaminated and noncontaminated trajectories PE H,OUT and PE H,CLEAN . With contaminated samples, the S-estimator had the best performance overall. Its mean squared prediction was closest to the "True" one, and it also provided better fits to the noncontaminated samples (and worse predictions for the contaminated trajectories). This last observation can be seen comparing the columns labeled "Out" and "Clean." The only case when the sieves projectionpursuit estimator performed slightly better than the S-estimator is for Model 2 with 1 = 0.20 and 2 = 0.90 (see Table 3). The advantage of the S-estimator was more notable in all the other cases of Model 1, Model 2, and Model 3.
We also compared the performance of different outlier detection methods for functional data. As described in Section 4, we used the squared prediction errors R 2 i,H = X i − X i 2 H , i = 1, . . . , n, to find curves X i that are poorly predicted. Those with squared prediction errors exceeding the upper whisker of the adjusted boxplot will be flagged as outliers. We used the same approach with predictors X i obtained using the other estimators mentioned before.
In addition, we included other outlier-detection methods for functional data that appeared in the literature. We considered the functional high-density region and the functional bagplots of Hyndman and Shang (2010) with a 99% coverage, denoted as HDR and BAG, respectively, as well as the integrated squared error method defined in Hyndman and Ullah (2007), denoted as HU. The first two methods are based on the scores of a twodimensional robust projection-pursuit fit. To keep the comparison fair, for HU we chose a q-dimensional robust fit with q = 1 under Models 1 and 2 and q = 4 under Model 3. Furthermore, we also compared our detection rule with the proposals based on a likelihood-ratio-type statistic given in Febrero, Galeano, and Gonzalez-Manteiga (2007) and on the modal depth, using both trimmed and weighted bootstrap estimates for the threshold as proposed in Febrero, Galeano, and Gonzalez-Manteiga (2008). These methods are denoted as LRT, DTR, and DWE, respectively. These detection rules are implemented in the R package rainbow.
For each model and each outlier detection method, in Tables  5 to 7 we report the average sensitivity and specificity over the 500 samples. Sensitivity is the proportion of actual outliers that are correctly flagged as such, while specificity is the proportion of nonoutlying curves correctly identified as not atypical. An ideal method will simultaneously maintain high sensitivity and specificity.
For Model 1, we note that DRT, DWE, and HU identify too many curves as outliers (resulting in a high sensitivity but low specificity). On the other hand, LRT, HDR, and BAG consistently miss most of the outliers (low sensitivity), as does LS when the proportion of outliers is 20%. Using prediction residuals based on S-and the projection-pursuit estimators offers the  best overall performance. When the data follow Model 2, LS, HDR, LRT, DTR, and DWE fail to detect most of the outliers, as does BAG for = 0.20. Again, HU flags too many curves as outlying. The relatively low specificity of DTR and DWE (and to some extent BAG) seems to indicate that the few observations flagged as outliers are not the truly atypical ones. Once again the approach based on S-and projection-pursuit estimators works best. Note that although the S (1.5) appears to miss around half of the outliers for 1 = 0.20, those flagged as atypical are correctly identified. The results for Model 3 are very similar to those for Model 1. Overall, for the three scenarios considered here, the clear best method to detect functional outliers is to use the squared prediction residuals based on a robust principal components estimator.

EXAMPLE: GROUND LEVEL OZONE CONCENTRATIONS
These data contain hourly average measurements of ground level ozone (O 3 ) concentration from a monitoring station in Richmond, BC, Canada. Ozone at ground level is a serious air pollutant and its presence typically peaks in summer months. We focus on the month of August, and obtained data for the years 2004 to 2012. We have 176 days with hourly average O 3 measurements. Our purpose is to identify days in which the temporal pattern of O 3 concentration appears different from the others. Based on the strong pattern observed in the data, we consider one-dimensional approximations. We use an S-estimator with tuning constant c = 3 applying the approach described in Section 3 with a cubic B-spline basis of dimension p = 10. To find potentially outlying curves, we use as threshold the upper whisker of the adjusted boxplot of Hubert and Vandervieren (2008) applied to the squared prediction errors using the LS and S-estimators. Figure 2 contains the estimated density of the L 2 norm of the residuals for each of the 176 curves when we compute predictions using our S-estimators (panel (a)) and the classical LS ones (panel (b)). The dashed line in Figure 2 corresponds to the threshold suggested by the adjusted boxplot. While there are a few extreme outliers at the right tail of each plot, both plots also show a relatively heavy tail that suggests the presence of moderate outliers. The solid line indicates approximately the beginning of this heavy tail, and is the cut-off used in our analysis.
To make the visualization of the results easier, each panel in Figure 3 shows the observations detected as outliers in 1 year, both by the robust estimator (solid lines) and the classical approach (dashed lines). The thin gray lines in the background show all the available observations, and are included as a visual reference, while the light dashed horizontal line at 50 ppb is the current maximum recommended level. We see that the robust fit identifies as outliers all of the days with relatively high peaks of O 3 concentration, but also some days that exhibit a "flat" profile.
Since ground level ozone is produced by the reaction between sunlight and other compounds in the air, we use temperature data to verify whether the potential outliers identified above correspond to atypical days. Figure 4 shows maximum daily temperature for the months of August between 2004 and 2012 together with the daily amount of rain. Days for which O 3 data are not available are indicated with white circles. A day identified as having an atypical O 3 profile by the robust fit is marked with a large solid circle. Potential outliers identified by the classical approach are indicated with a solid triangle. We see that the outliers identified by the robust fit correspond to days with either a very high or low temperature. Furthermore, outlying days with a "flat" O 3 profile are those with a low maximum temperature, while days with a sharp O 3 peak correspond to particularly hot days. On the other hand, days flagged as possible outliers by LS generally do not show any pattern with respect to temperature. This analysis shows that the robust method is able to identify potential outliers that correspond to extreme values of an unobserved but closely associated meteorological variable (temperature). In other words, the robust method is able to uncover outliers that correspond to actual atypical days.

CONCLUDING REMARKS
In this article, we propose a robust estimator for the subspace spanned by the first q principal components. We show that our method is consistent and can be used in general settings, including functional data applications. In this case, our method works well when the observations can be well represented in a sufficiently rich but arbitrary basis. Moreover, the resulting robust predictions can be used to detect atypical observations in the data. This is confirmed in our simulation study, where this outlier detection method compares very favorably to other proposals in the literature. Our estimators are defined via a nonconvex optimization problem, which is difficult to solve. As it is done for similar problems arising in other contexts (robust linear regression and multivariate location and scatter estimators, e.g.), we use first-order conditions to derive an iterative reweighted least-square-type algorithm. Extensive numerical experiments show that this algorithm provides estimators with good statistical properties. It would be interesting, but beyond the scope of this work, to study whether a convex relaxation of the optimization problem (2) can provide a more scalable algorithm with comparable robustness and statistical properties.

SUPPLEMENTARY MATERIALS
The supplementary material has three sections. Section 1contains a discussion on the Fisher-consistency of the Sievesapproach for S-estimators for functional principal components. Section 2 includes the analysis of the French mortality dataset. Finally, in Section 3 the proofs of Propositions 2.1 and 2.2 and of the Fisher-consistency is given.