Outlier-Robust Kernel Hierarchical-Optimization RLS on a Budget with Affine Constraints

This paper introduces a non-parametric learning framework to combat outliers in online, multi-output, and nonlinear regression tasks. A hierarchical-optimization problem underpins the learning task: Search in a reproducing kernel Hilbert space (RKHS) for a function that minimizes a sample average ℓp-norm (1 ≤ p ≤ 2) error loss defined on data contaminated by noise and outliers, under affine constraints defined as the set of minimizers of a quadratic loss on a finite number of faithful data devoid of noise and outliers (side information). To surmount the computational obstacles inflicted by the choice of loss and the potentially infinite dimensional RKHS, approximations of the ℓp-norm loss, as well as a novel twist of the criterion of approximate linear dependency are devised to keep the computational-complexity footprint of the proposed algorithm bounded over time. Numerical tests on datasets showcase the robust behavior of the advocated framework against different types of outliers, under a low computational load, while satisfying at the same time the affine constraints, in contrast to the state-of-the-art methods which are constraint agnostic.

Recent KAF efforts revolve around cleansing data from outliers to address their deteriorating effects in a wide variety of learning tasks [8], where outliers are defined as (sparsely appearing) contaminating data that do not adhere to a nominal data-generation model, and are often modeled as random variables (RVs) with non-Gaussian heavy tailed distributions, e.g., -stable ones [9,10]. Refraining from using the quadratic error loss, which is notoriously sensitive to non-Gaussian outliers [10,11], robust KAF approaches include methods which are based mainly on (i) the ℓ -norm error loss [12,13], motivated by its beneficial role [14,15] in classical adaptive filtering [16][17][18]; and (ii) the correntropic loss [19][20][21], based on the concept of correntropy [22]. Alternative loss functions, designed in a similar way to that of the correntropic loss, can be also found in [23,24]. Among the previous methods, approaches that employ KRLS-type of iterations can be found in [12,20,21].  becoming available to the user at time , the goal is to devise an online non-parametric algorithm to learn the unknown non-linear and multi-output system via a reproducing kernel Hilbert space ℋ, where the ×1 vector may carry also noise and outliers. Among all data, ( , ) =1 are "clean," with being devoid of outliers and noise. Learning the system while rejecting noise and outliers is achieved via the minimization of an error-loss function [see (1)] subject to constraints formed by the faithful data ( , ) =1 .
This paper introduces a KRLS-type framework for outlier rejection with novel contributions in relation with the state-of-the-art methodologies that are highlighted as follows. A non-linear, multi-output (a.k.a. multi-target or multi-response) regression task [25] is considered ( Figure 1). Rather than adapting the classical RLS iterations [2] into KAF, as in [7,12,20,21], the proposed Algorithm 1 stems from the stochastic-approximation framework [26]. Similarly to [12], a sample average ℓ -norm error loss is used also here to define the objective function in the optimization problem of the learning task. Nevertheless, apart from the outlier contaminated data and in contrast to [12,20,21], the present framework allows also the use of faithful data, devoid of noise and outliers, as side information. Compliance to the faithful data is achieved via an affine constraint set, defined by the set of minimizers of a quadratic loss on the clean data. The set of minimizers of a quadratic loss offers flexibility in the way that the constraint set is incorporated in the proposed Algorithm 1 (Section 3.2). Affinely constrained RLS schemes have already appeared in adaptive filtering [27,28], but it seems that this is the first time that affine constraints are considered in KRLS-type methods to accommodate side information. The adoption of the ℓ -norm error loss and the potential infinite dimensionality of the RKHS yield computational bottlenecks. To surmount those bottlenecks, approximations of the ℓ -norm error loss are proposed, and in contrast to [12,20] where all of the incoming data are incorporated in computations and the computational complexity grows unbounded with time, this study proposes also a novel twist of the approximate linear dependency criterion [7]  Due to space limitations, the proofs of the subsequent propositions, results on the convergence of the estimates of Algorithm 1 to a solution of the ensemble average of the optimization problem, formulae for the recursive updates in Algorithm 1, and more numerical tests than the ones presented here will be included in the journal version of the paper.

THE LEARNING TASK, COMPUTATIONAL BOTTLENECKS, AND APPROXIMATIONS
Available to the user are the real-valued input-output data ( , ) =1 , ∈ ℤ >0 , where denotes discrete time, the × 1 vector stands for the input of an unknown non-linear system (see Figure 1) and the × 1 vector Learning the unknown non-linear system is viewed here as a non-parametric regression problem [5], where a vectorvalued function ≔ ( (1) , … , ( ) ) ∈ ℋ is sought such that (s.t.) ( ) ≈ ( ) ( ), and where each ( ) (⋅), ∀ ∈ {1, … , }, is taken from a user-defined RKHS ℋ with inner product ⟨⋅ | ⋅⟩ ℋ , norm ‖⋅‖ ℋ , and reproducing kernel (⋅, ⋅) ∶ ℝ × ℝ → ℝ [3]. Space ℋ may be infinite dimensional; e.g., ℋ with a Gaussian kernel [4]. Hence, it is desirable that the learning algorithm operates on a computational budget to address the unpleasant computational complexity implications which may be inflicted by the potentially infinite dimensionality of ℋ. Motivated by the sample average of quadratic error losses in RLS [2], the hierarchical-optimization problem which underpins the learning task is where the |⋅| loss, with 1 ≤ ≤ 2, is adopted here to combat noise and outliers, since values of < 2 give less importance to large errors ( ) − ( ) ( ) than the classical case of = 2. To promote compact notations in the following discussion, let ≔ ( , ⋅) ∈ ℋ. The regularizer ( ℋ /2)‖⋅‖ 2 ℋ , with ℋ ≥ 0, is a classical way to avoid overfitting and to constrain ( ) into the linear subspace ≔ span of ℋ, spanned by the columns of the dim ℋ × "matrix" (better, linear operator) ≔ [ 1 , … , ] [29]. Moreover, ∈ (0, 1) is a forgetting factor that is often used in adaptive filtering [2] to penalize the errors of the currently and recently received data heavier than those of the remote past, and Γ ≔ ∑ =1 − .
Since ( ) ∈ ℋ, the reproducing property [3] yields , and thus equivalent to problems of the form where superscript ( ) is omitted to avoid clutter in notations. With ( , ) becoming available to the user at every time instance , the dimension dim of the linear subspace may become unbounded as time advances due to the potential infinite dimensionality of ℋ. Proposition 1 below demonstrates that this "curse of dimensionality" inflicts major burdens upon (2), as in the computation of the popular proximal mapping, defined for ℒ as [30]: where the dim ×dim "kernel matrix" ≔ ⊺ , and the dim × matrix as well as the × dim matrix are s.t.
= and = . Moreover, the th entry of the diagonal × matrix It is evident from (3) that the computation of the dim × dim kernel matrix poses serious problems in terms of complexity and memory requirements, since dim may become unbounded as time advances. Moreover, since Prox ℒ (ℎ) appears at both sides of (3), solving (3) may become an arduous large-scale computational task.
To surmount the bottleneck of solving (3), this work uses a linear subspacẽ⊂ to serve as an approximation of , under the requirement that the dimension of̃stays bounded over time: dim̃≤, ∀ , for a user-defined buffer length̃∈ ℤ >0 . To this end, vector̃( ) ∈̃is introduced to serve as an approximation of (the precise definition to be given in Section 3.1). Let also the dim ℋ × matrix ≔ [̃( ) 1 , … ,̃( ) ]. The following proposition introduces two approximations of ℒ : The weighted quadratic (4a), and (4b) which capitalizes on the first-order information of (4a).

THE ALGORITHM
Algorithm 1 stems from [26]. It is also equipped with the novel module of Section 3.1 to ensure that its computational complexity stays bounded over time in the context of the potentially infinite dimensional ℋ.

Approximate Linear Dependency on a Budget
To define subspacẽ, the approximate linear dependency (ALD) criterion of [7] is considered, but with a novel twist which ensures the boundedness of the dimension of̃: dim̃≤, ∀ . This condition is satisfied by monitoring the amplitudes of the coefficients of the estimates ( ) . This feature was not available in the original ALD [7], where the user cannot fix upper bounds on dim̃. The definition of is provided next via induction. To this end, it is assumed that knowledge of̃− 1 is available to the user at time .
In the present context, gathering the "clean" data in c ≔ [ 1 , … , ] and Consequently, by virtue of the flexibility that affine nonexpansive mappings offer, Proposition 3 below shows several ways via which the constraint ( ) can be accommodated.
Then, any of the following ( ) is affine nonexpansive with Fix ( ) = ( ) , and can be thus used in Algorithm 1.
In Algorithm 1, ( ) ≔ ( ) +(1− ) Id, for ∈ [0.5, 1). It is also assumed that span c ⊂̃, ∀ . Since time instances ( ) =1 , where the clean data are observed, are known to the user, the previous condition can be guaranteed, without any loss of generality, by initializing̃0 ≔ span(̃0 ≔ c ) (Line 2 in Algorithm 1), and by refraining from removing any of the columns of c via index * in Section 3.1.

NUMERICAL TESTS
Algorithm 1 is compared with KRLS [7], the state-of-theart KRLP [12] and KRMMC [20], as well as [32][33][34], whose kernel versions with the criterion of Section 3.1 were crafted specifically for these tests. In [32][33][34], outlier vectors become parameters to be estimated. The software code was written in Julia [35], while -stable noise was generated by [36]. The number of clean data is set equal to 1, and the version of Proposition 3(ii) is chosen for ( ) in Algorithm 1. Performance is measured by the following metrics on t = 500 noise-and outlier-free test data ( t, , t, ) t =1 : (i) Root mean square error RMSE(  Figure 2a shows that Algorithm 1 with (5a) achieves identical performance with KRLP [12] in much smaller time per . By virtue of Section 3.1, the computational time per iteration stays fixed for Algorithm 1, in contrast to KRLP and KRMMC where time grows unbounded as advances. Using only the first-order information in (5b) does not seem to be an effective strategy in the case of large-variance outliers.
Data [37] refer to an inverse dynamics problem for a 7 degrees-of-freedom SARCOS anthropomorphic robot arm.  The task is to map ( = 21)-dimensional vectors (7 positions, 7 velocities, and 7 accelerations) to their corresponding = 7 joint torques. Outliers are synthetically inserted into the data by adding to each output entry values of an RV with an -stable distribution [9] and the following parameters: index/stability: 1.5, skewness: 0.5, location: 0, and scale: 10 −2 [36]. A Gaussian kernel with variance 0.1 was used and̃= 150. Similarly to Figure 2, Figure 3 shows that the RMSE performance of Algorithm 1 with (5a) reaches and even exceeds that of KRLP with smaller computational complexity and memory footprints.
To conclude, KRMMC [20] performed poorly in cases of large-variance outliers, but performed similarly to KRLP and Algorithm 1 with (5a) under small/moderate-variance outliers (due to space limitations, these tests will be detailed elsewhere). Extensive tests have showed that Algorithm 1 with (5b) achieved competitive performance in cases of smallvariance outliers, but performed poorly otherwise. In all tests, regardless of the size of the variance of the outliers, Algorithm 1 with (5a) achieved the best RMSE performance among all competing methods, matched, and even exceeded in some cases, the RMSE performance of KRLP [12], with a bounded computational complexity over time, while complying with the side information, in contrast to the constraintagnostic methods [7,12,20,[32][33][34].