Minimax estimation in multi-task regression under low-rank structures

This study investigates the minimaxity of a multi-task nonparametric regression problem. We formulate a simultaneous function estimation problem based on information pooling across multiple experiments under a low-dimensional structure. A nonparametric reduced rank regression estimator based on the nuclear norm penalisation scheme is proposed to incorporate the low-dimensional structure in the estimation process. A rank of a set of functions is defined in terms of their Fourier coefficients to formally characterise the dependence structure among functions. Minimax upper and lower bounds are established under various asymptotic scenarios to examine the role of the low-rank structure in determining optimal rates of convergence. The results confirm that exploiting the low-rank structure can significantly improve the convergence rate for the simultaneous estimation of multiple functions. The results also imply that the proposed estimator is rate optimal in the minimax sense for the rank-constraint Sobolev class of vector-valued functions.


Introduction
Multi-task learning deals with a problem in which multiple estimation tasks are solved simultaneously while exploiting similarities across the tasks; see, for example, Thrun (1996) and Caruana (1997). By leveraging data from multiple domains, it can improve estimation efficiency when compared to separate learning methods. A successful multi-task learning method involves an appropriate definition of the similarity across tasks and an efficient way to exploit such similarity structure.
The main idea of the multi-task learning can be applied to the function estimation problem in many applications. If we know a priori that multiple functions or experiments share some common structures, we can exploit the structures to improve prediction accuracy of estimation methods. Examples of such structures include common sparsity structures, low-rank structures, geometric structures, common smoothness structures, and so forth. Several studies have been conducted on information pooling across multiple experiments in function estimation problems. Yuan et al. (2007) studied the reduced rank regression estimator using a general Ky Fan norm penalisation approach. This approach proves useful when strong dependence structures among experiments are present. Lounici et al. (2011) examined a parametric multi-task regression problem in which the underlying functions share joint sparsity structures. Wang et al. (2016) provided a stability analysis for mult-task learning methods. They also established L 2 risk upper bounds for the transfer learning problem, in which a densely sampled function is used to improve the recovery of the sparsely sampled function provided that the two functions are related via a smooth transfer function. Liu et al. (2008) considered nonparametric regression and classification methods that enforce joint sparsity patterns across component functions in an additive model. In a similar additive model structure, Foygel et al. (2012) considered a reduced rank regression estimator for application in gene expression data and biochemical data using the fact that the component functions are highly correlated. Koudstaal and Yao (2018) related the multiple Gaussian sequence model to the standard observation scheme in functional data analysis. They showed that a Stein estimator based on the information pooling process could improve estimation significantly because the functional observations possess inherent low-dimensional structures.
In this study, we examine a multi-task nonparametric regression problem in which the underlying functions possess a low-rank structure. To incorporate the low-dimensional structure in the recovery of multiple functions, we propose a nonparametric function estimation method based on the nuclear norm penalisation (NNP) approach. This leads to a nonparametric version of the reduced rank regression estimator under the multi-task learning framework. The proposed method can be used in various practical applications where multiple functions of interest share similar structures and behaviours. A majority of the studies on the multi-task learning focus on low-dimensional structures based on joint sparsity. However, there are many examples in which certain similarity structure is present among functions without notable sparsity patterns. As an illustrative example, we apply the proposed method to the Canadian weather data (Ramsay and Silverman 2005). The data consists of the daily temperature at weather stations across Canada. Despite the regional differences, the observations are highly correlated in a function space due to similar structure, such as the seasonal trend. The analysis result shows that the proposed method identifies this low-dimensional structure and improves estimation. In addition, we provide numerical studies to illustrate the efficiency of the proposed method. The results show that the information pooling across multiple experiments based on the lowrank structure can significantly outperforms the separate estimation method. Regarding the theoretical aspect, we first obtain a non-asymptotic oracle-type inequality to study the properties of the reduced rank estimator. Using the inequality, we prove the minimaxity and rank identification consistency of the proposed method.
Our main contribution lies in the development of the minimax theory for the multi-task regression problem. Assuming a dependence structure across underlying functions, one can expect that the information pooling across the experiments significantly improves estimation efficiency. A question that arises is how much we can improve estimation. To answer this question, we provide a minimax analysis for the related problems, thereby identifying the contribution of the information pooling process. A concept of a rank among functions is defined using the Fourier calculus to characterise the dependence structure. With this definition, we introduce a rank-constraint Sobolev class of vector-valued functions to analyse the minimaxity. Using the oracle inequality, we obtain minimax upper bounds for the risk of the proposed estimator when the underlying function belongs to the rankconstraint Sobolev class of functions. For a complete understanding of the dimensionality reduction effect due to the low-rank structure, we obtain the corresponding minimax lower bounds in various asymptotic scenarios. The results indicate that exploiting the lowrank structure can significantly improve the convergence rate of the recovery of multiple functions when compared with the univariate function estimation case, and the proposed estimator is rate optimal in the minimax sense under mild conditions. In addition, we provide conditions and a non-asymptotic probability bound for the rank identification problem.
The remainder of this paper is organised as follows. Section 2.1 introduces the nonparametric reduced rank regression estimator of the underlying functions. We obtain a closed-form estimator by adopting an orthonormal basis, as described in Section 2.2. Theoretical results are collected in Section 3. Section 3.1 defines the rank of vectorvalued functions in terms of their Fourier coefficients. We present the oracle inequality in Section 3.2 and minimax theory in Section 3.3. Section 3.4 details the rank identification problem. Simulation studies are conducted in Section 4.1 to complement the theoretical results. An analysis of Canadian weather data based on the proposed method is presented in Section 4.2. Section 5 provides a summary and discussion on the findings of the study. The proofs of the theoretical results are deferred to the appendix and supplementary material.

Model and estimator
We consider a standard observation scheme of the multi-task regression: where f i are unknown fixed functions and z ij are random samples from N(0, σ 2 ) distribution. We denote y = [y ij ], z = [z ij ] ∈ R m×n . In what follows, we assume that t ij = t j , i = 1, . . . , m, for simplicity. The objective is to estimate the vector-valued function f = (f 1 , . . . , f m ) : I → R m .
Each component function f i is modelled by a set of basis functions where η i denotes the ith row of η. To encourage proximity to each component function, we consider an empirical risk defined as We adopt the NNP approach to incorporate low-rank information in the estimation process. The nuclear norm of η is defined as where ∧ denotes the minimum and d s (η) is the sth largest singular value of η. The nuclear norm penalty induces sparsity in the singular values, resulting in a rank reduction effect. The objective function to minimise is defined as where λ > 0 is a complexity parameter. Definê In what follows, we suppress the dependency on λ for notations uncluttered when it is clear from the context. The estimator of f is given bŷ f = fθ .

Explicit form of the estimator
For a vector-valued function h = (h 1 , . . . , h ν ) defined on I, we denote by its capital letter the matrix of the function values evaluated at the design points: With this notation, we have where | · | denotes the Frobenius norm of a matrix. In general, the optimisation problem for λ does not have a closed-form solution (Yuan et al. 2007). To resolve this issue, we assume that φ k are orthonormal with respect to the empirical L 2 inner product so that 1 n = I p , where I p is the identity matrix of size p. We then have F η * = √ n|η| * . Thus, the optimisation problem covers the NNP problem under the orthonormal design assumption, which admits a closed-form solution.
θ is considered as the projection estimator of θ * . For a ∈ R m×p , the singular value decomposition is defined as where u s (a) and v s (a) denote the left and right singular vector of a corresponding to d s (a), respectively. Arguments from Lemma 1 in Yuan et al. (2007) can provide a solution for the aforementioned optimisation problem with a little modification. In Section 2 of the supplementary material, we present an alternative proof of Theorem 2.1 using the subdifferential calculus, which can be useful in various contexts across matrix optimisation problems.

Theorem 2.1: A global minimiserθ of λ is given bŷ
If, in addition, all the positive singular values ofθ are distinct,θ is the unique minimiser.

Rank of vector-valued functions
To investigate the dimension reduction effect of low-rank structures, we first define the rank of a vector-valued function. To this end, we define the rank of f by the dimension of the space spanned by f 1 , . . . , f m . That is, the rank of f is defined as r if there exists a set of linearly independent functions g 1 , . . . , g r such that span f 1 , . . . , f m = span g 1 , . . . , g r .
Without loss of generality, we assume that g 1 , . . . , g r are orthonormal functions. This leads to the following definition of rank.

Definition 3.1:
The rank of f is r in a strong sense if there exists a rank r matrix α = [α il ] ∈ R m×r and a set of orthonormal functions g 1 , . . . , g r such that, for i = 1, . . . , m, It may be, however, too restrictive to examine pointwise structure when we consider dependency among functions defined over I. To relax the strict condition, we propose an alternative definition in terms of the Fourier coefficients.
For a positive integer ν, let ν 2 be a ν-tuple of the space of square-summable sequences; see Section 1 in the supplementary material for details regarding the ν 2 space. We assume that f i admits a series expansion in the L 2 sense in terms of an orthonormal basis Let δ denote the Kronecker delta function.

Definition 3.2:
The rank of f is r in a weak sense if there exists a rank r matrix α Then, we write θ = αβ and rank(f ) = r.
If rank(f ) = r, then f i can be represented as a linear combination of orthonormal functions g 1 , . . . , g r in the L 2 sense for g l = ∞ k=1 β lk φ k . Indeed, we have

Oracle inequality
To define an estimator, the complexity parameter should be determined. We consider the following choice for the complexity parameter: where ψ > σ is a user-specified constant.
A goodness-of-fit of an estimator is measured by an average L 2 norm. For a vector- is the empirical L 2 norm of the component function. Theorem 3.1 states a nonasymptotic oracle-type inequality that provides a risk upper bound in the average L 2 norm. Since the closed-form representation is not needed to establish an oracle inequality, we relax the orthonormality assumption and assume the following instead.
Remark 3.1: Theorem 3.1 illustrates that information pooling across multiple experiments based on the low-rank structure can improve estimation measured by the average L 2 risk. We will see that rank θ − θ * term can be bounded by 2rank(θ * ) with high probability when we consider the L 2 approximation setup. If each component function is separately estimated by the projection estimator, the estimation error is of order p/n, which is greater than the term that appeared in Theorem 3.1 in order because rank(θ * ) ≤ m ∧ p.

Minimax analysis
We consider a function class where r = rank(f ), γ > 0 is a smoothness parameter, and 0 < μ < 1 is a rank-constraint parameter. That is, class F m,γ ,μ (Q) consists of smooth functions of rank r in the weak sense, where the smoothness is characterised by Sobolev-type ellipsoids.
Here and throughout, let M 1 , M 2 , . . . be positive constants independent of n, m, p, and r. We assume the following to establish a risk upper bound.
Once we specify the basis that satisfies the desired approximation order, (B2) can be directly verified with additional design assumptions. One such approach is detailed in the proof of Theorem 3.5. We leave (O1) and (B2) as assumptions to make the results independent of the choice of basis.
Theorem 3.2 provides a risk upper bound of the proposed estimator when the underlying function belongs to F m,γ ,μ (Q).
For two positive real sequences {a n } and {b n }, a n b n will mean that a n /b n is bounded away from zero and infinity as n → ∞. Theorem 3.3 summarises the upper bounds for the risk when m grows in proportion to n in various orders.
Remark 3.2: As 0 < μ < 1, for 0 < ζ < 1 μ(2γ +1) , the convergence rate is faster than the standard univariate convergence rate. In particular, we have a convergence rate of order n − 2γ 2γ +μ at the critical point ζ = 1 2γ +μ , in which case, we directly observe that μ plays the role of the effective dimension in the estimation problem.

Remark 3.3:
If ζ = 0 so that m is fixed, the upper bound of the estimation error is of order p/n as n → ∞ in Theorem 3.2, leading to the univariate convergence rate as expected.

Remark 3.4:
The convergence rate coincides with the univariate convergence rate for ζ ≥ 1 μ(2γ +1) . This is because we need a sufficiently large number of observation points to capture the smoothness of each function and the low-rank structure simultaneously.
For a complete understanding of the dimensionality reduction effect due to the lowrank structure, we obtain the corresponding minimax lower bounds in Theorem 3.4 using the Fano method (Khas' minskii 1979;Yu 1997).

Theorem 3.4: Suppose (O1) holds and assumes that m n ζ . Then
where the infimum is taken over all estimators of f and Remark 3.5: Theorems 3.3 and 3.4 imply that the low-rank structure reduces the difficulty of the recovery of multiple functions for 0 ≤ ζ < 1 μ(2γ +1) . This effect is characterised by the smoothness parameter γ , rank-constraint parameter μ, and growth rate ζ of the number of experiments. The results also imply that the proposed estimator is rate optimal in the minimax sense under the regularity assumptions.

Rank identification
Suppose that rank(f ) = r so that θ = αβ. To relate the rank identification problem to the perturbation theory, we view θ as a Hilbert-Schmidt operator on 1 2 into R m with the adjoint operator denoted by θ . Let ρ s (·) denote the sth largest eigenvalue of a matrix.
The singular values of θ are defined as d s (θ ) = ρ s θθ = ρ s αα so that rank(f ) = r implies d s (θ ) = 0 for s ≥ r + 1. Section 1 in the supplementary material formally introduces the operator concepts and summarises the inequalities used in the proofs of the rank identification results.
We make the following additional assumptions to achieve rank identification. ( The assumption γ > 1/2 in (R1) is needed to guarantee the pointwise convergence of the series expansion. Assumptions (R3) and (R4) are mild conditions that can be easily verified with most of the standard orthonormal bases. The regular design assumption (R2) is a simplifying assumption used to calculate the discrepancy between the empirical and theoretical L 2 projections.
The following result specifies the proximity between the singular values of θ and θ * .
Remark 3.6: Theorem 3.5 implies that the singular values of θ * converge to their counterparts of θ as n → ∞ provided that → 0. The singular values of θ * , in turn, can be estimated by those ofθ , thereby leading to a rank identification consistency.
Theorem 3.6 provides a nonasymptotic probability bound for the rank identification problem.

Simulation
We conduct simulation studies to illustrate the finite sample performance and properties of the proposed estimator. The prediction performance is compared with that of the projection estimator, which does not take into account of the low-rank structure. We consider four example functions: for t ∈ [0, 1], In what follows, we consider linear combinations of these functions to construct functional signals with low-rank structures.

Rank 1 signals
We first consider rank 1 functional signals. The underlying signal is constructed as We generate associated discrete observations for m = 30, 60 and n = 100, 200. The discrete observations are generated as follows: where t j are equally spaced observation points in [0, 1] and z ij are generated from N(0, 0.3 2 ) distribution. Plots (a) and (b) of Figure 1 illustrate the underlying functional signal g and the associated observations. We adopt the trigonometric basis with p = 15 to compute the reduced rank estimator and projection estimator.
For the numerical studies, the optimal complexity parameter is chosen by a version of the Bayesian information criterion (BIC; Schwarz (1978)). The reduced rank estimator is computed for a sequence of complexity parameters {λ 1 , . . . , λ L } with L = 100. The BIC for λ l is defined as BIC l = mn log y − Fθλ l 2 /mn + rank(θ λ l ) p log(mn).
The optimal value for λ is chosen as λ opt = λˆl wherel = argmin 1≤l≤L BIC l . Through 100 replications, we compute the mean squared error (MSE) and mean absolute deviation (MAD) as discrepancy measures between the underlying functional signals f and each of the estimators. For f = (f 1 , . . . , f m ) defined over I, these quantities are defined as and Additionally, we record the average rank (AverRank) over 100 replications for the proposed reduced rank estimator. Table 1 summarises the simulation results. It is seen that the reduced rank estimator significantly outperform the projection estimator in all scenarios. In particular, we notice that the performance of the reduced rank estimator is improved when m increases, whereas the same argument cannot be applied to the projection estimator. It stems from the fact that the reduced rank estimator identifies the low-rank structure so that the effective sample size increases in this case. This phenomenon illustrates an advantage of the information pooling of the proposed estimation method, especially for a large set of functional data. Moreover, the proposed method correctly identifies the rank of the signals for every replication in the simulation study. These results verify that the findings of Section 3 are valid in the finite sample case.
Plots (c) and (d) of Figure 1, respectively, present a typical example of the reduced rank estimator and projection estimator for the case in which m = 30 and n = 100. From Plot (c), we can see that the proposed method effectively exploits the low-rank structure by pooling information across the functional signals. This results in a rank 1 reduced rank estimator that closely approximate the underlying rank 1 signal in Plot (a). On the other hand, Plot (d) shows that the projection estimator is much more susceptible to the random noise.

Rank 3 signals
This section presents a simulation study for rank 3 functional signals. The underlying signals are constructed as follows: We generate associated discrete observations for m = 30, 60 and n = 100, 200. The discrete observations are generated as follows: y ij = g s (t j ) + z ij for s = 1, 2, 3, i = 20(s − 1) + 1, . . . , 20s and j = 1, . . . , n, where t j are equally spaced observation points in [0, 1] and z ij are generated from N(0, 0.3 2 ) distribution. Plots (a) and (b) of Figure 2 illustrate the underlying functional signals g 1 , g 2 , g 3 , and the associated observations.
Through 100 replications, we record the performance measures and summarise the results in Table 2. Although there are some overlaps among the signals due to the random noise, the proposed method identifies the rank of the underlying signals with very few exceptions. The average MSE and MAD illustrate that the reduced rank estimator significantly outperforms the projection estimator, and the proposed method increases the effective sample size as m increases. Plots (c) and (d) of Figure 2, respectively, present a typical example of the reduced rank estimator and projection estimator for the case in which m = 30 and n = 100. As in the rank 1 case, the results illustrate the advantage of the information pooling of the proposed method when compared with the projection estimator.

Analysis of Canadian weather data
We apply the proposed method to the Canadian weather data. The data illustrates the change in temperature over the course of a year in different regions in Canada. It consists of the daily average temperature for n = 365 consecutive days at each of m = 35 Figure 3. Plot of (a) daily temperature data connected with lines, (b) projection estimator, (c) optimal rank 2 estimator, (d) rank 1 estimator.
weather stations across Canada. One may refer to Ramsay and Silverman (2005) for further information about the data. In many applications, the discrete observations need smoothing to be used in a subsequent analysis based on functional signals. A plot of the data in Figure 3 suggests that the weather data possesses low-dimensional structures due to some factors such as the seasonal trend. To exploit the inherent structures, we use the proposed reduced rank estimator to recover the underlying functional signals. Since the temperature can be assumed to be periodic, we adopt the trigonometric basis with p = 15 to compute the reduced rank estimator. The rank of the reduced rank estimator chosen by the BIC is given as 2.
We compare the fit of the resulting estimator with that of the projection estimator and rank 1 estimator. Figure 3 shows the plots of (a) daily temperature data connected with lines, (b) projection estimator, (c) optimal rank 2 estimator, and (d) rank 1 estimator. We can observe that there is a clear seasonal trend so that temperatures rise in the summer, whereas they fall in the winter. However, as can be seen in Plot (d), the seasonal trend alone cannot explain the whole data due to different characteristics of distinct regions.
On the other hand, Plot (b) shows that the projection estimator tends to overfit local fluctuations of the data. Plot (c) illustrates that the proposed method yields a desirable estimate of functions by the information pooling process.

Concluding remarks
We considered the nonparametric reduced rank regression estimator in a multi-task learning problem and derived the oracle inequality for the average L 2 risk. We obtained minimax upper and lower bounds in various asymptotic scenarios for the rank-constraint Sobolev class of vector-valued functions. Thus, we identified the dimensionality reduction effect of the low-rank structure and established minimax optimality of the proposed estimator. The dimensionality reduction effect is characterised by the smoothness parameter γ , rank-constraint parameter μ, and growth rate ζ of the number of experiments. The results indicate that information pooling across multiple experiments based on the lowrank structure among underlying functions can significantly improve estimation measured by the average L 2 risk in many cases. Liu, H., Lafferty, J., and Wasserman, L. (2008), 'Nonparametric regression and classification with joint sparsity constraints'. Lounici, K., Pontil, M., Van De Geer, S., and Tsybakov, A.B. (2011), 'Oracle Inequalities and Optimal Inference Under Group Sparsity', Annals of Statistics, 39, 2164-2204. Ramsay, J.O., and Silverman, B.W. (2005, Functional Data Analysis (Vol. 2), New York: Springer. Schwarz, G. (1978), 'Estimating the Dimension of a Model', The Annals of Statistics, 6(2), 461-464. Thrun, S. (1996), 'Is Learning the n-th Thing Any Easier Than Learning the First'? Advances in Neural Information Processing Systems, 8, 640-646. Tsybakov, A.B. (2009), Introduction to Nonparametric Estimation, New York: Springer. Wang, X., Oliva, J.B., Schneider, J.G., andPóczos, B. (2016), 'Nonparametric risk and stability analysis for multi-task learning problems ', in IJCAI, New York: AAAI Press, pp. 2146-2152. Yu, B. (1997, 'Assouad, Fano, and Le Cam', in Festschrift for Lucien Le Cam, New York: Springer, pp. 423-435. Yuan, M., Ekici, A., Lu, Z., and Monteiro, R. (2007)

A.1 Proof of Theorem 3.1
Define a random event Proof: Using Lemma 3 of Bunea et al. (2011) and the fact that rank( ) = p, we obtain The inner product between two matrices a and b is defined as a, b = tr ab , where tr [·] denotes the trace of a matrix.

Lemma A.2: For all
It follows that Combining these results, we obtain It follows that, on the event A, where we have used the triangle inequality for the nuclear norm.

A.1.1 Proof of Theorem 3.1
Using Lemma A.2, we obtain where the last inequality follows from Therefore, we obtain

A.3 Proof of Theorem 3.3
For notations uncluttered, we adopt the standard Vinogradov notation, a n b n as n → ∞. Using Theorem 3.2 and ignoring the constants, we obtain an upper bound for f −f 2 of order 2γ +1 . Thus, we choose p n 1 2γ +1 in what follows. We first consider the case in which 0 ≤ ζ < 1 2γ +μ . Then, p r and q = r in the asymptotic sense. Minimising and thus r n n ζ μ−1 ≤ n −(1+ζ(1−μ)) 2γ 2γ +1 .
Therefore the term r n is negligible and the convergence rate is given by Suppose now that 1 2γ +μ ≤ ζ < 1 μ(2γ +1) . As in the above case, we have q = r in the asymptotic sense. Note that, in this range, Thus, the term r n determines the convergence rate, which is given by r n n ζ μ−1 .
Finally, we consider the case in which ζ ≥ 1 μ(2γ +1) . Then, we have n 1 2γ +1 n ζ μ so that we bound q by p. Since p m n ζ(μ−1) ≤ 1, the risk upper bound is of order p −2γ + p n .
Choosing p n 1 2γ +1 , we obtain a convergence rate Combining the above relations and the large deviation inequality, we obtain the desired result.

A.4 Proof of Theorem 3.4
To obtain minimax lower bounds using the Fano method, we construct subclasses of F m,γ ,μ (Q) that consist of low-rank vector-valued functions. In doing so, we need different schemes depending on the range of the growth rate ζ . Specifically, we use a row-wise construction scheme for 0 ≤ ζ < 1 2γ +μ , whereas we need a column-wise construction method for 1 2γ +μ ≤ ζ < 1 μ(2γ +1) . The case in which ζ ≥ 1 μ(2γ +1) will be briefly outlined in the proof of Theorem 3.4 below. We first suppose that 0 ≤ ζ < 1 2γ +μ . Define r = m μ . For simplicity, we assume that s = m/r for a positive integer s. Otherwise, one can easily verify the ensuing arguments for n ≥ e log 2 ζ(1−μ) . Let ω be an r × p matrix where ω ik ∈ 0, √ M 12 p −γ −1/2 . We define an m × p matrix τ with τ (t−1)r+i = ω i for 1 ≤ i ≤ r and 1 ≤ t ≤ s, where ω i denotes the ith row of ω. Observe that the matrix τ is a block matrix whose blocks are identical to ω. Define a class of vector-valued functions F m,n whose elements have the form f τ . For f τ , f τ ∈ F m,n with f τ = f τ , we have = m i=1 ∞ k=p+1 θ 2 ik ≤ M 10 mp −2γ .