Sequential nonparametric estimation of controlled multivariate regression

Abstract The article considers an adaptive sequential nonparametric estimation of a multivariate regression with assigned mean integrated squared error (MISE) and minimax mean stopping time when the estimator matches performance of an oracle knowing all nuisance parameters and functions. It is known that the problem has no solution if regression belongs to a Sobolev class of differentiable functions. What if an underlying regression is smoother, say, analytic? It is shown that in this case it is possible to match performance of the oracle. Furthermore, similar to the classical Stein solution for a parameter estimation, a two-stage sequential procedure solves the problem. The proposed regression estimator for the first stage, based on a sample with fixed sample size, is of interest on its own, and a thought-provoking environmental example of reducing potent greenhouse gas emission by an anaerobic digestion system is used to discuss a number of important topics for small samples.


INTRODUCTION
Nonparametric curve estimation is devoted to estimation of functions whose shape is unknown. A classical statistical setting is when a sample of size n is available, the problem is to propose a feasible estimator with a minimal mean integrated squared error (MISE), and an oracle approach is used to find a benchmark for an adaptive estimator. The oracle knows some information about an underlying estimated function, including its smoothness, and everything about nuisance functions. Then a sharp lower bound for the MISE of oracle estimators is established. The notion "sharp" means that both the constant and rate of the MISE convergence are established. Then a good data-driven estimator should match performance of the oracle, and if the latter is possible, the estimator is called adaptive because it adapts to the smoothness of an underlying estimated function and all nuisance functions. It is well known that adaptive nonparametric estimation is possible for a wide variety of statistical models and function classes of interest; see discussions in Efromovich (1999Efromovich ( , 2018 and Wassermann (2006). This is good news for nonparametric estimation with a deterministic sample size.
The situation changes rather dramatically if we are interested in the Wald problem of sequential estimation with an assigned value of a risk and a minimal mean stopping time. No adaptive sequential estimator matching performance of the oracle exists for the case of differentiable functions. More about the Wald problem, sequential estimation, and the lack of adaptation can be found in Wald (1947), Stein and Wald (1947), Anscombe (1949Anscombe ( , 1953, B. K. Ghosh and Sen (1991), M. Ghosh, Mukhopadhyay, and Sen (1997), Mukhopadhyay (1997), and Efromovich (1995and Efromovich ( , 2007and Efromovich ( , 2018. Though there is no way to change this outcome for estimation of differentiable functions, the article shows that this is possible for smoother functions like analytic ones. Further, sequential estimation can use the simplest two-stage strategy whose roots go back to Stein (1945) and Wald (1947); see also interesting discussions in Aoshima and Yata (2011), Mukhopadhyay and Zacks (2018), and Mukhopadhyay (2019).
Let us describe a considered regression model and review relevant known results beginning with the case of a fixed sample size. We observe a sample ðX 1 , Y 1 Þ, :::, ðX n , Y n Þ of size n from ðX, YÞ, where X :¼ ðX 1 , :::, X k Þ is a vector of continuous covariates (predictors) and Y is a response. The regression is controlled, implying that the distribution of X is known, and in what follows it is supposed that the joint density f X of the vector predictor is supported and positive on a k-dimensional cube R :¼ ½0, 1 k : The underlying regression model is (1.1) where mðxÞ :¼ EfYjX ¼ xg is the regression function of interest, n is a zero-mean regression error independent of X, and a positive function rðxÞ is called a scale function. Let us formulate one of the main known theoretical results due to Hoffmann and Lepski (2002). Consider a cosine tensor product basis u i ðxÞ :¼ Q k r¼1 u i r ðx r Þ on R, where u 0 ðxÞ ¼ 1, u i ¼ 2 1=2 cos ðpixÞ, i ¼ 1, 2, :::, i :¼ ði 1 , :::, i k Þ; set h i :¼ Ð R mðxÞu i ðxÞdx for Fourier coefficients of mðxÞ and introduce an anisotropic Sobolev class Sð a ! , QÞ :¼ fmðxÞ : mðxÞ ¼ P 1 i¼0 h i u i ðxÞ, x 2 R; Qg of differentiable functions. Note that we use notation P 1 i¼0 :¼ P 1 i 1 , :::, i k ¼0 : Then it is established that the optimal (oracle's) minimax rate of the MISE convergence is n À2a=ð2aþ1Þ , where a :¼ ½ P k r¼1 a À1 r À1 is the effective smoothness. For the univariate case k ¼ 1, not only is the rate known but a sharp constant is known that is achieved by a data-driven estimator that matches performance of the oracle that knows parameters of the Sobolev class and the nuisance functions f X ðxÞ and rðxÞ; see Efromovich (1999). Oracle's lower bounds, used as benchmarks for data-driven estimators, are discussed in Barron, Birge, and Massart (1999), Galtchouk and Pergamenshchikov (2009a, 2009b), and Efromovich (2018, where further references may be found. In short, the theory and methodology of regression estimation for k ¼ 1 and a fixed sample size are well developed. For sequential estimation it is known that neither the constant nor the rate can be improved by a sequential plan with stopping time T satisfying EfTg n: Further, if we restrict our attention to sequential estimators with an assigned MISE and minimal expected stopping time (the Wald problem), no data-driven estimator can match performance of the oracle; see Efromovich (2007Efromovich ( , 2018.
As we will see later in the article, the negative outcome for the Wald problem changes if we consider an analytic class of regression functions on R with faster decreasing Fourier coefficients: Here h i :¼ Ð R mðxÞu i ðxÞdx, b :¼ ðb 1 , :::, b k Þ, and c :¼ ðc 1 , :::, c k Þ are vectors of constants and minðc 1 , :::, c r Þ > 0: Analytic function classes are familiar in statistical literature and well suited for many practical applications; see discussions in Ibragimov (2001) and Efromovich (1999Efromovich ( , 2018. In what follows, parameters ðb, c, QÞ of the class A are known to the oracle and unknown to the statistician. Now let us formulate our main aim. We are interested in estimation of the regression function mðxÞ in model (1.1) by a sequential estimator E :¼ Eð m r ðx, Z r 1 Þ, r ¼ È 1, 2, :::g, TÞ: Here, m r ðx, Z r 1 Þ is a regression estimate based on a sample Z r 1 :¼ ðX 1 , Y 1 Þ, :::, ðX r , Y r Þ È É with fixed sample size r, and T is a stopping time. Accordingly, when the stopping time is defined, the regression estimate is m T ðx, Z T 1 Þ: The stopping time is a positive integer-valued random variable such that after observing Z r 1 we make a decision as to whether or not T ¼ r. If the decision is T ¼ r, we stop observations and use the regression estimate m r ðx, Z r 1 Þ; otherwise, we continue the sampling. More rigorously, let ðX, F , PÞ be an underlying probability space, Z r 1 , r ¼ 1, 2, ::: È É be a sequence of multivariate random variables on X, F , P f g, and F 1 & F 2 & ::: be an increasing sequence of sub-sigma-fields of F such that Z r 1 is F r -measurable; then the stopping time is a map T : X ! 1, 2, ::: f g such that T r f g2 F r : Suppose that the oracle knows the underlying class (1.2) of regression functions, the design density f X , and the scale function r. Then we consider two classical sequential regression problems when a data-driven estimator tries to match performance of the oracle: (1) the value of the mean stopping time is assigned and then the MISE of a sequential estimator matches the oracle's MISE and (2) the value of the MISE is assigned and then the mean stopping time matches the oracle's mean stopping time (the Wald problem). As we will see later in the article, for the former problem, sequential estimation does not dominate estimation based on a fixed sample size, but for the Wald problem sequential estimation is superior and can match the oracle.
The remainder of the article is organized as follows. Asymptotic theory for oracle estimators is presented in Section 2. The oracle's lower bounds serve as a benchmark for an estimator, and oracle estimators inspire data-driven estimators. Sequential estimation with assigned MISE (the Wald problem) is considered in Section 3, where a sharp minimax two-stage estimator matching performance of the oracle is introduced. The methodology mimics the classical Stein approach proposed for parametric models. Proofs are deferred to Section 4. Conclusions and topics for future research are presented in Section 5. The online supplementary materials contain an important environmental example devoted to a new civil engineering technology for reducing greenhouse gas emission. This is a thought-provoking controlled regression with five covariates and only n ¼ 86 observations. The example and its discussion shed a new light on the first stage of proposed estimation.
The following notations are used in this article. For the first above-formulated problem of estimation with minimal MISE, given that the mean stopping time is bounded by a positive integer n, we are interested in the asymptotic when n ! 1: Set q :¼ q n :¼ d2 þ ln ðn þ 1Þe and q 0 :¼ q 0 n :¼ d2 þ ln ð ln ðn þ 3ÞÞe, where dce denotes the smallest integer larger than or equal to c. It is assumed that P n l¼nþ1 :¼ 0, P J i¼0 :¼ P J i 1 , :::, i k ¼0 , 0=0 :¼ 0, Úi :¼ maxði 1 , :::, i r Þ, Ùi :¼ minði 1 , :::, i k Þ, ic :¼ ði 1 c 1 , :::, i k c k Þ, sup is the supremum over the considered function classes, and o n ð1Þ s are generic vanishing sequences in n. IðÁÞ is the indicator and ws are generic positive constants. Z rþj rþ1 :¼ ðX rþ1 , Y rþ1 Þ, ðX rþ2 , Y rþ2 Þ, :::, ðX rþj , Y rþj Þ È É denotes a sequence of independent and identically distributed observations (sample of size j) from ðX, YÞ: For the Wald problem, e denotes a given positive real number that is used to bound the MISE, and then we are interested in the asymptotic as e ! 0: Because notation n for a given sample size is no longer used, with some obvious abuse of notation it is convenient to consider q :¼ qðeÞ :¼ d2 þ ln ðe À1 þ 1Þe, q 0 :¼ q 0 ðeÞ :¼ d2 þ ln ðe À1 þ 1Þe, and o e ð1Þ denotes a generic function that vanishes as e ! 0:

ASYMPTOTIC THEORY OF ORACLE ESTIMATORS
Consider regression model (1.1) and the oracle who knows the model, the design density f X and the scale function r (the two so-called nuisance functions), and parameters of the function class (1.2). The problem is to understand how well the oracle can solve the two sequential problems formulated in Section 1, namely, minimization of the MISE given the assigned expected stopping time and the Wald problem of minimizing the expected stopping time given the assigned value of the MISE. Though the oracle knows the design density and the scale function, it is of interest to understand how the oracle can deal with a class of these nuisance functions. Introduce a class of positive and differentiable on R :¼ ½0, 1 k k-variate functions N :¼ N 1 ðw 1 , w 2 , w 3 Þ :¼ g : w 1 gðxÞ w 2 , @ k ðgðxÞÞ @x 1 :::@x k w 3 , x 2 R ( ) where w 1 , w 2 , and w 3 are positive constants whose specific values play no role in the theoretical results presented. Accordingly, to simplify formulas, we may write that both f X and r belong to N keeping in mind that the constants can be different.
Assumption 2.1. The design density f X and the scale function r are from class (2.1) with possibly different constants ðw 1 , w 2 , w 3 Þ, and Ð R f X ðxÞdx ¼ 1. The regression error n in (1.1) is independent of X and its distribution F n belongs to a class N of distributions with zero mean and variance bounded by 1.
The assumption is mild, and let us note that even in a classical univariate regression theory, it is traditionally assumed that the nuisance functions are positive and differentiable; see Wassermann (2006) and Efromovich (2018).
The following theorem considers the two above-presented sequential problems and presents sharp lower bounds for oracle estimators. Recall notation Eð m r ðx, Z r 1 Þ, r ¼ È 1, 2, :::g, TÞ for a sequential estimator introduced in Section 1. In the theorem considered, sequential regression estimates m r and stopping times T are constructed by the oracle, and accordingly the estimators are called "oracle estimators." To highlight that a statistic is constructed by the oracle, we use the asterisk; for instance m Ã r , T Ã and E Ã : Theorem 2.1 (Lower bounds for the oracle). Let Assumption 2.1 hold. The two sequential problems for oracle estimators with assigned mean stopping time and assigned MISE are explored in turn: (i) Introduce a class SðnÞ of sequential oracle estimators E Ã ðm Ã r ðx, Z r 1 Þ, r ¼ È 1, 2, :::g, T Ã Þ with stopping time T Ã satisfying Then the following lower bound for the minimax MISE of oracle estimators holds: Consider the Wald problem of minimizing the expected stopping time given an assigned value of the MISE. Introduce a class S 0 ðeÞ of sequential oracle estimators E 0 ðm Ã r ðx, Z r 1 Þ, r ¼ 1, 2, ::: Denote by n Ã ðÞ a minimal integer n such that n À1 ½ln ðnÞ Let us comment on the lower bounds and then proceed to presenting oracle estimators that establish sharpness (attainability) of the lower bounds. We begin with the first problem and lower bound (2.3). Note that the MISE is proportional to the integral Ð R ½r 2 ðxÞ=f X ðxÞdx, which shows how the two nuisance functions affect the MISE. This dependence is known for nonsequential regression estimators. Next, let us rewrite the right side of (2.3) as n À1 P Ã ; then the constant P Ã is traditionally referred to as the "Pinsker constant" to honor the pioneering result of Pinsker (1980) devoted to filtering signals from white Gaussian noise. The Pinsker constant describes the effect of an underlying function class A on the MISE convergence. If we return to definition (1.2) of the class A :¼ Aðb, c, QÞ, we can conclude that only vector c affects the first order of the MISE convergence, whereas b and Q do not. This is an interesting specific of analytic regression functions because for Sobolev function classes, considered in Pinsker (1980), all constants defining the class affect the Pinsker constant. Now let us look at the lower bound for the Wald problem. It points to a conjecture that the oracle may use an estimator with a priori chosen fixed sample size n Ã ðÞ to solve the Wald problem. At first glance, this conjecture looks strange, but let us stress that we are dealing with oracle estimators that know nuisance functions and an underlying class of regression functions. Because the statistician does not have that information, sequential estimation is the only option to solve the classical Wald problem. Several more remarks about Theorem 2.1 are as follows. The class of distributions N of the regression errors is large and includes both continuous and discrete random variables. Proof of the lower bound (2.3) uses a standard Gaussian regression error. It is well known in point estimation theory that the Gaussian distribution is the least favorable for estimation of the mean, and here we have a similar property for the multivariate regression. Recall that for a Gaussian variable its Fisher information is the reciprocal of the variance, and this sheds light on factor r 2 ðxÞ in the integral on the left side of (2.3). It is reasonable to conjecture that for a fixed distribution of n we will see a corresponding Fisher information in the integral. Another thought-provoking comment is as follows. Suppose that the regression is additive, that is, mðxÞ ¼ P k r¼1 m r ðx r Þ, and we are interested in estimation of the univariate function m 1 ðx 1 Þ: Additive models are often recommended to remedy the curse of multidimensionality; see Wassermann (2006). Then sharp minimax estimation of a component in an additive model becomes dramatically more complicated according to Efromovich (2013Efromovich ( , 2018. In other words, minimax lower bounds for and adaptive minimax estimators of a multivariate regression and components in an additive regression are different, and this is an interesting specific of a multivariate regression. Now we are in a position to introduce oracle estimators that attain the lower bounds of Theorem 2.1. As a result, we will be able to conclude that the lower bounds are sharp minimax and can be used as benchmarks for data-driven estimators. We begin with part (i) of Theorem 2.1 and introduce a minimax oracle estimator based on a sample Z n 1 : Below we define more general statistics and sequences than needed because they will be used in Section 3 for construction of sharp minimax sequential estimators. In addition, we use notations and sequences introduced at the end of Section 1.
Consider a sample Z n 1 1 with shortly defined deterministic n 1 < n, and introduce a low-frequency regression estimatẽ Then this estimate is used to construct a Fourier estimatê The proposed regression estimator is Here the sum complements the low-frequency Fourier components ofm 0 by high-frequency components with indices i r qð1 þ 1=q 0 Þ=c r , r ¼ 1, 2, :::, k: Note thatm 0 ðx, n 1 Þ is based on Z n 1 1 andm Ã ðxÞ is based on Z n n 1 þ1 andm 0 ðx, n 1 Þ: Further, the two terms on the right side of (2.8) are low-and high-frequency components of the regression estimate, respectively. Finally, note that the estimator is data driven and based solely on data. Also recall that for the Wald problem the oracle's sample size n Ã ðeÞ was defined in Theorem 2.1.
Theorem 2.2 (Oracle estimator). Let Assumption 2.1 hold and n 1 :¼ n 1 ðnÞ :¼ dn=ðq 0 Þ kþ2 e: (i) Consider a deterministic stopping time T ¼ n and a corresponding data-driven estimatemðx, Z n 1 Þ :¼m Ã ðx, Z n 1 Þ. The MISE of this estimate attains the lower bound (2.3) and Þ defined in (2.8) and based on a sample with the oracle's deterministic sample size n Ã ðeÞ. Then Theorem 2.2 implies two important conclusions. First, the lower bounds of Theorem 2.1 are sharp. Second, the oracle can solve the two classical sequential problems without invoking stochastic stopping times. These outcomes shed an interesting light on the two problems. For the first one (bounded mean stopping time and minimal MISE), the oracle suggests using a data-driven estimator. For the Wald problem, the oracle suggests the same regression estimator only with a deterministic sample size defined by nuisance functions and the underlying function class. This is an interesting conclusion for the theory of sequential nonparametric regression. Accordingly, the oracle tells the statistician that only sequential estimation can solve the Wald problem, and then simplicity of a proposed solution becomes paramount. As we will see in Section 3 that a two-stage Stein methodology allows us to solve the problem. Finally, let us make several technical comments. As we will see in the proof of Theorem 2.2, there is a large choice of sequences n 1 ðnÞ ¼ o n ð1Þn for which (2.9) holds. Also, note that whereasm Ã ðx, Z n Ã ðeÞ 1 Þ is an oracle estimator, statisticsh i ,ĥ i , andm 0 ðx, n 1 Þ are based solely on data. We may conclude that mimicking n Ã ðeÞ by a data-driven stopping time will be the main issue in the next section devoted to solving the Wald problem.

TWO-STAGE SEQUENTIAL ESTIMATION WITH ASSIGNED MISE
The aim is to solve the Wald problem and suggest a data-driven sequential estimator that matches performance of the sharp minimax oracle estimator of Theorem 2.2. As we will see, the renowned Stein methodology of two-stage sequential estimation is applicable for the considered multivariate heteroscedastic regression.
We continue to use notations and statistics introduced in the previous sections, and let us make a specific remark about regression estimatem 0 ðx, n 1 Þ and Fourier estimates h i ðn 1 Þ andĥ i ðn 1 Þ defined in (2.6) and (2.7). In the proposed two-stage sequential regression estimator, these statistics are used twice by both stages but using different observations collected by the corresponding stages. It is convenient to utilize the same notation for these statistics and keep in mind that they are based on different observations. Now let us describe two stages of sequential estimation. The first one is based on n 0 observations Z n 0 1 from ðX, YÞ, where n 0 :¼ n 0 ðe, kÞ is the smallest integer such that n 0 > e À1 q k =½q 0 kþ1 and q :¼ qðeÞ, q 0 :¼ q 0 ðeÞ are defined at the end of Section 1. Note that sup n 0 =n Ã ðeÞ ¼ o e ð1Þ, where the supremum is over the same classes as in (2.10) and n Ã ðeÞ is the oracle's benchmark for the mean stopping time; see part (ii) of Theorem 2.2. Observations of the first stage are used to calculate sizeñ of an extra sample for the second stage,ñ Here,J r :¼ min J :F r ðJÞ e=q 0 , J 2 dq=q 0 e, dq=q 0 e þ 1, ::: is the U-statistic used to estimate the Sobolev functional F r ðJÞ :¼ P qq 0 i Àr ¼0 P qq 0 i r ¼Jþ1 h 2 i , and ðk À 1Þ-dimension vector i Àr is obtained from vector i by removing its rth element; for instance, i À2 :¼ ði 1 , i 3 , :::, i k Þ: The statistic Let us also stress that statisticsh i ðn 1 Þ,ĥ i ðn 1 Þ,m 0 ðx, n 1 Þ are based on the sample Z n 0 1 , n 1 ¼ n 1 ðn 0 Þ ¼ dn 0 =ðq 0 Þ kþ2 e, q ¼ q e , q 0 ¼ q 0 e , and e is as small as desired. The second stage is defined as follows. The stopping time is T :¼ n 0 þñ, whereñ is defined in (3.1), and accordingly we get an extra sample Z T n 0 þ1 from ðX, YÞ: In what follows, to use notations from Section 2, we formally set n :¼ñ, n 1 :¼ n 1 ðeÞ :¼ de À1 q k =ðq 0 Þ kþ2 e (note that n 1 is not random) and use the extra sample Z T n 0 þ1 and formulas (2.6) and (2.7) to calculate Fourier estimatesh i ðn 1 Þ,ĥ i ðn 1 Þ and the low-frequency regression estimate m 0 ðx, n 1 Þ :¼ X Úi q=ðq 0 Þ 4h i ðn 1 Þu i ðxÞ: (3.5) The proposed sequential regression estimator, mimicking (2.8), iŝ whereJ :¼ ðJ 1 , :::,J k Þ and cutoffsJ r are defined in (3.2). Note that the regression estimator (3.6) uses observations Z n 0 1 to calculateñ andJ, whereas all other statistics are calculated using the extra observations Z T n 0 þ1 : Theorem 3.1 (Sequential estimator for the Wald problem). Let Assumption 2.1 hold and sup F n 2N E n 4 È É < 1. Then the two-stage sequential regression estimatorm T ðx, Z T 1 Þ, defined in (3.6), is sharp minimax. Namely, its MISE satisfies

7)
and its mean stopping time matches the oracle's stopping time, This is an interesting theoretical outcome for a multivariate heteroscedastic regression that states that it is possible to suggest an adaptive sequential estimator that solves the Wald problem and matches performance of the oracle. Moreover, the two-stage sequential approach is motivated by and resembles classical pioneering methods of Stein (1945), Wald (1947), and Anscombe (1949Anscombe ( , 1953 for sequential estimation of parameters. Several possible extensions of the result are discussed in Section 5.

PROOFS
Recall that main notations were introduced at the end of Section 1. In the proofs we are interested in the asymptotics in n ! 1 for Theorems 2.1 and 2.2 and in e ! 0 for Theorem 3.1.
Proof of Theorem 2.1. We begin with proving the oracle's lower bound (2.3) given the restriction (2.2) on the mean stopping time. To make the proof shorter, we will use results of Efromovich (1989Efromovich ( , 2000. The left side of (2.3) does not increase if we consider (i) specific nuisance functions f X ðxÞ and rðxÞ from the class N , (ii) specific distribution of the regression error n from the class N, and (iii) a subclass of considered regression functions. Let us consider these suggestions in turn and explain the motivation. (i) We choose f X ðxÞ ¼ Iðx 2 RÞ and r 2 ðxÞ ¼ d Ã : These functions belongs to N and they are the only constant functions on R that maximize the integral in (2.3). (ii) We choose a standard normal distribution for the regression error n. Recall that the Fisher information for the mean of a Gaussian distribution is the reciprocal of the variance and that for all distributions with bounded variance a Gaussian distribution is the least favorable for estimating the mean. (iii) A subclass of regression functions is chosen in such a way that each Fourier coefficient can be treated independently, and accordingly the subclass should be a parallelepiped in place of ellipsoid A: To define the parallelepiped, set J Ãr :¼ dð1=c r Þqð1 À 1=q 0 Þe, J Ã :¼ ðJ Ã1 , :::, J Ãk Þ, and note that for any b > 0, q Àb e Àqð1À1=q 0 Þ ¼ n À1 q Àb n 1=q 0 h i ð1 þ o n ð1ÞÞ and q b ¼ o n ð1Þn 1=q 0 : (4.1) These relations allow us to introduce a parallelepiped and according to (4.1) for all sufficiently large n we have R n & A: Using the above-described steps (i)-(iii) and the Bessel inequality we get for all sufficiently large n, Here the expectation E Ã stresses that the distribution of the regression function n is standard normal, f X ðxÞ ¼ Iðx 2 RÞ, and r 2 ðxÞ ¼ d Ã : In other words, on the right side of (4.3) the underlying model is Y ¼ mðxÞ þ ffiffiffiffi ffi d Ã p n 0 , where n 0 is standard normal and independent of predictor X uniformly distributed on R, and mðxÞ belongs to the parallelepiped (4.2). We converted the setting into one considered in Efromovich (1989Efromovich ( , 2000. Recall that the parametric Fisher information for Y 0 :¼ h þ ffiffiffiffi ffi d Ã p n 0 is 1=d Ã , and then validity of the lower bound (2.3) is established.
Let us present a technical lemma that will be used in the proofs of Theorems 2.2 and 3.1.
Lemma 4.1. (i) The following relation holds for function class A ¼ Aðb, c, QÞ defined in (1.2): ; : (4.4) (ii) Let the function gðxÞ be square integrable on R and Ð R ½@ k gðxÞ=@x 1 :::@x k 2 dx < 1. Then the following relation is valid for Fourier coefficients of gðxÞ: R @ k gðxÞ=@x 1 :::@x k Â Ã 2 dx=p 2k : (4.5) Remark 4.1. There are two useful corollaries of Lemma 4.1. The first one is that for functions from A the Fourier coefficients are absolutely summable and the functions are uniformly bounded. The second one is that the same can be said about the ratio r 2 ðxÞ=f X ðxÞ from the class N of nuisance functions. To see that Fourier coefficients are absolutely summable, note that for any set K & 0, 1, ::: f g k , equality (4.5) and the Cauchy-Schwarz inequality imply : (4.6) Proof of Lemma 4.1. Inequality (4.4) follows from the classical inequality between geometric and arithmetic means. Verification of (4.5) is more involved. To simplify formulas, set gðxÞ :¼ r 2 ðxÞ=f X ðxÞ and note that g is square-integrable on R and the derivative g 0 ðxÞ :¼ @ k gðxÞ=@x 1 :::@x k exists and is square-integrable on R. Then in place of the cosine tensor product we use the sine tensor product and write, using Parseval's identity and integration by parts, ð R @ k gðxÞ=@x 1 :: ðð@ kÀ1 gðxÞ=@x 1 :::@x kÀ1 Þ sin ðpi k x k Þ 1 x k ¼0 À ð 1 0 ðpi k Þ cos ðpi k x k Þð@ kÀ1 gðxÞ=@x 1 :::@x kÀ1 Þdx k Þ Y kÀ1 r¼1 sin ðpi r x r Þdx t # 2 : Using sin ð0Þ ¼ sin ðpi k Þ ¼ 0 and then repeating the above step for x kÀ1 , :::, x 1 , we conclude that Lemma 4.1 is proved.
Proof of Theorem 2.2. Some parts of the proof will be used in the proof of Theorem 3.1. This explains why several more general relations than needed are presented.
First of all, let us check that Fourier estimates introduced in (2.6) and (2.7) are unbiased. Using Assumption 2.1 we can write (4.8) Now we consider the Fourier estimateĥ i ðn 2 Þ: Using Assumption 2.1 and that Z n n 1 þ1 and m 0 ðx, n 1 Þ are independent, we can write, for Úi > q=ðq 0 Þ 4 , In the last equality, we used (2.6) and Ð R u j ðxÞu i ðxÞdx ¼ Iðj ¼ iÞ: Unbiasedness of the two Fourier estimates is established, and now we explore their variances (mean squared errors). Using Remark 4.1, it is plain to realize that (4.10) Here and in what follows, the supremum is over the same function classes as in (2.9), and recall that ws are generic positive constants. Using this result, the Parseval identity, and definition of n 1 , we conclude that Forĥ i ðn 1 Þ we need to establish a more accurate upper bound than (4.10), namely, we need to get Ð R ½r 2 ðxÞ=f X ðxÞdx in place of a generic constant w. Using (4.9), the Cauchy inequality, and a constant c 2 ð0, 1Þ, write (4.12) To analyze A 1 we note that u 2 i ðxÞ ¼ 1 þ 2 À1=2 u 2i ðxÞ, and this together with Lemma 4.1 and Remark 4.1 allows us to conclude that For the term A 2 on the right side of (4.12) we can write for any considered i satisfying Úi > q=ðq 0 Þ 4 , (4.14) In the last equality we used Ð R ½mðxÞ Àm 0 ðx, n 1 Þu i ðxÞdx ¼ h i : With the help of (4.11), we conclude that supA 2i ¼ o n ð1Þnq k : Combining the results, we get Using (4.15) we conclude that This relation, together with sup P ÚðicÞ>qð1þ1=q 0 Þ h 2 i ¼ o n ð1Þn À1 q k , (4.11), (4.15), and the Parseval identity, verifies Theorem 2.2.
Proof of Theorem 3.1. In what follows, we consider functions and distributions from the classes considered in the supremum of (3.7), and the sup means the supremum over those classes. Recall that general notations and specific sequences were introduced at the end of Section 1 and, in particular, q ¼ qðeÞ, q ¼ q 0 ðeÞ, and o e ð1Þ ! 0 as e ! 0: Also, to simplify formulas, it is convenient to introduce a generic function o Ã e ð1Þ such that supjo e ð1Þj ¼ o e ð1Þ: We begin with verification of the following upper bound on the mean sample size of the second stage, sup Eñ f g n Ã ðeÞð1 þ o e ð1ÞÞ: (4.16) Set J Ã r :¼ dð1=c r Þqð1 þ 1=q 0 Þe, r ¼ 1, :::, k, and recall from Section 2 that these are optimal cutoffs of the sharp minimax oracle estimator (2.8). Write (4.17) To continue, we recall one familiar inequality for moments of U-statistics due to lemma 4.1 and remark 4.1 in Efromovich (2000) and note that this is the place where the extra assumption of a bounded fourth moment of the regression error n is used. The inequality is sup E ðF À FÞ 4 =ðF þ Ln À1 0 Þ 2 n o wn À2 0 , w < 1: (4.18) and L is cardinality of set K: Further, note that statisticĥ j, 0, :::, 0 ðn 1 Þ used in (3.4) estimates a univariate Fourier coefficient Ð 1 0 ½ Ð ½0, 1 kÀ1 mðxÞdx 2 :::dx k u j ðx 1 Þdx 1 : Then, using lemma A.1 in Efromovich (2013) Now we can return to considering terms A 1 and A 2 on the right side of (4.17). For A 1 we write, using (4.19), : This relation yields that (4.20) For term A 2 we can write, usingd q 0 andJ r qq 0 , (4.21) Recall that F r ðJÞ ¼ P qq 0 i Àr ¼0 P qq 0 i r ¼Jþ1 h 2 i , and if J > J Ã r , then F r ðJÞ wð1 þ JÞ jb r j e Àc r qð1þ1=q 0 Þ=c r w=ðq 0 Þ 2 , andF r ðJ r À 1Þ > e=q 0 : Accordingly, for all sufficiently small e, we get ðJ þ 1ÞPðF r ðJÞ À FðJÞ > e=ð2q 0 ÞÞ: uniformly over the considered function classes. Using (4.18) and the Chebyshev inequality, we continue: wq 2 ðq 0 Þ 6 e À4 q jb r j e 1þ1=q 0 þ ðqq 0 Þ k eq Àk ðq 0 Þ kþ1 h i 2 e 2 q À2k ðq 0 Þ 2kþ2 Now we verify that the MISE of the proposed sequential regression estimator is at most eð1 þ o Ã e ð1ÞÞ: For an underlying regression function m :¼ mðxÞ, introduce the specific vector J m :¼ ðJ m1 , :::, J mr Þ of the oracle's cutoffs implying sharp minimax estimation: h 2 i 2e=q 0 , J ¼ 0, 1, :::, J Ã r Þ, r ¼ 1, :::, k: (4.23) The subscript m in J mr emphasizes that this is a special oracle cutoff based on the underlying regression function mðxÞ from the class A: Previously introduced cutoffs J Ã r are minimax and use only information about function class A: The latter explains the upper bound J Ã r in (4.23). We analyze the MISE using two vectors of oracle cutoffs ðJ m1 , :::, J mk Þ and ðJ Ã 1 , :::, J Ã k Þ: For an underlying regression function m 2 A, introduce a set of indexes D m :¼ \ k r¼1 J r : J mr J r J Ã r f g , and note that the complementary set is We also introduce notationJ :¼ ðJ 1 , :::,J k Þ for the vector of estimated cutoffs. Using these notations, we can write (4.24) Term U 1 is the oracle's MISE and sup U 1 eð1 þ o e ð1ÞÞ: Accordingly, we need to show that sup U 2 ¼ eo e ð1Þ: (4.25) Using the above-presented formula for D c m we get (4.26) Note that terms in the first sum use cutoffs smaller than recommended by the oracle for an underlying regression function m, and this may lead to larger bias. Terms in the second sum use cutoffs larger than suggested by the minimax oracle, and this may lead to larger variance. We consider these two cases in turn. Using the Parseval identity, write (4.27) Here h i :¼h i IðÚi q=ðq 0 Þ 4 Þ þĥ i IðÚi > q=ðq 0 Þ 4 Þ: For m 2 A we have supV 3 ¼ o e ð1Þe: Accordingly, V 3 is sufficiently small, and we evaluate V 1 and V 2 in turn. For V 1 we use the Cauchy-Schwarz inequality and get (4.28) For the probability term we note that ifJ r < J mr , thenF r ðJ Þ e=q 0 and F r ðJ Þ > 2e=q 0 : This remark, the Chebyshev inequality,J r ! dq=q 0 e almost surely, and (4.18) yield J¼dq=q 0 e PðF r ðJÞ À F r ðJÞ > ð1=2ÞF r ðJÞ, F r ðJÞ > 2e=q 0 Þ w X J mr À1 J¼dq=q 0 e e n À2 0 ðF r ðJÞ þ ðqq 0 Þ k n À1 0 Þ 2 F 4 r ðJÞ IðF r ðJÞ > 2e=q 0 Þ o Ã e ð1Þq À2k ðq 0 Þ w : (4.29) The next step is to evaluate the expectation on the right side of (4.28). Recall that the Fourier estimateh i ðn 1 Þ is not sequential and is based on a sample of size n 1 . This yields 2 q À2k ðq 0 Þ 2kþ5 : To evaluate moments ofĥ i , we note that this Fourier estimator is based on a sample with random sizeñ À n 1 defined by the first stage. Accordingly, set n Ã :¼ de À1 q k =ðq 0 Þ kþ1 e, n Ã :¼ ee À1 q k ðq 0 Þ kþ1 e, note that for all sufficiently small e we have n Ã ñ n Ã almost surely, and recall that we consider the asymptotic in e ! 0 and hence n 1 ¼ o Ã e ð1Þn Ã allows us to assume that n 1 < n Ã : Using these remarks, we can write  For the expectation on the right side of (4.30), we can use inequality (1.3.50) in Efromovich (2018) and get for Úi > q=ðq 0 Þ 4 (also recall a similar calculation in the proof of Theorem 2.2) E ðs À n 1 Þ À1 X n 0 þs l¼n 0 þn 1 þ1 ðY l À m 0 ðX l , n 1 ÞÞu i ðX l Þ f X ðX l Þ À h i Here we used the abovementioned n 1 ¼ o Ã e ð1Þn Ã and that statistic m 0 ðx, n 1 Þ in (4.31) is independent of ðX l , Y l Þ: Using this inequality in (4.30) we conclude that 2 q À2k ðq 0 Þ 2kþ3 : (4.32) Using (4.32) in (4.28), we conclude that V 1 ¼ o Ã e ð1Þe: To evaluate V 2 on the right side of (4.27), we note that this is the main term whenJ r < J mr because a smaller cutoff increases the squared bias and decreases the variance of a regression estimate. To evaluate the increased squared bias and to simplify formulas, set i Àr ¼0 P j!J r h 2 i =ðe=q 0 Þ: Note that Z r is a random variable (function ofJ r Þ, and according to the definition ofJ r , we haveF r ðJ r Þ e=q 0 , and for m 2 A and any positive constant c Ã we have (4.33) Using this inequality and a classical inequality E gIðg ! 2Þ È É P 1 r¼1 Pðg ! rÞ, we can finish evaluation of V 2 : (4.34) Using the already evaluated terms V 1 þ V 3 ¼ o Ã e ð1Þe together with (4.34) in (4.27), we conclude that U 21r ¼ o Ã e ð1Þe: Accordingly, : Now we evaluate a term U 22r in (4.26). Recall that J Ã r is the oracle's minimax cutoff. Accordingly, the caseJ r > J r Ã increases the variance part of the MISE while its squared bias part remains sufficiently small. To realize that, we may write, using Parseval's identity (compare with 4.27), Here, similar to (4.27), we use notation h i :¼h i IðÚi q=ðq 0 Þ 4 Þ þĥ i IðÚi > q=ðq 0 Þ 4 Þ: The definition of J Ã r implies that for m 2 A we have V 0 2 þ V 0 3 ¼ o Ã e ð1Þe: The term V 0 1 is evaluated similar to the term V 1 in (4.27) and we get V 0 1 ¼ o Ã e ð1Þe: We have shown that all considered terms are o Ã e ð1Þe, and there are only a finite number of these terms. Theorem 3.1 is proved.

CONCLUSION
The developed theory shows that asymptotically a sequential estimation of analytic multivariate regression functions with assigned MISE and minimax mean stopping time is possible. The proposed data-driven sequential estimator matches performance of the oracle that knows the smoothness of an estimated multivariate regression and all nuisance functions, and accordingly the estimator can be referred to as adaptive. The asymptotic theory sheds new light on the potential of sequential estimation because the only theoretical result known so far has been that no minimax adaptive sequential estimation is possible for differentiable regression functions. Another important result is that, similar to classical parametric models, a two-stage sequential methodology solves the problem.
Let us mention several interesting open problems for future research. First, it is of interest to understand the effect of missing data on sequential estimation. Missing data are typical in regression problems. Some theoretical results are known for fixed sample sizes; see Efromovich (2018). In particular, it is known that different remedies should be used for missing predictors and missing responses. It is understood that for sequential estimation an underlying missing mechanism must be evaluated and taken into account, and the latter is a challenging problem on its own. Second, it is of interest to apply the developed sequential methodology to other familiar sequential problems such as change point discussed in Baron (2001) and Schmegner and Baron (2004) and confidence bands and hypotheses testing discussed in Wald (1947). Third, sequential estimation of the scale function rðxÞ and distribution of regression error n is another practically important and theoretically challenging problem. Fourth, as follows from the discussion in Section 2, it is of interest to develop a theory of sequential estimation for additive regression models. Fifth, in Pergamenshchikov (2009a, 2009b), a practically important type of heteroscedastic regression Y ¼ mðxÞ þ rðx, mÞn is considered where the scale may depend on both x and the regression function m(x). It will be of interest to consider a multivariate regression of this type and then explore sequential estimation of the regression and scale.
Finally, a challenging and urgent open problem is the practically important case of a small sample for a first stage. The first stage is based on an a priori chosen sample size, and this creates a possibility of no feasible estimation due to the curse of multidimensionality and a sample size that may be too small for an underlying regression (of course, using multiple stages is a possible remedy, but the simplicity of two stages is appealing). The supplementary materials shed light on challenges of the first stage and point to possible modifications of proposed asymptotically optimal estimates via exploring a thought-provoking environmental example of a multivariate regression with five covariates and sample size n ¼ 86. The practical example also highlights a connection between studied nonparametric multivariate regression and nonparametric functional regression where prediction is a process. Accordingly, sequential functional regression is another important topic for future research.

ACKNOWLEDGMENT
A discussion with Nitis Mukhopadhyay and valuable comments from the Associate Editor and reviewers are gratefully appreciated.

DISCLOSURE
The author has no conflicts of interest to report.

FUNDING
This research was supported in part by NSF Grant DMS-1915845 and grants from CAS and BIFAR.