A practical guide to compact infinite dimensional parameter spaces

Abstract Compactness is a widely used assumption in econometrics. In this article, we gather and review general compactness results for many commonly used parameter spaces in nonparametric estimation, and we provide several new results. We consider three kinds of functions: (1) functions with bounded domains which satisfy standard norm bounds, (2) functions with bounded domains which do not satisfy standard norm bounds, and (3) functions with unbounded domains. In all three cases, we provide two kinds of results, compact embedding and closedness, which together allow one to show that parameter spaces defined by a -norm bound are compact under a norm . We illustrate how the choice of norms affects the parameter space, the strength of the conclusions, as well as other regularity conditions in two common settings: nonparametric mean regression and nonparametric instrumental variables estimation.


Introduction
Compactness is a widely used assumption in econometrics, for both finite and infinite dimensional parameter spaces. It can ensure the existence of extremum estimators and is an important step in many consistency proofs (e.g. Wald, 1949). Even for noncompact parameter spaces, compactness results are still often used en route to proving consistency. For finite dimensional parameter spaces, the Heine-Borel theorem provides a simple characterization of which sets are compact. For infinite dimensional parameter spaces the situation is more delicate. In finite dimensional spaces, all norms are equivalent: convergence in any norm implies convergence in all the parameter space as a ball with the norm jj Á jj s and obtain compactness under a norm jj Á jj c . This result can then be used to prove consistency of a function estimator in the norm jj Á jj c .
In the present article, we make two main contributions. First, we gather and review many of these compactness results. Unlike much of the previous mathematics literature, we focus on norms and parameter spaces most useful in econometrics to provide a unified and easily accessible treatment. The results are particularly complicated for functions defined on an unbounded Euclidean domain, where various commonly used choices of norms imply very different parameter spaces (for a specific example, see our discussion in Section 6). Second, we provide several new compactness results, which relax important restrictions on parameter spaces in the previous literature. For example, for functions on unbounded Euclidean domains our results allow for parameter spaces which include polynomials of arbitrary degree. These results can be used to verify high level assumptions in the sieve estimation literature, including assumption 3.5(i) in Ai and Chen (2003), assumption 4 in Newey and Powell (2003), condition (ii) of theorem 4.1 in Chen and Pouzo (2012), and assumption 3.2(iii) in Chen and Pouzo (2015). 1 We organize the results into two main parts, depending on the domain of the function of interest: bounded or unbounded. We first consider functions on bounded Euclidean domains which satisfy a norm bound, such as having a bounded Sobolev integral or sup-norm. Second, we consider functions defined on an unbounded Euclidean domain, where we build on and extend the important work of Gallant and Nychka (1987). Finally, we return to functions on a bounded Euclidean domain, but now suppose they do not directly satisfy a norm bound. One example is the quantile function Q X : ð0; 1Þ ! R for a random variable X with full support. Since Q X ðsÞ asymptotes to 61 as s approaches 0 or 1, the derivatives of Q X are unbounded. Nonetheless, we show that compactness results may apply if we replace unweighted norms with weighted norms.
In general, there are two steps to showing that a parameter space defined as a ball under jj Á jj s is compact under jj Á jj c . First we prove a compact embedding result, which means that the jj Á jj c -closure of the parameter space is jj Á jj c -compact. Second, we show that the parameter space is actually jj Á jj c -closed, and hence equals its closure and hence is compact. We show that some choices of the pair jj Á jj s and jj Á jj c satisfy the first step, but not the closedness step.
For functions on unbounded Euclidean domains, we follow the approach of Gallant and Nychka (1987) and introduce weighted norms. Gallant and Nychka (1987) showed how to extend compact embedding proofs for bounded domains to unbounded domains. We review and extend their result and show how it applies to a general class of weighting functions, as well as many choices of jj Á jj s and jj Á jj c , such as Sobolev L 2 norms, Sobolev sup-norms, and H€ older norms. In particular, unlike existing results, our result allows for many kinds of exponential weight functions. This allows, for example, parameter spaces for regression functions which include polynomials of arbitrary degree. We also discuss additional commonly used weighting functions, such as polynomial upweighting and polynomial downweighting.
In applications, the choice of norms has several important implications. The norm jj Á jj s typically places restrictions on the class of functions. It then affects the estimator either through the parameter space or a penalty function. The norm jj Á jj c , on the other hand, determines the strength of the conclusions. Furthermore, the norms affect how strong other regularity conditions are. We illustrate these considerations with two simple applications. First, we consider estimation of mean regression functions with full support regressors. We give low level conditions for consistency of a sieve least squares estimator, and discuss how the choice of norms is used in this result. We also show that weighted norms can be interpreted as a generalization of trimming and we consider a penalized estimator. Second, we discuss the nonparametric instrumental variables model and the role of the norms in this setting. While we focus on relatively simple examples to 1 Our notation is similar to Santos (2012), for example, but it differs from the notation in Chen and Pouzo (2012) and Chen and Pouzo (2015), who use jj Á jj s for the consistency norm.
illustrate the different roles of the norms, the general considerations carry over to more complicated settings.
The rest of this article is organized as follows. We first conclude this section by placing our results in the context of the related literature. Then in Section 2 we review the definitions of the norms and function spaces used throughout the article. Our main results are in Sections 3, 4, and 5, where we consider each of the three cases discussed above. In Section 6 we discuss our applications. Section 7 concludes. Some formal definitions, additional lemmas, and proofs of all results are all given in a Supplementary Appendix.

Related literature
All of our compact embedding results for unweighted function spaces are well known in the mathematics literature (see, e.g., Adams and Fournier 2003). For weighted Sobolev spaces, Kufner (1980) was one of the earliest studies. He focuses on functions with bounded domains, and proves several general embedding theorems for a large class of weight functions. These are not, however, compact embedding results. Schmeisser and Triebel (1987) also study weighted function spaces, but do not prove compact embedding results. As discussed above, Gallant and Nychka (1987) prove an important compact embedding result for functions with unbounded domains. Haroske and Triebel (1994a) prove a general compact embedding result for a large class of weighted spaces. This result, as well as the followup work by Triebel and coauthors, such as Haroske and Triebel (1994b) and Edmunds and Triebel (1996), relies on assumptions which hold for polynomial weights, but not for exponential weights (see Sections 4.1 and 4.2 for details). Moreover, as we show, these results also do not apply to functions with bounded domain. Hence, except in one particular case (see our discussion of Brown and Opic, 1992 below), our compact embedding results using weighted norms for functions on bounded domains are the first that we are aware of. Likewise, except in one particular case (again see our Brown and Opic, 1992 discussion below), our compact embedding results for functions on unbounded domains allow for a much larger class of weight functions than previously allowed. In particular, we allow for exponential weight functions. Note, however, that the results by Triebel and coauthors allow for more general function spaces, including Besov spaces and many others. We focus on Sobolev spaces, H€ older spaces, and spaces of continuously differentiable bounded functions because these are by far the most commonly used function spaces in econometrics. Brown and Opic (1992) give high level conditions on the weight functions for a compact embedding result similar to that in Gallant and Nychka (1987), for both bounded and unbounded domains. Similar to Gallant and Nychka (1987), this result is only for compact embeddings of a Sobolev L p space into a space of bounded continuous functions. This result allows for many kinds of exponential weights. In these cases, our results provide simpler lower level conditions on the weight functions, although these conditions are less general. Importantly, we also provide seven further compact embedding results that they do not consider. See Sections 4.2 and 5.2 for more details.
Just seven years after Wald's (1949) consistency proof, Kiefer and Wolfowitz (1956) extended his ideas to apply to nonparametric maximum likelihood estimators. 2 Their results rely on the well-known fact that the space of cdfs is compact under the weak convergence topology. In econometrics, their results have been applied by Cosslett (1983), Heckman and Singer (1984), and Matzkin (1992). More recently, Fox and Gandhi (2016) and  have used similar ideas, relying on this particular compactness result. This compactness result is certainly powerful 2 Wald (1949) did attempt to generalize his results to the infinite dimensional case in his final section. His approach, however, is to assume that closed balls are compact (his assumption 9(iv)). As we've discussed, this implies the parameter space is actually finite dimensional. when the cdf is our object of interest. We are often interested in other functions, however, like pdfs or regression functions. The results in this article can be applied in these cases. Wong and Severini (1991) extended the analysis of nonparametric MLE even further. They still make a compact parameter space assumption, but do not restrict attention to cdfs.
One approach in the sieve literature uses penalization methods to allow for noncompact parameter spaces. In this case, it is often assumed that the penalty function is lower semicompact. For example, see theorem 3.2 and assumption (ii) of theorem 4.1 in Chen and Pouzo (2012) and assumption 3.2(iii) of Chen and Pouzo (2015). The results in our article can be used to verify this lower semicompactness property. Specifically, for the penalty function penðÁÞ ¼ jj Á jj s and consistency norm jj Á jj c , lower semicompactness of penðÁÞ by definition means that jj Á jj s -balls are jj Á jj c -compact. Such jj Á jj c -compactness of jj Á jj s -balls is shown by our combined compact embedding and closedness results. Hence our compact embedding and closedness results together deliver pairs of norms which ensure lower semicompactness of the penalty function. Therefore our results are useful even if one does not want to assume the parameter space itself is compact.
Even when neither compactness nor penalization is necessary for consistency, such as in theorem 3.1 of Chen (2007), an "identifiable uniqueness" or "well separated" point of maximum assumption is needed. Also see van der Vaart (2000) theorem 5.7, van der Vaart and Wellner (1996) lemma 3.2.1, and the discussion in section 2.6 of Newey and McFadden (1994). Compactness combined with continuity of the population objective function provide simple sufficient conditions for this assumption, as Chen (2007) discusses via her condition 3.1 00 . Similarly, compactness may be used to verify uniform convergence assumptions, such as condition 3.5 of theorem 3.1 in Chen (2007); we discuss this further in Section 6.

Norms for functions
Let ðF; jj Á jj s Þ and ðG; jj Á jj c Þ be Banach spaces with F G. These could be any of the spaces mentioned below. Our main goal is to understand when the space is jj Á jj c -compact, for various choices of the two norms, where B > 0 is a finite constant. jj Á jj s is called the strong norm, since it will be stronger than jj Á jj c in the sense that jj Á jj c Mjj Á jj s for a finite constant M. Because we cannot obtain compactness of H in the strong norm without reducing it to a finite dimensional set, we instead obtain compactness under jj Á jj c , which is called the consistency or compactness norm. In econometrics applications, we obtain consistency of our function estimators in this latter norm (see Section 6). In many applications, the parameter h 0 2 H is defined as a maximizer of a population objective function QðÁÞ. That is, Here H is a finite or infinite dimensional parameter space. The estimator b h n is then typically obtained by maximizing a sample analog of QðhÞ, denoted by b QðhÞ. In series estimation H is commonly replaced by a finite dimensional approximation H k n and the sample objective function may include a penalty function. In particular where penðÁÞ is the penalty function and k n ! 0 as n ! 1. Specific examples are series regression and NPIV estimation, which we discuss in Section 6. Many general results in the literature either assume that H is compact under jj Á jj c (e.g., assumption 3.5(i) in Ai and Chen, 2003 or assumption 4 in Newey and Powell, 2003) or that penðÁÞ is lower semicompact under jj Á jj c (e.g., theorems 3.2 and 4.1 of Chen and Pouzo, 2012 or assumption 3.2(iii) in Chen and Pouzo, 2015). For the penalty function penðÁÞ ¼ jj Á jj s and consistency norm jj Á jj c , lower semicompactness of penðÁÞ means that jj Á jj s -balls are jj Á jj c -compact. Our results provide low level conditions for these types of assumptions. Note that in specific cases, compactness assumptions are not necessarily required; for example, see Newey (1997), Hall and Horowitz (2005), or Darolles et al. (2011).
Since the choice of norm for infinite dimensional function spaces matters, we begin with a brief survey of the three kinds of norms most commonly used in econometrics: Sobolev supnorms, Sobolev integral norms, and H€ older norms. These norms are defined for functions f : D ! R where the domain D is an open subset of R d x , possibly the entire space R d x , for an integer d x ! 1. 3 For these functions, denote the differential operator by where k ¼ ðk 1 ; :::; k d x Þ is a multi-index, a d x -tuple of nonnegative integers, and The first space we consider are continuously differentiable functions whose derivatives are uniformly bounded. Let m be a nonnegative integer. For an m-times differentiable function f : D ! R, define the weighted Sobolev sup-norm of f as jjf jj m;1;l ¼ max Here l : D ! R þ is a continuous nonnegative weight function. Let jj Á jj m;1 denote the unweighted Sobolev sup-norm; that is, the weighted Sobolev sup-norm with the identity weight lðxÞ 1. For the identity weight and m ¼ 0, jj Á jj m;1;l is just the usual sup-norm. Relatedly, notice that jjf jj m;1;l ¼ max 0 jkj m jjr k f jj 0;1;l : Let C m ðDÞ denote the space of m-times continuously differentiable functions f : D ! R. Let The normed vector space ðC m;1;l ðDÞ; jj Á jj m;1;l Þ is jj Á jj m;1;l -complete, 4 and hence it is a jj Á jj m;1;l -Banach space. When lðxÞ 1, define C m;1 ðDÞ ¼ C m;1;1 ðDÞ.
The next space we consider replaces the sup-norm with an L p norm. Let p satisfy 1 p < 1. To ensure completeness of this space, we must allow for functions which are only weakly differentiable, rather than just classically differentiable functions. When both the classical and weak derivatives exist, they are equal. We denote both kinds of derivatives by r k f . Adams and Fournier (2003) formally define the weak derivative on page 22, and we briefly discuss the completeness issue in section J of the Supplementary Appendix.
Let W m ðDÞ denote the set of all functions f : D ! R which have weak derivatives of all orders 0 jkj m. For f 2 W m ðDÞ, define the weighted Sobolev L p norm of f as l is a weight function as above. We also call this a Sobolev integral norm. Let jj Á jj m;p denote the unweighted Sobolev L p norm. For the identity weight and m ¼ 0, jj Á jj m;p;l is just the usual L p norm. Relatedly, notice that jjf jj p m;p;l ¼ Define the weighted Sobolev space by The normed vector space ðW m;p;l ðDÞ; jj Á jj m;p;l Þ is jj Á jj m;p;l -complete, 5 and hence it is a jj Á jj m;p;l -Banach space. When lðxÞ 1, define W m;p ðDÞ ¼ W m;p;1 ðDÞ. For both of the weighted Sobolev norms, there is a less common alternative approach to incorporating the weighting function, which we discuss in Section 4.3.
The final space of functions we consider is similar to the space of functions with bounded unweighted Sobolev sup-norms. Define the H€ older coefficient of a function f : D ! R by where recall that jj Á jj m;1 is the unweighted Sobolev sup-norm. The H€ older coefficient generalizes the supremum over the derivative; for differentiable functions f we have The H€ older exponent ½f 1 , however, is also defined for nondifferentiable functions f. Define the H€ older space with exponent by 5 Under Assumption 6 below. Again see theorem 5.1 of Rodr ıguez et al. (2004). 6 > 1 is excluded since ½f < 1 for a > 1 implies that f is constant. C m;1;1; D ð Þ ¼ f 2 C m D ð Þ : jjf jj m;1;1; < 1 n o : The normed vector space ðC m;1;1; ðDÞ; jj Á jj m;1;1; Þ is jj Á jj m;1;1; -complete. We discuss weighted H€ older spaces, along with an alternative approach to weighted Sobolev spaces, in Section 4.3. For all of these function spaces, we omit the domain D from the notation when it is understood.

Functions on bounded domains
The general approach to obtaining jj Á jj c -compactness of H has two steps. First, we prove that H is relatively jj Á jj c -compact, meaning that the jj Á jj c -closure of H is jj Á jj c -compact. This is essentially what it means for the space ðF; jj Á jj s Þ to be compactly embedded in the space ðG; jj Á jj c Þ, which is denoted with F ,! G. See Supplementary Appendix A for a precise definition. Next, we show that H is actually jj Á jj c -closed, and hence its jj Á jj c -closure is just H itself. Consequently, H itself is jj Á jj c -compact.
Thus our first result concerns compact embeddings.
The cone condition, which is a geometric regularity condition on the shape of D, is formally defined in Supplementary Appendix A. When d x ¼ 1, a sufficient condition for the cone condition is that D is a finite union of open, possibly unbounded intervals. When d x > 1, a sufficient condition is that D is the product of such finite unions. As we cite in the proof, all of the results in Theorem 1 are well known in mathematics. Result 5 shows that sets bounded under the H€ older norm are relatively compact under the Sobolev sup-norm, even with the same number of derivatives; the extra H€ older coefficient piece is sufficient to yield relative compactness. Result 3 shows that sets bounded under Sobolev sup-norms are relatively compact under Sobolev sup-norms using fewer derivatives. Result 2 shows that sets bounded under Sobolev L 2 norms are relatively compact under Sobolev L 2 norms with fewer derivatives, where the number of derivatives we have to drop depends on the dimension d x of the domain. Finally, results 1 and 4 show the relationship between the Sobolev sup-norm and the Sobolev L 2 norm. Sets bounded under one are relatively compact under the other with fewer derivatives, where again the number of derivatives we must drop depends on d x .
By combining cases 4 and 5 and applying Lemma 4 in the Supplementary Appendix, we also obtain compact embedding of H€ older spaces into Sobolev L 2 spaces. Here and throughout the article, however, we focus only on the function space combinations which are most commonly used in econometrics.
Theorem 1 only shows that sets bounded under the norm jj Á jj s on the left hand side of the ,! are relatively compact under the norm jj Á jj c on the right hand side of the ,!. As mentioned earlier, this means that their jj Á jj c -closure is jj Á jj c -compact. The following theorem shows that in some of these cases, jj Á jj s -closed balls are jj Á jj c -closed as well.
Theorem 2. (Closedness) Let D R d x be a bounded open set, where d x ! 1 is some integer. Let m; m 0 ! 0 be integers. Let 2 ð0; 1. Let ðF; jj Á jj s Þ and ðG; jj Á jj c Þ be Banach spaces with F G, where jjf jj s < 1 for all f 2 F and jjf jj c < 1 for all f 2 G. Define H as in equation (1). Then the results in Table 1 hold under the following additional assumptions: For cases (1) and (2) we also assume m 0 > d x =2 and D satisfies the cone condition. For cases (3) and (4) we also assume m 0 ! 1. For case (5) we also assume D satisfies the cone condition.
Results 1, 2, and 5 of Theorem 2 combined with results 1, 2, and 5 of Theorem 1 give pairs of strong and consistency norms such that the jj Á jj s -ball H defined in Eq. (1) is jj Á jj c -compact. We illustrate how to apply these results in Section 6. We also discuss additional implications of the choice of norms in that section.
For results 3 and 4, however, we see that H is not jj Á jj c -closed. We could nonetheless proceed by simply agreeing to just work with the jj Á jj c -closure H of H instead. Theorem 1 then ensures that this jj Á jj c -closure is jj Á jj c -compact. Moreover, by the very definition of the closure, every element in the closure can be approximated arbitrarily by an element in the original set. Hence, as is needed in econometrics applications, we can construct sequences of approximations that still satisfy any necessary rate conditions. In sieve estimation, the choice of sieve space in practice also will not be affected by whether we use the closure or not. 7 Working with the closure is precisely what Gallant and Nychka (1987) did, until Santos' (2012) Lemma A.1 showed that their parameter space was actually closed, thus proving result 2 in Theorem 2 above.
Nonetheless, as with Santos' (2012) result, it is informative to know when the closure can be characterized. In case 3, a simple characterization is possible. Here the strong norm is the Sobolev sup-norm. It turns out that the jj Á jj c -closure is precisely a H€ older space with exponent ¼ 1, as we show in the Supplementary Appendix L. Hence, there is no difference between working with the jj Á jj c -closure in case 3 or just using case 5 with ¼ 1 and one fewer derivative (the closure in case 3 will contain functions whose m þ m 0 'th derivatives do not exist). This is one reason why we sometimes use the H€ older norm rather than the conceptually simpler Sobolev sup-norm. We are unaware of any simple characterizations of the closure in case 4. Gallant and Nychka (1987) extended the first compact embedding result from Theorem 1 to spaces of functions on D ¼ R d x . In this section, we show how to further extend their result in several ways. In particular, our results allow for exponential weighting functions, as well as the standard polynomial weighting functions used by Gallant and Nychka and subsequent authors. We also extend results 2-4 of Theorem 1 as well as the closedness results of Theorem 2 to D ¼ R d x . All of these results use weighted norms, as introduced in Section 2. There are at least two reasons to use weighted norms for functions on R d x . The first is that many functions do not satisfy unweighted norm bounds. For example, the linear function f(x) ¼ x on R has jjf jj 0;1 ¼ 1. By sufficiently downweighting the tails of f, however, the linear function can have a finite weighted sup-norm. The second reason is that even when a function f satisfies an unweighted This extension to the closure may, however, affect other assumptions in one's analysis. For example, an estimation criterion function will now be defined over the closure and hence any assumptions on it typically must be satisfied over this extended domain.

Functions on unbounded domains
norm, we can upweight the tails of f, which yields a stronger norm than the unweighted norm. This makes our concept of convergence finer. As in Gallant and Nychka's application, this is often the case with probability density functions, since they must converge to zero in their tails. A further subtlety is that we actually need to use two different weighting functions-one for the strong norm jj Á jj s , denoted by l s , and another for the consistency norm jj Á jj c , denoted by l c . The reason comes from the main step in Gallant and Nychka's compact embedding argument. Their idea was to truncate the domain D ¼ R d x by considering a ball centered at the origin and its complement. Inside the ball, we can apply one of the results from Theorem 1. The piece outside the ball, which depends on tail values of the functions and their weights, is made small by swapping out one weight function for another, and then using the properties of these two weight functions.
In the following Section 4.1, we discuss the various classes of weight functions we will use. In many cases, these weight functions are more general than those considered in Gallant and Nychka (1987) and elsewhere in the literature. In Section 4.2 we give the main compact embedding and closedness results for functions on D ¼ R d x .

Weight functions
Throughout this section we let l; l c ; l s : D ! R þ be nonnegative functions and m; m 0 ! 0 be integers. We first discuss some general properties of weight functions. We then examine several specific examples. We conclude by discussing general assumptions on the classes of weight functions we use in our main compact embedding and closedness results, and show that these hold for specific examples.
Our first result is simple, but important.
Proposition 1. Suppose there are constants M 1 and M 2 such that for all x 2 D. Then 1. jj Á jj m;1;l and jj Á jj m;1 are equivalent norms. 2. jj Á jj m;2;l and jj Á jj m;2 are equivalent norms.
Proposition 1 says that weight functions which are bounded away from zero and infinity are trivial in the sense that they do not actually generate a new topology. Consequently, any nontrivial weight function must either diverge to infinity (upweighting) or converge to zero (downweighting) for some sequence of points in D. These are the only two kinds of weight functions we must consider.
The next result shows that upweighting only allows for functions which go to zero in their tails. Recall that jj Á jj e denotes the Euclidean norm.
This result implies that derivatives of f must go to zero in the tails when f is bounded in one of the upweighted Sobolev norms jj Á jj m;1;l or jj Á jj m;2;l with m > 0. Proposition 2 implies that the choice between upweighting and downweighting will primarily depend on whether we want to study spaces with functions f that do not go to zero at infinity. For example, for spaces of regression functions, we typically will choose downweighting. For spaces of probability density functions, we typically will choose upweighting as in Gallant and Nychka (1987).

Polynomial weights
The most common weight function used in econometrics is the polynomial weighting function, where d 2 R is a constant. If d > 0 then this function upweights for large values of x, while if d < 0 then this function downweights for large values of x.
One reason that polynomial weights are ubiquitous is that the well-known compact embedding result of Gallant and Nychka (1987) applies specifically to polynomial weights. In our Theorem 3 below, we restate this result and show how to generalize it to allow for additional classes of weight functions. There, as in Section 3, we want to understand when spaces of functions H ¼ ff 2 F : jjf jj s Bg are jj Á jj c -compact, where ðF; jj Á jj s Þ is a Banach space and B < 1 is a constant. To allow for the space F to contain functions with domain D ¼ R d x , we will choose jj Á jj s and jj Á jj c to be weighted norms, with corresponding weights l s and l c , respectively.
To understand what it means for a function to have a bounded weighted norm, consider the Sobolev sup-norm case where jj Á jj s ¼ jj Á jj mþm 0 ;1;l s with polynomial weights l s ðxÞ ¼ ð1 þ for every 0 jkj m þ m 0 . Consider the upweighting case, d s > 0. We have already pointed out that upweighting implies the levels of f and its derivatives must go to zero in their tails. But here, with the specific polynomial form on the weight function, we know the precise rate at which the tails must go to zero: as jjxjj e ! 1, for each 0 jkj m þ m 0 . For example, with d x ¼ 1 and d s ¼ 1; jf ðxÞj can go to zero at the same rate as l s ðxÞ À1 ¼ 1=ð1 þ x 2 Þ ¼ Oðx À2 Þ. Recall that the t-distribution with one degree of freedom has pdf C=ð1 þ x 2 Þ where C is a normalizing constant. So the fattest tails jf ðxÞj can have are these t-like tails.
Next consider the downweighting case, d s < 0. Then jf ðxÞj no longer has to converge to zero in the tails. But it also cannot diverge too quickly. The norm bound tells us exactly how fast it can diverge, which is given exactly as in Eq. (2). With polynomial weights, the choice of d s determines the maximum order polynomial that is in H. In general, for d s ¼ Àn where n is a natural number, l s ðxÞ À1 ¼ Oðx 2n Þ is the highest order polynomial allowed. A similar analysis applies for the Sobolev L 2 norm, for both downweighting and upweighting.

Exponential weights
An alternative to polynomial weighting are the exponential weights Again, d 2 R and d > 0 corresponds to upweighting, while d < 0 corresponds to downweighting.
As with polynomial weights, we want to understand what it means for a function to be in the jj Á jj s -ball H, where jj Á jj s is a weighted norm. Consider the Sobolev sup-norm case jj Á jj s ¼ jj Á jj mþm 0 ;1;l s with l s ðxÞ ¼ exp½d s ðx 0 xÞ. Then f 2 H implies that for every 0 jkj m þ m 0 . Hence as jjxjj e ! 1, for each 0 jkj m þ m 0 . Consider the downweighting case d s < 0. Then we see that by using exponential weights we can allow for jr k f ðxÞj to diverge to infinity at an exponential rate. In particular, jr k f ðxÞj can diverge at any polynomial rate. More precisely, jr k f ðxÞj proportional to x n for any natural number n > 0 will satisfy the specified rate, regardless of the value of d s < 0. In contrast, using polynomial downweighting for l s requires specifying a maximum order of polynomial allowed. 8 Consider the upweighting case, d s > 0. We have already pointed out that upweighting implies the levels of f and its derivatives must go to zero in their tails. But here, with the specific polynomial form on the weight function, we know the precise rate at which the tails must go to zero: Oðexp½Àd s ðx 0 xÞÞ. In applications, this is likely to be very restrictive. For example, it rules out t-distribution like tails. For this reason, we do not discuss exponential upweighting any further. A similar analysis applies for the Sobolev L 2 norm, for both downweighting and upweighting.
While we focus on the weights lðxÞ ¼ expðdjjxjj 2 e Þ throughout this article, one could consider a wide variety of exponential weight functions, such as expðdjjxjj j e Þ where j 2 R is an additional weight function parameter. Another possibility is to use a different finite dimensional norm, like the ' 1 -norm jjxjj 1 ¼ P d x k¼1 jx k j. This yields the weight function expðdjjxjj j 1 Þ.

Assumptions on weight functions
With these two main classes of weight functions in mind, we state our main results for the two general weight functions l s and l c used in defining the strong and consistency norms. We will, however, make several assumptions on these weight functions. We then verify that these assumptions hold for either polynomial or exponential weights, or both. The first assumption states that the consistency norm weight goes to zero faster than the strong norm weight as we go further out in the tails.
Here distðx; BdðDÞÞ ¼ min y2BdðDÞ jjxÀyjj e where BdðDÞ denotes the boundary of the closure of D. As discussed earlier, the key idea to prove compact embedding is to truncate the domain R d x , and then ensure that the norm outside the truncated region is small. Assumption 1 is one part of ensuring that this step works. Both polynomial weights In applications, we may obtain bounds of the form jjf jj m;p;l B from moment conditions and lower bounds on the density f X of a random variable X. For example, suppose Eðjr k f ðXÞjÞ C for all 0 jkj m and lðxÞ f X ðxÞ for all x 2 D.
have the form qðxÞ d where qðxÞ ! 1 as jjxjj ! 1. Hence for both kinds of weights the ratio is and so Assumption 1 holds by choosing d c < d s .
The following assumption, which bounds the ratio for all x, not just x's in the limit, plays a similar role in the proof.
Assumption 2. There is a finite constant M 5 > 0 such that for all x 2 D.
As above, Assumption 2 holds for both polynomial and exponential weights with d c < d s . The next assumptions bounds the derivatives of the (square root) strong norm weight function by its (square root) levels. This assumption is precisely what Gallant and Nychka (1987) used in their analysis. This assumption was also used by Schmeisser and Triebel (1987) page 246 Equation 2, and followup work including Haroske andTriebel (1994a, 1994b) and Edmunds and Triebel (1996). Gallant and Nychka's Lemma A.2 proves the following result. Assumption 3 also holds for certain kinds of exponential weights. For example, for d x ¼ 1 and d s ¼ À1 consider l s ðxÞ ¼ expðÀjxjÞ. Then the weak derivative of ffiffiffiffiffiffiffiffiffiffi ffi l s ðxÞ p with respect to x is À ffiffiffiffiffiffiffiffiffiffi ffi l s ðxÞ p signðxÞ, and hence @ @x ffiffiffiffiffiffiffiffiffiffi ffi l s x ð Þ p ffiffiffiffiffiffiffiffiffiffi ffi l s x ð Þ p ¼ jÀsign x ð Þj 1: Assumption 3 does not allow for many other kinds of exponential weights, however. For example, consider d x ¼ 1 and d s ¼ À1 again but now using the Euclidean norm for x: The function jxj is unbounded on R and so Assumption 3 fails. The function jxj is, however, bounded for any compact subset of R. This motivates the following weaker version of Assumption 3.
Assumption 4. For every compact subset C D, there is a constant K C < 1 such that for all jkj m þ m 0 and for all x 2 C. This relaxation of Assumption 3 will also be important in Section 5 when we consider weighted norms for functions with bounded domains. The following proposition shows that exponential weights using the Euclidean norm satisfy Assumption 4. Also note that polynomial weights immediately satisfy it since they satisfy the stronger Assumption 3. Finally, for one of our results we use the following assumption.
Assumption 5. There is a function g(x) such that the following hold.
1. gðxÞ ! 1 as jjxjj e ! 1 (for D ¼ R d x ) or as distðx; BdðDÞÞ ! 0 (for bounded D). In the Supplementary Appendix K we give some intuitive discussion of Assumption 5. The main purpose of considering Assumption 5 is similar to our motivation for Assumption 4: it allows for cases where Assumption 3 does not hold. In particular, in the following proposition we show that Assumption 5 holds for exponential weights.
Proposition 5. Let l c ðxÞ ¼ exp½d c ðx 0 xÞ; l s ðxÞ ¼ exp½d s ðx 0 xÞ, and D ¼ R d x . Then assumption 5 holds for any d s ; d c 2 R such that d c < d s .
Our final assumption on the weight functions ensures that the weighted spaces are complete. See Kufner and Opic (1984) and more recently Rodr ıguez et al. (2004) for more details. This assumption is a minor modification of the first part of assumption H in Brown and Opic (1992). 9 For D ¼ R d x , Assumption 6 rules out weights like l c ðxÞ ¼ ðx 0 xÞ 2 since then l c ðxÞ is not bounded away from zero on (0,1), for example. This assumption is satisfied by l c ðxÞ ¼ ð1 þ x 0 xÞ 2 , however, and more generally for l c ðxÞ ¼ ð1 þ x 0 xÞ d c ; d c 2 R. It is also satisfied by the exponential weights l c ðxÞ ¼ exp½d c ðx 0 xÞ. This assumption is also satisfied by indicator weight functions like l c ðxÞ ¼ 1ðjjxjj e MÞ for some constant M.

Compact embeddings and closedness results
As in the bounded domain case, we begin with a compact embedding result.
Theorem 3. (Compact Embedding) Let D ¼ R d x for some integer d x ! 1. Let l c ; l s : D ! R þ be nonnegative, m þ m 0 times continuously differentiable functions. m; m 0 ! 0 are integers. Suppose assumptions 1, 2, 4, and 6 hold. Then the following embeddings are compact: 9 As discussed in the proof of Theorem 3, Assumption 6 could be weakened slightly to a local integrability assumption.
Using the stronger Assumption 3, Gallant and Nychka (1987) showed case (1) in their Lemma A.4. Case (1) with polynomial weights was used, for example, by Newey and Powell (2003) and Santos (2012). 10 Under the stronger Assumption 3, Haroske and Triebel (1994a) show cases (1)-(4) as special cases of their theorem on page 136. Haroske and Triebel furthermore assume via their definition 1(ii) on page 133 that the weight functions have at most polynomial growth. Their results therefore do not allow for any exponential weights. For example, for d x ¼ 1, they do not allow for either lðxÞ ¼ expðdjxjÞ or lðxÞ ¼ expðdx 2 Þ. Brown and Opic (1992) give high level conditions for a compact embedding result similar to case (1), with m 0 ¼ 1 and m ¼ 0. They do not study the other cases we consider. They do, however, allow for a large class of weight functions, which includes the exponential weight functions we discussed earlier (e.g., see their Example 5.5 plus Remark 5.2).
To our best knowledge, cases (2)-(4) with any kind of exponential weight function have not been shown in the literature. The proof for these cases is similar to that for case (1), which is a modification of Gallant and Nychka's original proof. Our result for case (1) gives lower level conditions on the weight functions compared to Brown and Opic (1992), although these conditions are less general. Finally, note that the results by Triebel and coauthors allow for more general function spaces, including Besov spaces and many others, although again, they restrict attention to weight functions with at most polynomial growth.
Theorem 4. (Closedness) Let D ¼ R d x where d x ! 1 is some integer. Let m; m 0 ! 0 be integers. Let ðF; jj Á jj s Þ and ðG; jj Á jj c Þ be Banach spaces with F G, where jjf jj s < 1 for all f 2 F and jjf jj c < 1 for all f 2 G. Define H as in Eq. (1). Suppose assumptions 1, 2, and 4 hold. Then the results of Table 2 hold under the following additional assumptions: For cases (1) and (2) we also assume m 0 > d x =2 and that assumption 6 holds, and in case (1) also that assumption 5 holds. For cases (3) and (4) we also assume m 0 ! 1.
Just as in Section 3, Theorems 3 and 4 can be combined to show that the jj Á jj s -ball H is jj Á jj c -compact by choosing various combinations of strong and consistency norms given in Table  2. All of our remarks in that section apply here as well. The only new point is that in addition to the choice of norm, one must also choose the weight functions l s and l c . (1) jj Á jj mþm0;2;l s jj Á jj m;1;l 1=2 c Yes (2) jj Á jj mþm0;2;l s jj Á jj m;2;l c Yes (3) jj Á jj mþm0;1;l s jj Á jj m;1;l c No (4) jj Á jj mþm0;1;l s jj Á jj m;2;l c No 10 Santos (2012) allowed for a general unbounded domain D rather than D ¼ R dx specifically. We restrict attention to functions with full support merely for simplicity. This also allows us to remove the cone condition as an explicit assumption on D, since R dx satisfies the cone condition.

Alternative approaches to defining weighted norms
Thus far we have defined weighted Sobolev and H€ older norms by weighting each derivative piece equally. For example, with m ¼ 1 and d x ¼ 1, the weighted Sobolev sup-norm is The Sobolev integral norms were defined similarly, with each derivative using the same weight function. Call this the equal weighting approach. While this is the most common approach to defining weighting norms in econometrics, it is not the only reasonable way to define weighted norms. The next most common alternative is to convert any unweighted norm jj Á jj into a weighted norm jj Á jj l by first weighting the function and then applying the unweighted norm: jjf jj l ¼ jjlf jj: Call this the product weighting approach. For example, suppose we start with the unweighted Sobolev sup-norm, with m ¼ 1 and d x ¼ 1. Assume l is differentiable. Then Here we see that, compared to equal weighting, product weighting picks up an extra term involving the derivative of the weight function l 0 ðxÞ. Notice that when m ¼ 0, the product and equal weighting approaches to defining weighted Sobolev integral and sup-norms are equivalent.
The following result shows that, for a class of weight functions including polynomial weighting, these two approaches to defining Sobolev norms are equivalent. Consequently, it is irrelevant which one we use.
As discussed earlier, Assumption 3 does not hold for all feasible weight functions. So these two approaches to defining weighted norms are not necessarily equivalent for any given choice of weight function. The theorem in section 5.1.4 of Schmeisser and Triebel (1987) gives a result related to Proposition 6 for a large class of weighted function spaces. 11 One reason to consider product weighting is that it easily applies when it is not clear how to define an equally weighted norm. In particular, it allows us to define the weighted H€ older norm by jj Á jj m;1;l; ¼ jjl Á jj m;1;1; for 2 ð0; 1. Let C m;1;l; ðDÞ ¼ ff 2 C m ðDÞ : jjf jj m;1;l; < 1g denote the weighted H€ older space 11 This result is cited and applied in much of Triebel and coauthor's followup work. In particular, as Haroske and Triebel (1994a) show in the proof of their theorem 2.3 (page 145 step 1), this equivalence result can be used to prove compact embedding results. This proof strategy does not apply when the norms are not equivalent, which is why we rely on the more primitive approach of Gallant and Nychka (1987). with exponent . The difficulty in defining an equally weighted H€ older norm comes from the H€ older coefficient piece, which is a supremum over two different points in the domain, unlike the sup-norm part. 12 The product weighted H€ older norm is commonly used in econometrics, as in Ai and Chen (2003) example 2.1, 13 Chen et al. (2005), Hu and Schennach (2008), and Khan (2013).
If D is bounded, then compact embedding and closedness results for product weighted norms follow immediately from our results on bounded D with unweighted norms. For unbounded D, we provide the following two results. Under the stronger Assumption 3, the product and equal weighted norms are equivalent, by Proposition 6. Schmeisser and Triebel (1987) showed this equivalence and Haroske and Triebel (1994a) used it to prove cases (1)-(4) of Theorem 5 under Assumption 3 and the further assumption that the weight functions have at most polynomial growth (definition 1(ii) on page 133 of Haroske and Triebel, 1994a). Our result relaxes Assumption 3 and does not impose a polynomial growth bound on the weight functions. Our cases (1)-(4) of Theorem 5 are therefore the first we are aware of to allow for exponential weight functions when using product weighted norms. We use our previous results in Theorem 3 to prove cases (1)-(3). We adapt the proof of Theorem 3 to prove case (4).
Theorem 6. (Closedness) Let D ¼ R d x where d x ! 1 is some integer. Let m; m 0 ! 0 be integers. Let 2 ð0; 1. Let ðF; jj Á jj s Þ and ðG; jj Á jj c Þ be Banach spaces with F G, where jjf jj s < 1 for all f 2 F and jjf jj c < 1 for all f 2 G. Define H as in equation (1). Define e lðxÞ ¼ ð1 þ x 0 xÞ Àd for some d > 0 and assume that l c ðxÞ ¼ l s ðxÞe lðxÞ. Then the results of Table 3 hold under the following additional assumption: For cases (1) and (2) we also assume m 0 > d x =2.
As mentioned above, we do not impose Assumption 3 on the strong norm in either Theorem 5 or Theorem 6. We also do not impose the weaker Assumption 4. We do, however, strengthen assumptions 1 and 2 by assuming a particular rate of convergence on the ratio l c =l s , namely, that it is polynomial: This assumption is satisfied when both l c and l s are polynomial weight functions themselves. This case has been used in the previous literature which chooses the weighted H€ older norm, such as in Chen et al. (2005). This assumption is also, however, satisfied by the choice 12 See, however, Brown and Opic (1992) Equations (2.8) and (2.9), who suggest one way to define equally weighted H€ older norms. 13 In this example the parameter space is an unweighted H€ older space for functions with unbounded domain, but the consistency norm is a downweighted sup-norm. Hence this is an example of case 4 in theorems 5 and 6. Also, as we discuss in section 6, this kind of unweighted parameter space assumption rules out linear functions. Note that in other examples using an unweighted H€ older space on R dx is less restrictive, since the functions of interest are naturally bounded. For example, Chen et al. (2009b) and Carroll et al. (2010) consider spaces of pdfs while Blundell et al. (2007) (assumption 2(i)) consider spaces of Engel curves.
for d > 0 and d s < 0. Hence Theorems 5 and 6 can still be applied if we want our parameter space H to contain for polynomial functions of all orders, as discussed earlier. Finally, note that a compact embedding result under the norm l c yields a compact embedding result under any weaker norm, by Lemma 4. For example, with m ¼ 0, l c defined as the ratio of an exponential and polynomial as above, and e l c ¼ expðd c jjxjj 2 e Þ for d c < d s ; jj Á jj 0;1;e l c is weaker than jj Á jj 0;1;l c . Theorem 5 part 4 then implies that C 0;1;l s ; is compactly embedded in C 0;1;e l c .

Weighted norms for bounded domains
In section 3 we showed that when the domain D is bounded, sets of functions f that satisfy a norm bound jjf jj s B are jj Á jj c -compact for three possible choices of norm pairs-see Table 1.
In this section we consider functions with a bounded domain, but which do not satisfy a norm bound jj Á jj s B for any of the choices in Table 1.
Example 1. (Quantile function) Let X be a scalar random variable with full support on R and absolutely continuous distribution with respect to the Lebesgue measure. Let Q X : ð0; 1Þ ! R denote its quantile function. Since the derivative of Q X at s asymptotes to 61 as s ! 0 or 1, jjQ X jj 0;1 ¼ 1. Hence, although the domain D ¼ ð0; 1Þ is bounded, Q X is not in any Sobolev sup-norm space or H€ older space. Indeed, (Cs€ org€ o 1983, page 5) notes that jj b Q X ÀQ X jj 0;1 ! 1 a:s: is the sample quantile function for an iid sample fx 1 ; :::; x n g, and b F X is the empirical cdf. Also see page 322 of van der Vaart (2000).
On the other hand, it is certainly possible for such a quantile function Q X to be bounded in a weighted Sobolev sup-norm space or a weighted H€ older space. In fact, by examining the Bahadur representation of b Q X it can be shown that b Q X converges in the weighted sup-norm over s 2 ð0; 1Þ with weight function Note that this weight function depends on how fast the quantile function diverges as s ! 0 or s ! 1.
More generally, we may want to estimate quantile functions in settings more complicated than simply taking a sample quantile. In such settings, the compact embedding and closedness results developed in this section can be useful. (1) jj Á l s jj mþm0;2 jj Á l c jj m;1 Yes (2) jj Á l s jj mþm0;2 jj Á l c jj m;2 Yes (3) jj Á l s jj mþm0;1 jj Á l c jj m;1 No (4) jj Á l s jj mþm0;1;1; jj Á l c jj m;1 Yes Example 2. (Transformation models) Consider the model where Y, X, and U are continuously distributed scalar random variables. T is an unknown strictly increasing transformation function. Let F U and f U be the (unknown) cdf and pdf of U, respectively. Suppose Y has compact support suppðYÞ ¼ ½y L ; y U . If we allow distributions of U to have full support, like N ð0; 1Þ, then the transformation function T(y) must diverge to infinity as y ! y U or to negative infinity as y ! y L . We are again in a situation like the quantile function above, where because the derivatives of T diverge, it is not in any unweighted Sobolev sup-norm or H€ older space. Horowitz (1996) constructs an estimator b T ðyÞ of T(y) and shows, among other results, that where a and b are such that T(y) and T 0 ðyÞ are bounded on ½a; b. These bounds on T and T 0 imply that ½a; b is a strict subset of suppðYÞ when suppðYÞ is compact and U has full support. Chiappori et al. (2015) extend the arguments in Horowitz (1996) to allow for a nonparametric regression function and endogenous regressors. Also see Chen and Shen (1998), who study a transformation model assuming Y has bounded support in their example 3, and example 3 on page 618 of Wong and Severini (1991).
As with the quantile function, the compact embedding and closedness results developed in this section may be useful for proving consistency of estimators of T in weighted norms uniformly over its entire domain.
These examples show that sometimes our functions of interest do not satisfy standard unweighted norm bounds. Hence the compactness and closedness results Theorems 1 and 2 do not apply. In this section, we show that we can, however, recover compactness by using weighted norms. As in Section 4, we focus on equal weighting norms. 14

Weight functions
Proposition 1 applies for bounded domains, and hence again we see that only weight functions that go to zero or infinity at the boundary are nontrivial. Since our main motivation for considering weighted norms is to expand the set of functions which can have a bounded norm, we will restrict attention to downweighting. For simplicity we will also focus on the one dimensional case d x ¼ 1 with D ¼ ð0; 1Þ, as in examples 1 and 2. As before, there are two natural classes of weight functions. First, we consider polynomial weights for a; b ! 0 and d 2 R. a > 1; b > 1, and d > 0 ensure that lðxÞ ! 0 as x ! 0 or x ! 1. Next, we consider exponential weights, Compactness and closedness results for product weighting norms with bounded domains follow immediately from Theorems 1 and 2.
If we had a > 0 and b < 0 then this allows for asymmetric weights where the tail goes to zero at one boundary point but not the other.
The interpretation of jjf jj s B for a weighted norm jj Á jj s with D bounded is similar to the interpretation when D ¼ R d x discussed in Section 4.1. This norm bound places restrictions on the tail behavior of f(x) as x approaches the boundary of D. For example, let D ¼ ð0; 1Þ and consider the Sobolev sup-norm jj Á jj s ¼ jj Á jj mþm 0 ;1;l s with polynomial weights l s ðxÞ ¼ ½xð1ÀxÞ d s ; d s > 0. Then f 2 H ¼ ff 2 F : jjf jj s Bg implies that for every 0 jkj m þ m 0 . For example, jf ðxÞj ¼ Oðx Àd s Þ as x ! 0 means that jf ðxÞj can diverge to 1 as x ! 0, but it can't do so faster than the polynomial 1=x d s diverges to 1 as x ! 0. A similar tail constraint holds as x ! 1, and also for the derivatives of f up to order m þ m 0 .
The analysis now proceeds similarly as in the unbounded domain case. One important difference is that Assumption 3 cannot hold for nontrivial weight functions on bounded domains, as the following proposition shows.
The weaker assumption 4, however, can still hold. The following proposition verifies this for both polynomial and exponential weights.
Proposition 8. Assumption 4 holds for both l s ðxÞ ¼ ½xð1ÀxÞ d s and l s ðxÞ ¼ exp½d s x À1 ð1ÀxÞ À1 , for any d s 2 R.
The following result illustrates that Assumption 5 can also hold for exponential weights. It can be generalized to d x > 1; a; b 6 ¼ À1, and arbitrary bounded D.
Proposition 9. Let l c ðxÞ ¼ exp½d c x À1 ð1ÀxÞ À1 ; l s ðxÞ ¼ exp½d s x À1 ð1ÀxÞ À1 , and D ¼ ð0; 1Þ. Then Assumption 5 holds for any d s ; d c 2 R such that d c < d s .
It can be shown that such exponential weight functions also satisfy the other weight function assumptions discussed in Section 4, for appropriate choices of d c and d s .

Compact embeddings and closedness results
As in the previous cases, we begin with a compact embedding result. Because of Proposition 7, none of the results from Schmeisser and Triebel (1987) or the followup work by Triebel and coauthors applies to weighted norms on bounded domains. As in the unbounded domain case, however, Brown and Opic (1992) give high level conditions for a compact embedding result similar to case (1) of Theorem 7, with m 0 ¼ 1 and m ¼ 0. Again, they do not study the other cases we consider, and they allow for a large class of weight functions which includes exponential weights. Hence, to our best knowledge, cases (2)-(4) of Theorem 7 are new. The proof is similar to the proof of Theorem 3, which in turn is a generalization of the proof of lemma A.4 in Gallant and Nychka (1987). We end this section with a corresponding closedness result.  Table 4 hold under the following additional assumptions: For cases (1) and (2) we also assume m 0 > d x =2, that D satisfies the cone condition, and that assumption 6 holds, and in case (1) also that Assumption 5 holds. For cases (3) and (4) we also assume m 0 ! 1.

Applications
In this section we illustrate how the compact embedding and closedness results discussed in this article are applied to nonparametric estimation problems in econometrics. We discuss how the choice of norms affects the parameter space, the strength of the conclusions one obtains, and how other assumptions like moment conditions depend on this choice. In the first example, we consider mean regression functions for full support regressors. We discuss penalized and unpenalized estimators. We also show that weighted norms can be interpreted as a generalization of trimming. In the second example, we discuss nonparametric instrumental variable estimation. In each example we focus on consistency of a sieve estimator of a function of interest, but similar considerations arise for inference. While it may be possible to prove some of these consistency results without using compact embeddings, our goal is merely to provide simple illustrations of the general ideas which are widely used in the sieve estimation literature and carry over to more complicated settings.
We show consistency by verifying the conditions of a general consistency result stated below. Denote the data by fZ i g n i¼1 where Z i 2 R d Z . Throughout this section we assume the data are independent and identically distributed. The parameter of interest is h 0 2 H, where H is the parameter space. H may be finite or infinite dimensional. Let QðhÞ be a population objective function such that Let H k n be a sieve space as described in the examples below. Without penalization, a sieve extremum estimator b h n of h 0 is defined by Q n is the sample objective function, which depends on the data. Our assumptions ensure that h 0 and b h n are well defined. 15 Let dðÁ; ÁÞ be a pseudo-metric on H. Typically dðh 1 ; h 2 Þ ¼ jjh 1 Àh 2 jj c for some norm jj Á jj c on H. We now have the following result.
Proposition 10. (Consistency of sieve extremum estimators) Suppose the following assumptions hold.
1. H and H k n are compact under dðÁ; ÁÞ. 2. QðhÞ and b Q n ðhÞ are continuous under dðÁ; ÁÞ on H and H k n , respectively.
There exists a sequence p k h 0 2 H k such that dðp k h 0 ; h 0 Þ ! 0 as k ! 1. 5. k n ! 1 as n ! 1 and sup h2H kn j b Q n ðhÞÀQðhÞj! p 0.
Proposition 10 is a slight modification of lemma A1 in Newey and Powell (2003). The assumptions require a compact parameter space, which we can obtain by choosing a strong norm jj Á jj s and a consistency norm jj Á jj c , letting dðh 1 ; h 2 Þ ¼ jjh 1 Àh 2 jj c , and constructing the parameter space as explained in Sections 3, 4, and 5. The strong norm should be chosen such the parameter space is large enough to contain h 0 . The consistency norm not only needs to be selected carefully to ensure compactness, but it will also affect the remaining assumptions, such as conditions needed for continuity of Q and b Q n (Assumption 2). Similarly, a larger parameter space usually requires stronger assumptions to ensure uniform convergence of the sample objective function (Assumption 5). Assumption 3 is an identification condition, which allows QðhÞ ¼ Qðh 0 Þ for h 6 ¼ h 0 as long as dðh; h 0 Þ ¼ 0. Assumption 4 is a standard approximation condition on the sieve space.

Mean regression functions and trimming
Let Y and X be scalar random variables and define g 0 ðxÞ EðYjX ¼ xÞ. Suppose g 0 2 H, where H is the parameter space defined below. Suppose X is continuously distributed with density f X ðxÞ > 0 for all x 2 R. Hence suppðXÞ ¼ R. Our estimator is the sieve least squares estimator b g x ð Þ ¼ argmax where H k n is a sieve space for H. For example, let p j : R ! R be a sequence of basis functions for H. Then we could choose the linear sieve space Let jj Á jj c denote the consistency norm and let jj Á jj s be a strong norm. The parameter space H is a jj Á jj s -ball as explained in Sections 3, 4, and 5. Intuitively, the unweighted L 2 or sup-norms on R are too strong to be a consistency norm because the data provides no information about 15 Alternatively, we can define b h n as any estimator that satisfies b Q n ð b h n Þ ¼ sup h2Hk n b Q n ðhÞþo p ð1Þ. Assuming b h n exists, we would then not have to assume that b Q n is continuous or that H kn is compact. We use the more restrictive definition because in our examples below these assumptions are satisfied. g 0 ðxÞ for x larger than the largest observation. In fact, to apply any of the compactness results with such a choice of jj Á jj c , we would have to use a strong norm with upweighting. By Proposition 2, this implies that we would have to assume that gðxÞ ! 0 as jxj ! 1. Since this assumption would rule out the linear regression model, we instead use the downweighted supnorm where l c ðxÞ is nonnegative and l c ðxÞ ! 0 as jxj ! 1. As a parameter space we can then either use a weighted H€ older space (by Theorems 5 and 6) or a weighted Sobolev space (by Theorems 3 and 4). As an example, we choose a weighted Sobolev L 2 parameter space, and give low level conditions under which jjb g Àg 0 jj c ! p 0 in the following proposition.
Proposition 11. (Consistency of sieve least squares) Suppose the following assumptions hold.
3. H k is jj Á jj c -closed for all k. H k H kþ1 Á Á Á H for all k ! 1. For any M > 0, there exists g k 2 H k such that sup x:jxj M jg k ðxÞÀg 0 ðxÞj ! 0 as k ! 1. 4. k n ! 1 as n ! 1.
The Proof of Proposition 11 follows by verifying the conditions of Proposition 10. Assumption 1 of Proposition 11 ensures that H is jj Á jj c -compact by our compact embedding and closedness results, parts 1 of Theorems 3 and 4. Together with the moment assumptions of Proposition 11, this compactness provides a simple and primitive sufficient condition for the uniform convergence Assumption 5 of Proposition 10.
As mentioned earlier, we must use downweighting-l s ðxÞ ! 0 as jxj ! 1-in the strong norm to allow g 0 to be linear. The faster l s converges to 0, the larger is the parameter space. However, allowing for a larger parameter space has several consequences. First, by our assumptions on the relationship between l s and l c , faster convergence of l s to zero implies faster convergence of l c to zero. This weakens the consistency norm. Consequently, both continuity and uniform convergence are harder to verify. In Proposition 11 we ensure these two assumptions hold by requiring Eðl c ðXÞ À2 Þ < 1. But here we see that the faster l c converges to 0, the more moments of X we assume exist. For example, suppose l s ðxÞ ¼ ð1 þ x 2 Þ Àd s and l c ðxÞ ¼ ð1 þ x 2 Þ Àd c with d s > 0. The conditions on the weight functions require that d s < 2d c and the moment condition is Eðð1 þ X 2 Þ 2d c Þ < 1. Thus larger d s 's yield larger parameter spaces, but imply d c must also be larger, and hence we need more moments of X. Next suppose l s ðxÞ ¼ expðÀd s x 2 Þ and l c ðxÞ ¼ expðÀd c x 2 Þ with 0 < d s < 2d c . Then the moment condition is E½expðd c X 2 Þ < 1. This is equivalent to requiring that the tails of X are sub-Gaussian, PðjXj > tÞ C expðÀct 2 Þ for constants C and c, which in turn implies that all moments of X are finite.
The only remaining assumption is the condition on the sieve spaces. There are many choices of sieve spaces which satisfy this last condition because it only requires that g 0 can be approximated on any compact subset of R. See Chen (2007) for examples. Also see Mhaskar (1986) and Gallant and Nychka (1987) for analyses of weighted polynomial sieve spaces. Section 2.3.6 of Chen (2007) discusses general considerations for choosing sieve spaces.
k n is a penalty parameter that converges to zero as the sample size grows. The following proposition uses arguments from Chen and Pouzo (2012) to show that e g w is consistent for g 0 .
Proposition 13. (Consistency of penalized sieve least squares) Suppose the following assumptions hold.
Proposition 13 allows for a noncompact parameter space. The additional assumption needed is Assumption 5, which imposes an upper bound on the rate of convergence of k n . Assumption 3 implies that jjg k n Àg 0 jj c converges to 0 and Assumption 5 then imposes that k n cannot converge at a faster rate.
In Propositions 11 and 12, we used the compact embedding and closedness results of Sections 3, 4, and 5 directly to pick norms such that the compact parameter space assumption holds. In Proposition 13 this is no longer an issue because we do not need a compact parameter space. However, the results of Sections 3, 4, and 5 are still used in the proof, and hence the choice of norm here is still important, as discussed in section 3.2.1 of Chen and Pouzo (2012). Essentially, our Proof of Proposition 13 first uses lemma A.3 in Chen and Pouzo (2012) to show that for some finite M 0 > 0 e g w 2 g 2 W 1;2;l s : jjgjj 1;2;l s M 0 n o with probability arbitrarily close to 1 for all large n. We then use the arguments from the Proof of Proposition 12 to prove that jje g w Àg 0 jj c ! p 0. It's at this step where the compact embedding and closedness results help.
An alternative proof can be obtained by showing that our low level sufficient conditions imply the assumptions of theorem 3.2 in Chen and Pouzo (2012), which is a general consistency theorem, applies when X has compact support, and allows for both nonsmooth residuals and a noncompact parameter space. One of the assumptions of theorem 3.2 is that the penalty function is lower semicompact, which here means that jj Á jj s -balls are jj Á jj c -compact. This is precisely the kind of result we have discussed throughout this article.
Finally, we note that while both of these approaches-assuming a compact parameter space, or using a penalty function-lead to easy-to-interpret sufficient conditions, one could also use theorem 3.1 in Chen (2007), which may avoid both compactness and penalty functions. 16 The constraint jjgjj s B n ensures that the sieve spaces are closed, as in assumption 3.1(iii) of Chen and Pouzo (2012).

Nonparametric instrumental variables estimation
In this section we apply our results to the nonparametric instrumental variable model where Y, X, and Z are continuously distributed scalar random variables and f X ðxÞ > 0 for all x 2 R. Assume g 0 2 H, where H is the parameter space defined below. Since EðEðYÀg 0 ðXÞjZÞ 2 Þ ¼ 0, Newey and Powell (2003) suggest estimating g 0 in two steps. First, for any g 2 H estimate qðz; gÞ EðYÀgðXÞjZ ¼ zÞ using a series estimator. Call this estimator b qðz; gÞ. Then let b g x ð Þ ¼ argmax where H k n is a sieve space for function in H, as before. See Newey and Powell (2003) for more estimation details. Define e H ¼ g 2 W mþm 0 ;2;l s : jjgjj mþm 0 ;2;l s e B n o ; where l s ðxÞ ¼ ð1 þ x 2 Þ d s ; d s > 0, and m; m 0 ! 0. Let aðxÞ 2 R d a be a vector of known functions of x. Newey and Powell (2003) define the parameter space by Proposition 2 implies that for any g 1 2 e H, it holds that jg 1 ðxÞj ! 0 as jxj ! 1. The term aðxÞ 0 b ensures that the tails of g 0 are not required to converge to 0, but it requires the tails of g 0 to be modeled parametrically. As a consistency norm Newey and Powell (2003) use jj Á jj m;1;l c , where l c upweights the tails of the functions as well. Also see Santos (2012) for a similar parameter space.
In this section, we modify the arguments of Newey and Powell (2003) to allow for nonparametric tails of the function g 0 . In particular, we let l s ðxÞ ! 0 as jxj ! 1. Consequently we allow for a larger parameter space. The main cost of allowing for a larger parameter space is that we obtain consistency in a weaker norm.
The population objective function is Q g ð Þ ¼ ÀE E YÀg X ð ÞjZ À Á 2 : The generalization of trimming used in the previous section is generally not possible here because although EðYÀg 0 ðXÞjZ ¼ zÞ ¼ 0 for all z, usually EððYÀg 0 ðXÞÞl c ðXÞjZ ¼ zÞ 6 ¼ 0 for some z. Instead we follow the approach of Proposition 11. The following proposition provides low level conditions under which jjb g Àg 0 jj c ! p 0. As in the previous subsection, jj Á jj c is a weighted sup-norm and the parameter space is a weighted Sobolev L 2 space. The arguments can easily be adapted to allow for higher order derivatives in the consistency norm or a weighted H€ older space as the parameter space.
Proposition 14. (Consistency of sieve NPIV estimator) Suppose the following assumptions hold.

5.
H k is jj Á jj c -closed for all k. H k H kþ1 Á Á Á H for all k ! 1. For any M > 0, there exists g k 2 H k such that sup x:jxj M jg k ðxÞÀg 0 ðxÞj ! 0 as k ! 1. 6. k n ! 1 as n ! 1 such that k n =n ! 0.
Assumption 2 is the identification condition known as completeness. Besides this assumption and compared to the regression model in Proposition 11, the additional assumptions are Assumption 4 and the last part of Assumption 3. These two conditions ensure that the first stage regression is sufficiently accurate and they are implied by assumption 3 of Newey and Powell (2003). We use the same sieve space to approximate g 0 ðxÞ and b(z), but the arguments can easily be generalized at the expense of additional notation. The last part of Assumption 3 holds for example if either EðY 4 Þ < 1 and Eðl c ðXÞ À4 Þ < 1 or varðYÀgðXÞjZÞ M for some M > 0 and all g 2 H.
Chen and Pouzo (2012) discuss convergence in a weighted sup-norm of a penalized estimator in the NPIV model as an example of their general consistency theorem. Chen andChristensen (2015a, 2018) derive many new and important results for the NPIV model. Among others, they derive minimax optimal sup-norm convergence rates and they describe an estimator which achieves those rates. Their results apply when X and Z have compact support.

Conclusion
In this article, we have gathered and reviewed many previously known compact embedding results for convenient reference. Furthermore, we have proved several new compact embedding results which generalize the existing results and were not previously known. Unlike most previous results, our results allow for exponential weight functions. Our new results also allow for weighted norms on bounded domains, of which only one prior result existed, even for polynomial weights. We additionally gave closedness results, some of which were known and some of which are apparently new to the econometrics literature. Finally, we discussed the practical relevance of these results. We explained how the choice of norm and weight function affect the functions allowed in the parameter space. We also showed how to apply these results in two examples: nonparametric mean regression and nonparametric instrumental variables estimation.
After showing consistency of an estimator, the next step is to consider rates of convergence and inference. For these results, it is often helpful to have results on entropy numbers for the function space of interest. For functions with bounded domain satisfying standard norm bounds, many well-known results exist. For example, van der Vaart and Wellner (1996) theorem 2.7.1 gives covering number rates for H€ older balls with the sup-norm as the consistency norm. Such results are refinements of compact embedding results, since totally bounded parameter spaces are relatively compact. For functions with full support, fewer entropy number results exist. For example, lemma A.3 of Santos (2012) generalizes van der Vaart and Wellner (1996) theorem 2.7.1 to the case where H is a polynomial-upweighted Sobolev L 2 ball and jj Á jj c is the Sobolev sup-norm. Note that a compact embedding result is used as the first step in his proof. Haroske andTriebel (1994a, 1994b) and Haroske (1995) also provide similar results for a large class of weighted spaces, again restricting to a class of weight functions satisfying Assumption 3 and which have at most polynomial growth. Since our results allow for more general weight functions, it would be useful to know whether these entropy number results generalize as well.
Finally, applying a result on sieve approximation rates is one step when deriving convergence rates of sieve estimators. For example, see theorem 3.2 of Chen (2007) and the subsequent discussion. Many approximation results for functions on the real line, such as those discussed in Mhaskar (1986), are for exponentially weighted sup-norms. Therefore, our extension of the compact embedding results to exponential weights should be useful when combined with these approximation results to derive sieve estimator convergence rates.