Cross-validation: what does it estimate and how well does it do it?

Cross-validation is a widely-used technique to estimate prediction error, but its behavior is complex and not fully understood. Ideally, one would like to think that cross-validation estimates the prediction error for the model at hand, fit to the training data. We prove that this is not the case for the linear model fit by ordinary least squares; rather it estimates the average prediction error of models fit on other unseen training sets drawn from the same population. We further show that this phenomenon occurs for most popular estimates of prediction error, including data splitting, bootstrapping, and Mallow's Cp. Next, the standard confidence intervals for prediction error derived from cross-validation may have coverage far below the desired level. Because each data point is used for both training and testing, there are correlations among the measured accuracies for each fold, and so the usual estimate of variance is too small. We introduce a nested cross-validation scheme to estimate this variance more accurately, and we show empirically that this modification leads to intervals with approximately correct coverage in many examples where traditional cross-validation intervals fail.


Introduction
When deploying a predictive model, it is important to understand its prediction accuracy on future test points, so both good point estimates and accurate confidence intervals for prediction error are essential.Cross-validation (CV) is a widely-used approach for these two tasks, but in spite of its seeming simplicity, its operating properties remain opaque.Considering first estimation, it turns out be challenging to precisely state the estimand corresponding to the cross-validation point estimate.In this work, we show that the the estimand of CV is not the accuracy of the model fit on the data at hand, but is instead the average accuracy over many hypothetical data sets.Specifically, we show that the CV estimate of error has larger mean squared error (MSE) when estimating the prediction error of the final model than when estimating the average prediction error of models across many unseen data sets for the special case of linear regression.Turning to confidence intervals for prediction error, we show that naïve intervals based on CV can fail badly, giving coverage far below the nominal level; we provide a simple example soon in Section 1.1.The source of this behavior is the estimation of the variance used to compute the width of the interval: it does not account for the correlation between the error estimates in different folds, which arises because each data point is used for both training and testing.As a result, the estimate of variance is too small and the intervals are too narrow.To address this issue, we develop a modification of cross-validation, nested cross-validation (NCV), that achieves coverage near the nominal level, even in challenging cases where the usual cross-validation intervals have miscoverage rates two to three times larger than the nominal rate.

A simple illustration
As a motivating example where naïve cross-validation confidence intervals fail, we consider a sparse logistic regression model

CI midpoint
Figure 1: A plot of the true error of a model versus the CV estimates for 1000 replicates of the model from Section 1.1.The blue curve shows the average midpoint of the naïve CV confidence intervals.The green bands show the average 90% confidence interval for prediction error given by naïve CV.The red curves show the 5% and 95% quantiles from a quantile regression fit.To achieve nominal coverage, the green curves should approximate the red curves, but they are too narrow in this case.
with n = 90 observations of p = 1000 features, and a coefficient vector θ = c • (1, 1, 1, 1, 0, 0, . . . ) ∈ R p with four nonzero entries of equal strength.The feature matrix X ∈ R n×p is comprised of independent and identically distributed (i.i.d.) standard normal variables, and we chose the signal strength c so that the Bayes misclassification rate is 20%.We estimate the parameters using 1 -penalized logistic regression with a fixed penalty level.In this case, naïve confidence intervals for prediction error are far too small: intervals with desired miscoverage of 10% give 31% miscoverage in our simulation.We visualize this in Figure 1.The intervals need to be made larger by a factor of about 1.6 to obtain coverage at the desired level in this case.

Related work
Cross-validation is used ubiquitously to estimate the prediction error of a model (Allen, 1974;Geisser, 1975;Stone, 1977).The enduring popularity of CV is due to the fact that it is a conceptually simple improvement over a one-time train-test split (Blum et al., 1999).CV is part of a broader landscape of resampling techniques to estimate prediction error, with bootstrap-based techniques as the most common alternative (Efron, 1983(Efron, , 1986;;Efron andTibshirani, 1997, 1993).The other main category of prediction error estimates are based on analytic adjustments such as Mallow's C p (Mallows, 1973), AIC (Akaike, 1974), BIC (Schwarz, 1978), and general covariance penalties (Stein, 1981;Efron, 2004).The present work is primarily concerned with CV, but also addresses the properties of bootstrap, data splitting and covariance penalty methods.
In spite of CV's apparent simplicity, the formal properties of this procedure are subtle; the seemingly basic question "what is cross-validation estimating?" has engendered considerable debate.Although the predictive accuracy of the model fit on the observed training data may seem like a natural estimand, it has been observed that the CV estimator tracks this quantity only weakly, suggesting that CV should instead be treated as an estimator of the average prediction error across training sets (Zhang, 1995;Hastie et al., 2009;Yousef, 2020).See also Rosset and Tibshirani (2020) and Wager (2020) for a discussion about different potential estimands.In this work, we discuss this phenomenon in detail for the case of the linear model.Our main result uses a conditional independence argument to explain the aforementioned weak relationship between CV and the instance-specific error.
Turning to the question of inference, one important use of CV is to deliver confidence intervals for the prediction error (or, similarly, an estimate of the standard error) to accompany a point estimate.The second primary goal in this work is to provide such confidence intervals, which cannot be reliably created with naïve methods, as shown in our example in Figure 1.A fundamental prior result shows that there is no unbiased estimator of the standard error of the CV point estimate based on one instance of CV (Bengio and Grandvalet, 2004).As a result, to obtain standard error estimates, one would either need to modify the CV procedure or make additional assumptions.Pursuing the former, Dietterich (1998) and Nadeau and Bengio (2003) proposes sampling schemes where the data is split in half, and CV is carried out within each half separately.This yields an estimate of standard error, but it will typically be much too conservative since the internal CV model fits each use a samples size that is less than half of the full sample.A related proposal due to Austern and Zhou (2020) involves repeatedly performing leave-one-out CV with data sets of half of the original size, but this proposed estimator is not computationally feasible for most learning algorithms.
In a different direction, Nadeau and Bengio (2003) and Markatou et al. (2005) propose alternative estimates of standard error, but these are based only on the sample size and higher moments of the errors and so do not address the source of the problem: a covariance term that we describe in Section 4.1.For bootstrap estimators, there are proposals to estimate the standard error of the (bootstrap) point estimates of prediction error with methods based on influence functions (Efron, 1983;Efron and Tibshirani, 1997), and this approach has been partially extended to CV for the special case of the area under the curve measure (AUC) of performance (Yousef, 2019).The CV proposal of Austern and Zhou (2020) similarly involves leave-one-out resampling, which can be interpreted as an empirical estimate of the influence functions.
Accompanying these algorithmic proposals, there is some theoretical understanding of the asymptotic behavior of CV.Dudoit and van der Laan (2005) proves a central limit theorem (CLT) for a cross validation estimator, although it does not come with an estimate of the standard error.LeDell et al. (2015) provides a consistent estimator for the standard error in the special case of estimating the AUC, and Benkeser et al. ( 2020) conducts a higher-order asymptotic analyses for AUC and other metrics, yielding a more efficient estimator for accuracy with a consistent standard error estimate.Further theoretical results establish the asymptotic normality of the CV estimate in more general cases (Austern and Zhou, 2020;Bayle et al., 2020).The former considers the average prediction error across training sets (similar to our goal herein), and introduces an asymptotically valid estimate of the standard error; see Appendix F.9 for an experiment with this estimator.The latter estimates a different estimand: the average prediction error of the models fit on the subsamples, and introduces a valid estimate of standard error for this quantity.We explain this estimand and the proposed standard error estimator in more detail in Appendix I.Both use arguments relying on notions of algorithmic stability (Kumar et al., 2013;Kale et al., 2011;Celisse and Guedj, 2016).At present, it is not clear how the large-sample regime considered in these works relates to the behavior we see in small samples such as in the experiment in Section 1.1.In particular, algorithmic stability may not be satisfied in high-dimensional settings or with small sample sizes; see Appendix I and Bayle et al. (2020) for more discussion.
Lastly, we note that CV is often used to compare predictive models, such as when selecting a model or a good value of a learning algorithm's hyperparameters (e.g., Stoica et al., 1986;Shao, 1993;Zhang, 1993;Dietterich, 1998;Xu and Liang, 2001;Yang, 2007;Arlot and Celisse, 2010;Fong and Holmes, 2020;Sivula et al., 2020;Riley et al., 2021).To this end, Varma and Simon (2006) and Tibshirani and Tibshirani (2009) suggest a bias-correction for the model selected by cross-validation, Lei (2019) shows how to return a confidence set for the best model parameters, and Yang (2007); Wager (2020) show that for CV, comparing two models is a statistically easier task than estimating the prediction error, in some sense.While we expect that our proposed estimator would be of use for hyperparameter selection because it yields more accurate confidence intervals for prediction error, we do not pursue this problem further in the present work.

Our contribution
This work has two main thrusts.First, we study the choice of estimand for CV, giving results for the special case of the linear model.We prove a finite-sample conditional independence result (Theorem 1) with a supporting asymptotic result (Theorem 2) that together show that CV does not estimate the error of the specific model fit on the observed training set, but is instead estimating the average error over many training sets (Corollary 2 and Corollary 3).We also show that this holds for the other common estimates of prediction error: data splitting (Section 3.4), Mallow's C p (Section 3.5), and bootstrap (Appendix A).Second, we introduce a modified cross-validation scheme to give accurate confidence intervals for prediction error.We prove that our estimate for the MSE of the CV point estimate is unbiased (Theorem 3).Moreover, we validate our method with extensive numerical experiments, confirming that the coverage is consistently better than that of standard cross-validation (Section 5).

Setting and notation
We consider the supervised learning setting where we have features X = (X 1 , . . ., X n ) ∈ X n and response Y = (Y 1 , . . ., Y n ) ∈ Y n , and we assume that the data points (X i , Y i ) for i = 1, . . ., n are i.i.d.from some distribution P .We wish to understand how well fitted models generalize to unseen data points, which we formalize with a loss function such that (y, y) = 0 for all y.For example, could be squared error loss, misclassification error, or deviance (cross-entropy).Now consider a class of models parameterized by θ.Let f (x, θ) be the function that predicts y from x ∈ R p using the model with parameters θ, which takes values in some space Θ.Let A be a model-fitting algorithm that takes any number of data points and returns a parameter vector θ ∈ Θ.Let θ = A(X, Y ) be the fitted value of the parameter based on the observed data X and Y .We are interested in the out-of-sample error with this choice of parameters: where (X n+1 , Y n+1 ) ∼ P is an independent test point from the same distribution.Notice Err XY is a random quantity, depending on the training data.We denote the expectation of this quantity across possible training sets as Err := E [Err XY ] .
We will discuss the relationship between these two quantities further in Section 3. We note that out-ofsample error is materially different from in-sample-error which is the focus of methods like the C p and AIC statistics, and covariance penalties.These are discussed in Section 3.5.
In cross-validation, we partition the observations I = {1, . . ., n} into K disjoint subsets (folds) I 1 ,. . ., I K of size m = n/K at random.Throughout this work, we will assume K divides n for convenience, and we will choose K = 10 in all of our numerical results.Consider the first fold, and let θ(−1) = A((X j , Y j ) j∈I\I1 be the model fit to only those points that are not in fold one.Then, let e i = ( f (x i , θ(−1) ), y i ) for each i ∈ I 1 .The errors e i for points in other folds are defined analogously.We let Err (CV) be the average error, which is the usual CV estimate of prediction error.If one desires a confidence interval for the prediction error, a straightforward approach is to compute the empirical standard deviation of the e i divided by √ n to get an estimate of the standard error: From here, we can create a confidence interval as where z 1−α/2 is the 1 − α/2 quantile of the standard normal distribution.We call these the naïve crossvalidation intervals and they serve as our baseline approach.Importantly, we find that these naïve CV intervals are on average too small because the true standard deviation of ē is larger than the naïve estimate se would suggest, so a better estimate of the standard error is needed.
As a final piece of notation for our asymptotic statements, for a reference sequence a 1 , a 2 , . 3 What prediction error are we estimating?
We next discuss targets of inference when assessing prediction accuracy.We discuss both Err and Err XY , and also introduce an intermediate quantity Err X that explains the connection between these two.While cross-validation is our focus, our results hold identically for other estimates of prediction error: covariance penalties (Section 3.5), data splitting (Section 3.4), and bootstrap (Appendix A).

Err X : a different target of inference
The two most natural estimands of interest to the analyst are Err XY , the error of the model that was fit on our actual training set, and Err, the average error of the fitting algorithm run on same-sized datasets drawn from the underlying distribution P .The former quantity is of the most interest to a practitioner deploying a specific model, whereas the latter may be of interest to a researcher comparing different fitting algorithms.While it may initially appear that the quantity Err XY is easier to estimate-since it concerns the model at hand-it has been observed that the cross-validation estimate provides little information about Err XY (Zhang, 1995;Hastie et al., 2009;Yousef, 2020), a phenomenon sometimes called the weak correlation issue.
We now prove that CV has lower MSE for estimating Err than it does for Err XY , for the special case of the linear model.In this sense, CV should be viewed as an estimate of Err rather than of Err XY .In order to state this formally, for this section only, assume the homoskedastic linear model holds: with = ( 1 , . . ., n ) independent of X.In this setting, a key quantity in our analysis is which falls between Err and Err XY ; see Figure 2 for a visualization.This quantity is also considered by Hastie et al. (2019) in a high-dimensional regression setting, but to the best of our knowledge has not been considered in the literature on estimation of prediction error.
While our current focus is on cross-validation, the conclusions hold for a broad class of estimates of prediction error.In particular, we consider estimators of prediction error that satisfy the following property: Definition 1 (Linearly invariant estimator).We say that an estimator of prediction error Here κ is any p-vector and the random variable U is included to allow for randomized procedures like crossvalidation, and without loss of generality it is taken to be unif[0, 1] and independent of (X, Y ).
With ordinary least square (OLS) fitting, cross-validation satisfies this property: Lemma 1.When using OLS as the fitting algorithm and squared-error loss, the cross-validation estimate of prediction error, Err , is linearly invariant.
Note that linear invariance is a deterministic property of an estimator and does not rely on any distributional assumptions.Recall from classical linear regression theory that when using ordinary least squares (OLS), the estimated coefficient vector is independent of the residual sum of squares.This implies that the sum of squared residuals is independent of the true predictive error.It turns out that even further, the CV estimate of error (and all linearly invariant estimates of error) is independent of the true error, conditional on the feature matrix X.
Theorem 1. Assume the homoskedastic Gaussian linear model (3) holds and that we use squared-error loss.Let Err be a linearly invariant estimate of prediction error (such as Err (CV) using OLS as the fitting algorithm).Then, Err ⊥ ⊥ Err XY | X. (5) The proof of this theorem rests primarily on the fact that the OLS residuals are independent of the fitted coefficient vector in the linear model, together with the observation that linearly invariant estimators are a function only of the residuals of an OLS model fit.Due to its simplicity, we give the proof explicitly here; all other proofs are given in Appendix D.
Proof of Theorem 1.The true predictive error (Err XY ) is a function only of θ, the OLS estimate of θ based on the full sample (x 1 , y 1 ), . . ., (x n , y n ).On the other hand, any linearly invariant Err is a function only of the residuals Y − X θ = (I − X(X X) −1 X )Y , by the invariance property (see Lemma 6).Since from classical linear model results, the proof is complete.
As a result, any linearly invariant estimator (such as cross-validation) has lower MSE as an estimate of Err X than as an estimate of Err XY : Corollary 1.Under the conditions of Theorem 1, We demonstrate this in an experiment in a simple linear model with n = 100 observations and p = 20 features, where the features are i.i.d.standard normal variables; see Figure 3.As predicted by Corollary 1, we see that the CV point estimate has lower MSE for Err X than for Err XY .Similarly, the naïve CV intervals cover Err X more often than they cover Err XY .
Remark 1. Theorem 1 can be restated in an evocative way.Suppose we have two data sets (X, Y ) and (X, Y ) that share the same feature matrix X.Let Err XY and Err XY be the true errors of the model fit with these two data sets, respectively.Next, suppose we perform cross-validation on (X, Y ) to get an estimate Err XY and do the same on (X, Y ) to get an estimate Err XY .Then, (5) is equivalent to This means that for the purpose of estimating Err XY , we have no reason to prefer using the cross-validation estimate with (X, Y ) to using the cross-validation estimate with a different data set (X, Y ), even though we wish to estimate the error of the model fit on (X, Y ).

Relationship with average error
The results of the previous section suggest that Err X is a more natural target of inference than Err XY .Next, we examine the relationship between Err and Err X , showing that Err X is close to Err, in that the variance of Err X (which has mean Err) is small compared with the variance of Err XY (which also has mean Err).
Combined with the results of the previous section, this gives a formal statement that cross-validation is a better estimator for Err than for Err XY .
To make this precise, consider the conditional variance decomposition of the variance of Err XY , To quantify the relative contribution of the two terms in the right-hand side of (6), we will use a proportional asymptotic limit, where We use the proportional asymptotic limit rather than traditional p fixed, n → ∞ asymptotics, because in the latter asymptotic regime, the difference between Err, Err X , and Err XY is asymptotic order lower than 1/ √ n, so one always estimates these three targets with equal precision, and the analysis is less informative.
See Appendix H for a complementary analysis in the traditional p fixed, n → ∞ asymptotic regime and Yang (2007); Wager (2020) for a related discussion.By contrast, in the proportional asymptotic limit we will see that Err is closer to Err and Err X than to Err XY .
Theorem 2. Suppose the homoskedastic Gaussian linear model in (3) holds and that we use squared-error loss.In addition, assume that feature vectors X i ∼ N (0, Σ p ) for any full-rank Σ p .Then, in the proportional asymptotic limit in (7), we have as n, p → ∞.
We summarize the asymptotic relationship among the various estimands in Figure 4. We see that the randomness caused by Y given X is of a larger order than that due to the randomness in X.This explains why in Figure 3, the coverage and MSE of cross-validation is similar when estimating either Err or Err X , but is significantly different when estimating Err XY .As a result, Err X and Err XY are asymptotically uncorrelated, and moreover, combining this with Theorem 1 shows that Err is asymptotically uncorrelated with Err XY , as stated next.
Moreover, for any linearly invariant estimator Err (such as Err (CV) using OLS as the fitting algorithm),

Êrr
Figure 4: The relationship among various notions of prediction error in the proportional asymptotic limit (7).Recall that σ 2 is the Bayes error: the error rate of the best possible model.See Figure 5 for a simulation experiment demonstrating these rates.* The variance of Err scales as 1/ √ n; see Section 3.3 for details about the bias.Notice that this is a marginal result, whereas the similar Theorem 1 is conditional on X.With respect to Figure 4, this result means that the fluctuations of Err XY around Err are asymptotically uncorrelated with the fluctuations of of Err around Err. Combining Theorem 2 with Theorem 1, we conclude that CV has larger error for estimating Err XY than for Err or Err X : Corollary 3. In the setting of Theorem 2, let Err be any linearly invariant estimator (such as Err (CV) using OLS as the fitting algorithm).Suppose in addition that var( Err) → 0 (an extremely weak condition satisfied by any reasonable estimator).Then, The asymptotic theory perfectly predicts the experimental results presented in Figure 5; we see that even for moderate sample size, the scalings is exactly as anticipated.The main conclusion is that for a linearly invariant estimate of prediction error that has precision 1/ √ n, our results show that asymptotically one has lower estimation error when estimating Err compared to Err XY .Similarly, the correlation between a linearly invariant estimate and Err XY goes to zero.These theoretical predictions are also corroborated by the experimental results presented in Figure F.6.Thus, cross-validation is estimating the average error Err more so than the specific error Err XY .
Remark 2. Note that the results in this section apply both to K-fold cross-validation with fixed K, and leaveone-out cross validation where K = n.Formally, the results require only that one is using some sequence of linearly invariant estimators.

The bias of cross-validation
Up until this point, we have not explicitly mentioned the bias in the CV point estimate Err that comes from the difference in sample size.That is, Err uses models of size n(K −1)/K, whereas Err and Err XY are defined for models fit on data of size n, so E[ Err] is typically smaller than Err = E[Err XY ].We now pause for a few remarks about this bias.First, notice that Corollary 3 sidesteps the bias issue by considering differences between two mean squared error quantities.The bias is important, however, if we wish to understand absolute quantities such as E[( Err − Err) 2 ].To this end, the bias exhibits different behavior in different regimes1 : • The parametric regime.Suppose p is fixed, n → ∞, and the model class has fixed dimension.Here, the bias will typically be of order 1/n, which means that it is negligible compared to the variance.(In fact, the dimension of the model class can grow, provided the rate is slow enough; see Wager (2020) for discussion.) • The proportional, dense regime.Consider the setting above in ( 7), fitting a dense model.If the number of folds is fixed, the bias of Err will converge to a nonzero constant as n and p grow (e.g., Liu and Dobriban, 2020).What this means is that in Figure 5, the | Err − Err| curve will eventually cease to decay at a 1/ √ n rate, bottoming out due to the constant bias.We do not see this behavior in the plot, because the bias is still much smaller than the variance at the sample sizes we consider; Figure F.1 reports the bias and variance in this setting.
• The proportional, sparse regime.The setting is the most delicate.Here, the behavior of sparse regression algorithms may have very different behavior on samples of size n(K − 1)/K versus samples of size n (e.g., Reeves et al., 2019).Thus, the bias here may be appreciable.
In all cases, the bias can be mitigated by taking a larger number of folds as n grows.
Incorporating bias alongside our results from Section 3.2 leads to an interesting bias-variance-variance decomposition of E[( Err − Err XY ) 2 ].Since both Err and Err XY are random quantities, we cannot use the usual bias-variance decomposition.However, by Corollary 2, these two quantities are asymptotically uncorrelated, yielding the following: The first two terms on the right hand side are the bias-variance decomposition for Err as an estimate of Err.Thus, because the additional third term is positive, we again see that Err is a more precise estimate of Err than of Err XY .

Data splitting
Perhaps the simplest way to estimate prediction error is to split the data into two disjoint sets, one for training and one for estimating the prediction accuracy.The previous results also shed light on the properties of data splitting.In particular, we will show that when estimating prediction error with data splitting, refitting the model on the full data incurs additional variance that can make the confidence intervals slightly too small, even asymptotically.This is not a cause for practical concern, but it is another manifestation of the fact that linearly invariant estimators are estimating average prediction error, and Err XY contains additional, independent variation.We report on the details in Appendix B.

Connection with covariance penalties
For parametric models, there is an alternative theory of the estimation of prediction accuracy based on covariance penalties; see Stein (1981); Efron (2004); Rosset and Tibshirani (2020) for overviews of this approach.For the linear model with OLS and squared error loss, this approach specializes to the well-known Mallows C p (Mallows, 1973;Akaike, 1974) estimate of prediction error: The classical covariance penalty approach is focused on estimating in-sample error, the error for a fresh sample with the same features X: where the expectation is only over Y i and Y i for i = 1, . . ., n, and Y i , Y i are independent draws from the distribution of Y i | X i = x i .See Figure 6 for a visualization of how this relates our other notions of prediction error.To extend this to the setting where the feature vector of future test points is also random (the setting of the present work), Rosset and Tibshirani (2020) introduce RC p , which is a similar but slightly larger estimate of prediction error that accounts for the variability in the features of future test points.

Estimand of C p
We first discuss the estimation of prediction error with Mallows C p , showing that (like cross-validation) it is a worse estimator for Err XY than for Err.The results from Section 3.1 and Section 3.2 continue to hold for and Err .In particular, Err has lower error for estimating Err and Err X than for estimating Err XY , and Err is asymptotically uncorrelated with Err XY .In summary, as before with cross-validation, Mallow's C p is not able to estimate Err XY , but is rather an estimate of Err in , Err or Err X (the latter two are close for large samples).

A decomposition of Err X
Next, we develop a decomposition for Err X .The results that we present are implicit in Rosset and Tibshirani (2020), but in that work the results are averaged over X to instead obtain estimates for Err.From the definitions of Err X and Err in , we trivially have that When the linear model holds, this simplifies as stated next.
Proposition 1.For the linear model with the OLS fitting algorithm and squared error loss, assume in addition that the distribution of X i has mean zero and covariance Σ of full rank.Then, where Σ = X X/n.The second term in the sum can be either positive or negative.Roughly speaking, this term is smaller (more negative) if X is a good design that yields precise estimates of the regression coefficients, whereas this term is larger (more positive) if X yields less precise estimates.Note that we do not typically know the population covariance Σ, so this cannot be used as an estimator for prediction error.Instead, it is an expository decomposition relating Err X with existing work about the estimation of prediction error.From this expression, we can read off the following results from (Rosset and Tibshirani, 2020).
Corollary 4 (Random-X covariance penalties (Rosset and Tibshirani, 2020)).In the setting of Proposition 1, we have that Moreover, if the feature vector follows a multivariate Gaussian distribution, then The latter expression is the motivation for the RC p penalty.

Bootstrap estimates of prediction error
Bootstrap estimates of prediction error are also linearly invariant, and so they are also estimates of the average prediction error.For brevity, we present these results in Appendix A.

Confidence intervals with nested cross-validation
In this section, we develop an estimator for the MSE of the cross-validation point estimate.Our ultimate goal is then to use the estimated MSE to give confidence intervals for prediction error with approximately valid coverage.

Dependence structure of CV errors
Before developing our estimator for the cross-validation MSE, we pause here to build up intuition for why the naïve CV confidence intervals for prediction error can fail, as seen previously in our example in Section 1.1.
The naïve CV intervals are too small, on average, because the true sampling variance of Err is larger than the naïve estimate se would suggest.In particular, this estimate of the variance of the CV point estimate assumes that the observed errors e 1 , . . ., e n are independent.This is not true-the observed errors have less information than an independent sample since each point is used for both training and testing, which induces dependence among these terms.
In particular, the covariance matrix of the errors e 1 , . . ., e n has the block structure shown in Figure 7.As such, it is parameterized by only three numbers: a 1 := var(e 1 ), a 2 := cov(e i , e j ) for i, j in the same fold, and a 3 := cov(e i , e j ) for i, j in different folds (Bengio and Grandvalet, 2004).It is easy to see that the variance of ē is see Bengio and Grandvalet (2004).The constants a 2 and a 3 will typically be positive, in which case var(ē) > a 1 /n, and so estimating the variance of ē as se 2 results in an estimate that is too small.For example, in the setting from Section 1.1, the estimated variance is approximately a factor of 2.65 too small, so the naïve confidence intervals are too small by a factor of √ 2.65 ≈ 1.6.In order to correctly estimate the variance of ē, it would suffice to estimate a 1 , a 2 , and a 3 , but Bengio and Grandvalet (2004) proves surprising fact that there is no unbiased estimator of var(ē) based on a single run of cross-validation.Thus, estimating a 1 , a 2 , and a 3 cannot be done from a single run of cross-validation.Although this result suggests that something beyond the usual cross-validation will be required to give good estimates of the standard error of Err , it does not imply that it is impossible to get such estimates with other approaches.
To recap, the primary issue with the naïve cross-validation confidence intervals is that they rely on an independence assumption that is violated: they implicitly assume that a 2 = 0 and a 3 = 0. Thus, the usual estimate of the variance of Err is too small, resulting in poor coverage.To remedy this issue, we develop an estimator that empirically estimates the variance of Err (CV) across many subsamples.Avoiding the faulty independence approximation leads to intervals with superior coverage.We turn to details of our proposed procedure next.

Our target of inference
In this section, our primary goal will be to give confidence intervals for test accuracy by estimating the mean-squared error (MSE) of cross-validation: Definition 2. For a sample of size n split into K folds, the cross-validation MSE is In particular, we define MSE with respect to Err XY and thus will calibrate our test intervals to cover the quantity Err XY .At this point, the reader should wonder why we define the MSE with respect to Err XY in view of the results from Section 3 that show that Err is a more precise estimate of Err than of Err XY , and we will next discuss this issue carefully.
To be clear, we choose to pursue confidence intervals for Err XY because we are able to do so; the MSE quantity above can be estimated in a convenient way due to an upcoming decomposition (Lemma 3).At present, we do not know how to obtain a similar MSE estimate with respect to Err.Second, we emphasize that our results from Section 3 do not mean that giving confidence intervals for Err XY is impossible.Rather, our results say that in the linear model, confidence intervals for Err XY will be larger than confidence intervals for Err.Still, confidence intervals for either Err or Err XY would be of interest to the analyst.We are able to derive an estimator for the MSE with respect to Err XY , and we will turn to the details next.
The MSE in (9) contains both a bias term (due to the reduced sample size used by Err ) and variance term.See Section 3.3 for a discussion of the bias.Thus, we can view the MSE as a slightly conservative version of the variance of the cross-validation estimator.In any case, the MSE is the relevant quantity for creating confidence intervals around a possibly biased point estimate, since it accounts for both bias and variance.With this in mind, we will use an estimate of the MSE to construct confidence intervals for Err XY .Previewing the remainder of this section, we will produce confidence intervals for Err XY as follows: Above, Err is similar to the CV estimate of error except across many random splits, and MSE is our estimator for MSE-the heart of this section.In addition, we allow for the possibility of correcting for the sample size bias with an estimator bias; we use one that arises naturally from the computations already carried out to estimate the MSE (Section 4.3.3).

A holdout MSE identity
We now give a generic decomposition of the mean-squared error of an estimate of prediction error, which will enable use to estimate MSE K,n .Consider a single split of the data into a training set and holdout set, i.e., we partition I = {1, . . ., n} into I (train) and I (out) calling the training set ( X, Y ).Using only ( X, Y ), we use our fitting procedure to obtain estimated parameters θ(train) = A( X, Y ), and further assume we have some estimate of Err X Y of the prediction error Err X Y defined in (13).Here, Err X Y is any estimator of Err X Y based only on ( X, Y ), such as cross-validation using only ( X, Y ).Let {e (out) i } i∈I (out) be the losses of the fitted model f (•, θ(train) ) on the holdout set, and let ē(out) be their average.The MSE of Err X Y can be written as follows: Lemma 3 (Holdout MSE identity).In the setting above . (11) The expectations above are over the complete data (X, Y ).The lemma follows from adding and subtracting Err X Y within term (a) then showing the cross-term is zero with a nested conditional expectation argument.
This identity is of interest, since both (a) and (b) can be estimated from the data, which leads to an estimate of the MSE term.Specifically, we propose the following estimation strategy: 1. Repeatedly split the data into I (train) and I (out) , and for each split, do the following: (i) Apply cross-validation to I (train) to obtain Err X Y and use I (out) to obtain ē(out) , and then estimate (a) with ( Err X Y -ē(out) ) 2 .
(ii) Estimate (b) with empirical variance of {e i } i∈I (out) divided by the size of I (out) .
2. Average together estimates of (a) and (b) across all random splits and take their difference as in ( 11) to obtain an estimate of MSE.
Note that the estimates for both (a) and (b) are unbiased, so the resulting MSE estimate is unbiased for the MSE term in (11).In the next section, we will pursue this strategy for the particular case where Err X Y is itself a cross-validation estimate based only on ( X, Y ).
O 2 H l x 3 p 2 P + e i K k + 8 c w R 8 4 n z + t D J I F < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 1 v D J Z 0 c / c g D / o x z 1 4 o + D 0 v w s 0 e s = " > O 2 H l x 3 p 2 P + e i K k + 8 c w R 8 4 n z + t D J I F < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 1 v D J Z 0 c / c g D / o x z 1 4 o + D 0 v w s 0 e s = " > O 2 H l x 3 p 2 P + e i K k + 8 c w R 8 4 n z + t D J I F < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 1 v D J Z 0 c / c g D / o x z 1 4 o + D 0 v w s 0 e s = " > / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " t 0 5 S K 1

The MSE estimator
Building from Lemma 3, we now turn to our proposed estimate of MSE, the heart of this section.We follow the estimation strategy described above, using (K − 1)-fold CV as the estimator Err X Y .This gives an estimate of (a) and (b), and hence an estimate for the MSE, as described above.We also get a point estimate of error by taking the empirical mean of Err X Y across the many splits.See Figure 8 for a visualization of the nested CV sample splitting, and see Algorithm 1 for a detailed description.We denote the resulting estimate of mean squared error by MSE and the point estimate for prediction error Err .
In view of Lemme 3, we see that the estimator MSE is targeting the MSE of Err as an estimate of Err XY , as we record formally next.
Theorem 3 (Estimand of nested CV).For a nested CV with a sample of size n, where n = n(K − 1)/K.

This result shows that MSE (NCV)
obtained by nested CV is estimating the MSE of (K − 1)-fold crossvalidation on a sample of size n(K − 1)/K.Since nested CV uses an inner loop with samples of size n(K − 1)/K, we recommend re-scaling to obtain an estimate for a sample of size n by instead taking (although this re-scaled version is not guaranteed to be exactly unbiased for MSE K,n ).As a minor detail, in practice we also restrict MSE to fall between se (the estimated standard error if one had n independent points) and √ K • se (the estimated standard error if one had only n/K independent points).This is a minor implementation detail prevents implausible values of MSE from arising.After adjusting the point estimate Err with a bias correction discussed next, we form our final confidence intervals as in (10).
Remark 3 (An interpretation of nested CV).A simpler, perhaps more natural approach to this problem would be to borrow ideas from bootstrap calibration (e.g., Efron and Tibshirani, 1993, p. 263).Specifically, one could use a nested CV scheme, and compute the standard normal confidence interval for error from the inner folds.Then one could check how well this interval covers ē(out) , and adjust the interval to achieve the desired coverage.The problem with this approach is that the left-out fold is finite, so that the interval is a prediction interval for ē(out) rather than a confidence interval for the true underling prediction error.
NCV is similar in spirit to this: over repeated subsamples, we estimate the quantity (a) in (11), which leads to an empirical estimate of how much an interval around Err X Y must be widened in order to cover ē(out) .The latter is a random quantity, however, so this should be thought of as a calibrated prediction interval.In truth, we wish to cover not ē(out) , but its mean Err X Y .We convert from a prediction interval for ē(out) to a confidence interval for Err X Y by subtracting out the term (b).Randomly assign points to folds I 1 , . . ., I K for k ∈ {1, . . ., K} do outer CV loop e (in) ← inner crossval(X, Y, {I 1 , . . ., plug-in estimator based on ( 11) , MSE) prediction error estimate and MSE estimate procedure inner crossval(X, Y, {I 1 , . . ., in) ← append(e (in) , e (temp) ) return: e (in) Output: nested crossval(X, Y ) Remark 4 (The sample size difference and the target of inference.).Note that the estimator Err (NCV) uses models fit on with n(K − 2)/K data points, whereas the target of inference in Theorem 3, MSE K−1,n , is defined with respect to the prediction accuracy of a model fit with n(K − 1)/K points.How can the former be used to estimate the latter?The answer is that the nested CV procedure relies also on fits of size n(K −1)/K, see the definition of θ in the "nested crossval" subroutine of Algorithm 1. Nested CV compares the predicted accuracy on the models from n(K −2)/K data points to the estimated accuracy of the models with n(K −1)/K data points, using the extra holdout data to asses this accuracy.

Estimation of bias
The nested CV computations also yield a convenient estimate of the bias of the NCV point estimate of error, Err (NCV) .They key idea is that nested CV considers both models fit with n(K − 2)/K data points and with n(K − 1)/K data points, and comparing their these models gives an estimate of bias; see Appendix C.This aspect of nested CV is not critical-the MSE estimation above is the core of our proposal.

Simulation experiments
We now explore the coverage of nested CV in a variety of settings.In each case, we will report the coverage of naïve CV (CV), nested CV (NCV), and data splitting with refitting (DS), where the nominal miscoverage rate is 10% (5% miscoverage in each tail).We also report on the width of the intervals, expressed relative to the width of the standard CV intervals.(We wish to produce intervals that are as narrow as possible while maintaining correct coverage.)We use 10-fold CV (the number of folds has little impact; see Appendix F.4) and NCV, with 200 random splits for the latter; see Appendix F.2 for the runtime of each experiment.For classification examples we use binary loss, and form confidence intervals for CV, NCV, and data splitting after taking the binomial variance-stabilizing transformation, described in detail in Appendix G.For regression examples, we use squared error loss.
For data splitting, we use 80% of the samples for training and 20% for estimating prediction error.Note that the data splitting without refitting intervals are the same as the data splitting with refitting intervals; the difference is that they are intended to cover different quantities.To make this comparable to CV and nested CV, we report on the coverage of Err and Err XY here, which corresponds to data splitting with refitting.Data splitting without refitting (which seeks to cover the quantity in (13)) will typically have better coverage; we observed relatively accurate coverage in the classification examples and worse coverage in the regression examples, but do not explicitly report these results herein.

Low-dimensional logistic regression
We consider the logistic regression data generating model 1 with n = 100 observations and p = 20 features, sampled as i.i.d.standard Gaussian variables.Due to the rotational symmetry of the features, the only parameter that affects behavior is the signal strength, and we explore models with Bayes error of either 33% or 23%.Here, we use (un-regularized) logistic regression as our fitting algorithm.We report the results in Table 1, finding that nested CV gives coverage much closer to the nominal level.Moreover, the point estimates have slightly less bias.We report the size of the NCV intervals relative to their CV counterparts per instance in Figure F.2. Table 1: Performance of cross-validation (CV), nested cross-validation (NCV), and data splitting with refitting (DS) in the low-dimensional logistic regression model from Section 5.1.1.Each row is a setting with a different signal strength, indexed by the Bayes error: the error of the true model.The nominal total error rate is 10%, i.e., 5% above and below.A "Hi" miscoverage is one where the confidence interval is too large and the point estimate falls below the interval; conversely for a "Lo" miscoverage.The standard error in each coverage estimate reported is about 0.5%.The "Target" column indicates the target of coverage-the intervals are always generated identically, but we report the coverage of both Err and Err XY .
Next, we return to the question of estimands, as in Section 3. Since we do not have analytical results for the logistic regression model, we explore this in simulation.Here, we consider problems where the Bayes error rate is 22.5%, and vary n and p.We investigate two quantities.First, we investigate the correlation of Err (CV) and Err XY , and we find that it is small but larger than in the OLS case.See Figure 9. Next, we check whether Err has higher precision for Err than for Err XY .To this end, we compute the expected value of | Err (CV) − Err| and of | Err (CV) − Err XY |, and plot their relative difference in the right panel of Figure 9.
We find that the CV point estimate is again slightly more precise as an estimate of Err than of Err XY in this setting.

High-dimensional sparse logistic regression
We return to the high-dimensional logistic regression model introduced in Section 1.1, generalizing slightly.We consider n ∈ {90, 200} with p = 1000 features.The feature matrix has standard normal entries with an autoregressive covariance pattern such that adjacent columns have covariance ρ.In each case, we take k = 4 nonzero entries of the covariance matrix and use sparse logistic regression.We report on the results in Table 2 and give the width2 in Figure 10.Again, NCV gives intervals with coverage much closer to the nominal level.

Low-dimensional linear model
We next consider an OLS example.We take X ∈ R n×p with p = 20 comprised of i.i.d.N (0, 1).Further, we generate a response from the standard linear model:   1.
where is likewise i.i.d.N (0, 1).We use OLS to estimate θ.Note that by Lemma 1, the choice of θ does not affect the coverage rate of CV.The same argument shows that the choice of θ will not affect the coverage rate of nested CV, so we can take θ to be 0 without loss of generality.Similarly, both CV and NCV are unchanged when X is transformed by an full-rank linear operator, so the results in this section would remain unchanged for Gaussian features with any full-rank correlation structure.We report the coverage of nested CV in Figure 11.We find that this scheme works well and has good coverage for any n, overcovering somewhat for very small n.By contrast, naïve CV has poor coverage until n is 400.In Figure F.3 we report on the width of the NCV intervals relative to their CV counterparts-the usual ratio is not that large for samples sizes of n = 100 or greater.

A high-dimensional sparse linear model
We continue as in the previous experiment, but with n ∈ {50, 100} and p = 500.We choose θ to have 4 nonzero entries of equal strength such that var(Xθ) var( ) = 4.
Since p > n, we take the lasso estimator with a fixed penalty parameter.The parameter is chosen by minimizing the cross-validation estimate of prediction error on a single independent run, and then this value is fixed for the experimental replicates.We report on the results in

Real data examples
Lastly, we evaluate the nested CV procedure on real data sets from the UCI repository (Dua and Graff, 2017).In each case, we repeatedly subsample a small number of observations, perform nested CV on the subsample, and then use the many remaining observations to determine the accuracy of the fitted model.We consider the following data sets: • Communities and crimes (CC).This data set is comprised of measurements of 1994 communities in the US.We predict the crime rate of each community, a real number normalized to be between 0 and 1, based on 99 demographic features of the community.
• Crop mapping (crp).This data set is comprised of optical radar measurements of cropland in Manitoba in 2012.We filter the data set to contain two classes, corn and oats, and then do binary classification based on 174 features.Here, we add a small amount of label noise so that the best possible classifier has a misclassification rate of about 5%.
We again use sparse linear or logistic regression as our fitting algorithm.The results are reported in Table 4.We find that nested CV generally has coverage that is much closer to the nominal rate than naïve CV.Data splitting has poor coverage in this case due to the small sample size, but is significantly better with = 100 samples than with n = 50 samples.

Discussion
Our investigation had two main main components.First, we discussed point estimates of prediction error via subsampling techniques.Our primary result is that common estimates of prediction error-cross-validation, bootstrap, data splitting, and covariance penalties-should be viewed as estimates of the average prediction error, averaged across other hypothetical data sets from the same distribution.The formal results here were all for the special case of the linear model using unregularized OLS for model-fitting, although we also saw similar behavior in simulation for logistic regression; see Figure 9.A further important question is how regularization affects this behavior.In an additional experiment, we find that Err does track Err XY , albeit weakly, when there is regularization; see Appendix F.10.We look forward to future work explaining the behavior of cross-validation and other estimates of prediction error in these settings.Second, we discussed inference for cross-validation, deriving an estimator for the MSE of the CV point estimate, nested CV.The nested CV scheme has consistently superior coverage compared to naïve crossvalidation confidence intervals, which makes it an appealing choice for providing confidence intervals for prediction error.Nonetheless, we wish to be clear that nested CV is more computationally intensive than   1.
standard CV-we use about 1000 times more model fits per example because of the repeated splitting.For example, in the logistic regression example from Section 1.1, nested CV takes about 10 seconds on a personal computer.
A fundamental open question is to understand under what conditions the standard CV intervals will be badly behaved, making the nested CV computations necessary.Roughly speaking, we expect the standard CV intervals to perform better when n/p is larger and when more regularization is used.In our experiments, we saw that even in the mundane linear model with n/p = 10, the miscoverage rate of standard CV was about 50% larger than the nominal rate.As n increases, however, the violation decreases.Moreover, the asymptotic results in Austern and Zhou (2020) and Bayle et al. (2020) show that the coverage is correct in the p fixed, n → ∞ limiting regime.The stability conditions therein may also be able to shed light on settings with small samples or high-dimensions.We look forward to future work in this direction.
To conclude, we point out several additional future directions.First, one could adapt nested CV to cases with dependent data, as is done for standard CV (Rabinowicz and Rosset, 2020).Second, we note that the general leave-out style strategy of cross-validation can also be used to "fill-in" data for downstream use.Examples include pre-validation (Tibshirani and Efron, 2002;Höfling and Tibshirani, 2008) and crossfitting (e.g, Newey and Robins, 2018)).Further, confidence intervals for prediction accuracy are used to evaluate variable importance (Williamson et al., 2021;Zhang and Janson, 2020).We suspect that our nested cross-validation proposal could be adapted to improve the accuracy of these and related approaches.Lastly, cross-validation is often used to compare regression procedures (such as when selecting the value of a tuning parameter); see Section 1.2.We anticipate that nested CV can be extended to give valid confidence intervals for the difference in prediction error between two models.
An R package implementing nested CV is available at https://github.com/stephenbates19/nestedcv.means that the standard error estimate derived from data splitting is too small, and confidence intervals based on this number will have coverage that is too small.Importantly, this behavior is not only due to the difference in sample size used to fit θ(split) and θ.The final term in ( 14) is the result of the difference in sample size, but even without this term, the estimate is too small asymptotically (due to the middle term on the right-hand side of ( 14)).The standard error estimate se (split) is similarly too small if one wishes to estimate ( Err (split) − Err XY ) 2 ; see Proposition 3 in Appendix E. In The fact that data splitting with refitting produces intervals that are slightly too small is an interesting theoretical occurrence, but, based on our simulations, the magnitude of the bias is small enough that it is not a major worry in practice.In summary, data splitting with refitting has partially similar behavior to cross-validation; the point estimate has higher precision as an estimate of estimate of average prediction error and the standard error estimate is slightly too small in the proportional asymptotic limit.

C Bias estimation
The NCV estimate of prediction error is unbiased for Err for the procedure with a reduced sample size of n(K − 2)/K, but this will be typically be slightly biased upwards for Err with the full sample size.This discrepancy can be estimated by running both the usual K-fold CV and nested CV.An unbiased estimated for the difference in Err at a sample of size n(K − 2)/K (the sample size used in nested CV) to a sample of size n(K − 1)/K (the sample size used in standard CV) is . Now, since we expect that prediction error scales as a + b/n in n for some unknown constants a and b (the parametric rate), an estimator for the difference in in Err at a sample of size n to Err at a sample of size n(K − 2)/K (the sample size of each model used in nested CV) is then: The left term in the sum, "1", accounts for the bias when going from size n(K − 2)/K to size n(K − 1)/K and the right term in the sum, accounts for the bias going from a sample size from n(K − 1)/K to size n; this scaling due to the form of a + b/n.Combining this with the estimate of MSE in the previous section leads to the confidence intervals in (10).

D Proofs
Lemma 6.A linearly invariant estimator Err of prediction error is function of the residuals after running OLS fitting.That is, with r = Y − X θ ∈ R n where θ is the OLS estimate, we have that Err((x 1 , y 1 ), . . ., (x n , y n )) = g(r) for some function g : R n → R.
Proof of Lemma 6.Let θ be the OLS estimate.By the invariance property, we see that Err((x 1 , y 1 ), . . ., (x n , y n )) = Err((x 1 , y 1 − θ x 1 ), . . ., (x n , y n − θ x n )), and the right side of the above is a function only of r.
Proof of Lemma 1.Without loss of generality, consider point 1.We will show that the residual from CV on point 1 is the same when using either the data (x 1 , y 1 ), . . ., (x n , y n ) or the data (x 1 , y 1 ), . . ., (x n , y n ), where y i = y i + x i κ.
Let I 1 ⊂ {1, . . ., n} be the indices in the same fold as observation 1.Let θ(−1) be the OLS estimate of θ based on only the points (x i , y i ) i / ∈I1 -all the points not in the same fold as point 1-and let θ (−1) be the OLS estimate of θ based on only the points (x i , y i ) i / ∈I1 .Then θ(−1) = argmin Where the third equality is a result of changing variables from θ to θ + κ.As a result, The CV estimate of prediction error is the mean of the squared residuals, so since by the preceding display the residuals are the same for either the original or shifted data, the CV estimate of prediction error is the same in each case.
Proof of Corollary 1.

E[( Err
Where the last equality follows from Theorem 1.The result follows by noting that the second term in the final line is E [var(Err XY | X)].
Proof of Theorem 2. Note that the residuals from OLS remain unchanged when we transform the features by a full-rank linear transformation.Thus, we can transform the features by Σ −1/2 without changing the behavior of invariant estimators, and so without loss of generality we can take Σ to be the identity matrix.Nonetheless, we leave Σ generic throughout the proofs.We first give an expression for Err XY .
where Z ∼ N (0, I) is a function only of the noise .Next, we decompose the variance with the conditional variance formula var due to X , which was previously stated in (6) of the main text.We next derive asymptotic rates for the two components of this sum.
Proof of first claim.We begin with the first term, which corresponds to the variance cause by the randomness in Y | X.
where λ 2 1 , . . ., λ 2 p are the eigenvalues of Σ Σ−1 , a function of X.Thus, the proof is complete once we show that the expectation term in the final line is Θ(n), which we turn to next.
Notice that where This completes the proof of the first claim.
Proof of second claim.Now, we turn our attention to the second term, which corresponds to the variance caused by the randomness in X.

var(Err
where the final equality comes from the second moments of the inverse Wishart distribution (von Rosen, 1988).To elaborate on the last equality, var tr The second equality above comes from the conditional covariance formula, conditioning on X.
For the second claim, note that Err = g(Err X , U ) for an independent U ∼ unif[0, 1] for some function g (by Theorem 1).That is, Err is a random function of Err X .As a result, cor(Err XY , Err) = cor(Err XY , g(Err X , U )) The third equality above comes from the conditional covariance formula, conditioning on X.
Proof of Corollary 3.For the first claim, As a result Both the out-of-bag and .632bootstrap estimators are functions only of the quantities y i − x i θ, across many different bootstrap samples, so the proof is complete.

E Additional technical results
Lemma 7. In the setting of Proposition 2, with notation as in Section 3.4, Proof of Lemma 7. Using the notation from the proof of Theorem 2, The result follows.
Proposition 3. In the setting of Theorem 2, with notation as in Section 3.4, suppose also that I (train) > p and |I (out) |/n → c ∈ (0, 1).
E se (16) In the proportional asymptotic limit (7), for both sides of ( 16) above all terms are of order 1/n, except the O(1/n 2 ) term on the right-hand side.
Proof of Proposition 3. From Proposition 2, we have We now expand the first term: , Err XY Thus, it remains to show only that cov Err , , Err XY = cov Err X , Err X by the conditional covariance decomposition (conditioning on X) and applying Theorem 1. Applying Cauchy-Schwarz, cov Err X , Err X ≤ var(Err X ) • var(Err where in the last equality we applied Theorem 2.

F Further simulation results
F.1 The bias of CV in the proportional asymptotic regime.
In the setting of Figure 5, we report on the variance and bias squared of cross-validation in Figure F.1.We find that for the values of n we consider, the variance is much larger than the bias.We see, however, that for larger n, the constant bias would overtake the variance as the leading issue.(One way to address this issue is to have the number of folds grow appropriately.)

F.2 Compute times
In Table F.5, we report on the runtime of CV and NCV for our experiments.

F.3 Additional details on experiments from Section 5
This section reports additional results from the set of experiments in Section 5.

F.4 Number of folds
We next investigate how the number of folds affects the CV inflation: the ratio of the true standard error of the point estimate compared to the CV estimate of standard error.We consider a linear model with  p = 20 features that are sampled as i.i.d.standard Gaussians.The number of observations ranges from 50 to 400.We use OLS as the fitting algorithm, so as a result of Lemma 1, the results do not depend on the true coefficients θ.We report on the results in Figure F.5, where we find that the number of folds has minimal impact on the inflation, although more folds gives moderately better coverage for small n.We also find that even when n/p is as large as 20, there is appreciable CV inflation, and naïve cross-validation leads to intervals with poor coverage.

F.5 CV in the proportional region
We next show some experimental results further exploring the regime from Section 3.2.In that it has higher accuracy for the former, by a constant fraction as n, p → ∞.We also see that Err , Err XY ) Next, we plot the coverage of CV in Figure F.7, and see that it is far from the nominal level, even as n and p grow and the intervals have oracle debiasing so that they are centered around the correct value.In Figure F.8, we show that these intervals have higher than the nominal miscoverage rate in both tails, confirming that the miscoverage rate is not due to bias-it is due to the intervals being too narrow.

F.6 Coverage of data splitting in proportional regime
Here, we record further results about data splitting in the setting from Section 3.4.In Figure F.9, we show that data splitting does not approach the nominal coverage rate, even when the intervals are debiased by an oracle to be centered at the correct value.

F.8 Number of Repeated Splits
In our experiments, we found that a large number (e.g., 200) of random splits of nested CV were needed to obtain stable estimates of the standard error.In Figure F.10, we show how the estimate of standard error relative to the naïve CV estimate (inflation) converges as the number of repetitions increases for one example of the logistic regression model in Section 1.1.F.9 The Austern and Zhou estimate of Standard Error Next, we investigate the standard error estimator of Austern and Zhou (2020), which we will call the AZ estimator.Under certain stability conditions, that work proves a triangular array central limit theorem for the CV point estimate around Err, and the AZ estimator is shown to be asymptotically correct.We investigate this estimator in the OLS setup from Section 5.

F.10 The effect of regularization
In this simulation, we check the results of Section 3 in the presence of regularization.In particular, we return to the linear model simulation setup of Section 5.2.1 and check the correlation between Err and Err XY with the ridge regression estimator using varying regularization strength.Section 3 shows that this correlation is nearly zero with no regularization, so we are interested in how this changes for ridge regression.We report on the results in Figure F.12.We find that for low levels of regularization, the correlation remains small.For larger values of regularization, the correlation can be as large as 15% in this example.Our intuition for this behavior is that with regularization, there is less overfitting, so that the CV estimator has fresher data, so to speak, so it is able to partially track Err XY .We look forward to future work understanding this behavior.To report on the level of regularization in an interpretable way, the x-axis shows the fraction of reduction in norm of the fitted coefficient vector of the ridge regression fit compared to the unregularized fit.
Moreover, for any linearly invariant estimator Err (such as Err (CV) using OLS as the fitting algorithm), cor(Err XY , Err) → 0 as n → ∞.
Which means that CV not tracking Err XY .The proof of this is as in the proof of Corollary 2, applying Theorem 4 in place of Theorem 2.
Proof of Theorem 4. The proof follows as in the proof of Theorem 2, noting that in this case and var tr(Σ Σ−1 ) = Θ(1/n).

I Connection with results about the k-fold test error
Another estimand considered in the cross-validation literature is the k-fold test error : the average accuracy of the k submodels fit during cross-validation (e.g., Bayle et al., 2020).Formally, this quantity is defined as where the expectation is over a fresh test point (X n+1 , Y n+1 ) ∼ P with the training data held fixed.This quantity has earned recognition because of its technical properties.For example, Bayle et al. (2020) prove a far-reaching central limit theorem for ( Err (CV) − Err k-fold ).(Note that the expected value of this quantity is zero.One might say that Err is unbiased for Err k-fold , with the tacit understanding that the latter quantity is random.) How does Err k-fold relate to Err and Err XY ?We offer some initial observations here.First, we carry out a numerical experiment, returning to the OLS setting of Figure 3. Here, we find that cross-validation is a better estimate of Err than of Err k-fold , and a worse estimate of Err k-fold than of Err XY .See Figure I.14.
In this setting, the correlation between Err (CV) and Err XY is less than 1%, whereas the correlation between Err (CV) and Err k-fold is 8.5%.That is, the CV point estimate is tracking the variation in Err k-fold in a non-negligible way; see Figure I.15.In view of the definition of the k-fold test error, we expect that this correlation is nontrivial asymptotically, but we set aside a full investigation of this asymptotic correlation to future work.
Next, we consider the an estimator for the variance of ( Err (CV) − Err k-fold ) from Bayle et al. (2020).The first estimate is the all-pairs variance estimator, Notice this results in (very slightly) smaller confidence intervals than the naïve intervals we consider in this work.Thus, the performance of the so-called naïve estimator in the simulations presented in this work is essentially the coverage of the CIs created using the central limit theorem of Bayle et al. (2020) together with this variance estimator (the coverage of the naïve estimator in this work is slightly larger).Since these CLT intervals are shown to be asymptotically exact for the quantity Err k-fold , the poor coverage that we observe in our simulations is due either to (i) the limiting behavior derived in Bayle et al. (2020) is not a good approximation at the modest sample sizes we consider (ii) there is different precision when estimating Err XY compared to Err k-fold .Regarding the first point, that work introduced algorithmic stability conditions that may not be satisfied in the high-dimensional or low sample size settings we consider.

Figure 2 :
Figure 2: Possible targets of inference for cross-validation.Here, (X, Y ) is the training data and Err XY is the average error of the model fit on (X, Y ) on a test data set of infinite size.From left to right, the random variables above are a constant, a function of X only, and a function of (X, Y ).

Figure 3 :
Figure 3: Left: mean squared error of the CV point estimate of prediction error relative to three different estimands: Err, Err X , and Err XY .Center: coverage of Err, Err X , and Err XY by the naïve cross-validation intervals in a homoskedastic Gaussian linear model.The nominal miscoverage rate is 10%.Each pair of points connected by a line represents 2000 replicates with the same feature matrix X. Right: 2000 replicates with the same feature matrix and the line of best fit (blue).

Figure 5 :
Figure 5: Simulation results demonstrating the asympotic scaling presented in Figure 4.The fitted slopes of the lines (after log-transforming both axes) are 0.00, −0.46, −0.50, −1.01, from top to bottom.See Section 3.3 for details about the rate of Err of the residuals of the OLS fit.Thus, the conclusions of Theorem 1, Corollary 1, Corollary 2, and Corollary 3 hold for Err(Cp)

Figure 6 :
Figure 6: A representation of the notions of error considered in this work.See Rosset and Tibshirani (2020) for a definition and discussion of Err S .

Figure 7 :
Figure 7: Covariance structure of the CV errors.Red entries correspond to the covariance between points in the same fold, and blue entries correspond to the covariance between points in different folds.
t e x i t s h a 1 _ b a s e 6 4 = " N a 9 T 6 5 f o C b g 6 N s / X D I S B p o 4 9 T

Figure 8 :
Figure8: Visualization of nested CV.Using only the folds on left of the vertical line, we perform the usual cross-validation by holding out one fold at a time (the dark grey fold) and then fitting on the remaining folds (the light grey folds).The fresh holdout points (the blue fold) are never used in the inner CV step.

Figure 9 :
Figure 9: Behavior of cross-validation with a logistic regression model.Left: the correlation between the point estimate and instance-specific error.Right: fraction change in mean absolute deviation of the point estimate with respect to Err XY versus Err.

Figure 10 :
Figure 10: Size of the nested CV intervals relative to the size of the naïve CV intervals in the high-dimensional sparse logistic regression experiment from Section 5.1.2.

Figure 11 :
Figure 11: Coverage of CV, data splitting, and nested CV in the OLS case.
Figure F.9, we verify with an experiment that the data splitting intervals do not have coverage approaching the nominal level, even as n and p grow.(Naïve cross-validation has a similar miscoverage problem-see Figure F.7 and Figure F.8.) We emphasize, however, that the data splitting without replacement intervals have coverage much closer to the nominal level than cross-validation because they do not suffer from the data re-use (i.e., correlation across folds) problem; see again Figure F.9 and Figure F.7.
Figure F.1: The bias and variance of the CV estimate of Err.The simulation uses the same setup as Figure5; we use OLS as the fitting algorithm, the data is generated from a linear model with Gaussian errors, and n/p = 5.We us K = 10 folds.

Figure F. 2 :Figure F. 4 :
Figure F.2: Size of the nested CV intervals relative to the size of the naïve CV intervals in the low-dimensional logistic regression example from Section 5.1.1.
Figure F.5: The CV inflation and coverage of the naïve interval in the low-dimensional linear model with 20 features.The dark grey horizontal line in the middle panel gives the nominal miscoverage level.The right panel gives the average width of the confidence interval.
Figure F.10: The nested CV inflation estimate after a large number of repeated splits.
2.1.In particular, we compare the average size of the confidence intervals based on the AZ estimate of standard error to those from naïve CV, reporting the results in Figure F.11. Comparing the p = 20 curve (teal) to that from nested CV in Figure F.3, we see that the AZ estimator produces significantly larger intervals than nested CV-they are a factor of two larger for n = 200.(Nested CV has correct coverage in this experiment, see Figure 11.)Still, the AZ estimator is not too conservative once n = 800 for p = 20 or n = 200 when p = 5, in line with the asymptotic results proved in Austern and Zhou (2020).
Figure F.11: The average size of the confidence intervals of the AZ estimator in the linear model.
Figure F.12: The correlation of Err and Err XY with ridge regression.The regularization strength varies from small (left) to large (right).To report on the level of regularization in an interpretable way, the x-axis shows the fraction of reduction in norm of the fitted coefficient vector of the ridge regression fit compared to the unregularized fit.
Figure I.14: MSE of the cross-validation point estimate for estimating four targets of inference.

Table 2 :
Performance of cross-validation (CV), nested cross-validation (NCV), and data splitting (DS) in the high-d logistic regression model from Section 5.1.2.The nominal (target) error rate is 10%, i.e., 5% above and below.Other details as in Table Table 3, again finding the NCV has better coverage than CV, although both struggle when n = 50.The ratio the nested CV interval width to the naïve CV interval width is relatively stable across observations, see Figure F.4.

Table 3 :
Performance of cross-validation (CV),nested cross-validation (NCV), and data splitting (DS) in the high-d linear regression model.The nominal (target) error rate is 10%, i.e., 5% above and below.Other details as in Table1.

Table 4 :
Performance of cross-validation (CV),nested cross-validation (NCV), and data splitting (DS) with the real data sets.The nominal (target) error rate is 10%, i.e., 5% above and below.Other details as in Table Proof of Corollary 2. For the first claim, cor(Err X , Err XY ) = cov(Err X , Err XY ) var(Err X )var(Err XY ) = cov (Err X , E[Err XY | X])

Table F .
5: Approximate computation times for one run of CV and NCV for each of the experimental settings.