Two-Stage Plans for Estimating the Inverse of a Monotone Function

This study investigates two-stage plans based on nonparametric procedures for estimating an inverse regression function at a given point. Specifically, isotonic regression is used at stage one to obtain an initial estimate followed by another round of isotonic regression in the vicinity of this estimate at stage two. It is shown that such two-stage plans accelerate the convergence rate of one-stage procedures and are superior to existing two-stage procedures that use local parametric approximations at stage two when the available budget is moderate and/or the regression function is “ill-behaved.” Both Wald- and likelihood ratio-type confidence intervals for the threshold value of interest are investigated and the latter are recommended in applications due to their simplicity and robustness. The developed plans are illustrated through a comprehensive simulation study and an application to car fuel efficiency data.


INTRODUCTION
Threshold estimation is a canonical statistical estimation problem with numerous applications in science and engineering. Here is an interesting motivating example. In recent years, an important consideration for both car manufacturers and potential buyers in the United States is the fuel efficiency (FE) of the vehicle, expressed in miles per gallon (MPG). The National Highway Traffic Safety Administration (NHTSA) regulates the Corporate Average Fuel Economy (CAFE) standards to encourage automobile manufacturers to improve the average fuel efficiency of their fleets of vehicles. The CAFE standard for 2012 models is 29.8 MPG, set to increase to 34.3 MPG in 2016 according to the Environmental Protection Agency rule that came into effect in August 2012. To encourage higher FE, manufacturers are subject to a penalty if the average FE of their fleets falls below the CAFE standard. Moreover, a so-called gas guzzler tax is imposed on cars with low FE in accordance with the U.S. Energy Tax Act of 1978.
The data on fuel efficiency as a function of the vehicle's horse power, which is a key component, for the 2012 models is shown in Figure 1 (for a detailed discussion of the data and CAFE standards, see Section 4). An expected decreasing relationship is observed and it is of interest to identify the horse power threshold at which the fuel efficiency meets the current CAFE, as well as the 2016 standard.
The data plot indicates that fitting a precise parametric model may be challenging, while it is rather straightforward to fit a monotonically decreasing nonparametric one and obtain the fuel efficiency threshold for the target values of around 30 and 40. Further, it is also desirable to assign a confidence interval around the estimate and an interesting question addressed in this article is whether an adaptive procedure can lead to improved precision for such threshold estimates.
The topic of using adaptive procedures in a design setting for threshold estimation models has been recently studied in the literature (Lan, Banerjee, and Michailidis 2009;Tang, Banerjee, and Michailidis 2011). Given our motivating data application, the basic model considered in this article is the same as that in Tang, Banerjee, and Michailidis (2011): Y = m(X) + , where the design point X takes values in [a, b], the regression function m is monotone, for the sake of presentation henceforth assumed nondecreasing, and the random error has mean 0 and finite variance σ 2 . The quantity to be estimated is the threshold d 0 = m −1 (θ 0 ) for some prespecified θ 0 : this is the value beyond which the mean response exceeds level θ 0 .
The employed two-stage adaptive procedure in Tang, Banerjee, and Michailidis (2011) works as follows: (i) in the first stage, it uses a portion p of the design budget to obtain an initial (ii) in the second stage, the remaining portion (1 − p) of the budget is used to obtain more sample points in a small neighborhood of that estimate; (iii) finally, an improved estimate based on the second-stage data is constructed. This "zoom-in" sampling leads to accelerated convergence rates of the second stage estimators for d 0 compared to the standard ones that use all the data in one shot. Specifically, for the inverse regression problem, the rate of convergence can be accelerated from n 1/3 to n 1/2 by employing a local linear approximation (Tang, Banerjee, and Michailidis 2011), where n denotes the total budget available. Hence, tighter confidence intervals can be constructed that have the correct nominal coverage with the same budget as standard one-stage procedures, or alternatively one can reduce the design budget and still have good quality confidence intervals.
Our problem is closely related to dose-response (Rosenberger and Haines 2002;Dette, Neumeyer, and Pliz 2005;Bhattacharya and Lin 2010) and statistical calibration studies (Osborne 1991) (for additional references see Tang, Banerjee, and Michailidis (2011)). While the above strategy-a first-stage estimate using isotonic regression, followed by a local linear approximation-gives a √ n-consistent second-stage estimator, its success heavily hinges upon the (approximate) linearity of the regression function m(·) in the vicinity of d 0 . Small departures from linearity do not adversely affect the results (especially when the budget is large enough to allow for significant "zooming-in" at the second stage), but severe departures are a totally different matter as illustrated next.
Consider a monotone regression function exhibiting strong nonlinearity around d 0 = 0.5, for example, given by m(x) = (1/40) sin(6πx) + 1/4 + (1/2)x + (1/4)x 2 with x ∈ [0, 1] (see the plot with label IsoSin in the left panel of Figure 4). The coverage rates and average lengths of the confidence intervals obtained from the two-stage adaptive strategy based on isotonic regression and a local linear approximation for selected total budget sizes (n = 100, 300, 500), varying portions p allocated to the first stage and different noise levels (σ = 0.1, 0.3, 0.5) are depicted in Figure 2.
It can be seen that for the majority of portions p, the confidence intervals constructed from this adaptive strategy fail miserably in terms of coverage rates and exhibit relatively large lengths. Of interest is the fact that for large p, the coverage rates indeed approach the nominal level. In practice, however, it is not possible to choose an appropriate p without prior information on m. Further, even for large p's, the confidence intervals are excessively wide, especially for large noise levels.
In contrast, a two-stage adaptive strategy based on employing isotonic regression at both stages, which will be fully developed in this article, overcomes these difficulties. Such a strategy would be also desirable for the motivating data application, due to high variability in the vicinity of the CAFE thresholds, as seen in the scatterplots of Figure 1. In Figure 3, the coverage rates and average lengths using our new strategy are shown for the same settings as above, but with p = 1/4 (for more on this universal choice of p see Section 2). It can be seen that this wholly nonparametric strategy overcomes the previous difficulties, proves robust to the level of local nonlinearity of the regression function m and, as argued in Section 2, is easy to implement.
The remainder of the article is organized as follows: in Section 2 the adaptive procedure is introduced and the main results presented. Section 3 presents extensive simulations results, while an interesting application of the methodology to fuel efficiency data is shown in Section 4. Section 5 concludes. Proofs are sketched in the appendix.

An Overview of the Isotonic Regression Procedure
We provide a brief description of the one-stage isotonic regression procedure (OSIRP). Specifically, given n fixed, ran- continuous design density g and the corresponding responses , obtained from the proposed model assuming the design points have already been ordered, the isotonic regression estimate of m(·) is given by This minimizer exists uniquely, has a nice geometric characterization as the slope of the greatest convex minorant of a stochastic process and is readily computable using the pool adjacent violators algorithm (PAVA) (see, e.g., Robertson, Wright, and Dykstra 1988). Then, for a prespecified value θ 0 ∈ (m(a), m(b)), the one-stage isotonic regression estimator of d 0 is defined by There is a mild assumption on the regression function and the design density, namely, Assumption A: m is once continuously differentiable in a neighborhood of d 0 with positive derivative m (d 0 ) and g is positive and continuous at d 0 .
Under this assumption, the asymptotic distribution of d I is given by (see Tang, Banerjee, and Michailidis 2011): where C d I = 4σ 2 /m (d 0 ) 2 1/3 and Z follows the standard Chernoff distribution (Groeneboom and Wellner 2001). This result can be used to construct a 1 − α Wald-type confidence interval for d 0 : where the hats denote consistent estimates and q(ξ, τ ) is the lower τ th quantile of a random variable ξ . An alternative is to construct confidence intervals through likelihood ratio (LR) testing. Specifically, the hypotheses of interest are (2.4) Then, the LR test statistic is given by where l n (m, σ ) = −(2 σ 2 ) −1 n i=1 (Y i − m(X i )) 2 , m I c is the constrained isotonic regression of m under H 0 andσ a consistent estimate of σ . It is known that m I c uniquely exists (see Banerjee 2000). The asymptotic distribution of 2 log λ I under H 0 can be obtained from Banerjee (2009) (see Theorem 1.1 and also the first paragraph of Section 1.4, the concluding discussion) : 2 log λ I d → D, where D is a "universal" random variable not depending on the parameters of the model (Banerjee and Wellner 2001). This result allows us to construct a 1 − α LRtype confidence region for d 0 : (2.6) Figure 3. The left panel shows the coverage rates of the 95% confidence intervals using a two-stage procedure for d 0 = 0.5 with different sample sizes, and noise levels. The right panel shows the corresponding average lengths of the intervals.
The LR-type confidence region can be shown to be an interval and is typically asymmetric around d I , unlike the Wald-type one. Its main advantage is that only σ needs to be estimated for its construction, whereas for the Wald confidence interval, estimation of m (d 0 ) is also needed, a significantly more involved task.
Remark 2.1. The use of the term LR statistic in connection with (2.5) needs to be clarified. Under a normality assumption on the errors, 2 log λ I is, indeed, a proper likelihood ratio statistic; otherwise, it is more accurately a residual sum of squares statistic which can be interpreted as a "working likelihood ratio statistic" where the normal likelihood is used as a working likelihood. In this article, we do not assume normality of errors but continue to use the term LR statistic for 2 log λ I in the above sense.

Adaptive Two-Stage Procedures
As noted in Introduction, adaptive two-stage procedures can lead to accelerated convergence rates and hence to sharper confidence intervals for d 0 . The main steps of such a two-stage fully nonparametric procedure are outlined next: 1. Denote by p ∈ (0, 1) the sample proportion to be allocated in the first stage and by n 1 = np and n 2 = n − n 1 , the corresponding first-and second-stage sample sizes, respectively.
Then, compute a first-stage monotone nonparametric estimatorm 1 of m and obtain the corresponding first-stage estimator d 1,I =m −1 1 (θ 0 ) of d 0 for a prespecified value θ 0 . 3. Specify the second-stage sampling interval [L 1 , where C 1 > 0 and 0 < γ 1 < 1/3; recall that n 1/3 is the convergence rate of d 1,I . 4. Obtain the second-stage data {(X 2,i , Y 2,i )} n 2 i=1 with a design density g 2 on [L 1 , U 1 ]. Employ these data and a nonparametric procedure (which could be different from the one used previously) to compute a monotone second-stage estimator m 2,I and, as in the first stage, the corresponding d 2,I . 5. Construct confidence intervals for d 0 using the asymptotic distribution of d 2,I .
Remark 2.2. Choosing γ 1 < 1/3 ensures that the stage-two sampling interval contains d 0 with probability going to 1.

Remark 2.3.
The key intuition behind the proposed two-stage procedure is that the resolution of the sampling design increases. Specifically, if one considers sampling based on a uniform grid, at the second stage the sampling interval has length 2C 1 n −γ 1 1 . Suppose that the second-stage design points are equispaced over this interval, the resolution of the resulting grid at which responses are measured is O(n −γ 1 1 /n 2 ) = O(n −(1+γ 1 ) ) and this is driving factor for the realized rate of convergence of the twostage estimator, as formally established in Proposition 1, in contrast to that obtained in the OSIRP (n 1/3 ), since in the latter procedure the resolution of the sampling design is simply O(n −1 ).

Asymptotic Properties of Two-Stage Estimators
We discuss the properties of the two-stage procedure, where isotonic regression is employed in both stages (henceforth, IR+IR).
Proposition 1. Consider the IR + IR procedure. Let the design density at the second stage be given by: where ψ is a Lebesgue density on [−1, 1] that is positive at 0 and continuous in a neighborhood of 0. Thus, g 2 is simply ψ renormalized to the sampling interval at stage two. Assume that m , the derivative of m, exists and is continuous in a neighborhood of d 0 and m (d 0 ) > 0. Let d 2,I = m −1 2,I (θ 0 ) where m 2,I is the isotonic estimator of m constructed from the second-stage data. Then, From Proposition 1, a Wald-type 1 − α asymptotic confidence interval for d 0 is given by (2.7) Remark 2.4. A consequence of the accelerated rate of convergence obtained with the IR+IR strategy is that the asymptotic relative efficiency (ARE) of the two-stage estimator d 2,I with respect to the one-stage estimator d I is which diverges to infinity.
Note that, in the generic description of the two-stage procedure above, we use a confidence interval for d 0 that relies on the asymptotic distribution of a point estimate computed at stage two. However, this is not the only way to proceed at stage two. Having collected the second-stage data at the beginning of Step 4, we can bypass point estimation altogether and construct a confidence interval using likelihood ratio inversion. This alternative possibility is discussed below. Also, as will be explained in the practical implementation, the construction of [L 1 , U 1 ] is achieved in practice by constructing a high probability confidence interval for d 0 from the stage one data. This also opens up the possibility of bypassing point estimates at stage one in favor of a likelihood ratio inversion based confidence interval, a point that we come to later.
An alternative LR-type CI can be constructed as follows: the LR-type test statistic at stage two for testing H 0 : )) 2 , m 2,I c is the constrained estimator of m under the null hypothesis H 0 andσ is a consistent estimate of σ .
Proposition 2. Under the assumptions of Proposition 1, and the null hypothesis Finally, from Proposition 2, an LR-type (1 − α) asymptotic confidence interval for d 0 is given by ( 2.9) For the theoretical derivations in connection with Propositions 1 and 2, see the Appendix.
Remark 2.5. We have focused on the case of two-stage adaptive designs and the acceleration of the convergence rate by an IR+IR strategy. Obviously, one can extend it to multiple stages and continue using isotonic regression. As outlined in Section S1 in the supplement, it can be established that the convergence rate of such a procedure would come arbitrarily close to the √ n parametric rate, if enough stages are employed, but would not achieve it.

Implementation Issues
We discuss, next, the main steps for implementing the IR+IR strategy in practice. Specifically, we address the following: (i) estimation of σ 2 , (ii) estimation of m , (iii) determination of the second-stage sampling interval [L 1 , U 1 ], (iv) the first-stage sampling proportion p.
Implementation of IR + IR: For the estimation of σ 2 at Stage 1, we employ the nonparametric estimator proposed by Gasser, Sroka, and Jennen-Steinmetz (1986), and for the estimation of m (d 0 ), the local quadratic regression procedure proposed by Fan and Gijbels (1996); some details are provided in Section 3. Next comes the determination of the second-stage sampling interval. Recall, that the theoretical formula for the interval is given by [d 1,I ± C 1 n −γ 1 1 ], with C 1 > 0 and γ 1 ∈ (0, 1/3). While any such interval will contain d 0 with probability going to 1 in the long run, in practice we would like to ensure that our prescribed sampling interval [L 1 , U 1 ] does trap d 0 with high probability. The practical determination of [L 1 , U 1 ] is, therefore, achieved through a high probability confidence interval for d 0 from Stage 1 data. Consider the the following 1 − β Wald-type confidence interval where the computation ofĈ d I involves estimating both σ 2 and m (d 0 ) and where β is a small positive number such as 0.01. Using this, in practice, as [L 1 , U 1 ] amounts to choosing C 1 and γ 1 such that C 1 n Although γ 1 = 1/3 is not in (0, 1/3) as required by our theoretical results, it nevertheless provides a good approximation in practice, since it lies at the boundary of that interval.
As far as the first-stage sampling proportion is concerned, one would like to choose this in such a way as to increase the precision of the second-stage isotonic estimator. Proposition 1 shows that the second-stage estimator is asymptotically unbiased and that its standard deviation is proportional to {(1 − p) p γ 1 } −1/3 . For a fixed γ 1 , this is minimized when log(1 − p) + γ 1 log p is maximized, which happens when p = p opt = γ 1 /(1 + γ 1 ). As γ 1 approaches 1/3, p opt approaches 1/4. Thus, the optimal practical allocation of budget at Stage 1 is 25%.
From Stage-2 data, we can construct a confidence interval (CI) for d 0 based on d 2,I following Proposition 1 in which case both m (d 0 ) and σ 2 need to be updated. Alternatively, we can use likelihood ratio inversion to get a CI of desired coverage for d 0 , following Proposition 2 in which case only the estimate of σ 2 needs to be updated. Finally, a third option is to bypass the estimation of m (d 0 ) altogether by prescribing a high probability LR-based confidence interval for d 0 as [L 1 , U 1 ] in Stage 1 (see 2.6) and then using LR inversion at Stage 2 as well. While this procedure does not quite fall within the purview of our theoretical results it is a natural methodological choice and avoids the estimation of m (d 0 ) altogether; furthermore, comparisons among these three approaches based on elaborate simulation studies demonstrate that it is superior to the other two in practice.

PERFORMANCE EVALUATION OF THE ADAPTIVE PROCEDURES
In this study, the following procedures are compared: (i) POSIRP: practical one-stage procedure based on isotonic regression with Wald and LR CIs, (ii) PTSIRP-Wald: practical two-stage procedure based on isotonic regression for both stages and using Wald CIs both for selecting (L 1 ;U 1 ) and constructing the final CI, (iii) PTSIRP-LR: practical two-stage procedure similar to PTSIRPWald but employing LR CIs in both stages, and (iv) PABLTSP: the procedure from Tang et al. (2011) that uses isotonic regression followed by a local linear approximation and bootstrapping for constructing CIs for d 0 . The use of the qualifier "Practical" before the various procedures above is to emphasize the point that they involve estimates of nuisance parameters, as explained below.
The simulation settings are as follows: the design space is the [0, 1] interval and the regression functions considered: the sigmoid function m(x) = exp(4(x − 0.5))/[1 + exp(4(x − 0.5))], the quadratic function m(x) = x 2 and the isotonic sine function m(x) = (1/40) sin(6πx) + 1/4 + (1/2)x + (1/4)x 2 . The target point d 0 is 0.4, 0.5, or 0.6, while the random error follows a N (0, σ 2 ) distribution, with σ taking values 0.1 and 0.3. The total sample size n ranges from 100 to 500 in increments of 100. All design densities g, g 1 , and g 2 are uniform, while the confidence level for all CIs is set to 0.95. The results presented are based on 1000 replicates. For PABLTSP, we set the first stage sample proportions p = 0.7 in order obtain accurate coverage rates for all functions. (Yet, as shown in Figure 2, in some cases good coverage rates are achieved at the cost of large average lengths.) For all other two-stage procedures, we set p to be the asymptotically optimal proportion of 0.25. The quantiles of D and Z for constructing the second-stage sampling intervals for PTSIRP are set to be 4 and 2, respectively, corresponding to β = 0.01.
When estimating σ and m (d 0 ) in the second stage, only Stage-2 points are used to stay strictly within the scope of the methods used for these purposes. With smaller budgets as in the real data example, we follow the natural practice of combining both stage samples for updating estimates of σ and m (d 0 ), which makes second stage results more reliable.
To gain insight into the simulation results, we depict the plots of the functions under consideration together with their derivatives (see Figure 4). The coverage rates and average lengths of the 95% confidence intervals for d 0 are shown in Figures 5 and 6.
It can be seen that for the quadratic and sigmoid functions, the proposed two-stage procedures perform well with the coverage being about the nominal level 95% for all d 0 's, sample sizes, and noise levels considered. Further, their average lengths are fairly comparable. In contrast, PABLTSP shows inferior performance for larger noise and smaller sample sizes for the quadratic function.
The isotonic sine function proves the most challenging. The left panel of Figure 6 shows the coverage rates of the practical procedures. As discussed in the introduction and seen in the figure, this function exhibits strong nonlinearity causing the local linear approximation PABLTSP to feature very poor coverage rates. Also, note that, for the case with d 0 = 0.5 the coverage rates of the confidence intervals from POSIRP-Wald and PTSIRP-Wald are consistently lower than 95%. This behavior is caused by inaccurate estimation of m (d 0 ) as illustrated in the online supplementary material (Section 2, Figure 1). The true value of m (d 0 ) is around 0.279 and the corresponding kernel estimators of m (d 0 ) are usually around 0.75, significantly larger than the true value. This makes the confidence interval far too short to cover d 0 and consequently, the coverage rates behave erratically.  Full details for the estimation of m (d 0 ), which uses a local quadratic regression procedure, are available in Section 4 of Tang, Banerjee, and Michailidis (2011). An asymptotically optimal bandwidth, given in equation (3.20) on page 67 of Fan and Gijbels (1996), is employed for this purpose. This local bandwidth minimizes the asymptotic MSE, and indeed with large sample sizes, we find that m (d 0 ) is estimated accurately and the coverage rates approach the nominal level. Further emphasizing the importance of the derivative estimate and as illustrated in Figure 2 of the online supplementary material, if we repeat the procedures with perfect knowledge of m (d 0 ), then coverage rates are about the nominal level of 95% for the sample sizes considered.
Fortunately, for this wiggly isotonic sine function, POSIRP-LR and PTSIRP-LR have good coverage rates for all simulation cases. This indicates that LR-type confidence intervals are usually robust with different regression functions. The average lengths of the confidence intervals are shown in the right panel of Figure 6. Unsurprisingly, PTSIRP-LR achieves shorter average lengths since it is a two-stage procedure.
In summary, we find that when the underlying regression function is well-behaved, the more aggressive PABLTSP performs well. However, the conservative but stable PTSIRP-LR offers a robust procedure that performs well, even when the underlying function exhibits strong nonlinearities.

AN APPLICATION TO FUEL EFFICIENCY STANDARDS
As discussed in Introduction, car fuel efficiency (FE) is an important issue for both manufacturers and consumers, due to new CAFE standards. Note that while the CAFE standards are regulated by the NHTSA, the vehicle FE is assessed by the Environmental Protection Agency (EPA). From 2008 onwards, the EPA measures the fuel efficiency of a vehicle in two test-ing modes: city and highway, taking into consideration different speeds and acceleration, as well as air conditioning usage and colder outside temperatures, in an effort to better approximate real-world fuel efficiency. From the unadjusted city and highway fuel efficiency, the unadjusted combined fuel efficiency is calculated as follows (see www.epa.gov): Combined FE = 1 0.495/City FE + 0.351/Highway FE + 0.15.
The data for this study were extracted from the government website www.fueleconomy.gov that includes all FE data for all 2012 car models available to U.S. consumers. This dataset contains the unadjusted city, highway, and combined fuel efficiency for 3979 models, together with their horse power. We collected the horse power data for 1477 nonhybrid vehicles with automatic transmission gearboxes and natural aspiration engines (i.e., excluding turbo engines and plug-in hybrid vehicles), to have a relatively homogeneous dataset.
The objective of our analysis is to estimate the following model FE = m(HP) (or HP = m −1 (FE) and then identify the horse power at which the combined FE is equal to 30 MPG, around the 2011 CAFE standard. Hence, we are interested in estimating d 0 = m −1 (30).
The scatterplot in the left panel of Figure 1 shows the combined FE of these 1477 vehicles as a function of their horse power and indicates a decreasing relationship. Notice that there are multiple vehicle models with the same horse power, but different FE. To simplify the analysis, we add a small jitter to the original horse power to obtain a unique horsepower for every FE observation, whose scatterplot is given in the right panel of Figure 1. The jitter added is between ±1 to ensure that the ordering of samples by horsepower remains unchanged.
Given that this is an "observed" dataset, we will emulate the design setting (for a similar strategy see also Lan, Banerjee, and Michailidis 2009) for a budget of size 80. Both one-stage and two-stage procedures will be examined. For one-stage procedures, 80 horsepowers equally spaced are originally selected and the closest ones in the data constitute the final covariate values, together with the corresponding responses. For two-stage procedures, we select a portion p = 0.5 in the first stage and, hence, select 40 horsepowers in the first stage as previously described. After obtaining the second-stage sampling interval (L 1 , U 1 ) we choose with an analogous strategy the remaining 40 points. (Given the relatively modest budget, we chose not to use the asymptotically optimal allocation of 25% + 75%.) Finally, the "true" value of d 0 is obtained by using isotonic regression on the entire sample of 1477 observations and is estimated to be around 187.
The five procedures considered are: POSIRP-Wald, POSIRP-LR, PTSIRP-Wald, PTSIRP-LR, and PABLTSP. The fitted models are shown in Figure 7 and the confidence intervals obtained summarized in Table 1. It is interesting to note that only the LR-based procedures produce confidence intervals that cover the "true value." The two-stage procedure PTSIRP-LR achieves much shorter interval lengths. The PTSIRP-Wald CIs are too short, resulting in their missing the "true value." Additional results from using different allocations in the two stages are provided in Figure 3 of the online supplementary material, where we see that in almost all cases PTSIRP-LR tends to cover the "true" value of 187 with better point estimates. POSIRP-Wald and PTSIRP-Wald continue to struggle due to estimation difficulties with m (d 0 ).
Next, we perform another experiment to assess the reliability of the procedures with the FE dataset. We treat the data from the 1477 vehicles as the population, and sample from it according to different overall budgets of size 20,30,40,50,60,70,80,90,100. Given the modest budgets, we combine samples from both stages for estimation of auxiliary parameters and inversion of the likelihood ratio. The results, averaged over 500 repetitions for each budget size, are depicted in Figure 8. Note that POSIRP-Wald and PTSIRP-Wald still struggle to maintain coverage rates due to difficulties of auxiliary parameter estimation. The local linear approximation in PABLTSP also faces difficulties with such small budgets. On the other hand, POSIRP-LR and PTSIRP-LR maintain good coverage for all budgets. As noted before, PTSIRP-LR outperforms its one-stage counterpart with narrower intervals. Considering the overall budgets investigated in this experiment, PTSIRP-LR performs well with an extremely small fraction of the overall data, illustrating its utility in the context of very large datasets, a topic of further discussion in the Discussion section. Finally, we return to the task discussed in the introductory section of estimating the horse power at which the combined FE is equal to the upcoming 2016 CAFE standard of d 0 = m −1 (34). Employing isotonic regression on the entire sam-ple yields a "true" value of d 0 of around 155, with corresponding 95% confidence interval [143.360,166.052]. Following the same procedure and budget allocations as above, PTSIRP-LR yieldsd 0 = 166.204 with corresponding 95% confidence Figure 8. Results obtained after combining both stage samples. The top panel shows the coverage rate and average length of confidence intervals generated by the five different procedures. The bottom panel shows the distance of the point estimate to the "true" value, and the distance of the derivative estimate to its "true" value. interval [145.599,175.450], as shown in Figure 9. PTSIRP-LR covers the "true" value with a reasonably sized interval, while using a small fraction of the overall budget.
The upshot of the analysis is that the two-stage LR-based procedure offers superior performance to its competitors even with smaller budgets and when the underlying function exhibits nonlinearities.

DISCUSSION AND CONCLUDING REMARKS
In this article, we considered the estimation of the inverse of a monotone regression function at a given point in a design setting. The results strongly suggest that a two-stage procedure using isotonic regression in both stages coupled with calculation of likelihood-ratio based confidence intervals is agnostic to the local structure in the vicinity of the parameter of interest, requires minimal tuning and exhibits superior performance.
The reader may wonder whether an alternative nonparametric procedure at stage one, with a faster than the n 1/3 convergence rate of isotonic regression may offer advantages to the proposed strategy. We have investigated smoothed isotonic regression (Tang 2011) which, in a single stage, exhibits a convergence rate of n 2/5 and when repeated in the second stage exhibits the same acceleration pattern as isotonic regression provided the bandwidth is appropriately chosen (for a detailed discussion of this subtle issue see Tang 2011). However, extensive numerical work shows that no significant performance gains are realized, compared to using isotonic regression in both stages, while at the same time a bandwidth parameter needs to be carefully specified. Indeed, a strategy based on isotonic regression in the first stage, followed by smooth isotonic regression in the second-stage struggles with the estimation of the iso-sine function presented in Figure 2.
Another point that needs to be highlighted is that the methods of this article extend immediately to discrete response models. In many dose-response problems, the response to a dose is not a continuous random variable; rather, it is the number of individuals who showed a reaction to the dose. So, consider, for example, a Bernoulli dose-response model, where dose x is administered to an individual and the probability of this individual reacting to the dose, say m(x) is a monotone increasing function of x: higher doses make it more likely for an individual to experience a reaction. Consider a budget of size n. At Stage 1, n 1 iid dose-response samples . The MLE of m under monotonicity constraints, saym 1 , can be computed via an isotonic least squares problem that arises from the joint (Bernoulli) likelihood. As this is a Bernoulli monotone response model (see Banerjee 2007, p. 933), the point wise distribution ofm 1 is described by the Chernoff random variable Z, and letting d 1,I :=m −1 1 (d 0 ), it can be shown that n 1/3 1 (d 1,I − d 0 ) converges to a multiple of Z, as in this article. If the Stage-2 sample is collected as in this article, by restricting the covariate domain to [d 1,I − K n −γ 1 1 , d 1,I + K n −γ 1 1 ], and an updated estimate d 2,I constructed from the second-stage data, then, similar to what we have observed, n (1+γ 1 )/3 2 (d 2,I − d 0 ) converges (at an accelerated rate) to a multiple of Z again. The likelihood ratio-based procedures can be extended similarly and the optimal allocation continues to be 25% of the budget at Stage 1.
Finally, we should note that although the developed methodology applies to design settings (where the investigator can select the desired covariate and the corresponding response variable values), it can also prove useful in the context of very large datasets. Suppose that one is interested in estimating a threshold of a monotone function from a very large dataset that cannot be stored in its entirety in computer memory. In that case, onestage estimation based on the entire dataset is computationally challenging, since it requires appropriate modification of the standard algorithms. However, by adopting the proposed adaptive design framework, one can overcome such computational difficulties, while still obtaining a high degree of accuracy. It is our belief that by going to multiple stages, if necessary, with judiciously chosen parameters, one can match the performance of the estimator based on all data that could be stored in computer memory, thus providing a computationally efficient procedure that avoids major modifications of existing algorithms. The latter claim is supported by the results of the experiment shown in Figure 8, and in Figure 4 of the online supplementary material, which indicates the PTSIRP-LR reduces computing time by substantial amounts at larger budgets compared to its one-stage counterpart.

APPENDIX A: PROOFS
We discuss Propositions 1 and 2 of the article.

A.1 Appendix of TSIRP
We introduce the following idealized two-stage isotonic regression procedure (ITSIRP) as follows: 1. Set the first-stage sample proportion p ∈ (0, 1) and let the firstand second-stage sample sizes be n 1 = np and n 2 = n − n 1 , respectively, where n is the total sample size.
Remark. Note that ITSIRP is similar to TSIRP except that in ITSIRP the second-stage sampling interval is centered at d 0 instead of at d 1,I ; therefore, the sampling density at Stage 2 in ITSIRP is ψ renormalized to [L 1i , U 1i ], just as g 2 is ψ renormalized to [L 1 , U 1 ] (see Proposition 1) in TSIRP. Since d 1,I converges to d 0 at rate n 1/3 , which is faster than the rate at which [L 1 , U 1 ] is decreasing around d 1,I (since γ < 1/3), [L 1 , U 1 ] is essentially indistinguishable from its idealized counterpart [L 1i , U 1i ] and the asymptotic behavior of d oI will be identical to that of d 2I . Similarly, the asymptotic distribution of the idealized LRT, 2 log λ oI , will be the same as that of 2 log λ 2,I .
A rigorous proof of Proposition 1, formalizing the intuition above, relies on some recently developed theory for M-estimation via multistage sampling in Mallik (2013), ch. 6, where a number of two-stage estimation procedures (and, in particular, the two-stage isotonic regression problem) are investigated, and we direct the technically oriented reader to Section 6.3 of this work. Proposition 2 can be derived along similar lines. In this article, we provide a sketch of the derivations of the limiting distributions of the "idealized" (surrogate) quantities d oI and 2 log λ oI . We first introduce the following quantities.
Proof-sketch of Theorem 3: The following equivalence called the "switching relationship," derived originally by Groeneboom, can be established by arguments similar to those leading to the last display on p. 298 of van der Vaart and Wellner (1996). T (t) is the largest X 2,i less than or equal to t.
Since the process X c,d (h) is almost surely continuous, diverges to ∞ as h → ±∞ and has a unique minimum, andĥ n is O p (1), as can be shown by an application of Theorem 3.2.5 of (van der Vaart and Wellner 1996), we conclude by Theorem 2.7 of Kim and Pollard (1990) that h n → min h X c,d (h). Now, from Problem 5 on p. 308 of van der Vaart and Wellner (1996), we have min X c,d (h) = (c/d) 2/3 min X 1,1 (t), whence P n (1+γ 1 )/3 2 which leads to n (1+γ 1 )/3 (d oI − d 0 ) → d C d oI Z, the result in Theorem A.1.