Ecological Archives E083-042-A1

Stephen P. Ellner, Yodit Seifu, and Robert H. Smith. 2002. Fitting population dynamic models to time-series data by gradient matching. Ecology 83:2256-2270.

Appendix A. Details of methods.

Bandwidth selection for estimating the gradient.

            The choice of bandwidth h for local polynomial regression is usually a compromise between the bias and the variance of the fitted curve. At a given point x the mean squared error (MSE) between the estimate  and the true value  can be decomposed as MSE = bias2 + variance, where variance = . For large h the fitted value at any point is based on a large fraction of the total data set, hence the variance is low. However a low-order polynomial is usually a poor approximation over a long span of time, so the bias is high. A smaller bandwidth reduces the bias, but increases the variance. The optimal bandwidth for minimizing MSE is therefore an intermediate value where neither the bias nor the variance is too large.

            For our purposes this reasoning about bandwidths does not apply because estimating dx/dt is just an intermediate step towards estimating rate equations by regressing dx/dt on the state variables. In that latter step, the estimates of dx/dt appear as the dependent variable. As sample size n (length of the time series) increases, the variance of estimated regression parameters due to variance in estimates of dx/dt decreases at rate 1/n. But the bias in regression parameters due to bias in estimates of dx/dt does not decrease at all, because one is just adding more and more data points with the same systematic distortions. An "optimal" bandwidth in this situation is therefore far smaller than the usual optimality criteria would suggest. As a rule of thumb we suggest that h should be roughly 1-2 times the time interval between population counts, with higher measurement error variance favoring larger h. In a specific application the value of h could be fine-tuned by using simulated data from the fitted model to produce a plot like our Figure 1 that reveals the amount of bias in the gradient estimates. The use of small h is a significant difference between this paper and Ellner et al. (1997). Small h drastically reduces the fraction of pairwise-correlated gradient residuals. This allows fitting and model selection to proceed as if gradient residuals were uncorrelated (as explained below: Generalized or ordinary least squares?), eliminating the need for the ad hoc methods used by Ellner et al. (1997).

            Gradient estimates can be improved by estimating and correcting for the bias at extreme gradient values. By passing an interpolating cubic spline through the time series, we obtain a series very similar to the original data, for which the exact value and gradient are known at all times. We can therefore simulate the process of sampling the interpolated series (at the same frequency as the original time series) and estimating the gradient at sampling times. The relationship between the estimated and true gradients for the interpolated series is then fitted by a regression spline constrained to be monotonic, and the fitted spline is applied to the original gradient estimates. The net result is essentially a smoothed version of estimating the gradient by interpolation (h = 0). Interpolation minimizes the bias but it is unstable (especially with finely sampled noisy data) and prone to occasional spectacular errors. Using interpolation indirectly, to estimate and remove the bias of a stable estimate, eliminates the instability of interpolation without greatly increasing the bias.

Penalized regression splines

            The additive model (Eq. 6) was fitted by linear regression onto the set of regression spline basis functions, each of which is a function of X alone, or of Y alone (Ruppert and Carroll 1997). The basis functions for a variable X on an interval [a,b] are defined by a set of knots , , and the regression spline estimate of a function f(X) is

(A.1)

Equation A.1 is the cubic regression spline, and quadratic or linear regression splines are defined similarly. To fit the additive model (6)  we simply add a second "copy" to the right-hand side of Eq. A.1, with X replaced by Y. The resulting model has the form

(A.2)

where the j are the knots for Y. Note that in Eq. A.2 the intercept term has been omitted, reflecting the fact that no births or deaths occur at zero population density. In all analyses reported here we used m = 20 knots for both B and D.

            The fitting criterion is the sum of squared errors (SSE) penalized by the sum of squared spline coefficients,

,
(A.3)

with the "smoothing parameters" (X ,Y) determining the complexity of the fitted curve as described in the text. The SSE in Eq. A.3 is an ordinary least squares criterion, which ignores the correlations in the errors (in Z) that result from the gradient estimation step.  However, as we discuss below, trying to take account of the correlations by using a generalized least squares criterion is actually counterproductive.

            When the true death rate functions are polynomials, as in the simulation models used to test the methods, it might appear that the spline terms in the fitted model (the Y terms with j>3 in Eq. A.2) are superfluous. However fits with the spline terms omitted were much more strongly biased. The spline terms allow the fitted function to err in one spot (for example, the occasional spurious curvature when the death rate is actually linear) without the error propagating to other parts of the fitted curve. Even a reduction to 5 knots (instead of 20) caused a visible reduction of accuracy in the fits.

Fitting a penalized regression spline with constraints

            Our procedures require fitting penalized regression splines to data  with the constraint of the fitted spline being positive, monotonic increasing in x, or both. Following Wood (1997), our numerical methods applied the looser (but in practice equivalent) constraint that the function values at a set of points  satisfy the constraint(s); we refer to these as the constraining grid points. For positivity constraints the default in our code (and throughout the paper) is that the consist of the knots supplemented by the endpoints of the interval on which the spline is defined, while for monotonicity constraints a finer grid (generally 50 points equally spaced between the endpoints) was required. 

            Let X be the matrix of size N´(m + 4) whose columns consists of the spline basis functions  evaluated at the N data points , Y the vector of corresponding  values, and . We can assume that the x's are sorted in increasing order. Given the smoothing parameter a, let D be the square diagonal matrix of linear size (m + 4) with diagonal entries . The unconstrained fitting problem is to minimize

.

To apply the constraints, let Z be the the M´(m + 4) matrix whose columns are the basis functions evaluated at the M constraining grid points, and let C be the matrix whose jth row is the difference between (j + 1)st and jth rows of Z. Thus, Z is the vector of regression spline values at the constraining grid points, and C  is the vector of differences between the values at successive grid points. The constrained fitting problem then is to maximize F() subject to , or both. As Wood (1997) observed, these are standard quadratic programming problems; we obtained numerical solutions using the quadprog library in R (www.R-project.org).The function in the supplied code also returns the GCV score for the fitted model, so the value of can then be selected by minimizing GCV().

Generalized or ordinary least squares? 

            The differences between gradient values estimated by smoothing the time series, and the expected gradients from the model (Eq. 2), will generally be autocorrelated because the jth data value  affects the gradient estimates at sampling times , out to several multiples of the bandwidth h. This suggests that it might be preferable to fit B and D using a generalized least squares (GLS) fitting criterion that takes account of correlated errors, rather than the ordinary least squares (OLS) criterion (A.3) .  OLS estimates are still unbiased when errors are correlated, but GLS estimates can have a higher efficiency (i.e. a lower variance) if the variance-covariance matrix of the errors is known or estimated (Draper and Smith 1981). However we recommend using OLS, for the following two reasons.

            First, the error correlation is weak enough that OLS is nearly as efficient as GLS, due to the "whitening by windowing principle" (Hart 1996) for nonparametric time series models in the presence of autocorrelated errors. The principle derives from the fact that for large sample size, temporally adjacent points in the original time series usually are not nearest neighbors in the space of the model's independent variables – in our case, . Rather, most points in any small neighborhood in  space come from widely separated times, so the associated errors are nearly uncorrelated (i.e., the errors in a spatial window are nearly "white noise"). For the model in Fig. 1 the temporal autocorrelation of errors at lag 1 is significant at about +0.5 (depending on the particular realization of the model and measurement errors), but the autocorrelation is below 0.2 in absolute value at all higher lags because of the small bandwidth used for the gradient estimates. Consequently the spatial autocorrelation of residuals (Moran's I statistic) in space is significantly positive but small. In 25 replicate realizations of the model, the maximum spatial autocorrelation (for neighborhood areas between 5% and 50% of the smallest square covering the data) never exceeded +0.20,  and only one I value exceeded +0.15.

            The second reason is that the cure is often worse than the disease, in the presence of measurement errors. Measurement error in population density can cause appreciable bias in both OLS and GLS  regression coefficients, and (as discussed in the main text) this bias is the most problematic source of estimation error in our procedures. Analytic large-sample approximations and simulations (Y. Seifu, unpublished) show that the bias is larger for GLS than for OLS. The higher bias of GLS estimates often outweighs their slightly smaller variance, such that GLS estimates have a higher mean-square error than OLS estimates.

Poisson sampling model

            Measurement errors for simulated data were generated by a mechanistic model for the sampling process. We assume that the population is sampled such that each individual has some small probability p of being counted, and we refer to p as the capture probability. When there are N individuals in the population, the sample is therefore distributed as Binomial(N,p), which we approximated as Poisson(Np). For plotting and model fitting, these values were re-scaled to estimate the actual population size. Thus a "data with measurement error" value for a simulated data point with N individuals is distributed as (1/pPoisson(Np), which has mean N and standard deviation .

SIMEX bias reduction

            Given a model for the measurement or sampling errors, the bias in regression estimates can be reduced by the SIMEX procedure (Cook and Stefanski 1994, Stefanski and Cook 1995). For the case of nonparametric regression (Carroll et al. 1995, 1999), the general procedure is to

·        Repeatedly re-fit the model with simulated measurement errors added to the independent variables;

·        Compute the average estimates as a function of the level of added measurement error;

·        Extrapolate back to zero measurement error; Carroll et al. (1995, 1999) recommend quadratic extrapolation.

In our simulation models we have assumed the Poisson sampling model described above. We can therefore add a second "dose" of sampling error by simulating the process of sampling from a population of the observed size, and sample from that sample to add a third "dose", etc. Because the model (A.2) is linear in the coefficients, the average function estimates correspond to average parameter estimates. Let  denote the mean parameter vector with i doses of measurement error, i=1,2,3, with  being the estimate from the data. Quadratic extrapolation through these values back to i=0 yields the SIMEX estimator . Because of the large coefficients multiplying the  we used 500 refittings at each level of measurement noise. In the SIMEX example in the code that we provide, a univariate regression with 200 data points, 500 refittings takes about 40 seconds on an 800 MHz PC.  The smoothing parameters were not re-estimated for the refittings, as the increase in measurement error variance had very small effects on the value of GCV2-optimal smoothing parameters. Our simulation results indicate that this shortcut is safe for the data analyzed here.   

            In applications, the capture or sampling probability p typically will have to be estimated. More generally, in order to use SIMEX one needs a parametric model of the measurement error process. This could be done theoretically, based on a model of the sampling process (as in Turchin and Ellner 2000) or empirically based on a small-scale set of replicate measurements at a range of population densities. It is tempting but sadly not feasible to parameterize a sampling error model by smoothing the time series and using the residuals as estimated errors, because the residuals are too dependent on the smoothing method. Two reasonable (and visually very similar) choices of a smooth gave maximum likelihood estimates of p that differed by about a factor of 5.

More on smoothing parameter selection

     In addition to GCV2 we also considered a more conservative criterion based on the spatial autocorrelation of residuals from a fitted model (Ellner and Seifu 2001). Residuals from an insufficiently complex model exhibit positive autocorrelation in the space of independent variables, due to the unfitted structure in the data. The selection criterion is to find the simplest model (as measured by the total effective degrees of freedom for both rate functions) such that the residual spatial autocorrelation (RSA), as measured by Moran's I statistic, is 0. Relative to GCV2 the RSA criterion almost invariably selects slightly simpler models, having slightly higher bias but fewer spurious wiggles on average (Ellner and Seifu, 2001). In the models considered here, fitted rate functions from the RSA and GCV2 criteria were visually almost indistinguishable. We therefore report results only for the GCV2 criterion, which has been tested more thoroughly (Nychka et al. 1992, Ellner and Turchin 1995) and is much quicker to compute.

References

Carroll, R. J., D. Ruppert, and L. W. Stefanski. 1995. Measurement error in nonlinear models. Chapman and Hall, New York, New York, USA.

Carroll, R. J., J. D. Maca, and D. Ruppert. 1999. Nonparametric estimation in the presence of measurement errors. Biometrika 86:541-554.

Cook, J. R., and L. A. Stefanski. 1994. Simulation-extrapolation estimation in parametric measurement error models. Journal of the American Statistical Association 89:1314-1328.

Draper, N. R., and H. Smith. 1981. Applied regression analysis, Second edition. Wiley, New York, New York, USA

Ellner, S. P., B. E. Kendall, S. N. Wood, E. McCauley, and C. J. Briggs. 1997. Inferring mechanism from time-series data: delay-differential equations. Physica D 100:182-194.

Ellner, S. P., and Y. Seifu. 2001. Using spatial statistics to select model complexity. Journal of Computational and Graphical Statistics, in press.  

Ellner, S. P.,  and P. Turchin, 1995. Chaos in a noisy world: new methods and evidence from time series analysis. American Naturalist 145:343-375.

Hart, J. D. 1996. Some automated methods of smoothing time-dependent data. Journal of Nonparametric Statistics 6:15-142.

Nychka, D.W., S. Ellner, A. R.Gallant, and D. McCaffrey. 1992. Finding chaos in noisy systems (with discussion). Journal of the Royal Statistical Society Series B 54,  399-426.

Ruppert, D., and R. J. Carroll. 1997. Penalized regression splines. Technical Report TR1249, Department of Operations Research and Industrial Engineering, Cornell University, Ithaca, New York, USA.

Ruppert, D., and R. J. Carroll. 2000. Spatially adaptive penalties for spline fitting. Australian and New Zealand Journal of Statistics 42:205-223.

Stefanski, L. A., and J. R. Cook. 1995. Simulation-extrapolation: the measurement error jackknife. Journal of the American Statistical Association 90:1247-1256.

Turchin, P., and S. P. Ellner. 2000. Living on the edge of chaos: population dynamics of Fennoscandian voles. Ecology 81:3099-3116.

Wood, S. N. 1997. Inverse problems and structure-population dynamics. Pages 555-586 in S. Tuljapurkar and H. Caswell, editors. Structured-population models in marine, terrestrial, and freshwater systems. Chapman and Hall, New York, New York, USA.



[Back to E083-042]