An evaluation of likelihood-based bandwidth selectors for spatial and spatiotemporal kernel estimates

ABSTRACT Spatial point pattern data sets are commonplace in a variety of different research disciplines. The use of kernel methods to smooth such data is a flexible way to explore spatial trends and make inference about underlying processes without, or perhaps prior to, the design and fitting of more intricate semiparametric or parametric models to quantify specific effects. The long-standing issue of ‘optimal’ data-driven bandwidth selection is complicated in these settings by issues such as high heterogeneity in observed patterns and the need to consider edge correction factors. We scrutinize bandwidth selectors built on leave-one-out cross-validation approximation to likelihood functions. A key outcome relates to previously unconsidered adaptive smoothing regimens for spatiotemporal density and multitype conditional probability surface estimation, whereby we propose a novel simultaneous pilot-global selection strategy. Motivated by applications in epidemiology, the results of both simulated and real-world analyses suggest this strategy to be largely preferable to classical fixed-bandwidth estimation for such data.


Introduction
The analysis of spatial point process data usually involves estimation of the continuous density or intensity function deemed to have given rise to the observed pattern. In many research disciplines it is common to encounter high heterogeneity in observed patterns, with points clustered densely in some areas and dispersed sparsely elsewhere. Studies in geographical epidemiology, for example, might note tight clusters of disease occurrences in populated urban centres, with comparatively far fewer observations made in rural regions.
Appealing flexibility is therefore found in nonparametric techniques -multivariate kernel smoothing methods in particular [1,2] -for spatial density/intensity estimation. While in isolation nonparametric methods are not equipped to quantify, say, specific covariate effects on the process of interest, they are a powerful tool in initial data exploration and visualization; in making comparative inferences between different types of points [3][4][5][6]; and as a building block for more complicated semiparametric models [7][8][9].
The issue of appropriate bandwidth selection is of paramount importance in applications of kernel methods, impacting greatly on the resulting spatial and spatiotemporal surfaces. Data driven selection methods are complicated in this setting by the need to take into account corrective factors to combat boundary bias. In recent years, spatially adaptive kernel estimation [10] has been shown to be particularly useful in spatial applications, with the potential to substantially outperform the classical fixed bandwidth estimator in certain settings [7,11,12]. Bandwidth selection is even more difficult for such estimators, however, with the need to specify both a pilot and a global bandwidth to compute the variable bandwidth factors.
In this work we address the issue of bandwidth selection for kernel estimators of spatial and spatiotemporal point patterns, motivated by smoothing problems in epidemiology. Specifically, we focus on data-driven selection methods built on approximations to likelihood functions, the fundamental ideas being readily interpretable and applicable to several variants of kernel estimators. In so doing, we introduce adaptive kernel estimators for constructing conditional probability surfaces of multitype patterns, and for spatiotemporal densities. We also propose using a common pilot-global bandwidth selection strategy for all adaptive estimators considered. The methodological exposé is interspersed with a comprehensive collection of comparative simulations.
The article is structured as follows. Standalone spatial density estimation is examined in Section 2, including fixed and adaptive estimators, and the concept of likelihood-based bandwidth selection is introduced. Simulations for these estimators are designed and executed in Section 3. In Section 4 we consider spatial relative risk via conditional probability surface estimation; the accompanying simulations following in Section 5. Spatiotemporal density estimators are discussed in Section 6 and numerical performance is compared in Section 7. We apply the common pilot-global selection strategies to a pair of real-data examples in Section 8. Concluding remarks and suggestions for future research appear in Section 9.

Spatial density estimation
In what follows, suppose we observe a point pattern X = {x 1 , . . . , x n } falling within some bounded region W ⊂ R 2 . The goal is estimation of f, the density function on W assumed to have given rise to X. A common assumption is that X is a realization of a Poisson point process on W, in which case the kernel density estimator is directly proportional to the same estimator of the process intensity, which yields the expected number of observations per unit area.

Fixed bandwidth
The classical 2D kernel estimator [1,2] of the unknown density is given as In (1), h > 0 is the bandwidth or smoothing parameter; and K(·) is the kernel function taken as a zero-mean probability density function (pdf). We shall assume a Gaussian K throughout as is often used in spatial applications. Furthermore, is an edge-correction factor in place to mitigate the potentially serious negative bias that can occur inf at the boundaries of W. Edge-correction is a non-trivial issue, particularly so for spatial applications where it is common for W to form one (or even several) irregular polygonal shapes. While there are different forms of such correction factors [13,14], we shall for simplicity focus on the form of q W as given in (1) and (2), where the factor is found as the proportion of mass of the kernel K centred on the query location x (and with bandwidth h) that falls within W. In any case, it is important to note that any such solutions are approximative in nature and merely mitigate, as opposed to completely annihilate, the adverse effects of estimation at region boundaries.
In papers that deal more generally with multivariate density estimation, h above is often replaced by an appropriately dimensioned, symmetric positive definite matrix. For data in R 2 , this would entail specification of a 2 × 2 matrix with up to three unique components allowing differing degrees of smoothing along each axis, as well as altering the orientation of the kernel. In this work, motivated by applications in spatial epidemiology, we restrict attention to isotropic smoothing via a scalar bandwidth. Imparting equal amounts of smoothing in all directions in such analyses is generally preferable in the absence of external information defining the disease under scrutiny or significant prevailing geographic features. Computational complexity is also a factor that cannot be ignored, especially for adaptive smoothing (Section 2.2), and use of a scalar bandwidth helps minimize the cost of relevant operations.
It is universally agreed that a 'sensible' choice of bandwidth, i.e. one that is neither excessively large nor impractically small, is crucial in obtaining a reliable estimate of the unknown density. As such, the development of data-driven bandwidth selectors which aim to optimize this balance with respect to some appropriate criterion has been a key research focus for many years. While the bulk of early efforts focused on selectors for univariate data, for which there exist several suitable methods [2,15,16], not all are straightforwardly adaptable to the multivariate setting [17]. This is particularly true when we consider the aforementioned issue of boundary bias -approximative edge-correction techniques such as (2) are frequently absent from papers that deal with multivariate kernel density estimation in a more general sense. As a result, most bandwidth selection methods are not developed with edge-correction in mind, and the corresponding practical and empirical comparisons do not concern themselves with the bounded region upon which the data are generated.
A popular approach to bandwidth selection that can be applied in the spatial and spatiotemporal settings is likelihood cross-validation. This technique is based on a leaveone-out approximation derived from D(·, ·), the Kullback-Leibler (KL) information, a measure of disparity between two densities. With respect to the density estimatef as per (1) and the true target density f, we have Minimisation of (3) with respect to h is therefore equivalent to maximization of which can be approximated via leave-one-out averaging as where is the edge-corrected density estimate evaluated at the location of the ith observation assuming its omission from X. Thus, finding argmax h [FIXLIK(h)] is a direct data-driven method by which the smoothing parameter can be optimized for a fixed bandwidth kernel density estimate. The result in (5), barring the presence of the corrective factor q W , is well established in the literature [1]. However, the role such likelihood-based bandwidth selection can play in more intricate smoothing applications, such as in variable-bandwidth and spatial relative risk estimation, has yet to receive any attention in the literature.

Adaptive bandwidth
Allowing the amount of smoothing to vary leads to an adaptive or variable-bandwidth estimator, certain forms of which have been shown to be superior to fixed bandwidth estimators in various scenarios. For highly heterogeneous spatial point patterns, such as those often encountered in geographical epidemiology, ' Abramson' observation-specific adaptation [10,18] performs particularly well. Under this smoothing regimen, (1) becomeŝ where the isotropic smoothing parameter varies according to the following expression: Here, note that a bandwidth at any location x ∈ W is inversely proportional to the square root of a fixed-bandwidth density estimate of the observed data X. This estimate is found using a pilot bandwidth h p ; the pilot density must also be clipped at some nominal value > 0 to prevent excessively large variable bandwidths adversely affecting the final smooth [18]. The required global bandwidth h 0 sets an overall level of smoothing for the variable bandwidths. Finally, γ˜f = exp{n −1 u∈X logf X (u | h p ) −0.5 } is the geometric mean of the inverse square root pilot density values at each observation. Its inclusion in (7) is optional, but is useful in practice as it allows the global bandwidth h 0 to be considered on the same scale as the pilot bandwidth h p [1].
Additionally noteworthy in (6) is the modification to the edge-correction factor, which now depends on the variable bandwidth function (7) [19]. While the kernel estimator itself requires a bandwidth at each observation u ∈ X, the edge-correction factors require the 'hypothetical' adaptive bandwidths at all desired evaluation coordinates x ∈ W, conditional upon the observed data X.
Abramson adaptation decreases the relative amount of smoothing in areas of high pilot density, where points are more densely grouped, and increases smoothing for observations that lie in relative isolation where the pilot density is small. In the final density estimatef , this has the effect of preserving more detail in areas where we have more data, while mitigating the potential for undersmoothing in sparsely populated regions. Compelling theoretical and practical benefits have been noted in comparison to standalone fixed-bandwidth estimation [12].
Potential advantages of adaptive smoothing over fixed smoothing notwithstanding, we are still faced with a bandwidth selection problem in (6). Indeed, this problem is now more complex than earlier -we require some optimal selection of (h p , h 0 ), preferably jointly so. Furthermore, recent work has shown certain approximations based on asymptotic theory are not practically useful for this task [20]. As such, applications to date have largely seen the pilot and global bandwidths set subjectively or in an overtly ad-hoc manner [7,11,21]. Finding a data-driven method to choose these values simultaneously and objectively is desirable.
With this in mind, we may consider an appropriate alteration to (4), and identify a naive, joint pilot-global bandwidth selector for the adaptive kernel estimator based on where we choose the values of (h p , h 0 ) that satisfy argmax h p ,h 0 [ADALIK 2 (h 0 , h p )]; the subscript '2' utilized to emphasize bivariate optimization over the two scalar smoothing parameters. Note that, similar to (5), we definê where Crucially, (9) and (10) reveal that the leave-one-out operation affects both the main kernel sum and the bandwidth function, due to the dependence of the latter on the pilot density estimate. This presents a computational challenge for even modest sample sizes. We refer to ADALIK 2 as 'naive' because of the complex relationship between the pilot and global bandwidths in the Abramson estimator. It is reasonable to anticipate identifiability issues in such unconstrained optimization: The effect of a small pilot bandwidth could be tempered by a large global bandwidth, and vice-versa. Just how convoluted the two bandwidths are with respect to the optimization criterion is difficult to formally ascertain, and we turn to the simulations in the latter sections of the paper to shed light on this phenomenon.
In anticipation of this potential problem, we also propose a constrained version of (8), where we force h c ≡ h 0 = h p , and define which uses (9) and (10) with the shared h c . While setting the pilot and global bandwidths to be equal reduces the flexibility of the adaptive smoother in general, it is of interest to determine if the optimal choice -argmax h c [ADALIK 1 (h c )] -still provides a satisfactory final estimate, particularly in comparison to a fixed-bandwidth estimate.

Simulation study 1: spatial densities
Our first round of simulations will examine standalone density estimation using the likelihood bandwidth selectors for the estimators given in the previous section. We define four scenarios, S1 through S4. They are plotted in the top row of Figure 1; formal details and corresponding plots of example data sets appear in the supplement. From S1 to S3, which are defined as Gaussian mixtures, we move from a relatively smooth setting with mild spatial fluctuations to a more heterogeneous scenario. S4 is based on a single realization of a log-Gaussian Cox process [22], which is able to provide a 'less smooth' true scenario than Gaussian mixtures naturally permit. It is in the latter scenarios that we expect adaptive estimation to perform better than fixed, owing to the tendency of a constant bandwidth to simultaneously undersmooth and oversmooth different areas of the spatial region depending on local point concentrations.
Repeated over 100 iterations we simulate a Poisson process realization of 500 points for each scenario. The FIXLIK, ADALIK 2 and ADALIK 1 selectors are engaged to find optimal bandwidths for the relevant fixed or adaptive density estimate. In addition to these bandwidths, and to provide a 'baseline' performance upon which to improve, we also compute Terrell's oversmoothing bandwidth [23], a rule-of-thumb measure that provides an estimate of the upper bound on the amount of fixed-bandwidth smoothing compatible with the estimated scalar standard deviation of the dataσ ; h OS =σ {625/(384n)} 1/6 . We reference these results as 'FIXOS'.
Performance itself is measured using the integrated squared error between a given density estimate g and the true density according to the corresponding scenario f ; ISE = W (g − f ) 2 , as well as the KL distance (3). The bottom row of Figure 1 shows boxplots  of the ISE measures for each of S1 through S4; the KL results tell the same story and are deferred to the supplement.
Adaptive estimation with a constrained pilot-global bandwidth h c chosen via ADALIK 1 is generally the better performer for all scenarios. Its improvement over either FIXOS or FIXLIK is more obvious in the latter three scenarios where point spreads and overall heterogeneity becomes more variable; one could argue that fixed bandwidth smoothing is comparatively acceptable in situations like S1. Additionally we note poor performance of ADALIK 2 ; due at least in part to the anticipated inability of the unconstrained adaptive selector to tease apart the effects of h 0 and h p . This can be seen in Figure 2, which shows the distributions of selected bandwidths. While the curves for FIXOS, FIXLIK, and ADALIK 1 provide guides as to good ranges of bandwidth values, there is high variability in the ADALIK 2 selections. This is extreme for S1, for example, where pilot bandwidths erring toward zero are attempted to be compensated for by massive global bandwidths. Even for S4, which shows the best performance for ADALIK 2 out of the four (Figure 1), the high variability of the chosen pilot and global bandwidths render its overall performance far less stable than in the constrained setting.

Spatial relative risk and conditional probability surfaces
A common extension to standalone density estimation is to compare the estimated spatial densities of different types of observations collected in the same region. For example, researchers are often interested in estimating spatial disease risk after adjusting for the natural background heterogeneity of the population under scrutiny. The density estimates are based on an observed pattern of confirmed 'case' points, and another of at-risk 'control' points [3,4,24,25]. This concept can be further extended to situations in which there are more than two different categories of points, such as the comparison of spatial patterning of several different subtypes of a certain disease [26,27].

Notation and assumptions
Fully nonparametric assessment of any such spatial segregation (or lack thereof) exhibited by the distributions of different types of observations is typically performed under certain assumptions. The conditions that follow mirror those laid out Diggle et al. [27].
Suppose we are interested in comparing the spatial distributions of m different types of disease, all affecting an at-risk population taken to follow a Poisson point process with intensity function η 0 (x); x ∈ W ⊂ R 2 . Furthermore, let η k (x); k = 1, . . . , m represent the intensity functions of m independent Poisson point processes on W, corresponding to each of the m different types of observations, whereby η k (x) = ρ k (x)η 0 (x). We interpret the functions ρ k (x) as the probability a point at x will yield an observation of type k.
Of interest is estimation of the m surfaces p k (x), which are interpreted as the conditional probability that an observation is of type k given a point has occurred at location x: Note that the assumption of a common spatial intensity η 0 (x) in the expression for η k (x) above is important because it is revealed to cancel out in the ratio (12). This subsequently permits us to estimate the type-specific probabilities p k (x) using kernel estimation of the type-specific data sets only, in other words, without specifying any particular spatial form of the at-risk population in W.
In case-control settings we have m = 2 and may use (12) to obtain the conditional probability of a case, given an observation at x. A reformulation of this approach in such a 'two-type' scenario sees the relative risk of disease directly estimated as the ratio of the case density to the control density [4,24,28]. Such density-ratio estimators have been used in a variety of applications [21,[29][30][31][32] though are less capable of handling the multitype scenario when m > 2, and do not naturally lend themselves to likelihood-based bandwidth selection. We therefore focus on estimators for the conditional probability surfaces of p k in the remainder of this section.

Fixed estimator
Let X k ; k = 1, . . . , m represent realisations of the m Poisson point processes on W ⊂ R 2 , with corresponding sample sizes n 1 , . . . , n m . We collect all n = k n k points into a single data set X = {x 1 , . . . , x n }, and define a corresponding vector Y = {y 1 , . . . , y n } that identifies the type of each point i.e. y i = k if the ith observation is of type k.
The fixed-bandwidth estimator of (12) can be found as wheref (·) is the fixed-bandwidth density estimator (1). This estimator for p k can be seen as proportional to the quotient of the kernel density estimate of the points of type k only and the kernel density estimate of the pooled data set of all spatial observations. While choosing optimal bandwidths for each separate density estimate in (13) is possible use of a common bandwidth h throughout is preferable [27]: A common bandwidth ensures all type-specific probabilities sum to unity at any given location; i.e. k p k (x | h, X) = 1. As can be seen in the rightmost term in (13), a common bandwidth also obviates the need to calculate edge-correction factors q W per (2) as seen in (1) -their equivalence in the ratio leads to cancellation [33,34].
A bandwidth selector can be identified by recognizing the vector Y as a realization of n multinomial outcomes. These are associated with their observation locations X which in turn reference the probabilities of success over the m categories through the estimated p k (x i | h, X). The leave-one-out criterion for choosing an optimal h based on the log-likelihood of the multinomial distribution is therefore which sees (13) evaluated at x i given its deletion from X.

Adaptive estimator
Spatially adaptive estimation for nonparametric comparisons of spatial distributions is appealing for many of the same reasons noted in Section 2.2. Abramson adaptation has been suggested for the m = 2 density-ratio method [11], but has not previously been considered for the kernel regression estimator of conditional probability surfaces. Choosing a universally common global bandwidth h 0 and pilot bandwidth h p , direct substitution of adaptive kernel density estimates (6) in place of their fixed counterparts in (13) leads to the following: .
The above shows that adaptive smoothing of such surfaces requires additional care with respect to bandwidth assignments. Even if common global and pilot bandwidths are chosen for all individual density estimates, this is not enough to yield the favoured behaviour of total summation of the m probabilities to 1 at a given spatial site, nor does it circumvent the need to edge-correct. This is owed to the reliance of the adaptive density estimate on the pilot density, which, being based on different sets of spatial data in numerator and denominator positions (X k and X respectively) ofp * k , produces different bandwidth factors on either side of the fraction; cf. (7).
We can impose the desired behaviour by basing the bandwidth function on a common data set, meaning the bandwidth factors at the pilot estimation stage are identical in numerator and denominator at any given location x. This is akin to symmetric adaptive smoothing regimens discussed for density-ratio estimators of relative risk [35]. First, extend slightly the notation of the adaptive estimator (6) to highlight this modification, where A ∈ R 2 and B ∈ W are data sets of n A and n B spatial observations respectively. The above shows that the density estimate is constructed by variable-bandwidth kernels situated at the observations of B, but that the variable bandwidths at each u ∈ B are found using pilot estimation referencing A. Note that either A ⊂ B or A ⊃ B is permitted, and if A = B then (15) simplifies to (6). While (15) itself is no longer optimal in the sense of standalone density estimation via Abramson adaptation, it has been shown to possess excellent practical behaviour in ratiobased density estimators, particularly when A is chosen 'sensibly', for example, as a superset of B [35]. Similar benefits might be reasonably anticipated for the conditional probability estimator, though it remains to be seen if a high degree of spatial segregation adversely affects performance.
We further take into account the highly favourable simulation results of bandwidth selection based on ADALIK 1 shown in Section 3, and hold pilot and global bandwidths equal at a single value h c . Our proposed symmetric constrained adaptive conditional probability estimator is therefore given as where equivalent edge-correction factors have cancelled out, and the imposed symmetry ensures m k=1 p k (x | h c , X) = 1. Note in our implementation the shared bandwidth factors are taken to be based on the pooled observations X regardless of corresponding type in Y .
Optimization of h c via the multinomial likelihood may thus proceed in a manner identical to (14). We seek argmax h c [ADAMULT(h c )], where

Simulation study 2: spatial conditional probabilities
The numeric evaluation of the fixed and symmetric constrained adaptive conditional probability estimators of the previous section proceeds as follows. We base four scenarios on two themes, mild segregation and strong segregation. For each theme, we specify two multitype problems; one with m = 2, the other with m = 3. Figure 3 provides plots of these designs; where the m = 2 scenarios are made up of densities A and B, and the m = 3 scenarios are made up of A, B and C. We label the four multitype scenarios MS1 through MS4, respectively referencing m = 2 and m = 3 mild; and m = 2 and m = 3 strong themes. The densities A, B and C are used to define the corresponding type-specific probability surfaces p A and p B for MS1 and MS3, and p A , p B and p C for MS2 and MS4 as per (12). Figure 4 displays the surfaces of p A for the m = 2 scenarios (the p B surface is simply its pixel-wise complement) as well as all three type-specific probability surfaces p A , p B , and p C for the m = 3 scenarios. These are computed assuming sample sizes of 500, 1000, and 650 for point types A, B and C respectively.
Results are reported for 100 iterations in Figure 5. The top row shows the ISEs between the estimated type-specific probability surfaces and the true surfaces; note that for the m = 2 scenarios, MS1 and MS3, identical errors arise for both p A and p B , so only one set for each bandwidth selector/estimator is reported. Across all scenarios, the symmetric constrained adaptive estimator is clearly shown to hold superior performance. This is particularly pleasing given the h 0 = h p restriction in place for all contributing densities in the ADAMULT selector and the imposed symmetry on the corresponding estimator (16). The improvement over the FIXOS or FIXMULT versions of the estimator is especially obvious   in the high-segregation scenarios MS3 and MS4; the adaptive estimator appears far better equipped to handle locally abrupt changes in probability in the type-specific surfaces.
In the bottom row of Figure 5, we see the distribution of the common bandwidths h c chosen for ADAMULT are somewhat similar to those of FIXMULT for the mild segregation scenarios MS1 and MS2. This is visibly different to those of MS3 and MS4, where the chosen ADAMULT bandwidths are notably smaller than FIXMULT. In general, reducing the pilot bandwidth will increase the range of variable bandwidths by decreasing the smoothness of the pilot density in (7). This tends to be desirable for surfaces of high fluctuation such as those making up MS3 and MS4 -a wider gamut of point-specific bandwidths allows for more accurate capturing of local peaks and troughs.

Spatiotemporal density estimation
Our final methodological investigation concerns kernel density estimation of spatiotemporal data where, in addition to n spatial coordinates X = {x 1 , . . . , x n }; X ∈ W, we are provided a timestamp T = {t 1 , . . . , t n } where T ∈ D ⊂ R. It is assumed D is a single closed interval on the real line.
We follow Fernando and Hazelton [36] in expressing a fixed-bandwidth trivariate kernel density estimate as the sum of n products of a 2D and 1D kernel function K and L respectively; each is assumed to be a zero-mean symmetric pdf with finite variance on the respective domain. The estimator is (18) for x ∈ W and t ∈ D, where h and λ are the spatial and temporal bandwidths respectively. Edge-correction factors are in general necessary on both spatial and temporal domains; q W is given in (2), and a similar definition serves q D (t; λ) = λ −1 D L((t − u)/λ) du. The same principle behind derivation of (4) applies in this setting, and the likelihood-optimal bandwidths are found via bivariate optimization of The leave-one-out operation is based on deletion of (and kernel density evaluation at) the ith observation in space-time; denoted Let H = {h 0 , h p , λ 0 , λ p } be a set of four scalar bandwidth values; a global and pilot bandwidth for spatial and temporal margins respectively. A spatiotemporally adaptive version of (18) per Abramson's regimen can be written aš where A\a refers to the set A without member a. The bandwidth functions are and λ X,T (x, t; wheref X,T (x, t | h p , λ p ) is the fixed bandwidth pilot density estimate obtained via (19), and γ˘f = exp{n −1 i logf X,T (x i , t i | h p , λ p ) −0.5 }. As with purely spatial estimates, we clip the pilot density at some small nominal value to prevent excessive variable bandwidths arising from either (21) or (22). Completely unconstrained bandwidth selection for the new estimator (20) would involve quadvariate optimization over H. As noted in Section 2.2 and revealed by the simulations in Section 3, this is inadvisable. We therefore constrain the pilot and global bandwidths in a manner similar to the purely spatial estimator; this time, we are required to retain separate scales in the spatial and temporal margins. Setting h c ≡ h 0 = h p and λ c ≡ λ 0 = λ p , the likelihood selector is given by

Simulation study 3: spatiotemporal densities
The computational cost of brute-force leave-one-out operations for adaptive spatiotemporal kernel estimates is high. There is a substantial bottleneck induced by the variable bandwidth and edge-correction factors, necessitating repeated recalculation of the adaptive density over W × D for each deleted point. We mitigate this expense to a certain extent by implementing (20) using a data-partitioning approach whereby the adaptive estimate is approximated by a sum of fixed bandwidth estimates operating on subsets of the observed data. Appropriate subsets are identified based on a preselected sequence of unique quantiles of the variable bandwidth factors [12]. Such an approximation yields final estimates of lower precision when compared to direct evaluation of (20), though is arbitrarily improved by using a finer sequence of bandwidth quantiles. Further reductions in total running time can be achieved by parallelizing the leave-one-out steps of (23) over multiple CPU cores. Even with such measures in place and constraining optimization to the two values (h c , λ c ), complete calculation of the ADAST selector for a single data set is costly (roughly 150 minutes for direct evaluation on one of our modern desktop machines, based on a sample size of n = 500; an evaluation lattice of size 64 3 ; and setting a modest relative tolerance for the stopping rule defined using successive steps of the optimization routine). We therefore limit ourselves to two scenarios for the final simulation experiment and label these STS1 and STS2.
The scenarios are designed by first defining several pixel images over W to pose as 'keyframes'. We let D = [1,10], and associate the endpoints of this interval with two chosen keyframes; remaining keyframes are assigned unique times within D. This sequence of spatial images is linearly interpolated in a pixel-wise fashion so that the spatial appearance of the density is available for any t ∈ D. We define a smooth univariate function over D to describe relative point intensity at each t. The resulting spatiotemporal surface, when evaluated over a discretized 3D array spanning W × D, is rescaled to ensure integration to 1. A 3D rejection sampling algorithm subsequently permits generation of independent and identically distributed observations. Figures 6 and 7 plot aspects of the design of STS1 and STS2 respectively. The first scenario is composed of two main spatial centres of observations, the northernmost manifesting itself in the middle of the time period; both dissipate somewhat as time progresses. There are ongoing sporadic observations throughout, with relative point counts peaking around t = 5. STS2 is more complex, with single tight groupings of points both starting  and disappearing at staggered stages during D. Multiple observation clusters of different sizes develop between t = 6 and t = 7, with a single prominent grouping remaining at the end of the study period. The marginal point distribution over time oscillates between low and high density, with increased frequency for t > 8. There are greater space-time pockets devoid of observations than in STS1.
For 100 iterations, datasets of size 500 points are generated for each scenario. Densities are estimated via the constrained adaptive estimator (20) using ADAST (23), and via the fixed bandwidth estimator using FIXST (19). Again, a fixed bandwidth estimate is also computed based on Terrell's oversmoothing rule-of-thumb as FIXOS; the two bandwidth values are computed separately using the marginal spatial and temporal data values. For this we have h OS as defined in Section 3, and λ OS =σ T (2187/(96n [4.5])) 1/5 whereσ T is the standard deviation of the observation times and [·] is the gamma function.
Results of the simulations are given in Figure 8, and clearly favor the adaptive approach in both scenarios on average. Noticeable is the increased variability in the ISE measures for the ADAST densities, owed largely to a flatter distribution of selected spatial-margin bandwidths. As noted above, this variability can be reduced by increasing the resolution of the partitioning approximation to the spatiotemporal density estimator for a computational price. Importantly, however, the variability shown by our current implementation does not hinder the potential of the adaptive estimator to significantly outperform its fixed bandwidth counterpart.

Applications
We conclude our investigation with a pair of applications drawn from human and veterinary epidemiology. While, unlike in the simulations, a lack of knowledge about the true function of interest prevents an exact error comparison, it is nevertheless of interest to contrast the fixed and the proposed constrained adaptive estimates.

Bovine tuberculosis
The first example deals with the spatial locations of 873 farms affected by bovine tuberculosis in Cornwall, UK, recorded between 1989 and 2002. Of interest in the study that originally presented and analysed these data was estimation and visualization of the degree Figure 9. Bovine tuberculosis data showing the spatial patterning of the four most common spoligotypes (left), and objective functions of the FIXMULT and ADAMULT selectors (right) evaluated over a fixed domain [2,7] and normalized to the range [0, 1] for comparison. Vertical lines in the latter mark off the selected bandwidths as those which maximize the respective objective function.
of spatial segregation in the four most common spoligotypes via (13) -9 (n 9 = 494), 12 (n 12 = 109), 15 (n 15 = 166), and 20 (n 20 = 104) -of the disease [27]. The data are displayed in the left-hand panel of Figure 9: Visual evidence of segregation is apparent with spoligotypes 9 and 15 situated primarily in the northern portion of the region and the other types aggregated mainly in the southern area.
We estimate the m = 4 type-specific surfaces p 9 , p 12 , p 15 and p 20 using both fixed and constrained adaptive estimators, as well as a second fixed bandwidth estimate based on the oversmoothing factor. Bandwidth selection proceeds exactly as outlined for FIX-MULT (14) and ADAMULT (17); and we find optimal bandwidths of h = 4.8264 km and h c = 3.7722 km respectively. The objective curves of FIXMULT and ADAMULT are shown on the right of Figure 9; the common FIXOS bandwidth is found as h OS = 12.2312 km. Figure 10 shows the three estimated type-specific probability surfaces for spoligotype 20, with the full collection of images for all four types available in the supplementary materials. Dashed tolerance contours mark off statistically significant departures from uniformity that would occur for completely unsegregated patterns (i.e. assuming a null hypothesis of constant probability of occurrence over W for all observation types). These are based on p-values surfaces computed via Monte-Carlo permutation as ω( wherep is an estimate of a given type-specific probability surface from observed data, andp (i) is the ith estimate found following random relabelling of the point types; this is repeated for M = 999 iterates [4,6]. The contours are displayed at a significance level of 1%.
The FIXOS surface is suggestive of oversmoothing: Almost no detail is present in the raised area of probability to the south. We capture additional detail through a reduced bandwidth in FIXMULT, though a closer inspection of the contours in the FIXMULT and ADAMULT surfaces reveal the latter to preserve slightly more detail than the former. Crucially, the additional detail in the ADAMULT surface does not appear to have come at the cost of undersmoothing elsewhere in the region -a clear benefit of Abramson adaptation.

Burkitt's lymphoma
Our second example concerns the 188 cases of Burkitt's lymphoma recorded mostly in children in the Western Nile district of Uganda over an approximate 15-year period from 1960 to 1975 [37,38]. In addition to spatial coordinate data, we have the time of onset recorded as the number of days from January 1, 1960. The spatial and temporal margins of the data are plotted in Figure 11. A relatively diffuse yet large aggregation of observations lie at the western border, and over time, the number of reported cases appears to steadily increase within the first decade.
To assess structure in the data, we seek kernel density estimates of these spatiotemporal observations as outlined in Section 6, using (18) and (20). The required spatial-and temporal-axis bandwidths are selected using FIXST (19) and ADAST (23) for the fixed and constrained adaptive estimates respectively. We obtain h = 6.641 km; λ = 2.556 years for the former, and h c = 7.594 km; λ c = 2.737 years for the latter. The two density estimates are displayed as 3D plots in Figure 12. For details on an interactive version of the graphic that can be rotated with the mouse in a web browser to gain different viewing perspectives, see the supplementary material. Superimposed upon the observations themselves are 3D contours ('isosurfaces') drawn at α × max[f X,T ] for α ∈ {0.5, 0.75, 0.95} in darkening colors on both plots, wheref X,T is the fixed bandwidth estimate.
Comparing the two estimates, we see both depict a bimodal structure with the drawn isosurfaces. The northernmost mode near the village of Omugo commences roughly at the fifth year, and persists for the remainder of the period with minimal spatial variation -shown with the 'tubular' appearance of the 3D contour. A higher-density mode situated further south around the Arua township is manifested after the first decade. In contrast to the fixed estimate, the adaptive estimate yields larger pockets of higher density within these modes -again illustrating the ability of Abramson adaptation to capture more detail in observation-rich subregions of W × D without compromising by undersmoothing elsewhere.

Concluding remarks
Kernel smoothing is a highly flexible tool for the nonparametric analysis of spatial and spatiotemporal point patterns. Its role can range from basic data exploration; to estimation and testing of relative risk and conditional probability surfaces; to capturing deterministic heterogeneity in semiparametric models.
Classical fixed bandwidth estimation, requiring only a single level of smoothing be set and therefore being relatively easy to implement, suffers somewhat when smoothing highly heterogeneous patterns. Adaptive smoothing is intuitively appealing in many applications, though the need to initialize pilot and global smoothing levels renders the bandwidth selection problem more complex. Focusing on likelihood cross-validation, we have demonstrated the promising performance of Abramson adaptation for various nonparametric smoothing problems. By constraining pilot and global bandwidths to be equal, we circumvent identifiability problems in automated selection, providing a reliable means by which the two smoothing parameters could be chosen simultaneously in both spatial and spatiotemporal settings. These ideas extend to our proposed adaptive estimation of conditional probability surfaces for multitype patterns, which additionally benefit from symmetrization of bandwidth factors. Results of both synthetic and real-data analyses show the proposed approach has the capacity to perform rather well when compared to fixed bandwidth estimation.
Likelihood-based bandwidth selectors were subject to scrutiny here due to their ability to deal with both standalone density and conditional probability surface estimation. That said, there is no reason to suggest the strategy of constraining pilot and global bandwidths could not also be applied to different kinds of bandwidth selectors for adaptive density estimation. Future research could involve an investigation of comparative performance in this context. We also anticipate symmetric constrained adaptive spatiotemporal estimators of conditional probability will yield similar benefits as demonstrated here in the purely spatial case, though computational expense is a significant hindrance, and this could warrant additional attention. Finally, we consider the further development of Bayesian selection methods, where bandwidths are treated as parameters and have priors placed upon them [39,40], is a worthwhile avenue of pursuit: This could have benefits for construction and estimation of inhomogeneous Cox processes for data assumed to reflect both deterministic and stochastic influences.