Developing physical surrogates for benthic biodiversity using co-located samples and regression tree models: a conceptual synthesis for a sandy temperate embayment

Marine physical and geochemical data can be valuable surrogates for predicting the distributions and assemblages of marine species. This study investigated the bio-environment (surrogacy) relationships in Jervis Bay, a sandy marine embayment in south-eastern Australia. A wide range of co-located physical data were analysed together with biological data, including multibeam bathymetry and backscatter surfaces and derivatives, parameters that describe seabed sediment and water column physical/geochemical characteristics and seabed exposure. Three decision tree models and a robust model selection process were applied. The models for three diversity indices and seven out of eight infaunal species explained 32–79% of the variance. A diverse range of physical surrogates for biodiversity were identified. The surrogates are presented in a conceptual model that identifies the mechanisms that potentially link to biodiversity patterns. While some surrogates may exert direct influence over organisms to exposure and chlorophyll-a, for example, most pointed to complex relationships between multiple biological and physical factors occurring in different process domains/zones. The reliable bio-environment relationships identified from co-located samples and conceptual models enabled a mechanistic understanding of benthic biodiversity patterns in a sandy coastal embayment that may have implications for marine environmental management.


Introduction
The management of marine ecosystems requires accurate mapping of biodiversity patterns over large areas. The paucity of biological data and the prohibitive cost of directly sampling biota over large areas have led to the development of a mapping strategy based on physical surrogates (Zacharias and Roff 2000). In the marine zone, physical (geochemical) data are generally more readily measured than biological data. These physical variables can act as surrogate measures of biodiversity and be valuable in predicting the potential distributions and assemblages of marine species (McArthur et al. 2010). Importantly, modelling bio-environment relationships can help identify and verify ecological processes that link physical environmental attributes to the distribution of seabed biota (Austin 2002).
Many physical variables have been used as surrogates for marine biodiversity (Zacharias and Roff 2000, McArthur et al. 2010, Huang et al. 2011a) and can be grouped into three types: resource, direct and indirect variables (e.g. Austin and Smith 1989, Guisan and Zimmermann 2000, Austin 2002. Resource variables are consumed by biota. Direct variables exert a direct physical/physiological influence on biota. Indirect variables do not have physiological connections to biota but can correlate strongly with resource and direct variables that do influence biota. The resource and direct variables are preferred inputs for the modelling of bio-environment relationships, because they represent causal links (Austin 2002). The indirect variables, which are often easier to obtain, can also be useful because they may help to identify processes behind biological patterns.
There are many challenges to successfully model bio-environment relationships in a marine context. Data issues such as sample bias, inadequate sample size, missing important covariates and location and attribute errors can have significant impacts on the modelling results (e.g. Barry and Elith 2006, Hernandez et al. 2006, Pearce and Boyce 2006, Philips et al. 2006). In addition, selection of the appropriate spatial/statistical model(s) from the many that are available, optimisation of model performance and identification of potential surrogates from a model remain challenging tasks, despite the rapid development of modelling techniques.
The objectives of this study were to (i) use a modelling approach to identify the physical surrogates that can be used to map benthic biodiversity in a sandy marine embayment and (ii) develop a conceptual model that synthesises and conveys the causal mechanisms behind the surrogacy relationships in a process-orientated framework. In this study, the physical variables identified by the models as important were regarded as surrogates; and the inferred bio-environment relationships were called 'surrogacy relationships'.
A unique feature of this study is that the models were developed using accurately colocated samples so that the derived surrogacy relationships would have greater reliability. Most previous marine ecological studies have relied on interpolated physical variables (e.g. Pitcher et al. 2007, Gogina et al. 2010, Mellin et al. 2010, which introduce some uncertainty into the modelling process/results. We also addressed several of the issues mentioned above through robust data collection and modelling strategies. First, modern marine survey technologies were used for data collection to ensure the highest position and attribute accuracy. Second, a large number of physical/environmental variables were collected and derived to reduce the chance of missing important covariates. Finally, machine learning models and a model selection process were employed to provide an appropriate and robust analysis framework.

Study area
Jervis Bay is situated on the south coast of New South Wales (Figure 1a). The bay entrance is 3.5 km wide, up to 30 m deep and is permanently open to ocean swell from the east to south-east. Jervis Bay is a typical example of a sandy marine embayment common to south-eastern Australia. It has significant ecological, social and economic values and forms part of the Jervis Bay Marine Park (NSW Marine Parks Authority 2010). This study concerns a 3 × 5 km 2 area of seabed situated adjacent to the mouth of the bay, here termed the Darling Road Grid (DRG) (Figure 1; Anderson et al. 2009 CSIRO (1993) and Jacoby et al. (1995). Note: DRG, Darling Road Grid. types identified in the bay (e.g. drift algae, polychaete hummocks, rippled and bioturbated sands, bivalve clumps and heart urchin areas) and supports a diverse fauna and flora (CSIRO 1993, Jacoby et al. 1995Figure 1c). A previous study had used habitat categories and species assemblages as biological surrogates for evaluating marine biodiversity in Jervis Bay (Ward et al. 1999). However, physical surrogates remain unexplored in the area.

Field methods
The DRG seabed was mapped in December 2007 and May 2008 using a Simrad EM3002 300 kHz multibeam sonar system (Kongsberg Maritime, Kongsberg, Norway), which provided bathymetry and seabed backscatter data (Anderson et al. 2009). Two sampling surveys were undertaken (24-28 June and 27-29 August 2008) to collect accurately colocated sediment samples for physical/geochemical and infaunal analyses and to make water column observations (water temperature (WT), photosynthetically available radiation (PAR) and dissolved oxygen (DO) concentrations). A Shipek-style grab was used to collect sediment samples designated for physical/geochemical analyses, and a Van Veen grab was used to collect the biological samples. Stations, with 60 samples, were allocated using a stratified-random spatial sampling design based on the a priori habitat designations (CSIRO 1993; Figure 1b).

Data processing and analytical methods
Laboratory analyses and data post-processing yielded a large number of physical/ environmental variables for use in the machine learning models. These variables (detailed below) and the broad-scale benthic habitat types (CSIRO 1993) constitute the covariates (or predictors) in this study (Table 1).

Multibeam acoustic data and its derivatives
Processing of multibeam bathymetry data was undertaken using Caris Hips and Sips V6.1 software (developed by CARIS, New Brunswick, Canada; http://www. caris.com/products/hips-sips/) to correct for tidal variations and to remove data artefacts. The multibeam backscatter data were processed using CMST-GA MB Process v8.11.02.1 software, a multibeam backscatter processing toolbox co-developed by Geoscience Australia and the Centre for Marine Science and Technology at Curtin University of Technology, Australia. The backscatter processing included correction for transmission loss and ensonification area and removal of the system-implemented model and the angular dependence (Gavrilov et al. 2005). The angularly equalised backscatter strengths were normalised to the backscatter strength at an angle of 25 • (Heap et al. 2009). The bathymetry and backscatter data were gridded at 1 m resolution. Bathymetry (depth) is a potentially powerful physical surrogate for benthic biodiversity (e.g. Gogina et al. 2010, McArthur et al. 2010 due to its relationship with light availability, food availability and WT. The following seabed morphological variables were derived from the bathymetry data: slope, topographic position index (TPI) (Weiss 2001), surface area (SA) (Jenness 2004), topographic relief (TR), local Moran's index (Moran1) (Anselin 1995) and aspect. With the exception of aspect, these variables describe seabed rugosity and heterogeneity that can affect species diversity and abundance (e.g. Kostylev et al. 2005, Holmes et al. 2008, McArthur et al. 2010. For example, Moran1 measures heterogeneity (spatial autocorrelation) of seabed bathymetry values within a local neighbourhood. Backscatter has been associated with substrate composition (Goff et al. 2000, Kloser et al. 2001, Siwabessy et al. 2006) and is also a potential physical surrogate of benthic biodiversity (McArthur et al. 2010). In this study, local Moran's I was also derived from the backscatter data as a texture measure (Moran2). All the morphological variables and texture measures were calculated for four window sizes (3 × 3, 7 × 7, 11 × 11 and 19 × 19 m) to capture the effect of spatial scale (Holmes et al. 2008). Aspect was calculated from a 3 × 3 window only. Aspect can characterise potential exposure when used in conjunction with outputs from an oceanographic model and thus is also a potential surrogate for patterns of benthic biodiversity (Holmes et al. 2008).

Sediment and water column data
A range of sediment properties were derived from laboratory analysis and subsequent post-processing, including mean grain size (MGS), %mud (Mud), %sand (Sand), %gravel (Gravel), %carbonate (CaCO 3 ), sorting, skewness, kurtosis, sediment surface area (SSA), chlorophyll-a (chl-a) concentrations, total sediment metabolism (TSM), total carbon dioxide pools (pTCO 2 ) and total sulphur (TS) and Fe concentrations. These sediment variables can provide information on the nature of the interstitial environment, food availability and redox characteristics (e.g. McArthur et al. 2010, Radke et al. 2011a, 2011b. The analytical methods used to determine sediment size fractions, SSA and the geochemical concentrations are described in Radke et al. (2011a). Summary statistics for sediments were derived from laser grain-size data using the computer program GRADISTAT (Kenneth Pye Associates Ltd, Crowthorne, UK) (Blott and Pye 2001).
The light attenuation coefficient (K d ) between the surface and benthic PAR measurements was calculated from a standard equation (K d = ln(lo/lz)/z), in which lo and lz refer to PAR at the surface and at depth, and z is the water column depth. This water column property and WT and DO concentration were also used for this study as these variables have direct influence over benthic organism distributions (Snelgrove 2001).

Seabed exposure data
Seabed exposure to wave and current energy can play an important role in shaping local biodiversity patterns. According to Connell (1978), the highest biodiversity is likely to occur in areas with intermediate levels of disturbance. A fine-scale hydrodynamic model, the Simulating Waves Nearshore (SWAN) model (Booij et al. 1999, was developed to estimate seabed disturbance for Jervis Bay. The SWAN model estimates wave propagation using the wave action equation (Hasselmann et al. 1973). The model takes into account the refraction of swell waves, plus shoaling, diffraction, dissipation and random waves, but excludes local wave generation by wind. Apart from bathymetry data, the model also required three parameters: significant wave height (H m0 ), peak wave period (T p ) and peak wave direction (θ p ). Twelve combination scenarios of these three parameters were simulated, as summarised in Table 2. The model output of maximum orbital velocity (m s -1 ) was considered as the best estimate of seabed exposure. Their values were combined to generate three seabed exposure variables: averaging all 12 scenarios (Dist_all), averaging eight scenarios with θ p = 160 (Dist_160; from south-east) and averaging four scenarios with θ p = 90 (Dist_90; from east).

Response variables -infaunal data
Biological material retained on a 500 µm sieve (after elutriation in the field) was examined under a dissecting microscope. All animal materials were removed, identified and counted. Three biodiversity indices (total abundance, species richness and Shannon index) were calculated using PRIMER v.6 PRIMER-E Ltd, Ivybridge, UK (Clarke and Warwick 2001) and were used in our analyses. To provide a generality in species responses to benthic habitats, eight infaunal species were selected to represent taxa that were abundant (e.g. >100 individuals) but that differed by phyla and in their feeding mechanisms and life histories (Table 3).

Numerical analyses/modelling 3.3.1. Selection of covariates
Having too many covariates in a model can reduce prediction performance (Huang et al. 2011c). Therefore, we undertook exploratory analyses to reduce the number of covariates in the individual models. First, scatter plots illustrating the relationships between covariates and response variables were generated and fitted with the curve that provided the best explanatory power (indicated by R 2 ). The following five functions were examined, where possible, for least squares fits: linear, polynomial, logarithmic, exponential and power. Covariates with R 2 values greater than 0.05 in the exploratory analysis were deemed suitable for the modelling and were included in the selection. Second, the correlation coefficient (ρ) between a pair of covariates was calculated and ranked. Lower-ranked variables (i.e. with smaller R 2 values) in strongly correlated pairs (|ρ| > 0.75) were removed from the selection. In the final step, the variables habitat and aspect were added to the selection to obtain a set of covariates for the modelling (denoted Set 1 ). It should be noted that Set 1 was different for individual response variables. The results of the exploratory analyses are reported in the supplementary materials available online.

The statistical modelling process
Decision tree models were used for the regression problem because they have (1) the flexibility to handle both continuous and categorical variables; (2) the capability to handle high-dimension (often correlated) explanatory variables; (3) the ability to handle missing data in explanatory variables; (4) the ability to simulate complex non-linear relationships between explanatory and response variables; (5) the ability to fit interactions between explanatory variables; (6) the ability to identify and rank explanatory variables; (7) the ease of interpretation and thus the ability to discover relationships between explanatory and response variables; and (8) robust performance (e.g. De'ath and Fabricius 2000, Shan et al. 2006, Hochachka et al. 2007, Pitcher et al. 2007, Holmes et al. 2008, Pittman et al. 2009). We implemented three decision tree models (single decision tree (DT) (Breiman et al. 1984), boosted decision tree (BDT) (Friedman 1999) and random forest decision tree (RFDT) (Breiman 2001)) using DTREG software (http://www.dtreg. com). Although not shown in this article, these models can also be used to generate prediction maps. The selection of model parameters and covariates both can influence prediction performance. This study used a series of steps to find a 'best' performing model for each response variable. The following were the three main steps in the modelling procedure: (1) Search model parameters. This step used all the covariates in Set 1 . An intensive experimental process was undertaken to search for the combination of model parameters that would obtain the highest prediction D 2 (percent variance explained, see Section 3.3.3). For BDT, we varied four model parameters. The parameter of 'maximum number of trees in series' was varied between 400 and 1000 with an increment of 100; the parameter of 'depth of individual trees' was varied between 4 and 6 with an increment of 1; the parameter of 'minimum size node to split' was varied as either 2 or 3; and the number of cross-validation folds was varied between 2 and 10 with an increment of 1. All other model parameters were set at default. For RFDT, we varied three model parameters. The parameters of 'depth of individual trees' and 'minimum size node to split' were varied as BDT. The 'number of trees' was varied between 100 and 1000 with an increment of 100. DT does not have the 'number of trees' parameter and the three model parameters were varied the same as BDT. In addition, for DT and BDT, pruning (to minimise cross-validated error) was implemented to improve performance and generalisation.
(2) Search explanatory variables. This step used the 'best' combination of model parameters provided by Step 1. In the first iteration, only one covariate was added to the model. This was repeated for all covariates. The variable which provided the highest prediction D 2 was retained for the next iteration; and the D 2 value was recorded. Each of the subsequent iterations selected a different covariate until all covariates had been added to the model. The combination of covariate(s) which achieved the highest prediction D 2 value was selected as the 'best' set of covariates (denoted Set 2 ). (3) Refine the model. The covariates in Set 2 were removed from the model one by one. The variable which, upon removal, gave the highest prediction D 2 was not included in the model. The model was then refined by tuning the model parameters like in Step 1; and the D 2 value was recorded. This procedure was repeated until all variables were removed from the model. The combination of the variables with the highest prediction D 2 was selected as the final combination of covariates (denoted Set 3 ).
The ability to identify and rank the importance of covariates is critical for identifying physical surrogates for biodiversity. The three decision tree models were able to do this by summing the improvement in accuracy gained by each split that used the covariate. The importance score for the most important covariate in Set 3 is set to a value of 100. Other covariates will have lower scores scaled accordingly (Sherrod 2008). In this study, the covariate with an importance score greater than 80 was regarded as a (physical) surrogate.

Assessment of model performance
The statistical performance of the models was evaluated using either a cross-validation procedure (e.g. DT and BDT) or a bagging process (e.g. RFDT) with two statistics: D 2 and mean absolute error (MAE). D 2 , also called R 2 , measures the percent variance explained by the model. MAE is an average of the absolute errors between the prediction and the true values. D 2 values in marine ecological applications generally range from ∼0.3 to 0.7 (e.g. Dzeroski and Drumm 2003, Leathwick et al. 2006, Pitcher et al. 2007, Pittman et al. 2009, Hill et al. 2010. In this study, D 2 values in the ranges 0.3-0.5, 0.5-0.7, 0.7-0.9 and >0.9 are regarded as 'acceptable', 'moderate', 'good' and 'excellent', respectively.

The prediction of response curves
Response curves were constructed to help interpret the bio-environment (surrogacy) relationships identified by the modelling process. We constructed an artificial data set for each identified surrogate. If the surrogate was a continuous variable (e.g. bathymetry), the artificial data set would have 30 rows (data records) and the number of columns (data attributes) equal to the number of covariates. Using the bathymetry variable as an example, the column for the bathymetry attribute has values ranging from its minimum to its maximum sample values with equal increments. The value of any other columns was kept as a constant which was either the mean value (in cases of continuous variables) or the most frequent value (in cases of categorical variables) of the respective covariate. If the surrogate was a categorical variable (e.g. habitat), the artificial data set was given the same number of rows as there were categories (e.g. types of habitat). The values for all other columns were similarly kept as constants. The artificial data set was then passed to the selected model to predict the values for the selected response variable. The values for the surrogate (between minimum and maximum of the sampled values, as the x-axis) were then plotted against the predicted values of the response variable (as the y-axis) to construct a predicted response curve.

Results
The statistical results of the decision tree models are summarised in Table 4. There was no single best performing model amongst the decision trees according to the D 2 and MAE results. However, DT was the best overall model for individual species and BDT was the best performing model for the diversity indices. The statistical modelling performance of Shannon index and species richness was moderate (D 2 between 0.5 and 0.7), while the statistical performance for total abundance was acceptable (D 2 between 0.3 and 0.5). In terms of the eight selected species, the model for Diopatra dentata had good statistical performance (D 2 between 0.7 and 0.9); the models for Rhinocetes robustus and Owenia australis had moderate statistical performance; the models for Nephtys inornata, Mesochaetopterus minutus, Echinocardium cordatum and Unciolidae had acceptable statistical performance; and the model for V. galactites had poor statistical performance (D 2 < 0.3). In general, the predicted response curves (Figures 2 and 3) correspond well with the observed bio-environment relationships ( Figures S1 and S2, supplementary materials). The broad benthic habitat types were found to be the best surrogate for the total abundance of infaunal organisms. K d and DO and TS concentrations may also be useful surrogates (Table 5; Table S2 available online). The predicted response curves (Figure 2a-c) show that higher total infaunal abundance occurs in the bioturbated sand, bivalve clump and polychaete hummock habitats, and locations with higher light availability and low to intermediate total sulphur concentrations. A range of surrogates were found for species richness, including SA_11, Mud, chl-a, Fe, habitat and TSM (Table 5; Table S3 available online). Species richness was also predicted to be highest in bivalve clumps and lowest in drift algae and ripple sands. Interestingly, the results for the food indicators were different. There was a general increase of species richness with chl-a concentrations and a general decrease with total sediment metabolism. For the Shannon index, Moran1_11 was the strongest surrogate followed by Gravel (Table 5, Table S4). Higher Shannon diversity was found in relatively heterogeneous seabed (in terms of bathymetry) with low gravel concentrations. The gravel size fraction in these samples was due to the polychaete tubes and bound debris.
Several physical variables were identified as surrogates for individual species (Table 5,  Tables S5-S12). Here, we only discuss the findings on the four species with the best modelling performance (Table 4). For D. dentata, Dist_160 was consistently identified as the most important physical control (Table 5). The predicted response curve (Figure 3a) shows that D. dentata occurs in areas of intermediate to high seabed exposure. CaCO 3 was consistently the best surrogate of the crustacean R. robustus (Table 5). Another identified surrogate was Dist_160. The predicted response curves (Figure 3b) show that higher numbers of R. robustus are found in sediments with low bulk carbonate content (<10%). Higher abundances of R. robustus also occur in areas of high seabed disturbance (Figure 3c). For N. inornata, an errant predator polychaete, habitat was most important (Table 5). Nephtys inornata occurs in high numbers in the bivalve clump and polychaete hummock habitats (Figure 3e). SA_11 was identified as the most important factor governing the distribution of the polychaete O. australis (Table 5)    response curve (Figure 3d) shows that O. australis is found in high numbers on extremely flat seabed in the south-western part of the bay. The response curves ( Figure 2) are interpreted with reference to a conceptual model that presents the surrogates in a process-based framework (Figure 4). The conceptual model incorporates the broad habitats into three zones. From west to east these include (i) a resuspension-dominated zone; (ii) a bioturbation zone with a sub-zone noted for high M. minutus abundance (Figure 5a); and (iii) a wave-dominated zone. The model was informed by the results of this study (Figure 2), the raw data (Table 6) and a companion study (Radke et al. 2011a), which provided spatial context for a subset of the variables. The approximate spatial distribution of the zones is shown in Figure 5 and was determined by the magnitude of the local Moran's I parameter in relation to our a priori understanding of the broad-scale energetics of the DRG and its bearing on the distribution of benthic habitats and processes (Radke et al. 2011a). Overall, the benthic biodiversity shows strong zonation differences across the study area (Table 6). For example, the resuspension-dominated zone has the lowest species richness, whereas the bioturbation zone has the highest species richness (Figure 5b; Table 6).  Figure 4. Conceptual model of bio-environment relationships and supporting contextual information. The conceptual model has three zones and one sub-zone, which incorporate the broad habitats. From west to east these include (i) a resuspension-dominated zone; (ii) a bioturbation zone with a sub-zone noted for active mound-building by the polychaete Mesochaetopterus minutus; and (iii) a wave-dominated zone.  0.11 ± 0.04 0.14 ± 0.02 0.12 ± 0.02 0.16 ± 0.04

Model performance
The three decision tree models used in this study were well suited for investigating bio-environment relationships. Model performances were at least acceptable for all three diversity indices and for seven out of eight selected infaunal species. The values of D 2 > 0.5 (in 5 of 11 cases; Table 4) are indeed promising for marine biodiversity prediction (e.g. Dzeroski and Drumm 2003, Leathwick et al. 2006, Pitcher et al. 2007, Pittman et al. 2009, Hill et al. 2010. Using co-located physical and biological samples instead of interpolated data minimised positional and attribute errors and ensured the reliability of derived bio-environment relationships. However, the three decision tree models were not consistent in their performance (Table 4). There was no single best performing model, which supports our strategy of using multiple models, as has been shown elsewhere (e.g. Elith et al. 2006, Huang et al. 2011b. The predicted response curves (Figures 2 and 3) had various shapes highlighting that machine learning models are appropriate for simulating complex bio-environment relationships. However, given potential limitations with data and statistical models, the relationships identified by the models are not necessarily true and/or causal. Rather, the success of an ecological modelling process should be judged not only by model statistical performance but also by the biophysical knowledge that the model imparts (Austin 2002) (detailed in Section 5.2). It should be noted that several factors can affect the effectiveness of biodiversity surrogates and that there are other ways of evaluating their effectiveness (e.g. Grantham et al. 2010) than those used in this study.

Conceptual synthesis of surrogacy relationships and causal links
The 'productivity-richness' hypothesis (reviewed in Tittensor et al. (2010)) predicts a positive relationship between species richness and productivity, which is consistent with most of our observations in Jervis Bay. The broad sweep of habitats in the central DRG (e.g. bioturbated sands, bivalve clumps and polychaete hummocks), comprising the bioturbation-dominated zone in our conceptualisation ( Figure 5), had the highest average values for sediment chl-a concentrations, and likewise, the largest predicted values for species richness and total abundance (Figure 2a and h). Indeed, our results are concordant with the findings of Jacoby et al. (1995) in that the predicted species richness was highest in the bivalve clumps and lowest in drift algae and rippled sands (Figure 2h). Species richness increased with chl-a concentrations and decreased with total sediment metabolism (Figure 2f and i), which may imply that richness was supported more by freshly deposited phytoplankton and/or benthic micro-algae than by reactive bulk organic matter which may include fresh phytodetritus as well as other labile factors (e.g. dissolved organic carbon and carbohydrates, lipids, proteins and nucleic acids of plant or animal origin) that are measured as total sediment metabolism. However, highest total sediment metabolism occurred in association with a M. minutus-dominated sub-zone, because polychaetes secrete labile mucoid substances as burrow linings (Kristensen 2000). The M. minutus-dominated sub-zone had relatively low Shannon diversity index and the largest number of organisms of the zones ( Table 6). The low diversity of the M. minutus-dominated sub-zone was reflected in the modelling because the Shannon index was predicted to be low in sediments with high gravel percentages and low local Moran's I values (Figure 2j and k), both of which occurred here. A potential explanation for the modelled decrease in species richness with total sediment metabolism (and %gravel and local Moran's I) is thus that some species were excluded from the polychaete mounds by either direct competition with M. minutus or as an indirect consequence of habitat modification by this species.
Ecosystem engineering (including bioturbation) involves the construction, maintenance and modification of habitats by organisms (Gutiérrez et al. 2002) and includes the modulation of abiotic materials and their fluxes (Berkenbusch and Rowden 2003). While mounds built by the polychaete M. minutus are the most overt features of ecosystem engineering in the DRG, the seabed of the bioturbation zone (and especially the M. minutus sub-zone) is also recognised for its high level of biogeochemical heterogeneity (Radke et al. 2011a). It is likely that the lateral extent of bioturbation activity and/or the preservation of bioturbation features is constrained by wave-dominated processes occurring to the west and east of the bioturbation zone because these seabed environments have less heterogeneity in bathymetry (i.e. lower values for local Moran's I, Figure 5), thus giving definition to the resuspension-and wave-dominated zones (see also Radke et al. 2011a).
It is widely observed that energetic environments support specialised species and lowdiversity communities (Gray 2002), and this is consistent with our results. The rippled sand habitat (comprising the wave-dominated zone in our conceptualisation) is subject to high wave energy from the open ocean (CSIRO 1993) and is predicted to support a fauna that is low in both richness and abundance ( Figure 2a). Likewise, the seafloor of the driftalgae habitat can experience significant reworking by waves under modal wave conditions (resuspension-dominated zone) (Radke et al. 2011a) and is also predicted to have a fauna that is low in richness and abundance. Moreover, the modelled abundances of some species (e.g. D. dentata and R. robustus) were highest in the wave-dominated zone as evidenced by preferences for intermediate to high exposure to wave energy from the south-east direction (Figure 3a and c). The robust mucous/sediment and silk/sediment tubes that these taxa, respectively, build could explain their ability to recruit well to mechanically challenging habitats. Likewise, a predilection for filter feeding (as with R. robustus; Table 3) would be advantageous in areas where strong currents reduce the settling of organic matter.
Our results point to a potential levelling influence of K d and %mud on species abundances and richness (Figure 2b and e), and the response curve for mud (and likewise gravel) concentrations is contrary to the widely held view that biodiversity is higher in sediments with diverse grain sizes (Whitlatch 1977), for example, higher mud (or gravel) concentrations in a sandy matrix. The highest values of these surrogates occur in the resuspension-dominated zone (Table 6), and their potential influences on biodiversity have to be reconciled with consideration of more likely controls such as wave-induced disturbance (resuspension) and sediment anoxia (Radke et al. 2011a). Modelled organism abundances were found to decrease with increasing total sulphur concentrations ( Figure 2c). These concentrations, which were highest in the resuspension-dominated zone (Table 6), reflect the occurrence of an anoxic process called sulphate reduction which can produce pyrite and iron monosulphides (FeS 2 and FeS) from reactive iron. The anoxia was patchy and was caused by degradation of the drift algae and its propensity, when abundant, to inhibit sediment resuspension and the turbulent transfer of oxygen to the seabed (Radke et al. 2011a).

Concluding synthesis
The results of this study are promising, with acceptable to good model statistical performance for the three diversity indices and seven out of eight infaunal species. Using co-located physical and biological samples ensured the reliability of derived bioenvironment relationships. Patterns of abundance and diversity were statistically related to a wide range of different variable types, including physical sediment (e.g. mud, CaCO 3, gravel) and geochemical properties (e.g. Fe, chl-a, TSM), seabed morphometric characteristics (e.g. local Moran's I of bathymetry, rugosity), seabed exposure regime (e.g. Dist_160) and water column characteristics (K d ). Of these, habitat, K d , TS, chl-a. Fe, TSM, SA_11, Moran1_11, mud and gravel could be considered surrogates of biodiversity and/or species patterns.
Past studies suggest that the influences of the resource/direct variables (e.g. the exposure variables and chlorophyll) on biodiversity may be causal (reviewed by McArthur et al. (2010)). However, most surrogacy relationships identified in this study were indirect (e.g. total sediment metabolism, total sulphur, seabed rugosity/heterogeneity, gravel, mud, habitats) and pointed to complex interactions between multiple physical and biological factors. These complex interactions gave definition to different process domains/zones and in some cases produced modelled results that were contradictory to expected (e.g. species richness response to mud content, TSM, Shannon diversity response to gravel content).
While the Shannon diversity and abundance models strongly reflected conditions in the M. minutus hummocks and drift-algae habitats, respectively, the species richness models provided information across study area. The species richness models likely captured various dimensions of beta-diversity whereby species composition changed across space, at various scales, in accordance with the environmental preferences/tolerances of individual species to factors, such as productivity (represented by chl-a concentrations), anoxia (indirectly related to Mud), seabed exposure (indirectly related to habitat type) and competitive exclusion (indirectly related to total sediment metabolism and rugosity).
In conclusion, the modelling procedure used in this study in combination with a comprehensive data set comprising co-located samples shows great potential for both identifying biodiversity surrogates and ecological process mapping. Here we developed a mechanistic understanding of benthic biodiversity patterns in a sandy coastal embayment that may have implications for marine environmental management. In addition, the decision tree models developed can be used to generate the prediction maps of benthic biodiversity and individual species in the study area. These prediction maps can provide important information for the management and monitoring of the Jervis Bay Marine Park.