Analyses_RF_GAMM_comparison.R (6.37 kB)

Global hotspots for soil nature conservation - survey dataset + supporting code

Version 3 2022-07-04, 07:57
Version 2 2022-07-04, 06:45
Version 1 2022-07-04, 06:30
posted on 2022-07-04, 07:57 authored by Carlos GuerraCarlos Guerra

We used composite topsoil samples from global field surveys which were conducted between 2016-2019 following standardized field protocols. This global field survey includes 151 locations from all continents and 23 countries, from which 615 composite topsoil samples were collected, providing a large representation of all climatic and vegetation biomes in the planet (Supplementary Fig. 1). Between three and five composite soil (top ~0-10cm) samples (from 5-10 soil cores) were collected in these locations (ranging between 0.09-0.25 ha) following the protocol described in Maestre et al. (2012).  Environmental predictors were collected from publicly available datasets. Elevation and climatic information for each location was obtained from WorldClim v2 (1 km2 resolution;, including information on climatologies and on the seasonality of temperature and precipitation. Soil pH was determined with a soil pH-meter from a soil-water mix 71. Texture was determined as in Maestre et al. 71 and, in the case of missing information, this was filled using Soilgrid v2 (; as in 16). Information on dominant vegetation (forest, shrublands or grasslands) was obtained as part of the field survey.

The alpha diversity (corresponding to the number of phylotypes) and community dissimilarity (averaged Jaccard distance across samples from presence/absence matrices to account for dissimilarity in phylotypes, measured as ASVs, rather than in their proportions) of archaea, bacteria, fungi, protists and invertebrates was determined using amplicon sequencing technology (Illumina Miseq platform) following the protocol in Delgado-Baquerizo et al. (2019). Soil DNA was extracted using the Powersoil® DNA Isolation Kit (MoBio Laboratories, Carlsbad, CA, USA) according to the manufacturer’s instructions. A portion of the bacterial/archaeal 16S and eukaryotic 18S rRNA genes were sequenced using the 515F/806R and Euk1391f/EukBr primer sets 48–50, respectively. Bioinformatics processing was performed using a combination of QIIME 51, USEARCH 52 and UNOISE3 53,54. Phylotypes (i.e. Amplicon sequence variant; ASVs) were identified at the 100% identity level. The ASV abundance tables were rarefied at 5000 (bacteria via 16S rRNA gene), 100 (archaea via 16S rRNA gene), 2000 (fungi via 18S rRNA gene), 1000 (protists via 18S rRNA gene), and 250 (invertebrates via 18S rRNA gene) sequences/sample, respectively, to ensure even sampling depth within each belowground group of organisms. Protists are defined as all eukaryotic taxa, except fungi, invertebrates (Metazoa) and vascular plants (Streptophyta). Finally, to provide further evidence that 18s rRNA Miseq Sequencing can in this case provide a solid representation of the global patterns in soil-borne mycorrhizal fungi and fungal potential plant pathogens, we compare the global patterns (see mapping method below) in the proportion of soil-borne mycorrhizal fungi and fungal potential plant pathogens determined using 18s rRNA Miseq Sequencing  with the subset of data including ITS PacBio sequencing. To investigate the environmental factors associated with soil biodiversity and services, we first used machine learning Random Forest modeling. We used the R package “rfpermute” to conduct these analyses. We also conduct Spearman correlations to better describe the direction of the relationship between environmental factors and soil biodiversity and services. Next, we used spatially explicit random forest models to predict the distribution of each soil biodiversity and ecosystem service variable. To map each soil biodiversity and ecosystem service variable we used spatially explicit random forest models. For that, we used ArcGIS Pro that estimates random forest models by using an adaptation of the random forest algorithm (a supervised machine learning regression approach) proposed by Breiman et al.. Forest-based regressions were trained based on 90% of the dataset, the remaining 10% of the dataset were used for validation purposes. All models were fitted using 200 decision trees and 1000 runs for validation and fitting. Prior to prediction all variables included in the dataset and the predictors were scaled and predictions were made using a 0.25x0.25 deg. pixel size. Global hotspots were then calculated using a Getis-Ord Gi* spatial clustering method. The Getis-Ord Gi* statistic was calculated for each location (0.25x0.25 deg. pixel) in the dataset. The resulting z-scores were used to estimate if a given location has statistically high or low values and if these values are spatially clustered. Finally, we have implemented a two stage approach to tackle both the assessment of the environmental representation of the soil biodiversity and ecosystem services dataset used and the uncertainty related to the estimation of each variable or group of variables. For this, we calculated the mahalanobis distance in multidimensional space to assess environmental coverage and a second measure to assess the spatial uncertainty of the estimated values for each model. In order to do this analysis, for each soil biodiversity and ecosystem service variable, we calculated 1000 random iterations of each random forest model and estimated the upper and lower 25% quantile of the distribution of values. For the estimation of the spatial uncertainties related to our projections we calculated 1000 random iterations of each random forest model and estimated the upper and lower 25% quantile of the distribution of values. Similarly, in each model run we used 10% of the data (selected randomly) for validation. 


Usage metrics



    Ref. manager