Estimating monthly air temperature using remote sensing with highly variable topography and scarce monitoring in the southern Ecuadorian Andes
Instituto de Regimen Seccional del Ecuador - Universidad del Azuay, Cuenca (Ecuador). -Lorena Orellana mlorellana@uazuay.edu.ec
- Daniela Ballari dballari@uazuay.edu.ec
Universidad del Azuay. Facultad de Ciencia y Tecnolgia. - Pablo Guzman pguzman@uazuay.edu.ec
Universidad Nacional de Colombia. - Jesus Efren Ospina jeospinan@unal.edu.co
Objective
This script estimate the spatial distribution of the monthly air temperature in the Paute river basin, located in the southern Ecuadorian Andes with the use of regression models: linear regression, random forest regression, and regression kriging; and with the evaluation of altitude and other auxiliary variables (land surface temperature, latitude and longitude).
Models
We use five models to estimate the spatial distribution of the air temperature. The models are:
Based on altitude:
LR-altitude. Simple linear regression model, using in-situ Tair data and altitude.
RF-altitude. Regression random forest model, using in-situ Tair data and altitude.
Based on auxiliary variables (Altitude, LST, latitude, longitude):
LR-aux.variables. Multivariate linear regression, using in-situ Tair data and aux.variables.
RF-aux.variables. Regression random forests, using in-situ Tair data and aux.variables.
Based on the spatially autocorrelated residuals:
Topics
0- Study area and air temperature monitoring stations
1- Exploratory analysis
2- Selection of auxiliary variables for each month
3- Cross-validation statistics results
4- Spatial distribution of the monthly air temperature
5- Regression Kriging
About the data
The script was originally developed with air temperature data obtained from the meteorological stations belonging to the following Ecuadorian institutions: CELEC-hidropaute, INER, INAMHI and ETAPA. Given data publishing restrictions, we were not able to provide along to the script the used data. Thus we generated a random data set of 28 points extracted from the air temperature maps (RF-aux.variables) obtained from our study. Therefore, even if this script and provided data are only illustrative of the applied procedures and they do not replicate the results showed in work Estimating monthly air temperature using remote sensing with highly variable topography and scarce monitoring in the southern Ecuadorian Andes", we expect that they can be of further use to the meteorology and cartography community.
This script works with 4 data files.
A vector that contain the study area. In this case, the Paute river basin (shapefile format)
A data matrix with the following structure:
Dependent variable (In the first column). In this example, we use a sample of monthly air temperature data in the Paute River basin. A total of 28 points were used for the analisys.
Auxiliary variables(In the second, third, fourth, and fifth columns): We include land surface temperature (MODIS LST products), elevation (ASTER GDEM), latitude and longitude data of the point locations.
Name or code of the points (sixth column)
A rasterStack that contain the auxiliary variables: In this example the raster stack have a spatial resolution of 500m (tiff format)
A vector containing the location of the points (shapefile format).
Analyzed period The script was originally tested with monthly average air temperature estimations, which spans from January 2014 to December 2017. Nevertheless, the random data set used with this script spans from 2017-08-01 to 2017-12-01.
Note
The script works with data as list structures. Each list position correspond to date of the data. Thus the first position in the list correspond to the first analyzed date and so on. E.g: [[1]]=2014-01-01,[[2]]=2014-02-01 ….. [[48]]=2017-12-01.
The study area is the Paute river basin in the sourthern Ecuadorian Andes. The following map shows the study area and the used air temperature random points.
This section answers the following question: What is the correlation between the auxiliary variables and air temperature?
It evaluates the relationship between Tair and the auxiliary variables with Pearson correlation.
This section answers the following question: What are the most effective variables for each regression model?
We evaluated and selected, for each month, the most effective auxiliary variables. In multiple linear regression, we used stepwise regression, and multicollinearity was evaluated through variance inflation factor (VIF).
To determine the most important auxiliary variables in the random forest regression, the first three variables that positively contributed to each model were used based on the % IncMSE.
Auxiliary variable | Number of months | |
---|---|---|
1 | altitud longitud | 2 |
3 | altitud lst longitud | 2 |
2 | altitud lst | 1 |
# We extact the first three important auxiliary variables.
variables_rf <- lapply(1:length(matriz_datos), function(i){
set.seed(2210)
rf <- randomForest(temperatura ~ altitud + lst +longitud +latitud, data=matriz_datos[[i]], importance = TRUE, mtry=3)
variables <- data.frame(RF = sort(importance(rf)[, "%IncMSE"], decreasing = TRUE))
var1 <- rownames(variables)[1]
var2 <- rownames(variables)[2]
var3 <- rownames(variables)[3]
#var4 <- rownames(variables)[4] if you want to include other auxiliary variable. Include also in the next line: +, var 4 before sep
variables <- paste("temperatura ~", var1,"+", var2,"+",var3, sep=" ")
variables <- noquote(variables)
variables <- as.formula(variables)
return(variables)
})
var_rf <- as.data.frame(table(sapply(variables_rf,paste,collapse = " ")))
names(var_rf) <- c("Auxiliary variables", "Number of months")
var_rf$`Auxiliary variables` <- str_sub(var_rf$`Auxiliary variables`, 15) # Remove ~temperature
var_rf <- var_rf[order(-var_rf$`Number of months`),]
kable(var_rf)%>%
kable_styling(bootstrap_options = "striped")
Auxiliary variables | Number of months |
---|---|
altitud + lst + longitud | 5 |
This section answers the folloging question: Based on the cross validation results, what is the model that best estimates, at monthly basis, the air temperature in the Paute basin?
As a first step, we compared only models based on altitude and auxiliary variables. We evaluate the performance of the models applying the cross validation method (LOOCV), and we use the following statistics: root mean square error (RMSE), percentage bias (p-bias) and the Pearson correlation (r) for the selection.
The table shows the RMSE values for each model and month evaluated. The RMSE provides information about the magnitude of error estimation, and allows to compare diferents models. Perfect score for this statistic is 0 (scores closer to zero represent better estimation performance).
Date | LR-altitude | LR-aux.variables | RF-altitude | RF-aux.variables |
---|---|---|---|---|
2017-08-01 | 0.94 | 0.97 | 0.30 | 0.27 |
2017-09-01 | 0.96 | 1.00 | 0.33 | 0.25 |
2017-10-01 | 0.87 | 0.90 | 0.32 | 0.29 |
2017-11-01 | 0.94 | 0.90 | 0.23 | 0.26 |
2017-12-01 | 0.90 | 0.90 | 0.32 | 0.29 |
The table shows the P-Bias values for each model and month evaluated. Positive and negative values represent overestimation bias and underestimation bias respectively. Perfect score for this statistic is 0 (scores with abosute values closer to zero represent better estimation performance)
Date | LR-altitude | LR-aux.variables | RF-altitude | RF-aux.variables |
---|---|---|---|---|
2017-08-01 | 0.3 | 0.3 | -0.7 | -0.7 |
2017-09-01 | 0.3 | 0.3 | -0.6 | -0.5 |
2017-10-01 | 0.2 | 0.2 | -0.7 | -0.5 |
2017-11-01 | 0.2 | 0.2 | -0.4 | -0.4 |
2017-12-01 | 0.3 | 0.3 | -0.6 | -0.7 |
The table shows the r values for each model and month evaluated. In this case, the statistic measures the linear relationship strength between the models etimations and the air temperature observations. Scores range from -1 and 1. The optimal value is 1.
Date | LR-altitude | LR-aux.variables | RF-altitude | RF-aux.variables |
---|---|---|---|---|
2017-08-01 | 0.81 | 0.81 | 0.98 | 0.99 |
2017-09-01 | 0.81 | 0.80 | 0.98 | 0.99 |
2017-10-01 | 0.83 | 0.83 | 0.98 | 0.99 |
2017-11-01 | 0.82 | 0.84 | 0.99 | 0.99 |
2017-12-01 | 0.84 | 0.85 | 0.99 | 0.99 |
The table shows the medians and the interquartile range (IQR) of the 5 estimated months for each model.
Variables | Models | Median | IQR | Median | IQR | Median | IQR |
---|---|---|---|---|---|---|---|
Altitude | RL | 0.94 | 0.04 | 0.3 | 0.1 | 0.82 | 0.02 |
Altitude | RF | 0.32 | 0.02 | -0.6 | 0.1 | 0.98 | 0.01 |
Aux.variables | RL | 0.90 | 0.07 | 0.3 | 0.1 | 0.83 | 0.03 |
Aux.variables | RF | 0.27 | 0.03 | -0.5 | 0.2 | 0.99 | 0.00 |
The best model for estimating air temperature for each month was determined based on the lowest RMSE.
Variables | Models | Number of months |
---|---|---|
Altitude | RL | 0 |
Altitude | RF | 5 |
Aux.variables | RL | 0 |
Aux.variables | RF | 5 |
This section answers the following question: What is the spatial distribution of the cross validation statistics?
These statistical distribution maps allow us to spatially observe in which areas the performance of the models have greater and less accurancy.
This section answers the following questions: What is the spatial distribution of the air temperature in the Paute river basin?. Are there differences in the spatial distribution of air temperature depending on the regression method used?
As an example, here we show the spatial distribution for August 2017.
This section answers the following questions: Is there autocorrelation in the residuals of the regression models?. In how many months is the regression kriging applicable?. In the months with spatial autocorrelation, do the estimates improve using regression kriging ?
Kriging regression was used as a method for correcting the regression residuals. The presence of spatial correlation of the residual was monthly evaluated, through a visual inspection of the experimental variogram.
Here we show two examples. The first when there is no spatial autocorrelation in the residuals and regression kriging is not applied, and the second when there is positive spatial autocorrelation in the residuals and regression kriging is applied.
And finally, we evaluate whether the use of kriging regression in the months with the presence of autocorrelation in the regressionresiduals allows improving the performance of the estimates.
Example: noviembre 2017. Using random forest with auxililary variables
plot(variograms_mrf[[3]])
Example : November 2017. Using linear regression-altitude
plot(variograms_lr[[4]])
Variables | Models | Number of months |
---|---|---|
Altitude | RL | 5 |
Aux.variables | RF | 4 |
Altitude | RL | 5 |
Aux.variables | RF | 1 |
Variables | Models | Median | IQR | Median | IQR | Median | IQR |
---|---|---|---|---|---|---|---|
Altitude | RL | 0.94 | 0.04 | 0.3 | 0.1 | 0.82 | 0.02 |
Altitude | LR-Kriging | 0.9 | 0.03 | 0.4 | 0.1 | 0.86 | 0.02 |
Altitude | RF | 0.32 | 0.01 | -0.65 | 0.1 | 0.98 | 0 |
Altitude | RF-Kriging | 0.275 | 0.02 | 0.05 | 0.12 | 0.985 | 0.01 |
Aux.variables | RL | 0.9 | 0.07 | 0.3 | 0.1 | 0.83 | 0.03 |
Aux.variables | LR-Kriging | 0.86 | 0.07 | 0.3 | 0.1 | 0.86 | 0.03 |
Aux.variables | RF | 0.26 | 0 | -0.4 | 0 | 0.99 | 0 |
Aux.variables | RF-Kriging | 0.2 | 0 | 0.5 | 0 | 0.99 | 0 |
The following maps show the steps to apply regression kriging in February 2015 using linear regression with auxiliary variables.
Where the sum of the air temperature map using linear regression, plus the interpolation map of the regression residuals using ordinary kriging, is equal to the air temperature map using kriging regression.