Long-term precipitation prediction in different climate divisions of California using remotely sensed data and machine learning

ABSTRACT This study presented a novel paradigm for forecasting 12-step-ahead monthly precipitation at 126 California gauge stations. First, the satellite-based precipitation time series from Climate Hazards Group InfraRed Precipitation with Station data (CHIRPS), TerraClimate, ECMWF Reanalysis V5 (ERA5), and Precipitation Estimation from Remotely Sensed Information using Artificial Neural Networks-Climate Data Record (PERSIANN-CDR) products were bias-corrected using historical precipitation data. Four methods were tested, and quantile mapping (QM) was the best. After pre-processing data, 19 machine-learning models were developed. random forest, Extreme Gradient Boosting (XGBoost), extreme gradient boosting, support vector machine, multi-layer perceptron, and K-nearest-neighbours were chosen as the best models based on Complex Proportional Assessment (COPRAS) measurement. After hyperparameter adjustment, the Bayesian back-propagation regularization algorithm fused the results. The superior models’ predictions were considered inputs, and the target’s initial step was labeled. The next 11 steps at each station followed this approach, and the fusion models accurately predicted all steps. The 12th step’s average Nash-Sutcliffe efficiency (NSE), mean square error (MSE), coefficient of determination (R2), correlation coefficient (R) were 0.937, 52.136, 0.880, and 0.869, respectively, demonstrating the framework’s effectiveness at high forecasting horizons to help policymakers manage water resources.


Introduction
Precipitation is an essential input to the hydrological models that play a prominent role in water resource management studies (Jiang et al. 2012, Chivers et al. 2020, Moazami and Najafi 2021, Meydani et al. 2022).In this regard, ground-based (gauge) measurements are one of the most commonly utilized data sources for acquiring precipitation data (Le et al. 2020).However, despite high-accuracy precipitation measurements from the raingauges, these data suffer from a few shortcomings, such as missing values, wind effect, and temporal and spatial coverage limitations (Moges et al. 2022, Fooladi et al. 2023).On the other hand, developing suitable raingauge measurements is not applicable in difficult-to-access locations such as mountainous regions.To overcome these issues, precipitation radar estimations can be used effectively, but some of the gauges' bottlenecks, such as temporal and spatial coverage limitations and high cost, remain (Gilewski and Nawalany 2018, Lazri et al. 2020, Ghorbanpour et al. 2021).
In recent decades, advancement in satellite technology and climate models has made it possible to acquire continuous highspatial resolution precipitation estimations in any region (Pfeifroth et al. 2013, Su et al. 2021, Yu et al. 2021).As a result, various satellite and re-analysis precipitation products have been generated, which can be exploited as a replacement or supplement for raingauge data.Some of these products frequently used in water studies include global precipitation measurement (GPM), integrated multi-satellite retrievals for GPM (IMERG), precipitation estimation from remotely sensed information using artificial neural networks (PERSIANN), PERSIANN from climate data record (PERSIANN-CDR), PERSIANN cloud classification system estimation (PERSIANN-CCS), TerraClimate, global satellite mapping of precipitation (GSMaP), Climate Hazards Group InfraRed Precipitation with Station data (CHIRPS), Climate Prediction Center morphing technique with corrected bias (CMORPH), and the European Centre for Medium-Range Weather Forecasts (ECMWF)'s fifth-generation re-analysis product (ERA5) (Alizadeh and Nikoo 2018, Shen et al. 2019, Belayneh et al. 2020, Yeditha et al. 2020, Kumar et al. 2021, Prodhan et al. 2021, Sadeghi et al. 2021, Yu et al. 2021, Fooladi et al. 2023).
Nowadays, climate change has caused a widespread change in the pattern and amount of precipitation, and has increased extreme events and consequential damages caused by these events (Tabari 2020).To this extent, precipitation prediction is beneficial for water resource management by providing the possibility for flood and drought forecasting while meeting water demand requirements (Ni et al. 2020).Furthermore, precipitation prediction can be carried out at different forecast lead times, from a few hours to seasonal and long-range forecasting.The World Meteorological Organization (WMO) defines long-range as predicting for one month up to two years in advance.Besides, while accurate short-term precipitation prediction can help reduce flood risk, long-term precipitation prediction is extremely valuable for various applications, including long-term water resource planning, climate adaptation, and drought management (Crochemore et al. 2016).Therefore, long-term precipitation prediction is vital in areas experiencing simultaneous and prolonged droughts.
Recent droughts pose significant challenges to California's economy and agricultural production in the US (Mann andGleick 2015, Gibson et al. 2021).Climate change is expected to exacerbate these conditions, leading to uncertainty in managing water resources (Pathak et al. 2018, Scheuerer et al. 2020, Dong et al. 2021).This is exemplified by the State Water Project making decisions on next year's water allocation to California residents, with the final verdict usually announced at the end of each year (Choubin et al. 2016, He andGautam 2016).Hence, accurate long-term precipitation forecasts are essential in California to prepare for future crises and make informed decisions about water resources.
Several physical, statistical, and data-driven models are used for long-term precipitation prediction (Farajzadeh and Alizadeh 2018, Xiang et al. 2018, Dikshit and Pradhan 2021).Among these methods, data-driven techniques have minor computational limitations, with correspondingly lower requirements for background knowledge and formulation conception of the procedure; therefore, they are gaining in popularity for hydrological modelling.To this extent, different machine learning and deep learning models have been used in the literature to forecast precipitation (Hung et al. 2009, Nastos et al. 2014, Darji et al. 2015, Kumar et al. 2019, 2020, Pham et al. 2020, Hammad et al. 2021, Han and Morrison 2021).Some of these algorithms are: Levenberg-Marquardt algorithm-based artificial neural network (LM-ANN) (Mishra et al. 2018), convolutional neural network (CNN), multi-layer perceptron (MLP) model (Khan and Maity 2020), decision tree (DT), K-nearest-neighbours (KNN), naive Bayes, support vector machine (SVM) (Rahman et al. 2022), and long short-term memory (LSTM) (Kumar et al. 2020).However, to our knowledge, many robust Machine Learning (ML) models have not yet been utilized for long-term precipitation prediction in California (Sorjamaa et al. 2007, Nourani et al. 2019, Sulugodu and Deka 2019, Narejo et al. 2021).Therefore, filling these knowledge gaps can pave the road for water and environmental engineers in their current use.
This study proposes a novel framework for long-term prediction of precipitation based on comparing the performances of different ML models developed using bias-corrected precipitation data related to each station in different climate divisions of California, USA; and a fusion model for improving the individual ML models' prediction accuracy.The present study's significant contributions can be summarized as follows: (i) Evaluating the performance of four bias-correction techniques for the satellite and re-analysis precipitation data in different climate divisions of California, USA; (ii) Analysing the results of 19 ML models using the preprocessed data based on the principal component analysis (PCA) method for long-term precipitation prediction; (iii) Using the Complex Proportional Assessment (COPRAS) model to determine the models with superior performance in all stations; (iv) Assessing the sensitivity of the models to the variations in hyperparameters compared to the default state; (v) Proposing an innovative fusion-based model to obtain an accurate model for long-term precipitation prediction.

Methodology
The proposed framework for long-term precipitation prediction consisted of three main stages; (1) Preparation and pre-processing of monthly precipitation data, (2) ML models' development for one-step-ahead (one month) to 12-steps-ahead (one year) precipitation prediction, (3) Selecting the top six models using the COPRAS method and presenting the fusion model for each step.
In the first step, precipitation time series for 126 raingauges in different climate divisions of California from the Global Historical Climatology Network-Daily (GHCN-D) dataset, ERA5, TerraClimate, PERSIANN-CRD, and CHIRPS products were obtained.After that, the precipitation products were bias corrected using four bias-correction techniques to find the best method.Finally, in this step, data were prepared using normalization and PCA pre-processing methods before using them as input to the ML models.In the second step, 19 ML regressors were developed to predict precipitation for 1 to 12 months later.Then, various statistical error indices were applied to assess the models' predictive performance.The COPRAS model was applied in the third step to determine the six top models with better performances for all stations.Finally, the results of each step were fused using an artificial intelligence (AI) model to achieve better prediction results.Figure 1 shows the flowchart of the applied datasets and proposed methodology for this study.

Bias correction technique
Due to the ability of satellite and re-analysis data to cover all regions, these sources can be used to provide global estimations of precipitation.However, despite acceptable spatial and temporal precipitation coverage, they suffer from shortcomings, such as uncertainties in sampling, non-direct measurement intrinsic errors, and retrieval errors.Therefore, bias correction of these data using precipitation measurements from ground stations or radars before application in hydrological models is essential.For this purpose, four methods to reduce the bias of the precipitation products were implemented in this study: probability matching method (PMM) (Calheiros and Zawadzki 1987, Gado et al. 2017, Singh et al. 2020), quantile mapping (QM), scaled normal distribution mapping (SDM-normal) and scaled gamma distribution mapping (SDM-gamma) (Jakob Themeßl et al. 2011, Switanek et al. 2017al. , Ayugi et al. 2020)).Further clarification is provided in the Supplementary material, Section S1.

Data pre-processing
Pre-processing data is the first step that must be followed, in which raw data is transformed into a more suitable format for an ML model.This can substantially affect the performance and the generalization of the ML models.Several steps in data pre-processing for machine learning models include cleaning, structuring, and organizing raw data for input to the ML models (Kotsiantis et al. 2006).In this paper, precipitation time series of gauge stations and four products of satellite and re-analysis data were analysed, and the pre-processing procedure involved five steps: (i) Data cleaning, (ii) Data integration, (iii) Data transformation and feature scaling, (iv) Data reduction, (v) Splitting data into training and testing data.
In data pre-processing, the initial phase frequently involves cleaning the data to eliminate duplicate entries, outliers, inconsistencies in values, inconsistent units of measurement, and missing values.
The specific steps for data cleaning may vary based on the type of data and the intended analysis.Nevertheless, it is crucial to conduct data cleaning as it guarantees that the model operates with a high-quality dataset.In dealing with real-world data sources such as precipitation time series, incomplete data presents an inevitable problem, necessitating data cleaning.To ensure the provision of complete time series for all stations, stations with high numbers of missing values during the target period  were eliminated while preserving as many data records as possible.In handling the remaining missing values, various methods, such as hot deck imputation, filling in with the most common values, or ignoring null values, can be utilized (Kotsiantis et al. 2006).However, in this study, missing values were replaced by the average of the corresponding time series.Furthermore, duplicates were removed from all datasets to ensure consistency, and time series were processed for unit and value consistency.Data integration combines multiple datasets from different sources into a single, unified dataset for analysis.In this case, the goal is to create an integrated matrix for each station that includes reference data from the GHCN-D dataset and data from four precipitation products.The next step is normalization, a pivotal pre-processing technique that plays a critical role in standardizing the scale of the data and removing disparities in range and units.Its significance lies in ensuring that all data are treated equitably within machine learning models, which can enhance the accuracy of predictions, mainly when dealing with hydrological time series that encompass various scales.Commonly employed normalization techniques include min-max scaling, Z-score scaling, and log transformation (Alawsi et al. 2022).This study applied Z-score normalization (Equation 1) to normalize the monthly precipitation time series (Fu et al. 2021).
where y n i is normalized monthly precipitation, n is the number of data, y mean is the mean of monthly precipitation, and std y i ð Þ stands for the standard deviation of precipitation.
Furthermore, data dimensionality reduction was carried out using PCA.In addition, a comprehensive grid search was developed to find the number of components that lead to better results for ML models.Finally, the data were split into training and testing sets in order to evaluate the models' generalizability to unobserved data.The training set enhanced the models' performance by adjusting its parameters, while the test set was employed to assess the models' precision on new data.For this purpose, a 70/30 ratio is commonly used.Therefore, 70% of the data was allocated for each station for the training step and 30% for the testing step.

Principal component analysis
PCA is a statistical method that has gained widespread popularity due to its ability to effectively reduce the dimensionality of large datasets while retaining maximum information from the original data.PCA was first proposed by Pearson (1901) and subsequently independently developed by Hotelling (1933).The transformation achieved by PCA is orthogonal, resulting in a set of uncorrelated and independent variables, provided the original variables are distributed normally.The PCA method aims to find linear combinations of the original variables that explain the highest amount of variation in the data, with the first few principal components capturing the most variability from the original dataset (Equation 2).Even in low-dimensional input spaces, PCA has proved to be a valuable technique (Wu et al. 2010, Sumi et al. 2012).
where P i is the i th principal component, a ij is the coefficient of the j th variable in the i th principal component, and x j is the j th variable.
The data first centred on performing PCA by subtracting the mean from each dimension.Next, the covariance matrix of the centred data was calculated.The eigenvectors of this covariance matrix correspond to the principal components of the data, and the eigenvalues correspond to the amount of variance explained by each principal component.The centred data could then be projected onto the principal components to obtain a lower-dimensional representation of the data.Equations ( 3) and ( 4) describe the mathematical constraints required for a self-orthogonal PCA transformation, resulting in uncorrelated variables.These constraints involved the coefficients used in constructing new variables from a linear combination of the initial variables (Hu et al. 2007).
The objective of this study was to develop a framework for forecasting monthly precipitation up to 12 months ahead.To achieve this, the precipitation data from the preceding n months was utilized as input to predict the precipitation for a specific 12-month period.Since the input data was derived from four remotely sensed precipitation products, the input for the models comprised four arrays, each containing n components representing the precipitation values for the respective 12 months.These four arrays, each consisting of n components, were merged into a single vector with 4 × n components.This concatenated vector was then employed as the input for the PCA.
This study utilized the grid search approach in the preprocessing step to decide on the number of principal components for input parameters, leading to better results for precipitation prediction in each station.For this purpose, 19 ML models were trained with the XGBoost and Scikit-learn libraries based on the default parameters.In this regard, to obtain the optimum numbers for the number of input months and the most optimal principal components for the input and output data, a search was performed in the space for 126 stations on 19 models.Therefore, 299 250 states were evaluated to find the most optimal solution for each station.

Developing ML models
After setting the principal components, 19 models for each station were trained using the station's data matrix and the default hyperparameters.In this regard, approximately 70% of the data was used for training and 30% for testing.Finally, the result matrix containing the models' Nash-Sutcliffe efficiency (NSE), mean absolute error (MAE), and mean square error (MSE) metrics was built.Zavadskas et al. (1994).Assigning negative and positive weights to the evaluation metrics, COPRAS can score different solutions and rank them from ideal to worst (Mardani et al. 2015, Krishankumar et al. 2021, Nematollahi et al. 2022b).In selecting the most suitable machine learning model for precipitation prediction, the COPRAS method is more accurate than other methods since it comprehensively assesses model performance considering multiple evaluation criteria, including test and train metrics.Along with a more objective evaluation process, this approach still allows for some degree of subjective assessment based on the researcher's intuition and avoids the limitations of traditional single evaluation metrics (Mishra et al. 2020).Furthermore, the COPRAS method helps prevent overfitting by emphasizing test metrics by assigning higher weights.Proper application of this method results in selecting a smaller set of superior machine learning models for subsequent hyperparameter tuning, reducing computational costs.This study implemented the COPRAS method to rank the ML models and select the six with the highest rankings.

Hyperparameter tuning
Each ML model has two types of parameters: parameters and hyperparameters.Parameters are set in the training stage, describing a relation between the input and output data, while the hyperparameters should be set manually before training the model to specify the model structure.Since the hyperparameters affect the overall model accuracy and conversion, they should be selected wisely (Markovics and Mayer 2022).Finding the hyperparameter values that result in the model's best performance is called hyperparameter tuning.However, the search space for finding hyperparameters that give the best fit could be vast.Therefore, hyperparameter tuning should be carried out in such a way as to give optimal results in a reasonable amount of time.
There are mainly two approaches to hyperparameter tuning: the manual search method and the automatic search approach.In the manual approach, various hyperparameters are examined manually to obtain the best state, which may be time-consuming, and its success is related to the intuition and experience of engineering experts.On the other hand, the automatic approach overcomes most of the problems of the manual approach.The grid search technique, which is an automatic method, provides the opportunity to select the best hyperparameter by organizing a comprehensive searching network training the ML model with every combination of the given states of hyperparameters using the training dataset and evaluating them on the testing dataset (Wu et al. 2019).

Fusion model
Model fusion is a subset of merging data based on the rationale that combining data from various sources results in a more precise result than each source individually (Taravatrooy et al. 2018).There are various forms to implement this technique, such as parallel, series, and mixed.Parallel approaches combine the results of different single models, employing a weighting or bootstrapping approach.In the series form, the results of a model after processing are provided as input for another model.Finally, the mixed approach adopted in this study combined the parallel and series methods to obtain better results (Modaresi et al. 2018).The mixed approach was applied to build the fusion model to benefit from each model's advantages to increase the accuracy of precipitation prediction in each step (Munir et al. 2019).
Several methods can be used for building the fusion model, including advanced weighting methods (ORNESS-ORLIKE Ordered-weighted-averaged (OWA) genetic algorithm (GA)), Non-dominated Sorting Genetic Algorithm (NSGA)-II, NSGA-III, and AI-based fusion models.Machine learning models have proved efficient for this purpose (Fooladi et al. 2021, Jafari et al. 2023), so this study utilized an ANN algorithm with three different algorithms -Levenberg-Marquardt, scaled conjugate gradient, and Bayesian regularization -as training algorithms for fusing superior models.ANN, a type of intelligent machine learning algorithm, can learn nonlinearity in data by mimicking a biological neural network's function.The neural network structure has three neuron layers: input, hidden, and output (Qasem et al. 2019).
A back-propagation neural network (BPNN) is trained by the back-propagation (BP) algorithm, an efficient way of computing gradients through the chain rule, layer by layer, that back-propagate the errors from the output nodes to the input nodes updating the network parameters (Li et al. 2012).One common problem in training BP neural networks is the issue of overfitting, which causes the model to fail in generalization and in giving accurate predictions.Different approaches can be used to tackle this issue.Two general policies are adopted: early stopping and regularization (Wei et al. 2013).Bayesian regularization back-propagation neural network (BRBPNN) combines standard BP neural networks with the Bayes theorem.This means an extra term called a penalty term is added to the performance function modifying the sum of the squared errors (SSE), which makes it capable of achieving better generalization and avoiding overfitting, eliminating the need for a validation dataset in other solution approaches such as early stopping.
The regularization technique was used to find a set of weights and biases with lesser values.At the same time, the Bayesian method automated the task of selecting optimal parameters, all of which make the BRBPNN superior to conventional neural networks in this regard (Ekong et al. 2019, Gouravaraju et al. 2023).In this study, the Bayesian backpropagation regularization algorithm showed potential for use as the fusion algorithm in the preliminary assessment.Therefore, efforts were made to find the structure with the best performance, wherein the result of each prediction step from the superior models was fed as input to the fusion model, and the first step of the target was selected as the label for the fusion model; this was repeated for the next 11 steps and was done separately for each station.

Performance analysis
Several metrics were used to evaluate the results and forecasting capability of the models.Here, bias and mean square error (MSE) indices were utilized for bias correction results in evaluation, and NSE, correlation coefficient (R), mean absolute error (MAE), coefficient of determination (R 2 ), and mean squared error (MSE) metrics were utilized for assessing the result of prediction and fusion models.Table 1 shows the equations for the evaluated error metrics applied in this study, and their related parameter definitions.

Study region
Located in the western US between approximately 114°13′ and 124°65′W longitude and between 32°51′ and 42°01′N latitude, the State of California was chosen as the study area.California covers an area of 423,970 km 2 , encompassing a variety of ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi P ð Þ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi P N topographies (mountainous areas, deserts), land cover types/ uses, population characteristics, and meteorological processes (Li et al. 2020, Wei et al. 2021).California's climate varies widely depending on elevation, altitude, latitude, and distance from the coast.This area consists of seven climate divisions, in which the duration and intensity of precipitation differ substantially; however, all have the same yearly precipitation cycle, with a wet season and a long dry one (Silverman and Dracup 2000).Figure 2 illustrates California's location, and provides a map of its climate divisions.

Raingauge data
In this study, the Global Historical Climatology Network (GHCN-D) was the reference for the precipitation records.
The data was a subset of monthly time series precipitation available for the ground stations in the study area (Menne et al. 2012a(Menne et al. , 2012b)).This dataset included daily climate summaries, including precipitation measurements acquired by integrating multiple sources.The records were subjected to quality assurance checks and updated regularly, causing this dataset to be a valid and reliable source of data for comparison and validation of different kinds of data, including satellite-based estimations.This dataset is obtained from the National Centers for Environmental Information via the https://www.ncei.noaa.gov website.From this dataset, 126 stations in California were selected to meet the following three conditions: (  (3) The spatial distribution of the selected stations should cover all seven climate divisions, as shown in Fig. 2.

Satellite and re-analysis data
PERSIANN-CDR, a precipitation product of satellite-based origin, was developed by the University of California at the Hydrometeorology and Remote Sensing Center in Irvine, CA, USA.This product provided precipitation estimations covering the latitude band of 60°N-60°S, with a spatial resolution of 0.25°, from January 1983 (Ashouri et al. 2015).In offering time series of precipitation records available with acceptable length and only minor inconsistencies, PERSIANN-CDR shows excellent potential for use in hydrological and climatological studies (Arvor et al. 2017).Estimations of precipitation are obtained using the PERSIANN algorithm and infrared imagery data from various global Geosynchronous Earth Orbit (GEO) satellites.In addition, PERSIANN-CDR data are available for public use at the CHRS Data Portal in the NetCDF format (Ashouri et al. 2015).
TerraClimate is a data source that offers monthly time series of meteorological and water balance variables, including precipitation measurements dating from 1958 (Abatzoglou et al. 2018).This product can provide input to global-scale studies relating to the ecological and hydrological fields that demand high spatial resolution and sufficient data length.TerraClimate provides 4 km fine-resolution precipitation estimations that can be downloaded in NetCDF format (Abatzoglou et al. 2018).
CHIRPS is a precipitation product based on ground and satellite observations.This product provides time series of precipitation measurements spanning 50°N-50°S with 0.05° spatial resolution.Records are available starting from 1981 with the daily and yearly temporal resolution (Funk et al. 2015).Version 2 of the CHIRPS product was downloaded for this study in NetCDF format (Funk et al. 2015).
Finally, ERA5 is the fifth generation ECMWF atmospheric re-analysis of the global climate, which provides hourly estimations of various atmospheric, land, and oceanic climate variables at a resolution of 0.25° from January 1950 to the present (Hersbach et al. 2018).ERA5, with its superior temporal and spatial resolution compared to previous versions, is available for download in NetCDF format (Hersbach et al. 2018).
In this study, a comprehensive dataset spanning from January 1983 to December 2020 was employed.The data was strategically divided into two periods to facilitate different stages of analysis.Specifically, data from 1983 to 2010 was utilized during the calibration phase for bias correction and the training stage in developing predictive models.Subsequently, the validation and testing periods exclusively focused on the more recent data, spanning from 2011 to 2020, enabling a robust assessment of the methodology's performance on recent data and ensuring its applicability in current conditions.

Bias correction results
This study evaluated the precipitation products before and after bias correction based on two criteria: bias and MSE.Tables 2 and 3 summarize the results of the bias-correction procedure in two calibration and validation stages.As shown in these tables, although both QM and SDM-normal techniques gave an acceptable performance in decreasing the precipitation products' errors in both stages, the QM technique showed better performance in the validation step, so QM was selected as the bias-correction technique for this study.In more detail, in the validation step, the QM method decreased bias to 98.2% and 45.3% and MSE to 52.5% and 27.3% for the ERA5 and CHIRPS products, respectively, to make the precipitation estimations from products closer to the raingauge measurements.In addition, this method could   decrease bias and MSE for the TerraClimate products by 81% and 8.3%, respectively.Despite being unable to decrease the bias for the PERSIANN-CDR product, the QM method decreased MSE by up to 65.8%, resulting in a compelling performance for the bias correction of this product.The cumulative distribution functions (CDFs) of raw and bias-corrected products could be compared to that of the reference dataset for evaluating the effectiveness of bias correction.Figures 3 and 4 present the cumulative distribution function of the raingauge data and precipitation products before and after bias correction by the QM technique for all four products in both calibration and validation periods separately.As illustrated in the validation step, the cumulative distribution functions for the precipitation products, after bias correction, matched well with the associated functions of the raingauges and showed more similarity in their shapes, with the reference data being Global Historical Climatology Network (GHCN) data.
Moreover, as displayed in these figures, various products showed different performances in reporting precipitation, with most overestimating the actual precipitation; for example, for the validation stage, all of the products in the seven stations overestimated the precipitation, except two products of TerraClimate and CHIRPS in station 7.This was obvious as the CDF plots for the raw products, especially the PERSIANN-CDR, were shifted to the right, showing a higher amount of precipitation than the reference CDF for a given similar y value.In addition, there was an evident shift to the left after bias correction, accounting for the adjusting biases' effectiveness.If the precipitation products underestimated precipitation and the bias correction method was successful in correcting this underestimation, then the QM method would adjust the lower precipitation values in the pre-bias corrected data upwards to match the reference dataset better, as was the case in CDFs of two products of TerraClimate and CHIRPS in station 7.
In addition to examining the shape of the CDF curve, evaluating other statistical measures, such as the mean, variance, and error indices, before and after bias correction was essential.This could help to ensure that the bias correction method effectively reduced any systematic biases or errors in the precipitation data and improved the data's accuracy.Table 4 presents a summary of the average of the mean precipitation information in all stations before and after bias correction, which shows that the QM method gave an acceptable performance in matching the statistical mean annual precipitation information, such as average, minimum, and maximum values for the precipitation products, to that of the raingauges.

Spatial evaluation
To understand the spatial distribution of precipitation within the study area, average annual precipitation maps were created for both the calibration and validation periods, shown in Fig. 5.The maps show that California's mean annual precipitation was highly variable, with the most precipitation occurring in the northern parts of the state.In addition, minor differences in the average annual precipitation levels between the  calibration period  and the validation period (2011-2020) were found, as marked in Fig. 5.
Furthermore, since improving spatial precipitation patterns was one of the purposes of bias correction of satellite data, to evaluate the effectiveness of bias correction in adjusting the spatial distribution of precipitation, mean annual and monthly maps were prepared for all products before and after bias correction.Figure 6 presents an example of a sample plot of average annual precipitation for the PERSIANN-CDR product, indicating that the bias correction process effectively improved the spatial distribution of precipitation.In addition, mean monthly precipitation maps during the winter season for the validation period are provided in Fig. 7, demonstrating the efficacy of QM in correcting biases in remotely sensed products and improving the spatial distribution of precipitation.Additional plots can be found in the Supplementary material, Section S4 (see Figs S2, S4-S11).
To quantify the mentioned improvement, Fig. 8 was plotted to show the average improvement (%) in the MSE metric after bias correction of the monthly time series data for each station, with coloured circles.The colour and size of the circles represents the QM method's performance in improving the MSE and its value, respectively.Each map in the figure illustrates the performance of the QM method in bias correction for one of the precipitation products.This figure shows that the QM method performed differently among products and climate divisions.For example, this method indicated much more potential to improve the spatial pattern for the ERA5 and PERSIANN-CDR products than for the products of TerraClimate and CHIRPS.

Temporal evaluation
Figure 9 depicts the monthly precipitation time series of products before and after bias correction along with the ground observation for one station in climate division 1.As shown in the figure, the error variations of the products' time series (the orange plot) compared to the ground station's time series (the blue graph) decreased after bias correction (the green plot).In addition, the same plots for one station each in the other six climate divisions are presented in the Supplementary material, Section S5 (see Figs S12-S17).Achieving time series closer to ground-based measurements, QM proved to be an applicable and effective method in improving the time series of satellite and re-analysis data.

The machine learning results
The extensive grid search conducted in this study revealed that utilizing precipitation data from the previous 96 months as input produced the best results in predicting a window of 12 months later for overall precipitation in all stations.Thus, to predict the precipitation for a specific window of 12 months, the models were trained using precipitation data from the preceding 96 months.As the input data was sourced from four remotely sensed precipitation products, the models were fed with four arrays, each consisting of 96 components representing the precipitation values for the corresponding 12 months.To reduce the dimensionality of the input data and improve computational efficiency, PCA was applied.The objective of applying PCA was to extract a smaller set of components that could effectively capture the most significant variance in the data, leading to improved computational efficiency and more accurate predictions.The four arrays of 96 components were concatenated into a single vector of 384 components, which served as the input for PCA.In contrast to setting a fixed threshold to determine the number of components to retain based on the desired amount of variance, this study employed a grid search approach to identify the optimal number of components, and the performance of the predictive models was evaluated accordingly.This approach involved systematically testing different numbers of components and evaluating the results in terms of predictive model performance.
After setting the optimal number of principal components, the models showed better results, proving the positive impact of implementing the PCA method on improving ML model results.
Since there were 19 machine learning models, Taylor plots were used to compare the performance of the models in the testing period (Fig. 10).Taylor plots can provide a visual means of comparing multiple models' outputs against a standard set of observations, allowing for a quantitative assessment of how well each model reproduced the spatial pattern and variability of the observed data.In addition, they can help identify which models performed best and which had room for improvement.Higher correlation showed a higher level of agreement between observed and simulated data.The correlation decreases as a point moves towards higher sectors in the graph.The standard deviation in Taylor plots represents the spread or variability of the predicted values for each model.Higher standard deviation values indicate more significant variability or spread in the model precipitation predictions, while lower values indicate less variability or spread.Overall, Taylor plots can help provide insights into how well the models can capture the variability in the observed data when predicting the n th steps.Looking at the position of each model relative to the observed data can give a sense of how well the model could predict the observed data.Models closer to the origin point tend to have higher correlation and lower standard deviation, indicating they are more accurate.Also, the position of outliers is a sign that the models in question differ from the others in terms of correlation or standard deviation, which may indicate a problem with the model.In these plots, the DT model seemed positioned separate and far from the other points; this was emphasized more in plots for the testing period as the model position was close to the reference point, but this was not true in the training period (see Supplementary material, Fig. S18), raising concerns about possible overfitting issues.

Results for the COPRAS model
In this study, the results of NSE, MSE, and MAE for the testing and training sets for all 19 models trained with default hyperparameters were utilized to obtain the decision matrix for the COPRAS model.In this model, the assigned weights for the test and training period criteria for the COPRAS model were 0.001 and 0.0017, respectively.Table 5 displays the results for this technique, which show that the top six models selected, based on having more significant weights, were KNN, MLP, SVR, XGB, GBOOST, and RF.

Hyperparameter results
Table 6 displays the result of hyperparameter tuning for the top six models, showing each model's hyperparameters, the default values, and the best-selected values.Considering their default hyperparameters, the SVR, KNN, and MLP models had the best results in this study.In addition, after hyperparameter tuning, the MSE statistical error index in the testing stage for three models of GBOOST, RF, and XGB improved to 4.9%, 2%, and 5.5%, respectively.Finally, Tables S1 and S2 in the Supplementary material (Section S7) illustrate the results for MSE, R, and NSE in predicting 12 steps ahead for each model after setting hyperparameters.While the models' performance varied little across different forecasting steps, significant variation was observed across stations, with models performing poorly in a few stations.For instance, the RF model's performance across stations ranged from 0.564 to 0.919 for R 2 , 1641 to 5.828 for MSE, 0.743 to 0.959 for R, and 0.531 to 0.915 for NSE.

Fusion model results
This study utilized a Bayesian regularization model having 12 neurons in its hidden layer to fuse the results.The fusion model for each step could improve the precipitation prediction results by combining the outputs of six models.Tables 7 and 8 present the average R, MSE, and NSE values of the fusion model for precipitation prediction at all stations at 1 to 12 steps ahead during the testing period.In addition, Fig. 11 shows the results of the criteria mentioned above for the individual ML models and the proposed fusion approach in 12 separate steps during the testing period.Comparing the error indices of the fusion model with those of the individual models confirmed that the fusion model is an effective and successful tool in improving the precipitation prediction results, particularly considering MSE in all steps.Finally, the results of the fusion model indicate that the average value of R increased from 0.908 to 0.916, the average MSE changed from 134.76 to 120.04, the average R 2 changed from 0.827 to 0.842, and the average NSE increased from 0.81 to 0.832 compared to the best models in the testing period.Averaged over all stations, applying the fusion approach decreased the average MSE for 12-step-ahead prediction by approximately 22% during training and 11% during testing compared to the best individual model.The results presented here are for the test period; the equivalent results for the training phase are given in the Supplementary material, Section S8 (see Tables S3  and S4 and Fig. S19).
In addition to the evaluation metric results presented, we further illustrate the performance of our proposed framework by presenting time series plots for a subset of gauge stations.In order to maintain clarity and conciseness, we specifically focus on showcasing successful (Fig. 12) and challenging (Fig. 13) cases, highlighting stations where our methodology yielded favourable predictions, as well as stations where the model performance was less satisfactory.Given that the precipitation prediction was performed across 126 stations, the box plots are provided in addition to the average evaluation metrics.The whisker plots' elements can be examined more closely to understand the metrics' distribution over all stations.These elements can be used to identify the central tendency and variability of the data, as well as potential outliers and trends in the data.Figure 14 illustrates the distribution of evaluation metrics over all stations for all time steps.Different elements of the box plots represent specific information.As expected, the models' performance was lower in a few stations, indicated by the presence of dark circles outside of the whiskers' length, signs of variability in performance between different stations.Naturally, a model might not perform equally over all stations.
Nevertheless, these stations can be examined in more detail for the possible reasons for the methodology's poorer performance.The first step would be identifying individual stations and looking for similarities.The location of the stations whose results fell outside of the whiskers' length is plotted in Fig. 15.Furthermore, the box plots are an acceptable way to see the central tendency indicated by the variability by looking at the range of boxes of the metrics across all time steps.

Bias correction analysis
In hydrological modelling, the choice of input data significantly impacts the outcome of models, so high-quality input data is essential to obtain good results.This is not just restricted to data-driven models; even in physics-based rainfall-runoff modelling, for the results to be applicable to conclude, it should be ensured that the input rainfall data presents the patterns in the observed precipitation (Tramblay et al. 2016).Therefore, in the first part of this study, an attempt was made to provide precipitation data of satisfactory quality from remotely sensed data to be used as input to the models.Knowledge of the study area's spatial and temporal precipitation patterns was essential in assessing bias correction performance.California's diverse climate varies from region to region due to its vast size, and the state is divided into seven climate divisions, according to its varied topography, which range from the hot and arid deserts in the southeast to the cool and moist coastlines in the northwest.Based on average annual precipitation maps, most of the precipitation in the state fell in a few spots, with the majority concentrated in the northern parts.The monthly mean maps created for separate seasons indicate a strong seasonality in precipitation.The winter months, followed by the fall, receive the most precipitation in California.
The results suggest that most methods could correct mean values somewhat, although QM outperformed the other methods during the testing period.Its superiority can partly be attributed to improving the error in shape distribution and correcting the mean and variance.Upon applying QM, remotely sensed precipitation products used in the study could simulate the original precipitation time series to capture the inter-annual variability, seasonality, and spatial patterns of precipitation.They will provide an alternative to the gauge observations.This was proved by visual spatial plots and statistical evaluation indices focusing more on the evaluation period.The QM method matched the minimum, maximum, and average values for the precipitation products' annual time series to the gauge stations' corresponding values indicating its efficiency in reducing systematic bias on monthly and annual scales for stations in the seven climate divisions.Therefore, this method could provide an alternative source of data in water resource studies when there is a need for data with adequate spatio-temporal resolution or when ground-based data from raingauges are unavailable.The result aligned with It should be mentioned that although this method proved efficient, its power in reducing error indices varies both for stations and for the precipitation products.In more detail, the QM method performed better for two precipitation products, ERA5 and PERSIANN-CDR.The reason might partly be because these products have a higher amount of systematic error due to their coarse resolution compared to TerraClimate and CHIRPS.

Machine learning analysis
Nineteen ML models were developed for precipitation prediction 12 months later at 126 stations.Plotting Taylor diagrams helped compare the models visually, but overlapping might make quantitative conclusions difficult.To address this, quantitative statistical evaluation metrics and a multicriteria model were used to compare and rank models.This comprehensive approach would allow a more accurate comparison of models.Ranking models helped in selecting a few with better average performance over all stations and decreased the computational need for hyperparameter tuning in the next step.Within hyperparameter tuning for superior ML models, the grid search approach could improve the error indices in precipitation prediction 12 months later.The SVR, MLP, and KNN had the best results using default hyperparameters; moreover, the remaining three models (XGB, GBOOST, and RF) had minor variations with changing their hyperparameters.Therefore, the developed models in this study were not sensitive to the variations in hyperparameters compared to their default values.Furthermore, the study found that RF was superior to other models in predicting outcomes, as indicated by evaluation metrics.Zhou et al. (2021) found a similar result in their research on long-term monthly rainfall forecasting in China's Yangtze River Delta region, where RF was found to outperform other machine learning models.
In this study, the models were initially developed without undergoing hyperparameter tuning and were subsequently ranked.This decision was primarily influenced by the limitations of our computational resources, time constraints, and our preference to explore more models.Despite this constraint, our method enabled us to investigate and compare the performance of various algorithms across a diverse set of models, providing valuable insights.We acknowledge the importance of hyperparameter optimization in maximizing the potential of these models and recognize that future studies, with greater computational resources at their disposal, can more thoroughly delve into this aspect.By conducting extensive hyperparameter tuning, future studies can further refine and enhance the performance of the models, potentially producing even better results.The inclusion of hyperparameter optimization can provide more precise and tailored configurations, leading to improved predictive accuracy and robustness in forecasting precipitation patterns.
Within this study, it was shown that reducing the dimensions of input data with the PCA technique and normalizing data could positively impact the performance of the ML models, which is in line with the results of the long-term precipitation study by Diez-Sierra and Del Jesus (2020), using PCA for dimension reduction of the input data.This was due to the discriminative nature of the models; in more detail, the six models with better average performance on all stations were KNN, MLP, SVR, XGB, GBOOST, and RF.Since most obtained models were discriminative, they were less involved with the precipitation distributions and sought more data patterns.To obtain the best precipitation prediction result, PCA was used to identify the components that explain the most significant amount of variance; as a result, it captured the signal in the precipitation data and omitted the noise.Therefore, the PCA method could help to achieve the most suitable separation by reducing input data dimensions from 96 previous months to the input months with the highest dispersion.
Furthermore, given that the input variables were precipitation data acquired from various satellite products, it was likely that the products measured similar precipitation aspects, and some of the information from each product might be redundant.Therefore, using PCA could aid in extracting the essential information from each product and facilitate the models to focus on the vital aspects of precipitation while ignoring the redundant information.Moreover, PCA could assist in reducing the impact of diverse noise levels or variability in various products, leading to increased stability and reliability of the models.

HYDROLOGICAL SCIENCES JOURNAL
The models in this study were developed by adopting the direct multi-step forecasting approach, which meant all steps were predicted at once, and predictions at later steps did not suffer from the cumulative effect of the error.Moreover, the direct multi-step forecasting approach could deal better with non-linear patterns in precipitation data.As a result, the models' performance was not negatively affected by increasing the forecasting horizon, contrary to many earlier studies (Narejo et al. 2021, Salaeh et al. 2022).

Fusion discussion
In this study, the fusion models for each step resulted in better predictions than individual superior models.Each model predicted the future precipitation based on a specific technique.For instance, the neural network uses the separating hyperplanes it creates; the KNN model uses the ratio of data to its neighbours and the RF method works by branching the data hierarchically using several decision trees.Since each model had its capabilities in precipitation prediction, simply applying a given model could not lead to the best performance.Therefore, this study obtained better performance in precipitation prediction by combining these models.Overall, combining the outputs of the individual machine learning models through a BBRNN could help to improve the accuracy and robustness of precipitation prediction by leveraging the strengths of each model and mitigating their weaknesses.Gu et al. (2022) also presented similar results: combining base learners for monthly rainfall prediction proved superior to the single machine learning model approach.
The overall performance of the methodology was satisfactory, but the fusion model gave a worse performance in some stations.This was expected when applying machine learning models for prediction for multiple reasons, mainly data quality  or specific area conditions.However, visualizing the data would allow a search for possible reasons.In this study, through generating box plots of the results, these stations were identified as those located outside of the whiskers' length, known as potential outliers.By simply plotting them, it was shown that more than two thirds are located in the southern parts, and the few others were concentrated in another specific location, the vast Sierra Nevada mountain chain in the northern to mid-eastern part of the state.The Sierra Nevada region in California is an essential area for water resources, providing a significant amount of the state's water supply through snowpack and runoff.In addition, Southern California is semi-arid, with relatively low precipitation totals and high spatial variability.The specific situations of both regions might account for the observed poorer performance stemming from the quality of satellite data over these regions or the lower ability of machine learning models to distinguish the signal from the noise in these areas.
In sum, the proposed methodology offered acceptable accuracy for bias correction of four different remotely sensed data sources, which can be used in hydrological studies at the monthly and annual scale as an alternative to unavailable observed data.Using high-quality remote sensing and satellitebased measurements to monitor the study area could improve understanding of the region's hydrology and lead to more accurate hydrological studies.Moreover, long-term precipitation forecasts on a monthly scale could provide insights for decision makers in formulating plans to mitigate the adverse effects of ever-increasing variations in precipitation trends as well as prolonged droughts in the area.In addition, the time scale of the prediction made the results applicable to benefit the agricultural industry in the area.Reliable prediction of the future amount of precipitation is an asset in combating climate change-induced ecosystem vulnerability.

Summary and conclusion
This study proposed a novel AI-based framework for longterm monthly precipitation prediction using time series precipitation data from 126 ground stations and corresponding precipitation data for four remote-based precipitation products.For this purpose, four models were employed for bias correction of the precipitation products; then, to increase the models' prediction potentiality, the input data was preprocessed, and hyperparameters were tuned using the grid search method.Finally, individual fusion models were proposed for each prediction step.The main discoveries of the present research are the following: (1) The proposed combination framework could effectively predict precipitation series for multiple steps ahead even in the high forecasting horizon.(2) Furthermore, the MCDM selection framework could help choose among numerous machine learning models to optimize the process.(3) The proposed methodology proved effective in predicting long-term precipitation.However, we acknowledge there is room for improvement, especially in predicting extreme values.Therefore, this study's findings could be used as a basis for advanced precipitation forecasting.(4) In regions with noticeable climate variability, having more time to develop elaborate strategic adaptations in advance might enhance the quality of decisions that benefit different areas, including the agricultural sector.For this purpose, combining the ML models with deep learning algorithms and using different sources of remote sensing-based precipitation products is worth exploring.

Disclosure statement
No potential conflict of interest was reported by the authors.

Figure 1 .
Figure 1.Schematic view of the flowchart for the proposed methodology for long-term precipitation prediction.
1) Stations should have a precipitation record starting from January 1983.(2) Stations should have the least number of missing values.

Figure 2 .
Figure 2. Location of the study area.

Figure 3 .
Figure 3.The cumulative distribution function (CDF) of the monthly precipitation of the gauge and satellite product estimations for seven stations in the calibration period.

Figure 4 .
Figure 4. CDF of the monthly precipitation of the gauge and satellite product estimations for seven stations in the validation period.

Figure 5 .
Figure 5. Mean annual total precipitation for the reference data (GHCN).

Figure 8 .
Figure 8. Spatial improvement rates regarding averaged MSE of bias-corrected monthly precipitation products for the validation period.

Figure 9 .
Figure 9.The monthly precipitation time series of a station located in climate division 1 for the calibration and validation periods.

Figure 11 .
Figure 11.Evaluation metrics regarding the average for each step: R, MSE, and NSE of the fusion and six single models (test period).

Figure 12 .
Figure 12.Observed vs. predicted precipitation: time series plot of final results.

Figure 13 .
Figure 13.Observed vs. predicted precipitation: time series plot of final results.

Figure 14 .
Figure 14.Box plots of evaluation metrics of the fusion models (test period).

Figure 15 .
Figure 15.Location of the stations that fell outside of the whiskers' length.

Table 1 .
The equations of evaluated error metrics and their related parameters definition.

Table 2 .
The evaluation result of monthly precipitation products before and after bias correction for the calibration period.

Table 3 .
The evaluation result of monthly precipitation products before and after bias correction for the validation period.

Table 4 .
Comparison of averages of the mean, maximum, and minimum annual precipitations of all stations for gauge data and precipitation products before and after bias correction.

Table 5 .
The output ranking of the COPRAS method for machine learning models.

Table 7 .
Evaluation metric results of fusion models for precipitation prediction 1-6 steps ahead (test period).

Table 8 .
The result of evaluation metrics of fusion models for precipitation prediction 7-12 steps ahead (test period).