Physics-guided spatio–temporal neural network for predicting dissolved oxygen concentration in rivers

Abstract The prediction of river water quality is key in water resource management. Data-driven machine learning models have been widely used for predicting river water quality. However, these models seldom consider the physical mechanisms of water quality variation, which degrades the accuracy and stability of the prediction results. Hence, we develop a physics-guided spatio–temporal neural network (PGSTNN) model to predict a critical parameter for water quality assessment, i.e. dissolved oxygen. Physical information regarding spatio–temporal interactions in a hydrological network is explicitly considered to construct the architecture of PGSTNN. Two physical rules of dissolved oxygen variation (i.e. Henry’s law and power-scaling law) are established for the loss function of PGSTNN to guarantee the physical consistency of the prediction results. Experiments on the 2020–2021 water quality dataset in Atlanta, USA show that PGSTNN outperforms seven baseline neural network models in terms of prediction accuracy and stability. PGSTNN typically brings at least 10% accuracy (e.g. root mean square error and mean absolute error) improvement over the comparison methods. The proposed PGSTNN may not only improve the emergency response ability of water resource management, but also provide useful ideas for integrating scientific knowledge with machine learning.


Introduction
Rivers are fundamental sources of freshwater for socioeconomic development and ecosystems (V€ or€ osmarty et al. 2010).River water quality refers to the physical, chemical and biological properties of river water.Typical water quality indicators include dissolved oxygen (DO), temperature, salinity, pH, and turbidity (Rice et al. 2012).Owing to the acceleration of urbanization and industrialization, river water pollution has become increasingly severe.The accurate prediction of river water quality can provide a scientific decision-making basis for effective water resource management (Ahmed et al. 2019).For example, predicting water quality indicators is conducive to the early warning of water pollution and guarantees emergency response (Jin et al. 2019).In urban areas, water quality can be predicted to assess drinking water quality such that evidence can be furnished for water supply pre-adjustment (Li et al. 2021).In this study, we aim to predict DO content, which is a critical parameter for water quality assessment (Kannel et al. 2007).DO significantly affects the decomposition of organic matter and the survival of aquatic organisms.Changes in various physical and chemical reactions (e.g.pressure changes, photosynthesis and chemical oxidation) and water quality parameters (e.g.biochemical oxygen demand, nitrogen and phosphorus) in the water body affect DO content (Cox 2003).Accurate modeling of changes in DO content can provide government agencies with critical information pertaining to river health and water quality (Kim et al. 2021).
The existing river water quality prediction models can be broadly classified into two categories: physics-based and data-driven models.Physics-based models are typically developed based on known physical laws (e.g.energy and mass conservation) that quantify the relationships between input and output water quality indicators.Several physics-based models are currently available, e.g. the Streeter-Phelps (Streeter and Phelps 1958), stream quality model (QUAL) (Roesner et al. 1977), Mike flood model (MIKE) (Warren and Bach 1992) and environmental fluid dynamics code (EFDC) (Hamrick 1996) models.Although these physics-based models are widely used for predicting river water quality, they present two limitations: (1) prior knowledge regarding certain physical processes is typically incomplete (Clark et al. 2016, Willard et al. 2022) and (2) numerous unknown parameters cannot be estimated easily from the observed data.These limitations typically introduce biases and degrade the accuracy of the prediction results (Lall 2014, Allen andHoekstra 2015).To overcome these limitations, data-driven models can be used as effective alternatives for predicting river water quality.
Data-driven models directly construct relationships between inputs and outputs based on historical observational data.Machine learning (ML) models have become the most typically used data-driven models because of their strong nonlinear relationship-modeling capabilities.Classical ML models, such as support vector machines (Li et al. 2020), decision tree-based regression models (Ho et al. 2019), gene expression programming (Ferreira 2001) and simple artificial neural networks (e.g.multilayer perceptrons, feedforward neural networks, radial basis neural network and online sequential extreme learning machine), have been successfully applied to predict river water quality (Rankovic et al. 2010, Keshtegar and Heddam 2018, Malik et al. 2019, Cao et al. 2021, Zounemat-Kermani et al. 2021).For predicting DO content, exiting work have found that gene expression programming usually performs better than simple artificial neural networks (Kisi 2013, Mart� ı et al. 2013).Classical ML models lack the ability to learn complex features and high-order relationships from high-dimensional and large datasets (Varadharajan et al. 2022).Owing to breakthroughs in artificial intelligence in recent years, deep learning models make it possible to learn complex patterns from massive data and have been introduced for water quality prediction (Wai et al. 2022).Long short-term memory (LSTM) and gate recurrent units (GRUs) are widely used to model temporal correlations in water quality time series (Dheda et al. 2022).Graph neural networks (GNNs) and attention mechanisms have been used to model spatial dependence among water quality monitoring stations, and some scholars have combined them with recurrent neural networks (RNNs) to extract spatio-temporal dependence from historical observation data (Liang et al. 2018, Chen et al. 2021, Ni et al. 2023).Although data-driven models, particularly deep learning models, have demonstrated significant potential for water quality prediction, their limitations should not be disregarded.These black-box models only identify the statistical relationships between inputs and outputs without considering the physical mechanism of water quality variation.Therefore, these models may yield results that are inconsistent with physical laws (Lazer et al. 2014).Moreover, scientific problems usually suffer from the lack of representative training data, thus resulting in the unsatisfactory generalization ability of the data-driven models (Karpatne et al. 2017).In fact, the limitations of purely datadriven models degrade the accuracy and stability of the prediction results.
Physics-guided ML models that embed physical laws have received increasing attention in recent years as they can alleviate the limitations of black-box models (Reichstein et al. 2019, Muther et al. 2022).Two main strategies can be adopted to integrate physical laws into ML models: (1) physics-guided loss function and (2) physics-guided design of the architecture.The first strategy models physical laws or partial differential equations associated with the target variables as additional terms of the loss function of the ML models (Jia et al. 2019, Hanson et al. 2020).Although physicsguided loss functions allow ML models to yield more physically consistent predictions and improve their generalization ability, the black-box nature of ML models is not fundamentally changed (Karniadakis et al. 2021).Meanwhile, the other strategy aims to embed physical laws into the architecture of ML models by specifying the weights and connections of certain nodes, declaring physical variables and adding intermediate physical layers (Ba et al. 2019, Daw et al. 2020, Muralidhar et al. 2020).Although physics-guided architectures (PGAs) can render the black-box ML models more interpretable, designing the PGAs remains challenging (Varadharajan et al. 2022).
In this study, we aim to alleviate two main limitations of the existing black-box ML models: (1) prediction results are inconsistent with physical laws; and (2) generalization ability of the models is unsatisfactory.Motivated by the emerging paradigm of scientific knowledge-guided data science, we develop a new physics-guided spatiotemporal neural network (PGSTNN) for predicting DO content in rivers.A physicsguided loss function and a PGA are integrated to construct PGSTNN.The three main contributions of this study are as follows: i. Physical information regarding spatio-temporal interactions in a hydrological network is explicitly considered to construct the architecture of PGSTNN, thus allowing the accuracy and interpretability of the predicted results to be improved.ii.According to Henry's law (Sander 2015) and the power-scaling law (Kara et al. 2012), two physical rules of DO variation are established for the loss function of PGSTNN to guarantee the physical consistency of the prediction results.This allows the generalizability of the proposed prediction model to be improved.iii.Experimental results on the 2020-2021 water quality dataset from Atlanta, USA show that PGSTNN performs better than seven state-of-the-art neural network models in terms of prediction accuracy and stability.PGSTNN improves the emergency response ability of water resource management.

Study area and river monitoring dataset
The study area is located in Atlanta, Georgia, in the southeast of the United States (Figure 1), and comprises 24 monitoring stations.DO content in a river was usually affected by two kinds of factors, i.e. natural factors (e.g.water temperature, salinity and river discharge) and biotic factors (e.g.photosynthesis) (Cox 2003).In this study, water temperature was recorded at each station.Although salinity and river discharge were not recorded at each station, we can use conductivity and gauge height as substitute auxiliary variables.Conductivity is a good measure of salinity in water because dissolved salts can conduct electrical current; conductivity increases as salinity increases (American Public Health Association 2005).Gauge height and river discharge are also highly correlated (Guven and Aytek 2009).Existing work also found that PH and DO were both affected by photosynthesis (Zang et al. 2011).Therefore, we selected PH as an auxiliary variable.The DO content was used as the prediction target variable, and the other four indicators were used as auxiliary variables.We used five indicators on an hourly scale from 1 January 2020 to 31 December 2021.The river monitoring dataset used was provided by the National Water Information System of the U.S. Geological Survey. 1 The river network dataset was obtained from The National Map Download Client. 2 We extracted information regarding the river flow direction from the river network dataset.The lengths of the river segments between the two reachable stations were calculated based on a river network.The values of the variables were normalized to [0,1].

Problem formulation
Each station i has a multivariate time series X i ¼ fx i, 1 , x i, 2 , :::, x i, t , :::, x i, T g, where x i, t is a one-dimensional vector representing the observed values of the five variables at station i at time step t and T is the length of the input sequence.In this study, we aim to make an N-step prediction for DO content by developing a spatio-temporal deep learning network.We set T to 12 h, and predicted DO content within the next 2, 4, 6 and 8 h.Both spatial and temporal dependence of the monitoring data are integrated to predict DO content.In contrast, most existing deep learning-based hydrological prediction models only consider temporal dependence.LSTM and GRU are the most widely used network architectures (Varadharajan et al. 2022).In recent years, convolutional neural network or GNN has been combined with LSTM or GRU for considering spatio-temporal dependence (Baek et al. 2020, Moshe et al. 2020, Chen et al. 2021, Ni et al. 2023).These network architectures use an end-to-end strategy to implicitly learn the spatial interactions among different stations.These architectures require a large number of high-quality training samples that are hard to obtain in hydrological prediction.Therefore, these models may be less generalizable when the observation data are sparse and less representative (Jia et al. 2021).Most exiting deep learning-based hydrological prediction models use supervised loss (e.g. the mean square error) to guide the training process.Some recent works have constructed physics-guided loss functions for predicting water temperature and phosphorus dynamics (Jia et al. 2019, Hanson et al. 2020).However, these physics-guided loss functions cannot be used for DO content prediction.Moreover, integrating a PGA and physics-guided loss functions in a deep learning network is still a challenging problem.To overcome the above limitations, we construct an encoder-decoder architecture to couple spatial and temporal scales based on the attention mechanism, and integrate the prior knowledge of the river network structure and DO variation to define the connection and loss function of the neural network.

Method
The framework of the proposed model is illustrated in Figure 2. First, PGSTNN captures the temporal relationship of the monitoring data at a single station based on an encoder-decoder architecture (Sutskever et al. 2014).Second, the prior knowledge of the river network structure (Genuchten et al. 2013, Tu et al. 2020) is considered to design the physics-guided PGSTNN architecture.Finally, Henry's law and the powerscaling law of the DO variation are considered to construct a physics-guided loss function for PGSTNN.

Modeling temporal relationship based on an encoder-decoder architecture
To obtain the essential features and model the sequence dependence of the time series, PGSTNN uses two separate GRUs as the encoder and decoder to extract lowdimensional vectors from the original time series.The GRU encoder embeds the input multivariate time series into hidden states and the GRU decoder uses the hidden states extracted by the GRU encoder at the final time step to predict the subsequent time series.
For each station i, the GRU unit updates the hidden state h i, t at input time step t based on the hidden state h i, t−1 at time step t − 1 and x i, t : where f G ð�Þ represents a GRU unit.GRU contains two gates: reset gate (r i, t ) and update gate (z i, t ).r i, t integrates x i, t and h i, t−1 to obtain the candidate hidden state h 0 i, t of the current time step.z i, t controls the ability of forgetting and remembering by weighting h 0 i, t and h i, t−1 : The forward propagation process of a GRU unit can be described as follows: where W z , W r , W c and W D are the weight parameters.b z 、b r 、b c are the biases.rðÞ and tanhðÞ are the active functions.� represents the Hadamard product.From Eq. ( 4), one can find that h 0 i, t contains two types of information: (1) latent representation of multiple variables at time step t (W c � x i, t ) and (2) latent representation of multiple variables at time step t-1 Actually, h i, t contains both the latent representation of multiple variables at the current time step and the latent representation of multiple variables from the previous time steps.
A GRU encoder is employed to encode the input multivariate time series of a station i into hidden states where T is the length of the time series.A GRU decoder containing N GRU units is used for N-step prediction of a station i.The first GRU unit receives x i, T as the input, and h i, T as the hidden state of the previous time step.The physics-guided hidden states of the GRU decoder Ĥd i ¼ ĥi, Tþ1 , ĥi, Tþ2 , :::, ĥi, Tþs , :::, ĥi, TþN n o can be obtained by combining , h i, Tþ2 , :::, h i, Tþs , :::, h i, TþN f g calculated by using Eq. ( 1) and the physical variables reflecting the impact of the neighboring stations (Section 3.2), where s represents the look-ahead horizon.By putting Ĥd i to N linear layers, the predicted time sequence of DO content at station i Y i ¼ y i, Tþ1 , y i, Tþ2 , :::, y i, Tþs , :::, y i, TþN f g can be calculated as follows: where W is the weight matrix and b is the bias.We used the scheduled sampling method (Bengio et al. 2015) to train the decoder.

PGA of spatio-temporal neural network
The DO content of station S i at time step t is affected by the historical water quality and hydrological information of S i and the upstream stations of S i (Maier and Dandy 1996, Genuchten et al. 2013, Chakraborti 2021).In this study, we explicitly consider the spatio-temporal interactions between upstream and downstream stations to design the architecture of PGSTNN.
First, we construct a directed graph (G) to describe the flow direction of a river, where the nodes represent the monitoring stations and the edges represent the interplay between stations.Prior knowledge regarding the river flow velocity (v) is used to estimate the maximum distance (d max ) of solute transport between the two stations.Here, d max ¼ v � T, where T is the length of the input time series.The element A ij of the graph adjacency matrix A 2 R N�N is defined as follows: where d ij is the shortest path distance between station i and j.To consider the effect of the station's historical water quality status, self-connection is added to each node of G, where G represents the physical relationships among the different stations.Subsequently, a GRU encoder is employed to encode the input sequences of the stations into hidden states (Section 3.1).For the GRU decoder, we developed a physics-guided spatio-temporal graph to quantify the spatio-temporal interactions among different stations.Taking Figure 3 as an example, we identify the spatio-temporal To obtain the predicted value of station i at time step T þ t, the spatio-temporal graph attention mechanism between all the historical hidden states of the neighboring station(s) and the current hidden state h i, Tþt (calculated using Eq. ( 1)) is calculated.For t 2 ½1, T�, the query item, key item, and value item of station i at time step t are all the hidden state h i, t : For t 2 ½T þ 1, T þ N�, the query item of station i at time step t is h i, t , while the key item and value item of station i at time step t are the physicsguided hidden state ĥi, t : For the first GRU unit in a GRU decoder, the similarity coefficients fe 1 ij, Tþ1 , e 2 ij, Tþ1 , :::, e T ij, Tþ1 g between the current hidden state h i, Tþ1 and all the historical hidden states fh j, 1 , h j, 2 , :::, h j, T g of the neighboring station j can be computed as follows: where cosð�Þ represents the cosine similarity.
The physics-guided spatio-temporal attention weight a m ij, Tþ1 between h i, Tþ1 and h j, m can be calculated as where U i denotes the stations that neighbor station i in a spatio-temporal manner.
We aggregate the hidden states from station i and the spatio-temporal neighboring stations of station i into a physical intermediate variable q i, Tþ1 : Based on the graph of G, only the hidden states from the upstream stations within the maximum reachable distance can be aggregated.Therefore, q i, Tþ1 represents the effect of the upstream stations of station i and the historical status of station i. q i, Tþ1 can be calculated as follows: The physics-guided hidden state can be calculated as follows: h i, Tþ1 will be updated as ĥi, Tþ1 : The predicted value of station i at time step T þ 1 (y i, Tþ1 ) can be obtained using Eq. ( 6).ĥi, Tþ1 will be input as the new hidden state into the next GRU unit for multi-step prediction.

Physics-guided loss functions of spatio-temporal neural network
In this study, we integrate the physical law of the relationship between auxiliary variables (i.e.temperature, conductivity, gauge height and PH) and DO content and the physical law of DO variation in the spatio-temporal neural network to obtain more physically consistent predictions.Henry's law is a fundamental law of physical chemistry that describes the relationship among gas solubility, temperature and pressure (Sander 2015).Therefore, we use Henry's law to constrain the physical relationship between auxiliary variables and DO content.The power-scaling law is a general ecological principle based on data derived from natural systems (Hanson et al. 2020).The power-scaling law indicates that the power at lower frequencies exceeds that at higher frequencies (Kara et al. 2012).In this study, we use the power-scaling law to model the physical law of DO variation.We define two physics-guided loss functions based on Henry's law and the power-scaling law.

Physics-guided loss functions based on Henry's law
According to Henry's law, at a constant atmospheric pressure, the DO content can be expressed as a function of water temperature and salinity as follows (Truesdale et al. 1955, Cox 2003): where C s is the DO solubility, T is the temperature and S is the salinity.Based on Eq. ( 12), the DO content decreases with increasing temperature and salinity.Salinity and conductivity are positively correlated with each other (Poisson 1980).Owing to the absence of salinity observation data in the dataset, we use conductivity as a substitute for salinity.Based on the above analysis, we define two monotonic relationships as follows: 1. Monotonic relationship between DO content and temperature: i > 0 2. Monotonic relationship between DO content and conductivity: Here, y t i is the observed DO content at station i at time step t; ŷtþ1 i is the predicted DO content at time step t þ 1; Tp t i and Tp tþ1 i represent the observed temperature at station i at time steps t and t þ 1, respectively; and C t i and C tþ1 i represent the observed conductivity at station i at time steps t and t þ 1, respectively.Based on the two monotonic relationships above, we define two physics-guided loss functions as follows: where ReLUð�Þ is the linear rectification activation function, N S is the number of stations and N T is the number of samples obtained from a station in the training set.

Physics-guided loss function based on power-scaling law
According to the power-scaling law, changes in DO content on a short timescale (e.g.hourly) tend to be less significant than those on a long timescale (e.g.daily).Therefore, short-term changes in the water quality indicators should be insignificant.Based on this prior knowledge, to prevent the model from generating abnormally high or low predicted values, we define another physics-guided loss function as follows: We calculate the first differential values of the DO observations in the training set and then estimated the upper bound (D up ) and lower bound (D low ) of the confidence interval of the mean at the D confidence level.
The loss function for training PGSTNN can be defined by combining the supervised loss with the physics-guided loss functions (Eqs.( 13)-( 15)) as follows: where Loss MSE is the supervisory mean square error of the model and k M and k G are the weight parameters of the physics-guided loss functions based on Henry's law and power-scaling law, respectively.The two monotonic relationships between DO content and temperature, conductivity are all derived according to Henry's law (Eq.( 12)).Therefore, we use the same ?M for both conductivity and temperature.In this study, we use grid search to find the optimal hyperparameters for ?M and ?G based on the training data (k M 2 0, 1�, k G 2 0, 1�).

Experiment
PGSTNN was compared with the following seven baselines: 1. GRU (Chung et al. 2014): The GRU is a variant of the RNN and has been widely used for water quality prediction.2. Seq2Seq (Sutskever et al. 2014) (Lan et al. 2022): DSTAGNN is a dynamic spatial-temporal aware GNN for complex dynamic spatial-temporal dependencies within a dataset.6. Diffusion convolutional recurrent neural network (DCRNN) (Li et al. 2018): DCRNN is a diffusion convolutional recurrent neural network for traffic forecasting, which uses bidirectional random walks on the graph to model spatial dependency and uses GRU with the diffusion convolution to capture the temporal dependency.
7. Graph multi-attention network (GMAN) (Zheng et al. 2020): GMAN uses spatial and temporal attention mechanisms to model the dynamic and nonlinear spatiotemporal correlations for traffic prediction.
We also compared PGSTNN with the spatio-temporal neural network constructed based on the PGA introduced in Sections 3.1 and 3.2.PGA does not use the physics-guided loss functions introduced in Section 3.3.PGA can be regarded as a simplified version of PGSTNN.

Parameter setting and evaluation metrics
The training and test datasets were constructed using a continuous sliding window.The size of the sliding window is the sum of the lengths of the input and output time series.In this study, the length of the input sequence for the DO content observation data was 12 (12 h), and the sliding step size was set to 1(1 h).We segmented the dataset into non-overlapping training and testing data at a 4:1 ratio.Specifically, the observations collected from 1 January 2020 to 1 July 2021 (548 days) were used as training data, and the observations collected from 2 July 2021 to 31 December 2021(183 days) were selected as testing data.In the study area, the average river flow velocity was 2 km/h (Jobson and Keefer 1979); therefore, the maximum reachable distance (d max ) of water movement within 12 h was 24 km.
The hyperparameters of the proposed PGSTNN include the learning rate, batch size, training epoch, number of hidden units and weight parameters of physics-guided loss functions.The parameter settings for each model are listed in Table 1.We used the grid-search strategy to find the optimal hyperparameters for each model.The Adam optimizer was used during all the model training processes.The root mean square error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE) and coefficient of determination (R 2 ) were used to evaluate the performances of the four models.

RMSE ¼
ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 1 Here, y t s and ŷt s represent the observed and predicted values of the DO content at station j at time step t, N E is the total number of samples from one station in the test set, N S is the total number of water quality monitoring stations and Y is the overall mean of the observations.

Comparison of overall prediction accuracy
Table 2 shows the overall prediction accuracy of PGSTNN and baseline methods in predicting DO content within the next 2, 4, 6, and 8 h.PGSTNN exhibits the best prediction performance among the nine models.PGSTNN also performs better than PGA.It indicates that the physics-guided loss functions can further improve the performance of the PGA.The experimental results indicate that the prediction accuracy of DO content can be improved by considering the physical mechanism of water quality variation.
Compared with the results of GRU, the RMSE of PGSTNN decreases by 25.1% (þ8 h) at the least and 45.0% (þ2 h) at the most, and the MAE of PGSTNN decreases by 38.0% (þ8 h) at the least and 57.2% (þ2 h) at the most.Compared with the results of the Seq2Seq, the RMSE of PGSTNN decreases by 16.4% (þ6 h) at the least and 25.9% (þ2 h) at the most, and the MAE of PGSTNN decreases by 27.6% (þ6 h) at the least and 36.5% (þ2 h) at the most.Compared with the results of GCN-Seq2Seq, the RMSE of PGSTNN decreases by 11.0% (þ6 h) at the least and 25.9% (þ2 h) at the most, and the MAE of PGSTNN decreases by 19.2% (þ8 h) at the least and 30.3% (þ2 h) at the most.The GRU model performs the worst among the above four models.This indicates that the ability of GRU in modeling the time correlation is relatively weak compared with that of Seq2Seq.Compared with the results of ASTGNN, the RMSE of PGSTNN decreases by 8% (þ6 h) at the least and 29.4% (þ2 h) at the most, and the MAE of PGSTNN decreases by 15.2% (þ6 h) at the least and 31.3%(þ2 h) at the most.Compared with the results of DSTAGNN, the MAE of PGSTNN decreases by 13.0% (þ8 h) at the least and 38.0% (þ2 h) at the most.Compared with the results of DCRNN, the RMSE of PGSTNN decreases by 15.8% (þ8 h) at the least and 28.8% (þ2 h) at the most, and the MAE of PGSTNN decreases by 24.4% (þ8 h) at the least and 36.9%(þ2 h) at the most.Compared with the results of GMAN, the RMSE of PGSTNN decreases by 3% (þ6 h) at the least and 25.2% (þ2 h) at the most, and the MAE of PGSTNN decreases by 14.6% (þ6 h) at the least and 34.4% (þ2 h) at the most.
In short-term prediction (þ2 h), Seq2Seq performs better than ASTGNN, DSTAGNN and DCRNN.For predicting DO content within the next 4, 6 and 8 h, the spatiotemporal GNN-based methods (e.g.GCN-Seq2Seq, ASTGNN, DSTAGNN, DCRNN and GMAN) all perform better than GRU and Seq2Seq because they consider the spatiotemporal correlations among different stations.Among the spatio-temporal GNNbased methods, GCN-Seq2Seq performs the best in short-term prediction (þ2 h).For predicting DO content within the next 4 and 6 h, GMAN performs better than GCN-Seq2Seq, ASTGNN, DSTAGNN and DCRNN.In long-term prediction (þ8 h), DSTAGNN performs the best among the comparison models.
To allow the prediction results of the different models to be compared more intuitively, the prediction results of the average time series of all stations yielded by the eight methods are shown in Figures 4 and 5.The results show that the prediction results yielded by PGSTNN have the highest consistency with the actual changes in   DO content.Owing to space limitations, the results for þ4 and þ6 h are provided in the supplementary file.
We further performed a two-sample t-test to evaluate whether the prediction accuracy of PGSTNN is statistically significantly better than that of the comparison models.For each model M i , we obtained the error series , where y t A and ŷt A represent the average observed and predicted values of the DO content of all stations at time step t.For two models M i and M j , the null and alternative hypothesis can be defined as follows: There is no significant difference in the prediction accuracy of the two models) The prediction accuracy of model M i is statistically significantly better than that of model M j ) The significance level was set to 0.01.The statistical comparisons of prediction accuracy of different models are displayed in Tables 3 and 4 (the results for þ4 and þ6 h are provided in the supplementary file).If the cell C ij is labeled as 'Yes', it means that the prediction accuracy of model M i is statistically significantly better than that of model M j ; else, there is no significant difference in the prediction accuracy of the two models.One can find that in most cases, the prediction accuracy of PGSTNN is statistically significantly better than that of the comparison models.
Figures 6 shows the MAPE and R 2 values for each station obtained using different models (we only show the results of some models with higher prediction accuracy).Owing to space limitations, only results for the short-term (þ2 h) and long-term predictions (þ8 h) are provided herein.As shown, PGSTNN usually performs better than the compared methods at each station.GCN-Seq2Seq, which considers spatiotemporal correlations, typically performs better than GRU and Seq2Seq, which only  considered temporal correlations.However, in short-term prediction (Figure 6(a)), the prediction accuracy of GCN-Seq2Seq is lower than that of Seq2Seq in Region A. In Region A, the downstream station is not related to the upstream station because the shortest path between these two stations is greater than d max .For GCN-Seq2Seq, spatial neighbors were identified based on topological relationships.The information of the upstream station will be used to obtain the predicted values for the downstream station.Therefore, GCN-Seq2Seq introduces irrelevant information for the prediction, which might have contributed to its low prediction accuracy.
In Figure 6, all models have lower prediction accuracy at stations 2336526 and 2203700.We found that the average DO content of these two stations is relatively low among all stations.This may be an important reason for the low MAPE values of these two stations.In addition, these two sites have no spatial neighbors.Lack of spatial correlation information may also be the reason for the low prediction accuracy.

Comparison of prediction results at peak moments
We further compared the performance of different models at peak moments.The sliding window method (the size of the window was set to 10) was used to identify the extreme points of DO content.For each time series, the maximum and minimum value points in a sliding window were identified as extreme points.From Table 5, we can find that PGSTNN also performed better than the comparison methods at peak moments.Figure 7 shows the extreme points identified for the average time series of all stations.
We also performed a two-sample t-test to determine whether the prediction accuracy of PGSTNN is statistically significantly better than that of the comparison models at peak moments.The method is the same as that in Section 4.2.1.From Tables 6 and 7, one can find that compared with the eight baselines, PGSTNN achieves statistically significant improvements in most cases.Owing to space limitations, the results for þ4 and þ6 h are provided in the supplementary file.

Comparison of different training data sizes
The effect of the training data size on the prediction accuracy and stability of different models was analyzed.We kept the testing set unchanged, and used 40%-90% of the original training set to train the eight models.As shown in Figure 8, PGSTNN performs  better than the seven compared models (The results for þ4 and þ6 h are provided in the supplementary file).PGSTNN performs stably under different training sample sizes.The prediction accuracy of GRU is affected significantly by the size of the training data.GCN-Seq2Seq demonstrated higher prediction accuracy and stability than GRU and Seq2Seq.The prediction accuracy of PGSTNN using a small number of training samples (e.g.50%) can exceed that of the comparison models using all training samples.This indicates that introducing the physical mechanism of water quality changes into the neural network is beneficial for improving the prediction accuracy and stability while reducing dependence on the number of training samples.

Discussion
The experimental results show that in the study area, PGSTNN performs better than GRU, Seq2Seq, GCN-Seq2Seq, ASTGNN, DSTAGNN, DCRNN and GMAN in predicting DO content.This is attributable to the following two factors: 1.Although the spatio-temporal relationships among the different stations are vital for the prediction, learning these relationships using limited training data via purely data-driven models is difficult.Only 24 stations were included in the dataset.Therefore, GRU, Seq2Seq, GCN-Seq2Seq, ASTGNN, DSTAGNN, DCRNN, and GMAN may not be able to effectively learn the temporal or spatio-temporal correlations between stations.In this study, we first explicitly specified the connection between GRU cells at different stations based on prior knowledge regarding the river network structure; subsequently, we used neural networks to model the nonlinear spatio-temporal relationships among different stations.The prior knowledge can help to learn the spatio-temporal correlations between stations; therefore, this hybrid strategy alleviates the deficiencies caused by insufficient training data (Figure 8). 2. Training data are typically noisy and incomplete; therefore, purely data-driven models (e.g.GRU, Seq2Seq, GCN-Seq2Seq, ASTGNN, DSTAGNN, DCRNN and GMAN) cannot easily learn the general laws of water quality changes.For example, noise in the data directly affects the optimization of the loss function of a certain neural network.In this study, we use two physical laws (Henry's law and the power-scaling law) to guide the training of spatio-temporal neural networks.This new strategy yields more physically consistent predictions.Correspondingly, the stability and generalization ability of PGSTNN are superior to those of the compared methods.
Although PGSTNN can effectively predict DO content, its applicability should be investigated.PGSTNN is designed specifically for predicting DO content.Therefore, it cannot be directly applied to predict other water quality indicators (e.g.pH, conductivity, and nitrogen and phosphorus concentrations).For example, according to Le Chatelier's Principle, when we aim to predict water PH value, pH of water decreases as the temperature increases.Therefore, we usually construct different physics-guided deep learning models for the predictions of different water quality parameters.We believe that PGSTNN proposed herein can serve as useful guidance for integrating scientific knowledge with ML-based prediction models.

Conclusion
In this study, a PGSTNN was developed for predicting DO content in river water.PGSTNN integrates prior knowledge regarding the river network structure and a GRU to capture the spatio-temporal relationships among different stations.This hybrid strategy allows nonlinear relationships to be modeled using limited training data.PGSTNN applies two physical rules of DO variation (i.e.Henry's law and the powerscaling law) to guide the training of the spatio-temporal neural network.Physicsguided loss functions can guarantee the physical consistency of the prediction results and improve the generalization ability.PGSTNN successfully integrates data-and knowledge-driven paradigms to capture spatio-temporal correlations, thus providing a new idea for constructing a water quality prediction model.The experimental results show that PGSTNN performs better than seven state-of-the-art data-driven models in terms of prediction accuracy and stability.Compared with GRU, Seq2Seq, GCN-Seq2Seq, ASTGNN, DSTAGNN, DCRNN and GMAN, PGSTNN requires fewer training samples to achieve higher prediction accuracy.
Future studies should focus on the following two aspects: First, different PGSTNNs for other water quality factors, e.g.pH, conductivity and nitrogen and phosphorus concentrations, should be developed.Second, human activity factors should be considered in the prediction models.

Figure 1 .
Figure 1.Study area and locations of monitoring stations.

Figure 4 .
Figure 4. Short-term (þ2 h) prediction results of the average time series of all stations obtained by PGSTNN and baseline models.

Figure 5 .
Figure 5.Long-term (þ8 h) prediction results of the average time series of all stations obtained by PGSTNN and baseline models.

Figure 6 .
Figure 6.Comparison of performances of PGSTNN and baseline models at each station.

Figure 7 .
Figure 7. Extreme points in the average time series of all stations.

Figure 8 .
Figure 8.Effect of training data size on prediction accuracy and stability of different models.

Table 1 .
Parameters of PGSTNN and baseline models.

Table 2 .
Prediction accuracy of PGSTNN and baseline models.

Table 3 .
Statistical comparison of prediction accuracy of different models (þ2 h).

Table 4 .
Statistical comparison of prediction accuracy of different models (þ8 h).

Table 5 .
Prediction accuracy of PGSTNN and baseline models at peak moments.

Table 6 .
Statistical comparison of prediction accuracy of different models at peak moments (þ2 h).