Privacy-Preserving Probabilistic Voltage Forecasting in Local Energy Communities

This paper presents a new privacy-preserving framework for the short-term (multi-horizon) probabilistic forecasting of nodal voltages in local energy communities. This task is indeed becoming increasingly important for cost-effectively managing network constraints in the context of the massive integration of distributed energy resources. However, traditional forecasting tasks are carried out centrally, by gathering raw data of end-users in a single database that exposes their private information. To avoid such privacy issues, this work relies on a distributed learning scheme, known as federated learning wherein individuals’ data are kept decentralized. The learning procedure is then augmented with differential privacy, which offers formal guarantees that the trained model cannot be reversed-engineered to infer sensitive local information. Moreover, the problem is framed using cross-series learning, which allows to smoothly integrate any new client joining the community (i.e., cold-start forecasting) without being plagued by data scarcity. Outcomes show that the proposed approach achieves improved performance compared to non-collaborative (locally trained) models, and is able to reach a trade-off between privacy and performance for different architectures of deep learning networks.

which are thus increasingly subject to stressed operating conditions [1]. To comply with security constraints while avoiding costly infrastructure investments, one solution is to proactively manage local energy exchanges [2]. This can be achieved through local energy communities (LECs) which gather endusers into organized entities, wherein energy resources are pooled and allocated to reach common (e.g., economical, or environmental) objectives [3]. However, to ensure optimal coordination between resources, LECs need to be informed with accurate predictions of the future system state [4].
Our objective is thus to develop a new framework for the short-term probabilistic forecasting of nodal voltage magnitudes in LECs, exploiting the information from smart metering devices. The voltage information is indeed essential to support an optimized operation of the system, e.g., by providing insights on how much flexibility (such as curtailment of renewable generation, and load shifting) needs to be gathered from the local end-users to prevent voltage violations [5].
In contrast with state estimation that is used for identifying current voltage values in support of real-time grid management [6], [7], state forecasting is used for estimating future voltage values to enable a pro-active network scheduling (to avoid costly and suboptimal redispatch actions).
Although load and renewable energy forecasting tasks have been thoroughly studied [8], the literature on voltage forecasting is still very sparse. In [9]- [11], nodal generation and consumption are firstly predicted, and then embedded into a network model to calculate nodal voltages. However, these methods rely on the perfect knowledge of the network parameters, which are usually uncertain (e.g., the phases on which each client is connected are often unknown). To bypass this limitation, data-driven approaches have been developed. The voltage levels are represented by vector autoregressive (VAR) processes in [12]. To avoid such linear models, a deep learning model is proposed in [13], where it is shown that high accuracy can be achieved by monitoring only a few strategic buses. In [14], an ensemble approach (combining different regression models) is introduced for the deterministic voltage prediction. This work has been extended in [15] and [16] in a probabilistic setting.
All these algorithms assume that private data can be freely accessed from a centralized location. In a competitive environment, the nodal (e.g., smart meter) data are owned by end-users, who may be reluctant to share this information (as it may reveal private aspects such as home occupancy, routines, and usage of specific appliances). To address privacy concerns, an efficient solution is to rely on distributed learning. In [17], the alternating direction method of multipliers (ADMM) has been used in the context of renewable energy forecasting to train a ridge linear quantile regression model. Since ADMM does not guarantee privacy (as adversaries can recover the data if they have access to the intermediate calculations [18]), the procedure is enriched with data encryption in [19] for training autoregressive models. Alternatively to such ADMM-based techniques (which are limited to convex models [20]), federated learning (FL) has recently emerged [21] to build complex (non-linear) forecasters, such as tree-based [22] or deep learning [23] models. To that end, FL relies on a distributed setting where each client locally computes an update to the model (based on its own data). A central server then aggregates all client-side updates to compute a new global model, such that no raw data are communicated.
However, although FL complicates data inference since no centralized server holds all the information, it is not sufficient to ensure data privacy. Indeed, it has been shown that trained models may be reverse-engineered to extract detailed input information from the end-users involved in the training phase [24], [25]. It is thus essential to provide guarantees that the trained forecasting model protects the privacy of individual databases [26]. Cryptography-based methods have been proposed to hedge confidentiality breaches, but the high computational burden of current schemes is a barrier to their real-life applicability and integration in an energy-efficient society [27]. Indeed, encrypting data not only significantly increases their size (in the order of 100x-10,000x), but also lead to higher computation times (in the order of 100x) [28].
In this paper, we therefore augment FL (which lacks rigorous privacy guarantees against inference attacks) with differential privacy (DP). The principle of DP is to inject noise into the training procedure, which is calibrated in such a way that the privacy leakage of any sensitive information can be bounded and quantified [29]. In particular, the methodology is tailored to achieve user-level privacy, i.e., enforcing that the dataset of any client has a limited impact on the learned model, thus preventing inference of local raw information.
Another important aspect for the LEC is to smoothly accommodate end-users with different histories, such as households with newly installed smart meter. Hence, traditional learning strategies wherein local features (e.g., past load and PV generation) from all nodes are aggregated into the same input vector should be avoided. Indeed, the sites with few historical measures inherently limit the number of samples to train the model (i.e., a longer history of other end-users cannot be used). In this paper, we therefore develop a cross-learning approach [30], which mitigates the problem of (local) data scarcity by treating each end-user as a different sample to train a single, generic model. By generalizing to all samples from all clients of the community, the model endogenously learns common patterns from neighboring nodes (thus capturing space dependencies). Moreover, this transfer of learning enables cold-start forecasting for end-users with no historical data. Overall, the contributions of the paper are three-fold. 1) We leverage distributed learning (to avoid data exchanges in the training), differential privacy (to prevent inference of private data from the trained model) and cross-series learning (to accommodate end-users with diverse measurements histories) in an innovative framework dedicated to the private probabilistic forecasting of voltage levels in LECs. The approach is developed for different deep learning models, due to their ability to capture complex high-dimensional dependencies. 2) We bridge the gap between data utilization and protection by keeping track of the privacy loss accumulated over the learning course. This is achieved by ensuring that the training procedure satisfies Rényi differential privacy, which offers a tight analysis of privacy consumed. To that end, we formulate the model training as a sampled Gaussian mechanism, which consists in drawing a random subset of end-users at each training stage followed by the addition of Gaussian noise. 3) We provide insights for optimally tuning different architectures of deep learning models trained with user-level differential privacy. In addition, we explore the trade-off between improving model performance while maintaining privacy of raw data. We show how this trade-off can be adapted based on the LEC privacy preferences, which is key to enhance the engagement and satisfaction of all end-users. Outcomes show the added value of the proposed privacypreserving framework in comparison with fully private forecasting (where each end-user trains its own model with its own data), thus highlighting the interest for end-users to collaboratively train a joint community-wide model. Overall, the generic nature of the method paves the way towards the integration of privacy-enhancing techniques in smart grids.
In the rest of the paper, the different building blocks to construct the private voltage forecaster are introduced in Section II. These concepts are combined and enriched in Section III to propose a new probabilistic model with provable privacy guarantees. Section IV defines the different models used as a benchmark. These models are then tested on a radial low-voltage network, and the resulting outcomes are analyzed in Section V. Finally, the main conclusions are summarized in Section VI.

II. PROBLEM FORMULATION AND BACKGROUND
In this section, we firstly formulate the probabilistic voltage forecasting problem using cross-series learning. Then, we introduce the underlying deep learning-based model. Finally, we elaborate on the concepts of federated learning and differential privacy, which will be used (in Section III) to enrich the forecasting model with strong privacy guarantees.

A. Model Description
The objective is to generate privacy-preserving probabilistic forecasts of nodal voltages within low-voltage LECs. Nodal voltages are governed by intricate time correlations as well as space dependencies arising from network constraints (i.e., neighboring buses are likely to exhibit similar voltage patterns). Capturing such space-time dependencies is a challenging task since the size of the LEC may evolve (e.g., with new homes) and the LEC may thus be composed of endusers with different histories, including some with very few measurements.
Hence, as shown in the case study (Section V-A), the traditional solution of jointly predicting all nodal voltages in a single instance of the forecasting model may face data scarcity (and thus poor performance) since the total number of training samples is limited by the end-user with the smallest database (so that many relevant data are lost). Moreover, such an approach is not scalable since increasing the number of end-users is leading to a high-dimensional output space, which may lead to high training complexity.
Here, we tackle this issue by using cross-series learning [30]. In this setting, we learn a single forecasting model f θ (with parameters θ ) which is shared by all end-users. Each user (using only its own private data along with publicly available information) predicts the voltage level corresponding to its node. In this way, clients with limited history will simply have fewer samples for training the joint model.
Overall, by learning from correlated voltage patterns from all individuals, the model f θ acquires improved generalization abilities and is thus more robust to input noise (thus reducing overfitting risks), while accommodating clients with different histories (thus enabling cold-start forecasts for new clients joining the collaboration scheme). This strategy bypasses the need to retrain the model from scratch [31].
The model f θ is used by each individual end-user n ∈ N for predicting, at the forecast creation time t 0 , the conditional distribution of voltage levels y t 1:T ,n = (y t 1 ,n , . . . , y t T ,n ) at node n over the horizon [t 1 , t T ] (using exclusively by information x all t 0 ,n available at node n before t 0 ): :t 0 ,n , x (f ) where y :t 0 ,n are the past nodal voltages (measured before t 0 ), x n are the time-invariant features. Note that the bold notation is used for vectors spanning over multiple time periods.
These covariates are summarized in Table I. It should be noted that the past net imports-exports of the community are publicly available, and provided by the central authority (e.g., distribution system operator). The imports predictions are obtained with an autoregressive model (using only past values). Moreover, we encode the position of each node in the grid, which enables to capture space discrepancies (from grid technical constraints) during the training phase. To that end, we use a generic approach, wherein we specify the feeder of the network to which the end-user is connected, along with the distance between the end-user and the feeder's root node. Overall, by feeding the model with both global (e.g., estimated imports-exports of the community and global PV conditions) and local (e.g., past consumption levels, position in the grid) features, the model is able to generalize to all end-users, while leveraging local information to adapt the predictions in regards to node-specific conditions (such as solar shading by trees).

B. Deep Learning Model Structure and Training
The multi-horizon time-series forecasting problem can be naturally treated as a sequence-to-sequence task, wherein the goal is to convert a sequence of k past observations into a sequence containing the T predictions of interest. In this work, this task is solved using an encoder-decoder model, i.e., an advanced deep learning model, which has shown high performance in different studies [32].
As shown in Fig. 1, the encoder processes past data over a look-back window of k time steps [t −k , t −1 ], with the goal of extracting the relevant dynamics into a context vector c enc . Then, at time t 0 , the decoder leverages this vector c enc , along with the known future data x (f ) t 1:T ,n and static features x (s) n , to generate the multi-horizon predictionsŷ t 1:T ,n . Both encoder and decoder blocks are modeled by recurrent neural networks due to their ability to represent inter-temporal dependencies. Indeed, these models have an internal memory h t that provides an internal representation of past events, which is used to propagate relevant information through time.
In this work, we are interested in probabilistic forecasts that return the full conditional distribution p(y t 1:T ,n |x all t 0 ,n ). This target distribution is here approximated with Quantile Regression (QR) [33], which provides the conditional quantilesŷ t,n |x all t 0 ,n ) = q, for different relevant probability levels q ∈ Q [34]. For each node n, the QR model f θ thus yields: where the output corresponding to each Practically, the model f θ is trained on historical observations to learn the (unknown) relationship between inputs x all t 0 ,n and the outputs of interestŷ This is achieved by adapting the model parameters θ so as to minimize a user-defined loss function L(θ ), which penalizes discrepancies between predictions and actual observations (in the training data). Here, we use the pinball loss since minimizing this metric yields the optimal quantiles of the forecast uncertainty [35]. This learning phase is carried out with a stochastic gradient descent (SGD) algorithm. In this iterative optimization process, one forms (at each round), a batch b ∈ B of z i = (x all t 0 ,n , y t 1:T ,n ) i samples from the historical database, and estimates (over these z i samples) the gradient of the loss function L with respect to the θ -parameters as follows: Then, the weight vector θ is updated following the direction of the batchaveraged gradient −∇ θ L(θ, b) towards a local minimum, i.e., where η is the learning rate. It should be noted that SGD variants have been developed to reach better performances. One popular example is the Adam algorithm [36] that relies on adaptive learning rates to improve convergence properties.

C. Federated Learning
Federated learning (FL) is a distributed approach where a federation of clients c ∈ C is coordinated by a central server to learn a global model f θ , without sharing any raw client data.
In [21], an innovative algorithm, called Federated Averaging, is developed. In this setting, the server initializes the parameters θ 0 . At each round r ∈ R of the federated training, a sample of clients C r ⊆ C is selected, to whom the server broadcasts the global model θ r−1 . Then, each selected client c ∈ C r performs local computations (e.g., SGD or Adam) on its private dataset, and computes the difference c,r between the new (locally) optimized model θ c,r and the global model θ r−1 , i.e., c,r = θ c,r − θ r−1 . The local updates c,r ∀c ∈ C r are then uploaded to the server, which calculates the global average (5), i.e., where |C r | is the number of clients in training round r. The new global model is then computed as with η s the learning rate at the server. It should be noted that the Federated Averaging algorithm enables to perform multiple local updates, e.g., multiple steps of the SGD (3)-(4), before the averaging step (5). This reduces the number of communication iterations between the server and the clients. As further discussed in Section III-C, by limiting these interactions (during which an adversary can potentially access the parameters θ r ), stronger privacy guarantees can be achieved by the final model.

D. Differential Privacy
During the federated learning, the model can be accessed by adversaries to infer raw local information. Encrypted computations can be used to protect the training procedure, but there are still open questions about the possibility to break current cryptographic functions. Moreover, encryption schemes are computationally expensive, and consume large amounts of energy, which conflicts with the goal of energy efficiency.
An emerging alternative is provided by differential privacy (DP), which enables to bound and quantify the privacy leakage of sensitive information when performing learning tasks [29]. DP offers provable guarantees of protection against adversaries that have full knowledge of the training procedure along with an access to the model's parameters. DP is based on the notion of adjacent databases D and D , which differ by the addition or removal of a single element. In this work, we consider userlevel DP, which focuses on the largest possible difference that one client can have on the trained model [37].
In the context of deep learning, M refers to the learning algorithm (e.g., gradient descent optimization), and the application of M on a database D yields the model weights θ = M(D). Since M is inherently stochastic (e.g., neural networks are initialized with random weights, and rely on a random selection of samples at each weights update), there is uncertainty on the final weights θ of the deep learning model. Formally, a randomized learning algorithm M is said to be ( , δ)-differentially private if, for any adjacent datasets D and D , and any subset S in the weights distribution, we have: where ≥ 0 is the privacy loss yielding an upper bound of how much the probability of converging to a particular set of weights θ is affected by including (or removing) a single client during training M. A low -value is preferable since it means that removing any end-user from the training database does not significantly modify the final model. Such a property makes it very difficult to infer the raw data of clients. Then, we see that ( , δ)-DP allows for potentially large privacy losses (no bound on ) with probability δ. In this way, δ ∈ [0, 1] is the failure probability which caps any long tail of the M(·)-distribution where pure -DP guarantees do not hold. In the worst-case scenario wherein this δ-fraction exclusively relates to a single client, this may prove detrimental. Hence, to ensure privacy for each end-user n ∈ N , a solution is to have δ < 1/|N |.
Overall, a training process M is differentially private if the probability of θ 1 = M(D) and θ 2 = M(D ) are close for every choice of D and D , i.e., the data of any client do not significantly affect the weights distribution of the algorithm. In this work, the goal is to convert the federated learning of neural networks into a differentially private distributed training M that is associated with formal (i.e., provable) privacy guarantees (by bounding and δ values).

III. DIFFERENTIALLY PRIVATE FEDERATED FORECASTER
In this section, we present how to train deep learning models with user-level DP. The methodology is summarized in Fig. 2. First, the forecasting model is initialized (i.e., training round r = 1) at the server side, and sent to a random subset C r=1 of end-users. Each client c ∈ C r=1 locally trains the model (using its own private data and public information). Then, local updates c,r are sent back to the server, and are aggregated with the addition of noise that is calibrated using DP in such a way that the privacy leakages can be bounded and quantified. This procedure is iterated (over training rounds r ∈ R) until convergence, i.e., when the model accuracy cannot be further improved or when the privacy budget is consumed.
The core of the methodology is explained in Section III-A. It is then enriched in Section III-B with the introduction of a new privacy-compliant procedure that internalizes the data normalization (of individual end-users) into the prediction model, which improves consistency among the different local updates c,r . Finally, we compute in Section III-C the total privacy cost of the trained model.

A. Differentially Private Deep Federated Learning
Different approaches have recently been proposed for the differentially private training of learning models [38], [39]. As discussed in [29], an effective solution to ensure that a learning function f achieves ( , δ)-DP is to add noise proportional to the sensitivity S f of that function: where S f is defined as the maximum of the l 2 -distance f (D) − f (D ) 2 for any adjacent input datasets D and D . In particular, it is shown in [40] that a function f with sensitivity S f can achieve ( , δ)-DP by adding Gaussian noise N (0, S 2 f σ 2 ), with ≤ 1 and δ ≥ 0.8 · exp(−(σ ) 2 /2). The resulting learning process is usually referred to as Gaussian mechanism, i.e., M(D) f (D) + N (0, S 2 f σ 2 ), where σ is the noise multiplier controlling the trade-off between privacy and model performance.
In the context of (centrally trained) neural networks, DP can thus be achieved by applying the Gaussian mechanism to the weight update function (4) of the stochastic gradient descent (SGD) algorithm [41], i.e., where |b| is the number of samples z i in the SGD mini-batch b ∈ B, and I |θ| is an identity matrix of size |θ | such that is a |θ |-dimensional Gaussian noise. This principle can be further extended to FL [37], by using a DP version of the global average update (5), i.e., where per-client updates c,r are aggregated (on the server) at each round r of the training: However, in traditional SGD algorithms, the sensitivity S f of the update function f is a priori unknown. A solution is proposed in [41], wherein the gradient of each sample is clipped at a threshold S before the update step, such that the maximum influence (i.e., sensitivity) of a sample on the final average is bounded, i.e., S f ≤ S.
However, this technique only offers privacy for single samples. In order to extend sample-level DP to user-level DP, we need to bound the sensitivity of the training function (10) with respect to the addition or removal of any client from the dataset. This is accomplished by clipping the local weights updates c,r (at the end of the local training round r of each end-user c). In particular, for each user, we bound the (l 2 -norm of) local updates c,r by S, i.e., c,r 2 ≤ S, as follows: c,r ← c,r · min 1, S c,r 2 (11) such that if c,r 2 > S, per-user updates c,r are reduced to S, while c,r are preserved otherwise. It should be noted that bounding the influence of any user is also beneficial for training stability since it prevents the model to overfit to a particular subset of data.
When considering privacy in the federated learning of deep networks, it is important to notice that privacy breaches may Initialize model θ 0 for each round r ∈ R do C r ← (users sampled with probability q) for each client c ∈ C r do ( c,r ) ← LOCALTRAIN(c, θ r−1 , η c , S) end for · min 1, S 2 end function occur at each round r between clients and the server, through the information contained in the weight updates. To mitigate this issue, we exploit the randomness associated with subsampling. Indeed, if the training M is ( , δ)-DP, then drawing a random subset of end-users (from all |C| clients) before applying M follows (O(γ ), γ δ)-DP, with γ < 1 (which depends on the sampling strategy) [42].
Here, end-users are randomly and independently sampled with probability q ∈ (0, 1] at each round r ∈ R of the training mechanism M. Hence, the number |C r | of end-users at each round r is variable and unknown. It is thus replaced by its expected value E[|C r |] = q|C|, such that the global model update (on the server) is: where the sensitivity S f of the update function (12) is bounded by S f ≤ S q|C| , and σ is the noise multiplier for training the neural network.
The resulting differentially private federated learning for deep networks (DP-FDL) is given in Algorithm 1. By combining sampling and additive Gaussian noise to the update function, the training procedure M follows a Sampled Gaussian mechanism (SGM), which is sufficient for achieving a central differential privacy (Section III-C).
As shown in Algorithm 1, federated averaging is augmented with server momentum β, due to its ability to dampen oscillations in the learning, and thus to mitigate that local models diverge from the globally optimal solution.

B. Features Normalization
In general, end-users have different distributions of data. In a centralized learning, this poses no problem as a common normalization of features is applied over the entire database.
However, when local data are private, such a shared computation cannot be performed. This prevents the use, e.g., of traditional batch normalization [43] wherein local statistics (i.e., mean and variance values) are aggregated over the whole training data (including all clients) to generate the predictions.
To address this issue, we use layer normalization (LN) [44]. In contrast to traditional approaches, LN does not rely on shared statistics among end-users (that would break data privacy), and rather internalizes the normalization into the first layer of the model. This generic procedure can thus be applied to any neural network architecture. For the inputs x i in a neural layer, the LN inputs x nrm i are given by: where is the element-wise multiplication of vectors. All units in the layer thus share the same normalization terms μ i and σ i , i.e., the mean and standard deviation of the elements in x i , which are sample-dependent and computed locally. The bias b and gain g vectors are the internal parameters of the model that need to be learned during training. These vectors b and g are thus common to all clients, such that the normalization procedure (involving all clients) is directly internalized into the privacy-preserving learning.

C. Privacy Budget
The DP-FDL algorithm consists of |R| successive queries of the learning function M in Eq. (12). An adversary with access to intermediate models θ r may leverage this additional information to infer raw (private) data, and it is thus necessary to keep track of the total privacy loss, which is accumulating during the training (M 1 , . . . , M |R| ).
A solution is offered by advanced composition theorems [45], which give an upper bound of the accumulated privacy loss (by assuming that the worst-case scenario wherein the same amount of leakage occurs at each query to the data). For instance, applying |R| times consecutively the same ( , δ)-DP algorithm gives an (O( |R| log(1/δ)), |R|δ) guarantee [46]. However, these compositions are generic (they do not account for the specific noise distribution used in the learning), such that they tend to strongly exaggerate privacy losses.
This has motivated the development of another approach, called the moments accountant [41], which, instead of directly dealing with ( , δ)-DP, relies on the notion of Rényi-DP (RDP) [47]. It is a natural relaxation of DP based on the Rényi divergence (14) of order α > 1 between distributions P and Q (defined over the same probability space): where log is the natural logarithm, and P(z) is the density of P at z. By definition, M satisfies (α, γ )-RDP if for any adjacent inputs D, D , it holds that D α (M(D) M(D )) ≤ γ . The main advantage of RDP is its simple linear composition form, i.e., if M obeys (α, γ )-RDP, then the composition M |R| is (α, γ |R|)-RDP [47]. In particular, a numerically stable computational procedure for estimating the Rényi parameters for a sampled Gaussian mechanism is presented in [48], and has shown to provide strengthened privacy bounds.
However, in contrast with ( , δ)-DP, Rényi parameters are more difficult to interpret, and we are thus interested in converting the privacy budget expressed in terms of (α, γ ) in the more interpretable notion of ( , δ)-DP. In [47,Proposition 3], it is shown that an (α, γ )-RDP algorithm provides (γ + log(1/δ) α−1 , δ)-DP guarantees. However, follow-up works have improved this conversion, and bounds provided by [49] are therefore used in our experiments (in Section V-B) to tighten the privacy guarantees of the learning procedure.
It should be noted that few studies focus on how selecting an adequate privacy budget , although it is well-established that higher privacy budgets can be allocated to more complex learning tasks [39]. In that regard, it has been shown that DP-enhanced neural networks are resistant to inference attacks (with state-of-the-art attack frameworks) for up to 100 [50]. In complement, studies on memorization attacks (which exploit the fact that highly parametrized models may memorize some patterns in training data) demonstrated that attacks have been blocked by DP mechanisms with poor worstcase privacy guarantees (high values up to 10 9 ) [24]. In Section V-B, the trade-off between data privacy and model performance is fully discussed.

IV. BASELINES
In Section IV-A, we introduce baseline models (not trained in a privacy-aware setting) to evaluate the forecasting accuracy of state-of-the-art techniques in traditional (non-private) conditions. Then, in Section IV-B, we present the deep learning models that are trained in both (non-private) conditions and with the proposed DP-enhanced federated learning. Finally, we present the performance metric used to compute the prediction performance of the different models in Section IV-C.

A. Baseline Models
We implement naive and state-of-the-art techniques for the multi-horizon forecasting task.
• a naive probabilistic averaging model (Prob-Avg), where the voltage distribution at each time step is computed based on the aggregation of all past observations corresponding to this specific period. • a naive probabilistic persistence (Prob-Persistence), where the last nodal voltage value y t −1 ,n is propagated over the prediction horizon [t 1 , t T ] as the mean value, while the variance is computed on the look-back window. • a quantile regression forest (QRF), i.e., a tree-based ensemble model, in which the outputs of independent regression trees are merged for estimating the conditional distribution. The forest is set to 500 trees. • a gradient boosting regression tree (QGBRT) trained with the quantile loss. In this ensemble model, new regression trees are sequentially generated to forecast the residuals of the previous models. The number of boosting stages is set to 100 with an early stopping criterion.

B. Deep Learning Models
The sequence-to-sequence model (of Section II-B) is tested with Long Short Term Memory (LSTM) recurrent networks for both encoder and decoder blocks. A second variant (BLSTM) relies on bidirectional LSTM networks to improve the representation of time dependencies [34].
We also use a traditional deep feedforward neural network (DFFNN), wherein hidden layers are composed of neurons using rectifier linear units (ReLUs) as activation function.
In complement, we build a BLSTM model wherein the local (private) features of all end-users are aggregated into a single input vector to jointly predict the voltage levels at all nodes of the grid. By constructing a shared input space, the total number of training samples is limited by the client with the smallest database. This model (A-BLSTM) serves thus as a benchmark to quantify the value of the cross-learning strategy (wherein all historical data are considered by treating each client as a different sample).

C. Hyper-Parameters Optimization and Performance Metrics
The input features of all machine learning (tree-based and deep learning) models are the same (as given in Table I). This strategy is used to ensure that differences in prediction performance only arise from privacy constraints. For each forecasting model (excepting the parameter-free naive methods), an hyperparameter optimization is carried out through an extensive random search to identify the optimal model complexity. The same number of iterations is used across all benchmarks.
When assessing the performance of a probabilistic forecast, two complementary aspects need to be jointly analyzed, i.e., reliability and sharpness.
Reliability measures how closely the predicted intervals correspond to the actual data frequencies, while sharpness measures the width of prediction intervals. To evaluate the trade-off between both concepts, we use the quantile loss. It has indeed been shown the quantile loss yields consistent outcomes with other metrics such as the Winkler score, and the continuous ranked probability score (CRPS) [51].
The quantile loss QL τ,n for node n at time step τ of the test set is given by: τ,n , 0 whereŷ (q) τ,n are the quantiles predicted by the forecaster, while y τ,n are the actual voltage observations. In this paper, we compute the quantiles for q = 1, 10, 25, 50, 75, 90 and 99 %.
Practically, we compute the total pinball loss QL tot , i.e., the average value of all pinball losses (15) over all points (t, n) of the space-time domain of the test set. Smaller values of QL tot correspond to better forecasting outcomes.

V. CASE STUDY
The proposed privacy-preserving voltage forecasting strategy is tested on the IEEE European Low Voltage Test Feeder, Fig. 3. The studied energy community, composed of 6 line ramifications (I to VI) feeding 57 clients, i.e., 15 single-phase inflexible loads, 40 singlephase prosumers (with inflexible load, PV source and battery system) and 2 three-phase community-scale PV plants.
shown in Fig. 3, which has a radial structure with 6 line ramifications (used to encode the spacial information of each node). The nodal voltages are predicted over a multi-horizon of T = 8 intervals of 30 minutes (i.e., 4-hour ahead) for the N = 57 nodes. A look-back window of k = 12 intervals (i.e., 6 hours) is selected to capture past dynamics. Each client has a different history of recordings, ranging from 120 to 478 days. To have comparable performance indices among end-users, the test set is composed of the same days for all clients (corresponding to the last 96 days of the horizon). This test set is used to compute performance metrics that are independent from the data used during the model training. From these remaining data (not used in the test set), 85% are employed for training the model, while the remaining 15% are used for tuning the hyper-parameters. Overall, the aggregated test set is thus composed of 96 × N = 96 × 57 = 5, 472 daily voltage profiles, while the training and validation sets respectively include 11,400 and 1,995 daily patterns.
In this case study, a significant dependency between voltage levels and local energy exchanges is observed. In particular, using the Pearson coefficient [52], the nodal voltages have a correlation of −0.51 with the consumption profile, and of 0.64 with the PV generation. The voltage levels provide thus an indirect representation of local consumption and generation data (that reveal critical and private information), and it is essential to protect the privacy of the voltage forecaster.

A. Comparison Between Centralized and Local Models
To quantify the practical interest for end-users to participate in collaborative learning strategies, we compare the forecasting performance of centralized and fully private models.
First, we study the prediction performance of (non-private) centralized models relying on complete information, i.e., an ideal (but unrealistic) case where each end-user shares its private data to form a single database, from which the community manager trains a single model. Second, we implement the fully private counterpart, where a different model is trained by each end-user based only on its own private data. In deep learning-based models, the weights are initialized using a Glorot uniform distribution, while the batch size is set to 96 (as a good trade-off between training time and final performance). The optimization is performed with the Adam algorithm with a learning rate of 0.001. Then, in all tree-based and deep learning models, early stopping is implemented for avoiding overfitting in the learning procedure. Table II presents the performance and training time of the different probabilistic forecasters. For deep learning model, the number of epochs is added in parentheses. It should be noted that, in the private setting, the performance of local models is averaged over all end-users of the community.
The experiments show that machine learning models (deep learning and ensemble methods) significantly outperform naïve benchmarks (in both centralized and local settings), which stresses the interest of using advanced models to properly represent the dynamics of nodal voltage levels. In particular, centralized BLSTM-based networks (with 1 hidden layer composed of 50 neurons for both encoder and decoder blocks) achieve the best accuracy with QL tot = 0.0032 pu, due to their ability to capture the non-linear inter-temporal dependencies within data. The convergence is obtained in 16 rounds, with a per-round training time of around 5.5 minutes.
Moreover, we observe high differences in performance between centralized and local models. Specifically, the local (private) probabilistic averaging (Prob-Avg) wherein the voltage distributions are differentiated between nodes, reduces the total quantile loss of the centralized version from 0.319 pu down to 0.192 pu. This difference reflects the importance of capturing nodal dependencies between voltage levels.
Surprisingly, when training fully private models (using only local data), sequence-to-sequence networks exhibit poor convergence, and are even surpassed by tree-based and feedforward networks. This arises from the difficulty to train recurrent neural networks from scratch with a limited amount of data. In this way, the hyper-parameter optimization has led to very compact LSTM (and BLSTM) local models (with 1 hidden layer of 8 neurons). Also, the A-BLSTM (where all nodal voltage levels are jointly predicted) only yields QL tot = 0.192 pu, which can be explained by the data paucity (i.e., the number of samples is limited by the client with only 120 days of measurements). In this way, by enlarging the dataset (with the aggregation of samples from all end-users n ∈ N ), the cross-learning approach reduces the uncertainty space of the predictions, leading to improved sharpness. This materializes by a total quantile loss (for the centralized BLSTM) of 0.0032 pu over the test set.
In contrast, the best local model (DFFNN) performs no better than QL tot = 0.0144 pu. This performance gap between centralized and local models is a clear indication of the added value for end-users to collaboratively train a community-wide probabilistic forecaster. This need is exacerbated for end-users with data paucity, as they are particularly exposed to poor convergence and weak accuracy.

B. Utility -Privacy Trade-off
There is an inherent trade-off between the performance of the prediction model and the privacy, i.e., ( , δ)-values, of the local data. This trade-off is affected by three main parameters. First, privacy breaches may occur at each round r of the training, i.e., stronger privacy guarantees are obtained when the training phase is quickly converging. Second, privacy of raw data can be improved by reducing the expected number of end-users q|C| at each round. However, in accordance with Eq. (12), reducing q|C| is increasing the sensitivity ( S q|C| ) of the update function. In turn, this augments the variance of the noise added during the server optimization, which is detrimental for prediction accuracy. Third, the noise multiplier σ is controlling the variance of the noise injected during the weight update of the learning phase. Higher σ -values improve privacy guarantees, but at the expense of accuracy.
Here, we quantitatively analyze the utility-privacy trade-off. We study three different deep learning models, along with the BLSTM without layer normalization (called I-BLSTM) wherein the inputs of each end-user are normalized independently (based only on local statistics). The models are trained during |R| = 60 rounds. As explained in Section II-D, to ensure δ < 1/|N |, the values are reported for δ = 10 −2 in the following of the paper.
In Table III, we give the optimal number of rounds r opt , i.e., the number of interactions between end-users and the server which leads to the optimal accuracy on the validation set. Then, we provide the corresponding privacy loss opt (using [44,Proposition 3], as explained in Section III-C), along with the total quantile loss (over the test set) for two noise levels (σ = 0.25 and 0.75) for q|C| = 10 users per round. The Adam algorithm with η c = 0.001 is used for training local models, while SGD with a learning rate η s = 1 is used at the server side. The momentum β is set at 0.6 in Algorithm 1.
From Table III, we see that (LSTM and BLSTM) sequenceto-sequence models are more robust to the noise added, while the DFFNN struggles in achieving decent outcomes in the DP framework. In this way, while the BLSTM is still able to reach QL tot = 0.112 pu for σ = 0.75, the DFFNN only performs 0.140 pu (which is roughly equivalent to the private models). Also, outcomes stress the added value of using layer normalization in the federated learning since the I-BLSTM model is consistently outperformed by the reference BLSTM.
Overall, raising σ from 0.25 to 0.75 significantly reduces the worst-case privacy loss, but this comes at the expense of forecasting performance (for all models). It should be noted that using a noise multiplier higher than 0.8 even leads to strong model divergence. To better understand the effect of the noise level σ and the expected number of clients per round q|C| on the utility-privacy trade-off, Table IV shows outcomes from a sensitivity analysis on the performance of the BLSTM model in different privacy settings.
In Table IV, we see that models trained with traditional federated learning (which is not augmented with differential privacy, i.e., σ = 0) cannot reach the performance of centralized models, but they are still significantly better than their private counterpart. In particular, the federated BLSTM with q|C| = 10 end-users per round converges towards QL tot = 0.054 pu, while the centralized and private equivalents respectively achieve 0.032 pu and 0.172 pu.
Interestingly, the subsampling strategy allows to achieve more stringent privacy guarantees, i.e., lower -values are obtained for q|C| = 5 than for q|C| = 10 for the same number of training rounds. However, in accordance with Eq. (12), decreasing the expected number of clients per round q|C| is increasing the sensitivity ( S q|C| ) of the update function. In turn, this augments the variance of the noise added during the server optimization, which is detrimental for prediction accuracy. This effect is clearly observed with the performance gaps between q|C| = 5 and 10 (e.g., QL tot increases from 0.112 pu up to 0.132 pu for σ = 0.75). In our case, further increasing q|C| to 12 or 15 end-users does not improve the results, and we conclude that the best solution is to use q|C| equal to 10. For illustrating the quality of results obtained using the BLSTM network with q|C| = 10 and σ = 0.25, the probabilistic voltage forecasts of 4 nodes (A, B, C and D in Fig. 3) during a summer day are shown in Fig. 4. The gray areas represent the forecasted quantiles while the red line stands for the actual voltage time series. Fig. 4 shows that the predicted intervals properly encapsulate the actual voltage realizations, i.e., the volatility of nodal voltages is well captured in tight intervals. To have better insights on the convergence of the DFFNN, BLSTM and I-BLSTM models, their training performances are illustrated   Fig. 5. We depict (on a logarithmic scale) the evolution of the total quantile loss QL tot over the course of training for three levels of the noise multiplier (σ = 0, 0.25, and 0.75) for q|C| = 10. We also report the corresponding evolution (at each round r ∈ R) of the privacy loss (with δ = 10 −2 ).
In absence of noise, the DFFNN converges faster than the sequence-to-sequence BLSTM model, which arises from its simpler architecture that is easier to optimize. In contrast, the BLSTM model exhibits a more gradual training but is ultimately able to converge towards better solutions, i.e., the BLSTM outperforms the DFFNN for all noise levels σ . Also, we see that the DFFNN is more sensitive to noise than the BLSTM model, i.e., even a small noise value σ = 0.25 has a substantial effect on the convergence abilities of the DFFNN. The BLSTM is thus more suited to be used in combination with DP. Finally, the value of the layer normalization occurs at the end of the training (when the search towards the optimal solution becomes finer), by allowing a more efficient transfer of learning between end-users.

C. Sensitivity Analysis on Hyper-Parameters
In addition to the parameters specific to privacy, the prediction accuracy of the forecasters is also affected by other hyper-parameters, which need to be carefully tuned. In Table V, we evaluate the influence of the number of hidden units (# neurons), batch size (|b|) and clipping threshold (S) on the BLSTM test performance and per-round training time for σ = 0.25 and q|C| = 10, thus deriving valuable insights for applying differentially private federated learning in smart grid forecasting applications. In contrast with centralized training wherein more complex models have a higher modeling power, using more hidden units in the federated DP framework strongly decreases the model quality. In practice, it is thus preferable to use more compact models, which are less sensitive to the noise added in the learning procedure. This has also a positive effect on the communication burden since it requires transferring a smaller weight vector (and thus less bandwidth) between the server and end-users.
Then, the batch size has a very small effect on the final performance but it significantly influences the computation time. In this way, by decreasing the batch size from 96 to 10, the per-round training time increases from 41 to 123 seconds.
Finally, identifying the optimal value of the clipping threshold S is not an easy task. Indeed, low values (S = 0.1) may discard valuable information from the magnitude of local gradients, thus jeopardizing the gradient descent search of optimal parameters. Conversely, when S = 0.3, more noise is added in the global model update (12), which complicates the training. Here, we therefore use S = 0.175.

D. Discussion
The main outcomes of the work are summarized in Table VI. In particular, using the BLSTM as a reference model, we sum up both accuracy and privacy metrics in different relevant settings, i.e., (i) fully local models (with one model by client) trained using Eqs. (3)-(4), a fully centralized model (with complete information) trained using Eqs. (3)-(4), (iii) a global model built with federated learning without DP trained using Eqs. (5)-(6), (iv) a global model built using DP-enhanced federated deep learning (DP-FDL) with Eq. (12). Models (ii), (iii) and (iv) are all trained using cross-series learning to enable cold-start forecasts for new end-users joining the coordination problem.
The complexity of the centralized model differs from the local models, i.e., the centralized model can rely on a larger architecture, and thus on higher computational capabilities since it has access to a larger database. Similarly, collaborative models in different privacy settings (fully centralized, federated learning, and DP-enhanced federated learning) are also characterized by different optimal architectures.
We clearly observe the interest of collaborative learning since the local models achieve the worst accuracy of QL tot = 0.144 pu. However, the improved performance from collaboration comes at the expense of privacy, by revealing sensitive input features such as raw smart meter data (exposing the periods of presence at home). The federated learning helps mitigating this issue by keeping the raw data local, while keeping a good accuracy of QL tot = 0.054 pu. This FL approach does not provide guarantees that the trained model cannot be reversed engineered. Hence, FL is enriched with DP by introducing noise in the training. In this way, by increasing the noise multiplier σ from 0.25 to 0.75, the (worst-case) privacy breaches are significantly decreased from opt = 167.7 to 7.3 for a limited loss in prediction performance from 0.090 to 0.112 pu.

VI. CONCLUSION
This paper has presented a new privacy-preserving framework, applied to the probabilistic forecasting of nodal voltage magnitudes. The proposed framework enables to distribute the computations among the parties, and to derive a trade-off between utility and privacy by embedding the learning procedure into a differentially private mechanism. Outcomes show that compact recurrent models are inherently more robust to noise, which makes them natural candidates for the development of privacy-enhancing techniques in renewable-dominated smart grids.
As a perspective, one may be interested in tracking the privacy spent by each client, which is highly challenging since the set of clients participating in each round is private. Also, it may be useful to develop state-of-the-art attack frameworks to have an empirical evaluation of how much information an adversary can actually infer from trained models.