Wireless Distributed Learning: A New Hybrid Split and Federated Learning Approach

Cellular-connected unmanned aerial vehicle (UAV) with flexible deployment is foreseen to be a major part of the sixth generation (6G) networks. The UAVs connected to the base station (BS), as aerial users (UEs), could exploit machine learning (ML) algorithms to provide a wide range of advanced applications, like object detection and video tracking. Conventionally, the ML model training is performed at the BS, known as centralized learning (CL), which causes high communication overhead due to the transmission of large datasets, and potential concerns about UE privacy. To address this, distributed learning algorithms, including federated learning (FL) and split learning (SL), were proposed to train the ML models in a distributed manner via only sharing model parameters. FL requires higher computational resource on the UE side than SL, while SL has larger communication overhead when the local dataset is large. To effectively train an ML model considering the diversity of UEs with different computational capabilities and channel conditions, we first propose a novel distributed learning architecture, a hybrid split and federated learning (HSFL) algorithm by reaping the parallel model training mechanism of FL and the model splitting structure of SL. We then provide its convergence analysis under non-independent and identically distributed (non-IID) data with random UE selection scheme. By conducting experiments on training two ML models, Net and AlexNet, in wireless UAV networks, our results demonstrate that the HSFL algorithm achieves higher learning accuracy than FL and less communication overhead than SL under IID and non-IID data, and the learning accuracy of HSFL algorithm increases with the increasing number of the split training UEs. We further propose a Multi-Arm Bandit (MAB) based best channel (BC) and best 2-norm (BN2) (MAB-BC-BN2) UE selection scheme to select the UEs with better wireless channel quality and larger local model updates for model training in each round. Numerical results demonstrate it achieves higher learning accuracy than BC, MAB-BC and MAB-BN2 UE selection scheme under non-IID, Dirichlet-nonIID and Dirichlet-Imbalanced data.


I. INTRODUCTION
C ELLULAR-CONNECTED unmanned aerial vehicle (UAV) network is becoming an integral component of the beyond fifth generation (5G) and upcoming sixth generation (6G) networks [1], [2] to provide a variety of advanced applications ranging from real-time video streaming to surveillance. In this new network, the aerial users (UEs), i.e., UAVs, fly over the target area with the control of the base stations (BSs) to collect data (e.g., images and videos), and then they collaborate with the BSs to perform data processing for supporting those applications. Recently, machine learning (ML) algorithms, like deep neural network (DNN) and convolutional neural network (CNN), have been effectively used to provide efficient data processing for those applications through extracting the features and insights from a large dataset. However, each UE is only able to collect a sub-dataset that only contains partial information of the target area. The conventional approach is to gather the sub-datasets from all the UEs to the BSs for centralized ML model training, known as centralized learning (CL). In this case, the UEs require wide bandwidth and large amount of energy to transmit their sub-datasets to the BSs, and may potentially reveal their private information through the transmission process [3]. In practice, the transmission processes from UAVs to the BS always suffer from limited bandwidth and dynamic wireless channels, and the UAVs are powered by energy-limited batteries, hence, transmitting raw data to the BS is challenging. Due to the growing computational capability of computing engines, such as the CPU, GPU and DSP (e.g., Qualcomm Hexagon Vector extensions on Snapdragon 835 [4], and the possibility of equipped GPU on UAVs), the UAVs are able to perform ML model training locally using their own sub-datasets and then only share model parameters instead of raw data with the BSs. Therefore, distributed learning algorithms are emerged to provide ML model training in a distributed manner, which becomes a more attractive solution for supporting advanced applications in cellular-connected UAV networks.
The two state-of-art distributed learning algorithms, federated learning (FL) and split learning (SL), have different learning architectures and therefore are suitable for different application scenarios. In FL, all the UEs collaboratively train an entire ML learning model (e.g., DNN) with the help of a central parameter server collecting and performing model aggregation with the received local model updates from the UEs [5]. FL architectures rely on the fact that all of the UEs are capable of performing gradient descent and having powerful computational capabilities. Different from FL, SL was recently proposed in [6] and [7] by splitting the ML model (e.g., DNN) into several sub-models (e.g., a few layers of the entire DNN) with the cut layer and distributing them to different entities (e.g., the UE-side model at the UEs or the server-side model at the server), which facilitates distributed learning via sharing the smashed data of the cut layer. In this case, SL limits the UE-side model down to a few layers, thus reduces the computational overhead of the UEs compared to FL. Interestingly, the researches in [7] and [8] have shown that FL is more communication or computation efficient with small model size and large dataset size, whereas SL is more efficient with increasing the number of UEs or the model size. However, in practical UAV networks, the UAVs have diverse computational capabilities, own different datasets (e.g., imbalanced and non-independent and identically distributed (non-IID) data distribution over them), and heterogeneous communication and energy resources, either deploying FL or SL may be not efficient. Motivated by this, a splitfed learning (SFL) has been proposed in [9], which exploits the parallel model training mechanism in FL and model splitting structure of SL. By doing so, the SFL shortens the training time in SL and becomes more communication efficient than FL when the number of UEs is large. However, the SFL algorithm still exhibits high communication overhead similar to SL when the number of UEs is small and the dataset over UEs is highly imbalanced. To address this, there is an urgent need to propose a hybrid solution that can well leverage the advantages from both FL and SL even for small number of UEs and the highly imbalanced datasets.
While deploying distributed learning algorithms in wireless networks, not all the UEs can access to the BS in each communication round due to unreliable and randomly fading wireless channels from the UEs to the BSs, so it's essential to develop efficient UE selection schemes to select reliable and informative UEs to participate in distributed learning in each round. In FL, UE selection schemes have been widely studied [10], [11], [12], [13], [14], [15], where the parameter server determines which UEs should participate in FL according to their channel conditions and resource information (e.g., throughput, computational resource). Generally, the UE selection in FL has been studied either based on channel qualities [10], [11], [12] or the importance of local model updates [13], [14], [15]. In [10], the proportional fair UE selection policies based on the instantaneous channel qualities were developed. In [11], a joint learning, wireless resource allocation and UE selection problem was formulated and optimized to minimize the FL loss function. The authors in [12] studied the UE selection scheme that maximizes the number of selected UEs in each round based on their wireless and computational resource conditions. The authors in [13] proposed a reliable UE selection scheme by considering the reliability of the dataset owned by UEs. The reliability of dataset has a great impact on the importance of local model updates while training a ML model with the dataset, the user selection policy taking into account both channel conditions and the importance of local model updates at the UEs was proposed in [14] and [15].
Nevertheless, the above studies [10], [11], [12], [13], [14], [15] assumed that the UE information, including channel conditions and the importance of local model updates, is known in advance. In practice, it is difficult to obtain accurate UE information before the execution of the learning procedure, it also consumes extra computation and communication resources to estimate each UE's local model updates before UE selection. To address this, the dynamic UE selection scheme for FL based on Multi-Arm Bandit (MAB) has been proposed [16], [17], [18], in which the parameter server selects the UEs through exploration and exploitation processes according to the estimated local model updates of UEs [16] or the estimated channel qualities of UEs [17], [18]. Moreover, from [10], [11], [12], [13], [14], and [15], when deploying FL in wireless networks, both channel qualities and the importance of local model updates are significant to select UEs for global model aggregation in each round. However, as far as we know, there are rare existing works that have considered both factors together to design UE selection schemes using MAB algorithm. Different from considering any of them, the exploration process of exploiting MAB algorithm needs to maximize a weighted sum of both channel qualities and the importance of local model updates, which causes a challenge of finding a trade-off between those two parameters when selecting UEs.
Motivated by the above, we will study distributed learning architecture to train ML models for supporting advanced applications, like fire tracking and flood monitoring, in wireless UAV networks. We consider a group of aerial UEs are flying over a target area under the control of the BS to collect image data with the equipped camera. Here, each UAV is carried with a powerful processing unit (e.g., NVIDIA JETSON) [19] that may have different computational capabilities, and it can only capture a sub-dataset that observes partial information of the target area, and thus the whole dataset collected by all the UAVs may be on imbalanced and non-IID distribution. By transmitting the sub-datasets to the BS, an immediate data aggregation can be performed to enable each UE access to the complete environment information captured by other UEs. However, the transmission process of raw data is expensive in terms of energy and bandwidth, and possibly introduces infringements of UE privacy. To address these challenges, we first propose a novel distributed learning architecture, namely the hybrid split and federated learning (HSFL) algorithm, that encompasses the parallel model training mechanism of FL and the model splitting structure of SL. We conduct the experiments on the image recognition task using MINST dataset and perform training on two different ML models, Net and AlextNet, with the goal of improving the learning accuracy and communication efficiency in wireless UAV networks using the proposed distributed learning algorithms.
The main contributions of this work are summarized as follows.  Additionally, the communication efficiency of HSFL is better than that of FL, and improves with the increasing number of UEs. We also show that our proposed MAB-BC-BN2 UE selection scheme achieves better learning accuracy performance than BC, MAB-BC and MAB-BN2 UE selection schemes under non-IID, Dirichlet-nonIID and Dirichlet-Imbalanced data. The organization of this paper is presented as follows.
In Section II, we present the system model and learning model, as well as the learning problem formulation in wireless networks. Section III introduces the proposed HSFL algorithm including its learning procedure and convergence analysis in wireless networks. Then the UE selection schemes are illustrated in Section IV. The experiments and simulation results are demonstrated in Section V, finally the conclusions are drawn in Section VI.

II. SYSTEM MODEL AND PROBLEM FORMULATION
As illustrated in Fig. 1, we consider a single-cell wireless UAV network, consisting of a BS located at the center of the cell, and a set of aerial UEs N = {u 1 , . . . , u N } distributed in the BS coverage area as predefined flight paths and remained spatially static during the process of ML model training. In this network, the total system bandwidth is equally divided into M radio access channels, where M < N. The BS S is assumed to have a single antenna and equipped with high computational capability, and located at the origin of the 3D coordinates system with the antenna installed at the altitude h s above the ground. Each UE is also equipped with a single antenna and a lightweight GPU. We assume that all the UEs transmit data with a constant power P n = P . The location of UE u n is denoted as (x n , y n , h n ). Each UE is assumed to fly at the fixed altitude h n above the ground while the horizontal coordinates (x n , y n ) of each UE vary over time.

A. Channel Propagation Model
From Fig. 1, in the considered cellular-connected UAV networks, only the information including the BS's and UAVs' locations, and the type of environment (e.g. rural, suburban, urban, highrise urban, etc.) is available. Noted that, in such practical scenarios, one may not have any additional information about the exact locations, heights, and the number of obstacles. Therefore, to consider the possibility of occurrence of LoS link affected by the environment, we adopt the channel model for ait-to-ground (ATG) communication in urban environment presented in [20] and [21]. Here, we consider the randomness of LoS communication links using an LoS probability P LoS n,s , which depends on the environment, the location of UE and BS, as well as the elevation angle. Thus, the LoS probability is given as where a and b are the environmental parameters indicating the type of environment, like rural, urban or dense urban, θ n,s is the elevation angle of the UE-BS communication link. In (1), θ n,s = 180 π × sin −1 ( hn−hs distn,s ), where dist n,s is the Euclidean distance between UE u n and BS S, calculated by dist n,s = x 2 n + y 2 n + (h n − h s ) 2 . The LoS probability increases with increasing the elevation angle and the UE's altitude.
As stated in [22], the communication paths of ATG channels depend on both LoS and NLoS propagations, and it's impossible to determine the exact LoS/ NLoS status of the UE-BS link. Thus, we consider the spatial expectation of the pathloss for LoS and NLoS groups as the pathloss model to describe the UE-BS communication channel, which is given bȳ where P NLoS n,s = 1 − P LoS n,s is the NLoS probability, f is the system carrier frequency, c is the light speed, and α denotes the path loss exponent, ϕ l and ϕ n are the additional path loss coefficients of LoS and NLoS, respectively.

B. Problem Formulation
At the BS, the goal is to learn a statistical model over the dataset distributed among N UEs, that is, the BS aims to obtain an optimal vector ω to minimize an empirical loss function L(ω) (e.g., L(x T ω) = 1 2 y − φ(ω T x) ) by using the dataset distributed over all the UEs under its service. The local loss function of the u n that measures the prediction error of its local dataset D n , d n = |D n | denoting the data size, can be defined as where l(ω, x i n ) is an empirical loss function defined by the learning task, which quantifies the loss of the ML model at sample x i n . The objective of the considered learning task is to find the optimal model weights ω * that minimize the global loss function L(ω) [23] as To solve the optimization problem OP 1 , two distributed learning approaches, FL and SL, can be used to train the ML model by exploiting the computational capabilities of the UEs in a distributive manner. However, FL has higher requirements on the computational resource at the UEs and SL has higher communication overhead when the dataset is large at UE.
To efficiently obtain the solution to OP 1 with the dataset distributed over the heterogeneous UEs, we propose a novel distributed learning architecture, namely the HSFL algorithm, which keeps the parallel model training mechanism of FL and the model splitting structure of SL.

C. The FL and the SL Preliminaries
In this section, we present the learning procedures of using FL or SL to solve the optimization problem OP 1 .
1) FL Algorithm: To solve the optimization problem OP 1 using FL, we can convert OP 1 to the following where d = N n=1 d n is the size of the whole dataset. By applying the Federated Averaging algorithm proposed in [5] to solve OP 2 , the general learning procedure of this algorithm is illustrated in Fig. 2 (a). In Fig. 2 (a), each UE receives a global model, ω t , from the BS and trains it with the local dataset by minimizing the local loss function (3), it then performs gradient descent, such that the global model ω t is updated at UE u n to ω n t+1 , thus the local model updates can be defined as The BS periodically collects the local model updates from the UEs and then performs model aggregation to generate the improved global model and sends it back to the UEs. The whole process, defined as one communication round, repeats a sufficient amount of rounds until the objective function converges to the global optima. 2) SL Algorithm: SL algorithm is another state-of-art distributed learning techniques without need of directly accessing the raw data. Unlike FL, where each UE trains the entire ML model, SL divides the ML model into at least two sub-models, and trains them separately at the UE and the BS. As shown in Fig. 2 (b), the SL framework with multiple UEs in centralized mode [6] is illustrated, where each UE holds a fraction of dataset D n , to participate in training the model aiming to minimize the global loss function L(ω) sequentially. From Fig. 2 (b), the ML model is divided into two sub-models by the cut layer C, the first sub-model is trained at the UEs, termed as UE-side model ω l t , whereas the second sub-model is trained at the BS, termed as BS-side model ω e t . As such, each UE only needs to train a sub-model, consisting of a few layers and the rest of layers reside at the BS, which can reduce the computational load of each UE.
To solve the optimization problem OP 1 with SL, we can convert it to where the full model ω includes two sub-models ω l t and ω e t , it can be denoted by If we use the classical sequential SL mechanism to solve OP 3 , the learning procedure is presented as the following steps: 1) the BS initializes the global BS-side model ω e t and the global UE-side model ω l t , then sends ω l t to the UE u 1 ; 2) the UE u 1 trains ω l t over its local dataset D 1 and then sends the output of the cut layer C, a 1 t , to the BS; 3) the BS receives and feed forwards a 1 t to the BS-side model ω e t , and then it calculates and back propagates the loss to the cut layer C, where its gradients, g t 1 , are computed; 4) the BS sends g t 1 back to UE u 1 for back propagation and updating the UE-side model ω 1,l t , UE u 1 then updates the UE-side model and sends it back to the BS; and 5) the UE u 2 receives the UE-side model ω 1,l t from the BS and then starts training on its local dataset. This repeats until the training of the last UE is finished, then one communication round is finished.

D. UE Selection
When applying FL algorithm in wireless networks, the limited bandwidth and dynamic communication channels make the BS unable to access all the UEs in each round. Additionally, different local model updates are of dissimilar importance to the model convergence [14], [15]. Therefore, it's essential to develop efficient UE selection schemes to select a subgroup of UEs that provide the most useful information in each round. Due to the parallel model training mechanism of FL, applying the proposed HSFL algorithm in wireless networks also requires efficient UE selection scheme. Inspired by [14], we propose a MAB-BC-BN2 UE selection scheme by jointly taking into account both channel qualities and the importance of local model updates. For comparisons, we also implement the BC and BN2 UE selection schemes proposed in [14] as the benchmark schemes, and also propose the MAB-BC and MAB-BN2 UE selection schemes.

III. A NOVEL DISTRIBUTED LEARNING ARCHITECTURE: HSFL ALGORITHM
In this section, We present our proposed novel distributed learning architecture, an HSFL algorithm, by exploiting the advantageous learning mechanisms of FL and SL. In the following, we first introduce its learning procedure, then propose a wireless HSFL algorithm with its convergence analysis.

A. HSFL Learning Procedure
Inspired by [9] and [8], we propose a novel HSFL algorithm with the detailed learning procedure as illustrated in Fig. 3. Let us consider the UE set U = {u 1 , . . . , u n , u n+1 , . . . , u N } with diverse computational capabilities, channel qualities and energy resource, and the dataset owned by UEs is imbalanced and non-IID. In our proposed HSFL algorithm, we allow a portion of UEs to implement split training method with lower computational capability at the UE side, while allowing the rest portion of the selected UEs to use federated training method with less communication overhead when the dataset is large at the UE. Then, the UEs perform local model training in parallel and send the local model updates to the BS where it performs model aggregation and generates new global models.
The detailed steps are given in Fig. 3. Here, if u 1 and u N are scheduled for federated training, they receive the global This process continues to update the BS-side model and the UE-side models with the activations received from all the split training users. Since the model updates of BS-side model are based on the updated BS-side model before, more local model updates will be obtained by the UEs starting from u 2 .
Therefore, the local model updates of the split training UE u n are given by where ω 1 t = ω t , and the gradients of each UE u n , ∀n ∈ N St is calculated by .

3) Model Aggregation of HSFL:
Accordingly, the new global models are updated at the BS by performing model aggregation of all the local model updates obtained from both federated training UEs and split training UEs as The average local model updates of the federated training UEs u n , n ∈ N Ft are given by The average local model updates of the split training UEs u n , n ∈ N St are calculated by where Since the BS-side models can be trained sequentially at the BS with the received activations from the split training UEs, the UEs u n , n ∈ {2, . . . , N S } will receive more local model updates Δ g n than they are trained in parallel. This means if UE u n computes Δω n t as in (9) 8: else if n ∈ K S then 9: UE u n collaborating with the BS computes Δω n t as in (10) with transmitting the activations and gradients of the cut layer in the uplink and downlink. 10: end if 11: end for 12: The BS computes the new global model as in (11) 13: Set t = t + 1 14: Until the desired convergence performance is achieved or the final iteration arrives the same number of UEs are trained with federated training or split training, the later one will provide more local model updates in each round. Therefore, the model aggregation in the HSFL algorithm is derived as

B. Wireless HSFL Algorithm
In the considered wireless UAV networks, the UEs and BS collaboratively train the ML model for accomplishing the object recognition task based on the transmission of model parameters with dynamic and randomly fading wireless channels. Considering the diversity of the UEs with different computational capabilities, dataset distribution and channel conditions, the HSFL algorithm is needed. The general procedure of the wireless HSFL algorithm is summarized in Algorithm 1.

C. Convergence Analysis
In this section, we perform fundamental convergence analysis for our proposed wireless HSFL algorithm, where only a subset of UEs K are selected to participate in the global model training in one communication round due to the limited bandwidth and unreliable wireless communication links. We analyze the convergence performance of the proposed HSFL algorithm under non-IID data [24] with the random UE selection scheme. We first present the preliminaries and assumptions, and then the convergence result is obtained.

1) Preliminaries:
The optimal solution of the global loss function L(ω) in (4) is defined as so the minimum loss is L * Δ = L(ω * ). Similarly, the minimum loss of UE u n is denoted by L * n = L n (ω * ). Then the local-global objective gap is defined as where Φ is nonzero, which quantifies the degree of non-IID data, its magnitude reflects the heterogeneity of the data distribution, that is, larger Φ implies higher data heterogeneity over the UEs. If the data is IID, then Φ obviously goes to zero as the number of samples grows.

2) Assumptions:
We make the following assumptions on the loss function and the stochastic gradients. Assumption 3: Let ξ n t present the random sample dataset from the UE u n . The variance of stochastic gradients in each UE is bounded: E ∇L n (ω n t , ξ n t ) − ∇L n (ω n t ) 2 ≤ δ 2 n , for n = 1, . . . , N Assumption 4: The expected squared norm of the stochastic gradients is uniformly bounded, i.e., E g n (ω n , ξ n ) 2 ≤ G 2 , for n = 1, . . . , N 3) Convergence Result: As discussed before, only a subset of UEs K t is selected to join in the global model training in each communication round t. To establish the convergence bound, we need to make the assumption on the selected UEs first.
Assumption 5: Assuming that K t is a subset of UEs including K UEs randomly sampled from the available UE set N t including N UEs without replacement, so that the probability of each UE being selected to contribute global training is P = K N . Assuming that the data set is on non-IID and balanced in the sense that p 1 = . . . = p N = 1 N , thus the model aggregation at the BS is fulfilled as ω t+1 = ω t − N K ( n∈KF t p n Δω n t + n∈KS t p n Δω n t ). Theorem 1: Let Assumptions 1,2,3,4 and 5 hold, we assume = 2 μ with ι = 4 μ and let κ = μ , then the proposed HSFL algorithm with K UEs selected for participation satisfies where Proof: The proof is presented in Appendix. From Theorem 1, we can conclude that the increment of total communication rounds leads to the convergence of our proposed HSFL algorithm. Moreover, the convergence performance has a weak dependence on the number of selected UEs K, but the convergence speed increases with the increasing number of UEs on split training N S .

IV. USER SELECTION
In this section, we present the details of our proposed UE selection schemes. To train the ML model with the dataset distributed over the diverse UEs in wireless UAV networks, the limited bandwidth and dynamic communication channels make the BS cannot access all the UEs in each round. Additionally, different local model updates are of dissimilar importance to the model convergence [14], [15]. Therefore, it's essential to develop efficient UE selection schemes to select a subgroup of UEs that provide the most useful information in each round.

A. UE Selection Scheme
The channel quality and importance of local model updates are two key concerns when developing UE selection schemes, the authors in [14] developed the BC and BN2 UE selection schemes.

1) BC UE Selection Scheme:
In this scheme, the BS does not need any information about the local model updates of the UEs, and only selects K ≤ N UEs with the best channel qualities from the available UE set N in round t, where γ 1 t denotes the SNR of user u 1 in round t.

2) BN2 UE Selection Scheme:
This scheme requires an extra estimation phase, the BS requires all the UEs to compute their local model updates Δω n t and send back Δω n t 2 representing the importance of local model updates in the first estimation phase. Then the BS selects K devices with the largest Δω n t 2 in round t, which is given as The authors in [14] then proposed a UE selection scheme that jointly considering channel qualities and the importance of local model updates, which provides a better long-term performance than scheduling policies based only on either of the two metrics individually. However, in practice, it is difficult to obtain accurate channel conditions and local model update information before learning procedure is conducted, it also consumes extra computation and communication resources to estimate each UE's local model updates in the extra estimation phase. Fortunately, the dynamic MAB-based UE selection scheme [16] can address this problem by selecting UEs according to the estimated information using trail-and-error rule. In this case, we do not need the pre-estimation step to estimate UE information in each training round. Hence, in this section, we will exploit MAB algorithm to solve the dynamic UE selection problem by jointly considering channel qualities and the importance of local model updates.

B. MAB-Based UE Selection Scheme
In this section, we present our proposed MAB-based UE selection scheme, which formulates the UE selection in the wireless HSFL algorithm as a MAB problem, and uses the discounted UCB policy to estimate the UEs with expected larger local model updates and better channel quality. This scheme Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
provides an exploitation-exploration trade-off to select UEs with both larger local model update and better channel quality (i.e., exploitation) as that leads to faster convergence [25], and also to ensure UE diversity (i.e., exploration) [16].
Knowing that the importance of local model updates Δω n t 2 and the channel quality γ n t of UE u n are non-stationary during communication rounds, we apply the discounted MAB algorithm [26]. The discounted UCB algorithm has been modified for UE selection in [16] by measuring the local loss values of the UEs and received good performance. Therefore, we first propose MAB-BC and MAB-BN2 UE selection schemes by modifying the discounted UCB algorithm taking into account the channel qualities and the local model updates, respectively. To jointly consider both of them, we propose a novel MAB-BC-BN2 UE selection scheme.
Our proposed MAB-BC-BN2 UE selection scheme is based on UCB policy which makes decisions depending on the UCB score. It performs exploration by selecting UEs that are selected less often, and exploitation by selecting the UEs with the largest reward. We view the UEs as the arms in the MAB problem and separately compute discounted cumulative values of the Δω n t 2 and the SNRs, i.e., Ω n t (λ) and Γ n t (λ), as the cumulative rewards, and a discounted count of the number of times each UE has been selected, M n t (λ), till communication round t.
Thus, the discounted UCB score for each UE u n in communication round t is defined as where p n is the dataset size ratio of UE u n , λ denotes the discount factor, and f (K, n) is the UCB index function.

1) MAB-BC UE Selection Scheme:
If we only consider UE's channel quality as the reward in the considered MAB problem, the UCB index function f (K, n) is given by 2) MAB-BN2 UE Selection Scheme: If we only consider the importance of UE's local model updates as the reward in the considered MAB problem, the UCB index function f (K, n) is given by 3) MAB-BC-BN2 UE Selection Scheme: In this scheme, by jointly considering channel conditions and the importance of local model updates as the reward, the UCB index function f (K, n) is defined as In (23), two terms represent the importance of local model updates and the channel quality, respectively, and β is the balance factor between them.
Here, the discount factor 0 ≤ λ i ≤ 1 indicates the significance of stale values, λ i = 1 means all the past rewards contribute equally to the calculation of Ω n t (λ l ) and Γ n t (λ c ), and λ i = 0 indicates that only the latest reward is used to estimate the value. Thus, 0<λ i <1 means putting less weight of stale rewards to calculate the Ω n t (λ l ) and Γ n t (λ c ). This can avoid the noise in the latest evaluation and the discounted stale rewards computed in the past while computing the estimated values. In practice, the discount factor λ i in two terms, i.e., Ω n t (λ l ) and Γ n t (λ c ), can be set as different values, which means the past rewards may have different impacts on channel qualities and the significance of local model updates. The λ c can be set based on the empirical fluctuation of wireless channels, while λ l can be set based on the dataset distributions over the UEs. In the exploration term 2σ 2 t log(Tt(λi)) M n t (λi) , σ t is a hyper-parameter controlling the degree of exploration, which is defined as the maximum standard deviation in the reward computed over the latest update of the UEs. If the UE has not been selected very often, or not at all, then M n t (λ i ) will be small so that the exploration term will be large, making this UE more likely to be selected. As time progresses, the exploration term gradually decreases (due to (logn)/n goes to zero as n goes to infinity ) until eventually UEs are selected based only on the exploitation term. Therefore, we propose a MAB-BC-BN2 UE selection algorithm to accomplish the MAB-based UE selection scheme with the detailed process provided in Algorithm 2.

V. EXPERIMENTS AND NUMERICAL RESULTS
In this section, the learning performance of our proposed HSFL algorithm is provided, which is compared with the CL algorithm and the state-of-art distributed learning algorithms, including FL, SL and SFL algorithm, by simulating the learning task, image recognition, in wireless UAV networks using classical MINST dataset [27]. This image classification task relying on aerial UEs, i.e., UAVs, to collect dataset, has been investigated in many practical scenarios, such as mapping applications [28] and damage assessment for post disaster analysis [29]. We then compare the performance of different UE selection schemes for selecting UEs to join in model training with wireless HSFL algorithm under IID, non-IID, Dirichlet-nonIID and Dirichlet-ImD data.

A. Experiment Environment
The experiments are conducted by the laptop with one NVIDIA RTX 2070 GPU and Intel i7-10750H CPUs, where Algorithm 2 MAB-BC-BN2 UE Selection Algorithm Require: Input: K, K S , K F , β λ, p n for n ∈ N Initialization: Randomly select K 0 , K S0 and K F0 ; a list A of length N ; t = 1 Learning: 1: for t ≤ T do 2: for i ∈ K do 3: The BS distributes ω t to u n , n ∈ K Ft−1 and ω t,l to u n , n ∈ K St−1 .

4:
UEs respectively train the global model with respect to their local dataset.

5:
UEs compute the l 2 -norm Δω n t 2 of the local model update as (9) and (10), and then upload them to the BS. 6: end for 7: The BS receives the local model updates and measures the received SNR γ n t of each UE. 8  the BS's programming code is running on the GPU while the UEs' programming codes are running on the CPU. We consider training two different DNN models, Net and AlextNet, on MINST dataset, the architectures of which are shown in Table I. For all the experiments using SL, SFL and HSFL algorithm, the DNN network is split in the second layer, i.e., after the first conv1 layer. In this paper, to verify the learning performance of the proposed HSFL framework, we simulate a wireless UAV network with one BS located at the origin of the cell and multiple UAVs uniformly distributed within the cell. The cell radius is 500 m, the height of the BS antenna is 20 m and the UAV's flying height is in the range of 20-80 meters. The detailed simulation parameters of the UAV networks are provided in Table II.

B. Learning Performance Comparisons
In this section, the learning performance of our proposed wireless HSFL algorithm is studied in terms of test accuracy, training time and communication overhead. We adopt BC UE selection scheme to select K = 10 UEs from N = 100 UEs for training in each round, and set K s = K F = 5 for HSFL algorithm. The IID and non-IID data follow the same settings  in [5]. We set the local training rounds τ = 5 and the batch size b = 10. The local learning (LL) and CL algorithms are also simulated as the benchmarks. In LL algorithm, each user is training the DNN model locally without sharing any raw data or model parameters to the BS. In CL algorithm, all the users have to send their raw data to the BS for performing centralized training.
1) Learning Accuracy Performance: In Fig. 4 and 5, the learning accuracy performance of Net and AlexNet is presented, respectively. The CL has the highest learning accuracy while training both ML models. From Fig. 4, we can observe that the HSFL algorithm provides similar test accuracy performance as SL (sequentially SL can be viewed as optimal as the centralized learning) and better test accuracy performance than FL and SFL under both IID and non-IID data. This is because in HSFL algorithm, half number of the UEs perform split training which brings the superiority in test accuracy performance. In Fig. 5, it shows that the learning accuracy performance of using different learning algorithms to train AlexNet has similar trend to train Net shown in Fig. 4, and it takes less communication rounds to converge. From Fig. 5, the HSFL algorithm also has better learning accuracy performance than FL and SFL, and it converges faster.

2) Training Time and Communication Overhead:
The training time is calculated to include two parts, computation and communication time. The computation time is monitored by using time module in Python, while the UE's local training is simulated by running on the CPU and the BS's model aggregation or training are running on the GPU on my laptop. On the other side, the communication time is calculated by simulating the transmission process of model parameters through the wireless channels in wireless UAV networks, the simulation parameters related to the wireless transmission links are shown in Table II.
The communication overhead mainly includes two parts of calculation, model size and smashed data including activations and gradients of the cut layer in split learning. The model size is calculated by its model parameters, where each parameter is represented by a standard 32-bit floating point. The size of activations and gradients is computed by calculating the output size of the cut layer.
We consider four scenarios with the number of UEs N = 10, 50, 100, 200, and select K = 1, 5, 10, 20 UEs in each scenario, respectively, in each round for model aggregation on IID and non-IID data. In HSFL, the number of split training UEs in each scenario is set as K S = 0/1, 2, 5, 10, note that when K = 1, the UE either performs split training with K S = 1, or performs federated training with K S = 0. We set B w = 1 MHz, which is shared by the selected UEs in each round with each UE allocated with the same bandwidth. SL adopts sequential training, where only one UE takes up the whole bandwidth in each round in all the considered scenarios. In contrast, FL, SFL and HSFL are training in parallel, so that all the UEs selected in each round will share the whole bandwidth. Fig. 6 (a) shows the total training time over UEs N = 10, 50, 100, 200 four scenarios when training the learning model Net. The CL has the highest training time over all the scenarios because in which all the UEs have to upload their raw dataset to the BS in each scenario. The training time of FL increases with increasing number of UEs because the bandwidth allocated to each UE is decreasing. The SL and SFL experience similar training time performance since the total bandwidth is fixed and the training time mainly depends on the communication latency. Compared to SL and SFL, HSFL spends less training time because in this case only half of the selected UEs share the total bandwidth while performing split training, which reduces the communication latency for each communication round. Likewise, Fig. 7 (a) shows the total training time over UEs when training on AlexNet. In this case, the total training time of FL is increasing significantly with the increasing number of UEs because it makes the communication overhead increase significantly, Since the model size of AlexNet is larger than Net, communication overhead will increase a lot with the increasing number of UEs to send their model parameters to the BS. However, the proposed HSFL algorithm has the shortest training time compared to the other distributed algorithms, except local learning, which follows the same reason as training on Net shown in last subsection. Moreover, we can observe that the training time of FL is larger than SL, SFL and even CL when the dataset is distributed over 200 UEs. Fig. 6 (b) plots the total communication overhead per round when training on Net. The CL has the highest communication overhead over all the distributed algorithms when training on Net. However, when training on AlexNet, the communication overhead of CL becomes less than most distributed learning algorithms. This is because the model size of ALexNet is larger than Net, which causes large communication overhead for the distributed learning algorithms that include model parameters transmission. The communication overhead of FL is increasing over UEs N = 10, 50, 100, 200 because more users need to transmit their model parameters. The communication overhead of SL and SFL are almost the same and keeps unchanged over the increasing number of UEs, this is because the communication overhead of them, i.e., the activations and gradients of the cut layer, is decided by the size of local dataset at each UE. Specially, HSFL has almost half less communication overhead than SL and SFL in each scenario since it only includes half number of UEs for split training and the other half number of UEs for federated training. In Fig. 7(b), the total communication overhead per round when training on AlexNet with different distributed learning algorithms is shown. In this case, We can see that FL is less communication efficient than SL and SFL when the number of UEs increased to 100, while it is more communication efficient than SL and SFL when training on Net shown in Fig. 6 (b).

C. Performance Comparisons of UE Selection Schemes
In this section, the learning accuracy performance of our proposed MAB-BC-BN2 UE selection scheme is evaluated over the non-IID and imbalanced data in wireless HSFL algorithm and wireless FL algorithm. We set N = 30, τ = 5 and b = 64. We consider two non-IID data distribution settings; 1) For non-IID, it follows the similar settings in [5], where the dataset is first sorted by digit label, and it is divided into 60 shards of size 1000, and then each of 30 UEs is assigned with 2 shards. 2) For Dirichlet-nonIID, the whole dataset is partitioned among 30 UEs following the Dirichlet distribution Dir(α d ) [30], where smaller α d indicates larger data heterogeneity across UEs. We set α d = 0.01. 3) For Dirichlet-ImD, we also construct the imbalanced data partition among 30 UEs using this Dirichlet distribution Dir(α d , α imd ), where smaller α imd indicates the dataset size across UEs is more imbalanced. We set α d = 0.1, α imd = 2. Fig. 8 plots the test accuracy of different UE selection schemes in wireless FL and HSFL algorithms. In both FL and HSFL, we can observe that the BN2 UE selection scheme achieves the best test accuracy performance because it takes an extra round to estimate the importance of local model updates of all the UEs, so that the BS can select a subset of UEs with the largest local model updates to participate in training in each round. In contrast, the BC UE selection scheme has worst learning performance, since it always selects the UEs with the best channel qualities and neglects the importance of their local model updates. The MAB-BC-BN2 scheme has similar test accuracy performance as the BN2 scheme, which jointly considers both channel conditions and the importance of local model updates. The MAB-BN2 and MAB-BC UE selections schemes are shown with lower test accuracy than the MAB-BC-BN2 scheme. This is because in MAB-BN2, the selected UEs with large local model updates may fail to upload the local model updates due to bad channel conditions. And in MAB-BC scheme, the selected UEs with good channel conditions may have low local model updates. Note that MAB-BC has better performance than BC scheme, this is because MAB-BC adopts exploitation-exploration rule that enables it to explore the UEs with less optimal channel conditions. It then increases the chance to include the UEs with larger local model updates for training.
In Fig. 9, we compare test accuracy performances of different UE selection schemes in wireless HSFL algorithm using Dirichlet-nonIID and Dirichlet-ImD data. The comparisons of different UE selection schemes follow similar trend as Fig. 8 (b) using non-IID data. In Fig. 9 (a), the test accuracy performances of all the UE selection schemes are worse than Fig. 8 (b) due to larger heterogeneity of dataset over all the UEs. However, the test accuracy of UE selection schemes in Fig. 9 (b) shows better performance because the dataset across UEs is less heterogeneous even it's imbalanced. Fig. 10 plots the test accuracy of wireless HSFL algorithm with MAB-BC-BN2 UE selection scheme using non-IID data for various numbers of split training UEs K s . Compared to FL and SFL, the HSFL algorithm achieves better test accuracy performance and its superiority is increasing with increasing the number of split training UEs, K S . In Fig. 11, we examine the impact of balance factor β in MAB-BC-BN2 scheme on HSFL learning performance in terms of test accuracy. We can see that β = 0.5 has the best learning performance in our  simulated UAV networks, which reveals that selecting the UEs that satisfying the lowest SNR requirements and with larger local model updates will facilitate the improvement of test accuracy performance. In practice, if the dataset over the UEs is on IID, each user would have similar local model updates, in this case, more weight could be put on the channel qualities, that is, β can be set as a smaller value. On the other hand, if the dataset over the UEs is on non-IID, more weight could be put on the significance of local model updates to get better convergence performance.

VI. CONCLUSION
In this paper, we proposed a novel distributed learning architecture, namely hybrid split and federated learning (HSFL) algorithm, which adopts the parallel model training mechanism of federated learning (FL) and model splitting structure of split learning (SL). By applying our HSFL algorithm in wireless UAV networks, our results demonstrated it achieved higher learning accuracy than FL, and less communication overhead than SL under independent and identically distributed (IID) and non-IID data. Our results also revealed the learning accuracy performance of HSFL algorithm can be improved with increasing the number of split training UEs. We also provided convergence analysis for wireless HSFL algorithm under non-IID data with random UE selection scheme. To improve the learning performance of our proposed HSFL algorithm in wireless networks under limited bandwidth and dynamic channel conditions, we developed a Multi-Arm Bandit (MAB) based best channel (BC) and best 2-norm (BN2) (MAB-BC-BN2) UE selection scheme based on discounted MAB algorithm to select the UEs with larger local model updates and better channel qualities in each round. Our results have shown that MAB-BC-BN2 UE selection scheme achieved better learning accuracy performance compared to BC, MAB-BC and MAB-BN2 under non-IID, Dirichlet-non-IID and Dirichlet-Imbalanced data.

APPENDIX
We analyze the proposed HSFL scheme in the setting of partial UEs participation on non-IID data in this Section. In this scenario, the BS randomly selects a subset of UEs K according to the sampling schemes (like BC, BN2, or MAB-based UE selection scheme). We define g t = N n=1 p n g n t (ω n t , ξ n t ) and p nḡ n t (ω n t ), thus, Eg t =ḡ t . First, from (26), as shown at the bottom of the next page, we bound the average of the terms A 1 , A 2 and A 3 . They are explained in three Lemmas where the proof of each is included.
Lemma 1: To bound A 3 , Let E Nt denote expectation over the UE selection randomness at the round t. We have (27) from which it follows that Proof: Due to the randomness of the UE selection policy, it has Lemma 2: To bound A 1 , we have Proof: It's following as in (31), as shown at the bottom of the next page, by using the conditions (1) (2) P(n ∈ N t+1 ) = K N and P(n, j ∈ N t+1 ) = K(K−1) Proof: From (33), as shown at the bottom of the next page, we bound the three terms B 1 , B 2 and B 3 .
For B 1 , we use the similar steps as in [24] and [26] and get For B 2 , where E g t −ḡ t 2 shows the variance of the stochastic gradients in UE u n and it is bounded by δ 2 n , so it's bounded following the steps in (35) Further, F can be bounded following in (37), as shown at the top of the next page.
Therefore, EA2 is finally bounded by (38), as shown at the top of the next page.