BFLMeta: Blockchain-Empowered Metaverse with Byzantine-Robust Federated Learning

The emerging metaverse is envisioned as a virtual mapping of the real world, thus it would inevitably employ numerous Machine Learning (ML) frameworks to analyze and process massive data for the virtual-physical synchronization process. As a distributed ML paradigm, Federated Learning (FL) can naturally take advantage of numerous IoT, wearable devices, and edge, cloud servers under the metaverse infrastructure to train ML models with privacy guarantee. However, the large-scale and decentralized nature of the metaverse can pose significant challenges to traditional FL schemes, where there is a centralized server aggregating the local models received from local devices. It is not only vulnerable to Single Point of Failure (SPoF), but also lacks incentive mechanisms encouraging metaverse users to contribute their resources and data. In this paper, we propose BFLMeta, a blockchain-based FL scheme for the metaverse in which the aggregation process is performed in a decentralized manner, while the framework can estimate the non-IID degree of data to flexibly adjust blockchain committee size, thereby mitigating the impact of malicious aggregators. Security analysis shows that BFLMeta can resist SPoF, poisoning attack, privacy leakage, and sybil attack. Besides, our evaluation on computation, communication, and performance illustrates the efficiency of BFLMeta. Notably, BFLMeta can converge even with more than 50% poisoning nodes.


I. INTRODUCTION
Envisoned as the Internet's successor, the metaverse, on the one hand, is the combination of various advanced technologies such as Augmented/Virtual Reality (AR/VR), Internet of Things (IoT), Artificial Intelligence (AI), Digital Twin (DT), networking (5G/6G) and computation techniques (e.g., edge, cloud computing) [1].On the other hand, it also has the potential to revolutionize these technologies by offering a large-scale distributed environment with massive number of wearable, edge, and IoT devices.As a distributed Machine Learning (ML) paradigm, Federated Learning (FL) can naturally take advantage of metaverse devices to collaborative train the ML model with privacy-preserving guarantee.
However, traditional FL systems with a centralized server acting as the aggregator may not be suitable for the decentralized metaverse which is not under control of any particular organization.The centralized system is vulnerable to Single Point of Failure (SPoF), while it also lacks incentive mechanisms motivating crowdsourcing from metaverse users.To this end, blockchain with committee-based consensus mechanisms can decentralize the aggregation process by allowing the local models to be aggregated by multiple aggregators instead of T. V. Truong, D. N. M. Hoang, and L. B. Le are with INRS, University of Québec, Montréal, QC H5A 1K6, Canada (email: tuan.vu.truong@inrs.ca,duc.hoang@inrs.ca,long.le@inrs.ca).
only one centralized server, thereby preventing SPoF.Moreover, blockchain-based metaverese tokens and reputation can be an efficient method to encourage users to contribute their underused resource to train the FL model, thus receiving rewards as a passive source of income.
Currently, several works have realized the integration of FL in the metaverse.For instance, the authors in [2] aim to solve the problem of data heterogeneity in FL within the industrial metaverse context.However, this is still a centralized scheme which is prone to SPoF and other mentioned issues.The authors in [3] investigate blockchain to enable an incentive scheme for the industrial metaverse with enhanced privacy, but there is no mechanism to prevent malicious local devices from poisoning the model by sending harmful gradient updates.
In terms of blockchain-enabled FL with decentralized aggregation, the authors in [4] use Algorand's consensus protocol [5] to select the blockchain committee with multiple aggregators for the aggregation process, while Multi-Krum [6] is used to filter out poisoning updates.However, this approach trains the model continuously without monitoring its performance on any validation data, making it prone to backdoor attack and overfitting.Besides, poisoning attackers can collude to organize specific attacks that target on the weakness of the corresponding aggregation algorithm [7].On the other hand, the frameworks presented in [8] and [9] allow the blockchain committee to verify the performance (loss or accuracy) of local models to select only high-quality ones.Nonetheless, the heterogeneity of data can make the local models perform differently among committee members, causing inconsistency in consensus protocol.Moreover, the impact of malicious committee members, which is unavoidable in any decentralized system, have not been analyzed.
In this paper, we aim to mitigate the presented open issues of existing works and propose BFLMeta, a blockchain-based FL framework for the metaverse with the following contributions.
• BFLMeta can resist poisoning attack efficiently without relying on a fixed byzantine-robust aggregation algorithm which is prone to targeted attacks.Notably, it can converge even when the poisoners make up the majority.
• We analyze the effect of both non-IID data and malicious aggregators on the consensus process.Then, we propose a novel method quantifying the degree of heterogeneity in decentralized aggregation settings, thereby allowing the blockchain to adjust committee size flexibly to adapt with the corresponding heterogeneous condition.• An incentive mechanism is designed to reward contributors by metaverse tokens and reputation in the platform.
The rest of this paper is organized as follows: Section II proposes our blockchain-based FL framework for the metaverse.Section III analyze security and performance of the system.Section IV concludes the paper.

II. OUR PROPOSED BLOCKCHAIN-BASED FEDERATED LEARNING FRAMEWORK A. System Overview
To construct a full-flesh digital world which reflects the physical world in real time, the metaverse infrastructure must be a combination of various types of devices: (i) IoT devices and sensors that continuously collect real-world data, (ii) edge, cloud servers that provide computational services for heavy tasks, and (iii) wearable devices such as AR/VR and haptic devices to enable immersive user experiences.BFLMeta aims to take advantage of these devices during their idle period to collaboratively train ML models.In BFLMeta, the models are trained based on Stochastic Gradient Descent (SGD), while each blockchain round corresponds to certain SGD epochs.It includes the following entities: Requester: The requester bootstraps the blockchain system to request FL training for a specific task.The model's architecture must be pre-defined and stored on the genesis block.
Trainers: Trainers are metaverse devices with underused resource that want to earn metaverse tokens through training the requested FL model.
Aggregators: They aggregate the local models and compete with each other to produce a global aggregated model that aims to achieve the highest generalized performance.
Round Leader: The leader regulates the consensus process and encapsulates the aggregated model and all transactions into a block, which will be added to the blockchain once the blockchain consensus is reached.
The operation of BFLMeta is illustrated in Fig. 1 with 6 steps.In each blockchain round, trainers collect data and locally train the ML model without revealing their data (step 1).Based on reputation score, a blockchain committee consisting of N aggregators is elected to aggregate the local models received from the trainers (step 3).Thus, they propose N corresponding aggregated models.Since aggregator's local data were not used for training, they utilize their data as a validation dataset.Each aggregator validates the N − 1 aggregated models of the others to vote for one model that they determine to be of highest performance (step 4).Once consensus is reached among the committee (step 5), the highest-voted model is added into a new block of the blockchain.The trainers download that global model and repeat training (step 6).A metaverse-token reward is distributed to honest aggregators, while trainers who contributed to the highest-voted aggregated model are rewarded reputation scores.
However, there can be three possible issues in a real-world scenario: (i) some trainers send harmful local models to poison the global model, (ii) there are certain malicious aggregators in the committee that propose harmful aggregated models, or they may vote dishonestly to maximize their benefit, and (iii) the validation data among aggregators are non-IID, causing inconsistency in consensus protocol.While solution for the first issue is presented in Section II-C, the effects of the second and the third issues are analyzed in Section II-E and mitigated in Section II-F.

B. Local Training
Once the requester bootstraps the blockchain network, all trainers can obtain the model from the genesis block.In each round, they train the model using their collected local data, then submit the local model to the committee for aggregation.To prevent the aggregators from inferring sensitive information from the local model (i.e., inference attack), every trainer also adds (ϵ, δ)-Differential Privacy (DP) noise [10] to their local model before submitting it through a blockchain transaction: where nonce is a counter keeping track of the number of transactions sent by the trainer, which helps avoiding double submission of local model.Besides, w local is the local model added the DP noise.Sig sk (tx model ) is the trainer's digital signature, while addr src is the trainer's address.

C. Decentralized Aggregation
In each round, each aggregator collects a local dataset ξ and receives a set of local models Then, the aggregator aggregates W into an aggregated model w ag using the Algorithm 1.
Specifically, each aggregator validates the received local models using their own validation data, then selects top µ% local models with lowest losses to aggregate into a single model.Since each aggregator would collect a different dataset ξ and receive a different set of local models W, the aggregated models can be different between aggregators.The next section presents a protocol to help the committee selecting one model that is considered the best among different aggregated models.Ai evaluates the performance (loss) of every received model w ag j (with j ̸ = i) on his validation dataset ξi.

7:
Ai broadcasts a vote vi for the model that achieves the highest performance (lowest loss) on his validation dataset ξi.8: end for 9: Aleader selects the highest-voted model to be the w ag voted , then adds the w ag voted and the voting information V to the new block B. 10: The committee executes consensus mechanism (II-D2) to reach an agreement of adding B into the blockchain.
Specifically, once entering the consensus process, an aggregator with highest reputation score among N aggregators is selected to be the leader.Aggregators broadcasts their aggregated model to other committee members for peer-crossvalidation. Thus, each aggregator validates the models of the remaining N − 1 aggregators using their own collected validation data.Then, each of them must broadcast a vote for one model that they determined to obtain the highest performance (lowest loss) on their local validation dataset.The leader collects all the votes and base on that to select an aggregated model that received the most votes, which is denoted by w ag voted .The selected model w ag voted and voting information V are packed into a new block B, which will be added into the blockchain if the consensus in II-D2 is achieved.This is similar to a cross-validation process, but in a distributed setting where each aggregator possesses their own validation dataset and does not share it with the others.As a result, the highest-voted model w ag voted can be considered achieving the highest performance across aggregators' validation datasets, under a reasonable assumption that the data size is not significantly different among aggregators.
However, some aggregators may vote dishonestly to maximize their own benefit, while the heterogeneity of data among the committee can cause inconsistency to even the honest aggregators.Section II-F will propose a mechanism to ensure that the highest-performance model can still be selected under the heterogeneous and malicious conditions.
2) Block Proposal and Consensus: After the voting process, the leader generates a block B as follows: where T X T and T X R are the lists of token transactions and reputation transactions, respectively.C(w ag voted ) is the list of trainers who contributed to the selected highest-voted model w ag voted .A n+1 is the list of aggregators for the next round's committee, presented in Section II-F.
All information in the block n+1 is verifiable by every aggregator in the current round.Therefore, the leader can execute practical Byzantine Fault Tolerance (pBFT) consensus protocol [11] to reach agreement of the proposed block.
3) Incentive Mechanism: Once a block is accepted, the following rewards are distributed to the participants who contributed to the system: Aggregation Reward.AR metaverse tokens are distributed to the aggregators, proportional to the number of votes that their model received.This motivates the aggregators to aggregate a model that is as robust as possible, while discouraging them from producing poisoning models.
Validation Reward.The aggregators who voted for the highest-voted model w ag voted are rewarded V reputation scores.Those who did not vote for w ag voted are slashed V reputation scores.This encourages the aggregators to vote honestly.
Training Reward.All trainers who are listed in C(w ag voted ) are rewarded T reputation scores.Therefore, trainers would try to submit high-quality local models to increase the chance of being selected as aggregators to earn token rewards.

E. Resistance to Malicious Aggregators and Non-IID Data
In a real-world situation, there can be two scenarios in which an aggregator acts maliciously: (i) the aggregator proposes a harmful aggregated model, or (ii) he intentionally avoids voting for the model which performs the best on his validation data to maximize the probability that his own model is selected.The first scenario poses no risk if the majority of aggregators are honest, since the harmful model would not be able to collect the most votes thanks to the Algorithm 2. Therefore, we focus on the second scenario where there are selfish aggregators who do not vote for the best model.Instead, they randomly vote for one of the remaining N − 2 models.
In a round, assuming that there exists a model w ag best ∈ W ag that will receive the highest number of votes if all aggregators are honest and the validation data are completely IID among aggregators (i.e., the ideal case).Our aim is to satisfy that even if α proportion of aggregators voted dishonestly, and the validation data are non-IID among aggregators (i.e., the real case), the model w ag best must still receive the most votes: Proposition 1.The condition w ag best ≡ w ag voted can be satisfied with strong probability if the following condition is met: where α is the proportion of malicious aggregators, N is the total number of aggregators, and β ∈ [0, 1] is the degree of heterogeneity of local data.Specifically, β is the probability that each aggregator A i determines incorrectly the model w ag best using their validation data ξ i .For example, β = 0 means that the validation data among aggregators are completely identically distributed, thus the performance-based ranking of models should be the same among aggregators.In other words, every aggregator can determine exactly which model is the w ag best out of {w ag 1 , w ag 2 , ..., w ag N }.Intuitively, the lower heterogeneity, the higher consistency of evaluated performance.
Proof.To prove the above Proposition, we firstly compute the expected number of votes for the w ag best : where E[v hon (w ag best )] and E[v mal (w ag best )] are the expected number of votes for the model w ag best from honest aggregators and from malicious aggregators, respectively.
Since aggregators cannot vote for themselves, the number of honest aggregators that can vote for w ag best is (N − N α − 1).Therefore, we obtain: Although a malicious aggregator would avoid voting for w ag best , he still has β N −2 probability of unintentionally doing this.With N • α malicious aggregators, we obtain: Besides, it is obvious that a model will be the highest-voted model if it received at least 50% of votes (i.e., ≥ N 2 votes), even in the worst scenario in which one of the remaining models can collect the rest of votes.Therefore, the condition w ag best ≡ w ag voted (1) can be satisfied with strong probability if: From ( 3), (4), and ( 5), the condition ( 6) is presented as: Finally, the condition (2) can be derived from (7).This is illustrated in Fig. 2.

F. Dynamic Committee Election
Fig. 2 shows that BFLMeta can resist a larger proportion of malicious aggregators (α) if the committee size (N ) is higher, and the degree of heterogeneity (β) is lower.While β is naturally uncontrollable, we can increase N to achieve a higher α.However, both computation and communication costs grow if the committee size increases.Therefore, to achieve an optimized trade-off between performance and complexity, we firstly propose a method (8) to estimate the degree of heterogeneity β.Then, N can be adjusted to satisfy a predefined desirable fault tolerance α (e.g., α = 30%).
Definition 1 (Degree of Heterogeneity).In BFLMeta, The degree of heterogeneity among aggregator's local data in a round can be quantified as follows: where P(w ag voted ) ∈ (0, 1] is the proportion of votes that the highest-voted model w ag voted received in a particular round, and E k implies the average of P over the last k rounds.For instance, β = 0.1 means that in the last k rounds, the highestvoted model of each round received approximately 90% of votes of that round.Intuitively, the more consistency of votes, the lower the estimated degree of heterogeneity. In a round, the leader follows the Algorithm 3 to decide the list of aggregators A n+1 for the next round's committee.Firstly, the leader uses the voting information from the last k blocks to quantify the degree of heterogeneity β based on (8).With the obtained β and a predefined α, the leader computes the committee size N , which must satisfy the equation in (2).Then, the leader generates a seed, which is the hash of the previous block, to feed into a Verifiable Random Function (VRF) [12].As a result, he obtains a random number ϕ 1 and a proof π.Using the proof π and the seed, everyone can verify that ϕ 1 was truly randomized, thus the leader cannot manipulate the random process.Next, the leader hashes this random number N − 1 times to obtain a total of N random numbers (including the first one generated by VRF).Each random number ϕ i ∈ {ϕ 1 , ϕ 2 , ..., ϕ N } is used to randomly select one aggregator for the next committee.Specifically, all reputation scores of all participants are represented as K indexed reputation units {r 1 , r 2 , ..., r K }.For each ϕ i , the trainer who owns the reputation unit at index (K • ϕi Algorithm 3 Reputation-Based Committee Election Input: The desirable fault tolerance α; the voting information of the last k rounds {V1, V2, ..., V k }, the previous block Bn−1 Output: The list of aggregators An+1 for the next round; a proof of randomness π from VRF 1: Quantifies the degree of heterogeneity β based on the voting information of the last k rounds, using equation ( 8).2: Computes the committee size N based on the input α and the estimated β, using the equation in (2).3: Generates a seed: s = hash(Bn−1).4: Generates a verifiable random number ϕ1 and the corresponding proof π: ⟨ϕ1, π⟩ ← VRF sk (s).5: Generates N − 1 more random numbers by hashing N − 1 times the random number ϕ1.Thus, the leader obtains a total of N random numbers: Φ = {ϕ1, ϕ2, ..., ϕN }. 6: for ϕi ∈ Φ do Uses ϕi to randomly select aggregator Ai from M trainers.

8:
Adds Ai into the list of aggregators An+1.9: end for 10: Adds into Bn+1: the list of aggregators An+1; the proof π.

III. SECURITY AND PERFORMANCE EVALUATION
A. Security Analysis 1) Single Point of Failure: BFLMeta deploys model aggregation in a decentralized manner instead of relying on a single centralized server, allowing it to resist to SPoF and perform suitably even when certain nodes are under attack.
2) Poisoning and Majority Attack: Our two-stage validation process for both local models and aggregated models ensures that poisoning local models can be filtered out even when the majority of trainers are malicious, while the highestperformance aggregated model should be selected every round.
3) Privacy Leakage: While trainers and aggregators do not have to disclosure their local data for training the FL model and for the cross-validation process, the added DP noise can prevent inference attack efficiently.
4) Sybil Attack: BFLMeta selects the committee members randomly based on reputation of trainers, while reputation score is not transferable.Besides, the leader cannot manipulate the committee election process thanks to the VRF mechanism.Therefore, it is infeasible to dominate BFLMeta by sybil attack using multiple fake identities.C. Trade-off between Fault Tolerance and Complexity Fig. 3 illustrates the fault tolerance of BFLMeta (the blue curves) under different degrees of heterogeneity and committee sizes.A larger committee size can better ensure that the selected global model is of highest performance among the proposed aggregated models, but leading to a significant rise in communication cost (the green curve).We set up a practical setting (Table II) to monitor the number of rounds that the malicious aggregators successfully distort the voting result after each interval of 20 rounds.In this experiment, we implement an MLP model for classification task for the MNIST dataset, which consists of 60,000 handwritten-digit images.The reason for this dataset is that we can proactively adjust the degree of heterogeneity of data by constraining each FL participant to obtain the data whose labels are within only k out of 10 classes.For example, if each participant can collect the data of only one class, a trainer who only has the digit 0 will train the model in a very different direction than another trainer who collects the data of digit 4 (i.e., high heterogeneity).We deploy three different committee sizes (10, 20, and 40) and fix the fault tolerance α

D. Convergence Analysis under Poisoning Attacks
In terms of resistance against poisoning attack, we compare BFLMeta to FedAvg and other up-to-date byzantine-robust aggregation algorithms, including Krum, Multi-Krum, Mean, and Trimmed-Mean [13].We deploy two types of poisoning attacks: (i) Back Gradient attack, in which the poisoners poison the local models by training them in a reversed direction; (ii) Mislabel attack, in which the poisoners poison the local data by mislabeling the data before training.The training process is divided into 50 blockchain rounds, while each round consists of 2 epochs.The number of trainers is set to 40, with 30% and 55% poisoners in two different test cases.In each round, 30% of aggregators are malicious (i.e., α = 0.3).As shown in Fig. 4, BFLMeta converges faster than other aggregation mechanisms when 30% of trainers are poisoners.Notably, in an extreme case in which the proportion of poisoners is even greater than honest trainers (55% poisoning), BFLMeta still converges efficiently, while the others cannot.The reason is that BFLMeta enables validation for every of the received local models based on the validation data collected by aggregators.Thus, it is not affected by the quantity factor.

IV. CONCLUSION
In this paper, we have presented BFLMeta, a blockchainenabled FL framework for the metaverse with distributed aggregation mechanism.The framework can filter out poisoning models efficiently even in case of majority attack in which the proportion of poisoners is greater than honest trainers.Importantly, the effects of non-IID data and malicious aggregators, which is unavoidable in any distributed setting, are analyzed carefully and resolved by a mechanism for dynamic committee selection.Besides, the framework can adjust committee size flexibly to achieve an optimized performance and complexity trade-off.An incentive mechanism based on token rewards is also proposed to encourage metaverse users to contribute their underused resource for the FL tasks.
However, there are still several open challenges that could be tackled in our future works.Firstly, the complexity of the validation process is significantly higher than the traditional FedAvg method.Secondly, BFLMeta reveals the next round's committee members, which can make it vulnerable to attacks that target on the aggregators of the next committee.

Fig. 2 .
Fig. 2. Fault tolerance (α) of BFLMeta according to the committee size and degree of heterogeneity.
TableIsummarizes computation and communication complexity of each process of BFLMeta, where |D| is the average data size of each trainer, M is the number of trainers, N is the number of aggregators, and |w| is the model's size.A significant proportion of cost is at the distributed crossvalidation process, which helps selecting the highest-quality aggregated model for the next round.Since aggregators must broadcast their aggregated model to each other, the total communication cost for this operation reaches O(N 2 .|w|).However, they perform it in parallel, thus the actual individual cost is only limited at O(N.|w|).

Fig. 3 .
Fig. 3. Trade-off between fault tolerance and communication cost according to different committee sizes.The green curve is communication complexity, while the blue curves are fault tolerance proportions.

Fig. 4 .
Fig. 4. Convergence of BFLMeta under poisoning attack compared to other centralized byzantine-robust aggregation algorithms.
Validation data ξ; local models W = {w local 1 , w local 2 , ..., w local m } Output: The aggregated model w ag 1: for i = 1 to m do : end for 4: Selects top µ% local models with lowest losses: Ws. 5: Aggregates the selected models: w ag = 1 The set of N aggregators A = {A1, A2, ..., AN }; the set of N aggregated models W ag = {w ag 1 , w ag 2 , ..., w ag N } Output: The model w ag voted ; a record of N votes V = {v1, v2, ..., vN } 1: Select the highest-reputation aggregator to be the leader Aleader.
Input: 2: Computes the loss of each local model: li = f (w local i , ξ). 3Input: 2: for Ai ∈ A do 3:Ai broadcasts his model w ag i within the committee.4: end for 5: for Ai ∈ A do 6:

TABLE II THE
NUMBER OF ROUNDS THAT MALICIOUS AGGREGATORS SUCCESSFULLY DISTORTED THE VOTING RESULT (σ). the highest votes, which is denoted by σ.The blue cells in TableIIindicate that the condition (2) is not satisfied according to the corresponding β and N .As a result, the committee starts selecting incorrectly.This becomes more frequent when the degree of heterogeneity increases.When the condition (2) is met, we observed no similar issue.