MetaCrowd: Blockchain-Empowered Metaverse via Decentralized Machine Learning Crowdsourcing

Metaverse allows a 3D virtual mapping of the physical world to the digital world in which users interact with each other via digital avatars with a wide range of virtual activities. To realize this, the metaverse will inevitably employ numerous machine learning (ML) systems to enable the virtual-physical mapping process and offer intelligent virtual services to metaverse users (MUs). However, metaverse service providers (MSPs), who need ML models for their services (e.g., virtual events and healthcare services), may not have the expertise or resources required to build these underlying ML models. In addition, although ML models can be offered by a crowd of experienced ML workers (MLWs), the MLWs might not be able to collect the desired data for training their ML models due to privacy issues and the large-scale, distributed nature of the metaverse. In this paper, we propose MetaCrowd, a blockchain-based ML crowdsourcing framework that aims to overcome the mentioned issues and make ML accessible to a wide range of MUs and MSPs. Unlike traditional crowdsourcing systems which rely on central authorities, MetaCrowd is decentralized and automatic thanks to blockchain and smart contracts, thereby mitigating the single point of failure and trust issues. Experimental results illustrate the efficiency of MetaCrowd in both performance and cost. In addition, a decentralized application is also implemented and published widely to show its feasibility in practice.


I. INTRODUCTION
As the successor of the Internet, the metaverse is being realized thanks to various advanced technologies, including digital twins (DTs), augmented/virtual reality (AR/VR), Internet of Things (IoT), 5G/6G wireless networks, edge/cloud computing, blockchain, and artificial intelligence (AI) [1].The importance of AI techniques is undeniable in constructing the metaverse infrastructure and its virtual environment.For instance, computer vision can be utilized to reflect real-world user behaviors into avatars' actions in the digital world via wearable devices.Similarly, advanced machine learning (ML) models can analyze massive data collected by IoT devices, sensors, and unmanned aerial vehicles (UAVs) from the physical world, thereby facilitating virtual-physical synchronization.While these mentioned tasks are often implemented by metaverse publishers (i.e., the organizations publishing the platform), metaverse service providers (MSPs) may also need ML techniques to enable their virtual services.For example, MSPs can use ML models to produce and sell AI-generated contents on metaverse marketplaces.For metaverse virtual events, ML models can be used by event organizers to detect malicious users accessing the virtual space where the events H. D. Le, V. T. Truong, D. N. M. Hoang, T. V. Nguyen, and L. B. Le are with INRS, University of Québec, Montréal, QC H5A 1K6, Canada.Emails: {duy.hung.le,tuan.vu.truong, duc.hoang, thai.vu.nguyen, long.le}@inrs.ca.are held.In this case, the ML models could analyze the participant's features (e.g., reputation profile, location, and virtual relationships) for its prediction.Consequently, they all show that ML systems are mandatory for the metaverse.
Despite this urgent need for ML-based intelligent services in the metaverse, MSPs may lack the necessary expertise in AI/ML and computational resources to construct and train the desired ML models.On the other hand, even if there are ML professionals capable of designing and training ML models, they may face challenges in collecting the required metaverse data due to privacy concerns and the large-scale, heterogeneous nature of the metaverse.Therefore, in this paper, we propose MetaCrowd, a decentralized crowdsourcing platform that allows MUs and MSPs to crowdsource both metaverse data and ML models in a trustless distributed environment with low cost, high performance, and privacy preservation.Analyzing existing related works, we acknowledge that the foremost open challenge in crowdsourcing systems is to correctly evaluate the contribution of every participant, thereby fairly distributing the crowdsourcing reward.Most existing crowdsourcing frameworks assume that there exists an evaluation function provided by the task requesters that can, in some way, assess the contribution of the workers correctly.This is impractical in most cases as the requesters often lack the necessary expertise to do so.Hence, in MetaCrowd, we aim to eliminate this assumption of such an "omnipotent" function in crowdsourcing systems while also fulfilling the following crucial requirements: (i) the framework can operate in trustless environments, where both task requesters and workers can be malicious; (ii) the framework must be decentralized to prevent the single point of failure (SPoF) and third-party issues; (iii) the crowdsourced results must not lead to privacy leakage.
Traditional trading and crowdsourcing systems offered by third-party authorities are often vulnerable to SPoF and trust issues among participants, in addition to the lack of transparency, privacy, and incentive mechanisms.These limitations should be eliminated to fit with the decentralized metaverse where scams and frauds are exacerbated by various novel social engineering attacks.To this end, blockchain is a potential solution for trust management in metaverse trading systems with immutable, transparent, and auditable properties.Specifically, blockchain provides a reliable environment to record trading transactions, while smart contracts can replace third-party authorities in decision-making, policy enforcement, reputation, and incentive management.
Prior to MetaCrowd, several frameworks have integrated blockchain into the metaverse for different purposes.For instance, the authors in [2] developed a campus prototype for the metaverse, in which blockchain is used to enable the economic system with token trading.Additionally, the authors in [3] proposed a blockchain-based incentive mechanism to facilitate interactions between MSPs and MUs, thus optimizing resource usage.However, none of them investigate the use of blockchain for metaverse ML crowdsourcing systems.
Beyond the specific context of the metaverse, various attempts have been made to incorporate blockchain into data trading and crowdsourcing systems.In [4], a blockchain-based IoT data trading market includes three trading modes: general trading, selling on demand, and buying on demand.Managed by smart contracts, it eliminates SPoF and the need for a third-party middleman, but fails to address dishonest sellers and potential data leaks.The CrowdBC framework, referenced in [5], uses a 3-layer blockchain structure to defend against SPoF and DoS attacks and mitigate trust issues in crowdsourcing.Yet, the requirement for requesters to create an evaluation function is impractical for those lacking expertise, and the assessment of contributions by miners poses a data leakage risk.Other researchers, such as those in [6]- [8], focused on addressing privacy concerns in crowdsourcing systems.Ze-braLancer [6] establishes a private, anonymous crowdsourcing platform, and the research in [7] employs a public chain with multiple private subchains for privacy management, without considering a practical approach for evaluating contributions and distributing rewards.Furthermore, both [6] and [7] present the registration authority as a potential single point of failure.Meanwhile, [8] introduces a blockchain-powered framework for mobile crowdsourcing, neglecting to investigate data quality evaluation.
According to the mentioned existing issues, we proposed MetaCrowd, an ML crowdsourcing framework for the metaverse based on blockchain and smart contracts, with the following main contributions.First, MetaCrowd offers two complete procedures: (i) MSPs can request ML models from MLWs with pre-defined task descriptions and requirements, (ii) MLWs can also crowdsource necessary data from data workers (DWs), who own metaverse devices (e.g., IoT sensors, cameras, UAVs, wearable devices) and would like to sell the data collected by their devices to earn metaverse tokens.These two processes are both crucial and supplementary to each other to facilitate intelligent services in the metaverse.Using blockchain, a reputation-based incentive mechanism is designed to encourage participants to act honestly, while eliminating malicious actors.MetaCrowd can totally resist false-reporting attacks (i.e., dishonest requesters try to avoid payment) and free-riding attacks (i.e., dishonest workers earn rewards without making a real effort).Moreover, smart contracts automatize most of the processes in MetaCrowd, rendering it impervious to single points of failure, manipulation, and trust issues.Table I shows a comparison between MetaCrowd and existing related works regarding the mentioned system requirements, demonstrating that our proposed design is much more comprehensive than the existing framework.
The rest of this paper is organized as follows.Section II presents the architecture and operation of MetaCrowd.Sec- tion III analyzes its security and performance, followed by the concluding remarks in Section IV.

II. METACROWD: BLOCKCHAIN-BASED MACHINE LEARNING CROWDSOURCING FOR METAVERSE
This section presents the architecture of MetaCrowd, starting from an overall system overview to the detailed design of each element constituting the framework.

A. System Overview
The overall architecture of MetaCrowd is illustrated in Fig. 1, including the following entities.
• Metaverse Service Providers: MSPs are businesses or users who need ML models to enable their virtual services.Accordingly, they might want to crowdsource ML models from other participants.• Machine Learning Workers: MLWs possess expertise in AI/ML and also have computational capacity.However, they may not own necessary data to train ML models.• Data Workers: DWs own metaverse devices and can use them to collect data from the real world or the virtual world.They might want to collect and sell their data to earn metaverse tokens as a source of income.MetaCrowd operates on top of a consortium blockchain using a Raft-based consensus algorithm.In this paper, we use the term "blockchain committee" to refer to a group of selected blockchain consensus nodes.

B. Machine Learning Crowdsourcing Management
The ML crowdsourcing process is described in Algorithm 1, in which a MSP i requests ML service from the MetaCrowd system.To do so, the requester MSP i firstly submits the following transaction to trigger the smart contract MLC: where Θ is a certain number of metaverse tokens to reward the MLWs who contributed to the task, d task is the task description, t deadline is the task deadline, R min is the minimum reputation score that an MLW must possess to join the task, H test is the hash of the test data, and Sig sk (T x) is the digital signature of the transaction signed by the secret key of MSP i .Some MLWs who are interested in the published task can download the task description and train an ML model that satisfies the corresponding requirements.To submit the solutions, the MLWs upload their trained models to IPFS [10], then send the IPFS link (i.e., URL) to the smart contract MLC.Once reaching the task deadline, the requester MSP i must publish the test data D test onto IPFS.If the requester fails to do so, the smart contract MLC will distribute all the deposited tokens Θ to every MLW who contributed to the task, regardless of the quality of their ML model.In the normal case where the requester MSP i has uploaded the test data, the blockchain committee will download the test data and all ML models submitted by the MLWs for model evaluation.Based on the evaluation results, the majority of metaverse tokens from the reward Θ is given to the MLW whose model is of highest performance (e.g., lowest loss or highest accuracy), while the remaining tokens are distributed to those whose model's performance is greater than a predefined threshold (e.g., prediction accuracy higher than 90%).Finally, the requester MSP i can obtain the highest-performance model from the MLC and finish the process.

C. Data Crowdsourcing Management
Compared to the ML crowdsourcing procedure, data crowdsourcing has several fundamental differences: (i) the data Submit URL(Modelj) to smart contract MLC; should not be public widely to maintain its value and for privacy issues, and (ii) evaluating data quality is more challenging than validating ML models due to the lack of a common metric (e.g., accuracy or loss).To solve the first problem, MetaCrowd utilizes encryption techniques to restrict access to the purchased data.For the second issue, MetaCrowd allows the data requester to evaluate the submitted datasets and decide the rewards for the data workers himself instead of relying on the blockchain committee.To discourage false-reporting attacks, the total reward must be deposited into the DC at the beginning.Therefore, the data requester must still pay the same amount of fee no matter what the result of the data evaluation process is.Furthermore, a blockchain oracle is integrated into MetaCrowd to facilitate dispute resolution between the data requester and unsatisfied data workers.
Specifically, a data requester MLW i firstly submits the following transaction to the smart contract DC to request data crowdsourcing from the DWs: where α is the compensation stake, resulting in certain metaverse tokens, f data is the list of data features that MLW i wants to obtain for its datasets, and the remaining parts are the same as presented in (1).
The interested DWs will collect data with certain features required in f data .Once reaching the deadline, they encrypt their collected data by the public key of MLW i and upload the encrypted data to IPFS.The IPFS URLs are submitted to the DC as proof of contribution.While the encrypted datasets can be downloaded by anyone since their URLs are transparent on-chain, only the data requester MLW i can decrypt the data by using its secret key, thereby mitigating privacy issues and maintaining the data value.MLW i evaluates all obtained datasets and sends the evaluation results to the DC for reward distribution.In the evaluation results, each dataset is only determined to be either honest or dishonest.As a result, the reward Θ is equally distributed to all DWs whose dataset is determined to be honest.Because Θ has already been deposited into DC at the beginning, MLW i has no motivation to conduct false-reporting attacks.
Furthermore, if any DWs are not satisfied with the evaluation result, they can submit a disputation request to the blockchain oracle.The oracle with multiple professional members will make the decision on the disputation by analyzing the disputing dataset.If some DWs win the disputation, they will together share the compensation stake α mentioned in (2).Otherwise, each dishonest DW will lose a stake of β tokens, which is then given to MLW i , thereby mitigating potential DoS attacks.In the case where no disputation request has been submitted, or no DW has won its disputation, the compensation stake α is sent back to MLW i .

D. Incentive Mechanism
In addition to the token rewards mentioned above, participants are also rewarded reputation score, which will help them increase the opportunity of being selected for the committee.
ML Crowdsourcing.The MLWs whose model's performance is greater than the pre-defined threshold are rewarded 1 reputation score, while the remaining MLWs are slashed 1 score.Besides, the owner of the selected highest-performance model is rewarded an additional 1 score.
Data Crowdsourcing.The DWs whose dataset is selected by the data requester, or won a disputation, are rewarded 1 reputation score, while the remaining DWs are slashed by 1 score.In addition, a data requester MLW is also slashed by 1 score for each successful disputation claimed by DWs.
Blockchain Oracle.The oracle members whose decision was the same as the decision made by the majority of members are rewarded certain metaverse tokens.In contrast, the remaining members will lose that amount of stake for each of their "wrong" decisions, which did not follow the majority.
Committee Operation.The committee members receive certain metaverse tokens for each new block added to the blockchain via its consensus protocol.These token rewards are mostly derived from the transaction fee of other participants.
All changes in user reputation are managed by RC.With this concrete incentive mechanism, MetaCrowd ensures that all participants have no motivation for acting dishonestly, while the intentionally malicious actors will be eliminated from the system since their account balance will be drained quickly.

E. Reputation-Based Raft Consensus
In MetaCrowd, there is a blockchain committee consisting of multiple consensus nodes, who take responsibility for transaction verification, block proposal, and maintaining a consistent ledger.In the committee, 50% of slots are fixed for the metaverse organizations, who collaboratively maintain the infrastructure and operation of the virtual world.This is called authorized committee partition.The remaining slots are open for normal nodes (i.e., MSPs, MLWs, and DWs) via a reputation-based election mechanism, thereby forming a dynamic committee partition.As shown in Fig. 2, the dynamic partition is newly elected every k rounds, while the authorized partition remains unchanged.Consensus Process.The state transition of a consensus node in the Raft-based model is illustrated in Fig. 2 with three states, namely follower, candidate, and leader [11].Specifically, the leader takes responsibility for log replication to the followers, thus maintaining a consistent log record (i.e., the record of verified transactions) among distributed consensus nodes.Periodically, the leader must send a heartbeat message to the followers to prove its existence.If a follower cannot receive the leader's heartbeat after a pre-defined timeout, the follower changes its status to candidate and starts a leader election process where it can be elected to be a new leader upon receiving the majority of votes.Through this process, the consensus can be reached among distributed nodes even if up to 50% of the nodes are not operational [11].
Reputation-Based Committee Election.Every k rounds, the current leader executes Algorithm 2 to elect a new dynamic committee partition for the next period, where M is the number of nodes to be selected to the dynamic committee partition.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Firstly, the leader computes a pair of a random number and proof ⟨ϕ 1 , π⟩ based on a verifiable random function (VRF) algorithm [12].The seed of VRF is set to the hash value of the previous block, making it resistant to manipulation.Next, the leader hashes this random number M − 1 times to obtain a set of M random numbers (including the first one generated by VRF) as Φ = {ϕ 1 , ..., ϕ M }.Then, each random number ϕ i ∈ Φ is used to select one node for the next committee, in which the possibility that a node is selected is proportional to its current reputation score.Intuitively, assuming that the reputation scores of all nodes are represented as K indexed reputation units R spread = {r 1 , ..., r K }, the leader randomly selects one reputation unit r Idx ∈ R spread , then elects the node who owns r Idx into the committee.Thus, the more reputation units a node owns, the higher possibility it is elected.

III. PERFORMANCE AND SECURITY ANALYSIS
A. Security Analysis 1) Free-Riding Attack: In ML crowdsourcing, models are validated using the requester's testset, while in data crowdsourcing, the requester evaluates submitted datasets.This process ensures that MLWs and DWs cannot earn rewards without real contributions by discarding poor submissions and penalizing free-riders with reduced tokens and reputation scores.
2) False-Reporting Attack: The smart contract in ML/data crowdsourcing thwarts false-reporting attacks using automated evaluation and dispute resolution.For ML tasks, the contract's direct evaluation prevents requesters from wrongly discrediting work to repudiate payment.In data tasks, DWs can contest unfair evaluations, ensuring requesters face penalties for misreporting.
3) SPoF and Manipulation: The use of blockchain eliminates centralized authorities, reducing the risk of SPoF and manipulation.With a transparent committee election process based on VRF, distortion of election results is prevented.
4) Sybil Attack: MetaCrowd's reliance on reputation means that a Sybil attacker creating many avatars/identities can't manipulate the system.Numerous low-reputation avatars will have no negative impact on the system's operation.

5) Privacy Leakage:
In MetaCrowd, data submitted by workers is encrypted to prevent the leakage of sensitive information to the public during its operation.

B. Performance Analysis 1) Experiment Setup:
A prototype of MetaCrowd is implemented on top of a consortium blockchain constructed based on Hyperledger Fabric [13], an open-source blockchain development platform.The source code of our implementation is published on Github1 .In the simulation, a cluster consists of 60 blockchain nodes, with 40 of them being the orderers (i.e., consensus nodes in the committee) and the remaining being normal peer nodes.Each node runs an independent Docker container to join the system.
The experiment utilizes a computer with a CPU Intel Core i9-13900K (3.2 GHz), 64 GB of RAM, and a local IPFS network of 50 peer-to-peer nodes to store data and ML models.Hyperledger Caliper is used to generate and monitor transaction workloads, allowing for the assessment of MetaCrowd's performance across various network conditions, ranging from low to high and even extreme workloads.
2) Evaluation Results: In the experiment depicted in Fig. 3, 100 clients were configured to simultaneously submit transactions that invoke smart contract functions.The transaction workload ranged from 100 transactions per second (TPS) to 2000 TPS.The results demonstrate that when the workload is below 1000 TPS, the system can handle the majority of submitted transactions, maintaining a transaction processing rate of over 92%.However, as the workload surpasses 1200 TPS, the transaction processing rate declines sharply.At 2000 TPS, only 50% of transactions are processed in each round, indicating that the system's processing capacity is restricted to approximately 1000-1200 TPS.Similarly, the average network latency remains negligible until reaching the identified saturation point of 1000 TPS.Next, Table II shows the upload and download time of the data crowdsourcing process on several benchmark ML datasets of different sizes.It is obvious that both the upload and download time are proportional to the size of the dataset.However, the download time is significantly lower than the upload time since uploading files to IPFS requires a certain time to divide the files into smaller chunks.The uploading and downloading time of the ML crowdsourcing process are illustrated in Table III, in which different well-known ML models are used for the experiment.Accordingly, we observe a similar trend as presented for the data crowdsourcing process.In a consensus performance comparison between MetaCrowd and a baseline system using BFT-Smart consensus [14], 600 transactions per second were submitted.As shown in Fig. 4, MetaCrowd processed about 500 transactions per second with a steady 200 ms latency, regardless of consensus node count.In contrast, the baseline's throughput fell from 600 to 100 TPS as nodes increased from 1 to 50, causing latency to rise significantly with more consensus nodes.

IV. CONCLUSION
In this paper, we presented MetaCrowd, a blockchain-based decentralized framework for ML and data crowdsourcing in the metaverse.MetaCrowd allows the MSPs to crowdsource ML services from MLWs, while also offering data crowdsourcing to provide the MLWs with the necessary data for training ML models.The framework fits with the distributed and largescale nature of the metaverse by replacing the central authorities with a decentralized system based on blockchain and smart contracts.In addition, proper incentive and punishment mechanisms are proposed in MetaCrowd to ensure that participants must follow the defined rules to maintain their benefits.As a result, MetaCrowd can resist SPoF, manipulation, freeriding, false-reporting, and Sybil attacks, while guaranteeing performance and cost efficiency.Notably, our framework does not require providing an evaluation function to estimate each worker's contribution.

Fig. 2 :
Fig. 2: State transition of a node in the dynamic committee partition.

Fig. 4 :
Fig. 4: Processing speed and latency of MetaCrowd compared to BFT-Smart when the number of consensus nodes varies.

TABLE I :
Comparison between MetaCrowd and existing crowdsourcing frameworks.
This contract manages the reputation profile of all participants according to their honest or malicious actions within the system.
• Machine Learning Contract (MLC): MLC regulates the ML crowdsourcing procedure between MSPs and MLWs.• Data Smart Contract (DC): DC manages the process of data crowdsourcing between MLWs and DWs.• Reputation Smart Contract (RC): Evaluate each Modelj on Dtest and submit results to MLC; 21: Distribute reward Θ to MLWs based on the evaluation results; 22: return Modelbest;

TABLE II :
MetaCrowd upload and download time for the data crowdsourcing process with different datasets.

TABLE III :
MetaCrowd upload and download time for the ML crowdsourcing process with different ML models.