Efficient Bit Loading Algorithm for OFDM-NOMA Systems with BER Constraints

This paper proposes a novel bit loading algorithm to multicarrier non-orthogonal multiple access (NOMA) systems to maximize the system total throughput while satisfying the users individual quality of service (QoS) constraints. Although bit load-ing is generally a non-deterministic polynomial-time (NP)-hardproblem, even for orthogonal multiple access (OMA), the mutual interference between the users, and the dependence of power coefficients and modulation orders are additional challenges that add substantial complexity to the optimization problem. Therefore,in this paper, we propose an efficient bit loading algorithm formulticarrier NOMA systems and compare the complexity and throughput with a corresponding OMA system. The obtained results show that for the case of NOMA, bit loading provides an additional degree of freedom that allows non-uniform spectrum sharing among the users. In particular scenarios, the system may experience hybrid modes of operation, where the system switches between NOMA and OMA. The obtained numerical results show that NOMA can outperform OMA by100%in terms of spectral efficiency, for a two-user scenario. The complexity of the loading process for NOMA is noticeably higher than OMA, which is dueto the high complexity associated with the bit error rate (BER)computation in the case of NOMA


I. INTRODUCTION
The expeditious development of the mobile Internet and Internet of things (IoT) has introduced various capacity, data rate, and power consumption challenges for the fifth generation (5G) wireless communications systems [1]. Consequently, extensive research has been focused in the last few years to develop technologies that can overcome such challenges, such as energyaware communication and energy harvesting [2], [3], cloudbased radio access networks [4], wireless network virtualization [5], advanced antenna systems [6], [7], and efficient error correction coding [8]. Furthermore, significant advancements have been achieved in terms of multiple access technologies such as the non-orthogonal multiple access (NOMA), which may enhance the spectral efficiency with respect to orthogonal multiple access (OMA), by allowing multiple users to share the transmission resources simultaneously. Generally speaking, NOMA schemes can be classified into two main categories, code-domain NOMA [9], [10] and power-domain NOMA [11], which is the focus of this paper.
Towards further increasing the spectrum efficiency of NOMA, efficient modulation schemes such as orthogonal frequency-division multiplexing (OFDM) have been proposed in the literature [12], [13]. Despite the several advantages of OFDM, its bit error rate (BER) performance in fading channels is similar to single-carrier systems. Therefore, adopting adaptive modulation, commonly known as bit loading, can enhance the performance of OFDM-NOMA systems in terms of BER and consequently spectrum efficiency. Both BER and throughput in OFDM-NOMA depend on the modulation order and operating environments. In the low signal-to-noise ratio (SNR) regime, increasing the modulation order does not contribute to the total throughput due to high data detection errors. In contrast, in the high-SNR regime, increasing the modulation order may significantly improve the total throughput. However, applying bit loading, or adaptive modulation, for OMA [14]- [17] is drastically different from NOMA, because the loading processes for all OMA users are generally independent, i.e., changing the modulation order for any user does not affect the BER of other users. For NOMA, the mutual inter-user interference (IUI) between all users creates BER dependency, i.e., the BER of any user depends on the modulation order of all other users. Moreover, changing the modulation order for a particular subcarrier makes it necessary in certain scenarios to change the power assignment for that subcarrier, which may affect the power assignment for other users as well. Consequently, the bit loading process for NOMA is computationally more demanding than OMA, nevertheless, the throughput gain can be significant.

A. Related Work
Although NOMA has been considered widely in the literature, based on extensive literature search, and to the best of the authors' knowledge, very little work has considered the throughput maximization problem for NOMA systems using bit loading. For example, user pairing and adaptive turbo trellis coded modulation scheme for cognitive radio (CR) in downlink NOMA are proposed in [18]. The adaptive modulation is implemented by switching between multiple modulation orders based on a predefined set of SNR thresholds, and thus, the throughput is not maximized. In [19], a hybrid OMA/NOMA downlink system is considered, where the authors proposed a multi-stage algorithm for user grouping, subcarrier allocation, and bit allocation schemes while considering data-rate and error-rate constraints. However, the number of bits allocated to each OFDM subcarrier is predefined and the modulation order assigned for each user depends on the associated SNR. Therefore, no throughput maximization process is performed. Moreover, the study focuses on power reduction with no effort to maximize the throughput. Luo and Teh [20] considered maximizing the throughput of a cooperative NOMA system by adaptively switching between OMA and NOMA at the relay node. However, no error rate is guaranteed and the switching is performed by selecting the mode that maximizes the throughput based on channel gains.

B. Motivation and Contribution
Although adaptive modulation, or bit loading, has been widely considered for OMA as reported in [14]- [17], [21] and the references listed therein, it is not the case for NOMA where only very limited research work considered the problem, and with several simplifying assumptions. Therefore, this paper considers the bit loading problem in OFDM-NOMA systems and proposes an efficient bit loading algorithm for a twouser scenario, where the objective is to maximize the total system throughput while satisfying the users' individual BER constraints. The considered bit loading problem is a nonconvex integer non-linear optimization problem, and thus it is NP-hard. Moreover, the mutual interference between the users and dependence on the power coefficients and modulation orders are critical challenges that introduce additional complexity to the optimization process. For example, given that the number of possible modulation orders is M, and the number of subcarriers is K per OFDM symbol, then the total number of combinations to find the optimal modulation orders for N OMA users is N × M K because the subcarriers of each user are loaded independently from other users. In the case of NOMA, the subcarriers of all users have to be loaded simultaneously, and thus, the total number of combinations is M KN , which is substantially larger than the OMA case.
To resolve the complexity problem, we propose an iterative greedy algorithm that loads the bits of each user separately. The proposed solution is based on the incremental allocation algorithm (IAA) for OFDM-OMA with BER constraints [22]. However, the IAA is suitable only for single-user applications, and hence, it should be modified to handle multi-user scenarios. Specifically, the bit loading is performed based on the sorted channel coefficients rather than the BER at every subcarrier where different starting points and loading directions, incremental or decremental, are considered. Therefore, the algorithm is greedy within each iteration, but may change directions in different iterations. The impact of the initial loading is also considered to reduce the number of iterations required to provide reliable results. The performance is presented in terms of the throughput per user, system total throughput, and computational complexity, and compared to OMA. The obtained results show that NOMA significantly outperforms OMA in terms of throughput due to the cooperation between the two users. This behavior can be considered as virtual cognitive radio in the sense that the utilized spectrum by one user is utilized by the other user. In such scenarios, the resources of the user with poor channel conditions can be utilized by the other NOMA user, while they are wasted for OMA. Moreover, the bit loading naturally imposes a hybrid OMA/NOMA mode of operation because only one user might be fully allocated to the subcarrier. The complexity of the loading process for NOMA is noticeably higher than OMA, which is due to the high computational complexity required to compute the BER for the NOMA systems. Moreover, some additional complexity is introduced because of the iterative process. To reduce the complexity, the bit loading process is evaluated using two different initial starting points, namely the fully loaded (FL) and partially loaded (PL). The obtained complexity results show that the PL approach can reduce the complexity by about 50% for several cases of interest.

C. Notations
The notations used throughout the paper are as follows. Boldface uppercase and lowercase symbols, such as X and x will denote matrices and row/column vectors, respectively. The transpose is denoted by (·) T , E[·] is the statistical expectation, and the complex Gaussian random variable with a zero mean and σ 2 variance is denoted as CN (0, σ 2 ).

D. Paper Organization
The rest of the paper is organized as follows. In Sec. II, the system and channel models are represented. The proposed bit loading algorithm is detailed in Sec. III. The computational complexity of the bit loading process for OMA and NOMA systems is derived in Sec. IV. Numerical and simulation results are given in Sec. V, and the work is concluded in Sec. VI. The lists of acronyms and symbols are given in Appendices I and II, respectively.

II. SYSTEM AND CHANNEL MODELS
This work considers a downlink power-domain NOMA system serving two users, U 1 and U 2 , where the user equipment (UE) and the base station (BS) are equipped with a single antenna. The users share K subcarriers of an OFDM symbol through superposition modulation (SM). The multiplexed baseband NOMA symbol can be described as where y = [y 0 , y 1 , .., denotes the modulated data sequence of the nth user, P T is the total transmit power at the BS, and β n ∈ [0, 1] is the power allocation coefficient for the nth user. For the rest of the paper, the transmit power P T is normalized to unity and 2 n=1 β n = 1. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
For a proper detection of NOMA symbols, the power allocation coefficients must satisfy the following constraint [23], The NOMA stream y is converted to an OFDM symbol using the inverse discrete Fourier transform (IDFT), and a cyclic prefix (CP) is added to prevent inter-symbol interference (ISI). At the receiver, the CP is removed and discrete Fourier transform (DFT) is employed for demodulation. For the nth user, assuming that the channel remains fixed during one OFDM symbol period and the CP is larger than the maximum delay spread of the channel, the DFT output can be represented as, where is the additive white Gaussian noise (AWGN) vector whose elements w (n) k ∼ CN 0, σ 2 w , and G (n) ∈ C K×K is the nth user channel frequency response matrix, which is given by where the distance between the nth user and the BS is denoted as d n , ξ is the path loss exponent, g (n) i ∼ CN 0, σ 2 gi denotes the ith multipath component gain and Q (n) + 1 represents the number of multipath components. In the NOMA system considered in this work we assume that the first user, has lower average channel gain, i.e., E 1 Thus, the power coefficients should be allocated in the opposite order of the channel amplitudes, i.e., β 1 > β 2 . It should be noted that the coherence time is assumed to be larger than the OFDM symbol duration. After the DFT, the information detection can be performed using the joint multi-user detection (JMuD) [24] or successive interference cancellation (SIC) [25]. It is worth noting that for the considered system model, both the SIC and JMuD detection schemes have equal BER performance while their computational and delay properties are different [24].

III. BIT LOADING FOR NOMA
Generally speaking, bit loading for OFDM-NOMA is substantially more complicated than OFDM-OMA, due to the mutual IUI, complexity of computing the BER, and the relation between the modulation order and power coefficients allocated for each user. Therefore, we propose in this work a low complexity spectrally-efficient algorithm to enable maximizing the total OFDM-NOMA throughput while satisfying the BER constraints for each user. The proposed algorithm is based on the IAA [14], [15], [22], which is modified to suit the special structure of OFDM-NOMA. In particular, the proposed algorithm maximizes the system total throughput by assigning each user a certain number of bits per subcarrier based on the BER constraints of both users, since the number of bits allocated for one of the users affects the BER of the other due to the mutual IUI. The algorithm allows allocating one or more subcarriers exclusively for one user if the fading is severe for the other user and its BER constraint cannot be maintained. On the other hand, if the fading is not severe, then both users can share that subcarrier, but not necessarily with the same number of bits per user. In this section, the throughput maximization problem is formulated and the proposed solution is presented. However, an overview of the conventional IAA is shown for the sake of completeness.

A. Overview of IAA for OMA
The bit loading problem for OFDM with OMA and squared quadrature amplitude modulation (QAM) that aims at maximizing the system throughput can be formulated as subject to: where T T is the total throughput normalized by the total number of subcarriers, constraint C 1 in (7a) is used to force the modulation to be a squared QAM, while C 2 in (7b) is used to ensure that the average BER per OFDM symbolP is less than a predefined thresholdP T h . Moreover, b k represents the number of bits for the kth subcarrier, b = [b 0 , b 1 , ..., b K−1 ], b max is the maximum possible number of bits that can be allocated for any subcarrier. The conditional BER per subcarrier for squared QAM can be closely approximated by [22], where is the complementary cumulative distribution function of the standard normal Gaussian distribution. The average SNRγ = 1/σ 2 w and α 2 k is the channel gain of the kth subcarrier. In the context of OMA with fixed power allocation, the power factor is normalized to unity, i.e., β = 1.
The IAA can be used to efficiently solve the optimization problem in (6), where the process starts by allocating all subcarriers b max bits. Given that channel state information (CSI), i.e.,γ and α k ∀k are available at the transmitter, then they can be used to computeP . IfP ≤P T h , the algorithm stops, otherwise, we search for the subcarrier with the maximum P k and reduce its modulation order. ThenP is computed again and compared toP T h , if the condition Compute P k ∀k using (8),P 3. ifP Compute P using (8),P 10. end 11.
Go to: is satisfied the algorithm stops, otherwise the algorithm goes into a new iteration by reducing the modulation order of the subcarrier with the maximum P k . The algorithm stops when theP ≤P T h , or when b = [0, 0, . . . , 0]. The IAA can be explained for OMA using Algorithm 1, where the channel envelope vector α = [α 0 , α 1 , . . . , α K−1 ]. It is worth noting that the operation l = arg max l∈{0,1,...,K−1} P l in Algorithm 1 corresponds to finding the subcarrier with the maximum conditional BER. Moreover, it can be noted that the number of bits per subcarrier is decremented by two bits in each step to limit the modulation scheme to squared QAM. The complexity of the IAA algorithm is mostly dominated by two main operations. The first is that the algorithm initially calculates the BER for all subcarriers, and then the BER will be computed for each subcarrier that its allocation changes. The second is the search process for the subcarrier with maximum probability of error. The overall complexity depends on the number of times the two operations are repeated, which is a random variable that depends on the channel conditions.

B. Bit Loading for NOMA: Problem Formulation
Similar to the single user (SU) case, the adaptive two-user bit loading problem that aims at maximizing the total throughput while satisfying the users' BER constraints can be formulated as subject to: The objective function in (9) aims at maximizing the total throughput over K subcarriers for the two users. Constraint C 1 in (10a) specifies the possible number of bits per subcarrier which are restricted to squared QAM with M (n) k modulation order. Moreover, for national simplicity we consider that T h based on the quality of service (QoS) requirements for each user. The bit loading problem in (9) allows for three modes of transmission, NOMA, OMA or OFF. In the NOMA mode both users are active, in the OMA mode only one of the two users is active, while in the OFF mode both users must switch off, i.e., b 1} is a binary variable that allows computing P (n) k regardless of the number of bits per user in each subcarrier. More specifically, k | Γ k can be expressed as [26], [27], and where all the symbols in (12), (13) and (14) are defined in Appendix II. It is worth noting that P (n) k | Γ k =0 corresponds to the OMA case and (12) is its exact conditional BER expression, which is used in order to be consistent with the NOMA case where the exact BER is used as well. As can be noted from (9), the bit loading problem is a combinatorial non-linear nondeterministic polynomial-time (NP)-hard optimization problem. Therefore, in this paper, an efficient bit loading algorithm is proposed to solve the problem.

C. Proposed Bit Loading Algorithm for NOMA
The proposed bit loading algorithm is based on the principle of the IAA for OFDM-OMA with BER constraints [14].
However, the IAA is suitable only for single-user applications, and hence, it should be modified to handle multi-user scenarios. The main points that should be considered when applying the IAA to the two-user OFDM-NOMA are: 1) The allocation cannot be performed jointly for the two users because the channels for the two users are different.
2) The order of allocation has a major impact on the user fairness. For example, if the allocation starts with U 1 , the interference from U 2 will be maximum because all subcarriers are loaded with b max . The same scenario is also applicable if the loading process starts with U 2 . In such scenarios, the user that is loaded first might have less throughput when compared with the user who is loaded second. 3) After completing the loading for both users, most likely there will be a possibility for adding extra bits for the user that was initially loaded due to the reduced interference that results after reducing the total bits for the user that was loaded second.
To take these points into consideration, we propose a modified IAA (MIAA). Unlike the IAA, the MIAA can be applied using an arbitrary initial loading vector for each user, denoted as b (1) and b (2) . The MIAA is bidirectional in the sense that it can increase or decrease the number of bits per subcarrier. Moreover, the MIAA does not require the search for the subcarrier with the maximum error probability after each subcarrier allocation operation, which can reduce the time and computational complexity significantly. It is worth noting that the IAA outperforms the MIAA in terms of throughput when both algorithms are used with a single iteration. However, they produce roughly the same throughput when the number of iterations is more than one. The justification for such behavior is that the IAA generally produces better throughput because it is directly based on the BER, while the MIAA will reduce the total number of bits per OFDM symbol because it is based on SNR. In the consequent iterations, the MIAA will be able to add some of the missing bits, while the IAA will have less improvement because its outcome in the first iteration is close to the final loading solution. Consequently, both techniques will converge roughly to the same solution. Moreover, the loading process for a particular user should also consider the impact of the loading process on the other user, because increasing the bits for a user might increase the probability of error for the other user. In the MIAA, the number of bits is decreased by 2 in each iteration because (13) or (14) are valid only for squared QAM. Extending the algorithm for other modulation schemes or constellation shapes is straightforward. To simplify the presentation of the MIAA, it can be divided into two small algorithms MIAA-A and MIAA-D.
1) Algorithm 2 (MIAA-A): The MIAA-A is generally similar to the conventional IAA where it starts from a certain initial loading vector b (n) , and then starts decreasing the number of bits per subcarrier iteratively. To avoid the repetitive search for the subcarrier with maximum BER, the subcarriers are sorted based on their fading coefficients vector α (n) in an ascending order, which implies that only one sorting process is needed.
Go to: 4 14. end 15. Output: b (n) Therefore, the first subcarrier will be the one with worst fading conditions. Then, the allocation process is performed sequentially by reducing the number of bits per subcarrier until P (n) ≤P . Therefore, increasing the number of bits for one user should not cause BER constraint violation for the other user. The bit loading process starts by sorting the subcarriers based on their channel gains vector α (n) in a descending order. Thus, the first subcarrier is the one with the best fading conditions.
3) Algorithm 4 (I-MIAA): Based on Algorithm 2 and 3, we propose the iterative MIAA (I-MIAA) Algorithm 4. The I-MIAA has four steps that start by applying the MIAA-A for both users to obtain an initial loading vectors b (1) and b (2) that are generated such thatP (1) ≤P (1) T h and P (2) ≤P (2) T h . However, the total bits in b (1) and b (2) can be increased because the MIAA-A is strongly affected by the mutual interference between the users, which initially provides pessimistic results. In particular, the user that was loaded first would suffer from a strong interference from the other user. Therefore, the MIAA-D is applied to the user that was loaded first to overcome the severe bit reduction that was obtained in the initial MIAA-A. Because the MIAA-A is not as efficient as the IAA, The MIAA-D should be applied to the user was loaded second as well.

D. Using Partially Loaded Initial Bit Loading Vectors
As can be noted from the MIAA-A in Algorithm 2, the initial solution for the bit loading process is b (n) = [b max , b max , . . . , b max ], and thus, this vector is called a FL Algorithm 3: MIAA-D
ifP (1) ≤P Go to : vector. In such a scenario, converging to the final solution for b (n) might take several iterations and hence increase the system complexity. To expedite the convergence process to the final solution, the initial loading vectors for both users should be closer to the final solution as compared to the FL case where b (n) k = b max ∀k. Towards this goal, we propose using an initial solution for the MIAA-A that may have a small difference as compared to the final bit allocation. By noting that computing P k incur significant computational complexity as a result of using (13) and (14), the proposed solution for the nth user can be obtained by successively applying the MIAA-A and MIAA-D algorithms while considering that β 3−n = 0, i.e., there is no mutual interference between the users and the low complexity (12) will be used to compute P k . The output bit loading vectors of the this preliminary operation are considered as the initial loading vectors, instead of b (n) k = b max ∀k, for the actual MIAA-A and MIAA-D. Because the loading vectors in this case will mostly have subcarriers with b k < b max , the vectors are denoted as PL vectors.

IV. COMPUTATIONAL COMPLEXITY
In this section, the computational complexity of the proposed bit loading algorithm is provided. The computational complexity of bit loading algorithms with BER constraints is mostly dominated by the required extensive BER calculations performed during the iterative optimization process [22]. In particular, the complexity is computed in terms of the number of real multiplication (×) and addition/subtraction (±) operations, and number of times the Q (.) and . functions are evaluated in the BER expressions (12), (13) and (14). For more informative comparison, the overall equivalent complexity is also presented [28]. Moreover, the computational complexity for the exponent term, 2 v where v is a real integer, in the BER expressions is neglected as they can be implemented using simple shift registers. Table I presents the computational  complexity for the scenario where a single user is transmitting  at subcarrier k. Tables II and III show the computational complexity of the BER calculations for U 1 and U 2 , respectively, where both are transmitting per subcarrier. It can be noticed that the computational complexity is a function of the modulation order M (n) k and user's index. Clearly, the computational complexities of P (1) k | Γ k =1 and P (2) k | Γ k =1 are substantially larger than P (n) k | Γ k =0 which is due to the IUI introduced in NOMA systems. In order to quantify the number of operations required, Tables IV, V and VI represent numerical computational complexity results for various M (n) k values. The difference between the number of operations between P (n) k | Γ k =0 and P (n) k | Γ k =1 ∀n, is significant. For example, the overall computational complexity of P are, respectively, 2 and 6 times larger than P In order to fairly compare the computational complexity of the IAA and the proposed MIAA-A/D, Algorithm 4 is implemented using two different approaches, the first is based on the MIAA-A/D, and the other is using the original IAA. In both cases FL initial vectors are considered for the loading process. The implementation of Algorithm 4 using IAA initially requires computing P (n) k ∀k andP (n) . Then if C 2 is not satisfied, we search for the subcarrier with the maximum P (n) k and reduce its allocated bits by 2. After that the BER for the kth subcarrier andP (n) are recomputed to validate C 2 , and the process continues until it is satisfied. The same procedure is applied for the other user. Due to the mutual interference in NOMA systems, the nth user final output of the aforementioned steps can be further increased by first searching for the subcarrier with the minimum BER then increasing its bits by 2, and computing P (n) k andP (n) to check C 2 . This process continues until C 2 is violated for both users. As can be noted, the implementation of Algorithm 4 using IAA is mostly determined by two main operations. The first is that the algorithm initially calculates the BER for all subcarriers, and then the BER will be computed for each subcarrier that its allocation changes. The second is the search process for the subcarrier with maximum/minimum probability of error. The overall complexity depends on the number of times the two operations are repeated, which is random. In the worst case scenario, the BER and maximum search processes might be repeated 4 × K M times, which is expected to occur at low SNRs. It should be noted that the complexity of the ascending or descending sorting processes can be represented by the number of maximum/minimum searches. Therefore, as compared to the proposed Algorithm 4 using MIAA-A/D, the BER computations and search processes may be performed Algorithm 4: I-MIAA 1. Input: n,γ,P

24Λ
(1)  number of required iterations to obtain the final bit loading.

V. NUMERICAL RESULTS
This section presents the performance of the proposed algorithm concerning throughput and complexity. The simulation results are obtained using a computing machine that runs Intel Xeon CPU E5-2640 processor, clock frequency of 2.5 GHz, 16 GB RAM, and 64 bit operating system. The considered OFDM-NOMA system has a total of K = 64 subcarriers for U 1 and U 2 . In order to obtain a reliable BER performance, the possible modulation orders for each user are set to either QPSK or 16-QAM. Moreover, a subcarrier can be turned off for both users. Therefore, if both users are transmitting at the kth subcarrier, the superposed signal modulation order is going to be either 16-QAM or 256-QAM. The BER thresholds T h ∈ 10 −2 , 10 −3 . For the case where both users are using a certain subcarrier, β 1 is chosen based on (2), β 1 ∈ {0.92, 0.95, 0.99}. These values are selected because they are valid for all the modulation orders that will be used in this work, where the power should be selected such that β 1 ≥ 0.9 because b max = 4 for both subcarriers as shown in (2) or [29] for more general scenarios. If the subcarrier is allocated to only one user, then β n = 1 for that user. The numerical results are presented for the two possible initial solutions, i.e., FL and PL. The nth channel is modeled as a quasi-static Rayleigh frequency-selective fading channel, where the channel remains fixed during a given OFDM symbol, but changes randomly between adjacent symbols. More specifically, the channel is modelled as a 15-tap multipath channel with normalized delays of [0, 1,4,5,12,14] and average gains of [0.4584, 0, 0, 0, 0.147, 0.0928, 0, 0, 0, 0, 0, 0, 0.1851, 0, 0.1167] [22]. The distance ratio between the two users and the BS is d 1 /d 2 = 1.3, and the path loss exponent is assumed ξ = 2. The CSI is assumed to be known perfectly at the transmitter side, unless mentioned otherwise. The results are obtained by running and averaging Monte Carlo simulation for 10 4 channel realizations. For each channel realization, Algorithm 4 is run at 26 different average SNR values,γ, ranging from 0 to 50 dB. Figures 1 and 2 represent the throughput of U 1 and U 2 , respectively, for various power coefficients whereP (1) T h =P (2) T h = 10 −2 . In the presented results, the starting user and initial solution type are varied and their impacts are investigated. The initial solution type and starting user are denoted as [FL, U n ] and [PL, U n ]. The SU bit loading solution for the nth user is obtained by running Algorithm 4 using β n = 1 and β 3−n = 0. This setup can provide a near-optimal solution because it is equivalent to OMA [22], and thus can be considered as an upper bound for Algorithm 4 due to the absence of the IUI. For fair comparison with the proposed algorithm, another SU solution is provided for U 1 where β 1 ∈ {0.92, 0.95, 0.99} and β 2 = 0 as shown in Fig. 1. Similarly, Fig. 2 shows the SU solution for U 2 using β 2 ∈ {0.08, 0.05, 0.01} while β 1 = 0.
As can be noted from Figures 1 and 2, the throughout per user is not monotonically increasing versus SNR as in the case of OMA [22]. The reason for such behavior is that both users are competing for the same frequency bands, and these bands will be allocated to the user that can contribute more to the total throughput. Therefore, increasing SNR can be beneficial for one user more than the other, which allows that user to be allocated more bits and hence increase the total throughput. For example, for the case of [FL,1] in both figures in the range ofγ ∈ [14,18] dB, U 1 loses about 0.378 bits while U 2 gains 0.671, and hence, the total throughput increases by 0.29 bits. The obtained trending also depends on β 1 .
For the SU performance, the two SU solutions do not upper bound the proposed algorithm solutions due to the reduced power coefficients, specifically, over the low SNR region. For SU in Fig. 1 the main reason is that the near user U 2 , will have generally good channel conditions, which implies that most of its subcarriers will not be switched off. Therefore, the power allocated to the SU is similar to the one used for the proposed algorithm. In some rare cases at low SNRs, a few subcarriers for U 2 will be switched off, and thus all the power will be allocated to U 1 . Therefore, the proposed algorithm will slightly outperform the SU case. For the SU in Fig. 2, the same justification is also applicable. However, the subcarriers of U 1 will be switched off more frequently, which gives the chance for U 2 to utilize the full power, i.e., β 2 = 1. Therefore, the throughput of U 2 using the proposed algorithm in this case will outperform SU. For the high-SNR region, the impact of the reduced power allocation becomes less significant on the BER, therefore, both SU solutions converge and become upper bounds for Algorithm 4. Moreover, it can be noted that increasing β 1 for U 1 increases the throughput for all solutions and reduces the gap with the corresponding SU solution. This is due to the fact that increasing β 1 reduces the impact of the IUI [26]. Therefore, by excessively increasing β 1 , the throughout of U 1 becomes almost independent of the other user as shown in Fig.  1, particularly for β 1 = 0.99. On contrary, increasing β 2 does not necessarily guarantee enhanced throughput performance as can be noted for β 2 = 0.05 and 0.08 in Fig. 2. This is due to the error performance of the two stages of SIC and the associated error propagation. Therefore, it is important to properly allocate power coefficients to provide optimal BER performance at both stages to enhance the corresponding throughput.
Because the BER in NOMA depends on the IUI and β n , four different solutions are provided for both users where the initial solution type and starting user are varied as, [FL, 1], [PL, 1], [FL,2], and [PL, 2]. In Fig. 1 for β 1 = 0.92, both [FL, 2] and [PL, 2] outperform the solutions that started U 1 in the range ofγ = 0 to 34 dB. In fact, starting the bit loading with U 2 reduces the IUI for U 1 , which results in a throughput performance similar to the near-optimal SU solution with unity power. On the other hand, the [FL, 1] and [PL,1] are obtained where high IUI from U 2 exists, which reduces the chance to satisfyP (1) T h , and hence, decreases the throughput. In the high-SNR range,γ = 35 to 50 dB, the throughput for all cases becomes roughly identical because the BER is reduced substantially and the BER threshold can be satisfied regardless of the initial solution or starting user. Similar observations can be made when increasing β 1 . An opposite behavior can be noticed in Fig. 2 where higher throughput performance is achieved when the algorithm starts with U 1 , which reduces the IUI for U 2 . Moreover, both FL and PL solutions result in similar throughput. T h = 10 −3 in Fig. 4. It can be noted that both the FL and PL initial solutions perform almost identically over the full SNR range. Due to the conflicting BER requirements of the individual users, increasing β 1 does not necessarily increase the total throughout over the full SNR range. It can be noted that using β 1 = 0.95 provides the best throughout. Therefore, the need for proper power allocation is crucial for OFDM-NOMA systems to maximize the system total throughput. It is also interesting that OFDM-NOMA significantly outperforms the OFDM-OMA in terms of the throughput for the entire range of the considered SNR values and all the considered scenarios. The difference between the two solutions increases as the SNR increases because both NOMA users will be able to have fully loaded subcarrier. As can be noted from both figures, NOMA with bit loading can offer a throughput improvement that may reach 100% for a wide range of SNRs and operating scenarios. T h = 10 −3 . The modulation order when single user transmission is decided is denoted as M (n)+ k . As stated earlier, a subcarrier k can be turned off for both users especially at low SNR region which is referred to as "OFF". For low SNR values, single user transmission is more likely to occur as it can provide less BER as can be seen in Fig.  5. Moreover, because U 2 has much lower power than U 1 , it is expected to have lower throughput and lower modulation orders. For the case whereγ = 30 dB, the OFF mode percentage is almost zero because C 2 can be satisfied for both users. Also, at such high SNRs, the simultaneous transmission from both users is highly probable with superposed modulation orders, 16-QAM or 256-QAM. Due to the low power assigned to U 2 , lower modulation orders are frequently assigned. The results show that asP (n) T h ∀n increases, the modulation orders and simultaneous transmission increases as well.
The computational efficiency is a critical requirement in several delay-sensitive applications, hence, we provide the average computational time required for several cases of interest as shown in Table VII. The time required to obtain the throughput results varies depending on the initial solution type and starting user. It can be noted that starting with a PL solution reduces the computational time as compared to starting with a FL solution.
To quantify the reduction, the ratio percentage is calculated as R 1 = T P L T F L %, where T P L and T F L are the time needed to compute Algorithm 4 when starting with PL and FL initial solutions, respectively. The computational time decreases asγ increases as it becomes easier to satisfy the BER constraints for both users. Moreover, asP (n) T h ∀n, decreases, the computational time increases as U n requires more time to satisfy C 2 . As can be concluded from the table, it is more desirable to start with U 2 and PL initial solution for all considered scenarios. The table also shows that the simulation time for the bit loading process is generally large as compared to single-user IAA [22]. Such performance is obtained because the time requirements to compute BER for the NOMA is time demanding. To reduce the overall time-requirements for the bit allocation problem, the exact NOMA BER formulas can be replaced by more efficient approximation or lookup tables. Moreover further complexity reduction techniques such as the subcarrier grouping [22] can be adopted. Fig. 9 shows the individual and total throughput using Algorithm 4 based on the IAA and MIAA-A/D for β 1 = 0.95 andP (1) T h =P (2) T h = 10 −2 . In other words, Algorithm 4 is implemented using two different approaches, the first is based on the proposed MIAA, and the other is using the original IAA. In both cases we use FL initial vectors for the loading process. As can be noted from the figure, using the IAA provides T h =P (2) T h = 10 −2 , β 2 ∈ {0.08, 0.05, 0.01}.
higher total throughput when compared to the proposed MIAA-A as detailed in Sec. III-C. However, applying the second iteration to the IAA, which aims at increasing the bit allocation, does not necessarily enhance the throughput as in the case of the proposed MIAA-D. For U 1 , the throughput using the proposed algorithm outperforms the other algorithm over low and moderate SNR ranges. However, for U 2 , the result obtained using Algorithm 4 based IAA performs better than the other Algorithm over the rangeγ = [4,22] dB. As for the total throughput results, the proposed Algorithm outperforms the IAA slightly over the SNR range exceptγ = [8,18]. It can be noted that the proposed algorithm using MIAA throughput is generally equivalent to that using the IAA, but with lower complexity. Fig. 10 shows the impact of the imperfect CSI on the proposed bit loading scheme. The CSI imperfection corresponds to the SNR estimates where the noise power is assumed to be biased by an error factor of σ 2 E . As can be noted from the figure, the transmitter underestimates the noise effect, which causes throughout inflation in certain regions of the SNR axis. Therefore, the communications link will actually experience higher BER than anticipated which can cause QoS constraint violation. This problem is less severe in SNR regions where the throughout plateaus. On the contrary, if σ 2 E is negative, the system will be under loaded, which may cause throughout degradation, but the QoS constraint will still be satisfied. The same approach can be followed to evaluate the system performance in the case that the bias is random, or when the channel gains have estimation errors.

VI. CONCLUSION AND FUTURE WORK
This work considered the adaptive bit loading problem for a two-user OFDM-NOMA, where the objective is to maximize the total throughput while satisfying the users' individual BER constraints. Because the complexity of bit loading for OFDM-NOMA is significantly higher than OFDM-OMA, we proposed an efficient heuristic algorithm that converts the twodimensional optimization process into two one-dimensional processes, and thus significantly reduce the complexity of the bit loading process. The proposed algorithm is based on the incremental allocation concept, however, several fundamental modifications are preformed to accommodate the difference between OMA and NOMA. For example, the need for frequent search for the maximum BER was eliminated by performing the allocation based on the channel gain instead of the BER. Moreover, different initial starting vectors and loading directions were applied to handle the multi-user interference. The obtained numerical results show that bit loading for NOMA can significantly improve the total throughput because the NOMA, in this case, has an inherent spectrum pooling feature that enables maximizing the available spectrum utilization. More specifically, it has been shown that the bit loading may provide spectral efficiency advantage of more than 300% for NOMA over OMA at low and moderate SNRs, and about 100% at high SNRs. The computational complexity of the proposed algorithm was evaluated for various operating conditions, and the obtained results showed that the proposed PL approach can reduce the complexity by about 50% for several cases of interest as compared to typically used FL solution However, the bit loading complexity for OFDM-NOMA is generally higher than OFDM-OMA due to the heavy computations of the BER.
It is worth noting adaptive power allocation for multicarrier systems with QoS constraints is another interesting problem that should be tackled. The highly nonlinear BER formulas for NOMA complicates the power computation efforts. Therefore, an efficient design is needed to solve this problem efficiently.
Although this work considered the bit loading for NOMA, the number of users is limited to two. Consequently, extending the proposed algorithms to three or more users cases is crucial. Generally speaking, the proposed algorithm can be applied to any number of users by following the same approach used in this work. Nevertheless, the computational complexity may increase nonlinearly versus the number of users. More specifically, computing the BER for more than two users with high modulation orders requires extensive computational power [30]. Moreover, the power assignment is another challenge because using fixed power for all modulation schemes may degrade the performance for certain users. For example, for the three-user case with 16-QAM, the third user maximum power is about 6.5 × 10 −3 while if the modulation is switched to 4-QAM, then that user can be allocated up to 0.18 [29]. Therefore, extending this work to more than two users is an interesting research problem that we will consider in our future work.
T h = 10 −2 ,P    In the following list of symbols, the indices k and n correspond to the subcarrier and user index, respectively. Both indices will be dropped unless it is necessary to state one or both of them explicitly.
i,t,k D