Deep Learning-Based Activity Detection for Grant-Free Random Access

The cellular Internet of Things wireless network is a promising solution to provide massive connectivity for machine-type devices. However, designing grant-free random access (GF-RA) protocols to manage such connections is challenging, since they must operate in interference-aware scenarios with sporadic device activation patterns and a shortage of mutually orthogonal resources. Supervised machine learning models have provided efficient solutions for activity detection, noncoherent data detection, and nonorthogonal preamble design in scenarios with massive connectivity. In this article, we develop two deep learning (DL) sparse support recovery algorithms to detect active devices in massive machine-type communication random access. The DL algorithms, developed to deploy GF-RA protocols, are based on the deep multilayer perceptron and the convolutional neural network models. Unlike previous works, we investigate the impact of the type of sequences for preamble design on the activity detection accuracy. Our results reveal that preambles based on the Zadoff–Chu sequences, which present good correlation properties, achieve better activity detection accuracy with the proposed algorithms than random sequences. Besides, we demonstrate that our DL algorithms achieve activity detection accuracy comparable to state-of-the-art techniques with extremely low computational complexity.


I. INTRODUCTION
T HE massive machine-type communication (mMTC) service will provide full connectivity for Internet of Things (IoT) applications in cellular wireless networks. The machinetype devices have sporadic activation patterns, transmit small payloads with low-complexity network hardware, and are powered by short lifetime batteries. At the same time, they appear as increasing massive populations distributed over wide areas. When combined, all these characteristics impose challenges to provide connectivity, especially with regard to network access. The limited channel coherence time and the high number of devices prohibit the design of mutually orthogonal preamble sequences. Operating with fewer preambles than devices results in resource contention and network collision, which increases the access delay due to retransmissions. On the other hand, nonorthogonal preamble sequences reduce the number of collisions, at the cost of introducing interference among the active devices and degrading the system performance [1]. The random access (RA) protocols for supporting the mMTC service aim to provide connectivity under the premise of unpredictable and sporadic traffic generated by the machine-type devices. It is worth mentioning that the random aspect treated herein corresponds to the random activation pattern of the devices. In this sense, the grant-free random access (GF-RA) is an alternative to provide connectivity, dealing efficiently with the interference generated by the nonorthogonal preambles, while keeping a low network access delay [2].
Methods for sparse signal processing are broadly employed to cope with the interference and enable GF-RA procedures. These techniques are suitable for the typical mMTC scenario due to the sparse nature of the signals generated by machine-type devices. Specifically, we are interested in the sparse support and signal recovery methods. Sparse support recovery aims to estimate the indices of the active devices during a random access slot transmission. In contrast, sparse signal recovery is employed for channel estimation or data detection. Approaches to perform sparse support and signal recovery are classified in compressed sensing (CS)-based, covariance (CV)-based [2] and, more recently, machine learning (ML)-based.
The sparse support and signal recovery based on ML approaches have presented results that outperform substantially the other two CS-based and CV-based techniques, both in performance and computational complexity. These methods have extremely low running time at the cost of an intense offline training phase to learn the neural network parameters. Furthermore, according to the universal approximation theorem, any Borel measurable function can be approximated with an arbitrary nonzero amount of error by a properly designed neural network [3].
We can classify the ML-based approaches into the data-driven and model-driven categories. Data-driven approaches [4]- [7] use model-agnostic schemes that learn the mapping to the outputs purely from data. These approaches are convenient for situations involving intractable, incomplete, or poorly understood mathematical models, when it is hard to obtain accurate parameter estimate, or when the conventional approaches result in computationally inefficient algorithms. Besides, the data-driven approaches are able to extract implicit features from data, helping to understand the problem at hand [8]. On the other hand, model-driven approaches [9]- [15] are based on the domain knowledge and structures in data. One common strategy is to craft application-specific neural networks from traditional algorithms using the deep unfolding technique originally proposed in [16]. The model-driven approaches join the specialized knowledge of classical signal processing with the flexibility and generalizability of the ML tools. However, their application needs to be carefully assessed. Highly model-dependent algorithms may result in poor inference or waste of computational resources, since they ignore the fact that system parameters are dynamic and may be unknown. Additionally, algorithms which incorporate simplistic models may not capture the main properties and dynamics of the data.

A. Literature Review
We provide an outline of the literature on sparse support and signal recovery applicable to GF-RA protocols. The approximate message passing (AMP) algorithm from the CS literature is present in many recent contributions for device activity detection and channel estimation [17], [18]. Specifically, the authors in [18] proposed a noncoherent scheme that embeds information bits into the devices' preamble sequences, allowing joint activity and data detection. Despite the promising results of the AMP algorithm, its dependence on statistical channel information and poor stability represent technical difficulties for practical implementations.
Iterative methods using the empirical covariance matrix of the received signal are used for device activity detection in [19]. Two low-complexity algorithms to detect the set of active devices in massive MIMO unsourced RA are studied. Although the techniques achieve performance comparable to the AMP with multiple-measurement vectors and better numerical stability, the computational complexity is still high owing to the iterative structure of the estimation algorithms. Adopting a different strategy, the authors in [20] proposed an unsourced RA access scheme for millimeter-wave (mmWave) massive MIMO systems based on beam-space tree decoding. The scheme exploits the intrinsic beam division property of the mmWave channel to serve more active users and improve the system performance, outperforming the method proposed in [19].
The problem of mMTC GF-RA has been addressed in a variety of scenarios supported by different technologies. For instance, in [21], the problem of joint activity detection and channel estimation in cell-free massive MIMO was studied. The authors developed an asymptotic analysis of the MMSE estimation of sparse signal vectors, obtaining activity detection rules both at only one access point and at all access points considering cooperation between them. In another context, the same problem is tackled for satellite-enabled IoT applications in [22]. The authors proposed a Bernoulli-Rician message passing with expectation-maximization algorithm proof against unknown channel impairments.
On the matter of the ML-based approaches, neural networks obtained via deep unfolding of existing algorithms are presented in [9]- [12] and [15]. The networks of [9] and [10] are based on the AMP and, differently from the original algorithm, do not require prior information about the system parameters and channel statistics due to the transformation of the AMP parameters into trainable weights. The recursive neural network obtained from the iterative shrinkage thresholding algorithm for complex group row-sparse matrix signals in [11] converges quicker while achieving more prediction accuracy than the benchmark algorithms and robustness against ill-conditioned measurement matrices. The authors in [12] provided a framework to develop neural networks with adaptive depth, improving the efficiency of fixed-depth networks obtained via deep unfolding.
End-to-end neural networks that mimic the noisy measurements and the sparse recovery to jointly design the preamble matrix and the receiver are proposed in [4], [6], and [13]. Different deep auto-encoder algorithms, derived from CS-based and CV-based algorithms, that achieve high channel estimation and activity detection accuracy are proposed in [13]. These end-to-end network designs demonstrate the benefits of correctly choosing the preamble sequences to attain reasonable sparse recovery accuracy. As demonstrated by the aforementioned works, designing preamble matrices matched with the receiver is a promising path to increase the sparse recovery accuracy. However, we remark that crafting preamble sequences from scratch results in additional overhead to transmit the generated sequences to the machine-type devices. Such problem is mitigated with the adoption of easy-to-generate sequences, which can be systematically obtained at the device-side. Considering this context, in [15], three types of easy-to-generate preamble sequences, one obtained from a Gaussian random distribution, other obtained from a binary random distribution, and the last one obtained from the deterministic Zadoff-Chu (ZC) sequences, are evaluated and compared in joint device activity detection and channel estimation schemes. Numerical results reveal that accurate activity detection can be achieved with non-Gaussian preamble sequences taken from known sequences sets with good correlation properties. In this sense, preamble sequences developed by concatenating multiple short ZC sequences were proposed for mobile satellite communications in [23]. The resulted sequences are proof against the satellite environment and require low-complexity detection while benefiting the multiuser access performance.
A new algorithm for device activity detection based on a convolutional neural network (CNN) was presented in [7]. The CNN performs feature extraction from the received symbols to feed a densely connected layer that outputs the activity indicators. Simulation results reveal that the CNN outperforms a conventional neural network both in activity detection accuracy and computational complexity. Algorithms based on CNNs have been achieved promising results in applications such as image and video processing, natural language processing, and anomaly detection. Compared with the conventional densely connected layers, the convolutional layers of a CNN have sparse connections between the neurons and parameter sharing. Additionally, their computations exploit the correlation between the network inputs. In [24], a framework to recover sparse signals from noisy measurements in imaging applications using a CNN was developed. The scheme learns both the representation for signals and the inverse mapping function from measurement vectors in an end-to-end network. The proposed network approximates the solution of state-of-the-art algorithms in a much lower running time, achieving a remarkable tradeoff between reconstruction time and probability of successful signal recovery.

B. Contributions
The contribution of our work is threefold. 1) We propose two robust, low-running time deep learning (DL) algorithms for sparse support recovery for efficient device activity detection in mMTC GF-RA protocols. 2) We extend the neural network design workflow by adding a feature selection step based on choosing sets of preamble sequences with good correlation properties.
3) We carry out a comprehensive numerical characterization analysis of the activity detection accuracy and computational complexity achieved by the proposed algorithms under a variety of scenarios and different settings of preamble sequences sets. Elaborating further, we develop two DL algorithms for sparse support recovery used for efficient device activity detection in mMTC GF-RA protocols. The proposed DL data-driven neural networks do not require any statistical information of the channel or the devices' activity, having extremely low running time and robustness against variations in parameters of the system. The first network is a deep multilayer perceptron (DMLP) composed by densely connected layers, which inputs are the received symbols during the random access slot. The second network is a CNN composed by stacked convolutional layers, which inputs are the received symbols correlated with the preamble sequences. Especially, the CNN performs the sparse support recovery by exploiting the structured correlation of the network inputs.
We extend the neural network design workflow by adding a feature selection step, where the type of the preamble sequences is selected aiming to provide input features for the networks which increase the activity detection accuracy. During this step, we evaluate the DMLP and CNN with features generated by random complex normal and Bernoulli sequences, as well as deterministic ZC sequences [25], which present good correlation properties. We demonstrate that the feature selection-in our case, represented by the choice of preamble sequences type-is as important as the network architecture design, supporting the use of well-known and easy-to-generate sequences to integrate ML-based algorithms for applications in wireless communications.
The proposed networks are subjected to a parameters selection step aiming to optimize their architectures. Then, an offline training with the backpropagation algorithm and a representative dataset is performed to learn the networks parameters. Differently from other works in the literature, we carry out a comprehensive analysis of the activity detection error via the receiver operating characteristic (ROC) and the detection error tradeoff (DET) metrics. Numerical results reveal that the DL algorithms have competitive activity detection accuracy with extremely lower running time than state-of-the-art CS-based algorithms. Especially, the feature selection demonstrates that the CNN benefits from the good correlation properties of the ZC sequences, achieving improved activity detection accuracy in the mMTC scenario. Besides, the networks are resistant to variations on the system parameters, demonstrating the robustness provided by the generalization power of the DL models.

II. SYSTEM MODEL
In this section, we describe the system model, as well as formulate the device activity detection problem. We consider the uplink of a narrow-band system constituted by one cell with K machine-type devices, served by a single-antenna base-station (BS), as depicted in Fig. 1. These devices access the network with probability p a 1. For this reason, the number of active devices is much less than the total in the cell.
During the random access slot, the active devices transmit unique preamble sequences. These sequences identify the devices and are used at the BS to detect the set of active IoT devices. The preamble sequences have a length of L < K symbols due to the limited channel coherence time. For this reason, it is impossible to design mutually orthogonal sets of preambles. In our work, we consider three types of nonorthogonal sequences for the preambles. The nonorthogonal sequences result in interference on the transmitted signal by the set of active devices during a random access slot. Details on the used sequences are given in the next subsection. The preamble sequences are defined by the vectors a k ∈ C L×1 for k = 1, . . . , K, and are normalized such that a k 2 2 = L, ∀k. The preamble matrix A ∈ C L×K , which is known at the BS, contains in its columns the preamble sequences of all the K devices in the cell.
Let α k ∈ {0, 1} for k = 1, . . . , K be an activity state indicator of the device k, which is equal to 1 if the device is active and 0 otherwise. We consider the activity state indicators to be i.i.d. random variables following a Bernoulli distribution with parameter p a , i.e., is the probability mass function of α k and δ(u) is the Dirac delta function. The vector with the activity state indicators of all the K devices is named activity descriptor The channel links between each device and the BS follow the Rayleigh fading model, 1 with channel coefficients associated to the device k defined by h k ∼ CN (0, 1). It is worth noting that the evaluation of the proposed deep learning-based solutions under different channel conditions, e.g., line-of-sight propagation or spatially correlated channel vectors, is a promising path for future investigation.
The received signal at the BS during the random access slot is equal to the superposition of the preambles transmitted by each active device, which can be written as where z ∼ CN (0, σ 2 z I) is the additive white Gaussian noise. Considering x = [α 1 h 1 · · · α K h K ] T , we can rewrite the received signal in a compact form The goal of the device activity detection problem is to determine the set of active devices from the received signal y and the preamble matrix A. This is equivalent to finding the index of the nonzero entries of the vector x, namely the support of x As the number of active devices is much less than the total number of devices in the cell (sporadic activation, i.e., p a 1), x has a sparse structure. For this reason, the problem of finding the support of x from y and A can be cast as a sparse support recovery problem.
The sparse support recovery problem is an object of study of the compressed sensing field, and has been addressed in many works using different approaches, e.g., see [26] and the references therein. In our work, the aim is to develop two DL algorithms for sparse support recovery in the context of device activity detection for mMTC. We demonstrate that a DL algorithm which inputs are the real and imaginary parts of the received signal is sufficient to efficiently perform device activity detection.

A. Nonorthogonal Preamble Sequences
The preamble sequences play a key role in the device activity detection problem, as they dictate the interference patterns on the signal transmitted by the active devices during each random access slot. In addition, the inputs of the DL activity detection algorithms developed in the sequel are based on the BS received symbols, (3), a linear transformation of the preamble matrix. Hence, the learned neural network parameters depend directly on the preamble sequence features (cross and auto-correlation functions), which control how the signals of the simultaneously active devices are interrelated.
We remark two desired properties in preamble sequences. First, the preamble sequences must be easy-to-generate, coordinately, at the device-side. Such a property is desired since in the massive machine-type GF-RA, transmitting fully learned preamble sequences as in [4], [6], and [13] to a high number of devices in the cell creates a prohibitive communication overhead. Conversely, well-known sequences as the ZC and random sequences can be generated systematically and coordinately at the device-side with low overhead. The second property is the separability among the sequences in the set. 2 The separability governs the interference levels between the signals of the simultaneously active devices. For this reason, choosing sets of sequences with high separability is important to obtain reasonable activity detection accuracy.
In the context of sparse support recovery using CS techniques, it is common to use preambles generated by sampling random distributions. However, we show that DL algorithms with deterministic preamble sequences sets that have good correlation properties can achieve reasonable activity detection accuracy. In the following, we present three suitable types of preamble sequences sets used in our work.
Random sequences: We use two types of random sequences for the preambles. Random sequences are useful, especially in cases where the number of mutually orthogonal preambles is not sufficient for all devices, e.g., in crowded mMTC applications, because it is easy to generate large amount of unique preambles. Additionally, matrices generated by sampling sub-Gaussian random distributions, such as Bernoulli, satisfy, with high probability, the restricted isometry property (RIP), a necessary condition to guarantee the signal reconstruction of many CS algorithms [27]. The first type of random sequence is obtained by sampling a circularly symmetric complex normal distribution with zero mean and unitary variance for k = 1, . . . , K and l = 1, . . . , L. We name this type of sequence as normal. In order to meet the preamble normalization presented previously, we scale the Normal preamble, obtaining The second type of random sequence type is the Bernoulli. The Bernoulli sequences are real two-valued sequences obtained by sampling a symmetric random distribution of the type for k = 1, . . . , K and l = 1, . . . , L. As each entry of the Bernoulli sequence has norm equal to 1, normalization is not required. Zadoff-Chu sequences: The ZC sequences are polyphase sequences with elements defined by [25] [ where j = √ −1 and r ∈ {1, . . . , L − 1} is a number relatively prime to L named the sequence root. A ZC sequence has the ideal auto-correlation property, i.e., its auto-correlation value is equal to zero for all the shifted versions of the sequence. For this reason, a sequence and its shifted versions comprise a set of mutually orthogonal sequences. The ideal auto-correlation property holds only for sequences generated by a single root. On the other hand, sequences generated by different roots have constant cross-correlation equal to √ L if the difference between the roots is relatively prime to L [28]. Therefore, a set of nonorthogonal sequences with a three-valued cross-correlation function can be generated by taking the shifted versions of multiroot ZC sequences. We use this set of nonorthogonal sequences to generate the ZC preambles. The ZC preambles are defined by the sequences and its shifted versions, considering R ⊂ {1, . . . , L − 1} the set of chosen roots. Given the number of devices in the cell and the preamble length, the minimum number of roots to generate unique preambles for all the K devices is equal to It is worth mentioning that the set of nonorthogonal ZC preambles is composed by |R| ≥ N r smaller subsets of orthogonal ones. For this reason, allocating the subset of orthogonal preambles to devices with similar activation pattern is an efficient alternative to manage the interference levels.

B. Sequences Performance in Activity Detection Problem
In order to give an insight of the impact of the preamble design on the performance of the device activity detection scheme, we demonstrate a result on the distribution of the signal-to-interference-plus-noise (SINR) ratio of the received signal at the BS for the analyzed sequence types. The SINR gives an indirect measure of the activity detection accuracy, as it represents the ratio between the power of the signal of an active device and the power of noise plus the interference generated by the simultaneously interfering devices. Fig. 2 depicts the empirical cumulative distribution function (CDF) of the SINR for the kth device by using the normal, Bernoulli, and ZC sequences, considering three values of activation probability, p a ∈ {0.01, 0.1, 0.3}. The remain setup parameter values to generate this result are K = 40, L = 20, and σ 2 z = 0.1, resulting in a signal-to-noise ratio (SNR) of 10 dB. Besides, we define the SINR correlating the received signal at the BS with the preamble of each active device, resulting for k ∈ K A , where K A is the set of active devices during a specific random access slot. From Fig. 2, one can observe in the set of curves that decreasing the activation probability increases the average SINR values, since it reduces the number of simultaneous interfering devices during the transmission interval. The ZC sequences achieve the best average SINR values due to their good correlation properties. At the same time, the Bernoulli sequences outperform marginally the Normal ones in terms of the SINR distribution.

III. DEEP LEARNING ALGORITHMS
In this section, we introduce the architectures of the DMLP and CNN algorithms devised for device activity detection. The two DL algorithms exploit different input features to perform device activity detection. Especially, the structure of the CNN exploits the correlation between the network inputs, benefiting from the good correlation properties of the ZC sequences for Let S = {(y (s) , α (s) ) | s = 1, . . . , S} be a dataset containing the received signal and the activity descriptor of a random access slot. We denote the output of a DL algorithm for the sample (y (s) , α (s) ), asα (s) . Taking this into account and recalling that α (s) ∈ {0, 1}, we can measure the quality of the estimatẽ α (s) w.r.t. the original activity descriptor α (s) by calculating the binary cross-entropy function. Therefore, in our DL algorithms, the loss function is defined as the average binary cross-entropy function [29], calculated over all the S = |S| samples as The aim of the training procedure is to minimize L(S) by choosing the right parameters of f (v 0 , θ). In this way, the trained DL network produces an accurate estimate of the activity descriptor.

A. Deep Multilayer Perceptron
The DMLP algorithm, as depicted in Fig. 3, is implemented with densely connected layers. Its inputs are the real and imaginary parts of the received signal, while the output is an estimate of the activity descriptor. In the following, we describe in detail the layers of the DMLP.
The DMLP architecture has two densely connected hidden layers with N n neurons and one densely connected output layer with K neurons. The hidden layers have the ReLU activation function On the other hand, the output layer has the sigmoid activation function .
The details of the DMLP layers are organized in Table I. It is worth noting that the numbers of inputs and outputs of the DMLP are tied respectively to the preamble length, L, and the number of devices, K. Hence, one needs to retrain a network from scratch in order to accommodate different numbers of devices with different preamble lengths. Now, we introduce the parameters of the DMLP layers, which are the weights matrices and the bias values vectors. Given the dimensions of the inputs and outputs of each layer, we have the weights matrices W 1 ∈ R N n ×2L , W 2 ∈ R N n ×N n and W 3 ∈ R K×N n for the hidden layers 1 and 2, and the output layer, respectively. Similarly, the vectors with the bias values are b 1 ∈ R N n ×1 , b 2 ∈ R N n ×1 and b 3 ∈ R K×1 . Taking this into account, we define the set with the DMLP parameters by Considering the DMLP parameters, we write the input-output relationship of the hidden layers in the DMLP where v i is the layer output, considering v 0 as the DMLP input, and the activation function is calculated for each entry of the input vector. The input-output relationship of the output layer is obtained following the same logic, substituting the layer parameters, as well as the ReLU activation by the sigmoid function.
With θ DMLP in hands, it is possible to calculate the number of trainable parameters in the DMLP algorithm. Considering the dimensions of the weights matrices and the bias vectors values, the number of trainable parameters is given b Θ DMLP (N n , L, K) = N 2 n + 2N n L + N n K + 2N n + K. (17) Since the outputs of the DMLP are in the range [0, 1], a hard decision module with threshold parameter τ ≥ 0 is positioned at the output of the algorithm to calculate the activity descriptor in its original domain. Hence, the hard decision output iŝ During the device activity detection, two different kinds of error may occur. First, a false alarm (FA) occurs when an inactive device is detected as active. Second, a miss detection (MD) occurs when an active device is detected as inactive. The probabilities of FA and MD are calculated in terms of the hard decision threshold by: Given the imbalance on the number of active devices during each random access slot, the frequency of each type of error changes. The error probability as a function of τ is written in terms of the FA and MD probabilities as The hard decision module can be optimized in order to meet different design criteria, e.g., minimize a specific type of error, or a metric which combines the two types of error with different weights. In our work, we evaluate the algorithms adjusting the hard decision threshold such that Since it is difficult to derive a closed-form expression for the output error of the proposed DL algorithms, the value of the hard decision threshold τ can be computed near-optimally (τ * ) using a numerical approach. In our work, we use the samples into the training dataset to adjust τ in order to reasonably meet the criterion of (22). This can be done by first predicting the activity descriptors with the trained neural networks to obtain the outputs {α i } S i=1 . Then, the thresholdτ * that successfully recovers the largest number of samples is obtained by inspecting all the entries of the vectors {α i } S i=1 .

B. Convolutional Neural Network
Now, we introduce the CNN algorithm depicted in Fig. 4. Differently from the DMLP, the CNN algorithm is implemented with convolutional layers. Particularly, we need to adjust the input dimension aiming to match it with the number of outputs. In order to accomplish this while exploiting the good correlation properties of the ZC preamble sequences, we use a correlator stage at the input of the network, which produces the signal This correlator stage can be seen as an additional layer with fixed weights instead of trainable parameters. Then, the inputs of the CNN are the real and imaginary parts of the correlated received signal, organized as two vectors of dimension K × 1.
The output is an estimate of the activity descriptor.
The CNN architecture has two hidden layers of the type 1-D convolution with N f feature maps each one, filters of length N w and ReLU as the activation functions. The output layer has one feature map, filters of length N w , and sigmoid as the activation function. The details on the layers' parameter values of the CNN algorithm are described in Table II. From Table II, one can see that the CNN architecture does not depend directly on system parameters like the number of devices or the preamble length. In addition, differently from the DMLP, the CNN has a flexible structure that allows inputs of variable size due to the windowbased computations across the layers. Hence, a trained CNN can be reused for other system configurations by only changing properly the correlator stage. The parameters of the CNN layers are the filters and the bias values w.r.t. each feature map. Let the superscript be the index of the feature map. The parameters of the first hidden layer are W 1 ∈ R N w ×2 and b 1 ∈ R, = 1, . . . , N f . Next, the parameters of the second hidden layer are W 2 ∈ R N w ×N f and b 2 ∈ R, = 1, . . . , N f . Finally, the parameters of the output layer are W 1 3 With these definitions, we assume the set with the CNN parameters as Given the respective dimensions of the CNN parameters, we calculate the number of trainable parameters in the algorithm by It is important to stress that, differently from the DMLP, the number of trainable parameters in the CNN does not depend explicitly on the number of devices and the preamble length. This is a good advantage of the CNN, as its size does not grow rapidly with the scenario parameters. However, N f and N w must follow the changes of the parameters in the scenario in order to achieve efficient results on device activity detection. Let the convolution between the matrices M and N be whereM ∈ R n×p is a zero-padded version of M. The inputoutput relationship of the hidden layers is given by for i = 1, 2 and = 1, . . . , N f . We consider v i as the output of the th feature map in the layer i, while V i is a matrix which columns are the output vectors of all the feature maps, or channels, in the layer i The input-output relationship of the output layer can be obtained easily by substituting the layer parameters and the ReLU activation by the sigmoid. As the DMLP, the CNN has a hard decision module to calculate the estimate of the activity descriptor in the original domain.

IV. NUMERICAL RESULTS
In this section, we cover the performance and complexity numerical results associated to both proposed DL activity detection algorithms and the preambles sequences design, as well as compare with two baseline device activity detection techniques.

A. DL Training Procedure
The steps to train both DMLP and CNN algorithms for solving the device activity detection problem are described herein. The training procedure aims to optimize θ DMLP and θ CNN from a dataset S to obtain accurate activity detection, being carried out before the deployment of the device activity detection scheme. The dataset is obtained by sampling random distributions to generate the signals according to the definitions in Section II and evaluating (3). In our case, the dataset is constituted by S = 5 · 10 5 samples, from which S tr = |S tr | = 4.5 · 10 5 are used for training and S val = |S val | = 0.5 · 10 5 used for validation. Each dataset is generated with a single realization of the preamble matrix. The DL algorithms are trained with the dataset samples using the adaptive moment (ADAM) estimation algorithm. The training stops if the loss function L(S tr ) in (12) does not improve after five epochs. Table III contains the information on the  training setup of the DL algorithms.  Table IV contains the average training time of the proposed DL algorithms for different preamble lengths. The training is carried out at a workstation equipped with a Nvidia GeForce 940MX GPU, an Intel Core i5-7200 U CPU @2.5 GHz and 8 GB of RAM.

B. DL Input Parameters Tuning
In the following, we describe the procedure to tune the architectural parameters of the DL algorithms. In order to get high activity detection accuracy and prevent overfitting, we carefully tune the number of neurons in the hidden layers of the DMLP and the number of feature maps in the hidden layers of the CNN. The parameters are tuned by hand in the the ranges N n ∈ {80, 160, 320, 400, 800, 1600} and N f ∈ {20, 40, 80}. The tuning procedure consists of generating and assessing the learning curves of the algorithms, which contain the loss values along the epochs, calculated by the binary cross-entropy function in (12) on the training and validation datasets. Ideally, the loss values calculated for the validation set must follow the values calculated for the training one, indicating improvement on the generalization capability of the algorithm. If the validation loss worsens, or the gap between the losses of the two datasets increases, there is no more gain on increasing the parameter value. Fig. 5 depicts the learning curves for the DL algorithms varying their architectural parameters. 3 At first glance, we see that, except for the case where N n = 800, the DL algorithms 3 We omit a few curves in order to preserve the readability of the result. attain convergence. Analyzing the curves, we observe that the minimum training loss always reduces increasing the parameters value, as expected. Fig. 5(a) illustrates the learning curves of the DMLP for N n ∈ {80, 160, 320}, while the Fig. 5(c) depicts the validation loss for N n ∈ {320, 400, 800}. Combining these results, one can infer that the DMLP attains overfitting with N n > 400. Furthermore, we observe that the validation loss improves marginally from N n = 320 to N n = 400. Hence, we choose N n = 320 for the DMLP algorithm. On the other hand, the validation loss of the CNN, depicted in Fig. 5(b), always decreases with the epochs. Despite that, the gap between the validation and training losses increases for N f > 40. Considering these facts, we choose N f = 40 for the CNN.

C. Performance: Error, FA, and MD Rates
Numerical results on the activity detection accuracy of the proposed DL activity detection algorithms for GF-RA protocols are assessed and compared with two baseline algorithms available in the literature, the least absolute shrinkage and selection operator (LASSO) method [30] and the AMP algorithm [31]. The adopted AMP algorithm is implemented with the complex soft thresholding denoising function [32], while the LASSO method is computed via the coordinate descent optimization (CD) algorithm [33]. Both thresholding factor of AMP and regularization parameter of the LASSO are tuned for each evaluated scenario in order to develop a fair comparison, in which the baseline algorithms achieve good performance-complexity tradeoffs. Three figure of merit are evaluated: error rate, FA rate, and MD rate. It is worth mentioning that the numerical results in the following are generated using an independent evaluation dataset that contains different samples from those into the training dataset mentioned in Section IV-A. Table V summarizes the setup used to produce the evaluated scenarios and respective numerical results. The scenario parameters are consistent with other recently published papers on ML-based approaches for GF-RA [4], [6]. In addition, all the signals  are generated by sampling random distributions following the definitions in Section II. Fig. 6 depicts the dependence of the error rate w.r.t. the hard decision threshold (τ ) of the DL algorithms considering different values of undersampling ratio. One can notice that there exists a value of τ that minimizes the error rate attained by both DMLP and CNN algorithms, emphasizing the importance of optimizing this parameter. Table VI presents the values of the near-optimal hard decision threshold (τ * ) for each setting of the DL algorithms implemented to generate the numerical results in the sequence. Fig. 7 depicts the error rate versus the undersampling ratio Δ = L/K for both proposed DL algorithms, as well as the baseline algorithms. We evaluate each DL algorithm using the three types of sequences. Besides, we include the 95% confidence interval 4 of each point. The first observation in Fig. 7 is that the detection error rate improves by increasing Δ, as the longer preamble sequences provide more information to the activity detection algorithms. The DMLP algorithm using random sequences achieves better performance than the CNN under the same condition. However, the CNN with the ZC sequences achieves detection error rate values comparable to the DMLP. The CNN has a huge improvement in the detection error rate with the ZC sequences instead of the random ones. This is due the correlator stage in the CNN combined with the good correlation properties of the ZC sequences. In addition, the structure of CNN exploits the correlation among the samples of the network inputs. The activity detection accuracy for the 4 The confidence interval for a mean is computed by CI ε/2 = m ± z ε/2 s/ √ n, where 1 − ε is the confidence level, z ε/2 is the value such that the area to the right of it under the standard normal curve is ε/2, m is the sample mean, s is the sample standard deviation, and n is the number of samples. Notice that z ε/2 ≈ 1.96 for a confidence level of 95%. CNN considering either normal or Bernoulli random sequences is comparable, while the differences on the error rate of the DMLP with each type of sequences are nonnegligible. Besides, comparing with both baseline algorithms, the DMLP achieves error rate levels comparable to the AMP for Δ ≥ 0.325. On the other hand, considering low undersampling ratios, i.e., Δ ≤ 0.175 the DMLP and the CNN with ZC sequences outperform significantly the AMP algorithm. The LASSO method has the best activity detection accuracy for Δ > 0.175 at the expense of an extremely high computational complexity, as we demonstrate in Section IV-E. Moreover, the behavior of the error curves for the evaluated algorithms is similar for p a < 0.1, with decreased error rates as a consequence of the reduced number of active devices, and therefore, the level of interference. Fig. 8 depicts the ROC 5 curves of the DL algorithms using the normal, Bernoulli, and ZC sequences types for two values of undersampling ratio. Notice that, since the ROC curves are intended to analyze the detection accuracy tradeoff, these results are generated by increasing the value of the hard decision threshold in the range τ ∈ [0, 1]. For Δ = 0.525, the CNN using the ZC sequences has a huge improvement on the MD rate, if compared with the random sequences. Again, this is due to the correlator stage at the CNN input, the good correlation properties of the ZC sequences, and the CNN structure. Additionally, the ROC curve of the CNN using the ZC sequences is tight to the AMP algorithm. The LASSO method attains the best ROC curve among the techniques. On the other hand, for Δ = 0.175 the differences between the algorithms and types of sequences vanishes, as all the ROC curves of the DL algorithms are comparable to the LASSO. At the same time, the ROC of AMP degrades significantly as Δ decreases.   9 depicts the DET 6 curves of the proposed DL algorithms and the baseline algorithms for both values of undersampling ratio. The proposed DL algorithms cover a wide range of FA and MD rates, differently from the LASSO method and the AMP algorithm. Analyzing the DET curves, we see that the algorithms achieve lower FA rate values than MD ones. Such fact is a consequence of the sporadic activity of machine-type devices, which results in more inactive devices than active ones during the random access slots. Fig. 10 depicts the error rate versus the activation probability of the DL and the baseline algorithms. We evaluate each DL algorithm using the three types of sequences and two values 6 The DET curves are described by the MD rate versus FA rate with the axis warped by the inverse of the normal cumulative distribution function. The DET is more suitable than the ROC to analyze the tradeoff between the two types of error. of preamble length. Such figure of merit demonstrates the robustness of the DL activity detection algorithms in the face of variations in scenario parameters, since the networks are trained for a single-fixed activation probability p a = 0.1. As expected, the error rate increases with the activation probability, as high numbers of active devices incur in increased interference power at the received signal. Moreover, regarding the type of sequences, for Δ = 0.525, again the best detection accuracy is achieved with ZC sequences. Both DMLP and CNN algorithms present similar results using such sequences. Next, the Bernoulli sequences outperform marginally the normal ones. Especially, the CNN results in significant improvements using the ZC sequences set, achieving detection error rates close to the DMLP. Lastly, when the undersampling ratio is reduced substantially, e.g., Δ = 0.175, the performance of the DMLP and the CNN algorithms operating under each one of the three types of sequences is tight for p a ≥ 0.1.   by the algorithms to estimate the set of active devices, since the neural networks are trained before the system deployment, having no impact on the DL algorithms complexity. In this result, we set (K, L) ∈ {(40, 7), (40, 21), (400, 71), (400, 211)}. The DL algorithms present extremely low running time values on the order of 10 to 100 μs. The DMLP algorithm has slightly shorter running time than the CNN. Compared with the baseline algorithms, the proposed algorithms result in running time values at least two orders of magnitude less. As expected, the complexity of all the algorithms increases with the number of devices, owing to the growth in the number of inputs. Despite that, the running time changes marginally with the undersampling ratio, except for the LASSO method. Such fact is due to the accelerated convergence of the CD algorithm caused by the amount of information provided by longer preamble sequences.

V. CONCLUSION
In this article, we propose two DL sparse support recovery algorithms to detect active devices in GF-RA protocols. We develop a DMLP algorithm for device activity detection based on densely connected layers and a CNN algorithm built with 1-D convolution layers. At the same time, we analyze the impact of the type of sequences used for preamble design on the activity detection accuracy. We evaluate the accuracy achieved with random preambles generated by a complex normal and a Bernoulli distributions, as well as by deterministic ZC sequences. The numerical results demonstrate that the DMLP reaches the best detection error rate values among the evaluated algorithms. Despite that, the CNN with the ZC preamble sequences in place of random ones achieves detection error rates comparable to the DMLP, due to the good correlation properties of these sequences and the structure of the CNN that exploits the correlation properties among the network inputs. On the matter of the computational complexity, both proposed DL algorithms present extremely low running time, at the expense of the training burden to determine the suitable parameter values of the networks. Moreover, both the proposed DL activity detection algorithms attain promising and much better performancecomplexity tradeoff than state-of-the-art techniques such as the LASSO method and the AMP algorithm.