Data-driven Radar Processing Using a Parametric Convolutional Neural Network for Human Activity Classiﬁcation

—The paper proposes a data-driven pre-processing optimization for radar data using a parametric convolutional neural network. The proposed method is applied on human activity classiﬁcation as a use case. Present radar-based activity recognition system exploit micro-Doppler signature by generating Doppler spectrograms or a temporal series of range-Doppler maps, followed by deep neural networks or machine learning approaches for classiﬁcation. Those radar data representations are typically generated on the basis of short-time Fourier transformations. A Fourier transformation equally resolves the frequency space, which may be sub-optimal in some applications. Although deep convolutional neural networks (DCNN) have been shown to implicitly learn features from raw sensor data in other ﬁelds, such as speech recognition, yet, for the case of radar-based DCNNs, pre-processing is required to develop a scalable and robust classiﬁcation or regression application. In this paper, we propose a parametric convolutional neural network that mimics the radar pre-processing across fast-time and slow-time radar data through 2D sinc ﬁlter or 2D wavelet ﬁlter kernels to extract features for classiﬁcation of various human activities. During training only the ﬁlter parameters of the 2D sinc ﬁlters or 2D wavelets are learned, leading to optimized feature representation for the classiﬁcation task. It is demonstrated that our proposed solution shows improved results compared to equivalent DCNN architectures that rely on Doppler spectrograms or radar data cubes as input data.


I. INTRODUCTION
P Eople sensing and activity classification have increasing application potential in various areas, such as physical security, defense and surveillance. In industrial and consumer space, human activity recognition finds applications in smart homes, human-machine interfaces and elderly fall-motion monitoring systems. Knowledge of the performed activity in a room can enable smart control of the energy consumption, such as HVAC and lighting [1]- [3]. Furthermore, knowledge of the performed human activity facilitates true ubiquitous smart home solution by discerning the user's intent.
Most of the human activity recognition systems are based on cameras and computer vision approaches. These systems have the advantage that they are quite easy to implement and benefit from several years of research. As a result the accuracy of such systems is quite high. However, camera systems This paragraph of the first footnote will contain the date on which you submitted your paper for review. It will also contain support information, including sponsor and financial support acknowledgment. For example, "This work was supported in part by the U.S. Department of Commerce under Grant BS123456." T. Stadelmayer is with the Friedrich-Alexander University Erlangen-Nuremberg, 91054 Erlangen Germay and Infineon Technologies AG, 85579 Neubiberg Germany (e-mail: thomas.stadelmayer@fau.de; thomas.stadelmayer@infineon.com).
suffer from lack of privacy and are sensitive to illumination conditions, thus they are not a favorable choice for smart home solutions. On the other hand, radar sensors have shown to be an effective sensing modality for human activity classification [4], [9]- [17], [26]. The radar sensors offer privacy-preserving, illumination-invariance properties and are capable of being aesthetically concealed in the operating environment. Recent innovation in the semiconductor technologies have facilitated integration and antenna-in-package solutions making the radar sensor into a small form factor [18].
Radar sensors can sense and recognize human activities utilizing micro-Doppler signatures [24] that are generated by non-rigid-body motions of moving targets. To efficiently classify different human activities, meaningful features have to be extracted from the micro-Doppler signatures. In [4] the authors extract features, such as the total bandwidth of the Doppler signal and the normalized standard deviation of the Doppler signal strength among others, from the Doppler-spectrogram of different human activities. Also, in more recent works handcrafted features are used. The authors in [5] propose the extraction of median Doppler frequency, total bandwidth of the Doppler signal and the standard deviation in signal strength to name a few, whereas in [6] the envelope of micro-Doppler signatures is used as basis for classification. The classification result using handcrafted features highly depends on their discriminative characteristics and requires expert knowledge in designing them. Extracting such features from Doppler spectrograms inevitably removes information from the signal. In order to remove as little information as possible while compressing the data to meaningful features, the usage of a principle component analysis (PCA) was proposed in [7] and [8]. Whereas the PCA is a linear transformation, in recent years neural networks, which provide a non-linear transformation, were widely used for feature extraction in the radar domain. In [22], authors propose a novel deep auto-encoder based solution to sense elderly fall-motion from Doppler spectrograms and in [17], authors have used various deep convolutional neural network architectures to learn from range spectrogram, Doppler spectrogram and radar data cubes for different activity classification.
However, a predominant number of papers extracts features on the basis of short-time Fourier transformed (STFT) radar data such as Doppler spectrograms or radar data cubes. This holds for the handcrafted feature approaches as well as the dimensionality reduction approaches using a PCA as well as for the deep learning based approaches. The fact that for radar-based human-motion recognition and classification nearly all publications so far are based on pre-processed data is underlined by the survey of Gurbuz and Amin [25].
A STFT resolves the frequency space with equal resolution. However, when classifying between different human activities, the frequency regions representative of the different activities are of higher interest. Therefore, in [23], authors propose a deep deformable convolutional to focus on certain timefrequency areas in the Doppler spectrograms, which helps to handle the small inter-class differences and large intraclass variations of human fall-motion in a real-world situation. Further, in recent years different approaches feeding the raw radar data provided by the ADC, which is referred to as raw ADC data in this paper, directly to the neural network came up. In [27], the authors train a long-short term memory (LSTM) network directly with time series of complex-valued raw ADC data, which on the contrary leads to a very high network complexity. An approach based on the in-quase/quadrature (I/Q) trajectories is presented in [28]. First the I/Q trajectories are transformed into low-resolution images, which are then classified using a DCNN. Thus, the network extract features from the image rather than from the I/Q data directly. Further, in [29], a DCNN including a Fourier layer and using raw ADC data as input is proposed. The Fourier layer is a convolutional layer with its kernels initialized to the Fourier coefficients. However, those weights are adapted during training and thus after training the transformation in the Fourier layer most likely differs from a Fourier transformation. In audio signal processing, where the information is also encoded in the timefrequency domain, a neural network called SincNet operating directly on the raw data is proposed in [30]. The core idea of the SincNet is to restrict the first layer of the neural network to the usage of 1D bandpass sinc filters and only allow the optimization of their cutoff frequencies during training. By doing this restriction and therefore putting prior knowledge into the system, the network converges faster to its optimum solution. Further, the outcome of this layer remains physically interpretable since only the parameters of the filters, and not the single filter weights independently, are optimized.
Inspired from the 1D SincNet in speech processing, we propose two DCNN architectures based on 2D sinc filters and  Functional block diagram of FMCW radar RF signal chain depicting 1TX, 1RX channel 2D wavelet filters that directly learn joint fast-time-slow-timefrequency features from the raw ADC radar data. The proposed model classifies among different human activities using radar data captured by Infineon's 60-GHz frequency modulated continuous wave (FMCW) radar chipset BGT60TR13C. We demonstrate that the proposed trained DCNN architectures are able to achieve classification accuracy equal or better than equivalent DCNNs using STFT-based pre-processing techniques in form of Doppler spectrograms or radar data cubes as input. This is the first paper to the best of author's knowledge that deals with data-driven radar pre-processing optimization, thus allowing the network to learn implicitly better representation for classification of the particular task, which is human activity classification in this paper.
The rest of the paper is organized as follows, we present the radar system design and system parameters in Section II, the conventional signal processing involving Doppler spectrogram and range-Doppler-time or data-cube processing is presented in Section III, the contribution is also presented in Section III. The proposed solution is presented in section IV, the proposed architecture & learning in section V. The results and discussion along with the associated setup is presented in section VI, and we conclude in Section VII.

II. RADAR SYSTEM DESIGN
The work in this paper is based on Infineon's BGT60TR13C FMCW radar chipset. Its operating frequency ranges from 57 GHz to 64 GHz with an adjustable chirp duration. Its block diagram is shown in Fig. 1. The transmit path consists of a voltage controlled oscillator (VCO) that is regulated by a phase locked loop (PLL) to a reference frequency of f ref = 80 MHz. Highly linear frequency chirps between 57 GHz and 64 GHz are produced by adjusting the divider value and an additional tuning voltage ranging from 1 V to 4.5 V. In the receive path the echo returning from the target object is down-converted with a replica of the transmitted frequency chirp. Therewith the baseband frequency spectrum can be sampled by the 12 bit analog-digital converter (ADC). Moreover the receive path contains an intermediate frequency (IF) buffer amplifier and an analog IF-filter that can be adjusted corresponding to the received frequency range. The radar chip is package in an embedded waver level ball grid array package including 4 integrated patch antennas realized by a metal redistribution layer. Three of them are receive antennas having an antenna gain of 10 dBi and one is the transmit antenna with a gain of 6 dBi. Consequently, the radar sensor contains three identically structured receive paths and one transmit path. The radio frequency (RF) signal is distributed by an active RF distribution network to the receive paths.
The transmitted up-chirp from the FMCW radar's ramp generator is reflected by a moving object and is received at the receiver after round trip delay caused by the target's range from the radar and the velocity of the target. The received signal is mixed at the receiver with the transmitted signal and the resultant signal is low-pass filtered, thus performing the matched filtering operation. The phase of the resultant intermediate frequency or IF signal due to single point target can be expressed as where f min is the ramp start frequency, B and T c denote the chirp bandwidth and the chirp time respectively, and τ = 2(x+ I i=1 vit) c is the round trip propagation delay between the transmitted and received signal after reflection from the point target with range x and radial velocity components v i . The Doppler frequency relates with radial velocity v i as ν i = 2v i /λ = 2vf min /c, the Doppler is represented by the centroid Doppler of the Doppler components due to the target ν c = I i=1 ν i , while micro-Doppler components are represented as The signal of an extended target is defined by the super-position of point target signals.
For our demonstrator, we configured the chip to transmit chirps using a pulse repetition interval (PRI) of T PRI = 1 ms resulting in unambiguous maximum velocity of v max = 1.25 m s −1 which is sufficient in most indoor activity sensing applications. Figure 2 presents the chirp configuration. From the equidistant stream of chirps individual slices across slowtime can be extracted seamlessly. This provides maximum flexibility in further processing.
The bandwidth is set to B =1 GHz and the up-chirp time is set to T c =64 µs accounting for a range resolution of 15 cm. The maximum detectable unambiguous range is 9.6 m. The set system parameters are provided in Tab. I.

III. CONVENTIONAL PIPELINE & CONTRIBUTIONS
The conventional signal pre-processing involves 1D moving target indication (MTI) filtering to remove the response from static targets and also Tx-Rx leakage, which effects the first few range bins. The reflections from stationary objects such as chair, tables and wall, etc. can overwhelm the reflections from the other moving targets limiting their visibility on the RDM or  oppler spectrogram. Thus, MTI filter is used to suppress the contribution of these stationary objects and leakage. Among several MTI filters, a simple 1D MTI filter subtracts the mean along the fast-time to remove the Tx-Rx leakage that perturbs the first range bins, followed by mean subtraction along the slow-time to remove the reflections due to static or zero-Doppler targets.
The range information of the target is extracted by performing the first FFT after applying 1D windowing along fast-time, which is the intra-chirp time. The Doppler information of the target is extracted by monitoring the change of target peak along slow-time, which is the inter-chirp time. One common approach is applying FFT along the fast-time as well as slowtime dimension. The outcome of this operation is a twodimensional matrix representing the received power spectrum over range and velocity, also known as RDM. The received and deramped IF data is stored in matrices of size N c × N s , where N c being the number of chirps considered in a slice across slow-time and N s is the number of transmit samples per chirp.
In a conventional processing pipeline, the above preprocessing is followed by feature image generation, such as a Doppler spectrogram or a range-Doppler-time radar data cube, which are then inputs to deep convolutional neural networks (DCNNs) or LSTM networks for classification.

A. Radar Data Cube
The radar data cube represents a time series of RDMs. Before the single RDMs can be generated the static targets must be removed from the signal as described in the previous section. This is done by first removing the mean within each chirp or fast-time and then removing the mean across multiple chirps or slow-time. A RDM computed at the k th slice across slow-time can be obtained by applying a 2D STFT on the mean removed ADC data and is expressed as where w(m, n) is the 2D weighting function along the fasttime and slow-time, s(m, n, k) is the mean removed ADC data on the k th slow-time data slice. The index n, m sweep along the fast-time and slow-time axis respectively, while l, p sweep along the range and Doppler axeses respectively. N st and N ft are the FFT size along the slow-time and fast-time respectively. Figure 3 presents the radar data cube for walking and working activity at the slow-time data slice 0, 15 and 30. For the RDMs showing the walking activity a person is approaching the radar and then moving away from it. Most interesting for the activity classification is that the motion in range-Doppler domain and the spread of the target can clearly be seen. On the other hand for a person working on the laptop, the signal is mainly visible in the zero Doppler bin and hardly a variation can be detected, which makes it hard to assign it to a certain activity class.

B. Doppler Spectrogram Features
The Doppler spectogram is generated by marginalizing the radar data cube v RDM across the range dimension and can be expressed as where l is sweeps along the range axis, N ft is the number of range bins and p, l, k are the Doppler, range and slowtime data slice indices respectively. The Doppler spectrum at slow-time data slice k contains both the Doppler components as well as micro-Doppler components due to hand and leg movements while performing an activity. The stacked Doppler spectrum across consecutive slow-time data slices is referred as Doppler spectrogram that captures information about the instantaneous Doppler spectral content and the variation of the Doppler spectral content over time. Figure 4 presents the Doppler spectrogram of the different activities, namely empty room, walking, standing idle, arm movement, waving and working on laptop.  Figure 5(a) presents the conventional pipeline that involves explicit pre-processing and feature generation followed by a neural network such as DCNN or LSTM for classification. The novel aspect of the proposed architecture implicitly performs the pre-processing and feature generation in the neural network itself. Thus the input to the neural network is the raw ADC data directly as depicted in Fig. 5(b). The initial layer of the proposed DCNN learns 2D Sinc filter kernels or 2D wavelet filter kernels, which is representative of the pre-processing and feature extraction. The capability of DCNN to directly operate on the raw ADC data helps in reducing the computation complexity dramatically as well in practical implementation eliminates the need for digital signal processor (DSP) for preprocessing.

IV. PROPOSED PARAMETRIC CONVOLUTIONAL LAYER
Different activities can be distinguished by analyzing their unique range-velocity profiles. Some activities may have a very different profile such as walking and standing idle, but some may have only slight differences. For example working on the laptop and sitting idle on a chair only differ from slight hand movements to control the laptop. Thus, a higher resolution on specific frequency bands is required in order to accurately distinguish these actions. However, when applying a 2D STFT the whole observable range-velocity space is discretized in equal bins. By feeding the raw ADC radar data directly to a neural network, the neural network can learn filter kernels that extract more meaningful features than what can be achieved by fixed pre-processing steps. However, unlike as in computer vision or other domains, the feature is not spatially present in the raw ADC radar data. Therefore, a small filter kernel size of (3x3) or (5x5), which is typically used in convolutional layers, is not able to extract meaningful features. This is why larger filter sizes such as (64x32) have to be considered. As a result the number of trainable filter weights increase drastically leading to overfitting and getting stuck in local minimum. However, by constraining the filter kernels to a parametric filter function, which is specifically designed for radar feature learning, facilities convergence to global minima while requiring only a small set of filter parameter. Thus, we refer to the proposed layer as parametric convolutional neural network. In this way, by using prior knowledge of the radar signal, pre-processing is integrated into the neural network and can be optimized according to the training data. The benefit of integrating pre-processing into the neural network itself by learning the filter parameters of a set of time domain filters was already shown in [30] for 1D audio signal processing. In the paper constraining the first convolutional layer to the use of 1D band-pass filters defined as the difference of two lowpass sinc filters with different cutoff frequencies is proposed. While training the lower and higher cutoff frequencies are optimized for the needs of the application. Replacing classical pre-processing by this layer shows improved results in speaker recognition. In this paper the extension from 1D audio signals to 2D radar signals is proposed. Additionally, the Morlet wavelet is evaluated as a second possible parametric filter function.

A. 2D Sinc Filters
Besides applying a STFT, time domain bandpass filters can be used to analyze the frequency composition of a signal. Time domain bandpass filters yield the ability to adjust the cutoff frequencies according to the needs of the application and can therefore be learned within a neural network. Thus, similar to the proposal of Ravanelli et al. a sinc filter is chosen as parametric filter function. However, it has to be extended to the 2D radar domain. The 1D sinc filter is defined as where K is the filter length, f s the sampling frequency of the signal, f l the lower cutoff frequency, b the bandwidth and k the filter parameter index. The parameters of this filter are the lower cutoff frequency f l and the bandwidth b that implicitly defines the higher cutoff frequency. By defining a lower cutoff frequency and bandwidth in slow-time as well as in fast-time direction, a 2D bandpass filter that is able to extract joint range and velocity features can be created. The 2D sinc filter is defined as where N and M are the filter-lengths, f st s and f ft s the sampling frequencies, f st l and f ft l the lower cutoff frequencies, b st and b ft the filter bandwidths respectively in slow-time and fasttime direction. Furthermore, w(n, m) is a 2D cosine weighting function. n is sweeping along slow-time and m along fasttime. An exemplary 2D sinc filter is shown in Fig. 6 in time as well as in frequency domain. In frequency domain the rectangular shape with clear cutoff frequencies can be seen. The first layer of a CNN is initialized according to the definition of 2D sinc filters and only the filter parameters are allowed to be learned during training. The range-velocity profile is not only defined by its composed frequencies but also by the change of frequencies over time. When transforming a signal to frequency domain the time information is lost. This can be overcome by windowing the time domain signal. However, smaller window sizes mean higher time resolution but to the cost of worse frequency resolution and vice versa. Especially for time varying signals wavelets have several advantages over fourier transformations as they provide a time and frequency resolution [31]. Due to the fact that radar signals are highly time varying, the usage of a 2D wavelet transformation using Morlet wavelets is proposed. The 2D Morlet wavelet is defined as The filter parameters that can be optimized by the neural network are the center frequency and the standard deviation of the wavelet. Similar to the previous introduced 2D sinc filters the frequency area of interest can be adjusted by the center frequency. But, additionally also the time-frequency resolution can be optimized by changing the standard deviation of the gaussian part of the wavelet. Due to the fact that the defined wavelet is the product of a cosinus and a gaussian window function also the frequency response has the shape of a gaussian. That means that it has no clear cutoff frequencies as it can be seen in Fig. 7

V. ARCHITECTURE AND LEARNING
In this paper two state-of-the-art DCNNs evaluating the preprocessed data, a state-of-the-art DCNN evaluating raw input data and two novel DCNN architectures using raw ADC data as input are proposed. All architectures have several characteristics in common. First, all networks finishing with a softmax classifier layer of size 6 as six different actions should be classified. Second, categorical crossentropy is used as loss function. Third, RMSprop optimizer is used with a learning rate lr = 0.0001, ρ = 0.9, = 10 −8 and batches of size 128. Fourth, all unconstrained convolutional and dense layers are initialized using the 'Glorot' initialization scheme with an uniform weight distribution. Fifths, the common convolutional and dense layers are using a rectifier linear unit as activation. And sixths, after every common convolutional and dense layer a dropout with a rate of 0.2 is implemented in order to prevent over fitting.

A. 2D SincNet
The 2D SincNet uses 2D sinc filter convolutions in the first convolutional layer as described in chapter IV-A. As parameter this layer takes the filter lengths, the number of filters, the sampling frequencies, the padding mode and the stride for the slow-as well as fast-time direction respectively. Although there are no separated filters for slow-and fast-time, it is required to explicitly provide the number of filters in slowtime N st and the number of filters in fast-time N ft . According to this, 2D sinc filters are generated in a way that they form an  The 2D sinc filter layer is followed by a MaxPool layer with a pooling size of 8x2. Afterwards a common two dimensional convolutional layers using 50 filters of size 3x3 is implemented. As already mentioned above, the convolutional layer follows a dropout layer with a rate of 0.2. Moreover, after the dropout a max pooling of size of 4x2 is applied to decrease dimensionality. Then the tensor is flattened and fed into a dense layer of size 32 followed by the softmax classifier layer. The proposed network is depicted in Fig. 8

B. 2D WaveConvNet
The 2D WaveConvNet (WCN) is designed similar to the 2D SincNet. Only the first convolutional layer is initialized by 2D Morlet wavelets as described in IV-B instead of using 2D sinc filters. Required parameter for this layer are the filter lengths, number of filters, sampling frequencies, padding mode and stride for the slow as well as fast-time direction respectively. Similar to the 2D sinc filters the number of filters in slowtime direction N st as well as the number of filters in fasttime direction N ft have to be explicitly provided in order to distribute the frequency response of the wavelets equally as a grid in the 2D frequency domain. Both time axis were normalized as already discussed in the previous section. As a result, the standard deviations is chosen to be 0.06 in both filter dimensions. In a total N st times N ft 2D wavelets are created. Trainable weights of this layer are the center frequencies and standard deviations in slow-as well as in fast-time dimension. Also in the 2D WCN the learnable weights are normalized.

C. State-of-art Networks
DSNet: The DSNet is a typical state-of-art 2D DCNN architecture followed by a dense and softmax layer. It uses already preprocessed 2D Doppler spectrograms as input. Furthermore, it contains three common 2D convolutions with 4, 8 and 16 filters respectively. Each filter uses a kernel size of 3x3. After each convolution a dropout layer with dropout rate of 0.2 is used. Moreover, after the first two convolutional layers a MaxPooling of size 2x2 is used. Afterwards the tensor is flattened and fed into a dense layer of size 64 followed by the softmax classifier. The DSNet architecture is shown in Fig. 9 (a).
RDCNet: The RDCNet has a temporal sequence of RDMs in the form of a 3 dimensional radar data cube as input. Therefore three 3D convolutional layers are used to extract information from the radar data cube. They all have a kernel size of 3x3x3 and use 4, 8 and 16 filter kernels respectively. Also here a dropout layer with a rate of 0.2 is added after each convolutional layer to prevent overfitting. After the first two dropout layers a maxpooling of size 2x2x4 is performed. Afterwards the tensor is flattened and further processed by a dense layer of size 64 before it is classified by the final softmax layer. The RDCNet is sketched in Fig. 9 (b).
2D ConvNet: The 2D ConvNet uses the same architecture as the 2D SincNet and 2D WCN. Only the first layer is substituted by a unconstrained 2D convolutional layer with 'Glorot' weight initialization. No predefined time domain filters are used. Therefore each filter parameter can be learned individually. To evaluate the approach presented in this paper, a dataset was recorded in a real world environment. The radar was mounted on a tripod at a height of 1.20m and was placed in the corner of the room. The room has about 20m 2 with a table and chairs inside. The experimental setup is shown in Fig. 10. The dataset is chosen in a way that it covers fast moving activities such as walking, as well as slow moving activities like standing idle or working on the laptop. Thus, the challenge is to cover a large Doppler velocity range and simultaneously yield a high Doppler resolution in certain regions in order to differentiate similar activities. The dataset contains five different human activities plus additionally a recording of an empty room. To record the class "walking" a single human was allowed to randomly walk around. The class "idle" is split up in two recordings. First, a person was standing in front of the radar and in the second recording the person was sitting at the table facing towards the radar. As third activity random arm movements while standing were recorded. This class is called "arm movements". In order to record the fourth class called "waving" a person was waving with its hand at different positions in the room facing towards the radar. As last class working at the laptop while sitting at the table was recorded. The data was acquired in a clean environment, since the objective is mainly to differentiate activities, that cover a wide Doppler frequency bandwidth and simultaneously require high Doppler resolution in certain frequency bands, rather than to handle disturbances. Each activity was performed by the same person and recorded for about 18 minutes in total. Samples containing 2048 chirps with an overlap of 512 chirps are cut out from the recordings. Given the fact that the chirp repetition time is 1ms each sample captures 2.048s. For each sample a Doppler spectrogram and a radar data cube as described in sec. III-B and III-A is created. Therefore a dataset with raw ADC data, Doppler spectrograms and radar data cubes based on exactly the same chirps per sample is obtained. For each activity about 700 samples are available. Due to slightly different recording times for each activity,  To evaluate the proposed approach, a filter length of 65 in slow-and 33 in fast-time dimension was chosen for the 2D sinc filters. The same filter size is used for the constrained convolutional layer that substitutes the 2D sinc filter layer in the 2D ConvNet. For the 2D wavelet convolution the filter length in slow-time direction was doubled in order to allow the gaussian window function of the Morlet wavelet to expand. Moreover "valid" padding is used in both dimensions to obtain the same output shape after 2D sinc and 2D wavelet layer. This provides the possibility of substituting both layers one by one while keeping the remaining network the same. To reduce computational intensity a stride of 4 and 8 is used in slow-and in fast-time dimension respectively. As the filter sizes are specified, the final size of the networks can be stated. The composition of parameters per layer is shown in tab. III for the DSNet and the RDCNet and in tab. IV for the 2D SincNet, 2D WCN and 2D ConvNet. Given the fact that when evaluating Doppler spectrogram only velocity information has to be evaluated the network can be designed accordingly smaller. The RDCNet uses a radar data cube as input data. Thus, 3D convolutions have to be used which result in a higher number of parameters. Moreover, the first layer of the 2D SincNet as well as of the 2D WCN have significantly less parameters as the corresponding unconstrained convolutional layer of the 2D ConvNet. This results from the fact that only the four filter parameter are trained in the parametric convolutional layer. Since 64 2D sinc filters or wavelets are used, the parametric layer has just 64 · 4 = 256 parameters to optimize. The unconstrained convolutional layer in the 2D ConvNet in contrast has to learn each single filter weight. As a result the network size is reduced by more than 50 %. The proposed approach is a data-driven pre-processing optimization. Hence, besides the evaluation on the clean dataset, the proposed approach is also evaluated regarding the impact of limited amount of training data and the impact of pre-known disturbances, such as a static 50 Hz frequency of a power line, which can be considered during training.

B. Clean Dataset
In order to give a proof-of-concept, the proposed idea was evaluated on the dataset as descibed in the previous section. The dataset does not contain disturbances and yields enough training data.
1) Confusion Matrix Classification: For evaluation a 5-fold cross validation is performed. First, the dataset is split into 5 blocks, whereof each is used once as testing set. Thus, the model is trained 5 times and leaving out a different test set each time. Finally, the results of the five runs are averaged. In this way variations in the results due to unfortunate train and test data splits are reduced. All models are trained long enough to reach their saturation. Thus, the DSNet is trained for 100 epochs, the RDCNet is trained for 50 epochs, the 2D ConvNet is trained for 40 epochs and the proposed 2D SincNet and 2D WCN are trained for 20 epochs. Afterwards the accuracy as well as the F1 score are evaluated using the testing part of the dataset. The obtained accuracies and F1 scores are averaged over all runs. Additionally, the standard deviation is calculated for both matrices. The results are shown in tab. V. The state-of-art approaches achieve an accuracy of about 90.7 %, 95.6 % and 98.2 % respectively, whereas the proposed architectures show an improved accuracy of 99.2 % and 99.5 %. In order to analyze the classification results in more detail the confusion matrices of RDCNet representing the state-of-art approaches and the confusion matrix of 2D WCN representing the novel architectures are shown in Fig. 11. The limitation of the state-of-art approaches is unveiled by looking at the individual accuracy per class. While most actions are similar well classified by RDCNet and 2D WCN, a big uncertainty between the class "idle" and "working" exists. However, the proposed filter learning based approaches do not show this limitation and therefore achieves better accuracy scores.
2) Learned filters: Before training is started the sinc filters as well as the wavelets are initialized as described in sec. V-A For each approach, cumulating the initial filters leads to an approximately uniform range and velocity gain over the whole space. During training the filter parameters are iteratively optimized. As a result the initial grid structure is dissolved  The cumulative gain of all filters after training is depicted in Fig. 12. The 2D sinc filters as well as the 2D wavelets have a bandpass characteristic. Therefore the resulting gain of cumulative filters look similar except of the fact that the 2D wavelet gain is smoother caused by its smooth filter shape in frequency domain. However, the resulting weights of unconstrained convolutional layer are quiet different and can not be physically interpreted.

C. Limited Dataset
An essential factor in deep learning applications is the dataset. The neural network can only learn the information given in a dataset. Thus, data acquisition is an important task. However, it is also time consuming and challenging to acquire representative data for the application. If the dataset is too small, neural networks tend to overfit. Hence, it is evaluated how the proposed networks perform under low number of training samples. Thus, the different networks are trained until convergence using 10, 25, 50, 100 and 250 samples per class. The training samples are randomly selected. In order to get more stable results, each network was trained five times for each number of training samples using different randomly selected training samples every time. In tab. VI the mean F1-scores are stated for the different models based on the number of training samples, which are additionally visualized in Fig. 13. Especially for low number of training sampels the 2D SincNet and 2D WCN show very good results. Since their first layers are constrained to the usage of a set of parametric filters known from signal processing, less parameter have to be trained and therefore good results are already possible for little amount of training data. Further, it is noticeable that the variance in the results of the 2D ConvNet is large, which shows that it may come up with very good solutions, but sometimes also gets stuck in a local minimum. In contrary, the 2D SincNet as well as 2D WCN converge always to very similar scores due to the guidance of sinc filters and wavelets.

D. Fixed Disturbance
There exist disturbances depending on the application that are known before and can therefore be included into the training process. An example is a static 50 Hz interferer due to the power grid when installing the radar into a ceiling light. Thus, a 50 Hz sinusoidal signal with −12 dB with respect to the maximum detectable signal power of the sensor was added to the dataset. The networks are then re-trained using the modified dataset.
In Tab. VII the final accuracies and F1-scores are shown. The models using pre-processing are only slightly effected by the disturbance. Since the 50 Hz interference is almost static within a chirp, its influence is almost completely removed when subtracting the mean of the signal before calculating the STFT. The architectures operating directly on the raw ADC data have to learn to suppress the disturbing frequency. The results show, that the constrained networks, namely 2D WCN and 2D SincNet, perform better in suppressing the disturbance. In Fig. 14 the adaption of the sinc filters and wavelets respectively to suppress the 50 Hz interference, which corresponds to a velocity of 0.25 m s −1 , are depicted. The learned filters of the unconstrained 2D ConvNet still do not provide any physical interpretability.

E. Discussion
The limitation of the state-of-art approaches has its origin in pre-processing. The STFT for generating the RDMs equally discretizes the range as well as velocity domain. However, both activities "idle" and "working" contain very slight movements. As a result their features share similar range-Doppler bins and thus the STFT processed data of those actions is very similar. This complicates a classification. However, due to the ability of learning filters, the 2D WCN as well as the 2D SincNet is able to mitigate this limitation by adapting the parameter of its filters accordingly in order to clearly separate features of similar actions. Furthermore, the lack of range information gets noticeable when predicting classes based on Doppler spectrograms. The missing range information is expected to have an even higher impact when analyzing activities of multiple humans simultaneously.
The 2D ConvNet is not limited by using preprocessed data. Therefore, in theory 2D ConvNet can potentially achieve the same accuracy, though not guaranteed, as SincNet and WCN by increasing the training effort. The underlying issue for this is the fact that the learned sinc filters as well as wavelets are within the search space of the unconstrained convolutional layer. However, as the number of parameter is significant higher than the number of parameters of the proposed networks, the learning speed is decreased and more training samples are required to learn the extraction of meaningful features from the raw data. This was shown in the experiments with limited amount of training data. The 2D ConvNet achieves significant lower F1-scores as the 2D SincNet and 2D WCN and does not always converge to the same result, but shows high variance in training.
In signal processing static interferences can easily canceled out. That is why the approaches using pre-processing where only slightly influenced by the 50 Hz interferer. However, when the raw ADC data is directly fed into the neural network, the mitigation has to be learned during training. Due to the guidance of pre-defined filters, the proposed 2D SincNet and 2D WCN show a faster convergence and better mitigation of the interferer than the unconstrained counterpart network 2D ConvNet.
Using a predefined set of filters such as the proposed sinc filters or wavelets makes the outcome of the convolutional layer physically interpretable, which helps understanding the system. Moreover, due to filter learning application dependent meaningful range Doppler areas are focused.

VII. CONCLUSION
Human activity classification has several applications in surveillance, human-computer interfaces and smart home applications. We present an activity classifier based on parametric DCNN using 2D sinc filter kernels or 2D wavelet filter kernels that can seamlessly detect and classify human activities directly using raw ADC radar data. We demonstrated the performance of the proposed DCNN in comparison to conventional DCNNs that use Doppler spectrograms or radar data cubes as input data and demonstrated the proposed solution offers better classification accuracy. While the pre-processing steps in the latter processing are fixed, the parametric DCNN is able to learn filter parameters specific to the activity classification task. Further, the filter parameters can adapt to suppress fixed disturbances in the signal. Additionally, we demonstrate that, due to guidance by the parametric filters, the proposed networks perform better for low number of training samples than the unconstrained DCNN. Our experiments show, that the conventional un-parametric DCNN is unable to mimic the pre-processing operations due to the large search space and thus is not suitable for a robust solution. As future work, we aim to extend the proposed DCNN to simultaneously classify multiple activities from multiple targets in the field of view.