A Multimodal Data Processing System for LiDAR-Based Human Activity Recognition

Increasingly, the task of detecting and recognizing the actions of a human has been delegated to some form of neural network processing camera or wearable sensor data. Due to the degree to which the camera can be affected by lighting and wearable sensors scantiness, neither one modality can capture the required data to perform the task confidently. That being the case, range sensors, like light detection and ranging (LiDAR), can complement the process to perceive the environment more robustly. Most recently, researchers have been exploring ways to apply convolutional neural networks to 3-D data. These methods typically rely on a single modality and cannot draw on information from complementing sensor streams to improve accuracy. This article proposes a framework to tackle human activity recognition by leveraging the benefits of sensor fusion and multimodal machine learning. Given both RGB and point cloud data, our method describes the activities being performed by subjects using regions with a convolutional neural network (R-CNN) and a 3-D modified Fisher vector network. Evaluated on a custom captured multimodal dataset demonstrates that the model outputs remarkably accurate human activity classification (90%). Furthermore, this framework can be used for sports analytics, understanding social behavior, surveillance, and perhaps most notably by autonomous vehicles (AVs) to data-driven decision-making policies in urban areas and indoor environments.


I. INTRODUCTION
H UMAN activity recognition (HAR) has engaged many to become one of the most popular and active research fields in machine vision. Motivated by the academic value and practical applications in sports analytics [1], humanmachine interface [2], ambient-assisted living (AAL) [3], and autonomous vehicle (AV) research [4], [5], HAR is characterized by the actions taken by individuals in a group, or individuals acting alone. If a machine is to understand the actions taken by people, it needs to develop an intelligence that will allow it to reason about its behavior. To reason about the behavior of people, three core challenges must be addressed. The primary challenge is getting a machine to understand the complex interaction performed by people when they modify their behavior quickly depending on circumstances. These interactions are influenced by numerous factors, such as distance between agents or direction of travel. As such, the location and action of agents are equally important in facilitating a machines ability to understand what people are doing.
The second challenge is the selection of attributes to be measured and the sensors that measure them. Being a popular and active research field, there are many different approaches. As such, sensor inaccuracy, constraints, or obtrusiveness need to be considered. For example, wearable sensors provide reliable information about the actions of an agent. Unfortunately, wearable sensors are not as prolific as they need to be. While smartphone accelerometers are part of the way there, they cannot capture as much information as a camera can. Similarly, the camera is excellent at capturing a wide field of view (FoV) but is prone to interference from lighting conditions. If each sensor has a shortcoming, the inevitable choice is to overcompensate on modality, thus improving the chances of capturing all the necessary data required to improve accuracy.
The third challenge is the sheer quantity of data required to train a classifier. Most contemporary supervised machinelearning (ML) methods require examples to learn. In principle, given unlimited data, most supervised ML paradigms should be able to classify any activity under any circumstances. However, there is a question of whether we can access the required data to achieve this task in practice. To that end, a framework that can utilize the data more intelligently is a significant challenge.
To address these challenges, we proposed a novel CNN architecture to leverage multimodal ML and sensing redundancy benefits. The proposed framework uses a pretrained CNN (ResNet-50) [8] and a region proposal network (RPN) as an object detector [6]. The purpose of the object detector is to identify a region of interest (ROI)-containing a person performing an activity-in an image. The corresponding ROI is translated and aligned to point cloud data before being segmented and classified. This approach extends research on a 3-D modified Fisher vector representation derived from a Gaussian mixture model (GMM) [7].
The proposed framework addresses the three challenges, showing that sensor data redundancy and multimodal ML enable us to use the data more intelligently, accommodate 2168-2267 c 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information. diverse feature space, and reason jointly about agents' activities and location. Specifically, we developed a technique where a detected ROI is projected onto light detection and ranging (LiDAR) data, followed by a point cloud classification of an activity performed by a human, as depicted in Fig. 1.
The primary contribution of this work is a new, accurate and reliable, method of HAR, which leverages multiple sensing modalities. The secondary and tertiary contributions are the detection of an ROI and the segmentation of subjects captured in 3-D data. The remainder of this article is organized as follows. Section II provides an overview of the current state of the art in HAR. Section III details the proposed framework. Section IV reports on the comparative performance and visual results of the proposed framework. Section V discusses some of the limitations before concluding this article in Section VI.

II. RELATED WORK
The technology and sensors integral to HAR vary to great a degree. This research uses a camera and a LiDAR sensor to classify people performing different activities. The data collected by the camera is regular 2-D data, and the LiDAR collects 3-D or point cloud data. To that end, we use 3-D data and point cloud data interchangeably throughout the course of this article.
Multimodal ML is a research field with one of its earliest applications in the field of audio-visual speech recognition (AVSR) [8], to contemporary fields of emotion recognition [9], [10] and tracking multiple agents [11] or anomalies in traffic [12]. Like sensor data fusion, in the sense that it uses multiple sensor modalities, Multimodal ML uses multiple ML algorithms, either in parallel or in series, to improve a framework [13]. Five core technical challenges need to be addressed in multimodal ML [14]. They can be described as: 1) representation, which is concerned with the difference between data types; 2) translation which is the mechanism of moving data from one plane to another; 3) alignment which indicates the proximal relationship between two different sensor streams; 4) fusion which denotes the joining of sensor data; and 5) co-learning which means how knowledge is recognized from one modality for use with another.
Despite numerous developments in ML, the majority are demonstrated in tightly controlled settings with well-defined outcomes, typically perform singular tasks and generally use only one sensor stream [15]. Developing agents that make decisions using context information gathered from real-world environments remains an essential task, especially if we want a machine to understand what is going on around them.
The context of a scenario can be captured from various types of instruments, measurement techniques, and sensors. Also known as multimodal sensing, it is where multiple sensor modalities capture context information about the environment they are working in [16] and [17]. Commonly referred to as sensor data fusion, the challenges can be categorized into two groups [5], [16]: 1) challenges at acquisition level and 2) challenges due to the uncertainty of data sources.
One of the most prolific applications of multimodal ML and sensor data fusion is in the area of AAL. Typically used in hospitals, rehabilitation centers, and elderly care units [18], the algorithms are designed to determine what situations require urgent medical assistance. This process is usually done using multiple signals to determine a drift away from homeostasis [19] and identified using multimodal ML by combining multiple biometrics to deduce what has just occurred.
A closely related research field is multimodal ML for HAR and pedestrian detection [20]- [22]. Using the principles derived from AAL, it is expected that researchers could decrease road accidents and improve road safety. For example, Diederichs et al. [23] and Rehder et al. [24] provided an interesting application of multimodal ML for a pedestrian recognition system that matches the pedestrians predicted intention with the driver's direction and, depending on the actions, brake the vehicle to avoid a collision.
For the most part, research into HAR has focused on either RGB, RGB-D, or wearable sensor data. These techniques extract human silhouettes, tracking them in the temporal domain and classify activities based on patterns analyzed [12] or velocity and acceleration data from an agent to make a classification. For example, research in [25]- [27] used microelectromechanical systems (MEMS) to facilitate a wearable sensor-based approach to activity recognition. Suitable for both indoor and outdoor environments, their size and lightweight make MEMS ideal for such scenarios. Furthermore, when combining multimodal ML methods in similar manners to researchers in [28], the argument for using wearable sensors becomes abundantly clear. Regrettably, though, MEMS have a long way to go to become ubiquitous. Even though smartphones are commonplace and can be used for this purpose, a single stream of low-level data, such as acceleration data, is insufficient to identify the more nuanced differences between walking with a phone and just walking [20], [29].
In some cases, a single stream of low-level data can be used to detect actions such as falls. However, in reality, the performance of such recognition systems is often significantly lower than the reported research results [30]. For example, Igual et al. [31] cross-validated different fall detection algorithms and found unfavorable results. These unfavorable results can be attributed to variations in device hardware and the relative position of sensors [32]. Clearly, when dealing with accelerometer data, the placement of sensors is paramount [33]- [38]. For example, falling, a desirable activity to recognize, and possible to do so with a single stream of low-level data from a smartphone, the type sensor, and place where the user carries the device is a significant factor influencing the algorithms' ability to determine the action accurately. Although, possible to substitute a smartphone with wearable MEMS. We arrive back at the same problem of scantness, making the optical and range sensors an ideal substitute for the less than prolific wearable MEMS.
Wearable MEMS, such as the electrocardiogram (ECG) and photoplethysmogram (PPG), can provide reasonably reliable information. However, particle applications like HAR suffer attenuation due to clothing artifacts [39], are easily influenced by diet [40] and provide erroneous data due to prescribed medication [41]. For example, artifactual signals caused by clothing can corrupt normal cardiac signals at the electrode. Dietary sodium, associated with unfavorable alterations in left ventricular mass, hypertension, and increased water retention, can significantly lower carotid-femoral pulse wave velocity (PWV). Conversely, medications such as L-Dopa, typically used in the treatment of Parkinsons patients, cause postural hypotension-the pooling of blood in the lower limbsresulting in an inability to distinguish the diastolic peak of the PPG signal. This is not to say that wearable MEMS are not valuable sensors used to gather data, just that when applied to HAR they are scant and can easily be influenced by external factors, which result in inaccurate or erroneous data.
Herein the reason why some activity recognition researchers are focusing on RGB sensors rather than wearable MEMS. For example, Bagautdinov et al. [1] proposed a unified framework for HAR using a recurrent neural network (RNN) to identify human subjects with an imaging sensor. Researchers in [42] used RNNs and imaging data to perform the same task except for single subject activity recognition. And researchers in [43] present a vision-based HAR system using a mobile camera and histogram of oriented gradient (HOG) extracted from the feature space.
Expanding on research using RGB, some scholars have made a further transition over to a data stream that can accommodate depth information and true color. Using two parallel CNNs, called ResNet [44], to classify activities, Mukherjee et al. [45] presented a method of capturing the motion information of a whole video by producing a dynamic image corresponding to the input video. Similarly, Cippitelli et al. [46], Feng et al. [47], and Arzani et al. [48] proposed an activity recognition algorithm exploiting skeleton data extracted from RGB-D sensors, an activity scene recognition method based on a 3-D skeleton sequence, and a probabilistic graphical model creating loopy subgraphs applicable for modeling simple and complex activities, respectively.
While most of the activity recognition systems have focused on wearable MEMS, RGB, or RGB-D data, most contemporary systems are equipped with a camera and range sensors such as LiDAR. These sensors often complement traditional cameras (RGB-imaging sensors) to perceive the environment more robustly. Spurred on by events, such as the 2004, 2005 Defence Advanced Research Projects Agency (DARPA) Grand Challenge and the 2007 DARPA Urban Challenge, the configuration of LIDAR and camera has become commonplace in almost all AV design [49], [50].
It is not just in AV research that LiDAR has been adopted as a data-gathering device. AAL research frequently uses LiDAR to make it easier for impaired people to navigate [51], [52]. Moreover, low-cost versions of the same LiDAR sensors are starting to enter the commercial world [53], [54], making them more accessible. This trend has led some to believe that LiDAR data, fused with image data, is the future of photography [55]. By removing the sensor's mechanical components, the industry hopes to reduce the cost dramatically [27].
Concerning ML methods used for activity recognition, CNNs appear to be a popular choice [56]- [58]. Regrettably, point cloud data is not a natural input to CNN. CNNs were designed to process RGB images or a suitable data format, therefore adapting them to work on 3-D data is not a straightforward extension. Depending on the sensor FoV, point cloud data is unstructured, unordered, prone to missing points, and affected by noise [7].
Early attempts at training CNNs using point cloud data required transforming the 3-D data to a series of RGB images at multiple views. These networks learned the depth map of the scene rather than the 3-D objects [59]. Although some success was attained, information was lost in the transformation, and the accuracy of the network suffered [60], [61].
More recently, a deep-learning (DL) architecture called PointNet [62], [63] was proposed. During training, PointNet takes unordered data points as a set of functions and maps a point set onto a vector. When making a classification, the vector representation is checked against the patterns identified in the dataset.
The advantage of PointNet is an end-to-end network that can process 3-D data without translating it to a complex representation [63]. While PointNet is good at making a classification, it is void of a mechanism to facilitate detection and localization. Moreover, the process is limited by available memory and lacks an understanding of the relationship between data points.
From the reviewed material, we understand the ML methods used for HAR vary between RNNs and CNNs, and one of the three different types of sensors is used to capture data. In terms of practicalities, we know that wearable sensors can produce some exemplary results but are not suitable considering the infrastructure requirements and the ease at which data can be influenced. While RGB or RGB-D sensors capture more information, they are susceptible to lighting conditions and, therefore, not reliable.
Therefore, a solution for classifying human activities needs to rely on data that the sensors are providing, use it intelligently to rain in dimensionality and adequately capture all the necessary attributes without the need for ubiquitous infrastructure. This approach would have to borrow from multimodal ML techniques, sensor data fusion, and given the diverse environments humans interact with, the algorithm needs to be robust, with relative ease.

III. METHOD
In this section, we present our method of HAR using multimodal ML and sensing redundancy. The proposed method is based upon an advanced CNN architecture. The proposed framework illustrated in Fig. 2 utilizes a robust object detector to identify the ROI and a point cloud classifier to identify the subject's actions. The process requires the 2-D ROI to be translated and aligned to the LiDAR data before the segmented point cloud data is classified into the activity being performed.

A. Object Detection
The principal component of the proposed architecture detects and localizes agents before extracting the ROI from the RGB data. We modified ResNet [44] by training an RPN and a region classification network (RCN) to extract the ROIs. Referred to as a Faster-RCNN [6], this technique performs the convolution operation once per region before generating the feature map for the data. An extension of the RCNN and the Fast RCNN [6], the Faster RCNN's base classification network was ResNet-50. Trained on the ImageNet dataset [64] and finetuned using a subset (RGB images only) of the Loughborough London HAR (LboroLdnHAR) [3] dataset. This portion of the proposed framework was validated using an intersection over union (IoU) between the ground-truth bounding boxes and the predicted ROI. IoU detects the difference between the annotated bounding boxes of the RGB images and the proposed ROI. Using pairs of anchors with sizes {30, 19; 60, 38; 120, 76}, the ROI was labeled positive (object of interest present) when accuracy was greater than 0.65 and negative (object of interest not present) when less than 0.35.
In this case, the accuracy the classifier returns for a region behaves as a threshold for the detector. During training, the shortest side of all images was scaled to 246 pixels. Trained using stochastic gradient descent with a learning rate of 0.0001 and momentum of 0.9, the RCNN network used patch size of 16 × 16 pixels.

B. Geometric Alignment
The process of geometric alignment requires knowledge about proximity, orientation, and sensor modality. In terms of modality, there are vast differences between sensors that capture RGB, RGB-D, and point cloud data. For example, point cloud data is a representation of a collection of 3-D coordinates in space. Predominantly, these coordinates can be captured in two ways: 1) 3-D scanning and 2) photogrammetry [65]. The data are different in many ways to the structure and order of RGB data [61]. For example, the CCD or CMOS sensor at the heart of the camera captures reflected light from a subject. The specific location on the sensor that the reflected light strikes are recorded along with the true color value of the data there capturing. Conversely, point cloud data captures the location of an object in 3-D space. There is no sequence in which the ranges are recorded, and the scans vary in resolution, return either spherical, cylindrical, or Cartesian coordinates of the objects they are sensing.
We address these issues by translating and aligning different sensor modalities. This process is performed in two stages. The foremost stage translates and aligns RGB-D data captured by the Kinect sensor to the point cloud data captured by the LiDAR. The second stage translates the combined 3-D data streams to the RGB camera data. It should be noted that  in the first stage, the data types are comparable, so the process is relatively simple. In the latter, the process is more complicated since we are translating different kinds of data. Figs. 3 and 4 illustrate a plan and elevated view of the sensor setup. This data is imperative to know so we can find the corresponding data points between different sensor modalities.
For the process of geometric alignment, consider a point in space represented by a 3-D coordinate. To shift that point's origin requires a scalar for each coordinate. For this explanation, consider an object O 4.76 m (x) in front of the Kinect sensor, 2.75 m (y) to the right, and 0.9 m (z) below the horizontal axis.
If we know that the Kinect sensor is offset from the LiDAR-0.45 m on the ( x) axis, 0 m on the ( y) axis, and 0.5 on the ( z) axis-it is a matter of adding the scalar to corresponding coordinates, to align the two data streams as per the set of In the second stage of the process, we find the corresponding 3-D point for specific pixel output by the RGB sensor. This process is somewhat more complex than the foremost stage as the data types are not comparable. For this derivation, consider an object O at a distance (c) of 5.6 m from the LiDAR. In this case, the object (O) is identified at an azimuth angle (B') of 3 • and a zenith angle of 99 • . The vertical distance (b) between the LiDAR and the RGB sensor is 27 cm, and the RGB sensor is positioned (H 0 + z + b) 122 cm above the ground. Considering the distance c between the object O and the LiDAR sensor, we can describe B, the relative RGB azimuth angle as In this case, the RGB and LiDAR sensors are on the same longitudinal axis, and we know the LiDAR zenith angle A. As such, we can describe the RGB zenith angle C as The distance a between object O and the RGB sensor can be described as The purpose of this operation is to translate the LiDAR data stream to the image data. We assume that the longitudinal axis of the camera and the RGB sensor are aligned; however, an offset can be accounted for should it be required.

C. Point Cloud Classification
Unlike 2-D RGB data, 3-D data does not contain valuable descriptors-HOG, hue, saturation, and variance (HSV), etc.-used by feature learning networks to classify the objects of interest. By its nature, 3-D data needs to be modified before the CNN can classify it. The proposed method for classifying point cloud data consists of two main modules. The first module converts 3-D data to the modified Fisher vector representation. The second module processes the modified Fisher vector representation in a CNN before making a classification. An in-depth description of the point cloud classifier used in this research can be found in [7].
1) Fisher Vector Representations: Intraclass variability describes variations in appearance between two views of the same class or variations in appearance between two instances of the same class. In activity recognition, for example, significant variations in appearance between two instances can occur. Fisher vectors are a way of dealing with intraclass variability and determine the likelihood of getting specific data given an underlying theory.
A 2-D representation of the Fisher vector is shown in Fig. 5. Formally, it is the expected value of the observed information. This can best be described as having a probability distribution with a very sharp peak in the ideal circumstance. Conversely, we can obtain a distribution where the curve is very broad. This circumstance means there is a good deal of likelihood over a broad range of points.
Using a GMM, we can quantify the rate of curvature by looking at the probable distribution of the different points. It is often referred to as the curvature vector. It describes how curved the likelihood function is around the maximum. In simpler terms, the bigger the Fisher vector, the more curved the distribution is, meaning the more constrained the data is for that scenario.  Placing spherical Gaussians on a coarse grid provides structure, size, and the foundation for representing the image. If we compute each point derivative with respect to the Gaussian parameters and aggregate the results using three symmetric functions-maximum, minimum, and summationwe obtain a Fisher vector representation of a constant size that is invariant to permutations for different numbers of points.
For multiple points in a GMM, which consists of several Gaussian positioned on a grid, the Fisher vector representation can be interpreted as a statistically unique fingerprint of the 3-D data [66]. Fig. 6(a) depicts the spherical Gaussians superimposed on the point cloud data, and Fig. 6(b) shows the modified Fisher vector representation. It should be noted that the modified Fisher vector representation in Fig. 6(b) is for visualization purposes only.

2) Classification of Fisher Vector Representation:
The main parts of the network used to classify the representation comprise several inception modules, max-pooling layers, and four fully connected layers [67], [68]. Training is done using backpropagation and standard SoftMax and cross-entropy loss with batch normalization. Dropout is placed on the fully connected layers to prevent excessive co-adapting and overfitting during training. Dropout refers to a technique of ignoring specific neuron's response during a forward or backward pass after a certain time. Between the last max-pooling layer and the first fully connected layer, the network has approximately 4.6 million trained parameters.
The overall objective of the inception module is to overcome the dimensionality of the multiplier effect. An inception module allows us to use a number of filters in a single module [68]. While this will make the network architecture wider and more complex, the process works remarkably well in view of what we are trying to achieve.
The penultimate element of the classification network is a max-pooling layer. The max-pooling layer allows us to reduce the dimensionality of the inception module further.
Max-pooling layers work by passing all output elements of the inception module through an n×n filter. The n×n filter passes over the data with a fixed stride. Taking the first n×n region, we calculate the max value for that region before passing it on to the next. This process is repeated-passing the filter overall data elements-using the stride value to shift the pooling filter over in increments.
The final component in the classification network is four fully connected layers. The output of the previous layers is flattened into a single vector of values. These represent a probability that a particular feature belongs to a specific class. For example, in HAR, when a person is running, the fully connected layers should identify features representing the action with a high probability.

IV. EXPERIMENTAL RESULTS
This section is organized into six parts. The first section summarizes the LboroLdnHAR dataset developed in [3]. The second, third, and fourth sections report on the performance of the proposed framework, PointNet [62], and the deep neuralnetwork framework (DNN) [3], [46], for HAR, respectively.
Comparative evaluation of the framework in this manner is essential to appraise the performance of the proposed process against state-of-the-art methods. It is not always clear how to do this, as what works for one framework will not necessarily work for another, particularly, when they process different data types. With this in mind, we have explained the metrics we use and the reason for using them.
In the fifth section, we benchmark the proposed framework against baseline methods of HAR using the SYSU-3D dataset before presenting some visible results of the proposed framework in the final section.

A. Dataset
Existing techniques for HAR are sensitive to viewpoint variations. This is a consequence of extracting features from viewpoint-dependent images. In contrast, we mine RGB data to identify a region in a point cloud scan. Since we segment and classify viewpoint-independent data, this technique is insensitive to viewpoint variations. To achieve this, we require a specific type of dataset to fulfill the requirements of multimodality while providing optical and range data from at least two sensor streams of a known location.
The LboroLdnHAR [3] dataset consists of 6712 point cloud, RGB-D, and RGB annotated by hand samples. Data were captured by three sensors, a VLP-16 LiDAR, Kinect, and 360 • camera. The dataset contains data of nine typical human activities performed by 16 participants. Activities were chosen on the basis of realistic activities performed in an office environment. The activities were carrying boxes, lying down, pushing a board, running; sitting on a chair; sitting on a stool; standing while texting; and walking and walking while texting.
The mean period for each data capture was approximately 35 s. Subjects started and ended the data capture periods with a T pose-standing upright with arms outstretched. During the experiments, the RGB 360 • camera, RGB-D Kinect V2, and VLP-16 LiDAR sensor logged data at 30, 30, and 5 frames/s, respectively.
We split the dataset into three subsets. The first subsets consisted of images with annotated ground-truth ROI labels and trained the object detector. The second subsets contained aligned Kinect and LiDAR data. This subset of data was used to train the point cloud classifier. The final subset was a small number of the aligned and translated point cloud, RGB-D and RGB samples not used during the training or validation of the proposed framework. This section of the dataset was used to demonstrate the performance of the proposed framework visually.

B. Proposed Framework Performance
In this section, we report on the performance of the individual components of the proposed framework. We chose to evaluate the proposed framework in this manner to assist in the selection of a point cloud classifier. Not all point cloud classifiers are suitable for activity recognition. Some do not retain the crucial relationships between different points that allow us to understand the system of rigid segments connected by articulated joints that enable humans to move. Similarly, when comparing the performance of the proposed framework, it would make little sense to compare against a framework that used wearable MEMS as the data is dramatically different. To that end, we chose to evaluate the individual components of the framework separately and compare the methods of classification against an alternative point cloud classifier and a DNN rather than a classifier that processes MEMS data.

1) Object Detection Performance:
The location and detection of agents were performed using an object detector. We trained and evaluated the object detector on the first subset of the LboroLdnHAR dataset [3]. To measure the detector's performance, we focused on the average precision-shown in Fig. 7-the log-average miss rates-shown in Fig. 8-and the F-score detailed in Table I. The average precision is a method of evaluation that incorporates the ability of the detector to make correct classifications, that is, precision, and the ability of the detector to find all relevant objects, that is, recall.
The log-average miss rate was attained by varying the thresholds on the detector confidence prediction and measure the rate of change in the true and false positives. Generally speaking, the log-average miss rate returns the results of the   object detector compared to the ground-truth table. This metric is another method of measuring the performance of the object detector [69], [70]. In this case, we found that the average precision and log-average miss rate for our detector were 0.95 and 0.1, respectively.
The F-score considers both the precision and the recall of the detector. Using binary classification in conjunction with the F-score allows us to determine the framework's accuracy [71]. Defined as two times the precision times the recall divided by the precision plus the recall, it reaches its best value at 1 and worst at 0. We determined the F-score-listed in Table I-as 0.95 for the object detector. These metrics are frequently used to determine the performance of an object detector.   Table II reports the average score for each class when tested with the LboroLdnHAR dataset [3]. To measure the classifier's performance, we focused on the precision, recall, and the Fscore of each class. Fig. 9 shows the confusion matrix for the point cloud classifier. The y-axis is the output class, and the x-axis is the target class. The diagonal cells indicate true positives that are correctly classified. The off-diagonal cells indicate false positives that are incorrectly classified. The overall accuracy for the classifier was 0.903. The precision and recall were 0.895 and 0.894, respectively.

C. PointNet Performance
We compared the performance of the proposed method against PointNet [62]. PointNet [62] was trained and tested with the LboroLdnHAR dataset [3]. Table III reports the average score for each class from PointNet. Similar to the previous section, where we focused on the precision, recall, and F-score. Fig. 10 shows the confusion matrix for PointNet [62]. The overall accuracy for the classifier was 0.098. The precision and recall were 0.111 and 0.147, respectively. Of particular interest was the network's inability to classify subjects lying down, sitting on a chair, and sitting on a stool. While the network was able to make some predictions, its performance was substandard compared to the 3-D classification network.

D. DNN Performance
To make an additional comparison, we benchmark the performance of the proposed method against a DNN framework for HAR reported in [3]. This is a similar framework reported in [46] and a frequently used method to identify human activities from RGB-D data. The DNN framework was trained and tested with the LboroLdnHAR dataset [4]. Table IV reports on the precision, recall, and the F-score of each class from the DNN framework. Fig. 11 shows the confusion matrix for the DNN framework. The overall accuracy for the classifier was 0.835. The precision and recall were 0.808 and 0.825, respectively. Of particular interest was the network's performance in classifying subjects when moving. For example, the accuracy when classifying stationary subjects was remarkably higher than those from a dynamic class. This is evident in Fig. 11, where the framework frequently misclassifies walking, running, and carrying boxes.

E. Performance on the SYSU-3D Dataset
To benchmark the proposed framework, we trained and tested the network on the joint data from the SYSU-3D dataset. The SYSU-3D dataset consists of 40 subjects performing 12 activities: 1) drinking; 2) pouring; 3) calling phone; 4) playing phone; 5) wearing backpacks; 6) packing backpacks; 7) sitting chair; 8) moving chair; 9) taking our wallet; 10) taking from wallet; 11) mopping; and 12) sweeping. Fig. 12 shows the confusion matrix for the proposed framework when tested on the SYSU dataset. The overall accuracy for the classifier was 0.954. The precision and recall were 0.956 and 0.958, respectively. Table V reports on the precision, recall, and the F-score of each class from the proposed framework. Table VI indicates a comparison of HAR techniques indicating that the proposed framework outperforms baseline methods-according to accuracy-by 8.5%.
Since the skeletal data provided in the SYSU-3D dataset can be regarded as noisy with large variations in viewpoint, the proposed framework brings significant performance improvement while showing how an activity recognition process can be insensitive to viewpoint variations.

F. Visual Results of the Proposed Framework
The purpose of the proposed algorithm is to assist an AV or AAL with HAR. Due to the multimodal requirements of the proposed framework, the results displayed here were a subset taken from the LboroLdnHAR dataset [3] while the experimental platform was stationary. All samples were gathered from indoor scenarios under controlled lighting.
The results presented in Fig. 13 Scenario 1 (c), Scenario 2 (c), Scenario 3 (c), and Scenario 4 (c) indicate that the proposed HAR framework performed as desired. In all cases, the subject performing the activity was detected, segmented, and classified correctly.
For example, in Fig. 13 Scenario 1 (a), the image shows a subject carrying a box. The first part of the proposed  framework-object detector-identifies, with a high degree of confidence (99%), the location and region that the person occupies. Fig. 13 Scenario 1 (b) shows the 3-D ROI identified by the object detector in the 2-D image. Although there are unwanted artifacts located in the 3-D ROI, the activity being performed by the subject obtains correctly classified, as shown in Fig. 13 Scenario 1 (c).
In Fig. 13 Scenario 1 (a), there is a person-other than the subject-in the background. In this case, the object detector did not function as desired. This can be attributed to the design of the Faster-RCNN network, where input images with an aspect ratio of 360 × 360 are reduced down in size before processed.
It is important to remember, when processing images, the Faster-RCNN extracts features within the first few layers. When dealing with a small object in small images, the feature can, in effect, disappear in the middle of the network before detection occurs. In this case, the person in the background is not detected and, therefore, their activity is never classified in the latter part of the network. Regardless of the framework missing this subject, the results show how multimodal ML can be successfully applied to HAR when using RGB data captured by a camera and point cloud data captured using Kinect or LiDAR.
A similar but somewhat different issue occurs in Fig. 13 Scenario 2 (a). In this image, the subject is standing while texting. The object detector correctly identifies the ROI for the subject but misses the person behind the computer due to occlusion. Contrary to this, the subject of interest was detected correctly, and the corresponding 3-D ROI was extracted from the 3-D data. Of particular interest is Fig. 13 Scenario 2 (b). In this image, the subject is standing upright with hands out front, clasping a phone. This ROI is passed onto the classifier component of the proposed network before the correct result is overlaid onto the image shown in Fig. 13 Scenario 2 (c).
It is important to note that the phone in Fig. 13 Scenario 2 (b) occupies 19 3-D points. Conversely, the person sitting in the background in Fig. 13 Scenario 1 (a) occupies 79 pixels. Given that the object detector failed to identify the person in the background, it infers that the point cloud classifier is adept at identifying point cloud features regardless of proximity.
A further example of a dynamic activity is presented in Fig. 13 Scenario 3 (a), where a subject is walking while texting. In this image, the subject was identified with 98% confidence. The corresponding 3-D ROI was segmented using the ROI identified by the object detector is shown in Fig. 13 Scenario 3 (b). As in Fig. 13 Scenario 2 (b), the subject in this image had their hands out front clasping a phone. This ROI was passed onto the classifier component before the correct result is overlaid onto the image shown in Fig. 13 Scenario 3 (c). Fig. 13 Scenario 4 (a) shows a subject lying down. In this image, the area occupied by the subject, and some of the couch, was identified with 97% confidence. Fig. 13 Scenario 4 (b) shows the 3-D ROI identified by the object detector. In this image, the ROI contains the 3-D data of the subject and a portion of the couch. When passed onto the classifier, the correct result was overlaid onto the image. Much like Fig. 13 Scenario 1, where the subject was carrying a box, the object the subject is performing the activity with is incorporated into the classification processes.

V. DISCUSSION
In Fig. 13, we presented four scenarios displaying the different section of the proposed framework for HAR. In each case, the model outputs remarkably accurate results. Furthermore, we found that the proposed framework can make an accurate prediction from partial data with few points.
For example, in Fig. 13 Scenario 4 (b), the subject is in the supine position on a couch. The lack of point cloud data cannot be attributed to the instrument but the instrument principle of operation. Note the horizontal plains contacted by a laser-the further away, the greater the gap between plains.
In this case, points on the subject contacted by the laser are reduced to a single plain, making it difficult to classify. Moreover, given the obscurity and the absence of data points, humans trying to annotate the same data would find this a difficult task.
Notwithstanding the successes, we do observe a few intrinsic issues to heed. The primary and most pronounced issue is to do with the frame rate of the LiDAR (5 HZ). While possible to increase the frame rate, reducing the number of points captured by a single scan. Although the proposed framework shows resilience to frugal data captures, there is a point where the network will miss classify. Although possible to reduce the frame rate below 5 Hz and thus increase the data captured in a single frame. Doing so results in temporal ghosting of subjects moving at speed. When ghosting occurs, the RGB, point cloud and RGB-D data fall out of alignment. However, this can be managed with careful selection of the instruments frame rate.
We encountered other issues where the object detector failed to classify subjects at a distance from the camera. As commented in the results section, Fig. 13 Scenario 1 (a) showed a subject performing the task and a person sitting on the couch. In this scenario, the object detector misses the person on the couch but identified the subject of interest. This issue was attributed to the distance the subject was from the RGB sensor. It is believed that this issue can be overcome by using sensors with a high-resolution image patch for far-away objects.
In terms of the point cloud classifier, it is important to note that the primary objective of all supervised learning framework is to discover the pattern linking the inputs X to the outputs Y, given a dataset D of size N [59], [60]. In the simplest terms, each input value has a dimension vector that represents the data we want the ML algorithm to learn. Commonly referred to as features, this representation is the understanding of X the algorithm develops during training. From the perspective of Y, the output can be anything, but the assumption is that it matches a categorical or nominal variable from the dataset used during training.
Generally, when using supervised learning methods for processing context data, a CNN obtains superior results with respect to traditional feature learning algorithms. Of course, in scenarios where data are sparse, a CNN can be outperformed. For example, when a CNN is trained on one dataset and expected to classify unfamiliar data that it has not been trained for, it will fail. However, when a CNN has been trained to recognize a specific action, such as falling, the framework should outperform all other forms of feature learning algorithms. Unfortunately, it is not always possible to acquire all the necessary data needed to train CNN to identify all specific actions. To that end, when we were collecting data, we focused on indoor activities where humans had a limited attention span. Had we included additional activities in the dataset used to train the proposed framework, it is clear from the results we presented in Section IV that they would have been classified with a high degree of accuracy.
In terms of comparison to other point cloud classifiers and their suitability for 3-D feature learning, we looked at [62], [80], and [81]. Researchers in [81] converted the point cloud data to a voxel grid array. How well the CNN acts on a voxel is dependent on the size of the grid-course or a fine. If applied to HAR, choosing a coarse grid leads to quantization and a substantial loss of information, and a fine grid increases the computational cost.
PointNet was another approach to classification that directly consumes unordered point clouds [62], [80]. This context refers to the fact that the model does not assume any spatial relationships between features. This is not the case for HAR, which assumes neighboring relations between the location of different points. Consequently, PointNet performs poorly when classifying HAR, regardless of how well it performs on classifying objects.
It should be noted that although the primary component of the proposed framework is a point cloud classifier. However, it is not the principal objective. The principal objective is to classify human activities robustly. With that in mind, we compared the proposed framework to a DNN framework for HAR [3], [46]. Although the DNN framework performed well, there are some intrinsic shortcomings. Primarily, the issues with the DNN framework are to do with the type of data processed. Sensors that capture RGB-D data are notoriously susceptible to light changes [82] or interference from other sensors based upon structured light [83]. Consequently, the quality of light in the vision system environment is a key factor in its success. This is especially the case for networks that use sensors like the Kinect. Similarly, networks that process RGB data are affected to a lesser degree D.

VI. CONCLUSION
This article proposed a method of HAR using multimodal ML to process RGB, RGB-D, and LiDAR data. The algorithm consists of an object detector and a point cloud classification network. The object detector utilizes a Faster R-CNN to identify an ROI containing a person performing an activity. Translation and alignment of data types allow for the efficient segmentation of the point cloud data. The corresponding 3-D ROI is converted to a modified Fisher vector representation before a CNN classifies the activity. Performing HAR this way removes the need for wearable sensors to work in both indoor and outdoor environments. The technical challenges of the proposed framework lay with accurate alignment and translation of the different sensor modalities. Further technical challenges come in the form of sensor frequency of operation-which plays an integral role in perceiving the correct presence of subjects.
The object detection portion of the proposed framework accurately identifies the presence of subjects in an image, with an average precision of 95% and a log-average miss rate of 0.1, respectively. The classification portion of the proposed framework accurately classified the activities performed by the subject with 90.3%. The precision and recall were 89.5% and 89.4%, respectively. Compared to an alternative point cloud classifier, the proposed framework significantly outperformed the next suitable alternative, PointNet, by a staggering 80.5%. Furthermore, it was found that the proposed framework outperformed a state-of-the-art HAR framework using a DNN to process RGB-D by 6.9%.
To benchmark the proposed framework, we conducted a systematic evaluation using the SYSU-3D dataset. The proposed method in this article outperformed all reported baseline methods-according to accuracy-by 8.5%. Since the skeletal data provided in the SYSU-3D dataset can be regarded as noisy with large variations in viewpoint, the proposed framework brings significant performance improvement. It should be noted that the SYSU-3D dataset is substantially larger than the LboroLdnHAR dataset by approximately 100 thousand samples. This sample size difference can account for the varying performance of the proposed framework when trained on different datasets.
The implication of this research is machines that can better recognize human activities, without the need for wearable sensors, in varying environmental conditions and from various viewpoints. Indirectly, this research will allow machines to recognize when a person needs help, AVs to derive data-driven decision-making policies, and holds the potential to advance human-computer interaction to a new level of understanding.