Joint Selection using Deep Reinforcement Learning for Skeleton-based Activity Recognition

Skeleton-based human activity recognition has attracted lots of attention due to its wide range of applications. Skeleton data includes two or three dimensional coordinates of body joints. Not all of the joints are effective in recognizing an activity. In this paper, we propose a novel framework for identifying relevant joints per activity and using them for the purpose of activity recognition. We propose to formulate the joint selection problem as a Markov Decision Process (MDP) where we employ deep reinforcement learning to find the most informative joints per frame. The proposed joint selection method is a general framework that can be employed to improve the existing human activity classification methods. Experimental results on two benchmark activity recognition data sets using three different classifiers demonstrate the effectiveness of the proposed joint selection method.


I. INTRODUCTION
Activity recognition is a challenging, yet very useful task in the field of computer vision. Its applications range from monitoring indoor and outdoor activities to humanrobot interaction [1], [2]. With the prevalence of depth cameras such as Microsoft Kinect, and improvement of human pose estimation methods, skeleton data is easily accessible; therefore, skeleton-based activity recognition has become very popular [3], [4]. Skeleton data, which contains twodimensional (2D) or three-dimensional (3D) coordinates of human body, is more beneficial compared to RGB data since it is robust to variation of environment light, background clutter, view points and body scale.
For capturing skeleton data, often the key body joints are considered; however, for different activities, all the joints are not equally important. Consider two activities kick and throw as examples. For the activity kick, the lower body joints are important while in activity throw, upper body joints play more role. Beside that, in one single activity, the key joints may be different in different temporal frames.
In this paper we propose a novel framework for selecting the key informative joints in video frames for the purpose of human activity recognition. The process of selecting key joints can also be considered as a hard spatial attention learning mechanism to generate frame descriptions for activity classification. The proposed framework, for the first time, formulates the joint selection problem as a Markov Decision Process (MDP) [5] and employs deep reinforcement learning (DRL) to find the optimal solution. Throughout this paper, we refer to the proposed DRL-based joint selection method as JSDRL. In JSDRL, each video frame is associated with its own distinct optimal joint set which may vary both in membership and size across the video. This allows the joint set to optimally adapt to temporal variations. JSDRL is a general framework that can be employed to improve the recognition performance of human activity classification methods (e.g., decouple GCN-DropGraph (DCGCN) [6], convolutional neural network (CNN) and long-short-termmemory (LSTM) based classifiers) as it only passes the relevant, informative joints to the classifiers. JSDRL reduces the computational complexity of training a classifier as it drops the irrelevant joints. In Reinforcement learning (RL), an agent learns the best policy by interacting with the environment and getting reward or punishment. RL is an effective search tool when the proper searching steps are unknown. In the joint selection scenario, the ground-truth for the key joints is not available, i.e. there is no supervision informing which joints are important. Therefore, it is unclear how to effectively explore spatial information over frames to choose which joints to use. As such, RL is a highly beneficial tool for joint selection.
The rest of the paper is organised as follow: In Section II, the related works to our method are reviewed. Section III explains the proposed method in detail. In Section IV, the experiments we have done are presented. The conclusion is drawn in Section V.

A. Activity recognition with skeleton data
There have been a lot of researches for activity recognition in skeleton data some of which focus on extracting hand-crafted features [19], [9], [10], [34], [35], [36], [8], [30]. In [8], a three dimensional relationship between body parts is modeled by translations and rotations, and then the classification is performed in Lie algebra using the obtained representation. In [30], Weng el al. partitioned the action sequences into temporal windows and used them as the video descriptors. Then employing these descriptors and an extended version of Naive Bayes Nearest Neighbor algorithm, they performed activity recognition.
Great performance of deep learning based techniques in image understanding encouraged researchers to employ deep learning for activity recognition. Such algorithms can be categorized into three groups: methods based on Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN), and graph-based networks.
Effectiveness of RNN in modeling sequential data has made it a good choice for video analysis [13], [11], [20]. A two-stream RNN-based model is presented in [11] where both temporal dynamics and spatial information are captured by the two-stream network. Shahroudy et al. presented a part-aware LSTM method where each part is considered separately [20]. In [40], body joints are first grouped into five parts, and then each body part is fed into an individual subnetwork. The output of networks are then fused hierarchically to create one single output at the end which is used for recognition. Liu et al. extended LSTM to both spatial and temporal domains and proposed trust gate for dealing with noise [31]. An LSTM-based method is presented in [12] that finds soft spatial and temporal attention in skeleton data.
In CNN-based models, to fulfill the need for image data, the 3D coordinates of joints are usually considered as pseudo-image [13]. Li et al. combined the position and velocity information of joints and used a two-stream CNN architecture for activity recognition [14]. In [38], first a transformation is applied to the data and then the transformed data is fed to a CNN for robust feature extractions. Other CNN-based activity recognition methods for skeleton data can be found in [15] and [16].
Human body can be modeled as graph with nodes and vertices intrinsically. Several graph-based methods have been proposed and gained successful results in the skeleton-based activity recognition field. Spatial-temporal graph convolutional network is proposed in [17], which consists of several spatial-temporal graph convolutions to extract body skeleton features. Inspired by [17], other graph-based methods are presented, such as [6], [7] and [18]. Yan et al. suggested a graph CNN which learns both spatial and temporal representations to improve the recognition performance and generalization ability in recognition [37].
All the deep learning based methods discussed above focus on developing networks that capture skeleton data features for the purpose of activity recognition. None of these methods focus on finding the most informative joints and discarding the irrelevant ones prior to recognition. This paper presents a novel technique for identification of informative joints across frame/video and employs them for recognition.

B. Reinforcement learning in activity recognition
Inspired by the way humans learn to optimally behave in different environments, reinforcement learning algorithms try to learn how to obtain a complex goal through interaction with the environment and getting reward or punishment. The reward is designed based on the final goal(s) of agent and the agent's objective is to maximize the received reward. There are some researches in the field of computer vision which use reinforcement learning; for example, in [24], [26], and [25] RL is used for image recognition, visual tracking and face recognition, respectively. However, there are few studies on activity recognition especially for skeleton-based data. In [22], multi-agent reinforcement learning is used to select key frames in videos where each agent is responsible for selecting a frame. As a result, the number of selected frames is fixed, i.e. equal to the number of agents. Dong et al. proposed an RL-based method which finds the most relevant frames using an LSTM agent [23]. Both methods presented in [22] and [23] are for RGB data and not applicable to skeleton data. In [27], authors proposed an RL-based technique called deep progressive reinforcement learning (DPRL) to select key frames in skeleton video. This method uses a graph representation of data, and a graph CNN is used for reward generation. To the best of our knowledge the method presented in [27] is the only study on skeleton-based activity recognition employing deep learning. However, its focus is on frame selection.

III. PROPOSED METHOD
The proposed JSDRL method models the joint selection problem as an MDP and solves that with the well-known off policy reinforcement learning algorithm, Monte Carlo policy gradient (i.e. REINFORCE) [28]. A typical RL algorithm has an agent in its current state of the environment. The agent takes an action that changes its state and receives a reward based on it.
In this paper, we define the k th step of our RL episode as T k = (S k , A k , R k ), where S k , A k and R k are respectively state, action and reward at the k th step; the full episode of the proposed RL system can be shown as . At each step of episode, the agent goes over all T frames of a given video. Agent, State, Action, and Reward in the proposed joint selection framework are defined as follows: Agent: Human skeleton can be considered as an ordered sequence of J joints. In this study, we propose to employ Bidirectional LSTM (BiLSTM) network followed by a fully connected (FC) network as the agent. At frame t of T k , the BiLSTM network takes the state S t k (where S k = {S t k } T t=1 ) as input and then feeds its hidden layer, {h j } J j=1 , to the FC network. The agent outputs vector {p t j } J j=1 that is used to define the next action.
State: In skeleton-based human activity recognition, it has been shown that both joints location and joints motion are informative components. Hence, we define the agent's state at frame t of T k as S t k = {s t j } J j=1 where s t j = [s t j,c , s t j,m ], s t j,c is the 3D coordinates of the j th joint, and s t j,m is the j th joint 3D motion vector, i.e. s t j,m = s t j,c − s t−1 j,c .
showing joints that are selected at frame t of T k . If the j th element of f t k is 1, i.e. f t k,j = 1, the j th joint will be selected for frame t; otherwise it will not. We initialize elements of F k , k = 1, ..., K, with 1. The action taken at frame t of T k , i.e. a t k , is a J-dim vector showing the adjustment needed to be applied to f t k−1 to obtain f t k . We define two types of actions: 0 and 1, where 0 means no change is needed and 1 means flip the corresponding selection bit. The outputs of the FC network of the agent at the t th frame, {p t j } J j=1 , indicates the probability of changing elements of f t k−1 . The J elements of action vector at frame t of T k , a t k , are sampled from Bernoulli distributions as follows: where a t k,j = 1 indicates flip the j th element of f t k−1 to obtain the j th element of f t k , i.e. if the j th joint is selected (removed) in the previous step, it will be removed (selected) in the future step, and a t k,j = 0 means no change is needed. In this way, we allow the removed joints to be selected in the next episode if they were erroneously removed from the selected joint set. This changing process is shown below: The total action set corresponding to the k th episode is Reward: The reward reflects how good the action taken by the agent is with regard to the state. We generate the reward with a pre-trained classifier which takes the T frames with selected joints as input, where joints are selected by the agent. If the class label predicted by the classifier turns from the correct label to a wrong one, a strong punishment −Ω is enforced and a strong reward of Ω is enforced if the turning goes otherwise. Further, if the predicted class label does not change, but the confidence of classifier towards predicting the correct class changes, reward r 0 is given, which is defined as below: where P k l is the probability of correctly classifying the video as class l in T k . The Reward at T k , i.e. R k can be shown as below: The goal of agent is learning a policy function by maximizing the expected reward shown below: where p θ (a 1:T k,1:J ) is the probability distribution of the possible actions over the frames. In Policy Gradient algorithms, the policy is usually modeled with a function parameterized by θ, and in REINFORCE, which is a well-known policy gradient method [13], the gradient of the expected reward R(θ) w.r.t. the parameters θ is calculated as: where π θ is the policy function, a t k,j is the action taken by the agent at T k for the joint j in frame t and s t k,j is the corresponding state.
To simplify Eq. (5), instead of taking the expectation over action sequence, and as we get the reward after observing the whole T frames, we approximate the gradient by taking average of gradients over the total T frames and K steps as follows: where R k is the reward computed at the k th step of episode.
To reduce the variance and guarantee the convergence of the algorithm, a constant baseline b, which is the average rewards of steps, is reduced from the reward as follows: To make sure the agent selects at least one joint and does not select more than N joints, we propose to add two other terms to the loss function along with the REINFORCE loss as below: where p is the average probability of vector of actions over the T frames and K steps and N is the maximum number of selected joints, and α and β are two hyper-parameters that control the effect of their corresponding terms. Block diagram of the proposed framework is shown in Fig.  1. Pseudo code of the proposed JSDRL method is shown in Algorithm 1. In summary, first the classifier is pre-trained on the original training data. Then, a video sequence is given to the agent's network (aka policy network), an episode is completed, and the network is updated. This process is repeated for all epochs where the classifier is re-trained every G epochs. for videos do 5: Count += 1 6: for K steps of episode do 7: run the policy network 8: find the action using Eq. (1) 9: take the action and update the state (select joints) 10: compute reward using Eq. (3) and Eq. (4) 11: end for 12: compute the average reward 13: compute the loss (Eq. 9) 14: update the policy network parameters 15: if Count≤ G then 16: retrain the classifier 17: end if 18: end for 19: end for

IV. EXPERIMENTS
To evaluate the performance of the proposed JSDRL method, we conducted experiments on two benchmark activity recognition datasets. To demonstrate the effectiveness of joint selection in activity recognition, we show recognition results with and without joint selection using three classifiers: a CNN-based, a BiLSTM-based and a Graph-based.

A. Data sets
NTU+RGBD Dataset (NTU) [20]: NTU is currently the largest activity recognition data with 56,880 sequences and 4 million frames. The video samples belong to 60 classes, and there are two settings for train/test sample partitioning: Cross-Subject (CS) and Cross-View (CV). In the CS setting, samples of 20 subjects are used as train and the remaining ones are used for testing. In the CV setting, samples of camera views 2 and 3 are selected as the train set and samples captured by camera 1, are used as the test set. The number of skeleton joints captured for this data set is 25 and there are either one or two subjects in each video.
UT-Kinect Dataset (UT) [29]: UT includes 200 sequences belonging to 10 classes. Each activity is performed by 10 subjects twice and there is no interactive activity in the data which means there is only one subject in each video sample. There are 20 joints in each frame and leave-one-out crossvalidation protocol is used to evaluate the proposed method on this data.

B. Implementation Details
We use BiLSTM with 3 layers as the agent's network (i.e. policy network) and the optimizer is Adam with initial learning rate 1e-4. The number of epochs, values of K, Ω α, and β are respectively set to 20, 5, 10, 0.1 and 0.1. We divide the number of video samples to 5 and use that as the value of G. The value of N is set to half of the number of available joints. The proposed method was implemented with Pytorch.
Effectiveness of the proposed JSDRL method is demonstrated using three different classifiers including the two basic classifiers, BiLSTM and CNN, and a state-of-theart graph-based classifier which is specifically designed for skeleton-based human activity recognition, decoupling graph Convolutional neural networks with dropGraph module (DCGCN) [6]. The DCGCN parameters are set to their default values suggested in the original paper. The BiLSTM classifier has 3 layers with hidden layer size 256, where it is trained using Adam optimization method. The CNN classifier has 2 convolution layers followed by one fully connected layer, and the optimizer is Adam.

C. Recognition Performance
The classification accuracy with and without joint selection, i.e. applying JSDRL, for the two datasets are reported in Table I, where the best performance is shown in bold.
As can be seen, the proposed JSDRL method improves the classifiers performance for both the two datasets. The average performance of each Method over the three sets CS, CV and UT are shown in the last column. The average values confirm the improved performance of the proposed method compared to without-joint-selection cases. That is while on average, almost 60 % of joints are eliminated leading to a decline in classification cost in both training and testing phases.
In Tables II and III, performance of JSDRL (with the DCGCN classifier) is compared with several state-of-theart activity recognition methods. Table II shows superior performance of the proposed method, on both CS and CV settings of the NTU dataset, to its eight competitors. Table  III shows that the proposed method outperforms ten state-ofthe-art skeleton-based activity recognition classifiers, on the UT dataset.
To visualize the performance of the proposed method, the resultant selected joints for the two activities kick and phone call are depicted in Fig. 2. The intensity of red color at each joint indicates the frequency of selecting that joint over the whole video frames; e.g. in the activity phone call, hand,   thumb and fingers tip are correctly selected in all frames and the irrelevant foot and head joints are not selected in any frame. This figure demonstrates the effectiveness of JSDRL method in selecting relevant joints.

D. Sensitivity to hyperparameter N
To investigate the sensitivity of the JSDRL method to the hyperparameter N , introduced in (9), we apply JS-DRL+BiLSTM to the UT dataset for different N values, i.e. N ∈ {3, 6, 10, 12, 15, 20}. Note that J is equal to 20 in the UT dataset. The role of N in the loss function is to set an upper bound on the number of selected joints. Accuracy of activity recognition versus N is shown in Figure 3. The figure shows that the JSDRL method retains high accuracy for a wide range of N , demonstrating that JSDRL method is not too sensitive to N , which is a desirable behaviour.
a Markov Decision Process and finds the most informative joints in each frame of skeleton data using the popular policy gradient algorithm, REINFORCE. In JSDRL, each video frame is associated with its own distinct optimal joint set, which may vary both in membership and size across the video. This allows the joint set to optimally adapt to temporal variations. Employing reinforcement learning in the JSDRL method allows to find relevant joints, per frame, without requiring any extra labels. The JSDRL can be used as a filtering block, to identify and filter out irrelevant joints, prior to any sophisticated activity classification algorithm. This enhances the classifier performance and reduces the training time. We evaluated the JSDRL method on two benchmark skeleton-based activity recognition datasets employing three different classifiers. The experimental results demonstrated the effectiveness of JSDRL. Furthermore, the proposed JS-DRL method outperforms several state-of-the-art skeletonbased activity recognition methods in terms of recognition accuracy.