Conditional Link Prediction of Category-Implicit Keypoint Detection

Keypoints of objects reflect their concise abstractions, while the corresponding connection links (CL) build the skeleton by detecting the intrinsic relations between keypoints. Existing approaches are typically computationally-intensive, inapplicable for instances belonging to multiple classes, and/or infeasible to simultaneously encode connection information. To address the aforementioned issues, we propose an end-to-end category-implicit Keypoint and Link Prediction Network (KLPNet), which is the first approach for simultaneous semantic keypoint detection (for multi-class instances) and CL rejuvenation. In our KLPNet, a novel Conditional Link Prediction Graph is proposed for link prediction among keypoints that are contingent on a predefined category. Furthermore, a Cross-stage Keypoint Localization Module (CKLM) is introduced to explore feature aggregation for coarse-to-fine keypoint localization. Comprehensive experiments conducted on three publicly available benchmarks demonstrate that our KLPNet consistently outperforms all other state-of-the-art approaches. Furthermore, the experimental results of CL prediction also show the effectiveness of our KLPNet with respect to occlusion problems.


Introduction
Accurate semantic keypoint localization and detection is the basic prerequisite for copious computer vision applications, including simultaneous localization and mapping [30], human pose estimation [17], hand key-joint estimation [25], etc. The connection link provides an additional semantic relation between each pair of keypoints and it can be used for many semantic-level tasks. Nevertheless, prevailing keypoint detection methods, e.g. [21,31], mainly focus on multi-person human pose estimation, which aims at recognizing and localizing anatomical keypoints (or body joints) and human skeleton connections. On the other hand, existing object (or rigid body) keypoint localization approaches, e.g. [29,41,32,28], always fail to successively explore CL among keypoints. Few studies tackled the prob- (c) Figure 1: (a) KeypointNet [32]. (b) StarMap [42]. (c) Our proposed Keypoint and Link Prediction Network (KLPNet), a categoryimplicit approach. KLPNet is the first framework capable of finding connection link for multi-class object keypoints. KLPNet adopts a conditional embedding graph to implement the link prediction based on the extracted features from the 2D images.
lem of simultaneously inferring keypoint and encoding their semantic connection information for multi-class instances. Inspired by several prevalent bottom-up approaches in the field of multi-person pose estimation, such as [4,15,26], which directly localize all keypoints from multi-instances and group keypoints into persons to find skeleton connection, we wonder what if to apply a similar mechanism on multi-class object keypoint detection and CL prediction. However, the previous methods cannot be simply grafted onto this issue. The conventional approaches stack each heatmap as a particular class of keypoint for single-type pose estimation, especially for classes of nodes in human pose estimation which belongs to only one category: person. When we move to multi-class rigid bodies, the conventional approaches are inefficient and costly, even impossible, since each category contains numerous classes of keypoints. Consequently, two key factors should be addressed: (1) how to deal with multi-class instances and (2) how to encode the semantic keypoints and their connection links. Since geometric contextual relations are the keys to iden-tify the keypoint-instance affiliation, these relations spontaneously construct a graph, which consists of nodes (keypoints) and edges (relations between keypoints). A graph structure could be constructed. Besides, the crucial semantic information -the category of instance that a given keypoint belongs to, should also be considered for CL prediction for multi-class objects. So we can then discover geometrically and semantically consistent keypoints across instances of different categories.
There are two mainstream approaches in instance keypoint detection field, category-specific detection and category-agnostic detection. For category-specific methods like KeypointNet [32] in Fig. 1(a), they typically group each keypoint as an independent category concerning a given classified target, which is extremely ineffective, costly, and practically inapplicable for building a graph for keypoint and CL prediction for objects with a varying number of parts. For category-agnostic methods like SVK [9] and StarMap [42], StarMap in Fig. 1(b) typically introduces a hybrid representation of all keypoints and compress those belonging to one target as the same class to a single heatmap, and then lift them into 3D space to adjust their locations in 2D. However, StarMap is also costly due to the massive 3D geometry information, such as depth map or multi-view consistency. Besides, the connection information is lost due to the lack of semantic property. Thus, we hope to find a novel, economical, yet powerful approach to directly work in the 2D images.
To this end, we propose a category-implicit method, Keypoint and Link Prediction Network (KLPNet), as shown in Fig. 2, including a Deep Path Aggregation Detector (DPAD), a Cross-stage Keypoint Localization Module (CKLM) and a Conditional Link Prediction Graph Module (CLPGM). For the first time, we implement the semantic keypoint detection without converting 2D information to 3D spaces, by virtue of the conditional graph neural network, on rigid bodies. CLPGM recovers the links straightforwardly based on the single heatmap and implicit features of each target extracted from the module DPAD. In CKLM, the Cross-stage feature aggregation scheme is proposed to overcome the ambiguous locations of the category-implicit keypoints on the single heatmap. Specifically, a Location Instability Strategy (LIS) is utilized in GLPGM to disentangle occlusion cases and further respond the defective keypoint localization to the previous module, CKLM.
The main contributions of this paper can be summarized as follows: (1) To the best of our knowledge, we present the first category-implicit and graph embedding approach, KLPNet, to effectively construe keypoints for instances under multiple categories with a flexible number of semantic labels, and further predict its conditional connection links (2) We propose the keypoint data representation without redundant 3D geometry information, whose location can be adjusted through Cross-stage feature aggregation in a coarse-to-fine manner (3) A novel link prediction module,  CLGPM, can enhance the node links based on the single  heatmap and extracted implicit features of each target, providing the geometric and semantic information for link re-covery. An innovative strategy, LIS in CLGPM, are capable of disentangling the cases with occlusions (4) We explore a deep path aggregation detector to localize the targeting instances precisely.

Keypoint Estimation and Geometric Reasoning
The prominent thoughts of keypoint detection on rigid bodies concentrate on the feature extraction as a two-stage pipeline: identify the localization of each object on the image, and then solve the single object pose estimation problem based on the cropped target. Stacked hourglass [24] stacks hourglasses that are down-sampled and up-sampled modules with residual connections to enhance the pose estimation performance. Based on the stacked hourglass network, Cascade Pyramid Network [5] address the pose estimation by adopting two sub-networks: GlobalNet and Re-fineNet. GlobalNet locates each keypoint to one heatmap that is easier to detect, then RefineNet explicitly address the 'hard' keypoints that requires more context information and processing rather than the nearby appearance feature. The multi-stage pose estimation network (MSPN) [17] extends the GlobalNet to multiple stages for aggregating features across different stages to strengthen the information flow and mitigate the difficulty in training.
The conventional approaches stack each heatmap as a special class of keypoint for single types, including as lefttop, right-top, left-bottom, right-bottom ones. Based on well-defined semantic keypoints, when we move to keypoint detection on multi-class objects, such as bus, chair, ship etc, it is ineffective and costly to train N × C classes keypoints, where N represents the total number of keypoints of each category, and C is the number of categories. In addition, the value of N varies in different categories. In terms of merging keypoints from multiple targets, consistent correspondences should be established between different keypoints across multiple target categories, which is difficult or sometimes impossible. Besides, category-specific keypoint encoding fail to capture both the intra-category part variations and the inter-category part similarities.
To solve the above-mentioned issues, category-agnostic approaches project the keypoints that belong to the same target to the same category on one heatmap, and then provide additional information to convert 2D image to 3D space for pose estimation. StarMap [42] mixes all types of keypoints using a single heatmap for general keypoint detection. KeyPointNet [32] considers the relative pose es- Figure 2: Illustration of the proposed KLPNet that is composed of DPAD, CKLM (detailed in Fig. 3) and CLPGM. After objects are allocated by DPAD, category-implicit nodes that belong to the same target are classified as one class on the single heatmap by Cross-stage feature aggregation (CSFA, detailed in Fig. 4). F and H are down-sampling and up-sampling feature maps with subscript representing their size. CLPGM works on the nodes with extracted features and corresponding labels to rejuvenate the node links. The Location Instability Stragety (LIS, detailed in Fig. 5) of the inference nodes helps tackle occlusion issues and provide CKLM with feedback to rectify keypoint localization. timation loss to penalize the angular difference between the ground truth rotation and the predicted rotation using orthogonal procrustes. Both approaches have to convert the 2D to 3D space first, and then adjust the keypoint location by different predefined 3D models. Our approach generate one type of such general implicit keypoints with more explicit geometry property and top-class label. Besides, the geometric adjustment works in multiple stages in 2D space, increasing the cost-efficiency. Therefore, we consider a novel approach to skip 3D estimation and localize the keypoints more accurately.

Graph Link Prediction
There is a growing interest in the Graph Neural Network (GNN) because of its flexible utilization with body joint relations. [14] introduced the variational graph autoencoder (VGAE) for unsupervised learning on graph-structured data. [16] proposed a model called the Relationalvariational Graph AutoEncoder (RVGAE) to predict concept relations within a graph consisting of concept and resource nodes. We propose a new graph structure, CLPGM, to predict the connection among keypoints with different object labels.

Deep Path Aggregation Detector
Inspired by PANet [20], we propose DPAD, a Deep Path Aggregation Detector to enhance the localization capability of the entire feature hierarchy by propagating strong re-sponses of low-level patterns. We refer the readers to supplementary material for its architecture. ResNeXt [39] is used as the backbone to generate different levels of feature maps C3 ∼ C7. In addition to these generated feature maps from FPN [18], C8 and C9, two higher-level feature maps, are created by down-sampling from C7. The augmented path starts from the lowest level and gradually approaches to the top. From C3 to C9, the feature map is down-sampled with factor 2. {N 3 ∼ N 9} denote newly generated feature maps corresponding to {C3 ∼ C9}. Each building block takes a higher-resolution feature map N i and a coarser map C i+1 through lateral connection and generates the new feature map N i+1 . Furthermore, we adopt CIoU [40] to penalize the union area over the circumscribed rectangle's area in IoU Loss. CIoU can improve the trade-off between speed and accuracy for bounding box (BBox) regression problems, and suppresses redundant BBox to increase the robustness of detector for occlusions.

Cross-stage Keypoint Localization
After object targeting, a CKLM would generate detailed localization of all category-implicit keypoints for each classified candidate.

Single-stage Mechanism
The backbone of the single-stage mechanism is ResNext-101 [39]. The shallow features have a high spatial resolution for localization but low semantic information for recognition. On the other hand, deep feature layers have rich se- Figure 3: An overview of CKLM that consists of three stages with a coarse-to-fine surveillance strategy. A cross-stage aggregation scheme is adopted between adjacent stages (detailed in Fig. 4). The coarse-to-fine surveillance strategy utilizes distinctive Gaussian kernel size to boost the keypoint localization performance, as demonstrated on the heatmaps. W, H, C denote the height, weight and channel size for features respectively. mantic information but lower spatial resolution due to the convolution and pooling layers. As shown in Fig. 3, both spatial resolution and semantic features from distinctive layers are integrated to avoid the unconscious information.
Since we hope to allocate all keypoints from the same target to a single heatmap, they are compelled to set as one class. This single-channel heatmap encodes the image locations of the underlying points. It is motivated by using one heatmap to encode occurrences of one keypoint on multiple persons. The keypoint heatmaps is combined with the confidence maps H and offset maps {O x , O y }. We adopt the binary cross-entropy loss to learn confidence maps with each targeting category and the Smooth L1 loss to update the offset maps.
where L kd is the keypoint detection loss, δ is the binary cross-entropy loss, ρ is the Smooth L1 loss, Θ and Υ indicate the corresponding weights, c is the targeting categories, k denotes the number of keypoints under each targeting category. H * and O * xy are the ground truth.

Cross-stage Feature Aggregation among Multiple Stages
Feature aggregation could be regarded as an extended residual design in the single-stage mechanism, which is helpful for dealing with the gradient vanishing problem. After the first single-stage module, the single heatmap contains most probable keypoint locations. However, the heatmap from the first single stage is a coarse prediction with abounding noise, even if adequate features have been already extracted Figure 4: Cross-stage feature aggregation scheme. After concatenation, a transmission layer is applied to the features obtained from the previous stage before feature aggregation.
in the stage. Even small localization errors would significantly affect the keypoint detection performance. To filter the noise, another two stages are cascaded with the refined surveillance, as shown in Fig. 3. Since the Gaussian kernel is used to generate the ground truth heat map for each key point, we decide to utilize distinguishable sizes of kernels, 7, 5, and 3, in these three stages. This strategy is based on the observation that the estimated heat maps from multistages are also in a similar coarse-to-fine manner. Fig. 4 shows the cross-stage feature aggregation scheme among multiple stages. f l represents the extracted features in layer l. Since we hope to aggregate features from neighbour stages and layers, the coarse-to-fine approach is proposed as follows: where T is the transmission layer which is a 1 × 1 convolution operation, and C is the concatenation operation. The implementation details of CKLM are reported in supplementary material.

Conditional Link Prediction Graph Module
Since all the predefined keypoints are localized with agnostic category on the heatmap, we cannot directly connect them. To solve this issue, a CLPGM is considered to figure out the connection among each key point. In other words, given each target label, we hope to find the adjacency matrix when inferring on an unsupervised learnable graph. Our proposed CLPGM is a pioneer to explore connection links of category-implicit keypoints for multi-class objects.

Notations
We consider the keypoints under each targeting categories as the node of an undirected, unweighted graph G = (V, E) with N = |V | nodes. Each keypoint is seen as an individual node on gragh G. The features of each keypoint extracted from the previous stage are set as the node features and summarized in an N × D matrix X C with the target label C, where D is the degree matrix of the graph. The diagonal elements of the graph's adjacency matrix A are set to 1, which means every node is connected to itself. The adjacency matrix A consists of several A C . The stochastic latent variables z i is summarized in an N × F matrix Z, where N is the number of keypoints, F is the depth of features, Z represents the embedding space.

Objective Function
CLPGM is built in a top-down manner. Given the training graph, we first disentangle all nodes with C types of labels, where C is the number of target categories. When the training samples from the CKLM arrive, the corresponding nodes with the same target label C are activated and their features X C are updated. After retaining the rest nodes with their features, the whole graph updates the weights and learns the connection. The objective of learning the graph is given as follows: CLPG learns node parameters θ G that best fits feature maps of training images. Let us regard X C as a distribution of "node activation entities". We consider the node response of each unit x ∈ X C as the number of "node activation entities". F (X) = τ · max(f x , 0) denotes the number of updated nodes, where f (X) is the normalized response value of x and τ is a constant parameter.
For a Gaussian mixture model, the distribution of the whole graph at each step is as follows: where q(p A C |X C , θ G ) indicates the compatibility of each part of updated sub-adjacency matrix.
Each time B targets are detected from the sample training image, the corresponding sub-adjacency matrices are updated to the whole graph for the advanced training. The features are projected to an embedding space, Z, after the first four layers of CLPG. The feature information could be encoded as: where In the Equation 6, µ and σ 2 are the matrices of expected value and variance. Each layer is defined as tanh( AXW i ). Denote the target category prediction as c, we calculate the piecewise link loss function as follows: represents the Kullback-Leibler divergence between q(·) and p(·).

Rejuvenation
For a non-probabilistic variant of the C-GVAE model, we calculate embeddings Z and the rejuvenated adjacency ma-trixÂ as follows:Â where σ is the inner-product operation and C is the targeting category. Thus, the final loss function of KLPNet can be formulated as: where α and β are the predefined constant parameters.

Location Instability Strategy of Inference Nodes
When multiple targets are occluded, the number of detected nodes may be higher than the predefined number in an area. If these targets belong to distinctive categories, this issue is simplified to generate multi singular heatmaps for each target category. However, if these targets are with the same label, we should design a location instability strategy to infer which nodes are for each overlapped target. As illustrated in Fig. 5(a), an occlusion appears at the center of the image. After keypoint localization through the CKLM, the generated heatmap is illlustrated in the Fig. 5(b). The red and yellow keypoints named fixed nodes are identified to each target. However, the white keypoints with category-implicit nodes on the intersection region (shade part) are confused to be categorized into either of the two monitors to the particular target. We first assume that the category-implicit keypoints of the target with the same label share similar features. For example, the similarities of the features belonging to two top-right keypoints of monitors have higher probability of coming from the same instance. Next, We considered that if certain special adjoining nodes always triggered a node, then the inferred node's distance and certain fixed nodes of the object part should not change greatly among targets with the same category label. In the inference part, partial implicit nodes and fixed nodes are utilized to complete the link prediction, where the number of the total nodes should equal to the predefined number of each target category. As shown in the Fig. 5(d), two cases of the predicted results are given, and the incorrect link prediction (red dash line) is deleted. Thus, the implicit nodes are distinguished as inferred node and outlier, as illustrated in Fig. 5(c).

Conclusion
In this work, we proposed a novel and effective simultaneous multi-class object keypoint and connection link rejuvenation approach. The major contributions of this network include: 1) DPAD, a detector capable of localizing category-implicit keypoints accurately; 2) CLPGM, a novel link prediction module capable of recovering the node links straightforwardly based on a single heatmap and the implicit features of each target extracted by DPAP; 3) LIS,  an innovative strategy capable of handling occlusion issues; 4) KLPNet, the first end-to-end category-implicit keypoint and link prediction network. The conducted extensive experiments demonstrated both the robustness of our proposed KLPNet, proving the effectiveness of our proposed multistage architecture, and meanwhile, showing the state-ofthe-art performance on the three publicly available datasets.

A. Deep Path Aggregation Detector
Multi-scale feature fusion aims to aggregate features at different resolutions. Formally, given a multi-scale feature P li at layer li, we aim to design an appropriate approach to effectively aggregate different features and update to a deeper layer with renovated features. The conventional Feature Pyramid Networks (FPN) [18] aggregate multi-scale features in a top-down manner, but it is inherently limited by the one-way information flow. Thus, PANet [20] provides an extra bottom-up path aggregation network. Figure 7(a) illustrates the additional path with red arrows. The neurons in high layers strongly respond to entire objects, while other neurons are more likely to be activated by local texture and patterns. This manifests the necessity of augmenting a topdown path to propagate semantically strong features and enhance all features with reasonable classification capability. The coarser feature map C li at layer li and the generated feature maps N li at layer li with higher resolution can be calculated as: where g represents the convolutional operations for feature processing, U is usually an upsampling procedure and D is usually a downsampling procedure for resolution alignment, and Concat denotes the concatenate operation. Inspired by PANet [20], we design a deep path aggregation network, DPAD, to enhance the localization capability of the entire feature hierarchy by propagating strong responses of low-level patterns, which is illustrated in Figure 7(b). ResNeXt [39] is utilized as the backbone network to generate different levels of feature maps, namely {C3, C4, C5, C6, C7}. Table 1 demonstrates the the superiority of ResNeXt-101 considering both AP and FLOPs. In addition to these feature maps generated from FPN, two higher-level feature maps, C8 and C9, are created by downsampling from C5. The augmented path starts from the lowest level and gradually approaches to the top. From C3 to C9, the spatial size is gradually downsampled with factor 2 . {N 3, N 4, N 5, N 6, N 7, N 8, N 9} denote newly generated feature maps corresponding to {C3, C4, C5, C6, C7, C8, C9}. Each building block takes a higher-resolution feature map N i and a coarser map C i+1 through lateral connection to generate a new feature map N i+1 as follows: (8,9) , N li = Concat(D(N li−1 ), g(C li )) i∈ (8,9) ,   Unlike PANet [20], we remove the mask branch and adopt CIoU [40] to penalize the union area over the circumscribed rectangle's area in IoU Loss. CIoU can achieve better convergence speed and accuracy for bounding box (BBox) regression problem.
We have also tried a deeper DPAD and compared the performance with other approaches. Table 7 shows the performance of DPAD * and DPAD †, another two deeper DPADs. Based on DPAD, the former has two more higher-level feature maps C10 and C11, and the other has four more higherlevel feature maps C10, C11, C12, and C13. From Table  7, we achieve an advance of 0.3% in AP for DPAD * , but the FLOPs increases by 0.5G. Even the module is designed  deeper, like in DPAD †, the AP will even decrease, which indicates the precision of the module does not increase w.r.t. the deeper feature layers. Another two approaches, NAS-FPN [8] and BIFPN [33] can achieve higher precision, but the FLOPs and the number of parameters are also huge. Thus, we discuss what influences object detector on the topdown keypoint detection in Figure 8.
As shown in Figure 8, the orange line is the trend of the object detector performance; and the blue one illustrates the performance of the top-down keypoint detection. When the object detector's precision is low, the top-down keypoint detection performance will be more dependent on the objector detector. However, when the object detector's AP is larger than 42.7%, the precision of the top-down keypoint detection becomes saturated; namely, the top-down keypoint detection performance does not heavily rely on the object detector. Thus, a trade-off between the precision and the cost is the key to design an efficient top-down keypoint detector. Based on this prospect, we adopt DPAD as the final object detector in our system.
From the top row of Figure 10, it is obvious to figure out the differences among the heatmaps with distinguishable sizes of Gaussian kernels, 7, 5, and 3. The strategy is based on the observation that the estimated heatmaps from multi-stages are also in a similar coarse-to-fine manner. The bottom row of Figure 10 shows an illustrative corresponding predictions (yellow lines) and ground truth annotations (blue lines). The pink circles on both left and center images display the prediction errors, which demonstrates that the proposed strategy is able to refine localization accuracy gradually.

B. Cross-Stage Keypoint Localization Module (CKLM)
In the paper, we have discussed the performance of CKLM w.r.t. different number of single stages. The final CKLM module contains three stages to balance the precision and the cost. After a single stage, the single heatmap contains predefined the most probable keypoint locations. However, the heatmap from the first single stage is a coarse prediction with abounding noise, even adequate features have been extracted in the stage. To filter the noise, another two stages are cascaded with a coarse-to-fine surveillance strategy to boost the keypoint localization performance. Since the Gaussian kernel is used to generate the ground truth heat map for each key point, we decide to use distinguishable sizes of kernels, 7, 5, and 3, in these three stages. The distribution of the heatmaps with distinguishable sizes of kernels are demonstrated in Figure 9. Table 8 demonstrates the performance of the 3-stage CKLM with disparate Gaussian kernel sizes. We employ distinctive settings for the 3-stage CKLM each time. As shown in Table 8, setting 1 achieves an AP of 75.0%, whose kernel sizes are 7 in all three stages. We tried to degrade the kernel size to 5 in all three stages, however, the AP decreases by 0.4%. It indicates that if we adopt the same kernel sizes, the larger one can present a better performance. We conjecture that this is because the smaller region after the first stage would negatively affect the performance of the detector. Thus, setting 3 and setting 4 are proposed to prove our speculation. When the kernel size in the first stage increases from 5 to 7 in setting 3, the AP is escalated by 0.5%; while the kernel size of the second stage further increases to 7 in the setting 4, the AP is depreciated 74.8 %, even better than the performance in setting 2. It shows that the kernel size in the second stage should be smaller than the one in the first stage. Thus, we decide to diminish the kernel size in the third stage to further validate our hypothesis. In setting 5, the kernel sizes in three stages are set as 7, 5 and 3, respectively. The performance in this setting accomplished the best, and the AP is escalated to 75.3 %. Another two further settings, setting 6 and setting 7, are also made to figure out the performance trends with a smaller kernel size in the first stage. In accordance with expectations, the AP is worse than the one in setting 5. Finally, we adopt the best setting of the kernel sizes: 7, 5 and 3 in our 3-stage CKLM. Table 9 shows the architecture of the single stage of our proposed CKLM. There exist two main branches, the downsampling path and the upsampling path, in each single stage. Each path contains four corresponding layers: DS-1, DS-2, DS-3 and DS-4 for the downsampling path and US-1, US-2, US-3 and US-4 for the upsampling path. The downsampling layer consists of several BottleNeck-4 and bottleneck-3 blocks; while the upsampling layer encompasses several

C. Conditional Link Prediction Graph Module (CLPGM)
In the paper, we construe the details of CLPGM, which includes the Location Instability Strategy (LIS). The LIS is utilized to disentangle occlusion cases under the same category. When multiple targets are occluded, the number of detected nodes may be higher than the predefined number in the overlapped area. If these targets are with the same label, we should design a LIS to infer which nodes are for each overlapped target. Since the details of LIS are already introduced in the paper, we only demonstrate our approach with more samples.
Tabel 10 illustrates the comparison among KLPNet, KLPNet and KLPNet † on ObjectNet3D+. From Table 10, KLPNet † achieves the best performance on distinctive categories. Since our approach is the forerunner for link prediction on multi-class rigid bodies, it is hard to compare it with others either quantitatively or qualitatively. Here we visualize the conditional connection link to illustrate the qualitative performance. From Figure 11, our KLPNet † provides correct connection links in various cases and the semantic information well manifests themselves.

D. Loss Function
Recall the total loss Keypoint and Link Prediction Network (KLPNet) is formulated as follows: where α and β are the predefined constant parameters. Table 11 illustrates the performance of the whole module with different settings of the loss. We tried nine settings for the coefficient, α and β. If the proportion of α is large, the precision of the KLPNet is low; while if it is tiny, the precision is not achievable either. Finally, the setting of α and β, 0.3 and 0.7, can achieve the best performance.

E. Other Applications
KLPNet can be utilized for keypoint detection on rigid bodies. In this section, we discuss some other applications based on our KLPNet.

Refined Object Detection
If CLPGM is removed from the KLPNet, the categoryimplicit keypoints are localized in the images. Without CLPGM, we cannot connect the keypoints correctly due to semantic chaos. In this case, we hope to refine the bounding box as a polygon area, which can encircle the target with a narrow yet more accurate area. The results are shown in Figure 12. However, we concede this approach is only an attempt to refine the object detection. Some recent object detector [36] can generate a more accurate and efficient area to encircle the target in the image.

Simultaneous Localization and Mapping (SLAM)
SLAM is a computational problem of constructing or updating the map of an unknown environment while simultaneously keeping track of an agent's location within it. SLAM [7] [2] contains two subsystems: localization and mapping. Localization is not only the first step but also the key of success to figure out the decision of the whole system. Current approaches [22] [6] [10] of localization concentrate to pair the landmarks on two frames. We believe that KLPNet offers a new sight to localize the special objects. In terms of SLAM, the details, such as texture, color, are not essential and could be ignored to localize the target. Based on the accurate detection and localization of semantic keypoints in the real world, it is accessible to discern the  robot's location and build the real-world mapping. To accomplish it, we need to consider how to get the coordinates in the 3D space, which is discussed in the next part.

3D Reconstruction and Rendering
3D Reconstruction [27] can determine the object's 3D profile, and 3D Rendering [13] is the final process of creating the actual 2D image or animation from the prepared scene. The first step of both fields is to project the 2D object into a 3D space. Thus, we propose two latent approaches to implement the projection: KLPNet with multi-view consistency and KLPNet with depth map, which are shown in Figure 13 and Figure 14. Figure 13 illustrates the first approach for converting 2D targets into a 3D space. Two sensors are utilized to capture useful information, color image and depth map. Using KLPNet, the keypoints can be localized and connected on the 2D heatmap. Depth map affords the space distance of each pixel in the 2D image. After merging the heatmap and depth map, it is conceivable to reconstruct the 3D object in the 3D space.
The second approach to build the 3D object is to utilize the multi-view consistency instead of the depth map. After obtaining the location and connection of keypoints on the 2D neighbour keyframes, the known rigid rotation (R) and translation (T) between the two views is provided as a supervisory signal. As shown in the 14, V 1 and V 2 are the two views that best match one view to the other. A multi-view consistency loss can be considered to measure the discrep- ancy between the two sets of keypoints under the ground truth transformation. Once the transformation is corrected, it is conceivable to reconstruct the 3D object in 3D space.

F. Blemish
During the testing, we also find some blemishes as shown in Figure 15. The top-left one is a desk with four additional legs, which lead our model to misjudge the four basic bottom nodes; The bottom-left one is a sofa-bed combination that baffles the model to localize the node accurately since the bed and sofa have a different predefined number of nodes; The top-right contains an unusual desk and a chair who have more legs than normal samples during training; Our model predicts all nodes on the chair correctly but fails to connect all chair legs. We notice that our model cannot self-adapt the node number. The predefined number of nodes per class limits the performance of the model. In the future, a novel supervised approach can be designed to depict more suitable edges for specific geometrical patterns of the objects.