A geometry-aware attention network for semantic segmentation of MLS point clouds

Abstract Semantic segmentation of mobile laser scanning (MLS) point clouds can provide meaningful 3 D semantic information of urban facilities for various applications. However, it still remains a challenge to extract accurate 3 D semantic information from MLS point cloud data due to its irregular 3 D geometric structure in a large-scale outdoor scene. To this end, this study develops a geometry-aware attention point network (GAANet) with geometric properties of the point cloud as a reference. Specifically, the proposed method first builds a graph-like region for each input point to establish the geometric correlation toward its neighbors for robustly descripting local geometry-aware features. Thereafter, the method introduces a novel multi-head attention mechanism to efficiently learn local discriminative features on the constructed graphs and a feature combination operation to capture both local and global geometric dependencies inside fused point features for significantly facilitating the segmentation of small or incomplete 3 D objects at point-level. Finally, an adaptive loss function is appended to handle class imbalance for the overall performance improvement. The validation experiments on two challenging benchmarks demonstrate the effectiveness and powerful generation ability of the proposed method, which achieves state-of-the-art performance with mean IoU of 65.09% and 95.20% in the Toronto-3D and Oakland 3-D MLS dataset, respectively.


Introduction
Recent advances in three-dimensional (3 D) GIS modeling technologies applied in smart cities have shown an increasing demand for accurate 3 D semantic information for the sophisticated analysis and interpretation of 3 D urban scenes (Biljecki et al. 2014). With the rapid upgrades of 3 D data acquisition equipment, the representative mobile laser scanning (MLS) system has the advantage of quickly collecting highprecision MLS point cloud data which can present 3 D urban scenes with detailed objects (Xu et al. 2021a) and has become more desirable for urban studies. As a challenging task, semantic segmentation of MLS point clouds aims to classify each point in the point set to provide 3 D semantic information of ground objects for urban scene understanding, contributing to many real-world applications such as autonomous vehicles (AVs) and intelligent transportation systems (ITS) (Broggi et al. 2013, Schreiber et al. 2013. Early research works for solving the problem of point cloud semantic segmentation tend to use traditional machine learning-based methods. These methods usually use a conventional classifier to segment points based on manually designed features or rules (Weinmann et al. 2013, Xu et al. 2014. However, the MLS point clouds, as illustrated in Figure 1, are unevenly distributed in a large-scale outdoor scene, different scanned objects inside are varying in shape and size, and may suffer serious incompletion caused by scanning angles or occlusion. The aforementioned objective factors pose significant challenges in the semantic segmentation of MLS point clouds. Recently, deep learning on point cloud semantic segmentation (Zhao et al. 2018, Guo and Feng 2020, Zeng, et al. 2022) has become gradually thriving. Distinctly from defining a group of hand-tuned feature descriptors in most conventional methods, deep learning-based methods achieve automatic and efficient feature extraction with their advances in feature learning and the capacity of parameter sharing (Xu et al. 2021b). For this reason, we intend to accurately segment MLS point clouds of large-scale urban scene by using deep learning-based method in this paper.
In particular, there are some geometric properties of MLS point clouds that need to be seriously considered when using the deep learning technique, apart from uneven distribution, irregular structure and varied object sizes, the geometric correlation between adjacent points in a point cloud should likewise be stressed especially when segmenting small or incomplete objects under complex urban scenes. In the sight of Tobler's First Law of Geography (TFL) (Tobler 1970), we observe that adjacent points belonging to the same category in a point cloud usually have similar geometric properties and strong geometric correlation in 3 D space. This kind of significant characteristic just can be well leveraged to distinguish different categorical points in local regions of point clouds by modeling the geometric correlation between adjacent points and capturing local discriminative features of each point from the constructed model. Hence how to reliably model the geometric correlation between local points of input point set and efficiently capture local discriminative features of each point are two key problems to be addressed when designing a deep neural network with specific geometric properties of MLS point cloud as reference.
Given the geometric properties and problems mentioned above, we propose a novel geometry-aware attention network (GAANet) for semantic segmentation of MLS point clouds and our main contributions include: (1) A weighted K-Nearest Neighbour (W-KNN) search algorithm is utilized to construct a graph-like region for each point, which robustly models the geometric correlation toward its neighbours through weighted edges. (2) A geometry-aware attention module (GAAM) is designed to allow the network to focus attention selectively on capturing discriminative features of each point by considering self-coordinate information and geometric correlation to the corresponding neighbours. (3) An adaptive loss function is introduced to balance the training trend between majority classes and minority classes for further improving segmentation accuracy over all classes.
The remainder of this paper is organized as follows. Section Related work describes previous related works. Section Methodology details the proposed method and makes a comparison with other existing methods. Section Experiments presents the experiments and the corresponding ablation studies, as well as gives a comprehensive analysis and discussion of the results. Section Conclusions draws a conclusion and provides future research directions.

Related work
Semantic segmentation of point clouds using deep learning has been a popular topic in recent years. We refer to related reviews (Bello et al. 2020, Wang et al. 2021, Guo et al. 2021b) and generally divide the existing methods into the following five categories based on the underlying data representations they leverage.

Multiview-based methods
Multiview-based methods render 3 D point clouds into collections of feature images from multiple pre-defined views so that already matured 2 D CNNs can be directly applied to the rendered data. Su et al. (2015) took pioneer work in this direction by using a unified CNN architecture to learn to combine information of a single 3 D shape from multiple views for category recognition. Kalogerakis et al. (2017) employed imagebased convolutional modules to generate part-label confidence maps from multiple views and scales for part-based semantic reasoning on 3 D shape representations. Bai et al. (2016) generated synthetic 2 D views by picking various camera positions and fed them into a refined SegNet (Badrinarayanan et al. 2017) to perform 3 D semantic labelling task. Likewise, a similar multi-view transformation method (Luo et al. 2019) had also been applied to ground-based MLS point clouds for the scene semantic understanding. Nevertheless, these methods inevitably cause spatial information loss and induce quantization error during the conversion process with limited perspectives (Te et al. 2018).

Volumetric-based methods
Alternatively, point clouds can be straightforwardly converted into 3 D regular grid data on which the 3 D convolution can be operated. Wu et al. (2015) represented a geometric 3 D shape by using the binary voxel grids, while Zhou and Tuzel (2018) utilized the volumetric occupancy grids as the data representation to make the 3 D scene point clouds and shapes more distinguishable for object semantic recognition. However, it often requires high memory and computational resources to operate 3 D convolutions on sparely-occupied volumetric grids as the voxel resolution increases. To this end, some research works sought to reduce computational consumption for high-resolution output. For example, octree (Riegler et al. 2017, Tatarchenko et al. 2017, Wang et al. 2017) and k-d tree (Klokov and Lempitsky 2017) constructed efficient partition structures to remedy the resolution issue and improve computational performance as well. But such space partition structures seriously rely on the subdivision of a 3 D bounding volume rather than the local geometric surface to enclose objects, thus easily leading to unfilled or empty volumetric grids with an extra increasement in space cost.

Point-based methods
To bridge the gap of applying a deep learning network model on 3 D point clouds with data conversion, PointNet (Qi et al. 2017a), as a milestone on directly processing input point sets, achieves automatic feature extraction of each point for point cloud semantic segmentation. However, PointNet merely captures per-point features individually without considering local geometric correlation, which makes it fail to combine the geometric information among neighbouring points. To handle this issue, the subsequent PointNetþþ (Qi et al. 2017b) utilizes an iterative sampling strategy followed by a compositional grouping operation to generate a set of point groups for local geometric feature learning. Encouraged by PointNet and PointNetþþ, many recent methods (Li et al. 2018, Xu et al. 2018, Wu et al. 2019) construct a powerful synthetical convolution module or a novel 3 D convolution kernel for directly learning point features. Although these methods can efficiently perform the feature extraction on non-uniform point clouds, they are mainly applicable to individual 3 D shapes with simple structures and may suffer poor performance when processing real-world outdoor scenes with different noises.

Graph-based methods
Different from point-based methods that take the discrete points as input, graphbased methods construct a graph-like region for each input point and then conduct point feature learning on the constructed local graphs. Simonovsky and Komodakis (2017) took the first step to apply convolution-like operation on the graph-structured point cloud data and then Zhang and Rabbat (2018) proposed a Graph-CNN architecture to encode the local geometric structure of graph-like points by combining localized convolutions with two types of graphic down-sampling operations. Further, DGCNN (Wang et al. 2019c) dynamically updated the local feature learning of each point by recomputing the graph at each layer in the feature dimension. Later, LDGCNN (Zhang et al. 2019) refined the network architecture of DGCNN by linking the hierarchical features from different layers to increase performance and decrease the model size. Although graph-like regions can inherently model the geometric correlation between each point and its neighbours, the geometric dependencies within them are still hard to be efficiently captured by most existing methods due to the diverse graph topologies.

Attention-based methods
Motivated by the advantage of the attention mechanism which enables the network to focus on the critical part of input data for efficient feature learning (Chaudhari et al. 2021), some researchers tend to apply it to graph-structured point cloud data to alleviate the insufficient learning ability of local geometric representations for capturing geometric dependencies of each point toward its neighbours. Wang et al. (2019a) proposed a novel graph attention convolution to dynamically learn geometric features from the local graph with a deformable kernel whose shape was determined by the distribution of attentional weights. Feng et al. (2020) introduced a local attention-edge convolution (LAE-Conv) to assign different attention coefficients on each edge of the local graph and then aggregated learned edge attention features of each central point as a weighted sum of its neighbours. Furthermore, Guo et al. (2021a) employed a refined Transformer (Vaswani et al. 2017) architecture implemented by embedding an offset-attention mechanism to selectively learn the discrepancy feature of geometric attributes between the connection point pairs in the local graph. However, these methods merely pay attention to the local feature extraction of each point and fail to consider its global geometric dependency toward other input points in the whole point set, which often gives rise to false segmentation results, especially when segmenting small or incomplete 3 D objects under complex outdoor scenes.
Although the aforementioned methods have achieved impressive performance in segmenting point clouds from a variety of perspectives, they have a common limitation of reliably establishing the geometric correlation between local points in point clouds and significantly exploiting both local and global geometric dependencies concealed in point features to accurately segment 3 D objects at point-level in real outdoor environments. To fill this gap, we propose a novel deep neural network (GAANet) that robustly establishes the local geometric correlation between input points by utilizing an improved weighted K-NN algorithm to build a locally weighted graph for each point and efficiently captures both local and global geometric dependencies of each point by operating the designed attention module (GAAM) and a feature combination on the constructed local graph. Moreover, we also introduce an adaptive loss function to further optimize the process of network training for overall performance improvement.

Methodology
In this section, we first present the general architecture of our proposed deep neural network for the semantic segmentation of MLS point clouds. Then, we detail its two key components including the geometry-aware attention module (GAAM) and the adaptive loss function.

Network architecture
Our proposed GAANet directly takes point cloud coordinates (x, y, z) as input and segments MLS point clouds into categorical object instances by assigning a semantic label to each point in an end-to-end manner. Figure 2 shows the overall network architecture and illustrates the whole process workflow through four steps: (1) The input points are first aligned in a normative 3 D space by multiplying a learnable transformed matrix to guarantee a vantage spatial position posture for the point feature learning.
(2) Then the feature abstraction maps the transformed point coordinates (N Â 3) into a higher dimension (N Â 1024) by employing three stacked GAAM and two shared multilayer perceptron (MLP) layers to achieve higher-level and more abstract point features; (3) In the stage of feature combination, the symmetrical max-pooling is operated to produce the global feature which is then repeated and concatenated with all local features generated by the preceding GAAMs to capture both local and global geometric dependencies of each point; (4) To predict a semantic label for each point, a softmax classifier is utilized to obtain the predicted probability score belonging each category based on the categorical feature (NÂN c ) which contains significant semantic information over N c categories, and the final prediction result is determined by the category with the maximal score.

Geometry-aware attention module
The designed geometry-aware attention module (GAAM) is constructed to learn local geometric features of each input point by attending over its neighbours in a graphlike region. As shown in Figure 3, the GAAM processes input point features (N Â C) in the point set to generate the significant point features (N Â FÂM) through the following two steps: local graph construction and multi-head attention mechanism.

Local graph construction
In most graph-based methods (Simonovsky and Komodakis 2017, Zhang and Rabbat 2018, Wang et al. 2019c, the K-NN algorithm is often utilized to locate and keep top-k nearest neighbours for each point to make preparations for local graph construction. However, the normal K-NN indiscriminately selects the nearest neighbours of each point to construct diverse local graphs, which tends to cause feature contamination when conducting the neighbouring feature embedding with the central points surrounded by other different categorical points. To this end, a weighted K-NN (W-KNN) is designed to build a locally weighted graph for each point. As shown in Figure 4, different from the normal local graph (left), each edge of the central point in the locally weighted graph (right) is further multiplied by a weight coefficient which quantifies the geometric similarity between the two connected points with the spatial distance between them as reference.  To formulate the process of the local weighted graph construction, suppose each point p i in the point set P ¼ fp i g N i¼1 as a node and the directed distance toward its neighbour point p ij as the edge e ij , the local graph G i can be denoted as follow: where V i and E i represent the set of nodes and the set of edges in the local graph G i respectively, p i 2R NÂ3 represents the i th point with typical 3 D coordinates (x i , y i , z i ) as feature dimensions in point set P and p ij represents its j th neighbourhood point. As shown in the front part of Figure 3, the W-KNN first employs a normal K-NN to search k-nearest neighbours N i ¼ fp ij g k j¼1 of the reference point p i and then calculates the weight coefficient of each directed edge can be calculated by a Gaussian kernel, which can be denoted as follow: where W ij is the weight coefficient of edge e ij , jjÁjj is the calculation of Euclidean distance, p i and p ij are the same as described in Equation (1). Then each edge feature e ij is calculated and subsequently weighted by multiplying the corresponding weight coefficient to obtain the final weighted edge e w ij as follow: where p Ã i and p Ã ij are the feature values of p i and p ij , respectively. It is noteworthy that the weighting process of each edge in local graph construction gives a consideration to the similarity of geometric properties between two connected points. The higher similarity means the greater the weight coefficient to be assigned to the corresponding edge, and more similar neighbouring features will be fused into the central point in the subsequent neighbouring embedding to robustly enhance the representation of local geometry-aware features, which contributes to local geometric feature extraction of each point from constructed local graphs.

Multi-Head attention mechanism
As shown in Figure 3, in order to efficiently facilitate the point feature learning, we introduce a novel multi-head attention mechanism that concatenates all heads together over feature channels on the constructed local graph to allow the network to jointly capture significant points features from different representation subspaces of each input point. The idea of a multi-head attention mechanism derives from Transformer attention (Vaswani et al. 2017), we refine its self-attention mechanism by combing a neighbouring embedding in each single-head to make it well-suited for local feature learning with the consideration of 3 D geometric structures of the point cloud. To better understand, we start by introducing the single-head attention mechanism which takes local graphs as input in our GAAM.
As illustrated in Figure 5, the single-head attention mechanism combines two operations including the self-attention and the neighbouring embedding to pay attention to the feature learning of all input points and their neighbours respectively. In more detail, for input local graphs, the self-attention learns significant attention features of central points by calculating an attention score matrix (N Â N) to quantify the correlation between central points in the feature dimension to indicate different importance to the point features, whilst the neighbouring embedding learns local geometric features of central points by aggregating all weighted edges of each central point with a max-pooling operation in the feature dimension. Given each input point p i in point set P and its corresponding edges E i obtained in the previous step, we formulate the single-head attention mechanism from two branches: self-attention and neighbouring embedding.
As an initial step, the self-attention employs three linear layers to convert all input points into three new representations to prepare for the subsequent correlation calculation between different input points in the feature space, which can be formulated as follow: where P Q , P K and P V are queries, key and value point matrices respectively, h(.) is a parametric linear function that can be regarded as a single MLP layer and h is a set of learnable parameters of the filter used in the convolution operation. In this paper, we regard P Q as the significant point in the input points, P K and P V both as the input points. We can obtain the attention weights of each point in P V by calculating the similarity between the features of P Q and P K to help the network to locate and capture significant features of input points. We then compute correlations within point features by multiplying the P Q with the transposed P K and apply a softmax classifier to normalize the computed correlation matrix to obtain the attention score of each input point, which can be calculated as follow: where a i 2R NÂ1 is the attention score of i th point in point set P, P Qi 2R 1ÂF is the i-th point feature in P Q and P T Ki 2R FÂ1 is the i th point feature of P T K , T is the transpose operation.
To capture a significant self-attention feature, each input point feature is multiplied by the corresponding attention score as follow: where is the element-wise multiplication, F P 2R NÂF is the output self-attention feature of input point set P, P Vi 2R 1ÂF is the i th point feature in P V , a i is the same as described in Equation (5). At the same time, to extract local geometric features of each point from input local graphs, the neighbouring embedding starts by using a linear layer to process all edges of each central point and then aggregates all its edge features by utilizing a max-pooling operation, which can be formulated as follow: where F E 2R NÂF is the output neighbouring features of input points, k is the number of edges, max(.) is the max-pooling operation, h(.) and h are the same as described in Equation (4), e w ij is the j th weighted edge of the point p i in edge set E i : Finally, to obtain local discriminative point features, we fuse self-attention features and neighbouring features by using an element-wise summation operation in each single-head attention mechanism, and the multi-head attention mechanism concatenates M single-head attention features to achieve the final geometry-aware feature of input points, which can be denoted as follows: where F s 2R NÂMÂF is the geometry-aware attention feature of input points, F P and F E are the same as described in Equation (6) and (7) respectively, concatð:Þ is the concatenate operation in the feature dimension, M is the total number of the singlehead attention.

Adaptive loss function
The loss function for semantic segmentation of point clouds should provide a quantitative measurement for the quality of predicted points during network training, and the cross-entropy loss function (L ce ) is often used by most existing deep learning networks in this task. However, as shown in Figure 6, when there exists a serious class imbalance in the sample proportion of the experimental dataset, the L ce tends to make the training trend bias toward the majority classes with a large sample proportion during network training as it allocates the same training weight to the all classes, which may easily cause the insufficient training of minority classes and thus lead to the decline of overall segmentation accuracy. Therefore, the proposed adaptive loss function aims to deal with the class imbalance and improve the overall performance of all classes. The reference cross-entropy loss function is defined as follows: where K is the number of categories, y i and p i are the ground truth label vector and predicted label vector, respectively. To alleviate the class imbalance, we adopt a weight allocation strategy to dynamically balance training trends between different classes and the weights of each class are calculated according to their sample proportion in the training dataset. The weighted cross entropy loss function (L wce ) can be defined as follow: where w i is the weight coefficient of the i th category, N i is the point number belonging to the i th category, K, y i and p i are the same as described in Equation (9). Based on the L wce , we further refer to the focal loss (Lin et al. 2020) to introduce a balance coefficient to optimize the training effect of hard classes. The reference focal loss used for multi-class segmentation can be defined as follow: where d i can be directly defined as w i as in L wce , ð1Àp i Þ b is a modulating coefficient used to balance the training weights of different categories, b is a balance coefficient used to adjust the value of the modulating coefficient and b ! 0, K, y i and p i are the same as described in Equation (10). The proposed adaptive loss function can be defined as follow: where w i is the same as defined in Equation (11), c is a balance coefficient used to adjust the value of the w i and c ! 0, K, y i and p i are the same as described in Equation (10). We can observe that if the sample proportion of majority classes is greater than that of minority classes or gives a higher c value, the samples of majority classes will be assigned with a smaller training weight in our L aÀce and thus achieve a similar balance effect as the modulating coefficient used in L fl :

Experiments
In this section, we conduct extensive experiments to evaluate the semantic segmentation performance of our proposed method on two challenging MLS point cloud datasets. Moreover, we also perform ablation studies to investigate the impact of different network design choices on the point-wise segmentation accuracy and more experimental results and analysis can be found in the Supplementary Material.

Description of datasets
To validate the effectiveness of our proposed GAANet, two benchmark MLS point cloud datasets are used for the experiment. Toronto-3D (Tan et al. 2020) dataset contains four sections with approximately 80 million points in total, each of which covers almost 250 m of an urban scene with a high density of about 1000 points/m 2 as shown in Figure 7. Each point in the dataset possesses several basic attributes such as position, color, intensity, GPS time and scan angle, and has been manually labeled into eight categories, including road, road marking (rd_mk), natural objects, building, utility line (ut_line), pole, car and fence. In the experiment, similar to (Xu et al. 2021a), the second section is selected as the testing data while the others are for training, the detailed number of labeled samples for each class is provided in Table 1. Another dataset called Oakland 3-D (Munoz et al. 2009) is similarly split into a training set and a testing set, with the detailed number of labeled points for each class presented in Table 2. Note that each point only contains coordinate information and has been categorized into facade, ground, pole, vegetation and wire. For a fair comparison, only coordinates (x, y, z) of point clouds are used as input during network training.

Implementation and metrics
As the coordinates of point clouds in the Toronto-3D dataset are stored in UTM format, and the original XY coordinates may suffer the loss of precision and wrong geometric features when directly taking them as the input of the network. To preserve more details of input points, we set an offset of [627285,4841948,0] to subtract from the raw coordinates. In addition, we observe that in the Oakland 3-D dataset, the   number of original testing samples is much larger than the number of training samples, which fails to meet the data requirement of the deep learning network. Therefore, we select the original testing set as the training set instead. However, the entire training set is unable to be directly input into the network because of the limited GPU memory. To handle this problem, we adopt a batch entry strategy to take every 4096 points in the training set as a data batch to fed into the network. The proposed network was implemented on the Tensorflow framework. During the training stage, the network was trained on a single NVIDIA RTX 2080Ti GPU and updated by the Adam optimizer (Kingma and Ba 2014) with the momentum set to 0.9. The initial learning rate was set to 0.001 and reduced by half every 300,000 steps. Limited by the GPU memory, the batch sizes were adjusted accordingly. It took approximately 100 epochs to coverage the network at last. During the testing stage, the trained model with the best overall accuracy was selected to conduct the pointwise prediction of the testing set.
For the evaluation of semantic segmentation results, three standard metrics (Tan et al. 2020) including intersection over union (IoU) of each class, overall accuracy (OA) and mean IoU are used in this paper and are defined as follows: OA ¼ where N p and N c are the total number of all points and categories respectively, i is the i th category in N c categories, the TP i , FP i and FN i are the number of true positives, false positives and false negatives, respectively. OA and mIoU evaluate the comprehensive quality of semantic segmentation for all categories, and IoU measures the performance of each category.

Experimental results on Toronto-3D dataset
The experiments are first conducted on the Toronto-3D dataset, which contains largescale urban scene point clouds of the real world. Several mainstream segmentation methods, including PointNet ( Figure 8 shows the visual segmentation results of different methods from a global view. As seen in Figure 8, compared with the other four methods, the segmentation results achieved by the proposed GAANet are more consistent with the ground truth labels and yield less misclassification, especially in those subregions as marked by the black boxes. Figure 9 shows the visual segmentation results of different methods from a local view. The rows including A, B, C and D show the detailed segmentation results of subregions marked in Figure 8, respectively. We can observe that in row A, part of points in some small or incomplete objects, such as traffic lights, road markings and broken cars marked by the black circles are misclassified as electric wires or roads by PointNet and DGCNN. This is possible because these small or incomplete objects are usually attached to other categorical objects or overlapped with each other and the points in the mixed or overlapped areas have a similar geographical distribution and topological features. For example, GAPointNet and PointASNL are unable to accurately segment the complete road marking from the road as they occupy the same plane space and have no obvious boundaries. Besides, note that part of the points in the telegraph pole overlapped by the street tree in row B, PointNet and DGCNN fail to classify them into the right category due to the mixing of these two categories. Similarly, in row C and row D, they perform poorly in segmenting the broken car, road marking and electric wire as well. By contrast, seen from the visual segmentation results marked by the black circles in each row, it is obvious that the proposed GAANet performs better in semantic segmentation of small or incomplete objects than the compared methods and presents a powerful resistance to nearby interference. In addition, Figure 10 further illustrates the superior segmentation performance of the proposed method from a view of the classification error map. We can observe that compared with the other four methods, especially with PointNet and DGCNN, the proposed GAANet yields fewer classification errors in general and obtain the right classification results in the most detailed areas. Table 3 shows a quantitative comparison between the proposed GAANet and other compared methods. As shown in Table 3, our network has achieved the highest score in both OA (92.65%) and mIoU (65.09%), which are 1.05% and 6.77% higher than that of the latest MS-TGNet, respectively. It also outperforms most of the compared methods in IoU score of each class, especially in those minority classes such as road marking, car and fence, demonstrating that the proposed GAANet achieves the best overall segmentation performance and has the ability to handle the problem of segmenting small or incomplete objects.
The main reasons for the above considerable results achieved by the proposed GAANet lie in the following three aspects. 1) Different from the normal local graph constructed by DGCNN, GAPointNet and PointASNL, GAANet builds a locally weighted graph to robustly model the local geometric structure of each input point and provide a reliable neighbouring feature embedding for the subsequent feature extraction. 2) Compared with the simple MLP layer used by PointNet, GAANet utilizes a targeted   module (GAAM) to perform the efficient feature extraction on the locally weighted graph and also integrates a feature combination operation to capture local and global dependencies of each point for the point-wise semantic prediction. 3) GAANet introduces an adaptive loss function to balance the training trend between the minority and majority samples based on their proportion in the training set for further improving overall segmentation performance.

Experimental results on Oakland 3-D dataset
To further validate the effectiveness and generation ability of the proposed method, the supplemental experiments are carried out on the Oakland 3-D dataset and five state-of-the-arts including PointNet (Qi et al. 2017a), PointNetþþ (Qi et al. 2017b), DGCNN (Wang et al. 2019c), PointASNL (Yan et al. 2020) and GAPointNet (Chen et al. 2021) are selected as the comparative methods. For fair comparisons, the compared deep networks are implemented by their open-source codes and trained with the best hyper-parameters setting described in their papers. Figure 11 shows the visual segmentation results of different methods. In Figure 11(a), we can observe that PointNet performs poorly in most categories due to the limitation of the network architecture. Specifically, as marked in the triangle and circle, it is unable to distinguish the small pole or incomplete facade from the street tree. Besides, as seen from the rectangles in Figure 11(b,c), part of the points in the street tree overlapped with the wire are misclassified as wires by DGCNN and GAPointNet, most likely because they are incapable of robustly constructing a local graph of each input point to provide a reliable neighbouring feature embedding to identify such overlapped structures. Although PointASNL achieves a relatively satisfying result as shown in Figure 11(d), part of the points in the incomplete facade marked in the circle are still misclassified as vegetation. This is probably because it adopts a sampling strategy on the input point set and fails to learn more significant geometric features on such incomplete structures from limited sampled points. By contrast, the results achieved by our GAANet are more consistent with the ground truth labels, especially in those marked areas, which demonstrates that the proposed GAANet achieves state-of-the-art performance and can effectively alleviate the impact of overlapping and incompletion on the semantic segmentation of MLS point clouds. Meanwhile, as given in Table 4, the overall accuracy (98.69%) and mean IoU (95.20%) achieved by our GAANet are higher than those of other compared methods.

Ablation studies
In order to further validate the performance of the proposed network, the following ablation studies are conducted on the Oakland 3-D dataset. All the ablated networks are trained and tested in the same experimental environment as previously mentioned in Section Experimental results on Oakland 3-D dataset.
(1) Removing the operation of weighting on local graphs: This operation allows each 3 D point to establish robust geometric correlation toward its neighbours for enhancing its local geometric representations. Without using locally weighted graphs, the normal local graphs are fed into the network instead.
(2) Removing the multi-head attention mechanism in the designed GAAM: This attention mechanism allows the network to efficiently learn local significant features of each point in the input point set. After removing multi-head attention, the MLP layer as an alternative is employed to conduct convolution operations on locally weighted graphs.
(3-5) Altering the depth of network: The proposed GAANet is mainly constructed by the stacked GAAMs. To change the depth of the network, different numbers of GAAMs are utilized based on the original GAANet. Table 5 compares the mean IoU scores of all ablated networks. From this, we can observe that: 1) The great impact is caused by without using a gaussian kernel to weight local graphs, the network fails to efficiently and robustly learn local geometrics from complex structures under large-scale outdoor 3 D scenes.
2) The removal of the multi-head attention mechanism in the designed GAAM shows the next greatest impact on performance, indicating that this attention mechanism is necessary to learn the local discriminative features of each point from the input point set.
3) The alteration of the depth of the network also has a great influence on performance, the comparative results demonstrate that a modest number of GAAMs are to be considered. From these ablation studies, we can observe how the proposed method units different network choices and components to achieve the state-of-the-art performance.

Conclusions
In this paper, a geometry-aware attention network, GAANet, is proposed for the semantic segmentation of MLS point clouds. It can be directly applied to the largescale urban scene 3 D point clouds to provide meaningful 3 D semantic information of urban environments for various applications such as autonomous driving, urban highdefinition (HD) mapping and urban 3 D modeling. The proposed GAANet is an end-toend point network constructed by the geometry-aware attention module, GAAM, which has the capability of learning the local discriminative features of each input point in the point cloud set. Specifically, considering the inexplicit structure of MLS point clouds, the designed GAAM ingeniously leverages the geometric similarity of point cloud to build a locally weighted graph for each point and employ a multi-head attention strategy to concatenate the local attention features of all single-heads by simultaneously considering the self-coordinate information and local geometric correlations between each point and its corresponding neighbours. By integrating three stacked GAAMs and a feature combination operation, GAANet is able to exploit both the local and global geometric dependencies inside fused point features for the pointwise classification. In addition, by introducing an adaptive loss function, the proposed network can reduce prediction errors of point samples due to class imbalance for the overall segmentation accuracy improvement. The proposed GAANet has been evaluated and compared with other state-of-the-art on two large-scale scene MLS point cloud datasets. The experimental results show that the proposed method outperforms most prevalent methods quantitatively and qualitatively, especially when segmenting small and incomplete objects that consist of few points in a complex 3 D urban scene. Furthermore, extensive experiments on two diverse datasets also demonstrate the powerful generalization capability of the proposed method. Although the proposed GAANet has achieved relatively good results, there still exists considerable scope for further improving the semantic segmentation accuracy of MLS point clouds. As some complex 3 D structures are mingled with various categorical objects, such as the ground surface which contains the road and road marking points inside, our network fails to accurately distinguish such mixed points by merely taking their coordinates as input, which may demand extra auxiliary information such as RGB and intensity. In future work, we attempt to integrate more input information including the 3 D attributes of point clouds and other indirect information such as the projected 2 D point images to help the network to yield more accurate segmentation results.