Monocular Visual-IMU Odometry: A Comparative Evaluation of the Detector-Descriptor Based Methods

. Visual odometry has been used in many fields, especially in robotics and intelligent vehicles. Since local descriptors are robust to background clutter, occlusion and other content variations, they have been receiving more and more attention in the application of the detector-descriptor based visual odometry. To our knowledge, however, there is no extensive, comparative evaluation investigating the performance of the detector-descriptor based methods in the scenario of monocular visual-IMU (Inertial Measurement Unit) odometry. In this paper, we therefore perform such an evaluation under a unified framework. We select five typical routes from the challenging KITTI dataset by taking into account the length and shape of routes, the impact of independent motions due to other vehicles and pedestrians. In terms of the five routes, we conduct five different experiments in order to assess the performance of different combinations of salient point detector and local descriptor in various road scenes, respectively. The results obtained in this study potentially provide a series of guidelines for the selection of salient point detectors and local descriptors.


Introduction
Ego-motion estimation in real-world environments has been studied over the past decades.As one of the commonly-used methods for this problem, Visual Odometry (VO) estimates the pose of a vehicle by matching the consecutive images captured using the onboard camera [28].According to the camera involved, visual odometry can be divided into two categories: monocular and stereo [28].However, the architecture of stereo visual odometry systems is normally complex, which limits their practical applications.Stereo visual odometry also tends to degenerate to a monocular system when the distance between objects and the camera is large.On the other hand, monocular visual odometry systems are simple and can be easily used in practical applications.In addition, the joint use of the Inertial Measure Unit (IMU) and the camera (referred to as Visual-IMU Odometry) normally improves both the reliability and accuracy of motion estimation [19] because they are complementary [3].Hence, the scope of this research is limited to the study of monocular visual-IMU odometry.
Considering local descriptors are insensitive to occlusion, background clutter and other changes [23], they have been extensively applied to visual odometry [26], visu-al-SLAM (Simultaneous Localization and Mapping) [5] and visual tracking [9].Local descriptors are normally extracted at the salient points detected from images in order to accelerate the speed of feature matching.In this context, salient point detection and feature extraction are key to the detector-descriptor based visual odometry systems.As a result, an extensive evaluation of detectors and descriptors in a unified visual odometry framework is required in order to obtain guidelines for the choice of these.
To the authors' knowledge, however, there is no research which extensively assesses the performance of salient point detectors and local descriptors for the applications of monocular visual-IMU odometry.In this paper, we therefore conduct an extensive, comparative evaluation of different combinations of detector and descriptor in the scenario of monocular visual-IMU odometry.The contributions of this paper are: (1) we design a unified evaluation framework based on five typical routes containing different road scenes and a well-established monocular visual-IMU odometry system [15]; and (2) we survey five salient point detectors and eight local descriptors (in which HOG [4], LIOP [34], LM [18] and LSSD [32] have not been applied to visual odometry) and perform a comparative evaluation on different combinations of detector and descriptor, which produces a set of useful benchmarks and insights.
The remainder of this paper is organized as follows.Related work is reviewed in Section 2. In Section 3, the detail and implementation notes of the salient point detectors and local descriptors are described.The experiments are introduced in Section 4 and the results are reported in Section 5. Finally, conclusions are drawn in Section 6.

Related Work
In this section, we briefly review the existing work related to salient point detectors and local descriptors, the application and the evaluation studies of these methods.

Salient Point Detectors
Salient points are normally used to avoid the heavy computational cost of matching all the pixels in two images.Harris and Stephens [13] proposed a corner detector using the image gradient matrix.Based on this detector, Mikolajczky and Schmid [21] proposed the Harris-Laplace corner detector.The FAST (Features from Accelerated Segment Test) corner detector [27] was introduced based on a discretized circle of pixels surrounding the corner candidate point.Although corner points can be fast computed, they are less distinctive.In contrast, the points detected using blob detectors are more distinctive and redetected [28].These detectors include the Difference of Gaussian (DoG) detector [20] and the Fast Hessian detector [1].In addition, Geiger et al. [11] proposed a blob and corner detector in order to capture both types of points.

Local Descriptors
Local descriptors have been widely applied in computer vision due to their powerful representation abilities.Local descriptors, for example, Scale-Invariant Feature Trans-form (SIFT) [20] and Histogram of Orientation Gradient (HOG) [4], can be computed from local gradient histograms.As a faster alternative to SIFT, Bay et al. [1] introduced the Speeded-Up Robust Features (SURF) descriptor.Local descriptors can also be extracted in the form of filter responses [18] or image patches [33].Besides, Shechtman and Irani [32] introduced a Local Self-Similarity Descriptor (LSSD) while Wang et al. [34] proposed a Local Intensity Order Pattern (LIOP) descriptor.

Detector-Descriptor Based Monocualr Visual (-IMU) Odometry
The application of local descriptors can be found in many visual odometry tasks.Nister et al. [26] applied the image patches extracted at the Harris corner points to monocular visual odometry, while Bloesch et al. [2] used the FAST detector and multilevel patches for monocular visual-inertial odometry.As one of the most famous local descriptors, SIFT [20] has been used in monocular visual-IMU odometry systems [15] [24].Nilsson et al. [25] also proposed a monocular visual-aided inertial navigation system using SURF [1].However, these descriptors are normally extracted from gray level images.In order to exploit richer image characteristics, Dong et al. [7] applied three sets of multi-channel image patch features to monocular visual-IMU odometry.

Comparative Evaluations of Salient Point Detectors and Local Descriptors
Many evaluation studies have been conducted for computer vision tasks.Schmid et al. [31] compared salient point detectors under different scale, viewpoint, lighting and noise conditions.Mikolajczyk and Schmid further assessed different affine-invariant detectors [22] and descriptors [23].Recently, Gauglitz et al. [9] compared different salient point detectors and local descriptors for visual tracking.An evaluation study of local descriptors was also performed in the field of geographic image retrieval [35].
On the other hand, the similar comparative studies have also been performed for the visual odometry tasks in the indoor [30] and outdoor scenes [12], [16], [29].However, only a small number of combinations of detector and descriptor were tested in these studies.In addition, the datasets used are not representative to road scenes.Therefore, we conduct a series of extensive (more detectors and descriptors) evaluation experiments based on a unified monocular visual-IMU odometry framework containing five particularly typical real-world routes.To our knowledge, this is the first extensive evaluation study in the scenario of monocular visual-IMU odometry.

Salient Point Detectors and Local Descriptors
We briefly review the salient point detectors and local descriptors tested in this study.
The parameters used for these methods can be found in the supplementary material.

Salient Point Detectors
The five salient point detectors examined in this study are described as follows.
Blob and Corner (Blob&Corner) Geiger et al. [11] first convolved the blob and corner masks with an image.Then, non-maximum and non-minimum suppressions were applied to response images.Four types of points: "corner max", "corner min", "blob max", and "blob min" were derived.Difference of Gaussian (DoG) Lowe [20] introduced a salient point detector by finding local extrema in an image.The convolution of the image and the DoG functions is performed at different scales.Thus, the salient points can be derived by convolving the extrema of the scale space in the DoG functions with the input image.
Fast Hessian As a scale-invariant salient point detector, Fast Hessian [1] was developed on the basis of the Hessian matrix.In order to reduce the computational cost, Bay et al. [1] used a set of box filters to approximate the Laplace of Gaussian functions.The salient points in an image can be obtained by detecting the local maximum of the determinate of the Hessian matrix over spatial locations and scales.
Features from Accelerated Segment Test (FAST) Rosten et al. [27] proposed the FAST detector.This detector operates on a circle of 16 pixels around the candidate corner point p.The point p is treated as a corner if there is a continuous arc of at least nine pixels that are darker than the pixel − ( is a threshold) or brighter than the candidate pixel + .The FAST detector can be further accelerated by learning a decision tree in order to examine fewer pixels.

Harris-Laplace
The Harris-Laplace detector [21] locates potential salient points in the scale space based on a multi-scale Harris corner detector.The key idea of the Harris-Laplace detector is to obtain the representative scale of a local pattern, which is the extremum of the Laplacian function across different scales.This scale is representative in the quantitative viewpoint because it measures the scale at which the maximal similarity between the detector and the local image pattern is reached.

Local Descriptors
In total, we tested eight different local descriptors in this study.We briefly introduce these below.For more details, please refer to the original publications.

Histogram of Oriented Gradients (HOG)
The HOG descriptor computes the occurrence of the gradient orientation in the sub-regions of an image [4].It first partitions the image into blocks which are further divided into cells.Then, a gradient orientation histogram is derived over each cell.The histograms obtained over each block are concatenated into a vector.We computed a 9-bin histogram from each 5×5 cell in the 15×15 block around a salient point in this study.

Image Patches (IMGP)
The simplest description of a salient point is the image patch around this point.Extraction of image patches only requires cropping the image at a given point.The non-warped image patches retain the original image characteristics [33].In our experiments, the size of image patches was set as 11×11 pixels.
Integral Channel Image Patches (ICIMGP) Dollár et al. [6] proposed a set of integral channels, including the gray level (or color) channel(s), the gradient magnitude channel and six gradient histogram channels.Dong et al. [7] first extracted the image patch around a point in each channel.Then, each patch was normalized separately.
All patches were combined into a single ICIMGP feature vector.In this study, the size of patches was set as 11×11 pixels.

Leung-Malik (LM) Filter Bank
The LM filter bank [18] contains 36 first-and second-order derivatives of Gaussian filters built at six orientations and three scales, eight Laplace of Gaussian filters, and four Gaussian filters.We applied the LM filter bank at each salient point in this study.

Local Intensity Order Pattern (LIOP)
Given the image patch around a point, the LIOP descriptor [34] first partitions it into sub-regions using the overall ordinal data.
Then, a LIOP is computed over the neighborhood of each pixel.The LIOPs contained in each sub-region are accumulated into an ordinal bin.The LIOP descriptor is obtained by combining different ordinal bins.
Local Self-Similarity Descriptor (LSSD) Given an image, LSSD [32] first computes a correlation surface for each pixel by comparing its local neighborhood with the neighbourhood of each pixel within a larger surrounding region.Then, the surface is partitioned into log-polar bins, which contains radial bins and angular bins.Finally, the descriptor is obtained as the normalized bins by linearly stretching these bins into the range of [0, 1].

Scale-Invariant Feature Transform (SIFT)
The SIFT descriptor [20] is extracted by computing a 128-bin histogram of local oriented gradient magnitudes and orientations in the neighborhood of a salient point.

Speeded-Up Robust Features (SURF)
The SURF descriptor [1] first obtains an orientation from the disk around a salient point.Then, a square neighborhood that is parallel to this orientation is derived.The neighborhood is further divided into four 4×4 patches.The features computed from these patches are concatenated into a 64-D feature vector.Compared to SIFT features [20], the lower dimensionality boosts the computational and matching speed.

Evaluation Experiments
In this study, a monocular visual-IMU odometry system [15] was used and five experiments were conducted using different routes.In each experiment, we tested different combinations of salient point detector and local descriptor.The GPS/IMU navigation unit data [10] was used as ground-truth, while the pure inertial method (referred to as IMU, whose navigation data was obtained by integrating acceleration and angular velocity) was used as a baseline.The Euclidean distance and the rotation angle were used to compute the position and orientation errors respectively.

The Monocular Visual-IMU Odometry System
Hu and Chen [15] proposed a monocular visual-IMU odometry system (see Fig. 1 for pipeline) based on the multi-state constraint Kalman filter (MSCKF) [24].In this system, the trifocal geometry relationship [14] between three consecutive frames is used as camera measurement.Hence, the estimation of the 3D position of feature points is avoided.Also, the trifocal tensor model [14] is used to map the matched feature points between the first two frames into the third frame.A "bucketing" method [17] is further used to choose a subset of the matched points.Finally, the Random Sample Consensus (RANSAC) [8] method is applied in order to reject outlier points.We used the modified version [7] of the system [15].The feature matching and outlier rejection module was replaced with a self-adaptive scheme in order to prevent the system from exceptionally crashing when insufficient inliers were returned.The feature matching algorithm introduced by Lowe [20] was utilized in this study.
Fig. 1.The pipeline of the monocular visual-IMU odometry system [15] used in this study.

Dataset and Ground-Truth
In order to assess the detectors and descriptors fairly and explicitly, we selected five typical routes (see Figs.All images included in these routes are real-world driving sequences with the GPS/IMU ground-truth data.These images were captured at 10 fps using a recording platform equipped with multiple sensors [10].We used the synchronized grayscale images in this study.

Performance Measures
Since the measures based on the error of trajectory endpoints are usually misleading, we used the Root Mean Square Error (RMSE) measure computed from the position or orientation data.This measure has been extensively used for the navigation and autonomous driving systems.The RMSE measure is defined as: where (" # , % # ) means the ground-truth data while (" # , % # ) stands for the estimated data.

Experimental Results
In this section, we report the position and orientation RMSE measures derived in the five experiments.(More figures are provided in the supplementary material).

Route 1: Straight Line
Since Route 1 was gathered on the express way, the average speed involved was high.Table 1 lists the overall position and orientation RMSE values computed between the estimated trajectories obtained using different methods and the ground-truth trajectory.Fig. 3(a) further shows the ground-truth trajectory and the estimated trajectories obtained using IMU and the best descriptor for each detector in this experiment.It can be seen from Table 1 that: (1) the joint use of Fast Hessian [1] and ICIMGP [7] yields the best performance; (2) ICIMGP [7] can also achieve proper performance when used with other detectors, except the DoG detector [20]; (3) the HOG [4] and LSSD [32] descriptors perform properly when combined with FAST [27] while SIFT [20], SURF [1] and LM [18] generates promising results when used with DoG [20]; (4) IMGP [33] and LIOP [34] do not provide good performance.Especially, LIOP performs worse than all its counterparts; and (5) the IMU method performs properly.(e)

Route 2: Quarter Turn
The route used in this experiment is a simple quarter turn.SIFT [20] and SURF [1] yield proper performances; (4) the performance of LIOP [34] is better than that it obtained in Section 5.1 but is still worse than those of the other descriptors in most cases; (5) LM [18] and LSSD [32] perform well when combined with the Blob&Corner detector [11]; (6) IMGP [33] performs properly when used with the FAST [27] or DoG [20] detectors; and (7) the performance of the IMU method is proper.In addition, the ground-truth trajectory and the trajectories obtained using IMU and the best descriptor for each salient point detector are shown in Fig. 3(b).Table 2.The overall position and orientation RMSE values computed between the ground-truth trajectory and the trajectories obtained using different methods on Route 2.

Route 3: Multiple Quarter Turns
The route used in this experiment was captured in the residential area.Compared to Routes 1 and 2, this route is longer and more complicated.Table 3 lists the overall position and orientation RMSE values obtained using different methods.It can be observed that: (1) the best result is produced by the combination of Fast Hessian [1] and ICIMGP [7]; (2) HOG [4] also performs well, especially, when used with the Harris Laplace detector [21]; (3) the performance of IMGP [33] is even comparable to the best result when combined with DoG [20] and is proper when used with the other detectors; (4) LM [18] performs properly and yields its best performance when combined with the Harris Laplace detector [21] while SIFT [20] and SURF [1] perform properly in most cases; (5) LIOP [34] produces better results than it did on Routes 1 and 2, and yields its best performance when used with Harris Laplace [21]; (6) LSSD [32] provides proper performance when combined with Blob&Corner [11], DoG [20] or FAST [27]; and (7) the performance of the IMU method is worse than those of all the descriptors.Besides, the ground-truth trajectory and the trajectories obtained using IMU and the best descriptor for each salient point detector are shown in Fig. 3(c).Table 3.The overall position and orientation RMSE values computed between the ground-truth trajectory and the trajectories obtained using different methods on Route 3.

Route 4: Multiple Curved Turns
The route used in this experiment contains several curved turns.Table 4(a) lists the overall position and orientation RMSE values derived using different methods.It can be seen that: (1) the joint use of Fast Hessian [1] and ICIMGP [7] achieves the best result; (2) SIFT [20] yields the comparable performance to this result when used with FAST [27] and performs better than it did on Routes 1, 2 and 3; (3) HOG [4], SURF [1] and LM [18] perform properly while LSSD [32] only produces proper performance when used with Blob&Corner [11], DoG [20] or FAST [27]; (4) IMGP [33] yields its best performance when combined with DoG [20] and also performs properly when used with the other detectors; (5) the trajectories obtained using LIOP [34] suffer from the drift issue except when used with Harris Laplace [21] and are even worse than that obtained using IMU.Fig. 3(d) also shows the ground-truth trajectory and the trajectories derived using IMU and the best descriptor for each detector.

Route 5: Loop Line
A closed route is used in this experiment.Table 4(b) reports the overall position and orientation RMSE values computed between the trajectories obtained using different methods and the ground-truth data.As can be seen, ( 1) the combination of FAST [27] and ICIMGP [7] performs the best; (2) HOG [4] yields promising results except when it is used with Blob&Corner [11]; (3) LM [18], IMGP [33], SIFT [20] and SURF [1] generate proper performance while LSSD [32] only yields proper performance when used with DoG [20] or FAST [27]; (4) LIOP [34] performs properly when combined with the Blob&Corner [11], DoG [20] or Harris Laplace [21] detectors; and (5) the performance of IMU is the worst while it can be improved by being jointly used with local descriptors.In addition, Fig. 3(e) shows the ground-truth trajectory and the trajectories obtained using IMU and the best descriptor for each salient point detector.

Summary
The performance of the descriptors varies when they are used with different detectors or on different routes.To summarize, a set of insights can be obtained as follows: (1) In the five experiments, the best result is always produced by ICIMGP [7], especially, when it is used with the FAST [27] or Fast Hessian [1] detectors.It suggests that ICIMGP [7] is suitable for monocular visual-IMU odometry.Those promising results should be attributed to the fact that ICIMGP [7] encodes richer image characteristics than its counterparts that are normally extracted from gray level images; (2) The HOG [4] and LSSD [32] descriptors perform properly when they are used with FAST [27].However, their performance varies when used with other detectors; (3) The DoG detector [20] is the best choice for IMGP [33].In this case, IMGP [33] performs better than ICIMGP [7] on Routes 3 and 4.However, it does not yield promising results on the straight express way (Route 1).The similar finding can be obtained for LIOP [34] when it is used with Harris Laplace [21].These results show that gray level image patches are not sufficient for the use on the straight express way and probably need to be combined with other image characteristics (see ICIMGP); (4) The LM [18], SIFT [20] and SURF [1] descriptors produce promising results when used with DoG [20] while their performances are not stable when used with the other detectors.Surprisingly, SURF [1] normally performs better when combined with DoG [20] than Fast Hessian [1] even if the latter was proposed for it; and (5) According to the average position RMSE, Route 3 is the easiest (15.0±8.8)but Route 1 is the most difficult (65.2±79.2) for the detectors and descriptors tested here.
The above insights provide the meaningful guidelines for choosing the salient point detector and local descriptor in the monocular visual-IMU odometry applications.
We did not compare the computational speed of different detectors and descriptors because they were implemented in different programming languages.However, the time cost of feature matching depends on the dimensionality of the local descriptors extracted at the same salient points.Table 5 lists the dimensionality of the eight local descriptors.It can be seen that the dimensionality of the ICIMGP [7] descriptor is the highest while it did produce the best results in this study.

Conclusions and Future Work
In this paper, we first reviewed five salient point detectors and eight local descriptors.Then, we conducted a comparative evaluation study on different combinations of detector and descriptor using a unified monocular visual-IMU odometry framework and five typical routes [10].To our knowledge, this is the first extensive comparative evaluation on salient point detectors and local descriptors for monocular visual-IMU odometry by using these explicit types of routes.The experimental results can be used as a set of baselines in the further research.The analysis of these results also provides a set of useful insights to the community, which could be used as guidelines for the selection of the detector-descriptor combinations.However, the experiments presented in this paper are not exhaustive and only investigate different combinations of detector and descriptor using a monocular visual-IMU odometry system.In the next stage of this study, we will tune the parameters of the detectors and descriptors and also test a different monocular visual odometry system in order to augment the results reported in this paper.

2 and 3 )
from the KITTI dataset[10] according to the length and shape, the impact of the independent motion of other vehicles and pedestrians.(The configurations of the routes can be found in the supplementary material).The three factors are challenging for the existing visual odometry systems.Specifically, (1) Route 1 (Straight Line) and Route 2 (Quarter Turn) are on the urban road, in which other vehicles can be found; (2) Route 3 (Multiple Quarter Turns) and Route 4 (Multiple Curved Turns) are in the residential area, and are longer and more complicated; and (3) Route 5 (Loop Line) is also in the residential area and is a closed path.

5 Fig. 2 .
Fig. 2. Example images (corresponding images of the left color camera) of the five routes selected from the KITTI dataset [10].

Table 1 .
The overall position and orientation RMSE values computed between the ground-truth trajectory and the trajectories obtained using different methods on Route 1.

Table 4 .
The overall position and orientation RMSE values computed between the ground-truth trajectory and the trajectories obtained using different methods on (a) Route 4 and (b) Route 5.

Table 5 .
The dimensionality of the local descriptors examined in this paper.