Visual place recognition using HMM sequence matching

Visual place recognition and loop closure is critical for the global accuracy of visual Simultaneous Localization and Mapping (SLAM) systems. We present a place recognition algorithm which operates by matching local query image sequences to a database of image sequences. To match sequences, we calculate a matrix of low-resolution, contrast-enhanced image similarity probability values. The optimal sequence alignment, which can be viewed as a discontinuous path through the matrix, is found using a Hidden Markov Model (HMM) framework reminiscent of Dynamic Time Warping from speech recognition. The state transitions enforce local velocity constraints and the most likely path sequence is recovered efficiently using the Viterbi algorithm. A rank reduction on the similarity probability matrix is used to provide additional robustness in challenging conditions when scoring sequence matches. We evaluate our approach on seven outdoor vision datasets and show improved precision-recall performance against the recently published seqSLAM algorithm.


I. INTRODUCTION
Visual place recognition is a core component of many visual Simultaneous Localization and Mapping (vSLAM) systems.Correctly identifying previously visited locations enables incremental pose drift to be corrected using any number of graph-based loop closures techniques (e.g.g 2 o [1], COP-SLAM [2]).In this work, we focus on the place recognition component -determining whether a place has been visited before -with an eye towards systems that can provide robust performance under variations in lighting and other atmospheric conditions such as rain, fog and dust.
There has been tremendous work on this topic with most approaches phrasing the problem as one of matching a sensed image against a database of previously viewed images (i.e.images and places are synonymous).The FAB-MAP algorithm [3], [4] remains a popular state-of-the-art algorithm achieving robust image recall performance for outdoor image sequences up to 1000 kilometers.FAB-MAP and many of its competitors, build from the visual Bag-of-Words (BoW) approach popularized in image retrieval [5] and earlier in document retrieval.These approaches rely on extracting scale-invariant image keypoints and descriptors, such as SIFT [6] and SURF [7], from an image.The descriptor vectors are quantized using a dictionary trained on prior data.FAB-MAP uses a probabilistic model of co-occurrences of visual word appearance, while the more traditional BoW approaches use the vector space model approach.The success of BoW approaches however is very dependent on the quality of the visual vocabulary, and in turn the prior data, and on the reliability of extracting the same visual keypoints and descriptors in images with similar viewpoints.The latter is particularly problematic when there are large lighting variations and scene appearance changes such as due to fog.
Recently [8], [9] presented sequence SLAM (seqSLAM) that achieves significant performance improvements over FAB-MAP under extreme lighting and atmospheric variations [9].Image similarity is evaluated using the sum of absolute differences between contrast enhanced, low-resolution images without the need for image keypoint extraction.For a given query image, the matrix of image similarities between the local query image sequence and a database image sequence is constructed.The image recall score is the maximum sum of normalized similarity scores over pre-defined constant velocity paths (i.e.alignments between the query sequence and database sequence images) through the matrix, a process referred to as continuous Dynamic Time Warping (DTW).The contrast enhancement and matrix normalization steps are described in section II and are the keys steps for achieving robust performance under extreme lighting or atmospheric changes.The underlying assumption of the continuous DTW is that the vehicle traverses a previously visited path in the environment at a constant multiple of its previous velocity.Recently [10] extended the approach to use odometry to constrain the distance between database and query images to overcome this restriction.The approach taken in this work is to improve the overall flexibility of the sequence alignment procedure, and not rely on odometry which may be unavailable or inaccurate.
DTW is reminiscent of approaches to handle variable speaker speed in speech recognition [11].There, alignment is solved by efficiently finding the minimum cost path through a similarity matrix using dynamic programming.Here, we propose a similar approach.By phrasing the sequence matching problem as a Hidden Markov Model we can use the Viterbi algorithm [12] to efficiently and optimally align the sequences.We obtain much greater flexibility in state transition models that allow for different velocity variation models, including discontinuous jumps, and produce a meaningful likelihood estimate as an output.The result is, when compared to the continuous DTW used by seqSLAM, that a much larger space of possible paths can be considered without increasing computational load.When coupled with the sequence scoring procedure presented, improved place recognition performance is demonstrated.
In section II, we provide an overview of sequence alignment and the seqSLAM algorithm before presenting our HMM sequence alignment approach (section III) and place recognition scoring procedure (section IV).We compare the performance of the algorithm against seqSLAM on a range of challenging publicly available datasets as detailed in sections V and VI.Finally, we conclude the paper in section VII.

A. Overview
The goal for a visual place recognition system is to identify a valid database image I d corresponding to a current query image I q .The database and query images may belong to different datasets, or may be from the same dataset.For the latter, the database images would be those viewed before the current query image.
We use the same generalized sequence matching fundamentals as seqSLAM as illustrated in Fig. 1.To evaluate a place recognition score between a query image I q and a database image I d , a matrix M d of similarity values between the sequences of images A path through the matrix M d is found maximizing some function of the values along the path (e.g.sum of values).This path defines an alignment/matching between images in the query sequence I q to images in the database sequence I d .The place recognition score is based on the aligned sequence similarity, and not simply the global one-to-one similarity between the individual images I q and I d .Using this sequence matching approach significantly improves the reliability of place recognition.

B. Sequence SLAM
As we compare our algorithm empirically to seqSLAM we first provide a brief overview of seqSLAM.
All images are first converted to low-resolution greyscale images and contrast enhanced.Contrast enhancement is performed by dividing a low-resolution image into multiple cells, each containing W × W pixels, and normalizing the pixel intensities in each cell to have zero mean and unit standard deviation.The contrast enhanced values in each cell are standard deviations (i.e.z scores) from the mean.
For a query image sequence I q , and all database images, an initial matrix D of image similarity values is constructed.The similarity values are the sum of absolute differences of The continuous DTW of seqSLAM uses a set of pre-defined constant velocity search lines within the limits V min to Vmax at step sizes Vstep.These lines are the set of potential paths through the matrix M d .
the contrast enhanced images.The final similarity matrix M is obtained by applying normalization to each value in D: where Dw and σ w are the mean and standard deviations of the values M (q, d − w/2, • • • , d + w/2) within a fixed sized window width w = 10.This normalization provides a 'local best fit' metric within the neighborhood of w database images -see [8] for a more detailed discussion.
For each database image I d , a continuous DTW is used to select the path through the matrix M d , as illustrated in Fig. 2. The continuous DTW uses a set of predefined discretized constant velocity paths (i.e.straight lines) through the matrix between the limits V min and V max at step sizes of V step .The sum of similarity values for each path is computed, and the maximum selected as the place recognition score 1 .Note that this continuous DTW should not be confused with the classical DTW algorithms used extensively for speech recognition [11].
To improve upon seqSLAM, we propose to improve the place recognition score calculation and to exploit a probabilistic framework for more flexible sequence alignment.

III. SEQUENCE ALIGNMENT USING A HIDDEN MARKOV MODEL
The process of finding a path through a similarity matrix M d is modeled using a Hidden Markov Model (HMM), and the most probable path found using the Viterbi algorithm.Referring to Fig. 3, the HMM is parameterized as follows.
The observations Z 1:n = {Z 1 , . . ., Z n } are the sequence of n query images I q , and the state space S the set of database images I d .For each observation there is an unobserved hidden variable X corresponding to one of the database images in the state space.The optimal state sequence X * is the one maximizing the conditional probability over all possible state sequences X 1:n = X 1 , X 2 , . . ., X n , Fig. 3.
The Hidden Markov Model.The left shows the observations (query images) and states (database images) represented with respect to the similarity matrix M d -the path connects the selected state for each observation.The right shows the trellis diagram with emission probabilities and state transfer probabilities labeled.For each observation Z there is a hidden variable X corresponding to a database image in the state space.
Here we have made the constraint that the path length n is fixed.In this scenario, the Viterbi algorithm can be used to efficiently find X * using dynamic programming.Defining µ i k as the largest joint probability of all the path combinations X 1:k ending at state X k = i, the Viterbi algorithms uses the recursion: The maximum probability for a sequence length n is then The above equations find the value of the maximum conditional probability over all state sequences.Storing the argmax at each iteration and backtracking is used to recover the optimal state sequence X * (path through matrix).
In the remainder of this section we describe the selection of the emission values, initial state probabilities, and the state transfer probabilities.
1) Emission Matrix: Low-resolution contrast enhanced images are created using the same procedure as seqSLAM described in section II.Recalling that the contrast enhanced values are z scores, we use the below function to compute the similarity matrix M d with values in the range 0 to 1: where Φ is the cumulative distribution function of the standard normal distribution with unit standard deviation, and N r and N c are the number of low-resolution image rows and columns.
The similarity matrix M d is converted to the stochastic emission matrix E by normalizing the sum of values in each column to 1, The emission matrix stores the conditional probability values 2) Initial State Probabilities: Referring to Fig. 3, the start point of the path in each local similarity matrix M d is the lower right corner -the observation Z 1 and state S i=1 .The initial state probabilities are therefore 3) State Transition/Transfer Probabilities: The state transition probabilities are the likelihoods of transitioning between states (database images) from one observation (query image) to the next.They are stored in the m × m state transition matrix A, where m is the number of states, and We use local velocity constraints to set the state transition matrix values using the function and then normalize to satisfy (10).This function is a truncated Gaussian distribution with a flattened peak.The state transition matrix A for m = 15 states computed using the velocity values V min = 1/1.5 and V max = 1.5 is shown in Fig. 4 for reference.Note again that the state transitions are defined only between successive observations, and therefore enforce only local velocity constraints.
As a secondary step we use a global velocity mask to limit the number of possible paths through the matrix, as illustrated in Fig. 5.It is constructed using the same values V min and V max , with unshaded regions having a value of 1, and the others 0. Any path through the matrix must lie within the bounds of this mask.To achieve this, any state transfer value in ( 5) is set to zero if the mask value at cell t, j or t, i is zero.This often requires a temporary re-normalization of the state transfer values A to satisfy (10).IV.SEQUENCE SCORE The HMM sequence alignment procedure in section III finds a path X * through the image similarity matrix M d .A place recognition score must be evaluated using this path.
The simplest metric to use is the path probability µ(X * ).However, this is particularly unreliable when the query or database images contain visual aliasing or limited appearance changes.In this scenario the probability values in the matrix M d may all be large, but are ambiguous having no discernible 'best' path.This is illustrated in Fig. 6(a).It shows the path selected in two separate similarity matrices M d .The left matrix is an incorrect place recognition result, but has a higher probability µ(X * ) than the correct result on the right.
As discussed in section II, sequence SLAM uses a local normalization of the similarity matrix values M to find a local best match.An alternate approach used in [13] was to compute the eigen-decomposition of a square similarity matrix M , and reconstruct the rank-reduced matrix omitting the first r eigenvalues 2 .They argue that the first r rank one matrices having large eigenvalues are dominant themes in the similarity matrix arising from visual aliasing.
We use a similar spectral-decomposition to [13], but adapted to operate on the matrix M d .The Singular Value Decomposition (SVD) of M d is found, and the rank-reduced similarity matrix Md computed, where Σ is the n×n matrix Σ with the first r singular values set to zero.In later experiments we use a heuristic value of r = 4.The place recognition score for the query image I q and database image I d is selected as the Gaussian weighted sum of rank-reduced similarity values along the path X * : where 2 Applied to a full square similarity matrix M where the query and database images sets are the same.14) computed using the method described is larger for the true positive result.For a query image sequence I q , the sequence alignment and place recognition score in (14) is computed for all database sequences.The database sequence having the largest score is selected as the match for I q .
V. EXPERIMENTS Place recognition results using seqSLAM and our HMM-Viterbi algorithm were found for a range of outdoor vision datasets summarized in table I.The datasets include 3 sequences from the KITTI visual odometry training set 3 , one sequence from the Málaga urban dataset [14] 4 , sequences from the St. Lucia multiple times of day dataset [15] 5 , and sequences collected near our campuses in Pittsburgh and Qatar.Referring to table I, the database and query for St. Lucia and Qatar are different image sets collected at different times.For all other datasets, the database and query are the same image sets.The results for seqSLAM were found using our Matlab implementation of the algorithm.
The same low-resolution image sizes were used by both algorithms with a contrast-enhanced patch size of 8×8 pixels -for stereo datasets, only left images were used.A sequence length of n = 20 images was selected for all datasets, and velocity limits of V max = 1.5 and V min = 1/V max = 0.67.For seqSLAM we use the parameters reported in [8] and use a step size of V step = 0.02 pixels, matrix normalization window size w = 10, and the maximum path score (sum of similarity values) over all database sequences to select the match for each query sequence.The HMM parameters and sequence scoring parameters given in sections III and IV were used.For the datasets where the query and database images are the same set of images, a query image I q can only be matched to prior database images.We evaluate place recognition performance using precision recall.For any dataset, the total number of possible loop closure events and true positives are identified by thresholding the relative position and orientation between candidate query and database image pairs.This is followed with a manual verification.The ground truth pose estimates for each frame in the KITTI datasets are provided.For the remaining datasets, GPS data is used to interpolate an approximate 2 Degree of Freedom (DoF) pose estimate for each frame.

VI. RESULTS AND DISCUSSION
The seqSLAM and HMM-Viterbi precision recall results for all datasets are shown in Fig. 7, and the recall scores for selected precision values provided in table II.The lines connecting the place recognition results for each dataset at 99% precision using the HMM-Viterbi algorithm are displayed in Fig. 8.
For all datasets the HMM-Viterbi place recognition algorithm achieves an increased maximum recall rate, which is

Fig. 1 .
Fig. 1.An image similarity matrix M computed for a query image sequence Iq and all database images.To compute a place recognition score for a database image I d , the local matrix M d is selected spanning the previous m database images.A path through the local matrix M d is found, which aligns the query and database sequences, and the place recognition score computed.

Fig. 4 .
Fig. 4. A sample m × m state transition probability matrix A computed using (11).For display purposes the values are shown before normalizing the sum of each row to one.
Example sets of state transitions.

Fig. 5 .
Fig. 5.The global velocity mask (a) with limits V min and Vmax.For each iteration of the Viterbi algorithm, the local state transitions are constrained to lie withing the bounds of the global mask -see (b).This restricts the set of possible paths through the matrix to lie within the bounds of the mask.
Original image similarity matrices M d and paths: incorrect place recognition scenario (left) and correct (right).The rank-reduced image similarity matrices Md : incorrect place recognition scenario (left) and correct (right).
(c) Query and database images: incorrect place recognition scenario (left) and correct (right).

Fig. 6 .
Fig. 6.The rank-reduction used for improved place recognition scoring.The left columns are an incorrect place recognition scenario, and the right columns a correct scenario.(a) shows the paths computed using the original similarity matrices M b , (b) the rank-reduced matrices Mb , and (c) the query and database images.The probability µ(X * ) is larger for the incorrect result.The new matching score using the sum of rank-reduced values is larger for the correct result.

Fig. 6 (
Fig.6(b)  shows the rank reduction applied to the original similarity matrices in 6(a).The score in (14) computed using the method described is larger for the true positive result.For a query image sequence I q , the sequence alignment and place recognition score in (14) is computed for all database sequences.The database sequence having the largest score is selected as the match for I q .

TABLE I SUMMARY
OF THE DATASETS USED IN THE EXPERIMENTS.THE NOTATIONS 'D' AND 'Q' REFER TO THE DATABASE SEQUENCE AND QUERY SEQUENCE, RESPECTIVELY.FOR THE NUMBER OF FRAMES, THE VALUES IN THE BRACKETS WERE THE ORIGINAL NUMBERS, AND THE VALUES WITHOUT BRACKETS THE NUMBER USED BY SELECTING EVERY k th IMAGE.FOR THE RESOLUTION, THE VALUES IN BRACKETS WERE