Thermal Infrared Single-Pedestrian Tracking for Advanced Driver Assistance System

Tracking algorithms with low computational complexity and reliable performance are important in developing advanced driver assistance systems (DAS). This paper proposes a method of single-pedestrian tracking using thermal infrared cameras to meet the needs of DAS operating in nighttime and low-visibility conditions. The proposed algorithm uses the background-aware correlation filter (BACF) as the basic tracking framework. In order to address the problem that directly introducing the convolutional features leads to tracking performance degradation in the BACF framework, this paper proposes a fusion scheme to integrate handcrafted and convolutional features to make full use of the advantages of both the features. The proposed scheme combines response maps from convolutional and handcrafted features through fusion coefficients to improve the performance of the trackers based on the single features. In order to calculate fusion coefficients, a novel approach of searching the main peak and interference peaks of a response map is proposed by using local binary pattern values of the response map to locate all local maximum points. Experimental results show that the proposed algorithm outperforms the existing 9 competing tracking algorithms and can be used in vehicle platforms as a module of DAS to improve the safe level of driving in nighttime.


I. INTRODUCTION
I N ORDER to reduce the fatality rate in pedestrian-vehicle accidents and improve the level of road safety, both the automobile manufacturers and the scientific community have made a significant effort and progress in the development of diverse types of safety systems [1], such as electronic stabilization programs and driver assistance systems (DAS) [2]. As a popular type of DAS, vision-based driver assistance system (V-DAS) uses vision sensors to capture images with rich information and extracts useful information through advanced and complicated image processing [3]. Over the last two decades, several V-DASs with a diverse range of specific functions have entered the automotive market, such as monocular night vision systems and evasive pedestrian protection systems (PPS). PPSs typically detect and track the presence of both stationary and moving individuals in a region of interest with the help of images captured by forward-facing cameras, and then warn the driver to perform corresponding actions. In the most evolved PPSs, the tracking module plays a crucial role in several aspects [4], [5]: improving the real-time performance of PPS, avoiding false detection over time, target velocity estimation, and making useful inferences about pedestrian behavior [6], [7] by analyzing the trajectory of the tracked target. In short, the tracking module builds a bridge between the task of pedestrian detection which is a basic function of the V-DAS and the task of activity understanding as advanced functions [8]. Pedestrian tracking used in V-DAS has to face notorious challenges, such as deformation, occlusion, scale variation and temporary disappearance of the target caused by unpredictable movement of the pedestrian, blur effect of the image and background clutter caused by ego-motion of vehicles. It is well known that most pedestrian-vehicle accidents occur in nighttime or low-visibility conditions caused by bad weather (e.g., rainy, snowy and foggy). Therefore, it is significant and imperative to develop the PPS to support driving in nighttime and bad weather. Due to the unavailability of normal RGB cameras in environments with poor illumination, infrared cameras are widely used in night vision of automobiles. Depending on the wavelength, infrared cameras are divided into near-infrared (NIR) and far-infrared (FIR) cameras [9]. NIR cameras detect reflected light using the projector system at nighttime and their detectable frequency bandwidth is between 0.8 and 1.1 μm. FIR cameras, whose detectable frequency bandwidth is 8-12 μm, commonly known as thermal infrared (TIR) cameras, are used in nighttime or bad weather, and can capture relative temperature of different objects and are good at distinguishing targets with high thermal energy like pedestrians or automobiles from cold backgrounds like asphalts or trees [10]. Although TIR cameras are more expensive than visible-ray or NIR cameras, FIR cameras can detect targets in bad weather because far-infrared rays are less susceptible to moisture than rays of other wavelength bands. Moreover, TIR cameras are robust to disturbing light, such as oncoming headlights. This advantage can ensure that the target captured by TIR cameras is distinguishable from the background when a pedestrian is in the front of a car's headlights [9]. However, TIR images have severe shortcomings in comparison 2379-8858 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
with visible light images, e.g., lack of color, texture and edge information, low signal-to-noise ratio [11], etc. Moreover, TIR cameras also bring special challenges to increase the difficulty of tracking tasks, such as low resolution, and thermal crossover of different objects on the road. Generic tracking algorithms are to estimate the trajectory of a target throughout a sequence of image frames when only the location of a target in the first frame is known [12]. In the field of intelligent transportation systems, visual tracking is used for road traffic surveillance and self-driving vehicle including V-DAS. Currently, with the help of the powerful computational power of traffic control centers, the tracking performance used for road traffic surveillance has been greatly improved [13]. However, dynamic background of V-DAS caused by the movement of ego-vehicles poses a huge challenge to the tracking tasks [14]. Moreover, limited on-board computing resources and fast response to environment changes also demand that the tracking algorithm of V-DAS has a low computational complexity and reliable high real-time performance.
According to the number of tracked objects, visual tracking tasks can be divided into two categories: single object tracking (SOT) and multiply object tracking (MOT) [15]. Considering that MOT trackers can be achieved by running multiply SOT trackers in parallel at the same time, we focus on SOT in this paper. To summarize, the following observations motivate the research in this paper: 1) The proposed tracking algorithm needs to meet the limits of computational capacity available on-board and the high real-time requirement in support timely situational awareness and follow-on decision making. 2) The proposed tracking algorithm must have excellent performance when facing several challenges caused by operational environment of V-DAS, such as occlusion, deformation, scale variation, background clutter, low resolution, fast motion, and motion blur. 3) Since most of the existing tracking algorithms are developed for visible light images, it is extremely significant, for the success of TIR target tracking, to design or select a tailored and promising scheme of feature representation taking into account the characteristics of TIR images. Motivated by the fact that no existing tracking algorithms fulfill all these requirements, this work aims to develop a promising TIR pedestrian tracking system in the context of V-DAS and address the above three listed issues. We use the background-aware correlation filter (BACF) [16], which inherits the advantage of excellent computational efficiency of the discriminative correlation filter (DCF) [17] and can gain real negative samples from backgrounds by enlarging the searching areas for filters training, as the basic tracking framework. More specifically, the main contributions of this paper are summarized as follows: 1) To the best of the authors' knowledge, this work is the first attempt to integrate handcrafted and convolutional features in the BACF framework to develop a reliable TIR pedestrian tracker for the application of V-DAS. Experimental results demonstrate that the proposed tracking algorithm outperforms state-of-the-art methods on all the challenges that V-DAS often encounters while meeting the real-time requirement. 2) We propose a novel fusion scheme that combines response maps from different feature spaces to take advantages of the fact that convolutional features are complementary to handcrafted features normally associated with physical properties. Specifically, a method to estimate fusion coefficients is proposed according to the values and locations of the main peak and interference peaks of different response maps. 3) In order to compute fusion coefficients of response maps, a novel approach is proposed to search the main peak and interference peaks of a response map by identifying all local maximum points. This approach firstly analyzes and reveals the relationship between local binary pattern (LBP) values and local maximum points, and then computes the LBP value of each element of the response map to obtain the location of each local maximum point.

A. Tracking Framework
Visual tracking, as a basic issue of computer vision, is an attractive research filed with a broad extent of applications. Recently, DCF and deep learning trackers are two main development directions in the field of visual tracking [18]. Overall, computational complexity of deep learning trackers is greatly higher than DCF-based trackers [19]. Trackers based on DCF, working by learning an optimal correlation filter used to detect and locate the target in the next frame, have obvious advantages in computational efficiency while providing excellent tracking performance. Therefore, they are especially suitable for diverse real-time applications, such as V-DAS. However, for the TIR target tracking tasks of V-DAS, although cyclical sampling greatly improves the real-time performance, negative samples from cyclical sampling cannot truly reflect dynamic background information. This leads to that the performances of many DCFbased trackers degrade in the presence of common challenges of V-DAS, e.g., cluttered background, occlusion, etc.
To utilize background information as much as possible, the correlation filters with limited boundaries (CFLB) trackers presented an idea of training the filter in a large area and detecting the target in a small area [20]. The BACF tracker inherits this idea of CFLB and collects real negative samples from background instead of virtual negative samples from cyclical sampling. Due to the importance of background information to tackle the problems caused by challenges arising in V-DAS using thermal imaging, this paper uses the BACF as the basic tracking framework.
Although the search range of BACF is enlarged, this also leads to that the resolution of the target area in the feature space becomes smaller than that of the DCF-based trackers. The reduction of resolution means the loss of detail information to a certain extent. Therefore, it is important to construct a suitable approach of feature representation of TIR targets to address the above problem in the BACF framework.

B. TIR Target Tracking
TIR target tracking has made considerable progress in recent years. Zhang et al. used a local structure detector and a context relation model to learn the features for matching [21]. He et al. used complementary semantic features to construct a twofold matching network for TIR tracking [22]. Zhu et al. matched the tracking target by combining the optical flow features [23]. Li et al. presented a tracker which online selects the target-aware features [24]. Ding et al. investigated different features including normalized grayscale and Histogram of Oriented Gradient (HOG) within the DCF framework [25]. The combination of the different layer features of VGGNet was used in the Multi-layer Convolutional Features for Thermal infrared tracking System (MCFTS) to represent the TIR target [26]. Zhang et al. used synthetic TIR images to train a Siamese network as the feature extractor and then combine it with the framework of DCF [27]. Zulkifley et al. presented a multiple-model fully convolutional neural network (CNN) which updated a small set of fully connected layers on the top of a pre-trained CNN for TIR pedestrian tracking [28].
In summary, as a branch of visual tracking, most TIR trackers are developed in the existing visual tracking frameworks, and their difficulty lies in the feature representation for appearance modeling due to the lack of discriminating information. Therefore, the study of TIR target tracking usually focuses on feature representation of the tracked target.

C. Feature Representation for Tracking Tasks
Generally, features used in tracking tasks mainly contain two categories: handcrafted and deep features [29].
Handcrafted features are widely used for feature representation in tracking tasks, e.g., HOG used in the original versions of Discriminative scale space tracking (DSST) [30] and BACF. These trackers provide an appealing trade-off of competitive tracking performance and efficient computations. However, since handcrafted features are built on inflexible assumptions about target structures and their motion in real world scenarios, they cannot interpret semantic information of the target and address the problem of significant appearance variation [19].
In recent years, deep learning-based trackers have attracted considerable interest in the field of visual tracking. These trackers are divided into two categories according to pre-training and online training. Online training needs a platform with high computational performance. Existing computational resources of a vehicle is difficult to meet this requirement. Pre-training network usually plays a role of feature extractor in trackers, and the dominant feature extraction network is CNN. Convolutional features obtained by a pre-training CNN can represent objects more comprehensively and have more powerful ability of object classification, compared with handcrafted features. Deep Spatially regularized discriminative correlation filters (SRDCF) [31] combining activation from the convolutional layers of a pre-training CNN with the SRDCF [32] has fully proven the effectiveness of the convolutional features for target tracking in visible image sequences. Unfortunately, the similar scheme that combines convolutional features with the BACF leads to a decrease in tracking performance due to the resolution reduction of the feature map corresponding to the target region in BACF. Therefore, designing a scheme that incorporates rich semantic information of convolutional features into the BACF framework is significant to improve tracking performance.

D. Fusion Scheme for Visual Tracking
In the field of visual tracking, fusion schemes can be roughly divided into three levels. The first one is at the tracker-level. Biresaw et al. proposed a framework to fuse two trackers by using an online performance measure to identify the track quality of each tracker [33]. The second one is at the feature-level. Currently, the multi-feature combination or fusion is also an approach of the feature representation for tracking tasks and has been received an increasing attention. Bertinetto et al. combined the HOG and the color histogram to represent the target in the framework of DCF [34]. Danelljan et al. fused outputs of multiple convolutional layers to replace the single-layer feature of deep SRDCF [35]. Furthermore, Wang et al. combined intensity and edge information to represent the tracked target for the TIR pedestrian tracking [36]. Ko et al. proposed to fuse local intensity distribution and texture features [37]. Compared with the single feature, the fusion of different features can provide a better overall description of a target. The last one is different from others. Specific to the framework of the correlation filter, the essence of fusion at this level is based on response maps which is adopted in this paper. Therefore, a tailored fusion scheme of response maps for TIR pedestrian tracking will be proposed in this paper with a significantly improved performance.
In summary, this paper develops a tracker by designing a fusion scheme to combine the convolutional features from a pre-training CNN based on TIR images with HOG features in the BACF framework to develop a reliable TIR tracking algorithm for V-DAS. The proposed tracker offers three merits: (1) The approach of feature representation and the fusion scheme proposed in this paper can improve the tracking performance.

III. PROPOSED METHOD
A bounding box from the output of a TIR pedestrian detector is used as the input of the tracking framework and the tracking results can be used as the input of the module of pedestrian behavior understanding. Due to the important role that the tracking task plays in the V-DAS, this paper focuses on this tracking framework. The framework of the proposed tracking algorithm consists of the feature representation, the fusion scheme, the filter generation, and the target detection as shown in Fig. 1.

A. Tracking Framework
For the feature map of a target area in the first frame x ∈ k1×k2×d , d is the dimension of the feature, k 1 and k 2 are determined by the target area size, and In the framework of the correlation filter, the objective of tracking is to learn a filter combination h ∈ k1×k2×d consisting of one filter h i ∈ k1×k2 per feature channel i ∈{1,2, …,d}, vectorized h i ∈ K . The correlation filter's combination h can be calculated by minimizing the following cost function [30]: where, * denotes correlation operator, λ is a regularization coefficient to prevent overfitting [16], [30], y ∈ K is the desired correlation output and a Gaussian-shaped label matrix. Eq. (1) can be expressed by vectorising x, h and y: (2) where [Δτ j ] is the circular shift operator, x i [Δτ j ] denotes the j-step discrete circular shift for the signal x i . According to the definition of circulant matrix, Eq. (2) can be rewritten as follows: where, In order to reduce the boundary effect while expanding the search area(from k 1 ×k 2 to s 1 ×s 2 ), BACF introduces a binary constant matrix Pࢠ K×S (S = s 1 ×s 2 ) and an auxiliary filter g i ࢠ S , where S>>K, h i = Pg i , and PP T = I. Therefore, the cost function of the BACF tracker can be expressed as For computational efficiency, Eq. (4) is expressed in the frequency domain as: where F ∈ S×S is a Fourier Matrix,^denotes the Dis- , and ⊗ denotes the Kronecker product.
The Augmented Lagrangian Method (ALM) is used to solve (5) and the corresponding augmented Lagrangian function is defined as: Theĝ and h of (6) can be solved iteratively using the ADMM algorithm [16]. The final filtersĝ i with the size of s 1 ×s 2 can be obtained by resizingĝ i ∈ S .
In the following frame, z ∈ s1×s2×d represents the feature map of a candidate patch centered around the predicted target location. Its response map R ∈ s1×s2 in the spatial domain can be computed by where IDFT(.) represents inverse discrete Fourier transform. The main peak location of R indicates the tracked target at the current frame. Compared with DCF-based trackers, the BACF framework can search and locate a target in a wider candidate area; more specifically, the candidate area increases from k 1 ×k 2 of DCF to s 1 ×s 2 of BACF.

B. Features Used in the Proposed Tracker
The original BACF tracker employing only 31-channel HOG features with 4 × 4 cells outperforms many trackers using convolutional features [16]. However, convolutional features often complement each other with handcrafted features that are associated with physical properties. In order to further improve tracking preference, one of purposes of this paper is to introduce convolutional features in the existing BACF framework.
In this paper, the pre-training network used for convolutional feature extraction was derived from VGG-m-2048 [38], which has been widely used in a number of trackers, such as ECO [39]. Since the ImageNet-VGG-m-2048 from MatConvNet (https: //www.vlfeat.org/matconvnet/pretrained/) is based on visible light images, we retrained this network by using TIR pedestrian images and no-pedestrian images, and refer this new network as TIR-VGG-m-2048. The network contains five convolutional layers (from Conv1 to Conv5) and three fully connected layers, and there is a 2 × 2 Max pooling operation after Conv1, Conv2 and Conv5, respectively [38]. The architecture of TIR-VGG-m-2048 and the feature map resolution of each convolutional layer are shown in Table I.
From Table I, we can observe that the resolution of the feature map from the outputs of the Conv3 becomes 13 × 13. Because the size of the input image of the network is a fixed value of 224 × 224 pixels [39], this means that each element in the feature map of the size of 13 × 13 denotes a 17 × 17 pixels sub-image of the input image. Therefore, the outputs of the shallower convolutional layer can provide more precise localization and describe an object more comprehensively.
Moreover, since the search range of BACF s 1 ×s 2 is much larger than that of traditional DCF-based trackers k 1 ×k 2 for the same target, for the feature maps from the same layer, the size of the target area in the BACF is much smaller than that in the other DCF-based trackers. Specifically, supposing that the sizes of candidate regions in the BACF and DCF are s 1 ×s 2 and k 1 ×k 2 respectively, we can compute the ratio of the size of the target region in two feature maps x BACF and x DCF : where size(.) is the size of the feature map. Based on the above considerations, this paper uses the outputs of the first convolutional layer of TIR-VGG-m-2048 as the convolutional feature.

C. Fusion Scheme of Response Maps
This paper proposes a scheme that firstly computes fusion coefficients and then obtains the fused response map by a linear weighting-sum method to give full play to the advantages of both convolutional and handcrafted features.
Generally, fusion coefficients represent the importance and confidence of the fused information. Previous research shows the shape of the response map reflects the confidence of the tracking results to a certain extent [40]. According to the theory of the correlation filter, the response map is essentially a probability distribution map where each element value of a response map denotes the possibility that the tracked target is at this location. The higher the probability value is, the higher the confidence of the tracking result. From Fig. 2, it can be intuitively observed that the confidence of response map in Fig. 2(b) is lower than the value of Fig. 2(a) due to the disappearance of the target in Fig. 2(b).
Generally, the confidence of a response map is characterized by three factors [40]: the response value of the main peak, the number of interference peaks, and the distance between the main peak and interference peaks. Therefore, the first step of a fusion scheme is to search the main peak and interference peak on a response map. Obviously, all peak points can be searched through finding local maximum points. We can determine local maximum points using the LBP value. According to the definition of LBP descriptor [41], an element of a response map is a local maximum point when its LBP value is equal to 0. To conclude, the method to search the main peak and interference peak is shown in Fig. 3 and briefly descripted as follows: (1) LBP values of a response map are calculated; (2) Local maximum points are obtained by the LBP matrix; (3) The local maximum point with the largest response value is the main peak, and the local maximum points where the response value is greater than a certain percentage θ of the overall maximum response value are identified and retained as the interference peaks.
The second step of the fusion scheme is to estimate the confidence index of a response map e according to the main peak and the interference peak. Considering three factors mentioned above, we use the following formulae to compute e: where R is a response map, (x M , y M ) is the location of the main peak, (x i I , y i I ) are the locations of the interference peaks, and num is the number of the interference peaks. The numerator of (9) indicates the difference between the response value of the main peak and that of the interference peak, and the denominator part represents the distance between the main peak and the interference peak. The third step of the fusion scheme is to compute fusion coefficients. For two response maps R h and R d corresponding to HOG and convolutional features respectively, their confidence indexes are e h and e d respectively, fusion coefficients of two response maps can be calculated by: In the last step, the final response map R is computed by combining R h and R d linearly as: It is worth to point out that there are two special situations when computing fusion coefficients σ h and σ d : (1) The σ value of a response map is equal to 1 if and only if num of its response map is 0. (2) If both num values of two response maps are 0, confidence index e can be calculated as follows: where s 1 ×s 2 is the size of the response map, B is a binary image computed by segmenting the response map using the threshold as θ of the maximum response value. The fusion process of response maps can be summarized as follows:

D. Scale Estimation and Model Update
This paper employs the interpolation strategy [35] that implements scale estimation by applying the filter on multiple resolutions of the searching area. As shown in Fig. 1, five searching areas with different scales firstly return corresponding response maps. Secondly, the fused response map is calculated in each scale, respectively. Finally, the interpolation strategy is employed to maximize detection scores per each fused response map, and the estimated scale is the one with the highest score.
Furthermore, the online update strategy by introducing a learning rate is utilized to improve the robustness of the tracker in the framework of the correlation filter. Specifically,x m (t), the End 7. Computing the fused response map R according to (11) feature map of the samples for training filters at the t-th frame, is calculated aŝ wherex(t) is the feature map based on the detection result at the t-th frame and η is the learning rate. If the learning rate is a fixed value, the wrong information caused by incorrect tracking results may lead to drift or even failure of tracking tasks. Therefore, the learning rate η(t) is determined by (14).
Eq. (14) denotes that if the number of interference peaks is more than 4, the filter is not updated.

IV. EXPERIMENTAL RESULTS
The proposed tracking algorithm is extensively demonstrated and evaluated in this section with public benchmark datasets and real experiments. In the first three parts, in order to test and compare the performance of the algorithms quantitatively and fairly in TIR pedestrian tracking, a challenging public benchmark TIR dataset, PTB-TIR, is used [42]. The initial value of the tracking is given from ground truth. In the last part, the proposed algorithm is implemented and tested on an automated vehicle operating in real road scenarios, and the initial value of the tracking is manually entered.
The key parameters of the proposed tracker are set as follows: TIR-VGG-m-2048 is trained with 10,000 positive samples containing TIR pedestrian and 10,000 negative samples. The cell size of HOG is 4×4. According to the parameter configure of the original BACF [16], the regularization parameter λ = 0.001, and the learning rate η = 0.013, the number of scales is set to 5 with a scale-step of 1.01. The desired output of the correlation filter is a 2-D Gaussian function with bandwidth of √ wh/16, where w×h is the size of the tracked target. For the ADMM optimization, we set iteration numbers of HOG-based filters and convolutional feature-based filters to 2 and 5, respectively. θ used to identify the interference peaks is set to 30%.
The proposed tracker is evaluated in Matlab 2019a on an Intel(R) Core(TM) i3 CPU (3.4GHz) PC with 6 GB of memory. To assess the practical application of the proposed tracking algorithm in the V-DAS, we evaluate the trackers by one-pass evaluation (OPE), which is to run the tracker throughout a test sequence with initialization from the ground truth position in the first frame. In the first three parts, the tracking results on the PTB-TIR dataset are quantitatively evaluated by using two standard evaluation metrics, Precision Plot and Success Plot [42].
Test dataset PTB-TIR consists of 60 TIR test sequences [42]. We test the tracking precision of the trackers separately on 8 challenges that are usually encountered in practical application of V-DAS, concretely, deformation in 60 test sequences, occlusion in 39 test sequences, scale variation 35, background clutter 52, low resolution 14, motion blur 22, fast motion in 8 test sequences, and thermal crossover in 22 test sequences.

A. Performance Comparison With Different Convolutional Features in DCF and BACF Frameworks
In this part, we evaluate tracking performances of different convolutional features in the two frameworks, DCF and BACF. Trackers generally use the shallow outputs of a pre-training network as the feature representation. Therefore, this part utilizes the outputs of Conv1 and Conv2 of the TIR-VGG-m-2048 in the DCF and BACF framework to compare their tracking performances. In this experiment, the precision scores of the different features are presented and ranked in Fig. 4(a) for a given location error threshold of 20 pixels. Moreover, we exploit the area-under-curve (AUC) score shown in Fig. 4(b) for performance assessment in success plot.
It can be first observed in Fig. 4 that the trackers using the Conv1 outputs as the features outperform that using the Conv2 outputs in both the DCF and BACF framework. Furthermore, the tracking performance significantly reduces as the depth of the convolutional layer deepens in the BACF framework. More specifically, the distance precision scores greatly drop from 72.8% of the first layer to 48.3%, and the success rates from 52.1% to 35.3%. Secondly, we also find that the DCF framework can get a better performance when the same layer's outputs are used as the feature. This confirms our analysis in Section III B. We can conclude that the tracking performance of BACF should be lower than that of DCF if only using the convolutional features from the same layer, since size of the target area in the feature map becomes smaller. Therefore, we use the outputs of Conv1 of TIR-VGG-m-2048 as the convolutional feature.

B. Ablation Studies
Although the BACF has the deficiencies of using convolutional features, in order to give full play to the advantages of convolutional features, we have proposed a fusion scheme to introduce the convolutional features in the BACF framework as in Section III. In this subsection, we conduct experiments to demonstrate the contribution of the proposed tracker by comparing it with the original BACF tracker and the tracker using the other fusion scheme of response maps. We set the original BACF tracker only using HOG feature as the baseline method. To further validate the superiority of the proposed fusion scheme, we also implement the tracker using the concatenation scheme that directly concatenate different feature maps along the direction of the channel, referred as concatenation [25].
We first compare the baseline method with concatenation method which is widely used in the ECO [38], SAMF [43], and DSST [30] tracking algorithms. Fig. 5 shows that the tracking performance decreases when the concatenation scheme is performed in the framework of BACF. Since there are more filters in the concatenation scheme than in the Baseline method but available training samples do not change, it increases the feature dimension so results in a higher probability of over-fitting. Consequently, the target localization accuracy cannot be improved.
Secondly, we validate the effectiveness of the proposed tracker by comparing with the Baseline and Concatenation methods. The proposed tracker improves the baseline method by 2.9% and 0.8% in precision and success plots, respectively. It can be seen that the proposed fusion scheme can fully integrate the advantages of convolutional features and handcrafted features to improve tracking performance. We present an approach that integrates convolutional features into the BACF framework successfully.

C. Comparative Experiments With Representative Trackers
In this part, we provide a comprehensive comparison of our proposed tracker with 9 state-of-the-art tracking methods. The trackers used for our comparison are: ECO [38], ASRCF [44], AutoTrack [45], DSST [30], MCFTS [26], HDT [46], SAMF [43], HCF [47], and KCF [48]. According to the feature representations, these trackers are divided into three categories. HDT constructs a strong tracker by combining weak CNN trackers from numerous convolutional layers. As an algorithm specifically developed for TIR target tracking, MCFTS is also used to compare with the proposed tracker. DSST, SAMF and ECO also exploit multiple features to enhance tracking performance. But unlike our fusion scheme, they use concatenation schemes to fuse features.
As shown in Fig. 6, the precision and success plots over all the 60 videos in the PTB-TIR dataset demonstrate that the proposed algorithm performs favorably against the other 9 algorithms. In the legends of Fig. 6(a), the proposed tracker achieves the best overall performance; for example, 83.5% in the distance precision score when the location error threshold is set to 20 pixels. This score is far better than the most of other 9 trackers. Moreover, the proposed tracker achieves the best overall performance in terms of AUC, for example, a score of 60.8%. It shall be noted that the ECO tracker is recognized as a most successful tracker and our algorithm is slightly better [17]. In comparison with the MCFTS which is specifically developed for TIR target tracking, the results show that our tracker exhibits an outstanding tracking performance for TIR targets. Meanwhile, our tracker significantly outperforms the compared trackers using single-feature, such as KCF and HDT. Table II summarizes the comparison of the precision scores between 10 different trackers on 8 different challenges where the location error threshold is equal to 20 pixels. It is clearly evident that our tracker achieves the best performances on all the challenges that V-DAS often encounters. Therefore, the proposed tracker improves the performance of V-DAS in a wide range of diverse driving conditions. The last row of Table II shows that computational speeds of all the trackers used for comparison. The results of Table II (frames per second) illustrate the computational burden of the proposed tracker is quite modest, only slower than the ECO among the trackers using convolutional features. Actually, the computational complexity is proportional to the size of the search area s 1 ×s 2 and feature dimension d.

E. Qualitative Evaluation in Real Road Scenarios
Our proposed tracker is also tested on a remotely controlled mobile vehicle in real road scenarios. As shown in Fig. 7, this platform mainly consists of a TIR camera for video capture, a portable computer for data processing, and power supplies fixed in the vehicle. The TIR camera is an FLIR A35 which has a wide temperature range and can display temperatures in the range [−40∼550°C]. It also has a high thermal sensitivity of less than 50Mk and can capture very small image temperature difference to obtain a clear TIR image with a pixel size of 336×256. The mobile platform is a D1 robot developed by EAI Technologies. The road for experiments consists of three about 300-meter straights and two turning sections, located in the Jiangjun Lu campus of Nanjing University of Aeronautics and Astronautics (NUAA). The average speed of the test vehicle is 15km/h and the top speed is 30km/h. Three common situations in the real scenarios are tested, including tracking a walking pedestrian, a running pedestrian or a pedestrian crossing the road. Each scenario contains 200 image frames and the average processing speed of the proposed algorithm on this platform is 85 ms/frame by accelerating with GPU (NVIDIA GeForce GTX 1060 6GB). Due to lack of the credible ground truth, we only made qualitative evaluations for experiments on the real scenarios. The test is performed on the campus road at night with the environment temperature of 18°C.
A pedestrian walking on the road often has slow movements, inconspicuous changes in form, and significant scale variation as the relative distance between the vehicle and the pedestrian changes. Typical tracking results shown in Fig. 8(a) demonstrate that the proposed algorithm tracks the target reliably, and the tracking error is small. The pedestrian in running state has relatively faster movement speeds, so the morphological changes and scale variation are relatively obvious. These factors may lead to the target being blurred. Typical tracking results as in Fig. 8(b) show that our algorithm tracks the target successfully. The pedestrian crossing the road imposes a high safety risk for driving and is prone to collisions. In this case, the pedestrian usually turns sideways, with obvious morphological changes, and has acceleration involved in, which results in fast motion in image sequences. Fig. 8(c) shows that the target can be tracked reliably by our tracking algorithm when the pedestrian is crossing the road. These three experiments confirm that the algorithm proposed in this paper can be integrated into vehicle platforms as a module of V-DAS. However, compared with visible light image sensor, since the thermal modality is less discriminative, the possibility of TIR tracking failure may increase when there are similar targets in the background. Therefore, it is necessary to test the proposed trackers in different and more complex real scenarios thoroughly in the future.

V. CONCLUSION
This paper presents a TIR pedestrian tracking algorithm for the problem of pedestrian tracking in nighttime or low-visibility conditions, which could be implemented in V-DAS. By investigating and comparing convolutional features of a pre-training TIR-VGG-m-2048 model in the DSST and BACF frameworks, it reveals that using convolutional features alone in BACF is not conducive to improving the tracking performance. Therefore, to take advantages of the strengths of both convolutional features and the BACF framework, a fusion scheme that combines convolutional features with handcrafted features is proposed in this paper in the framework of the BACF. In order to assess and fuse the response maps, this paper proposes a novel approach to search the main peak and the interference peaks of a response map by establishing the relationship between LBP values and local maximum points. Extensive tests have been conducted using public benchmarks with typical challenges arising in TIR tracking and real experiments. Comparative experimental results show that the proposed tracking algorithm performs favorably on the dataset PTB-TIR against 9 state-of-the-art tracking algorithms. The experimental results in real road scenarios confirm the effectiveness of our tracker and show that our tracker can be used in vehicle platforms as an important module of V-DAS.
Furthermore, we are exploring in extending the proposed framework to reliable tracking pedestrians for vehicles operating under all weather conditions by combining RGB visual images with thermal images. This will also alleviate the difficulties in TIR multiple object tracking due to the low discrimination of TIR cameras.