A Collaborative Learning Tracking Network for Remote Sensing Videos

With the increasing accessibility of remote sensing videos, remote sensing tracking is gradually becoming a hot issue. However, accurately detecting and tracking in complex remote sensing scenes is still a challenge. In this article, we propose a collaborative learning tracking network for remote sensing videos, including a consistent receptive field parallel fusion module (CRFPF), dual-branch spatial-channel co-attention (DSCA) module, and geometric constraint retrack strategy (GCRT). Considering the small-size objects of remote sensing scenes are difficult for general forward networks to extract effective features, we propose a CRFPF-module to establish parallel branches with consistent receptive fields to separately extract from shallow to deep features and then fuse hierarchical features adaptively. Since the objects and their background are difficult to distinguish, the proposed DSCA-module uses the spatial-channel co-attention mechanism to collaboratively learn the relevant information, which enhances the saliency of the objects and regresses to precise bounding boxes. Considering the interference of similar objects, we designed a GCRT-strategy to judge whether there is a false detection through the estimated motion trajectory and then recover the correct object by weakening the feature response of interference. The experimental results and theoretical analysis on multiple datasets demonstrate our proposed method’s feasibility and effectiveness. Code and net are available at https://github.com/Dawn5786/CoCRF-TrackNet.

sensing images can no longer meet the demand for dynamic detection of ground objects. While video satellites can obtain time-series dynamic images of observation areas, which can provide rich information for many applications, such as traffic condition monitoring, the rapid response of natural disaster, and military security [1]. Object tracking is one of the key technologies in video analysis and understanding applications [2]. Many advanced trackers for natural videos have been proposed [3], [4].
The emergence of very high-resolution (VHR) video satellites provides the possibility for remote sensing video tracking. Since remote sensing videos are taken from high altitudes with wide angles, complex scenes, and the feature distribution of ground objects are vastly different from those in natural videos. Therefore, accurate real-time remote sensing video tracking remains a particular challenge.
Object tracking usually establishes an appearance model with objects marked in the first frame, then detect the designated object in subsequent frames. A general tracking framework usually consists of a search strategy, feature extraction, and observation model. According to the different feature extraction methods, trackers can be mainly divided into traditional methods and deep learning methods.
In traditional methods, based on different ways of observation, there are generative and discriminant models [5]. Generative models usually extract object features to construct an appearance model, such as optical flow [6], Kanade Lucas Tomasi (KLT) [7], meanshift [8], etc. However, generative models do not make full use of background information and appearance changes. While discriminant models, such as tracking learning detection (TLD) [9], struck [10], MOSSE [11] usually compare the difference of object and background simultaneously through a discriminant function. Later, the kernel correlation filter (KCF) [12] introduces a fast Fourier transform for real-time online training. While the trained filters limit the universality of trackers.
In deep learning methods, trackers use deep features with powerful representation instead of manual features [13]- [15]. Early works like efficient convolution operators (ECO) [16] and hedged deep tracking (HDT) [17], directly use pretrained models to extract deep features. However, due to the poor generalizability of pretrained models, deep trainable trackers are proposed: SiamFC [18] uses siamese net to train a similarity metric function for matching; CFNet [19] rewrites the correlation filter into a differentiable neural-network layer to Fig. 1. Overall process framework, including a consistent receptive field parallel fusion module (CRFPF), a dual-branch spatial-channel co-attention module (DSCA), and geometric constraint retrack strategy (GCRT). train convolution features; SiamRPN [20] constructs a siamese feature extraction subnetwork and a regional proposal subnetwork to improve the tracking accuracy. ATOM [21] conducts a two-stage tracker with a target classifier and a regression estimation net. DIMP [22] further utilizes a weight prediction module to initialize the classification network well.
In recent years, some tracking methods peculiar to remote sensing videos have come into being. Du et al. [23] combined KCF [12] and three-frame difference algorithm to build a strong tracker. Guo et al. [24] designed a correlation filter incorporated with a Kalman filter (CFKF) to correct the tracking trajectory of moving target Wang et al. [25] designed a Gabor filter on CSK [26] to enhance object features. Later, Hu et al. [27] extracted deep features with pretrained deep neural networks. PASIAM [28] uses a shallow siamese network to match object features and predict attention to deal with occlusion. Although the above studies have achieved good performance, the research remote sensing video tracking is still in its infancy. There are still some significant challenges as follows.
1) Since satellite videos are taken from high altitudes, the interesting objects are usually small-size with little sufficient information. However, the present trackers do not sufficiently extract series of hierarchical features for these small objects in particular. 2) Due to the complex atmospheric medium in remote sensing videos (e.g., clouds and fog), objects are usually similar to those of the surrounding background. It makes accurate tracking difficult. However, as far as we know, existing trackers cannot adaptively enhance objects in dynamic backgrounds. 3) When the object to be tracked moves around those similar objects, the tracker is easily disturbed and drifts to the wrong objects. It dramatically affects the success rate of trackers. Based on the analysis above, we propose a deep collaborative learning tracking network for remote sensing videos. The overall framework of our network is shown in Fig. 1. 1) For small objects to be detected, we design some parallel multiresolution branches so that they can have the consistent receiving field on the same level layer, thus extracting the hierarchical features of small objects from shallow to deep layers, and finally, use adaptive weights to fuse them effectively.
2) To reinforce object regions, we introduce a collaborative attention learning mechanism based on the crosscorrelation of input frames. It performs collaborative attention learning in spatial and channel, thereby enlarging the difference between object and background. And, an additional attention-loss is designed to enhance the saliency of the objects further. 3) For the tracked results, we design a retrack strategy to judge whether the objects are false tracked through geometric constraint, and then weaken the feature response of the interference objects and finally retrack the actual objects. The remainder of this article is organized as follows: in Section II, we introduce three modules of the proposed framework in detail; in Section III, contains the analysis of the experimental results and the effect of each module; finally, Section IV concludes this article.

A. Consistent Receptive Field Parallel Fusion Module
In remote sensing videos, the objects are usually smallsize, which causes the bottleneck of tracking accuracy. In recent works [29], feature pyramid structures are constructed to meet the challenge of small objects. However, they are usually applied in global recognition on multiple categories of objects at any scale widely. So it lacks specificity in detecting small objects. Based on this, we construct a CRFPF.
The input of this module is an RGB image patch I 0 1 ∈ R M 0 1 ×M 0 1 ×3 , which is intercepted by the center of the object.

1) Construction of Parallel Image Pyramid Input:
Single resolution input is difficult to balance the extraction of deep features and shallow features for extremely small objects. In order to further extract rich hierarchical effective semantic representation information, we construct parallel image pyramid by a multiresolution sampling way on original image patch I 0 1 and obtain a series of image patches {I 0 0 , . . . , I 0 k , . . . , I 0 K } with corresponding sample rate {α 0 , . . . , α k , . . . , α K }, where These image patches correspond to the inputs of K + 1 parallel convolution branches {B I 0 , . . . , B I k , . . . , B I K }. According to the principle of deep neural network, the higher resolution branch tends to extract deep semantic features, while the lower resolution branch tends to extract shallow detailed features.
Among them, B I 0 is a down-sampling branch with α 0 < 1 to capture global location feature information from a lower resolution image patch.
So the steps to construct the input process are listed as follows.

2) Feature Extraction With Consistent Receptive Field:
For image pyramid as input, different features are obtained through parallel convolutional network branches. The actual scene regions corresponding to the receptive fields of these features are usually inconsistent. Therefore, it is not conducive to subsequent feature fusion. To prevent misalignment when fusing features from different branches, we aim to maintain a consistent receptive field on the same layer for all branches. (See the yellow dotted lines in Fig. 2). Note that in a branch, a series of consecutive convolutions form a level layer, their size is fixed. And, one branch has L level layers in Fig. 2. In this regard, we think of using dilation convolution [30] which can change receptive field with the same size convolution kernel to align features. Dilation rates of the level layers in branch B I k is recorded as {r 1 k , . . . , r l k , . . . , r L k }. Starting from the branch B I 1 , its original RGB image patch I 0 1 ∈ R M 0 1 ×M 0 1 ×3 passes through dilation convolution operations to generate the first level layer feature F 1 1 ∈ R M 1 1 × M 1 1 ×3 as follows: where Ker is the convolution kernel size and r 1 1 is the dilation rate of the first level layer in B I 1 .
Among parallel branches, given sampling rate α k , (k = {0, 1, . . . , K}) and dilation rate r 1 1 of branch B I 1 , dilation rates r 1 k of each branch in the first level layer can be calculated as follows: In order to resist the misalignment phenomenon during the fusion of nonsame branch features, we aim to keep the consistent receptive field on the same level layer for all branches Therefore, the obtained {r 1 1 , . . . , r 1 K } are substituted into (2) and {F 1 0 , . . . , F 1 K } are determined. We design the above rules to ensure that features of same-level layers have a consistent receptive field among different branches. That is they respond to the same real scene of original image patches.
For the branch B I k , with r 1 k of each branch obtained from the above, the receptive field F i k of the ith layer (i = {1, . . . , L}) is shown in following recurrence formula as (4), where the dot represent dot multiplication operations in maths: In this process, the current level layer features of the lower resolution branch have the same dimension as the next level layer feature of its adjacent higher resolution branch (e.g., F l−1 k−1 and F l k in Fig. 2). So far, features of all branches can be determined. The specific steps are listed as follows.
1) First build a standard convolutional network branch B I 1 as the feature extraction branch of the initial image I 0 1 .  consistent in dimension. In addition, the two input branches are weight sharing to ensure that features are extracted in a unified feature space for both the current frame and the template frame, as shown in Fig. 1.

3) Adaptive Feature Fusion:
According to the principle of deep learning, objects of various sizes have different degrees of preference on different level features. So the feature fusion way with equal weights is not particularly suitable. Therefore, we propose an adaptive feature fusion way to obtain the final fused feature F fuse as follows: where λ k represents the fusion weight of feature F L k , which is normalized by β k to interval [0, 1]. And, the dot represent dot multiplication operations.
The proportion β k is determined by the object size √ w × h, the image patch size M 1 and the sample rate α k as follows: Given a certain object size, β k decreases with the increase of α k , that is, a branch with a larger sample rate has a smaller fusion weight. Besides, given the sample rate of each branch, as the object size increases, the weight gap among the branches decreases and tends to be even. In this way, the smaller the object, the proportion of shallow detail features is larger, conducive to the precise detection of small objects. Function ln(·) is used to prevent violently jitter of fusion weight caused by object size changes.
In general, for CRFPF-module, multiresolution parallel branches fully extract the low-level detailed information and high-level semantic information of small objects from shallow to deep layers, consistent receptive fields allow features of different branches in the same level layer corresponding to the same real scene range. And the adaptive fusion way not only contains multiple layers information, but also reduces the deviation caused by feature misalignment.

B. Dual-Branch Spatial-Channel Co-Attention Module
To increase the saliency of objects, we propose a DSCAmodule composed of a target classifier and an intersection over union (IoU) regression based on the collaborative attention (co-attention) mechanism.
1) Spatial Co-Attention Module: To highlight the object region saliency from the spatial structure, we construct a spatial co-attention module on the target classifier. The structure is shown in Fig. 3(a).
1) Feature Initialization: The inputs, Z ∈ R W×H×C and X ∈ R W×H×C , are the fused features extracted from CRFPF-module of the template frame and the current frame. They go through several convolution operations and generate initial features Z 0 S ∈ R W×H×C and X 0 S ∈ R W×H×C . 2) Generate Spatial Co-Attention Map: We calculate the spatial co-attention map S r describing the crosscorrelation information between two initial features, Z 0 We obtain the spatial co-attention map S r ∈ R N×N as follows: We design the spatial co-attention map S r to measure the similarity of Z 1 S and X 1 S in the corresponding spatial position. The larger value in S r indicates the higher similarity of corresponding positions, vice versa.
3) Feature Modulation: We modulate spatial co-attention map S r to initial current frame feature X 1 s based on the current frame feature X, and the cross-correlation feature X co S ∈ R W×H×C is calculated as follows: where (·) represents an operation to reshape input back to the initial dimension as X. Therefore, according to the degree of cross-correlation between the template frame and the current frame, we enhance the object region signal and enlarge the difference in the feature distribution between the object and its background. 4) Saliency Attention-Loss: Considering that the feature information of small objects is very minimal, if they are directly sent to the classifier, the feedback loss will not be enough to improve the saliency of the object. Therefore, we design an additional attention-loss to further focus on the neighborhood of objects, thereby further expanding the gap between the object and the background. For this attention loss, the saliency mask S p ∈ R W×H×1 is generated as follows: where conv(·) represents several 1 × 1 convolution operations.
The additional attention-loss function is as follows: where G is a binary label obtained according to the ground truth boxes. For G, the value of the object location is set to 1, while the background is set to 0. η presents the percentage of positive values in the binary label template to prevent an imbalance between positive and negative samples. g ∈ G and s p ∈ S p . In the tracking process, we multiply the modulation feature X co S with salient mask S p , getting final attention feature X C S for the target classifier. Subsequent target classifier consists of several fully convolutional layers. The loss for the classifier is the least square function, as shown in the following: here, Y is the Gaussian sampling label centered at the object, y ∈ Y and x c s ∈ X C S . μ is the amount of regularization on ω. The overall loss is as follows: where ξ is in [0, 1] to balance the two losses.
2) Channel Co-Attention Module: In IoU predictor, each channel of feature has a different contribution to the accurate prediction. To increase the proportion of beneficial channels under the guidance of precise ground truth, we propose a channel co-attention module as Fig. 3(b).
1) Feature Initialization: The inputs are ground truth region B g within the template frame feature Z, and the proposal region B e within the current frame feature X, denoted as Z(B g ) and X(B e ). Both them are fed through convolution operations, and then do precise region of interest pooling operations (PrPool) [31] as (13) to obtain continuous down-sampling of the same dimension, getting initial where B represents the above B g or B e , (x 1 , y 1 ) and (x 2 , y 2 ) represent the coordinates of the upper left and lower right corners of B. f (x, y) represents the element value of the feature map at coordinates (x, y). 2) Generate Channel Co-Attention Map: We flatten Z 0 C and X 0 C to Z 1 C ∈ R J 2 ×C and X 1 C ∈ R J 2 ×C along the channel axis and obtain channel co-attention map S c ∈ R C×C as follows: The S c obtained in this way reflects the degree of crosscorrelation between channels from Z 0 C and X 0 C , respectively. We design S c to measure the degree of similarity between channels. Larger values indicate higher similarity and vice versa. According to the cross-correlation score, the channel that is highly similar to the channel of the ground truth object is enhanced.
3) Channel Modulation: We generate channel modulated features X co C ∈ R J×J×C with S c as follows: where operation (·) reshape input back to dimension as X 0 C . This not only adaptively adjusts the channel weight, but also completely ensures the original feature information. 4) IoU Estimate: Given X co C , the IoU value is predicted by several fully connected layers. The IoU regression loss computed is as follows: Finally, the overall DSCA-model is complete. In this module, we introduce the cross-correlation characteristics between the template frame and the current frame, and make use of the cross-correlation information. In this way, we, respectively, enhance the saliency of the object region in the spatial domain and prefer high-quality features intelligently in the channel domain. Note that the target classification branch and the IoU regression branch focus on different domains and cannot be interchanged. Primarily, we also design a saliency attentionloss to further enlarge the difference in feature distribution between objects and similar backgrounds.

C. Geometric Constraint Retrack Strategy
In remote sensing videos, objects are small-size and have little texture information, so the indiscernible appearance of objects leads to the difficulty of retrieving objects once lost. Therefore, we propose a retrack strategy based on geometric constraints to reduce false detections, as shown in Fig. 4.
Given the previous T frames results {B t−1 , . . . , B t−T } , we judge whether the estimate result B t org (x t c , y t c , w t , h t ) of the current t-th frame needs to be retrack, where (x t c , y t c ) is the center of the box, and w t and h t represent the width and height respectively. For the box of the t-th frame, we update its corresponding mask S p response by a attenuation factor ϕ as follows: We discuss ϕ for retrack in three situations as where R means the Euclidean distance between (x t c , y t c ) and (x t−1 c , y t−1 c ). L and A are the diagonal length and the shorter side length of the box, respectively 1) R > L: In remote sensing videos, since the tracked object is usually a low-speed moving one, the Euclidean distance the object moves between two adjacent frames is usually less than the length of the object itself, that is, L in (19).
Therefore, when R > L, we consider the estimated result of the t-th frame to be unreliable. So, we take ϕ = e −R/L for retrack. It attenuates the unreliable detection object according to the distance. As R is larger, ϕ is closer to 0, and more severe attenuation. In this way, the unreliable detection objects are weakened so that the real objects can be retracked.
2) A < R ≤ L: In this situation, we consider that objects may appear in two positional relationships as shown in Fig. 5(b) and (c). So we further use the angle relationship to make the determination.
We take the average angle change θ t T of the previous T frames as a benchmark to predict the current angle change where x t−i c and y t−i c are corresponding center of previous boxes.  And, the current movement direction is Then, we calculate the angle change θ of the current frame We use the angle θ o between the diagonal and the long side of the object bounding box B t as the threshold of the angle change range, as follows: Since small objects in our remote sensing scenes are moving at a low speed, θ should be an acute less than θ o .
If A < R ≤ L and θ > θ o are satisfied at the same time, we start the retrack procedure with the attenuation factor as ϕ = e −R/A . Compared with e −R/L in the case of R > L, ϕ = e −R/A also exponentially decreases as R increases. The difference is that ϕ = e −R/A decays faster, which can effectively suppress interference even near the object. If the above two situations do not occur, it is considered that there is no false detection, and ϕ is unchanged as 1. The reason we use two exponential functions with different coefficients is to strike a balance between sensitively capturing false tracking results and avoiding allergy alarms. As shown in Fig. 6, suppose we use a uniform decay function e −R/L indiscriminately, then when R ≤ L, the value of the ϕ is relatively high (e.g., ϕ M > ϕ O = ϕ N ). Therefore, we replace the attenuation function in the interval A < R ≤ L with e −R/A , so that the nearby interferers can also be effectively attenuated without affecting the magnitude of the long-distance attenuation. In this way, the interference signals at different distances can be effectively attenuated during retracking, and the allergy alarm caused by the jitter of bounding boxes can be avoided.
With the above GCRT strategy, we use geometric constraints to weaken the response of the mask S p in the unreliable detection object region, thereby bringing out the correct real object during retracking.

III. EXPERIMENTAL RESULTS AND DISCUSSION
A. Data Description 1) IPIU Dataset: The dataset is acquired over the San Diego Military Port, USA, 2017 by Jilin-1 HD Dynamic Video Satellite with 10 fps. The actual ground resolution is 0.91 m/pixel. The various vehicle objects in the scene are to be tracked. 95% of object sizes are from 5×8 pixels to 10×15 pixels. The scene contains bridges, roads, trees, buildings, and many similar vehicle objects. Less effective target information and complex scenes cause great difficulties in object tracking.

2) RSSRAI Dataset: This dataset comes from the remote sensing video target tracking track competition in 2019 Remote Sensing Image Sparse Representation and Intelligent Analysis
Competition [32]. The total size of a frame is from 220 × 223 pixels to 1348 × 1348 pixels. These video sequences are shot with a resolution of 1.13 m/pixel. The object sizes are from 4 × 9 pixels to 18 × 19 pixels. The object signal strength is weak, so it is difficult to distinguish it from the background.
3) UAV123 * Dataset: The public UAV123 dataset [33] contains a series of video sequences from an aerial drone viewpoint. In all 123 sequences, we take vehicles as objects and select sequences with background clutter (BC) and low resolution (LR) attributes, forming the UAV123 * dataset to conduct experiments. This dataset sequence is characterized by a wide range of viewing angles, which leads to an unstable direction of objects.

B. Experimental Setup
The overall experiments are conducted on a workstation with Intel Xeon CPU E5-2650 v4 @ 2.20 GHz and 4 NVIDIA TITAN Xp GPU. The proposed tracker is implemented on PyTorch deep learning platform.
1) Backbone Network Learning: We use CRFPF-module proposed in Section II-A as our backbone network to fusion and extract features. Considering the balance between computing power and time cost, we take K = 3 in (1) in practice, that means a total of four parallel branches {B I 0 , B I 1 , B I 2 , B I 3 }. We add a classifier head consisting of a 1 × 1 convolution layer and a 4×4 convolution layer after the backbone network [21]. We take μ = e −4 in (11). In order to adapt the backbone network to the remote sensing dataset, we use GOT-10K [34] natural dataset and 25% IPIU (denoted as IPIU T ) as the training dataset. For the pair of selected frames, we crop out the image patches that with several times (five times in our experiments) the size of the object and centered on the bounding box. Then, we resize them to 72 × 72 as inputs into the backbone with sharing weight. In each epoch, 60% samples are randomly selected from GOT-10K and the remaining 40% from IPIU train set. In order to avoid the situation that it is difficult to converge for multibranch training at the same time, we adopt an alternate training strategy. When training a specific branch, we set the loss weight of the rest branches to 0. Each branch trains 80 epochs with 2 × 10 −4 learning rate.
2) Target Classifier Learning: The spatial co-attention module is trained online to adapt to the foreground and background discrimination of the current test sequence. During training this process, we keep the above backbone weight frozen and only update the weight of the dual-branch spatial co-attention module. For each test sequence, we use the first template frame of the object's ground truth to generate a series of similar but slightly different enhanced frames. In experiments, we generate 15 enhanced samples, then choose any two enhanced samples to form a pair as the input. The weights of attention loss in (10) and classifier loss in (11) are ξ and 1 − ξ , respectively, see Section III-C2 for details.

3) IoU Regression Learning:
We use image pairs with bounding box annotations to train the entire IoU prediction network. During training this process, we keep the above backbone weight frozen and only update the weight of the dual-branch channel co-attention module. We only use the above IPIU T as the source of image pairs. We sample image pairs from each sequence with a maximum interval of 30 frames. Then, we set displacement similar to [21] to achieve the purpose of expanding the training data. During the training process, each batch includes 26 image pairs, the learning rate is 1 × 10 −3 , and a total of 60 epochs are trained.

C. Hyperparameter Analysis 1) Dilation Rate r k :
In Section II-A, the dilation rate r k is used to adjust consistent receptive field among parallel branches. Different dilation rate settings correspond to different receptive fields, which has a great impact on the distinguish ability of features for small objects. According to (3), given the dilation rate r k of certain branch I k , the dilation rate of rest branches can be derived according to the consistent receptive field principle. Therefore, we select a single I 2 branch with a moderate number of network layers for experiments. In the experiments, we record the group form of dilation rates as where r 1 i , r 2 i , and r 3 i represent the dilation rate of Level1, Level2, and Level3 in branch B 2 . We use the 1 × 1 area of the last features corresponding to the real area of the original input (denoted as absolute receptive field) as a unified comparison index. From G 1 to G 7 , the absolute receptive field gradually increases as shown in Table I. The rest parameters are kept consistent for fair.
The results are shown in Fig. 7. As the receptive field increases, convergence loss first increased slightly, then decreased rapidly to valley at 87, and finally slowly recovered. So we select G 4 with minimum convergence loss. The specific structure of each branch is shown in Table II.

2) Number of Branches K:
In Section II-A, K is an important value worth choosing in feature extraction.
Considering the objects in our datasets are uniformly from 5 × 8 to 10 × 15 pixels in size, we design each branch as Table II and conduct an overall test. The experimental results are shown in Table III.
At the beginning, the accuracy of the network increase with increasing K value. When K = 3, the accuracy of the network has risen to a stable level. We consider that the network at this time has been able to sufficiently extract the deep features of small objects. When K > 3, the accuracy of the network is no longer improved, but the computational complexity is still multiplying. Considering the balance between the parameter amount and accuracy, we take K = 3. In fact, the value of K  can be adjusted according to the size of the object itself, the magnitude of the training dataset, and timeliness requirements in specific practical application.
3) Spatial Loss Weight ξ : In Section II-B, ξ in (12) is the weight of losses in the total spatial loss L spatial . Its value affects the convergence of the network. Therefore, in order to make the value of ξ cover a range as wide as possible, we set ξ exponentially in a wide range of intervals [0.001, 0.999]. During the experiment, the rest of the parameter settings are the same. The experimental results are shown in Table IV. As we can see, the overall loss first decreases and then increases with the increases of ξ from 0.001 and is lowest when ξ = 0.1. The convergence speed increases slowly as ξ increases while of the same magnitude. When the loss percentage (on the same magnitude) is dominated by L cls and assisted by L s , the model has a more vital ability to distinguish between target and background. It also confirms the effectiveness of the introduced attention loss. Therefore, we set ξ to 0.1 finally.

D. Module Comparison
We conduct comparative experiments individually on each module proposed in Section III. To ensure the experiments' fairness, only the settings of the module to be compared are different while the rest keep consistent.
1) CRFPF-Module: In the experiment, Sin-B, Up-B, and Wh-B are backbone networks with different branch combinations. Sin-B represents a single branch, Up-B represents nondownsampling branch fusion, and Wh-B represents entire branches (upsampling and downsampling) fusion. The letters in brackets indicate different fusion methods: Add stands for additive fusion and Con stands for concatenate fusion. The specific structure of each branch is shown in Table II. The experimental results are shown in Table V and Fig. 8.  Sin-B Versus Up-B: As shown in Fig. 8, in both the target classifier and the IoU regression training, the training loss of Up-B converges to a lower value than that of Sin-B. (Note that, Sin-B uses a single best-performing branch B I 2 ). As a result, the features of different levels of nondownsampling branches are fused, which is more comprehensive than a single branch's features. For small objects, single-resolution feature expression is usually insufficient, while the hierarchical features of multiresolution can be used to supplement, which improves the overall performance. However, the training time of the network increased at the same time.
Up-B Versus Wh-B: Wh-B adds down-sampling branch B I 0 based on Up-B. It can be seen from the experimental results, for the target classifier, whether the fusion way is an additive or concatenate operation, the training convergence loss of Wh-B is lower than that of Up-B. It is verified that the introduced shallow global features have positively affect the target classifier. However, since the IoU regression uses the precise position feature with PrPool, the addition of global information in B I 0 does not show apparent advantages.
Add Fusion Versus Con Fusion: For the target classifier, since the features are consistent in the spatial structure, additive fusion can strengthen the target information and performs better than concatenate fusion. Moreover, in terms of training time, the training efficiency of additive fusion is doubled higher than that of concatenating fusion.
In summary, we choose the optimal structure of Wh-B Add.
2) DSCA-Module: We conduct comparative experiments on the spatial co-attention target classifier and the channel co-attention IoU regression to verify the effect of our DSCA-module.
Spatial Co-Attention: Fig. 9 shows the changes in the feature map's response before and after adding the spatial co-attention module. By learning the correlation of the object between the current frame and the template frame, we strengthen the regions that are strongly correlated with the template object. Moreover because the attention loss is specially trained, the object region is further highlighted, while the response of the similar surrounding background is weakened. As shown in Fig. 9, compared with the original algorithm without coattention, the peak of our response maps is concentrated in the object region, and its surrounding interference is suppressed.
Channel Co-Attention: Fig. 10 shows the comparison of the tracking results before and after adding the channel coattention module. According to the cross-correlation between the features of the current frame and the template frame, for object features, the channels that are highly similar to that of the ground truth object are enhanced. In this way, channels containing more object information help the IoU's accurate prediction, and it also has a better understanding of the object boundary. For example, Fig. 10 shows all 64 channel features of the 12th frame in a particular sequence and also shows the corresponding weights assigned by the channel co-attention. We can see that most channels with more significant object information have higher weights. The final results in Fig. 10 show that the tracking result with the channel co-attention module is tighter and more precise.
3) GCRT-Strategy: Fig. 11 shows the effect of the GCRTstrategy with a fragment of a sequence. Our strategy uses geometric relations to determine whether the object is lost, and retracks to the correct object by weakening the response of the interfering object in the saliency mask. As shown in

E. Ablation Studies and Algorithm Comparison
In this section, we conduct comparative experiments with several state-of-art methods and self-ablation experiments. For the sake of experimental fairness, we adopt standard indicators Precision and Success in single-object tracking area according to [35] to evaluate the performance of trackers.
We compare our tracker framework with some state-of-art trackers, including DCF [12], SAMF [36], DLSSVM [37], LCT [38], ECO [16], HDT [17], DIMP [22], and baseline ATOM [21]. DCF is a representative of the discriminant kernel correlation filtering method. SAMF uses a feature pyramid correlation filter to perform multiscale tracking. DLSSVM and LCT are algorithms that combine detection and tracking. ECO and HDT algorithms combine the powerful deep learning presentation. For the above algorithms, we adapt their official codes without changing the network structure. Fig. 13 shows the tracking results of our method compared with other trackers in several videos. DIMP and ATOM are deep trainable end-to-end tracker. Their frameworks are similar to ours so we use the consistent initialization of the same settings.
We carry out a quantitative ablation study taking ATOM as a baseline algorithm to verify each proposal component of our framework, including ATOM-A, ATOM-AB, and ATOM-ABC. A represents CRFPF-module, B represents DSCAmodule, and C represents GCRT-strategy. The quantitative results are shown in Table VI. 1) Experiment Analysis on IPIU Dataset: DCF is a correlation filtering algorithm that is known for its speed advantage. As shown in Table VI, the number of tracking frames/s is up to hundreds. However, shallow manual design features, that is, directional gradient histogram (HOG) and color names (CN), have low robustness to complex remote sensing scenes. Therefore, it is susceptible to interference from multiplicative background noise.
SAMF is a multiscale tracker. However, the object sizes in our remote sensing videos are small and the change is not apparent, so the tracking Success is not significantly improved. DLSSVM uses a traditional structured SVM kernel to determine the object and the LCT algorithm also uses a redetector composed of SVM to redetect hard negative samples. However, it is difficult to achieve stable performance on complex scenes due to weak adaptive learning ability.
ECO tracker that combines superficial appearance information and deep semantic information is significantly better than HDT tracker with only convolutional features, despite the speed price. However, they two directly use the pretrained model on ImageNet dataset, which limits the trackers' learning ability and scalability from remote sensing videos.
DIMP and ATOM allow trainable learning of remote sensing data characteristics. However, too deep forward network reduces the feature discrimination of small objects. When similar objects appear around, the tracker is challenging to distinguish. And, it is almost difficult to retrieve the original object once a mistake occurs. Although DIMP has online update ability, low-quality features become a burden after the tracking fails.
Compared with baseline ATOM, ATOM-A merges deep and shallow features in parallel to represent small size objects. It can effectively extract the deep features of small objects and fuse them with the shallow features to improve the robustness of features while ensuring the integrity of spatial detail information. Experiment results in Table VI  Compared with ATOM-A, ATOM-AB improves the saliency of the object in the spatial domain, weakens the response of the similar surrounding background, and enhances the dominant feature in the channel domain. Therefore, its Precise outperforms ECO. However, due to its algorithm is not enough to distinguish the moving interference object effectively, the result of Success does not reach the ideal level.
Our complete tracking framework, ATOM-ABC, further takes the geometric constraint redetection strategy, achieves a gain of 13.6 in precise score and 15.9 in success score in comparison with baseline. The GCRT-strategy uses time context information to constrain accidental error. The hard-torecover plight in remote sensing scenarios is alleviated.
In addition, the three proposed modules can also be effectively transferred to another baseline algorithm DIMP, denoted as DIMP-A, DIMP-B, DIMP-C, and DIMP-ABC, respectively. They all bring about performance improvements in Success and Precision. This demonstrates the extensibility of the module. And among the methods with the same baseline, our methods (i.e., DIMP-ABC and ATOM-ABC) performs the best. This result indicates that our proposed models CRFPF, DSCA, and GCRT have a synergistic effect in improving the tracker performance.
For time complexity, as shown in CRFPF-module with K = 3 is T(9.27E + 06)], so they are within an order of magnitude. Compared with ATOM, our methods tracks about 6 fewer frames/s. The reduction of each part is as follows: the CRFPF-module increases by five frames; in DSCA-module, the cross-correlation calculation in the collaborative attention mechanism reduces by about nine frames; and the GCRT-strategy reduces by about two frames. Therefore, our proposed method keeps the overall speed within an order of magnitude of ATOM while ensuring Success and Precision.
2) Experiment Analysis on RSSRAI Dataset: The experimental results on RSSRAI Datasets are shown in Fig. 12(c) and (d). It can be seen that tracking methods fused with deep learning, such as ATOM, DIMP is better than traditional correlation filter, and SVM frameworks. This is mainly due to the background changes in some tracking sequences. For example, when a vehicle drives to the transition between a dark asphalt road and a bright bridge, the background suddenly changes. In this way, the tracking Success of DIMP with online learning capabilities is higher than ATOM.
In addition, since the resolution of the RSSRAI dataset is 1.13 m/pixel, the object blur is more serious. When retracking strategy is used alone, the accuracy of bounding boxes obtained during regression is affected, although the target object can be recovered. So Precision (i.e., 0.670) slightly According to the visualized results of tracking, it is found that when the vehicle turns at an intersection, the vehicle direction and shape are quite different from the initial state, especially who with a long body. Therefore, the performance of ECO in RSSRAI is not as good as that in IPIU. Moreover, the ATOM-AB tracker we proposed uses the object information of the template frame to strengthen the similar target area of the frame to be tested, shields the interference of drastic changes in the background to a certain extent, and improves the Success by 5% points. This factor limits the increase in Precision.
From the point of view of speed, although image size of RSSRAI is larger than that of IPIU, ours still has the same level on FPS compared with deep learning trackers.
3) Experiment Analysis on UAV123 * Dataset: The experimental results are shown in Fig. 12(e) and (f). Since it belongs to a short-distance shooting by drones, images contain more spatial detail information, and the shape and scale of the target change more diversely, the ATOM and DIMP trackers with IoU-Net branch perform pixel-level regression on the rough box during the tracking process. In this case, the performance is better than the ECO algorithm estimated with multiple scales. Thereby, due to online updated classifier module, DIMP performs better than the baseline ATOM on UAV123 * dataset. So Success of DIMP and its extension algorithm are higher than these of ATOM. DCF captures object information through hand-designed features, so it has good rotation invariance and robustness to the change of view angle. While the object captured in UAV123 * dataset has angle variability, so the results of DCF in UAV123 * is better than that of the previous two datasets. In the contrary, DLSSVM is sensitive to appearance and shooting angle, and its performance is not good enough as CF trackers.
In the ablation experiment, the object scene captured by UAV123 * is more complex, and it is more vulnerable to the interference of the surrounding moving targets. Our ATOM-A can effectively extract the hierarchical robust features from the shallow to the deep, and the overall ATOM-ABC with GCRTstrategy can also effectively eliminate the surrounding moving targets. The Precision and Success are improved by about 0.28 and 0.33, respectively, on UAV123 * .

IV. CONCLUSION
In this article, we propose a collaborative learning network applied to remote sensing video tracking. Experiments verify that CRFPF-module can extract effective hierarchical features especially from small objects; DSCA-module collaboratively learns the object commonality between the template frame and the current frame, and uses spatial channel correlation to highlight the weak target signal; GCRT-strategy provides a retracking strategy for remote sensing of complex large scenes, reducing the interference of similar moving objects. However, the training process of the DSCA-module relies on a large number of annotated tracking frames, which limits the flexibility of the strategy in practical application; in addition, the efficiency of our algorithm is not particularly high. In the future, we will further study on how to lighten the structure of the backbone network, reduce the dependence on the amount of training data, and save computational costs in the future. Xiaotong Li received the B.S. degree in elec-