Robust Cross-Scene Foreground Segmentation in Surveillance Video

1Training only one deep model for large-scale cross-scene video foreground segmentation is challenging due to the off-the-shelf deep learning based segmentor relies on scene-specific structural information. This results in deep models that are scene-biased and evaluations that are scene-influenced. In this paper, we integrate dual modalities (foregrounds’ motion and appearance), and then eliminating features without representativeness of foreground through attention-module-guided selective-connection structures. It is in an end-to-end training manner and to achieve scene adaptation in the plug and play style. Experiments indicate the proposed method significantly outperforms the state-of-the-art deep models and background subtraction methods in un-trained scenes – LIMU and LASIESTA. Source Code is available at: https://github.com/WeiZongqi/HOFAM


INTRODUCTION
Video foreground segmentation aims at discovering the visually distinctive moving foreground objects in a video, and identifying all pixels covering these objects from background. Video foreground segmentation model can serve as an important pre-processing component for many applications, for examples, image and video compression [1], visual tracking [2] and person re-identification [3]. However, in practice, training only one deep model for large-scale cross-scene video foreground segmentation is still challenging issue, since the off-the-shelf deep learning based segmentor relies on scenespecific structural information. Smoothly adapting to new scenes requires additional laborious annotation, training from scratch or fine-tuning the model, otherwise the foreground, especially the tiny ones will be false segmented. Traditional unsupervised foreground subtraction methods [4,5,6] focus on building statistical model to suppress interference of dynamic background but they have bottleneck to achieve accurate background updating. Approach using CNN to replace background subtraction were proposed in [7,8,9,10,11]. All the aforementioned methods are scenespecific and needs to be trained from scratch for other scenes. DeepBS [12] and STAM [13] utilize a trained CNN to realize foreground segmentation across video scenes. For the training data, it randomly select 5% samples with corresponding ground truths of each subset from CDNet2014 dataset. The cross-scene segmentation is often coarse that the boundary of object and small object cannot be well preserved. Semantic segmentation methods have enabled remarkable progress due to the development of convolutional neural networks. SOTA methods include PSPNet [14], DeepLabV3+ [15], BFP [16] and CCL [17]. Although semantic segmentation approaches could provide high-level semantic annotation for each frame, they ignore the temporal relevance and motion cues which are quite important for video foreground segmentation.
Essentially, foreground segmentation is an empirical task related to appearance, motion, and scene attributes. End-toend feature descriptor provides a path for effective blending and fusion of appearance and motion features to filter multifarious foreground patterns across scene. Optical flow is an instantaneous motion cue which is less robust and inadequate to describe motions in pixel level. In this paper, we try to solve the following issues: 1) how to describe the foreground more comprehensively in the scene 2) Can we realize a plug and play foreground segmentation model without extra training when use it even for a new scene. We solve these issues by integrating more features from different modalities (foregrounds' motion and appearance), and then eliminating features without representativeness of foreground through attention-module-guided selective-connection structures. The proposed method is shown in Figure 1.

Model structure
As shown in Figure 1, the proposed model combines both static appearance features and motion information, and integrates attention modules in the upsampling process to fuse the features of encoder and decoder.

Hierarchical Optical Flow
As a instantaneous motion field, optical flow lacks stability and sufficiency in representing motion. Optical flow from long interval video frames has the long term motion cues of the object but the outline of object is imprecise. Optical flow calculated by short interval video frames has accurate motion cues of the current frame, but sometimes it is insufficient to describe the whole moving object, such as the first optical flow in Figure 1. Hierarchical Optical Flow (HOF), illustrated in Figure 1 right, uses the current video frame and interval frames with different lengths to calculate 3 optical flows. Hierarchical frame interval to complement each other. The specific steps are as follows: The frame position at the current time T , and frames at the time of T − τ 1 , T − τ 2 and T − τ 3 by setting the interval frame length parameters τ 1 , τ 2 and τ 3 . Lastly calculate the optical flow information at time T , which is denoted as Op(τ 1 ), Op(τ 2 ) and Op(τ 3 ). We merge three optical flows with different frame interval into three channels as hierarchical optical flow Hop(T ). We use a state of the art deep model Selflow [18] to calculate optical flow.

Attention Module
The proposed model merges the decoder and encoder features through a dense attention processes during the decoder phase. In detail, high-level features provide global information to guide attention modules to weight proper low-level features contribute to prediction in the inputting image that encoder features are re-weighted by the decoder layers in pixel-level and concatenated with the latter.
In Figure 2, the decoding process is from the previous decoding layer D i−1 to the next layer D i . The input parts are E i and Op i and the previous layer D i−1 in the decoder. The output part is decoder layer D i . In order to explain the operation mechanism of attention module more clearly, we use B up sampling , B w and B e op as the process stages. The specific process is as follows: Suppose we have obtained two feature map tensors E i ∈ R H×W ×C and Op i ∈ R H×W ×C (H and W are the height and width of a single feature map, and C indicates the number of the feature map channels). In order to get D i , firstly, we concatenate two kinds of corresponding feature map E i and Op i in two encoders. After concatenating, the channel becomes twice (2C) as much as the original channel (C), and then B e op ∈ R H×W ×C is obtained by convolution.
where conv 0 means convolution of kernel 3×3 and step 1 used to extract appearance feature and reduce channels, is the concatenation operator and Relu is the ReLU active functions.
In decoding layer D i−1 ∈ R H/2×W/2×4C , do up sampling convolution to get B up sampling ∈ R H×W ×C . Then, the weighted coefficient tensor B w ∈ R H×W ×C (between 0 and 1) is obtained by convolution and activation operation, where σ is the Sigmoid active functions, conv 1 is convolution of kernel 3×3 and step 1 to learn weighted coefficient and BN is batch normalization (BN). Then B w is combined with the feature map B e op by multiplying pixel by pixel to obtain the weighted feature map (Atten result). This step is the weighting operation of the decoder in attention module.
After batch normalization, we get original decoder feature from B up sampling . We also add Dropout operation to the original decoder feature, and each node has a 50% probability of being suppressed during the training process, and removes this operation during the test process. The weighted encoder feature map and the original decoder feature are concatenated to get D i ∈ R H×W ×2C in current decoding layer i.
where is Hadamard product.

Loss Function
F ocal Loss [19] are designed for solving the positive/negative unbalanced sample problem in RetinaNet for object detection which is based on the binary cross entropy function. We define an area ratio between the foreground and background in one frame S(f g), and then define a balance coefficient inside class β, which is shown as follows: Where t 3 is a hyper-parametric. The reason for setting the minimum value of 1 S(f g) and 50 is to prevent the potential scene from infinity, where 50 is the value set after sampling the small object in the training scenes. The class-in scale focal (cisfocal) loss is, where p represents the probability of model prediction, with foreground label y = 1 and background label y = 0. α is the parameter matrix of foreground and background pixel samples. γ is the parameter regulating the contribution of hard and easy samples. For the hard sample case, it will get a lower p. In order to train model stably, Manhattan distance l1 loss is also used as regularization in the training process. It is measured between the predicted p and ground truth y, L l1 = ||p − y|| 1 . The final loss function can be expressed as follows:

EXPERIMENT
In this section, we evaluate the proposed network for foreground segmentation on three benchmark datasets, namely CDNet 2014 [20], LIMU [21] and LASIESTA [22]. Quantitative results in terms of average F-measure and visual results are evaluated and verified with the state-of-the-art methods.

Data Preparation and Experiment Setting
Following the training setting in DeepBS [12], for the training data, we randomly select 5% samples with their ground truths of each subset from CDNet 2014 to train HO-FAM. The left 95% of samples in CDNet 2014 are used to test the model, without any overlap of the training set. Segmented foreground is obtained without any post-processing.
We have done a lot of experiments for hyper parameters tuning in advance and compare many different settings. For hierarchical optical flow in the experiment, we set τ 1 = 1, τ 2 = 5 and τ 3 = 10. In the loss function, and finally set t 1 = 0.8, t 2 = 0.2, t 3 = 0.25, α = 0.75, γ = 0. The training batch size is 16, and we train 16000 epochs in all. Adam is used as the optimizer and its beta1 = 0.95, beta2 = 0.999. Learning rate is set to a small value of 5 × 10 −5 .
For methods for comparison, we divide them in to three folds: (1) cross-scene deep models (single model), (2) specific-scence models (including deep model and background subtraction methods), and (3) semantic segmentation models. For cross-scene deep models, STAM [13] and DeepBS [12] trained as the same way as HOFAM. We also compare the model without Attention (HOF AM noAtt ) or Optical flow (HOF AM noOp ). For semantic segmentation models, DeepLabV3+ [15] and PSPNet [14] train in ADE20K [23], because there is no semantic annotation in CDNet2014. We define some classes as foreground according to the protocol recommended in [24], including {person, car, cushion, box, book, boat, bus, truck, bottle, van, bag and bicycle}.
Precision, Recall and F-measure for segmentation are pixel-level evaluation that accumulate all the positive and negative pixels in all tested frames, but ignore the scale of foreground, which is unfair to small foreground evaluation. In order to more fairly evaluate the results of foreground segmentation of small sizes, we additionally supplement an metric based on dice coefficient as follows: Where N is the number of frames that contains foreground, (T P + F N ) i is the truth label in frame i, (T P + F P ) i is the prediction result in frame i.

Ablation Experiments on CDNet 2014
In the ablation experiments, we verify the hierarchical optical flow, attention module, and loss function classin scale f ocal loss to f ocal loss with related combinations.   Table 1, compared with model without using attention module, attention module brings obvious improvement in F-measure and Mean Dice. From Figure 4, we visualize the process results of the seventh attention module (Att7) in decoder. Because the proposed attention module involves multilayer and multi-channel processes, it is difficult to visualize the process of attention directly and accurately through twodimensional images. We average the results of one layer to reveal this trend roughly. From comparison between Atten result(Att7), attention module highlights the area of foreground object. B w and B e op are the intermediate steps to get Atten result. In the result of B w (Att7) and B e op (Att7), it seem that B w and B e op present more original feature distributions of decoder and encoder with uncertainties and biases on appearance and optical flow. We also compare the results of three different loss function singly. Compared with f ocal + l1 loss, cisf ocal + l1 loss has obvious improvement and get the best scores in F-measure and Mean Dice. In particular, the improvement of Mean Dice is more obvious. From Figure 5, the proposed loss have better performance of small objects. We use color green and red to mark the false positive and false negative.

Results and Evaluation on CDNet 2014
Since the method proposed in this paper are trained on this dataset, the purpose of this experiment is not to test the capability of cross-scene segmentation, but to test the proposed single model compared with specific scenes. From Table 2, it can be seen from this result that even a single model trained using only 5% of the training data of all scenes, the performance of this method still exceeds deep models and background subtraction models with specific-scene training.

Cross-Scene Segmentation Results on LIMU and LASIESTA
For cross-scene testing, LIMU [21] and LASIESTA [22] dataset are used to verify foreground segmentation in cross scene. On LIMU, from Table 3, HOFAM presents a better performance on two subsets than other models. On subset of CameraParameter, PSPNet has better results on person segmentation, and HOFAM ranks second with 0.7979. In overall, HOFAM gains the best performance of F-measure 0.7981 while PSPNet ranks second with 0.7506, and STAM ranks third with 0.7344. We visualize the results in Figure 6.

Conclusions
We propose a Hierarchical Optical Flow Attention Model for cross-scene foreground segmentation to realize crossscene foreground segmentation task with practical significance. Comparing with the state-of-the-art cross-scene deep models, specific-scence deep model, background subtraction methods and semantic segmentation models on LIMU and LASIESTA benchmarks indicates its promising generalization capability of the scene without any additional training. Although with dual input, the framework realizes single model and end-to-end training. Future work would be to use self-supervised learning to explore the attention models for specific training scenarios.