Inferred box harmonization and aggregation for degraded face detection in crowds

Since objects usually keep a certain distance from the surveillance camera, small object detection is a practical issue. Detecting small objects is also one of the remaining challenges in the computer vision community. The current detectors usually leverage a more robust backbone network, build one or more multi-scale feature pyramids, or define a more precise anchor-box screening criteria. However, the distinguishable features are scarce due to the appearance degradation and a shallow resolution. In this paper, we leverage high-level context to enhance anchor-based detectors’ capabilities for small and crowded face detection. We first define face co-occurrence prior based on density maps (FCP-DM) to explore extensive high-level contextual information. We propose a score-size-specific non-maximum suppression (S3NMS) to replace the traditional non-maximum suppression at the end of anchor-based detectors. Our approach is plug and play and model-independent, which could be concatenated into the existing anchor-based face detectors without extra learning. Compared to the prior art on the WIDER FACE hard set, our method increases an Average Precision of 0.1%-1.3%, while on Crowd Face, which we make for testing small and crowded face detection, it raises an Average Precision of 1% - 6%. Codes and dataset have been available online.


Introduction
For video surveillance in the open world, robust face detection is an ultimate component to handle various facial-related tasks.Since the faces are usually far from the surveillance camera, small face detection is a problem with practical needs.In recent years, renewed detection paradigms [14,9], strong backbone [16,21,15] and large-scale datasets [32,10] jointly push forward the limit of face detection to approach humans' cognition.However, because flexible mechanisms and abundant domain knowledge guide human's cognition, human has advantages on handing the challenges of low-resolution [19].In the computer vision community, a central issue of small object detection is the appearance degradation of a small object with shallow resolution.The essential issue is that the distinguishable features are scarce due to appearance degradation.
Anchor-based face detectors have achieved satisfactory performance on the benchmark WIDER FACE [32].Recently, many face detectors rely on features extracted from deep Convolutional Neural Network (CNN).They obtain low-level features of the objects (such as texture and edge related feature) from the low layers of the network and high-level features such as semantic-related feature from the high layers.However, for face detectors, thorny issues involved in detecting degraded faces are caused by small-size, defocus blur, and occlusion [36].These blurred and lowresolution faces only have dozens or even a few pixels, containing limited feature information.When using the standard spatial pooling process [36] in a CNN, appearance features would be further degraded.This problem is ill-posed for a lowresolution object as CNN can only provide very few low-level features at the low layers and almost no high-level features of these faces at the high layers.Therefore, aggregating more information from context becomes a inevitable choice.
Some works [9,36,39,21,31] have introduced contextual information for lowresolution face detection.In these methods, the contextual information of faces is usually employed in the form of low-level context via an augmented receptive field of feature maps.Obviously, rich low-level context is helpful to detect small objects and easy to implement [4,24], but augmented receptive field relies on the limited local area.Some current detectors leverage a more robust backbone network [14], build one or more multi-scale feature pyramids [9], or define a more precise anchor-box screening criteria [21].On the other hand, [1] shows that humans detecting objects that violate their high-level context take longer and make more errors.Hence, object detection is expected to fit into a certain high-level scene context to reach humans' cognition.
We also argue that high-level contextual information is valuable for small object detection.Different from the traditional context which rely on adjusting the local receptive fields, we explore the compositional semantics-the relationship among the confidence, quantity and size of objects, as high-level contextual information and extend it to the whole scene.We presents a universal strategy with density-map-based face co-occurrence priors (FCP-DM) and score-size-specific non-maximum suppression (S 3 NMS), independent of training paradigms to directly replace the standard non-maximum suppression (NMS) post-processing formula in anchor-based detectors.FCP-DM harmonizes the outputs of a detector according to crowd density estimation.It enhances the sensitivity and specificity of the detector via increasing true positives.S 3 NMS aggregates the bounding box by decreasing false positives and increasing true positives according to the inferred face boxes' score and size.Figure 1 illustrates the proposed detection framework.We also collect a challenging face detection dataset with tiny faces to provide adequate samples to further prominent the bottleneck of detecting crowded faces.
The contribution of this paper are listed as follows.
• We proposed a general approach using high-level contextual information for small and crowd face detection.
• The proposed scheme reduces false positives and increases true positives according to the inferred face boxes' score, quantity, and size under the guidance of crowd density estimation.• The proposed scheme makes sense to detect multi-scale and low-resolution faces in the crowded challenge and provides a refined structure to avoid arbitrary discarding or preservation of the bounding box.• It requires no extra training and is simple to be implemented.
The remainder of this paper is organized as follows.We discuss the related work in Section 2. We describe problem formulation and the proposed FCP-DM and S 3 NMS in detail in Section 3. The experimental results are presented and discussed in Section 4, and the conclusions, limitations, and future work are presented in Section 5.

Anchor-based object detection model
Supervised training of a detection model requires bounding boxes and their class labels associated with the objects in images.However, it is not trivial for CNNs to directly predict an order-less set of arbitrary cardinals [33,25].One commonly used strategy is to introduce anchors, which employ a divide-and-conquer strategy to match objects with convolutional features spatially.Anchor box is firstly introduced in Faster R-CNN [21] and serves as a reference at multiple scales and aspect ratios for object detection.During inference, anchors independently predict object bounding boxes, where the box with the highest classification score is retained after the non-maximum suppression (NMS) procedure.Anchor-based detection methods include the well-known FPN [14], RetinaNet [15], SSD [16], and YOLOv3 [20], all of which requires additional post-processing, i.e.NMS.Anchor-free approaches, including CornerNet [11], CenterNet [5], and ExtremeNet [38] et.al., have shown a great potential for the cases of extreme object scales and aspect ratios.However, without the anchor box as the reference point, direct regression of bounding boxes from convolutional features remains challenging.On the benchmark WIDER FACE [32], the most competitive methods are still the anchor-based models.We continue to tap the potential of anchor-based methods, expecting to enhance these methods' performance without additional training.

Using context in face detection
The idea of using context in object detection has been studied in many works.[19], [29] and [4] reviewed contextual information used in contemporary methods and analysed its role for challenging object detection in empirical evaluation.For specific face detection, Hybrid Resolution Model (HR) [9] is a simple yet effective framework for finding small faces, demonstrating that both large context and scale-variant representations are crucial.It specifically shows that massively large receptive fields can be effectively encoded as a foveal descriptor that captures both coarse context and high-resolution image features.Similarly, [39] pools ROI features around faces and bodies for detection, which improves overall performance.The methods mentioned above either build multi-scale feature pyramids or enlarge feature maps' receptive fields to employ the low-level context.But for the task of small face detection, the above methods are inefficient in using context information and not as flexible as a human cognitive system.For the human cognitive system, high-level context and domain knowledge help reduce decision time and disambiguate the low-quality inputs.We expect to fit into a proper high-level context of a scene to enhance the anchorbased face detectors.

Inferred box harmonization
The goal of Non-Maximum Suppression (NMS) [22] penalizes false positive detections, which has been an integral part of many object detection algorithms in computer vision [28,8,18,23].Soft-NMS in [2] argues that the conventional NMS is too greedy because only the bounding box with the maximum score is selected.In contrast, all other bounding boxes with a significant overlap with this box are suppressed using a pre-defined threshold.Soft-NMS suppresses the bounding box by reducing its score instead of just removing it.In our preliminary experiments, however, soft-NMS causes the increase of false positives because some redundant boxes cannot be deleted due to high scores.More complex learning-based methods rely on the model-related learning process.Hosang [7] proposed a learning-based NMS to improve localization and occlusion handling.Tychsen-Smith [27] argued that many detection methods are designed to identify only a sufficiently accurate bounding box, rather than the best available one, and proposed fitness NMS.Although learning-based methods have achieved good performance in specific scenarios, they also have poor generalization capabilities and insufficient cross-domain adaptability.We tend to develop plug-andplay and model-independent paradigms, which could be integrated into the existing anchor-based detectors without extra learning.
3 Problem formulation and proposed approaches

Crowd density map
A density map is firstly used in crowd counting literature.[37] proposes geometry adaptive and fixed kernels with Gaussian convolution to generate a density map.[13] introduces a dilated convolutional neural network to improve the density map's quality.[17] combines features obtained using multiple receptive field sizes and learns the importance of features at each image location, which adaptively encodes the scale of the contextual information required to predict crowd density accurately.In our problem formulation, crowd density estimation is employed to derive face co-occurrence priors for harmonizing a face detector's outputs.
A density map is also used in crowd analysis since it can exhibit the headcount, locations and their spatial distribution.Given a set of N training images {I i } (1≤i≤N) with corresponding ground-truth density maps D gt i , the goal of density map estimation is to learn a non-linear mapping F that maps an input image I i to an estimated density map D est i (I i ) = F (I i ), that is close to the ground truth D gt i in term of L 2 norm.To represent the density maps, to each image I i , we associate a set of 2D points P est i = P i, j 1≤ j≤C i that denote the position of each human head in the scene, where C i is the headcount in image I i .The corresponding estimated density map D est i is obtained by a total probability formula via convolving an image with a Gaussian kernel N est p | µ, σ 2 .We have where µ and σ represent the mean and standard deviation of the normal distribution.
For each head point P i, j in a given image, denoting the distances to its K nearest neighbors as {d i, j k } (1≤k≤K) .The average distance is therefore A crowd density map cannot directly show the size of the head.However, since the individuals are close to each other in a high-density crowd scene, it can roughly represent the head size.The head size is approximately equal to the distance between two neighboring individuals' centers in crowded scenes.The density estimate network we used is Context-Aware Network (CAN) [17].It combines features obtained using multiple receptive field sizes and learns the importance of each feature at each image location.It adaptively encodes the scale of the contextual information required to predict crowd density accurately.This method yields an algorithm that outperforms state-of-the-art crowd counting methods, especially when perspective effects are strong.

Co-occurrence of homogeneous faces for inferred box harmonization
In this part, we focus on using the face co-occurrence prior to optimize the detectors in crowd scenarios.Since the face size approaches the limit of imaging resolution, the face appearance is scarce and inadequate.A low-resolution face that is difficult for humans to recognize is also a challenge for a vision-based detector.General face detectors are highly dependent on appearance features, and the severe scarcity of information is essentially ill-posed, which can directly lead to the degradation of detection performance.However, it is unavoidable normality in crowd-scene face detection.We utilize the co-occurrence of faces as a higher-level context to make more sensitive detection when the face is ambiguously or marginally visible in a crowd scene.
Face co-occurrence prior here refers to the harmonization of homogeneous faces-If the scores of many faces dominate in an image, it is reasonable that some inferred boxes similar to the sizes of these faces have a high probability of being faces.According to the co-occurrence prior, we increase the scores of real faces with low scores after a detector's inferring phase.
We send the image into the density estimate network to generate the density map D est i first.From the perspective of making full use of the context, the contextual information on a broader perception area of a density map could provide more cooccurrence prior to the area just around the observed face.Hence, it is unreasonable to use Equation ( 2) to estimate face co-occurrence directly.However, using density maps to reconcile the results of face detection seems to be a chicken-egg paradox.At least, how to use the inaccurate density map to adjust the result of face detection is a complicated interaction problem of heterogeneous information.
As mentioned earlier, the head size in a high-density crowd scene can be represented by a density map rather than in a low-density crowd scene.Hence, we need to design an operator to disturb the inference in high-density areas and give up interventions for low-density areas.We define a dense grid on image I i , and generate blocks A = {A n i } with 50% overlapping to minimize border effects, where n is the number of blocks.The population in different blocks is estimated by integrating over the values of the predicted density map as follows, In the corresponding block, the average size of all the high score faces is calculated and recorded as BS n avg .
where a n i is the area of region A n i .There are two constraints to filter the inferred box for reconciliation.If the score of a inferred box s x,y exceeds the score threshold s t , the inferred box could be a candidate of human face.The inferred boxes whose scores are ultimately lower than s t will be deleted.These boxes with the size between (1 − γ, 1 + γ)BS avg are further filtered out as the inferred box for reconciliation.The reconciliation formula is as follows, where σ is the Sigmoid function.The above proposed FCP-DM scheme is summarized in Algorithm 1.
3.2 Score-Size-specific NMS NMS [22] is utilized as standard processing for object detection to partition boundingboxes into non-overlapping subsets.The final detections are obtained by averaging the coordinates of the detection boxes in set B. If b u and b v are two bounding boxes, the Intersection over Union overlap (IoU) refers to the standard Jaccard similarity used in NMS, which can be expressed as follows, The conventional NMS preserves the detection box with the maximum score and discards all the other inferred boxes overlapped with an IoU threshold.Specifically, if IoU(b u , b v ) > N t , (0.3 is obtained here as most detectors using this value), then the box with the lower score is deleted directly.This principle is also effective in the multi-scale pyramid scheme, as more inferred boxes may be detected in different pyramid layers.However, this will cause missed detection, as the face covered by part of another face or two faces close to each other may not be detected.As illustrated in Fig. 2, in the process of the three models moving close to each other, the middleman's face was gradually covered, and the detection score decreased significantly.Meanwhile, due to the instability fluctuation of the maximum value (as shown in Fig. 2), the use of NMS will aggravate the instability of the detection score.
Based on NMS, soft-NMS [2] provides a chance to preserve the overlapped and closed objects using a penalizing function to the inferred scores.NMS is a noncontinuous procedure to produce a penalty when an IoU threshold of N t is reached, which could lead to abrupt changes to the ranked score list of the inferred boxes.A continuous penalty function should have no penalty when there is no overlap and a large penalty at a high overlap.Also, when the overlap is low, it should gradually increase the penalty, and b u should not affect the scores of boxes with very low overlap.However, when the overlap of a box b v with b u becomes close to 1, b v should be significantly penalized.Taking this into consideration, soft-NMS updates the pruning step with a Gaussian penalty function as follows, This update rule is applied in each iteration, and the scores of all the remaining detection boxes are updated.It suppresses the inferred box by reducing its score instead of just removing it.However, Both NMS and soft-NMS ignore the role of size factor in the inferred box aggregation.Consider an extreme situation, the areas of the two boxes are quite different, that is b u >> b v .From the definition of Equation ( 6), the intersection is much smaller than the union.The IoU(b u , b v ) cannot reach the threshold Fig. 2: The statistic of the maxim detection score of HR [9] face detection model using a Hikvision surveillance camera.In the process of the three models moving close to each other, the middleman's face was gradually covered, and the detection score decreased significantly.
of deleting redundant boxes in NMS and soft-NMS.In the inferred box aggregation process, a more reasonable way should be to implement a retention operator among similar size boxes.Based on IoU, we define ACB (Area Consistency of boxes) as follows, We adopt a constraint that ACB(b u , b v ) must be in a range between 1 ± B t .Algorithm 2 summarizes the proposed S 3 NMS scheme.If IoU(b m , b x,y ) N t and 1 − B t ACB(b m , b x,y ) 1 + B t , where b m is the box with the highest score in B, it decays the scores using a continuous function s x,y = s x,y e −IoU(b u ,b v ) 2 /δ .It uses NMS when the bounding box's score is low and uses soft-NMS when the score is high.A high score box is more likely to be an occluded face, and soft-NMS is used to re-identify such a case.For a low score box, NMS avoids this non-face box to be false positive.The above scheme gives a chance to detect faces covered by other faces without causing false positives as soft-NMS does.
Score-size-specific NMS is a compromise solution of NMS and Soft-NMS, which provides a fine-grained consideration of the score and the size to avoid arbitrary discarding or preservation of the bounding box, which is essential in the multi-scale face detection task.More detailed performance evaluation will be discussed in the experiment section.In face detection literature, a widely used benchmark is WIDER FACE [32].WIDER FACE contains 32203 images with 393793 faces, 40% of which are used for training, 10% for validation, and 50% for testing.According to the detection rate, the validation data are divided into three classes: "easy", "medium", and "hard", gradually increases various difficult situations in various face detection scenes in open environments, including size changes, occlusion, pose changes, lighting changes, and background confusion.
Considering the proposed solution in this paper is mainly for obscured small face detection in crowd scenes, in addition to the commonly used WIDER FACE, we prepare a new dataset -Crowd Face by ourselves collected from the Internet.There are 34 images with 10731 annotated faces, and the maximum number of faces on an image is 1001.As illustrated in Figure 3, we measured the average size of objects (blue plots) and the average number of objects per image (orange plots).Crowd Face has much smaller (around 10 times smaller in the average size of objects) and more faces (approximately 20 times more in average number of objects per image) than WIDER FACE.As shown in Figure 5 and Appendix B, the Crowd Face dataset has many low-resolution, small, and obscured faces.It is a challenging dataset with hard samples, specifically for high-density face detection.Testing face detection algorithms on Crowd Face is helpful to explore the shortages of face detectors.a

Experimental setting
In our experiments, the models we used to verify our proposed methods are HR [9], EXTD [34], S 3 FD [35], LFFD [6], CAHR [30], PyramidBox [26], DSFD [12] and TinaFace [40].All the models we used in the experiments are trained with the WIDER FACE training set and tested on the WIDER FACE validation set and Crowd Face.In our experiments, we compare many different settings of parameters, and finally set s t = 0.5, γ = 0.1 for FCP-DM, S th = 0.5, B t = 0.1 for S 3 NMS.Our experiments are run on GTX1080 with 16 GB RAM and 12-core i7 CPU.

Experiments for face co-occurrence prior based on density map
In this part, the co-occurrence priors based on a density map is tested on Crowd Face, as the proposed method is mainly used to detect faces in high-density crowd scenes.We introduce density information to the state-of-the-art anchor-based detectors, and then combine with our proposed algorithm.As is illustrated in Figure 4, we integrate FCP-DM to the trained detectors: HR [9], CAHR [30], EXTD [34], S 3 FD [35], Pyra-midBox [26] and DSFD [12], and compare their performance with the original detectors.The red curve in each figure represents our proposed co-occurrence prior based on the density map integrated into the detector.The blue curve below represents the original detector without our proposed method.The results show that the proposed FCP-DM has higher accuracy at the same precision rate than the original detectors.Face co-occurrence priors increase true positives according to crowd density estimation.Figure 5 shows the comparison of the co-occurrence prior within HR (cyan ellipses) and original HR (magenta rectangles) in crowd scenes, where the proposed approach detects more true faces.It illustrates that the proposed method can enhance the detectors to find more true faces in crowd scenes with many low-resolution small faces.
is a post-processing method without any additional training.We compared our approach with other post-processing methods NMS and Soft-NMS, which do not need model training too, as shown in we need a fine-grained consideration of the score and the size to remove redundant boxes.Figure 6

Ablation study on Crowd Face
As shown in Table 2, we perform ablation experiments on Crowd Face.We separately integrate NMS, score-size-specific NMS, and co-occurrence prior based on density maps to HR [9], PyramidBox [26], EXTD [34], CAHR [30], and DSFD [12] on Crowd Face.We first compare the performance of NMS and our proposed S 3 NMS, which shows that our proposed S 3 NMS has higher AP performance.Then, we respectively integrate NMS and S 3 NMS with FCP-DM into the detectors.The result shows the proposed FCP-DM can further improve the performance, and S 3 NMS combined with FCP-DM has the best AP performance.It shows that our proposed S 3 NMS has higher AP performance than NMS, and S 3 NMS combined with co-occurrence priors has higher AP performance than integrates only one of the two methods into the detectors.FCP-DM and S 3 NMS increase an overall AP around 1% -6% for the above detection models.

Overall performance on WIDER FACE
In this part, we apply the proposed FCP-DM and S 3 NMS to the detectors together on the WIDER FACE dataset.The trained detectors are LFFD [6], HR [9], CAHR [30], EXTD [34], S 3 FD [35], PyramidBox [26], DSFD [12], and TinaFace [40] and compare their performance with the original detectors.Table 3 shows that the proposed approach integrating within most face detectors has better performance than the original methods.It illustrates that our proposed score-size-specific NMS reduces false positives and increases true positives according to the inferred face boxes' score and size.For WIDER FACE hard set, our results could increase an AP of 0.1-1.3%,indicating the capability of the proposed approach in challenging situations.The WIDER FACE-easy set contains almost no high-density scenes-the proposed FCP-DM freezes the density estimation output.Thus the proposed method maintains consistent results in most models and does not deteriorate the original performance in low-density scenarios.Table 3: Performance of integrating score-size-specific NMS and co-occurrence priors to the trained detectors on WIDER FACE.

Conclusion, limitations, and future work
We proposed a general approach using high-level contextual information for small and crowd face detection.The proposed scheme reduces false positives and increases true positives according to the inferred face boxes' score, quantity, and size under the guidance of crowd density estimation.The proposed scheme makes sense to detect multi-scale and low-resolution faces in the crowded challenge and provides a refined structure to avoid arbitrary discarding or preservation of the bounding box.It requires no extra training and is simple to be implemented.
The main limitation of this method is that it needs to rely on the performance of the density estimate network.The performance of this network is usually affected by the quality of training data, scene category, and task category.Therefore, the network will need to be retrained when changing specific tasks, such as vehicle density estimation.
We will explore the capability of the proposed framework for other dense and small object detection tasks, such as remote sensing scenes with rotated bounding boxes.
Appendix A: Results in a series of degenerate images in a classroom in NUAA (All the portrait rights are licensed.).

Algorithm 2 :Fig. 3 :
Fig. 3: Comparison of benchmark dataset WIDER FACE and our Crowd Face dataset.Two quantities are measured for each dataset: average size of objects (blue plots) and average number of objects per image (orange plots).

Fig. 7 :
Fig. 7: The statistic of the average detection score of HR [9] before and after cascading the proposed score-size-specific NMS.The experimental scene is the same as Fig.2.

)
Algorithm 1: Face co-occurrence prior for inferred box HarmonizationData: B = b x,y , S = s x,y , A = {A n i }, D est i(x,y) , γ, s t , B is the list of initial inferred boxes, S contains corresponding inferred scores, A n i is the list of different density areas, D est i is the estimated density map.

Table 1 ,
[9]MS has the highest AP compared with NMS and soft-NMS on WIDER FACE hard set and Crowd Face set.It illustrates that Fig.5: A comparison of low-resolution face detection in Crowd Face Dataset using our proposed method within HR detector[9](cyan ellipses) and the original HR (magenta rectangles).

Table 1 :
AP performance of NMS, Soft-NMS and proposed S 3 NMS for HR, CAHR and PyramidBox on WIDER FACE hard and Crowd Face sets.

Table 2 :
Ablation study of our proposed score-size-specific NMS and co-occurrence priors based on density maps to HR, PyramidBox, EXTD, CAHR, DSFD and TinaFace on Crowd Face.
Dong Liang et al.