Object Detection in Hazy Environment Enhanced by Preprocessing Image Dataset with Synthetic Haze

Object detection in hazy environment has always been a difficult task in the autonomous driving field. Huge breakthrough is hard to achieve due to the lack of large-scale hazy image dataset with detailed labels. In this work, we present a simple and flexible algorithm to generate synthetic haze to MS COCO training dataset, which aims to enhance the performance of object detection in haze when taking the new synthesized hazy images as training dataset. Our algorithm is inspired by the Multiple Linear Regression Dark Channel Prior (MLDCP), and we obtain a general model that can add synthetic haze to haze-free images by implementing Stochastic Gradient Descent (SGD) to the reversed MLDCP model. We further evaluate the mean average precision (mAP) of Mask R-CNN when we train the network with the Hazy-COCO training dataset and preprocessing test hazy dataset with existing single image dehazing algorithms.


Introduction
In recent years, computer vision-based system has been playing an important role in a wide application field of urban traffic, such as autonomous and assisted driving, traffic monitoring system and security maintenance.Neural network has made significant breakthrough on some fields, such as computer vision [8] [11] [12] and speech recognition [22] [23] [24].With the rapid development of object detection and recognition techniques, and the increasing number of traffic cameras, smart cities are becoming more intelligent and safer.However, the reduced performance caused by some poor weather has to be taken into ac-count since most traffic is outdoor.Especially, images captured in hazy scenes by camera of traffic surveillance and autonomous/assisted driving will have complicated, nonlinear and data-dependent noise.It becomes a highly desirable task to reduce the effect of such degradation from jeopardizing the performance of object detection and recognition.Many efforts have been made to haze-removal algorithms, including some state-of-the-art algorithms [9][20] [19] and some neural network based algorithms [12][18] [1].

Background of dehazing algorithms
In computer vision, most haze-removal algorithms are designed based on the classical atmospheric scattering model: where I(x) is the observed intensity ("hazy image"), J(x) is the scene radiance ("haze-free image") to be recovered.
A is the global atmospheric light, and t(x) is the medium transmission matrix.When the atmosphere is homogeneous, the transmission matrix t(x) can be defined as: where β is the scattering coefficient of the atmosphere, and d(x) is the scene depth which indicates the distance between the object and the camera.Most state-of-the-art haze removal algorithms would estimate the global atmospheric light A and transmission matrix t(x).Then haze-free image J(x) can be recovered via the inverse atmospheric scattering model (1):

Overview of existing dehazing algorithms
Haze removal has always been a challenging problem, and significant progress has been made in the past few years.Some state-of-the-art haze removal algorithms exploit the ways to estimate two core parameters, transmission matrix t(x) and global atmospheric light A, which can be used to recover hazy images through (3).The analysis of polarization filtered images [19] [20] has been proved a good approach to recover the hazy images taken with different degrees of polarization.In [21], a Markov random fields (MRFs) framework has been utilized to predict the airlight values and maximize the local contrast of a hazy image.To recover the depth information, Color Attenuation Prior [25] creates a linear model of scene depth and learn its parameters by supervised learning.
He has proposed a widely recognized state-of-the-art haze removal algorithm called Dark Channel Prior (DCP) [9].It's a regular pattern that in most non-sky patches of clear images, at least one color has some pixels whose intensity are very low and even close to zero.They further propose an explicit image filter called guidedf ilter [10] to optimize the rough estimation of transmission matrix t(x), and the estimation can be more accurate in general cases.However, it may fail when there is strong sunlight over a large local region.It will underestimate t(x) in this region and overestimate t(x) in hazy part, causing some color shift and distortion in restored images.
With the rapid development and wide application on computer vision, Convolutional Neural Networks (CNN) has been proved to be more powerful than state-of-the-art algorithms in many fields.Several haze removal algorithms obtain more accurate transmission matrix t(x) via training various CNN models with hazy images datasets.Cai proposed DehazeNet [1], a trainable end-to-end system to estimate transmission matrix t(x).Ren came up with a multiscale deep neural network, including a coarse-scale net and a fine-scale net, to predict a holistic transmission map and optimize the result locally [18].Instead of using CNN to refine the estimation of transmission matrix t(x), Li proposed a light-weight CNN called AOD-Net [12] to generate the haze-free images directly.Normally, CNN-based haze removal algorithms can gain better performance over state-ofthe-art algorithms.However, more and more complicated frameworks and the dependence on high quality and large amount dataset make it difficult to make a significant breakthrough.
Considering the less efficiency of state-of-the-art algorithms and the bottleneck of CNN-based algorithms, there have been some new methods using unsupervised learning to optimize state-of-the-art dehazing algorithms.Li has proposed a multiple linear regression haze-removal model [14] using the unsupervised learning to optimize the rough estimations of transmission matrix t(x) and atmospheric light A in Dark Channel Prior [9].In [6], a deep neural network is trained on real-world outdoor images by minimizing Dark Channel Prior energy function.

Object detection model
Object detection model has developed a lot in recent years.The development from R-CNN to Fast R-CNN, from Fast R-CNN to Faster R-CNN, from Faster R-CNN to Mask R-CNN, shows that the deep model will become faster with higher accuracy.In [10], the authors proposed a cascade of AOD-Net dehazing and Faster-RCNN detection modules to detect object in hazy images, and applied different combinations of more powerful dehazing and object detection modules in the cascade.What's more, such a cascade could be subject to further joint optimization.
Fast R-CNN refers to Fast Region-based Convolutional Network [5].A Fast R-CNN network takes the entire image and a set of object proposals as inputs and generates a convolutional feature map.By identifying the region of proposals, a fixed-length feature vector can be extracted from each region of interest (ROI) pooling layer so that it can be fed into a sequence of fully connected layers.For the output layers, a softmax layer is used to predict the probability over K object classes and the offset values for the bounding box.
Faster R-CNN consists of two stages.The first stage, called a Region Proposal Network (RPN), proposes candidate object bounding boxes.The second stage, which is in essence Fast R-CNN [17], extracts features using RoIPool from each candidate box and performs classification and bounding-box regression.The features used by both stages can be shared for faster inference.
Mask R-CNN extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition and classification.It efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance [8].
Domain adaptation has been widely used to enhance the performance of object detection, especially on the images with poor visibility (hazy, snowy, rainy).Domain adaptation can be understood as adapting the classifier trained on a source domain to recognize instances from a new target domain [7].Chen [2] has proposed a domain adaptive Faster R-CNN to improve the cross-domain robustness of this work.Two domain adaptation components, on image level and instance level, are integrated into Faster R-CNN model to alleviate the domain discrepancy between them.Source domain is the domain of training data with images and full supervision (i.e., bounding box and object categories).Target domain is the domain of test data with only unlabeled images.Inspired by the Domain Adaptive Faster R-CNN, [16] applied a similar approach to design a domain-adaptive mask-RCNN (DMask-RCNN).The primary goal of DMask-RCNN is to mask the features generated by feature extraction network to be as domain invariant as possible, between source domain and target domain.Specifically, DMask-RCNN places a domain-adaptive component branch after the base feature extraction convolution layers of Mask-RCNN.From the experiment results in [2] [16], these two domain adaptation methods are proved to have better performances over basic Faster R-CNN and Mask R-CNN when testing object detections on hazy images.

Dataset
Image datasets have played an important role in computer vision challenges.The rapid development of deep learning models requires a higher demand for training and testing datasets.Recently, more image datasets have been presented, which provides computer vision research with more challenging directions.

Object Recognition Datasets
Object Recognition can be roughly defined that it develops from image classification to object detection then to semantic segmentation.All tasks have high requirements for image datasets.In object classification dataset, each image is labeled with binary to indicate if objects exist.Object detection not only identifies which specific class an object belongs to, but also locates it in the image.This requires the dataset to collect large-scale data of bounding boxes that locate all objects in images.The PASCAL VOC datasets [4] contains 20 object categories spread over 11000 images, and over 27000 object instance bounding boxes are labeled, which are widely used by some object detection models.The ImageNet dataset [3] contains millions of cleanly labeled full resolution images in a densely populated hierarchical structure.Some deep learning algorithms has made great breakthroughs in both object classification and detection research by using this large-scale dataset.Semantic segmentation requires to label each pixel in an image to a category, which means tons of works to build a largescale dataset with detailed semantic scene labeling.The Microsoft COCO dataset [15] involves about 3.3 million images and over 2 million of them are labeled.It contains 91 common object categories and segments 2.5 million object instances.MS COCO dataset becomes widely adopted to train the newer and more complex object detection models.

Image Restoration Datasets
The REalistic Single Image DEhazing (RESIDE) dataset [13] is the first large-scale dataset for benchmarking single image dehaziang algorithms and includes both indoor and outdoor hazy images.Further, RESIDE contains both synthetic and real-world hazy images, thereby highlighting diverse data sources and image contents.It is divided into five subsets, each serving different training or evaluation purposes.And the test sets address different evaluation viewpoints including restoration quality (PSNR, SSIM and no-reference metrics), subjective quality (rated by humans), and task-driven utility.

Contribution
Even though some large-scale datasets have been presented to train object detection models, few datasets are created just for solving the problem of object detection in haze.In RESIDE RTTS dataset, only 4322 labeled hazy images are much less efficient to train an object detection model compared with hundreds of times images in MS COCO training dataset.In fact, it's already an extremely time consuming task to create a dataset like MS COCO with detailed labels and segmentation instances.And the rarity of realworld hazy images makes it more difficult to collect the same amount of high-quality images as MS COCO training dataset.It's not worthy to spend a lot of time building a dataset that can be only applied to a small research field.Our algorithm based on Multiple Linear Regression DCP Model can deal with this balance perfectly.Given an existing haze-free training dataset for object detection model, our algorithm can generate synthetic haze to all images easily.If we train an object detection model with this new hazy dataset, the performance of detecting objects in hazy images can be significantly enhanced.Based on our algorithm, lacking large-scale training dataset will no longer be an obstacle for solving object detection in haze problem.This will encourage more researchers to dive in this field and further increase the precision of object detectors.And higher precision means more safety when object detection is performed on autonomous driving.The similar idea can also be applied to object detection in some other bad weathers, such as rainy and snowy.

Method
In [14], to optimize the rough estimation on Atmospheric Light A and transmission t(x) in DCP, a Multiple Linear Regression DCP Model is proposed.It adds a weight to each independent variable, I(x)  t(x) , A t(x) and A in (3), and adds a bias to (3).Then the haze-removed model ( 3) is transformed to a multiple linear regression model: In [14], it implements Stochastic Gradient Descent (SGD) to train model ( 4) with haze-free images and corresponding synthetic hazy images as training dataset.Then the optimal weights and bias can be found by minimizing the mean-squared error (MSE) between output images and target haze-free images.
Since the multiple linear regression model can help DCP gain higher haze-removed performance, similar idea can be applied to synthesize higher quality hazy images from hazefree images.In MLDCP model [14], the weights and bias are used to minimize the unexpected deviations caused by the rough estimations of transmission matrix t(x) and atmospheric light A. In our algorithm, weights and bias are expected to maximally restore the deviations between hazefree images and synthetic hazy images.They can be added to the atmospheric scattering model (1) making it a multiple linear regression model: The training dataset is the Outdoor Training Set (OTS) of RESIDE dataset [13], which provides 8970 outdoor hazefree images and about 30 synthetic hazy images for each haze-free images.During the training process, we refer the synthetic images in OTS as target images I, refer haze-free images as input images J.The synthetic images generated from model ( 5) will be referred to as output images I ω .And the deviations between the output images I w and target images I are evaluated by mean-squared error (MSE): We aim to search for a candidate solution to minimize MSE, which is defined as loss function or cost function: When Stochastic Gradient Descent is implemented, we should compute the derivative of the loss/cost function with respect to each weight and bias.Then we can update all three weights repeatedly by training each image in training dataset to travel down the slope of loss/cost function in steps until it reaches the lowest value: where α is the learning rate, and k ∈ 0, 1, 2. And weights and bias are all (3,1) dimension parameters, aiming to adjust RGB colour channels of all pixels respectively.Weights and bias are updated in each iteration by the simpilified Equation ( 8): In Equation( 9), x k equals to J(x)t(x), At(x), and A in Equation( 5) respectively when k equals to 0, 1, and 2.
Since OTS dataset provides 30 synthetic hazy images with different haze intensity for each haze-free image, we manage to calculate the optimal weights and bias respectively for each haze intensity.We add different intensities of haze to the original COCO dataset and train the Mask R-CNN model with all Hazy-COCO training dataset.After we test on RTTS dataset and UG2 test dataset, the highest mAP value is gained when haze intensity is 0.2.We can conclude that the synthetic haze is most similar to real-world hazy environment when haze intensity is 0.2.

Experiment Setup
We have two tasks in our experiment.Firstly, we evaluate the accuracy of object detection in haze after we dehaze the test dataset with Multiple Linear Regression DCP Model.Even though MLDCP has gained the best dehazing performance compared with some state-of-the-art algorithms and CNN dehazing models [14], it's hard to say that it can outperform other dehazing algorithms when applied to object detection in haze problem.Some detection tests on RESIDE RTTS dataset have been made in [12] [13] [16].The Domain-Adaptive Mask R-CNN model proposed in [16] has the best performance among several detectors.We dehaze RTTS test dataset with MLDCP model first and concatenate this pre-processing with Mask R-CNN and DMask-RCNN models to form new dehazing-detection cascades.

Experiment results of dehazed test dataset
When concatenating Mask-RCNN or DMask-RCNN2 with dehazing algorithms like DCP [9], MSCNN [18] and MLDCP [18], the increase of mAP values in Table 1 shows that these dehazing algorithms are also effective on enhancing the performance of object detection in hazy environment.Pre-dehazing test dataset with MLDCP model can increase the detection precision of DMask R-CNN2 module by 1.7%, which is the highest compared with other dehazing approaches.We can simply conclude that pre-dehazing test images can help increase the object detection precision in hazy images.Somewhat surprisingly, AOD-Net [12] degrades the mAP result of DMask R-CNN from 61.72% to 60.47%.In [12], AOD-Net outperforms some dehazing algorithms.It means that the better dehazing performance on PSNR value and SSIM value doesn't equal to higher accuracy on object detection in haze.

Conclusion
Inspired by the multiple linear regression model added to Dark Channel Prior in [14], we reverse the idea from dehazing hazy images to adding synthetic haze to haze-free images.This algorithm can be applied to any large-scale real-world image dataset.It will save plenty of time and work from building a large-scale image dataset with synthetic or real haze, and detailed labels for object detection and segmentation.And the test results presented in Section 5 demonstrate the significant enhancement to the precision of object detection in haze.When dealing with object detection in some other extreme environments like rain and snow, similar idea can be explored to create large-scale rainy image dataset and snowy image dataset.As the autonomous driving technique becomes more and more mature, object detection in these extreme environments should be paid a special attention to.Higher accuracy of object detection in autonomous driving ensures people to drive more safely.Our algorithm can be further improved in the future.After we get the optimal weights and bias respectively for each haze intensity, we can't find a proper criteria to measure which haze intensity is most similar to the real-world haze environment.We have to train the Mask R-CNN model with the Hazy-COCO training dataset of all different haze intensities and compare their test performances, which is timeconsuming.It will be a significant improvement if we can define a criteria to easily select the haze intensity that most close to the real-world haze environment in test dataset.

Figure 1 .
Figure 1.Examples of synthetic hazy images from MS COCO dataset Secondly, we train the Mask R-CNN model with the new Hazy-COCO training dataset and evaluate its performance compared with the original Mask R-CNN model.During the training process, the backbone network is set as resnet101, and the training starts from the pre-trained COCO weights.We have tried different numbers of the training layers to check the effect on its test performance.Finally, when we trained the head layers by 10 epochs and fine tune Resnet stage 4 and up by 60 epochs, we obtained the weights with the highest precision on test datasets.The comparison of test performances between the Mask R-CNN model trained with Hazy-COCO training dataset, and the Mask-RCNN model trained with original COCO training dataset is presented in Section 5.3.

Table 2 .
In Table2, the enhancement of the performance on object detection in haze is demonstrated on UG2 Sample Dataset and RTTS Dataset when training the object detection model with the optimized training dataset.When mAP values comparison on test Dataset testing for RTTS and UG2 test datasets, the mAP values increase by 5% and 6% respectively when the Mask R-CNN model is trained with the Hazy-COCO training dataset.The significant increase of the mAP values on both RTTS and UG2 test datasets presents how powerful the new Hazy-COCO training dataset is.If we dehaze RTTS test dataset with MLDCP model at the same time, it can still increase the mAP value by 0.07%.However, this increment is further increased to 2% when we train the Mask R-CNN model with the original COCO training dataset.We can simply conclude that the effects of pre-processing training dataset and dehazing test dataset with MLDCP have some overlaps.