Deep Color Consistent Network for Low-Light Image Enhancement

Low-light image enhancement (LLIE) explores how to refine the illumination and obtain natural normal-light images. Current LLIE methods mainly focus on improving the illumination, but do not consider the color consistency by reasonably incorporating color information into the LLIE process. As a result, color difference usually exists between the enhanced image and ground-truth. To address this issue, we propose a new deep color consistent network termed DCC-Net to retain the color consistency for LLIE. A new “divide and conquer” collaborative strategy is presented, which can jointly preserve color information and enhance the illumination. Specifically, the decoupling strategy of our DCC-Net decouples each color image into two main components, i.e., gray image plus color histogram. Gray image is used to generate reasonable structures and textures, and the color histogram is beneficial for preserving the color consistency. That is, they both are utilized to complete the LLIE task collaboratively. To match the color and content features, and reduce the color consistency gap between enhanced image and ground-truth, we also design a new pyramid color embedding (PCE) module, which can better embed color information into the LLIE process. Extensive experiments on six real datasets show that the enhanced images of our DCC-Net are more natural and colorful, and perform favorably against the state-of-the-art methods.


Introduction
Low-light image enhancement (LLIE) is a task of refining the illumination to obtain natural norm-light images, which aims at improving the perception and visual quality of low-light images captured in poor illumination environment. Low-light images are rather common in reality, e.g., images captured in outdoor or indoor scenes with poor lighting conditions, which suffers from the unclear contents and Figure 1. Comparison of our DCC-Net and other deep LLIE methods in terms of PSNR/SSIM metrics. We clearly see that there is a large color gap between the enhanced images of RetinexNet, Zero-DCE++, Kind++ and EnlightenGAN, and the ground-truth image. In contrast, our DCC-Net can retain the color consistency effectively, and the enhanced image is more natural and colorful. textures, low contrast and noises. These degradations will not only have negative effect on human perception, but also will be not conducive to the subsequent multimedia computing and computer vision tasks designed for high-quality images, for instance face recognition [3], object detection [25] and semantic segmentation [4].
Traditional LLIE methods aim to build a model to refine the illumination and obtain the enhanced image, which can be roughly categorized into histogram equalization (HE)based and retinex-based methods. Traditional methods are relatively simple and easy, but they usually cannot restore the consistent colors and detailed textures.
With the impressive performance of deep neural networks (DNN) in diverse high-level and low-level vision tasks [5,10,31,36], deep LLIE methods also achieve great improvement [2,11,30]. Deep LLIE methods usually design a deep neural network equipping with different modules to reverse the degradation process. Compared to traditional methods that usually produce undesirable illumination and noises, deep LLIE methods can obtain better results due to the strong ability of DNN. However, these methods tend to generate inconsistent colors, which can be seen in Figure  1. There is obvious color difference between the generated images of RetinexNet, Zero-DCE++, Kind++ and Enlight-enGAN and the ground-truth. While the result of our DCC-Net is more natural and conforms to the real color. We ask: what makes the enhanced images lose color consistency?
We attempt to answer this question from two respects: 1) Different architectures. There are two popular ways to handle the LLIE task in current studies: 1) end-toend deep frameworks that directly handle the low-light image to obtain normal-light images; 2) retinex-based frameworks that decompose the image into reflectance and illumination for further processing. Both the two modes focus on refining illumination, while ignoring the color consistency and naturalness. Thus, there will be color gap in the enhanced images.
2) Information mismatch. Color histogram describes color information globally, which does not contain any spatial information. As a result, we cannot find suitable color information for specific contents of images. The connection between the color and content features is therefore unable to be directly built. This kind of information mismatch will make the enhanced images unnatural and contain inconsistent colors.
We therefore propose a new "divide and conquer" collaborative strategy, which can jointly retain the color consistency and enhance the illumination. Generally, the main contributions of this paper are summarized as follows: • Technically, we introduce a new strategy to retain the color consistency for LLIE, and also propose a deep color consistent network termed DCC-Net to reduce the color difference between the enhanced image and ground-truth. To the best of our knowledge, this is the first work to enhance the illumination of low-light image by directly exploring the color consistency. Extensive experiments show that our DCC-Net can better enhance the illumination, and the enhanced images are more natural and consistent in color.
• To jointly retain the color consistency and enhance the illumination, DCC-Net designs a decoupling strategy to decouple a color image into a gray image and a color histogram that complete the LLIE task collaboratively. We design three sub-nets for DCC-Net, i.e., G-Net, C-Net and R-Net, as shown in Figure 2. G-Net aims at recovering the gray image that can offer rich structure and texture information. C-Net aims to learn the color distributions, which will be conductive to the color coherence. R-Net combines the gray image and color information to restore the normal-light image.
• To better overcoming the weakness of lacking spatial information for color histogram, we also design a pyramid color embedding (PCE) module that consists of six color embedding (CE) sub-modules with pyramid structure. CE can match the color and content features, according to the affinities between them, so that the color information can be dynamically incorporated, which can further reduce the color gap between the enhanced image and ground-truth image.

Related work
In this section, a brief review on both traditional and deep LLIE methods will be presented.

Traditional LLIE Methods
HE-based methods. Based on various image priors, HE-based LLIE methods [17,24] focus on changing the dynamic range of the image to improve the contrast, such as [1,17]. However, HE-based methods pay attention to enhancing the contrast, rather than directly refining the illumination. Thus, the enhanced results may suffer from the under-enhancement or over-enhancement.
Retinex-based methods. This kind of methods [7,8,20] decompose image into the pixel-wise product of reflectance and illumination, inspired by the retinex theory [15]: where S denotes an image, R and I denote the corresponding reflectance and illumination respectively. By further processing the reflectance and illumination, enhanced results can be obtained [7,8,12,13,20,27]. Since Retinexbased methods aim at estimating the illumination, which is hand-crafted and depends on intensive parameters tuning, the final result often contain inconsistent colors and noises.

Deep Learning-based LLIE Methods
Deep LLIE methods can usually outperform traditional methods due the strong learning ability of DNNs. According to whether paired data are used, current deep LLIE methods can be divided into three categories, that is, supervised, unsupervised and semi-supervised methods.
Supervised deep LLIE methods. For supervised methods, all training data are paired. We can further divide this kind of methods into retinex-based methods and end-toend methods. Retinex-based deep LLIE methods similarly uses deep learning to compose an image into reflectance and illumination. For example, Wei el al. [30] proposed

ResBlock×4
Convolutional Layer Color Embedding Figure 2. The overall framework of our DCC-Net. As can be seen, there are three sub-nets: G-Net, C-Net and R-Net, where G-Net aims at recovering the gray images with rich content information, C-Net focuses on learning the color distributions, and R-Net combines the gray image and color information to restore the natural and color-consistent normal-light images.
a RetinexNet with two stages, where the first stage decomposes an image into reflectance and illumination, and the second one adjusts the illumination map. Zhang et al. also presented two RetinexNet based improved models, called KinD [35] and KinD++ [34]. Compared with RetinexNet, there are three sub-networks for KinD and KinD++, which are decomposition-net, restoration-net and adjustment-net.
End-to-end deep LLIE methods directly handle the lowlight image, rather than decomposing the image. For example, LLNet [21] is a pioneering work using deep autoencoder approach for deep LLIE. Due to the strong representation ability of Convolutional Neural Networks (CNN), Li et al. [19] further presented a CNN-based deep LLIE model. By estimating the global contents of low-light image, a deep hybrid network was also developed to preserve the structure details in the enhanced image [26]. Supervised methods can achieve better weakly illuminated image enhancement, but the color consistency between enhanced image and ground-truth is still a difficult issue.
Unsupervised/semi-supervised deep LLIE methods. In reality, it is challenging or even impractical to simultaneously obtain the paired data, i.e., degraded and ground-truth images, of the same scene. Therefore, unsupervised/semisupervised LLIE methods are studied to alleviate this problem. For example, Yang et al. [32] designs a deep recursive band network using paired and unpaired low/normal-light images to obtain a linear band representation of an enhanced normal-light image. Jiang et al. [11] present an unsuper-vised LLIE method that employs the generative adversarial network (GAN) as main framework. There are also several zero-shot methods whose input only contains the low-light images [6,18,33,38]. By training these zero-shot models with carefully-designed loss functions, the illumination of input low-light images also can be enhanced. Though these methods solves the LLIE problem without or with partial paired data, the enhancement quality is usually limited.

Proposed Method
In this section, we introduce the framework (see Figure  2) and details of DCC-Net, which aims at preserving the color consistency and naturalness in obtaining normal-light images. DCC-Net has three sub-nets (i.e., G-Net, C-Net, R-Net) and one pyramid color embedding (PCE) module.

Network Structure
G-Net. Given an input low-light image, the target of G-Net is to predict the gray image of normal-light image, which contains rich structure and texture information, without color information. This process is formulated as where G pre denotes the predicted gray image, S low denotes the input low-light image and GN et denotes the transformation of G-Net. Specifically, G-Net employs the encoderdecoder pipeline, which is similar with classic U-Net [28].
For G-Net, we use l 1 loss to reconstruct the gray image: where l g denotes the gray image reconstruction loss, G high denotes the gray image of normal-light image, H and W denote the height and width of the gray image G high . Hence, G-Net does not consider color information and devotes to recovering the textures and structures. C-Net. Color histogram is a kind of color features, which is widely used in image retrieval systems [9]. Color histogram mainly describes the proportion of different colors in the entire image, while not caring the spatial position of color. In this paper, we calculate the color histogram in RGB color space. In particular, the color histogram of an image should be a matrix with size of N × 256, where N = 3 correspond to the three color channels (i.e., R, G and B), and 256 is consistent with the range of pixel values.
C-Net is designed based on the color histogram for color feature learning. The goal of C-Net is to obtain consistent color features with the normal-light image (see Figure  2). We also utilize the encoder-decoder pipeline for C-Net, which transforms the input low-light image to the predicted color histogram by the following formula: where C pre denotes the obtained color histogram and CN et denotes the calculation process of C-Net. To better reconstruct the color histogram, we also apply the l 1 loss to constrain C-Net, which can be described as follows: where l c is color histogram reconstruction loss and C high is the real color histogram of normal-light image. Note that color histogram cannot describe the contents and details in images. That is, C-Net pays all attention to learning consistent color features, which is beneficial to enhancement. R-Net. Based on the gray image and color histogram obtained by G-Net and C-Net, R-Net combines them to restore normal-light image collaboratively. The input low-light image, predicted gray image and color histogram are transformed into the normal-light image by R-Net as follows: where S pre denotes the enhanced image. To reconstruct the normal-light image in pixel level, we use color image reconstruction loss l r , which is defined as follows: where N , H and W denote the channel number, height and width of the normal-light image S high . In terms of structure level, we employ the ssim loss as the constraint: where the similarity function SSIM (·) is described as where x, y ∈ R H×W ×3 denote two images to be measured, µ x , µ y ∈ R represent the mean values of the two images, σ x ,σ y ∈ R are the corresponding variances of the two images, c 1 and c 2 are two constant parameters which can prevent the denominator from being zero. In addition, the total variation loss l tv is also employed as a regularization term to retain the smoothness for the enhanced image.

Pyramid Color Embedding (PCE)
The PCE module is designed to well embed the color information into R-Net, as shown in Figure 3. Clearly, PCE has six color embedding (CE) modules with pyramid structure. Specifically, CE achieves the dynamic embedding of color features. The main component of CE is dual affinity matrix (DMA) that solves the information mismatch issue.
Dual affinity matrix. From G-Net and C-Net, we can obtain the corresponding gray image and color histogram, which can provide rich structure and texture details, and color information respectively. R-Net applies both of them to achieve better enhancement. Since color histogram does not contain spatial information, simply concatenating them will cause inaccurate illumination in the enhanced image. Besides, the simple concatenation will also result in mismatch between color information and contents, which may produce color difference in the enhanced images.  To solve the information mismatch issue and obtain better color information embedding, we present a new color embedding (CE) module, which can dynamically incorporate color features into R-Net according to the affinity between color and content features. The proposed dual affinity matrix (DAM) aims at computing the affinity matrix to match color and content features, and further prevent the enhanced image from producing inconsistent colors. Specifically, given color features C and content feature F with size of N × H × W , DAM first computes the Manhattan distance and inner product between C and F for each position, which are formulated as follows: where F (x, y) , C (x, y) ∈ R N denote the vectors of F and C for position (x, y), M, P ∈ R H×W are Manhattan distance matrix and inner product matrix. Then, the dual affinity matrix A can be calculated as follows: where tanh (·) and sigmoid (·) are tanh function and sigmoid function respectively. Note that M (x, y) ≤ 0 for each position (x, y), so that sigmoid (M ) ∈ [0, 0.5]. Hence, we use 2 × sigmoid (M ) to ensure the range of A ∈ [0, 1]. Color embedding. CE obtains the dynamic embedding of color information, whose structure is given in Figure 3. After obtaining the dual affinity matrix A, CE computes the element-wise multiplication of A and color features C. The weighted color features are summed with the content feature F to obtain the color information embedded features: where E denotes the output features used in the decoder of R-Net. There is also an upsampling operation for color features C to change its resolution, and then further feed into the next CE as original color features.
Pyramid structure. Given color features, we can use them to guide the enhancement process to obtain consistent colors. To fully explore the color information, we present PCE including six CEs with pyramid structure (see Figure  3). Given the color features C i and content feature F i from i-th CEs, the features obtained by PCE in each layer from shallow to deep are described as follows: where E i denotes the output features, CE (·) denotes the transformation of CE. C i is computed by the (i − 1)-th CE. In contrast, F i is copied from the corresponding layer in the encoder of R-Net. The pyramid structure embeds color features into six layers. In other words, the progressive design can make full use of the color information. As a result, the enhanced image will be more consistent in colors.

Objective Function
The objective function of our DCC-Net is described as l total = λ g l g + λ c l c + λ r l r + λ ssim l ssim + λ tv l tv (15) where λ g , λ c , λ r , λ ssim , λ tv are several trade-off parameters. Specifically, l g and l c are used to recover the gray image and color histogram, respectively. l r and l ssim are utilized to reconstruct the norm-light image in pixel and structure level. l tv can be regarded as an regularization term to prevent over-fitting and preserve smoothness.

Experiments
In this section, we evaluate the LLIE performance of our DCC-Net on several datasets, and describe the comparison results with some related deep LLIE methods.

Experimental Settings
Evaluated datasets. For training, we use LOL synthetic dataset and LOL real dataset [30]. Specifically, LOL synthetic dataset contains 1,000 paired synthetic low/normal  [29] (85 images) and VV 1 (24 images) datasets. Evaluation metrics. To evaluate the performance of different LLIE methods, we use both full-reference and nonreference image quality evaluation metrics. For LOL testing data with paired data, peak signal-to-noise ratio (PSNR), structural similarity (SSIM), mean absolute error (MAE), inference time and color-sensitive error (CSE) [37] are employed. For the DICM, LIME, MEF, NPE and VV datasets without paired data, only naturalness image quality evaluator (NIQE) is used, as there is no ground-truth. Note that CSE is a new metric that can measure the color difference of two images [37] . To be specific, the greater PSNR and SSIM, the better enhancement. In contrast, the smaller CSE, MAE and NIQE, the more realistic the refined images.
Compared methods. Since we mainly focus on deep LLIE methods, DCC-Net is compared with six state-of-theart deep neural network-based LLIE methods: Implementation details. We conduct all experiments by the Pytorch [23] platform on Python environment with two NVIDIA GeForce RTX 2080i GPUs. All training and testing images are resized into 512×512 pixels. Adam optimizer [14] is utilized with a batch size of 6. We train our DCC-Net for 400 epochs, where the learning rate is 0.0001 during the first 200 epochs and 0.00001 for the next 200 epochs. For the hyper-parameters of our DCC-Net, we empirically set λ g = 1, λ c = 2, λ r = 2, λ ssim = 2, λ tv = 0.1.

Quantitative Enhancement Results
LOL dataset with paired data. We first evaluate each deep LLIE model on LOL dataset. The numerical results are described in Table 1. We can see that: (1) our DCC-Net obtains the greatest PSNR value and smallest MAE value, i.e., the enhanced results are more close to the ground-truth among all the compared methods; (2) for the SSIM metric, our DCC-Net is comparable to KinD, and superior to other methods, i.e., our DCC-Net can better restore the structures for LLIE; (3) compared with supervised methods, unsupervised methods are still weak in retaining the quality of enhanced images; (4) our DCC-Net obtains obvious improvement on the CSE metric compared with other deep LLIE methods, where "raio" indicates the ratio of other method's results to that of our DCC-Net. Since CSE can directly measure the color difference between two images, so the results can express that our DCC-Net is effective to preserve the color consistency; (5) The inference time of our DCC-Net is comparable to other methods. As such, in terms of enhancement performance and inference time, DCC-Net obtains the best results with relatively short inference time.
Datasets (DICM, LIME, MEF, NPEE and VV) without paired data. We also conduct experiments on unpaired and real low-light images. Table 2 displays the quantitative image quality of LLIE results in terms of NIQE metric. In general, DCC-Net obtains better NIQEs results among all methods. To be specific, KinD and KinD++ achieve preferable performance on MEF dataset, which is slightly better than ours. For other datasets, our DCC-Net is the best one.

Visual Image Analysis and Evaluations
LOL dataset with paired data. Figure 4 shows several enhanced images of LOL datset. It is clear that our DCC-Net can prevent the enhanced image from inaccurate colors. In most cases, there are obvious color difference with the ground-truth image in the enhanced images of compared methods. For the light refined images of KinD, KinD++, the results are usually over-enhanced. The resulted images of EnlightenGAN and our DCC-Net looks better. In terms of numerical PSNR/SSIM metrics, our proposed DCC-Net method achieves the best enhancement results.
Datasets without paired data. We further exhibit the visual enhancement results on the DICM, LIME, MEF, NPE and VV datasets in Figure 5-7. We can find that: (1) Zero-DCE++ and Zero-DCE tend to generate over-enhancement    images, which are full of white pixels and lose many details; (2) there is obvious color gap for the enhance results of RentinexNet, which makes them seem unreal; (3) for KinD, Kind++ and EnlightenGAN, the illumination improved images lack of naturalness; (4) in contrast, the enhanced images of our DCC-Net are more natural and colorful.

Ablation Study
We evaluate the effect of the network structure and PCE module on the performance of our DCC-Net.
Effectiveness of network structure. To demonstrate the effectiveness of the sub-networks G-Net and C-Net, we conduct the LLIE task on LOL dataset with and without them. Figure 8 displays the LLIE results of different models, where W/o G-Net and W/o C-Net represent our DCC-Net without G-Net and C-Net respectively. We find that unreasonable colors are produced by W/o G-Net and W/o C-Net. Table 3 describes the quantitative results. We see that there is obvious performance decline without G-Net or C-Net, which demonstrates the rationality and validity of the proposed "divide and conquer" collaborative strategy.
Effectiveness of PCE. As can be seen from Table 3, when PCE is removed from DCC-Net, denoted as W/o PCE, the values of PSNR and SSIM are smaller than DCC-Net, i.e., PCE is important to ensure the performance. Similarly, the value of MAE is greater than W/o PCE, which suggests that PCE is effective to enhance the illumination. Since PCE can effectively match the color and content features layer by layer, which can take full advantage of the color information. From the third row of Figure 8, we see that the enhanced image contains undesired yellow color, which is obviously inconsistent with peripheral regions.

Conclusion
We have discussed the issues of retaining the color consistency and naturalness for LLIE task. Technically, we proposed a new "divide and conquer" collaborative strategy to retain color information and naturalness, and developed a deep color consistent network called DCC-Net. To be specific, two sub-nets are designed to learn gray image and color histogram from a low-light image, where the gray image offers rich content information and the color histogram provides color information. Since color histogram does not consider spatial position, a new module PCE is further designed to match color and content features, and progressively embed color information. By the collaborative strategy, DCC-Net can jointly preserve color information and refine illumination. Extensive experiments show the superiority and effectiveness of DCC-Net for obtaining more natural and colorful normal-light images. In future, we will investigate more effective networks to further improve the naturalness and color consistency for LLIE. Besides, how to quantitatively assess the naturalness and image quality in terms of contents and color difference still remains an open problem, which is also an interesting future work.