Low-dose CT image denoising using deep convolutional neural networks with extended receptive fields

How to reduce radiation dose while preserving the image quality as when using standard dose is an important topic in the computed tomography (CT) imaging domain because the quality of low-dose CT (LDCT) images is often strongly affected by noise and artifacts. Recently, there has been considerable interest in using deep learning as a post-processing step to improve the quality of reconstructed LDCT images. This paper provides, first, an overview of learning-based LDCT image denoising methods from patch-based early learning methods to state-of-the-art CNN-based ones and, then, a novel CNN-based method is presented. In the proposed method, preprocessing and post-processing techniques are integrated into a dilated convolutional neural network to extend receptive fields. Hence, large distance pixels in input images will participate in enriching feature maps of the learned model, leading to effective denoising. Experimental results showed that the proposed method is light, while its denoising effectiveness is competitive with well-known CNN-based models.


Introduction
The presence of noise adversely affects the image quality as well as the performance of subsequent image analysis and processing tasks. Thus, denoising always plays an important role in modern image processing systems. Although image denoising has been studied for a long time with numerous efficient methods, it still remains an active area of research as it is the testbed for a variety of high-level image processing tasks. In this paper, we are interested in a difficult denoising problem, namely denoising for low-dose computer tomography (LDCT) images.
X-ray computer tomography (CT), introduced in the early 1970s [1], is an ionizing radiation-based medical imaging technique. CT has been one of the most widely used imaging modalities in medicine. However, its major drawback is that the use of ionizing radiation can be harmful to the health of the patient. Therefore, it is necessary to reduce radiation dose. Unfortunately, it was showed that the reduction in radiation dose leads to increased noise level and artifacts in reconstructed images. These factors affect the quality of CT images and thereby the diagnostic accuracy and the outcome of a CT examination. Thus, the reconstruction of high-quality CT images under low or ultra-low radiation dose conditions is a great challenge.
Numerous studies have been carried out for denoising and artifact removal for LDCT. The existing methods can be classified into two groups: (i) the group of methods which filter noise and reduce artifacts within the image reconstruction from raw projection data, and (ii) the group of methods which perform removing noise and artifacts in reconstructed lowdose CT images.
For the first group, methods mostly use filters to suppress noise and artifacts in the raw data (sinogram). The filtered sinogram data are then used to reconstruct the image using an iterative (IR) or filtered back-projection (FBP) method [2][3][4]. In the filters, the noise prior (e.g., Poisson noise) plays an important role in designing denoising algorithms. In practice, exactly estimating the noise distribution and level is not easy. It is shown that sinogram filtration and IR techniques can reconstruct LDCT images with quality equivalent to current clinical standards [5]. It is noticed that the reconstruction techniques are embedded in the hardware of CT scan systems, and thus the sinogram data are not available to users.
For the second group (group of post-processing methods which do not rely on raw data), numerous digital image denoising methods were modified to be able to apply to LDCT images such as the nonlocal means filter-based methods [6][7][8], the methods using sparse representation [9][10][11], the filters in the wavelet domain [12,13], and the BM3D filter [14]. For these classical methods, noise is directly suppressed from the noisy image. Similar to sinogram filtration, they require the knowledge about the types of source noise and general properties of noise in CT images. Most existing methods assume that the distribution of the noise in CT images can be approximated by the Gaussian distribution [15]. Trinh et al. in [16] proposed to use the assumption of local Gaussian distribution in dealing with the variation of noise levels among image regions. However, the fact that the noise in LDCT is very complex makes it difficult to be exactly estimated, affecting the LDCT image denoising performance of the classical methods.
Recently, deep convolutional neural networks (CNN) have been used to denoise LDCT images with impressive results. It is shown that the CNN-based methods, for instances, RED-CNN [17], WGAN-VGG [18], SAGAN [19], SMGAN [20], FD-VGG [21], significantly outperform classical denoising methods. In this post-processing approach, a large dataset of low-dose and normal-dose image pairs (images of such a pair are taken at the same position of a patient) is required. Normal-dose CT (NDCT) images are much less noisy and have higher quality, as compared to LDCT images. The aim of CNN-based methods is to train a deep network to learn the mapping from the LDCT image to the NDCT image of a pair. However, the idea of using high-quality CT images for denoising LDCT images was early proposed [22][23][24]. An overview of learning-based LDCT image denoising methods, thus, is necessary.
In this paper, we first overview existing learning-based methods for LDCT image denoising from naive methods to state-of-the-art ones. Typical learning-based methods will be subjectively and objectively compared through experiments performed on LDCT images. Then, we will present a novel CNN-based method that uses an extended receptive field CNN architecture.
In deep learning, receptive field (RF) is defined as the size of the region in the input that produces the feature [25]. Extending the RF helps the output of CNNs take more information from the input. In previous CNNs for image denoising such as FFDNet [26] and FD-VGG [21], down-sampling is used to extend the RF but their number of convolutional layers remains high (15 layers). The SAR-DRN [27] deploys dilated convolution to extend the RF but dilated convolution may generate artifacts.
We use both down-sampling and dilated convolution to extend the RF without increasing the number of convolutional layers. In this method, noisy images are re-arranged in an orderly manner into sequences of subimages (preprocessing). A deep dilated residual CNN network is proposed. It receives these subimage sequences and learns the mapping from LDCT to NDCT images in a given training dataset. At the end of the network, a reconstruction stage is used to reconstruct the desired output image (post-processing). By using the dilated convolution and the preprocessing step, the RFs in the network are extended so as to be able to take into account more useful information for denoising.
The rest of the paper is organized as follows. In Sect. 2, an overview of existing example-based learning methods for LDCT image denoising will be given, with details about the main contents of the typical methods. The proposed novel contribution is given in Sect. 3. Objective and subjective comparisons are given in Sect. 4. Finally, conclusion is given in the last section.

Overview of learning-based methods for LDCT image denoising
Generally, the goal of image denoising is to restore a clean image from its noisy observations. Unlike the classical methods, which try to directly denoise by solving a inverse problem models (e.g., sparse representation, statistical filters, total variation), learning-based methods learn the mapping that represents the relationship between noisy and clean image pairs from given external datasets and use the trained mapping to denoise the new noisy images. In this section, the main contributions in this approach of denoising and their applications to LDCT images (from the naive methods to the current state-of-the-art methods) will be mentioned. Before going into the details of the existing learning-based methods, let us start with the external patchbased denoising methods which are considered as a bridge between classical methods and learning-based methods.

External patch-based denoising methods
Recent CNN-based methods [17,18,21] are known as the state-of-the-arts for LDCT image denoising. However, the idea of using NDCT images to denoise LDCT images was early proposed in the patch-based denoising methods [16,[22][23][24]. Although computation time is a drawback of these patch-based methods, their denoising effectiveness has been confirmed. This approach of external patch-based denoising can be considered as a bridge from classical denoising meth- Fig. 1 One of the first CNN architectures for image denoising [34] ods to learning-based methods. So, this subsection briefly reviews some important contents of this approach.
The idea of an external patch-based denoising algorithm can be formulated as follows. A large noisy image is considered as an arranged set of overlapped small patches, and denoising is performed on patches. Given a patch of size √ n × √ n from a noisy image, presented by a vector q ∈ R n , the algorithm finds a set of similar (i.e., reference) patches p 1 , p 2 , . . . , p k ∈ R n from external clean images and determines a mapping F to obtain an estimatep of the unknown clean patch p aŝ Numerous patch-based image denoising methods using external clean images have been proposed [16,23,24]. In a non-local mean (NLM) algorithm proposed in [23], the denoising function F of (1) is defined based on the weighted average model of the NLM method proposed in [28]. However, the weights are computed using reference patches extracted from NDCT images. Trinh et al. [16,24] proposed to define F as a sparse linear combination of reference patches, to be determined by solving nonnegative sparse coding problem. These methods demonstrated that by using external patches, noise in LDCT images can be effectively suppressed while preserving subtle details, as compared to classical methods. Nguyen et al. [29,30] used image decomposition and sparse representation techniques to define F in such a way that it maximizes the preservation of high-frequency components in LDCT images. Moreover, denoising with the help of external datasets of clean image patches for other image types has been already studied in [31][32][33]. As an example, Lou et al. [33] showed that by using targeted external databases, one can obtain an effective denoising method, significantly outperformed wellknown classical methods such as NLM [28], KSVD [9], and BM3D [14].
The advantage of this approach is that it only needs to use clean patches in example datasets, and it does not require the corresponding noise-free and noisy patch pairs as in the learning-based methods. Even though external patchbased denoising methods have showed their effectiveness, this approach still requires an assumption of prior knowledge  [17] about the noise distribution. Global or local Gaussian noise is often used in most existing methods, while noise in LDCT images is, in fact, very complex. Moreover, from model (1), one can see that the performance of external patch-based methods highly depends on the quality of found reference patches (candidate patches of the desired patch) and the mathematical model used to determine the denoising function F. It is noticed that for every noisy patch, the determination of reference patches and function F is time-consuming. Thus, computational time is also a drawback of this approach.

Early learning-based denoising methods
The goal of the learning-based denoising approach is to answer the question: can we automatically learn a denoising procedure from given training examples that consist of pairs of noisy and noise-free (or noiseless) images? Specifically, a training dataset D is defined as where n i (i = 1, 2, . . . , N ) are the noisy images, and p i are their corresponding noise-free images. In LDCT image denoising, the training set is often established by the pairs of LDCT (noisy) and NDCT (noiseless) images, taken at the same position on the same patient. Learning-based methods try to train a model that represents the mapping from the space of noisy images to the space of clean images and use this trained model to remove noise from noisy images. One of the first learning-based denoising methods was given in [34] by Jain and Seung. They designed simple convolutional neural networks of four layers, 24 neurons on each layer, and without activation functions (e.g., ReLU, Leaky ReLU) to build a denoising model for natural images (Fig. 1). Due to hardware limitation for training deep learning networks at the time and the simplicity of the network model, its denoising performance cannot be compared to the cleverly engineered algorithms, such as KSVD [9], BM3D [14]. Several other naive learning-based methods were proposed, for instances the nonlinear regression models in [22,35].
Burger et al. [36] proposed to use a deep learning model namely multi-layer perceptron (MLP) and obtained the best Fig. 3 Architecture of FD-VGG [21] inspired by FFDNet [26] for LDCT image denoising performing method at that time. The authors also demonstrated that this learning-based approach can effectively denoise with and without known noise conditions.
Applications of example-based learning approaches to LDCT image denoising were early proposed in, for examples, [22] using kernel ridge regression, or in [37] using the Markov random field. Although simple learning models were used, the authors demonstrated promising results of the learning approach for LDCT image denoising.
In most of early learning-based methods, denoising a noisy image was performed on overlapping patches and final denoising results were aggregated from denoised patches. This way makes them difficult to control subtle textures at pixels in overlapped regions. However, the success of patchbased learning methods, e.g., the MLP method [36], opens a door for numerous CNN-based state-of-the-art denoising methods which are introduced in the next subsection.

CNN-based LDCT denoising methods
The outstanding success of numerous CNN-based denoising methods for natural images [26,[38][39][40] and promising results of early learning-based methods for the problem of LDCT image denoising lead to the proposals of many state-of-theart CNN-based LDCT image denoising methods [17][18][19][20]. Most CNN-based methods were developed and experimented on open datasets such as the AAPM Low Dose CT Grand Challenge dataset [41] or, recently, the LoDoPaB-CT dataset [42]. In the following, we mention several recent typical CNN-based methods for LDCT image denoising.
RED-CNN [17], which originates from [43], is one of the first CNN-based methods for LDCT image denoising. The method uses the U-net architecture [44] in which pooling layers are replaced by convolutions and unpooling ones by deconvolutions. The convolutional layers (encoder) extract coarse features of the LDCT image, and thus removing noise. The deconvolutional layers (decoder) tend to recover subtle details, which may be lost when the LDCT image is passed through the convolutional layers. The symmetric skip connections help the network converge faster and keep more subtle details [17]. RED-CNN can greatly reduce noise and  [18] artifacts. However, since RED-CNN only uses the meansquared-error (MSE) loss function, the denoised images are often oversmoothed. The architecture of RED-CNN is shown in Fig. 2.
FD-VGG [21] is a solution that applies the FFD-net, designed in [26] for natural image denoising, to LDCT image denoising. The structure of the network is shown in Fig. 3. The authors proposed to use a loss function defined as the combination of MSE loss and "perceptual" loss in order to improve the global quality of denoised images. The perceptual loss was also used in SACNN [45] with a selfattention CNN. Another one, Shan et al. in [46] proposed a convolutional encoder-decoder network with 2D and 3D configurations.
Other methods are based on generative adversarial networks (GAN), being a deep-learning-based generative model introduced in 2014 by Goodfellow et al. [47]. The use of GAN for image denoising has been proposed in literature. In LDCT image denoising, state-of-the-art methods, such as WGAN [48], WGAN-VGG [18], SAGAN [19], and SMGAN [20], demonstrated their high performance for estimating NDCT images from LDCT images. The architecture of GANs includes a generator and a discriminator. The generator is used to generate NDCT images from LDCT images and the discriminator is used to discriminate the generated (fake) NDCT images from real NDCT images. Du et al. [49] applied the GAN framework with a visual attention mechanism to better preserve details in denoised images. As an  [26]. Dilated convolution operators are used for all layers from L 1 to L 7 example, Fig. 4 shows the architecture of WGAN-VGG [18], which includes three sub-networks: the generator, the discriminator and the VGG pre-trained network. The pre-trained VGG network is used to extract feature maps of real and fake images. These feature maps are then used to calculate the VGG-based perceptual loss that compares two images in feature space.
Recently, transformer-based networks have been applied to medical image processing problems, and the results have been impressive. These emerging architectures of deep networks can integrate long-range spatial information but they require high hardware configuration in training [50][51][52].
Generally, the CNN-based methods significantly outperform traditional methods in both objective and subjective comparisons. How to effectively remove noise and artifacts while preserving subtle details is the biggest challenge in the problem of LDCT image reconstruction. For this purpose, the existing CNN-based methods focus on both the architecture of the network and the definition of the loss function.

Proposed denoising method
The similarity of non-local pixels (usually computed based on patch comparison) plays a key role in the success of wellknown image denoising methods, namely NLM [28] and BM3D [14]. In CNN, similar pixels often have the approximate values in the feature maps. Therefore, our idea is to extend RFs in deep CNN models. This helps similar nonlocal pixels in images better contribute to denoising.

Network architecture
For the purpose of extending RFs, in the proposed network, the preprocessing and post-processing techniques proposed in [26] are embedded into a dilated residual CNN (DRN) using a combination of dilated convolutions, which is denoted by Dconv [53], and skip connections with the residual learning structure. The DRN architecture is inspired by SAR-DRN, proposed in [27].
The architecture of the proposed network is shown in Fig. 5 where k-Dconv means dilated convolution with factor k. It consists of nine layers, including: one preprocessing layer, followed by seven nonlinear mapping layers and then one post-processing layer. The preprocessing layer, L 0 , performs down-sampling of the noisy input image into arranged sub-images. At this layer, an n × m image is decomposed into four sub-images with size of n 2 × m 2 , as shown in Fig. 6. These four sub-images are fed into DRN consisting of seven 3 × 3 dilated convolution layers, namely L 1 , L 2 , ..., L 7 , and two skip connections. In this work, the dilation factors of the 3 × 3 dilated convolutions from layer L 1 to layer L 7 are, respectively, set to 1, 2, 3, 4, 3, 2, and 1, similar to [27]. The two skip connections are employed to connect layer L 1 to layer L 3 , and layer L 4 to layer L 7 . Unlike the original SAR-DRN network introduced in [27], in layers from L 2 to L 6 , we use dilated convolution, batch normalization, and ReLU. DRN (after layer L 7 ) generates four sub-images. These images then are up-sampled by the up-sample layer (L 8 ) to obtain an estimate of noise and artifacts in the input image (the same size as the input image). Finally, this estimate is combined with the input noisy image as inputs to residual learning, to obtain the denoised image.
The use of pre-and post-processing techniques and dilated convolution can strongly enlarge the RF while maintaining the kernel size. By using the dilated convolution with downsampling, at layer L 1 , the 1-Dconv operator with a 3 × 3 filter can activate a region of size 6 × 6 on the input image, two times larger than the traditional convolution (without down-sampling). Figure 6 describes the effect of the 2-Dconv operator at layer L 2 on the input image when using the dilated convolution with a 3 × 3 kernel and a factor of 2. Visually, we can see that the 2-Dconv operator affects a region of size 14 × 14 on the input image. Therefore, large distance pixels can contribute to the calculation of feature maps, which means that the extension of the RFs leads to increased information for feature maps in the network. These advantages help achieve effective denoising.

Loss function
The loss function measures the difference between the predicted output (denoised LDCT images) and the ground-truth (NDCT images). While the architecture determines the complexity of the model, the loss function controls how to learn the denoising model from the training dataset. There exist many loss functions for image restoration in the literature.
In this work, we use both the MSE loss and the VGG-based perceptual loss in the overall loss function. Per-pixel MSE loss often leads to fast convergence. However, the use of the MSE loss leads to over-smoothed edges and details in denoised images. Thus, the perceptual loss is applied to deal with these issues. Suppose the training set contains N image where is a set of network parameters. The perceptual loss is determined in the feature space as follows: where is a feature extractor, w i , h i , and d i are the width, the height and the depth of the feature space, respectively. In this paper, we deploy the pre-trained VGG-19 network [54] for feature extraction.
To take advantage of the MSE loss and the perceptual loss, we used the linear combination of them as a loss function for the proposed model:

Experiments and performance evaluation
In this section, we present objective and subjective comparisons between the proposed method (DRN-LDCT) and the well-known methods (BM3D [55] and FDSC2 [30]) as well as state-of-the-art CNN-based methods (WGAN-VGG, RED-CNN and FD-VGG). The experiments were performed using the same dataset. To quantitatively evaluate the performance of the methods, we use three quality indices, namely PSNR (peak signal-to-noise ratio), SSIM (structural similar-ity) [56], and FSIM (feature similarity) [57]. PSNR measures the intensity difference between denoised image and ground truth image (noise-free image). It cannot describe the subjective quality of the image, which are often very important in medical images. SSIM and FSIM better express the structure and feature similarity between the recovered image and the reference one, as compared to PSNR, and thus they better preserve important information of denoising methods.

Dataset
The "2016 Low Dose CT Grand Challenge" database, supported by the National Institute of Health, the American Association of Physicist in Medicine, and Mayo Clinic [41], was used in this work for training the CNN models. This database contains 1 -mm and 3 -mm thickness CT slices of full-dose and simulated quarter acquired from 10 anonymous patients. In this work, we only use 3 -mm CT images. We randomly selected 900 quarter-full-dose image pairs from six patients to establish the training dataset. It consists of more than 600.000 patch pairs of size 64 × 64 (small image pairs) randomly cropped at the same positions in low-dose and full-dose image pairs. The use of patches helps to increase the number of samples for better training. Moreover, patchbased processing represents local details required for optimal denoising [17]. After the training phase, the model is used to map a LDCT image to a NDCT one. For validation and testing, we randomly selected 300 image pairs from four remaining patients. The full-dose CT images are considered as the ground-truths for computing the quality indices.

Parameter setting
For the existing CNN-based models for LDCT image denoising (WGAN-VGG, RED-CNN, and FD-VGG), parameters were set as guided by authors in the corresponding papers. For DRN-LDCT, parameter λ in (5) was typically set to 0.1, the learning rate was set to 10 −2 and decreased by half after every 10 epochs. The number of epochs was set to 50. Adam optimizer [58] with default hyper-parameter values was used for training model parameters.

Objective and subjective comparisons
One hundred and fifty LDCT images from the testing dataset were used to evaluate the performance of the methods. Quality indices are computed based on comparing denoised LDCT images and their associating full-dose images. Experimental results are shown in Table 1 and Fig. 7. Specifically, Table 1 shows the average values of PSNR, SSIM, and FSIM computed on the testing dataset. Figure 7 represent basic statistical measurements of PSNR, SSIM and FSIM values of the methods over 150 samples of the testing dataset, respec- tively. As can be seen, the quality indices of DRN-LDCT were comparable to those of RED-CNN and FD-VGG, while significantly higher than those of BM3D and WGAN-VGG.
To further illustrate the effectiveness of DRN-LDCT, Fig. 8 shows denoising results of the different methods performed on a LDCT image of the abdominal region. This image contains two liver lesions which are highlighted by two small rectangles. Visually, as compared to BM3D, FDSC2, and WGAN-VGG, noise was better suppressed by RED-CNN and FD-VGG. Small dark signals in highlighted regions were well-preserved by DRN-LDCT and the other CNN-based methods.
Another experiment is shown in Fig. 9. In this figure, the noisy image for testing is an abdominal LDCT image of another patient. The testing image has two liver lesions as marked by two small rectangles [41] (see Fig. 9). By subjective comparison of the denoised images by the methods and the NDCT images, we can see that globally the denoising effectiveness of DRN-LDCT was equivalent to that of RED-CNN and FD-VGG, while being slightly better than that of BM3D, FDSC2 and WGAN-VGG. In the regions of interest, the DRN-LCDT seemed to outperform the other methods in preserving the structures in the larger lesion. Moreover, the small dark point in the small highlighted region in the image denoised by DRN-LDCT was also clearer. Table 2 shows the capacity and the average computational time of the different CNN-based methods under comparison. The computational time was recorded from experiments performed with testing images of size 512 × 512, on a CPU Intel(R) Core(TM) i9-9900K CPU @ 3.60 GHz ×8 as well as on the GeForce RTX 3090 GPU card. As can be seen, the capacity of DRN-LDCT was 0.75 MB, which is significantly smaller than that of RED-CNN (1.9 MB) and FD-VGG (2.0 MB), while the denoising effect was nearly equivalent. The computational time on CPU of DRN-LDCT was the lowest [59].

Conclusions
This paper has presented a brief overview of learning-based methods for the denoising of LDCT images, from simple patch-based learning methods to modern CNN-based methods. In addition, we have proposed a very competitive denoising method for LDCT images (DRN-LDCT), which is based on the idea of extending the RF in an end-to-end CNN architecture. Experimental results have shown that the proposed denoiser is lighter as compared to some state-ofthe-art methods such as RED-CNN and FD-VGG, while the quality is equivalent. They have also demonstrated the effectiveness of DRN-LCDT over several leading state-ofthe-art LDCT denoising methods. Future work may look into designing improved loss functions and incorporating the attention mechanism in the network architecture in order to focus more useful features in medical images.