Underwater Motion Deblurring Based on Cascaded Attention Mechanism

The images captured in the underwater scene frequently suffer from blur effects due to the insufficient light and the relative motion between the captured scenes and the imaging system, which severely hinders the visual-based exploration and investigation of the ocean. In this article, we propose a feature pyramid attention network (FPAN) to remove the motion blur and restore the blurry underwater images. FPAN incorporates the cascaded attention modules into the feature pyramid network, enabling it to learn more discriminative information. To facilitate the training of FPAN, we construct a weighted loss function, which consists of a content loss, an adversarial loss, and a perceptual loss. The cascaded attention module and the weighted loss function enable our proposed FPAN to generate more realistic high-quality images from the blurry underwater images. In addition, to deal with the lack of publicly available datasets in underwater image deblurring, we built two specific underwater deblurring datasets, namely Underwater Convolutional Deblurring Dataset and Underwater Multiframe Averaging Deblurring Dataset, to train and examine different deep learning-based networks. Finally, we conduct sea trial experiments on our autonomous underwater vehicle. Experimental results on two underwater deblurring datasets demonstrate that our proposed method achieves satisfactory results, which validates the potential practical values of our proposed method in real-world applications.

underwater vehicle, AUV) allows people to collect videos to perform visual ocean exploration in the undersea world [1], [58].However, the images captured in the underwater scenes frequently suffer from blur effects due to the insufficient light and the relative motion between the captured scenes and the imaging system, which dramatically degrades the image visibility and affects the performance of the ocean tasks [2], [59].Thus, removing the motion blur to improve the image quality is of great significance.
Motion blur leads to degradation of image quality, which is generally caused by camera shaking or fast object motions [3].However, techniques have been developed in the image deblurring researches for decades, most of them focus on providing solutions for blurry images captured on land.These methods either try to estimate blur kernels and image priors [4]- [7] or to train a deep neural network to generate clear images directly from the blurry observations [8]- [12], [61].For the conventional optimization-based deblurring approaches, they are treated as a deconvolution process and clear images can be obtained by using priors.Fergus et al. [5] estimated blur from camera shake using Gaussian scale mixture priors.Krishnan et al. [13] regularized the blurry images using a normalized sparsity method.Based on the dark channel prior to blurry images, Pan et al. [14] introduced a linear approximation of the minimum operator to compute the dark channel prior, which could be directly extended to nonuniform deblurring in practice.The conventional optimizationbased deblurring algorithms promote the development of image deblurring techniques to a certain extent, but the performance on blurry images with fewer corresponding features is unsatisfactory because these priors are usually designed under limited observations or restricted assumptions [3].In addition, some researchers attempted to combine conventional optimizationbased methods with deep learning techniques, such as convolutional neural networks (CNNs), to estimate the blur kernels [15]- [18].The majority of these blur kernel estimation approaches utilize a conventional optimization-based method in an iterative way, which has shown significant improvement over traditional deconvolution-based deblurring algorithms.However, they are commonly computationally expensive since they repeat a crucial step of a conventional optimization-based method many times.Compared with conventional optimization-based approaches, deep learning-based image deblurring approaches usually obtain excellent deblurring performance and achieve real-time processing speed.To train a deep neural network, one can conduct abundant experiments to collect slight or severe blurry images in diverse scenes on land.Different from the construction of land-based deblurring datasets, underwater images often suffer from low visibility (resulting in blur effects); this is because the light is scattered and absorbed when traveling through the water [1].Moreover, acquiring clear images in underwater scenes is difficult.Thus, it is a challenging task to construct an appropriate underwater deblurring dataset for the motion blur removal task.Besides professional image deblurring algorithms, image restoration methods [1], [2], [19], [40]- [42], [58], [59], [64]- [66] are also able to remove the blur in underwater images.They mainly consider imaging models where the light is attenuated in the water body.These methods have a significant effect on color restoration and improve the image sharpness by removing the slight blur in the image.However, they show limited ability in restoring the severe blurry underwater images.
In this article, we propose a deep learning method based on the cascaded attention mechanism, namely feature pyramid attention network (FPAN), to translate blurry images into clear ones.We also collect and provide two large-scale underwater deblurring datasets for training the underwater image deblurring networks.Both datasets contain clear and blurry images.Meanwhile, the blurry underwater images collected by our AUV-based imaging system are processed using the proposed method to verify the network performance.We compare the proposed method with three conventional methods [1], [13], [19] and two state-of-the-art methods [8], [11], and the experimental results show that our proposed method is more satisfactory.
The main contributions of this article are summarized as follows.
1) We propose a deep learning network, which combines the cascaded attention module and the feature pyramid network (FPN) to remove motion blur and restore the brightness and sharpness of underwater images.2) We collect and release two large-scale underwater deblurring datasets for researchers to advance the development of underwater image deblurring.3) We conduct experiments on two underwater deblurring datasets, and evaluate the proposed method using realworld experiments on our AUV platform.The experiments show our proposed method achieves satisfactory results.The rest of this article is organized as follows.Section II briefly reviews the related works.Section III presents the details of the proposed network.Section IV demonstrates the experimental results using the images from the validation sets and the sea trial dataset.Section V presents the postprocessing.Finally, Section VI concludes this article.

II. RELATED WORK
In recent years, deep learning techniques had achieved great success in image transformation tasks, which provide an endto-end solution to translate distorted images into clear ones.Previous works [20]- [22] estimated rigid or nonrigid transformations between two images for tasks such as motion estimation and matching using siamese networks.These networks usually need ground truth clear images, but the ground truth clear images are unknown in many application scenes.Later, the spatial transformer was proposed as a trainable module in classification networks to estimate the parametric transformations [23].To handle articulations, the method of nonparametric transformations was used in the form of shape representation [24].Although similar methods in [23] and [24] with a convolutional variant can solve specific parametric transformation problems, there are several application scenes too complex to be representable by a small number of bases [20].Recently, based on the concept of spatial transformer and mapping relationship, Nah et al. [8] proposed a multiscale CNN to restore the degraded images and the network could restore the blurry images on three different levels.Following this, Tao et al. [11] extended the multiscale CNN with the long short-term memory to produce a scale-recurrent CNN for blind image deblurring that generated promising deblurred results.Kupyn et al. [9] inherited the generative adversarial network (GAN) from [25] to construct the DeblurGAN with the gradient penalty and the perceptual loss, which enable the DeblurGAN to achieve satisfactory results.Built on the success of DeblurGAN, Kupyn et al. [26] proposed DeblurGAN-v2, which was another substantial push on a GAN-based motion deblurring framework.The end-to-end deep learning-based methods mentioned above show excellent performance in restoring blurry images with fewer artifacts than the conventional optimization-based methods [13].In addition, deep learning-based methods do not need to estimate the blur kernel.
Besides removing the image blur in an end-to-end way, deep learning-based methods can also be used as a core step to estimate the blur kernel.Schuler et al. [27] designed the deep network architectures for blur kernel estimation by imitating the alternating minimization steps in the conventional optimization-based methods.For studying the spectral property of blurry images, Chakrabarti et al. [28] applied a deep CNN to predict the Fourier coefficients, and the estimated blur kernel was obtained with the coefficients in a projection way.In [29], CNN is used to predict the parametric blur kernels for motion blurry images.Although these CNN-estimated blur kernel methods give another solution for removing the blur in an image, they are not efficient enough since they repeat a step many times.
The training of deep neural networks is frequently a timeconsuming task, and a commonly used network architecture (e.g., encoder-decoder) is usually able to solve many image translation issues, but the results are not impressive enough.In recent years, the attention mechanism is widely utilized for efficiently training a deep network in computer vision tasks, which helps generate satisfactory results [30]- [32].The principle of the attention mechanism is that the importance of different features can be weighed by learning an intermediate attention map and then applying the elementwise product to the attention map and the source feature map [33].For the task of underwater image processing, the weak textures and features that are crucial in an image can be learned by an attention-based network such as the underwater object located in a low visibility environment and suffering from motion blur.
In this article, we carry out the research of underwater image deblurring in an end-to-end way.We aim to remove the underwater image blur induced by low visibility, object motion, and camera shaking.As the blur in the images is caused by multiple factors, and the object features are not conspicuous in these images, it is important to propose a network, which can learn more robust features from the training data.The architecture of our network is inspired by Lin et al. [34], Mei et al. [30], and Kupyn et al. [26].The FPN was proposed by Lin et al. [34] for object detection tasks and achieved satisfactory results.It is a kind of structure containing bottom-up and top-down pathways.The bottom-up pathway is a common convolutional network for feature extraction.The spatial resolution is downsampled in this pathway and semantic context information is extracted and compressed in this process.As for the top-down pathway, FPN reconstructs spatial resolution from the semantically rich layers.The lateral connections are constructed between the bottom-up and top-down pathways in FPN, which supplement high-resolution details and help localize objects.Inspired by this, Kupyn et al. [26] first introduced the idea of FPN to the field of image restoration and enhancement.Later, Zhang et al. [35] proposed an attention mechanism in the deep network framework to train GAN, which shows excellent performance.Mei et al. [30] trained a model to address the problem of image denoising and image superresolution using FPN and the attention mechanism.Our network is inherited from the structure of FPN.We incorporate those priors and propose an FPAN.Different from the previous work using one attention module to connect the network, we propose a cascaded attention network architecture, which allows our network to learn more details.The network architecture will be introduced in the next section.

III. METHODOLOGY
Conventional methods formulate the image deblurring task as a deconvolution problem when the blur kernel is spatially invariant [16].Let I b (x) be the blurry image, I c (x) be the latent clear image, K be the blur kernel, and N be the additive white Gaussian noise.The model can be defined as Different from the conventional image deblurring methods, the deep learning-based methods provide a simple and direct mapping relationship between the blurry image I b (x) and the latent clear image I c (x), which can be expressed as where f is the complex deep CNN transfers, the blurry image to the latent clear image.θ is the parameter of the deep CNN.Existing deep learning frameworks, especially GAN [36]- [38], achieved great success in the field of image translation tasks.There are two competing networks in a standard GAN, namely the generator network and the discriminative network.The images generated by the generative network are put into the discriminative network, and the discriminative network judges whether the output results are realistic images.However, it requires a large-scale dataset for training, hence, we determine to construct datasets for training our GAN to achieve the mapping function f, and we can easily obtain the latent clear image in an end-to-end way.

A. Network Architecture
The pipeline of our proposed network is illustrated in Fig. 1, which includes multiple layers connected in sequences.
1) Generator Network: Considering the dark scene and the inconspicuous features of the underwater images, the designed network should learn adequate complex information from the images.Based on this, we choose the FPN backbone of our generator network.Our proposed network takes a three-channel RGB (red, green, and blue channels) image as the input and outputs five feature maps with different scales.The bottom-up pathway for feature extraction is a 3-kernel-2-stride-1-padding convolutional network, and the channels are set to 3, 64, 128, 256, and 512.The features are transferred to the top-down pathway through the lateral connections and reconstructed spatial resolution from the semantically rich layers.The channel numbers in the top-down pathway are the same as that in the bottom-up pathway.To restore the original image resolution, two upsampling layers and convolutional layers are added to reconstruct the spatial resolution.Then, a skip connection is used to learn the residual between the input image and the output image of the convolutional layer, and the final output image is obtained after the elementwise addition module.
2) Cascaded Attention Mechanism: Although the FPN architecture alone can remove the blur, its performance in actual applications is limited.Therefore, we add the convolutional block attention modules (CBAMs) to our generator network.As is shown in Fig. 2, the CBAM consists of the channel attention module and the spatial attention module.The channel attention module exploits the interchannel relationships and focuses on "what" is meaningful given an intermediate feature map F, and it can be defined as (3) where M c (F) is the output channel attention map, MLP is the multilayer perceptron with one hidden layer, and σ is the sigmoid function.The spatial attention module utilizes the interspatial relationship and concentrates on "where" is an informative area in an image, which can be defined as where M s (F) is the output spatial attention map and Conv is a 7 × 7 size convolution operation.The attention mechanism is simple but effective for feedforward CNNs.It sequentially infers attention maps along two separate dimensions, i.e., channel and space, when given an intermediate feature map.The attention maps are then multiplied by the input feature map for adaptive feature refinement [39], and it can be expressed as where F out is the final output feature map from the CBAM module and ࣹ is the elenmentwise multiplication.As the attention mechanism has an advantage in helping learn more textures and features information, thus, we add eight CBAMs in the FPN architecture to form a cascaded attention network.One CBAM is located in the middle of two different layers, therefore, the next layer of the neural network will learn the information that Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.has been processed by the attention mechanism module in the previous layer in such a designed serial network.Thus, ( 5) can be rewritten as where i is the index of the ith CBAM; i ࢠ{1, 2, 3, 4, 5, 6, 7, 8}.Moreover, the convolution blocks and the additional layers with the same number of channels are connected by a 1 × 1 convolution layer, which allows the information processed by the attention mechanism to be used more fully.Finally, the generator network produces a deblurred image G F i out θ .Since we aim to restore the blurry underwater images and overcome the challenges introduced by the dark underwater scene, the introduction of the attention mechanism like the CBAM can meet the needs of the FPN to refine more image details.Taking the underwater camera mounted on our AUV platform as an example, an AUV's speed and the turbulence in the sea will directly affect the blur degree of the captured visual data.In this situation, a common CNN shows limited ability in removing the blur and refining the details.By using the attention mechanism, the proposed network can produce clear and bright images.
3) Discriminator Network: A discriminator is like a "judge," which can distinguish the realistic clear images from the fake clear images generated by the generator.To let the discriminator be more intelligent, we inherit the wisdom from Isola et al. [37].They proposed a PatchGAN discriminator and took the advantage of both the local information and the global information in an image generating sharper images than a standard discriminator.We take a further step to combine their discriminator with our proposed generator together.In this way, it is essential for our proposed network to learn both global information and local information from the training data.As shown in Fig. 1, both the input and output of the discriminator are a three-channel RGB image.The architecture of our discriminator is a 4-kernel-2-stride-2-padding convolutional network, and the channels are set to 3, 64, 128, 256, 512, and 1 in this module, respectively.Together with the generator network, the discriminator network uses the dataset to alternately train to address the min-max problem, which can be expressed as where θ is the learnable parameter in the generator network and E denotes the mean.

B. Training Objective
An image translation GAN framework is notoriously hard to train.In previous works [26] and [36], a weighted loss function showed satisfactory performance in training a complex GAN mapping framework.We inherited the priors of the weighted loss function and proposed a novel loss function aiming at improving the quality of the blurry underwater images.It is a three-term loss function, which consists of the content loss L con , the adversarial loss L adv , and the perceptual loss L per .Among them, the content loss L con can yield over smoothened pixel-space outputs [36], [38].As the underwater scenes are usually dark, and the camera or the object is in a motion condition, the captured underwater images suffer from different degrees of blur.Fine details of the original underwater scenes cannot be reconstructed effectively.To reconstruct the blurry areas and the main features, L con is utilized as the first term of our loss function, which is defined as However, L con alone cannot generate satisfactory resultant images, the resultant images are still blurry and usually lack high-frequency details [20].Hence, relativistic average least squares GAN [36] objective loss (the adversarial loss L adv ), is used to further improve the high-frequency details in the images.It has been proven in [17] that L adv can allow the network to learn sharper edges and more detailed textures by estimating the probability that the original image is more realistic or not than the blurry image reconstructed by the generator.L adv is expressed as Previous works [9], [20], and [38] introduced the perceptual loss L per as a part of their loss function.In terms of L per , it aims to measure the CNN feature space differences between the generated images and the target images, which shows excellent performance in weakening or eliminating the artifacts.To remove the inevitable artifacts in resultant images, we regard L per as a suitable training objective in our proposed loss function.L per is defined as All the loss functions mentioned above are used as the metrics to compare the reconstructed images and the original ones during the training process.Thus, our loss function can be defined as where λ c , λ a , and λ p are, respectively, the weighted parameters of the corresponding loss function.

C. Training Datasets
Ground truth clear images cannot be obtained in the underwater scenes.The synthesized underwater image datasets [62], [63] address this issue to some extent and sufficient training data can be provided for deep learning-based CNNs.However, existing underwater datasets [40]- [42], [62], [63] mainly aim at addressing the issue of object recognition and image enhancement.To our best understanding, available underwater deblurring datasets are rare for training a deep deblurring neural network.To produce sufficient training data, current mainstream works synthesize deblurring images from the clear images captured on land.The synthesis methods can be divided into two categories: 1) convolving clear images with real-world or generated blur kernels [18], [27], [28]; and 2) averaging consecutive clear frames from videos captured by a high-speed motion camera [8], [43]- [45].The convolving-based method is a simplified image formation model, and all pixels in the generated image share the same blur kernel trajectory.Thus, the synthetic images look different from the real-world motion blur images, which are more similar to those captured with the camera out of focus.To overcome the drawbacks of the convolving-based method, the averaging-based method adopts a multiframe accumulation strategy that would be equivalent Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. 1) Underwater Convolutional Deblurring Dataset (UCDD): By considering the cableless underwater robotics with limited energy to collect images in underwater scenes, the camera often cooperates with an auxiliary light source.The acquired images are often blurry and dim, which are like images under the condition of an out-of-focus imaging blur.Inspired by the works in [4] and [6], we propose the model of random trajectories generation to simulate realistic and complex blur kernels that has a similar blurring effect when acquiring images with underwater motion platforms.It takes us a further step to generate blurry and dim data, which is equivalent to the images captured in a dark environment under shaking conditions.The blur kernels are stimulated by applying subpixel interpolation to the trajectory vector.For each trajectory vector, it is a complex-valued vector corresponding to the discrete positions of an object following 2-D random motion in a continuous domain.In this process, the Markov process is used to generate the blur trajectory, and the position of the next point of the blur trajectory is randomly generated based on the previous point velocity and position, Gaussian perturbation, impulse perturbation, and deterministic inertial component [9].To render blurry images at different levels, we extract frames from the videos and set the exposure time as 0.5, 0.25, 0.125, and 0.0625 s to generate blurry images.The exposure time setting is appropriate for underwater visualization by an AUV, which often uses strobe lights in conjunction with cameras for visual image acquisition.For each exposure time, we generate the same number of blurry images.Examples of clear and blurry image pairs are shown in Fig. 3.In total, we generate 36 204 pairs of synthetic blurry images and the corresponding ground truth clear images.The UCDD is publicly available at 1 .

2) Underwater Multiframe Averaging Deblurring Dataset (UMADD):
The pipeline of the averaging-based method contains underwater video collection, frame interpolation, and blur synthesis.The blurry images are generated by accumulating clear image stimulation at every time during the camera exposure [8], [44].It can be approximately defined as averaging the pixel values at the same location in high-speed consecutive video frames where T is the exposure time of the camera, F(t) is the light signal at time t, N is the number of frames, and F[n] is the light signal of the nth clear image.
When recording these video frames, the camera should use a high frame rate mode to ensure that a large number of video frames are captured in the same exposure time.Meanwhile, special attention should be paid to the quality of each frame since we aim to average these clear frames to generate a blurry one.The GoPro 8 Hero Black camera can be set to a maximum of 240 fps when capturing a video, which can satisfy the need of capturing enormous video frames.However, high frame rate data capture is achieved at the expense of video frame quality for most of the high-speed cameras.The consumer-level cameras (including GoPro Hero Black Series) have a limited computational ability in recording all light signals in the cell arrays during the readout time.It is strictly related to the exposure time that leads to a tradeoff between the noise and the blur.Short exposures can reduce the blur at the cost of the increasing noise, whereas long exposures reduce the noise at the cost of the increasing blur [4].Thus, we inherit the previous wisdom [44] and set the frame rate as 120 fps with a satisfactory compromise to the quality and quantity of the captured video.Then, an advanced video interpolation technique is applied to expand the frame rate from 120 to 1920, which aims to make the blur more natural and smooth.When the object or camera moves very fast, the averaging operation on the video can produce unnatural results from two adjacent frames [46].In this situation, the video interpolation technique can help to adjust the frame rate to a high enough level to alleviate or eliminate these unnatural steps.In this article, an adaptive separable convolution video interpolation [47] is utilized to address the problem of unnatural steps and aid nonlinear motion blur generation.Different from the standard optical flow method, the adaptive-separable-convolution-based video interpolation formulates frame interpolation as local separable convolution over input frames using pairs of 1-D kernels, which can produce more visually pleasing frames [47].After the video interpolation operation, we average 241 successive clear images to generate 1 blurry image and define the 121st clear image as the corresponding ground-truth image.For example, the first blurry image is the mean from the 1st frame to the 241st frame, and the second blurry image is the mean from the 241st frame to the 481st frame.The operation of averaging 241 frames can simulate the maximum exposure time of the GoPro 8 Hero Black camera since the camera can capture a maximum of 240 frames in 1 s.It is consistent with our goal of obtaining as much experimental data as possible.To improve the training efficiency, all the data were resized to 512 × 512 resolution for training and testing.For the training objective, the weights of the content loss L con , the adversarial loss L adv , and the perceptual loss L per are set as 0.5, 0.006, and 0.01, respectively.The training objective is optimized to minimize the distance between the generated image and the ground truth.We trained the network with both UCDD and UMADD for 100 epochs, the initial learning rate is set to 0.0001, and the batch size is set to 4.

IV. EXPERIMENTAL RESULTS AND ANALYSIS
In this section, we evaluate the proposed network with both the validation sets and the sea trial dataset.The sea trial dataset (in total 25 images) is realistic blurry underwater images collected by our AUV, which consists of the underwater natural scenes and sediments.To demonstrate the effectiveness of our proposed network, we compare it with several representative methods proposed in recent years, including the deep learning-based methods and the conventional methods.These selected comparison methods are proposed to address the problem of image  [50] with an Nvidia GTX 1070Ti GPU, Ubuntu platform.In the testing stage, we test all the comparison methods on both the UCDD validation set (including 3316 images) and the UMADD validation set (including 1137 images).We also conduct experiments in Jiaozhou Bay and collect a sea trial dataset (including 25 images).Our AUV platform is equipped with one GoPro 8 Hero Black camera, which is mounted on the bottom of the AUV.The camera's field of view is in the direction of the seafloor.Fig. 5 shows our AUV platform and the experimental site.
We evaluate our proposed method in qualitative and quantitative ways.The qualitative evaluations mainly depend on the evaluation of image quality by the human visual system.As for quantitative evaluations, two full-reference evaluation metrics and several nonreference evaluation metrics are used.The full-reference evaluation metrics are Structural Similarity Image Metric (SSIM) [51] and peak signal to noise ratio (PSNR) [52].Several commonly used nonreference image quality evaluation metrics are employed to compare the performance of different methods in this article.They are nonreference image spatial quality evaluator (BRISQUE) [53], naturalness image quality evaluator (NIQE) [54], patch-based contrast quality index (PCQI) [55], and underwater image quality metric (UIQM) [56].The score of BRISQUE is based on a support vector regression model trained on an image database that contains images with different distortions (e.g., blurring, artifacts, and noise).It can intuitively represent the perceptual image quality and the blur recovery capability.NIQE is an evaluation metric to judge the natural state of the image globally, which is based on constructing a series of features to measure image quality and fitting these features to a multivariate Gaussian model.These features are extracted from simple and highly regular natural landscapes to measure the differences in the multivariate distribution of an image.For BRISQUE and NIQE, the smaller scores the better image quality.As for PCQI, it provides accurate predictions on the human perception of contrast variations using a metric based on an adaptive representation of local patch structure.In terms of UIQM, it is a specific underwater image quality metric, which is obtained by assigning carefully calculated weights to UICM on color, UISM on sharpness, and UIConM on contrast.The higher scores of PCQI, UIQM, UICM, UISM, and UIConM indicate the image has better quality.

A. Ablation Study and Analysis
Fig. 6 shows the qualitative comparison results of the ablation study on the UCDD validation set.To verify the effectiveness of the attention mechanism, we start from the original FPN-based GAN for image deblurring, then we add the attention mechanism into the FPN to form the FPAN.Instead of using 512 × 512 image pairs, we use 1280 × 720 image pairs to carry out the ablation study.This is because the performance of the attention module is more visible in high-resolution images.
As shown in Fig. 5, we observe that both FPN and FPAN architectures are able to remove the blur in the images.Our network architectures FPAN outperform FPN in removing the blur and improving the brightness, especially in the images of Fish group, Coral reef, and Octopus in Fig. 6.From the qualitative results, we can confirm that the attention mechanism plays an important role in restoring the object details in the blurry images (see the seafloor and fish in Fish group), and the results generated by FPAN are much closer to the ground truth.We also report the quantitative results in Table I; the proposed FPAN ranks first in eight of nine metrics.For the PSNR metric,

TABLE I QUANTITATIVE COMPARISON OF DIFFERENT NETWORK ARCHITECTURES IN THE ABLATION STUDY USING NONREFERENCE METRICS
The values indicate the average scores of the images on the UCDD validation set.The values in bold represent the best results.The number in brackets refers to rankings 1 and 2 of a method on the metric.the quantitative results of FPAN and FPN are very close.Based on the ablation study, it can be confirmed that the FPAN can learn more information from the training data.Our data are collected at different depths in the underwater environments and the light suffers from different levels of attenuation due to the varying depths, which needs a strong learning network to extract weak textures and features.Thus, FPAN is suitable to address the ill-posed image translation task.

B. Underwater Convolutional Deblurring Dataset (UCDD) Validation Set
The qualitative comparisons on the UCDD validation set are shown in Fig. 7, from which we can observe that the input images are much more blurry than the ground truth images.Fu's method [19] shows a strong ability in compensating for the color, but it generates images with a significant fogging mask.Peng's method [1] can improve the image quality by removing color casts although some color and bright differences exist.However, Peng's method [1] shows an unsatisfactory deblurring result even though the image blur is very slight (e.g., the Sediment and the Fish in Fig. 7).Krishnan's method [13] shows limited deblurring performance as it generates slight artifacts in the resultant images.For the deep learning-based methods, the qualitative results of Nah's method [8] and Tao's method [11] are similar in removing the blur.Wang's method [60] can remove the "noise points" for a degraded image that seems to be smooth globally but show limited ability in removing the blur.The results after processing by Kupyn's method [26] are close to those of Tao's method [11].Mao's method shows a competitive result in removing the blur on the UCDD validation set.Our proposed method generates high-quality images with much better visual appearances as shown in Fig. 7. Except for the excellent deblurring ability, our proposed method can even improve the brightness in the entire image compared with other comparison methods.We notice that some inevitable slight artifacts exist in the results of our proposed method (e.g., the left edge in the Fish group processed using our proposed method).It is reasonable since there is a brightness gradient in the image of the Fish group, it is dark-bright-dark from the top to the down.The FPN-based framework with the attention mechanism can learn such information and alleviate the artifact problem.The number in brackets refers to the rankings 1-9 of a method on the metric.
The divers made a significant effort to collect a wide range of underwater scenes and animals; it inevitably captured images with large illumination and darkness changes.Nevertheless, the qualitative result of our proposed method is still the best among all the comparison methods.
In addition, we report the quantitative comparison of different methods on the validation set in Table II using both full-reference metrics and nonreference metrics.Our proposed method shows superior performance to other methods; it ranks first in five of nine metrics.For the other four indicators, there is only a small gap between our results and the best results.For the full-reference metrics, our proposed method shows competitive performance.We get the highest score with UIQM, which is consistent with the qualitative analysis in terms of contrast, color, and sharpness among the comparison methods.The BRISQUE metric can reflect the ability to restore image distortions and the NIQE metric evaluates the results in terms of the proximity of the restored images to the natural underwater images.Our results outperform other methods in both the BRISQUE metric and NIQE metric.As PCQI metric is computed based on an adaptive representation of local patch structure for providing accurate predictions on the human perception of contrast variations [55].Fu's method ranks first and its resultant images are more consistent with human perception, thus the score is reasonable.

C. Underwater Multiframe Averaging Deblurring Dataset (UMADD) Validation Set
In this section, we test our proposed method on the UMADD validation set, the qualitative results and quantitative results are shown in Fig. 8 and Table III, respectively.Although the generated blurry images are different from the images in UCDD, the qualitative results are similar to the results in Fig. 7.The resultant images of our proposed method present superior perceptual quality over that of other methods.The restored images show good potential in improving the brightness and sharpness.As the quantitative results reported in Table III    The number in brackets refers to rankings 1-9 of a method on the metric.
[11] and Mao's method [61] outperform other methods in terms of full-reference metrics since they are designed for removing the motion blur generated in an averaging multiframe way.However, they show limited ability in nonreference metrics.On the UMADD validation set, our proposed method achieves first place in terms of BRISQUE, NIQE, UIConM, and UIQM, and also ranks top four for PCQI.

D. Sea Trial Dataset
In the sea trial scenario, a GoPro 8 Hero Black was fixed on the bottom of our AUV.The objects in our captured images are sediments, stones, and marine life, such as starfishes and crabs.The frame rate is 240 fps, and our AUV is powered by the onboard battery and the propellers.The sea trial dataset contains 25 real-world blurry underwater images and the images are resized to 720 × 540 resolution.Typical examples from the sea trial dataset and the results of different comparison methods are shown in Fig. 9.The qualitative results of the comparison methods are consistent with their qualitative results on the UCDD and the UMADD validation sets.In the meanwhile, we have evaluated the performance of different methods using the nonreference metrics on the sea trial dataset, and the average score of each metric is shown in Table IV.Our proposed method still ranks first in three of six nonreference metrics on the sea trial dataset; this contributed to its excellent performance in removing blur in the underwater images.We notice that the images of the sea trial dataset suffer from color degradation and image fogging.Peng's method [1] affects these issues, whereas other methods show a limited ability in solving these issues.Thus, we conduct image postprocessing using an advanced underwater image enhancement method, which is introduced in Section V.

E. Efficiency Test
We also report the processing time of different methods on the sea trial dataset.All the experiments are conducted using the facility mentioned in Section III.The results of the average testing time for 25 images on the sea trial dataset are as shown in Table V. Fu's method [19] is the most efficient one in processing a blurry image.Our proposed method ranks fifth among the nine different methods for restoring an image and outperforms Peng's method [1], Krishnan's method [13], Tao's method [11], and Kupyn's method [26].The methods of Wang et al. [60], Mao et al. [61], and Nah et al. [8] process an image in an average time of less than 1 s.As for the conventional methods, Krishnan's method [13] and Peng's method [1] are very time-consuming.The computational efficiency of deep learning algorithms is generally higher than that of traditional methods according to the above evaluation.

V. POSTPROCESSING
The proposed underwater image deblurring framework can significantly improve the sharpness of the underwater images.However, the images still suffer from inherent color distortion.A CNN-based method cannot well-handle the blur effects and color distortion at the same time.Thus, we employ our own color restoration method [57] to address the color distortion issue and generate images with higher quality.As is shown in Figs. 10 and  11, the image quality is greatly improved using our proposed method and the postprocessing approach.

VI. CONCLUSION
In this article, we proposed an end-to-end deep learning-based approach FPAN to remove the underwater motion blur.By combining the FPN structure with the attention mechanism, FPAN demonstrates clearly superior perceptual quality in removing the blur and restoring the brightness in underwater images.Moreover, due to the lack of publicly available datasets for training the deep deblurring networks, we provide two large-scale underwater deblurring datasets, namely UCDD and UMADD.The proposed method is verified on the validation sets and the sea trial dataset.Qualitative and quantitative experimental results show the effectiveness and robustness of our proposed method.The proposed method is not only suitable for removing the motion blur but also has a strong ability to restore the brightness of underwater images.
However, our proposed method achieves satisfactory results, there are still some limitations.First, our proposed method cannot meet the real-time requirement; hence, it cannot be applied to real-time applications carried out by AUVs.Second, unexpected artifacts might appear as mentioned in this article; this is because the model parameter tuning regarding the water environment requires further optimization.We will make improvements in the future.

Fig. 3 .
Fig. 3. Examples of clear and blurry image pairs in UCDD.The exposure time for clear images and blurry images is 0.5 and 0.25 s in the first row, and 0.125 and 0.0625 s in the second row.tocollect real-world blurry images containing the camera or the object motion.As both the out-of-focus blur and motion blur exist in practice, we determine to inherit the above-mentioned approaches to propose high-quality underwater deblurring datasets for training deep neural networks.The construction of the underwater deblurring datasets is carried out with due consideration to an AUV's operating environment and scenario; thus, the parameters of the datasets are configured in conjunction with an AUV's motion characteristics.Based on this, we propose two datasets, namely the Underwater Convolutional Deblurring Dataset (UCDD) and the Underwater Multiframe Averaging Deblurring Dataset (UMADD).Both datasets contain two image sequences of the same contents: one blurry image sequence with blur by a shakable camera; and another one is the corresponding clear image sequence.Our divers manually used GoPro 8 Hero Black camera to capture 19 videos (120 frames per second in the linear mode) at 1920 × 1080 resolution in Bali.The videos we capture take full account of the content diversity and dynamic motion transformation.Then, the clear images are extracted from these videos.Both UCDD and UMADD are generated from these videos, and we describe the production of the datasets in detail in the following.1)Underwater Convolutional Deblurring Dataset (UCDD): By considering the cableless underwater robotics with limited energy to collect images in underwater scenes, the camera often cooperates with an auxiliary light source.The acquired images are often blurry and dim, which are like images under the condition of an out-of-focus imaging blur.Inspired by the works in[4] and[6], we propose the model of random trajectories generation to simulate realistic and complex blur kernels that has a similar blurring effect when acquiring images with underwater motion platforms.It takes us a further step to generate blurry and dim data, which is equivalent to the images captured in a dark environment under shaking conditions.The blur kernels are stimulated by applying subpixel interpolation to the trajectory vector.For each trajectory vector, it is a complex-valued vector corresponding to the discrete positions of an object following 2-D random motion in a continuous domain.In this process,
Finally, we generate 2842 pairs of blurry and clear images at 1280 × 720 resolution.Examples of clear and blurry image pairs are shown in Fig. 4. The UMADD is publicly available at 2 .D. Details of Training We trained the network on datasets UCDD and UMADD.As mentioned in the previous part, 36 204 image pairs are generated in UCDD.We split UCDD into 32 588 image pairs as the training set and 3616 image pairs as the validation set.The image pairs in these sets contain the same proportion of four kinds of exposure time.For UMADD, we augmented the dataset by rotating the original image clockwise by 90°, 180°, and 270°.Finally, there are 11 368 pairs in the dataset, including 10 231 image pairs in the training set and 1137 image pairs in the validation set.

Fig. 5 .
Fig. 5. Experimental scene in Jiaozhou Bay, Qingdao, China.Three GoPro 8 Hero Black cameras are mounted on the head of AUV.Tow areas marked by red rectangles are the head of AUV and the GoPro 8 Hero Black camera we used in the experiments, respectively.

Fig. 6 .
Fig. 6.Qualitative comparison of different network architectures in the ablation study.

Fig. 7 .
Fig. 7. Qualitative experimental results of different comparison approaches on the UCDD validation set.

Fig. 8 .
Fig. 8. Qualitative experimental results of different comparison approaches on the UMADD validation set.

Fig. 9 .
Fig. 9. Qualitative experimental results of different comparison approaches on the sea trial dataset.

Fig. 10 .
Fig. 10.Typical experimental results on UCDD validation set.(a) Input images.(b) Results of processing the input images using the color restoration method.(c) Results of processing the input images using our proposed method.(d) Results of processing the deblurring images.

Fig. 11 .
Fig. 11.Typical experimental results on UMADD validation set.(a) Input images.(b) Results of processing the input images using the color restoration method.(c) Results of processing the input images using our proposed method.(d) Results of processing the deblurring images.

TABLE II QUANTITATIVE
EXPERIMENTAL RESULTS OF DIFFERENT COMPARISON APPROACHES ON THE UCDD VALIDATION SET USING FULL-REFERENCE METRICS AND NONREFERENCE METRICSThe values indicate the average scores of the images.The values in bold represent the best results.The number in brackets refers to the rankings 1-9 of a method on the metric.

TABLE III QUANTITATIVE
EXPERIMENTAL RESULTS OF DIFFERENT COMPARISON APPROACHES ON THE UMADD VALIDATION SET USING FULL-REFERENCE METRICS AND NONREFERENCE METRICSThe values indicate the average scores of the images.The values in bold represent the best results.
, Tao's method Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE IV QUANTITATIVE
EXPERIMENTAL RESULTS OF DIFFERENT COMPARISON APPROACHES ON THE SEA TRIAL DATASET USING NONREFERENCE METRICSThe values indicate the average scores of the images.The values in bold represent the best results.The number in brackets refers to the rankings 1-9 of a method on the metric.

TABLE V AVERAGE
PROCESSING TIME OF DIFFERENT METHODS FOR AN IMAGE IN THE SEA TRIAL DATASETThe values in bold represent the best results.