A Hardware-Aware Network for Real-World Single Image Super-Resolutions

Most single image super-resolution (SISR) methods are developed on synthetic low-resolution (LR) and high-resolution (HR) image pairs, which are simulated by a predetermined degradation operation, such as bicubic downsampling. However, these methods only learn the inverse process of the predetermined operation, which fails to super resolve the real-world LR images, whose true formulation deviates from the predetermined operation. To address this, we propose a novel super-resolution (SR) framework named hardware-aware super-resolution (HASR) network that first extracts hardware information, particularly the camera degradation information. The LR images are then super resolved by integrating the extracted information. To evaluate the performance of HASR network, we build a dataset named Real-Micron from real-world micron-scale patterns. The paired LR and HR images are captured by changing the objectives and registered using a developed registration algorithm. Transfer learning is implemented during the training of Real-Micron dataset due to the lack of amount of data. Experiments demonstrate that by integrating the degradation information, our proposed network achieves state-of-the-art performance for the blind SR task on both synthetic and real-world datasets.

learning is implemented Â during the training of Real-Micron dataset due to the lack Â of amount of data.Experiments demonstrate that by Â integrating the degradation information, our proposed Â network achieves state-of-the-art performance for the blind Â SR task on both synthetic and real-world datasets.Impact Statementâ\euro" The proposed HASR method has Â significant impact on various areas, such as enhancing the Â accurate inspection of manufactured products for quality Â control and enhancing the resolution of medical images to Â enable more accurate diagnosis and healthcare.Current SR solutions neglect the uniqueness of each imaging system, Â hence cannot produce accurate HR images across the Â different systems.Taking advantage of the known hardware information, HASR can differentiate low?resolution images across different imaging systems and Â produce HR images that are closer to the real-world Â scenario.Given sufficient training images, the proposed Â HASR method can overcome the physical optical limitation Â and generate higher quality images.The proposed method Â improves the overall performance by about 0.2 dB and 0.5 Â dB on the synthetic and the real-world datasets, Â respectively.Â Impact Statement-The proposed HASR method has significant impact on various areas, such as enhancing the accurate inspection of manufactured products for quality control and enhancing the resolution of medical images to enable more accurate diagnosis and healthcare.Current SR solutions neglect the uniqueness of each imaging system, hence cannot produce accurate HR images across the different systems.Taking advantage of the known hardware information, HASR can differentiate lowresolution images across different imaging systems and produce HR images that are closer to the real-world scenario.Given sufficient training images, the proposed HASR method can overcome the physical optical limitation and generate higher quality images.The proposed method improves the overall performance by about 0.2 dB and 0.5 dB on the synthetic and the real-world datasets, respectively.igh-resolution digital images are consistently preferred, whether for human satisfaction or for various downstream industrial applications.However, there are instances where obtaining images with the desired resolution is challenging due to limitations in imaging hardware.Factors like low-resolution (LR) cameras or unstable imaging conditions can result in a loss of image resolution.To address this issue, image super-resolution (SR) techniques are frequently employed.These SR techniques are designed to reconstruct high-resolution (HR) images from their LR counterparts.Image SR not only has the potential to enhance image details and realism [1] but also to overcome the limitations of imaging systems [2].Recently, deep learning has paved the way for the development of numerous advanced SR algorithms that leverage large-scale datasets [3]- [5].While these methods excel with artificially degraded LR images, like those created through techniques such as bicubic downsampling, they face challenges when dealing with realworld LR images.This decline in performance results from a domain gap between the training data and the data encountered during inference, particularly when the degradation kernel of real-world LR images differs from the one used for training.
There are typically two approaches to address the SR issue mentioned: (1) generating LR images through multiple degradation models during training [6]- [8], and (2) learning the degradation kernel first and then using it for SR [9]- [11].The first approach struggles with complex real-world degradations, while the second approach is more practical, but it often overlooks a critical piece of prior knowledge: the hardware information of image acquisition devices.
Real-world degradations, stemming from factors like camera blur, sensor noise, sharpening artifacts, and image compression [6], are closely tied to the specific imaging system (camera) in use.Therefore, we posit that possessing prior knowledge of image acquisition system can significantly enhance real-world SR, a common scenario in industry where known camera Leveraging this prior knowledge and the supervised contrastive learning (SupCon) method [12], we can generate hardware representations and employ them to enhance the generation of SR images.
Our proposed hardware aware super-resolution (HASR) network consists of two steps.In the first step, we aim to extract hardware representations.We hypothesize that, in relatively stable capture environments, images taken by the same camera share similar blur kernels, while those from different cameras exhibit distinct blur kernels.Initially, we considered querying the specifications like pixel resolution and sensor type and encoding this information into vectors.However, for efficient differentiation of images from different hardware setups, we adopted contrastive learning.This method groups image patches from the same camera and separates patches from different cameras, implicitly embedding the camera's hardware information.In the second step, we integrate this hardware information into the SR network using our proposed hardwareaware block (HAB), incorporating spatial and channel attention mechanisms.Detailed structures of the HASR provided in Fig. 1 and Section III.
Furthermore, obtaining real-world LR-HR image pairs is challenging, resulting in limited large-scale real-world SR datasets.We address this in two ways.First, we apply transfer learning to the HASR network by initially training the network on publicly available synthetic datasets and fine-tune it with a small number of real-world datasets.These synthetic datasets simulate degradation processes using isotropic Gaussian filters with additive Gaussian noise.Second, we introduce the Real-Micron dataset, containing micron-scale patterns and captured using three Basler CMOS cameras with objectives of various high magnification factors (see details in Section IV).
The contributions of are as follows:

II. RELATED WORK
This section is divided into three parts: The first part surveys current solutions for the blind SR problem, the second part introduces contrastive learning and its variants, and the third part explores feature fusion methods.

A. Blind SR Methods
As discussed in the first section, there are two categories of blind SR methods.The first category includes methods that incorporate multiple degradation models in the network.For example, in [8], the authors proposed to concatenate an LR input image with its degradation map as a unified input to the SR model, allowing for feature adaptation according to the specific degradation and covering multiple degradation types in a single model.In [7], a kernel modeling super-resolution network (KMSR) was proposed, where the simulated LR images were generated by applying a specific blur kernel to HR images, which was chosen from a predetermined kernel pool.Other methods, such as [6], [13], [14], built more generic training datasets with more kinds of realistic blur kernels.However, these methods had a significant drawback: they relied on predefined blur kernel pools and could not provide satisfactory results for images with degradations not covered in their pools.
The second category is to estimate the degradation kernel first and then to super resolve the LR images with the learned degradation kernel information.For instance, Iterative kernel correction (IKC) [10] proposed to correct kernel estimation in an iterative way to gradually approach a satisfactory result.In  [9], the authors introduced "KernelGAN", an image-specific Internal-GAN that estimated the SR kernel (downscaling kernel) that best preserved the distribution of patches across scales of the LR image.However, these methods were timeconsuming due to the numerous iterations during inference.In [15], unsupervised contrastive learning was used to estimate the degradation process.The authors first learned abstract representations to distinguish the various degradations in the representation space rather than explicitly estimating the exact degradations.They then introduced a Degradation-Aware SR (DASR) network with flexible adaptation to various degradations based on the learned representations.A contrastive loss was used to conduct unsupervised degradation representation learning by contrasting positive pairs against negative pairs in the latent space.However, the degradation representation highly relied on the contents of the LR images because of the assumption that each image had a unique degradation kernel.In [16], an unsupervised way to imitate realworld LR images of an unknown downsampling process was proposed.The authors implemented generative adversarial network [17] to generate the LR images that had similar distribution to the real-world LR images.Furthermore, to keep the generation process stable, low-frequency loss (LFL) and adaptive data loss (ADL) were utilized to keep the content consistency between the generated LR and the real-world LR images.However, balancing the data loss and the adversarial loss needed to be very careful.They also did not consider the kernel variances from the training data.The estimated degradation kernel was just an average from all the training data, which would be inaccurate if the training data came from different acquisition systems.

B. Contrastive Learning
Contrastive learning is a self-supervised learning method widely utilized in computer vision, natural language processing, and other domains.Intuitively, contrastive learning can be considered as learning by comparing.To learn the representations of the samples, contrastive learning compares the similarities among the samples: it aims to embed similar samples (positive examples) close to each other while trying to push different samples (negative examples) away.In [18], a simple framework for contrastive learning of visual representations (SimCLR) was presented.SimCLR learned representations by maximizing agreement between differently augmented views of the same data example via a contrastive loss in the latent space.The paper showed that the authors' methods significantly outperformed previous techniques for self-supervised and semi-supervised learning on ImageNet.However, the batch size for SimCLR training was limited by the hardware constraints such as GPU memory.To address this issue, MoCo [19] introduced a dynamic dictionary with a queue and a moving-averaged encoder, allowing for the creation of a large and consistent dictionary on-the-fly, which facilitated contrastive unsupervised learning.MoCo-V2 [20] built upon this approach by incorporating SimCLR's stronger data augmentation and MLP projection head, enabling it to achieve better results than SimCLR on a typical 8-GPU machine.Additionally, if additional labels were provided, they could be integrated into the contrastive framework's similarity and dissimilarity definitions.The authors of [12] extended the selfsupervised batch contrastive approach to the fully-supervised setting with two possible versions of the supervised contrastive (SupCon) loss.The SupCon loss offered benefits for robustness to natural corruptions and was more stable to hyperparameter settings such as optimizers and data augmentations.

C. Feature fusion
As deep learning continues to evolve in handling multimodal data, the effective fusion of information across multiple modalities is extensively explored.Multimodal information fusion is typically categorized into three main approaches: early (feature-based), late (decision-based), and hybrid fusion [21].
In the context of this paper, we exclusively focus on early fusion, where hardware information is treated as a supplementary component rather than an independent modality.Within early fusion, one straightforward technique involves the use of adaptive instance normalization (AdaIN) [22] to align the mean and variance of features from one modality with those from another.Attention mechanisms, widely employed in image super-resolution (SR) networks, have played a pivotal role in early fusion.In [23], a channel attention mechanism was proposed to adaptively rescale channel-wise features by considering interdependencies among channels.Additionally, in [24], the authors introduced the holistic attention network (HAN) to model the comprehensive interdependencies among layers, channels, and positions.In [25], an SR network based on graph attention network (SRGAT) fully leveraged internal patch-recurrence within natural images.With the increasing adoption of transformer backbones, self-attention mechanisms are making their way into SR tasks as well.In [26], a multiscale hierarchical design, incorporating efficient Transformer blocks, was introduced to capture long-range pixel interactions, even for large images.This approach divides images into multiple patches that interact with each other through self-attention mechanisms within the transformer blocks.This paper focuses on investigating whether the fusion of hardware information improves SR performance.Thus, our exploration has been primarily centered on the application of attention mechanisms.We remain open to considering additional fusion methods in the future, with the anticipation that more effective solutions will be uncovered.

III. METHOD
This section begins by elucidating the rationale behind the use of hardware information.It then proceeds to offer a comprehensive overview of the HASR network, as illustrated in Fig. 1.

A. Motivation of using hardware information
Digital image acquisition systems play a pivotal role in myriad of applications, capturing continuous real-world objects and generating sampled image, denoted by   .In these systems, a physical camera can be conceptually modeled as a continuous-space filter, followed by sampling on a lattice [27].If a higher-resolution camera capable of producing the desired HR image   exists, the transformation between the HR image and the LR images can be defined as a function, represented as: Previous SR methods either predefined the degradation function [6], [7], [13], [14] or learned a degradation model for each LR image [15], [16].However, in real-world scenarios, the degradation function is often more complex than the predefined ones, such as bicubic downsampling with anti-aliasing filter.Additionally, training a degradation prediction model to estimate the degradation function for each LR image heavily relies on the patterns within the LR images.Consequently the estimation may become inaccurate when applied to LR images with unseen patterns, which can deteriorate the SR results [28].
Considering that the degradation process originates from the image acquisition system, if we have knowledge that the images in the dataset come from similar image acquisition systems, it logically follows that these images should induce the same degradation process.Furthermore, if we possess a dataset containing information about the image acquisition system for each image, we can harness the contrastive learning method to extract information about these image acquisition systems, inherently representing various degradation processes.Our hypothesis posits that incorporating this learned information into the SR generation network will enhance SR performance.This approach eliminates the need for manually defining inaccurate degradation functions.Moreover, this approach defines different types of degradation functions based on the diversity of hardware information, rather than relying solely on individual LR images [15], [16], aligning it more closely with real-world scenarios.Therefore, the proposed SR algorithm can be represented as: where ℎ is the feature map representing the degradation information of the current LR image acquisition system, acquired by the Degradation Information Extraction network   .Hence, two parts of the loss functions are included in the training process, with its optimization represented by: where ℒ 1 represents the pixel loss, ℒ  represents the supervised contrastive loss, and  is a hyperparameter that controls the tradeoff between ℒ 1 and ℒ  .

B. Network architecture
Our proposed SR algorithm has two stages: the Degradation Information Extraction stage and the hardware-aware superresolution (HASR) stage.The first stage aims to extract a discriminative feature map from each LR image, while the second stage is responsible for performing the SR operation.The first stage is facilitated by a pretrained Degradation Information Extraction network, represented as the yellow block on the left side of Fig. 1.Within this initial stage, we use a simple 6-layer convolutional neural network as an encoder and SupCon method to extract the degradation information.Then, we omit the Two-layer Fully Connected (FC) projection part and employ the encoded feature map as the degradation representation.The complete procedure for Degradation Information Extraction is illustrated in Fig. 2, and we will delve into it shortly.The degradation representation obtained from the first stage and the LR feature map from the Shallow Feature Extraction block are combined within the Deep Feature Fusion block.The fusion operation is primarily executed by the proposed HAB.Finally, the super-resolved image is generated through the HR Image Reconstruction block, with the guidance of the hardware information.A detailed description of both stages is presented below.

1) Degradation Information Extraction:
The goal of the degradation information learning is to extract a discriminative feature map from each LR image.Building on our previous hypothesis, feature maps originating from different acquisition systems will exhibit dissimilarity, whereas those from the similar acquisition system will manifest similarity.
In this context, we construct our degradation information learning based on the framework of MoCo V2 [20].The presence of a large dictionary containing a diverse set of negative samples plays a critical role in contrastive learning, as underscored in existing contrastive learning methods [18], [19].MoCo V2 offers a spacious and consistent dictionary that decouples the dictionary size from the mini-batch size.This feature enriches the pool of negative samples during training, and the size of the dictionary is not limited by the GPU memory.
Furthermore, we introduce positive examples not only by augmenting the anchor image, but also by augmenting images taken from the same acquisition system.Consequently, the LR image datasets in our model are distinctively labeled with corresponding acquisition systems.The SupCon loss function used is as follows: When the training is completed, like classical contrastive learning methods [18], [20], the degradation representation ℎ  is used for the SR algorithm in this paper.
Discussion.The proposed degradation information learning does not require the ground-truth degradation process.Its goal is to learn the hidden distinctive characteristics of degraded images taken from the different acquisition systems for distinguishing.Such a good degradation representation can improve the SR network performance, as shown in section IV.
2) HASR network: Given the degradation information extracted from LR images we can integrate this information into an SR network backbone through deep feature fusion.As shown in Fig. 1, our proposed HASR network mainly contains three components: shallow feature extraction, deep feature fusion, and the HR image reconstruction.
A convolution layer is first utilized to extract the shallow feature map  0 from   , which can be represented by: where  3 (3,) denotes a convolution layer with filter size 3 × 3, input channel 3, and output channel . is a hyper parameter that decides the number of filters of the shallow feature extraction convolution layers.Next, the feature map  0 and the degradation representation ℎ will go through multiple blocks of the residual group for the deep feature fusion.Each residual group takes both the feature map from the previous residual group and the degradation representation ℎ as inputs, and outputs the fused feature map   , where    represents the th residual group.More details of the residual group will be presented later.Then, after the last residual group, the fused feature map   will go through a convolution layer and make the summation with  0 (see (6)) to create the dense feature map   by the global residual learning: Finally, the dense feature map   will go through the HR reconstruction decoder.To effectively upscale the dense feature map   , the decoder utilizes efficient sub-pixel CNN (ESPCNN) [29] followed by a single convolution layer to output the three-channel SR images: where  represents the pixel-shuffle operation with the scale factor of 2.
Residual Group: The Residual Group serves as a crucial component in deep feature fusion.The incorporation of multilevel skip connections allows abundant low-frequency information to be bypassed, enabling the main network to focus on learning high-frequency information.As shown in Fig. 1 (a), each residual group comprises multiple HABs.The current residual group  takes the previous fused feature map  −1 from the previous residual group and the degradation information ℎ as inputs.Then,  −1 and ℎ go through  HABs.Finally, the residual group outputs the fused feature map   with the long skip connection.It can be formulated as: where    represents the th HAB. is a hyper parameter that determines the number of HABs in each residual group.
Hardware-Aware Block: The detailed structure of the HAB is illustrated in Fig. 1 where    represents the output feature map of the th HAB of the th residual group. ∈ {1 … },   0 =   . represents the two-layer multilayer perceptron (MLP),  represents the reshape operation, ⊗ represents the element-wise multiplication.If the feature map   −1 has the dimension of ℝ ×× , the degradation information will travel through dual paths before implementing element-wise multiplication with the feature map.The first path contains two fully connected (FC) layers and a reshape operation that projects the dimension of the degradation information to ℝ 1×× as the spatial attention values.
The second path contains two FC layers that project the dimension of the degradation information ℝ ×1×1 as the channel attention values.During element-wise multiplication, the attention values are broadcasted accordingly: spatial attention values are broadcasted (copied) along the channel dimension, and vice versa.This parallel attention mechanism enables the network to extract more informative features from the degradation information.
Discussion.Current SR networks designed to handle multiple degradations, as seen in [8], [30], often combine degradation information with image feature maps and directly input them into the SR network.However, this direct integration using convolution may introduce interference due to the inherent domain gap between degradation information and image features, as highlighted in [10], [15].In our approach, we utilize degradation information as attention values within dual paths, allowing us to effectively harness this information to adapt to specific degradation scenarios.The spatial attention path focuses on optimizing the connections between adjacent pixels in the image, guided by the degradation information.Meanwhile, the channel attention path is dedicated to optimizing the relationships between feature channels, again guided by degradation information.Subsequently, by optimizing through these two attention paths, we combine their results to achieve the fusion of degradation information and deep feature maps.In Section IV, we also conduct an ablation study on our fusion method to empirically demonstrate its effectiveness.

IV.EXPERIMENTS
In this section, we introduce the super-resolution dataset named Real-Micron created from the real-world micron-scale patterns.We then present the experiment details and results based on open-source synthetic datasets, real-world datasets including DRealSR [31], ImagePairs [32], and Real-Micron dataset.Ablation study is presented at last.

A. Real-Micron Datasets
We collected sets of LR and HR images at multiple resolutions with the combination of three Basler cameras and three Mitutoyo objectives to build a dataset for learning and evaluating the super-resolution models of the real-world micron-scale patterns.

1) Setup of Image Acquisition:
The image acquisition system was mounted on an optical table to keep it as stable as possible, as shown in Fig. 3. Auto-focus algorithm [33] was applied during the acquisition process.The cameras and objectives could be easily unscrewed from the coaxial in-line assembly unit.The working distance could be adjusted by the translation stages and fine-tuned by the piezoelectric motion stage (PEMS).
Four different samples were captured by the acquisition system containing multiple cameras and objectives, including the US Air Force Hi-Resolution target and three different micro-scale circuits as shown in Fig. 4. Different parts of each sample were captured by three different cameras.For each camera, images with three different resolutions were captured using the objectives with 20, 10, and 5 magnifications (20×, 10 × , and 5 × ).After image pair registration, the images captured by 20×, 10× and 5× objectives were respectively ground-truth (GT), two times downsampled LR images (LR-×2), and four times downsampled LR images (LR-×4) as the super-resolution dataset.Furthermore, each LR image was labeled by the camera number, showing which camera it came from.To reduce sensor noise, we captured ( = 10) consecutive images for each scene as [34] did.Therefore, the raw images are computed by: where   represents the th consecutive image.Each of these  consecutive images was captured under constant illumination and without interframe motion.

2) Image Pair Registration:
To create the pixel-wise aligned image pairs in different resolutions, we utilized the image pair registration algorithm.For the images acquired by each model of the camera, we implemented image registration algorithms between the 5 × and 10 × objectives, the 10 × and 20 × objectives as the two times downsampling pixel-wise aligned pairs, and the 5 × and 20 × objectives as the four times downsampling pixel-wise aligned pairs.However, obtaining pixel-wise aligned image pairs is not straightforward due to duplicate patterns and unstable luminance conditions in the circuit targets.As shown in Fig. 4, conventional image registration algorithms such as SIFT [35], SURF [36], and SuperGlue [37] cannot produce accurate results.To obtain accurate image pair registration of our dataset, we designed a coarse-to-fine registration algorithm that maximizes the structural similarity index measure (SSIM) between the transformed LR image and the HR image.Denote   and   as the HR and the LR images to be registered.The final target of our algorithm is to maximize the objective function: where  is the affine transformation matrix,  is the cropping operation to make the transformed   the same size as   , ‖⋅‖  is the structural similarity index measure (SSIM).
To find the accurate  , point correspondences between   and   must be also accurate.We first implemented the registration algorithm in [38] to obtain the point correspondences since it solved the problem of duplicate and deformable patterns.Then, given the scale factor from the magnification of the lenses, other unknown parameters in  can be calculated from the point correspondences using the least square method.Next, several cropped candidates will be proposed based on the inverse transformation of   .Due to the stability of our acquisition system, scale and translation are the principal transformations.Therefore, identifying four corners of   • () will be enough for proposing the candidates.Last, the SSIM values will be calculated to pick the best candidate.
The detailed registration algorithm is included in the supplemental materials.The code was written in MATLAB 2022b and is available at https://github.com/cucum13er/Hardware-Aware-Super-Resolution/tree/main/Matlab_github.Fig. 5 shows examples of the registered image pairs from different cameras, and conspicuous field of view (FOV) differences can be observed between the two cameras.We will present the quantitative results in the next subsections to prove that the images taken from different cameras have different degradation processes.Note: it is difficult to observe the degradation differences among cameras.

B. Experimental setup
To train the Degradation Information Extraction network, we    first synthesized LR images according to (1).The simulation of the degradation process included blurring and downsampling.Blur.The blurring process is a combination of optics and digital camera sensors, and commonly used synthetic model of the kernel is the isotropic and anisotropic Gaussian filters [6].For example, a Gaussian blur kernel  with the size of 2 + 1, the (, ) ∈ [−, ] element is sampled from a 2D-Gaussian distribution: where Σ is the covariance matrix;  is the coordinate; N is the normalization factor.The covariance matrix could be further represented by: where  1 and  2 are the standard deviation along the two principal axes;  is the rotation angle.We used isotropic Gaussian kernels with different  in our synthetic experiments ( 1 =  2 = ) to simulate different image acquisition systems.Downsampling.Downsampling is a basic operation of synthesizing LR images for the training of SR methods.There are various downsampling algorithms, such as nearest-neighbor interpolation, bilinear interpolation, and bicubic interpolation.To facilitate comparisons, we implemented downsampling using bicubic interpolation algorithm in our synthetic experiments.
To evaluate the performance of the degradation information, we implemented five different isotropic Gaussian kernels with bicubic downsampling on the HR images in the synthetic experiment.Five different image acquisition systems were simulated by the five 2D-Gaussian blurring kernels with  2 setting to [0.5, 1.0, 2.0, 3.0, 4.0], respectively.Following [10], the size of the Gaussian kernels was fixed to 21 × 21.We also used LARS optimizer [39] to train the degradation information network with the SupCon loss with 128 batch size and 2 augmented views (see (5)).During training, each of the 128 LR image patches was first randomly selected from different degradation processes and cropped into size 160 × 160.Data augmentation was then performed through random flipping and transposing.The start learning rate was 0.4, and we performed 1000 iterations of training.We separated the training images of the DIV2K [40] dataset into 70%, 10%, and 20% as the training set, validation set and one of the test sets respectively.We also included Flickr2K [41], BSD100 [42], Set5 [43], Set14 [44], and Urban100 [45] as the test sets.
We employed the same training process for the real-world datasets (DRealSR [31], ImagePairs [32], and Real-Micron) as for synthetic datasets, with the difference being that we already had real LR images and different camera labels in real-world datasets.
We evaluated our HASR model using both the synthesized LR-HR image pairs with known blurring kernels and downsampling methods and the real-world LR-HR image pairs with unknown degradation processes.
For synthetic experiments, we used training images from the DIV2K and Flickr2K datasets as the training set and the Set5, Set14, and Urban100 benchmark datasets as the testing set.HR images were degraded into LR images using the same methods as we used to train the Degradation Information Extraction network.We trained HASR network with a combination of SupCon loss and 1 loss for 200 iterations, with the learning rate of 1 × 10 −4 for the SR part, 1 × 10 −9 for the degradation information part, and decaying half every 40 iterations.The hyperparameter  was set to 0.1, and we used the Adam [46] optimizer with  1 = 0.9,  2 = 0.999 for optimization.We used the registered image pairs from DRealSR, ImagePairs, and Real-Micron datasets to conduct real-world experiments.DRealSR consisted of real-world LR and HR images collected by zooming DSLR cameras.The dataset included five DSLR cameras (Canon, Nikon, Olympus, Panasonic, and Sony), corresponding to five different acquisition systems of our Degradation Information Extraction network.ImagePairs used a beam-splitter to capture the same scene by a low resolution camera (LRC) and a high resolution camera (HRC).The LRC can be the sixth acquisition system for our Degradation Information Extraction network.For the × 2 experiments, we combined the DRealSR and ImagePairs for training and testing.For the × 4 experiments, we only used DRealSR for training and testing since ImagePairs does not have the ground-truth HR images.However, the Real-Micron dataset does not have enough training samples.Therefore, we implemented transfer learning to improve the model performance.We first separated the Real-Micron dataset into 80% and 20% as the training and testing datasets, respectively.Next, the best model we have trained on the Real-Micron dataset for extracting the hardware information (MoCo-V2+SuperCon) was selected to initialize the Degradation Information Extraction part of HASR network.Then, the other part of HASR network was initialized by the model we trained on the synthetic experiments (DIV2K and Flickr2K datasets).Finally, we trained the HASR network on the Real-Micron dataset for 40K iterations by freezing the Degradation Information Extraction part and partially freezing the residual groups.Specifically, we experimentally quantified the generality versus specificity of neurons in each residual group of the network by freezing the trainable parameters of different residual groups during the fine-tuning.Further analysis will be presented in the next subsections.
We conducted experiments using PyTorch and MMediting [47].NVIDIA RTX3090 and RTX2080ti GPUs were used for training and testing.The source code and pre-trained models is available at https://github.com/cucum13er/mmagic/tree/0.x.

Network
To evaluate the performance of the degradation information, we compared the supervised contrastive methods to unsupervised contrastive methods, including SimCLR [18] and MOCO V2 [20], and the supervised method.To be fair, we used the same backbones ResNet-18 [48] and 6-layer CNNs [15] to compare different methods.For the performance evaluation, we added a classification head (a supervised linear classifier: two fully connected layers followed by SoftMax) to the backbone and loaded the pretrained weights into the backbone.We then > REPLACE THIS LINE WITH YOUR MANUSCRIPT ID NUMBER (DOUBLE-CLICK HERE TO EDIT) < froze the weights of the backbone and trained the whole network for a small number of epochs.
TABLE III presents a comparison of the classification performance using different methods for the isotropic Gaussian kernels.The results demonstrate that the supervised contrastive and the classic supervised methods outperform the unsupervised methods in this classification task due to their use of label information.As noted in [12], supervised contrastive learning can improve classifier accuracy and robustness.We, therefore, select this method to extract degradation information, as supported by the results in TABLE I. Surprisingly, simple 6layer CNNs outperform ResNet-18 in all the three methods because they can effectively represent degradation information, unlike the more complex ResNet-18, which has too many redundant trainable parameters.Additionally, limited training data and iterations can cause overfitting issues with ResNet-18.
Given the 6-layer CNNs perform well in synthetic experiments, we opt to use them to train the real-world datasets.TABLE II presents the classification results of these real-world datasets.The supervised contrastive method with MoCo-V2 structure achieves the best classification accuracy on average.We, therefore, finalized our Degradation Information Extraction network with 6-layer CNNs as the backbone, MoCo-V2 as the structure, SuperCon loss as the loss function.
To further visualize the learned degradation information, we used the T-SNE method [49] to cluster LR images from both synthetic and real-world datasets.The degradation representations of those LR images were fed to the Degradation Information Extraction networks and then visualized.Fig. 6 shows the visualization results, where the first row includes the results of the synthetic dataset DIV2K with five different isotropic Gaussian blurring kernels and the second row includes the results of the DRealSR dataset with five DSLR cameras and the results of Real-Micron dataset with three different Basler cameras.The visualization results reveal the feature vectors are well clustered by different degradation kernels or different cameras.MoCo-V2 can distinguish different categories better than other algorithms, as demonstrated in TABLE III and TABLE II.Fig. 6 (h) is less distinguishable because the three Basler cameras have very similar specifications, making their degradation information quite similar.

D. Experiments on the HASR Network
We conducted simulation experiments on LR-HR pairs with known blurring kernels and downsampling methods, i.e., isotropic Gaussian blurring kernels with bicubic downsampling method.We compared our CNN based HASR to several recent CNN based SR algorithms, including RDN [50], Real-ESRGAN [6] and DASR [15], using pretrained models.Furthermore, the adoption of stronger Transformer backbones has gained significant traction recently.To validate that our proposed degradation information's impact on enhancing SR is not confined to the SR generation network's backbone, we conducted experiments using a Transformer-based backbone as well.TABLE V shows the PSNR and SSIM comparison results, indicating that with the assistance of the degradation information, our CNN based HASR algorithm outperforms other algorithms, especially when the LR images are heavily blurred by a greater  value.TABLE IV presents a comparison of PSNR and SSIM results among the CNN based HASR, SwinIR [51], and Swin-Transformer based HASR.SwinIR represents a state-of-the-art Swin-Transformer based model for image restoration.Similar to TABLE V, we find that the inclusion of degradation information consistently enhances the quality of SR results.Taking advantage of both local selfattention mechanism and the shifted window scheme, the Swin-Transformer based HASR achieves the best performance across most test datasets.
We also conducted experiments on real-world LR-HR image pairs using DRealSR and ImagePairs datasets for the × 2 experiments and DRealSR dataset for the × 4 experimens.We then used Real-Micron dataset for another real-world dataset evaluation.As shown in TABLE VI, our HASR algorithm consistently achieves higher PSNR and SSIM values compared to most other algorithms.It's worth noting that CDC [31] exhibits higher SSIM values in certain cases, as it dissects an image into three components (flat, edges, and corners) and reconstructs each component individually.In contrast, our proposed method is designed to reconstruct the entire image as a whole.However, an interesting avenue for future research could involve adapting CDC to incorporate degradation information.For the evaluation of the Real-Micron dataset, as stated in subsection IV.B, we initialized the HASR model by employing the pretrained Degradation Information Extraction network obtained from the Real-Micron dataset, along with the pretrained HASR network acquired from the synthetic experiments.We utilized CNN based HASR in the evaluation due to the relatively small scale of the Real-Micron dataset.
TABLE VI and TABLE VII shows the PSNR and SSIM results and confirms that our proposed HASR network achieved better quantitative evaluation results than other state-of-the-art algorithms.Additionally, Fig. 7 shows the SR visualization results on the Real-Micron and ImagePairs datasets, demonstrating that the proposed HASR network successfully reconstructs detailed textures and edges in the HR images, yielding better-looking SR outputs compared to other methods.While Real-ESRGAN produces sharper-looking details, it introduces some artifacts due to its adversarial model.The adversarial model prioritizes generating visually pleasing SR images over SR images closer to the input LR images, resulting in a tradeoff between the visual quality and the quantitative performance.Note the PSNR metric fundamentally disagrees with the subjective evaluation of human observers [1].If users care more about the quantitative performance in SR applications, e.g., using the HASR for product pattern inspection and metrology in manufacturing processes, the SR results must be as close as possible to the ground-truth rather than guessing a more visually pleasing image.

E. Ablation Studies
We first evaluated the effectiveness of the degradation information in the network by conducting an ablation experiment using two different fusion methods, CNN based HASR and EDSR [4] with Adaptive Instance Normalization (AdaIN) [22].Then, we evaluated the effectiveness of the dualpath attention mechanism by conducting an ablation experiment using different fusion methods.Finally, we evaluated the  1) Analysis on Degradation Information: To disregard the degradation information, we set λ = 0 for ℒ sup of the HASR network and compared the experiment results of this model to the results of previous HASR network.To explore the generalizability of the degradation information, we conducted an experiment on another SR architecture, EDSR with AdaIN fusion method.For this experiment, we made specific modifications to the residual blocks of EDSR.Specifically, we used the two-FC-layer projected degradation information as the style feature map for AdaIN, while the feature map from the original residual blocks served as the content feature map.These two feature maps were then combined using an AdaIN layer.Fig. 8 illustrated both original and modified residual blocks.Similarly, we trained two models for this architecture with λ = 0.1 and λ = 0, respectively.TABLE VIII displays the PSNR and SSIM results for these four models.It is evident that the inclusion of degradation information enhances the performance of both SR networks, confirming the effectiveness of this approach.
2) Analysis on Feature Fusion: To evaluate the effectiveness of the dual-path attention mechanisms, we conducted experiments of different fusion approaches of the CNN based HASR network.Specifically, we compared the original HASR with single path attention (either only spatial or channel attention) and channel attention outside of RCAB [23].Readers can refer to Supplemental Materials for more details.The method which performed the worst was the one where fusion occurred outside of the RCAB.This outcome can be attributed to the absence of degradation information during the deep feature extraction process, which occurred inside RCAB.Similarly, methods employing a single path, be it the CA or SA path, exhibited worse performance.These single-path methods lack connections between adjacent pixels or feature channels, making them less effective compared to the proposed fusion method with dual-path.

3) Analysis on Transfer Learning:
To evaluate the effectiveness of transfer learning on the Real-Micron dataset, we conducted two sets of experiments using the CNN based HASR network.Firstly, we trained the HASR network using only the Real-Micron training data, with the degradation information part pretrained and the HASR part randomly initialized.Secondly, we trained the network using the same training data with both pretrained degradation information and HASR parts.For the latter, we froze different residual groups in the models during training.Fig. 9 shows the PSNR and SSIM results of both transfer learning metrics.
The results indicate that transfer learning outperforms direct training from scratch when the weights of the first one, two or three residual groups are frozen.This is reasonable due to two factors.Firstly, the Real-Micron dataset has fewer LR-HR image pairs (397) than other public datasets like ImagePairs and DRealSR, making overfitting a potential issue during training from scratch.Secondly, by using the pretrained model (DIV2K+Flirck2K) to initialize the HASR, the SR performance can be improved.However, since the pretrained model has domain gaps with the Real-Micron dataset, the best performance was achieved when unlocking the weights of the last and penultimate residual groups.This approach locks in the learned generic features from pretrained model, while providing enough learnable parameters for learning the unique features of the Real-Micron dataset.

V. CONCLUSION
In this study, we propose a blind SR method that can handle

1 1>
REPLACE THIS LINE WITH YOUR MANUSCRIPT ID NUMBER (DOUBLE-CLICK HERE TO EDIT) < * A Hardware-Aware Network for Real-World Single Image Super-Resolutions Rui Ma, Xian Du* Abstract-Most single image super resolution (SISR) methods are developed on synthetic low resolution (LR) and high resolution (HR) image pairs, which are simulated by a predetermined degradation operation, such as bicubic downsampling.However, these methods only learn the inverse process of the predetermined operation, which fails to super resolve the real-world LR images, whose true formulation deviates from the predetermined operation.To address this, we propose a novel SR framework named hardware-aware super-resolution (HASR) network that first extracts hardware information, particularly the camera degradation information.The LR images are then super resolved by integrating the extracted information.To evaluate the performance of HASR network, we build a dataset named Real-Micron from real-world micron-scale patterns.The paired LR and HR images are captured by changing the objectives and registered using a developed registration algorithm.Transfer learning is implemented during the training of Real-Micron dataset due to the lack of amount of data.Experiments demonstrate that by integrating the degradation information, our proposed network achieves state-of-the-art performance for the blind SR task on both synthetic and real-world datasets.
Pioneering the utilization of hardware information to enhance SR generation.• Introducing a novel supervised contrastive learning method for learning unknown degradation processes in various image acquisition systems.• Empirically demonstrating that integrating prior hardware information significantly enhances SR generation.• Presenting a real-world dataset featuring micron-scale patterns and containing precisely aligned HR and LR image pairs with different scale factors.
) In this equation,  ∈  ≡ {1 ⋯ 2} represents the index of an arbitrary augmented sample,   = (( �  )) represents the feature map generated by the Degradation Information Extraction Encoder and the projection network, the • symbol denotes the inner product,  ∈ ℛ + is a scalar temperature parameter, () ≡ \{} represents all the indices except  , () ≡ { ∈ ():  �  =  �  } represents all the indices that have the same label as the th augmented sample, and |()| is its cardinality.Fig. 2 serves as an illustration of (5).At the beginning of each training batch, a set of  randomly sampled {image, acquisition system label} pairs {  ,   } =1⋯ , are selected.The > REPLACE THIS LINE WITH YOUR MANUSCRIPT ID NUMBER (DOUBLE-CLICK HERE TO EDIT) < corresponding training data comprises 2 pairs, { �  ,  �  } =1⋯2 , where  � 2 and  � 2−1 represent two random augmentations or "views" of   (  = 1 ⋯  ), and  � 2−1 =  � 2 =   .Fig. 2 presents an example with  = 6 ,  = 1 , (1) = {2,3,4}, (1) = {2,3, … ,12} , and the labels for the three acquisition systems (different cameras in Fig. 2) are respectively {1,2,3}.Intuitively, for the th augmented sample, all the other augmented samples with the same label are expected to be positive samples, while the remaining augmented samples are expected to be negative samples.This equation is simply an extension of the classical self-supervised contrastive loss that enables multiple positive examples in a batch of training data.
(b).The current HAB  takes the fused feature map from previous HAB and the degradation information ℎ as inputs.It involves a deep feature extraction module (DFEM) and a dual-path attention mechanism.The DFEM can be either CNN based or Transformer based feature extraction layers.For more details of the structure with DFEM, readers can refer to our supplemental materials.The dual-path attention mechanism involves both channel attention (CA) and spatial attention (SA) paths.The output of the current HAB,    can be inferred by:
> REPLACE THIS LINE WITH YOUR MANUSCRIPT ID NUMBER (DOUBLE-CLICK HERE TO EDIT) <

Fig. 7 .
Fig. 7. Qualitative comparison of our model with other works on ×  super-resolution on the Real-Micron dataset (top) and ×  super-resolution on the ImagePair dataset (bottom).
> REPLACE THIS LINE WITH YOUR MANUSCRIPT ID NUMBER (DOUBLE-CLICK HERE TO EDIT) < various degradation processes of different image acquisition systems by extracting and integrating the prior hardware information.By the inclusion of HAB, both Transformer based and CNN based HASR networks outperform conventional approaches by not relying on predefined or ground-truth degradation kernels.Results from both synthetic and real-world datasets demonstrate the effectiveness of the proposed method in handling blind SR problems.Future work will extend our method to more state-of-the-art SR frameworks such as CDC and verify the effectiveness of the degradation information in these frameworks.Additionally, the effective utilization of prior hardware knowledge to enhance image quality represents a promising avenue for exploration.Algorithms developed on the basis of such hardware information hold significant potential for practical applications.However, our HASR method may have limitations when handling input LR images acquired from hardware that significantly deviates from the training data.In such cases, the HASR network cannot accurately predict the unknown hardware degradation, resulting in a decline of SR performance.Moreover, obtaining labeled device sources to use as training data for the HASR method can be challenging, which adds to the difficulty of acquiring the necessary data.

Fig. 8 .
PSNR and SSIM comparison of transfer learning on Real-Micron dataset.

Fig. 9 .
Comparison of residual blocks in original EDSR and EDSR with AdaIn fusion.
(1)EPLACE THIS LINE WITH YOUR MANUSCRIPT ID NUMBER (DOUBLE-CLICK HERE TO EDIT) <where (•) is a degradation function that amalgamates both filtering and down-sampling processes.The essence of SR problem is to derive an estimated HR image  ̂ from   , effectively inverting transformation in(1).Note the SR problem is inherently ill-posed because multiple different HR images can yield the same LR result.To address this, it is transformed into an optimization problem.

TABLE I
Cameras and Lenses Used in Data Collection.

TABLE II
The classification results for real-world datasets.REPLACE THIS LINE WITH YOUR MANUSCRIPT ID NUMBER (DOUBLE-CLICK HERE TO EDIT) <

TABLE IV
PSNR and SSIM comparison of Transformer based models on open-source synthetic datasets.

TABLE VI
PSNR and SSIM results on DRealSR and ImagePairs datasets.

TABLE V
PSNR and SSIM comparison of CNN based models on open-source synthetic datasets.REPLACE THIS LINE WITH YOUR MANUSCRIPT ID NUMBER (DOUBLE-CLICK HERE TO EDIT) < the PSNR and SSIM comparison of different fusion methods.

TABLE VII
PSNR and SSIM results on Real-Micron dataset.

TABLE IX
PSNR and SSIM comparisons with/without degradation information.

TABLE VIII
PSNR and SSIM results of different fusion methods.