Omnidirectional Video Super-Resolution Using Deep Learning

Omnidirectional Videos (or 360° videos) are widely used in Virtual Reality (VR) to facilitate immersive and interactive viewing experiences. However, the limited spatial resolution in 360° videos does not allow for each degree of view to be represented with adequate pixels, limiting the visual quality offered in the immersive experience. Deep learning Video Super-Resolution (VSR) techniques used for conventional videos could provide a promising software-based solution; however, these techniques do not tackle the distortion present in equirectangular projections of 360° video signals. An additional obstacle is the limited 360° video datasets to study. To address these issues, this paper creates a novel 360° Video Dataset (360VDS) with a study of the extensibility of conventional VSR models to 360° videos. This paper further proposes a novel deep learning model for 360° Video Super-Resolution (360° VSR), called Spherical Signal Super-resolution with a Proportioned Optimisation (S3PO). S3PO adopts recurrent modelling with an attention mechanism, unbound from conventional VSR techniques like alignment. With a purpose-built feature extractor and a novel loss-function addressing spherical distortion, S3PO outperforms most state-of-the-art conventional VSR models and 360° specific super-resolution models on 360° video datasets. A step-wise ablation study is presented to understand and demonstrate the impact of the chosen architectural sub-components, targeted training and optimisation.


I. INTRODUCTION 360
• videos are increasingly popular, rapidly becoming the preferred format for multimedia in Virtual Reality (VR) [1], [2].Also called omnidirectional videos, spherical videos or panoramic videos; 360 • videos consist of 360 • of horizontal and 180 • of vertical Field of View (FoV).360 • videos are primarily used for creating immersive experiences for their viewers by allowing up to six degrees of freedom of movement when interacting within the virtual environment.These videos are created using either a single camera with multiple sensors or multiple single-sensor cameras.The views captured from each sensor are stitched to create a single omnidirectional view.The spherical signal is then projected on a rectangular plane, by mapping the yaw and pitch, producing an EquiRectangular Projection (ERP).While other forms of projection for 360 • signals exist, such as cube map projection, equirectangular is the most widely used projection [3] and is the format studied in this work.An example of the EquiRectangular Projection (ERP) frame is illustrated in Fig. 1 which shows the wide FoV and distortion present in ERPs resulting from mapping the spherical signal on a rectangular plane.
For a similar viewing experience to 1080p High-Definition (HD) conventional video, a resolution of 3840 × 2160 (= [1920 × 1080] × 4) pixels is recommended by YouTube VR for 360 • videos [4].Similarly, in light of the wider view plane, 360 • videos require ×8 more data to be transmitted than conventional videos for a similar level of perceptual quality [5].To mimic human biological perception, 60 pixels are needed to represent each degree of view [6].This means for true immersive impact, 360 • of horizontal FoV needs to be represented by 21,600 pixels.Thus the key factor inhibiting the adoption of 360 • videos in immersive contexts is the spatial resolution of the format.To bridge this gap, we explore the idea of enhancing 360 • videos using deep learning-based video super-resolution (VSR) technology.
Recent advances in VSR for conventional videos show competent results, specifically when enhancing low-resolution videos by up to ×4 higher resolutions [7], [8], [9], [10].Such a software-based solution, built specifically for 360 • video context, could address the spatial resolution-related limitations in the 360 • video domain.Therefore, in this work, we explore the super-resolution of 360 • videos to achieve a ×4 spatial resolution enhancement of EquiRectangular Projection (ERP) of these videos.To experiment with these issues, we create a novel 360 • video dataset to measure model performance against the superresolution task.The 360 • Video Dataset (360VDS) introduced in this paper presents a large collection of 590 video clips in ERP format with a varied range of Spatial Index (SI) and Temporal Index (TI), as discussed in Section IV (360 • Video Dataset).A new deep learning-based 360 • Video Super-Resolution (360 • VSR) model called Spherical Signal Super-resolution with a Proportioned Optimisation (S3PO) is then proposed to model the task of super-resolution specifically for panoramic videos, as discussed in Section III (Proposed 360 • Video Super-Resolution Model).To address the limitations of existing VSR models, targeted recurrent modelling with a 360 • Feature Extractor and optimisation using a novel loss function are also introduced.
Our empirical evaluation and analysis show that although conventional VSR models perform satisfactorily on 360 • videos, they can be further improved with targeted modelling and training.Our proposed S3PO model outperforms most of the existing state-of-the-art conventional and 360 • VSR models in quality evaluation matrices specific to both conventional and 360 • signals.Furthermore, a detailed ablation study presented in Section VI (Ablation Study) demonstrates comparative benefits of the strategic architectural choices, training with domain adaptation and optimisation of the S3PO model with the novel loss function.
The five-fold contributions in this paper are summarised as: 1) development of a new 360 • video dataset with more diverse spatial and temporal contexts than the existing ones for benchmarking 360  videos on existing and new 360 • video datasets, discussed in Section V-B(Quantitative Evaluation) and V-C(Visual Comparison).5) a detailed ablation study investigating the impact of architectural and learning choices for 360 • VSR along with the influences of conventional vs. 360 • -specific methodologies.

A. Conventional Video Super-Resolution
Deep learning has been widely adopted to show how high-resolution (HR) output can be generated with improved quality from low-resolution (LR) conventional video inputs [8], [9], [11].The learning-based methods for conventional VSR typically consist of four key components, namely feature extraction, alignment, and fusion, followed by reconstruction and up-sampling.Among these, extracting relevant features from accurately aligned frames and fusing them are tasks of key significance for learning spatiotemporal correlation in a given temporal radius [7], [10].Techniques such as Motion Estimation and Motion Compensation (MEMC), using optical flow followed by warping [7] or deformable convolution [12], are commonly used for either explicit (or implicit) alignment, respectively.Alternative to these, only 2D/3D convolutions [13], [14] and Recurrent Neural Networks (RNNs) have been used to extract features from unaligned frames and learn the sequential nature of video in doing so.
Recurrent Neural Networks (RNNs) are gaining popularity in the conventional Video Super Resolution (VSR) space thanks to their proven ability in effective sequential modelling.Recent RNN-based models, such as Recurrent Residual Network (RRN) [16], Recurrent Structure-Detail Network (RSDN) [9], BasicVSR [10], Replenished Recurrency with Dual-Duct (R2D2) [17] and BasicVSR++ [18], have demonstrated the capacity to learn long-term inter-frame correlation within a given video and consequently improve super-resolution quality.The global information propagation in RNNs offers a time and cost-efficient alternative to their alignment-based counterparts.
There is an increasing trend of leveraging global information present in videos using bidirectional recurrent models [10], [18], [19].These models make use of an all-frames-in approach to super-resolve the input videos.Although the all-frames-in approach has shown promising results [10], [18], it can be computationally expensive due to the concurrent processing of information bidirectionally (forward and backward) across a long temporal radius for a video [17].Among the bidirectional recurrent model, the order of information propagation is increasingly becoming complex with not just two flows of information (forward and backward) but multiple layers of these flows and their crossovers, as seen in BasicVSR++ [18].This limits the applicability of such models to offline settings only, as in most online settings, all frames are not available at a given time.
More sophisticated usage of hybrid structures incorporating unidirectional recurrent models with replenishing information and multi-staged refinement is emerging to mitigate issues like vanishing gradient from unidirectional recurrent networks [17].These models are showing promising results when compared to the bidirectional counterparts while still being applicable in both online and offline application settings.
However, the extensibility of any recurrent modelling, and its effectiveness for 360 • VSR, has yet to be studied.In this work, we aim to address this gap by firstly testing state-of-the-art conventional VSR models on our newly created 360VDS dataset.We prove that a purpose-built 360 • VSR model can outperform conventional VSR technologies, which are not well-suited to the format because of the large FoV and the presence of distortion and horizontal cyclicity in EquiRectangular Projections (ERPs).

B. 360 • Super-Resolution
Super-resolution technologies have been applied for omnidirectional images by pioneers like Fakour-Sevom et al. [20], Ozcinar et al. [21] and Nishiyam et al. [22].These studies reveal that Single Image Super-Resolution (SISR) can be applied to panoramic images.In these approaches, existing SISR models for conventional images [23] have been fine-tuned by using either 360 • image datasets [24], re-training to optimise distortion-aware loss functions [21], or by applying distortion maps on input images before being fed into the network [22].
However, there is limited work extending super-resolution to 360 • video signals.Dasari et al. [25] proposed adopting VSR in 360 • videos to mitigate the bandwidth-related requirements for an adaptive video streaming service.A micro-model for super-resolution was designed as part of a streaming system to enhance the spatial quality of compressed tiles by passing each of them through multiple convolution layers, followed by final deconvolution and upsampling.The Dasari et al. approach is based on bandwidth-related requirements for a streaming service; the video-specific properties were not considered in the super-resolution task where the temporal correlation between video frames is a crucial factor to boost the visual quality for higher resolution output.
Liu et al. [15] focus on creating a dual network VSR model called Single and Multi-Frame Recurrent Network (SMFN).One pipeline is used for Single Image Super-Resolution (SISR), and the other for Multi-Image Super-Resolution (MISR).Approaching the 360 • VSR problem by combining SISR and MISR limits SMFN in learning spatiotemporal properties in a local neighbourhood of limited temporal radius.SMFN has been shown to generate better super-resolution outcomes for 360 • videos compared to conventional state-of-the-art VSR models, such as EDVR [8] and RBPN [7].However, the MiG Panorama dataset, used in the study by Liu et al. [15], has only four clips in the test set, and most of the recent recurrent network-based VSR models have not been used in their comparison; although, these models have proven to outperform EDVR and RBPN on several benchmark datasets due to their better sequential modelling ability.Figs. 2 and 3 give a glimpse of S3PO's performance on the MiG Panorama test set with four clips.As shown in these figures, S3PO outperforms most of the existing VSR models in both Structural Similarity Index Measure (SSIM) [26] and Weighted Spherically-SSIM (WS-SSIM) [27] metrics in each of the four clips of MiG Panorama test set.

III. PROPOSED 360 • VIDEO SUPER-RESOLUTION MODEL
A. Architectural Overview Fig. 4 shows the architectural overview of our proposed Spherical Signal Super-resolution with a Proportioned Optimisation (S3PO) model.S3PO adopts recurrent modelling in combination with a sliding-window mechanism.The recurrence allows for sequential modelling by enabling global memory propagation across equirectangular video frames over time.For uninterrupted global memory propagation, the current target frame F t is directly fused with the hidden state h t−1 , and the super-resolved output HR t−1 from the previous timestamp establishing a recurrent pathway.Such recurrent neural networks have a proven ability to model the sequential nature of continuous spatiotemporal data such as videos.However, the propagation of global memory diminishes over time in unidirectional recurrent models, limiting their temporal receptivity and long-term sequential modelling ability.Additionally, any sudden change or noise within a given time stamp can disrupt the flow of global memory and introduce undesirable memory features.To mitigate these issues, an information replenishment mechanism with a sliding window is used in the S3PO model.This allows highly correlated local features to be extracted and used for information replenishment in the recurrent network.To capture local features from a given sliding window, we introduce a novel 360 • Feature Extractor specifically designed to extract joint features from three consecutive panoramic frames, F t−1 , F t and F t+1 .
The S3PO model aims to replace the computationally intensive and error-prone conventional alignment steps with the sequential modelling ability of a recurrent architecture, as illustrated in Fig. 4. The 360 • Feature Extractor with attention mechanism replenishes vanishing memory with highly correlated features extracted locally.The co-joint feature extraction in the proposed feature extractor, followed by the attention mechanism, allows S3PO to extract relevant features directly from unaligned frames without the need for explicit alignment and further to pay varied attention to these extracted features across both spatial and temporal dimensions.The extracted and fused features from both local and global propagation, as denoted in Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

B. 360 • Feature Extractor
Alignment methods, such as MEMC, designed for conventional videos, do not consider the distortion present in EquiRectangular Projection (ERP) frames.Optical Flow estimation [28], primarily used for motion estimation, is an error-prone process, even for conventional videos when there are large motion or intensity changes between consecutive frames.With the added complexity of distorted pixels, conventional methods can not be directly applied for accurate flow estimation in the context of 360 • videos.The objects in ERP frames have different levels of distortion at different latitudes.So, when an object moves between consecutive frames, distortion across different regions within the same object may also vary, making flow estimation between consecutive ERP frames challenging.Additionally, spherical boundaries of 360 • videos allow for cyclic motion.Although the geometric meaning of displacement is independent of the motion trajectory, cyclic motion means multiple possible paths between origin and target points in a rectangular plane, causing misleading or unsatisfactory estimation of flow using conventional methods [29], [30], [31].
Rather than using conventional alignment [28], we propose using learning-based technologies to directly extract relevant features from unaligned consecutive frames in 360 • videos.Direct extraction and fusion of 2D features from video frames have proven to result in subpar performance when compared to features extracted from aligned frames in conventional videos [14].The proposed 360 • Feature Extractor addresses this limitation in the specific context of 360 • videos using a two-staged feature extraction mechanism.
As shown in Fig. 5, the initial 2D features directly extracted from each EquiRectangular Projection (ERP) frame are refined with respect to joint features extracted from the combination of all three frames in the local neighbourhood.This enables the extraction of mutually correlated local features, represented by Correlated Feature in (1).For each pair of the neighbouring frames (F t−1 and F t+1 ), corresponding features are then further refined with respect to the target frame (F t ) features.This allows for meaningful feature extraction from neighbouring frames that correlate with the target frame.The final two sets of features extracted are then fused to generate a single set of local features represented by FinalFeature t in (1).Each step involved in the first stage of feature extraction is outlined in (1), where Conv 3×3 is the convolution operation with filter-size 3 × 3, ' ' represents the concatenation operation along the feature depth, '⊕' represents the element-wise addition operation and ReLU represents the Rectified Linear Unit.The single set of FinalFeature t , obtained from the first stage for a given timestamp t, is then propagated to the second stage of the 360 • Feature Extractor.
In the second phase, to cater for the varied spatial differences caused by distortions present in ERP frames, we propose to use a spatial attention mechanism [32] to allow the S3PO model to assign varied attention to different spatial regions within the distorted ERP frames.Additionally, a channel attention mechanism [32] is also used to allow the model to differentiate between the features extracted from the temporally co-located frames.The attention mechanism involved in the second stage of feature extraction is outlined in (2), where '⊗' represents the element-wise multiplication operation and σ represents the sigmoid activation function.This stage allows 360 • Feature Extractor to learn to extract spatiotemporally correlated features representing the unique nature of 360 • video frames while extracting mutually correlated features in a given local temporal radius.By combining and correlating the features from temporally co-located frames, the 360 • feature extractor enables a comprehensive representation of the information in the given temporal radius, including motion information.This allows S3PO to be free from conventional alignment steps, which are known to be error-prone and computationally intensive in the context of conventional videos and thus do not extend well for the unique nature of 360 • videos [30].A single set of extracted features G local , for the given timestamp t, is then propagated for further refinement with respect to the global memory.

C. Recurrent Global Fusion
The S3PO model learns the long-term temporal reliance between ERP frames by propagating global memory.For this, it uses hidden-state information h t−1 and the super-resolved output from previous timestamp HR t−1 , as illustrated in Fig. 6.The current target frame F t is directly fused with the hidden-state, and ×4 space-to-depth transformed (using a pixel-unshuffle operation) version of HR t−1 , to obtain a single set of fused features.This can be formally denoted by (3), where G global is the finally fused features representing the global information flow in recurrent architecture, and ↓ represents a pixel-unshuffle operation.The grouping of hidden state and output from the previous timestamp with the current LR target input allows the S3PO model to learn temporal correlations between frames across the video and allows long-term propagation of texture and context details, enabling the sequential modelling desired from an RNN for VSR.

D. Dual-Duct Refinement
The two sets of features G local and G global are the data propagation results respectively depicted in Figs. 5 and 6.They are forwarded for refinement through a residual convolution network.This network consists of ten residual blocks, where each Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Rather than building two completely independent pipelines, we build residual blocks where both the local and global ducts are leveraged via mutual information exchange as shown in Fig. 7 and (4).Each duct consists of a 3 × 3 convolution operation followed by a ReLU activation and additionally a 3 × 3 convolution operation as shown in Fig. 7.The features obtained from each duct are added element-wise with the respective identity feature obtained through the skip connection.Skip connections of features, in conjunction with the jointly learned refinement, allows for the construction of a deeper convolution network while enabling the network to be robust to vanishing gradient, effectively propagating both local and global information while mitigating error accumulation.
The two final sets of jointly refined features from the last residual blocks FF local and FF global are generated as shown in Fig. 7.These two sets of features are then used to create hidden-state information h t for the given time-stamp t to be used for the future time-stamp t + 1, as defined in (5).The two finally refined features LF t and GF t , from the local and global information propagation respectively, are obtained from FF local and FF global respectively (as shown in (4-6)).4) )

E. Upsampling
Following the refinement operation in the residual blocks, the two outcomes from local and global-duct, namely LF t and GF t respectively, are then subjected to the 3 × 3 convolution operation followed by a depth-to-space transformation in order to cast the ×4 feature depth to spatial data as shown in Fig. 8.For depth-to-space transformation, the S3PO model uses the pixel-shuffle operation.The two outputs, each with three channel features then obtained, corresponding to the super-resolved residues from the two groups of information propagation.Finally, the two outputs are concatenated along depth and passed through two 3 × 3 convolution layers to obtain the ×4 superresolved residue.This spatially super-resolved residue is then added element-wise with a bi-linearly interpolated LR frame for the final HR reconstruction (F t × 4) of the given F t .

IV. 360 • VIDEO DATASET
A. Data Collection 360 • video is an emerging format in multimedia.There is, therefore, presently a lack of benchmark datasets that can be used for standard training and model evaluation.As discussed earlier in Section II-B (360 • Super-Resolution), there are only two known works that have attempted super-resolution for 360 • videos.And only one of those has created a dataset that could be used generally for 360 • video super-resolution research.However, the MiG Panorama Video dataset [15] in question consists of only 204 360 • video clips, and only four of these are publicly available and used for testing.This limits the diversity of motion, content and light conditions represented by the test set.
To address this, we create a new dataset of panoramic videos specifically designed for super-resolution called the 360 • video dataset (360VDS).Open-source datasets used in other areas of 360 • video research are assembled to create the 360VDS, such as those used in quality assessment [33], compression [34], salience modeling [35], along with those in surveys and literature studies [36], [37], [38].Additionally, we also make use of the publicly available 360 • video dataset from the Stanford VR lab called psych-360 [39].
In collecting this diverse range of open-source 360 • videos, we ensure that the number of moving objects ranges from none to single to multiple objects.At the same time, we also ensure that the videos selected contain different camera motions -either fixed or movement along a single axis -or rotation -or a combination of these movements.The contents captured are also diverse, ranging from animals, trees/grasses, buildings, humans, day-to-day objects and even synthetic content.With this collection, we ensure a diverse representation of content and contexts, variable lighting conditions and different types of motion.In total, 301 videos were collected with variable duration.

B. Dataset Formation
Each of the videos in the collection is then processed to detect scenes followed by segregation into multiple clips, each representing a single shot.PySceneDetect [40] was used to identify jump cuts in the collected videos.We use an adaptive content detector to convert RGB videos to HSV and compare the rolling average difference across all channels between adjacent frames.Shot detection, identified in this way, avoids false detection that would otherwise result from large camera motion.
In total, 1,133 clips were created from the 301 collected videos.The 1,133 clips were then hand-crafted to remove any shots with no content, static content, or solely textual content.896 video clips obtained in this way were then further filtered to remove shots with highly similar content and scenes.The result is 590 videos.However, most of these videos are too long and large, therefore, unsuitable for deep learning training and testing due to hardware demands.
To form the final benchmark dataset, up to 20 frames were extracted from each of the 590 videos.Additionally, all 590 videos were transformed to 480 × 360 pixels for consistent and uniform resolution.Resizing also helps resolve the size-related bottleneck we would otherwise face when training our recurrent model.The 590 video dataset is then split randomly to select 45 clips as a test set and the remaining 545 as the training set.Fig. 9 shows some randomly sampled frames from the final dataset.The 360VDS dataset will be publicly available at https://github.com/arbind95/360VSRS3POGitRepo for the community to use as a benchmark dataset in future research and development.To ensure that the 360VDS represents diversity, the results of spatial and temporal complexity analysis, as per International Telecommunication Union's (ITU) standard guidelines [41], is presented in Fig. 10.It is evident that 360VDS represents varied spatiotemporal complexity desired for training and evaluation in VSR domain [42].Furthermore, the four publicly available clips from the MiG Panorama test set are also analysed, as plotted in red in Fig. 10, to illustrate the need for a larger and more diverse test set.
Additionally, a test set with clips ranging from high-resolution to ultra-high-resolution is created to evaluate the performance of VSR models on a higher-resolution restoration task which poses more computational challenges.The REDS [43] test set and UDM10 [44] dataset, widely used for benchmarking in conventional video super-resolution, consist of four and ten HD clips, respectively.Our newly created 360 Ultra High-Definition (360UHD) dataset consists of eight clips ranging from HD to 4 K; therefore, posing a similar benchmarking challenge to REDS and UDM10, however specifically designed for 360 • multimedia context.These clips are randomly selected from the 45 clips of the 360VDS test set without being transformed to 480 × 360 pixels.Their original high-resolution ground truth is kept intact, ranging from HD to 4 K, as presented in Table III.Testing the efficacy of omnidirectional video super-resolution (VSR) models on high-resolution videos is of utmost importance for the majority of 360 • multimedia applications, as the format requires a higher resolution to provide a truly seamless and immersive viewing experience.The process of super-resolving 360 • high-resolution videos, therefore, presents a greater  [50] ON THE 360VDS TESTSET

TABLE II QUANTITATIVE COMPARISON USING PSNR, SSIM, WS-PSNR AND WS-SSIM
ON MIG PANORAMA TESTSET challenge compared to conventional VSR since the models must preserve and enhance an increased amount of detail.Consequently, it is highly recommended that any 360 • VSR model be evaluated for super-resolving high-resolution videos.The details and evaluation results for each clip of the 360UHD test set is presented in Table III (7) shown at the bottom of this page.

A. Experimental Setting 1) Domain Adaptation:
The Vimeo90K [42] dataset, a widely used benchmark dataset in conventional VSR, is used to train the S3PO model initially for a generic initialisation of the super-resolution task.The size and diversity presented by Vimeo90 K allow the S3PO model to learn the task of super-resolution which is used as a knowledge base.This is then fine-tuned by retraining the S3PO model on 360VDS to allow for the specification of the generic initialisation in the context of 360 • videos.Out-of-domain knowledge has proven to be effective as a generic initialisation for new domains [45].Following the success of this approach in deep learning, we also make use of conventional VSR as a base initialisation to learn the task of 360 • video super-resolution more effectively.Fig. 11 depicts the benefit of initialising the S3PO model with conventional VSR weights, this prevents early saturation of training and allows for further learning.The effectiveness of this approach is also demonstrated and discussed in Section VI (Ablation Study).
2) Training: The input LR frames were generated using both Blur Degradation (BD) and Bicubic Interpolation (BI).For BD, we use Gaussian blur with a standard deviation of σ = 1.6 and 4× down-sampling commonly used in the literature [46].Using two forms of degradation allows the S3PO model to learn the super-resolution task for diverse scenarios and benchmarking against varied models.
For the conventional VSR initialisation, weights were obtained by training the S3PO model on the conventional Vimeo90K [42] dataset with a batch size of 8, for 70 epochs using Smooth-L1 loss [47].Followed by this, in order to allow , where ψ i,j = cos (i + 0.5 − height/2)π height (7) Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.3) Loss Function: Conventional Smooth-L1 loss acts as both L1 and L2 losses conditioned to a hyper-parameter β.It combines the advantages of L1-loss (steady gradients for large values) and L2-loss (less oscillation during update when values are small).Thus, it is less sensitive to outliers and prevents exploding gradients in some cases.However, the loss functions used for conventional images and videos do not take into account the unique nature of EquiRectangular Projection (ERP) frames.Distortion across the latitude present in ERP frames can cause learning-based models to be easily influenced by high errors in prediction across polar regions.
Considering distortion, the International Telecommunication Union (ITU) pioneered the use of distortion maps to be applied on conventional Peak Signal-to-Noise Ratio (PSNR) for more accurate quality evaluations [50].Following in these footsteps, we propose to use a weight map with a Smooth L1 loss to account for the distortion in 360 • videos during training to generate a super-resolver.To our knowledge, this is the first time a Weighted Spherically Smooth-L1 (WSS-L1) Loss function, shown in (7), has been used for this purpose.In (7), width and height represent the horizontal and vertical resolution of the generated High-Resolution (HR) output, which is the same as that of the Ground Truth (GT); and, the weights for a given row (j) in the distortion map ψ remains same, and it only varies along the latitude.An example distortion map of size 480 × 360 pixels is visualised with the help of a heatmap in Fig. 12.As seen in this figure, the use of a weight map allows the deep learning model to pay more importance to equatorial regions with higher weight values, the most focused area within the wide FoV of 360 • videos [51], [52].This can be used as a standard loss function for future

B. Quantitative Evaluation
The performance of the proposed S3PO model, against other state-of-the-art conventional and 360 • specific VSR models, is evaluated on two test sets, namely the 360VDS -test set with 45 clips and the MiG Panorama test set with four clips.For a fair comparison, and assessment with more recent conventional VSR models, we train and evaluate the S3PO model with two types of input degradation, namely Bicubic Interpolation (BI) and Blur Degradation (BD).To measure the quality of the generated super-resolved output, both conventional quality evaluation metrics (PSNR and SSIM), and 360 • specific quality evaluation metrics (Weighted Spherically -PSNR (WS-PSNR) [50] and WS-SSIM [27]) are used.
On average, across the 45 clips in the test set of 360VDS, S3PO performs best across all evaluation metrics and types of degradation as shown in Table I.The conventional VSR model used in this table uses the original degradation, as presented by the corresponding authors.RBPN [7] and RSDN [9] mostly perform second best to S3PO when degradation is BI and BD, respectively.For BD degradation, TGA [11] outperforms RSDN only on the WS-PSNR metric, and S3PO outperforms Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.all the models on all metrics.Notably, on the BI degradation, BasicVSR [10] trained on REDS [43] achieves higher SSIM and WS-SSIM scores than RBPN, resulting in the SSIM performance same as that of S3PO and second to the best performance on WS-SSIM.Nevertheless, S3PO's superior performance is observed over all other models for BI degradation as well.
Furthermore, the S3PO model outperforms all superresolution models across all evaluation metrics in the MiG Panorama test set, as presented in Table II, while BasicVSR [10] performs second best to S3PO.In this case, all the models presented make use of BI degradation.This demonstrates the effective performance of the S3PO model in the quality enhancement of 360 • videos in different datasets representing varying input conditions.
A further evaluation and comparison of S3PO and the best performing conventional VSR models -BasicVSR [10] -is presented in Table III.The BasicVSR model trained on conventional high-resolution dataset REDS [43] outperforms most of the existing conventional VSR models in super-resolving high-resolution conventional videos such as the REDS test set and UDM10 [44] dataset.We evaluate the S3PO model and BasicVSR model on the newly created 360 Ultra HD dataset, which poses a high-resolution benchmarking challenge in 360 • video super-resolution.As shown in Table III, the S3PO model outperforms BasicVSR across all evaluation measures for all clips.Despite BasicVSR being trained with an objective to super-resolve high-resolution videos, conventional VSR models such as BasicVSR are not suited well for the unique nature of 360 • videos.Thus, our omnidirectional VSR model -S3PO -outperforms its conventional counterparts despite being only trained to super-resolve LR of resolution 120 × 90 pixels to HR of resolution 480 × 360 pixels.This provides further evidence of S3PO's robustness and superior ability to superresolve diverse ranges of 360 • videos in EquiRectangular Projection (ERP) form.

C. Visual Comparison
The qualitative performance of the proposed S3PO model, when compared to that of other models, can be inspected visually as in Fig. 13.For this analysis, models using BD downsampling to generate LR inputs are considered.The enhanced spatial profile for a given LR input is compared with the corresponding spatial profile's visual quality in the Ground Truth (GT).For each of the five sampled frames, representing five different clips in Fig. 13, it is evident that S3PO restores finer details across segments from various sections within the input frame -ranging from inner contents to edges.The visual details restored are improved significantly in reference to the GT when compared to other state-of-the-art super-resolution models.This further confirms the exceptional enhancement ability of S3PO resulting from the multi-fold benefits of targeted modelling and training.

D. Horizontal Cyclicity in ERP Frames
The horizontal cyclicity in EquiRectangular Projection (ERP) frames of 360 • videos refers to the continuous scene that emerges when the left and right edges of the frame are stitched together, creating a seamless panoramic view [57], [58].This property can lead to the loss of finer details and degradation of ERP frames when super-resolution techniques are applied, as the model may not be able to effectively capture the complete content of the scene across both edges.
To address this issue, a cyclic treatment methodology is considered, as depicted in Fig. 14.This method takes advantage of  the geometric properties of the equirectangular projection by dividing a frame into two equal parts separated along the vertical axis, followed by stitching of the swapped left and right parts.This places the edge contents in the middle of the frame as part of a continuous scene, giving the model the opportunity to extract meaningful features for all contents across the frame, regardless of their location.The model is trained to extract features from both the original equirectangular frame and the frame subjected to the cyclic treatment, ensuring that each object or content is fed to the model as part of a continuous scene at least once.
The S3PO model is fine-tuned to extract features from the two variations of given input frames.The two sets of extracted features from the 360 • feature extractor are added element-wise at the co-located positions across the two feature sets.The feature obtained in this way is then propagated for refinement without further methodological changes.With this approach, no architectural complexity is introduced, keeping the structure and size of the model constant.
As shown in Table IV, the cyclic treatment discussed above results in an improvement of PSNR and WS-PSNR results.S3PO-cyclic can be a useful variant for applications where higher pixel accuracy is desirable.At the same time, this also confirms that the proposed S3PO model with no cyclic overhead is adequately robust to the uniqueness of 360 • videos.

VI. ABLATION STUDY
The multi-fold strategic design choices that jointly lead to a superior super-resolution performance from the proposed S3PO model while also making it robust to the challenges posed by distortions in EquiRectangular Projections (ERPs) frames are studied in a two-staged ablation study presented in this section.For each ablation study, only one factor is investigated at a time while keeping the network size mostly similar in order to make a fair and factual comparison.All the models considered in this study are trained to super-resolve LR inputs generated from blur degradation.

A. 360 • -Specific Characteristics
To understand the effectiveness of the proposed 360 • feature extractor, we replace it with a conventional MEMC-based alignment step by making use of the SpyNet [28] flow estimation model.Removing the 360 • feature extractor and replacing it with the pre-trained SpyNet model leads to a decrease in super-resolution performance across all evaluation metrics denoted by 'w/o 360 • Feature Extractor' in Table V.This signifies the superiority of the proposed feature extraction mechanism over the conventional alignment step in capturing coherent spatiotemporal details for 360 • video super-resolution.
Furthermore, the attention mechanism within the 360 • feature extractor is expected to allow the S3PO model to learn to pay varied attention to distorted spatial regions and diverse temporal contexts present in the given ERP frames.The results, as shown in Table V, reveal that the removal of the attention mechanism leads to a decrement in all the performance measures denoted with 'w/o Attention' in Table V.This confirms that the attention mechanism helps the network to better capture and weigh the contribution of different features to the final representation.As shown in Fig. 15, the visualization of the attention map illustrates a strong relationship between the distorted nature of the objects/contents and corresponding attention maps.Evidently, the distortions caused to objects and contents in polar regions of LR inputs due to the equirectangular projections in ERP frames are represented well by the learnt attention weights.The heat maps of spatial attention weights help distinguish the varied degree of attention paid to the spatially distorted objects, while both spatial and channel attention weights ensure that the distorted nature is preserved in the learnt weights.These illustrations further showcase the learnt ability of the attention mechanism to preserve and account for the distortion present in ERP frames.

B. Information Propagation and Domain Adaptation
The S3PO model is trained to optimize the proposed weighted spherically smooth-l1 (WSS-L1) loss as seen in (7).To understand the effectiveness of the WSS-L1 loss for 360 • video Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Fig. 15.Heatmaps of attention weights obtained from the attention mechanism within 360 • Feature Extractor illustrating a strong correlation between learnt attention weights and the distorted regions of ERP frames (highlighted by bounding boxes).The heat maps of spatial attention weights help distinguish the varied degree of attention paid to the spatially distorted objects, while both spatial and channel attention weights ensure that the distorted nature is preserved in the learnt weights.super-resolution task, we replace WSS-L1 loss with conventional Smooth-L1 loss.This resulted in a deterioration in performance denoted with 'w/o Weighted Loss' in Table V; thus affirming the improved ability of the S3PO model in dealing with the uniqueness of ERP frames and preserving those characteristics during super-resolution.
The balance between the L1 loss and L2 loss components in WSS-L1 loss is controlled by the β parameter as seen in (7).L1 loss measures the absolute difference between the predicted and ground truth values, while L2 loss measures the squared difference.A larger β value places more weight on the L2 loss component and is more sensitive to larger differences and outliers, while a smaller β value places more weight on the L1 loss component and is more sensitive to smaller differences and is likely to produce erratic gradients and an unstable convergence process.
To study the impact of β in balancing the contributions of L1 and L2 norms in WSS-L1 loss, the default β = 1 value was doubled and halved.Reducing the value of the β parameter by half places more weight on the L1 loss component and less weight on the L2 loss component, while doubling the value does the exact opposite.As shown in Table VI, changes up and down to β lead to inferior performance compared to β = 1.The superior results across all evaluation metrics indicate that β = 1 is the optimal choice that maintains the desired balance between L1 and L2 norms in the proposed WSS-L1 loss.Thus β = 1 is used as the default parameter setting for all S3PO training presented in this paper.The use of recurrent residues is aimed to enable the S3PO model to capture temporal dependencies between adjacent frames in a video, thereby improving the quality of the superresolved output.At the same time, the use of a dual duct architecture, as seen in Fig 7, allows for the mutual exchange between local information (induced by 360 • feature extractor) and global information (generated by recurrent residues) propagation.This means that information from both fine and coarse levels, as shown in Fig. 16, can be passed between different processing stages.This allows for the effective use of both local and global information, leading to a more accurate representation of the high-resolution frames.
The empirical results presented in the Table VII show that the removal of the recurrent residues (hidden state h t−1 as seen in ( 3)) and mutual exchange (as seen in Eqs.(4)(5)(6) between and global information propagation leads to a significant reduction for all quality measures.This provides evidence of the practical advantages of the use of these techniques in improving the accuracy and quality of the super-resolved output in the context of 360 • videos.Similarly, the S3PO model without domain adaptation underperforms compared to the version with adaptation in all four evaluation metrics, as shown in Table VII.The benefits of domain adaptation in this context of video super-resolution can be attributed to the improved ability of the S3PO model to generalize to new, unseen data.By fine-tuning the S3PO model on the 360 • video data set, the model is better able to handle the unique challenges posed by this type of data while leveraging from the base initialization of super-resolution task from the conventional domain.

VII. CONCLUSION AND FUTURE WORK
The applicability of conventional Video Super-Resolution (VSR) models, with satisfactory outcomes on 360 • videos, is demonstrated in this research.To ensure diverse training and test conditions, desirable for low-level computer vision systems specific to 360 • multimedia, a novel dataset is assembled described.Conventional VSR models are applicable to omnidirectional videos because the EquiRectangular Projection (ERP) frames are similar in format to conventional video frames.Nevertheless, the data within an ERP frame is unique and different from the conventional video frame because of the distortion present along the vertical axis and cyclic continuity along the horizontal aixs in 360 • videos.
Accounting for the ERP-specific properties, a novel 360 • VSR model is proposed (the S3PO model) with an ERP-specific architecture, feature extractor and optimiser.The empirical evaluation and ablation study confirm the superiority of the super-resolution performance of the S3PO model provisioned by the combined benefit of 360 • -content specific architectural sub-components, training with domain adaptation and optimisation with distortion-aware loss.The S3PO model does not incorporate conventional VSR steps, such as alignment; nonetheless, it outperforms the state-of-the-art super-resolution models, including those that use alignment.
The S3PO model and the 360VDS dataset help define new opportunities for future 360 • multimedia research.The application of implicit and explicit alignment techniques as extension work can be further studied, with appropriately attenuated alignments to account for the distortion and cyclicity in EquiRectangular Projections (ERPs) frames.Additionally, the impact on Quality-of-Experience (QoE) resulting due to quality enhancement using S3PO can be conducted to better understand how the model impacts the user perception and consumption of 360 • multimedia.

Fig. 1 .
Fig. 1.Illustration of an EquiRectangular Projection (ERP) frame with corresponding wide Field-of-View (FoV) and the induced distortion map.

Fig. 4 .
Fig. 4. Block diagram of the key architecture of the proposed Spherical Signal Super-resolution with a Proportioned Optimisation (S3PO) model.

Fig. 4
Fig.4respectively as G local and G global , are then propagated through a network of dual-duct residual blocks.Each dual-duct residual block refines the features individually while still making use of mutual information exchange to allow both local and global memory to be refined with respect to each other.Such a network of dual-ducts allows different frequency details to be captured separately while still fostering meaningful feature representations mutually.The two sets of features obtained from the last residual block are subjected to pixel-shuffle operations to enable depth-to-space transformation of the obtained 360 • feature maps.The resultant ×4 feature map is then used as residue, added element-wise to a bi-linearly interpolated target frame to produce the final ×4 super-resolved equirectangular frame HR t (1) and (2) shown at the bottom of this page.

Fig. 9 .
Fig. 9. Randomly sampled EquiRectangular Project (ERP) frames with uniform resolution representing different video clips from the newly created 360 • Video Dataset (360VDS) showcasing diverse contents and lighting conditions.

Fig. 10 .
Fig. 10.Representation of the spatial and temporal complexity present in the assembled 360VDS dataset.

Fig. 11 .
Fig. 11.Comparison of WSS-L1 loss when the S3PO model is trained with (and without) being initialised with conventional VSR weights.
research and related products specifically for 360 • signals in EquiRectangular Projection (ERP) form.

Fig. 13 .
Fig. 13.Visually inspected qualitative performance comparison for super-resolution models using Blur Downsampling (BD).Super-resolved output for segments profiled from five different Low Resolution (LR) frames belonging to five different clips confirms S3PO's outstanding visual restoration ability in reference to the Ground Truth (GT).

Fig. 14 .
Fig. 14.Illustration of steps in cyclic treatment applied to an ERP frame considering horizontal panoramic continuity of the scenes in 360 • videos.

Fig. 16 .
Fig. 16.Final features obtained from two ducts of S3PO model illustrating different frequency details represented mutually by the ducts for a given LR input.

TABLE III QUANTITATIVE
COMPARISON USING PSNR IN DB, SSIM, WS-SSIM AND WS-PSNR IN DB ON THE HIGH-RESOLUTION -360 ULTRA HIGH-DEFINITION [49]ltimedia.The initial learning rate is set to 1 × 10 −4 , and decayed by a factor of 10 after every 10 epoch.Model training and testing are performed using two NVIDIA Tesla V100 GPUs.The PyTorch[49]-based source-code and training weights will be made publicly available at https://github.com/arbind95/360VSRS3POGitRepo.

TABLE IV IMPACT
OF CYCLIC TREATMENT AS DISCUSSED IN SECTION V-D

TABLE V S3PO
WITH AND WITHOUT 360 • -SPECIFIC CONSIDERATIONS

TABLE VI IMPACT
OF WSS-L1 LOSS β VALUES(SEE EQ. 7) ON S3PO RESULTS

TABLE VII IMPACT
OF RECURRENT RESIDUE, MUTUAL INFORMATION EXCHANGE AND DOMAIN ADAPTATION ON S3PO RESULTS