Panoramic Video Quality Assessment Based on Non-Local Spherical CNN

Panoramic video and stereoscopic panoramic video are essential carriers of virtual reality content, so it is very crucial to establish their quality assessment models for the standardization of virtual reality industry. However, it is very challenging to evaluate the quality of the panoramic video at present. One reason is that the spatial information of the panoramic video is warped due to the projection process, and the conventional video quality assessment (VQA) method is difficult to deal with this problem. Another reason is that the traditional VQA method is problematic to capture the complex global time information in the panoramic video. In response to the above questions, this paper presents an end-to-end neural network model to evaluate the quality of panoramic video and stereoscopic panoramic video. Compared to other panoramic video quality assessment methods, our proposed method combines spherical convolutional neural networks (CNN) and non-local neural networks, which can effectively extract complex spatiotemporal information of the panoramic video. We evaluate the method in two databases, VRQ-TJU and VR-VQA48. Experiments show the effectiveness of different modules in our method, and our method outperforms state-of-the-art other related methods.


I. INTRODUCTION
A S A new means of simulation and interaction, virtual reality (VR) has attracted more and more attention in recent years [1]. Panoramic video and stereoscopic panoramic video are essential means of constructing virtual reality. Panoramic video has an unparalleled sense of realism and immersion, which can make viewers feel as if they are there. However, low quality panoramic video can cause intense discomfort and even cause physical illness [2], [3]. The process of making panoramic video and stereoscopic panoramic video is complex [4], including shooting, stitching [5], blending, projection, encoding, etc. Each process will distort the original video and affect the quality of the panoramic video [6]. Therefore, to promote the standardization of panoramic video, it is very imperative to carry out the related work of virtual reality video quality assessment (VRVQA). In the traditional multimedia quality assessment, the design process of the algorithm is often to extract features manually, and then use the machine learning method to perform regression prediction. The design of the two steps is often designed differently. In the extraction of features, the commonly used assistant theories are saliency [7]- [10] and the human visual system (HVS) [11]- [14]. Support vector regression (SVR) [15] is usually used in the regression process.
However, the panoramic video is very different from the ordinary video [16], [17]. The part of the production process and quality assessment of the panoramic video are shown in Fig. 1. When making the panoramic video, we first need to shoot with multiple panoramic cameras, and then merge the captured videos into a sphere. For ease of transmission and encoding, the panoramic video is projected from the sphere onto the plane. The panoramic video is encoded and decoded and then projected onto the sphere by a plane for viewing. Therefore, the video data we need to process is often projected to the plane, but projection will cause the original shape of regular objects to bend [18], so ordinary image quality assessment (IQA) and video quality assessment (VQA) methods are challenging to extract useful features, VRVQA must perform targeted processing on the projection process of panoramic video.
To solve the problem of projection, some methods have been proposed in VRVQA field. Xu et al. [19] proposed two kinds of objective assessment methods: non-content-based perceptual peak signal to noise ratio (NCP-PSNR) and content-based perceptual PSNR (CP-PSNR). The difference between the two is whether to predict the viewing direction of the person and calculate the difference, and then perform the weight mapping of the region for the PSNR operation. Yang et al. [20] used 3D convolutional neural networks (CNN) to evaluate the quality of local panoramic video blocks, and then assigned different weights to combine all video blocks to obtain the overall quality of the video. Sun et al. [21] proposed weighted-to-spherically-Uniform PSNR (WS-PSNR), which gives all pixels different weights in advance and then calculates the PSNR. Yu et al. [22] projected the pixels on the original panoramic video plane and the distorted panoramic video plane onto a sphere, and then performed a large number of uniform sampling on the spherical surface to calculate the PSNR. They proposed two indicators, S-PSNR and L-PSNR, which differ in whether they give higher weight to the equator. Zakharchenko et al. [23] proposed the Craster parabolic projection PSNR (CPP-PSNR), which projects all the panoramic video to the sphere using the CPP method.
Although the above methods have achieved excellent results, two problems have not been effectively solved. Firstly, feature extraction in the spatial domain has to be discussed. Most of the above VRVQA methods are the improvement of traditional quality assessment methods, lacking feature extraction methods for panoramic video and advanced techniques such as deep learning. Secondly, global time domain feature extraction in the time domain has to be discussed. Global time domain information refers to the relationship between the pixels of each frame and all the pixels of other frames, which is different from finding the local time domain information of current pixels and some pixels in other frames. Most of the above VRVQA methods do not consider the effect of the relevant information of pixels in different frames of the video on the quality assessment.
Based on this, we design a deep neural network for panoramic video, which can extract the information of the panoramic video spatial domain and global time domain effectively. We verify the effectiveness of the proposed method through a series of experiments. Our main contributions are below.
1) The proposed network can effectively extract panoramic video features. Compared with ordinary CNN, spherical CNN [24] can effectively extract the "deformed" features in the panoramic video, and has translation invariance, rotation invariance and scale invariance in panoramic video processing. Spherical CNN projects the panoramic image from the plane to the three-dimensional sphere, and extracts the relevant features on the sphere by convolution. Therefore, we use spherical CNN as the basis for the proposed method. Comparative experiments verify the effectiveness of spherical CNN in VRVQA.
2) The proposed network can make full use of the global temporal information of the input data. Non-local neural networks module [25] makes the feature map in the neural network contain attention information, so the global time information of the panoramic video can be extracted together with the spherical CNN. Besides, the nonlocal neural network module uses a residual structure, which can be embedded in the spherical CNN while maintaining the same size of input and output, rather than evaluating the spatial domain and global time domain separately. Comparative experiments verify the effectiveness of non-local module in VRVQA. 3) The model we designed can evaluate not only the quality of the panoramic video, but also the quality of the stereo panoramic video. The two are only different in the preprocessing part. Experiments show that our method can provide the best quality indicator in the field of VRVQA. In the following sections, we elaborate on the characteristics of panoramic video and related works (Section II), analyze our methods (Section III), evaluate our methods through a large number of experiments (Section IV), draw conclusions and discuss the future direction (Section V).

II. BACKGROUND AND MOTIVATION
In this section, the characteristics of the space-time domain of the panoramic video and the solution ideas are introduced. Then, the description and progress of the VRVQA related work are listed separately.

A. Spatial Domain Characteristics of Panoramic Video
The projection part is the essential reason of the difference between the panoramic video and the ordinary video. To facilitate panoramic video transmission, the panoramic video must be projected onto a plane and projected back to the sphere when viewed.
There are many ways to project panoramic video [26], such as equirectangular projection (ERP) [27], cylindrical equal-area projection (EAP) [27], cube map projection (CMP) [28], rotated sphere projection (RSP) [29], etc. In respective of the method to use, it will inevitably distort the original pixel distribution and shape. ERP projection becomes the most commonly used projection method for panoramic video due to its simple processing. Similar to the projection process of the world map, ERP stretches each latitude to the length of the equator. Due to the change of longitude of 2π and the latitude of π, the plane after the projection of the method tends to exhibit a 2 : 1 aspect ratio. This method tends to stretch the two pole portions of the sphere significantly, so that the warpage deformation of the two pole portions is more severe after projection. The formula of ERP is as follows: where λ represents the longitude in the sphere, ϕ represents the latitude in the sphere, and λ 0 and ϕ 0 often represent the latitude and longitude of the equatorial center in the panoramic video.  x and y represent the horizontal and vertical coordinates in the plane, respectively. After projection, the original pixel distribution will be deformed, and the degree of distortion at different positions will be different. The closer to the two poles, the greater the distortion. The projection process of the panoramic video is shown in Fig. 2.
The distortion of pixels caused by projection will change with the position of the image, so the information distribution and features of the objects on the panoramic image will be quite different from the ordinary image. In general VQA method, the feature extraction operator obviously cannot adapt to the estimated deformation and extract effective features [30], and the current VRVQA field lacks the features designed for panoramic video. In order to solve this problem, a deep learning method is applied in this paper. Deep learning can automatically extract the features of panoramic video without artificial design and participation, and can extract higher-dimensional semantic information as the depth of the network increases [31]. Therefore, we design the VRVQA method based on CNN.
The same decoration in Fig. 3 has different degrees of warpage deformation at different positions after projection. This is because the displacement of an object in a spherical model belongs to three-dimensional rotation rather than translation. Therefore, the same convolution kernel is difficult to extract consistent effective features for the same object at different locations. It can be known that sparse connectivity, weight sharing, and pooling in CNN do not have translation invariance, scale invariance and rotation invariance in panoramic video [24]. For the above reasons, we must modify the convolution method in CNN, so we choose the spherical CNN that can effectively extract the features of the panoramic video.
The spherical CNN regards the spherical image as a threedimensional manifold, expands the spherical surface into discrete three-dimensional Lie groups [32], and expresses the relationship of the special orthogonal group SO(3) in the CNN. In fact, spherical CNN can be understood as the process of convolution extraction for the input signals of three-dimensional manifold. As for the spherical CNN, this paper gives a detailed explanation in Section III, Part B. Through the above method, the two-dimensional image is reconstructed into a three-dimensional manifold, that is, the panoramic image frame is back projected back to the sphere to solve the problem of pixel deformation, and then the feature is extracted by spherical CNN. The special orthogonal group SO (3) is expressed as follows: where R represents a matrix of 3 × 3, and the right side of the equation indicates that the matrix is orthogonal and the determinant is 1. In the experiment, R refers to the rotation matrix, not the learnable filter parameters.

B. Global Time Domain Information Extraction of Panoramic Video
The extraction of global time domain information has always been a problem in the field of VQA. As a kind of video, panoramic video also needs to incorporate time domain information into the quality evaluation system. The previous work is divided into two categories according to whether the spatial domain information and time domain information are considered comprehensively.
The first is to extract the spatial domain and time domain information separately, and then comprehensively perform the quality assessment. Manasa et al. [33] used local optical flow statistics to measure the video time domain distortion to design a full reference VQA method. Zhu et al. [34] first obtained the characteristics of each frame of the video and then combined these features to learn the weight of the parameters from the main neural network. Ullah et al. [35] used long short-term memory (LSTM) to process the extracted spatial domain features, which can adequately express the information between the preceding and following frames.
The second is to extend the original 2D method to 3D, and then comprehensively consider the space-time domain information of the video. Li et al. [36] extended the two-dimensional discrete cosine transform (DCT) to three-dimensional, so that the spacetime domain information of the video was extracted simply and effectively. Similar to the previous work, Li et al. [37] used the 3D shearlet transform to extract the spatiotemporal information. Giannopoulos et al. [38] used 3D CNN to extend the process of convolution and pooling from 2D to 3D to complete the video quality assessment.
The above work has brought us a lot of inspiration, but considering the cooperation with spherical CNN, an easy-to-integrate deep learning method is our best choice. CNN imitates the human cognitive process from local to macro [39]. The bottom convolution is responsible for local information and the top convolution is accountable for combining local information to get global information. However, this idea can not be applied to all situations. For example, for the quality assessment of speech video, a convolution kernel covers only around the human head.
To evaluate the quality, we should not only pay attention to the distortion of the head, but also pay attention to the background of the human head, the distortion of the next frame and other related informations [40]. The same distortion appears worse on the face than on the sky, and interframe flicker distortion is also worse than continuous distortion between frames [41]. It is difficult for a single convolution layer to extract global related informations. Since the pooling process and information are transmitted layer by layer, a large amount of information is lost in the complete extraction of global information by multiple convolutional layers, so CNN has limitations in extracting global information [42].
In order to resolve the contradiction between CNN and global time domain information, non-local neural networks are integrated into our proposed framework. The non-local neural network calculates the response of a certain location as the weighted sum of the features of all positions in the input feature mapping. When we use non-local neural networks to process video, the information of each point in the feature map contains information about other points, and the input and output shapes of the non-local neural network module are the same. It is easy to insert into the neural network and can effectively extract the global temporal information of video frames.

C. General Idea of VRVQA
The quality assessment of panoramic and stereo panoramic video is in its infancy, and the related results are less than other multimedia quality evaluation fields. In order to perform VRVQA, we first need to rely on the database. Zhang et al. [43], Xu et al. [19], Zhang et al. [44] and Yang et al. [20] improved subjective assessment methods according to the characteristics of the panoramic video itself, and established a panoramic video database or a stereo panoramic video database.
Based on this work, there are two main ideas for the objective quality assessment of the panoramic video. One way is to assign different weights to different areas of panoramic video according to the pixel warping deformation or the different viewing directions of the person [19], [20]. As a representative method of this idea, WS-PSNR [21] is calculated according to the following formula: where w is the assigned weight, y(i, j) and y (i, j) are the reference pixel value and the test pixel value, respectively. MAX is the maximum pixel value, h and w are the height and width of the image. N represents the total number of pixels per column. The other way is to re-project the planar panoramic video onto the sphere and then improve the accuracy of the quality assessment by changing the projection format or sampling method [22]. As a classic method under this idea, CPP-PSNR [23] converts the video projection format into the Craster parabolic projection format to reduce the degree of pixel distortion. The formula for the CPP projection transformation is as follows: where ϕ and λ are the elevation and azimuth of the spherical coordinates, and R is the spherical radius. The above ideas have great inspiration for our work. The method proposed in this paper mainly belongs to the second idea.

III. PROPOSED METHOD
In this section, our method is described and deduced in detail. Fig. 4 describes the primary process of our method. It should be emphasized that the method proposed in this paper can not only evaluate the quality of the stereo panoramic video, but also evaluate the quality of the ordinary panoramic video.

A. Preprocessing
Since the amount of video tends to be large, it is difficult to directly use the entire video as input to a deep learning network. The differential grayscale image can better represent stereoscopic image information based on reducing the amount of data [45], so we perform similar pretreatments. We grayscale and subtract the left and right views of the video according to the following formulas: where x is the output of the pre-processing, V lef t and V right represent the left and right views of the stereoscopic panoramic video frame, and i is the pixel position index. After the above processing, the original grayscale difference map size is 2560 × 1280. To adapt to the input of the spherical CNN network and reduce the parameters that need to be calculated, the original grayscale difference map is downsampled to 1280 × 1280, and the preprocessing of spatial domain is completed.
For the preprocessing of time-domain, uniformly-spaced sampling is used. Adjacent video frames contain too much redundant information because it is difficult for the naked eye to observe the changes among them. Refer to other video research areas [46], Fig. 4. The method proposed in this paper. Each convolution layer has a dimension that indicates the size of the output, and the first number "4" in the dimension represents the batch_size. The orange module describes the SO3_NonlocalBlock, and the dashed line indicates that the SO3_NonlocalBlock module is not used. "FC" indicates the complete connection layer. [47], we randomly select one frame as the starting frame of the training sample video, and then extract one frame for every 8 frames. A total of 3 frames are extracted to form a video block, which is used as the input of the network. It should be noted here that due to the large amount of video data, it is difficult to read multiple consecutive frames as input during network training. The specific process is shown in Fig. 5.
Based on the above preprocessing, a video block can be extracted every 24 frames. When processing a normal panoramic video, the input of the network becomes a grayscale image of the same size instead of a grayscale difference map, and the processing of the video block is the same.

B. Spherical CNN
In ordinary CNN, the output of the convolution operation is equivalent to the inner product of the input feature map and the convolution kernel, which is equivalent to the correlation operations in mathematics. Similar to ordinary CNN, the convolution output in spherical CNN is equivalent to the inner product of the feature map and the rotating convolution kernel, where the feature map of the spherical CNN is treated as a signal on the special orthogonal group SO(3). The convolution process of a rotating group in a spherical CNN can be expressed as follows: where f and ϕ are signals on the special rotation group SO(3) → R n . L R ϕ is a rotation operator defined as [L R f ](Q) = f (R −1 Q) on the special rotation group SO(3). dQ is a measure of the integral and can be expressed as dαsin(β)dβdγ/(8π 2 ), α, β, γ are parameters in the ZYZ Euler parameterization.
In fact, the calculation of the rotated feature map is equivalent to the inner product between the input feature map and the rotated filter.
Similar to the convolution process of a signal on SO(3), the convolution process on the surface of the sphere is called S 2 . The signal convolution can be defined as: Spherical CNN uses the generalized fourier transform (GFT) to reduce the complexity of SO(3) convolution. The formulas of transform and inverse transform are expressed as follows: where X refers to all manifold signals input, such as s 2 or SO3. If we input SO3, it can be understood as all spaces in three directions (α, β, γ). U l denotes a corresponding basis function. Function f is X → R, b is the bandwidth, and D is the Wigner D-functions.

C. Non-local Neural Networks
In order to extract the global time domain information of the video and process the video long-range dependencies, the nonlocal neural network is embedded in the designed network. The mathematical formula for non-local operations is as follows: where i is one of the locations of the input feature map. In general, this position can be a time point, a space point, and a space-time point. j is the index of all other possible locations, and x is the input signal, which is usually a feature map. y is the same output feature map as the x scale, f is the pairing function that calculates the correlation between the i-th position and all other positions, g is a unary input function for the purpose of information transformation, C(x) is the normalization function that keeps the overall information unchanged during the conversion process. The above functions have many manifestations. We choose to use softmax as the f function. Since softmax contains normalization process, the calculation of C is omitted, and convolution operation is used for g. 1 × 1 convolution operation is equivalent to matrix multiplication [25], the convolution operation is expressed as w · x, w refers to the parameters of the convolution kernel update, T represents matrix transposition, then the expression of formula (14) in this paper can be converted into: In order to ensure uniform size of input and output in network and more convenient configuration in the network, the design of the residual module is utilized, as shown in the formula: where y is the non-local operation operator and x is the input. Taking ordinary CNN as an example, the non-local operation can be as shown in the Fig. 6. In Fig. 6, the input first passes through a convolution kernel of size 1 × 1 × 1, whose main purpose is to reduce the dimension, thereby reducing the computational complexity of the non-local block. It is worth noting that this module contains the idea of attention mechanism. The softmax operation is equivalent to finding the normalized correlation between other pixels and the current pixel, and then multiplying the matrix after conv3, which applies an attention mechanism to each pixel of the input. Finally, conv4 restores the feature map to its original size and adds it to the input.

D. Network Design and Training
The epoch is set to 200. An epoch refers to the process of all data being sent to the network to perform a forward calculation and back propagation. Since an epoch is often too large and the computer can not load, we divide it into several smaller batches. Batch_Size is set to 4 because the input video size is large and limited by hardware memory. The learning rate is set to 1e-3. The number of channels in each layer is set to 4, 8, 16 and 1, and the bandwidth b is 640, 128, 32 and 8, respectively. It should be explained that the bandwidth here should be half of the input sizes H and W .
ReLU is used as an activation function in this paper. The formula is as follows: where x refers to the input of the activation function layer. This paper chooses to use Adam [48] as the optimizer. The formula is as follows: where V dp and S dp are gradient first-order moment estimation and second-order moment estimation with deviation correction, respectively. α and β are attenuation factors, dp is the gradient of the parameters, and ε is the offset that prevents the denominator from being zero. In the experiment, (β 1 , β 2 ) is (0.9, 0.99), ε is 1e-8, and α is 1e-3. For the loss function, we choose to use the mean square error (MSE) and L2 regularization. The formula is shown below: where L is the global loss, N is the number of all samples, it is equal to batch_size in neural network. y is the label, which is the subjective assessment score of the video.ŷ is the predicted value, which is the objective assessment score of the video. λ is the regularization coefficient. In this paper, the regularization coefficient is equal to 0.01. w is the parameter of all layers that the network needs to update. Since the number of network layers is not large, in addition to the L2 regularization, we do not need other means to prevent over-fitting to achieve good experimental results.
In the network training phase, we use 80% of the video dataset as the training set, 20% of the video dataset as the test set, and use random sampling when segmenting data sets. In order to ensure the validity of the data, each time we repeat the experiment, we randomly divide the training set and test set again, and take the middle finger as the final result after repeating 50 experiments. When reading the training data, we randomly sample the three eligible frames in the video as input video blocks. The specific requirements are explained in the preprocessing section. Finally, we average the objective scores of 24 video blocks as the overall score of panoramic video.

A. Datasets
Experiments are performed on VRQ-TJU 1 [20] database and VR-VQA48 [19] database to verify the effectiveness of the proposed method.
The VRQ-TJU database contains a total of 377 stereoscopic panoramic videos, including 13 original video sources. These videos are distorted by H.264 and JPEG2000. Each distortion type is divided into five levels, and each distortion type has 182 videos. Besides, the database contains symmetric distortion and asymmetric distortion, 104 of which are symmetric distortions and 260 are asymmetric distortions. Mean opinion score (MOS) is in the range [1,5], the higher the score, the better the video quality.
The VR-VQA48 database contains a total of 48 panoramic videos, including 12 original video sources. These videos are distorted by H.265 distortion, and the degree of distortion is divided into three levels. The MOS value is in the range of [0, 100], and the higher the score, the better the video quality. In the VR-VQA48 database, the MOS value is often between 30 and 60.
In order to more intuitively display the video of two MOS values in two databases, we show some video frames in Fig. 7. 1 https://pan.baidu.com/s/1QDEnDARDBXDTHRcdWyPkhA

B. Experimental Setups
For the VRQ-TJU dataset, 302 videos are used for training and 75 videos are used for testing. For the VR-VQA48 data set, 38 videos are used for training and 10 videos are used for testing. It should be emphasized that the number of videos here is not the amount of data actually needed by the network, because 24 video blocks are proposed in each video. The comprehensive performance of the two databases can verify that the method can evaluate the quality of the panoramic video as well as the quality of the stereoscopic panoramic video. Since the data types in the two databases do not intersect with the distortion type, the two databases cannot perform cross-database experiments. Other panoramic video databases are not open source, so this paper does not involve cross-database experiments.
In the evaluation, Pearson linear correlation coefficient (PLCC) and Spearman rank order correlation coefficient (SROCC) are used to predicting the accuracy. The closer the two values are to 1, the closer the objective score of the prediction is to the subjective score.
The proposed method is first tested in two databases and then compared with other classical methods. In order to verify the superiority of the spherical CNN module relative to the ordinary CNN and the effectiveness of the non-local module, relevant comparative experiments are also designed. To ensure the reliability of the experimental method and the validity of the experimental data, we repeat iterations 50 times for each analysis. In a similar experimental step, the final result tends to take the average or median of the 50 outcomes. However, the average is often affected by the outliers in the 50 data. A substantial deviation will usually give the mean has a big impact, so this paper uses the median as the final result.

C. Performance Evaluation
This section compares the proposed method with other classical VRVQA methods in two databases to prove the effectiveness of the proposed method. Our experimental environment is based on Intel(R) Xeon(R) CPU E5-2620 v4 and NVIDIA GTX TI-TAN Xp GPU. The method we proposed is based on the PyTorch deep learning framework.
First, we validate the proposed method in the VRQ-TJU database and the VR-VQA48 database. In the VRQ-TJU database, the PLCC value and SROCC value of the proposed method are 0.939 and 0.924 respectively. In the VR-VQA48 database, the PLCC value and SROCC value of the proposed method are 0.891 and 0.877 respectively. Fig. 8 shows the variation of PLCC with the increase of epoch in the test set of , WS-PSNR, L-PSNR, S-PSNR, and CPP-PSNR to perform experiments in two databases. PSNR and SSIM are used as the most classic algorithms in IQA and VQA to compare with other VRVQA methods. The remaining comparison algorithms are commonly used in the field of VRVQA. The results of the experiment are shown in Table I. Among them, L-PSNR, S-PSNR and CPP-PSNR are implemented in C++, and other comparison methods are implemented in MATLAB. In Table I, we show the most advanced indicators in bold type. As can be seen from the table, our method achieves good results in both databases. In the VRQ-TJU database, our PLCC and SROCC lead 0.076 and 0.066 respectively. In the VR-VQA48 database, our PLCC and SROCC lead 0.184 and 0.24 respectively. Experiments show that our method can achieve good results in stereo panoramic video quality assessment and panoramic video quality assessment. Combining the performance of Fig. 8 and Table I, it can be found that the convergence index during training is basically consistent with the test set, indicating that the model has not been overfitted. Through experiments, the average time to train a video block in this method is 25.17 seconds, and the average time to test a video block is 0.072 seconds. Our model get the best performance with the right amount of complexity. The detailed comparison is shown in Table II. The statistical significance of the predictions is determined by comparing the SROCC values of each VRVQA method. We assume that the predicted score follows a normal distribution, and the F-test is used to express whether the proposed method is superior to other methods. Assuming a significance level of 0.05, we calculate the result for each method using the 50 SROCC values. The value "1" indicates that the algorithm (row) is better than the algorithm (column). The value of "0" indicates statistical equivalence between rows and columns, and the value of "−1" indicates that the algorithm (row) is not as good as the algorithm (column). The results of the F-test are shown in Fig. 9.  9. Results of statistical significance comparison between SROCC values from the algorithms. "1" represents the algorithm (row) better than the algorithm (column), "−1" represents the algorithm (row) worse than the algorithm (column), "0" represents similar performance.
Overall, our method results are all "1," indicating that our model is superior to other models.

D. Module Comparison Evaluation
In this part, we verify the contribution of spherical CNN and non-local modules to the proposed method, and confirm the superiority of this method in the spatial domain and time domain assessment. To make the results more reliable, we conduct four experiments in two databases, using spherical CNN+ non-local modules, spherical CNN, ordinary CNN+ non-local modules, and ordinary CNN. For the sake of convenience, we write ordinary CNN as CNN, spherical CNN as S2CNN, and non-local module as Nol. In order not to add additional variables, we simply replace the corresponding structure without changing the overall settings of the hyperparameters and the network. When we want to compare spherical CNN with ordinary CNN, we only replace the corresponding layer. When we want to compare the effects of non-native modules, we only add or not add non-local modules to the network. The results of the experiment are shown in Table III, and we show the best data in bold type. In order to more intuitively show the contribution of different modules, we represent the data in the table as the form of Fig. 10.
Experiments show that both spherical CNN and non-local neural networks have significant contributions to the methods proposed in this paper, especially spherical CNN can significantly improve PLCC and SROCC. In order to fully demonstrate the correlation between objective data and subjective data obtained by different methods, we show some scatter plots in Fig. 11.

E. Distortion Type Evaluation
VRQ-TJU contains relatively many types of distortion. In order to better evaluate the performance of our method in different distortion types, we divide the two databases into five parts according to the distortion type, which are symmetric distortion database, asymmetric distortion database, H.264 distortion database, JPEG2000 distortion database, H.265 distortion database (ie VR-VQA48). Experiments are carried out in these five databases. The first four databases used the model trained by VRQ-TJU, and the last database used the model trained by VR-VQA48. The experimental data is shown in Table IV, and we show the most advanced indicators in bold type.
Experiments show that the proposed method can deliver good performance in different distortion types, and the symmetric distortion database has better results than other databases. In order  to more intuitively observe the relationship between subjective scores and objective scores, we show some scatter plots of two scores in different distortion databases in Fig. 12.

1) Optimization of codec
As we all know, panoramic video has the characteristics of high resolution and large amount of data, which brings considerable challenges to video coding and decoding [50]. How to measure the performance loss of codec is helpful to optimize the rate distortion and other related works. Therefore, in the ERP format of panoramic video coding and decoding, the method proposed in this paper is used to observe the loss of video quality, so as to provide guidance for codec.

2) Quality enhancement
At present, some people have begun to study how to enhance the quality of virtual view to give viewers a better visual experience. For example, Rahaman et al. [51], [52] used Gaussian mixture modeling (GMM) to significantly enhance the quality of virtual views. However, in their final evaluation stage, they often do not use advanced quality evaluation methods to verify the effectiveness of the proposed methods. The method proposed in this paper can assist the verification work in the field of quality enhancement, so as to improve the accuracy of the quality enhancement method.

3) Standardization of hardware
At present, the quality of virtual reality display helmets varies in the market, and non-standard hardware devices will greatly affect the user experience. The quality assessment method proposed in this paper can be used to accurately quantify the performance of hardware in video viewing, so as to unify the hardware manufacturing standards.

V. CONCLUSION
In this paper, we propose a method based on deep learning, which can evaluate the quality of panoramic video and stereo panoramic video end-to-end. This paper starts from the two aspects of spatial domain assessment and global time domain assessment, and studies the characteristics of panoramic video. A lot of experiments have been carried out to verify the effectiveness of the proposed method. The results show that the spherical CNN is more suitable for the extraction of panoramic video features than CNN, and the non-local neural networks module can effectively extract the global time domain information.
Although the proposed method has good results, the non-local neural networks module occupies a large number of computing resources and storage space. Due to the large amount of data in the panoramic video and the complexity of network calculations, it is difficult to use some techniques that require parameter calculation, such as the GN layer [53]. In the future work, we hope to design a more elegant time domain assessment strategy to minimize complex parameter operations based on the network performance. Cross-database experiments will also be added in the next step to verify the generalization ability of the algorithm. Qinggang Meng (Senior Member, IEEE) received the B.S. and M.S. degrees in electronic engineering from Tianjin University, Tianjin, China, and the Ph.D. degree in intelligent robotics from the Department of Computer Science at Aberystwyth University, Aberystwyth, U.K. He is currently a Professor with the Department of Computer Science, Loughborough University, Loughborough, U.K. His current research interests include biologically inspired learning algorithms and developmental robotics, service robotics, robot learning and adaptation, multi-UAV cooperation, human motion analysis and activity recognition, activity pattern detection, pattern recognition, artificial intelligence, and computer vision. Prof. Meng is on the editorial boards of several journals including IEEE TRANSACTIONS ON CYBERNETICS.