No-Reference Quality Assessment of Stereoscopic Videos With Inter-Frame Cross on a Content-Rich Database

With the wide application of stereoscopic video technology, the quality of stereoscopic video has attracted people’s attention. Objective stereoscopic video quality assessment (SVQA) is highly challenging, but essential, particularly the no-reference (NR) SVQA method, where reference information is not needed and a large number of samples are required for training and testing sets. However, as far as we know, there are only a few samples in the established stereo video database, which is unsuitable for NR quality assessment and seriously hampers the development of NR-SVQA method. For these difficulties that we encountered, we carry out a comprehensive subjective evaluation of stereoscopic video quality in our newly established TJU-SVQA databases that contain various contents, mixed resolution coding and symmetrically/asymmetrically distorted stereoscopic videos. Furthermore, we propose a new inter-frame cross map to predict the objective quality scores. We compare and analyze the performance of several state-of-the-art 2D and 3D quality evaluation methods on our new databases. The experimental results on our established databases and a public database demonstrate that the proposed method can robustly predict the quality of stereoscopic videos.


No-Reference Quality Assessment of Stereoscopic Videos With Inter-Frame Cross on a Content-Rich Database
Thus, the exploitation of subjective and objective visual quality models have become a hot research field [2]. However, there are some specific challenges to assessment of 3D video quality, in addition to the inevitable distortions in the transformation process which have already appeared in 3D image [3]. The main reasons are video contains a large amount of information [4] and a more complex content than image, and the results of the prediction performance is difficult to achieve.
Research on VQA can be generally divided into two types: subjective and objective quality evaluation [5]. Subjective quality evaluation is to ask a sufficiently great number of subjects to assess the degree of distortion of image/video according to their viewing experience. And then averaging across these opinions produces a mean opinion score (MOS), which can represent the real viewing experience of the tested image/video [6]. In fact, such subjective assessment is the best indicator of perceived quality of image/video. At present, there are several general and freely available 3D Video Quality Assessment (3D-VQA) databases [7]- [9]. But these databases contain fewer videos and have a single video content. Especially with the arrival of the big data era and the extensive application of the 3D video display technologies, it is urgent to establish a SVQA database which involves a great number of samples and rich content.
Similar to image quality assessment (IQA), objective SVQA methods are generally classified into three types depending on requirements for reference videos. The pristine content of the full reference (FR) metrics is necessary for the comparison with the test content. And reduced-reference (RR) metrics for measuring quality require partial content of the pristine information, while NR evaluation metrics measure of quality only require processed/distorted information. Classical objective quality schemes, such as peak signal to noise ratio (PSNR) and structural similarity (SSIM) [10] are FR methods, but their results can not be consistent with subjective evaluation. To overcome this difficulty, new objective algorithms with stereo vision are being exploited. Hewage et al. [11] introduced a 3D video quality algorithm by assessing stereo pairs and analyzing the edges and contours of the depth map. Due to the characteristics of stereoscopic video, the quality is comprehensively evaluated from the perspectives of multiple domains. The experiment results demonstrate that this NR-SVQA metric is highly consistent with subjective quality evaluation [12].
SVQA has great challenges, for it has higher complexity than 2D-VQA. Due to the pupillary distance and slightly different visual angles of two views, there is a horizontal difference in the scene of two retinas and human eyes can recognize the depth of the scene. On the one hand, the additional dimension of 3D information (depth or disparity) will raise a number of important issues. On the other hand, the motion of videos will also lead to distortion in temporal domain. In this paper, a content-rich SVQA database is constructed and a NR-SVQA metric with inter-frame cross information is put forward. The main contributions of this paper are as follows: 1) A content-rich SVQA database is established with two parts: TJU-SVQA database Phase I and Phase II, and it contains symmetric and asymmetric compression videos created by varies content and mixed-resolution, respectively. And we analyze the subjective experiment results to prove the stability and rationality of the database.
2) We present a binocular characteristic inspired model, which leads to a markedly improved NR-SVQA model. This method uses a new infer-frame cross to effectively reflect the relationship between video distortion degree and the MOS, and increases efficiency.
3) According to the property of the extracted features, we adopt different methods for different views to obtain local quality score. We first combine dictionary learning with support vector regression (SVR), which is utilized in spatial, spatio-temporal and temporal domain. The experimental results show that our framework performs better than using dictionary learning or SVR alone.
The rest of the paper is organized as follows. The existing models of 2D/3D-VQA are reviewed in Section II. Section III describes the details of TJU-SVQA database, including scene choice and descriptions, subjective research project, and the analysis of subjective evaluation result. Then in Section IV, an objective SVQA model considering binocular characteristic is proposed. Section V reports the experimental results of the proposed metric on our TJU-SVQA database and a public database. The analysis shows the availability of the TJU-SVQA database and the effectiveness of the proposed algorithm. Finally, in Section VI, conclusion and ideas for the optimization of the algorithm are presented.

II. RELATED WORK
Existing video quality evaluation mainly includes 2D-VQA and 3D-VQA based on different information sources. Besides, according to the design concept of 3D quality assessment algorithm it can be divided into two categories: one of which is directly extended from 2D and the other 3D-VQA based on human visual system (HVS).

A. 2D Video Quality Assessment
In recent years, the theme of 2D quality evaluation has been extensively explored and studied [13]. Objective VQA methods are used to automatically predict the quality of the visual perception of the video sequence [14], while striving to be consistent with the subjective perception of the human eye. Some eminent FR-VQA methods include MOVIE [15], VQA based on SSIM [16], VADM [17], optical flow-based VQA [18]. Moreover, to imitate the processing of video signal by HVS, video sequences energy characteristics were described by statistical features with 3D discrete cosine transform [19]. Compared with RR-VQA, NR-VQA is much more challenging but has higher potential possibility of wide application. Saad et al. [20] proposed a NR-VQA (video BLIINDS) which indicates that structure a spatio-temporal model in the DCT domain and exhibited statistical regularities. Mittal et al. [21] developed a completely blind VQA VIIDEO (video intrinsic integrity and distortion evaluation oracle), which demands not the supererogatory distortional video information but the intrinsic statistical regularities of natural videos. Considering that different features have different sensitivity to high quality video and the inferior quality, noticeable distortion and blurring artifacts were applied comprehensively to guarantee accurate prediction [22]. In [23], intensity and texture of decoded sequence were combined as complementary information to measure duration and the stall number of distorted streaming video. Li et al. [24] extracted the natural scene statistical characteristics of video blocks with 3D shearlet transform and amplified disparity between primary features to effectively perceive quality with convolutional neural network.

B. 2D-to-3D Quality Assessment
The history of 3D-VQA is relatively short when being compared with that of 2D-VQA. First of all, there are few publicly stereoscopic video databases in the past few years, which greatly hinders the study of subjective evaluation algorithms. Recently, a variety of 3D video datasets have emerged according to different researching purposes and requirement [25]. A brief of most of the SVQA databases can be found in [7]. However, those existing databases have a small amount of samples, together with single video content and resolutions.
According to algorithm design principles, the existing SVQA schemes could be classified as two categories. An earlier type of SVQA is extended from 2D image/video by averaging or weighting stereo pairs quality scores. In [25], Chen et al. predicted the content quality of stereo pair sequences on a weighted average with PSNR and multiscale SSIM (MS-SSIM). In [26] and [27], IQA approaches (PSNR, SSIM) and 2D-VQA approaches VQM [28] were implemented on the two views of stereo video respectively and then taken weighted average to obtain stereoscopic perceived quality. Experimental results demonstrated that VQM was more accurate than PSNR and SSIM. Due to the obvious phenomenon of binocular competition in asymmetric distortion, a new dynamic weight scheme was proposed to predict system bias and significantly improve the performance of SVQA [7].

C. 3D-VQA Based on HVS
The other category of 3D-VQA metrics are to estimate the impact of compression distortion on visual perception from the perspective of stereo vision. In [29], a new binocular energy framework with suppression and recurrent excitation was presented in which multivariate regression method was applied to map the quality perception features to MOS. For improving the performance of the algorithm, Yu et al. [30] introduced a new frame extraction algorithm by judging the motion intensity to reduce the volume data processing. In [31], both views were represented by constructing a joint descriptor with the spatial-temporal structure metric and inter-view correlation was considered between 3D structure tensors of two views. In [32], regions of significance based on human perception were extracted. In their framework, region of interest and just noticeable distortion (JND) were used to measure two views quality and depth information was estimated in 3D wavelet domain. In [33], Jin et al. presented a FR-SVQA based on 3D-DCT transform to stack the matched two views blocks in the stereo video sequence to form a 3D stack and then analyzed the visual similarity through 3D-DCT. Inspired by the Bayesian model, Shao et al. [34] decomposed the synthesized view into three terms and quantified the impact of color and depth distortions and their interactions to predict their synthesized videos quality. Recently, a NR-SVQA method, the depth perception quality metric (DPQM), has been developed with depth perception by Chen et al. [8]. The natural scene statistic was employed in fusion map of binocular summation and autoregressive prediction based disparity entropy (ARDE) was applied in suppression map of binocular difference, which indicated key factors affecting texture and parallax.

III. SUBJECTIVE STEREO VIDEO QUALITY
ASSESSMENT DATABASE CREATION The new SVQA database is created and we call our new stereoscopic video database Tianjin University-Stereoscopic Videos Quality Assessment database (TJU-SVQA database). We first describe the choice of source stereo videos, and then conduct the subjective test on the built database. Finally we discuss the subjective experimental results.

A. Source Video Selection
To build new stereoscopic video database, 20 pristine stereoscopic video sequences were collected from pre-existing subjective 3D video quality researches [9], [35], [36]: Barrier gate, Basket, Boxers, Hall, Lab, News report, Phone call, Soccer, Tree branches, Umbrella, Street, Champagne, Pantomime, Dog, Sofa, Classroom, Bike, Car, Balloons, and Kendo. All these source video sequences are with 25 fps frame rate and in uncompressed.AVI format. All the 20 source videos are used in Phase I and Phase II. Sample frames for each source video are given in Fig. 1. The details of all the source videos are exhibited in Table I.
The selection of the original video plays an essential part in the quality evaluation of stereo video. The original videos should try to contain all the possible scenes in real life. For the universality and completeness of our video database, the selection of the source video is basically based on the video resolution, scene content, motion intensity and background complexity. The specific information of the source video is shown in Table I  From the point of shooting distance, there are 12 videos taken at close distance (the shooting distance is less than 5 meters), 5 videos taken at medium distance (the shooting distance is between 5 and 10 meters) and 3 videos taken at long distances (the shooting distance is more than 10 meters). Therefore, our video database contains rich video content, which makes the versatility of this video library better. The objective quality evaluation model based on our database has better universality in real life. In addition, existing stereo video databases have a small sample size, with most databases containing less than 10 original videos and about 100 distorted videos [9], [34], [37]. However, most of NR-SVQA algorithms need a great amount of samples to train the prediction models, so as to accurately obtain the test video quality scores. The TJU-SVQA database has larger sample capacity and is more suitable for NR-SVQA methods.

B. Distortion Simulation
The source sequences (SRC) from TJU-SVQA were impaired by two compress degradations. H.264 damages are introduced through H.264/AVC video coder [38], and  Table II, eight symmetrically distorted conditions (SYD_00-SYD_07), including the unprocessed reference as SYD_00, and each condition has 20 stereo pairs. Therefore, there are totally 160 stereo videos in Phase I.  Table III, thirty-three different kinds of asymmetrically distorted conditions (ASD_00-ASD_32), including the unprocessed reference as ASD_00. Therefore, there are totally 660 stereo videos in Phase II.

C. Subjective Test
Our subjective experiments were conducted in accordance to ITU-R BT. 500. The subjective test was conducted in the Lab for Image and Video Processing at Tianjin University. The evaluation was performed by using a HD 3D TV with passive polarized glasses. A CHANGHONG 3D55C2000i monitor with NVDIA GTX 1080 graphics was used for testing equipment. The viewing distance was set to be 6 times longer than the screen height, and that is about 4.7m. The details of the viewing conditions are given in Table IV. In Phase I, nineteen subjects participated in our subjective study with ages between 18 and 38. In Phase II, twenty-nine subjects participated in our subjective study with ages between 18 and 40. Before the subject evaluation, a visual acuity test and a 3D vision test was performed to examine their capacity to view stereo content.
In the phase of subjective experiment, in order to distinguish the quality of original videos that have content scenes, we opt to the single stimulus procedure using a 5-grade numerical categorical scale (SSNCS) protocol. A general introduction was given at the beginning of the whole test, including the purpose, process and scoring criteria of the experiment. The participants were required to evaluate the video quality according to their overall 3D viewing experience. In the training phase: for both Phase I and Phase II, we first selected three categories of original videos, intermediate damaged videos and severely damaged videos from the stereoscopic video database as training sets, and then played these videos and explained the evaluation criterion to observers. The observers were asked to give high scores (close to 5) to the original stereo video sequences, the mediate scores corresponded to the moderately distorted sequences, and low scores to severely distorted 3D videos. Subsequently, several videos were played at random to test each observer, until ensuring that observers can fully understand the scoring criteria and establish their own scoring strategies.
In the formal subjective test phase, for the purpose of reducing the time required for the subjective evaluation to ensure comfort duration, we divided the video databases into several sessions. And each session contains all the stimuli. In Phase I, the 160 video sequences were divided into 4 sets, and each set contains 5 pristine and 35 corresponding degraded sequences (40 sequences per test). While in Phase II, the 660 video sequences were divided into 10 sets, and each set contains 2 pristine and 64 corresponding degraded sequences (66 sequences per test). When the video was played, the order of stimuli was randomized. When the video ended, the participants were requested to speak out a score between 1 and 5, high score correspond to high quality. Meanwhile, the instructor recorded the quality scores with another computer. To prevent visual fatigue and discomfort of participants, each subject's subjective test time is not more than 2 hours per day. After all the subjective results were collected, the impact of individual misjudgment on video quality was excluded (within 95% confidence interval) and the MOS values for each of the videos were calculated.
In addition, to demonstrate the consistency between the prediction of the individual scores and MOS, two indicators generally used to evaluate the fitting degree of subjective and objective values are employed: Pearson linear correlation coefficient (PLCC) and Spearman rank correlation coefficient (SROCC). Among those parameters, the higher the values of PLCC and SROCC are, the better the algorithm performance is. In particular, values PLCC and SROCC approaching 1 indicate higher accuracy and stronger monotonicity. The data of PLCC and SROCC associated with each subject was obtained, and the average values of PLCC and SROCC are shown in Table V, where the high average indicator present a good consistency between individual score and MOS.

D. Results and Analysis
The indicator of the rationality of subjective evaluation research is uniformly distributed MOS scores. A histogram of MOS values is given in Fig. 2. It can be observed that the distribution ratios of MOS both in Phase I and Phase II are evenly spaced in the range of 1 to 5, which indicates that the MOS distortions in the TJU-SVQA dataset have a wide range of visual perception quality, which can be considered representative. Furthermore, original videos with different content scenes usually bring different viewing experiences to the audience. And the same distortion type and distortion degree usually cause various damage to the original videos with different content scenes. Therefore, we display the MOS distribution of the original videos with different content scenes in Fig. 3, and the MOS distribution of the same distortion on the original videos with different content is shown in Fig. 4. Fig. 3 shows that the quality scores of most original video sequences are within the range of 4 to 5, except for Dog and Lab, whose scores are close to 4 (3.95 and 3.90, respectively). Accordingly, the video quality of TJU-SVQA original sequences could be considered competent in large-scale assessment activities. In addition, Fig. 4 also plots the MOS distribution of the same distortion on the original videos with different MOS. To some extent, the reduction of  IV. THE PROPOSED 3D-VQA ALGORITHM According to human visual perception, we propose a comprehensive SVQA method based on joint contribution of multiple domain information, and a new inter-frame cross about spatio-temporal information is presented. The framework of the proposed algorithm is displayed in Fig. 5. In particular, the binocular summation/difference and inter-frame cross maps are produced firstly, and then the quality perception features from the above sequences are extracted. Since optical flow is sensitive to distortion, optical flow features are extracted in temporal domain to measure the degree of distortion. Finally, the quality scores of each part are obtained through dictionary learning and SVR, and pooled into the global objective quality.

A. Spatial Domain Analysis
In terms of HVS peculiarity, binocular vision could operate in several different 'modes' [40]. According to the Li and Atick's theory [41] of efficient stereo coding, binocular signals are encoded to be irrelevant to binocular summation signal and difference signal, and gain control is performed on the summation and difference channels to optimize their sensitivities. The summation map of distorted stereo video contains both additive impairments and detail losses. Detail losses refers to the damage of valuable information due to distortion, while additive impairments mean redundant information which does not show up in the pristine signal [42], [43]. Moreover, due to the distance between human eyes, two views have horizontal parallax and depth perception is formed. As an alternative absolute disparity map, difference map covers the distinction between two views and can reflect the depth information of stereo vision [8]. Given the left view L and right view R, the binocular summation and difference maps could be computed frame-by-frame as follows: where i is the number of frame of the stereo video sequences. The purpose of feature vector extraction is to extract the quality perception features that effectively indicate the distortion degree of stereoscopic video. The spatial features are firstly obtained in summation/difference maps. The human eye has selectivity of orientation and frequency in its perception of stimulus, and therefore the various scale and orientation sensitivities of the receptive fields can be simulated by the statistics of multi-scale and multi-directional filtering responses [44]. Since Log-Gabor filters can avoid direct current interference and bandwidth limitations, we utilize them for feature extraction and the kernel of the Log-Gabor filters used in this study can be generated by where ω 0 denotes the center frequency, θ l = lπ L, l = {0, 1, · · · , L − 1} is the orientation angle, and σ θ and σ γ control the angular and radial bandwidth of the filter respectively. In this paper, the Log-Gabor filter parameters are regraded as follow: σ θ = 0.40, σ γ = 0.65 and ω 0 = 1 min wavelength * mult j −1 , where min wavelength = 3 and mult = 1.7 (the scaling factor between successive filters).
The local magnitude as spatial information at location on different scale with different orientation can be given by where (k 1 , k 2 ) represent coordinates in the spatial domain, j = {1, · · · , J } and l = {1, · · · , L} express the scale and orientation, which are respectively set to J = 4 and L = 6. R j,l (k 1 , k 2 ) and Im j,l (k 1 , k 2 ) represent real part and imaginary part which constitute Log-Gabor. In Oppenheim and Lim's study [45], the high degree of intelligibility image was synthesized from the Fourier transform magnitude of an image and the phase of another image, which strongly resembles the one with the same phase. As the human eye is more sensitive to the phase information of the image [46], the phase information of each frame is also extracted by Log-Gabor filter as spatial information. In order to extract the local phase at location (k 1 , k 2 ) on different scale j with different orientation l is given by It is well known that the primary visual area (V1) is the first region in visual cortex to directly receive visual signals from the retina. The local semantic information formed in the V1 is strongly correlated with the perceived quality, and the decline of visual quality can be clearly reflected in local information changes. Therefore, after the computation of the magnitude and the phase information, an improved rotational invariance L B P (L B P riu2 P,R ) [47] is applied on the extracted and effectively studied local structure descriptors.
The basic patterns (the number of changes of adjacent codes is less than or equal to 2) of L B P riu2 P,R are uniform patterns with very little spatial variation. Due to the high proportion of basic patterns in the process of rotation invariant binarization coding, L B P riu2 P,R encodes nine basic patterns respectively and categorizes the rest patterns into one code as follows: where uniformity measure is defined as The superscript riu2 in Eq. 6 indicates that the rotation invariant uniform patterns have U at most 2. z c represents the central magnitude/phase, and z p is the magnitude/phase of the neighborhood. According to the above equation, there are a total of P + 2 patterns in L B P riu2 P,R , which are coded from 0 to P + 1 respectively. We selected P = 8 points as the neighborhood and determined the range of the interest to be R = 1. After completing L B P riu2 P,R coding, the occurrence times of the various patterns was counted as: where the input of the L B P riu2 P,R operator is an N × M matrix and K is the maximum pattern of L B P riu2 P,R , The statistical characteristics of the various patterns are regarded as the final magnitude/phase-based features. In addition, the magnitude, variance, and entropy of stereoscopic video are considered as auxiliary global features for the spatial domain. The spatial features are listed in Table VI.

B. Spatio-Temporal Domain Analysis
Stereoscopic perception is based on binocular disparities. One scene with slightly different perspectives is sent to the lateral geniculate nucleus (LGN) through retinal ganglion cells, and some layers of the primary visual cortex (V 1) receive information from the LGN. However, when the frames of two view at the same retinal location do not match, binocular rivalry, a perceptual effect in which perception alternates between different images presented to each eye, will occur. At the same time, binocular rivalry is not limited in V 1, and continue to be in the higher and more abstract level of the visual pathway. When binocular rivalry occurs, two images are not seen overlapped, but perceived alternately at intervals [48]. Binocular rivalry can spread around the highly distorted and low correlated regions [49], and it can make visual processing unstable, unpredictable, and impair the ability of observers to direct attention to targets in the visual field [50].
Binocular rivalry and stereopsis are two core phenomena of binocular vision, which involve independent and parallel pathways in the early stages of visual processing. Moreover, binocular is a simple combination of the output of the pathways mediating stereopsis and binocular rivalry [51]. In the previous subsection, binocular summation and difference maps of spatial domain are established according to the stereopsis phenomenon. Here, a quality prediction model of spatiotemporal domain is proposed according to binocular rivalry phenomenon.
In the process of research, one underlying assumption is proposed that besides spatial and temporal information, human perception of stereoscopic video quality is also influenced by the inter-frame information of time transformation (as spatiotemporal information) [15]. The traditional analysis method of spatio-temporal information is exploring the relationship between adjacent frames respectively in the left and right video without consideration of the possible interactions between two views. Based on binocular rivalry, a new inter-frame difference is proposed by extracting frames from left and right sequences of stereo pairs alternately along the time axis, which we call inter-frame cross. Concrete implementation is to combine the first frame of the left video and the second frame of the right video to extract the spatio-temporal information of the stereoscopic video. Then the second frame of the right video is combined with the third frame of the left video. By this analogy, inter-frame cross formed. Moreover, stereoscopic video contains much more information than 2D video/image, so the computational efficiency is a major challenge for SVQA. As is shown in Fig. 6, in contrast to the previous inter-frame difference method, inter-frame cross will combine the left and right video with the same amount of data as left (right) video, which can reduce the computation time by half and obtain the same amount of information. The inter-frame cross (I C ) is given by which means that the i th frame of the left (right) view alternates with the i + n th frame of the right (left) view, and n is a constant. The local features distribution of inter-frame cross are illustrated in Fig. 7. Similar to the spatial domain, the magnitude, variance, and entropy of stereoscopic video are considered as auxiliary global information for the spatio-temporal domain. The spatio-temporal features are listed in Table VII.
Furthermore, when light of the image for one eye leaks to the other eye, perceptual crosstalk, the significant factor that degrades the perceived quality of stereoscopic images/videos, might be visible. Crosstalk could occur with both spatial multiplexing and temporal multiplexing [52]. Based on the existing research in [53], we speculate that the proposed interframe cross model might be able to reflect perceptual crosstalk to some extent. In the future research, we will conduct further research on this.

C. Temporal Domain Analysis
Because the structures of stereoscopic video frames change along the timeline, stereoscopic video contains not only the structure and depth information in the spatial domain but also temporal information. Therefore, the objective scores are inconsistent with MOS when the 3D-IQA methods are directly applied to predict stereoscopic video quality with merely referring to the spatial distortion. Stereoscopic video quality can be represented by quality perception features involving spatial domain and temporal domain. An significant part of SVQA is the analysis of the temporal information distortion and most of the previous approaches utilized the motion magnitude as the temporal feature to measure the temporal distortion directly. In order to relate the movement of brightness patterns to the movement of the frames, the optical flow metrics are applied to acquire the motion information between tow frames which are taken at times t and t + t at every voxel position. In our studies, the Horn-Schunck algorithm [54] is proposed to gain the motion vectors in the temporal domain and its constraint equation is given by where I (x, y, t) represents the intensity at the point (x, y) of frame at time t. Optical flow is smooth in undistorted natural images but loses its smooth properties when distortion occurs.
In particular, the degree of distortion will affect the magnitude and orientation of the optical flow. The statistical regularities of natural stereo videos are correlated with video quality not only in the spatial domain but also in temporal domain. Based on the premise that local optical flow statistics are affected by distortions and the deviation from pristine flow statistics, the temporal statistical characteristics [9] based on optical flow are introduced to analyze the degree of temporal distortion. To describe the visual impact of distortion on video, the optical flow is divided into several components as sample ξ i shown in Table VIII. According to two assumptions in [55], statistical methods of and [ξ ] estimate the probabilities of having characteristic orientation and characteristic speed of the above components respectively [55]. The temporal features of stereo video are obtained by averaging the statistics of each frame and listed in Table IX.

D. The Overall Stereo Video Quality Evaluation
Similar to the simple cells and complex cells encoding stereoscopic information into non-redundant binocular retinal information, dictionary learning is utilized for basic quality feature vectors to eliminate redundant information from the underlying features. Instead of just using dictionary learning or SVR alone to obtain the quality score, here different schemes are used in different domains to estimate the stereo video quality. An exclusive overcomplete dictionary matrix D S D for binocular summation/different channel is constructed by concatenating all the feature vectors to sparsely represent spatial domain, i.e., D S D = [ f S D,1 , f S D,2 , · · · , f S D,n ] ∈ R m×n , where f S D,i means the feature vectors of the i -th sample and m is the dimension of features. Similar to spatial domain, an exclusive dictionary matrix D C of spatial-temporal domain is formed by spatio-temporal features ( f C,i ), i.e., D C = [ f C,1 , f C,2 , · · · , f C,n ] ∈ R r×n , where r is the dimension of spatio-temporal features. Then the dictionary learning is straightforwardly used in features and the corresponding MOS of the training set as follows: where q i denotes the human subjective score associated with f S D,i . The similar dictionary learning model of spatiotemporal view can be obtained. In order to find the most approximate and sparsest representation of the feature of a test video, the l 2 -minimization based approach which is less complicated than l 1 -minimization has been used as where λ is a regularization parameter to balance the fidelity term and the sparse regularization term. According to above steps, the optimal sparse solution vector α S D of f S D is obtained. On the basis of the assumption that the videos with the approximate quality scores have similar distributions of their features, the predicted result of the test video can be quantified with training video, i.e., where Q S D denotes the spatial quality and Q C denotes the spatial-temporal quality.
Simultaneously, an individual SVR is used to obtain temporal quality (Q T ) scores. Since the temporal statistical feature vector dimensions extracted in our framework are small, it is hard to accurately describe the temporal features of the test set by using dictionary learning. Therefore, SVR is applied to obtain the temporal quality scores. Specifically, the temporal feature vector of the training set randomly generated by the database and the corresponding MOS value is used to train SVR, and the trained model is utilized for predict the quality of the local part. Finally, the global stereoscopic video quality is predicted by integrating the above three parts, so the overall stereo video quality (Q o ) is computed as: where the constraint is μ + η + γ = 1, and μ, η and γ are constant. The determination of weight parameters is mentioned in section V.

V. EXPERIMENTS
In addition to the TJU-SVQA database Phase I and Phase II, the presented NR-SVQA method is also executed on the NAMA3DS1-COSPAD1 stereo video database [9] to verify the accuracy and robustness of the algorithm. The NAMA3DS1-COSPAD1 database contains 100 symmetrically distorted stereo videos with 10 original 1920 × 1080 3D full HD stereo videos which span five distortion categories including: H.264/AVC, JPEG2K, reduction of resolution, sharpening and downsample with sharpening.

A. Algorithms and Performance
On the premise that the performance of the algorithm is not affected, to reduce the consumption of time, the algorithm uses the strategy of extracting one frame every four frames for processing. Thus, the constant in the inter-frame cross is defined as n = 4. During the training of the regression model, the database is randomly divided into non-interacting training and testing sets: 80% of the database is applied to train and the rest 20% to test. Consequently, two dictionaries of spatial domain and spatio-temporal domain are formed directly by using feature vectors with corresponding MOS for the training set and sparse representation for the testing set. A SVR is trained with temporal domain features in training set to predict the effect of distortion on the temporal domain in the test set. We executed 1000 repetitions of the train-test process, and the medians of two performance indicators PLCC and SROCC are applied to measure the ultimate algorithm accuracy. The capability of the employed algorithm is verified with several  X   THE PERFORMANCE OF THREE METHODS TO GET THE FINAL  SCORE ON TJU-SVQA DATABASE PHASE I AND  NAMA3DS1-COSPAD1 DATABASE experiments and the outstanding results show that the predicted scores are highly consistent with the MOS.
To demonstrate the effectiveness of the presented algorithm, we compared it with five existing metrics on the TJU-SVQA database Phase I and Phase II, including two widely used 2D-IQA models (PSNR and SSIM), one 3D-IQA model (Cyclopean [55]) and two state-of-the-art 3D-VQA models (VQM [12] and DPQM [8]). For 2D models, PSNR and SSIM are estimated as the average of all frame scores for each view and two views' scores are averaged to obtain the final scores. For 3D-IQA model, all frame features are averaged to obtain the final quality result. Moreover, in NAMA3DS1-COSPAD1 database, four quality evaluation methods are also employed to verify the robustness of the presented algorithm.

B. Parameter Optimization of Local Quality
To explore the characteristics of different visual features and optimize the performance of the algorithm simultaneously, we used the quantitative angle to analyze different prediction methods. In the process of predicting the final quality score, three schemes are considered, including SVR, dictionary learning (DL) and the combination of the two schemes (SVR + DL). Three regression schemes are explored respectively in TJU-SVQA database Phase I and NAMA3DS1-COSPAD1 database. As shown in Table X, we list the experimental results based on different predictors choices. From Table X, the presented metric (SVR + DL) achieves better performance on two evaluation criteria.
We further investigated the quality perception ability of each domain and explored the sensitivity of each domain to distortion. The quantitative results of the fitting degree between predicted scores of each domain and MOS are shown in Table XI. It can be observed that the quality of the spatial and spatio-temporal domains contributes similarly to the prediction and is more important than that of the temporal domain. Even so, temporal domain information is also valuable for quality evaluation and the local quality score of each domain does not exceed the composite quality score of three domains, which verifies the necessity of comprehensively considering the quality degradation of the three domains for visual quality perception.
The ability of global features, magnitude and phase to predict the quality was explored by taking the features of spatio-temporal domain as an example. The experimental results are displayed in Table XII. The results show that the prediction performance of composite features is better than  that of separate type feature, which proves that each type of feature contributes to quality prediction, and the magnitude is the most sensitive to stereo video quality. A separate type of feature cannot comprehensively describe the quality of stereo video. In order to ensure the prediction accuracy of the model, it is necessary to combine various quality-sensitive features that will not interfere with each other. In addition, the performance of inter-frame cross model is compared with that of the previous spatio-temporal domain model. The previous model here is to extract the above features from the inter-frame differences of left and right views, and the stereo video quality score is obtained by weighted average of the scores of two views. The results are shown in Table XIII.
Since inter-frame cross model is proposed based on binocular rivalry phenomenon, it can more accurately reflect the quality perception process of human eyes. However, the previous model is not closely related to binocular perception, and its performance does not exceed that of the proposed model. For purpose of determining the three parameters in the combination process, we assigned various values to one parameter and compared their performance. The experiments to determine the μ, η and γ in eq. 15 are conducted, and μ, η and γ denote the weights adjustment to coordinate the balance  of three domains. The adjustment of these three parameters ranges from 0 to 1 with an interval of 0.1. We first adjusted parameters μ and η to the optimal weight ratio of spatial and temporal-spatial domains, and results are shown in Fig. 8 (a), When the assignment values of two parts are equal, the prediction accuracy reaches the best. Under the constraint of μ = η and μ + η + γ = 1, different weight values are allocated to γ to achieve the best performance of the algorithm. The result is shown in Fig. 8 (b). Therefore, the proposed metric achieves the best performance with μ = 0.4, η = 0.4 and γ = 0.2. Such weight scheme also matches contribution to the prediction performance of each domain provided in Table XI.

C. Prediction Performance Evaluation
The experimental results on TJU-SVQA database Phase I, Phase II and NAMA3DS1-COSPAD1 database are shown in Table XIV and Table XV, respectively. The method with outstanding performance is highlighted in boldface. Overall, compared with other metrics, the proposed method has better performance in both evaluation criteria on individual databases, which provides strong support for its effectiveness. To more intuitively express the consistency between the objective results of each algorithms and MOS values, we displayed  the scatter plots of the predicted results of the above algorithms on TJU-SVQA database Phase I in Fig. 9. To make the comparison more explicit, for the FR methods, we randomly selected 20% of the samples in the database for demonstration, which is consistent with the test set sample number of the NR methods. From Fig. 9, the fitting degree between the predicted score of the proposed method and MOS is impressive. Since the FR methods (PSNR, SSIM, Cyclopean) we compared are extended from 2D-IQA and 3D-IQA to 3D-VQA, binocular vision characteristics and temporal information are ignored. Therefore the performance is not satisfactory.
Since the prediction quality of symmetric distortion is relatively less difficult than than of asymmetric distortion, and video content in NAMA3DS1-COSPAD1 database is relatively less, the overall performance of the algorithms in NAMA3DS1-COSPAD1 database is better than performance in TJU-SVQA databases. Moreover, DPQM also has an impressive performance on the symmetric distortion databases, and particularly its SROCC indicator is superior to the others on the NAMA3DS1-COSPAD1 database. To further explore the performance of the proposed method and DPQM, we applied a new performance measure P G Data proposed in [56]. The higher the index is, the better the prediction performance of the algorithm is achieved. We trained the prediction model corresponding to each algorithm on the TJU-SVQA Database Phase I, and verified the performance of each model on the NAMA3DS1-COSPAD1 database. It can be observed that the proposed method performs relatively better. This indicator is required to use the training and validation set, so the FR methods are not implemented.
To evaluate the performance difference between any two algorithms, the significance experiment is conducted on the SROCC statistical sequence of any two methods. T-test [57] operates on 1000 SROCC of two random methods on TJU-SVQA databases, and results are displayed in Fig. 10. Thereinto, 1 manifests that the horizonal algorithm performance is better than the columnar method performance, while −1 manifests that the columnar method is superior. In addition, 0 manifests that the two algorithms are statistically correlated and their performance is similar. Since only the NR method needs to execute training model and multiple iterations, the significance test is just carried out for the three NR methods. The experimental results are consistent with Table XIV. The performance of each algorithm is divergent, and the proposed algorithm performs better.
Compared with symmetrical distortion, asymmetrical distortion is more realistic and more difficult to predict. Since many methods can not accurately reflect the binocular perception of asymmetric distortion, it is difficult to match the predicted results with the subjective scores, and their performance would decline on the TJU-SVQA Database Phase II. However, as shown in Table XIV, the proposed algorithm and VQM perform better on the TJU-SVQA Database Phase II. This might be due to the fact that the number of stereoscopic video samples in the Phase II is significantly larger than that of the Phase I, which makes the training sufficient and finally achieves better results in the prediction stage. To verify the conjecture, stereoscopic video samples with an equal amount of Phase I are randomly selected on Phase II, during the training and testing phase. As shown in Table XVI, experimental results show that with the the same sample size, the proposed algorithm performs better on Phase I database, which  demonstrates that the sample size has significant influence on the experimental results.
Furthermore, to verity the predictive capacity of the proposed algorithm in individual distortion types, we divided Phase I and Phase II into two parts respectively according to the type of distortions (H.264 and JPEG2000). The training and testing process is the same as the experiment based on the whole database, and the results are shown in table XIV. Obviously, the proposed algorithm has a stable performance on H.264 subsets, while DPQM and VQM have remarkable performance on JPEG2000 subsets.

D. Cross-Database Experiment
To illustrate the generalization ability of the presented method, we executed cross-database experiment on the above three databases. Since the FR method does not require training, cross-database validation are conducted on only three NR methods. In this paper, TJU-I/NAMTE means that the NR method is trained on the TJU-SVQA database Phase I and tested on the NAMA3DS1-COSPAD1 database. A total of six groups of experiments are presented: 1) TJU-I/TJU-II, 2) TJU-II/TJU-I, 3) TJU-I/NAMTE, 4) NAMTE/TJU-I, 5) TJU-II/NAMTE, 6) NAMTE/TJU-II. The results are displayed in Table XVII.
We observed that although the proposed approach performed relatively well, these algorithms show significant accuracy degradation in cross-database experiments compared to the individual database experiments, especially when setting TJU-SVQA database Phase II as test set. Since TJU-SVQA database Phase II has a larger sample size and a richer variety of distortion types than the other two databases, the prediction model cannot be adequately trained and the veracity decreases greatly. Meanwhile, those algorithms performance is relatively high when TJU-SVQA database Phase II is set as the training set, especially for VQM.

VI. CONCLUSION
In this paper, we built two stereo video quality assessment databases, namely TJU-SVQA Phase I and TJU-SVQA Phase II, which are all freely available, making it practical for cross-database experiments and the comparison between different studies. TJU-SVQA databases contain various kinds of content, symmetrically and asymmetrically distorted videos, and have a large number of video samples, which makes it more in line with actual demand and more versatile. In addition, we implemented subjective quality assessment experiments on the two databases. Results show that the presented evaluation is well balanced on the sequences variability. Then we proposed a comprehensive NR-SVQA model to verify the TJU-SVQA databases' validity with inter-frame cross map, leading to an improvement of the quality prediction performance of stereoscopic videos. In the future, three aspects could be taken into consideration: first to explore the synthesis of the final quality score by combining the quality scores of each domain with the dynamic weight, second to explore the novel good property model with inter-frame in TJU-SVQA database, third to further study on the effect of video distortion on temporal information.