Full-Reference Stereoscopic Video Quality Assessment Using a Motion Sensitive HVS Model

Stereoscopic video quality assessment has become a major research topic in recent years. Existing stereoscopic video quality metrics are predominantly based on stereoscopic image quality metrics extended to the time domain via for example temporal pooling. These approaches do not explicitly consider the motion sensitivity of the Human Visual System (HVS). To address this limitation, this paper introduces a novel HVS model inspired by physiological findings characterising the motion sensitive response of complex cells in the primary visual cortex (V1 area). The proposed HVS model generalises previous HVS models, which characterised the behaviour of simple and complex cells but ignored motion sensitivity, by estimating optical flow to measure scene velocity at different scales and orientations. The local motion characteristics (direction and amplitude) are used to modulate the output of complex cells. The model is applied to develop a new type of full-reference stereoscopic video quality metrics which uniquely combine non-motion sensitive and motion sensitive energy terms to mimic the response of the HVS. A tailored two-stage multi-variate stepwise regression algorithm is introduced to determine the optimal contribution of each energy term. The two proposed stereoscopic video quality metrics are evaluated on three stereoscopic video datasets. Results indicate that they achieve average correlations with subjective scores of 0.9257 (PLCC), 0.9338 and 0.9120 (SRCC), 0.8622 and 0.8306 (KRCC), and outperform previous stereoscopic video quality metrics including other recent HVS-based metrics.

metrics that are able to faithfully capture the complexity of human perception. Recently, there have been a number of research activities which explored how models of the Human Visual System (HVS) can be used to develop more robust stereoscopic image and video quality metrics [1]- [11]. However, these activities mostly considered stereoscopic video quality assessment as an extension of stereoscopic image quality assessment, relying on temporal pooling of stereoscopic image quality measures. A major drawback of this class of approaches is that they are unable to capture important spatiotemporal characteristics, such as the motion of objects in a scene, which require direct processing in the spatio-temporal domain. In contrast to previously reported research, this paper introduces a novel HVS model which directly encodes temporal complexity to mimic the spatio-temporal characteristics of human stereoscopic perception. The proposed HVS model generalises our previous model [8] which was also based on the HVS but excluded influence of motion sensitivity. The novel model is applied here to stereoscopic video quality assessment thereby demonstrating the importance of incorporating motion sensitivity in perceptual tasks. To the authors' knowledge, this is the first stereoscopic video quality metric based on a motion-sensitive HVS model.
Simple cells and complex cells are the main cell types in the primary visual cortex that are responsible for binocular vision in the HVS. Several physiological models have been proposed to mimic their properties [12], [13] and have been used to build image and video quality metrics [2], [8]. These models are based on the computation of binocular signals using analytical methods to estimate the perceptual quality of stereoscopic images and videos. The response to a stereoscopic input is encoded in the form of a binocular energy consisting of multiple objective scores capturing different perceptual characteristics. Physiological studies have identified motion sensitivity as an important characteristic of a significant proportion of complex cells [2]. However, to date, no model of complex cells with motion sensitivity has been developed for stereoscopic video quality assessment. This paper introduces a novel HVS model which incorporates motion sensitivity information in the computation of binocular energy, introducing new energy terms capturing motion-specific perceptual characteristics, and demonstrates its application and benefit for stereoscopic video quality assessment.
A fundamental challenge addressed in this paper relates to estimating and leveraging the level of motion present in stereoscopic videos to construct a reliable HVS model. The key insight is the introduction of a generalised complex cell model which is able to represent the behaviour of a variety of complex cells and their motion responses. This is achieved using an optical flow algorithm to extract pixel level motion information for each perceptual channel and utilising this information to modulate the response of each complex cell. This results in two types of complex cells: non-motion sensitive and motion sensitive complex cells. Non-motion sensitive complex cells respond to spatial orientation regardless of whether motion is present or not, similarly to the complex cells introduced in [8]; in contrast, motion sensitive complex cells respond to spatial orientation only in the presence of motion [14], [15]. Different velocity response functions are investigated to model the behaviour of these cells as a function of the amplitude of the motion at a given orientation and scale, taking into account minimum velocity requirements.
To validate the model and demonstrate its practical use, it is applied to build a novel stereoscopic video quality metric. The metric is built by pooling both sensitive and non-motion sensitive objective scores and performing a multi-variate regression on the pooled objective scores. In the case of the motion sensitive objective scores, the level of motion in each frame is taken into account during pooling. The high dimensionality of the proposed HVS model poses computational challenges in terms of extracting a robust regression model. To address this, a tailored two-stage regression approach is proposed. In the first stage, the most significant objective scores are selected by performing a regression separately on the non-motion sensitive and the motion sensitive objective scores. In the second stage, a regression is performed on the combined set of selected nonmotion sensitive and motion sensitive objective scores thereby reducing dimensionality. A comparison against state-of-theart stereoscopic video quality metrics including the Binocular Energy Video Quality Metric (BEVQM) [8] validates the benefit of accounting for motion-sensitivity.
The novelty of the proposed method lies in accounting for both motion sensitive and non-motion sensitive complex cells of the HVS. Further, the temporal response of these complex cell types are modelled differently. In this way, modelling the HVS is expected to result in more accurate representation than by neglecting the true behaviour of the complex cells. The rest of the paper is organised as follows. Section II reviews the background on HVS modelling with a focus on motion sensitivity and the application to stereoscopic video quality assessment. Sections III and IV introduce the proposed HVS model and quality metrics. Section V evaluates the proposed approach against the state-of-the-art and discusses performance. Section VI concludes the paper by summarising the findings and discussing avenues for future research.

A. Physiology of the HVS
A neural tissue at the back of the eye called retina receives images. It contains two layers with synaptic interconnections between the neurons and three layers of cell bodies. The images projected onto the retina are inverted and exhaustively Fig. 1. Decomposition of an image into perceptual channels. Left: original image. Right: spatial-frequency bands of the image with 3 orientations and 3 decomposition levels considered and resulting in a total of 10 perceptual channels: V1, D1 and H1 correspond to the vertical, diagonal and horizontal orientations respectively at level 1 (similar notation is used for the orientations at levels 2 and 3); L is the low resolution residual.
pre-processed before passed on to other parts of the brain. The visual cortex that processes this information is located at the back of the brain. The primary visual cortex (V1 area) is the largest part of the HVS, which receives signals from the Lateral Geniculate Nucleus (LGN) located in both hemispheres of the brain. There is a large variety of cell types in the visual cortex, responding to different kinds of stimuli, e.g. particular frequencies, colours or direction [16].
Physiological experiments have shown that simple cells can be modelled using linear filters from their impulse response measured on the visual cortex. An approximation of the impulse response using a Gabor wavelet has been shown in [17], where the spatial arrangement by a two-dimensional Gabor function with ON and OFF regions correspond to peaks and hollows of the function, respectively. These findings have resulted in many sampling functions for simple cells which allow an image to be decomposed into perceptual channels and image elements localised in spatial and frequency domains as shown in Fig. 1.
There is a variety of simple cells in HVS: binocular and monocular cells with their respective types of receptive fields. Monocular information from left and right retinas results in occluded information when each eye independently sees the world. Binocular vision of objects results from binocular simple cells organised in pairs with binocular receptive fields. These binocular cells are responsible for stereoscopic perception. There are several analytical models to describe simple cells and the response of a pair of binocular simple cells is often represented as a complex cell [16], [17]. The spatial-frequency response based on size, amplitude, phase and orientation can be modelled using directional wavelets with an aim to represent the pairs of stereoscopic images using a set of complex functions.
The binocular energy is generated in the receptive fields of binocular complex cells. The spatial relationship between monocular receptive fields of complex cells and corresponding simple cells is described in [18], including the correspondence of amplitude, size, orientation and phase shift between simple and complex cells. Sensitivity to the orientation and spatial arrangement is however not inherited by complex cells from corresponding simple cells. Thus, the binocular energy generated by a complex cell depends on the disparity of position and the shift in phase between the simple cells.

B. Motion Sensitivity in the HVS
Direction of motion is one of the main features tested in physiological experiments. Different types of complex cells are sensitive to different directions of motion, as illustrated in Fig. 2. This is an important phenomenon for modelling motion sensitivity of the HVS. It has been observed that there is a lower velocity threshold at which the HVS response starts and an upper velocity threshold beyond which the response saturates. Physiological studies have shown that motion sensitivity is orientation and spatial frequency selective [2]. These notable findings are at the centre of the generalised complex cell architecture proposed in this paper.
Reference [19] showed that the velocity response of complex cells can be classified into three types: low pass, high pass and band pass responses. The first type primarily responds as a low pass filter in velocity and only a small proportion of complex cells are known to behave in this manner. The second type acts as a high pass filter in velocity and is the most common type of motion sensitive complex cells found in the HVS. It is worth noting that the cut-off velocity between these two types of filters does not occur at the same velocity threshold. The third type acts as a band pass filter and shares a maximum velocity with the second type. This type of complex cells is more common than the first type but less so than the second type. The response to velocity in all three types of complex cells can be observed to be approximately linear or uniform across the given range of operation. The study in [19] indicates that the predominant velocity response of the HVS can be modelled as a high pass filter with a linear slope or a sharp high pass filter with a uniform response after a threshold velocity. This is the model that will be implemented and evaluated in this paper.
A number of HVS models incorporating motion sensitivity and with various degrees of complexity have been proposed. A motion model based on a simple spatio-temporal concept of motion is discussed in [15]. Motion detection is formulated in terms of detecting orientation in a three-dimensional space defined by x, y, and t; the orientation exists in space-time rather than just in space. Motion in particular is filtered using appropriately oriented impulse response filters chosen as quadrature pairs sensitive to the motion direction. The combination of the outputs of two linear filters has a phaseindependent-motion energy response.
If the filters' responses are squared and summed, the resulting signal gives a phase-independent measure of local motion energy within a given spatial-frequency band. The system built on these filters has motion-detecting properties with a motion response that is localised in space, time, and spatial frequency. Continuous motion, apparent motion, and motion illusions (fluted square wave and reverse phi) are basic phenomena perceived in this model. Spatio-temporal orientation can be considered as a local property of spatio-temporal stimuli and can be extracted with the same kind of simple mechanisms used for extracting spatial orientation.
A two-stage physiological model for local image velocity representation in the Middle Temporal (MT) visual area is presented in [20]. Each neuron of the MT visual area computes a weighted sum of its inputs followed by half-wave rectification, squaring, and response normalisation. Despite its simplicity, the model can account for much of the physiology of MT neurons. However, the population of model neurons is unrealistically homogeneous unlike real neurons which are irregular in comparison. Further, there is a lack of realistic temporal dynamics as the model corresponds to steady-state firing rates. This model is required to compute an estimated velocity from the responses of the MT population for the perception of speed and direction of plaid patterns.
Physiological mechanisms have been used to derive a unified model of motion and stereo vision in [21] to explain phenomena pertaining to motion-stereo interaction. In one such phenomenon, when a moving target is viewed with a neutraldensity filter over one eye, it appears displaced in depth. This phenomenon is called Pulfrich's pendulum, where when the target is oscillating like a pendulum, it appears to move in an elliptical path. A demonstration of how computational modelling can help bridge the gap between physiology and perception confirms the importance of constructing computational theories of vision based on neurophysiology. However, the integrated model developed in [21] is not completely physiologically realistic.
A functional architecture of human visual motion perception is presented in [22], using four types of moving stimuli with luminance modulation, texture-contrast modulation, depth modulation and motion modulation. Seven experiments related to the four types of stimuli were conducted to determine a functional control chart. A first-order luminance system and a second-order texture-contrast system use independent motion-energy detectors, operate in parallel, and combine their outputs at an early stage. A third-order (feature-tracking) system receives inputs (features) from texture grabbers and from the lower-order motion systems. The strength of feature inputs to the third-order motion system is subject to top-down control-attention to particular features, and influences features' strengths and thereby the perceived direction of motion. The high complexity is a major concern in this architecture.
Despite the above-mentioned works, there is a large gap between physiological models of motion sensitivity and their use in quality assessment tasks. Devising novel quality metrics which incorporate the motion sensitive information available in physiological models is therefore of primary importance to achieve reliable stereoscopic video quality assessments.
Sections II-C and II-D review the related stereoscopic image and video quality metrics with a focus on motion sensitivity.

C. Stereoscopic Image Quality Metrics
In [23], established 2D quality metrics such as Peak Signal-to-Noise Ratio (PSNR), Structural SIMilarity (SSIM), Just Noticeable Difference (JND), Visual Information Fidelity (VIF) and Noise Quality Measure (NQM) were extended to measure the quality of stereoscopic content through averaging of the left and right view scores obtained each using the 2D quality metric. The authors observed a reduction in performance which was attributed to the fact that stereoscopic perception was not only affected by image content, but also by other attributes of stereopsis such as disparity.
Inclusion of disparity maps with 2D metrics was considered in [24] and [25]. Blur, JPEG and JPEG2000 impairments were applied symmetrically to left and right images in [24] to derive a measure of 3D perception. It was concluded that the 3D content of a disparity map could not be interpreted by 2D metrics based on a fidelity score combining disparity map score and average stereoscopic score. Further experiments in [26] confirmed the limitations of the usage of 2D metrics for stereoscopic image quality assessments. Depth information was incorporated in different ways without directly considering the special characteristics of 3D perception such as spatial masking affected by suppression.
In [27], the performance of applying Video SSIM (VSSIM) [28] and PSNR to colour+depth sequences was evaluated. Synthesized virtual views of compressed colour and depth sequences were objectively assessed with the quality metrics. Subjective experiments showed the relative importance of colour distortions over depth distortions and the need to devise quality metrics specifically targeting stereoscopic content. In [29], a good correlation with human perception was obtained when the depth maps were computed using stereoscopic images affected by low degrees of impairment. However, the correlation was found to degrade as the significance of the impairment increases.
A full-reference metric using a product of two quality scores based on a disparity map and an extracted Cyclopean view was presented in [16]. Both quality scores were quantified using SSIM and were named as monoscopic quality (number of binocular cues preserved in images) and stereoscopic quality (via disparity map comparison). The quality metric results were then correlated to human perception. However, the results were verified using a small scale subjective experiment and did not include colour perception.
In [9], Lv et al. propose a blind (no-reference) stereoscopic image quality assessment method based on learning the receptive fields' characteristics. In this work, dictionary learning for constructing a quality lookup was used to predict subjective scores. Although good performance is reported, limited ability for dealing with asymmetrical distortion is also noted as a weakness of this approach. A full reference metric was proposed in [11] by jointly considering binocular energy and contrast perception. The approach offers mechanisms for binocular fusion and rivalry with a high prediction accuracy of perceived stereoscopic image quality.
A perceptual stereoscopic image quality approach based on modelling the properties of the primary visual cortex was proposed in [10]. This was achieved by introducing a new feature encoding approach and a tailored similarity measure that was shown to achieve high correlation with subjective scores. An effective human binocular combination model for Cyclopean image was proposed in [30], where a full-reference stereoscopic image quality assessment model was built based on binocular summation and binocular difference channels.
The survey of the state-of-the-art in stereoscopic image quality perception reveals that high accuracy prediction of subjective quality can be achieved, particularly with the recent developments in HVS-based models that outperform earlier approaches. However, better tailored approaches are required for highly reliable assessment of the stereoscopic video quality.

D. Stereoscopic Video Quality Metrics
In [4], a comprehensive set of subjective experiments was performed with stereoscopic video sequences, which were encoded using both H.264/Advanced Video Coding (AVC) and High Efficiency Video Coding (HEVC) standards. Results of the subjective experiments on symmetrically and asymmetrically encoded stereoscopic videos were analysed using statistical techniques to reveal subjective scoring patterns. Structural distortion caused by compression was the main feature used in the metric introduced in this work. Measurement of asymmetric blur and content complexity were also used as objective measures. However, it does not consider ringing artefacts commonly present in wavelet based video codecs.
The metric presented in [31] quantifies the distortion in luminance and contrast using an approximation (variances) weighted by the mean of each pixel block to obtain the overall image distortion. The distortion on the block level is weighted to measure the frame level perceptual distortion. This metric does not account for chrominance. A stereoscopic video quality assessment method based on block-matching of left and right views via a 3D-DCT transform was proposed in [32]. However, this ignores masking effects due to motion.
In [33], spatio-temporal structural information was utilized by an algorithm which jointly represented and evaluated two views. In particular, the algorithm firstly selected salient pixels based on the results of a 3D Sobel filter. Then, the similarity of joint descriptors constructed from eigenvalues and eigenvectors of pixels in the left and right views was calculated at the pixel level. Finally, all of the local scores were pooled into one global score. This metric does not account for different degrees of the influence of salient pixels on HVS.
A novel Stereoscopic Video Quality Assessment (SVQA) metric was introduced in [34], based on the multiple visual masking characteristics of HVS, a stereoscopic just-noticeable difference model to compute the perceptual visibility for stereoscopic video. Using a stereoscopic visual attention model, stereoscopic visual saliency information was extracted first. Then, the quality maps were calculated by the similarity of the original and distorted stereoscopic videos perceptual visibility. Lack of integrity between the two models is a major drawback in this metric.
A compound stereo-video quality metric was proposed in [16] composed of monoscopic and stereoscopic quality components. Distortions causing blur, noise and contrast change were considered as monoscopic cues whereas binocular depth was the only stereoscopic cue considered. The assessment framework was based on the SSIM quality index which identified the limited perceptual measures as a major drawback.
A perceptual quality assessment metric using temporal complexity and disparity information for stereoscopic video was proposed in [35]. Temporal variance, disparity variance in intra-frames, disparity variance in inter-frames and disparity distribution of frame boundary areas were used to design a noreference stereoscopic video quality perceptual model. When the disparity in the content was high, the estimation error increased due to the incomplete disparity estimation algorithm. Direction of disparity change was not considered in this work. [36] introduced a quality assessment model based on the observed phenomenon that spatial frequency determines view domination in the HVS. Based on the binocular fusion process characterising 3D human perception, a full-reference metric was proposed for quality assessment of stereoscopic images in [2]. The Binocular Energy Quality Metric (BEQM) introduced was modelled following a reproduction of the binocular signal generated by simple and complex cells. However, the computation of binocular energy for perceptual evaluation was poor due to the simplicity of the complex cell model. This metric was later extended to the video domain in [8] by introducing a more accurate complex cell model and an adaptive temporal pooling strategy to define the BEVQM. Despite correlating well with the subjective scores, the BEVQM lacks physiological plausibility in the time domain as it does not explicitly model motion sensitivity.
Despite considerable progress in stereoscopic video quality assessment, there is no metric making use of a motion sensitive HVS model. This paper addresses this gap by building on the recent research on HVS-based stereoscopic video quality assessment and generalising it to incorporate for the first time a physiologically inspired model of motion sensitivity. The importance of considering motion sensitivity in enhancing the accuracy of 2D video quality assessment modelling was highlighted in previous research [37], [38]. This importance is particularly magnified for realising a precise 3D video quality assessment model. As such, this paper makes its original research contribution by introducing motion sensitivity into 3D video quality assessment modelling.

A. Overview of the Model's Architecture
The proposed motion sensitive model aims to mimic the processing taking place in the primary visual cortex (V1 area) by modelling the response of simple and complex cells.
The key contribution is the introduction of a generalised complex cell architecture able to account for the behaviour of motion-sensitive complex cells as well as non-motion sensitive complex cells. The model generalises the earlier Extended Binocular Energy Model (EBEM) from [8]. A system diagram of the proposed motion sensitive model highlighting how it extends our previous model is shown in Fig. 3.
The earlier EBEM (shaded in Fig. 3) introduced a model of simple and complex cells to characterise the binocular response of the HVS. Temporal pooling of the objective scores obtained for each video frame was used to learn a metric to predict subjective perception of stereoscopic video quality. The model improved perceptual modelling compared to other approaches. However, the complex cell model lacked physiological plausibility as it ignored motion sensitivity.
In contrast, the model introduced in this paper generalises the previous model by incorporating motion response maps to modulate the output of complex cells according to perceived motion. This results in a more physiologically plausible model of the response of complex cells and allows computation of a new class of objective scores capturing motion sensitive perception of stereoscopic video quality. The final model is obtained by combining motion and non-motion sensitive objective scores and is shown to result in a significant increase in its ability to predict perceived stereoscopic video quality.
Motion response maps are computed for each perceptual channel by estimating the velocity seen by the channel and then applying a velocity response function characteristic of the type of complex cell considered. Fig. 4 summarises the key processing steps to compute the motion response maps. This generalisation allows a broad variety of motion sensitive complex cell behaviours to be modelled depending on the choice of velocity response function, while retaining the ability to model simpler non-motion sensitive complex cell behaviour using a constant velocity response function. The new architecture leads to two different types of binocular energy outputs: one modelling the non-motion sensitive response of complex cells (similar to that proposed in [8]), the other one modelling the response of motion sensitive complex cells. The remainder of this section describes the key steps in the processing pipeline.

B. Simple Cell Model
The simple cell model used in this paper is similar to that used in [2], [8]. Stereoscopic pairs of images are represented using complex functions C l ( p, c) and C r ( p, c) which denote the monocular signals in the left and right images at pixel p and for a given perceptual channel c. These are defined as where A l and A r denote the amplitudes while φ l and φ r denote the phases of the left and right signals. A brief description of the simple cell model computation is provided here, the reader being referred to [2], [8] for the full details.
A Complex Wavelet Transform (CWT) is used to model the spatial frequency response of the simple cells for both luminance and chrominance components. A dual-tree method [39] is used to analyse the image using two different Discrete Wavelet Transforms (DWTs). The real and imaginary parts of the CWT are computed by applying a pair of filters, each composed of a low-pass and a high pass filters with the first couple computing the real parts of the CWT and the second couple computing the imaginary parts.
A pre-processing step is used to convert the chrominance channels in a stereoscopic image into a colour space more representative of the HVS. CIE L*a*b* [40] is chosen where a single channel of luminance L* and two mutually orthogonal channels of chrominance a* and b* are used. With the intention to represent stereoscopic images using a set of complex functions, real and imaginary parts of the response to luminance are separated using the CWT on the luminance component. The chrominance response is computed using two DWTs as they are mutually orthogonal being real and imaginary parts of a complex function.
The bandelet transform is used to analyse the wavelet components due to its similar behaviour to simple cell characteristics [41]. The set of sub-bands obtained using the analysis are organised in a quadtree of variable size following the image geometry. An orientation is computed and assigned to each block as a dyadic square depending on the coefficients.

C. Generalised Complex Cell Model
The binocular energy is generated in the receptive fields of the binocular complex cells. The most common type of complex cells are known to perform a SUM-like operation on the responses of simple cells with similar orientation preference [42]. Another type of complex cells are known to perform MAX-like operation [43]. Both types of operations have been modelled in [8] which defined the energy terms based on the functions A l ( p, c) and A r ( p, c) introduced in (1). For luminance the binocular signal is a complex function and for chrominance it is a real function. In this model, the different perceptual channels account for the different orientations and scales extracted as depicted in Fig. 1. Even though effective at modelling the orientation and scale sensitivities of the HVS, this model completely ignores motion sensitivity.
The proposed model generalises this earlier model by introducing the motion response maps H l ( p, c) and H r ( p, c) characterising the motion response at a pixel p for a given perceptual channel c in the left and right images respectively. Hence, the following energy terms are defined: In these equations, the binocular energy for a given channel c is obtained by first weighting the amplitude of the monocular signals in the left and right images using their respective motion response maps at each pixel, and then summing the resulting binocular energies contributed by each pixel p over the entire image. This allows complex cells to respond selectively to a particular motion. The motion response maps for the left and right images are defined respectively as ( p, c)) and H r ( p, c) = h (V r ( p, c)) (6) where V l ( p, c) and V r ( p, c) denote the velocity maps in the left and right images, and h is the velocity response function of the type of complex cell considered. The velocity maps V l ( p, c) and V r ( p, c) represent the amplitude of the motion at pixel p for a given perceptual channel c in the left and right views respectively. This requires a dense estimate of scene motion characterising the displacement at each pixel in the pair of image frames. It should be noted that the amplitude of the motion at a given pixel is dependent on the perceptual channel considered since it depends on both scale and orientation. Velocity map estimation will be discussed in more detail in Section III-D.
The velocity response function h is specific to a given type of complex cell. In the case of a non-motion sensitive complex cell, this is a constant function. In the case of a motion sensitive complex cell, this is a motion dependent function with profile depending on the nature of the complex cell. The motion model considered in this paper is based on implementing a high-pass filter behaviour since previous research has identified this type of behaviour as predominant [19]. The definition of the velocity response function and its effect will be discussed in more detail in Section III-E.
A two layer architecture containing both motion-sensitive (characterised by a velocity response function h motion ) and non-motion sensitive (characterised by a constant velocity response function h still ) complex cell models is considered in this paper. The binocular energy scores obtained for the different SUM and MAX operations and the different perceptual channels can be concatenated into vectors E still and E motion in the case of the non-motion sensitive and motionsensitive complex cell models. To the authors' knowledge, there is no physiological evidence to suggest what proportion of complex cells response is related to motion. Hence both motion sensitive and non-motion sensitive models are considered in equal proportion and the contribution of different types of complex cells will be learnt later on together with the specific weights of each objective scores when building a metric. Hence, this results in a vector of binocular energy scores with four times as many elements as the number of perceptual channels considered (half of the binocular energy term relating to motion sensitive complex cells, the other half being non-motion sensitive).
Similarly to [8], the proposed approach models the interactions between complex cell outputs using a Recurrent Excitation Model (REM) where the output of one complex cell is modulated by the output of another complex cell according to the physiological findings reported in [44]. In the proposed model, the two layers do not converge until a common REM combines them using a regression model to produce final binocular energy elements. This generalises the previous approach by allowing modulation across complex cells with different types of motion response as well as complex cells with the same motion response. The remaining of this section provides more detail on the velocity map estimation and the definition of the velocity response function.

D. Velocity Map Estimation
A per pixel measure of velocity for each perceptual channel c in both left and right images is required in order to weigh the contribution of each pixel when computing the binocular energy scores in (4) and (5) and thereby represent the orientation selectivity of motion sensitive complex cells. A two-stage approach is proposed to efficiently compute the velocity maps.
1) Multi-Scale Optical Flow Estimation: First, an optical flow algorithm is used to estimate the left and right motion vectors u l ( p, c) and u r ( p, c) at each pixel p and for each perceptual channel c. The dense optical flow algorithm proposed by Farnebäck [45] is used in this paper using both the previous and the next frame to estimate the motion vectors at any given frame. The algorithm was chosen for its computational efficiency and its robustness at the time our study was performed. Optical flow is calculated separately for the left and right views. The perceived optical flow is dependent on the scale considered. For example, optical flow induced by the motion of a high frequency texture may only be visible at high resolution, disappearing when the texture becomes blurred at the lower levels of resolution. Similarly, small scale motion may only be perceptible at the higher resolution levels as its amplitude may be too small to generate a response at the lower resolution levels. Hence a multi-scale approach is used to compute the optical flow at each scale (three scales corresponding to three decomposition levels are considered in this paper). To reduce computational complexity and improve the accuracy of the motion vectors, optical flow is computed on the luminance channel only. Therefore, each image requires only three optical flow computations at the different scales considered. Optical flow estimation may be prone to inaccuracies in the presence of rapid scene motion. To some extent, the proposed HVS architecture is resilient to such errors as it does not require a very precise estimate of motion as long as the algorithm is able to distinguish pixels associated with moving scene points from static scene points, especially when using a binary velocity response function as discussed in Section III-E. Also, summation over the image provides robustness by effectively weighting down the contribution of outlier pixels with inaccurate flow.
2) Multi-Channel Velocity Estimation: Second, the amount of motion in the left and right images at each pixel p for a given perceptual channel c is calculated in order to define the velocity maps. These are both scale and orientation dependent. The multi-channel image decomposition used in this paper considers three different orientations (horizontal, vertical and diagonal) at three scales and a low resolution residual. Denoting by e c the unit vector corresponding to the orientation and scale used in the perceptual channel c, the left and right velocity components at pixel p in channel c are given by in the case of the perceptual channels representing scale and orientation. This is illustrated in Fig. 5. For the last channel representing the low resolution residual, the velocity components in the left and right images are given by The unit vectors e c , with respect to which motion is measured, define the three orientations (horizontal, vertical and diagonal) and the three scales with scale halved when moving from one level to the next. A total of only nine projections and one magnitude are required to compute all the velocity components for a given image. This avoids separate computation of the velocity components for the luminance and chrominance channels which share the same velocity maps.  ( p, c) and V r ( p, c). For clarity, all indices have been omitted in the figure. One example is provided for each orientation in the decomposition.

E. Velocity Response Function
The velocity response of a given type of complex cell is represented using the velocity response function h. The proposed formulation is generic and versatile in that it is able to model the behaviour of both non-motion sensitive and motion sensitive complex cells including the various types of responses discussed in the literature. This paper considers the most common type of motion sensitive complex cells which are known to behave as high pass filters. The exact profile of the velocity response function for this type of complex cell is unclear. Hence two different models are considered: a binary velocity response model and a linear velocity response model as illustrated in Fig. 6.
The binary velocity response model uses a binary velocity response function h bin to reject pixels with perceived velocity falling below a given threshold V bin and accepts all other pixels with equal weight. This threshold is determined empirically as discussed in Section V-A. It is defined as follows: In contrast, the linear velocity response model uses a binary velocity response function h lin to reject pixels with small motion while weighting linearly the contribution of pixels exceeding the minimum threshold V lin . It is defined as follows: The binary velocity response function is a simple yet effective way of distinguishing moving scene points from static ones with the merit of being resilient to inaccuracies in optical flow estimation since it does not consider the exact amplitude of a given velocity component as long as it exceeds the minimum threshold. The linear velocity response function provides a finer grain analysis by offering the ability to take into account the actual motion amplitude when computing the binocular energy but may be more sensitive to inaccuracies in optical flow estimation. Both models will be investigated to build the stereoscopic video quality metrics in Section V.
Both models require the use of a minimum motion threshold which must be appropriately set. The threshold needs to be larger than the noise level present in the input video and smaller than the level at which object motion becomes noticeable. This threshold must also be able to mitigate any noise measured in velocity and to represent the motion sensitivity's lower threshold of the velocity response. In this paper, a common threshold of 3 pixels, measured at the resolution of the first decomposition level (highest resolution image), was used for both models. The same threshold is used at all decomposition levels since scale changes are accounted by appropriate changes in the unit vectors e c with respect to which motion vectors are expressed. Further, to address the variability in input video resolution, all input frames are normalised to a 512 × 512 pixel resolution with referenced to which the threshold is defined. Further discussion and analysis of the effect of the threshold and justification of the choice of value is provided in Section V-A.

IV. MOTION SENSITIVE BINOCULAR ENERGY QUALITY METRIC (MSBEQM)
This section introduces two full reference motion-sensitive stereoscopic video quality metrics based on the generalised HVS model introduced in the previous section.

A. Normalised Motion-Sensitive Objective Scores
Let us consider a given frame t in a video sequence. The generalised HVS model can be used to calculate two sets of objective scores: E still (t), representing the behaviour of non-motion sensitive complex cells and obtained using a constant velocity response function h still (V ) = 1 for any V , and E motion (t), representing the behaviour of motion sensitive complex cells and obtained using velocity response function h motion (two types of velocity response functions h bin and h lin are considered). These can be concatenated into a single vector containing all energy terms Similarly, the non-motion sensitive and the motion sensitive objective scores for the same frame reference stereoscopic pair can be calculated and concatenated into a X 2 (t), . . . , X n (t)] of normalised objective scores for the given frame and defined by: (11) where the symbol / denotes an element-wise vector division. Further, a vector Z(t) = [X 1 (t).X 2 (t), . . . , X 1 (t).X n (t), X 2 (t).X 3 (t), . . . , X 2 (t).X n (t), . . . , X n−1 (t).X n (t)] is also introduced. It consists of the product of all pairs of normalised objective scores in X(t) and is used to implement the REM which allows complex cells to mutually interact and modulate their outputs. Unlike the previous work in which this effect was limited to non-motion sensitive complex cells, this allows different types of complex cells to modulate their outputs.

vector E ref (t) with the same dimension as E(t). E ref (t) and E(t) are then combined to compute a vector X(t)
For an N-level spatial frequency decomposition in the simple cell model, 3N + 1 different spatial frequency subbands are obtained considering 3 orientations as illustrated in Fig. 1. Separate analyses are carried out on the luminance (L*) and the two chrominance channels (a* and b*), resulting in 2 × 3 × (3N + 1) objective scores (half of them representing SUM-like operations, while the other half modelling MAX-like operations). The proposed model considers both motion sensitive and non-motion sensitive objective scores thus doubling the number of objective scores and resulting in a total of n = 12 × (3N + 1). In this paper, N = 3 was used, as in [2], [8], which results in a total of n = 120 objective scores for the proposed MSBEQM (as opposed to 30 for the BEQM and 60 for the EBEQM) for each frame. The increased dimensionality when considering each frame poses a major challenge in terms of extracting a reliable model via regression. The remainder of this section introduces a robust approach which makes use of dimensionality reduction to build a metric.

B. Motion Response Weighted Temporal Pooling
The objective scores for the different frames are combined via temporal pooling based on Minkowski summation which was found to be the most effective in [8] in the case of non-motion sensitive objective scores. Unlike the approach proposed in [8], which equally weights the contribution of all frames, a motion response weighted temporal pooling approach is proposed here. The key idea is to measure the extent to which a particular type of motion is represented within each frame and use this information to inform the choice of pooling weight. This measure of motion at a given frame is referred to hereafter as the motion support. More formally, the motion support at frame t for a given perceptual channel c is defined as the number of pixels with non-zero motion response, that is the number of pixels p such that h(V l ( p, c)) > 0 in the case of the left view (and similarly in the case of the right view). Denoting the motion support by w h (t), the sequence of values taken by the i th objective score over time X i (t) with t = 1, . . . , f are pooled into a single motion-response weighted objective scorē Here, β is the Minkowski parameter set to 0.66. This value was found to be optimal in the case of the BEVQM and was observed to work well in the generalised model proposed here. This generalises the previous approach by allowing to take into account the motion support at each frame for a particular type of motion response. It should be noted that in the case of a non-motion sensitive complex cell, the generalised approach is equivalent to the traditional temporal pooling approach since the number of pixels with non-zero motion response is constant and equal to the total number of pixels.

C. Motion Sensitive Stereoscopic Video Quality Metric
Having computed the normalised objective scores and pooled them into objective scores reflecting both non-motion and motion sensitive complex cell behaviours, the next step is to identify a relationship expressing the subjective score Y as a function of the pooled objective scoresX = [X still ,X motion ] = [X 1 ,X 2 , . . . ,X n ] and their recurrent variablesZ = [X 1 .X 2 , . . . ,X 1 .X n ,X 2 .X 3 , . . . ,X 2 .X n , . . . ,X n−1 .X n ].
In [8], only one type of objective scores was used. However, in this paper, there are two types of objective scores representing motion sensitivity and non-motion sensitivity. The resulting increase in the number of objective scores (120 in total) poses a challenge in identifying a metric as this considerably increases run-time but also affects the convergence of the regression technique. To address this challenge, a two-stage approach is proposed where regression is first performed separately on the non-motion sensitive and the motion sensitive objective scores to select the meaningful objective scores for each type of the motion response. In the second stage, regression is performed on the reduced set of objective scores which have been selected in the first stage. An overview of the approach is given in Fig. 3.

1) Initial Regression and Objective Scores Selection:
First, two separate multi-variate regressions are performed to determine the relationships predicting the subjective score from the non-motion and the motion sensitive coefficients respectively: In these Equations, the vectors a still , b still , a motion and b motion and the constants k still and k motion denote the regression coefficients for the two models. As the HVS model requires cross relationships among objective outputs to meet the recurrent excitation in the complex cells model, the recurrent objective scoresZ still = [X 1 .X 2 , . . . ,X 1 .X n 2 ,X 2 .X 3 , . . . ,X 2 . X n 2 , . . . ,X n 2 −1 .X n 2 ] andZ motion = [X n 2 +1 .X n 2 +2 , . . . ,X n 2 +1 . X n ,X n 2 +2 .X n 2 +3 , . . . ,X n 2 +2 .X n , . . . ,X n−1 .X n ] are required. Due to the number of components in the analysis, a suppression technique is required to remove terms which are not required to stabilise the regression model. Therefore, stepwise linear regression is used over linear regression due to its ability to suppress the least meaningful components from the analysis [46]. The stepwise regression results in models containing a relatively small number of non-zero coefficients (approximately 20 overall). Only these selected objective scores will be considered in the final regression stage thus significantly reducing the pool of objective scores used in the final regression.
2) Final Regression on Selected Objective Scores: The final regression is performed by considering the selected objective scores for each type of motion. These are typically of significantly smaller size than the complete set of objective scores. The final multi-variate stepwise regression is performed to estimate the following relationship Y = k + aX + bZ (15) where the constant k and the vectors a and b are the regression parameters defining the metric. All values in a and b corresponding to non-selected objective scores are enforced to be zero. The two-stage approach results in a metric with a significantly reduced number of non-zero coefficients and a reduction in computation time by several orders of magnitude compared to a classical single-stage approach which does not perform objective score selection. Objective score selection and the obtained metrics will be discussed in Section V.

V. RESULTS AND DISCUSSION
The proposed motion sensitive approach is evaluated using the ROMEO project 1 dataset [4] and the publicly available datasets NAMA3DS1-CoSpaD1 [47] and Waterloo 3D-VQA [48]. All of the datasets consist of stereoscopic video sequences with additional information on associated subjective scores. The characteristics of these datasets are summarised in Table I and sample images from the different sequences from each dataset are shown in our supplementary report [49].
Two separate evaluations are conducted. First, the effects of the different approaches to model motion sensitivity and perform regression are evaluated and discussed. The analysis is performed in a leave-one-out fashion on the combined NAMA3DS1-CoSpaD1 and ROMEO datasets. Then, two motion sensitive binocular energy video quality metrics are built and evaluated against existing metrics to validate the importance of accounting for motion sensitivity. This is the core part of the evaluation which is performed by training and validating models on the NAMA3DS1-CoSpaD1 dataset and then evaluating the models independently on the ROMEO and Waterloo 3D-VQA datasets.

A. Evaluation of the Effects of Motion Sensitivity
We proposed different approaches to model motion sensitivity depending on the motion model used and the regression approach performed on the combined set of objective scores. This results in four possible combinations of methods. To better understand the effects of the different objective scores, the models built from purely non-motion sensitive and purely motion-sensitive objective scores are also evaluated. The following seven models are therefore considered: • NoMo: Regression on only the non-motion sensitive objective scores (similar to the EBEQM),   • ComLin: Same as above with linear velocity response function, • SelBin: Two-stage regression on combined non-motion sensitive and the motion sensitive objective scores with binary velocity response function, • SelLin: Same as above with linear velocity response function. All approaches are evaluated on the combined NAMA3DS1-CoSpaD1 and ROMEO datasets in a leave-one-out fashion where each sequence is excluded in turn and used for testing purposes while the other sequences are used for training.
First, to understand the effect of the choice of threshold value used in the velocity response function and to select an optimal value, the performances of the MoBin and MoLin methods are evaluated for different threshold values ranging from 1 to 10 pixels. For each value, the correlation between objective and subjective scores is used to measure the ability of the velocity response function to predict quality of experience based on motion alone. Results, shown in Fig. 7, indicate that a threshold of 3 pixel is optimal for both types of response functions and is therefore used in the rest of the paper.
Next, the performance of the different methods is evaluated by calculating the Pearson's Linear Correlation Coefficient (PLCC) between predicted scores and subjective scores over the entire set of test sequences. The performance of each method is shown in Fig. 8. Considering first the effect of the velocity response function, one can observe that the binary motion response model usually outperforms the linear model. The binary motion sensitive objective scores considered on their own appear to be better predictors than their linear counterpart as well as the non-motion sensitive objective scores as can be seen when comparing the performance of MoBin against MoLin and NoMo. This suggests that they are the most important type of objective scores. When combined with non-motion sensitive objective scores, the binary model remains a better predictor than the linear model, although the difference between the two becomes marginal. As for the effect of the regression method, it can be observed that the two-stage approach significantly improves performance compared to the single-stage approach for both types of velocity response functions. This can be attributed to improved convergence resulting from objective score selection. The detrimental effects of high dimensionality are evidenced by the poor performance of the single-stage regression approaches (ComBin and ComLin) which perform more poorly than the non-motion sensitive model alone (NoMo).
Finally, Table II provides some information on the average size of the models built by the two regression approaches together with their computational time. The single-stage regression approach performs regression over 120 objective scores which results in models with large numbers of objective scores (over 40) and slow convergence (over 2000 s). In contrast, the two-stage approach enables selection of only a small number of coefficients (about 20) in the first stage which translates into significantly smaller final models (less than 10 coefficients) and reduces computation time by two orders of magnitude (less than 20 s for the combined two stages).
Overall, the models using either a binary or a linear velocity response function with a two-stage selective regression method are the top performers, resulting in the highest correlation scores while at the same time being significantly more compact and faster to compute. They are therefore the methods of choice that will be used in Section V-B to build the metrics.

B. Metric Construction and Evaluation
To build the metrics and evaluate their performances, the NAMA3DS1-CoSpaD1 dataset [47] is used for training and validation, while the ROMEO and Waterloo 3D-VQA datasets are used for testing. This ensures that there is no overlap between training and testing datasets and enables evaluation under diverse datasets covering a broad range of scenes and motion activity levels. The NAMA3DS1-CoSpaD1 dataset is split into a training set (9 sequences) and a validation set (1 sequence) to build different models. The model achieving the best performance on the validation set is used to build the metric. Metrics are then evaluated by calculating the Pearson's Linear Correlation Coefficient (PLCC), Spearman's Rank Correlation Coefficient (SRCC) and Kendall's Rank Correlation Coefficient (KRCC) between the predicted scores and the subjective scores over the testing datasets.
Two variants of motion sensitive metric are proposed depending on whether a binary or a linear velocity response function is used. These are evaluated against state-of-the-art quality metrics. More specifically the proposed metrics are compared against an image quality metric [28], a stereoscopic image quality metric [50], a video quality metric [51] and six stereoscopic video quality metrics [4], [8], [31], [33], [34]. The complete set of quality metrics evaluated and their key characteristics are listed as follows: • SSIM: based on luminance, contrast and structural comparison [28], • SSIM_Ddl: based on a global 2D image distortion measure and differences in disparity maps of stereo pairs [50], • VQM: standardised method for objective evaluation of video quality [51], • StSD: based on structural distortion, asymmetric blur and content complexity [4], • PQM: based on distortions in luminance and contrast [31], • 3D-STS: based on spatio-temporal structure [33], • SJND-SVA: based on visual attention and just-noticeable difference models [34], • BEVQMμ: based on a non-motion sensitive HVS model with temporal pooling using averaging [8], • BEVQMβ: same as above with temporal pooling using Minkowski summation [8], • MSBEQM bin : this is based on the proposed motionsensitive HVS model with a binary velocity response function and two-stage regression, • MSBEQM lin : same as above with a linear velocity response function. In the case of monoscopic quality metrics such as SSIM and VQM, stereoscopic quality scores are obtained by applying the quality metric separately to the left and right inputs and then averaging the left and right quality scores obtained. For the image quality metrics, mean temporal pooling is also performed to obtain a score for the entire video. Fig. 9 shows the results obtained for the different quality metrics and for each sequence from the ROMEO and Waterloo 3D-VQA datasets based on PLCC. Further, Fig. 10 shows the average correlation on each dataset as well as the overall performance based on PLCC, SRCC and KRCC. In all cases, top performance has been highlighted in bold. Scatter plots showing the distribution of predicted scores against subjective scores are also provided in our supplementary report [49].  The image quality metrics, SSIM and SSIM_Ddl, are the two worst performers with average scores significantly lower than any video quality metric tested, whether stereoscopic or not. This confirms the importance of considering temporal information. The monoscopic video quality metric VQM performs less than the binocular video metrics considered, except for StSD which performs less well on these datasets. This demonstrates the benefit of accounting for binocular visual effects to achieve a high performance stereoscopic video quality assessment.
Considering now the performance of the different stereoscopic video quality metrics, it can be observed that the top four performers are all based on HVS models, being either variants of the BEQM or the proposed BEVQM. Looking more closely at the performance of these four methods, it can be seen that the two variants of the proposed MSBEQM metric outperform their non-motion sensitive BEVQM counterparts by a significant margin. These results confirm the importance of modelling the motion sensitivity of the HVS when devising a stereoscopic video quality metric.
The two variants of the MSBEQM perform similarly with MSBEQM bin and MSBEQM lin achieving average correlations of 0.9257 (PLCC), 0.9338 and 0.9120 (SRCC), 0.8622 and 0.8306 (KRCC). This is in agreement with the results shown in Section V-A which suggested that the two models have similar performance. For the 'Beergarden2' sequence, a reduction in performance can be noted for MSBEQM lin compared to MSBEQMbin; this may be due to the high texture details present in the video which may lead to complex optical flows.
The complete list of non-zero coefficients for the MSBEQM bin and MSBEQM lin are given in Table III. These specify all the regression parameters defining the metric in (15). The listed coefficients specify the constant k and the non-zero entries of the vectors a and b identified by their indices. The coefficient indices in the range 1 to 60 refer to non-motion sensitive coefficients, while indices in the range 61 to 120 refer to motion sensitive coefficients. It can be observed that the MSBEQM bin and MSBEQM lin present some similarities in terms of coefficients that are selected with, in particular, the third motion sensitive objective score playing the most significant role in both models.

VI. CONCLUSION AND FUTURE WORK
This paper introduced a motion sensitive HVS model based on physiological observations describing the response of complex cells to motion. The approach is based on a generalised model of complex cells whose behaviour is defined by a velocity response function. Depending on the choice of velocity response function, the model is able to describe the behaviour of a wide range of complex cells including both non-motion sensitive and motion sensitive types. Although this paper only implemented the predominant type of motion sensitive complex cells, known to behave as high pass filters, the approach is generalisable to other types of complex cells. This paper has demonstrated an application of the proposed model in stereoscopic content production. The model was used to define binocular energy terms capturing the non-motion sensitive and motion sensitive characteristics of each video frame. Temporal pooling and a two-stage regression approach were introduced to reduce dimensionality and improve the efficiency and accuracy of the estimation of a stereoscopic video quality metric. Two variants were proposed depending on whether a binary or a linear velocity response function is used to describe the behaviour of motion sensitive complex cells. Both metrics were evaluated on three stereoscopic video datasets containing a wide range of scenes and motion activity levels. The evaluation has showed that the two proposed metrics perform better than existing stereoscopic video quality metrics including other HVS-based metrics, and are able to achieve average correlations to subjective scores of 0.9257 (PLCC), 0.9338 and 0.9120 (SRCC), 0.8622 and 0.8306 (KRCC).
Further advances in understanding the physiology of the HVS are likely to open up new avenues to extend this research. For instance, a better understanding of the proportion of complex cells with motion sensitivity and more precise models of their velocity response functions would help increase the accuracy of the proposed approach. The present study assumed a common velocity threshold for all complex cells, however it may be beneficial to introduce cells with a variety of threshold values to better capture the effects of scene motion amplitude.
Furthermore, incorporating complex cells with different motion sensitivity responses such as low pass filter and band pass filter or even other types of non-linear velocity responses has the potential to increase the performance of the model and resulting metric. However, this is likely to also open up new computational challenges as the number of objective scores increases. Another interesting avenue for future research would be to extend the model by incorporating physiological findings modelling the response of other parts of the brain beyond the V1 area.
Another research direction would be to extend the approach by treating the stereoscopic video input as a 3D signal instead of two separate video streams, applying 3D image processing techniques such as the 3D transform to derive the objective scores. In this approach, motion sensitivity may be incorporated using scene flow instead of optical flow. Finally, it would be interesting to investigate the use of the proposed model in other application domains such as 3D video compression. Janko Calic (Member, IEEE) is currently a Visiting Lecturer with the Centre for Vision, Speech and Signal Processing, University of Surrey, and a Senior Research and Development Engineer with the BBC Research and Development. His main areas of expertise are QoE in video systems and user aspects of multimedia communications and HCI. He regularly reviews research in the area of multimedia signal processing and quality of experience in multimedia systems for the leading international funding bodies and publishers.
Safak Dogan (Senior Member, IEEE) is currently a Senior Lecturer in multimedia technologies with the Institute for Digital Technologies, Loughborough University London. His main areas of expertise include 2D/3D digital media processing, media adaptation and delivery, transcoding, multimedia communication systems and networks, and media quality assessments. His recent research focuses on media clouds, smart, and autonomous systems. He has managed various EU-funded multinational collaborative research projects.
Jean-Yves Guillemaut (Member, IEEE) is currently a Senior Lecturer in 3D computer vision with the Centre for Vision, Speech and Signal Processing, University of Surrey. His main areas of expertise include 3D reconstruction, multimodal registration, camera calibration, free-viewpoint video, and stereoscopic content production. His current research focuses on developing novel video-based modeling techniques for the reconstruction of outdoor scenes and scenes with complex surface reflectance properties.