Discriminative and Geometrically Robust Zero-Watermarking Scheme for Protecting DIBR 3D Videos

Copyright protection of depth image-based rendering (DIBR) 3D videos is crucial due to the popularity of these videos. Despite the success of recent watermarking schemes, it is still challenging to ensure the robustness against strong geometric attacks when both lossless quality and distinguishability of protected videos are required. In this paper, we pro-pose a novel zero-watermarking scheme to improve the performance under strong geometric attacks when satisfying the other two requirements. In our scheme, CT-SVD-based features are extracted to ensure both distinguishability and robustness against signal processing and DIBR conversion at-tacks, while a SIFT-based rectication mechanism is designed to resist geometric attacks. Further, an attention-based fusion strategy is proposed to complement the robustness of rectied and unrectied CT-SVD features. Experimental results demonstrate that our scheme outperforms the existing zero-watermarking schemes in terms of distinguishability and robustness against strong geometric attacks such as rotation, cyclic translation and shearing.


INTRODUCTION
Three-dimensional (3D) videos provide better immersive experiences to viewers than traditional 2D videos, thus they have been becoming more and more popular in entertainments [1,2,3]. One typical format for storage and online distribution of 3D videos is the stereoscopic format, which contains two different 2D views captured by two parallel cameras for each frame. Another common used format is depth image-based rendering (DIBR) format which consists of a 2D These authors contributed equally to this work. * To whom correspondence should be addressed. This research is supported by the National Natural Science Foundation of China (61602527,61772555,61772553,U1734208), Natural Science Foundation of Hunan Province, China (2020JJ4746), and Guangxi Key Laboratory of Trusted Software (KX202032) view with its depth map for each frame. Compared with the stereoscopic format, the DIBR format saves storage and transmission bandwidth by efficiently compressing the depth map and thus is more widely adopted [2,3].
Copyright protection of DIBR 3D videos is important. Compared with the protection of traditional 2D videos, it requires a unique characteristic to resist DIBR conversions. Simply put, watermarks need to be extracted from both the DIBR format and its converted stereoscopic format because the attackers can easily convert a DIBR format into a stereoscopic format for illegal distribution. Recently, many digital watermarking schemes are exclusively designed for DIBR 3D videos, which can be categorized into three main classes based on their embedding strategies: (1) 2D-frame-based watermarking schemes exploit the DIBR invariant characteristics of 2D frames for watermark embedding [4,5,6,7,8]; (2) depth-map-based watermarking schemes embed watermark into depth maps based on unseen visible or reversible watermarking [9,10,11]; and (3) zero-watermarking schemes utilize the relationship between 3D video features and the watermark information for the copyright identification without direct watermark embedding [12,13,14,15].
Despite the success of these above mentioned watermarking schemes, it is still challenging to ensure the robustness, especially against geometric attacks, when both lossless quality and distinguishability of protected videos are required. The 2D-frame-based watermarking achieves sufficient robustness and distinguishability but causes irreversible distortion on image quality. The depth-map-based watermarking keeps the synthesized 3D videos distortion-free but is not sufficiently robust. Although the lossless zero-watermarking outperforms the other two categories of schemes in terms of both video quality and watermark robustness, none of the state-of-theart zero-watermarking schemes ensures sufficient robustness against strong geometric attacks, such as rotation, cyclictranslation and shearing while achieving the distinguishability at the same time.
To address this challenge, in this paper, we propose an innovative zero-watermarking scheme based on contourlet transform (CT)-Singular Value Decomposition (SVD) features and scale-invariant feature transform (SIFT)-based rectification for protecting DIBR 3D videos. In our proposed scheme, CT-SVD are performed on 2D frames of DIBR 3D videos to extract discriminative and robust features and a SIFT-based rectification is used to guarantee the watermarking robustness against strong geometric attacks. In addition, an attention-based fusion is designed to exploit the complementary robustness of both the rectified and the unrectified CT-SVD features, which further improves the performance of copyright identification.
Our contributions are highlighted as below: • A novel zero-watermarking scheme is proposed to offer a lossless copyright protection of DIBR 3D videos. • CT-SVD features are designed to ensure not only the distinguishability but also the robustness against signal processing and DIBR 3D conversion attacks. • A SIFT-based rectification mechanism is established to resist strong geometric attacks, such as large-degree rotation, large-scale cyclic-translation and shearing. • An attention-based fusion is designed to offer an optimal copyright protecting solution by exploiting the complementary robustness of rectified and unrectified CT-SVD features. • Comprehensive experimental results demonstrate the superiority of the proposed scheme against the stateof-the-art zero-watermarking schemes.

PROPOSED SCHEME
Our proposed zero-watermarking scheme includes two phases, which are a copyright registration phase and a copyright identification phase, as shown in Fig. 1.

Copyright registration phase
In this phase, 2D frame features of DIBR 3D videos are extracted and their ownership shares, representing the relationship between these features and corresponding watermarks, are created and stored in a certificate authority (CA) database for copyright identification.

Extraction of 2D frame feature
In our scheme, robust and discriminative features of 2D frames are extracted based on CT-SVD. The detailed steps are as follows. Firstly, 2D frames are normalized based on spatio-temporal smoothing and resampling. Secondly, the normalized frames are averaged in temporal domain to construct the temporally informative representative images (TIRIs). Thirdly, three level contourlet transform (CT) [16] is applied on non-overlapping blocks of TIRIs and the 6 ℎ and 7 ℎ directional subbands of the 2 level CT domain are selected. There are two reasons for the selection of these two subbands: (1) the coefficients in the 2 level CT domain are selected to ensure both the distinguishability and the robustness against signal processing attacks, and (2) the 6 ℎ and 7 ℎ directional subbands are chosen because they mainly contain horizontal edges and contours, as shown in Fig. 2, which are more robust against DIBR conversion. Fourthly, SVD transform is performed on the selected subbands and the first singular values in each diagonal matrices are selected. Finally, the feature of a DIBR 3D video is generated by binarizing these singular values of different image blocks of TIRIs based on their mean value.

Generation of ownership share
After the feature extraction, an ownership share is generated by XORing the extracted feature and the binary watermark. The generated ownership share is then stored into the CA database for copyright identification.

Copyright identification phase
In this phase, two rectified features and one unrectified feature are extracted from the 2D frames of a queried DIBR video. XOR operations are performed between these features and the corresponding ownership share to recover three watermarks. Finally, an attention-based fusion is utilized to exploit the complementary robustness of the rectified and unrectified features for copyright identification. The procedure details are listed as below.

Extraction of rectified feature
To resist geometric attacks, a SIFT-based rectification mechanism, including rotation, translation, and shearing rectifications, is designed. Firstly, we normalize the 2D frames of the queried DIBR video and those of the DIBR videos stored in database. Then, we extract the SIFT points in the 1 normalized 2D frames of both the queried and stored DIBR videos. Next, we match the SIFT points of the queried DIBR video to those of the stored ones. After that, we perform geometric rectifications in two parallel channels, as shown in Fig. 1(b). The former channel is a sequential connection between a rotation rectification and a translation rectification, while the latter is a single shearing rectification. The three rectifications are described as below.
The rotation rectification rotates the normalized 2D frames according to the factor expressed in eq.(1).
where Δ is the factor of rotation rectification, ì ( ) and ì ( ) are the vectors obtained by the -th pair of matched SIFT points in 2D frames of the queried and stored videos, respectively, and is the total number of the matched SIFT points.  The translation rectification moves the pixels in normalized 2D frames according to the factors expressed in eq.(2) where Δ , Δ are the translation rectification factors, According to eq.(4), we calculate the rectification factors (Δ and Δ ) of shearing transformation. Then, we can rectify the normalized 2D frames by substituting our obtained rectification factors into eq.(3).
In specific, translation rectification is performed after the rotation rectification because cyclic-translation attacks will not affect the results of rotation rectification, but conversely, rotation attacks will lead to an incorrect translation rectification. Moreover, shearing rectification is separated from the other two rectifications to avoid incorrect rectification caused by the interactions of shearing with the other two attacks.
After geometric rectifications, a rotation-translation rectified feature, namely 1 , and a shearing rectified feature, namely 2 , are extracted following the same extraction steps in our copyright registration phase.

Extraction of unrectified feature
The CT-SVD feature of unrectified 2D frames, namely , is also extracted following the same steps in copyright registration phase. In this manner, the watermarking robustness can be further improved because some of the attacks, such as noise addition and DIBR conversion, may affect the results of SIFT point matching or rectification factor calculation.

Attention-based fusion
In our scheme, an attention-based fusion is designed to further enhance the performance of copyright identification by exploiting the complement robustness of all the rectified and unrectified features. Three watermarks are recovered by XORing these features with the corresponding ownership share of each stored video. Three bit error rates (BERs) between the original watermark and them are calculated. These BERs are fused by eq.(5) and eq.(6) to simultaneously satisfy the heterogeneity and monotonicity expressed in eq.(7). If any fused BER is smaller than a heuristic threshold, the queried DIBR 3D video is treated as an illegal copy of the video corresponding to this fused BER.
where is the attention-based fusion function, 1 and 2 are the BERs to be fused, is a constant and is set to 0.01 empirically in our study.
1 and 2 are obtained from the rectified features, the is obtained from the unrectified feature, and is the fusion result.

Experimental settings
The testing database of our study contains 50 DIBR 3D video clips with different frame numbers and frame sizes, including videos from datasets of the MEPG 3DAV group [17], Interactive Visual Media Group of Microsoft Research [18] and Shenzhen Institutes of Advanced Technology (SIAT) [19] as well as 2D frames selected from existing movies, with corresponding depth maps calibrated based on the scheme in [20]. When implementing our proposed scheme, all the frames are normalized to 320 × 320 × 100 (height × width × framenumber). And a total of 25 TIRIs are generated and each of them are divided into 8 × 8 non-overlapping blocks for CT-SVD feature extraction. As a result, the dimension of our extracted CT-SVD feature is 1600, which is the same as the size of our utilized watermark.
Three state-of-the-art zero watermarking schemes [13,14,15], are compared with our proposed scheme in terms of distinguishability, robustness, and over-all performances in sections 3.2, 3.3 and 3.4 respectively.

Comparison of distinguishability
In our study, interBER is used to compare the distinguishability of different schemes. The interBER is defined as the BER between the genuine watermark and the fake watermark. The genuine watermark is generated by XORing the ownership share and the master share of the same video while the fake watermark is generated by XORing the shares of different videos. The watermarking distinguishability is higher if the interBER value of a scheme is larger. The comparison results are listed in Table 1. Here, 1 , 2 and indicates using the single CT-SVD feature with rotation-translation rectification, shearing rectification, or without any rectification, respectively.
As shown in Table 1, the average interBER of our proposed scheme by fusing the three features is 0.488 and much larger than those of [13,14] and comparative with that of [15]. Moreover, our minimum interBER is 0.225 and much larger than those of [13,14,15], which are 0.156, 0.073 and 0.124 respectively. These results demonstrate that our proposed scheme superiors to the other three schemes in terms of watermarking distinguishability. The reason for these results is that we select the 6 ℎ and 7 ℎ of subbands of 2 level CT domain, which contain discriminative horizontal edges and contours, for the feature extraction. Although watermarking distinguishability by fusing the three features is naturally and slightly worse than those by using any single feature, the robustness of the fusing one are much stronger as shown in section 3.3, leading to its better overall watermarking performance as shown in section 3.4.

Comparison of robustness
Then, we perform different attacks on all the 50 videos and compare the robustness of different schemes in terms of mean value of intraBER, which is defined as the BER between the watermarks recovered from original and attacked videos. The attacks with the detailed parameters are shown in Table 2 and the comparison results are listed in Table 3. As shown in Table 3, our proposed scheme by fusing the three features achieved an impressive performance in terms of the mean intraBER. In specific, the average value of our mean intraBER is 0.039. This value is much lower than those by utilizing any single feature. These results indicate that the attention-based fusion further enhances the watermarking robustness. The reason is because that the use of rectified features ensures the strong robustness against rotation, cyclic-translation and shearing attacks due to the invariance of SIFT, while the use of unrectified feature enhances the robustness under the noise addition and DIBR conversion attacks. Moreover, the average value of our mean intraBER is much lower compared with those of the other schemes, which are 0.147, 0.087 and 0.114. Specifically, for cyclic-translation and shearing attacks, our largest mean intraBER is 0.069 and it is much better than those of the other three schemes, of which the smallest value is 0.160. For rotation attacks, our largest mean intraBER is 0.078 while the mean intraBERs of [13,15] are close to 0.4 and 0.2. For DIBR attacks, our largest mean intraBER is 0.066 while the smallest value of [14] is 0.113. These results demonstrate that our proposed scheme is more robust than the other three schemes. The reason for these results is threefold: (1) Our well-designed CT-SVD feature ensures the robustness against signal processing and DIBR conversion attacks.

Comparison of over-all performance
Finally, the overall performance of our proposed scheme is compared with other scheme in terms of false negative rates with a fixed false positive rate and smaller values represent better overall performance. In our experiments, the values of different schemes are set to 0 by defining their identification thresholds as the same with their minimum interBERs, respectively. The results are listed in Table 3.
As shown in Table 3, most of the values of our proposed scheme by fusing the rectified and unrectified features equal to 0 with an insignificant average value as 0.001. These values are much lower than those by using any single feature of our proposed scheme, which demonstrate the effectiveness of our designed attention-based fusion. Furthermore, our values are much lower than those of the other benchmark schemes [13,14,15] with their average values as high as 0.341, 0.467 and 0.334 respectively. Especially, our values are still 0 under cyclic translation and shearing attacks while those of the three benchmark schemes are close to 1. These results indicates that our proposed scheme outperforms the other three schemes in terms of the overall performance.

CONCLUSION
In this paper, a novel zero-watermarking scheme is proposed for the copyright identification of the DIBR 3D videos. The advantages of our proposed scheme include: (1) By using our well designed CT-SVD feature, the watermarking robustness against signal attacks, DIBR attacks, and the watermarking distinguishability are ensured simultaneously. (2) By establishing the SIFT-based rectification, the strong geometric attacks are resisted. (3) By designing the attention-based fusion, the complementary robustness of the rectified and unrectified CT-SVD features are exploited, which further improves the performances for copyright identification. The experimental results demonstrate that our scheme superiors to the existing zero watermarking schemes for 3D videos in terms of distinguishability and robustness against strong geometric attacks. Our future work is to explore this technique to protect medical volume images and other types of multimedia data.