DIBR Zero-Watermarking Based on Invariant Feature and Geometric Rectification

Despite the success of watermarking technique for protecting depth image-based rendering (DIBR) 3-D videos, existing methods still can hardly ensure the robustness against geometric attacks, lossless video quality, and distinguishability between different videos simultaneously. In this article, we propose a novel zero-watermarking scheme to address this challenge. Specifically, we design CT-SVD features to ensure both distinguishability and robustness against signal processing and DIBR conversion attacks. In addition, a logistic–logistic chaotic system is utilized to encrypt features for the enhanced security. Moreover, a rectification mechanism based on salient map detection and SIFT matching is designed to resist geometric attacks. Finally, we establish an attention-based fusion mechanism to explore the complementary robustness of rectified and unrectified features. Experimental results demonstrate that our proposed method outperforms the existing schemes in terms of losslessness, distinguishability, and robustness against geometric attacks.

L egal copyright protection of 3-D depth imagebased rendering (DIBR) video content has become important owing to the emerging trend of using advanced immersive technologies in entertainments, business communications, and commercial campaigns. 1 3-D videos are typically stored in either of the two formats for their online distribution: 1) stereoscopic and 2) DIBR videos. Compared with the stereoscopic format, DIBR is the more common format due to its lower storage and transmission bandwidth requirements since depth maps can be compressed more effectively. 1 Unlike protection of 2-D videos, copyright protection of DIBR 3-D videos poses a unique challenge, that is, the robustness against DIBR conversion attacks. Because illegal copies are often distributed by converting the original DIBR video into a stereoscopic format, watermarks need to be extracted not only from the DIBR format but also from its converted format. Traditional copyright protection techniques for 2-D videos can hardly satisfy this requirement since the pixels in the video frames are shifted horizontally, and thus, the synchronization of watermark recovery is lost in the DIBR conversion process.
Digital watermarking is a key technique to protect copyright of the video content, which has wide applications in video coding standard. 2,3 DIBR watermarking methods can be categorized into three classes based on their embedding strategies: 1) 2-D frame-based watermarking (2-D-W), 2,4 2) depth map-based watermarking (DM-W), 5,6 and 3) zero-watermarking (ZR-W). 7-11 2-D-W directly embeds watermarks into 2-D frames of the 3-D videos. This approach is intuitive but the embedding process causes irreversible distortion in terms of video quality. DM-W addresses this problem; however, it is not robust against various attacks, such as blurring, filtering, and noise addition. In contrast to the other approaches, ZR-W utilizes the relationship between features extracted from the 3-D videos and watermarks to losslessly and robustly protect the 3-D video copyright. Although ZR-W schemes outperform 2-D-W and DM-W algorithms in terms of video quality and watermark robustness, it is still a challenge to ensure the robustness against geometric attacks while achieving the video distinguishability simultaneously.
To address this challenge, we propose a novel ZR-W scheme based on horizontal shift-invariant features and a geometric rectification mechanism to protect DIBR 3-D videos in this article. In this method, contourlet transform and singular value decomposition (CT-SVD) are combined to extract horizontal shift-invariant and discriminative features. In addition, a logisticlogistic chaotic system (LLCS) is utilized to encrypt these features to improve the watermarking security. Furthermore, we design a geometric rectification mechanism by using salient map detection and scale-invariant feature transform (SIFT)-based matching. In this manner, the watermarking robustness against geometric attacks is guaranteed. Finally, an attention-based fusion is proposed to make full use of the complementary robustness of both the rectified and the unrectified CT-SVD features, thus further improving the reliability and accuracy of copyright identification.
This article is an extended version of our previous work, 12 and the following improvements have been made: 1) enhancing the geometric rectification mechanism by combining an RGB-D saliency detection model, to improve rectification efficiency as well as the video distinguishability; 2) exploiting an LLCS to encrypt the designed features, thus improving the watermarking security; 3) comparing to more SOTA zero-watermark methods under more types of attacks with more testing data to demonstrate the superiority of our proposed method.

PROPOSED SCHEME
Our proposed method includes two phases: 1) a registration phase and 2) an identification phase. In the following, we describe each phase in detail.

Registration Phase
In the registration phase, features of 2-D frames are extracted and a certificate authority (CA) database is set up to store the ownership shares of 2-D frames for copyright identification, as illustrated in Figure 1. In our scheme, robust and discriminative features are extracted based on CT-SVD. The detailed steps are as follows.
1) Normalize the 2-D frames to 320 Â 320 Â 100 based on spatiotemporal smoothing by using a Gaussian filter with a window size of 3 and a standard deviation of 1. In this manner, the watermarking robustness against noise addition attacks is enhanced. 2) Resample the smoothed image to yield the normalized frames. In detail, we use bilinear interpolation and downsampling (to a fixed frame number, 100) for the spatial and temporal resampling, respectively. In this manner, the watermarking robustness against scaling attacks is enhanced. 3) Divide normalized frames into m groups, and construct temporally informative representative images (TIRI) by averaging each group in the temporal domain. By exploring the temporal properties of 2-D frames sequence, the robustness against noise addition attacks is improved. In our work, m ¼ 25. 4) Partition each TIRI frame into N Â N nonoverlapping blocks, and perform three-level CT 13 on each block. In our work, N ¼ 8. 5) Select the sixth and seventh directional subbands of the second-level CT domain for feature extraction due to the following two reasons.
1) The selection of coefficients in the secondlevel CT domain ensures both the distinguishability and the robustness against signal processing attacks.
2) The selection of sixth and seventh directional subbands enhance the robustness against DIBR conversion because these two subbands mainly contain horizontal edges and contours, as shown in Figure 2  To generate ownership shares, which represent the mapping relationship between the video features and watermark information, we employ chaotic mapping based on LLCS 14 as follows: 1) Use the logistic maps defined in (1) to generate a chaotic sequence S = fsði þ LÞ, 1 i 1600g; as shown in the following: where u and v are the control parameters, 0 u 10, 8 v 20, sð0Þ is the initial value of chaotic system, n is the iteration number, and sðnÞ is the output chaotic sequence. u, v, and sð0Þ are used together as the secret key, and bc is a floor operation. These parameters are strictly followed by Pak and Huang 14 to ensure the effectiveness of the chaotic system. Because the chaotic sequences generated after multiple iterations have better chaotic performance, L is set to random integer value larger than 1000 in our study. 2) Binarize the chaotic sequence S to obtain a binary chaotic sequence BS = fbsðiÞ, 1 i 1600g by its mean value where T S is the average value of S.
3) Encrypt our designed CT-SVD features F = ffðiÞ, 1 i 1600g by applying exclusive-or (XOR) operation with the binary chaotic sequence BS as where È denotes the XOR operation and CF = fcfðiÞ, 1 i 1600g is the encrypted feature. In this manner, the watermarking security is enhanced without affecting the robustness and distinguishability of our extracted features. 4) Generate ownership shares O by applying exclusive-or (XOR) operation between encrypted feature CF and watermark information W where È denotes the XOR operation, oðiÞ and wðiÞ are the ith bit of O and W , respectively. 5) Store the O with the corresponding secret key of LLCS into the CA database for watermark recovery.

Identification Phase
In this phase, a geometric rectification mechanism based on salient map detection and SIFT feature matching is designed to resist geometric attacks. Two rectified features and one unrectified feature are extracted from a query DIBR video. Then, LLCS is performed to encrypt these features, and three watermarks are recovered by the XOR operations between these encrypted features and the corresponding ownership share. Finally, the copyright ownership is identified based on our designed attention-based fusion model to utilize the complementary robustness of the rectified and unrectified features. The detailed processes are shown in Figure 3. Our designed geometric rectification composes of two parallel channels, as shown in Figure 3: one channel is a connection between a rotation rectification and a translation rectification, while the other one is a shearing rectification. As the rotation attacks will lead to incorrect translation rectification results while the cyclic-translation attacks have no effect on rotation rectification, the rotation rectification should be performed first. In addition, shearing rectification should be separated from the other two rectifications because incorrect rectification will be caused by the interactions of shearing with rotation or cyclic-translation attacks. The processes of three rectifications are as follows.  expressed as follows: where D r is the factor of rotation rectification, V q ðiÞ andṼ s ðiÞ are the vectors obtained by the ith pair of matched SIFT points in salient regions of the queried and stored videos, respectively, and N is the number of the matched SIFT points.
› Then, the pixels of rotation rectified frames are moved based on translation rectification according to the factors expressed as follows: where D x and D y are the translation rectification factors, ðx q ðkÞ; y q ðkÞÞ are the coordinates of the kth matched SIFT point of queried DIBR video, ðx s ðkÞ; y s ðkÞÞ are those of stored videos, and W and H are the width and the height of normalized 2-D frames, respectively.
› Meanwhile, the shearing rectification is performed on normalized 2-D video frames. We define rectification factors (D a and D b ) of shearing attacks as follows: Then, the normalized 2-D frames is rectified by substituting these obtained rectification factors into the following equation: Rectified feature extraction steps are shown as follows: 1) Normalize the 2-D frames of query and stored DIBR videos. 2) Define a salient mask for geometric rectification.
Here, we propose to use a pretrained RGB-D salient map detection model, namely UC-Net, 15 to enhance both the efficiency of geometric rectification and the distinguishability of the rectified features. In specific, we generate a singlechannel saliency map by applying the pretrained UC-Net on the first frame of each DIBR video. This saliency map is then binarized to generate a mask so that the geometric rectification is applied only to its salient regions. 3) Extract the SIFT points in salient regions of both the query and stored DIBR videos. 4) Match the SIFT points in salient regions. In this manner, the rectification efficiency is improved since the number of SIFT points in a salient region is much fewer than that in a whole video frame. In addition, the distinguishability of rectified features is also enhanced by further exploring the salient information. 5) Rectify the normalized 2-D frames based on the rules described in (6)-(9). 6) Extract a rotation-translation-rectified feature, namely f r1 , and a shearing rectified feature, namely f r2 , following the same extraction steps as in our registration phase.
Because the signal processing attacks, such as noise addition and DIBR conversion, could affect the SIFT matching results, we also extract CT-SVD features from the original query frames, namely f u , following the same steps as in the registration phase to further enhance the robustness.
In our scheme, an attention-based fusion is designed inspired by Hua and Zhang 16 to further enhance the performance of copyright identification by exploiting the complement robustness of the rectified and unrectified features. The detailed steps of copyright identification are as follows: 1) Encrypt the rectified and unrectified features by LLCS following the same steps as in the registration phase.
2) Recover three watermarks by XORing these encrypted features with the corresponding ownership share of the stored video.  (10) and (11), to simultaneously satisfy the heterogeneity and monotonicity defined by Hua and Zhang. 16 where U is the attention-based fusion function, BER 1 and BER 2 are the BERs to be fused, BER fr 1 and BER fr 2 are obtained from the rectified features, the BER fu is obtained from the unrectified feature, and BER fused is the fusion result. is a constant and is set to 0.01 empirically in our study. If any fused BER is smaller than our heuristic threshold, the query DIBR video is treated as an illegal copy.

Experimental Settings
The testing database in our experiments comprises 200 different DIBR 3-D video clips from open datasets provided by the MEPG 3DAV group, 17 Interactive Visual Media Group of Microsoft Research, 18 and Shenzhen Institute of Advanced Technology, 19 as well as DIBR videos generated by using the method described by Rzeszutek et al. 20 The watermark image is of size 40 Â 40. To verify the robustness of our scheme, we apply the attacks listed in Table 1 and attack examples are illustrated in Figure 4. We perform an extensive set of experiments, and compare the obtained results to those yielded by five other state-of-the-art ZR-W algorithms [7][8][9][10][11] to evaluate the effectiveness of our proposed method.
All the experiment results are tested on an Intel Core i7-7700HQ CPU Processor, and a GeForce RTX 2080 TI GPU Processor in Python implementation.

Comparison of Video Distinguishability
We first use the inter-BER metric to compare the distinguishability of our proposed method with the other five schemes. The inter-BER is defined as the BER value between the genuine and fake watermarks.
Here, the genuine watermark is generated by stacking the ownership share and the master share of the same video while fake watermarks are generated by stacking the ownership share with the master share of other videos. A higher inter-BER indicates higher distinguishability. The obtained results are given in Table 2.
As we can see from Table 2, the average inter-BER of our proposed method by fusing the three features is 0.487 and larger than those of Cui et al.'s 8 and Wang et al.'s 9 work, and comparative with those of Liu et al.'s work. 7,10,11 Moreover, our minimum inter-BER is 0.207 and much larger than those of the other five techniques. These results demonstrate that our proposed method achieves higher distinguishability than other benchmarks. The reason for this superiority is twofold: the selection of CT subbands, containing discriminative horizontal edges and contours, ensures the feature distinguishability, and the deployment of RGB-D salient map detection further enhances the distinguishability of the two rectified features.

Comparison of Watermarking Robustness
Next, we subject the videos to various attacks on all the 200 videos and use the intra-BER metric to evaluate the robustness against the attacks. The intra-BER is defined as the BER value between the watermarks recovered from the original and attacked videos. A smaller mean intra-BER indicates stronger robustness. The comparison results are listed in Table 3. Here, the bold font means the best performance, and the underline means the second-best performance.
As shown in Table 3, when fusing the three features, our proposed method achieves remarkable mean intra-BER values. It can be easily found that our proposed scheme is comparable to the best robustness performance under each attack, which is not achieved by any other benchmark methods. Moreover, our average value of the mean intra-BER is 0.041 and smaller than those by utilizing any single feature and other benchmark methods. These results demonstrate that our proposed scheme is superior to other five schemes in terms of robustness, especially against geometric and DIBR conversion attacks. The reason for this superiority is threefold.

Comparison of Copyright Identification Performance
We then compare the different methods in terms of overall performance for copyright identification, measured by the false positive rate P fn and the false negative rate P fp defined in Liu et al.'s work 11 . A smaller P fn indicates better performance when P fn is fixed. We set P fp to 0.5% and compare the resulting P fn for all attacks. The results are given in Table 4. Here, the bold font means the best performance and the underline means the second-best performance. We can find that fusing the features in our proposed method gives better results than based on either of

Evaluation of the Improvement by Using Salient Map Detection
Because the geometric rectification is the most timeconsuming process in our proposed method, it is important to improve the rectification efficiency for the real-world applications. In this section, we evaluate the improvement of our scheme, which applies the salient map detection compared to our previous work. 12 As shown in Table 5, the average processing time of geometric rectifications without the salient map detection is 0.41 s per video frame, while the value with the salient map detection is merely 0.14 s, including 0.08 s for the salient map detection, and 0.06 s for rectification. This result demonstrates the improvement of the rectification efficiency (nearly three times faster) by using the salient map detection. In addition, it can be found that the performance of robustness is comparable to the scheme without salient map detection, as in Liu et al.'s work. 12 At the same time, the distinguishability is improved by 1.8% in terms of its average value and 2.9% in terms of its minimum value by adding the salient map detection.

Evaluation of Watermarking Security
We evaluate the security performance by using the LLCS in terms of key-sensitive-BER and encryption-BER values. The former is calculated between different binarized chaotic sequences under different initial values (from 0 to 1 at an increment of 1=10 16 in our study), which indicates the sensitivity to tiny key differences. The latter is the BER values between the extracted features and their encrypted format (using 200 random sð0Þ in our study). An encryption function is considered to be more secure when these two values are close to 0.5. The test results are listed in Table 6. As shown in Table 6, The average, maximum and minimum values of the two BER metrics are all approximately 0.5, confirming the remarkable security performance by deploying LLCS: an attacker cannot forge an invalid watermark without the secret key of the LLCS even if he/she has the watermarking information and ownership shares, which cannot be achieved by our previous work. 12

CONCLUSION
In this article, a novel ZR-W scheme based on horizontal shift-invariant feature and geometric rectification is proposed for the copyright identification of the DIBR 3-D videos. The advantages of our proposed scheme include in the following: 1) By using our well-designed CT-SVD feature, the robustness against signal processing attacks, DIBR attacks, and the video distinguishability is ensured simultaneously. 2) By establishing the geometric rectification, the geometric attacks are resisted. 3) By introducing the RGB-D salient map detection model, the rectification efficiency is improved and the distinguishability of rectified feature are enhanced. 4) By designing the attention-based fusion, the complementary robustness of the rectified and unrectified CT-SVD features are exploited, which further improves the performances for copyright identification. 5) By deploying the LLCS to encrypt the features, the watermarking security is ensured.
The experimental results demonstrate that our scheme outperforms the existing ZR-W schemes for