Joint compressive autoencoders for full-image-to-image hiding

Image hiding has received significant attention due to the need of enhanced multimedia services such as multimedia security and meta-information embedding for multimedia augmentation. Recently, deep learning-based methods have been introduced that are capable of significantly increasing the hidden capacity and supporting full-size image hiding. However, these methods suffer from the necessity to balance the errors of the modified cover image and the recovered hidden image. In this paper, we propose a novel joint compressive autoencoder (J-CAE) framework to design an image hiding algorithm that achieves full-size image hidden capacity with small reconstruction errors of the hidden image. More importantly, our approach addresses the trade-off problem of previous deep learning-based methods by mapping the image representations in the latent spaces of the joint CAE models. Thus, both visual quality of the container image and recovery quality of the hidden image can be simultaneously improved. Extensive experimental results demonstrate that our proposed method outperforms several state-of-the-art deep learning-based image hiding techniques in terms of imperceptibility and recovery quality of the hidden images while maintaining full-size image hidden capacity.


I. INTRODUCTION
Image hiding has become one of the most important information hiding technologies to embed secret messages in multimedia data. As highlighted in [1], hidden messages are not limited to authentication information such as watermarks but also can be meta-information including depth-map images, motion related images, and extended-colour images. Thus, image hiding techniques can diversify multimedia services such as multimedia authentication and multimedia augmentation by embedding hidden information.
Many image hiding methods have been proposed over the years. Classical image hiding approaches seek to embed messages in insignificant components of cover images so as to make it difficult to obtain the hidden messages perceptually. Popular methods include least significant bit (LSB) approaches [2], highly undetectable steganography (HUGO) [3], wavelet-obtained weight-based (WOW) methods [4], and universal wavelet relative distortion based (S-UNIWARD) techniques [5]. These methods mainly aim to decrease the detectability of the embedded information, but they hide only a relatively small number of bits such as text messages into a cover image.
Therefore, how to improve the hidden capacity remains one of the most important research aspects of data hiding. Many methods have been designed to enlarge the hidden capacity, such as diamond encoding based hiding [6], pixel pair matching based hiding [7], and image coding based hiding [8]. However, these methods can hardly hide a fullsize hidden image into a cover image while ensuring acceptable human imperceptibility or image quality. Recently, deep learning (DL)-based methods have been exploited in order to further improve the hidden capacity by learning a nonlinear mapping function to embed a same-size hidden image into a cover image [1], [9]- [12]. Using convolutional neural networks, these methods can achieve the required capacity of 24 bpp (bits-per-pixel) to support full-image-to-image hiding.
Although DL-based full-image-to-image algorithms have shown some success, they are constrained by major challenges, namely that it is difficult to simultaneously reduce the residuals between container and cover images, and the reconstruction error of the hidden image. Loss functions designed in these methods [1], [9]- [12] comprise two weighted terms to obtain a balance between the two optimisation goals, which in turn means that neither of these errors can be minimised to its optimal level due to this trade-off.
In this paper, we propose a novel joint compressive autoencoder (J-CAE)-based framework to tackle this challenge. In particular, we train two CAE models to represent cover and hidden images, respectively. Each model is designed to learn an effective latent space to represent its image gallery as well as minimise the reconstruction error of gallery images. In the image hiding stage, we map the representations between the two latent spaces instead of embedding a hidden image into a cover image directly. This process allows to fundamentally solve the image quality trade-off problem in typical DL-based full-image-to-image hiding algorithms. The feature representations in both CAE models are binarised and a logistic-logistic chaotic mechanism is introduced to facilitate the mapping process. In addition, reconstruction performance is ensured through the advantages of the compressive autoencoder models. Possible applications of our proposed method include image authentication and storing image-related meta-information such as depth-map images, motion related images, and extended-colour images.
The main contributions of our work can be highlighted as follows: (i) our proposed image hiding algorithm fundamentally avoids the quality trade-off problem via joint deep autoencoder networks, providing excellent performance in both high hidden capacity and human imperceptibility. To our knowledge, this is the first approach to solve the quality tradeoff problem for effective full-image-to-image hiding, (ii) the compressive autoencoder approach ensures high quality recovery of the hidden image and (iii) a logistic-logistic chaotic mechanism is employed for the mapping of representations in the latent spaces to further enhance the image hiding security.
The remainder of the paper is organised as follows. Related work is discussed in Section II. In Section III, we explain the details of our proposed J-CAE framework, while Section IV presents the experimental results to demonstrate that it outperforms current state-of-the-art DL-based full-imageto-image hiding methods. Finally, conclusions are drawn and future work is suggested in Section V.

II. RELATED WORK
In the following, we briefly describe related works and background on classical hiding methods with relatively high hidden capacity, DL-based full-image-to-image hiding methods and autoencoders.

A. Classical hiding methods with relatively high capacity
Conventional image hiding conceals information into a cover image with imperceptible modification. [2] proposes one of the earliest approaches of image hiding, where the secret information is embedded into the four least significant bits (LSB) of a cover image. [13] improves the LSB method by designing an optimal pixel adjustment process (OPAP) to reduce the distortion of cover image. In [14], a method based on pixel LSB matching is introduced that uses a pair of pixels as an embedding unit to further enhance the image quality. [15] presents an exploiting modification direction (EMD) method with a 5-ary notational system to increase the capacity of LSB matching methods. Further, [16] enhances the basic EMD to a section-wise EMD method and [17] exploits eight modification directions to improve the EMD method by up to 4.5 bpp. As another enhanced version of EMD, [6] improves EMD by developing a diamond encoding (DE) and designing a 2k 2 + 2k + 1 notational system. In addition, [7] improves the DE method using an adaptive pixel pair matching function, in which more compact neighbourhood sets are provided to choose a better notational system for data embedding. In [18], a turtle shell-based hiding scheme is designed to improve the hidden capacity with good image quality. In [19], the performance is further improved by composing a reference matrix with a location table. [20] uses local binary pattern coding and OPAP together and yields a relatively high capacity. [8] employs the absolute moment block truncation coding (AMBTC) for information hiding and [21] further enhances this work by designing hybrid hiding strategies for different image blocks.
Although the hidden capacity has been improved significantly by deploying these methods, it still cannot yield a 1:1 ratio of hidden capacity to cover information (24 bpp) for hiding a full-size hidden image into a cover image.

B. DL-based full-image-to-image hiding methods
Deep learning-based image hiding methods allow to hide a full-size image into another image. In [9], the first DL-based full-image-to-image hiding method is proposed which relaxes the lossless recovery requirement of the hidden image thus allowing a trade-off between the errors of the container image and the recovered hidden image which is achieved by training three serial networks together. In [1], this approach is further improved by introducing image permutations to enhance the difficulty to identify the content of the hidden information.
[10] introduces a method that learns the mapping functions by using a novel block named separable convolution with residual block (SCR) and also employs a variance loss term for enhanced imperceptibility. In [12], two U-Net convolutional neural networks are trained together for hiding and extracting hidden images. The networks are iteratively updated to minimise the errors of container and hidden images. To further enhance data hiding imperceptibility, [11] only uses the Y channel of the cover image to embed hidden images, while adding the structure similarity index (SSIM) metric as an additional term to the loss function.
DL-based full-image-to-image hiding algorithms need to learn to balance the residuals between container images and cover images, and the reconstruction errors of the hidden images since these two terms form the constraints in the proposed frameworks.

C. Autoencoders
An autoencoder (AE) is a deep neural network model to learn a non-linear encoder to represent images with compressed feature vectors and reproject the vectors back to original images with minimal quality loss via a decoder in an unsupervised manner. Since the first DL-based AE has been shown to outperform linear principal components analysis (PCA) [22], a variety of AE models have been proposed for a variety of purposes including feature representation, denoising and compression. In [23], a variational autoencoder is proposed for approximate inference of images with continuous variables in a low-dimensional latent space. In [24], denoising adversarial autoencoders are trained by combining denoising with regularisation on the distribution of the latent space via adversarial learning. [25] trains a sparse autoencoder to extract sparse representations which are more distinctive for classification tasks, while in [26] a compressive autoencoder (CAE) is used to extract compressive feature representations as well as decode high-quality images.
The two distinctive advantages of AE algorithms are effective impact feature representation in the learned latent space and remarkable reconstruction of images from their representations. In our work, we exploit the advantages of AE models to propose a novel image hiding framework. By designing a joint compressive autoencoder (J-CAE) model, the hidden capacity limitation of conventional methods and the container-recovery image quality balance problem in the DL-based full-image-toimage hiding methods are both successfully addressed.

HIDING
Our proposed method comprises a hiding phase and an extraction phase. As illustrated in Fig. 1, during the hiding phase, two CAE models are trained based on cover images and hidden images, respectively. These two models have their own non-linear mapping functions to favour the reconstruction of their groups of images. The mapping of feature representations in these two latent spaces can be used to hide a hidden image into a cover image without a direct embedding process. In the extraction phase, a recovery CAE and a refining U-Net model are trained sequentially to improve the quality of the recovered images. In the following, we explain the hiding and extraction processes in detail.

A. Hiding phase
The hiding process has two stages (the two left blocks in Fig. 1): the first stage is dual CAE model training, while the second stage encompasses the feature representation mapping from the hidden image CAE representation to the cover image CAE representation.
1) Dual CAE training: Dual CAE model training is fundamental in our proposed approach. The two models represent the data distributions of two groups of images, cover images and hidden images. Note that the two image sets are swappable as they are able to be converted into the other latent space by encoding images into their own latent space. The architecture of the CAE model we adopt is illustrated in Fig. 2. It is composed of nine convolutional layers and two fully connected layers on both the encoder and the decoder side. Different from the model in [26], our designed CAE replaces the last two convolutional layers with two fully connected layers on the encoder side for improved image hiding performance and the same modification is applied to the first two deconvolutional layers on the decoder side.
The loss function for training the Dual CAE models is where x represents the original image, E i (·) the encoder and D i (·) the decoder with i ∈ (s, c). Q(·) is a probability model to binarise the feature vector and α is a weight to balance the two terms. In general, the loss function encourages sparsity of the feature representation in the latent space and small reconstruction errors of the original image. In this manner, the separation of the two latent spaces removes the balanced weights in the loss functions of other state-of-the-art DL-based full-image-to-image hiding methods.
2) Joint mapping: After training the CAE models, a hidden image can be represented using a binarised encoded vector in its CAE model. Once the feature representation of the hidden image V s is obtained, its corresponding feature representation of a random cover image V c can be generated from the latent space of the other CAE model. The mapping relationship between V s and V c is generated by where ⊕ indicates an XOR operation between the two vectors.
To further enhance the image hiding security, a chaotic system is applied to F as follows: (i) A 512 bit length chaotic sequence H = {h n , t + 1 ≤ n ≤ t + 512} is generated by using a logistic-logistic system (LLS) [27] defined as with u and v two control parameters and 0 ≤ u ≤ 10, 8 ≤ v ≤ 20. h 0 is the initial value of the LLS, n indicates the iteration number and h n is the output of the LLS. Compared with a single logistic map, the LLS provides larger chaotic ranges and thus ensures better security.
(ii) The chaotic sequence H is then binarised to obtain a binary chaotic sequence BH = {bh n , 1 ≤ n ≤ 512} as where T is the average value of H and t is set to 100. (iii) Finally, the output K of the chaotic system is generated by together with h 0 used as the secret key for image hiding and transmitted to the receiver securely based on a private key distribution mechanism.

B. Extraction phase
The extraction phase also proceeds in two stages (the two right blocks in Fig. 1).
1) Recovering CAE training: When extracting the hidden image from a container image, another CAE is trained. This model ensures recovery quality once the container image and key are transferred to the receiver side. The CAE is initialised by taking the encoder of the cover image CAE and the decoder of the hidden image CAE. The employed loss function is where x s is the original hidden image, x c is the container image, E r (·) represents the encoder to convert the container image into the feature representation, E r (·) represents the decoder for recovering the hidden image, BH represents the generated binary chaotic sequence by using the initial value h 0 , and K is the output of chaotic system. Both h 0 and K are obtained from the securely received secret key.
2) Refined U-net training: In general, the quality of the recovered hidden images via the recovering CAE model is excellent due to the characteristics of the CAE. A refined network is used to further improve the recovery quality as the recovered images become smoother compared to the original images, a phenomenon that exists in most autoencoder models. In particular, we employ an adapted U-Net architecture, illustrated in Fig. 3, to refine the final recovery results. This network is composed of seven convolutional layers and seven deconvolutional layers and is capable of modelling the pattern of the lossy textural information caused by the CAE network.
The employed loss function for the U-net is where x s is the refined recovered hidden image.

IV. EXPERIMENTAL RESULTS
To evaluate our approach, we use three DL-based fullimage-to-image hiding algorithms [1], [10], [12] as benchmarks and compare them with our proposed method. Both subjective and objective comparisons are performed to demonstrate the superiority of our proposed method.

A. Experimental setup
Our experimental dataset contains 12000 images, including facade images [28], face images [29] and aerial images [30]. All of them are resized to 128 × 128. The number of images used as cover images and hidden images are set equally, both including 4000 training images and 2000 test images.

B. Model parameters
Our J-CAE framework contains three CAE models and one U-Net model. The numbers of epochs are 4000 for the Dual CAEs and 2000 for the recovering CAE. The learning rates are 0.0001 for all three CAEs with a linear decay after 50% of epochs. For the refining U-net, the number of epochs is 800 and the learning rate is 0.00005 with a linear decay after 400 epochs. For the chaotic map system, u is set to 5, v is set 14, and h 0 is a random number in [0, 1].

C. Subjective image quality evaluation
Six pairs of cover and hidden images are used as representative examples for a subjective comparison. Fig. 4 shows, for all four evaluated methods, the images together with the recovered images and the resulting residual images between the original and recovered images. As is obvious from there, our proposed method yields not only the best hiding imperceptibility but also the best reconstruction quality of the hidden images. On one hand, the residuals between recovered and original cover images of our method are much less significant compared to those by [1], [10], [12]. The residuals are noise-like patterns which are not correlated to the content of the hidden images. In contrast, the structures of the hidden images still can be identified in the residual cover images of [10] while some information related to secret images can still be found in the residual cover images of [12]. On the other hand, the residuals between recovered and original hidden images of our proposed J-CAE are insignificant and clearly the smallest among all methods.

D. Objective Image Quality Evaluation
For an objective evaluation, we employ the standard metrics of pixel error, peak signal-to-noise ratio (PSNR), and structural similarity index (SSIM), which are typically used to evaluate the image quality in the context of image hiding. A lower pixel error, a higher PSNR and a higher SSIM indicates better image quality and hence better hiding imperceptibility and recovery quality.
The results, as averages over the datasets, are given in Table I. As we can see from there, the average pixel error for cover images is only 1.7881 for our proposed J-CAE algorithm, which is significantly smaller compared to the other three methods. Similarly, the average PSNR (40.4039) and SSIM (0.9921) are much higher than those of the other algorithms. These results impressively demonstrate significantly better hiding imperceptibility of our proposed method. At the same time, our proposed method also yields much better results for the hidden images with an average pixel error of 1.9547, PSNR of 39.5338 and SSIM of 0.9890, clearly outperforming the other three techniques and thus confirming clearly superior overall performance of our proposed J-CAE over existing deep learning-based methods.

E. Security analysis
RS Analysis [31], Chi Square Attack [32] and Difference Histogram Analysis [33] are usually used as standard methods to evaluate the security performance of classical image hiding methods with relatively high capacities [8], [20], [21]. We use StegExpose, a publicly available steganalysis toolkit [34] which investigates pre-existing pixel based steganalysis methods including RS Analysis, Chi Square Attack, Difference Histogram Analysis, Sample Pairs [35] and Primary Sets [36] to show whether we just embed the information of the hidden image by LSB substitution and evaluate our security performance as in comparison with [1], [10] The resulting ROC curves, which plot the true positive rate against the false positive rate, are shown in Fig. 5. Here, a true positive indicates a container image correctly detected while a false positive means that an image without a hidden message is falsely classified as a container image. As evidenced from Fig. 5, the curve of our method is very close to that obtained by random guessing. The Area Under Curve (AUC) value of our method is 0.55 which is close to the 0.5 AUC value of a random classifier. This indicates that our J-CAE method does not just hide the information of the hidden image in the LSB of the cover image and that it is very difficult to correctly identify true container images generated by our J-CAE method with a steganalysis tool. Our AUC value is also smaller than those of the other three methods, which are 0.62, 0.58 and 0.56 respectively indicating that our method clearly outperforms [1] and [10], confirming better security performance against this steganalysis tool. Although the result for [12] is comparable with ours here, the recovery quality of our proposed method is much better as has been demonstrated in Table I.
It is obvious that the larger the hidden information is, the more the potential distortions on cover images will be. We cannot expect full-image-to-image hiding algorithms to achieve perfect undetectability against all recently developed machine steganalysis because these methods yield a 1:1 ratio of hidden capacity to cover information. Although our proposed method is unsuitable for applications requiring perfect undetectability of hidden information, its value for data-hiding such as image authentication, and data augmentation such as storing image-related meta-information still remains intact as analysed in [1]. For these applications, the main concerns turn to the insignificant visually noticeable distortions of the cover image and the discoverability of the content of the hidden image. Regarding the former, the results listed in Fig. 4 and Table I have already demonstrated that the visual imperceptibility of our proposed method is remarkable and better than other DL- based full-image-to-image hiding algorithms. As for the latter, the employed logistic-logistic chaotic mechanism (LLS) of our propose J-CAE fully ensures security of the content of the hidden image because no information of the hidden image can be obtained without the accurate initial value of the LLS.

V. CONCLUSIONS
In this paper, we have proposed a novel full-image-to-image hiding method based on joint compressive autoencoders. Our J-CAE approach is capable of not only achieving the same high hidden capacity as other DL-based full-image-to-image hiding methods but also of recovering hidden images with much smaller errors. Another advantage of our method is the significantly better visual imperceptibility achieved by mapping the feature representations. Our experimental results have impressively demonstrated that our algorithm outperforms three state-of-the-art DL-based full-image-to-image hiding algorithms based on both subjective and objective comparisons, yielding better security performance. In future work, we will investigate to adapt our proposed framework to a coverless image hiding scheme to further enhance the security, while we will also be evaluating it in the context of real-world applications.