Small Sample Image Segmentation by Coupling Convolutions and Transformers

Compared with natural image segmentation, small sample image segmentation tasks, such as medical image segmentation and defect detection, have been less studied. Recent studies made efforts on bringing together Convolutional Neural Networks (CNNs) and Transformers in a serial or interleaved architecture in order to incorporate long-range dependencies into the features extracted using CNNs. In this study, we argue that these architectures limit the capability of the combination of CNNs and Transformers. To this end, we propose a dual-stream small sample image segmentation network, namely, the Interactive Coupling of Convolutions and Transformers Based UNet (ICCT-UNet, code and models are available at: https://indtlab.github.io/projects/ICCTUNet), motivated by the success achieved using the UNet in the scenario of small sample image segmentation. Within this network, a CNN stream is paralleled with a Transformer stream while maintaining feature exchange inside each block through the proposed Window-Based Multi-head Cross-Attention (W-MHCA) mechanism. To derive an overall segmentation, the features learned by both the streams are further fused using a Residual Fusion Module (RFM). Experimental results show that the ICCT-UNet outperforms, or at least performs comparably to, its counterparts on eight sets of medical and defective images. These promising results should be attributed to the effective combination of the local and global features fulfilled by the proposed interactive coupling method.


I. INTRODUCTION
S EGMENTATION of natural images has been well stud- ied [1], [2], [3], [4], [5], [6], [7].However, this is not the case for image segmentation tasks with a small data set, for example, medical image segmentation and defect detection.Medical images captured by X-Ray, Computed Tomography (CT), Magnetic Resonance Imaging (MRI), Ultrasound, etc., have been widely used in clinical medicine.In practice, organ or lesion segmentation can assist clinicians to make a more precise diagnosis, design a more appropriate surgical plan and propose treatment strategies.On the other hand, defect detection plays an important role in Non-Destructive Testing (NDT) which is critical to the automatic production process and can significantly reduce the production cost [8].
Although a large number of CNN methods [9], [10], [11], [12], [13], [14], [15] have been proposed for those tasks, the progress is still slower than that of natural image segmentation.The inherent inductive bias helps these methods learn effective representations from a relatively small data set.However, the limited receptive field of those methods normally prevents them from capturing long-range dependencies [16].This issue impairs the performance of the CNN methods.On the other hand, Transformer [17], had been introduced into vision tasks and promptly became an alternative to CNNs [18], [19].Due to the self-attention mechanism, Transformers can be used to extract the context information.However, the strengths of Transformer methods cannot be sufficiently exploited when only a small data set is available [18].
The dilemma can be attributed to two challenges.(i) The lack of training images is the Achilles' heel of those methods, which requires that the more powerful priors are encoded in the model while the model is remained appropriately concise in order to avoid over-fitting [20].(ii) Compared with segmentation of natural images, the special image modality and visual content contained in the medical or NDT images result in the low discriminative semantic boundary [21], [22], [23], which requires the network to preserve a precise representation of the local structure and use a large receptive field to aggregate the context information at the same time.
It has been revealed that the characteristics of CNNs, such as finer local features, shift-invariance and hierarchical representation, can boost the performance of Transformers [24], [25].Inspired by this finding, existing image segmentation methods, such as TransUNet [26] and MissFormer [27], made efforts on alleviating the two challenges by bringing together CNNs and Transformers to introduce the inductive bias and enhance the ability to model long-range dependencies.However, the serial structure (see Fig. 1(a)) used by these methods limits the exertion of the complementary action of the two sides.
Furthermore, an interleaved [28] (see Fig. 1(b)) or disentangled [29] structure has been used to overcome the limitations of the serial structure.Due to the ignorance of the distinct difference between the features learned by CNNs and Transformers [30], however, those structures may underutilize the potential of these features when they simplistically mix them together.Recently, efforts have also been made in image classification by performing feature communication between CNNs and Transformers [31], [32].
Therefore, we are motivated to propose a novel dual-stream image segmentation network, namely, the Interactive Coupling of Convolutions and Transformers Based UNet (ICCT-UNet) (see Figs. 1(c) or 2(a)), to effectively exploit the locality and globality for image segmentation on small data sets.A new basic block (see Fig. 2(c)) is designed, which comprises a parallel CNN sub-block and a Transformer sub-block while enabling feature exchange between them, to extract local and global features respectively.Within each block, the CNN subblock is able to receive the global representation from the Transformer sub-block to increase the awareness of the global context, at the same time, the local features can be injected into the Transformer sub-block from the CNN sub-block to help it learn from a small data set.In this case, both the CNN and Transform streams are able to facilitate the other side and a complementary action is achieved.
Instead of fusing the outputs of both the sub-blocks into a single set of feature maps [29], we link the same type of sub-blocks across all the basic blocks, deriving two individual streams.Consequently, the potential of both CNNs and Transformers will be further exploited.We also design a Window-Based Multi-head Cross-Attention (W-MHCA) mechanism (see Fig. 2(d)) for feature exchange.Thanks to the W-MHCA, both the streams can dynamically exchange features at a reasonable computational cost.Two classifiers are placed behind the final decoder block, which introduce the supervision for the two streams and predict two logit maps, respectively.To exploit the features learned by the two streams, we further develop a Residual Fusion Module (RFM).This module fuses these features using residual learning and is appended to the ICCT-UNet.As a result, a third logit map is produced.
Our method is able to utilize the merits of CNNs and Transformers by interactively coupling them.To our knowledge, both the proposed dual-stream network and W-MHCA have not been applied to image segmentation before.Our contributions can be summarized as fourfold.
• We introduce a novel dual-stream image segmentation network, i.e., ICCT-UNet, in which the CNN and Transformer streams exchange features and boost the other side, to effectively exploit both the locality and longrange dependencies.In addition, we design an RFM by leveraging residual learning to integrate the features extracted at the two streams, which normally produces the better prediction than that derived using a single stream.
• To effectively couple both the streams with the relatively low memory and computational demand, we propose a W-MHCA mechanism, which outperforms the addition or concatenation fusion and cross-attention [17] approaches.
• We build a 3D version of the proposed network, which can be applied to volumetric segmentation tasks.
• We demonstrate the effectiveness and generalization of the proposed method on five medical data sets and three defect data sets by experimentation.The results can be used as baselines by the community.The remainder of this paper is organized as follows.We review the related literature in Section II.In Section III, our methodology is introduced.The experimental setup and results are reported in Sections IV and V respectively while detailed ablation studies are performed in Section VI.Finally, we draw our conclusion in Section VII.

II. RELATED WORK A. CNN-Based Methods
In [33], a series of Fully Convolutional Networks (FCNs) were evaluated in the scenario of small sample medical image segmentation.As one of the most popular medical image segmentation networks, UNet [9] and its variants [10], [11], [12], [13], [34], [35], [36], [37] have been extensively used.The symmetric U-shaped encoder-decoder structure greatly inspired the community.Motivated by the ResNet [38], Xiao et al. [10] introduced residual learning into the UNet by building a U-shaped network on top of residual blocks.In [36], a nested U-shaped network, i.e., UNet++, was proposed, which comprised multiple sub-UNets.Besides, several studies stacked two encoder-decoder networks, e.g., XNet [12] and DoubleU-Net [37], in order to achieve the better performance.In [13], ERDUNet was developed on top of a Context Enhanced Encoder (CEE) module, to extract the global context and leverage the features extracted at different levels for the finer result.Jin et al. [39] designed the DUNet by building a U-shaped network using the deformable convolution for retinal vessel segmentation.
The CNN techniques have also been widely explored in the field of defect detection.In [40], the features extracted using a pre-trained UNet [9] were used together with the random forest classifier for small defect detection.To classify the crack pixels from the background, a full convolutional structure was developed, namely, CrackNet [41].Zou et al. [15] proposed a deep supervision based feature fusion framework, referred to as DeepCrack, for crack segmentation.Recently, Dong et al. [42] utilized the encoder trained using a synthetic data set for defect detection and classification.In [43], a multi-task framework was also introduced for these tasks, which exploited both the autoencoder and the one-class classifier.
Although the above methods have made great progress, the intrinsic local inductive bias of CNNs restricts the size of receptive fields.As a result, the performance was limited.

B. Transformer-Based Methods
Compared with the CNN-based methods, Transformer [17], which is built upon the self-attention mechanism, naturally owns the ability to capture long-range dependencies.Hence, Transformer-based methods usually showed the better, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
or at least the comparable, performance in high-level vision tasks [19], [24], [44], [45].Chen et al. [26] conducted systematic experiments on Vision Transformer (ViT) [18] methods for medical image segmentation.It was shown that the vanilla ViT can only work with image patches because of the quadratic complexity of the self-attention mechanism.This issue resulted in the absence of the locality and multiscale features.
To address the issue, Cao et al. [46] proposed a pure Transformer-based UNet, namely, SwinUNet, on top of the Swin Transformer [24] blocks.Since self-attention was constrained within local windows, the SwinUNet reduced the computational cost.Due to the hierarchical representation, the performance was further improved.The Transformer techniques were also used for defect detection.In [47], a Transformer network, i.e., CrackFormer, was developed using the proposed content-based self-attention blocks.To exploit the global context for road surface crack detection, Chen et al. [48] proposed the LECSFormer on top of the window-based selfattention and token rearrangement techniques.
These studies have shown the applicability of Transformers to different tasks.Since Transformers do not contain the strong inductive bias, the superiority to CNNs mainly depends on the large scale of the training data [49].However, this is normally not the case for either medical image segmentation or defect detection.Although the weights pretrained using natural images have been used to accelerate convergence [26], [46], the significant domain-shift between these images and the medical or defective images limits the performance.

C. Hybrid Methods
To explore the merits of both CNNs and Transformers, the emphasis was put on the integration of them.It is an intuitive design to place Transformers behind CNNs, to compute global characteristics from the local features learned by CNNs.The TransUnet [26] utilized a ResNet [38] and a Transformer [18] as the encoder and bottleneck respectively.However, this network imprisoned the Transformer at the smallest scale and neglected the multi-scale features learned by CNNs.The following studies [27], [50], [51], [52] were devoted to solving the weakness of the TransUNet.Among these studies, the MCTrans [51], MissFormer [27] and ScaleFormer [52] adopted a similar design philosophy in which Transformers were used to model the scale-wise relationship, while the UTNet [50] inserted a self-attention module in each block which led to an interleaved hybrid structure.Recently, Guo et al. [29] proposed the UNet-2022 by leveraging a disentangled structure to design the basic block.In [14], a hybrid network, namely, MDAL was proposed by serially integrating convolutional blocks and the self-attention mechanism.

D. 3D Medical Image Segmentation
Some medical image modalities, e.g., MRI images, offer not only the spatial data but also the axial information.3D or voxel segmentation typically leverages both the sorts of data for the voxel-wise prediction.These methods normally employ 3D convolutional blocks within a U-shaped encoderdecoder structure [53], [54], [55], [56].Despite the improved performance has been achieved using those methods, the locality limitation still remains in them due to the multidimensional aggregation scope of the 3D convolution.
To address this issue, many studies introduced self-attention mechanisms to 3D medical image segmentation [28], [57], [58], [59].In [58], Hatamizadeh et al. proposed UNETR, which combined a Transformer encoder and a CNN decoder.Zhou et al. [28] designed a hybrid encoder-decoder architecture, referred to as nnFormer, by stacking convolutional and Transformer blocks in an interleaved manner.It was shown that the joint use of the locality and long-range dependencies was beneficial for 3D medical image segmentation.
Guo et al. [29] disentangled the CNN and Transformer subblocks in order to balance the use of the locality and globality.However, the unidirectional fusion mechanism lacked adaptability.Pioneering studies [30], [49] have shown distinct feature differences between CNNs and Transformers.In this context, an even fusion may hinder the full utilization of them [60].Recently, bi-directional CNN-Transformer communication has been applied to image classification [31], [32].These studies mainly paid attention to the introduction of the global context, while ignoring the utilization of hierarchical representations which are essential to image segmentation.In contrast, we propose a dual-stream encoderdecoder network (see Fig.

III. METHODOLOGY
The architecture of the proposed method is shown in Fig. 2(a), which comprises an Interactive Coupling of Convolutions and Transformers Based UNet (ICCT-UNet) and a Residual Fusion Module (RFM).As can be seen, the ICCT-UNet contains two parallel streams: CNNs and Transformers.In terms of these streams, two sub-blocks in the same block can exchange features.Each stream produces a single logit map.To fuse the features learned by those streams, the RFM is appended to the ICCT-UNet.A third logit map is predicted by it.Regarding the two streams and RFM, three loss terms are computed respectively.The entire network can be end-to-end trained by summing up these terms.However, we found that they were difficult to weight.To further explore the capabilities of the two streams and RFM, they can be separately trained by applying the stop-gradient technique to the beginning of the RFM.In addition, we implement a 3D variant of the proposed network, which can be used for 3D image segmentation.

A. ICCT-UNet
Within the U-shaped ICCT-UNet, the encoder contains a series of blocks: Enc _i (i ∈ {0, 1, 2, 3, 4}) while the decoder consists of a different set of blocks: Dec _ j ( j ∈ {3, 2, 1, 0}).The CNN stream is formulated with the sub-blocks which are similar to those used by the UNet [9].On the other hand, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.the Transformer stream is built on top of the sub-blocks that SwinUNet [46] used.With regard to both the streams, the features computed using the two sub-blocks in the same block are exchanged through the Window-Based Cross-Attention (W-MHCA) mechanism.An individual logit map is predicted by each stream.

1) Encoder:
To derive an initial representation for each stream, we use a stem block (see Fig. 2(b)) denoted as Enc _0 , which contains two consecutive convolutional layers and each layer is followed by a Batch Normalization (BN) layer and a ReLU activate function.As a result, the input image is transformed to a set of feature maps which have the same resolution as the image.It should be noted that this processing is different from that performed using the first block of the TransUNet [26] and SwinUNet [46], which down-samples the input.The feature maps are directly fed into the CNN stream.Meanwhile, they are flattened and processed using a linear projection.As a result, a set of initial patch-embeddings are generated.They are then sent to the Transformer stream.
Following the Enc _0 , there are three additional blocks and a bottleneck, Enc _i (i ∈ {1, 2, 3, 4}), in the encoder, to learn representations at different semantic levels.In each block (see Fig. 2(c)), the CNN sub-block contains two units (unit (•)), each of which comprises a set of Conv-B N -ReLU operations; while the Transformer sub-block includes a Shifted Window-Based Multi-head Self-Attention (SW-MHSA) unit and a twolayer Multilayer Perceptron (MLP).(The W-MHSA is a zeroshift SW-MHSA).Given that X i−1 C N N ∈ R C×H ×W denotes the feature maps extracted by the CNN sub-block in the Enc _i−1 and Xi C N N represents the feature maps produced by the MaxPooling operation, the computation of the CNN sub-block in the Enc _i can be expressed as: where T rans ∈ R H W ×C denote the patch-embeddings produced by the Transformer sub-block in the Enc _i−1 , Ŷ i T rans stand for the downsampled patch-embeddings and Ỹ i T rans represent the intermediate result, the computation of the subblock in the Enc _i can be expressed as: where In the bottleneck, i.e., Enc _4 , the computation process is nearly the same as the previous blocks except that the number of channels of the feature maps outputted is equal to that of the input feature maps in the CNN sub-block.
2) Decoder: The decoder of the ICCT-UNet contains four blocks: Dec _ j ( j ∈ {3, 2, 1, 0}).In the Dec _0 , both the CNN and Transformer sub-blocks are the same as the CNN subblock contained in an encoder block.In terms of each stream, however, the sub-blocks in the Dec _3 to Dec _1 are the same as those comprised in the Enc _3 to Enc _1 respectively.Regarding each block, the input of the CNN sub-block or the Transformer Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.sub-block is the concatenation of two sets of feature maps.One set is the result obtained by applying bilinear interpolation to the feature maps produced by the previous sub-block in the same stream, while the other set is generated by the sub-block at the same level in the corresponding stream of the encoder.In particular, a linear projection is used in the Transformer sub-blocks to adjust the dimension of the input.A 1 × 1 convolutional layer is applied to the output of each sub-block in the Dec _0 and the result is a logit map.To further reduce the possibility of over-fitting, we can apply the Spatial Dropout [61] technique to the ICCT-UNet.For each stream, we use a Spatial Dropout unit in the Dec _ j ( j ∈ {3, 2, 1, 0}).
As shown in Fig. 2(a), the bi-directional interactions between the two streams take place in each block except the Enc _0 and Dec _0 .Due to the local inductive bias of CNNs, the training of them is easier than that of Transformers.Thus, the injection C N N s → T rans can accelerate the training of the Transformer stream.On the other hand, Transformer sub-blocks are able to incorporate the global information into the CNN sub-blocks through the T rans → C N N s pathway, which is useful for the CNN sub-blocks to overcome their limited receptive fields.Hence, the ICCT-UNet can overwhelm the data scale limitation and learn from scratch with a small data set.Since it integrates both local characteristics and the global context, the more accurate segmentation can be derived.
3) W-MHCA: In essence, the feature exchange between the two streams can be treated as feature fusion.Given two sets of features, the commonly-used fusion methods include the point-wise addition and the concatenation followed by a weighted summing (e.g., convolution), as shown in Figs.3(a) and (b) respectively.Compared with the addition method, the concatenation method can aggregate more features using a large neighbourhood but it requires more parameters.
In [31], a Feature Coupling Unit (FCU) was used to fill the semantic gap between the CNN and Transformer streams for image classification, in which feature fusion was fulfilled by addition.More sophisticatedly, a simplified version of crossattention [17] was also utilized in order to fuse the features extracted using CNNs and Transformers [32].As shown in Fig. 3(c), the stream, which intakes the information from the other, uses its own features as the query and utilizes the features of the other as the key and value.Compared with the addition and concatenation methods, cross-attention is able to dynamically fuse features because fusion weights are computed according to the input.Since the projections in the cross-attention unit are shared by all tokens, less parameters are used to aggregate a large number of features than those that the concatenation method needs.
The cross-attention computation was conducted across all the tokens at each resolution level [32].Although this global computation works well with image classification which only needs the image-level representation, it is unsuitable for dense prediction tasks because they use fine-grained representations to precisely localize objects.Considering feature exchange is conducted at each level, the global computation will require a huge amount of computational resource when it is computed at the high resolution level.In addition, the global crossattention mechanism may produce redundant features because the irrelevant information is introduced.To address these issues, we propose a new Window-Based Multi-head Cross-Attention (W-MHCA) mechanism on top of an improved attention method [62].This mechanism is able to efficiently bridge the CNN and Transformer streams (see Fig. 2(d)).
Given the features extracted in the CNN and Transformer sub-blocks of the same block, we split these into w × w nonoverlapping windows.Then the multi-head cross-attention is computed within these windows.Compared with the global cross-attention mechanism [32], the W-MHCA has a reasonable computational complexity while reducing the redundant computation on the irrelevant tokens.
When feature fusion is performed from the CNN stream to the Transformer stream, for example, the features produced in the CNN stream are used as the key and value while the features extracted in the Transformer stream are used as the query.Similar to the MHSA [18], the W-MHCA contains multiple attention heads.The results produced by these heads are fused using an output projection W O .Let X C N N and Y T rans ∈ R w 2 ×d be the d-dimensional tokens within a window in the CNN and Transformer streams respectively, and W j Q , W j K and W j V be the query, key and value projections of the j-th attention head, the output of this head, i.e., Ŷ j T rans , can be computed as: where s is the temperature term of the softmax function and is defined as a learnable parameter in order to avoid the attention deterioration problem [62].Finally, the output of the W-MHCA with h attention heads is computed as:

B. Residual Fusion Module
The two streams of the ICCT-UNet produce two different logit maps respectively.To jointly exploit the discriminant ability of these streams, we design a simple sub-network, referred to as the Residual Fuse Module (RFM), which performs residual learning on the concatenation of the features extracted at the end of the two streams.The RFM aims to fuse these features and generate the finer logit map.
Motivated by the ResNet [38], the computation of RFM can be expressed as: where X in is the concatenation of the features extracted in the two streams of the decoder and F(•) stands for the RFM.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
The RFM is also a U-shaped encoder-decoder network with skip connections.Each block contains a convolutional layer, a batch normalization layer and a ReLU activate function.Two additional 3 × 3 convolutional layers are placed in front of the first block of the encoder and behind the last block of the decoder respectively.The encoder comprises five blocks in which the last block is used as the bottleneck.With the spatial resolution of feature maps decreases, the number of channels is kept constant, i.e., 64.Similarly, the decoder consists of four blocks.Finally, we use a 1 × 1 convolutional layer to perform the pixel-wise classification.The result is a third logit map in addition to the two logit maps produced by the ICCT-UNet.
However, it is challenging to obtain the optimal weights in order to balance the end-to-end training processes of the ICCT-UNet and RFM.In particular, inappropriate weights may disturb the training of the ICCT-UNet.As a result, imperfect representations are learned by the two streams.Since these representations are fed into the RFM, the performance of it will be impaired.Although this issue can be alleviated by setting a small weight for the loss of the RFM, it may result in the poor training of the RFM.To get rid of the dilemma, we apply the stop-gradient operation to the beginning of the RFM.In this case, the ICCT-UNet and RFM can be trained independently while they will be optimized appropriately.

C. 3D Variant
The aforementioned ICCT-UNet and RFM can be easily modified for 3D image segmentation tasks.We first replace the convolutional layers and batch normalization with the corresponding 3D versions.In this case, both the CNN stream and RFM are reformulated.To reformulate the Transformer stream and W-MHCA, we then use local volumes to replace the local windows required for computing the self-attention and cross-attention mechanisms.As a result, a 3D variant of the proposed network is built.

IV. EXPERIMENTAL SETUP
We performed a series of experiments on eight data sets.In this section, we will briefly introduce the data sets, experimental setup and implementation details.

A. Data Sets
For 2D medical image segmentation, our method was tested with four data sets, including Synapse [63], ACDC [22], ISIC 2018 [64] and BUSI [65].The Synapse data set contains 30 abdominal CT scans in total.Both the Dice Score (DSC) and 95% Hausdorff Distance (HD95) were computed on eight main organs.The ACDC data set comprises 100 cases for segmentation of the Left Ventricle (LV), Right Ventricle (RV) and Myocardium (MYO).Only the DSC was calculated for this data set by following the previous work [26], [28], [46].In total, 2,594 dermatologic images of the skin lesion are included in the ISIC 2018 data set.Regarding the BUSI data set, 647 benign and malignant cancer ultrasound images were used.Both IoU and the F1 score were used as the performance measures for the ISIC and BUSI data sets.With regard to 3D medical image segmentation, we used three data sets, including Synapse [63], ACDC [22] and MSD [66].The MSD data set contains 484 MRI images for brain tumor segmentation.We examined the segmentation results of the Whole Tumor (WT), Enhancing Tumor (ET) and Tumor Core (TC) in terms of the DSC and HD95 metrics.
For defect detection experiments, we evaluated our method on three data sets: CFD [67], MT [68] and KSDD [69].The CFD data set includes 183 road surface crack images with the annotated cracks.Five sets of magnetic tile surface defect images and one set of defect-free images are comprised of the MT data set.We used 172 images of the blowhole and crack defects by following the original setup [68].The KSDD data set contains 52 annotated defect images.We computed both the IoU and AUC values for the three data sets.
Regarding the Synapse [63], ACDC [22] and MSD [66] data sets, the training/testing split was kept the same as that used in the previous studies [26], [28], [46].For the ISIC 2018 [64], BUSI [65] and three defect data sets, we split these into five folds and conducted cross-validation experiments.The average was computed across the results obtained in the five folds.For more details, please refer to Table III.

B. Experimental Setup
For the 2D image segmentation experiments, we resized the images to the resolution of 224 × 224 pixels.Only the cross-entropy loss was used to train both the ICCT-UNet and RFM.The AdamW [70] optimizer was utilized to optimize the network.We also employed the poly learning rate attenuation strategy [26], which is expressed as: where iter s means the number of iterations that have been completed and iter s_total denotes the number of iterations required in the experiment.Table I shows the details of the experimental setup used for each data set.
Regarding the 3D image segmentation experiments, we used the same training tactics as those used by the nnFormer [28], including the spacing and cropping methods, combined loss of the cross-entropy and dice losses, SGD optimizer and Deep Supervision technique.Please refer to Table II for more details.
To alleviate the potential overfitting problem, we used the same data augmentation operations as those utilized in [26], [46], including horizontal flip, vertical flip and random rotation, for the 2D image segmentation task, while we used the same data augmentation operations as those employed by the nnFormer [28] in the 3D image segmentation task.We also used Spatial Dropout [61] as an extra regularization method for training our network.

C. Implementation Details
In the remainder of this paper, we use ICCT-UNet-X to denote the proposed ICCT-UNet with a specific number of channels of the feature maps extracted using the Enc _0 .The predictions of the CNN stream, the Transformer stream and the RFM are denoted as ICCT-UNet-X-C, ICCT-UNet-X-T and ICCT-UNet-X-F, respectively.Considering the varying scale of different data sets, we used the ICCT-UNet-64 for the Synapse [63] and ACDC [22] data sets and utilized the ICCT-UNet-32 for the other data sets in the 2D segmentation experiments.In each encoder or decoder block, the numbers of the attention heads in the SW-MHSA and W-MHCA were set to the same value.For the Enc _i (i ∈ {1, 2, 3, 4}), we used 4, 8, 16 and 32 attention heads in turn.In terms of the Dec _ j ( j ∈ {1, 2, 3}), 4, 8, and 16 attention heads were utilized in turn.The size of windows w was set to 7 for the SW-MHSA unit in the Transformer stream and all the W-MHCA units.
Regarding the 3D segmentation experiments, we replaced all the blocks and modules by the corresponding 3D versions.To reduce the computation and memory demand, the stride of the convolutions in the Enc _0 was set to 2, which generate the smaller feature maps and fewer patch embeddings.Since we followed the experimental setup utilized by the nnFormer [28], the same hyperparameters were used for our Transformer stream to avoid tuning the network.
It should be noted that our Transformer stream was different from the nnFormer [28].To be specific, we used the Patch-Merging [24] operation to connect different encoder blocks and the locality information was introduced by the CNN stream through the W-MHCA unit.In contrast, the nnFormer used the overlapped convolution.Furthermore, the nnFormer used a skip-attention module [28] to fuse the features between the encoder and decoder while we used the more simplistic skip connection method [9].In addition, the nnFormer used the transpose convolution in the decoder to upsample feature maps while we employed the naive trilinear interpolation.
All the image segmentation approaches tested in this study were implemented using Python 3.9.7,Pytorch 1.10.2 and Torchvision 0.11.3.The experiments were performed on an NVIDIA Geforce RTX 3090 graphics card.

V. EXPERIMENTAL RESULTS
In this section, we will report the results obtained using five medical and three defect data sets.

A. Medical Image Segmentation
The results obtained using different methods on the Synapse [63], ACDC [22], MSD [66], BUSI [65] and ISIC [64] data sets are reported in Tables IV, V, VI and VII.As can be seen, the three predictions of our method were superior to, or at least comparable to, different baselines, including CNN methods (e.g., UNet [9] and XNet [12]), Transformer methods (e.g., SwinUNet [46]) and serial hybrid methods (e.g., TransUNet [26], ScaleFormer [52] and MISSFormer [27]) across the five data sets.Our 3D model outperformed its state-of-the-art counterpart, i.e., nnUNet [56], on three data sets.This model also outperformed the interleaved network, nnFormer [28], on the Synapse and MSD data sets and produced the comparable result on the ACDC data set.Compared with the disentangled model, UNet-2022 [29], which was trained using a different setup, our method achieved the comparable performance.Given that our training setup was used, however, our method was superior to the UNet-2022.In addition, the fusion of the two streams normally generated the better result than that produced by each individual stream.
For the Synapse data set (see Table IV), our 3D model generated the best result.In particular, the Transformer stream outperformed the state-of-the-art method, nnFormer [28].Our 2D model produced the second best result.Compared with the results of the UNet-2022 reported in [29], our method was superior on the Gallbladder, Spleen and Stomach images while was comparable on the Kidney (Right) and Liver images.
The proposed method also achieved comparable results to those produced by its counterparts on the ACDC data set (see Table V).It is noteworthy that both the TransUNet [26] and SwinUNet [46] used pre-trained weights as the initialization even if they produced the better results than our method on the LV images.In contrast, our method can be trained from scratch on the target data set.Besides, the average performance of the Transformer stream of our method was better than that of the SwinUNet [46] which had a similar architecture.
On the MSD data set (see Tabe VI), our method outperformed the state-of-the-art method, nnFormer [28].Although this method has set a significantly high benchmark, we did not aim to design the most powerful volumetric segmentation approach.Nevertheless, our 3D model, derived by replacing the units by their 3D counterparts without making more modifications, still achieved the comparable results.The effectiveness and generalization of our method has been shown.When the BUSI and ISIC data sets (see Table VII) were used, our method always outperformed the baselines no matter IoU or the F1 Score was considered.Particularly, each prediction of our method was the best.In contrast, the Swin-UNet [46] struggled with segmenting the BUSI images.
It is noteworthy that the UNet-2022 used more data augmentation operations and the more complicated training/inference setup on the Synapse and ACDC data sets, including a multi-term loss function, a deep supervision method and a

B. Defect Detection
Defect detection was also performed based on the 2D image segmentation task.Regarding the three defect data sets: CFD [67], MT [68] and KSDD [69], the results obtained using different methods are shown in Table VIII.Again, the results produced by our method were better than, or at least comparable to, those derived using the baselines.

C. Visualization of Segmentation
In Fig. 4, we visualize the results obtained using six networks on an image of eight data sets.It can be observed that our method was able to localize the organs and defects in various scales with the higher accuracy than its counterparts.In particular, UNet [9] struggled to handle complex organ boundaries and tended to produce over-segmentation results.This observation should be attributed to the limited effective receptive field which hinders UNet from capturing the sufficient semantic context.Although the pure Transformer-based method, SwinUNet [46], could capture long-range dependencies, this method usually produced coarse results because it Fig. 4. The results obtained on eight data sets, including Synapse [63], ACDC [22], MSD [66], BUSI [65], ISIC [64], CFD [67], MT [68] and KSDD [69], are shown in the above eight rows respectively (For the best viewing, please zoom in.).Each row displays the results derived using (a) UNet [9], (b) TransUNet [26], (c) SwinUNet [46], (d) ICCT-UNet-T, (e) ICCT-UNet-C, (f) ICCT-UNet-F and (g) the ground-truth in turn.did not extract local structures well.In contrast, the hybrid method, TransUNet [26], achieved the better performance by combining CNNs and Transformers.However, it was inferior to our method due to its serial structure.
In Fig. 5, we further visualize the feature maps extracted using five networks.As can be seen, the ICCT-UNet activated the foreground region better than its counterparts while ignoring the background.These findings should be due to the effective integration of the local features and long-range dependencies.

VI. ABLATION STUDIES
We also conducted extensive experiments to examine the effect of different factors on the performance of our method.For simplicity, only the Synapse [63] data set was used.

A. Effect of the Training Setup
To examine the effect of the training setup, we re-trained our network and seven baselines, including UNet [9], XNet [12], Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.TransUNet [26], SwinUNet [46], MissFormer [27], Scale-Former [52] and UNet-2022 [29], using our training setup for five times.The results are reported in Table X.As can be seen, the baselines normally produced the worse result using our training setup than that obtained using the original setup shown in Table IV.Also, they were outperformed by the proposed method.On the other hand, the DSC values derived using our method and UNet-2022 were 82.32 and 81.43 respectively, when they were trained and tested using the training/inference setup of the UNet-2022.It is likely that our simple training setup accounted for the fact that the proposed method did not always produce the best result.However, we only paid attention to validating the effectiveness of the coupling of CNNs and Transformers for image segmentation rather than deriving the best performance by improving the training setup.

B. Effects of the Dual-Stream Structure and the W-MHCA Unit
To validate the effectiveness of our dual-stream structure and W-MHCA unit, we built three additional feature exchange units by removing the W-MHCA unit or replacing it with the addition or concatenation fusion methods.To overcome the semantic gap between the two streams, the FCU [31] was incorporated into the fusion-based units.Given that the four units were applied to the encoder and decoder of the ICCT-UNet with different combinations, the results obtained using each stream are shown in Table IX.
When the W-MHCA unit was removed from the encoder and decoder, in essence, two independent single-stream networks were created.As can be observed, each of these produced the worst result.It can also be seen that the feature exchange in the encoder always boosted the performance of each stream no matter what feature exchange unit was applied.Given that the dual-stream encoder was utilized, the use of the double-stream decoder improved the performance when the addition fusion or W-MHCA units were used.However, the best performance was produced when our W-MHCA unit was used for both the encoder and decoder.It was indicated that the proposed dual-stream structure and the W-MHCA unit play important roles in image segmentation.
It has been demonstrated that Transformer-based methods can learn the locality when trained with the sufficient data and epochs [49], [71].As shown in Table IX, the Transformer stream achieved the DSC value of 69.27 after it had been trained for 450 epochs.To investigate whether or not the poor result was due to the insufficient training, we trained the stream for 1000 epochs and the DSC value raised to 74.0.However, this result was still inferior to that of our dual-stream model.By coupling the Transformer stream with a CNN stream, we not only boosted the performance of the Transformer stream but also shortened the training time.Thus, the effectiveness of the dual-stream structure was further indicated.

C. Effect of the Residual Fusion Module
For the purpose of fusing the features extracted at the two streams, we designed a Residual Fusion Module (RFM).To examine the effectiveness and necessity of this module, we replaced it by a convolutional layer.The modified network was re-trained and the DSC values obtained using CNNs, Transformers and the convolution fusion module were 83.95, 82.28 and 83.99 respectively.It was indicated that the application of a fusion approach to the features learned using the two streams has the potential to enhance the segmentation accuracy.Although the convolution fusion module could slightly improve the performance, the DSC value was still lower than 84.60 produced by the RFM.It was suggested that the RFM is effective and necessary for further improving the performance of the proposed ICCT-UNet.

D. Effect of the Model Size
In terms of eight methods, model sizes (i.e., number of parameters), computational complexity (i.e., FLOPs) and the average DSC values derived are compared in Table X.As can be seen, our model outperformed its counterparts with different sizes.Although a larger model may be useful for achieving the better performance, the architecture of the network also plays an important role.To investigate the effect of the model size, we built a variant of the UNet [9] by enlarging the model size to 69.1M.The average DSC value obtained using this model was 80.11±0.65,which was much worse than the value of 84.25±0.75derived using our method.It was indicated that the superiority of our method should be due to the effective coupling of CNNs and Transformers rather than its model size.Although our method had the highest FLOPs, the inference time that it took was approximately 0.02 seconds per 224 × 224 image.In this case, our method achieved a proper trade-off between the computational complexity and effectiveness.

E. Effect of the Scale of the Training Set
To examine the effect of the scale of the training set, we retrained the UNet [9] and our method using 1/4 and 1/2 of the Synapse [63] training set.The results are shown in Table XI.Compared with the results shown in Table IV, it can be seen that ( 1) our ICCT-UNet-F model consistently outperformed the UNet under the two scales, which suggested the generalization and effectiveness of our method; (2) while the UNet achieved comparable results using the smaller training set, the gap between the results of the ICCT-UNet and UNet was large when the larger training set was used.It was indicated that CNNs can be trained well on a relatively small data set but they are hard to gain the greater improvement due to their locality nature, while coupling Transformers to these was useful for boosting their performance; (3) it was difficult to obtain satisfying results using the Transformer stream when our method was trained on a small data set, which hindered the training of the CNN stream.However, the RFM was still helpful for achieving the promising result in this situation.

VII. CONCLUSION
To fulfill the small sample image segmentation task, we proposed a novel network, which comprised an Interactive Coupling of Convolutions and Transformers Based UNet (ICCT-UNet) and a Residual Fusion Module (RFM).Compared with the serial and interleaved networks which simplistically stacked CNNs and Transforms and the disentangled networks which unidirectionally fused CNNs and Transforms, the ICCT-UNet was built on top of a parallel dual-stream architecture, which not only kept each stream relatively independent but also enabled these to interactively exchange features.We also developed a Window-Based Multi-head Cross-Attention (W-MHCA) mechanism.In contrast to the original cross-attention method, the W-MHCA can perform feature exchange at all resolutions with an acceptable computational cost and reduce redundant features.The RFM was used to predict a logit map by fusing the features extracted in the two streams.To our knowledge, both the dual-stream network and the W-MHCA unit have not been applied to image segmentation before.
Experimental results showed that our method performed better than, or at least comparably to, the baselines (including not only the CNN-based and Transformer-based networks but also the serial, interleaved and disentangled hybrid networks) on eight data sets.Either the CNN or the Transformer streams normally outperformed their single-stream counterparts.Also, the result obtained using the RFM was usually better than that produced by a single stream.It was suggested that a complementary action of both the streams has been achieved by our method.We believe that these promising results are due to the inherent capability of our method to effectively integrate the local structure extracted using the CNN stream and the global context captured using the Transformer stream, achieved through the proposed W-MHCA mechanism.
It is noteworthy that we mainly paid attention to investigating the effectiveness of the iterative coupling of CNNs and Transformers for image segmentation.This explains why we only used a simply training setup which was much simpler than that used by the state-of-the-art methods, e.g., UNet-2022.However, the key point is that we have shown that coupling CNNs and Transformers is useful for image segmentation.

Fig. 1 .
Fig. 1.Comparison of different types of hybrid structures: (a) two serial structures, (b) the interleaved structure and (c) the dual-stream structure.

1
(c)) by coupling CNNs and Transformers while enabling them to exchange features using a new Window-Based Multi-head Cross-Attention (W-MHCA) mechanism.

Fig. 2 .
Fig. 2. Illustration of the proposed image segmentation method which contains an ICCT-UNet and a Residual Fusion Module (RFM).(a) shows the network architecture, while (b) and (c) present the internal structure of the stem block and the basic block respectively, and (d) displays the pipeline of W-MHCA.
patch-fusion-based inference strategy.The performance gain of the UNet-2022 may be attributed to these techniques.In contrast, our training setup is much simpler and our model can perform inference over an image rather than a set of image patches.When our training/inference setup was used with the UNet-2022, a significant performance degradation was observed.

Fig. 5 .
Fig. 5. Visualization of the ground-truth and the feature maps extracted from two Synapse [63] images using the decoder of five networks.

TABLE I THE
SETUP UTILIZED FOR THE SEVEN DATA SETS USED IN 2D IMAGE SEGMENTATION, INCLUDING THE BATCH SIZE, MAXIMAL EPOCHS, INITIAL LEARNING RATE, WEIGHT DECAY AND SPATIAL DROPOUT RATE

TABLE II THE
SETUP USED FOR THE THREE DATA SETS UTILIZED IN 3D MEDICAL IMAGE SEGMENTATION, INCLUDING THE SPACING, MEDIAN SHAPE, CROPPING SIZE, BATCH SIZE, MAXIMAL EPOCHS, INITIAL LEARN-

TABLE IV COMPARISON
OF DIFFERENT METHODS ON THE Synapse [63] DATA SET.HERE, * IMPLIES THE RESULTS DERIVED USING A 3D SEGMENTATION MODEL AND † SUGGESTS THE RESULTS OBTAINED USING OUR TRAINING/INFERENCE SETUP.FOR EACH COLUMN, THE BEST, SECOND BEST AND THIRD BEST PERFORMANCES ARE HIGHLIGHTED IN RED, CYAN AND BLUE FONTS RESPECTIVELY.THIS CONTINUES IN TABLES V, VI, VII AND VIII TABLE V COMPARISON OF DIFFERENT METHODS ON THE ACDC [22] DATA SET

TABLE VI COMPARISON
[28]IFFERENT METHODS ON THE MSD[66]DATA SET.THE RESULTS SHOWN IN ROWS 1-7 ARE DERIVED FROM[28]

TABLE VII COMPARISON
OF DIFFERENT METHODS ON THE BUSI [65] AND ISIC 2018 [64] DATA SETS

TABLE IX COMPARISON
OF DIFFERENT FEATURE EXCHANGE UNITS, INCLUDING NONUSE (NONE), THE ADDITION FUSION (ADD), THE CONCATENATION FUSION (CAT) AND W-MHCA, ON THE

TABLE XI EFFECT
OF THE SCALE OF THE TRAINING SET ON THE UNET [9] AND ICCT-UNET WHEN THEY WERE TRAINED USING THE Synapse [63] DATA