Sketch-Supervised Histopathology Tumour Segmentation: Dual CNN-Transformer With Global Normalised CAM

Deep learning methods are frequently used in segmenting histopathology images with high-quality annotations nowadays. Compared with well-annotated data, coarse, scribbling-like labelling is more cost-effective and easier to obtain in clinical practice. The coarse annotations provide limited supervision, so employing them directly for segmentation network training remains challenging. We present a sketch-supervised method, called DCTGN-CAM, based on a dual CNN-Transformer network and a modified global normalised class activation map. By modelling global and local tumour features simultaneously, the dual CNN-Transformer network produces accurate patch-based tumour classification probabilities by training only on lightly annotated data. With the global normalised class activation map, more descriptive gradient-based representations of the histopathology images can be obtained, and inference of tumour segmentation can be performed with high accuracy. Additionally, we collect a private skin cancer dataset named BSS, which contains fine and coarse annotations for three types of cancer. To facilitate reproducible performance comparison, experts are also invited to label coarse annotations on the public liver cancer dataset PAIP2019. On the BSS dataset, our DCTGN-CAM segmentation outperforms the state-of-the-art methods and achieves 76.68 % IOU and 86.69 % Dice scores on the sketch-based tumour segmentation task. On the PAIP2019 dataset, our method achieves a Dice gain of 8.37 % compared with U-Net as the baseline network.


I. INTRODUCTION
C ANCER is one of the most deadly diseases in the world.
Despite tumour resection surgery, patients are at high risk of recurrence.Pathologists create stained histology slides using samples of the resected tumour tissue, to assess the effect of the pre-operation treatment regimen.The visual examination of histopathological images involves searching for specific medical features such as the tumour's shape, location and growth pattern [1].In clinical practice, digital scanners [2] capture digitised whole slide images (WSIs), making the visual examination of histopathology slides easier and more flexible.Nevertheless, the time and efforts required for pathologists to visually analyse WSIs of every single case are enormous as a result of the large number of slides to be analysed, in contrast to the limited availability of specialised pathologists.Besides, visual evaluations are inherently subject to inter-observer and intra-observer variabilities.The inconsistent output annotations may be not satisfied, leading to a negative impact on the actual diagnosis and treatment planning [3].
Recent improvements in computer vision open new revenues for (semi-)automatic analysis of digital WSIs, saving significant time and resources in manual analysis.Tumour segmentation in histopathological images is heavily dependent on the quality and quantity of annotated ground truth boundaries, which, however, are costly and challenging to acquire.Specifically, tumour borders are complex, vague, and non-rigid, making it extremely difficult for even experienced pathologists to define.
In a more practical scenario, pathologists tend to mark tumour regions with a rough outline instead of drawing out every detail around the tumour border [4].However, in normal deep-learning methods, the learned models generate predictions that are equivalent to the training labels.Accurate boundaries can only be predicted by segmentation models trained with accurate, detailed annotations.It becomes unusual, yet highly demanded capability, to segment tumour regions accurately using models trained Weakly supervised methods try to solve model training with coarse annotations including category-based, sketch-based, bounding box-based, point-based, and interaction-based categories [5].However, these methods only train the model on the coarse annotations without consideration of subsequent annotation refinement.Thus, in this article, we are motivated to boost supervision signals by refining coarse annotations.Additionally, fully convolutional networks, e.g., VGG [6], GoogleNet [7]and ResNet [8], have been adopted as the mainstream backbones to extract medical object features supervised by coarse annotations.Transformer-based methods have recently been proposed for the representation of global features and the segmentation of medical objectives on 2D and 3D images [9].There have been several CNN-Transformer fusion studies that demonstrate the brilliant performance of global features on convolutional neural network (CNN) structures.Inspired by these approaches, we propose a novel scheme to join the strength of CNN and Transformers in the context of sketch-based supervision for segmenting tumour regions.
This research is committed to developing a tumour segmentation model that can learn from the light annotation of coarse region boundaries, and once trained, is able to define accurate tumour boundaries with fine details on unseen histology images.To facilitate experiments and evaluation, we acquire two versions of annotations of tumour regions on our target datasets, a set of poor-quality labels (P-label) and a set of fine-quality labels (F-label).P-label can be obtained with relatively light efforts by pathologists and is used for training the models.In contrast, F-label requires significant time to prepare, and in this article is only utilised as ground truth.The accuracy of F-labels is far better than that of the P-labels.Based on this scenario, we propose a framework that follows a sketch-supervised paradigm [10].More specifically, it aims to generate accurate tumour region masks by models learned only from P-labels.The core of the framework entails a Dual CNN-Transformer network (DCT), supported by a Global Normalisation class activation map (GN-CAM).The main idea of the Dual CNN-Transformer structure is to integrate the advantages of CNN and Transformer, and provide descriptive joint global and local tumour representations.This dual network structure forms the foundation for the sketch-based tumour segmentation task.Fig. 1 introduces the training and testing pipelines of our DCTGN-CAM method.
Our proposed method has more satisfying performances against the popular methods in this area, and our final result is even better than the primitive P-label while using F-label in the evaluation as ground truth.The average improvement of our proposed methods against the P-label is more than 20 %, which proves the success of our work as a sketch-supervised framework and also an ideal way of annotation improvement of poor-quality annotations.Our contributions are threefold: -By calculating the intersection of cancer

A. Tumour Segmentation
Conventional image processing techniques do not work effectively when applied to tissue segmentation in histopathology images, as these methods cannot capture the semantic features manually.Instead of relying on manually crafting features, deep convolution networks make a more straightforward choice by training the models to extract the most relevant and descriptive feature information.Patch-based classification networks achieve WSI image segmentation with low computational complexity while sacrificing boundary smoothing conditions [11], [12].U-Net is one of the most widely used techniques in patchbased pathological image segmentation [13].Its main idea is to capture global features on the shrinking path and achieve accurate positioning on the expanding path.However, U-Net does not fully consider the local dependence among pixels.To address this issue, a fusion framework is proposed for promoting the accuracy of tumour edge segmentation [14].Long-range dependencies can be modelled by conditional random fields, which can be exploited to post-process semantic segmentation predictions.However, this method is computational-intense and requires a large number of expert annotations for training.

B. Weakly Supervised Segmentation
Generally, sketch refers to sparse annotations that provide masks for small areas of pixels [15].In existing methods, selective pixel loss is usually used for annotated pixels.For model training, some studies attempt to expand sketches or reconstruct the entire mask [16].A number of works employ conditional random fields in post-processing [17], [18] or as a trainable layer to refine segmentation results without relabeling [19].These methods, however, are not effective in providing better supervision for the training of models.More recent methods for evaluating and refining segmentation masks are developed, leading to more accurate predictions, such as a multi-scale attention gate proposed by Gabriele et al. [20], and a PatchGAN discriminator to leverage shape priors by Zhang et al. [21].However, these methods require additional sources of mask data and are not applicable in more general scenarios.
Initial cues are essential for weakly supervised segmentation tasks since they provide reliable priors to generate segmentation maps.Class activation map (CAM) can be a good auxiliary as it can provide the preliminary information of object localisation [22].It highlights class-specific regions that can serve as the initial cues.In [22], the authors demonstrate that a CNN with a Global Average Pooling (GAP) layer has localization capabilities despite not being explicitly trained to do so.Our work in this article takes inspiration from two algorithms of this kind, namely, CAM [22] and Grad-CAM [23], we intend to resolve their existing issues modelling global tumour representation in the whole slide level.

C. Attention Mechanism and Transformers
Attention mechanisms are designed to discover and explore the key parts of a batch of data.Several existing works have applied Non-local modules to segmentation tasks.In [24], the authors introduce a global feature with the Non-local operation.Global-guided Local Affinity is proposed to play a crucial role in modelling capture context information [25].The above non-local-based attention models are not friendly to memory.In order to reduce computation costs, several related works are proposed later.Attention in CCNet collects all kinds of information near and far on crisscross paths with low computation complexity [26].Inspired by Spatial attention and Non-local block, GCNet uses a simple non-local block with fewer memory requirements [27].However, these models are embedded into convolution layers and sampling layers.Sampling layers like the pooling layer always lose the details of images, causing poor performances.Additionally, self-attention blocks like Non-local blocks can exploit global information integrated by channel and spatial dimensions, but with high computational complexity [28].
Recently, the Vision Transformer has emerged as a novel approach by integrating a Non-local block to facilitate attentive interaction among different patch tokens [29].In medical image segmentation, several techniques have been developed to address the limitations associated with capturing both global semantic information and local contextual details [30].The TransUNet method leverages the self-attention mechanism to compute global context [31], while SETR replaces the coding component of conventional convolutional layers with Transformers, resulting in improved segmentation performance [32].SwinU-Net, resembling Unet architecture, employs a hierarchical Swin Transformer with shifted windows as the encoder to extract contextual features [33].The MSHT model adopts a multistage hybrid design, combining Transformer blocks with CNNs to enhance spatial features and leverage the global modeling capabilities of Transformers [34].SEGTRANSVAE synergistically utilizes the strengths of CNNs, Transformers, and variational autoencoders [35].Nevertheless, the inference time of hybrid CNN-Transformer or pure Transformer structures is longer compared to CNNs, due to the increased computational resources required by Transformer blocks.Inspired by the success of these approaches, we incorporate Transformers into our network design to leverage their capabilities in our proposed hybrid models.Additionally, we also prioritize addressing the computation complexity to ensure efficient processing.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Algorithm 1: Annotation Refinement by K-Means.
1: repeat 2: Compute the cluster centroids of background C 1 and tumor C 2 , where x i is one pixel of WSI, The kth cluster centroid is the vector of the feature means in the kth cluster.
Assign each observation x i in the unsupervised label Y 1 to the cluster whose centroid is closest.

III. METHODS
Fig. 1 shows the overall framework architecture.Using an unsupervised AR module, coarse annotations are first refined as much as possible.Then, supervised by a binary cross-entropy loss, the proposed Dual CNN-Transformer Network (DCT) simultaneously train a fully convolutional network and a transformer for patch classification.Tumour segmentation masks for a test image are inferred based on the patch classification output from DCT.Specifically, the proposed Global Normalised CAM (GN-CAM) calculates gradient-based heat maps derived from the final convolution layer of DCT.To produce the whole heat map with the same size as the WSI, all individual heat map patches are placed in order.Global normalization models the global tumour information over the whole heat map and ensures precise marking of tumour boundaries.Lastly, noise is eliminated using a convolutional CRFs-driven eliminator.

A. Annotation Refinement
P-label is the sketch-like coarse mask drawn by experts.Its boundaries are inexact, meaning that many non-tumour regions around the boundaries are likely to be included inside the mask, and some visual features in these vague regions are considered in training rather than only from the genuine tumour tissue.In contrast, F-label illustrates the tumour regions and boundaries accurately and requires significant time to prepare.In this article, F-labels are only utilised as ground truth for performance testing.
To relieve these data challenges and improve the training data quality, an AR module is designed based on the K-means clustering algorithm, to refine the pixel memberships in marginal regions of tumours marked by P-labels.Following this unsupervised process, pixels with similar visual features are grouped together, while regions of distinct colours are better delineated by the mask boundaries.When experts annotate the coarse tumour masks, they tend to do it slightly excessively by including all tumour regions inside the mask, as well as some non-tumour tissues along the margin.Thus, the coarse P-label, denoted as Y 0 , only roughly separates non-tumour regions from tumour regions roughly.The AR module is designed to preliminarily improve the coarse masks, by re-examining the tissue membership along the mask margin based on pixel colours.Unsupervised K-means clustering is applied to all pixels of WSIs, creating a new set of labels Y 1 to represent the tumour boundaries.Nevertheless, the Y 1 label sometimes includes some non-tumour pixels that have similar colour features to the tumour regions, while the original coarse annotation Y 0 normally does not include these regions unless they are in contact with the genuine tumour region.The region of Y 0 is usually much larger than the region of Y 1 , while Y 1 may consist of some outlier, disconnected regions from the main tissue region.Thus, the refined tumour mask is obtained by finding the intersection of the two Ŷ = Y 1 ∩ Y 0 .

B. DCT Network for Patch Classification
In existing unsupervised or weakly supervised, patch classification-based segmentation methods, VGG is commonly used as a CNN-based backbone for classification.VGG [6] shows substantial improvement through the use of deeper convolutional layers and small kernels, and it is popular in patch-based classification and segmentation tasks on weakly-supervised medical imaging.However, VGG has some internal design limitation that leads to network gradient vanishing and ignorance of long-term dependency among pixels.Besides the VGG structure, UNet is also taken as a commonly used CNN-based backbone to extract high-dimensional features.However, this pixel-wise segmentation structure may introduce more falsepositive results when the annotations are coarse sketches, which is unacceptable for weakly supervised tumor segmentation tasks.Recently, several Transformer-based methods attempt to describe global features effectively.Therefore, the an intuitive idea is to exploit Transformers to complement for the lack of CNN structures.In light of the fact that most category-based weakly supervised approaches use VGG as their backbones, taking VGG as our base network makes it easy to demonstrate network improvements.
Specifically, the VGG network prioritises shallow features (colour) over high-level features (morphological structure) in the pathological image classification task [6].It means that the convolution networks like VGG lack a global understanding of a whole image, while, for the classification of pathological image patches, the extraction of global semantic features is the key to cancer recognition at the boundary.As a result, VGG cannot classify cancer tissues accurately, especially around cancer borders with complex visual characters.Recently, the emergence of Transformers shows a promising perspective in solving the problem of long-term dependence in the field of computer vision.To combine the strength of CNNs and Transformers, we propose a dual CNN-Transformer network, namely, the DCT net, which consists of two branches, a CNN branch, and a Transformer branch.In the CNN branch, we substitute the usual fully convolutional block structures as in VGG, with residual blocks to focus on local features.The transformer branch is designed to extract global semantic features that complement the local visual representations.This dual branch structure ensures a robust and precise tumour classification by modelling the local details and global tissue relationship simultaneously.Extracted from the coarsely annotated masks, some patch labels may be wrongly corresponded to meaningful visual features, leading to problems in the learning of such features and subsequently affecting the decision power of the network.Therefore, we hope to separate the calculation process of each feature as much as possible.Our expectation is that each head can independently extract a type of global feature such as global texture, global tumour colour distribution, or global tumour boundary.However, a SWIN Transformer fuses multi-head attention results for the normal window before calculating self-attention based on shift windows.Consequently, each head cannot model one type of global feature independently, resulting in redundant multi-head shift-window attention.To resolve this issue, we adjust the structure by using a stack of these sub-blocks to ensure the consistency of feature representation and fusing the features of each head after the cascade self-attention calculation on two windows (a normal window and a shift window), as shown in the sub-figure (b) of Fig. 3. Our design ensures a consistent representation of global features.
Rather than flattening the patches x and mapping them by a trainable linear projection [29], we exploit a convolution operation E shown in (1), to project the image patch to a high-dimensional space and to split the image patch into smaller window-based patches simultaneously.The image patch x ∈ R H×W ×3 is transformed into a sequence of patches x p ∈ R

C. CAM Feature Extractor and Visualizer
Class Activation Mapping (CAM) [36] presents exemplary visualisation ability and has attempted to be utilized in sketchbased learning.The aim of CAM is to show interest in features for the target network and thus reveal the focus of the network on every patch.By using CAM, we can see that the network's interest in the tumour region is not only in the colour value, thus the output of CAM will grant the final result a significant improvement in accuracy.
However, CAM can only process the tumour information within a patch.In patch-based segmentation, the proportion of a patch in a WSI is very small, the relationship between patches is very important because it plays a fundamental role in defining tumour region boundaries.We need to consider both information within the patches as well as that between the patches.To this end, we designed a Global Normalised class activation map (GN-CAM).Firstly, we calculate the heat map of CAM in two forms, the class activation heat map inside the patches and that of the whole WSI.The heat values are fused as the final results of GN-CAM.By taking into account the overall picture of WSI, this fusion result captures the changes of details within patches and prevents noises in visually confusing regions, like tumour boundaries.
As a first step, the global normalised CAM (GN-CAM) collects the gradient-guided information B l flowing from the DCT-Net's last convolution layer and the output map of the DCT-Net y out .The lth and (l + 1)th layers are the two convolutional layers inside the last fusion block of the DCT-Net.Denote i as the channel index of a feature map.Assuming the ith feature map from the (l + 1)th layer as y l+1 i according to the gradient backpropagation, the ith guided gradient map from the (l + 1)th layer as B l+1 i .The gradient feature of the l + 1 layer is calculated by The guided gradient map flowing out the l layer B l is We define an updateable queue O 1 to store all guided gradientbased features B from the same WSI.Then we decentralize the feature set B by a global normalisation and store the processed features B in a new queue O 2 .Additionally, every pixel B l i,m,n , m ∈ M, n ∈ N is normalised locally by a patch-based level, where m, n is one position in the feature map B.
The locally normalised features from the same WSI are collected in a queue O 3 .Each two corresponding normalised gradient maps from O 2 and O 3 are counted together and outputted as the final segmentation results D by:

D. Noise Eliminator
The final refinement of tumour segmentation relies on the convolutional CRFs [37].Consider an input D (the probability map from the output of the GN-CAM) with shape [B, C, H, W ] where B, C, H, W denote batch size, number of classes, input height and width respectively.Assuming that two pixels u = (p, q) and v = (p + dp, q + dq) come from two conditionally independent distributions, where p and q are the image coordinates.d(u, v) > t is a restraint called Manhattan distance, where t refers to the filter size.All pixels with a distance greater than t have a pairwise potential of zero.The Gaussian kernel matrix k g is defined as Denoting E ∈ R B×C×H×W as the final output of the CAM visualizer.A Gaussian kernel g can be calculated based on feature vectors e 1 , . .., e d by (13).ω is defined as: where δ i is a learnable parameter.For a set of Gaussian kernels {g 1 . ..g S }, S is the number of kernels.We define the global kernel matrix G = S a=1 w r • g r .In the combined message passing of all S kernels, the result Q defined as: So the final tumour segmentation for one whole side image is given as the matrix Q.

A. Data Introduction 1) BSS Dataset:
The BSS dataset [38] is a private tumour dataset and has been adopted in our previous work.The BSS dataset contains 150 WSIs of squamous cell carcinoma including basal cell cancer (BCC), squamous papilloma (SP) and seborrheic keratosis cancer (SKC).All of the images on the BSS dataset are from the Second Affiliated Hospital of the Zhejiang University of China.It takes around 4 years to collect the dataset, make annotations and review the data in total.To protect the privacy of patients, all personal labels on scan images have been removed.For each scan, the invited experts roughly spent 60 minutes marking fine tumour labels (F-labels) and 5 minutes annotating coarse tumour labels (P-labels).After that, we invited 2 senior experts to spend one week checking whether all tumour regions are marked in the scans of the BSS dataset.
Several patch samples are cut from the WSIs and shown in Fig. 4. It is clear to observe that the whole tumour lesion is composed of multiple lobules as shown in Fig. 4.Each lobule is covered with squamous epithelial cells.Fibrous vascular tissue in the centre is infiltrated with inflammatory cells.There is an obvious thickening of squamous epithelium, vacuoles in the cytoplasm, and an increase in goblet cells.Inside tumour cells, there was no obvious mitotic phase or nuclear heterogeneity.
2) PAIP2019 Dataset: PAIP2019 dataset [39] has facilitated the development and benchmarking of cancer diagnosis and segmentation.This dataset contains 50 high-quality annotations for liver cancer WSIs and is first released by the PAIP Liver Cancer Segmentation Challenge, organised in conjunction with the Medical Image Computing and Computer Assisted Intervention Society (MICCAI 2019).Hepatocellular carcinoma (HCC) is a cancer of the internal organs.Most primary liver cancers are caused by hepatocellular carcinomas.There are a number of cellular and stromal components in HCC scans, including  The annotations published by PAIP2019 are of very high quality.However, in the course of clinical practice, the majority of tumour WSIs don't have precise annotations.In order to simulate this common situation, we invited two tumour experts who are also responsible for the marking of the private BSS dataset, to mark 50 WSIs with coarse annotations (P-label) for PAIP2019.The original high-quality annotations from the PAIP2019 dataset are used as F-labels in the evaluation process.The two versions of annotation are shown in Fig. 5.
To alleviate the potential impact of variations in coarse labels, we adopted a meticulous approach during the data preparation phase.Specifically, we enlisted the expertise of four histopathologists, who independently labeled the tumor datasets using coarse annotations.We carefully considered subjective independence and authoritative annotations by inviting four Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.to mark the BSS dataset and the remaining four experts to label the PAIP2019 dataset.The final coarse annotations were generated by aggregating the inputs from all four experts, resulting in a more comprehensive and representative labeling scheme.In Section IV, we presented experimental results obtained from two different tumor datasets, namely the BSS dataset and the PAIP2019 dataset in Tables III and IV, respectively.These datasets encompassed a total of five tumor sub-categories, providing a diverse set of cases for evaluating the performance of our proposed method.

B. Evaluation Matrics
The segmentation inference results are evaluated using Recall, Specificity, Accuracy, IOU and Dice [40].Assuming the positive sample is the tumour and the negative sample is the normal tissues.Six results are defined to demonstrate the relationship between ground truth and prediction results.
To verify the generalizability of our proposed method, the segmentation performances need to be inferred from both of the private BSS tumour dataset and the public PAIP2019 tumour dataset.We evaluate five performance metrics of four components in our proposed method.

C. Training Details
In Tables II, III, and IV, we compare four state-of-the-art (SOTA) pixel-wise medical image segmentation methods and three classification-based image segmentation methods for the task of sketch-based tumor segmentation.The classificationbased methods require the use of CAMs to visualize tumor distribution within patches, whereas the pixel-wise methods do not rely on CAMs.To ensure a fair comparison of these networks, we provide the parameter settings for data preparation, model training, and inference.
For the data preparation in the experiments presented in Tables II, III, and IV, we randomly shuffled all WSIs from the BSS dataset.We allocated 20% of the WSIs for validation, 20% for testing, and the remaining 60% for training.The same data split ratio was applied to the resection scan of the PAIP2019 dataset.We saved all the patches in three sub-datasets for subsequent model training, validation, and testing.The number of split patches from each WSI ranged from 20,000 to 25,000 due to varying resolutions.Overall, the BSS dataset contained approximately 3 million patches, while the PAIP2019 dataset had 1.25 million patches.This demonstrates sufficient data support for model training and performance testing.For a detailed breakdown of the data preparation in the training, validation, and testing stages, please refer to Table I, which gives the details on the distribution of WSI images and patches in both datasets.
During model training, we utilized cross-entropy loss for all the networks listed in Tables II, III, and IV.The initial learning rate was set to 0.0001, and the optimizer was employed.Each method was trained for 500 epochs using an Nvidia RTX 3080 GPU.To ensure consistent input sizes, we divided all images into (512, 512) patches.
For model inference, we tested all the trained models on 600 k testing patches from the BSS dataset and 475 K patches from the PAIP2019 dataset.A comprehensive comparison of segmentation performance for all classification-based models is presented in Table II.Table III provides a comparison of tumor segmentation performance for both pixel-wise and classification-based methods on the BSS dataset.Similarly, Table IV compares the tumor segmentation performances of these methods on the PAIP2019 dataset.It is important to note that the (GN-)CAMs shown in Tables III and IV were utilized during model inference and did not require additional training.

D. Sketch-Based Tumour Segmentation Results on the BSS Dataset
1) The Annotation Refinement: We implement a series of ablation studies to figure out the gain when using our proposed AR as shown in Table II.Experimental results show that the AR module is robust and efficient on two CAMs (CAM and GN-CAM) and three types of models (VGG, VF, and DCT).The base VGG+CAM+NE combination achieves a Specificity of 98.90 % after using the AR module.Especially, the segmentation precision increases from 83.97 % to 88.29 % after only adding the AR model to the base experimental configuration.Further, the AR module has more gains in Recall, IOU, and Dice metrics on the proposed GN-CAM visualizer compared to the base CAM.In detail, after using the AR module, the VGG+GN-CAM+NE combination achieves a Recall of 81.96 % (+8.64 %), an IOU of 70.83 % (+5.39 %), and a Dice of 82.62 % (+3.92 %).
2) The Dual CNN-Transformer Classification Network (DCT): The proposed DCT network is designed to classify whether a patch belongs to the "tumour" class.Table II shows the qualitative comparison results between the existing methods and our proposed DCT.The proposed DCT network outperforms the compared VGG and VF methods, especially when using the GN-CAM module simultaneously.The best sketch-based tumour segmentation performances achieves Recall 84.44 %, Specificity 97.85 %, Accuracy 96.23 %, IOU 71.33 %, and Dice 83.12 %, which sets the experimental configuration of AR+DCT+GN-CAM+NE.Experimental evidence demonstrates the effectiveness of the proposed DCT for sketch-based tumour segmentation tasks.Compared with the VF-based method (VF+CAM) [38], our proposed method outperforms 14.55 % of Recall, 14.04 % of IOU and 10.76 % of Dice, which is a significant improvement on the BSS dataset.
Fig. 6 further illustrates the qualitative segmentation analysis among VGG, VF and our proposed DCT networks.We carefully draw three types or tumour predictions using three different methods, respectively.It can be found that the predictions of DCT are closest to the F-labels.For example, in the SKC WSI, the magnified image in the upper right corner shows that the enlarged tumour region looks like a "horse" in the green-boxselected area of the F-label.It is clear to distinguish the shape and location of the "horse" when using our DCT but is impossible to discriminate the "horse" body in the predictions of VGG and VF methods.Qualitative results present that our proposed DCT is effective and outperforms other networks on sketch-based tumour segmentation.
3) GN-CAM: Inspired by the CAM concept, we design a GN-CAM module to represent local and global tumour features and to refine the tumour location on the patch level.We attempt to solve the non-negligible challenge for patch-based segmentation tasks: representing the global relationship between patches and the boundary of the generated WSI is not consistent.Our work involves generating and merging the output heat map patches by GN-CAM globally.Although we only train a binary classification network with sketch supervision, the proposed GN-CAM is capable to infer tumour location and boundaries precisely in the inference stage.Fig. 7 provides a close look at the predicted heat maps of tumour patches.Heat maps illustrate accurate tumour location and region information on three types of patches after using the proposed GN-CAM module.It proves that the GN-CAM module can illustrate precise tumour segmentation results based on the classification probabilities.
AR and GN-CAM modules complement each other clearly.Fig. 8 shows the segmentation improvement of AR and GN-CAM modules under the same DCT classification network.As compared with AR+CAM and GN-CAM, our segmentation results have fewer false-positive pixels than the GN-CAM, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.which means fewer normal tissues are classified as tumours due to the usage of AR.Furthermore, the boundaries of our predicted tumour regions are more precise than those predicted by AR+CAM, indicating the GN-CAM improves the boundary details effectively.Therefore, the AR and the GN-CAM work together to improve tumour segmentation performances.
4) The Noise Eliminator: Using patch-based segmentation has another challenge: jagged edges always appear at patch boundaries, and noise effects in non-tumour areas often look square or rectangular.As a result, taking threshold-based results as the final tumour segmentation results may lead to great visual errors.In our ablation study, the problem of noise is significantly relieved by the noise eliminator module.Fig. 9 shows the predicted patches focus on the noise eliminator.It is proven that the noise eliminator is responsible for eliminating false negative samples (filling voids within tumour tissue) and false positive samples (eliminating isolated noise areas outside the large tumour areas).Compared with the middle results directly from the output of the CAM, the final results processed by NE effectively improve the segmentation performances.to the pure CNN methods U-Net and AttnUnet, the pixel-wise approaches presented in Table III, comprising U-Net, AttnUNet, TransUnet, and MSHT, demonstrate superior performance.This suggests that the global context feature encoding provided by the Transformer is more suitable for the sketch-based tumour segmentation task than pure CNNs.Furthermore, the highest performance metrics are achieved using classification-based segmentation methods, with 88.28% recall, 98.90% specificity, 97.08% accuracy, 76.68% IoU, and 86.69% Dice coefficient.Notably, our proposed methods account for four out of the five best performances.As all learnable networks are supervised by coarse labels, the pixel-wise label will bring in large falsepositive errors.In this case, segmentation-based methods like U-Net are ineffective.

E. Sketch-Based Tumour Segmentation Results on the PAIP2019 Dataset
Another ablation study is to compare our method with the non-classification method on the public PAIP2019 dataset, to further prove the efficiency of our methods.Table IV compares the P-label, U-Net and our method with F-label, respectively.U-Net is a pixel-wise method commonly used for tumour segmentation [13].Experiment results show that U-Net has a 3-12% performance gain compared with P-label even if the U-Net training with only supervision on P-label.However, our method shows a large performance gain compared with U-Net when using the same experiment configurations.Specifically, our proposed method outperforms approximately 9% increase of Recall, 4% of Accuracy, 13% of IOU, and 12 % of Dice.
Although our proposed method has an exciting performance on the above private datasets, we still need to evaluate our method on the public datasets to verify the universality of our method.Compared with various types of skin cancers in the previous private dataset, the boundaries between the non-tumour tissues and tumour tissues are relatively smooth and hard to recognise.Table IV and Fig. 10 show the systematic evaluation results among the existing methods on the PAIP2019 dataset.It presents that our method (and our proposed modules) has better segmentation results on every evaluation metric.Our methods still obviously outperform the fully supervised network U-Net, proving the significant success on sketch-supervised tumour segmentation tasks.As all learnable networks are supervised by coarse labels, the pixel-wise label will bring in large falsepositive errors.In this case, segmentation-based methods like U-Net are ineffective.

F. Clinic Application
It is noticeable that the P-label of the PAIP 2019 is a rough outline drawn based on the location of the tumour in the F-label.Similarly, the P-labels of the BSS dataset also have false positive samples only.Therefore, the premise of the good results is the P-label should include all tumour regions.
It is possible to use our method in clinics as well.As a pathologist, all that needs to be done is to coarsely label the contours of the tumour boundaries in a short amount of time.The accurate tumour segmentation results can then be easily achieved by using the methods that we have suggested.Our work enables doctors to automatically obtain more accurate cancer segmentation results at a lower cost of labelling.

V. CONCLUSION
In this article, we propose a framework for sketch-supervised tumour segmentation in histopathology, called DCTGN-CAM.Annotations from experts are optimized by calculating the intersection of cancer regions in unsupervised k-means and sketch annotations.The dual-branch DCT classification method leverages tumour features comprehensively.Parallel SWIN Transformer ensures the consistency of global feature representation.With a Global-Normalised CAM, a whole-slide heat map is generated from patch-based tumour classification predictions, which combine local and global normalization.A robust analysis of two tumour datasets shows that DCTGN-CAM is superior to weakly supervised tumour segmentation methods.This work is valuable Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
and practical for computer-aided histopathology analysis.However, the multi-step design may cause influent feature flow or noise effects.To optimize this work in the future, an end-to-end approach might be more effective.Additionally, the adaptability of the front-end trainable model to the back-end CAM remains to be studied.In the future, we will continue to optimize CAM visualization, lightweight the dual CNN-Transformer structure, and study the adaptability of CAM visualization in sketch-based segmentation tasks.
Sketch-Supervised Histopathology Tumour Segmentation: Dual CNN-Transformer With Global Normalised CAM Yilong Li , Linyan Wang , Xingru Huang , Yaqi Wang , Le Dong , Ruiquan Ge, Huiyu Zhou , Juan Ye , and Qianni Zhang Abstract-Deep learning methods are frequently used in segmenting histopathology images with high-quality annotations nowadays.Compared with well-annotated data, coarse, scribbling-like labelling is more cost-effective and easier to obtain in clinical practice.The coarse annotations provide limited supervision, so employing them directly for segmentation network training remains challenging.We present a sketch-supervised method, called DCTGN-CAM, based on a dual CNN-Transformer network and a modified global normalised class activation map.By modelling global and local tumour features simultaneously, the dual CNN-Transformer network produces accurate patchbased tumour classification probabilities by training only on lightly annotated data.With the global normalised class activation map, more descriptive gradient-based representations of the histopathology images can be obtained, and inference of tumour segmentation can be performed with high accuracy.Additionally, we collect a private skin cancer dataset named BSS, which contains fine and coarse annotations for three types of cancer.To facilitate reproducible performance comparison, experts are also invited to label coarse annotations on the public liver cancer dataset PAIP2019.On the BSS dataset, our DCTGN-CAM segmentation outperforms the state-of-the-art methods and achieves

Fig. 1 .
Fig. 1.Pipeline of the proposed sketch-supervised tumour segmentation method DCTGN-CAM.A Dual CNN-Transformer classification network (DCT) is trained by the tumour image patches and refined P-label patches, processed by an annotation refinement (AR), and supervised by a binary classification (cross-entropy) loss.In the test stage, the testing image patches are passed through the trained DCT classification network.A GN-CAM visualizer combines the local with global tumour heat maps simultaneously to create accurate tumour boundaries.The final denoising tumour segmentation results are obtained by a noise eliminator (NE).

others 5 :
until the cluster assignments stop changing.6: Obtain the refined annotation Ŷ by the intersection operation: Ŷ = Y 1 ∩ Y 0 .

Fig. 2 .
Fig. 2. Structure of the proposed Dual CNN-Transformer network (DCT).Subfigure (a) presents the details of five stages inside the DCT network.Each stage contains a local CNN block and a global transformer block except for the 5 th stage.Subfigure (b) shows the residual connection of the CNN block.Subfigure (c) illustrates the subblocks inside the proposed transformer block including the patch partition, the patch embedding, the parallel SWIN encoder, the patch merging and the patch expanding.It is noticeable that the patch partition and the patch embedding layer are only executed in the first stage.Subfigure (d) presents a simple but effective way to fuse the global and local tumour features of the image patches.
, where (H, W ) is the resolution of image patch, C is the adjusted channel number, (P H , P W ) is the resolution of each resulting window-based image patch.Then we add the Random Position Embedding E pos on these window-based patches by a 3D dropout operation, and obtain the embedding z 0 by the (2).Then, the embedding z 0 is passed through a Parallel SWIN encoder.As shown in sub-figure (II) of Fig.3, a part of our Parallel SWIN encoder consists of Window-based Self Attention (WSA) and Multilayer Perceptron (MLP) sub-blocks.The other parts of our Parallel SWIN encoder include Shift-Window-based Self Attention (SWSA) and MLP.Additionally, LayerNorm (LN) layers are applied before each sub-block.In Parallel SWIN, sub-blocks are computed exactly the same as in SWIN Transformer, so we will not repeat the computational

Fig. 3 .
Fig. 3. Comparison between the existing SWIN Transformer and our proposed parallel SWIN Transformer block.Feature representation is continuous and independent in our parallel design rather than local layer normalisation (LN) and residual connection in each shifted-windowbased/window-based multi-head self-attention module (SW/W-MSA).SW/WSA is the shifted-window-based/window-based self-attention module.MLP is the Multi-layer Perceptron.

Fig. 4 .
Fig. 4. Three types of tumour patch samples are extracted from the WSIs in the private BSS dataset, which contains basal cell cancer (BCC), the squamous papilloma (SP), and seborrheic keratosis cancer (SKC).These patches show a close look at tumour shapes and colours of different classes.

Fig. 5 .
Fig.5.Three WSIs samples of the public dataset (PAIP2019[39]).The first column is the original images; the second and third columns present the corresponding poor labels (P-labels) and the fine labels (F-labels) marked by experts.

Fig. 6 .
Fig. 6.Qualitative segmentation results of sketch-supervised based methods.Three types of binary tumour classification networks (VGG, VF and ours) are trained and tested along with AR, GN-CAM, and NE modules.Each type of network is trained three times with different types of images for distinguishing three types of skin tumours (BCC, SP and SKC).For one method i, M i means the final tumour segmentation result and N i is defined as the visualized heat map.The tumour segmentation performances of our proposed DCTGN-CAM method are closest to the fine labels annotated by experts.The upper left corner of each image presents enlarged tumour segmentation results.

Fig. 7 .
Fig. 7. Three types of predicted heat maps obtained by our proposed method DCTGN-CAM on the BSS dataset.The blue areas in the heat maps illustrate a relatively high probability of belonging to tumours.

Fig. 8 .
Fig. 8. Qualitative segmentation results of different visualizers in the sketch-supervised framework.Each method is trained three times with different types of images for distinguishing three types of skin tumours (BCC, SP and SKC).The tumour segmentation performances of our proposed DCTGN-CAM method are closest to the fine labels.

5 )
Comprehensive Analysis: We compare our proposed method to the existing methods to analyze their performance comprehensively in the Table III.The P-labels are used to train the methods and the F-labels are used to evaluate them.Compared to CNN-based networks such as U-Net and AttnUNet, our method surpasses Transformer-based networks like SWinU-Net and hybrid CNN-Transformer networks, including TransUnet and MSHT, in terms of recall, specificity, accuracy, Intersection over Union (IoU), and Dice metrics.Furthermore, in contrast

Fig. 9 .
Fig. 9. Optimized segmentation results with the NE module in the DCTGN-CAM.The first row is the original patches; The second row shows the middle results only processed by the binary threshold.The third row presents the final tumour predictions processed by the NE.

Fig. 10 .
Fig.10.Qualitative segmentation results on PAIP2019 dataset.Compared with the U-Net method, our method has more accurate tumour segmentation results in tumour boundaries for the sketch-supervised tumour segmentation task.The fifth row shows that the tumour regions have high responses after processing by our method.

TABLE I A
SUMMARY OF THE DATA STATISTICS IN THE PRIVATE DATASET (BSS) AND THE PUBLIC DATASET (PAIP2019), INCLUDING THE NUMBER OF WSIS, AND THE NUMBER OF IMAGE PATCHES TABLE II PERFORMANCE COMPARISON OF TUMOUR SEGMENTATION WITH DIFFERENT CLASSIFICATION NETWORKS, VISUALISER, ANNOTATION REFINEMENT (AR) AND THE NOISE ELIMINATOR (NE) IN THE BSS DATASET (%)