A Multi-Organ Nucleus Segmentation Challenge

Generalized nucleus segmentation techniques can contribute greatly to reducing the time to develop and validate visual biomarkers for new digital pathology datasets. We summarize the results of MoNuSeg 2018 Challenge whose objective was to develop generalizable nuclei segmentation techniques in digital pathology. The challenge was an official satellite event of the MICCAI 2018 conference in which 32 teams with more than 80 participants from geographically diverse institutes participated. Contestants were given a training set with 30 images from seven organs with annotations of 21,623 individual nuclei. A test dataset with 14 images taken from seven organs, including two organs that did not appear in the training set was released without annotations. Entries were evaluated based on average aggregated Jaccard index (AJI) on the test set to prioritize accurate instance segmentation as opposed to mere semantic segmentation. More than half the teams that completed the challenge outperformed a previous baseline. Among the trends observed that contributed to increased accuracy were the use of color normalization as well as heavy data augmentation. Additionally, fully convolutional networks inspired by variants of U-Net, FCN, and Mask-RCNN were popularly used, typically based on ResNet or VGG base architectures. Watershed segmentation on predicted semantic segmentation maps was a popular post-processing strategy. Several of the top techniques compared favorably to an individual human annotator and can be used with confidence for nuclear morphometrics.


I. INTRODUCTION
E XAMINATION of H&E stained tissue under a microscope remains the mainstay of pathology. The popularity of H&E is due to its low cost and ability to reveal tissue structure and nuclear morphology, which is sufficient for primary diagnosis of several diseases including many cancers. Nuclear shapes and spatial arrangements often form the basis of the examination of H&E stained tissue sections. For example, grading of various types of cancer and risk stratification of patients is usually done by examining different types of nuclei on a tissue slide [1], [2]. Nuclear morphometric features and appearance including the color of their surrounding cytoplasm also helps in identifying various types of cells such as epithelial (glandular), stromal, or inflammatory, which in turn give an idea of the glandular structure and disease presentation at low power [1]- [4]. Segmentation of nuclei accurately in H&E images therefore has high utility in digital pathology.
However, nucleus segmentation algorithms that work well on one dataset can perform poorly on a different dataset. There is far too much variation in the appearance of nuclei and their surroundings by organs, disease conditions, and even digital scanner brands or histology technicians. Examples of such variations are shown in Fig. 1, along with the problems of  [5] and Cell Profiler [10] gives merged nuclei (under-segmentation). Marker controlled watershed segmentation [6] and Fiji [9] produces fragmented nuclei (over-segmentation). Segmented nuclei instances are shown in different colors in rows 2-5. some common segmentation algorithms such as Otsu thresholding [5], marker controlled watershed segmentation [6]- [8] or open-source packages like Fiji [9] and Cell Profiler [10]. Segmentation based on machine learning should be able to do a better job, but that makes designing and refining nucleus segmentation algorithms for a new study a tedious task because annotations of thousands of nuclei are needed to train such segmentation models on datasets of interest. Algorithms that generalize to new datasets and organs that were not seen during training can reduce this effort substantially and contribute to rapid experimentation with new phenotypical (visual) biomarkers.
Until recently, one of the major challenges in training generalized nucleus segmentation models has been the unavailability of large multi-organ datasets with annotated nuclei. In 2017 Kumar et al. [11] released a dataset with more than 21,000 hand-annotated nuclei in H&E stained tissue images acquired at the commonly used 40× magnification, sourced from seven organs and multiple hospitals in The Cancer Genome Atlas (TCGA) [12]. Kumar et al. also introduced a metric called Aggregated Jaccard Index (AJI) that is more appropriate to evaluate algorithms for this instance segmentation problem as opposed to other popular metrics such as Dice coefficient, which are more suited for semantic segmentation problems. This is because nucleus segmentation algorithms should not only tell the difference between nuclear and non-nuclear pixels, but they should also be able to tell pixels belonging to two nuclei apart that touch or overlap with each other. Additionally, they had released a trained model that performed reasonably well on unseen organs from the test subset of images.
We organized the Multi-organ nucleus segmentation (MoNuSeg) Challenge at MICCAI 2018 to build upon Kumar et al.'s work by enlarging the dataset and by encouraging others to introduce new techniques for generalized nucleus segmentation. The participation was wide and several of participants outperformed the previous benchmark [11] by a significant margin. In this paper we describe in detail the objectives of the competition, the released dataset, and the emerging trends of techniques that performed well on the challenge task. We hope that the algorithms described on the challenge webpage [13] will be of use to the computational pathology research community.
The rest of the paper is organized as follows. We describe the prior work on nucleus segmentation and dataset creation in Section II. We describe the dataset and competition rules in Section III. We present an organized summary of the techniques used by the challenge participants in Section IV. Finally, we discuss emerging trends in nucleus segmentation techniques in Section V.

II. BACKGROUND AND PRIOR WORK
In this section we describe the importance of H&E stained images in histopathology and provide details of some previous notable techniques and datasets for nucleus segmentation from H&E stained images.

A. Hematoxylin and Eosin (H&E) Stained Images
Pathologists usually observe tissue slides under a microscope at a specific resolution (ranging between 5× and 40×) to report their diagnoses including tumor grade, extent of spread, surgical margin, etc. Their assessment is primarily based on the appearance, size, shape, color and crowding of various nuclei (and glands) in epithelium and stroma. Stains are used to enhance the contrast between these tissue components to help a pathologist looking for specific nuclei and gland features. The combination of hematoxylin and eosin (H&E) is a frequentlyused, universal, and inexpensive staining scheme for general contrast enhancement of histologic structures of a tissue. Hematoxylin renders the nuclei dark blueish purple and the epithelium light purple, while eosin renders the stroma pink. Compared to the general use of H&E, immunohistochemical staining is more specialized as it targets proteins specific to certain disease states for visual identification.
With the advent of high resolution cameras mounted on microscopes, and more importantly, digital whole slide scanners, it is now possible to acquire whole slide images (WSIs) of the tissue sections for computer assisted diagnosis (CAD). However, the development of CAD systems requires automated extraction of rich information encoded in the pixels of WSIs. Recently, computer based assessment of tissue images has been used for tumor molecular sub-type detection [14], mortality or recurrence prediction [3], [15], and treatment effectiveness prediction [4]. Notably, nucleus detection and segmentation is often a first step for several such CAD systems that rely on nuclear morphometrics for disease state stratification and predictive modelling. Therefore, MoNuSeg 2018 focused on crowdsourcing techniques for nucleus segmentation in H&E stained images captured at 40× resolution.

B. Nucleus Segmentation Techniques
Prior to the advent of deep learning, approaches to segment nuclei relied on watershed segmentation, morphological operations -such as erosion, dilation, opening and closingcolor-based thresholding, and variants of active contours [6], [7], [16]- [18]. These techniques were often complemented with a collection of pre-processing methods, such as contrast enhancement and deblurring to improve the 'image quality'. Additionally, several post-processing techniques, such as hole filling, noise removal, graph-cuts, etc., were also used to refine the outputs of the segmentation algorithms. However, these approaches do not generalize well across a wide spectrum of tissue images due to reasons such as (a) variations in nuclei morphologies of various organs and tissue types, (b) inter-and intra-nuclei color variations in crowded and chromatin-sparse nuclei, and (c) diversity in the quality of tissue images owing to the differences in image acquisition equipment and slide preparation protocols across hospitals and clinics.
There have been tremendous advances in the recent years to develop learning-based nucleus segmentation methods to advance the state-of-the-art. Instead of relying on pre-determined algorithms for segmentation, machine learning methods derive data driven algorithms that are trained in a supervised manner based on annotations of nuclear and non-nuclear pixels. This allows them to concentrate on relative differences between nuclear and non-nuclear pixels and their surrounding patches and overcome the aforementioned sources of intra-class variations for better generalized segmentation. The use of learning based approaches started with the extraction of hand-crafted local features based on color and spatial filtering that were fed to traditional learning-based models such as random forests, support vector machines, etc. to segment nuclei and non-nuclei regions [1], [19], [20]. The selection of features is dependent on domain knowledge and trial-and-error for improving nucleus segmentation performance, and yet it is difficult to detect all nuclei with diverse appearances and crowding patterns.
To circumvent the constraints of hand-crafted features, representation learning algorithms, popularly known as deep learning techniques, have recently emerged. These methods -specifically the ones using convolutional neural networks (CNNs) -have outperformed previous techniques in nucleus detection and segmentation tasks by significant margins [11], [21]- [24]. To use deep learning, the problem is often cast as one of semantic segmentation wherein a two-class probability map for nuclear and non-nuclear regions is usually computed. After semantic segmentation, sophisticated post-processing methods -such as graph partitioning [21], or the computation of distance transform of the nuclear map followed by H-minima transform and region growing [22] -are often used to obtain final nuclei shapes with the desired separation of touching and overlapping nuclei. Semantic segmentation of third class of pixels -those on the nuclear boundaries including that between two touching nuclei -has also been proposed to exclusively refine the separation between the segmented touching and overlapping nuclei [11]. Deep generative models have also been used for accurate nuclei segmentation [25]. More recently, nucleus segmentation problem has been formulated as a regression task to predict a distance map with respect to centroids or boundaries of nuclei using fully convolutional networks (FCNs) to achieve both segmentation and computational performance gains over previous deep learning based approaches [24]. More comprehensive reviews of state-of-the-art nucleus segmentation algorithms can be found in [26] and [27].
One of the major barriers in out of the box (without re-training) application of state-of-the art deep learning based nucleus segmentation algorithms was the lack of publicly available source codes and trained models by previously published techniques until Kumar et al. [11] and Naylor et al. [24] released their source codes. The other major barrier was the lack of publicly available annotated datasets for benchmarking, which we address next.

C. Nucleus Segmentation Datasets
The success of machine learning and the development of state-of-the art deep learning algorithms in computer vision can be attributed to the healthy competition enabled by publicly available consumer photography datasets such as ImageNet [28] and COCO [29] for object recognition in images. Unfortunately, we do not see similar progress in digital pathology image analysis as there is dearth of labeled and annotated datasets for solving various tasks of pathologist's interest. For example, CAMELYON dataset [30], which is one of the largest histopathology classification dataset, has 1,399 images, while ImageNet [28] has 14 million images. Similarly, CheXpert [31], which is one of the largest medical image segmentation datasets has only 224, 316 images. This is because labeling and annotating pathology images require expert knowledge and diligent work. However, there have been a few recent efforts dedicated to the release of hand-annotated H&E stained tissue slide images for nucleus segmentation as summarized in Table I. These datasets can also be downloaded from the challenge webpage [13]. Please note that we have not included datasets where the nuclei were annotated for detection alone in Table I because these cannot be used for the segmentation task. We also excluded datasets annotated for other specific objectives such as gland  I  PUBLICLY AVAILABLE H&E STAINED TISSUE IMAGE DATASETS ANNOTATED FOR NUCLEUS SEGMENTATION segmentation, mitosis detection, epithelial segmentation, and tumor type classification, as opposed to generalized nucleus segmentation. Most of the datasets listed in Table I focus on a specific organ with the exception of Kumar et al. [11] and Wienert et al. [18].

III. DATASET AND COMPETITION RULES
The objective of MoNuSeg 2018 was to encourage the development of learning based generalized nucleus segmentation techniques that work right out of the box (without re-training) on a diverse set of H&E-stained tissue images. The images therefore spanned a range of patients, organs, disease states, and sourcing hospitals with potentially different slide preparation and image acquisition methods. Training and testing datasets were carefully curated and the competition rules were crafted in accordance with these objectives.

A. Training Dataset
The training data of MoNuSeg 2018 was the same as that released previously by Kumar et al. [11], which comprised 30 tissues images, each of size 1000 × 1000, containing 21, 623 hand-annotated nuclear boundaries. Each 1000 × 1000 image in this dataset was extracted from a separate whole slide image (WSI) (scanned at 40×) of an individual patient downloaded from TCGA [12]. The dataset represented 7 different organs viz., breast, liver, kidney, prostate, bladder, colon and stomach, and included both benign and diseased tissue samples to ensure diversity of nuclear appearances. Furthermore, the training images came from 18 different hospitals, which introduced another source of appearance variation due to the differences in the staining practices and image acquisition equipments (scanners) across labs. Representative 1000 × 1000 sub-images from regions dense in nuclei were extracted from patient WSIs to reduce the computational burden of processing WSIs and increase participation. Only one crop per WSI and patient was included in the dataset to ensure diversity. The distribution of training images across organs is shown in Table II while patient and hospital details are available on the challenge webpage [13].
Both epithelial and stromal nuclei were manually annotated in the 1000 × 1000 sub-images using Aperio ImageScope ® . Annotations were performed on a 25" monitor with a 200× digital magnification such that each image pixel occupied 5×5 screen pixels to ensure clear visibility for annotating 0 Only annotations verified by a pathologist were considered. nuclear boundaries with a laser mouse. For overlapping nuclei, each multi-nuclear pixel was assigned to the nucleus that appeared to be on top in the 3-D structure. The annotators were engineering students and the quality control was performed by an expert pathologist with years of experience in analyzing tissue sections. Specifically, each H&E image was included in a PowerPoint ® (Microsoft, Redmond WA, USA) slide at 300 dots per inch, along with the annotated boundaries overlaid in bright green. The slides were examined by a pathologist on 25" monitor to point out missed nuclei, false nuclei, and nuclei with wrong boundaries. For each image, the numbers of each type of error was summed up and divided by the number of annotated nuclei to assess the quality of annotations. As shown in Supplementary Table S2, the error rate for each organ was smaller than 1%. The images and XML files containing pixel coordinates of the annotated nuclear boundaries were released for public use by [11]. The reasons that make this dataset ideal for training a generalized nucleus segmentation model are as follows: 1) It is the largest repository of hand annotated nuclei which aptly represents a miscellany of nuclei shapes, and sizes across multiple organs, disease states and patients. The inclusion of tissue sections from 18 hospitals further augments the richness of this dataset. From Table I, the only multi-organ alternative to it is Wienart et al. [18]. However, Wienart et al. [18] contains tissues from lesser number of organs captured in a single hospital with a single scanner. 2) It extracted only one sub-image of 1000 × 1000 pixels per patient to maximize nuclear appearance variation. Other datasets mentioned in Table I extracted multiple sub-images from each patient and are thus limited in representing nuclear appearance diversity. For example, WSIs of only 10 and 11 patients were used in Irshad et al. [33] and Naylor et al. [24], respectively. 3) It provided coordinates of annotated nuclear boundaries in an XML format instead of binary masks. This is crucial for learning to separate touching and overlapping nuclei in any automatic nucleus segmentation algorithm. This helped several participants of MoNuSeg 2018 whose nucleus segmentation algorithms explicitly learned to recognize nuclear boundaries in addition to the usual foreground (nuclei pixels) and background classes (non-nuclei pixels). 4) It publicly released the source code of their generalized nucleus segmentation algorithm to catalyze natural competition among a newer generation of automatic nucleus segmentation algorithms.

B. Testing Dataset
A new testing set comprising 14 images, each of size 1000 × 1000 pixels, spanning 7 organs (viz. kidney, lung, colon, breast, bladder, prostate, brain), several disease states (benign and tumors at different stages), and approximately 7,223 annotated nuclei was prepared in the same manner as used for preparing the training data. As shown in Table II, lung and brain tissue images were exclusive to the test set which made it more challenging. More details about the test set are available in the "supplementary material" tab of the challenge webpage [13]. The annotations of the test set were not released to the participants. To formally conclude the challenge, with this paper, we are releasing the test annotations on the challenge webpage [13] to facilitate future research in the development of generalized nucleus segmentation algorithms.

C. Competition Metric and Results
Competitors were evaluated only once on the test set. Their latest submission before the deadline was considered as the final submission for evaluation. Average aggregated Jaccard Index (AJI) was used as the metric to evaluate nucleus segmentation performance of the competing algorithms because of its established advantages over other segmentation metrics [11], [24]. The value of AJI ranges between 0 to 1 (higher is better). Computing AJI involves matching every ground truth nuclei to one detected nuclei by maximizing the Jaccard index. The AJI is then equal to the ratio of the sums of the cardinals of intersection and union of these matched ground truth and predicted nuclei. Additionally, all detected components that are not matched are added to the denominator. We reproduce Algorithm 1 detailing AJI computation from [11] with permission. The code for computing AJI is available on the challenge webpage [13].
Participants were asked to submit 14 segmentation output files (one for each of the 14 test images) to the challenge organizers. For each participant's submission, the organizers then computed 14 AJIs (one for each test image) as per Algorithm 1. If a participant did not submit the results for a particular testing image then AJI value of zero was assigned for that particular image to that participant. The organizers then computed the average AJI (a-AJI) for each participant by averaging image level AJIs across 14 test images. The participants were then ranked in the descending order of a-AJI to obtain the final leaderboard shown in Table III.  Table III also includes the 95% confidence intervals (CIs) around each participant's a-AJI. It is evident that the confidence intervals of the top five techniques exclude a-AJI of the lower ranked techniques. To further assess the overall a-AJI based ranking scheme, we also computed organ level a-AJI Algorithm 1 Aggregated Jaccard Index (AJI) Input: A set of images with a combined set of annotated nuclei G i indexed by i , and a segmented set of nuclei S k indexed by k. Output: Aggregated Jaccard Index A.
1: Initialize overall correct and union pixel counts: C ← 0; U ← 0 2: for Each ground truth nucleus G i do 3: Mark S j used 6: end for 7: for Each segmented nucleus S j do 8: If S k is not used then U ← U + |S k | 9: end for 10: A ← C/U (and confidence intervals), for each participant, by averaging image level AJIs across the number of images that belonged to a specific organ, as shown in Supplementary Table S3. From  Supplementary Table S3, it is evident that (a) the top five techniques perform better than other techniques for each organ as well, (b) the organ is a larger contributor to the variability in performance among the top five techniques than the technique itself, and (c) techniques with a higher overall a-AJI perform better for more organs even among the top five techniques. Specifically, for instance, (a) no technique that is not among the top-five overall breaks into the top-five for more than two organs, (b) breast cancer images had AJI's that were lower by about 0.063 to 0.085 compared to those for bladder for the top-five techniques, and (c) the overall top-ranked technique is also the top-ranked one for all but one organ.

IV. SUMMARY OF SEGMENTATION TECHNIQUES
In this section we present a summary of the techniques used by 32 teams who successfully completed the challenge. We describe the trends observed in pre-processing, data augmentation, modeling, task specification, optimization, and post-processing techniques used by the teams. Specific details of all algorithms are provided in the respective manuscripts submitted by participants as per challenge policies and are available at challenge webpage [13] under "manuscripts" tab.

A. Pre-Processing and Data Augmentation
Pre-processing techniques reduce unwanted variations among input images -from both the training and testing sets -so that the test data distribution is not very different from the training data distribution, by projecting both to the same low-dimensional manifold. On the other hand, data augmentation techniques increase the training data set size by introducing controlled random variations with the hope of creating a training data distribution that covers most of the test data distribution. There are several ways in which the participants altered the given images and their ground truth masks before passing them to the segmentation learning systems in order to increase test accuracy. We summarize some of the interesting trends observed in this challenge. These results are also summarized in Table III. 1) Color and Intensity Normalization: Among the data pre-processing techniques, color and intensity transformations were the most common. Approximately half the teams used color normalization techniques that were specifically developed for pathology images to reduce unwanted color variations between training and testing data. Structure Preserving Color Normalization (SPCN) by Vahadane et al. [35] was used by ten teams due to its demonstrated performance and code availability. Another seven teams used Mecenko et al.'s color normalization scheme [36], out of which one used this technique in combination with another technique by Reinhard et al. [37].
Pixel intensity and RGB color transformations that are unspecific to pathology were also used by approximately half of the teams. Most popular among this class of techniques were channel-wise mean subtraction, variance normalization (unit variance), and pixel-value range standardization. Six teams also used either contrast enhancement (or histogram equalization), among which CLAHE [38] was the most commonly used technique.
Among the unique techniques, one team used image sharpening to remove unwanted variations between training and testing data, one team concatenated HSV and L channels (of L, a*, b* color space) to the RGB channels, and one team used only the blue channel after color normalization of the RGB images.
2) Data Augmentation: Among data augmentation techniques, geometric transformations of the image grid were the most common. For example, rigid transformations of the images -such as rotation (especially, by multiples of 90 degrees) and/or flipping -were used by all but four teams to increase the size of the training data. However, as can be seen in Table III, all of the top twelve teams by a-AJI also augmented the training set using affine transformations, while only five teams below that used this type of augmentation. Another transformation used by the participants was elastic deformation, but it was not very popular among the contestants due to the marginal gain it might afford over an affine transform, while being more complicated to implement. Another geometric transformation is image scaling, which was used by nine contestants.
Another popular set of augmentation techniques involve changing the pixel values while leaving the geometric structure intact. The most popular among these techniques was the addition of white Gaussian noise, which was used by several of the top performing teams. Another popular technique is color jitter or random HSV shifts, which was used by nine of the top twelve teams. Color jitter is opposite in spirit to color normalization in that it is used to present more color variations of the same input geometric structure to the learning machine with the hope that it will learn to focus on the geometric structure as opposed to the color of nuclei, which may vary between training and testing data sets. Random intensity (brightness) shifts were used by fewer participants, as were blurring by isotropic Gaussian filters of random widths and random image sharpening.
One interesting data augmentation technique that was used by team CMU-UIUC involved extracting the nuclei, augment them in-place, filling the holes in the background, and then pasting the nuclei back on to the reconstructed background.

B. Specification of the Learning Task
The challenge of nucleus segmentation can be split into two tasks: distinguishing between nuclear and non-nuclear pixels (semantic segmentation) and separating touching nuclei (instance segmentation). The following were three principal types of outputs that the contestants produced using deep learning to meet these two challenges: 1) Binary class probability maps distinguish between pixels that belong to the core of any nucleus versus those that do not. The process of not including the outer periphery of the nuclei into the foreground class helps separate touching nuclei. The lost nuclear territory can later be gained back during post-processing. 2) Ternary class probability map distinguishes between nuclear core, non-nuclear, and nuclear boundary pixels. Nuclear pixels that are on a shared boundary of two touching nuclei are considered to belong to the third class, which has been shown to be useful in separating touching nuclei [11]. 3) Distance map estimates how far a nuclear pixel is from the centroid of a nucleus. Such a map can also distinguish between nuclear and non-nuclear pixels by assigning a fixed value to the latter, such as 0. This is a per-pixel regression problem while the previous two are classification problems. A variant of this distance map is to predict the distance from the boundary of the nucleus. Most teams trained their models to predict variants of one or more of the three types of maps described above. One interesting departure from these three tasks was by Canon Medical Research Europe who predicted a five-class probability map -one for nuclear pixels, and the other four for their probability of belonging to one of the four Cartesian quadrants of a nucleus in order to separate touching nuclei.

C. Model Architectures
All participants used deep convolutional neural networks. Twenty one teams used variants of U-Net [39], of which the original U-Net architecture was used by 11 teams while six teams used base architectures inspired by VGGNet [40], and another 11 teams used architectures inspired by either MRCNN [41], FCN [42], DenseNet [43], or ResNet [44] with different depths. Eight teams used Mask Region with CNN features (MRCNN) [41] as the primary models (of which, two also used U-Net), and two used FCN [42] (of which one also used U-Net). Among the remaining, four teams used their own custom models and architectures, and one each used VGGNet [40], Deep Layer Aggregation [45], PANet [46], and TernausNet [47]. A few teams used multiple architectures for ensembling. Two teams used two architectures each for two different tasks, for example one for semantic segmentation (binary classification between foreground and background pixels) and another for distance map prediction to separate touching nuclei. Notable innovations in model architectures tried by some of the top teams are described in Section IV-G.

D. Model Optimization
The choice of loss function depends on the desired output being predicted. Among various choices for the loss function, pixel-wise cross entropy was used by 28 teams for predicting binary or ternary probability maps, and it was by far the most popular loss function. Ten teams used Dice loss [48], and two teams used its variant such as smooth Jaccard index loss or IOU (intersection over union) loss [49]. For regression problems, seven teams used a smooth L 1 loss. Five teams used mean square error. In total, 16 teams used more than one loss function. Most teams trained their models end-to-end, except when an ensemble of more than one model was used, with the exception of team Yunzhi that used a cascade of two neural networks trained one after another.

E. Post-Processing
For post-processing, watershed segmentation (WS) was used by 17 teams. The most popular way to apply WS was on the nuclear probability pixel map. Additionally, to separate touching nuclei several teams used a neural network to predict the location of a marker for each nucleus, such as by using a nuclear-core probability map, a distance map, or a vector map pointing to the nearest nuclear center. Cleaning up small or weakly detected nuclei was also a common theme. Non-maxima suppression and h-minima were commonly used along with a threshold to clean up false positives.

F. Training and Testing Time
Training times ranged from 2 hours and 17 minutes on using a single Nvidia 1080Ti GPU for team Junma to 42 hours for team Johannes Stegmaier on a similar hardware. Testing times also had a wide range from 1 second per 1000 × 1000 image for team Unblockabulls on an Amazon Web Services GPU instance powered by an Nvidia K80 GPU to 2 minutes 58 seconds per image on an Nvidia Titan X GPU by team CVBLab.

G. Description of the Top-Five Techniques
We now describe the top-five techniques in more detail as examples of the innovations and diligence with which the participants tried to get robust generalization. Comparative results of the top-five techniques are shown in Fig. 2. Specific details about parameter settings of each algorithm can be found in their respective manuscripts available on the challenge webpage [13] under the "manuscripts" tab.

1) CUHK & IMSIGHT:
Extensive data augmentation based on random affine transform, rotation, and color jitter was used. Nuclei segmentation task was split into that of nucleus and boundary segmentation. A contour information aggregation network (CIA-Net), inspired by FCN [42] and U-net [39], to simultaneously segment nuclei and boundary was developed using Resnet50 [44] as the backbone architecture. The binary cross-entropy loss function that combined nucleus and boundary annotation errors was used to train the network.
This algorithm missed some of the smaller nuclei and over-segmented (incorrectly splitting a large nuclei into multiple smaller nuclei) some larger nuclei as shown in Fig. 2.
2) BUPT.J.LI: Images were color normalized and training data was augmented using random cropping, flipping, rotation, scaling, and noise addition. Deep layer aggregation [45] architecture was used to perform three tasks -(1) detect inside-nuclei pixels, (2) estimate the geometric center of the inside-nuclei pixels and (3) estimate a center vector that pointed towards the estimated nuclei center for each inside-nuclei pixel. During inference, the detected nuclei centers and center vectors were used to assign inside-nuclei pixels to one of the overlapping or touching nuclei instances. Since, nuclei boundary information was not explicitly used by the network, this technique produced overly smooth nuclei boundaries (Fig. 2), especially for nuclei with high curvature boundaries.
3) pku.hzq: Extensive data augmentation was used such as flips, rotations, scaling, and noise addition. Then a U-Net [39] was used to predict a ternary class map similar to Kumar et al. [11]. Additionally, an MRCNN [41] was used for top-down instance segmentation. Predictions from the two models were combined as an ensemble for both boundary and nucleus prediction. Then the ensembled nuclei center masks were calculated using morphological eroding of the predicted nuclei pixels. A random walker was used to obtain instance segmentation masks from the ensembled semantic masks and center masks. From Fig. 2, it is evident the boundaries for touching and overlapping nuclei were sometimes unnatural (and occasionally merged) due to pixel-level (semantic) ensembling of the boundary class predictions. 4) Yunzhi: For data preparation contrast-limited adaptive histogram equalization (CLAHE) [38] was used. Data augmentation was done using mirror flipping, rotations that were multiples of 90 degrees, color jitter, Gaussian noise addition, and elastic deformation. For each pixel, the probability of it belonging to a nucleus, or a nucleus boundary and unit vector to the center of the nuclei was computed using two cascaded U-nets [39]. First U-net predicted the inside nuclei pixels and unit vector to the center of the nuclei, which were then used in the subsequent U-net to accurately predict nuclei boundaries. Delineation of touching and overlapping nuclei using this technique heavily relied on accurate estimation of the unit vector that pointed towards the center of a nuclei and due to inaccuracy in precisely estimating the unit vector, this technique produced some over-segmentation and under-segmentation (incorrectly merging two touching or overlapping nuclei) errors (see Fig. 2). 5) Navid Alemi: A neural network predicted both foreground (nuclear core) and background (nuclear boundary) markers. The neural network was a multi-scale feature-sharing network that used extensive skip connections, and was dubbed Spaghet-tiNet. For training the marker head prediction, the network used a combination of weighted Dice and binary cross entropy loss. For predicting the boundaries, it used smooth Jaccard loss and the boundary map was cleaned up using Frangi vesselness filter [50]. Finally, marker-controlled watershed segmentation using predicted markers and boundaries was employed to obtain the instance segmentation maps. Fig. 2 shows that this technique produced overly smooth boundaries with some over-segmentation and under-segmentation errors.

H. Ensemble of Top-Five Techniques
Unlike ensembling of semantic segmentation, where class probabilities or decisions can be averaged for each pixel location, ensembling of instance segmentation results is far from trivial. Hence, we developed our own approach to generate the ensemble output of instances segmented by the top-five techniques because literature on this topic is thin and unconvincing. First, we looped over instances of the top-ranked technique and identified the corresponding nuclei instances from the other four techniques on the basis of maximum overlapping pixels. Once the matched instances from all techniques were identified, the corresponding ensemble instance was computed through pixel level majority voting as would be done for semantic segmentation of a single nucleus. Once we looped over all nuclei instances predicted by rank 1 technique, to incorporate the instances missed by rank 1 technique, we looped over all those instances of rank 2 technique that did not find an overlap with those of the rank 1 technique. The process was repeated for rank 3 technique, but not for the other two remaining techniques because the extra instances detected by those two would not have a majority vote from the top three techniques. This ensembling method gave an overall a-AJI of 0.693 (95%CI: 0.682-0.703), which is only marginally better than the individual results of top-five teams.

I. Comparison to Inter-Human Agreement
We re-annotated all 14 test images and computed their a-AJI with the previous annotations. The re-annotation protocol was identical to the one used for creating the training set of MoNuSeg 2018 and the annotator was blinded to the previous test set annotations. The a-AJI between new and old manual annotations across 14 test images was 0.653 (95%CI 0.639-0.667), to which the a-AJI of the top few techniques compares very favorably. This suggests that for nucleus segmentation in H&E images, machine performance is at par with human performance if the image quality is as good as the one used in this challenge.

V. DISCUSSION AND CONCLUSION
Some clear trends emerged from analyzing the top few techniques in Table III. While based on a prior idea that color normalization can improve performance of segmentation tasks [11], [51], it is becoming apparent that color augmentation (jitter) trains more robust segmentation models [52]. Most of the top techniques relied on heavy data augmentation including affine transformations, color jitter and noise addition. ResNet [44] seems to be an architecture of choice for several top performers irrespective of how they formulated the learning task. This is because the residual skip connections in ResNet allow backpropagation of gradient deep into the network without dilution. Most of the highly successful networks stuck to predicting pixel-wise class probabilities or using MRCNN [41] to predict instance maps. Watershed segmentation was among the most heavily utilized post-processing techniques. It was applied to the nuclear probability maps, most often coupled with a marker, where the marker was based on detecting the cores of individual nuclei. Some of the aforementioned general trends observed corroborated those found in instance segmentation challenges of general photography images such as Common Objects in Context (COCO) Challenge [29].
Although, the participating nuclei segmentation techniques reported significant improvement over the baseline method of [11], more improvements are possible and welcome. To further improve the nuclei segmentation quality, the ambiguity at the boundaries of touching and overlapping nuclei need to be better resolved. Additionally, new techniques should also produce more accurate nuclei boundaries without smoothing out high curvature boundaries. Another direction to be investigated is that of developing techniques that are tolerant of errors in the ground truth annotation itself. The role of generatative adverserial networks (GANs) to further improve nuclei segmentation performance should also be explored [25]. Based on the fact that the top techniques submitted to the MoNuSeg challenge had a-AJIs that were at par with that of a human annotator, it seems that it is time to put some of these techniques to use in nuclear morphometry based disease assessment studies to develop morphometric biomarkers. Finally, the robustness of the dataset and the techniques that have emerged as a part of the MoNuSeg challenge should be assessed for segmenting nuclei under multi-resolution and multi-stain settings. This can be achieved by conducting future competitions on the datasets containing annotated nuclei from images obtained at multiple microscopic resolutions (e.g, 10×, 20×, 40×, etc.) and including annotated nuclei from images stained with different types of stains (e.g. multiple IHC stains).
X. Zhou is with the Department of Computer Science