Self-supervised Representation Learning for Reliable Robotic Monitoring of Fruit Anomalies

Data augmentation can be a simple yet powerful tool for autonomous robots to fully utilise available data for selfsupervised identification of atypical scenes or objects. State-of-the-art augmentation methods arbitrarily embed"structural"peculiarity on typical images so that classifying these artefacts can provide guidance for learning representations for the detection of anomalous visual signals. In this paper, however, we argue that learning such structure-sensitive representations can be a suboptimal approach to some classes of anomaly (e.g., unhealthy fruits) which could be better recognised by a different type of visual element such as"colour". We thus propose Channel Randomisation as a novel data augmentation method for restricting neural networks to learn encoding of"colour irregularity"whilst predicting channel-randomised images to ultimately build reliable fruit-monitoring robots identifying atypical fruit qualities. Our experiments show that (1) this colour-based alternative can better learn representations for consistently accurate identification of fruit anomalies in various fruit species, and also, (2) unlike other methods, the validation accuracy can be utilised as a criterion for early stopping of training in practice due to positive correlation between the performance in the self-supervised colour-differentiation task and the subsequent detection rate of actual anomalous fruits. Also, the proposed approach is evaluated on a new agricultural dataset, Riseholme-2021, consisting of 3.5K strawberry images gathered by a mobile robot, which we share online to encourage active agri-robotics research.


I. INTRODUCTION
Agricultural mobile robots are expected to precisely assess the qualities of crops from their sensory information to autonomously perform the targeted treatment of individual plants or harvest the mature and healthy crops.To realise this autonomy, deep learning models could be adopted to classify visual input from robotic sensors by optimising their parameters based on a large number of examples available in advance.In practice, however, collecting data of "atypical" qualities, e.g., fruits with disease or damage, can be challenging mainly because of their rare occurrences, and therefore, One-class Classification (OC) paradigm [1], [2] has been widely used in computer vision communities, in which classifiers are trained to maximise the utility of the available data from "normal" class to later distinguish unseen instances of "anomalous" class as well.
Self-supervised Learning (SL) has been introduced as a powerful method to effectively solve OC problems by augmenting training data to inject some level of unusual patterns because classifying the artefacts can be an instructive proxy task to learn potentially informative feature representations All authors are with the Lincoln Agri-Robotics (LAR) Centre, University of Lincoln, UK. 1 {tchoi, asalazargomez, gcielniak} @lincoln.ac.uk, 2 25393497@students.lincoln.ac.uk for detecting anomalies in tests [3], [4], [5], [6].Nonetheless, most successful SL tasks have been designed only for the scenarios in which anomaly is mostly defined by structural differences-e.g., bent tips in screws, holes on hazelnut bodies, or missing wires in cable clusters in MVTec AD Dataset [7]-or image samples out of a particular training class in large datasets such as ImageNet [8] or CIFAR-10 [9].
We argue that such representation learning techniques may only provide suboptimal performance for OC in agricultural domains since anomalies in fruits, for example, tend to appear with only little distinction in shape, but peculiar pigmentations (e.g., Fig. 1c) instead could be more useful visual cues for differentiation.As an alternative, in this paper, we thus propose Channel Randomisation (CH-Rand), which augments each image of normal fruit by randomly permutating RGB channels with a possibility of repetition to produce unnatural "colour" compositions in the augmented image.Whilst classifying these artefacts, the neural networks automatically learn discriminative representations of irregular colour patterns, so that distance-based heuristics can later be employed on that learnt space to estimate the anomaly score of input using the distance to the existing data points.
To validate the performance of our system in a realistic scenario, we also introduce Riseholme-2021, a new dataset of strawberry images, for which we operated a mobile robot (Fig. 1a) to collect 3, 520 images of healthy and unhealthy strawberries at three unique developmental stages with possible occlusions (cf.Fig. 1b-1c).Our experiments are conducted not only on this set of strawberry images but also on Fresh & Stale dataset with several other fruits to show that CH-Rand can gain the most reliable representations for detection of anomalous fruits compared to all other baselines including self-supervised structure-learning methods (e.g., CutPaste [3]).We further support our design by demonstrating high degrees of correlation between the success in colour prediction task and the final performance in anomaly identification.Hence, CH-Rand does not require manually engineered criteria for early stopping, and validation accuracy can simply be monitored during the proxy task to ensure the precise detection of actual anomalies.

A. Anomaly Detection in Agricultural Domains
Perception models have played a crucial role also in agriculture to build up essential capabilities to eventually deploy fully autonomous robots in real farms.For instance, weeds are targeted anomalies to detect in [10], [11], [12], and occlusions or dense fruit clusters are of interest in [13], [14].More relevantly to our work, plant diseases are also important anomalies to detect as in [15], [16], [17], in which networks were trained with annotated images to learn leaves with diseases.These methods were built upon supervised learning aided by manually annotated data, but our approach is designed to meet the practical assumption in OC that anomalous data may be unavailable during training.Hossain et al. [18] also utilised colour-based features to recognise anomalous leaves, but they only depended on human engineered features, while ours trains deep neural networks.

B. One-class Classification Strategies
Due to the strict assumption in OC, generative modelbased frameworks have been widely used.For instance, Deep Convolutional Autoencoders (DCAE) measure the reconstruction error because novel data would more likely cause higher errors [1], [19].With DCAE as a backbone, Ruff et al. [2] introduced Deep Support Vector Data Decription (DSVDD) to learn as dense representations near a central vector c as possible so that atypical data points would be detected by a long distance from it.Generative Adversarial Networks (GANs) can also provide a large benefit by synthesis of data to potentially model unavailable anomalous samples [1], [20], [21], [22].For example, IO-GEN [1] utilises a trained DSVDD to replace its c with synthetic data to perform multi-dimensional classification for complex datasets in place of simplistic distance calculation.
1) Self-supervised Learning: The ultimate goal in SL is to gain useful representations in neural networks for future anomaly detection whilst identifying intentionally manipulated data as a pretext task.For example, inferring (1) geometric transformations such as rotation (ROT) applied to input images [23], (2) relative locations among regional patches [5], or (3) images with blank local masks embedded [4] has shown great successes for OC.More recently, Li et al. [3] introduced CutPaste (CP), in which unlike [4], local patches are extracted directly from the original images (cf., Fig. 2a) to keep the pretext task more challenging.
In fact, all these augmentations were motivated to model typical structures of normal objects (e.g., defect-free screws) to detect odd shapes in anomalous examples (e.g., screws with bent tips) afterwards.We, however, argue that such structural difference may be less significant in differentiation between healthy and unhealthy fruits, and we propose to learn colour regularity instead as an alternative.

C. Channel Randomisation for SL
Although colourisation can be used to learn useful representations by colouring grayscale images [24], [25], a more relevant technique to CH-Rand is Channel Permutation (CH-Perm) [26], also called Channel Swap [27], in which five random permutations-i.e., RBG, GRB, GBR, BRG, and BGR-can be considered for augmentation by reordering the channels of a RGB image without repetition.Channel Splitting (CH-Split) [27] is also related, in which the values of a randomly chosen channel-R, G, or B-are all copied to others to potentially produce three novel visuals.Lee et al. [27] applied this to their SL framework to encourage their models to "ignore" colour variations but learn semantic coherence through a human action.Contrary to any of these techniques, CH-Rand here is adopted in the context of one-class classification to "reflect" colour regularities in representation learning.Moreover, CH-Rand can generate a larger set of 26 random channel sequences including the ones that CH-Perm and CH-Split can generate, so in Section IV, we investigate the benefits from using it.

III. METHODOLOGY
As in previous approaches [3], our framework includes two modular processes for anomaly identification on fruit images: (1) self-supervised representation learning with data augmentation and (2) anomaly score estimation.In III-A, we first formalise our proposed CH-Rand augmentation and then in III-B, describe a heuristic method to calculate anomaly scores on learnt representations.

A. Channel Randomisation
Our approach is motivated by the unique observations of fruit anomalies compared to other types-e.g., local defects on industrial products in MVTec AD [7] or the out-of-distribution samples in CIFAR-10 [9].To be specific, as shown in Fig. 1b, fruits generally have relatively high phenotypic variations in local structures even in the same species regardless of normality; nevertheless, healthy fruits at the same developmental stage all share a similar colour composition, which can change dramatically as the fruit becomes unhealthy, for example due to some fungal infection as displayed in Fig. 1c.Therefore, we design a novel augmentation method to restrict the neural network to learn representations for encoding colour irregularity to ultimately build a more reliable anomaly detector on agricultural robots.
CH-Rand can be simply performed by computing a random permutation of colour channels with a possibility of repetition to apply to the entire image input.More formally, we generate an augmented image A ∈ R W ×H×C by executing CH-Rand on the original image I ∈ R W ×H×C of normal class available during training, where W and H are the width and the height, respectively, and C is the number of channels, which is typically set to 3 for the RGB image format.To augment a new input I, we first randomly build an arbitrary function π : χ → χ for permutation, where χ = {1, 2, ..., C} denotes the indices of original channels, and χ ∈ P(χ) \ ∅ as P returns the powerset of input.Each element a c w,h in A can then be determined as follows: for which π is fixed for every w, h, and c to apply the same channel assignment during a single augmentation process.Note here that the output sequence of π may use some duplicate channel indices from χ by design because |χ | ≤ |χ|.Moreover, we keep drawing a new π until ∃c ∈ χ, c = π(c) to avoid the case of A = I.Consequently, 26 possible channel sequences exist for augmentation in 3-channel format, whereas CH-Split and CH-Perm only have three and five possibilities, respectively.An example of augmentation is presented in Fig. 2e, for which the channel sequence of BBR has been generated from the RGB colour space.
Based upon this augmentation method, a classifier can be set to learn the binary classification to predict whether input images are the products of the augmentations.Inspired by [3], [23], we design our loss function below to train a deep neural network-based classifier f Θ on a training dataset D: where CHR is the application of CH-Rand augmentation, and H is the function of binary cross entropy to estimate the prediction error in classification.In implementation, we randomly sample a batch D ⊆ D at each iteration to feed a half with augmentation and the other without.

B. Anomaly Detection
For anomaly prediction, we use the feature representations g θ learnt within the classifier f Θ -i.e., g θ is the output Our design is generic to easily replace this scoring module with other unsupervised techniques, such as Gaussian density estimators [30] or One-class SVM [31].Yet, we use the kneighbour heuristic since it has performed best in our tests.

IV. EXPERIMENTS
We here offer the experimental results to demonstrate the performance of our proposed framework in vision-based monitoring of fruit anomalies.We first explain the fruitimage datasets for anomaly detection and technical details used through experiments in Section IV-A and Section IV-B, respectively.Section IV-C then shows quantitative results by comparison with other baselines, and based on the results, we examine in Section IV-D the relevance between the pretext task that CH-Rand generates and the subsequent detection of real unhealthy fruits.Lastly, Section IV-E includes ablation studies to discuss variants of CH-Rand.
A. Fruit Anomaly Detection Datasets 1) Riseholme-2021: For realistic evaluations, we first introduce Riseholme-2021, a new dataset of strawberry images, in which 3, 520 images are available with manually annotated labels, such as Ripe, Unripe, Occluded, and Anomalous.This dataset was collected by deploying a commercial mobile robot Thorvald outdoors on the strawberry research farm on the Riseholme campus of the University of Lincoln as depicted in Fig. 1a.In particular, the robot was configured to use a side-mounted RGB camera to take images of normal and anomalous strawberries at various growth stages under natural, variable light conditions, whilst navigating along the rows in polytunnels.Human experts then examined each image to manually crop the regions centered around strawberries and annotate with respective labels.In real applications, fruit segmentation algorithms  More specifically, each image from Ripe (Unripe) contains a single ripe (unripe) strawberry, whereas several ones may appear overlapping one another in images of the Occluded class.Furthermore, some Occluded strawberries are observed to be covered by green stems.Anomalous cases also display single strawberries with anomalies, such as the presence of malformations, the lack of normal pigmentation, or clear signals of disease.Example images of each category are displayed in Fig. 1b-1c.
Table I shows the basic statistics of the dataset, in which "normal" categories, including Ripe, Unripe, and Occluded, have considerably more images (95.7%) than the Anomalous (4.3%)-i.e., this severe class imbalance provides a realistic testbed for anomaly detection.Riseholme-2021 is also presented with exclusive data sets-Train, Val, and Test-which contain 70%, 10%, and 20% "normal" images, respectively, and all Anomalous images are considered only during test.To further encourage active research in agri-technology, we publish our Riseholme-2021 dataset online at https://github.com/ctyeong/Riseholme-2021.
2) Fresh & Stale: Fresh & Stale 1 dataset contains annotated fruit images of six different species collected in controlled environments, and each image is labelled with either Fresh or Stale.We, however, have discovered duplicate images that have been transformed with several methods-rotations or translations.We thus only keep images with unique instances of fruit, and as a result, the size of final dataset significantly is reduced, in which Apple is the largest class with 231 normal and 327 anomalous instances.Moreover, we only utilise Apple, Orange, and Banana since other classes each have less than 50 examples after the removal of duplicates.We set the split of normal data with Train (40%), Val (10%), and Test (50%) to conduct tests along with images of Stale.Also, the black pixels that the pre-transformation had produced were removed by conversion to white to match the original background.

B. Implementation Details & Evaluation Protocols
Throughout experiments, we deploy a deep-network classifier for SL which consists of 5 ConvLayers followed by 1 https://www.kaggle.com/raghavrpotdar/fresh-and-stale-images-of-fruitsand-vegetables 2 DenseLayers, in which the number of 3 × 3 convolutional filters incrementally increases (64, 128, 256, 512, and 512) as each layer is followed by a BatchNorm layer and a 2 × 2 MaxPool layer, and the DenseLayers have 256 and 1 output nodes, respectively.Every layer uses LeakyReLU activations except the last layer with a sigmoid function.Note that despite ResNet's successes in SL [3], [32], we did not discover any benefit from using it in this work possibly because of the relatively small resolutions of our images.Also, at each training iteration, every image is resized to 64 × 64 and processed with traditional augmentations before CH-Rand-horizontal/vertical flips and color jitter2 changing the brightness, contrast, saturation, and hue.Normalisation is then applied to set pixel values to be bounded by As in previous works for OC [1], [2], [3], the Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) is used as the performance indicator, and the AUC of Precision-Recall (PR) curve is also reported as an additional metric considering the highly imbalanced class distribution in Riseholme-2021 (cf., Table I).Each AUC is the average of three individual runs to mitigate the random effects from CH-Rand and weight initialisations in networks.
In fact, other representation learning frameworks [3], [29] suggest a certain number of training iterations to achieve their best performance, although in practice, such knowledge is unavailable in advance.Our approach is, however, to regularly monitor the "validation accuracy" to stop training if the mean of the last five measurements reaches > .95,or 1.5K epochs have passed to deploy the model with the maximum validation accuracy.These criteria apply to all SLbased methods in our experiments to examine the relevance of their pretext tasks to the final task of anomaly detection.More details of hyperparameters with the code is available online at https://github.com/ctyeong/CH-Rand.

C. Comparative Results
We compare CH-Rand with related methods mentioned in Section II-DCAE [1], DSVDD [2], IO-GEN [1], ROT [23], CP [3], and CH-Perm [26].Note that for representation learning under SL, ROT and CP inject structural irregularities into images, while CH-Perm augments with randomly shuffled channels.A basic colour feature generator (HIST) is also considered, in which the number of pixels is counted within six unique ranges in each channel, and a representation of 6 × 6 × 6 dimensional colour histogram is produced per input combining the channel-wise ranges.Pretrained VGG16 [33] is also used to generate features to investigate the utility of the learnt features on ImageNet [8].DOC [29]   codes are not available for CP, we have implemented it based on the details on their appendix in [3].In particular, we adopt the deep classifiers and the k neighbors-based detector described in Section III on ROT, CP, and CH-Perm so as to focus only on the achieved representation power in comparison to ours.The only distinction with ROT is to use four output nodes in the classifiers to predict four degrees of rotation-i.e., 0 • , 90 • , 180 • , and 270 • -pre-applied to input images.Similarly, HIST and VGG16 run the same detector to discern anomalies on their representations.In addition, the results obtained by the best k ∈ {1, 5, 10} are presented except CH-Rand and CH-Perm.Note here that high-performing models on Riseholme-2021 are then applied to Fresh & Stale without major modifications to assess the general capacity on various environments in agriculture.Lastly, CH-Rand uses representations g θ at fc6, and discussions on this design are described in Section IV-E. 1) Riseholme-2021: We test different combinations of normal classes against anomalies as shown in Table II.Every method struggles more with Unripe than Ripe and also with the cases where more normal types are involved since a larger variety of colours and shapes need to be modeled.In particular, the notable failure in DSVDD, HIST, IO-GEN, VGG16, and DOC indicates the challenge of the task with the strawberry images as all the wild conditions are concerned.
In overall, however, SL-based approaches, such as ROT, CP, CH-Perm, and CH-Rand, demonstrate more robust performance across categories despite their relatively simple de-  and as Occluded class is also added, the margin increases up to larger than 30%.This trend supports our motivation (cf., Section III-A) that representations of shapes could be less informative for identification of unhealthy fruits.
Table II also implies that though CH-Perm and CH-Rand are all trained for simply identifying unnatural colour patterns, their representations are not trivial features based on frequencies of various colours, because they obviously outperform HIST.In particular, CH-Rand provides considerably better results than CH-Perm particularly in PR when more complex normal sets are involved probably because its higher randomness in augmentation can simulate more realistic colour anomalies.Furthermore, similar observations are obtained in any value of k.
In addition, Fig. 3 visualises representations in CH-Rand, in which the final features appear surprisingly useful for differentiation of anomalies though anomalous class was unavailable for explicit learning.In particular, Fig. 3c implies ambiguous appearances of anomalous samples to be represented between the ripe and unripe examples, so the final detector can take advantage of it.For better understanding, some visual examples of successful and unsuccessful classification results are also shared online in the code repository.

D. Relevance of SL task
Table IV reveals the correlations between the validation accuracies during SL and the ROC's finally achieved to examine relevance of each pretext task to the downstream task-detection of real anomalous fruit images.CH-Rand leads to positive correlation coefficients in all datasets, while others including CH-Perm have negative ones in Fresh & Stale.In other words, successful training in the task of CH-Rand can ensure representations for precise detection later, but continued training with other augmentations may rather degrade the performance of final detector particularly on the Fresh & Stale dataset.Therefore, as designed in Section IV-B, the validation accuracy is a useful, practical criterion for early stopping compared to manual searches for an optimal number of training iterations in other frameworks [3], [29].

E. Ablation Study
We here investigate the effects of various randomisation methods and hyperparameters that define our augmentation techniques.To save computation time, we train models only on a half of training set of all normal classes in Riseholme-2021.Also, each image is resized to 32×32, and the utilised representations of g θ are always extracted at conv5 layer unless mentioned otherwise to focus on each parameter in order.Moreover, k is set to 1 to only consider the nearest neighbor from training data to calculate the anomaly score.
1) Pixel Selection: Though CH-Rand is to apply randomised channels across all pixels of an image, we here explore the cases below where only some of n pixels are randomised.Figure 2 visualises several methods: • Patch: Pixels inside a random rectangular patch [3] • Sobel: Pixels inside a large segmented region from Sobel filter-based segmentation [28].
• All: All pixels as proposed in Section III-A.Table V reveals that CH-Rand works poorly when objectness is not taken into account, because, Patch, which may position a patch lying across multiple semantic objects, leads to the worst result.Similarly, as 75% pixels are sparsely augmented in SP.75, the result is worse than Th.75, which tends to pick pixels on the same part of object as a result of thresholding.Sobel also supports this idea with its high ROC.
Another key observation is that CH-Rand on more pixels produces better results.For instance, Th.∆ presents the improvements as ∆ increases, and finally when all pixels are involved as designed in Section III-A, the highest ROC is achieved.Thus, all pixels are considered hereafter.
2) Randomisation Variants & Input Size: We also explore the effect of different image sizes particularly comparing CH-Rand with other channel-randomising methods such as CH-Split [27] and CH-Perm [26] (cf.Section II-C).
In Table VI, CH-Split leads to the lowest ROC implying that its three possible channel sequences may provide limited irregular patterns to learn.CH-Perm and CH-Rand each appear to work best with the size of 64 × 64, which is close to the average size of 63 × 66 in the dataset (cf., Table I).With that size, CH-Rand outperforms CH-Perm.
3) Layer Selection: In Table VI, more improvement is also discovered in CH-Rand with representations at fc6 implying that the most discriminative representations are learnt there to offer the best features to the last fc7 layer.Note that we have consistently observed such a tendency with CH-Rand, though CH-Perm did not take any benefit.
Thus, based on all these findings, the best configuration for each model has been adopted in Section III and Section IV.

V. CONCLUSION & FUTURE WORK
We have shown the importance of learning representations of regular patterns in colour rather than in structure so as to reliably identify images of anomalous fruits.In particular, our CH-Rand method has demonstrated consistently accurate results of detection on all tested types of fruit compared to other baselines, which typically perform well only on some of them, whether to model structural or colour regularities.
In addition, unlike other methods, we have discovered the positive correlations between the success in the pretext task of CH-Rand and the performance of finally built anomaly detector.Hence, the validation accuracy can be used as an useful criterion for early stopping during training in practice.For realistic scenarios of agricultural robots, we also have introduced a new image dataset, so-called Riseholme-2021, containing 3.5K images of strawberries with various levels of maturity and normality.
In future work, fine-grained detections could be developed to spatially identify local anomalies.Also, we could study on a potential limitation of CH-Rand in case where imaged fruits contain severe structural damages in appearance.

TABLE I :
Statistics of Riseholme-2021.In this work, all subcategories except anomalous are included in normal class.could be employed to automate the extraction of fruitcentered regions, but in this work, we allow humans to intervene in the loop to minimise potential negative impacts caused by errors in segmentation process.
[9]also set up, which learns representations utilising an external benchmark dataset (e.g., CIFAR-10[9])Hyperparameter searches are conducted for each baseline to offer the best results on the Riseholme-2021 dataset first, albeit initial configurations are set up based on publicly available codes, e.g., DCAE's performance dramatically improves with a smaller image size of 32 × 32.Since official source

TABLE II :
Average AUC-ROC and AUC-PR scores on Riseholme-2021 with standard deviations from three independent runs for each model.HIST, VGG16, ROT, and CP also employ the k nearest neighbor detector with the best k ∈ {1, 5, 10}.

TABLE III :
Average AUC-ROC scores on Fresh & Stale dataset.Successful approaches in Riseholme-2021 are compared using their best k ∈ {1, 5, 10} for k-neighbor detector.signs of data augmentation and self-supervision for learning.Still, significantly large drops of ROC are observed in ROT (.926 → .736)and CP (.911 → .736)compared to CH-Perm or CH-Rand (.922 → .790 in the worst model), as all normal subcategories are considered.Moreover, CP presents a 25% lower PR than CH-Rand at least in Ripe&Unripe, TableIIIshows that DCAE is not as effective as in Riseholme-2021 with highly varying ROC's in different categories, i.e. .487∼.845,since it easily overfits the less complex images with controlled backgrounds.Interestingly, HIST works significantly better here than in Riseholme-2021 even outperforming ROT and CP probably taking advantage of homogeneous colour patterns in focal objects, and consequently, visual signals such as black spots on bananas are easily identified simply by colour frequencies.The failure of the two SL methods re-emphasises the lower utility of structural features in detecting anomalous fruits.

TABLE IV :
Pearson correlation coefficients between AUC-ROC scores and validation accuracies measured during SL proxy tasks."All" categories are considered for each dataset.

TABLE V :
Performance of CH-Rand on Riseholme-2021 depending on the selection of pixels.

TABLE VI :
Performance of CH-Split, CH-Perm, and CH-Rand on Riseholme-2021 with different input sizes.Utilising representations learnt at fc6 layer is also considered.Also, CH-Perm struggles particularly with Orange and All, in which it loses even to HIST.CH-Rand, however, presents high performance across all fruit species.