Self-Aware SGD: Reliable Incremental Adaptation Framework for Clinical AI Models

Healthcare is dynamic as demographics, diseases, and therapeutics constantly evolve. This dynamic nature induces inevitable distribution shifts in populations targeted by clinical AI models, often rendering them ineffective. Incremental learning provides an effective method of adapting deployed clinical models to accommodate these contemporary distribution shifts. However, since incremental learning involves modifying a deployed or in-use model, it can be considered unreliable as any adverse modification due to maliciously compromised or incorrectly labelled data can make the model unsuitable for the targeted application. This paper introduces self-aware stochastic gradient descent (SGD), an incremental deep learning algorithm that utilises a contextual bandit-like sanity check to only allow reliable modifications to a model. The contextual bandit analyses incremental gradient updates to isolate and filter unreliable gradients. This behaviour allows self-aware SGD to balance incremental training and integrity of a deployed model. Experimental evaluations on the Oxford University Hospital datasets highlight that self-aware SGD can provide reliable incremental updates for overcoming distribution shifts in challenging conditions induced by label noise.

take inputs from medical devices, smartphones, wearables, and healthcare records to provide a wide range of clinical functionalities such as patient management and diagnosis. Modern deep learning has been the cornerstone of this AI revolution, expanding into a range of healthcare applications [2], [3] such as disease diagnosis [4], [5], patient monitoring [6], drug discovery [7], and organ transplant allocation [8].
Most of the prevalent deep learning solutions are static in nature and often become outdated if the characteristics of a targeted population change over time [9]. A significant distribution shift in patient variables can render a deployed model practically ineffective. Such shifts are anticipated in the healthcare domain as target populations, underlying disease epidemiology, and treatment protocols often evolve: r Shifts in demographics: Populations evolve naturally over time due to various factors such as immigration [10], changes in fertility/mortality rates [11] and population aging [12].
r Shifts in epidemiology: Changes in risk factors and population behaviours can alter disease epidemiology. For example, increased obesity in England during last two decades is expected to affect the prevalence of other chronic diseases [13].
r Shifts in pathology: Disease pathophysiology also changes over time, often due to changes in the risk factors of the disease. For example, the major risk factors for cervical cancer are oncogenic variants of Human Papillomavirus (HPV). Although a significant reduction has been witnessed in HPV due to widespread vaccination, other risk factors such as smoking, genetic predisposition, and sexual factors are still prevalent. As such, the nature of future cases is likely to change due to differences in the driving pathophysiology [14], [15].
r Shifts in disease management protocols and technology: Protocols, technologies, and tools for disease management evolve over time [16], [17]. These changes in disease management are driven by improved understanding of the disease, early diagnostic advancements, and progress in therapeutics, including personalised therapy. COVID-19 pandemic presents a clear case study to analyse the impact of the above factors on a population. We witnessed a demographic shift as higher incidence of COVID-19 shifted from older to younger population after the initial pandemic stages [18]. Changes in the management of COVID-19 patients also accompanied this population shift. As our understanding of COVID-19 improved and better preventative (e.g. vaccines) This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ and therapeutic [19] options became available, the protocols were changed to account for these improvements. Another factor altering the nature of COVID-19 is the emergence of new viral variants differing in virulence, pathogenicity and susceptibilities towards vaccines and antivirals [20]. The different outcomes observed in COVID-19 patients over time reflected these trends [21].
As underlying distributions of patient variables shift, models become uncalibrated, and performance degrades over time. A model trained for predicting respiratory deterioration in COVID-19 patients [22], [23] in the early stages of the pandemic may be less effective during the third or fourth wave. Hence, clinical models should be adapted/fine-tuned to stay effective over a prolonged period. Incremental learning [24] provides an effective solution to tackle such distribution shifts, wherein models are iteratively updated to reflect the changes in a target population (Fig. 1). Incrementally updating models over time is a broad topic and has been studied in various contexts over the years, such as online learning and continual learning. Online learning typically refers to methods for updating models on a potentially infinite incoming stream of training examples. It is generally assumed that all streaming examples belong to the same domain or exhibit no distribution shift [25]. Continual learning also deals with updating models on incoming data over time. However, it iteratively adapts a model to new classes (class-incremental) or domains (domain-incremental) while retaining information about the previously learned classes or domains [25]. Retaining information from previous domains (or previous characteristics of population) is less relevant in the scenario where the target population varies with time, and the updated model is only expected to work on the new population [26]. Hence, this paper mainly deals with such scenarios and considers simple iterative adaptation or fine-tuning of a deployed model to new populations as incremental learning (unless specified otherwise).
Although incremental learning has potential to help in developing dynamic clinical deep learning models, it presents two major challenges to medical practitioners, regulators and AI researchers: r Alterations to a deployed model: Incrementally training a clinical model is precarious as it modifies a deployed or in-use model. An inappropriate modification to critical clinical models, e.g. ICU resource allocation and disease diagnosis models, can hamper performance and may lead to devastating consequences. The primary reason for such performance degradation could be "label noise" that often manifest itself in the incremental data due to labelling errors arising from inaccuracies in diagnosis, coding of diagnoses, and documentation of clinical measurements [27]. Additionally, labels or incremental training data may be compromised by malicious actors. Since incremental data often arrives in streams or bursts, it is not always feasible to manually check the sanity of labels. If the deployed model is updated with data exhibiting label noise, it may forget the trained task and become unusable.
r Issues in regulation of iterative AI solutions: The current regulatory framework of SaMD adopted by healthcare regulators such as the Medicines & Healthcare products Regulatory Agency (MHRA) and the Food and Drug Administration (FDA) does not enable iterative development approach as it requires products to go through the regulatory certification route once a significant iteration has happened. Recently, FDA launched a new regulatory pathway pilot termed the "PreCert" pathway [28] to regulate the developer of digital health solutions rather than the individual solution itself, facilitating the iterative development. While this is a step towards effective implementation of adaptive digital health solutions, frameworks and tools for quality control over the incremental learning process are still lacking. Quality checks are imperative to ensure that newly incorporated information and updates lead to non-inferior performance of deployed models. This paper aims to provide a framework for reliable incremental adaptation of clinical deep learning models that addresses the above mentioned challenges to a great extent. To this aim, we introduce self-aware stochastic gradient descent that acts as a wrapper over standard stochastic gradient descent (SGD) and imposes a sanity check over gradients for reliable incremental learning. In the presence of label noise in incremental data, the proposed algorithm effectively balances the adaption of a deployed clinical model and maintaining its integrity. Self-aware SGD exploits a deep neural network-based contextual bandit that analyses the magnitude and direction of a gradient update to predict its impact on the performance of the deployed model. A gradient update that results in performance deterioration is deemed harmful and filtered out. As a result, self-aware SGD assures that a model is updated reliably and robustly.
Apart from standard incremental adaptation, Self-aware SGD is also compatible with state-of-the-art (replay-based) domainincremental learning methods. By preserving the integrity of a deployed model during incremental learning, this paper exhibits a potential mechanism to create new iterative AI solutions for healthcare that asserts quality control over themselves. Tools like the proposed algorithm can potentially be a key component of how regulators control and exert quality assurance over rolled out incremental digital health solutions.
The major contributions of this paper are listed below: r This paper proposes self-aware SGD, a reliable incremental learning algorithm, that can provide effective incremental adaptation of deployed deep learning models under challenging label noise conditions without compromising their integrity.
r The proposed algorithm provides a blueprint for the possible future studies dealing with clinical incremental learning applications. Unlike existing studies, this paper has highlighted the requirement of reliability and self-quality control in the clinical incremental learning solutions.
r This paper provides the evidence in favor of dynamic clinical models by utilising the Oxford University Hospitals (OUH) data, collected between 2016 and 2021, to show the inappropriateness of static clinical models in handling the evolving target populations. The rest of this paper is organised as: Section II provides a background of the existing incremental learning frameworks within the context of healthcare informatics. Section III describes the proposed self-aware SGD. Section IV and V describe the experimental setup and analysis of the results, respectively. Finally, Section VI concludes this paper.

A. Incremental Learning
Incremental learning has seen limited exploration in healthcare applications. In [26], the authors proposed an adaptive risk prediction system that detects distribution shifts and adapts the models to cater to the changing population. Guo et al. [29] and Alves et al. [30] also tackled the problem of temporal shifts in the target population. However, instead of incremental learning, these studies considered domain adaptation for handling distribution shifts. In [31], the authors benchmarked state-of-the-art domain-incremental learning algorithms for longitudinal electronic health records or multivariate clinical time-series data and found that replay-based domain-incremental algorithms outperform the other counterparts. On the similar lines, Kiyasseh et al. [32] had explored replay-based domain-incremental learning on physiological signals earlier. Domain-incremental learning has also been used in some studies to overcome distribution shifts in X-Ray images that arise either due to differences in sensors or in the underlying patient population [33], [34].

B. Robust Deep Learning Training Under Label Noise
Training deep learning models using data with label noise is a well-studied problem in deep learning literature. The efficient noise-robust mechanisms can isolate training signals from overwhelming noise and provide effective training. In [35], the authors proposed a meta-learning framework where a simple one layer MLP is used to learn the importance of each sample in a training dataset. In theory, the importance of wrongly labelled examples will be lower, and they will have no impact on the overall training. Xu et al. [36] proposed an information theoretic loss function that is robust any pattern of label noise. In [37], the authors highlighted early learning regularisation to propose a label-noise robust training mechanism. In the early learning phase, the model mostly learns from correctly labelled examples. During the later phase, the model starts learning from noisy labels resulting in feature interference. Building on this observation, the authors proposed a regularisation mechanism that tries to force the model outputs to be consistent with the model predictions obtained in the early training phase.
All the existing noise robust training mechanisms only deal with training models from scratch. These methods do not impose any integrity constraints on a deployed model and may degrade performance after incremental training (see Section V).

C. Comparison With Self-Aware SGD
Most of the existing incremental studies do not consider the reliability and regulatory concerns posed by modifications to a deployed clinical model during incremental adaptation. Unlike the existing methods, the proposed self-aware SGD is specially designed to strike an effective balance between maintaining the performance of the deployed model and adaptation to the evolving population. To the best of our knowledge, this paper is the first attempt in performing incremental learning under label noise.

III. PROPOSED SELF-AWARE SGD
This section elaborates the proposed self-aware SGD based incremental learning framework. Here, we first describe the problem statement. Then, we describe the concept of gradient consistency that forms the basic building block of the proposed self-aware SGD. Finally, self-aware SGD is presented.

A. Problem Statement
A deployed or in-use clinical deep learning model has to be updated using batches of incremental data that are arriving in bursts. The aim is to come up with an incremental learning algorithm that makes sure that the application of incremental updates does not result in any catastrophic drop in performance of the deployed model. This desired algorithm must work under the assumption that some gradient updates are going to be harmful (arising from batches with drastic label noise) for the deployed model while others will help model in adapting itself to the evolving population. It should filter out the harmful updates while applying the required incremental updates to adapt the model.

B. Gradient Consistency
A deployed model has already been trained for the targeted task. It has learned the features or semantics associated with each class. When this model is incrementally updated using a batch containing incorrectly labelled data, we are forcing the model to switch this semantic-class relationship. For example, we are dealing with a model deployed for respiratory deterioration prediction and the lower SPO 2 is associated with respiratory deterioration. However, the gradient update computed from a wrongly labelled batch may force the model to associate lower SPO 2 with no respiratory deterioration event. This implies that the incremental gradient update is not consistent with the historical gradient updates that have been used to train the deployed model.
All historical gradient updates (that resulted in current state of the deployed model) can be summarised as change in parameters from their initial random state to the current state. Suppose θ 0 and θ 1 represent the random and current state of model parameters. Then, the historical gradient update can be defined as: |B| batches // Obtaining rewards and features for training Cross-entropy loss 8: Gradient update with learning rate η 12: At the current state of model parameters (θ 1 ), an incremental gradient update, ∇ θ 1 , must be consistent with g h to preserve the integrity of the model. This consistency can be defined as: r Gradient norm: The norm of a consistent gradient update must be lesser than an inconsistent one. The lower norm implies that the magnitude of gradient is lower, and it won't cause any significant changes to the current model state [38].
r Cosine similarity: The higher cosine similarity between g h and ∇ θ 1 implies that ∇ θ 1 is going to update the parameters in a direction similar to historical parameter updating. Hence, ∇ θ 1 can be considered as consistent. The opposite is true if cosine similarity is lower. Since gradients are tensors (not vectors), we vectorise ∇ θ 1 and g h to compute cosine similarity: To visualise gradient consistency in action, we computed gradients using batches with correctly and incorrectly labelled examples for a model trained/deployed for respiratory deterioration prediction (see Section IV). We corrupted batches with four different label noise probabilities, i.e. we flipped the label of each Cross-entropy loss 9: G θ = ∇ θ L Gradient 10: Bandit Prediction 13: if P < 0.5 then 14: skip remaining steps 15: else θ = θ − α∇ θ L 16: g h = θ 0 − θ Update historic gradient 17: θ = θ Update θ 18: Return θ or f θ () example (in each batch) with a uniform random probability of 0.2, 0.4, 0.6 and 0.8. Fig. 2(a) and (b) depicts the difference between gradient norm and cosine similarity of gradients computed from corrupted and "pure" batches. The analysis of this figure highlights a clear difference between correctly labelled batches and batches with label noise. Hence, both these properties can be used to create an automatic sanity check to identify and reject the inconsistent gradients.

C. Self-Aware SGD
Self-aware SGD can be seen as combination of two stages: training the bandit model and performing reliable incremental learning using the trained bandit model. Following the nomenclature from the contextual bandit literature, these stages can be seen as analogous to exploration and exploitation. In the exploration phase, an agent learns to associate the actions and contexts with rewards. Hence, it can be considered as a training phase. On the other hand, an agent exploits the accumulated knowledge to take actions and obtain rewards in the exploitation phase. The details of these two stages with respect to incremental learning are discussed below: r Training bandit model: The bandit model is trained or updated before each bout of incremental learning. As incremental data arrives in bursts of a few batches at a time, only these batches are analysed to update or train the bandit model. Since there are no labels (or rewards) to train the bandit model, the first step is to analyse gradient updates and their impact on the performance of the initial or deployed model over validation examples. The gradient updates are computed for each incremental batch, and each gradient update is processed to obtain its norm and cosine similarity with historical model updates. These gradient norms and cosine similarities are accumulated and are used as input features (context) to train the bandit model. The labels or rewards are generated for each batch or its corresponding gradient without affecting the initial model. For each gradient update, a copy of the initial model is created and the gradient update is applied to this copy. Then, we compute performance metrics such as the area under the ROC curve (AUROC) on the validation examples using the "updated copy" model. Based on the performance of the initial and updated model, we define reward R as: Here r θ and r θ represent the AUROC obtained by the deployed/initial model and incrementally updated copy of the initial model, respectively. τ is a user-defined parameter used to reject the confusing gradients from this training process. In terms of distribution, these confusing gradients belong to the region overlapped by both correctly and incorrectly labelled batches (see Fig. 2(c)). After computing gradient properties and their corresponding rewards or labels, the bandit model is trained to predict whether an input gradient is consistent or inconsistent with the initial model. An update is regarded as inconsistent if it results in performance deterioration and vice-versa.
r Reliable incremental adaptation using trained bandit model: Once the bandit model has been trained, it can be deployed as a wrapper over the standard gradient descent (or any of its variants). During incremental training, the gradient is computed for a batch as in any standard deep learning framework. Then, this gradient is processed to compute its norm and cosine similarity with respect to historical gradient updates (as discussed earlier). The contextual bandit model intakes these gradient properties and predicts the gradient consistency. The gradient update is only applied if it is deemed consistent, and hence, the integrity of the deployed is preserved. Implementation details: Suppose f θ () be an initial/deployed model, with θ defining its parameters or weights. Similarly, let B φ () be a DNN acting as an agent or a bandit model. First, B φ () is trained using the set of labelled incremental batches X inc . Then, self-aware SGD utilises the trained bandit model, B φ (), to identify the consistent gradient updates for adapting the initial model f θ (). Algorithm 1 documents the process of training the bandit model and Algorithm 2 illustrates the process of deploying this bandit model for incremental learning.
Since the proposed algorithm is aware of the historical gradient direction and determines the nature of the gradient updates without any external stimulus, it has been referred to as self-aware SGD in this paper. This algorithm works on the assumption that incremental data has a mix of corrupted and pure batches. In real life, we may encounter scenarios where all incremental batches are pure (best-case scenario), or all are corrupted (worst-case scenario). We can borrow a handful of batches from validation data to handle such cases. Since batches from validation data are pure, we can add artificial label noise to create their corrupted versions. These noisy and original validation examples can be added to the incremental data for the training bandit model.

D. Extending Self-Aware SGD for Domain-Incremental Learning
In domain-incremental learning, a model must retain information about the previous domains (or populations) while adapting to new domains. Self-aware SGD is consistent with replay-based domain-incremental methods (which are considered state-ofthe-art).

IV. EXPERIMENTS
This section describes the dataset and experiments designed to evaluate the performance of self-aware SGD and self-aware SGD with replays. 1

A. Dataset Used
Patient records from the Infections in Oxfordshire Research Database (IORD) are used for evaluating the proposed framework. 2 This data is collected from patients admitted to Oxford University hospitals (OUH) between January 2016 and June 2021. Patients admitted between January 2016 and December 2019 exhibited various underlying conditions such as pneumonia, heart failure, and asthma. In contrast, the data between March 2020 and June 2021 is only collected from patients with PCR confirmed COVID-19. To simulate an incremental learning setup, we temporally divide the data into six subsets: 2016 dataset, 2017 dataset, 2018 dataset, 2019 dataset, first COVID-19 dataset (March 2020 to July 2020), and second COVID-19 dataset (August 2020 to June 2021) dataset. The first COVID-19 dataset (COVID-1) corresponds to first COVID-19 wave, whereas the second COVID-19 dataset (COVID-2) corresponds to second and third waves.
Patient features are sampled at irregular time intervals reflecting ad hoc clinical measurements taken by hospital staff. Each sample is characterised by a 77-dimensional feature vector and a binary label (retrospectively generated) signifying respiratory deterioration within the next 24 hours [23]. Features include demographic characteristics, vital sign measurements, laboratory test results, and inspired oxygen concentration (FiO 2 ). The detailed information regarding these features and data pre-processing can be found in [23] and in the supplementary document. Table I documents the number of patients, number of samples, and percentage of samples exhibiting respiratory deterioration in each sub-dataset after pre-processing.

B. Designed Experiments
We train a deep neural network for the task of respiratory deterioration prediction in an incremental learning setup depicted in Fig. 3. Following experiments were designed to evaluate the performance of self-aware SGD using this setup: r Self-aware SGD vs. standard SGD for incremental adaptation: The performance of self-aware SGD is compared against the standard SGD (normal training) in presence of the label noise for incremental adaptation or fine-tuning. The noisy conditions are simulated by randomly flipping labels of the incremental training batches. A training batch is selected with probability b, and the label of each example in a selected batch is flipped with probability p. We refer to b and p as batch and example probabilities, respectively. We used batch probabilities of 0.5 and 0.75, and a fixed example probability of 0.8 i.e. we randomly corrupt labels of approximately 50% and 75% of available batches. In each corrupted batch, labels of approximately 80% of randomly chosen examples are flipped.
r Comparison against existing noise robust algorithms: In this experiment, the performance of self-aware SGD is compared against the well-known noise-tolerant DNN training frameworks. These methods include determinantbased mutual information (DMI) loss function [36], early learning regularisation (ELR) [37] and meta-weight net [35] (discussed in Section II). We incrementally trained the model using self-aware SGD with replays in a label noise setup where the batch probability of 0.5 is used for label corruption.

C. Models and Parameter Setting
A fully-connected neural network or DNN is trained for predicting respiratory deterioration events. The model consists of three fully-connected or dense layers having 308 (77 × 4), 231 (77 × 3) and 1 hidden units. The first two dense layers are followed by the rectified linear activation function, and the last layer is followed sigmoid activation function. A dropout of 0.25 is used between dense layers to regularise the model. Binary cross-entropy is used as a loss function, and we use standard SGD with 0.9 momentum and 0.0001 learning rate as an optimiser to train the initial model on 2016 dataset (see Fig. 3). A batch size of 512 examples is used across all experiments. Also, we train each model (initial or incremental) for 100 epochs and use early stopping to store only the best performing version of the model. During incremental adaptation using self-aware SGD, we used also use SGD optimiser with 0.9 momentum and 0.0001 learning rate.
We use a three-layered DNN consisting of layers with 8 nodes, 4 nodes and 1 node as a contextual bandit model. The first two nodes are followed by rectified linear activation whereas the last node is followed by sigmoid activation function. The loss function used to train this model is mean average error (MAE) [40], and Adam with a learning rate of 0.0001 is used as optimiser. We used a batch size of 128 across to train bandit model across all experiments.
For comparative analysis, we train DNN with determinantbased mutual information (DMI) loss function [36] using SGD optimiser with 0.0001 learning rate and 0.9 momentum. In ELR [37], we used a regularisation coefficient of 7 (r = 7) and weight factor of 0.7 (t = 0.7) for generating pseudo-labels. SGD with a learning rate of 0.001 and momentum of 0.9 was used as an optimiser. For meta-weight net, we have used SGD with a 0.001 learning rate and 0.9 momentum as an optimiser to train the respiratory deterioration prediction model. To train the weighing MLP, SGD with 0.001 and zero momentum was used. All these parameters are chosen using hyperparameter tuning on validation examples.  (Fig. 4(a)). This drop upholds our claim of the requirement of reliability in clinical incremental learning.

A. Self-Aware SGD vs. Standard SGD for Incremental Learning
r The performance of self-aware SGD is near-optimum across all datasets, and no performance drop is observed in any of the noisy conditions unlike normal training. This shows that self-aware SGD can effectively identify and filter out "harmful" gradient updates during incremental training, preserving the integrity of the deployed model.   and there is little to no distribution shift between these datasets. This is illustrated in Fig. 5(a) showing the empirical distributions of 8 different features from the 2016 and 2017 training datasets. However, the model trained on 2016 or incrementally updated using 2017 and 2018 datasets demonstrates high but sub-optimal performance on 2019 test examples (Fig. 4(d)). Incrementally updating using 2019 training examples results in a significant performance improvement. This highlights that the features of 2019 and previous datasets exhibit a minute distribution shift such that the previous model is still effective but not optimum.
r A more significant change is observed between COVID-19 and the other datasets. The initial model or incrementally adapted model using 2017, 2018 and 2019 datasets is practically unusable on both COVID-19 datasets (Fig. 4(e) and (f)). However, after incremental adaptation using COVID-19 data, the performance on the COVID-19 datasets is significantly improved. This signifies a distribution shift between historical and COVID-19 data (see Fig. 5(b)).  [36], early learning regularisation (ELR) [37] and Meta-weight net [35].
r The improvement in performance after incremental adaptation using self-aware SGD (Fig. 4(d), (e) and (f)) under noisy label conditions highlights that it not only preserves the integrity of the model but also allows an effective incremental adaptation to overcome the distribution shifts induced by evolution of the underlying population.
r The paired t-tests are performed to analyse the statistical significance of the performance improvement in selfaware SGD over normal training across all experiments under label noise. The null hypothesis is that there is no statistical difference in scores, and we reject null hypothesis if p < 0.005. Apart from COVID datasets, the performance improvement achieved by the proposed method is statistically significant. In case of COVID datasets, the significant distribution shifts have rendered the models trained using 2016 to 2018 datasets ineffective (even in no noise scenarios). However, after adapting models to COVID datasets, we again witness a significant improvement in the performance of self-aware SGD.

B. Comparison Against Existing Label-Noise Robust Deep Learning Methods
We compare the performance of self-aware SGD with existing label-noise robust deep learning methods under the simulated label-noise configurations. Fig. 6(a) and (b) depict the results of this comparison at batch probabilities of 0.5 and 0.75, respectively. The analysis of these figures highlight that self-aware SGD significantly outperforms existing methods in most experimental setups. The paired t-tests between scores obtained by selfaware SGD and the comparative methods at batch probability of 0.75 highlight that the performance improvement by self-aware SGD is statistically significant in approximately all scenarios (p < 0.005). At a batch probability of 0.5, the performance of self-aware SGD and DMI loss is statistically comparable across many experimental setups (p > 0.005). In spite of that, self-aware SGD exhibits noticeable improvement over the other comparative methods.
Although the existing methods exhibit better performance than normal training, they fail to preserve the integrity of deployed models and result in performance deterioration. This behaviour is expected as none of these methods are designed for incremental learning. Although these methods can isolate the relevant training signals from the noisy labels with some success, they impose no constraint on the preserving historical performance of the deployed models.

C. Catastrophic Forgetting During Incremental Learning
On analysing the performance of normal training under no label noise in Fig. 4(a), it is clear that catastrophic forgetting is not observed on incremental adaptation of model using 2017 and 2018 data. However, a relative drop of 9.88% and 15% is observed after incremental training on COVID-1 and COVID-2 datasets. This drop is due to the distribution shift between earlier datasets (2016-2018) and COVID-19 datasets. Incremental training on new datasets interferes with previously learned features and results in this performance drop. Self-aware SGD also exhibits catastrophic forgetting. However, it undergoes lesser adaptation and exhibits less catastrophic forgetting than standard training as most of the noisy incremental data is rejected. Similar catastrophic behaviour is observed in Fig. 4(b), (c) and (d) after adaptation with the COVID-19 datasets.

D. Self-Aware SGD With Replays for Domain Incremental Learning
The results of this experiment are depicted in Fig. 7. The analysis of this figure highlights that self-aware SGD with replays avoids the catastrophic forgetting witnessed in self-aware SGD. Incremental training with COVID-19 datasets does not cause any significant drop in performance over 2016 to 2019 datasets

VI. CONCLUSION
This paper highlighted that clinical models must evolve with time to match the dynamic nature of diseases and target populations. A traditional static model cannot cope with the distribution shifts seen in the underlying clinical variables as the nature of population health and diseases change, and hence become ineffective over time. This paper further argued that incremental learning may help in coming up with dynamic clinical AI solutions to tackle evolving populations while bring the reliability and regulatory concerns in incremental learning to the foreground. To alleviate these concerns, this paper conceptualised an incremental learning framework, i.e. self-aware SGD, with a sanity check over gradient updates to allow only reliable changes to a deployed model. The experimental results highlight that these sanity checks can indeed allow an incremental learning framework to preserve the integrity of deployed models even in the presence of extreme label noise. Hence, self-aware SGD and similar future algorithms can pave the way for exploiting incremental learning to develop reliable SaMD solutions.
In comparison to the standard SGD, self-aware SGD is computationally expensive as it requires a trained bandit model. However, this computational overhead empowers the proposed framework to tackle the challenging label-noise conditions. Apart from that, a major limitation of the current version of self-aware SGD is that it has only targeted the prediction tasks (binary classification problems). Although the prediction tasks form a bulk of automated decision support systems, it would be beneficial to extend the proposed framework to more generic and challenging settings such as multi-class classifications and temporal segmentation. Future work will deal with extending this work to these use-cases using advanced reinforcement learning algorithms.