Multiple Myeloma Prognosis From PET Images: Deep Survival Losses and Contrastive Pretraining

Objective: Diagnosis and follow-up of multiple myeloma (MM) patients involve analyzing full body positron emission tomography (PET) images. Toward assisting the analysis, there has been an increased interest in machine learning methods linking PET radiomics with survival analysis. Despite deep learning’s success in other fields, its adaptation to survival faces several challenges. Our goal is to design a deep learning approach to predict the progression-free survival (PFS) of MM patients from PET lesion images. Methods: we study three aspects of such deep learning approach: 1) Loss Function: we review existing and propose new losses for survival analysis based on contrastive triplet learning; 2) Pretraining: we conceive two pretraining strategies to cope with the relatively small datasets, based on patch classification and triplet loss embedding; and 3) Architecture: we study the contribution of spatial and channel attention modules. Results: our approach is validated on data from two prospective clinical studies, improving the c-index over baseline methods, notably thanks to the channel attention module and the introduced pretraining methods. Conclusion and Significance: we propose for the first time an end-to-end deep learning approach, M2P2, to predict the PFS of MM patients from PET lesion images. We introduce two contrastive learning approaches, never used before for survival analysis.


I. INTRODUCTION
M ULTIPLE myeloma (MM) is a bone marrow cancer characterized by the clonal proliferation of plasma cells with a high rate of relapse and a 5-years survival rate around 50% [1]. It accounts for 10% of all hematological malignancies and the International Agency for Research on Cancer estimated a mortality rate of 1.39 per 100 000, and a global age-standardized incidence of 2.1 per 100 000 [2]. [ 18 F] Fluoro-2-deoxy-2-D-glucose positron emission tomography/computed tomography (FDG-PET/CT) is now an imaging technique recommended by the International Myeloma Working Group (IMWG) imaging guidelines for MM at diagnosis for therapeutic assessment and for minimal residual disease (MRD) detection. However, these guidelines are based on visual assessment of PET/CT images. FDG-PET/CT allows to extract imaging biomarkers which can serve as a reliable tool for prognosis purpose [3] with the aim of not delaying the initiation of treatment of high-risk patients and to avoid the establishment of harmful bone lesions or renal impairment [4].
Therefore, in this article, we design a quantitative deep learning (DL) approach to predict patient risk from positron emission tomography (PET) images, as it has been shown PET images at diagnosis bring useful information, complementary to clinical biomarkers [5]. Our approach falls within the class of survival analysis methods, aiming at quantitatively linking patient data to disease progression over time. Survival models enable identifying bio-markers useful for splitting patient population into high and low risk subpopulations. Early identification of patient profiles at most risk would allow adapting their treatment accordingly [6].
Traditional survival methods rely on Cox proportional hazards regression and Kaplan-Meier curves [7]. However, the predictive power of such methods is limited [8]. Recently, random survival forests (RSFs) [9] have become the reference among learning approaches. Although DL has improved the state-of-the-art in many classical medical image analysis tasks, such as classification and segmentation, few approaches [10], [11], [12], [13] have succeeded in adapting DL to survival analysis, despite its importance in medical research.
There are several challenges in adapting convolutional neural networks (CNNs) to learn discriminative features for survival. First, survival data often suffers from censorship 1 which requires a careful formulation of the loss function used for training the network. Second, prospective datasets that control patient treatment, and follow-up allow to draw more meaningful medical conclusions but are often limited in the number of patients. Finally, PET images have low resolution relative to the heterogeneity of the MM tumor, while small lesions make transfer learning from existing CNNs pretrained on natural images difficult. Both issues require specific architectural adaptations.
In this article, we propose a DL method for survival analysis, called M2P2 (MM Prognosis from PET images), addressing the challenges above. The approach predicts the prognosis of MM patients from the PET image of lesions at diagnosis. More specifically, the network predicts the progression-free survival (PFS) either in the form of a risk or a time to event.
To this end, we review existing survival losses that consider censorship, including the Cox function [10], a discrete survival loss [13], and conventional losses adapted to censorship, as the combination of the mean squared error (MSE) loss and a ranking term [12]. Motivated by the success of contrastive learning in other domains [14], [15], we introduce two contrastive losses adapted to survival analysis. Unlike previous functions, our contrastive losses directly exploit the relationships between triplets of data points, bringing together the embedding representation of patients whose outcome (risk or time) is similar while pushing away dissimilar ones. The proposed losses can be adapted to handle censorship through the choice of triplets.
To deal with the next challenge, i.e., the relatively small dataset size of prospective studies, we propose to pretrain the network with pretext tasks as it has been shown in other medical applications that pretraining with auxiliary losses (instead of from scratch) improves the results in small dataset regimes [16]. The two proposed pretext tasks are: 1) the binary (lesion/not lesion) classification of patches or 2) the contrastive learning of an embedding space reflecting survival with a triplet loss [14]. Finally, to handle the lesions' small and variable size (see Appendix G), we designed a dedicated architecture considering spatial and channel attention from the convolutional block attention module (CBAM) method [17]. Indeed, attention has been successfully used in a large variety of medical applications, including reconstruction, segmentation, detection, or classification, but hardly ever with survival tasks [18], [19]. In the latter, only spatial attention was considered, while introducing a channel attention block was determinant in our study.
The contributions of this paper are the following: 1) In terms of the medical application, these works investigate for the first time DL approaches in the context of the survival analysis of MM patients from PET images.
To support the significance of our results, we rely on data from two prospective clinical studies collected over seven years. 2) We present a survey of loss functions suitable for survival analysis with censorship. The survey includes two contrastive learning approaches, which we adapt to censored data and survival analysis. An experimental study shows the feasibility of these new losses and points to the potential benefits of their use as regularizers or for representation learning. 3) To deal with the limited size of prospective datasets, we propose two effective pretraining strategies that avoid the need for further annotations while improving convergence and performance. One of them is based on the introduced contrastive loss with triplet ranking. 4) Finally, we show the interest of including a channel attention module to improve the performance and identify predictive CNN filters for survival.

Survival Analysis (From Statistical to DL Models):
In the medical literature, Kaplan-Meier (KM) curves are a standard method to estimate or compare the survival of different groups [20]. KM has the advantage of being simple to use and interpret, and does not require any hypotheses on the survival distributions. However, KM fails to evaluate the survival of individual patients according to their variables or evaluate the impact of covariates on survival. Covariate impact analysis has been mainly done through Cox regression [21]. The linear Cox proportional hazards model assumes that each variable independently affects a baseline hazard h 0 (t) (a measure of risk at time t), multiplying it by a constant factor independent of time. The risk for patient i is then modeled as h(x i , t) = h 0 (t)e β x i where x i ∈ R m and β ∈ R m are, respectively, the vector of input variables and their associated coefficients. Zhou and McArdle [8] show that the statistical power of Cox regression is affected by high censoring. In contrast, RSFs [9], handle well censorship, are not constrained by the proportionality assumption, and can be useful to detect interactions between variables. Hence, RSFs are now the reference method for survival analysis. RSFs were successfully implemented and tested in the context of MM by our team [5], [22].
In recent years, DL methods have outperformed traditional machine learning methods in many areas, including classification and segmentation [23], [24]. Early adaptions of DL to survival tasks extract deep features from images with a pretrained neural network and feed them to prediction models, such as Lasso Cox [25] or RSFs [26]. The survival problem has also been simplified to the classification of risk groups (e.g., low, middle, and high risk) [27], or to the regression of the time-to-event [26]. However, such formulations do not natively handle censored data and therefore require removing censored patients from the dataset. Both RSFs and DL are machine learning methods allowing for personalized treatment recommendations.
Survival Losses: The design of the DL for survival analysis involves adapting the loss and the metric to censorship. The reference risk-predicting method by Faraggi and Simon [28] replaced Cox's linear β x i term with a more flexible feedforward network. The loss used for training the network parameters is the negative log likelihood of the updated Cox model over the training population. Katzman et al. [10] revisit (in the so called DeepSurv approach) Faraggi-Simon's loss in the context of deeper networks. Other variants stick to the Cox model, but use the network either to learn representations that replace the input x i [29], or for directly predicting the weights β [30]. Recent alternatives include discrete survival models redefining the likelihood to predict a discrete survival time instead of a continuous risk [13], [31], [32], [33], [34], [35]. Time-predicting methods also include continuous formulations [12], where a network learns to predict a continuous time-to-event, from a regression MSE and a patient ranking losses both adapted to censorship. Inspired by [12], we extend the idea of ranking from pairwise patient comparisons to triplets. Although contrastive learning based on triplets has been shown successful for learning discriminant representation in classification [14] or regression [15] problems, it has not to our knowledge been used for survival analysis. Here, we adapt both discrete and continuous triplet-based losses [14], [15] to handle censorship showing their feasibility as regularizationlosses for survival analysis, and their effectiveness for learning meaningful representations during a pretraining stage.
Survival Analysis on PET Images: Regarding the data type, Zhu et al. [36] were the first to adapt DeepSurv to convolutional layers (DeepConvSurv) and deal with image data as input. Zhu et al., extend in [37] and subsequent papers the DeepConvSurv to handle very large whole slide histopathological images. We are instead interested in the problems raised by the limited spatial resolution and the tradeoff between resolution and noise. Amyar et al. [27] and Li et al. [11] proposed 3-D CNN models for survival analysis from PET images. Amyar et al. [27] studied response to treatment by simplifying the survival problem to a classification without censorship and using an input layer of relative large size (100 × 100 × 100). Li et al. [11] modified the DeepConvSurv model with an additional spatial pyramid pooling (SPP) layer [38] in order to deal with small lesions of multiple scales in the context of treatment response in colorectal cancer. We opt here for custom network with an attention block, trained with censored data.
Network Pretraining: To handle the difficulty of training the large number of parameters of a CNN from few data we consider pretraining the network with two pretraining tasks, including one self-supervised. Many pretext tasks for self-supervised pretraining with medical images have been recently proposed, such as the Jigsaw puzzle [39], or simply image reconstruction [16]. However, to our knowledge, we are among the first to use self-supervised pretraining with PET images. Contrastive pretraining has been demonstrated to be an effective self-supervised strategy for visual tasks [40], [41]. A preliminary version of the model was presented in [42]. Here, we extend the model, evaluating the interest of each attention block, adding the contrastive learning and different pretraining methods, and extending the experimental validation to additional losses.

III. METHOD
The input to our method is a dataset , of N patients, each consisting of a data vector x i associated to a target time-to-event t i and a binary value indicating censorship δ i . The data vectors x i , contain the 3-D image of the lesion with the highest SUVmax 2 for each patient, identified manually on a full body PET exam by a nuclear physician. The time t i is given in days. Considering this data, we design a CNN model that predicts the PFS, i.e., the time to (or risk of) the next disease progression, from a single input image. The parameters of the CNN are optimized minimizing a survival loss function capable of handling censored data.
In the following, we describe the CNN model architecture (Section III-A). Then, in (Section III-B), we present a unified description of existing survival losses, including the widely used Cox function and its discrete version; a continuous regression loss combined to a pairwise ranking; as well as our novel adaptions of the triplet loss. Finally, in (Section III-C), we introduce two pretraining strategies to improve the performance: exploiting the information within the lesion segmentation masks or enhancing the CNN embedding with contrastive triplet loss learning.

A. CNN Model for Survival Analysis
The core of the risk prediction model is an architecture inspired from [11], with 3-D convolutional layers that learn radiomics features, and dense layers that focus on the prediction task. We also consider two additional (optional) spatial and channel attention blocks [17] which allow guiding and retrospectively evaluate where the model puts the most importance. The model with channel attention is presented in Fig. 1.

1) Radiomics Feature Learning:
This block is a standard CNN transforming the 3-D image of a lesion x i into C feature maps. It is composed of three convolutional blocks. Each block consists of several convolutional filters with dropout, average pooling, instance normalization, L2 regularization and Leaky ReLu activations. Rather than using SPP as the model of Li et al. [11], we manage different scales of the lesions by resampling the bounding box around the lesion to a fixed size cube of size (36 × 36 × 36) with cubic interpolation.
2) CBAM Attention Model: An attention mechanism learns to weight the extracted features according to their relevance for a given task, usually with a positive effect on the performance. The mechanism is flexible in the sense that the weights adapt to the input image. We consider both spatial and channel attention blocks from the CBAM method after the feature extraction [17]. Channel attention was introduced as the squeeze-and-excitation (SE) block by Hu et al. [43]. SE adaptively adjusts the weight of each feature map in a convolutional block based on a global average pooling (GAP) operation. With the CBAM, Woo et al. [17] extended the SE block by considering an additional global max pooling, which has a complementary effect to the GAP. CBAM also attaches a spatial attention module to refine the feature responses. Spatial attention weights are calculated at each location of the feature maps, while channel attention is done over the filter response. In practice, channel attention is computed by squeezing spatial information from the convolutional layers with the max and average pooling operations. The pooled values are passed through a shared multilayer perceptron (MLP) to predict the attention weights. The channel and attention blocks are applied sequentially since the objective is to guide the feature maps to have joint spatially localized responses from few filters. The output of channel attention is a weights vector with as many entries as feature maps (here 64); each entry weighs one full feature map, modulating its importance. The spatial attention matches the dimension of the feature maps and is applied through an elementwise multiplication with every feature map. A diagram of the attention is presented in Fig. 1. Following our experimental validation, the channel attention block is a key step, while the spatial will remain optional.
3) Survival Prediction Output: The prediction layer consists of a fully connected layer, and an output layer whose structure will change depending upon the survival loss (described in Section III-B). The output will be a single neuron predicting either the risk h(x i , θ), in the case of the Cox loss and the Ranking and Cox loss, or the time to eventt i , in the case of Ranking and MSE (RankDeepSurv). The output layer will instead have D neurons modeling the discrete time of the event, in the case of the discrete loss and the ranking and discrete loss.

1) Cox Loss:
A common approach to train learning models for survival analysis tasks has been to derive a loss function from the Cox regression model. Cox's partial likelihood models the joint probability of the events in the dataset, assuming they are statistically independent. The probability of each event is computed as the ratio between the risk of the patient i and the cumulative risk of all individuals still at risk at time t i where the product is done over the defined (i.e., uncensored) time events. Following [28], the linear model β x is replaced by the output of a neural network h θ (x) parameterized by weights θ while the rest of the partial likelihood is kept. Computing the negative log likelihood from (1) leads to the following loss to optimize the parameters θ : with l the loss, h θ the risk function, and x i is the data vector describing patient i. This loss function pushes the network to predict risks that explain the order of events in the dataset. Note that the output of the current patient depends on the output of all patients at risk at the event time. This loss, originally proposed for an MLP, was later used to train deep [10] and convolutional [11] networks.
2) Discrete Time Survival Loss: Gensheimer and Narasimhan [13] choose a more interpretable output, with a survival model, called nnet-survival, that predicts the time-to-event instead of the risk. The event time is defined over D discrete time intervals T = [ 1 , 2 , . . . , D ]. Accordingly, the output layer of the network is set to have D neurons, each predicting the probability p d,θ (x i ) of surviving interval d ∈ T. This probability is complementary to the risk: The likelihood is defined differently for uncensored and censored patients. For an uncensored individual, the likelihood of experiencing an event in interval k is the risk h k,θ (x i ) multiplied by the probability of surviving through intervals 1 to k−1 , that is h d,θ (x i )). Thus, the log likelihood for an uncensored individuals is For a censored individual, only the product over the survived times is kept: Summing up over the N individuals and expressing the loss in terms of the network output p d,θ leads to the following loss for the discrete survival model: In addition to the interpretability of its output, this model is no longer proportional, neither depends on a baseline hazard h 0 . Moreover, this choice makes the likelihood of each individual independent of other patients, favoring training with small batch sizes and as opposed to the full dataset for the Cox loss [13]. Finally, since the method provides times rather than risks and given the time intervals are fixed beforehand, the output value range will not be dependent on the training range values. For the above reasons, the predicted values for a new individual are easier to interpret.
3) Continuous Regression and Pairwise Ranking: A straightforward way to formulate survival analysis is as the MSE regression of the time-to-event t i for every uncensored patient i. To allow for censored data and further constrain the problem, Jing et al. [12] proposed a network that predicts survival timet i trained with the RankDeepSurv loss consisting of an extended MSE loss l reg and a pairwise ranking regularization term l rank , balanced with parameters α and λ l RankDeepSurv = αl reg + λl rank .

The extended MSE is
where C reg defines the set of admissible individuals, including uncensored patients and individuals with predicted timest i greater than or equal to the censored times t i . The ranking loss compares pairwise differences between the real and the predicted event times where only admissible pairs C rank are considered for training. C rank includes pairs where the smallest time or both times are uncensored, and whose ground-truth difference is larger than the predicted one. This loss also predicts time but in a continuous fashion. One advantage of the ranking loss is that it includes more information about the censored patients, by considering explicit pairwise relationships. However, the balancing parameters must be determined with care as they may have a great influence on the result. 4) Triplet Ranking Loss: Triplet losses have shown to be effective for learning similarities in different applications [14], [15]. Here, we propose to adapt two variants to the survival analysis task.
The triplet loss can be seen as a ranking function defined over triplets instead of pairs, where a triplet is composed of an anchor x a , a positive x p , and a negative x n samples. A sample is positive if it belongs to the same class as the anchor, and negative otherwise. The goal is to build a meaningful feature space by bringing the anchor closer to the positive, and far from the negative sample. Consider the feature vectors f a k , f p k , and f n k to be the output of the flattened layer before the fully connected layers for entries x a k , x p k , and x n k . Then, Schroff et al. [14] define the triplet loss as where μ is a margin parameter enforcing a minimal separation between the negative and the positive samples, and τ the set of all possible triplets. Different methods for triplet generation (online or offline) and triplet selection (batch all or batch hard) have been proposed [14]. Our experiments are carried out with the online batch hard method, which is more efficient [14] and helps the convergence. The triplet loss in (8) was designed for classification problems. Here, we propose adapting it to survival analysis by first, discretizing the survival times t i into a fixed number of classes y i in order to determine positive and negative samples; and second, by defining the rules to verify the validity of triplets according to censorship. Thereby, we define a valid triplet as one that verifies one of the following conditions.
with q = a, if y a k < y n k . n, otherwise. These rules are applicable only for right censorship. The resultant triplet loss, called tripletSurv, will be used both in combination with a survival loss (2) or for pretraining (see Section III-C).
Our second adaptation relies on the scale-varying triplet (SV-triplet) loss proposed by Im et al. [15], who consider continuous (age) differences instead of classes to generate the triplets. The loss also introduces a weighting function that adapts to the scale (age difference), and removes the need for a margin μ parameter. Formally, the SV-triplet loss is Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
with τ the set of valid triplets: p k , and f n k the embedding vectors of, respectively, the anchor, the positive and the negative samples; and, t a k , t p k , and t n k their true survival times. The weights w k give priority to triplets where the anchor and positive samples have nearby event times where t k = (t k − t min )/(t max − t min ) are the normalized true event times with t min and t max computed over the N patients, and is a small constant to prevent zero division. Finally, the individual loss for a triplet in (9) is ) the output of a softmax function computed over distances, and D the Euclidean distance.
To adapt (9) to survival, we modify the selection criterion t a k − t p k < t a k − t n k , and thus τ , to consider right censorship. In our case, the accepted triplets should verify: where ψ is a batch (ψ ⊂ τ ). As with the original triplet loss, the resultant SV-tripletSurv loss, will be used in combination with a survival loss (2) or for pretraining (see Section III-C).

C. Model Pretraining
A widely used approach to counter the lack of annotated data, avoid overfitting, and speed up convergence is transfer learning, based on pretraining a network on large datasets such as ImageNet [44]. However, to be effective, data for pretraining should bear at least some similarities with the actual task, e.g., in input size and type of texture. Since there are currently no large publicly available PET imaging dataset, we opt for pretraining on our own data, exploiting the already provided annotations: lesion segmentation and survival data. A first pretraining task is thus a binary lesion/no lesion classification. The second task builds a meaningful embedding space from the survival data thanks to the tripletSurv loss.

1) Binary Lesion Classification:
Despite a relative small number of patients, our work analyzes annotated ROI images around lesions. Therefore, by decomposing each full body PET image into 3-D patches, we can pretrain our network on the simpler binary lesion/no lesion classification task, determining a patch label according to the portion of the patch covered by the lesion. To build the training dataset, for each patient we take N i patch patches, where the N i patch /2 positive patches are all patches from patient i whose ratio between the lesion voxels and the background is greater than a ratio r + . The remaining N i patch /2 negative patches are sampled randomly from the image. The training and validation sets collect positive and negative patches from different patients. We also ensure a consistent patientwise split between the pretraining and training stages. During pretraining, the output layer is replaced by two neurons with a softmax activation, and training is done with a cross-entropy loss.
2) Contrastive Pretraining: We propose a second type of pretraining task that focuses on refining the feature extraction. The architecture of Fig. 1. is kept but only until the first dense layer. We then train the 4096-dim output vector with one of the two triplets losses adapted to survival alone (tripletSurv and SV-tripletSurv). The idea is to learn a meaningful embedding space by enforcing the distance between feature vectors to reflect survival time. Such contrastive learning falls within the category of self-supervised pretraining approaches.

D. Our Final Model M2P2
After experimental validation (see Section V), the model chosen for predicting MM Prognosis from PET images, named hereafter, M2P2 3 is a 3-D CNN with channel attention, with binary and contrastive pretrainings followed by a final training stage with a Cox Loss.

A. Dataset Description and Data Preprocessing
We evaluate our survival model on a dataset composed of two prospective multicentric MM studies (IMAJEM and EMN02/HO95) with, respectively, 87 and 65 patients considered among the whole dataset [45], [46]. The 3-D baseline PET-images and survival time (until seven years) are available for each patient (censorship rate of 45%). The most intense lesion for each patient was selected and a global polygon around the lesion was drawn by a nuclear physician with a propagation method. No voxelwise segmentation step was involved in order to avoid the effects of the segmentation being operator or algorithm dependant. As preprocessing, we draw a cuboidal bounding box tightly around each lesion's polygonal annotation. Then, as lesion size has been reported to be a noninformative feature (Jamet et al., 2020 [5]), we resize each cropped bounding box to a fixed size 36×36×36 cube before feeding them to the network. This favors focusing on the textural analysis and improves the performance. This size was chosen according to the distribution of the MM lesions' scale in our database. The resizing was needed since the lesions have too different sizes (between 3×3×3 and 32×40×53). Finally, we use data augmentation (rotation, translation, zoom, and flip) to obtain 15 or 30 different images per patient (see supplementary material Appendix F. This leads to, respectively, a total of 2280 or 4560 3-D images, split patientwise into four sets for cross-validation. Regarding the conventional methods used for comparison (c.f. Section V-A), we compute radiomics features extracted from a tight cuboidal bounding box around each polygon surrounding a lesion (no further resizing), following the IBSI standardization initiative [47].

B. Architecture and Optimization Details
The kernel size of each convolutional block is, respectively, 3 × 3 × 3, 5 × 5 × 5, and 3 × 3 × 3 with padding and a stride at 1. Dropout, kernel regularization, and instance normalization were enforced in the convolution layers to reduce overfitting. The Leaky RELU activation is set to 0.1. A dropout layer is added after the first fully connected layer.
The α and λ of combined losses are chosen by hand so that the two losses give an equivalent scale.

C. Evaluation Metrics
Given the high censoring rate (45%) in our dataset, we opt to stratify the censored patients between the training, validation, and test sets, in order to assess the prediction accuracy in terms of c-index, which can handle the censoring. The c-index measures if the predicted risk respects the events order To evaluate the model when the risk is replaced by a predicted time, we replace the c-index by 1− c-index. For all methods, we report c-index with test-time augmentation (TTA) [48]. The presented mean c-index is first averaged over the TTA and then over the folds. We also evaluate the separation of the population into meaningful groups with Kaplan-Meier curves [7] (see supplementary material Fig. 2).

D. Experimental Plan
In the following section, we report the results of five sets of experiments. First, we compare the performance of our model against conventional survival analysis methods and versus 2-D and 3-D CNN baselines (Section V-A). Then, we do an ablation study on the attention components (Section V-B) and the pretraining strategies (Section V-C). For the latter, we consider pretraining either by binary lesion classification or with the proposed contrastive tripletSurv and SV-tripletSurv losses. Next, we make a comparison of the state-of-theart survival losses (Section V-D), including the DeepSurv loss [10], the discrete survival loss [13], the RankDeepSurv loss from [12], and combinations of Cox and discrete survival with the pairwise ranking part of RankDeepSurv (Rank&cox and Rank&discrete). Finally, we also consider the Cox loss combined with the tripletSurv (tripletSurv&cox) and SV-tripletSurv losses(SV-tripletSurv&cox). From the results of the previous experiments we devise our M2P2 model and evaluate it for risk group separation (Section V-E). All the experiments were performed using Python 3 and Tensorflow 1.12.

A. Comparison to Conventional Methods
We first compare M2P2 against two baseline survival methods, Lasso Cox and a recent RSF approach [5] given radiomics features as input instead of the original lesion images. We also report the results of two baseline 2-D and 3-D CNNs following the architecture in Fig. 1 but without attention or pretraining and trained with a Cox loss. The results are presented in Table I. The 2-D and 3-D CNN consistently improve the c-index compared to the conventional Lasso Cox and RSFs methods, while our M2P2 method outperforms conventional and CNN baselines.

B. Attention Model
We evaluate our model with and without each attention block and report the results in Table II. The experiments in this section were done with the 3-D CNN, binary pretraining, and a Cox loss.
Channel attention alone improves the validation c-index over the model without attention or with both spatial and channel attention. Reversing the order of the spatial and channel attention did not significantly change the results, leading to a validation c-index of 0.606 (±0.012). We believe spatial attention may be redundant with the preprocessing (bounding box + interpolation).
An example of the channel attention output is presented in the supplementary material (Fig. 3).

C. Pretraining Strategies
We compare the pretraining strategies described in Section III-C: binary classification, and tripletSurv or SV-tripletSurv feature embeddings adapted to right censorship. We also combine the methods sequentially. Experiments rely on the 3-D CNN model with channel attention and a Cox Loss. Table III shows the best results over the tested freezing strategies (no freezing and freezing one, two or three layers). All methods work best without freezing, except for SV-tripletSurv, which performs better when freezing the first layer. The results presented in Table III show the interest of pretraining, with each individual strategy increasing the cindex. The best results are obtained with a sequence of a binary and a right censored triplet pretraining, indicating that the binary pretraining is a better weight initialization for the feature vectors embedding with the tripletSurv loss. The succession of these two pretraining steps increases the c-index values by 7.7%. The SV-tripletSurv pretraining improves the prediction only slightly. A reason could be the reduced number of triplets kept with the SV-tripletSurv compared to the tripletSurv loss calling for larger databases. However, this loss produces the closest scores in training and validation, indicating it effectively reduces overfitting.

D. Survival Losses Comparison
In this section, we make a comparison of current state-ofthe-art losses for survival (Cox loss, discrete survival loss, and RankDeepSurv), different combinations of these losses (pairwise ranking + Cox loss, pairwise raking + discrete survival loss), as well as the novel adaptations of triplet losses to right censorship (tripletSurv + Cox loss and SV-tripletSurv + Cox loss). Except if explained otherwise, experiments were performed on the 3-D CNN with channel attention and binary pretraining. The results reported in Table IV show similar effectiveness among the different loss choices, demonstrating the feasibility of triplet survival losses. Adding to the Cox loss, a pairwise or a triplet ranking term has no influence or leads to a slight performance decrease. Instead, considering the triplet loss as an additional pretraining stage followed by a simple Cox improves the results considerably, and it is thus retained as our best model, M2P2.

E. Risk Group Separation
We provide a more clinically oriented evaluation of our M2P2 model by means of survival curves. After risk prediction, we separated the validation set into two (good and bad) prognosis groups. The separation is computed, choosing the best split as determined by the log-rank test. The resultant Kaplan-Meier curves are shown in the supplementary material Appendix C. The mean p-value on the fourfold is 3.40E − 03 (±6.80E − 03). The p-values being well below 0.05 for all folds implies that the separation is significant.

VI. CONCLUSION
Prior work on the analysis of PET images from haematological diseases has focused on the lesion detection task [49], or have formulated survival analysis tasks in terms of classification problems e.g., for T-cell Lymphoma [49]. There have also been other initiatives to model survival of MM patients from histophatological images, which presents very different challenges [50]. Different to such prior work, our proposed approach addresses the prediction of the PFS of MM patients simultaneously from PET images and avoiding to reduce the PFS prediction to a classification task. Preliminary results were published in the PRIME workshop [42]. This work deepens the modeling of the loss functions and extends the experimental validation.
Relying on a prospective clinical dataset, we show that CNN can learn meaningful feature representations, removing the need to rely on conventional two-stage approaches based on generic texture descriptors and prediction methods, such as Lasso Cox or RSFs. Our results suggest that analysing the image of the most intense lesion with our model and training strategies can help in determining high-risk patients at baseline.
From a technical point of view, we propose a simple 3-D CNN architecture as a backbone and study the influence of attention and pretraining steps as two effective strategies to adapt DL methods to clinical studies datasets. When focusing on lesions, we retain channel attention as an important means to select the most relevant learned filters. However, a key step is a pretraining stage composed of a sequential combination of a binary lesion-no lesion patch classification task and the newly proposed tripletSurv embedding.
We also revise the most common DL survival losses, try new combinations, and propose new adaptations of contrastive losses to consider right censorship. While all the losses are effective for the task, the combination with triplet ranking losses has a larger impact when used as a pretraining strategy instead of a penalization term. We argue the contrastive pretraining improves the learning of meaningful representations before focusing on the prediction for survival.
In the future, a more in-depth exploration of the learned filters combined with the filter attention matrix could be interesting to determine the most important filters and whether they relate to known radiomics features. Moreover, the bounding boxes around the identified lesions are manually and roughly drawn by experts, but could in future be combined with a semi-automatic segmentation algorithm. It is worth noting that the evaluations were done with a fourfold cross validation, but without a hold-out set due to the limited size of the dataset. Future work includes validating our approach with an external dataset when another dataset with annotated MM PET images becomes available for survival analysis. Finally, possible leverage for improvement, could be to replace the attention module with a SE network [43] as we finally only keep the channel attention part of the CBAM or with a more elaborated combination of the spatial and channel attention [51]. Our method, requires further research on fusion algorithms allowing the integration of the complementary clinical data into our model.