Variation of normal tissue complication probability (NTCP) estimates of radiation-induced hypothyroidism in relation to changes in delineation of the thyroid gland.

Abstract Background. To examine the variations of risk-estimates of radiation-induced hypothyroidism (HT) from our previously developed normal tissue complication probability (NTCP) model in patients with head and neck squamous cell carcinoma (HNSCC) in relation to variability of delineation of the thyroid gland. Patients and methods. In a previous study for development of an NTCP model for HT, the thyroid gland was delineated in 246 treatment plans of patients with HNSCC. Fifty of these plans were randomly chosen for re-delineation for a study of the intra- and inter-observer variability of thyroid volume, Dmean and estimated risk of HT. Bland-Altman plots were used for assessment of the systematic (mean) and random [standard deviation (SD)] variability of the three parameters, and a method for displaying the spatial variation in delineation differences was developed. Results. Intra-observer variability resulted in a mean difference in thyroid volume and Dmean of 0.4 cm3 (SD ± 1.6) and -0.5 Gy (SD ± 1.0), respectively, and 0.3 cm3 (SD ± 1.8) and 0.0 Gy (SD ± 1.3) for inter-observer variability. The corresponding mean differences of NTCP values for radiation-induced HT due to intra- and inter-observer variations were insignificantly small, -0.4% (SD ± 6.0) and -0.7% (SD ± 4.8), respectively, but as the SDs show, for some patients the difference in estimated NTCP was large. Conclusion. For the entire study population, the variation in predicted risk of radiation-induced HT in head and neck cancer was small and our NTCP model was robust against observer variations in delineation of the thyroid gland. However, for the individual patient, there may be large differences in estimated risk which calls for precise delineation of the thyroid gland to obtain correct dose and NTCP estimates for optimized treatment planning in the individual patient.

AbstrAct background. To examine the variations of risk-estimates of radiation-induced hypothyroidism (HT) from our previously developed normal tissue complication probability (NTCP) model in patients with head and neck squamous cell carcinoma (HNSCC) in relation to variability of delineation of the thyroid gland. Patients and methods. In a previous study for development of an NTCP model for HT, the thyroid gland was delineated in 246 treatment plans of patients with HNSCC. Fifty of these plans were randomly chosen for re-delineation for a study of the intra-and inter-observer variability of thyroid volume, D mean and estimated risk of HT. Bland-Altman plots were used for assessment of the systematic (mean) and random [standard deviation (SD)] variability of the three parameters, and a method for displaying the spatial variation in delineation differences was developed. results. Intra-observer variability resulted in a mean difference in thyroid volume and D mean of 0.4 cm 3 (SD  1.6) and -0.5 Gy (SD  1.0), respectively, and 0.3 cm 3 (SD  1.8) and 0.0 Gy (SD  1.3) for inter-observer variability. The corresponding mean differences of NTCP values for radiation-induced HT due to intra-and inter-observer variations were insignificantly small, -0.4% (SD  6.0) and -0.7% (SD  4.8), respectively, but as the SDs show, for some patients the difference in estimated NTCP was large. conclusion. For the entire study population, the variation in predicted risk of radiation-induced HT in head and neck cancer was small and our NTCP model was robust against observer variations in delineation of the thyroid gland. However, for the individual patient, there may be large differences in estimated risk which calls for precise delineation of the thyroid gland to obtain correct dose and NTCP estimates for optimized treatment planning in the individual patient.
Over the last decades treatment planning in radiotherapy (RT) has experienced a considerable development both in the precision in treatment planning [including computed tomography (CT), magnetic resonance imaging (MRi) and positron emission tomography (PET)-CT simulation] and in delivery of the planned dose to the target area through imageguided therapy. This has made it possible to optimize target dose coverage and at the same time reduce radiation dose to organs at risk (OaRs) [1]. However, the precision of most of these new technologies relies on the delineation of the target volumes as well as OaRs. The variation in delineations is a limiting factor for the precision of treatment planning and consequently prediction of the risk of late effects in the OaRs.
The tolerance estimates of OaRs are based on normal tissue complication probability (nTCP) models, and variation in OaR delineation might affect the estimated risk of developing a specific late effect. as pointed out by the Qualitative analyses of normal Tissue Effects in the Clinic (QUanTEC) group [2], lack of consistency in contouring OaRs among investigators is one of the challenges in developing nTCP models.
Radiation-induced hypothyroidism (HT) is a well known normal tissue late effect to RT in the neck treatment plans for evaluation. Of these 46 patients, 24 received an intravenous iodine-containing contrast injection, ioversol (Optiray), during the CTscan for treatment planning. Patient characteristics are shown in Table i. The thyroid gland volume and D mean from all three delineation sets were extracted from the dose-planning system. intra-observer (a1-a2) and inter-observer (a1-B) variations in volume and D mean were evaluated using Bland-altman (Ba) plots [6]. For all delineation pairs, the Ba-plots show the paired differences as a function of the averaged paired values; furthermore, the mean of all differences and associated 95% limits of agreement (95% confidence limits, 1.96  SD) are illustrated by horizontal lines. a Wilcoxon-signed rank test was used to test for systematic differences between the delineations. The random variation was calculated as the SD of the paired differences. Spearman's rank correlation was used to test for correlations between the paired differences and mean thyroid volume, D mean and nTCP, respectively. The correlation between the (random) variation and mean thyroid volume was tested by dividing the paired differences into five equally sized groups, depending on mean thyroid size, and test the SDs of these groups with Spearman's rank correlation.
area due to radiation of the thyroid gland [3,4]. in a previous study [3], a nTCP-model for biochemical HT after definitive RT in patients with head and neck squamous cell carcinoma (HnSCC) was generated. The model is a mixture model, taking thyroid volume, thyroid mean dose (D mean ) and latency time of the normal-tissue reaction into account. Biochemical HT was defined as any elevated serum level of thyrotropin (TSH). The aims of this study were to evaluate variability of estimated nTCP values of our mixture nTCP model subsequent to assessing the intra-and inter-observer variability in delineation of the thyroid gland.

Material and methods
in the previous nTCP model study, the thyroid gland was delineated in the RT plans of 246 patients with HnSCC, 203 of whom were finally included in the nTCP study. all delineations were made by oncologist a. The patients were treated at the Department of Oncology, Odense University Hospital, Denmark, during 2002-2010. all patients were treated according to the Danish Head and neck Cancer groups (DaHanCa) guidelines [5] with definitive external beam RT. For further characteristics of patients included in the previous study, we refer to our publication [3]. The treatment planning CT-scans were performed on a Siemens (Siemens Healthcare, germany) or Philips CT-scanner (Philips Healthcare, The netherlands) with a 3-mm slice thickness (voxels 1  1 3 mm 3 ). Treatment planning was carried out in Pinnacle 3 (Philips Healthcare, The netherlands), using the collapsed cone algorithm for dose calculation.
The thyroid glands were delineated without guidelines but according to departmental clinical practice. no dose constraints were applied to the thyroid gland during dose planning, however, according to current DaHanCa guidelines, doses to the larynx should be kept below 50 gy for 2/3 of the organ as a secondary priority compared to PTVs.
Delineations made for the previous study is referred to as a1. Fifty of the 246 RT plans were randomly chosen for blinded re-delineation of the thyroid gland by a (a2, intra-observer variability) and by an experienced oncologist, B (inter-observer variability). The first 50 patients delineated in the initial study were excluded in order to avoid the potential problem of a learning curve during the initial delineations. Patients with abnormal conditions or deformations in the neck area were excluded from analysis. altogether, four patients were excluded due to: tracheostomia (n  2), surgical implants in the cervical spine (n  1), and severe deformation of the neck due to a previous accident (n  1), leaving 46  (9) 1 (2) For each patient the overlap between the intraand inter-observer delineations was calculated and the Sørensen-Dice similarity index (DSi) was calculated. For two volumes (a and B) DSi is defined as: For additional assessment of a potential spatial variation of differences between delineations, the spatial dependent minimum distance between two pairs of contours was calculated over the entire surface of the gland. Due to the anatomy, the thyroid gland was divided into the left and right lobe for the mapping. This was done manually from a clinical perception of the anatomy, and separation was done through the middle of the isthmus. The mapping was done separately for each lobe, and for both the intra-and interobserver delineations, using the center of mass as origin and 100 measurement points in both the latitudinal and the longitudinal direction (10.000 surface points in total). The distance was measured, including the sign indicating whether the first contour was inside or outside the second contour. Based on these spatially resolved values, population averaged values (systematic deviations) as well as SD (random deviations) could be calculated and visualized as a color surface plot of the "mean thyroid gland" using the color to illustrate the local deviation. Both thyroid volume and D mean from the different sets of delineations were entered into our previously developed multivariate nTCP model. The estimated nTCP risks for each patient were calculated and evaluated using Ba-plots.
Statistics were done using STaTa 13 and Mat-Lab version 2012b for the nTCP modeling. all tests were two-sided and a p-value  0.05 was considered statistically significant.

Results
Median thyroid volumes, D mean and estimated nTCP of the three sets of delineations are shown in Table ii. Figure 1a and B show the Bland-altman plots for variations in thyroid volume. no significant difference was found in average variability in delineated volumes. assessed by Spearman's rank correlation, no significant correlation was found between the differences and averaged values of volume (no detectable slope of the average line value in the Ba plots). For random variation (deviation from the average line in the Ba-plots) Spearman's rank correlation did not show a significant intra-observer relation between the variation and the size of the gland (p  0.188), but did show a significant inter-observer variation (p  0.005). When assessing the intra-and interobserver Ba-plots for the subgroups receiving contrast or no contrast no clear difference between mean and random variability between the subgroups was found (Supplementary Figure 1 and Supplementary Table i, to be found online at http://informahealthcare.com/doi/ abs/10.3109/0284186X.2014.1001034).
Mean DSi was 0.88 (SD  0.03) for the intraobserver delineations and 0.85 (SD  0.04) for interobserver delineations. Systematic spatial variations of delineations were assessed from the color surface plots of the lobes of the "mean thyroid gland". The plots displayed variability from the population mean values of  0.02 cm for the majority of the surfaces on both the intra-and inter-observer plots (Figure 2a). The most pronounced variability from the population mean values was found around the middle of the gland, especially on the anterior and lateral sides of the lobes and the medial parts of the gland (above the isthmus), with mean differences in delineations of  0.04-0.08 cm. assessing the local random variation, the surface plots showed that the SD of the delineations were around 0.06-0.1 cm (intra-observer) and 0.06-0.12 cm (inter-observer) for the major areas of the lobes ( Figure 2B). The highest random variations in  delineations were in the caudal and medial parts of the lobes (including the isthmus), as well as in small areas cranially, with an SD around 0.12-0.16 cm (intra-observer) and 0.16-0.22 cm (inter-observer). Figure 3a and B show the variations in D mean derived from the different thyroid gland delineations . as demonstrated in Figure 3a and Table iii, there is a small systematic difference in intra-observer D mean , mean difference being 0.5 gy lower in a1 compared with a2 (p  0.002). However, intra-observer random variation was slightly smaller than the interobserver, SD  1.0 and  1.3 gy, respectively. assessed by Spearman's rank correlation no correlation was found between the differences and averaged values of D mean .
The intra-and inter-observer variations in predicted risk for development of HT were assessed by estimating individual nTCP-values based on thyroid volumes and D mean from the three delineations sets. in Figure 4a and B the Ba-plots demonstrate the intra-and inter-observer variations, respectively. We found that intra-and inter-observer systematic variations in the estimated risk of developing HT were small, below 1.0%, and that random variations were equal to or less than 6.0%. assessed by Spearman's rank correlation no correlation was found between the differences and averaged values on the risk of developing HT.

Discussion
Several studies of observer-variability delineation of OaRs in HnSCC [7][8][9] have demonstrated the importance of precise delineation of organ structures in dose planning. One study, analyzing the impact of different guidelines in estimating nTCP-values for different late effects for swallowing organ structures [10], showed that differences in organ delineation may have major impact on nTCP estimates. We studied the impact of intra-and inter-observer variability in delineation of the thyroid gland on estimation of the risk of developing radiation-induced HT using our previously published nTCP model.
When predicting the risk of developing HT, by estimating individual nTCP-values based on the thyroid volumes and D mean from the three delineations sets, only small and insignificant mean differences of nTCP values were found, i.e. -0.4% within observer, and -0.7% between observers. This shows that the applied nTCP-model is robust towards small systematic differences in volume and D mean caused by delineation inaccuracy and that nTCP-estimates are consistent. Thus, the variability in delineation of the thyroid gland had a modest effect on the predicted mean nTCP for HT after RT. This implies that our previously published tolerance levels for the thyroid gland in general, for dose planning, are consistent [3]. and in the third patient, there was no obvious explanation for the delineation differences. For the two first patients the differences might be explained by lack of intravenous contrast during the CT-scan, although we did not find any difference in delineation variation between the subgroups receiving contrast or not when assessing the Ba-plots of the subgroups.
The mean DSi of 0.88 and 0.85 in our study is relatively high compared to other structures in the head and neck region that have been analyzed for observer variability [7]. although we cannot rule out an impact of local delineation traditions, the consistency in delineation of the thyroid gland may be explained by the fact that the thyroid gland is relatively well defined on CT-scans due to its well vascularized tissue [12] and high content of iodine [13,14]. This is also in accordance with nygaard et al. [15] who found a good reproducibility of thyroid volume on CT scans when assessing intra-and inter-observer variability in moderate-sized goiters.
as stated, DSi showed good agreement in delineated volume. However, this study extended the assessment of spatial variation by assessing the size of the local variation and identifying problematic areas that might need particular attention in delineation by three dimensional (3D) mapping of the thyroid gland. The spatial variations in delineations, illustrated in Figure 2B, demonstrate that the random variation is greatest in the area around the isthmus (medial part), the caudal and cranial part of the thyroid gland. This is in agreement with Brouwer et al. [8] in their study of 3D variations of five other head and neck OaRs. When we assessed the size of the variation around the gland ( Figure 2B), we found that the areas with the largest variation (in red) had a range of around 0.12-0.2 cm, reflecting a difference in delineation corresponding to one slice of the CT-scan cranially or caudally (slice thickness 3 mm). in our opinion, this random variation is difficult to avoid and is not likely to be minimized with delineation guidelines. Variability might be reduced with a smaller slice thickness of the planning CT. Furthermore, delineations in multiple orientations and not in this single center study, both intra-and interobserver variability for thyroid volume delineation on CT scans were within the same range without using specific guidelines for delineation. Consistent thyroid volumes were observed in three sets of delineations of the thyroid gland by two observers. When visually assessing thyroid volume on the Ba plots, we could not discern a particular structure and no systematic differences, nor were there significant correlations between the mean differences and volume (Figure 1a and B). For the inter-observer delineations, the random variation was positively correlated with thyroid volume, and this might be expected when the absolute volume of the gland increases. With one exception, all thyroid glands in our study were in or close to the normal size range of 10-28 ml [11].
Three patients were identified as outliers from the Ba-plots due to a difference in delineated volume above 3.0 cm 3 . all three were found in the intraobserver set, two of whom were also outliers in the inter-observer set. When assessing the CT-scans of these patients, we found that in one of the patients, the scanning image was blurred and the thyroid difficult to visualize, the second had an irregular goiter Figure 3. Bland-altman plots demonstrating intra-observer (a) and inter-observer (B) variation in calculated thyroid D mean . regarding a set of OaRs in patients with pharyngeal cancer they found that substantial intersession variability in volume was apparent, but that dose differences were small in accordance with our findings. However, the thyroid gland was not delineated in either of the two studies [7,21]. We have addressed the methodological limitation of the low numbers of observers by analyzing the impact on nTCP estimates caused by extreme systematic variations. This was done by calculating the standard error of the observed differences in the studied patient group and then applying the extreme value of the 95% confidence interval of the observed mean value. The percentage difference was then obtained by dividing the value from the extreme value of the confidence value with the median observed value. This resulted in a systematic difference of delineated volume of  5% (both intra-and inter-observer) and a systematic difference in D mean of  2% (intra-observer) and  1% (interobserver), which in the most extreme combination (-5% volume and  2% D mean ) led to an estimated mean difference in nTCP of 3.4% with a SD of 3%. This indicates that the estimated nTCP, compared to the intrinsic uncertainty of the model, is almost unchanged for the whole population, however, there might be large changes for the individual patient.
it must be emphasized that the current study did observe large differences in estimated risk of HT in some individuals (up to 26%), leading to a SD of the nTCP-values up to  6%. a small thyroid volume and high thyroid D mean are highly significant and independent risk factors for development of radiation-induced HT [3,4]. Consequently, when estimating the variability in nTCP, this relies both on the differences in delineated volume and D mean and also on the patients' actual thyroid volume and D mean . Furthermore, the variability is dependent on the patients' risk of developing HT, i.e. the steepness of the doseresponse curve giving a larger ΔnTCP in the middle of the dose-response curve than in both ends for the same difference in volume or D mean . This can be seen in Figure 4a and B, where the random variation is higher in the areas with a 20-80% risk of developing HT, compared to the areas with both lower and higher risk. When examining the three outliers according to thyroid volume mentioned earlier, one of these had a difference in nTCP of 26% (due to a small thyroid volume and a nTCP in the steep part of the doseresponse curve), whilst the other two outliers had a  1% difference in nTCP. Thus, despite the small variation in mean differences of nTCP values in the study population, variation in organ delineation does have an impact on nTCP estimates which still calls for a precise delineation of the thyroid gland to obtain correct dose and nTCP estimates for optimized treatment planning in the individual. only on the transverse CT-slices might reduce variability, as also proposed by Brouwer et al. [8]. automatic delineation has been proposed for reducing time in treatment planning and potentially reduce variability in delineation of other head and neck OaRS [16,17]. However, both consistency and precision of the automatic delineations of the thyroid gland remain to be determined [18,19].
We found a small systematic, and statistically significant, difference in intra-observer D mean , with a mean difference 0.5 gy lower in a1 compared with a2. The clinical significance of this small difference is questionable. The variability in D mean is related to the variability in delineated thyroid volumes, since D mean is calculated from the delineated thyroid volumes. However, the relationship between variations in volume and D mean is not straightforward, since the radiation dose to the thyroid gland is highly variable for head and neck patients and will depend on primary site, target area, total dose and exact dose distribution. Therefore, delineation variations in areas receiving considerably higher or lower doses than the mean thyroid dose will contribute to variability in D mean . However, delineation uncertainties in regions in which the dose is close to the mean dose will not change D mean significantly, as also stated by Lorenzen et al. [20]. Due to this, an outlier, with respect to delineated volume, may not be an outlier with respect to D mean . nelms et al. [7] studied the effect of inter-clinician (n  32) variation in contouring in the head and neck by evaluating six OaRs in one CT data set from a patient with oropharyngeal cancer. They found that variation was organ-specific and could be high, both for volumes and estimated doses. Feng et al. [21] studied the effect of observer variability on plan optimization. in their study, For the entire study population, the variation in predicted risk of radiation-induced HT in head and neck cancer was small and our nTCP model was robust towards observer variations in delineation of the thyroid gland. Therefore, our recommendations for tolerance dose levels for the thyroid gland to radiation treatment still hold true. However, for the individual patient there may be pronounced differences in the estimated risk due to variation in volume and organ delineation so that it is of utmost importance for the individual patient, to precisely define OaRs to obtain correct dose and nTCP estimates for optimized treatment planning.