Accuracy of the ‘CUT’ Score for Assessing Malignancy in Bethesda 3 and 4 Thyroid Nodules in North American Population: A Retrospective Study

Abstract Background The CUT score is a thyroid nodule malignancy risk assessment scoring system intended to guide surgeons in treating Bethesda 3 and 4 thyroid nodules. It is based on clinical (C) and ultrasonographic (U) features and a five-tiered (T) representing cytology. Purpose Our study aimed to assess the utility of the CUT score in predicting thyroid malignancy in the North American population. The main reason for creating this score is to reduce unnecessary surgeries on these challenging thyroid nodules. Materials and Methods A retrospective record review study applied the CUT score to 219 Bethesda 3 and 4 thyroid nodules. A total of 203 Bethesda 3 and 16 Bethesda 4 nodules from patients treated between January 2015 and December 2019 at a single institution were assessed. A receiver operating characteristic (ROC) curve analysis was performed to evaluate the CUT diagnostic test. Binary logistic regression analysis was performed. Iteration of analysis was performed after stratification according to body mass index to assess CUT score accuracy in obese and non-obese patients. Results Of 219 nodules analyzed, 148 were characterized as benign and 71 as malignant. Prevalence rates of malignancy were 29.6% (n = 60) and 68.8% (n = 11) in Bethesda 3 and 4 nodules, respectively. The mean CU (clinical, ultrasonography) score was 5.35 ± 1.38 in benign nodules versus 4.96 ± 1.5 in malignant nodules (p = 0.08). The area under the curve (AUC = 0.433) for the association of CUT scores with nodule malignancy was not significant (p = 0.13). The CUT score was insignificant as a diagnostic test for nodule malignancy in obese (AUC = 0.45; p = 0.72) and non-obese patients (AUC = 0.39; p = 0.08). Conclusion The CUT score did not correlate with preoperative malignancy risk estimates in Bethesda 3 thyroid nodules and, therefore, may have limited utility as a predictor of malignancy in these thyroid nodules.


Introduction
The incidence of thyroid nodules has increased significantly and is reported to be as high as 68% when examined with high-resolution ultrasound (1). This may be due to an increased focus on screening programs and ultrasound use. According to the Bethesda system, Bethesda 3 nodules carry up to a 30% risk of malignancy, and Bethesda category four carries up to a 40% risk (2,3). Fine needle aspiration biopsy (FNAB) cannot provide a final diagnosis for these thyroid nodules (2,3). Many scoring systems have been developed to guide the surgeon in making the best decisions when treating thyroid nodules (4)(5)(6).
One method of risk assessment, the CUT score, was created in Italy. The CUT score is a meta-analysis-derived scoring system, where C stands for clinical, and U stands for suspicious ultrasonographic features (7). Each clinical and ultrasound element receives a value depending on the effect size computed in the CU meta-analysis. Then the sum of C þ U is given a score ranging from 0 to 10, linked to the nodule's cytology (T). CUT's creators have made a free smartphone app for using the CUT score used in our study. Our study aimed to assess the CUT scoring system's diagnostic accuracy as a preoperative malignancy risk assessment for Bethesda 3 and 4 thyroid nodules in our institution's patient population.

Materials and methods
A retrospective review of medical records reported 219 Bethesda 3 and 4 thyroid nodules (203 Bethesda 3 nodules, and 16 Bethesda 4 nodules) from January 2015 to December 2019 ( Figure 1). The total number of nodules assessed was 644. Sample size calculation revealed the requirement of a minimum of 143 patients for achieving the power of 80%. All records were retrieved from the Tulane hospital outpatient endocrine surgery clinic database. The Tulane Institutional Review Board approved this study. All patient records were evaluated based on clinical history, examination, and ultrasonography. The features of each nodule were assessed by ultrasound and given a score by the American College of Radiology Thyroid Imaging Reporting and Data System (ACR TI-RADS), a reporting system for thyroid nodules proposed by the ACR (5). All ultrasonography was performed by an experienced technician with 15 years of experience and interpreted by an endocrine surgeon with 20 years of experience in ultrasound and fine-needle aspiration (FNA). Ultrasound devices were as follows: Probe Make, Model, and Serial Number: GE Logiq9: 89054US1; Probe: M12L 14.0 MHz, Serial no. 1065734YMA, images were obtained using 15 MHz linear transducers. The patient obtained informed consent if the FNAB criteria were met based on the ACR TI-RAD and American Thyroid Association guidelines (2,7). The FNAB was done in the clinic under ultrasound guidance and local anesthesia (numbing xylocaine cream and 1% xylocaine with epinephrine injection), using 1.5-inch 25 G needles attached to 10 mL syringes. The short-axis technique was used under negative pressure suction and coring (8). A total of 4 to 5 consecutive passes were done for each nodule. After the aspiration, the needle was rinsed in CytoLyt solution for Thin Prep and sent to the histopathology lab for processing. The cytological reporting was categorized according to the Bethesda Thyroid Cytopathology Reporting System (9).
Using the CUT score created by Ianni F, 2015 (7), we adjusted the Italian system's cytological class to our Bethesda system, making class TIR3A ¼ Bethesda 3 (Atypia of undetermined significance) and follicular lesion of undetermined significance, and TIR3B ¼ Bethesda 4 (follicular neoplasm) or suspicious for follicular neoplasm. The CUT score was applied retrospectively to all Bethesda class 3 and 4 nodules, involving the sum of C þ U without knowing the final histopathology. The C þ U sum of the cut score was calculated (7,10).

Statistical analysis
Malignant call rate represented the percentage of thyroid nodules that were found to be malignant postoperatively. The C þ U score values were calculated for each thyroid nodule. The two-sided Chisquare and Mann-Whitney U tests were applied to compare the benign and malignant groups' CUT scores. p values below 0.05 were considered significant. A receiver operating characteristic (ROC) curve analysis was performed to validate the CUT score as a diagnostic test within our population sample. Sensitivity, specificity, positive predictive value, negative predictive value, and accuracy were estimated for the best cutoff value selected by the ROC analysis. The area under the curve (AUC) and the pvalue were calculated. Binary logistic regression analysis was performed for each parameter of the CUT scoring system. Iteration of analysis was performed after stratification according to body mass index (BMI) into obese (BMI 30 or more) and non-obese patients to assess the correlation of ultrasound and CUT scores with malignancy (11). Statistical analysis was performed in SPSS version 27.0.

Patient and thyroid nodule characteristics
The CUT score was retrospectively applied to thyroid nodules from 219 patients (177 women and 42 men), including 203 Bethesda 3 and 16 Bethesda 4 nodules. Of these, 148 were categorized as benign and 71 as malignant nodules. Prevalence rates of malignancy were 29.6% (n ¼ 60) in Bethesda 3 and 68.8% (n ¼ 11) in Bethesda 4 nodules. The mean ± SD age of patients was 52.9 ± 14.1 years (range, 18-87 years), and the mean BMI was 31.7 ± 9.2 kg/m 2 . Out of the nine sonographic features that are included within the CUT score, only nodules that were taller (n ¼ 13, 12.8%) than wide (n ¼ 12, 21.1%) were significantly observed to be malignant (p ¼ 0.016), Table 1.

Lack of difference in the CUT score between benign and malignant nodules
As depicted in Figure 2, there was no significant difference between the CUT score of malignant (4.96 ± 1.5) and benign nodules (5.35 ± 1.38), p ¼ 0.08 ( Figure 2A). In addition, subgroup analysis according to the FNA results showed no difference in CUT score between malignant and benign lesions in Bethesda 3 (p ¼ 0.13) and four nodules (p ¼ 0.052) ( Figure 2B). Similarly, BMI stratification revealed an insignificant difference in the CUT score between groups ( Figure 2C).

Discussion
The size of thyroid nodules and suspicious ultrasound features (solid, microcalcification, higher than wide, hypoechoic, irregular borders, and vascular) are the leading indicators for biopsy (2). Bethesda 3 and 4 nodules carry a risk of malignancy between 6% and 40% (9). Surgical management of this category of thyroid nodules is debatable. Developing a scoring system that helps resolve this issue has been an area of research over the years. When creating any scoring system, the main factors to consider are feasibility, accuracy, and cost-effectiveness. The CUT score was created as a clinical-ultrasound evaluation system based on a meta-analysis of 41 studies conducted by an Italian team (7,12). The goal of creating the CUT score was to reduce unnecessary surgery for indeterminate thyroid nodules. In addition, we were interested in exploring how the CUT score might predict thyroid malignancy in our hospital population. Our analysis showed an impressive result. We observed no significant predictive value of the CUT score in discriminating between benign and malignant thyroid nodules. The AUC for the CUT score was not significant when tested as a diagnostic test for our sample population. Similar findings were determined after stratifying for the BMI of patients. The rationale for performing a subgroup analysis based on BMI stratification is the reported increased thyroid cancer incidence in the obese population in some cohorts (13)(14)(15)(16). We saw no significant difference in the mean C þ U score between benign and malignant thyroid lesions. Prevalence rates of malignancy were 29.6% in Bethesda 3 and 68.8% in Bethesda 4 nodules. Multivariable analysis showed that only the solid echo structure of the nodules and height greater than width conferred a higher risk of malignancy. We believe that the proposed CUT methodology may lack reliability and validity despite the need for a scoring system for thyroid malignancy that integrates clinical risk factors and ultrasound characteristics. Looking at their original study's results, we noticed that the odds ratio for each parameter in the CUT score, which counts to 51.5, was adjusted to similar approximate ratios to sum up to 10, leading to questionable accuracy of the reporting system (Supplemental figure). Furthermore, the pooled results of the studies included in the meta-analysis, published from 1989 to 2012, lacked subgroup analysis by ethnicity, raising the need to test the score in the different ethnic groups (7,17). In the meta-analysis used to develop the score, a remarkable heterogeneity was significant in studies for 9 out of 12 parameters of the CUT score. In addition, each parameter relied on 4 to 17 studies. Thus, pooled results exhibited doubt for generalization ( Table  2). The same group validated the CUT score on 110 thyroid nodules from 103 patients in TIR3 grade, defined as indeterminate follicular lesions including atypical cells of indeterminate significance, the prevalence of malignancy was 25% (7). The CUT score showed high predictive performance with an AUC of 0.904, showing an inverse relationship between sensitivity and specificity (69% sensitivity, 96% specificity at C þ U score >5 and 95% sensitivity and 60% specificity at C þ U score >2.5). However, analysis of individual parameters' diagnostic value in the scoring system revealed an insignificant effect for all clinical data and two ultrasonographic features, namely nodular size and intra-nodular vascularization, which raises concerns about the test's effectiveness (Table 2).
Five years after the original study established the CUT score, the same group re-evaluated the CUT score on 201 cytologically indeterminate thyroid nodules from another Italian cohort (10) categorized as TIR3A and TIR3B, of which the malignancy rate was 22.8%. Unfortunately, the CUT score showed a less predictive power than in their last analysis, with an AUC of 0.714 (55.6% sensitivity, 76.9% specificity) in TIR3A and 0.744 (65.0% sensitivity and 78.3% specificity), with accuracy ranging from 65% to 71.8%; thus, the CUT score had missed approximately one-third of patients. The pre-test probability is the estimated probability of a person having a disease before a test is even performed. This is based on a clinician's personal experience and local prevalence. In the Italian study (10), pre-test probability was 0.108 for Bethesda 3 and 0.321 for Bethesda 4. In our community, where the malignancy rate is higher, pre-test probability estimates were 0.296 for Bethesda 3 and 0.688 for Bethesda 4. We had a similar Positive Post-test probability (PPP) (0.24 vs. 0.21 for Bethesda 3 and 0.58 vs. 0.57 for Bethesda 4). In a more recent study published in 2021, Pinhas et al. validated the CUT score, and they concluded that the CUT score is applicable, but they also reported a higher cutoff than previously reported and the need for internal validation (18).
Even though the ultrasound is considered a valuable tool in evaluating thyroid nodules, it cannot be used to predict malignancy risk in these challenging thyroid nodules (19,20). Furthermore, its integration with clinical features did not cause a remarkable shift in prediction. Brito et al. published a meta-analysis in 2014 that included 31 studies done between 1985 and 2012 (19). They concluded that ultrasonography has low to moderate evidence of individual ultrasonography features to predict malignancy (19).
One year later, in 2015, Remonti et al. (20) published another meta-analysis that assessed 52 studies, including nine studies of indeterminate thyroid nodules. They did a separate meta-analysis for the indeterminate thyroid nodules. They concluded that the available data were insufficient to analyze these nodules' individual sonographic features (20). Over the past years, this topic has received increased attention, and many innovative  (17) suggested that the best rule-out tests for malignancy are the Afirma gene expression classifier and fluorodeoxyglucose positron emission tomography. At the same time, the most accurate rule-in test was BRAF mutation analysis. Considering the wide availability and lower cost of ultrasound compared to genetic testing, it was used in creating the CUT scoring system, which may have contributed to its lower accuracy relative to more comprehensive genetic profile testing (2). In our practice at Tulane, we use molecular genetic testing to guide our approach in Bethesda 3 and 4 thyroid nodules (2). Though we use molecular genetic testing, considering an ultrasound-based scoring system is more feasible, instant, and lower in cost. As a limitation of our study, we propose the need for further studies with higher sample sizes to evaluate the CUT score's accuracy. However, we note that our sample size was larger than those in many previous studies. Not to mention that according to our sample size and the calculated power, we could conclude that the CUT score was not a good predictor tool in Bethesda 3 thyroid nodules with limited application to Bethesda 4 thyroid nodules (sample size for Bethesda 3 (203) and Bethesda 4 (16).

Conclusion
We observed that the CUT score applied retrospectively to a sample of thyroid nodules was not associated with an accurate preoperative malignancy risk estimate prediction in Bethesda 3 thyroid nodules. Thus, the CUT score requires more rigorous validity testing in prospective multicenter studies to confirm their clinical utility.