Gene Expression ^ Based Recurrence Prediction of Hepatitis B Virus ^ Related Human Hepatocellular Carcinoma

Purpose: Thepoorprognosisofhepatocellularcarcinoma(HCC)is,inpart, duetothehighrateof recurrenceevenafter‘‘curative resection’’of tumors.Therefore, itis axiomatic that the development of an effective prognostic prediction model for HCC recurrence after surgery would, at minimum, help toidentify inadvancethose who would mostbenefit from the treatment, andat best, provide new therapeutic strategies forpatients withahighriskofearly recurrence. Experimental Design: For the prediction of the recurrence time in patients with HCC, gene expression profiles were generated in 65 HCC patients with hepatitis B infections. Result: Recurrence-associated gene expression signatures successfully discriminated between patients at high-risk and low-risk of early recurrence ( P = 1.9 (cid:1) 10 -6 , log-rank test). To test the consistency and robustness of the recurrence signature, we validated its prognostic power in an independent HCC microarray data set. CD24 was identified as a putative biomarker for the prediction of early recurrence. Genetic network analysis suggested that SP1and peroxisomepro-liferator ^ activated receptor- a might have regulatory roles for the early recurrence of HCC. Conclusion: We have identified a gene expression signature that effectively predicted early recurrence of HCC independent of microarray platforms and cohorts, and provided novel biological squares of expression values in each sample) was selected as a representative probe. Functional analysis of signatures. Once a gene set was identified as useful in the stratification of a patient’s outcome, we attempted to insight into molecular mechanisms that might be involved in generating this hierarchy of patient outcome. For the functional analysis of gene sets, enrichment of the gene set was estimated by the cumulative hypergeometric P values of each biological process provided by Gene Ontology Consortium. 11 In order to obtain representative and significantly enriched terms, those terms with a level higher than two in the Gene Ontology hierarchy, including at least three genes, were considered in our calculation. Statistical significance was determined with a cutoff of P < 0.01.

Hepatocellular carcinoma (HCC) is one of the most common cancers in the world, accounting for an estimated 600,000 deaths annually (1). The resistance of HCC to existing treatments and the lack of biomarkers for early detection make it one of most hard-to-treat cancers. Surgical tumor resection, including liver transplantation, remains the only curative modality for HCC. Although the progress of surgery and preoperative or postoperative care have improved the survival of patients with HCC, it is disappointing that the recurrence rate remains high even after curative resection of tumors. HCC recurrence is a serious complication following the resection of the primary tumor and occurs in 75% to 100% of patients within 5 years after surgery (2 -4). In light of the dismal clinical outcome of this neoplasm, the development of effective systems that can predict the likelihood of recurrence is much needed. This will help in deciding therapeutic strategies for patients with HCC according to the predicted risk of recurrence.
Several attempts have been made to predict recurrence and prognostic outcomes based on single or multiple clinicopathologic features such as the severity of the liver function, age, tumor grade, size, microvascular invasion, portal vein thrombosis, and the presence of microsatellite regions (2, 5 -7). Prognostic staging systems have also been proposed to stratify patients according to expected survival (8 -10). However, their prognostic significances and clinical utilities needed to be further validated with large-scale studies (11,12).
Recent studies on gene expression profiles could successfully predict recurrence, metastasis, or survival prognosis of HCCs (13 -17). Even though these studies successfully provide prognostic markers for clinical application, the lack of consistency and robustness of predictors generated from different microarray platforms remain one of the major obstacles for the clinical use of microarray-based predictors (18,19). As the lack of reproducibility mainly comes from the heterogeneity of the patient cohorts and the difference in microarray platforms, it is important to identify a reliable and consistent predictor that is robust enough to overcome the variabilities introduced by different platforms or different patient cohorts.
In the present study, we examined the gene expression profiles of 65 patients with HCC associated with the same viral background of hepatitis B virus (HBV) infection and identified molecular markers that predict HCC prognostic subtypes of high-risk and low-risk of early recurrence. The robustness and consistency of predictability was validated when our gene expression signature was applied to a completely independent patient cohort (15). This suggests that the signature would be more accurate and promising for clinical application. Moreover, as all of the 65 patients were HBV positive, these gene expression profiles might chiefly help in the understanding of HBV-related hepatocarcinogenesis. Detailed functional analyses of the prognostic subtypes provide novel molecular insights into HCC recurrence mechanisms.

Patients and Methods
Patients. Between February 2001 and May 2005, we prospectively collected resected HCC specimens with a pathologically proven cirrhotic background from 65 patients who were chronically infected with HBV and had surgical treatment for HCC at Seoul National University Hospital. All patients were preoperatively evaluated with routine blood tests, a-fetoprotein, routine X-ray, abdominal ultrasonography, and two-phase spiral liver computed tomography scan. Space-occupying lesions in the liver remnant were examined by intraoperative ultrasonography; no distant metastases or space-occupying lesions were identified in the nonresected part of the liver of any of the individuals in this study. We excluded subjects who were positive in serologic tests for anti-HCV or anti-HIV (HCV3.2; Dong-A Pharmaceutical Co.; Greencross Life Science Corp.). Patients with other types of liver disease, such as autoimmune hepatitis, toxic hepatitis, primary biliary cirrhosis, or Budd-Chiari syndrome were also excluded. The study protocol was approved by the institutional review board for the use of human subjects at the Seoul National University School of Medicine, and all participants provided written informed consent. We defined curative resection as complete excision of the tumor with clear microscopic margins and no residual tumors as indicated by computed tomography scan at 1 month after surgery. To assess tumor size and undertake pathologic examination, we sectioned the resected specimens using the slice with the largest diameter, which we then cut at intervals of 5 mm. Two experienced pathologists independently examined all samples for evidence of residual tumors at the surgical margin, tumor differentiation, stage, and presence of vascular invasion. Based on these examinations, all 65 patients were determined to have received ''curative resection.'' Patients were followed up at least once every 3 months after surgery.
Microarray experiments and analysis. Total RNA was extracted from frozen tissues using TRIzol (Invitrogen) and then cleaned using an RNeasy Mini kit (Qiagen). Five micrograms of total RNA from the HCC tissues was used for labeling, and microarray hybridization was carried out on Human Genome U133A 2.0 chips (Affymetrix) according to the manufacturer's protocol. The fluorescent intensities were determined with a GeneChip scanner 3000 (Affymetrix), controlled by GCOS Affymetrix software.
Raw data were normalized using the Robust Multiarray Average method (20) and global median centering. Hierarchical clustering analyses of gene expression profiles were done based on centered correlation metric and average linkage method.
Class prediction and the misclassification rates of the classifiers were estimated by a leave-one-out cross-validation method using different algorithms (compound covariate predictor, linear discriminant analysis, nearest centroid, k-nearest neighbor, and support vector machine) implemented in BRB-Array Tools. 10 The probabilities of recurrence-free and overall survival rates were estimated with Kaplan-Meier plots and significance was determined by log-rank test. Statistical analyses were done using R/Bioconductor package.
For data integration with the independent data set, each data set was standardized independently by transforming the expression of each gene to a mean of 0 and SD of 1, pooled the expression profiles together, and then considered them as a single data set. Probes in each data set were matched with Entrez Gene identifiers. For the multiple tagged genes, the probe with the largest magnitude (i.e., sum of the Fig. 1. Identification of genes responsible for early recurrence of HCC. A, hierarchical clustering was done on the expression profile of recurrence genes that were identified by Cox proportional hazards analysis (P < 0.005, log-rank test). Before clustering, the average expression levels of each gene were set to 0. HCCs were labeled according to recurrence time as early recurrence (i.e., patients with tumor recurrence within a year after surgery, n = 15; red) and late recurrence (i.e., patients free of recurrence for >1year, n = 25; blue). The patients who had been followed up for less than a year without recurrence were assigned as unclassifiable (n = 25, gray). B, Kaplan-Meier plot of recurrence-free survival of HCCs stratified by hierarchical clustering of expression profile of recurrence genes. Fig. 2. Validation of recurrence genes by cross-platform comparison with an independent HCC data set. A, 150 out of the 628 SNU recurrence genes were found in the LEC data set. Hierarchical clustering was done on these gene expressions in the LEC data set (n = 139, recurrence information was available in 67 samples). Before clustering, the average expression levels of each gene were set to 0. B and C, Kaplan-Meier plot of recurrence-free survival (B) and overall survival (C) of HCCs (LEC) grouped by hierarchical clustering of the recurrence gene expression profile. D and E, independent prediction algorithms of compound covariate predictor (D) and linear discriminant analysis (E) were trained with SNU expression data and then applied to the LEC data set, respectively, and Kaplan-Meier plot analysis and log-rank test were done on the predicted classes. F, hierarchical clustering was done on an integrated data set comprised of both SNU and LEC data sets. Before integration, each data set was standardized independently by transforming the expression levels of each gene to a mean of 0 and SD of 1 (see Patients and Methods) G-I, Kaplan-Meier plots and log-rank test of recurrence-free survival of HCCs in SNU (G, n = 65), LEC (H, n = 139), and the overall integrated data (I, SNU + LEC, n = 204), respectively. HCCs were grouped based on hierarchical clustering of the expression profiles of the integrated data sets. Abbreviations: CCP, compound covariate predictor; LDA, linear discriminant analysis; SNU, data from Seoul National University; LEC, data from Laboratory of Experimental Carcinogenesis, National Cancer Institute, NIH. Functional analysis of signatures. Once a gene set was identified as useful in the stratification of a patient's outcome, we attempted to gain insight into molecular mechanisms that might be involved in generating this hierarchy of patient outcome. For the functional analysis of gene sets, enrichment of the gene set was estimated by the cumulative hypergeometric P values of each biological process provided by Gene Ontology Consortium. 11 In order to obtain representative and significantly enriched terms, those terms with a level higher than two in the Gene Ontology hierarchy, including at least three genes, were considered in our calculation. Statistical significance was determined with a cutoff of P < 0.01.
In another approach, we employed PathwayAssist software (Ariadne Genomics, version 3.0) as an independent pathway analysis tool to identify connections between differentially expressed genes. After constructing genetic networks, we sought to identify common regulators or common targets of the differentially expressed gene sets.

Results
Identification of HCC recurrence signature. We examined the gene expression profile of 65 HBV-associated HCCs using Affymetrix U133A 2.0 chips. In order to access the association of the expression variables of each gene feature with recurrencefree survival, a univariate Cox proportional hazard model was applied. A total of 628 gene features were selected as recurrence signatures that were highly correlated with the length of recurrence with strong statistical significance (P < 0.005, logrank test) and differentially expressed across samples at nontrivial levels (SD > 0.3). Hierarchical clustering with these 628 recurrence signature genes subdivided HCC patients into two subtypes that appropriately reflect the difference in recurrence times between patients with HCC (Fig. 1A). Kaplan-Meier plot analysis and log-rank test showed a significant difference of recurrence-free survival between these two HCC subtypes (Fig. 1B, P = 1.9 Â 10 -6 , log-rank test).
The absence of early recurrence during the first year after surgery is the golden standard to determine the success of curative resection. In prior microarray experiments, we subdivided the patients with HCC into early recurrence (i.e., HCCs recurred within a year from curative surgery) and late recurrence (i.e., HCCs free of recurrence for >1 year) groups. As shown in Fig. 1A, most of the early recurrence samples were predicted to be in the high-risk group with 82.5% of accuracy suggesting that our subtype classification might be helpful in planning adequate strategies for patient treatment.
Validation of recurrence signature with independent gene expression data. Having defined two distinct HCC subtypes that reflect significantly different clinical outcomes, we decided to test the robustness of the identified recurrence signature by applying six different class prediction methods (compound covariate predictor, linear discriminant analysis, k-nearest neighbor, nearest centroid, and support vector machine). Prediction of these two risk subtypes by six different class prediction algorithms showed between 83% and 97% mean prediction accuracy rates with significant leave-one-out misclassification rates (P < 0.01, based on 100 random permutations; Supplementary Table S1). These results strongly support the robustness of our recurrence signature.
For the validation of the prognostic reproducibility of this recurrence signature, we next applied our recurrence signature directly to an independent gene expression data set of patients with HCC [data from Laboratory of Experimental Carcinogenesis (LEC), National Cancer Institute, NIH; ref . 15]. Hierarchical clustering of gene expression profile of recurrence signature in  the LEC data set could subdivide patients into two distinct subgroups with homogeneous expression patterns ( Fig. 2A). Kaplan-Meier plot analysis and log-rank test of these HCC subtypes showed a significant difference of overall survival (P = 0.0001), as well as recurrence-free survival (P = 0.0018; Fig. 2B and C). This suggests that our recurrence signature is well conserved in the independent data set and is able to predict recurrence-free survival regardless of microarray platforms.
In addition to the use of hierarchical clustering, we applied two independent classification algorithms (compound covariate predictor and linear discriminant analysis) to validate the robustness of the gene expression signature that predicted the likelihood of early recurrence. Gene expression data from the Seoul National University (SNU) were used to train classification algorithms and those from LEC were used as the test set. During training, the number of genes used in the prediction was optimized to minimize misclassification errors during leave-one-out cross-validation. When applied to the LEC data set, both algorithms successfully identified early recurrence patients with statistical significance (Fig. 2D and E).
In another approach, we applied data integration by pooling both data sets. Hierarchical clustering of recurrence signatures in overall integrated HCCs (n = 204) showed two main clusters with homogeneous expression patterns across platforms (Fig. 3F), suggesting that the expression patterns of the recurrence signatures were well conserved in both data sets. Kaplan-Meier plot analysis of these HCC subtypes showed a significant difference of recurrence between the subgroups of each individual data set (SNU data set, P = 0.007; LEC data set, P = 0.005, respectively, log-rank test; Fig. 3G and H). Kaplan-Meier analysis on the overall integrated data set also successfully dissected subgroups based on the recurrence rate (P = 0.0003; Fig. 3I). These results strongly support the consistency and robustness of this recurrence signature at this independent cohort and experimental platforms of individual studies.
Clinicopathologic features and recurrence signature. The prognostic values of conventional clinicopathologic factors on the risk of recurrence have been widely studied. In agreement with previous reports, we identified patient's age (21), tumor gross type (22), and tumor size (23) to be associated with the likelihood of HCC recurrence in univariate Cox proportional hazards analysis (Table 1). However, other clinicopathologic features such as serum a-fetoprotein, serum platelet count, differentiation, tumor grade, venous invasion, and extranodal invasion and adjuvant therapy (trans-arterial chemoembolization) were not associated with recurrence-free survival. Even though all the patients had a history of HBV infection, the serotype status of HBeAg and anti-HBe were not associated with recurrence-free survival (data not shown). The multivariate analysis, including all the clinicopathologic variables and the molecular subtype, showed that only the molecular subtype was significantly associated with tumor recurrence (hazard rate, 12.54; 95% confidence interval, 3.59-43.76, P < 7.30 Â 10 -5 ; Table 1).
In an attempt to improve the prognostic usefulness of clinical features, we combined classifications of molecular subtypes with clinical features, hoping to predict tumor recurrence much more precisely. Of the clinical features that showed significant association with recurrence in univariate Cox regression analysis, the combined application of patient age or tumor gross type with recurrence signature improved the predictability of the patients' outcomes better than the recurrence signature alone (Fig. 3A and B). In addition, combining tumor size with molecular subtype could predict patients' outcome more precisely by stratifying the patients assigned to the low-risk group (Fig. 3C). These results suggest that a combined application of certain clinical features with molecular subtypes would be a practical approach to define stratified recurrencefree survival groups.
Biological insights of HCC recurrence signature. In order to get a biological insight on the mechanisms reflecting the differences of prognostic outcomes of these two molecular subtypes, the genes showing significant differences in expression between these two subtypes were selected with a twosample t test. A total of 937 genes showing significant differences in 10,000 permuted two-sample t tests (P < 0.001, false discovery rate < 1.46%) with fold difference between subtypes greater than 1.4-fold were selected for the analysis of Gene Ontology composition. Functional enrichment analysis with Gene Ontology categories (Supplementary Table S2) showed a significant enrichment of metastasis-related functions including actin filament organization, regulation of cell migration, and cell motility. As expected, proliferation-related functions (cell proliferation, regulation of progression through cell cycle) and differentiation/development-related functions (cytoskeleton organization and biogenesis, cell fate determination, skeletal development) showed significant enrichment in the high-risk group. Of interest, notch signaling genes (JAG1, JAG2, and NOTCH2) were significantly up-regulated in the high-risk group, implying their functional roles in HCC recurrence. Inflammation-related functions (i.e., chemotaxis, humoral immune response) were also highly enriched in the high-risk group. Inflammation/ immune response -related genes were reported to have an association with noncancerous hepatic tissues from patients with metastatic HCC (17), suggesting that its enrichment in the recurrence signature might be derived from noncancerous stromal cells promoting surveillance for HCC recurrence.
Notably, CD24 showed the highest fold difference of geometric mean (6.84-fold) and all six probes for CD24 (i.e., 208650_s_at, 208651_x_at, 216379_x_at, 209771_x_at, 209772_s_at, and 266_s_at) were significantly overexpressed in the high-risk group (P < 0.001; Supplementary Table S3). As shown in Fig. 4, CD24 expression levels between high-risk and low-risk groups were significantly different in the LEC as well as in the SNU data set (P < 0.001, two-tailed Student's t test for each data set). This concordant observation in both data sets identifies CD24 as a putative biomarker for the prediction of early recurrence.
Once the molecular subtype of HCC was validated for robustness of prognostic capacity, we examined the genetic network of differentially expressed genes between the subtypes. To identify the most prominent common regulatory genes, we carried out pathway analysis using PathwayAssist (Ariadne Genomics, version 3.0). We found that SP1 was the most prominent common regulator for the genes overexpressed in the high-risk group compared with the low-risk group (Fig. 5). Of these SP1 targets, many genes, e.g., PLAUR (14,16,24,25), FGFR1 (26), VIM (27), PDGFA (28), and HK2 (29) had previously been reported to be associated with HCC prognosis or metastasis, and it strongly implicates SP1 as a critical factor in HCC recurrence. Contrary to SP1, peroxisome proliferatoractivated receptor a (PPARa) was identified as a prominent common regulator for many of the down-regulated genes in the high-risk group. From these results, we suggest that SP1 and PPARa play critical roles in HCC recurrence.

Discussion
Many previous studies have shown successful analyses of gene expression profiles for the prognostic prediction of patients with cancer, but their clinical applications have been overoptimistic and still premature. The lack of consistency and the robustness of expression profiles is thought to be one of major obstacles for its clinical application. In this regard, external validation by comparing totally independent crossplatform and cross-site studies will help to identify robust predictors reducing data set -derived systematic biases. The application of cross-platform comparison of independent studies has its own limitations due to the heterogeneity and unavailability of the data sets to be combined. However, this approach is now at the stage of one of reliable solutions to overcome with the overfitting problem of microarray data.
In the present study, we examined the gene expression profiles of 65 patients with HCC to generate a genetic classifier that could identify the patients with a high-risk of early recurrence following curative resection. The 628 gene features selected as genetic classifiers by a univariate Cox proportional hazard model could classify HCC patients into high-risk (n = 31) and low-risk (n = 34) subtypes of early recurrence of HCCs using hierarchical clustering analysis. Cross-platform analysis of this recurrence signature with independent data sets showed consistent stratification of HCC patients which appropriately reflects the risk of early recurrence, suggesting that it might be less prone to false findings and is independent of individual studies. Moreover, the HCC samples in our data set were collected from a homogenous patient population with the same viral exposure (i.e., HBV), ethnicity, hospital care, and postoperative follow-up; therefore, it would be less confounded and more informative for the understanding of recurrence mechanisms.
CD24 was identified as a putative biomarker for classifying low-risk and high-risk groups of early recurrence in both SNU and LEC data sets (Fig. 4). Congruent with this finding, previous studies showed that the CD24 expression level is prognostic in many cancers (30 -36), including HCC (37), Fig. 4. Expression levels of CD24 in high-risk and low-risk groups of early recurrence. High-risk and low-risk groups of early recurrence were assigned by hierarchical clustering recurrence signature in SNU (Fig. 1) and LEC data sets ( Fig. 2A). The average expression levels of CD24 between high-risk and low-risk groups were compared in both data sets, respectively. Expression levels in the SNU data set represent relative values to the overall median of normalized log-transformed expression profile, whereas the expression levels of the LEC data set represent the fold changes of log-transformed expression levels between tumor samples and normal liver tissues. Statistical significance was estimated by two-tailed Student's t test (P < 0.001). Columns, mean; bars, SE.  Common regulatory genes of differentially expressed genes between molecular subtypes of HCCs. Gene regulatory network of differentially expressed genes (P < 0.001, two-tailed Student's t test) between high-risk and low-risk groups were constructed using PathwayAssist software. Up-regulated (red) and down-regulated (green) differentially expressed genes in the high-risk group compared with nondifferentially expressed genes in the low-risk group (gray). Enlarged pictures of SP1and PPARa indicated their common regulations of recurrence genes. Common regulators of TP53 and EGF (blue circle ; details in Supplementary Fig. S1). References for genetic interactions for SP1and PPARa are listed in the Supplementary Notes. although its prognostic role for recurrence has not been noted. CD24 is known to participate in the regulation of cell-to-cell and cell-to-matrix interactions, and its ligand, P-selectin is associated with tumor metastasis by increasing cell spreading, adhesion, and proliferation (30). Therefore, we suggest that CD24 might be a good putative biomarker for the prediction of early recurrence of HCC.
Genetic network analysis of this recurrence signature revealed SP1 and PPARa as prominent common regulators of genes that differed in expression between high-risk and low-risk groups. Many genes that regulate the cell cycle frequently contain proximal GC-rich promoter sequences, and their interactions with SP proteins and other transcription factors are critical for their expression (38). SP1 is associated with the prognosis of cancers, including pancreatic cancer (39), breast cancer (40), and gastric cancer (41), although the potential roles of SP1 in HCC prognosis remains unclear. In line with these studies, it is likely that SP1 would be a good candidate for further studies to elucidate its regulatory role in the recurrence mechanism of HCC.
PPARa agonists have been known to cause hepatocarcinogenesis in rats and mice, whereas humans seem to be resistant (42,43). Human PPARa mRNA and functional receptor is expressed at a level <10% of that found in rats and mice, which may contribute to a difference in susceptibility to agonists between rodents and humans (44). Lower expression of human PPARa might be due to variant human PPARa mRNA species, in which exon 6 was deleted by alternative splicing, and in which the amounts of variants were reported to be up to 20% to 50% of the total PPARa mRNA in human tissues (44). These studies imply that the variants of PPARa might lead to different expressions of PPARa and its target genes between high-risk and low-risk groups. In tumor progression, the PPARa agonist, fenofibrate, was revealed to have antimetastatic potential in both human and mouse melanoma cells (45), suggesting that PPARa has a tumor suppressor role, at least in humans, besides its tumorigenic potential in rodents. From these findings, we could hypothesize that lower expression of PPARa (possibly related to variant PPARa) may affect the differences of recurrence potential between HCC subtypes. However, we cannot rule out the possibility that the deleterious loss of hepatic functions and subsequent depletion of lipid metabolism in the high-risk group could be related to the lower expression of PPARa (see Supplementary Table S2).
In addition to SP1 and PPARa, close examination of genetic networks revealed several prominent common regulators such as EGF and PTGS2, which have previously been well studied in association with HCC progression (refs. 46, 47; Fig. 5; Supplementary Fig. S1A and B). When we constructed a genetic network with the target genes of the differentially expressed genes between the two groups, FOS and JUN were revealed as common downstream targets of the overexpressed genes in the high-risk group (Supplementary Fig. S1C), which is consistent with a previous study that shows its regulatory role in HCCs with poor prognosis (15). When all the regulators and targets of the differentially expressed genes were pooled in the network, TP53 and TGFB1 were identified as commonly regulated and targeted genes, which have previously been shown to play critical roles in cancer progression (refs. 47, 48; Supplementary  Fig. S1D and E). Taken together, these results suggest that concomitant disruption of multiple gene expression networks is required for HCCs to adopt an aggressive phenotype.
In conclusion, we generated a consistent and robust recurrence predictor independent of platforms and cohorts, which could successfully predict molecular subtypes of HCC that reflect the likelihood of early recurrence after curative resection. We also showed that the combined analysis of the molecular subtypes with clinicopathologic features could improve their prognostic utilities. In addition, our study provides substantial biological insights that prioritize the functional significance of SP1 and PPARa in HCC recurrence mechanisms and CD24 as a putative biomarker. We believe that our predictor profile can be helpful to clinicians in choosing a treatment modality for HCC patients who have a high risk of early recurrence.