Using Bayesian Evidence Synthesis Methods to Incorporate Real-World Evidence in Surrogate Endpoint Evaluation

Objective Traditionally, validation of surrogate endpoints has been carried out using randomized controlled trial (RCT) data. However, RCT data may be too limited to validate surrogate endpoints. In this article, we sought to improve the validation of surrogate endpoints with the inclusion of real-world evidence (RWE). Methods We use data from comparative RWE (cRWE) and single-arm RWE (sRWE) to supplement RCT evidence for the evaluation of progression-free survival (PFS) as a surrogate endpoint to overall survival (OS) in metastatic colorectal cancer (mCRC). Treatment effect estimates from RCTs, cRWE, and matched sRWE, comparing antiangiogenic treatments with chemotherapy, were used to inform surrogacy patterns and predictions of the treatment effect on OS from the treatment effect on PFS. Results Seven RCTs, 4 cRWE studies, and 2 matched sRWE studies were identified. The addition of RWE to RCTs reduced the uncertainty around the estimates of the parameters for the surrogate relationship. The addition of RWE to RCTs also improved the accuracy and precision of predictions of the treatment effect on OS obtained using data on the observed effect on PFS. Conclusion The addition of RWE to RCT data improved the precision of the parameters describing the surrogate relationship between treatment effects on PFS and OS and the predicted clinical benefit of antiangiogenic therapies in mCRC. Highlights Regulatory agencies increasingly rely on surrogate endpoints when making licensing decisions, and for the decisions to be robust, surrogate endpoints need to be validated. In the era of precision medicine, when surrogacy patterns may depend on the drug’s mechanism of action and trials of targeted therapies may be small, data from randomized controlled trials may be limited. Real-world evidence (RWE) is increasingly used at different stages of the drug development process. When used to enhance the evidence base for surrogate endpoint evaluation, RWE can improve inferences about the strength of surrogate relationships and the precision of predicted treatment effect on the final clinical outcome based on the observed effect on the surrogate endpoint in a new trial. Careful selection of RWE is needed to reduce risk of bias.

Surrogate endpoints are often used when it takes too long, is too expensive, or is too difficult to observe treatment effects on the final clinical outcome of interest. 1 However, before surrogate endpoints can be used, for example, for regulatory approvals, they should be validated. 2,3 Surrogate endpoints can be validated based on 3 levels of association: 1) biological plausibility, 2) individual-level surrogacy, and 3) trial-level surrogacy. 4 However, identifying and validating potential surrogate endpoints can be difficult when data for such analysis are limited. Traditionally, surrogate endpoint evaluation has been carried out using only data from randomized controlled trials (RCTs). Shortages of RCT data are becoming more common as precision medicine evolves and treatments become more effective and are targeted to specific patient populations, leading to smaller cohorts of patients, where it takes longer to observe the treatment effect on the final outcome with reasonable precision. [5][6][7] This is due to fewer events recorded in patients receiving targeted therapies and thus high uncertainty around the effectiveness estimates and, as a consequence, around the estimates of association between the treatment effects on the surrogate endpoint and final outcome. It is therefore possible that a putative surrogate endpoint cannot be validated, and treatments may not be granted conditional approval based on treatment effects on the questionable surrogate endpoint, or, if regulatory approval is granted based on an unreliable surrogate endpoint, approval may be withdrawn at the re-evaluation stage when more data on the final outcome are collected, resulting in a waste of resources. 2 However, in recent years, there has been increased interest in the use of real-world evidence (RWE) at all stages of drug development. [8][9][10][11][12][13][14][15][16] While RWE is subject to a higher risk of bias compared with data from RCTs, RWE also has many advantages. It has the potential to increase the evidence base for decision making, often includes data recorded over longer follow-up times, and can be more generalizable to the target population. [17][18][19] The addition of RWE could improve validation of surrogate endpoints that could not be validated using RCT data alone.
In this article, we explored how RWE can be used to strengthen the evidence base for surrogate endpoint evaluation. We made use of comparative RWE (cRWE) and single-arm RWE (sRWE) to supplement RCT data on the effectiveness of antiangiogenic therapies in metastatic colorectal cancer (mCRC). We then investigated the impact of the addition of RWE on the estimates of the surrogate relationship between treatment effects on progression-free survival (PFS) and overall survival (OS).
The remainder of this article is structured as follows. Data sources and the statistical methods are described in the ''Methods'' section. The results are presented in the next section, which is followed by discussion and conclusions in the fourth and fifth sections, respectively.

Data Sources
RCTs. Data were obtained from a prior literature review conducted by Ciani et al., 20 which included treatment effect estimates from 11 RCTs in mCRC that assessed antiangiogenic treatments such as bevacizumab, Biostatistics Research Group, Department of Health Sciences, University of Leicester, UK (LW, AP, SB); GlaxoSmithKline R&D Centre, GlaxoSmithKline, Stevenage, UK (AP); Leicester Cancer Research Centre, University of Leicester, Leicester, UK (AT). The authors declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: Sylwia Bujkiewicz has served as a paid consultant, providing methodological advice, to NICE, Roche, and RTI Health Solutions; received payments for educational events from Roche; and received research funding from European Federation of Pharmaceutical Industries Associations (EEPIA) and Johnson & Johnson. Anne Thomas has served as a paid consultant, has received payment for lectures, presentations, speakers bureaus, manuscript writing, or educational events and received payment for expert testimony from Bristol Myers Squibb. All other authors have no conflicts of interest to declare. The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Research was conducted in the Health Sciences Department at the University of Leicester. LW was funded by a National Institute for Health Research (NIHR) Predoctoral Fellowship (NIHR301013). The views expressed are those of the authors and not necessarily those of the NIHR or the Department of Health and Social Care. The funding bodies played no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript. SB was funded by Medical Research Council Methodology Research Panel (grant No. MR/R025223/1). The funding agreement ensured the authors' independence in designing the study, interpreting the data, writing, and publishing the report. combined with various chemotherapies, such as FOL-FOX. In the review, Ciani et al., 20 defined OS as time from randomization to time of death and PFS as time from randomization to tumor progression or death from any cause.
For this article, we extracted the treatment effects (logHRs) on PFS and OS. RCTs were included in the analysis only when the control arm was chemotherapy, and the treatment arm was an antiangiogenic treatment plus chemotherapy.
Comparative real-world evidence. We carried out a literature review to identify cRWE evaluating antiangiogenic treatments for mCRC. The following combinations of terms were used to search for studies published between 2000 and 2020 in the PubMed database: 1) ''metastatic colorectal cancer''; 2) ''cohort,'' ''cohort study,'' ''retrospective,'' or ''prospective''; 3) ''PFS''; 4) ''OS''; and 5) ''antiangiogenic'' or ''bevacizumab.'' Abstracts, titles, and, where necessary, full articles were screened, and studies that were not relevant were removed. LogHRs on PFS and OS and their corresponding standard errors were extracted from the remaining studies. To account for potential bias, cRWE studies were included only if they reported treatment effects adjusted for baseline characteristics or potential confounders.

Matching Single-Arm Studies
Unlike RCTs and cRWE, treatment effects cannot be extracted from sRWE as single-arm studies do not make comparisons. To obtain relative treatment effects in the absence of individual participant data (IPD), sRWE studies were matched using aggregate-level data according to the method proposed by Schmitz et al. 22 The distance D tot between any 2 single-arm studies j and k was determined as the weighted average of differences in covariates.
where n is the number of covariates, w c refers to the covariate weights, and D c ½j; k is the difference in covariate c, between studies j and k, scaled to ensure its values are between 0 and 1. The 0 to 1 scaling is applied to ensure all covariates have the same impact on distance before applying the weights. The covariate weights were decided based on rankings from a consensus statement. 23 This distance takes a value between 0 and 1, where smaller values indicate more similar studies. Distance measures between treatment arms for RCTs and cRWE were also calculated. There is no consensus on how small the distance measure should be for 2 single-arm studies to be considered sufficiently similar and thus suitable for matching. Therefore, the maximum distance measure observed between arms of RCTs was selected as the maximum allowable distance for matching single-arm studies. The maximum distance measure observed between arms in cRWE studies was selected as the maximum allowable distance for matching sRWE studies in a sensitivity analysis. Pairs of single-arm studies achieving a distance measure below the matching threshold were considered for matching, while pairs of single-arm studies above the matching threshold were not matched. Each single-arm study could be included only once (only in a single matched pair) to avoid double counting of data. Where multiple matches were possible, matches with the smallest distance measure were used.

Obtaining Treatment Effects for Matched Single-Arm Studies
WebPlotDigitizer was used to extract data from Kaplan-Meier curves from each arm of the matched sRWE studies. Kaplan-Meier curves from RCTs and cRWE were also digitized to compare digitized and reported logHRs. Data from risk tables, reporting the number of patients at risk in each arm at regular time intervals, was also extracted to improve approximated IPD. 24 Where risk tables were not reported, the number of patients and total number of events in each arm were used.
Data extracted from Kaplan-Meier curves and risk tables were used in Stata to reconstruct IPD using the ipdfc command by Wei and Royston. 25 The Cox proportional hazards model with a single covariate for treatment arm was used to analyze the reconstructed IPD to obtain logHRs on PFS and OS for matched sRWE studies, cRWE, and RCTs.

Meta-analytic Methods for Surrogate Endpoint Evaluation
Before candidate surrogate endpoints are used in evaluation of a new health technology, they should be validated, by evaluating the strength of the association pattern between the treatment effects on the surrogate and final clinical outcomes and by assessing their ability to predict treatment effects on the final outcome, given treatment effects on the surrogate endpoint. 26 The standard model for surrogate endpoint evaluation by Daniels and Hughes, 27 denoted here as D&H, and bivariate random-effects meta-analysis (BRMA) using the product normal formulation (PNF) were used as alternative methods to model the correlated treatment effects (logHRs) on the surrogate endpoint (PFS) and final outcome (OS) using a Bayesian framework. The models were applied to RCT data alone, RCTs and cRWE, and RCTs, cRWE, and matched sRWE. Sensitivity analyses to vague prior distributions were conducted for both models.
D&H model. Daniels and Hughes proposed that the observed treatment effects measured on the surrogate endpoint ðY 1i Þ and final outcome ðY 2i Þ come from a bivariate normal distribution and estimate the underlying true effects on the surrogate and final outcomes (d 1i and d 2i ) from each study i with corresponding withinstudy standard deviations (s 1i and s 2i ) and within-study correlation ðr wi Þ: The true effects measured on the surrogate endpoint ðd 1i Þ are assumed to be independent in each study. It is also assumed there is a linear relationship between the true treatment effects on the final outcome and the surrogate endpoint, where the intercept ðl 0 Þ, slope ðl 1 Þ, and conditional variance ðc 2 2 Þ are used as criteria for evaluating the surrogate relationship.
Daniels and Hughes considered a surrogate relationship perfect when the following conditions were met: (a) l 0 ¼ 0, (b) l 1 6 ¼ 0, and (c) c 2 2 ¼ 0. These conditions state that 1) no treatment effect on the surrogate endpoint implies no treatment effect on the final outcome; 2) the slope is not zero, implying an association between treatment effects on the surrogate and final outcomes; and 3) treatment effects on the final outcome can be perfectly predicted by treatment effects on the surrogate endpoint. The 3 criteria set out by Daniels and Hughes ðl 0 ¼ 0; l 1 6 ¼ 0; c 2 2 ¼ 0Þ were used to evaluate the surrogate relationship between treatment effects on PFS and treatment effects on OS in mCRC when using the D&H model. To implement this model in a Bayesian framework, noninformative prior distributions were placed on the fixed effects d 1i ; N 0; 10 4 À Á and regression parameters l 0;1 ; N 0; 10 4 À Á . To ensure a vague prior distribution on the conditional variance, a uniform prior distribution was placed on the conditional standard deviation c 2 ; Unif 0; 2 ð Þ. A sensitivity analysis was conducted using a prior distribution of c 2 ; Unif 0; 100 ð Þ. A minimally informative prior distribution r wi ; Unif 0; 1 ð Þ was placed on the within-study correlation. In a prior publication, Papanikos 21 obtained IPD from 1 RCT 28 included in this analysis and used bootstrapping to obtain a within-study correlation of 0.52. To assess the robustness of the model, a sensitivity analysis was conducted assuming a within-study correlation of 0.52 for all studies.
BRMA (PNF). The D&H model does not estimate correlation or study-level R 2 , which are often used to assess the strength of a surrogate relationship. [29][30][31] For example, the German Institute for Quality and Efficiency in Healthcare defined an acceptable surrogate endpoint by setting a lower bound for the confidence interval (CI) on the correlation coefficient to be 0.85. 32 To estimate the between-studies correlation and study-level R 2 , we used BRMA PNF. The most popular form of the BRMA was described by van Houwelingen et al. 33 and Riley et al., 34 while Bujkiewicz et al. 35 proposed a parameterization of this model such that the between-studies model could be presented as a product of univariate conditional distributions in the so-called PNF. BRMA PNF has the same within-study model as the D&H model (2), but the between-studies model assumes exchangeability of the correlated true (random) treatment effects on both outcomes. In the PNF, the bivariate normal distribution is represented as a sequence of univariate conditional distributions: where d 1i and d 2i are the true effects in the population, which are correlated, assumed exchangeable, and normally distributed.
The parameters of the BRMA PNF model can be represented in terms of the parameters of the bivariate normal distribution using the following formulae. 36 where t 1 and t 2 are the between-studies heterogeneity parameters and r is the between-studies correlation. To implement this model in a Bayesian framework, vague prior distributions were placed on the between-studies parameters t 1;2 ; Unif 0; 2 ð Þ, r ; Unif À1; 1 ð Þ and the intercept l 0 ; N 0; 10 4 À Á , implying prior distributions on l 1 , c 2 1 , and c 2 2 through rearranging the relationships (5). To assess the robustness of the model, sensitivity analyses were conducted assuming a within-study correlation of 0.52 for all studies and prior distributions of t 1;2 ; Unif 0; 100 ð Þ on the between-studies heterogeneity parameters.
In addition to the surrogacy criteria from the D&H model, a perfect surrogate relationship is defined in the BRMA PNF model when r ¼ 61. 26 This implies a perfect linear association between treatment effects on the surrogate endpoint and final outcome. The study-level R 2 in this random-effects model is equal to r 2 . 31 The 3 criteria set out by Daniels and Hughes (l 0 ¼ 0, l 1 6 ¼ 0, c 2 2 ¼ 0) and r ¼ 61 were used to evaluate the surrogate relationship between treatment effects on PFS and treatment effects on OS in mCRC when using the BRMA PNF model. In addition, we report the between-study correlation with values of r ¼ 61 corresponding to a perfect surrogate relationship.
Accounting for bias. Non-randomized studies are susceptible to additional risk of bias due to lack of randomization and unmeasured confounding, and this additional bias can manifest in different treatment effects between different study designs. 37 To account for potential systematic differences in treatment effects between RCTs, cRWE, and sRWE, the BRMA PNF model was extended in the following way. The between-studies model (4) remains the same for all studies, and the within-study model (2) remains the same for RCTs. However, the within-study models for cRWE and matched sRWE include bias terms, a ji and b ji , respectively, for the surrogate and final outcomes (j ¼ 1; 2), where the bias terms can differ for each individual RWE study (i): The bias terms for cRWE and sRWE for each endpoint (j ¼ 1; 2) for each study are assumed to come from a single normal distribution with a mean and variance: Additional non-informative prior distributions were placed on the bias terms, a 1;2 ; N 0; 10 4 À Á and b 1;2 ; N 0; 10 4 À Á .
Cross-validation. The bivariate nature of the D&H and BRMA PNF models allow both validation of a surrogate endpoint and prediction of an unobserved treatment effect on the final outcome given observed treatment effects on the surrogate endpoint.
To assess the predictive value of a candidate surrogate endpoint, a ''take-one-out'' cross-validation procedure was conducted for the D&H and BRMA PNF models. The cross-validation was carried out with RCT data alone, RCTs and cRWE, and RCTs, cRWE, and matched sRWE. The cross-validation uses a take-oneout procedure, which is repeated as many times as the number of studies in the data set. For example, when conducting cross-validation on RCTs alone and the number of RCTs in the data set is given by n RCT , the cross-validation procedure will be repeated N ¼ n RCT times.
For each study i ði ¼ 1; :::; N Þ, the treatment effect on the final outcome, Y 2i , was removed and assumed missing at random. The treatment effect on the final outcome was predicted from the treatment effect on the surrogate endpoint, conditional on data on both outcomes from all other studies in the meta-analysis. The mean predicted effect is equal to the mean predicted true effect from Markov chain Monte Carlo (MCMC) simulation. The variance of the predicted effect is s 2 2i + varðd 2i jY 1i ; s 1i ; Y 1ðÀiÞ ; Y 2ðÀiÞ Þ, where Y 1;2ðÀiÞ are the observed treatment effects on the surrogate and final outcomes for the remaining studies not omitted in the ith iteration. 3,27 For a valid surrogate, we expect the 95% predicted interval to contain the true treatment effect in 95% of studies. However, the true treatment effects are unknown in the cross-validation, which limits the cross-validation to comparing the predicted treatment effect estimates with the observed (but assumed missing) treatment effects on the final outcome.
To assess the accuracy of predictions on the final outcome, we compare the predicted treatment effect on OS ðŶ 2j Þ to the observed treatment effect on OS ðY 2j Þ by summarizing the absolute discrepancy between these values. For a perfectly predicted study, the absolute discrepancy will be zero. If the absolute discrepancy decreased with the addition of RWE, this would indicate that the addition of RWE improves the accuracy of predictions. To assess the precision of predictions on the final outcome, we compare the width of the 95% predicted interval ðwŶ 2j Þ and the width of the 95% observed CI ðw Y 2j Þ by summarizing the ratio of these 2 values ðwŶ 2j Þ=ðw Y 2j Þ. If the ratio of the widths decreased with the addition of RWE, this would indicate that addition of RWE improves precision of predictions.

Software and Computing
All models were implemented using WinBUGS, 38 in which estimates were obtained using MCMC simulation with 150,000 iterations (including 50,000 burn-in). Convergence was checked via visual assessment of history, density, and autocorrelation plots. Posterior estimates are presented as means (for approximately normal posterior) or medians (for skewed posterior) with 95% credible intervals (CrI). R was used for data manipulation, to execute WinBUGS code using the R2WinBUGS package, 39 and to produce figures using the ggplot2 package.

Summary of Data
Of the 11 RCTs obtained from the prior literature review, 4 were excluded for not investigating the effect of antiangiogenics in combination with chemotherapy against chemotherapy alone. Overall, 7 RCTs were included in the analysis. Details of these studies can be seen in Table  A1 in Appendix A.
The database search of PubMed returned 166 publications for cRWE studies and 145 publications for sRWE studies on the chemotherapy arm. After screening titles, abstracts, and, where appropriate, full articles, 7 cRWE studies comparing bevacizumab against chemotherapy remained, and 8 sRWE studies of chemotherapy alone remained. Of the 7 cRWE studies, 4 adjusted for covariates. However, none of the 4 cRWE studies adjusted for all of the covariates recommended in the consensus statement, and thus treatment effects from cRWE could still be subject to considerable bias. Details of the covariates adjusted for are available in Table B1 in Appendix B.
The consensus statement identified 14 characteristics to include in the recommended set of baseline characteristics and a further 22 characteristics to include in the suggested set of baseline characteristics. Details of these baseline characteristics can be found in Table C1 in Appendix C. However, only 5 covariates were reported by all sRWE studies, and of these, sex was the only covariate not included in the recommended set of baseline characteristics. Following the consensus statement ranking, sex was given weight 1 and all other covariates weight 2. The covariates selected for matching were treatment line (weight = 2, current mean treatment line scaled assuming range 1-3), age (weight = 2, median age scaled assuming range 18-100), performance score (weight = 2, mean Eastern Cooperative Oncology Group/World Health Organization score scaled assuming range 0-3), tumor location (weight = 2, proportion with colon tumor compared with rectum tumor), and sex (weight = 1, proportion of females). An example of calculating the distance measure is available in Appendix D. Table 1 shows the distance measures between the sRWE studies. A maximum distance measure of 0.030 was applied, as this was close to the maximum distance measure from RCTs (0.027). This resulted in an exploration of 5% (n = 7) of possible matches. In Table 1, possible matches are shaded and final matches (lowest distance measures) are in bold. Overall, 2 matched sRWE studies were included in the analysis. Figures E1  and E2 in Appendix E show the digitized Kaplan-Meier curves for these matched sRWE studies. Appendix F shows that matching was reasonably robust to changes in covariate weights. A maximum distance measure of 0.055 was applied as a sensitivity analysis, as this was the maximum distance measure calculated between arms of cRWE. This resulted in an exploration of 23% (n = 30) of possible matches, which can be seen in Table G1 in Appendix G.  Table 2 shows results from the D&H model, for which the code is available in Appendix H and history plots suggesting convergence are available in Appendix I. While there are small changes in the point estimates for the surrogacy parameters with the addition of cRWE and matched sRWE, the overall conclusions on evidence for a surrogate relationship do not change with the addition of RWE. There is evidence of a surrogate relationship regardless of the type of evidence used, as the 95% CrIs for the intercept contain zero, the 95% CrIs for the slope do not contain zero, and the conditional variances are close to zero. Figure 2 illustrates this relationship, highlighting that studies with larger treatment effects on PFS generally also have larger treatment effects on OS.

D&H Model
The addition of cRWE to RCTs improved the precision of all 3 estimates for the surrogate relationship while having a minimal impact on the point estimates. Using RCT data alone resulted in the conditional variance 0.0089 (95% CrI: 0.0000, 0.11), while the addition of cRWE gave a conditional variance of 0.010 (95% CrI: 0.0002, 0.051). Thus, the addition of cRWE reduced  The last 2 columns provide the results from the D&H model using a lower matching threshold (0.030) and higher matching threshold (0.055) for matching sRWE. The 2 rows provide the cross-validation results from the D&H model, where absolute discrepancy is the absolute difference between the observed logHR and the predicted logHR on OS, while wŶ 2j =w Y2j is the ratio of the width of the 95% predicted interval of the logHR on OS to the width of the 95% CI of the observed estimate of the logHR on OS. Figure 2 Forest plots of HRs from RCTs, cRWE, and sRWE. Left panel: PFS; right panel: OS. Solid lines show the observed 95% confidence intervals, and dashed lines show 95% predicted intervals obtained from cross-validation using D&H model. Blue shows RCTs, red shows cRWE, and gray shows matched sRWE. cRWE, comparative real-world evidence; D&H, Daniels and Hughes; HR, hazard ratio; OS, overall survival; PFS, progression-free survival; RCT, randomized controlled trial; sRWE, single-arm real-world evidence. the conditional standard deviation. Column 4 of Table 2 shows that using a higher maximum value for matching sRWE studies (based on the highest distance for cRWE of 0.055), and thus including more matched sRWE studies, did not result in further improvements in precision of the surrogacy parameters. Table 3 shows there was weaker evidence for a surrogate relationship using the BRMA PNF compared with the D&H model. Although the between-study correlation was relatively high (0.74; using all evidence), the CrI corresponding to the correlation was wide (95% CrI: 0.065, 0.97). Furthermore, the surrogacy criteria were not fully satisfied, as the 95% CrI for the slope contains zero when using RCTs alone and RCTs plus cRWE. This provides weaker evidence for an association between treatment effects on the surrogate endpoint and treatment effects on the final outcome. When including RCTs, cRWE, and matched sRWE, there was stronger evidence of an association, as shown by the positive value for the slope with a 95% CrI that excluded zero and the conditional variance, which was reduced together with the upper end of the 95% CrI. However, the estimate for the slope was obtained with greater uncertainty compared with the slope estimated from the D&H model when using RCTs, cRWE, and matched sRWE. Such differences in results between the models could be a result of the assumption of random effects in the BRMA PNF model. When assuming that normal random effects is appropriate, greater borrowing of information can lead to more precise estimates. However, when this assumption is violated, the model can lead to overshrinkage of the true effects, thus potentially reducing the between-studies correlation. 56,57 This can be observed in Figure L1 in Appendix L, where the true effects (obtained from BRMA PNF) in the bubble plot are more shrunken toward the mean, in particular for PFS, compared with those obtained from the D&H model (depicted in Figure  1). This can lead to bias and reduce precision of estimates for surrogacy parameters.

BRMA (PNF)
Despite the slightly weaker evidence supporting a surrogate relationship compared with the D&H model, the addition of RWE generally improved the precision of estimates obtained from the BRMA PNF model. Using RCT data alone, the correlation was 0.75 (95% CrI: The last 2 rows provide the cross-validation results from the BRMA PNF model, where the absolute discrepancy is the absolute difference between the observed logHR and the predicted logHR on OS, while wŶ 2j =w Y2j is the ratio of the width of the 95% predicted interval of the logHR on OS to the width of the 95% CI of the observed estimate of the logHR on OS. 20.25, 0.99), whereas when using all sources of evidence, the correlation was 0.74 (95% CrI: 0.065, 0.97). Thus, the addition of RWE reduced uncertainty by 27%, while there was little change in the point estimate. Tables J2 and K2 highlight that the BRMA PNF model is also robust to the assumptions about the within-study correlation and the choice of prior distributions for the between-studies heterogeneity parameters.
Accounting for Bias Table 3 shows results from the BRMA PNF model accounting for bias, described in the ''Accounting for Bias'' section. Relative to the model including RWE without accounting for bias, there was no improvement in the precision of the estimation of the intercept, slope, or conditional variance. In addition, the 95% CrI for the slope contained zero when using the model accounting for bias, providing weaker evidence for the association between treatment effects on the surrogate endpoint and treatment effects on the final outcome. However, relative to the model including RCT data only, when including RWE and accounting for bias, there was improvement in the precision of estimation of all surrogacy parameters while point estimates remained similar. For example, when using RCT data alone, the between-studies correlation was estimated to be 0.75 (95% CrI: 20.25, 0.99), but when RWE was included accounting for bias, the between-studies correlation was estimated at 0.73 (95% CrI: 20.22, 0.98). This indicated that the addition of RWE while accounting for bias resulted in a 3% improvement in precision of the between-studies correlation, while there was little change in the point estimate.

Cross-validation
The last 2 rows of Table 2 show the results of crossvalidation using the D&H model. The median absolute discrepancy between the predicted and observed treatment effects on OS decreased with the addition of cRWE and slightly increased with the further addition of sRWE. Table 2 also shows that the width of the predicted interval, relative to the observed CI, fell with the addition of cRWE (2.90 to 2.09) and sRWE (2.09 to 1.72), indicating that the precision of prediction improved with the addition of cRWE and matched sRWE. Cross-validation for the BRMA PNF model showed similar results (Table 3). Column 4 of Table 2 shows that relaxing the threshold for matching sRWE studies by using a higher maximum distance value, and thus including more matched sRWE studies in the analysis, resulted in slightly poorer accuracy and precision of predictions when using the D&H model.

Discussion
When existing clinical trial data are limited, surrogate endpoint validation may fail. As a result, new therapies may not receive conditional marketing authorization or, if approved, may still fail at the health technology assessment (HTA) decision-making stage, by HTA agencies such as the National Institute for Health and Care Excellence (NICE). 2,3 In this article, we provide an approach for using RWE to strengthen the evidence base for surrogate endpoint validation.
When including RWE alongside RCT data, it is important to carefully consider the quality of studies to avoid excess bias. When selecting RWE, we aimed to minimize bias by using a strict matching threshold for sRWE and by including only cRWE, which adjusted for covariates. In the motivating example, the inclusion of RWE improved the precision of estimation of the surrogacy parameters without substantially changing their point estimates. This implies that in this example, inclusion of RWE reduced the uncertainty around the surrogacy parameter estimates without inducing bias on the surrogacy parameter estimates. However, inclusion of RWE could result in substantially different point estimates for the surrogacy parameters, which could imply that the addition of RWE induced bias on the surrogacy parameter estimates. In this case, any improvements in precision should be interpreted with caution, and the quality of the RWE included should be assessed to determine whether excess bias has been appropriately accounted for.
Despite our careful consideration given to the selection of RWE, there are several limitations of this research. Inclusion of sRWE relied on digitizing Kaplan-Meier curves. However, such curves are not always published, and thus, potentially useful studies would not be included. One method to overcome this issue is to extract median survival times for the surrogate and final outcomes and use an exponential hazard assumption, as proposed by Schmitz et al., 22 to obtain treatment effects. However, our preliminary analysis showed that the assumption of exponential hazard did not provide a good approximation in this case study.
Matching of sRWE was based on study-level covariates and thus was prone to bias, as patients were not randomized or matched at the individual level, and therefore, the exchangeability assumption may have been violated. 58 This bias was exacerbated by matching on only 5 covariates, when the consensus statement recommended 10 additional characteristics. However, these variables were not reported in the included studies. While acquisition of IPD would allow for use of complex methods such as propensity scoring to better adjust for measured covariates, IPD is often unavailable. For example, in HTA submissions, the manufacturer will have access to IPD from trials of their own technology; however, they are likely to have access to only aggregate data from studies of competitor technologies. Thus, in the absence of IPD, we propose matching sRWE based on aggregate-level data to permit inclusion of potentially useful single-arm studies.
The matched sRWE studies were analyzed according to the Cox proportional hazards model. It is possible that the assumption of proportional hazards was violated for some matched sRWE studies. However, all RCTs and cRWE studies in the analysis were analyzed using the log-rank test or Cox proportional hazards model, both of which assume proportional hazards. Visual inspection of Kaplan-Meier curves from RCTs and cRWE showed that some studies appeared to exhibit nonproportional hazards. Therefore, if accounting for nonproportional hazards present in sRWE studies, nonproportional hazards present in RCTs and cRWE studies should also be considered. This is a limitation often present in the meta-analysis of time-to-event outcomes, even when only RCT data are included. However, this issue is beyond the scope of this article and will be investigated in future work.
While RWE can increase the evidence base, it has limitations compared with RCTs. For example, despite using adjusted HRs and matching for cRWE and sRWE, RWE studies are likely to suffer from residual confounding. Furthermore, differences in data collection and evaluation of endpoints could make the treatment effects obtained from RCTs and RWE unsuitable for synthesis in a single meta-analysis. 59 For example, participants of RCTs are followed up for a predefined period of time, and progression is assessed using predefined quantitative measures. 60 However, in RWE, information is recorded as patients attend appointments and progression is based on clinical interpretation of imaging. Furthermore, RCTs frequently define time zero as time from randomization, whereas RWE studies define time zero as time from initiation of treatment. 61 However, a study in oncology found that after adjusting for potential confounders, endpoints such as real-world PFS from RWE were similar to those observed in RCTs of participants given the same treatment. 61 While the BRMA PNF model was extended to account for potential systematic differences in treatment effects between data sources, all sources of evidence contributed the same weight to the model, suggesting that RCTs, cRWE, and matched sRWE were of equivalent quality. However, RCTs have traditionally been considered the gold standard of evidence, as they provide more reliable sources of information about treatment effects compared with RWE. 62 Further methodological research is carried out to allow for accounting for such differences in quality.
Finally, while the methods detailed in this article can be used to investigate different surrogate endpoints in different disease areas, the improvements in precision observed in the motivating example of PFS as a surrogate endpoint for OS in mCRC cannot be guaranteed in other disease areas or surrogate endpoints. The degree of improvement in precision will depend on a number of factors, including the quantity and heterogeneity of available evidence and the true underlying surrogacy pattern.

Conclusions
RWE can be used to improve the precision of estimates for surrogate endpoint validation relative to using RCT data alone. The addition of RWE to RCT data also allows for more precise predictions to be made of the treatment effects on the final clinical outcome based on the treatment effect measured on the surrogate endpoint. When incorporated in a decision-modeling framework, such improved estimates can lead to cost-effectiveness estimates being obtained with reduced uncertainty, which can in turn lead to more robust policy decisions.

Research Data
All data generated or analyzed in this study and the code used to analyze the data are included in the published article and its supplementary files.