A meta-analytic approach to large scale competing risks regression
In survival analysis, competing risks regression is necessary when estimating eects on specic causes. Regression implementations are computationally intensive: Kirby et. al. (2013) showed that modelling a sample size of 100,000 can take 72 hours to compute. This paper illustrates an alternative to their proposed sampling strategy to improve speed - dividing the complete dataset into computationally manageable sub-groups, computing separate estimates then combining these to estimate the overall eect. This is illustrated with an 80,000 case simulation, and shows that sub-groups of 4,000 reduce computation from 3 hours to less than 2 minutes without substantial loss in accuracy.
The National Cancer Registry contains detailed records on all tumours diagnosed in Ireland since 1994. There are currently over 500,000 records in the registry. Competing risks regression (Fine and Gray 1999) is often required to analyse effects of variables on cumulative incidence of cancer-specific death. Conducting a competing risk analysis on the complete registry is not computationally feasible.
Competing risk data was simulated using the method of Beyersman et al. (2009). Configuration included 1 covariate and 2 event types, resulting in a sub-distribution hazard ratio (SHR) for cause A = 2.745. A dataset with 80,000 cases was simulated and systematically divided into sub-groups of sizes 1,000 to 20,000. Estimates were pooled using inverse-variance weighting and compared with that observed in the complete data. PSHREG (Kohl et al. 2014) in SAS was used to for all estimates.
The SHR estimated on the complete dataset was 2.74, 95% CI (2.656, 2.838) in 178 minutes. All pooled sub-group estimates were within 0.45% of SHR based on complete data (coefficient: 0.45%, standard error: 0.01%). In this scenario, time to compute was shortest (79 seconds) with 20 groups of 4,000 cases, which was 135 times faster than the analysis on the complete dataset. This pooled estimate matched those from complete data to 2 decimal places 2.74, 95% CI (2.653, 2.835).
Dividing a large dataset into sub-groups enabled drastically improved computation time. There was less than 0.5% error in all estimates. In this simulation, time to compute increased for subgroups larger than 4,000. The optimal number of sub-groups depends on observed rates of events, competing risks and censoring, as well as computing hardware. The largest feasible sub-group size should be selected.