A Modified Random Survival Forests Algorithm for High Dimensional Predictors and Self-Reported Outcomes

We present an ensemble tree-based algorithm for variable selection in high-dimensional datasets, in settings where a time-to-event outcome is observed with error. This work is motivated by self-reported outcomes collected in large-scale epidemiologic studies, such as the Women’s Health Initiative. The proposed methods equally apply to imperfect outcomes that arise in other settings such as data extracted from electronic medical records. To evaluate the performance of our proposed algorithm, we present results from simulation studies, considering both continuous and categorical covariates. We illustrate this approach to discover single nucleotide polymorphisms that are associated with incident Type 2 diabetes in the Women’s Health Initiative. A freely available R package icRSF has been developed to implement the proposed methods. Supplementary material for this article is available online.