On Data Integration Problems With Manifolds

Version 3 2018-11-06, 00:50

Version 2 2018-10-12, 15:24

Version 1 2018-06-18, 21:24

dataset

posted on 2018-11-06, 00:50 authored by Mark V. Culp, Kenneth J. Ryan, Prithish Banerjee, Michael Morehead

This article focuses on data integration problems where the predictor variables for some response variable partition into known subsets. This type of data is often referred to as multi-view data, and each subset of the predictors is called a data view. Accounting for data views can add practical value in terms of both interpretation and predictive performance. Many existing approaches for multi-view data rely on view-agreement principles, strong smoothness assumptions, or regularization penalties. The former approaches can be sensitive to modest noise in the response or predictor variables, while the latter approach is linear and can usually be out-performed. We develop semiparametric data integration methods to span key tradeoffs including the bias-variance tradeoff on prediction error, the possibility that the data may be fully viewed with no appreciable view relationships, and the use of sparse anchor point methods to detect and use manifolds (i.e., possibly nonelliptical structures) within views if they enhance performance. Theoretical results help justify the new technique, and its effectiveness and computational feasibility are demonstrated empirically. This new semiparametric methodology is available for public use through the supplemental R package mvltools. Additional supplementary material for this article is also available online.