LINCSAnalytics: An integrated platform for the efficient query and computation across diverse LINCS signatures
The Library of Integrated Network-based Signatures (LINCS) program generates a wide variety of cell-based perturbation-response signatures using diverse assay technologies. A signature, defined as a specific cellular response to a given perturbation, can hence be expressed as a function of a set of parameters: the model system (typically a cell), the perturbation (e.g. small molecule) and the detected analytes (e.g. expressed in a transcriptional profiling assay) plus additional experimental details (such as concentration and time). In order to effectively use LINCS data for a wide variety of scientific use case, signatures need to be readily queryable, retrievable and accessible for computation as a function of all of these dimensions.
Here we present a computational platform built on top of the open source Cloudera Hadoop platform allowing the distributed storage and processing of large datasets through a number of dedicated modules. LINCS signature data and standardized entity metadata are stored in the Hadoop Distributed Filesystem. Apache HIVE and IMPALA are responsible for the fast query and retrieval of any data point, while computation and modeling are available through Apache Spark and its Sparklyr R interface. Full accessibility to the core of the platform is achieved via a set of APIs, which also allow to build and deploy custom-made applications. As an initial demonstration, we show a simple Shiny R application to interactively query and retrieve LINCS signatures for any dimension of interest.
To enable the computational biology community to use LINCS data in their research via the LINCS Analytics platform, we deployed an R package that allows to retrieve the available data and metadata for any dimension of interest. It also allows on the fly aggregation of replicates and filtering by desired output values.