figshare
Browse

Expression vs genomics for predicting dependencies

dataset
posted on 2024-05-17, 14:29 authored by Broad DepMapBroad DepMap

This dataset supports the "Gene expression has more power for predicting in vitro cancer cell vulnerabilities than genomics" preprint by Dempster et al. To generate the figure panels seen in the preprint using these data, use FigurePanelGeneration.ipynb. This study includes five datasets (citations and details in manuscript).

Achilles: the Broad Institute's DepMap public 19Q4 CRISPR knockout screens processed with CERES

Score: The Sanger Wellcome Institute's Project Score CRISPR knockout screens processed with CERES

RNAi: The DEMETER2-processed combined dataset which includes RNAi data from Achilles, DRIVE, and Marcotte breast screens.

PRISM: The PRISM pooled in vitro repurposing primary screen of compounds

GDSC17: Cancer drug in vitro drug screens performed by Sanger


The files of most interest to a biologist are Summary.csv. If you are interested in trying machine learning, the files Features.hdf5 and Target.hdf5 contain the data munged in a convenient form for standard supervised machine learning algorithms.


Some large files are in the binary format hdf5 for efficiency in space and read-in. These files each contain three named hdf5 datasets. "dim_0" holds the row/index names as an array of strings, "dim_1" holds the column names as an array of strings, and "data" holds the matrix contents as a 2D array of floats. In python, these files can be read in with:


import pandas as pd

import h5py


def read_hdf5(filename):

src = h5py.File(filename, 'r')

try:

dim_0 = [x.decode('utf8') for x in src['dim_0']]

dim_1 = [x.decode('utf8') for x in src['dim_1']]

data = np.array(src['data'])


return pd.DataFrame(index=dim_0, columns=dim_1, data=data)

finally:

src.close()


##################################################################

Files (not every dataset will have every type of file listed below):

##################################################################


AllFeaturePredictions.hdf5: Matrix of cell lines by perturbations, with values indicating the predicted viability using a model with all feature types.



ENAdditionScore.csv: A matrix of perturbations by number of features. Values indicate an elastic net model performance (Pearson correlation of concatenated out-of-sample predictions with the values given in Target.hdf5) using only the top X features, where X is the column header.


FeatureDropScore.csv: Perturbations and predictive performance for a model using all single gene expression features EXCEPT those that had greater than 0.1 feature importance in a model trained with all single gene expression features.


Features.hdf5: A very large matrix of all cell lines by all used CCLE cell features. Continuous features were zscored. Cell lines missing mutation or expression data were dropped. Remaining NA values were imputed to zero. Features types are indicated by the column matrix suffixes:

_Exp: expression

_Hot: hotspot mutation

_Dam: damaging mutation

_OtherMut: other mutation

_CN: copy number

_GSEA: ssGSEA score for an MSigDB gene set

_MethTSS: Methylation of transcription start sites

_MethCpG: Methylation of CpG islands

_Fusion: Gene fusions

_Cell: cell tissue properties


NormLRT.csv: the normLRT score for the given perturbation


RFAdditionScore.csv: similar to ENAdditionScore, but using a random forest model.


Summary.csv: A dataframe containing predictive model results. Columns:

model: Specifies the collection of features used (Expression, Mutation, Exp+CN, etc)

gene: The perturbation (column in Target.hdf5) examined. Actually a compound for the PRISM and GDSC17 datasets.

overall_pearson: Pearson correlation of concatenated out-of-sample predictions with the values given in Target.hdf5

feature: the Nth most important feature, found by retraining the model with all cell lines (N = 0-9)

feature_importance: the feature importance as assessed by sklearn's RandomForestRegressor


Target.hdf5: A matrix of cell lines by perturbations, with entries indicating post-perturbation viability scores. Note that the scales of the viability effects are different for different datasets. See manuscript methods for details.


PerturbationInfo.csv: Additional drug annotations for the PRISM and GDSC17 datasets


ApproximateCFE.hdf5: A set of Cancer Functional Event cell features based on CCLE data, adapted from Iorio et al. 2016 (10.1016/j.cell.2016.06.017)


DepMapSampleInfo.csv: sample info from DepMap_public_19Q4 data, reproduced here as a convenience.


GeneRelationships.csv: A list of genes and their related (partner) genes, with the type of relationship (self, protein-protein interaction, CORUM complex membership, paralog).


OncoKB_oncogenes.csv: A list of genes that have non-expression-based alterations listed as likely oncogenic or oncogenic by OncoKB as of 9 May 2018.

History

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC