Expression vs genomics for predicting dependencies
This dataset supports the "Gene expression has more power for predicting in vitro cancer cell vulnerabilities than genomics" preprint by Dempster et al. To generate the figure panels seen in the preprint using these data, use FigurePanelGeneration.ipynb. This study includes five datasets (citations and details in manuscript).
Achilles: the Broad Institute's DepMap public 19Q4 CRISPR knockout screens processed with CERES
Score: The Sanger Wellcome Institute's Project Score CRISPR knockout screens processed with CERES
RNAi: The DEMETER2-processed combined dataset which includes RNAi data from Achilles, DRIVE, and Marcotte breast screens.
PRISM: The PRISM pooled in vitro repurposing primary screen of compounds
GDSC17: Cancer drug in vitro drug screens performed by Sanger
The files of most interest to a biologist are Summary.csv. If you are interested in trying machine learning, the files Features.hdf5 and Target.hdf5 contain the data munged in a convenient form for standard supervised machine learning algorithms.
Some large files are in the binary format hdf5 for efficiency in space and read-in. These files each contain three named hdf5 datasets. "dim_0" holds the row/index names as an array of strings, "dim_1" holds the column names as an array of strings, and "data" holds the matrix contents as a 2D array of floats. In python, these files can be read in with:
import pandas as pd
import h5py
def read_hdf5(filename):
src = h5py.File(filename, 'r')
try:
dim_0 = [x.decode('utf8') for x in src['dim_0']]
dim_1 = [x.decode('utf8') for x in src['dim_1']]
data = np.array(src['data'])
return pd.DataFrame(index=dim_0, columns=dim_1, data=data)
finally:
src.close()
##################################################################
Files (not every dataset will have every type of file listed below):
##################################################################
AllFeaturePredictions.hdf5: Matrix of cell lines by perturbations, with values indicating the predicted viability using a model with all feature types.
ENAdditionScore.csv: A matrix of perturbations by number of features. Values indicate an elastic net model performance (Pearson correlation of concatenated out-of-sample predictions with the values given in Target.hdf5) using only the top X features, where X is the column header.
FeatureDropScore.csv: Perturbations and predictive performance for a model using all single gene expression features EXCEPT those that had greater than 0.1 feature importance in a model trained with all single gene expression features.
Features.hdf5: A very large matrix of all cell lines by all used CCLE cell features. Continuous features were zscored. Cell lines missing mutation or expression data were dropped. Remaining NA values were imputed to zero. Features types are indicated by the column matrix suffixes:
_Exp: expression
_Hot: hotspot mutation
_Dam: damaging mutation
_OtherMut: other mutation
_CN: copy number
_GSEA: ssGSEA score for an MSigDB gene set
_MethTSS: Methylation of transcription start sites
_MethCpG: Methylation of CpG islands
_Fusion: Gene fusions
_Cell: cell tissue properties
NormLRT.csv: the normLRT score for the given perturbation
RFAdditionScore.csv: similar to ENAdditionScore, but using a random forest model.
Summary.csv: A dataframe containing predictive model results. Columns:
model: Specifies the collection of features used (Expression, Mutation, Exp+CN, etc)
gene: The perturbation (column in Target.hdf5) examined. Actually a compound for the PRISM and GDSC17 datasets.
overall_pearson: Pearson correlation of concatenated out-of-sample predictions with the values given in Target.hdf5
feature: the Nth most important feature, found by retraining the model with all cell lines (N = 0-9)
feature_importance: the feature importance as assessed by sklearn's RandomForestRegressor
Target.hdf5: A matrix of cell lines by perturbations, with entries indicating post-perturbation viability scores. Note that the scales of the viability effects are different for different datasets. See manuscript methods for details.
PerturbationInfo.csv: Additional drug annotations for the PRISM and GDSC17 datasets
ApproximateCFE.hdf5: A set of Cancer Functional Event cell features based on CCLE data, adapted from Iorio et al. 2016 (10.1016/j.cell.2016.06.017)
DepMapSampleInfo.csv: sample info from DepMap_public_19Q4 data, reproduced here as a convenience.
GeneRelationships.csv: A list of genes and their related (partner) genes, with the type of relationship (self, protein-protein interaction, CORUM complex membership, paralog).
OncoKB_oncogenes.csv: A list of genes that have non-expression-based alterations listed as likely oncogenic or oncogenic by OncoKB as of 9 May 2018.