Avana Historical Dataset
***Provided only for reproducibility purposes***
***This dataset was generated as a provisional internal dataset***
***A number of lines included in this dataset were since determined to fail quality control and therefore were not made public***
***Do not use this dataset for new analyses and instead use the most recent data available on depmap.org***
This Achilles dataset contains the results of genome-scale CRISPR knockout screens for 17673 in 578 cell lines. It was processed using the following steps:
- Sum raw readcounts by replicate and guide
- Remove the list of guides with suspected off-target activity
- Remove guides with pDNA counts less than one millionth of the pDNA pool
- Remove replicates that fail fingerprinting match to parent or derivative lines
- Remove replicates with total reads less than 15 million
- Remove replicates that do not have a Pearson coefficient > .7 with at least one other replicate for the line
- Calculate log2-fold-change from pDNA counts for each replicate
- Calculate the average SSMD for each cell line using guides targeting the Hart reference essentials and non-essentials, and remove those with values more positve than -0.5. See Hart et al., Mol. Syst. Biol, 2014.
- Run CERES. See http://www.biorxiv.org/content/early/2017/07/10/160861
- Identify pan-dependent genes as those for whom 90% of cell lines rank the gene above a given dependency cutoff. The cutoff is determined from the central minimum in a histogram of gene ranks in their 90th percentile least dependent line.
- For each CERES gene score, infer the probability that the score represents a true dependency or not. This is done using an EM step until covergence indepedently in each cell line. The dependent distribution is determined empirically from the scores of the pan-dependent genes. The null distribution is determined from unexpressed gene scores in those cell lines that have expression data available, and from the Hart non-essential gene list in the remainder.
*****************
Dataset contents:
*****************
Post-CERES files:
gene_effect - NumericalMatrix
CERES data normalized to positive controls.
Columns: genes in the format “HUGO (Entrez)”
Rows: cell lines in the format “ID_PRIMARYSITE”
gene_dependency - NumericalMatrix
Probability that knocking out the gene has a real depletion effect.
Columns: genes in the format “HUGO (Entrez)”
Rows: cell lines in the format “ID_PRIMARYSITE”
gene_fdr - NumericalMatrix
If the given gene and all genes scoring left of it in the given cell line are treated as dependencies, the matrix entry gives the fraction of those genes are not true dependencies (the FDR for that cell line).
Columns: genes in the format “HUGO (Entrez)”
Rows: cell lines in the format “ID_PRIMARYSITE”
guide_efficacy - Table
Columns:
sgrna (nucleotides)
efficacy - CERES inferred efficacy for the guide
pan_dependent_genes - Raw
List of genes identified as dependencies in all lines, one per line. The scores of these genes are used as the dependent distribution for inferring dependency probability.
Pre-CERES files:
essential_genes - Raw
List of genes used as positive controls, currently the 217 Hart panessentials in the format “HUGO (Entrez)”. Each entry is separated by a newline.
nonessential_genes - Raw
List of genes used as negative controls (Hart nonessentials) in the format “HUGO (Entrez)”. Each entry is separated by a newline.
raw_readcounts - NumericalMatrix
Summed counts for each replicate/PDNA
Columns: replicate/pDNA IDs
Rows: Guides (nucleotides)
logfold_change - NumericalMatrix
Post-QC log2-fold change (not ZMADed)
Columns: replicate IDs
Rows: Guides (nucleotides)
guide_gene_map - Table
Columns:
sgrna (nucleotides) - appears more than once
genome_alignment
gene (“HUGO (Entrez)”)
n_alignments (integer number of perfect matches for that guide)
copy_number - Table
Segmented copy number data for included lines
Columns:
cell_line_name (“ID_PRIMARYSITE”)
Chromosome (integer, X, Y)
Start (bp)
End (bp)
Num_Probes
Segment_Mean (logfold change from average)
replicate_map - Table
Columns:
replicate_ID (str)
cell_line_name (str): CCLE_name (“ID_PRIMARYSITE”)
pDNA_batch (int): indicates which processing batch the replicate belongs to and therefore which pDNA reference it should be compared with.
dropped_guides - Raw
Guides dropped for suspected off-target activity, separated by newlines.
Annotations:
sample_info - Table
Columns:
cell_line (“ID_PRIMARYSITE”)
n_replicates (int): number of replicates surviving QC
primary_tissue (str): from masterfile
histology (str): from masterfile
histology_subtype (str): from masterfile
type (str): from masterfile
tumor_type (str): from masterfile
cas9_activity (float): percentage score from masterfile
culture_medium (str): from GPP_CRISPR tab, Q
culture_code (str): from GPP_CRISPR tab, R
culture_type (str): “adherent” or none, from masterfile
clean_cell_line_name
cell_line_SSMD
aliases