This dataset is the result of 318 cancer cell lines screened with the genome-wide KY1.0/1.1 CRISPR KO library by the Sanger Institute, processed with the Achilles pipeline (except QC). The publication describing the experiment is "Prioritization of cancer therapeutic targets using CRISPR-Cas9 screens," DOI 10.1038/s41586-019-1103-9. 

Readcounts were downloaded from https://score.depmap.sanger.ac.uk/downloads on 8 May 2019. Only cell lines annotated by the authors as passing both QC steps in Supplementary Table 1 were retained. Additionally, only cell lines for which the Broad has copy number data as of 10 May 2019 were retained. The following steps were used to process the data:

- Remove cell lines failing QC according to Sanger's metrics
- Calculate RPM for each replicate and pDNA, adding pseudo count 1 to the RPM
- Calculate log2-fold-change of pDNA counts for each replicate
- Run CERES to generate gene-level scores.
- Scale so the median of common essentials in each cell lines is -1
- Remove the mean and variance of each gene across datasets, so each gene as mean 0 and variance 1
- Remove the first five principle components of the resulting matrix and restore the prior means and variances of genes
- Scale again so the median of common essentials in each cell lines is -1
- Identify pan-dependent genes as those for whom 90% of cell lines rank the gene above a given dependency cutoff. The cutoff is determined from the central minimum in a histogram of gene ranks in their 90th percentile least dependent line.
- For each CERES gene score, infer the probability that the score represents a true dependency or not. This is done using an EM step until convergence independently in each cell line. The dependent distribution is given by the list of essential genes. The null distribution is determined from unexpressed gsene scores in those cell lines that have expression data available, and from the Hart non-essential gene list in the remainder.


*****************
Dataset contents:
*****************

## Post-CERES files:

README - Raw

gene_effect_unscaled - NumericalMatrix
CERES inferred effects of knocking out genes, without additional scaling. More negative effects produce more negative logfold changes.
Columns: genes in the format  “HUGO (Entrez)”
Rows: cell lines (Broad IDs)

gene_effect - NumericalMatrix
CERES data with principle components strongly related to known batch effects removed, then shifted and scaled per cell line so the median nonessential KO effect is 0 and the median essential KO effect is -1.
Columns: genes in the format  “HUGO (Entrez)”
Rows: cell lines (Broad IDs)

gene_dependency - NumericalMatrix
Probability that knocking out the gene has a real depletion effect using gene_effect.
Columns: genes in the format  “HUGO (Entrez)”
Rows: cell lines (Broad IDs)

guide_efficacy - Table
Columns:
sgrna (nucleotides)
efficacy - CERES inferred efficacy for the guide

pan_dependent_genes - Table
List of genes identified as dependencies in all lines, one per line. The scores of these genes are used as the dependent distribution for inferring dependency probability.

## Pre-CERES files
essential_genes - Table
List of genes used as positive controls, currently the 217 Hart panessentials in the format “HUGO (Entrez)”. Each entry is separated by a newline.

nonessential_genes - Table
List of genes used as negative controls (Hart nonessentials) in the format “HUGO (Entrez)”. Each entry is separated by a newline.

logfold_change - NumericalMatrix
Post-QC log2-fold change (not ZMADed)
Columns: replicate IDs
Rows: Guides (nucleotides)

guide_gene_map - Table
Columns:
sgrna (nucleotides) - appears more than once
genome_alignment
gene (“HUGO (Entrez)”)
n_alignments (integer number of perfect matches for that guide)

copy_number - Table
Segmented copy number data for included lines
Columns:
Broad_ID
Chromosome (integer, X, Y)
Start (bp)
End (bp)
Num_Probes
Segment_Mean (logfold change from average)

replicate_map - Table
Columns:
replicate_ID (str)
DepMap_ID
pDNA_batch (int): indicates which processing batch the replicate belongs to and therefore which pDNA reference it should be compared with.