1/1
7 files

Local fitness landscape of the green fluorescent protein

dataset
posted on 14.03.2016, 16:05 authored by Dmitry BolotinDmitry Bolotin
Description

These files contain data from the article "Local fitness landscape of the green fluorescent protein". Raw sequencing data for this experiment is available at SRA (http://www.ncbi.nlm.nih.gov/sra) under BioProject PRJNA282342 (http://www.ncbi.nlm.nih.gov/bioproject/PRJNA282342/). Files presented here are data sets obtained at different stages of analysis (as illustrated in the file "Data_low.png"). All files are tab-separated tables with a header at first row. Some table cells may be empty (e.g. list of mutations for wild-type).

Please note the mutations notation used throughout the files. It is described in details here: http://mixcr.readthedocs.org/en/latest/appendix.html#alignment-and-mutations-encoding. Briefly, all positions are zero-based (i.e. first nucleotide has index 0) and type of mutation (substitution, deletion or insertion) is indicated as the first letter of mutation description. For example, SG101A is the substitution G>A at position 101. The reference avGFP sequence is provided as “avGFP_reference_sequence.fa” file.

File names and content


1. Final data sets: genotypes with corresponding log-brightness values

nucleotide_genotypes_to_brightness.tsv – processed file “barcodes_to_brightness.tsv”, with genotypes aggregated by their nucleotide sequence, and brightness information averaged across all barcodes that share the same nucleotide genotype.

Columns:
nMutations – list of nucleotide mutations (see above for mutations notation); empty for wild-type,
aaMutations – list of amino acid mutations; empty for wild-type or genotypes with only synonymous substitutions,
uniqueBarcodes – number of unique barcodes sharing the same nucleotide genotype,
medianBrightness – median of log-brightness values across barcodes that share the same nucleotide genotype,
std – standard deviation of log-brightness values across barcodes that share the same nucleotide genotype; empty for genotypes represented by a single barcode.


amino_acid_genotypes_to_brightness.tsv – processed file “barcodes_to_brightness.tsv”, with genotypes aggregated by their amino acid sequence, and brightness information averaged across all barcodes that share the same amino acid genotype.

Columns:
aaMutations – list of amino acid mutations; empty for wild-type,
uniqueBarcodes – number of unique barcodes sharing the same amino acid genotype,
medianBrightness – median of log-brightness values across barcodes that share the same amino acid genotype,
std – standard deviation of log-brightness values across barcodes that share the same amino acid genotype; empty for genotypes represented by a single barcode.


2. Intermediate data set: estimated brightness values for each barcode.

For details of brightness estimation please see the protocol in the original paper.

barcodes_to_brightness.tsv – final data set containing aggregated, clean and filtered data on genotypes with substitutions only (no indels).

Columns:
barcode – molecular barcode sequence of the genotype,
nMutations – list of nucleotide mutations (see above for mutations notation),
aaMutations – list of amino acid mutations,
brightness – log-brightness of the barcoded sequence.


3. Early data set: processed raw sequencing data

populations.zip archive contain files with names in the following form: L{k}R{m}.tsv. The files contain aggregated read counts of barcodes for each particular sorted population, where {k} is the index of sorting gate and {m} is the index of replica. For example, file L1R2.tsv contains counts for barcodes found in brightness population L1 in experimental replica R2. (see below for median sorting gate brightness values).

Files with {k} = 0 (e.g. L0R1.tsv) contain results of sequencing of bacterial population before sorting.

Columns:
barcode - molecular barcode sequence (see protocol in original paper),
count - number of occurrence of this barcode in sequences for particular sorted population,
minQuality - minimal phred quality for barcode sequence.

Important: please see “Normalization” section below that describes how we translated read counts into the number of cells for each barcode.


genotypes.tsv – contains processed Illumina MiSeq sequencing data of GFP genotypes for each barcode (genotype to barcode correspondence).

Columns:
barcode – molecular barcode sequence of the genotype,
minCoverage – minimal coverage of target GFP sequence by sequencing reads (see protocol in the paper),
meanCoverage – mean coverage of target GFP sequence by sequencing reads,
nMutations – list of nucleotide mutations (see above for mutations notation),
aaMutations – list of amino acid mutations for genotypes without indels, empty string (!) for genotypes with indels.


Information on data processing
The data processing workflow is outlined in the file “Data_low.png”. We processed data from Illumina MiSeq sequencing run to reconstruct full-length sequences of GFP and relate each GFP sequence to the corresponding barcode. We then analyzed Illumina HiSeq sequencing of cell populations sorted by fluorescence-activated cell sorting, for each of the four replicas of the experiment. We counted reads that each barcoded genotype has in each brightness population. We then fitted each barcode distribution with two Gaussian distributions using the values of logarithms of sorting gates medians. When aggregating information from replicas we eliminated barcodes that displayed too broad distribution across the brightness populations or had conflicts between replicas. We saved resulting filtered data into the file “barcodes_to_brightness.tsv”.

Normalization
A fixed number of cells with known barcodes (AAGTTCTAAATAACAATCCC, AATACCAGTAAGGACTTAA, TATGGTACTTAATTTACAGT, TATTTACGGGTATGACTGGG) was added to every population after sorting, about 1333 cells for each barcode. These cells passed all sample preparation procedures together with the library being a control for each sample in each replica. When analysing the sequencing data, we used these controls to translate the number of reads per barcode to the number of sorted cells. Barcodes with less than three cells across the population samples were later removed at the data filtering stage.

Estimation of brightness
For some of the barcodes a bimodal distribution of cells across the fluorescence gate populations was observed. These distributions were not reproduced across experimental replicas, indicating that they represent an artifact of the experimental procedure rather than inherent genotype properties. We fitted each barcode distribution within each replica with two Gaussian distributions using actual values of logarithms of sorting gates boundaries. Thus, the resulting distributions parameters were expressed in actual brightness logarithm values. We filtered out the cases where the log-value of fluorescence of the major Gaussian component was below 0.65, or its sigma exceeded 0.4. When aggregating information from replicas we eliminated barcodes for which less than three replicas belonged to the ±0.45-neighbourhood of the median value calculated across all replicas.
The following median values of brightness within sorting gates were used to estimate the brightness of the genotypes:
Replica 0 (from L1 to L8): 10751, 5970, 3190, 1372, 418, 179, 81, 20,
Replica 1 (from L1 to L8): 16278, 9189, 4942, 1817, 433, 179, 72, 20,
Replica 2 (from L1 to L8): 7984, 5914, 3207, 1337, 428, 160, 69, 20,
Replica 3 (from L1 to L8): 12989, 6864, 3522, 1377, 414, 147, 58, 20.

Please see the original paper for the description of level and structure of the noise in the final estimations of log-brightness.

History

Usage metrics

Licence

Exports