figshare
Browse
1/1
6 files

Biomarker Benchmark - GSE37745

Version 6 2016-03-17, 22:17
Version 5 2016-03-16, 16:40
Version 4 2016-02-23, 23:22
Version 3 2016-02-22, 17:57
Version 2 2016-02-04, 21:55
Version 1 2016-02-02, 22:28
dataset
posted on 2016-03-17, 22:17 authored by Anna GuyerAnna Guyer, Stephen PiccoloStephen Piccolo

[NOTICE: This data set has been deprecated. Please see our new version of the data (and additional data sets) here: https://osf.io/mhk93 ]

" Background: Global gene expression profiling has been widely used in lung cancer research to identify clinically relevant molecular subtypes as well as to predict prognosis and therapy response. So far, the value of these multi-gene signatures in clinical practice is unclear and the biological importance of individual genes is difficult to assess as the published signatures virtually do not overlap.

Methods: Here we describe a novel single institute cohort, including 196 non-small lung cancer (NSCLC) cases with clinical information and long-term follow-up, which was used as a training set to screen for single genes with prognostic impact. The top 450 gene probe sets identified using a univariate Cox regression model (significance level p<0.01) were tested in a meta-analysis including five publicly available independent lung cancer cohorts (n=860).

Results: The meta-analysis revealed that 17 probe sets were significantly associated with survival (p<0.0005) with a false discovery rate of 1%. The prognostic impact of one of these genes, the cell adhesion molecule 1 (CADM1), was confirmed by use of immunohistochemistry on a tissue microarray including 355 NSCLC samples. Low CADM1 protein expression was associated with shorter survival (p=0.028), with particular influence in the adenocarcinoma patient subgroup (p=0.002).

Conclusions: We were able to validate single genes with independent prognostic impact using a novel NSCLC cohort together with a meta-analysis approach. CADM1 was identified as an immunohistochemical marker with a potential application in clinical diagnostics."

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE37745

We have included gene-expression data, the outcome (class) being predicted, and any clinical covariates. When gene-expression data were processed in multiple batches, we have provided batch information. Each data set is organized into a file set, where each contains all pertinent files for an individual dataset. The gene expression files have been normalized using both the SCAN and UPC methods using the SCAN.UPC package in Bioconductor (https://www.bioconductor.org/packages/release/bioc/html/SCAN.UPC.html). We summarized the data at the gene level using the BrainArray resource (http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/20.0.0/ensg.asp). We used Ensembl identifiers. The class, clinical, and batch data were hand curated to ensure consistency ("tidy data" formatting). In addition, the data files have been formatted to be imported easily into the ML-Flex machine learning package (http://mlflex.sourceforge.net/).

History

Usage metrics

    Categories

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC