figshare
Browse

Data for "Feature selection methods affect the performance of scRNA-seq data integration and querying"

Published on by Luke Zappia

The continued accessibility of technologies for single-cell transcriptomics has led to the increased production of single-cell datasets and the development of computational methods for analysing them. Several efforts have now been made to bring together the available data to construct reference atlases that attempt to catalogue the cell types present in various tissues and organs. While these atlases have the potential to be valuable resources, their usefulness depends on the quality of integration of multiple datasets from different labs and experimental conditions to create the atlas, as well as the ability to map new query samples to the completed reference. Previous benchmarking studies have compared integration methods and found significant differences in performance and that methods generally perform better when applied to a subset of highly variable genes rather than the full feature set. While the field has converged on this approach, it is not clear how these genes should be selected to generate the best integrated reference or how feature selection affects the mapping of query datasets.


In this study we benchmark feature selection methods for single-cell integration and reference usage, including the most commonly-used highly-variable gene selection methods in popular packages such as Seurat and scanpy as well as alternative approaches. We investigate deep learning integration methods as these methods have been found to be among the top performers and have been used to construct existing large-scale atlases and an alternative integration approach based on a corrected PCA space. To extend beyond the integration step to how a reference is used we include evaluation metrics for assessing the quality of query mapping, label transfer and the detection of previously unseen populations. Our shows that highly-variable feature selection is effective for producing high-quality integrated datasets, reinforcing current comment practice. We also examine the relationship between the number of features selected and integration performance and compare batch-aware variants of feature selection methods, providing further guidance on how they can be applied.


Most single-cell studies now include several batches or conditions but data integration continues to be a key challenge in single-cell data science. By evaluating feature selection in-depth we reinforce the effectiveness of current practices and provide further guidance on how they can be best employed. We expect these results to be informative for those working on large scale tissue atlases through initiatives such as the Human Cell Atlas as well as analysts making use of atlases or integrating their own data to tackle specific biological questions.

Cite items from this project

DataCite
No result found

cite all items

Share

email

Usage metrics

Members (1)