figshare
Browse

Single-cell transcriptional mapping reveals genetic and non-genetic determinants of aberrant differentiation in AML - Analysis Notebooks

software
posted on 2025-10-26, 02:23 authored by Andy ZengAndy Zeng
<p dir="ltr">This includes analysis notebooks and scripts that I used to complete the primary analysis involved in the paper. I will briefly outline each of the sections, split into separate directories</p><h3><u>Establishing BoneMarrowMap</u></h3><p dir="ltr"><b>Folder: 0_Reference_atlas_setup</b> </p><p dir="ltr">This provides the notebooks used to establish the BoneMarrowMap atlas. Pre-processing of the constituent datasets is performed in [<u>0_BoneMarrowMap_Datasets_Preprocessing.ipynb</u>] which were then concatenated and subject to doublet removal and feature selection in [<u>1_BoneMarrowMap_FeatureSelection_ParameterSearch.ipynb</u>]. Critically, extensive parameter optimization was performed to identify an optimal UMAP embedding which matched biological priors. The final set of parameters used to generate the embedding for the resulting reference atlas are outlined in [<u>2_BoneMarrowMap_FinalEmbedding.Rmd</u>]. </p><p dir="ltr">After generating an embedding, cell type annotation was performed in a semi-supervised manner by performing leiden clustering at various resolutions and manual curation of discordant cluster assignments. First, this clustering was performed on the entire dataset [<u>3_Clustering_GridSearch_LineageSplitting.ipynb</u>]. Subsequently, each lineage was analyzed independently to refine cell state annotations: [<u>4_HSPC_subset_annotation.ipynb</u>], [<u>5_MkEry_subset_annotation.ipynb</u>], [<u>6_Myeloid_subset_annotation.ipynb</u>], [<u>7_Lymphoid_subset_annotation.ipynb</u>], and [<u>8_TNK_subset_annotation.ipynb</u>].</p><p dir="ltr">A final round of annotations was performed to finalize the initial cell state assignments [<u>9_BoneMarrowMap_Combined_reannotation.ipynb</u>] and additional sanity checks were performed based on published cell state annotations and signature scores from sorted cord blood HSPC fractions [<u>10_CB_HSPC_SortedFraction_SignatureScoring.Rmd</u>] prior to establishing the final symphony reference object [<u>11_ProjectionSetup.Rmd</u>].</p><p dir="ltr">Furthermore, marker genes were identified for each cell state within the reference atlas [<u>12a_BoneMarrowMap_pseudobulkDE.R</u>] and [<u>12_BoneMarrowMap_MarkerGenes.Rmd</u>]. Other analyses to characterize the atlas were performed including construction of metacells [<u>12b_metacell_BoneMarrowMap.py</u>] iorder to perform cNMF [<u>12c_cNMF_BoneMarrowMap.sh</u>] and pySCENIC [<u>12d_pySCENIC_BoneMarrowMap.sh</u>], as well as pseudotemporal ordering with monocle3 [<u>12e_monocle3_BoneMarrowMap.R</u>]. Variable genes and transcription factor regulons were identified across differentiation pseudotime for specific lineages [<u>12_BoneMarrowMap_PseudoTime_MarkersAUC.Rmd</u>] and [<u>13_SCENIC_AUCellMarkers.Rmd</u>].</p><h3><u>Normal Hematopoietic Projections</u></h3><p dir="ltr"><b>Folder: </b><b>1_Normal_projections</b> </p><p dir="ltr">This provides notebooks that I used to project scRNA-seq datasets from normal hematopoiesis onto BoneMarrowMap. This analysis notebook [<u>0_Normal_Projections.Rmd</u>] includes projection of xenografted HSPCs from a separate pre-print, published cord blood HSPC and Ery progenitor fractions profiled by bulk RNA-seq, and annotated datasets from vanGalen 2019 and Triana 2021, as well as scRNA-seq from sorted fractions from Pellin 2019, Karamitros 2018, Belluschi 2018, Kaufmann 2021, Anjos-Afonsos 2022, Lehnertz 2021, Zheng 2018, Psaila 2020, Roy 2021, and Zhang 2022, among others. A second notebook was used to validate our pseudotime projection estimates along HSPC differentiation using <i>de novo</i> pseudotime metrics from Roy 2021 [<u>1_RoyHSPC_Pseudotime_Validation.Rmd</u>].</p><p dir="ltr">For a simple scRNA-seq projection guide in normal hematopoiesis, please see the tutorial in my R package (<a href="https://github.com/andygxzeng/BoneMarrowMap" rel="noreferrer" target="_blank">https://github.com/andygxzeng/BoneMarrowMap</a>)</p><h3><u>Leukemia Projections</u></h3><p dir="ltr"><b>Folder: 2_Leukemia</b><b>_projections</b> </p><p dir="ltr">These are notebooks for scRNA-seq projection of AML samples onto BoneMarrowMap.</p><p dir="ltr">First, the Munich Leukemia Labs - St Jude Children’s Research Hospital (MLL-SJCRH) scRNA-seq data generated in this study is pre-processed from raw counts and leukemia cells are projected onto BoneMarrowMap [<u>0_MLL-SJCRH_samples_processing_mapping.Rmd</u>] followed by celltype classification and pseudotime prediction. Furthermore, transcriptional clusters within each sample, likely representing distinct clones, are visualized independently along hematopoiesis. Next global clustering was performed on the MLL-SJCRH dataset to separate normal T, B, and Plasma immune cells from patient-specific blasts in an unsupervised manner [<u>1_MLL-SJCRH_ImmuneClassification.Rmd</u>]. Next, inferCNV was performed and the results from inferCNV for each patient are visualized along the hematopoietic hierarchy [<u>2_MLL-SJCRH_inferCNV.Rmd</u>] along with expressed variants from cbsniffer [<u>3_MLL-SJCRH_expressed_variants.Rmd</u>].</p><p dir="ltr">After the in-house dataset was analyzed, published leukemia scRNA-seq datasets were projected from diverse leukemia diagnoses including acute lymphoblastic leukemia, blastic plasmacytoid dendritic cell leukemia, acute megakaryoblastic leukemia, and erythroleukemia [<u>4_Leukemia_Projections.Rmd</u>]. Next, acute myeloid leukemia (AML) and mixed phenotype acute leukemia (MPAL) samples from twenty additional studies were projected onto BoneMarrowMap over three separate notebooks: [<u>5_AML_Projections_pt1.Rmd</u>], [<u>6_AML_Projections_pt2.Rmd</u>], and [<u>7_AML_Projections_pt3.Rmd</u>]. Results from each of these projections for each individual patient sample are visualized together using [<u>8_AML_Projection_Figures.Rmd</u>].</p><p dir="ltr">For a simple scRNA-seq projection guide in human leukemia, please see the tutorial in my R package (<a href="https://github.com/andygxzeng/BoneMarrowMap" rel="noreferrer" target="_blank">https://github.com/andygxzeng/BoneMarrowMap</a>)</p><h3><u>AML Composition Analysis</u></h3><p dir="ltr"><b>Folder: 3_AML_composition_analysis</b> </p><p dir="ltr">These are notebooks for performing composition analysis of single cell AML/MPAL datasets.</p><p dir="ltr">First, scRNA-seq projection results from the in-house MLL-SJCRH dataset and twenty additional published datasets are combined and filtering to retain confidently mapped single-cell transcriptomes from primary patient samples at diagnosis and relapse, resulting in data from >1.2 million cells from 318 unique patient samples [<u>0_CompositionAnalysis_setup.Rmd</u>]. This data was then ported to a jupyter notebook where I normalized the composition data, collapsed the 37 fine cell states into 13 broader differentiation stages, and clustered the data to identify twelve recurrent patterns of differentiation in AML [<u>1_scAML_composition_analysis.ipynb</u>]. A complex heatmap was used to visualize differences between these patterns and associations with genomic alterations [<u>2_scAML_DifferentiationPatterns_ComplexHeatmap.Rmd</u>].</p><p dir="ltr">To characterize these 13 differentiation stages in AML and MPAL, I constructed pseudo-bulk profiles by collapsing cells from each differentiation stage within each patient sample, enabling differential expression analysis to be performed [<u>3_scAML_DiffStage_Pseudobulk_DE.R</u>] and scoring of LSC-derived signatures, among other possibilities. I also performed a comparison against Quiescent LSPCs most enriched in engrafting LSC+ fractions from Zeng Nat Med 2022, finding that these LSC-like cells can span a range of stem and progenitor cell states. [<u>3_DiffStage_Pseudobulk_Characterization.Rmd</u>]</p><h3><u>AML Differentiation Stage Quantification</u></h3><p dir="ltr"><b>Folder: 4_AML_DiffStage_Quantification</b> </p><p dir="ltr">This provides notebooks to quantify AML differentiation stages in bulk RNA-seq datasets. First, patient-level pseudobulks were constructed to identify genes that are correlated with the relative abundance of each lineage [<u>0_DiffStage_GeneCorr.ipynb</u>]. Second, differentially expressed genes specific to each differentiation stage were analyzed and subject to adaptive thresholding to keep the top DE genes [<u>1_DiffStage_MarkerGeneSelection.Rmd</u>]. Third, starting from the intersection of correlated and DE genes, sparse regression models were trained to estimate differentiation stage abundance in logCPM normalized patient-level pseudobulks from 302 scRNA-seq samples. Repeated cross-validation identified the best model parameters, and final models were trained to predict abundance for each of the thirteen differentiation stages using a total of 400 genes [<u>2_DiffStage_LASSOregression.Rmd</u>]. This is what I use to subsequently quantify relative abundance in bulk RNA-seq datasets. I have found this approach to be more accurate at the level of each population than deconvolution approaches like CIBERSORTx, BayesPrism, DWLS, etc (discussed in Iacobucci, Zeng, Gao, Garcia-Prat, <i>et al</i>, bioRxiv 2023).</p><p dir="ltr">Next, quantification across bulk AML cohorts was performed throughout 5 cohorts (TCGA, BeatAML2, Leucegene, TARGET pediatric AML, and LUMC) based on batch-corrected data from Severens <i>et al</i> 2023. They put a lot of work into updating clinical annotations (including WHO/ICC 2022 classification) for samples in these cohorts and I have corrected some minor inconsistencies in their genomic annotations [<u>3_Severens2024_AMLmap_Processing.Rmd</u>]. This dataset is used for the primary bulk RNA-seq quantification analysis and for associations with clinical subtype as well as genotype-to-phenotype mapping across >1,200 adult and pediatric AML patients [<u>4_Severens2024_AMLmap_DiffStage_Correlates.Rmd</u>]. Additional analyses were also performed based on sorted fraction bulk RNA-seq and a clinical cohort of acute erythroid leukemia [<u>5_DiffStage_LSC_AEL.Rmd</u>].</p><h3><u>AML Subclone Analysis</u></h3><p dir="ltr"><b>Folder: 5_AML_subclone_analysis</b> </p><p dir="ltr">This analysis relies on copy number variation calls from inferCNV and expressed mutation calls from cbsniffer, as well as Tapestri sc-Immunophenotype + Genotype, within the MLL-SJCRH dataset (see part 2 - Leukemia Projections) to define genetic subclones within the data and to map these subclones to different locations along the hematopoietic hierarchy. This focuses primarily on four illustrative examples to highlight principles for how genetic subclones can skew or block lineage decisions or induce maturation or self-renewal downstream of the primitive stem cell [<u>MLL-SJCRH_Subclone_Analysis.Rmd</u>].</p><h3><u>KMT2A-rearranged AML Sub-clustering</u></h3><p dir="ltr"><b>Folder: 6_KMT2A_subclustering</b> </p><p dir="ltr">This outlines composition analysis of KMT2A-rearranged AML samples, including subclustering based on AML differentiation stage to identify Early and Committed subtypes of KMT2A-r AML [<u>KMT2A_subclustering.Rmd</u>] and scoring of literature-derived signatures from canonical primitive LSCs versus KMT2A-specific committed LSCs.</p><h3><u>Co-existing LSC analysis</u></h3><p dir="ltr"><b>Folder: </b><b>7_coexisting_LSC_analysis</b> </p><p dir="ltr">These notebooks analyze scRNA-seq data generated from two primary AML samples from Princess Margaret Hospital (PMH) including unsorted bulk cells, CD34+CD38- sorted fractions, CD34-CD38+ sorted fractions, and xenografts produced by each of these two sorted fractions, in order to demonstrate co-existence of primitive and committed LSC fractions. These data were processed [<u>0_coexistingLSC_scRNA_preprocessing.Rmd</u>] and subject to inferCNV analysis to identify malignant cells [<u>1_coexistingLSC_inferCNV_Clustering.Rmd</u>]. After clustering of malignant cells to identify leukemic cell populations, xenografts produced from each of the CD34/CD38 fractions were compared against the primary sample [<u>2_coexistingLSC_Primary_vs_Xenograft_compare.Rmd</u>].</p><p dir="ltr"><br></p><p dir="ltr"><br></p><p dir="ltr"><br></p><p dir="ltr"><br></p><p dir="ltr"><br></p>

History