figshare
Browse

Cannabis Pangenome Figures

figure
posted on 2025-02-13, 00:22 authored by Lillian Padgitt-CobbLillian Padgitt-Cobb

Abstract

Cannabis sativa is a globally significant seed-oil, fiber, and drug-producing plant species. However, a century of prohibition has severely restricted legal breeding and germplasm resource development, leaving potential hemp-based nutritional and fiber applications unrealized. Existing cultivars are highly heterozygous and lack competitiveness in the overall fiber and grain markets, relegating hemp to less than 200,000 hectares globally1. The relaxation of drug laws in recent decades has generated widespread interest in expanding and reincorporating cannabis into agricultural systems, but progress has been impeded by the limited understanding of genomics and breeding potential. No studies to date have examined the genomic diversity and evolution of cannabis populations using haplotype-resolved, chromosome-scale assemblies from publicly available germplasm. Here we present a cannabis pangenome, constructed with 181 new and 12 previously released genomes from a total of 156 biological samples from both male (XY) and female (XX) plants, including 42 trio phased and 36 haplotype-resolved, chromosome-scale assemblies. We discovered widespread regions of the cannabis pangenome that are surprisingly diverse for a single species, with high levels of genetic and structural variation, and propose a novel population structure and hybridization history. Conversely, the cannabinoid synthase genes contain very low levels of diversity, despite being embedded within a variable region containing multiple pseudogenized paralogs and distinct transposable element arrangements. Additionally, we identified variants of acyl-lipid thioesterase (ALT) genes2 that are associated with fatty acid chain length variation and the production of the rare cannabinoids, tetrahydrocannabinol varin (THCV) and cannabidiol varin (CBDV). We conclude the Cannabis sativa gene pool has only been partially characterized, and that the existence of wild relatives in Asia remains likely, while its potential as a crop species remains largely unrealized.

1. Nions, U. Commodities at a glance: Special issue on industrial hemp. Commod Glance (2023) doi:10.18356/9789210019958.

2. Pulsifer, I. P. et al. Acyl-lipid thioesterase1-4 from Arabidopsis thaliana form a novel family of fatty acyl-acyl carrier protein thioesterases with divergent expression patterns and substrate specificities. Plant Mol. Biol. 84, 549–563 (2014).

Figure 2. Transposable elements shape the cannabis pangenome. A) Percent of each genome assembly covered by TEs, grouped by population ID, across the pangenome. The y-axis is a Gaussian kernel density estimation (KDE). B) Across the pangenome, the age distribution of fragmented TEs (million years ago [mya]), with inset showing distribution within the last 100,000 years (thousand years ago [kya]). In the inset, the highest density occurs within 10 kya. C) Age distribution of intact TEs (million years ago [mya]), with inset showing distribution within the last 100,000 years (thousand years ago [kya]). In the inset, the highest density occurs within 10 kya. D) Average solo:intact ratio for Ty1-LTR elements in 78 chromosome-level, haplotype-resolved genomes, grouped by chromosome. E) Average solo:intact ratio for Ty3-LTR elements in 78 chromosome-level, haplotype-resolved genomes, grouped by chromosome. F) Average solo:intact ratio for Ty1-LTR elements in the sex chromosomes grouped according to boundary (PAR, X-specific region, or SDR). G) Average solo:intact ratio for Ty3-LTR elements in the sex chromosomes grouped according to boundary (PAR, X-specific region, or SDR). H) Genomic landscape plot for AH3Mb.chrY showing density of LTRs, methylation, CpG content, and transcripts across the length of the chromosome. I) Genomic landscape plot for AH3Mb.chrY showing the ratio of solo:intact Ty1-LTRs across the length of the chromosome. J) Visualization of whole genome alignments between AH3Ma.chrX and AH3Mb.chrY. The bracketed region with high similarity is the PAR, where recombination between X and Y chromosomes occurs. K) Genomic landscape plot for AH3Ma.chrX showing density of LTRs, methylation, CpG content, and transcripts across the length of the chromosome. L) Genomic landscape plot for AH3Ma.chrX showing the ratio of solo:intact Ty1-LTRs across the length of the chromosome.

Figure 4. The cannabinoid pathway is domesticated, but shows contrasting patterns of genetic diversity and synteny A) Cannabinoid biosynthesis pathway and gene copy numbers across the pangenome, per assembly. B) Consensus maximum likelihood phylogeny of aligned coding sequences from cannabinoid synthases, with the proportion of 100 bootstrap replicates shown on branches where values are > 0.75. Each branch tip represents a distinct cluster of synthases within > 99% identity of 859 total synthases from across all 193 pangenome samples. C) Summary of common cannabinoid synthase cassette arrangements, with number of occurrences in the pangenome shown at left. Full = full length synthases gene models, and Partial = truncated lower stringency synthases alignments, likely representing pseudogenes. D) Synthase cassettes exhibit variation in synteny as seen in BUSCO anchored local alignment of chromosome 7. Red triangle = THCAS cassettes; Blue triangles = CBDAS cassettes; Yellow triangle = CBCAS cassettes; gray triangles = low stringency synthase matches (pseudogenes); gray and pink circles = BUSCOs. E) Maximum likelihood tree of helitron DNA TE sequences flanking (2 kb upstream or downstream) cannabinoid synthases in the 78 haplotype-resolved, chromosome-scale assemblies.

Extended Data Figure 1. A) Haplotype specific expression for all tissue types from EH23, grouped by chromosome. Haplotype gene pairs are either syntenic or reciprocal best hits. Balanced and biased gene expression is assigned according to TPM difference. A difference threshold of 5 TPM was required for gene pairs to be assigned as biased, otherwise gene pairs were assigned as balanced (see also Supplemental Table 2 for counts by tissue type). B) LATE ELONGATED HYPOCOTYL (LHY) shows biased gene expression in EH23b foliage under 12 hours of light (12/12 hours). C) The copy of LHY with biased expression also belongs to an orthogroup with high entropy in different populations, with the largest difference in entropy separating feral and MJ. D) GO term enrichment of biased gene expression for all tissues in EH23a; and E) GO term enrichment of biased gene expression for all tissues in EH23b.

Extended Data Figure 2. Expression patterns in the flowers and leaves of male and female Ace High plants. A) Stacked bar chart showing the number of genes with balanced, biased, or exclusive expression in male and female tissues. Overall, for a gene to be considered expressed, a minimum average TPM value of 1.0 across tissue replicates was required, grouped by sex. For balanced expression, genes were required to have a minimum average TPM of at least 1.0 in both sexes, grouped by tissue type, while also having less than a difference of 5 TPM between each sex. For biased expression, a difference of >= 5 TPM between sexes was required for each tissue type. For exclusive expression, a gene was required to have a minimum average TPM of at least 1.0 in one sex for a given tissue, without expression in the other sex for that tissue type. On average, approximately 90% of genes with balanced or biased expression are syntenic across tissues and sexes; in contrast, approximately 80% of genes with exclusive expression are syntenic. The main exception is exclusively-expressed genes in female leaf tissue, in which approximately 90% of genes are syntenic. For this analysis, synteny is relative to the set of eight genomes with X and Y chromosomes, determined by genespace. B) Revigo TreeMap figure showing GO term enrichment among genes with biased and exclusive expression in male flowers. A variety of metabolic pathways are enriched, including pollen development. C) and D) Biased gene expression in male and female flowers across chromosomes X and Y. Scatter plots for chrX and chrY, respectively, of gene start positions on the x-axis. The y-axis shows the difference of log2 TPM between male and female flowers, specifically showing genes that have biased or exclusive expression in male flowers. The blue markers correspond to genes in the PAR, red markers correspond to genes in the X-specific region. E) and F) Biased expression of intact TEs in male and female flowers across chromosomes X and Y, respectively.

RGeneSupplementalFigure. Disease resistance genes across the cannabis pangenome. A) Circos plot showing the EH23a genome as an example of the chromosomal distribution of disease resistance gene analogs. Outer track (gold)=all categories of RGAs identified by drago2; middle track (blue)=receptor-like kinases; interior track=coiled-coil nucleotide binding site leucine-rich repeat genes. B) Violin plot showing numbers of resistance gene analogs per chromosome in scaffolded genomes. C) Maximum likelihood tree of CNL genes on chromosome 2 with similarity to a gene associated with powdery mildew resistance. D) Sequence tube map visualization of gene EH23a.chr2.v1.g115410 (EH23a.chr2:77164374-77165978).

Full Pangenome MLO Tree. High-level view of population groupings for XM_030647777.1. We mapped XM_030647777.1 to the full set of the 193 genomes with minimap2. We then collected all gene models that overlapped the hit with bedtools intersect and aligned corresponding proteins with mafft.

Ideograms. Ideograms for the scaffolded assemblies, including the 370 bp centromere marker.

Funding

NSF Postdoctoral Fellowship in Biology

Directorate for Biological Sciences

Find out more...

Tang Genomics Fund

To develop high-quality genome assemblies of heterozygous cassava varieties and new tools for pangenome analyses to serve breeding programs that need this detailed genomic understanding for more efficient breeding

Bill & Melinda Gates Foundation

Find out more...

History

Research Institution(s)

Salk Institute for Biological Studies

Contact email

tmichael@salk.edu

I confirm there is no human personally identifiable information in the files or description shared

  • Yes

I confirm the files and description shared may be publicly distributed under the license selected

  • Yes

Competing Interest Statement

S.C. was a co-founder of Oregon CBD. A.R.G and A.T. were employees of Oregon CBD. R.C.L is a stakeholder in Saint Vrain Research LLC, which manufactures hemp based products. T.P.M is a founder of the carbon sequestration company CQuesta.