Robust 3D modelling reveals spatiosyntenic properties of animal genomes

: Animal genomes are organized into chromosomes that are remarkably conserved in their gene content, forming distinct evolutionary units (macrosynteny). We developed a novel three-dimensional chromosomal modelling approach to show that syntenic signals are reflected in conserved three-dimensional networks, encompassed within interaction spheres. We show evidence for evolutionary constraints that could not be surmised by genomic sequence alone, thereby underlining the importance of three-, rather than just two-, dimensional organization

Our approach utilizes a model consisting of beads on a string, with bead diameters dependent upon the resolution for which the Hi-C data relates (Figure 1a).Beads are initially randomly dispersed and after multiple cycles imposed by the algorithm converge to stable chromosome topologies.During each cycle, beads slowly adjust their positions dependent upon simple physical forces in conjunction with the Hi-C constraints (see Figure 1 and Methods for algorithmic details).As Hi-C quality is highly variable between species, we developed a partitioning approach where we subdivide the chromosomes into interaction spheres (Figure 1a).This partitioning approach allows us to confidently profile in a species-dependent manner interactions within each 'interaction sphere' (IntSph) (Supplementary Figure 1C-E).Outliers on a per-chromosome basis may reflect actual chromosomal dynamics or assembly issues.
With the models and orthology information (Methods) we have asked whether any of the orthologous co-localization at gene pair, macro-and microsyntenic levels are reflected in the three-dimensional structures.For this, we profiled the Euclidean (3D) and genomic (2D) distances between orthologous bins along the chromosomes of two evolutionary divergent species -the chordate amphioxus and the mollusc blood clam (Supplementary Figure 2).The two species span the ancient bilaterian split in the animal phylogeny (over 500 million years ago 8 ) providing for the highest evolutionary conservation signal.They also show relatively good quality of Hi-C data, which is still lacking for many of the invertebrate clades and will become crucial for further elaboration of the results presented here.To this end, we provide chromosomal models for other species in our analysis (Supplementary Figure 3), however where Hi-C resolution can still be improved.
We find that ortholog gene pairs, if they are located on the same chromosome in both species, tend to localize within a partial 3D vicinity of around 200-300nm.While this reflects overall chromosomal folding, in randomly assigned orthologies such mid-distance interactions (when normalized by the genomic distance) were depleted (Supplementary Figure 4).To further investigate this result in the context of chromosomal homologies, we identified homologous chromosomes between the two distantly related animals (Supplementary Figure 5a).Previous studies have shown patterns of largely one-to-one correspondence between animal chromosomes, with individual genes retaining their chromosomal identity (yet in a scrambled order) in multiple species 3,9 .Conversely, around half of orthologous genes move and disperse across multiple animal chromosomes, resulting in an off-diagonal dotted pattern.Genomes can thus be partitioned into genes which stay within the homologous chromosomes (ancestral linkage groups, ALGs) and those that disperse or move to other chromosomes (losing their ancestral linkage group identity, nonALG).We tested whether genes that are maintained in their ALG identity, and thus define the macrosyntenic pattern, are more likely to come into interactions with each other than dispersed genes.
We investigated contact density between bins within each so-called ALG IntSph (consisting of at least one ALG) and nonALG IntSph (lacking the presence of any ALG genes).Similarly to orthologous gene pairs, we found that the ALG IntSphs display more contacts between the ALG gene containing bins (Figure 1b) than nonALG IntSphs and were forming more contacts within 200-300nm range (Figure 1c).This observation is striking in the context that ALG-ALG contacts were less abundant than nonALG-nonALG contacts (altogether 32,352 ALG-ALG contacts compared to 92,632 nonALG-nonALG in amphioxus, Supplementary Figure 5b,c).Interestingly, we find that while this pattern is overall conserved between homologous chromosomes (Supplementary Figure 5c), some chromosomes stand out.In particular, chromosomes 3 and 4 in amphioxus are particularly enriched for nonALG-nonALG contacts (contact score 28 and 25 for chromosome 3 and 4, respectively) while having a very low contact score for ALG-ALG contacts (12 and 10, respectively).These chromosomes have been shown to undergo recent fusions 6 .Together, this suggests presence of evolutionary constraints to maintain global chromosomal organization.
To further understand properties of local genome organization, we assessed 3D contacts between known microsyntenic regions (defined here as local conserved gene clusters of 3 or more genes, Methods).We observed that IntSphs consisting of at least one microsynteny tend to have significantly higher number of contacts (mean 7.68 and median 6.0 interacting partners) compared to IntSphs with only randomly sampled microsyntenies 5 (mean 6.74 and median 5.0 interacting partners, Figure 2a -p-value 6.5*10 -4 and 3.9*10 -10 for amphioxus and blood clam, respectively).This suggests that local microsyntenic linkages also help form 3D interaction hubs.While little is known about the functionality of the majority of observed microsyntenic linkages, we investigated the folding around well studied and often syntenic homeobox genes [10][11][12] .genes [10][11][12] .While the Hox cluster itself forms a tight microsynteny in most animal genomes examined to-date, genes from the proposed ancestral SuperHox and NK clusters 13,14 are usually dispersed along their respective chromosomes (Figure 2b,c).While amphioxus 3D model found evidence for co-localization of the proposed SuperHox cluster genes within a single 'sphere' of 75-100nm, in blood clam the genes appeared more dispersed (Supplementary Figure 6).Nevertheless, our analysis showed that 14 genes were located in spatial proximity (within a single IntSph of radius 50nm) of the proposed SuperHox cluster genes in both blood clam and amphioxus and thus form shared conserved interactors (Figure 2b).The region of the mouse genome spanning three of these genes (atp5g3, atf2 and the jazf2 pseudogene) has been found in the local vicinity of and interacting with the promoters of hoxd genes and regulate their transcription 15 .This contrasts with their syntenic state in amphioxus and blood clam: while atp5g3, atf2 and jazf2 are located on the same chromosome as the hox genes in these animals, they are not in the genomic vicinity of the hox genes.Our inference of their interactions in invertebrates thus suggests importance of the retained topological, rather than local syntenic, interaction around this cluster 11 .Similarly, we can also detect six shared genes in the NK chromosomal network (Figure 2c), however their function is largely unknown.In spite of the substantial gene scrambling and disruption of many microsyntenic linkages between homologous chromosomes, this analysis points to the existence of evolutionary 3D constraints upon chromosomal organization that can only be detected via 3D modelling approach.
We have presented a robust method to quantify shapes of animal chromosomes.Since our 'interaction sphere' approach allows for approximation to compensate for Hi-C quality, the method can be readily applied to an increasing number of species for which Hi-C data sets are accumulating.Our study focuses on implementation of this method for two phylogenetically divergent bilaterian species, as well as describing its application in other species, which is currently impeded by the Hi-C quality.Combined with comparative genomic analyses of gene linkages at several scales of genomic organization, we show evidence for the existence of conserved 3D constraints on genome folding.In particular, we show evidence that genes within both macro-and microsyntenies display more 3D contacts than non-syntenic (and randomly sampled) regions.Our novel methodology paves the way for further detection of topologically and evolutionarily conserved genomic regions (spatiosynteny), providing testable hypotheses for their functional profiling.

Orthology assignment
Genomes, annotations and protein sequences of 80 species were obtained from the databases summarized in Supplementary table 1 Only the longest isoform of each gene was retained.Orthogroups were identified using Orthofinder v2.4.1 16 , in conjunction with diamond 0.9.36 17 , and MCL 14.137 18 .Identification of Hierarchical Orthogroups at the root node (root node HOGs) is based on orthogroup gene trees and species trees that were built using both Mafft 7.427 19 and FastTree 2.1.1120 .For all the following analyses, we consider genes belonging to the same root node HOG to be orthologs.

Idenitfication of microsyntenic blocks
Microsyntenic blocks were inferred using methods described in 5 .In order to determine which microsyntenic blocks of Achatina fulica, Anadara broughtonii and Branchiostoma floridae were present in the Nephrozoan Last Common Ancestor (Nephrozoan blocks), we used two cladograms (Supplementary File X).One was based on the Nephrozoa hypothesis and monophyletic deuterostomes 21,22 , while the other was based on the Xenacoelomorpha hypothesis with paraphyletic deuterostomes (after ref. 23 ).Since both methods produced broadly similar sets of blocks (overlap coefficient of 85 to 98 % for the three species), only the first set was kept for subsequent analyses.Nephrozoan blocks were defined as blocks found in at least two species of each ingroup or of one ingroup and the outgroup.To obtain background information, we employed two block randomization methods.For each observed block, 100 random blocks were sampled either across the whole genome, as described in ref. 5 , or only across the same chromosome bearing the observed block.

Macrosynteny dotplots and assignment of genes to ancestral linkage groups
Pairwise ortholog oxford plots for all possible species pairs between A. fulica, A. broughtonii, B. floridae and Rhopilema esculentum were generated, using the previously identified root node HOGs, but retaining only one-to-one orthologs for each species pair.Chromosome homologies were assessed with a Fisher exact test against a null model of gene permutation (i.e. if p < 0.05, chromosomes were considered to be homologous), with Benjamini-Hochberg correction for false discovery rate for the multiple tests done for each species pair.Another set of ortholog oxford plots between all the possible species pairs between A. fulica, A. broughtonii and B. foridae were also built.Here, one-to-one orthologs were defined based on a reciprocal best hit approach, using B. floridae as a reference.Only the proteins of B. floridae with a reciprocal best hit in A. fulica and A. broughtonii were retained (4866 groups of 3 proteins each).Each group was assigned to the same BLG as the B. floridae protein they comprised according to ref. 6 .

Annotation of homeodomain proteins
The homeodomain HMM profile from Pfam (http://pfam.xfam.org/family/pf00046)was used to query the proteomes of B. floridae, A. broughtonii and A. fulica, using hmmsearch from the HMMER 3.3 package 24 with a 0.1 e-value threshold.Homeodomain candidates were then used to query the HomeoDB2 database 25 using the blastp utility from the BLAST+ package.If at least 8 out of the 10 top hits of a query were of the Antennapedia-class (ANTP), they were considered as ANTP candidates.
ANTP candidates were then aligned to the homeodomains of the ANTP class from HomeoDB2 using Mafft 7.427 with default parameters.The alignment was trimmed using the gappyout algorithm of trimAl 26 and used to infer a phylogenetic tree with FastTree 2.1.1120 .This tree was used to isolate the B. floridae, A. broughtonii and A. fulica orthologs to the members of the SuperHox (hox genes, dlx, en, evx, gbx, hhex, meox, mnx, nedx and ro), the ParaHox (cdx, gsx and pdx), the NK (nk genes, lbx, lcx, msx, tlx and ventx) and NK2 cluster (msxlx, nk2.1, nk2.2) (reviewed in ref. 14 ) In addition, a second round of search for othologs was performed.To this end, all the proteins of B. floridae, A. broughtonii and A. fulica found in the same Orthofinder root node HOGs as the already isolated SuperHox, ParaHox, NK and NK2 cluster genes were recovered.These candidates were used as queries against the BLAST nr database.If the top hits were annotated members members of the SuperHox, ParaHox, NK or NK2 clusters, they were added to the existing list of putative orthologs.

Hi-C analysis
Hi-C sequencing libraries were downloaded from NCBI Sequence Read Archive using SRA Toolkit 27 .Hi-C data of A. fulica 28 (accession ID SRX5181756), B. floridae 6 (accession ID SRX3274438) and A. broughtonii 29 (accession ID SRX5337861), together with their reference genomes were used for preprocessing.Sequence quality was evaluated using FastQC 0.11.8 (https://github.com/s-andrews/FastQC)and HiC-Pro 2.11.1 software 30 was used to generate interaction matrices between 150kb windows of each chromosome.Bowtie2 31 and samtools 1.11 32 , as a part of HiC-Pro, were utilized to map the Hi-C reads to the reference genome assemblies.Specific parameters for HiC-Pro run are listed in Table 1.

3D modelling and SASA analysis
3D model generation The 3D structure of individual chromosomes was constructed using a home-built C++ software, motivated by studies Chrom3D 33 and NucDynamics 34,35 .Each chromosome has a beads-on-a-string representation and starts with a randomized conformation.Then, the time evolution of chromosome conformation is governed by the Newton equation of motion, with forces (detailed below) implemented to characterize the chromosome structural integrity (→    ), volume exclusion between spatially overlapping genomic sites (→    ), drag by nucleoplasm (− →   ), and genomically distant interactions suggested by Hi-C ( ⃗  − ).

The dynamics
The dynamics of a coarse-grained chromatin bead  is governed by the following Newtonian equation of motion: where  ⃗  and  ⃗  are the instantaneous acceleration and velocity of the bead, respectively;  is the mass of the bead;  is the drag coefficient;  ⃗   ,  ⃗   , and  ⃗  ℎ are forces implemented in the model to characterise the mutual volume exclusion between beads, the interaction between genomically consecutive beads, and the interaction between genomically distant beads with high Hi-C frequency.Computationally, Verlet integration is applied to calculate the trajectories of chromosome beads over time.

The volume exclusion force
The volume exclusion between any two spatially overlapping beads is assumed linearly elastic.The contribution of this force to a bead  is described by the following equation: where   is the spring constant reflecting the incompressibility of genetic content within the beads in contact;  , is the distance between the centre of two consecutively connected beads i and j;  0 is the rest length of the linearly elastic spring (in our case  0 2 *   );  ^, is a unit vector pointing from bead  to bead .

The chromatin tension force
The interaction between two genomically consecutive beads is assumed linearly elastic.The contribution of this force to a bead  is described by the following equation: where   is the spring constant of the inter-bead 'chromatin' linker,  ,+1 is the distance between the centre of two consecutively connected beads  and  + 1;  2 is the rest length of the linearly elastic spring;  ^,+1 is a unit vector pointing from bead  to bead  + 1.

The Hi-C restraint force
The interaction between genomically distant beads is also assumed linearly elastic.The contribution of this force to a bead  is described by the following equation: where  − is a constant reflective of the constraint strength implied by HiC and applies to any pairs of coarse-grained beads that have pairwise Hi-C frequency greater than a threshold value, namely,  , >  0 ;  −0 is the rest length of the linearly elastic spring;  ^, is a unit vector pointing from bead  to bead .

Data preparation and modelling
Normalized HiC-Pro sparse matrix was parsed into matrices of separate chromosomes, containing only intra-chromosomal contacts without any scaffold interactions.We neglected all inter-chromosomal interactions due to the multi-cell nature of the Hi-C experiment.Such matrices were then used as an input to our chromatin reconstruction program with specific cut-offs to filter interaction frequencies used as reconstruction constraints.
High number of contact restraints results in very dense and compact chromatin models.Such structures might be lacking desired biological relevance, however there is no current knowledge about optimal number of contacts per selected genomic region, therefore we used a mean value of interaction frequency as a Hi-C cut-off and all interactions below selected Hi-C threshold were neglected.The total number of cis-contacts per chromosome together with number of used constraints for reconstruction with different Hi-C threshold is shown in Table 2,3 and 4, for blood clam, amphioxus and jellyfish, respectively.Structural measurements of mapped syntenic blocks (such as SASA, coverage, depth) can be, to some extent, affected by the number of constraints and thus compactness of chromatin model.We keep the selection criteria for Hi-C threshold consistent among individual chromosomes to mitigate this impact.We reconstructed chromosomal scaffolds of blood clam and amphioxus and jellyfish with different interaction frequency (IF) cut-off thresholds (Table 2-4) and compare the models.Overall, the geometry and fines topology of the models was very similar between different IF cut-off threshold (data not shown).The higher the number of constraints utilized to build a model, the more densely packed the chromosomal scaffold was.Since the quality of the Hi-C datasets utilized is variable, we selected models from mean IF cut-off to be further analysed, in order to capture majority of interactions obtained from Hi-C.All chromosomes were reconstructed with three replicates and each model was initialized with different conformation based on principles of self-avoiding random walk (SAWR).The reconstruction process starts after initialization, when the pseudo-energy of 3D chromosome conformations, calculated as the sum of kinetic and potential energy in the system, is monitored throughout the simulation as an indicator of convergence of the system.This is then accompanied by root-mean square (RMSD) analysis across all the time-point structures towards the final structure.We ran the reconstruction algorithm for 10,000-time steps and the final chromosome structure of each replicate run was then taken for further analysis.In order to validate correlation of our models with Hi-C IF map, we calculated cosine similarity between IF contacts, which were selected as restraints for 3D modelling, and Euclidean distance of corresponding genomic position in the model (Supplementary Figure 1b).In addition, we calculated the proportion of satisfied/violated contacts (Supplementary Figure 1a).Model analysis and gene mapping Genes and microsynteny locations were mapped onto the chromosome models within the 150kb-resolution beads.Due to low resolution, some 150kb-regions include multiple ALGs and nonALGs together.If at least one ALG is present per 150kb-bead, we treat this region as ALGs content only.

Interaction sphere (IntSph) analysis
In order to identify spatio-functional units within chromosome scaffolds, we performed sliding Interaction Sphere (IntSph) analysis; an imaginary sphere with specific radius moving along the chromosome scaffold and detecting spatial contacts within (Figure 1a, Supplementary Figure 1c-e).
The smaller the radius of IntSph, the more the linear contacts are dominant as interacting partners within IntSph.The larger the radius, the more genomically long-range interactions can be included as demonstrated in Supplementary Figure 1c.To measure the IntSph occupancy we defined 'contact density' as the number of interacting beads within defined radius of IntSph.

Figure 1 .
Figure 1.Hi-Chrom modelling pipeline and its utilization for macrosynteny interactions.A) Schematic of 3D modelling tools and analysis pipeline; first row depicts the workflow of Hi-C pre-processing and modelling.Second row shows mapping of genomic features and their consequent structural analysis.B) Distribution of ALG or nonALG contact density within the interaction sphere (radius 50nm) for blood clam (left) and amphioxus (right), respectively.C) Density hexplot depicting relationship between genomic and Euclidean distance of ALG or nonALG contact pairs for amphioxus and blood clam, respectively.

Figure 2 .
Figure 2. Genes in microsynteny form local 3D interaction hubs.A) Distribution of observed vs random microsynteny contact number [defined as] within the interaction sphere (radius 50nm) for amphioxus (left) and blood clam (right), respectively.B, C) Interaction network of SuperHox (green, C) and NK (purple, D) clusters in amphioxus and blood clam, respectively.Circled numbers correspond to putative shared interactors of SuperHox/NK clusters between amphioxus and blood clam.

Table 2 .
Hi-C constraints applied for single chromosome model reconstruction from mean/median and 35 th percentile IF threshold cut-off in amphioxus.

Table 3 .
Hi-C constraints applied for single chromosome model reconstruction from mean/median and 35 th percentile IF threshold cut-off in blood clam.

Table 4 .
Hi-C constraints applied for single chromosome model reconstruction from mean/median and 35 th percentile IF threshold cut-off in jellyfish.