Identifying high-confidence capture Hi-C interactions using CHiCANE

The ability to identify regulatory interactions that mediate gene expression changes through distal elements, such as risk loci, is transforming our understanding of how genomes are spatially organized and regulated. Capture Hi-C (CHi-C) is a powerful tool to delineate such regulatory interactions. However, primary analysis and downstream interpretation of CHi-C profiles remains challenging and relies on disparate tools with ad-hoc input/output formats and specific assumptions for statistical modeling. Here we present a data processing and interaction calling toolkit (CHiCANE), specialized for the analysis and meaningful interpretation of CHi-C assays. In this protocol, we demonstrate applications of CHiCANE to region capture Hi-C (rCHi-C) and promoter capture Hi-C (pCHi-C) libraries, followed by quality assessment of interaction peaks, as well as downstream analysis specific to rCHi-C and pCHi-C to aid functional interpretation. For a typical rCHi-C/pCHi-C dataset this protocol takes up to 3 d for users with a moderate understanding of R programming and statistical concepts, although this is dependent on dataset size and compute power available. CHiCANE is freely available at https://cran.r-project.org/web/packages/chicane. The capture Hi-C assay and its variants require specialized statistical methods for identification of high-confidence contacts. This protocol presents a versatile computational pipeline for detecting interactions and performing downstream analyses.


Introduction
Chromosome conformation capture (3C) methods are a set of techniques for identifying interactions between chromosomal regions based on chromatin conformation in the cell. The initial 3C methodology has been expanded to include chromosome conformation capture-on-chip (4C), chromosome conformation capture carbon copy (5C) and Hi-C 1-3 . Hi-C utilizes proximity ligation and biotinylation of chromatin followed by high-throughput sequencing to produce a genome-wide interaction map 1,4 . The advantage of Hi-C in comparison to other 3C methods is that it provides genome-wide coverage of interactions; however, high-resolution genome-wide coverage comes at very high sequencing costs 5 . To achieve relatively low cost, high-resolution interaction data for a set of regions of interest we and others have further combined 3C or Hi-C with a capture enrichment step (Capture-C, Next Generation Capture-C or Capture Hi-C) [6][7][8][9] . This targeted approach offers a powerful tool to delineate spatial and functional relationships at specific regions of the genome (rCHi-C) such as those identified by genome wide association studies (GWAS) [10][11][12] and/ or particular functional elements such as promoters (pCHi-C, NG Capture C) [7][8][9]13 . For example, rCHi-C has been used to identify putative target genes at breast cancer 10 , colorectal cancer 11 and rheumatoid arthritis 12 GWAS loci; pCHi-C has successfully enabled linkage between genetic variants and their target genes prioritizing new disease-associated genes and pathways 9,13 . As an emerging tool, CHi-C offers a promising system to generate high-resolution interaction maps of selected loci. However, CHi-C approaches generate data matrices representing unbalanced many-to-all interaction profiles between baits and target fragments (also referred to as a noncaptured end), respectively, where baits are pre-defined but the target fragments could be anywhere in the genome. This is inherently different to Hi-C protocols, where the interaction profiles are not biased by pre-defined baits so both baits and other ends could be anywhere in the genome. Therefore, it remains unclear how best to model CHi-C data and handle the unique error profile of this methodology.
In this protocol we delineate a comprehensive workflow beginning with the computational steps required to process raw sequencing data from CHi-C libraries using HiCUP 14 or similar approaches (Fig. 1a,b) and the creation of input data needed for interaction calling and meaningful interpretation using CHiCANE (Fig. 1c). Finally, multiple downstream analyses are demonstrated to annotate candidate interaction peaks, enabling biological interpretation (Fig. 1d). We also cover the many optional steps and parameters available to tailor this protocol for various experimental needs. Our holistic CHi-C analysis protocol presents researchers with a broad toolkit for processing CHi-C data through to biologically interpretable output.

Development of the protocol
Identifying chromosomal regions that interact more than expected by chance (interaction peaks) requires an accurate model for the background rate of interactions. The capture enrichment step in CHi-C further complicates this model compared to a typical Hi-C experiment, as the capture dynamics will differ depending on whether one or both ends of the interacting fragment pair are targeted in the capture enrichment step. To address this, we developed CHiCANE specifically for the analysis of CHi-C data 6,10 ; our method has also been tested in independently published CHi-C research 12 . We have now extended CHiCANE to support a range of statistical distributions and genomic features to fit a wide range of CHi-C protocols [8][9][10]12 . CHiCANE is freely available as an R package that includes a flexible toolkit that allows users to tailor the settings to different experimental designs and library preparation protocols.
This protocol is designed to handle the most frequent tasks involved in CHi-C analysis from preprocessing, through intermediate data manipulation, on to interaction peak calling and, finally, subsequent downstream analysis of interaction peaks. The protocol starts with an overview of preprocessing of sequencing data generated from CHi-C libraries using a combination of third-party tools 14,15 to create alignment BAM files. Using CHiCANE, BAM files are processed into interaction tables by removing all reads where neither end maps to a capture bait as well as those in which both ends map to the same fragment. Reads are identified by the restriction fragment they map to (requiring an overlap of ≥51% of the read), and the number of reads linking each combination of restriction fragments is calculated. To identify interaction peaks, CHiCANE models the expected number of reads linking two restriction fragments as a function of the distance between the loci and the 'interactibility' 16 of the bait fragment; that is, its inherent propensity to interact with other fragments.
Let Y ij denote the number of reads linking bait i and target fragment j, d ij denote the distance between the two fragments and t i denote the number of reads linking the bait i with another fragment in trans. If the target fragment j is not a bait, the number of reads linking the target fragment j with another fragment in trans should be zero. Y ij is assumed to follow a distribution with mean parameter μ ij , where If the target end fragment j is also a bait, terms are included to adjust for the trans interactions of the fragment j, as well as an additional term capturing statistical interaction between the two trans counts, expressed as a product of the trans read count of the two fragments.
cis interactions are sorted by the distance between the interacting fragments and regression models are fitted in a piecewise linear fashion with an approximately equal number of di-tags in each fitted model. As a result, the distance range (minimum to maximum distance between interacting BEDTools Intersect in silico digest with array probes

Model read count per restriction fragment pair
Bait-to-bait Bait-to-other fragments) in each model will vary. trans interactions do not have a distance term d ij , and are fitted separately from the rest. To account for the different capture dynamics between bait-to-bait and bait-to-other interactions, these are also modeled separately.

High quality CHi-C interactions
After an expected value μ ij has been obtained for each combination of fragments, a P-value for the observed counts y ij exceeding what is expected by random chance is estimated as: The probability P(Y ij ≥ y ij ) depends on how the distribution of the counts conditional on the expected value, Y ij |μ ij , is modeled. In order to allow for overdispersion, under default settings CHiCANE assumes a negative binomial model for the counts, with E(Y ij ) = exp(μ ij ). The negative binomial is similar to the Poisson model for counts, but includes an overdispersion term that makes it more appropriate for datasets with high variance. After fitting β 0 , β 1 , β 2 and the overdispersion term by maximum likelihood over all pairs of fragments, we use the fitted model to estimate a P-value for each pair. Finally, a multiple testing correction (Benjamini & Hochberg method) is applied to identify pairs of fragments (interaction peaks) that interact more than expected by chance. To give users flexibility, CHiCANE supports several different statistical distributions for the expected number of reads linking any two chromosomal loci. The quality of the model fitting process is evaluated using rootograms 17 , which are a convenient way of assessing fit of regression models for counts data. Rootograms are similar to histograms of observed counts data with the expected counts fit overlaid. However, in rootogram representation, the top of the histogram bars is shifted to match the expected counts fit. As a result, the histogram bars are shifted above or below the reference line at y = 0. The resulting deviation from the reference line can be intuitively interpreted as over-or underprediction by the model. This protocol also shows how the identified interaction peaks are subject to a range of downstream analyses depending upon the CHi-C protocol used and the research question. These downstream analytics could help unravel novel biology and guide users to meaningful interpretation of CHi-C libraries.

Applications and adaptations of the method
While CHiCANE was conceived to call interaction peaks from rCHi-C experiments, it is also applicable to pCHi-C data. As CHiCANE utilizes BEDTools 15 to process inputs into interaction data tables, it accepts BED files as the first two inputs, a baitmap and a restriction digest of the reference genome. Our suggested methods for creating these files are outlined in the pre-processing section of the protocol steps below; however, any other method of obtaining BED files with the necessary coordinates is suitable and usable by CHiCANE. Similarly, pre-processing of sequencing data can be handled in numerous ways. HiCUP 14 is a user-friendly pipeline commonly used in pre-processing of CHi-C libraries which includes mapping and pairing of sequencing reads and creating alignment BAM files. HiCUP also includes a script for generating an in silico digest of the reference genome, which is used for filtering out common experimental artefacts. This in silico digest along with the BAM files created by HiCUP are readily consumable by CHiCANE as starting inputs. Another useful mapping pipeline from which CHiCANE can utilize output is that from Arima Genomics (arimagenomics.com), which we have found to be useful in processing libraries generated using Arima Hi-C kits as well as Dovetail's Omni-C kits (dovetailgenomics.com) when combined with the target enrichment protocols. BAM files from either of these pipelines or any other method of mapping and pairing reads to produce BAM files can be used as CHiCANE input. The flexible toolkit built into CHiCANE is designed to be adaptable to a wide variety of CHi-C and sequencing library processing protocols.
This protocol can be easily adapted to any species with appropriate input and annotation files. However, users must take into consideration species-specific topological architecture and, number and length of chromosomes when interpreting distribution of cis and trans interactions.

Comparison with other methods
In addition to CHiCANE, there exist several other interaction calling methods [18][19][20] that have been used to analyze different CHi-C protocols. Of these, GOTHiC 20 was developed for Hi-C experiments. GOTHiC uses a cumulative binomial test to estimate the probability of observing the given read counts between two loci by chance alone, and does not adjust for the distance between the interacting fragments. CHiCAGO 19 models the expected number of reads as the sum of a Poisson distribution and a negative binomial distribution as well as adjusting for distance, while ChiCMaxima 18 does not use statistical modeling but rather calls interactions based on local maxima within a sliding distance window.
The absence of distance adjustment in GOTHiC means that this method is likely to call a greater number of short cis interactions (<10 kb), which might be appropriate for discovery-focused studies collating global interaction profiles; however, this strategy inevitably risks an increased type I error rate 21 . CHiCAGO was developed specifically for pCHi-C experiments and might also be appropriate for rCHi-C data. Early data from ChiCMaxima 18 suggests good performance in short cis regions of <20 kb, which is the default window size for this method. The differences between these interaction calling algorithms will, inevitably, lead to differences in the number of calls and the median distances between interacting fragments called by each caller. On the assumption that GWAS risk loci harbour causal variants, some of which influence gene expression via long-range interactions with their target genes, CHiCANE was developed to identify such interaction peaks. To our understanding, CHiCANE is an appropriate choice for prioritizing high-quality rCHi-C interaction peaks in mid-range (100 kb-5 Mb). We show this in our recent rCHi-C analysis using CHiCANE 10 , where the median distance between interacting fragments at called interaction peaks varied from 350 kb to 1.5 Mb depending on the cell line. Furthermore, CHiCANE also serves as an exploratory toolkit to test a range of statistical distributions for any CHi-C dataset in order to find an appropriate model that best describes the data. A future benchmarking study comparing these interaction-calling methods could substantially enhance our understanding of the pros and cons of these methods.

Experimental design
A typical analysis of CHi-C libraries can start with a single sample. Once the CHi-C library has been sequenced, the analysis can begin using the guidelines detailed in this protocol. It is, however, recommended to perform an experiment in replicates in order to minimize noise and elevate true signal from the libraries. Furthermore, it is also useful to perform a second pass reciprocal capture Hi-C experiment to validate prioritized hits from the first experiment 9,12 . All of these libraries are processed the same way as detailed in this protocol. The pre-processing steps encompass sequence alignments and quality assessment of sequenced libraries. Once these data have been generated, CHiCANE has a number of functions to transform the aligned reads into R data tables. Next, a user can customize the model fitting process that best explains the data. Here we provide an explanation of key concepts that drive the model fitting process, starting with the choice of distribution. CHiCANE implements a wrapper function chicane(), which takes most of the parameters discussed below, unless stated otherwise.

Negative binomial
The default negative binomial distribution models the read counts as Y~NB(μ, θ), where μ is the mean and θ is a dispersion parameter. Under this model, the variance of the counts conditional on the mean μ is given by: The maximum likelihood estimates (MLEs) of the coefficients b β i (see 'Development of the protocol') are used to provide an estimate of the expected number of counts for each combination of restriction fragments. The MLE of the dispersion parameter θ is used as the dispersion parameter (size) in the negative binomial model.

Poisson
Unlike the negative binomial, the Poisson distribution does not allow the variance of the counts to exceed their mean. This model assumes that Y~Poisson(μ), giving Var(Y) = μ. In practice, CHi-C counts often show more variability than can be explained by the Poisson model, leading to false positive interaction calls.

Truncated negative binomial
When fitting the model without including unobserved combinations of fragments, which is CHi-CANE's default, there are no counts of zero in the data. If a user would like to accommodate for this in the statistical model, a truncated negative binomial distribution can be specified, which is implemented in CHiCANE by using the GAMLSS R package 22 . This model assigns P(Y = 0) = 0, and assumes the probabilities are proportional to the negative binomial probabilities for values y > 0.

Truncated Poisson
Similar to the truncated negative binomial, the truncated Poisson distribution is supported by CHiCANE. This model assumes the probabilities are proportional to the Poisson probabilities for values y > 0.

Inclusion of zero counts
CHiCANE's default settings only include interactions that occurred at least once, as for most experiments including zeros will result in impractically large files. As mentioned above, specifying a truncated distribution through the 'distribution' parameter is one way to account for the zero counts; however, this will slow down model fitting. Alternatively, CHiCANE includes a parameter 'include. zeros' for which a user can specify 'cis' to include all possible interactions on the same chromosome or 'all' to include all possible interactions. Note that the much larger file that will result from using either of these options will also increase processing time. Inclusion of zeros in the model will affect the quality of the model fit. For instance, when the proportion of zeros is very high this will result in higher residual variation, hence the expected (predicted) values might not be accurate. Fitting both models, with and without zeros, and assessing the quality of the model fits could help guide the choice of an appropriate model (see 'Assessing model fits' below).
Adjusting for custom covariates Chromosome conformation capture (3C) assays are affected by systematic biases such as GC content or restriction fragment length 23 , which, in Hi-C libraries, are corrected during raw data normalization 24 . For CHi-C libraries, global correction of count matrices is not suitable due to locus-specific capture and sequencing. Furthermore, as CHi-C is being increasingly considered in a disease context 6,10,12,13 such as cancer, using cancer cell lines will present additional biases in the interaction profiles due to structural variations (e.g., copy number changes, insertions and deletions). To examine and adjust for the effect of disease-specific biases or genomic properties in general, CHiCANE offers adjustment for covariates that capture local properties of interacting fragments. Any column in the data can be included as an adjustment term by simply passing it to the 'adjustment.terms' parameter. For example, if a user would like to adjust for the chromosome of the target fragment, 'adjustment. terms' can be set to 'target.chr'. Furthermore, CHiCANE supports expressions, such as log transformations, as well as multiple adjustment terms in the form of a vector passed to the 'adjustment. terms' parameter. Additionally, any custom covariate not already present in the interactions data, such as GC content, can be added to the data table that results from prepare.data() function to preprocess interaction data before running the chicane() function on the interactions data table.
Filtering baits and/or targets Some experiments take the approach of filtering out bait or target fragments with a low 6 or high 19 degree of 'interactibility', as inferred through the trans counts. CHiCANE's default is to include all baits and targets when fitting the model; however, it does also support filtering the fragments to be used in the model. There are two parameters for this purpose, 'bait.filters' and 'target.filters', which allow a user to set the proportion of fragments that should be considered to fall outside of a minimum and/or maximum limit. Each parameter accepts a vector of length two, specifying the 'lower limit' and 'upper limit' as proportions of the number of bait or target fragments.

Merging biological replicates
For experiments with biological replicates, CHiCANE provides a helper function compare.replicates() to assess replicate concordance, which compares pairwise correlation between replicates further stratified by interaction distance bins. If replicates demonstrate concordance, they can be merged by CHiCANE either through the chicane() function itself or via the prepare.data() function to preprocess the interaction data. There are two options for how CHiCANE merges the replicates: 'sum' is the default, which simply sums the counts of the replicates; otherwise, 'weighted-sum' can be used, which will adjust for differences in library sizes between the replicates. These options are specified through the parameter 'replicate.merging.method'. As 3C libraries feature variable levels of technical noise 25 and limited reproducibility amongst replicates 21 , we recommend merging the replicates to improve signal-to-noise ratio and increase the power to detect interaction peaks. However, where the in-depth examination of replicates is desired, CHiCANE can be run on individual replicates separately and their interaction peaks can be subsequently assessed for overlap.
Assessing model fits CHiCANE will output model fit statistics and plots to the directory provided to the 'interim.data. directory' parameter, which can be used to assess the fitted models. As the model is fitted in a piecewise linear fashion, the data is sorted by distance bins and fit separately within each bin, then merged by concatenating the results from the individual model fits. The longer distance bins can be expected to have fewer interaction counts informing the model fits. CHiCANE tries to split data based on distance quantiles, i.e., 100 equally sized groups; however, in some cases the resulting datasets are too small for model fitting, as would be the case when starting with a small dataset (<50 observations per group). In such a case, the number of distance bins are halved until all resulting datasets are large enough to estimate stable parameters of the model. This default behavior can be overridden with the 'distance.bins' parameter. trans interactions are fitted separately from cis interactions. Model fitting is performed through parallel processing and run time would be reduced if multiple cores are specified through the parameter 'cores' in the chicane() function.

Selecting FDR threshold
The output from CHiCANE is a data table containing all interactions sorted by FDR (Benjamini & Hochberg method) adjusted P-value (q-value). This table, including all interactions, is useful in some circumstances, such as the inclusion of a track displaying total counts in the helper function create. locus.plot() or in simulating background interactions with the stratified.enrichment.sample() function. However, the full table with all interactions will be quite large so for further analysis users can select only meaningful interactions by filtering for those below a desired q-value threshold. We recommend filtering for q-value < 0.05.

Limitations
Prioritizing putative target genes and credible variants from GWAS for labour-intensive functional studies requires a conservative approach to prioritize high-confidence interaction peaks. In addition, it is arguable that the 'nearest gene' or 'nearest expressed gene' to a GWAS signal will a priori be considered as a putative target gene. Consequently, for post-GWAS purposes, short cis interaction peaks (<10 kb) are of less interest than those that map further afield. According to these criteria, CHiCANE performs well as it is a conservative caller, focusing on calling high confidence, mid-range (100 kb-5 Mb) interaction peaks. CHiCANE might, however, be less well suited for generating comprehensive maps of tissue-specific regulatory elements using a pCHi-C approach. The (presumably) low type I error rate with CHiCANE might come at the cost of a relatively high type II error rate. For the purposes of cataloging all putative regulatory elements in pCHi-C, elements that map both proximally (<10 kb) and distally (≥10 kb) to the gene(s) that they regulate are equally important and including a higher rate of false positives could be a price worth paying in order to avoid false negatives.

Materials Equipment
Hardware • A commodity computer with at least four cores or high-performance computing (HPC) environment. A recommended 32 GB of RAM is appropriate to process a CHi-C dataset with two replicates each yielding up to~50 million interactions. Disk space requirements are proportional to the size of sequencing data (~2.5 × the size of gzipped fastq files) Software and data • Linux or Unix-based operating system, including Mac OSX • R version 3.4.0 or higher (https://www.r-project.org/) or, equivalently, RStudio, which is an integrated development environment (IDE) for R (https://www.rstudio.com/) • Perl version 5.10 or higher (used by HiCUP, see below; https://www.perl.org/get.html) • Alignment tool of choice for pre-processing. We recommend Bowtie2 26  install.packages('chicane', dependencies = TRUE) R Studio users can also install packages through R Studio's 'Package' management tab.

Procedure
Pre-processing data • Timing ≥2 h depending on library size (as a representative example, the T-47D libraries used in this study took~5 h with one CPU core assigned to each replicate) c CRITICAL Initial pre-processing steps are performed in command line (not R) except where otherwise specified. 1 CHiCANE is designed to function with any BAM files from any processing pipeline. HiCUP is an example of one such pipeline that can be used to obtain mapped and filtered read pairs as well as an enzyme-digested genome. This pipeline takes FASTQ files with sequencing data as the input and the output is Bowtie2-aligned 26 BAM files with corresponding QC summary reports. It also provides a script for generating an in silico digest of the genome, which is required for HiCUP filtering if necessary.
(Note: this only needs to be done once for each genome build/restriction enzyme combination.) A brief example for running the pipeline is given in the code below; see HiCUP documentation for further details at https://www.bioinformatics.babraham.ac.uk/projects/hicup/read_the_docs/html/index.html.
perl hicup --config /path/to/hicup_config.conf While a user can run HiCUP for processing libraries generated using an Arima Hi-C kit, an alternative is Arima's own mapping pipeline, which is also useful for enzyme agnostic protocols such as DNase Hi-C, which is not currently supported by HiCUP. The Arima pipeline is found at: https://github.com/ ArimaGenomics/mapping_pipeline HiC-Pro 28 is another useful toolkit for Hi-C data processing, which has stated compatibility with the Arima protocol and DNase protocols as well as traditional restriction enzyme digest protocols. Users can run the HiC-Pro workflow in a piecewise fashion to obtain BAM files to be used as the input to CHiCANE. HiC-Pro provides a utility for creating the fragment digest also needed as input (Step 2). HiC-Pro scripts and documentation are available at https://github.com/nservant/HiC-Pro 2 Create fragments digest BED file. CHiCANE includes a helper function to convert the digested genome output from HiCUP into a BED file, convert.hicup.digest.bed(). Otherwise, CHiCANE can use any BED file containing the coordinates of the restriction enzyme digested fragments of the desired genome build. Throughout this paper we use BED format (which contains three tab-separated columns without a header: chromosome, start and end) for a number of operations. 3 Create a baitmap. Intersect the digested fragments BED file (Step 2) with a BED file containing the coordinates of the probes which were included on the capture array probe using the bedtools 'intersect' command. We recommend to then limit these results to a ≥20 bp overlap and sort for only unique fragments to speed up processing as follows: bedtools intersect -wo -a /path/to/fragments.bed -b /path/to/capture_ regions.bed \ > /path/to/fragments_overlap_full.bed awk '$7>=20 {print $1 "\t" $2 "\t" $3}' /path/to/fragments_overlap_ full.bed \ | uniq > /path/to/captured_fragments.bed 4 (Optional) Create an interaction data object. Rather than running CHiCANE as a wrapper function on the BAM and BED files directly, the user can pre-process data by running the prepare.data() function on the BAM and BED files to obtain a data table of all interactions containing details of the fragments and the number of reads linking the two fragments. This step is useful if a user would like to adjust for a covariate that is not already present in the data table (e.g., GC content) and will also speed up the model fitting, although the pre-processing will take some additional time. Another use case is to obtain interaction data on individual replicates for the replicate comparison with CHiCANE's compare. replicates() helper function before running interaction calling with the replicate merging option. Pass the BAM(s) (Step 1) as the 'bam' parameter, the baitmap (Step 3) as the 'baits' parameter, and the restriction digested fragments (Step 2) as the 'fragments' parameter.
interaction.data <-prepare.data( bam = '/path/to/sample.bam', baits = '/path/to/baitmap.bed', fragments = '/path/to/captured_fragments.bed' ) write .table( interaction.data, CRITICAL STEP This step is performed in an R environment. 5 (Optional) Compare biological replicates. Users can compare biological replicates to evaluate replicate concordance by using the CHiCANE helper function compare.replicates(), which calculates the correlation coefficient between interaction counts for replicates in a pairwise manner across different interaction distance bins. To make this comparison, first run the prepare.data() function (Step 4) on each of the replicates individually to get the interaction data files for comparison. The compare. replicates() function then requires the paths to these replicates' interaction files (saved in the previous step) passed to the 'interaction.data' parameter, and the path to the directory where the user would like the plots to be saved passed as the 'output.directory' parameter, as shown below: compare.replicates( interaction.data = c('/path/to/interaction_data_rep1.txt', '/path/ to/interaction_data_rep2.txt'), output.directory = '/path/to/plotting_directory' ) c CRITICAL STEP This step is performed in an R environment. 6 (Optional) Using a four-cutter enzyme (e.g., DpnII) or multiple restriction enzymes (arimagenomics. com) rather than a six-cutter enzyme (e.g., HindIII) might improve the resolution of a library by generating shorter (DpnII) and/or more random restriction fragments. However, for the same amount of data (sequencing yield in gigabases or mapped reads) there will be fewer reads (on average) mapping to each pair of interacting fragments. As the number of di-tags at any pair of fragments will influence the power of the analysis, at a given sequencing depth these libraries might lack power for a single restriction fragment-based analysis. If sequencing to a greater depth is not an option, the interaction data from these libraries could be further processed by merging neighbouring fragments in a predefined window size. To determine the window size, the fragment-wise counts distribution of an experiment should be assessed (either directly from the interaction data table in Step 4 or by assessing model fit plots after successful completion of Step 7). The size of this window is likely driven by the sequencing depth of the library and, therefore, the power to detect significant interactions. As a rule of thumb, it is useful to try a few thresholds varying between 1 kb and 5 kb and assess the impact of the varying thresholds on the counts distribution per fragment. Users can accomplish this by creating a fragments BED file (Step 2) based on the desired window size. This file simply specifies fragments according to the window size, then this window-based fragments file is in turn intersected with the bait coordinates to create a baitmap (Step 3) which is also based on the desired window size. These window size-based BED files can then be input as normal when running CHiCANE. Similarly, the baitmap and fragments BED files could also be designed to cluster interactions within loci as specified (i.e., replacing window-based coordinates with region-based coordinates for regions desired to be clustered together as the fragments BED file (Step 2) and also making the baitmap (Step 3) by intersecting baits with this region-based fragments BED file as detailed above for a window size-based BED file).
CHiCANE interaction calling • Timing ≥2 h, depending on the library size and number of CPU cores assigned (as a representative example, the T-47D libraries used in this study took~3 h with four CPU cores assigned) c CRITICAL CHiCANE interaction calling steps are performed in an R environment. 7 The main function chicane() is typically run as a wrapper call on the BAM and BED files (Steps 1-3) directly, as it will internally convert each BAM file to a text file then use BEDTools to overlap this file with the captured fragments and identify the restriction fragments corresponding to each read. Alternatively, chicane() can be run in a stepwise fashion on a pre-processed interaction data If replicate merging was performed through prepare.data(), the above code is replaced with the last block of R code shown in Step 7. 9 (Optional) While the negative binomial distribution is the default, CHiCANE supports several different distributions which can be specified with the 'distribution' parameter. Alternative distributions supported can be specified as: 'poisson', 'truncated-poisson', or 'truncated-negativebinomial', for example: chicane.results <-chicane( bam = '/path/to/sample.bam', baits = '/path/to/baitmap.bed', fragments = '/path/to/captured_fragments.bed', distribution = 'poisson' ) 10 (Optional) As the default, only interactions that are detected at least once are included in the data.
To also include zero counts, use the 'include.zeros' parameter: include.zeros = 'cis' includes all zero counts for bait/target combinations on the same chromosome; alternatively, include.zeros = 'all' includes all possible combinations. The example below shows inclusion of cis zeros: chicane.results <-chicane( bam = '/path/to/sample.bam', baits = '/path/to/baitmap.bed', fragments = '/path/to/captured_fragments.bed', include.zeros = 'cis' ) 11 (Optional) CHiCANE tries to split data into an optimal number of distance bins based on the size of the dataset, such that the resulting datasets are large enough for the model to be fit. In order to override this default behavior, set the desired number of distance bins with the 'distance.bins' parameter. For example, the following call will result in the count model fitted separately in each of the 50 data bins (trans interactions are fitted separately from cis interactions). Specifying multiple CPU cores through the 'cores' parameter would run multiple model fits in parallel to decrease the runtime.
chicane.results <-chicane( bam = '/path/to/sample.bam', baits = '/path/to/baitmap.bed', fragments = '/path/to/captured_fragments.bed', distance.bins = 50 ) 12 (Optional) By default, CHiCANE includes all baits and targets when fitting the model; however, for users wanting to filter out fragments with low and/or high 'interactibility', the chicane() call supports 'bait.filters' and 'target.filters' parameters. These parameters accept a vector of length two, the first element corresponding to the lower limit as a proportion of fragments and the second element corresponding to the upper limit, also as a proportion of fragments. Note that filtering the fragments will affect multiple testing correction by changing the number of tests performed. In the following example the baits with the lowest 30% or highest 10% of trans counts will be removed before fitting the model. Interpretation and post-processing • Timing 3-10 min for each of the options below.
c CRITICAL Interpretation and post-processing steps are performed in R except where indicated otherwise. 17 (Optional) CHiCANE calls can readily be converted to a BED file by simply selecting columns [3][4][5] for target fragments and/or 6-8 for bait fragments. These columns are useful for performing bedtools 'intersect' with a number of genomic data formats for annotation and enrichment testing of target and/or bait fragments. For instance, to annotate the genes to which the target fragments map, a BED file (e.g., interaction_calls_significant_targets.bed) can be made from the interaction calls' target fragments (chr, start, end), which can then be intersected with a genome annotation file of choice. The following calls are an example of how to pre-process the human refGene file (http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz) into a BED file to then intersect the CHiCANE interaction peaks' target fragments BED file with the annotation BED file made from refGene. The bedtools intersect -wao option shown below keeps data from both BED files in the output and reports the number of base pairs by which the intersected files overlap. Furthermore, data for which the -a file (interaction_calls_significant_targets.bed in the example below) has no overlap with the -b file (refgGene.bed in the example) is still reported with overlap = 0; see bedtools intersect documentation for details: https://bedtools.readthedocs.io/en/ latest/content/tools/intersect.html.
zcat refGene.txt.gz | awk '{print $3 "\t" $5 "\t" $6 "\t" $13}' \ | sort -u > refGene.bed ▓ bedtools intersect -wao -a /path/to/interaction_calls_significant_ targets.bed \ -b /path/to/refGene.bed \ > /path/to/interaction_calls_significant_targets_genes.bed Similarly, a user can intersect both target and/or bait fragments (columns 3-8) of interaction peaks with Variant Call Format (VCF) files containing variant calls, such as somatic variants in a specific cancer type. Furthermore, BED files are used for many disparate datasets that can be used for enrichment testing, such as copy number aberrations (CNAs) or ChIP-Seq (e.g., histone marks, CTCF binding sites), any of which could also be used with a bedtools 'intersect' command to find overlap with CHiCANE BED data for enrichment. In addition to the command line interface of bedtools, a user can access bedtools through an R environment using the package bedR 29 . An important consideration when using the bedtools 'intersect' command is to ensure that the chromosome format of both files is identical. For example, VCFs containing somatic variation calls from the Pan-Cancer Analysis of Whole Genomes (PCAWG) 30 study have only the number of the chromosome, e.g., '1' rather than 'chr1', in which case the 'chr' should be prepended to the VCF files' chromosome column for bedtools intersect to function properly. As an additional consideration for intersecting with a VCF file, a user might want to limit the reported fragments to those with an overlap that can be achieved with the -wo option (in place of -wao detailed above. -wo only includes those fragments with overlap > 0, i.e., the -wo option in this example reports only fragments that do contain a reported mutation in a VCF file) as follows: bedtools intersect -wo -a /path/to/interaction_calls_significant_ targets.bed \ -b /path/to/variants.vcf \ > /path/to/interaction_calls_significant_targets_mutations.bed c CRITICAL STEP The commands in this section are run on the command line. 18 (Optional) To examine functional associations between a locus of interest and the rCHi-C identified target genes (Step 16), a user can perform an eQTL analysis to model the relationship between a GWAS SNP and mRNA abundance of its putative target gene in a cohort of interest. This step requires accessing matched genotype data and mRNA abundance profiles. This relationship is modeled directly in R by fitting a linear model as shown below: where 'mrna.abundance' is a continuous variable and 'genotype' is either a categorical variable (e.g., 'A/A', 'A/G' and 'G/G') or a discrete variable encoding genotypes as 0, 1 and 2.
As variation in mRNA abundance is often only in part explained by the genotype, we recommend that genotype be associated with the residual mRNA abundance by regressing out the effect of other covariates, which may also explain variance in mRNA abundance. For instance, a gene's mRNA abundance could be influenced by its dosage and/or promoter methylation. Hence, where copy number data and DNA methylation data are also available, an appropriate eQTL model in R is specified as: lm(mrna.abundance~genotype + copy.number.state + methylation) where 'copy.number.state' is either a categorical variable (e.g., 'loss', 'neutral' and 'gain') or a discrete variable specifying the total number of gene copies (e.g., 0, 1, 2, 3, …), and 'methylation' is either a continuous variable (e.g., methylation beta values) or a categorical variable (e.g., 'methylated' and 'unmethylated'). In practice this is run multiple times (at least 100 times) to generate a large number of background distributions in order to infer a stable estimate of background enrichment. Once random partitions are created, save them as separate BED files and overlap them with the BED file of histone marks using the bedtools 'intersect' command below. Note that the '-c' option reports a count of the number of overlaps including 0 for -a entries with no overlap with -b:

(Optional) For genome-wide visualization of interactions and integration
bedtools intersect -a /path/to/interaction_calls_random_targets.bed \ -b /path/to/histone_coordinates.bed \ -c > /path/to/interaction_calls_random_targets_histones.bed c CRITICAL STEP BEDTools commands are run on the command line. The proportion of interactions overlapping with histone marks across multiple randomly-sampled background datasets is summarized as the mean or median, which is then compared with the observed enrichment of histone marks in the original dataset. 22 (Optional) TAD boundaries are another useful data type that can be stored in a BED file; however, users should intersect them with the full-length coordinates of an interaction rather than individual fragments to test if the interaction is within or across TAD boundaries. To obtain a BED file (e.g., interaction_calls_full_length.bed) for this purpose from CHiCANE calls, first limit interactions to cis interactions only by removing any calls with 'NA' in the 'distance' column. Then for each cis interaction add to a data table the chromosome (column 3), the minimum of the fragment starts (columns 4 and 7) and the maximum of the fragment ends (columns 5 and 8). Write these three columns as a BED file to correspond to the coordinates of each interaction. The following is a bedtools call suited to intersect the resulting full interaction coordinates BED file with TAD coordinates, where '-f 1.0' ensures complete overlap between the coordinates and the '-c' option reports a count of the number of overlaps: bedtools intersect -a /path/to/interaction_calls_full_length.bed \ -b /path/to/TAD_coordinates.bed -c -f 1.0 \ > /path/to/interaction_calls_full_length_TADs.bed c CRITICAL STEP BEDTools commands are run on the command line. 23 (Optional) Since the negative binomial and Poisson models are nested in structure with Poisson being a special case of negative binomial, these models can be quantitatively compared by assessing their log-likelihood estimates. Of the two models, the one with the higher log-likelihood estimate can be considered better. In order to estimate the statistical significance of the difference between the two likelihood estimates, apply a likelihood ratio test of the two estimates; however, this should be done on matched distance bins. For instance, to test whether the log-likelihood estimate for the first distance bin of the negative binomial model is significantly superior to the Poisson model using the likelihood ratio test, use the following R code: # extract the log-likelihood estimates and degrees of freedom (df) from # the model fit statistics file of the first distance bin and initialise # variables loglik.nb <--100.123 df.nb <-4 loglik.poisson <--120.456 df.poisson <-3 ▓ # estimate the difference in log-likelihood estimates and compare with # the chi-squared distribution to estimate a p-value models.difference <-2*(loglik.nb -loglik.poisson) p.value <-pchisq(models.difference, df = df.nb -df.poisson, lower. tail = FALSE) 24 (Optional) The choice of compulsory and optional steps (above) will vary across research groups. Some users may also require bespoke extensions to these steps. Once an appropriate combination of these steps has been identified and implemented, users might want to consider automating these steps through a workflow engine such as Common Workflow Language (https://github.com/ common-workflow-language), Nextflow 33 or Snakemake 34 . These workflow engines will help create a reusable workflow that can run in an automated and parallel fashion. Further, these workflows are easy to share and deploy to HPC and cloud environments, enabling reproducible research.
Troubleshooting Occasionally, the model fitting procedure will result in numerical errors as the result of a lack of convergence. This is most often due to a lack of overdispersion in the data leading to an unstable estimate of θ. If the variance is approximately equal to the mean, the θ term goes to infinity and the model fails to converge. When the analysis results in a numerical warning or fails to converge, the chicane() function will attempt to diagnose if the problem was due to a lack of overdispersion. This is done by fitting a negative binomial regression model with 25 iterations, and using the resulting fitted values and dispersion parameter to examine the variance of the negative binomial. If the estimated variance at the point with the largest fitted value exceeds the corresponding estimate of the Poisson variance (i.e., the fitted value at that point) by <0.1%, a Poisson distribution is fit instead. Please follow the log statements printed by the chicane() call to be aware of when this happens. Other potential errors and corresponding solutions are listed in Table 1. Steps 17-23: Any of these analyses will take up to 10 min to run. However, optimization of plots might take longer as numerous iterations are often required to create publication quality plots.

Assessing alignments and quality of CHi-C library
If using the HiCUP pipeline for processing of sequencing data (Steps 1-3), a report will be generated for each sample. This report includes an assessment of the truncation of reads at the restriction enzyme sites (conducted as part of the pipeline) and subsequent alignment of truncated reads. It also reports the percentage of di-tags that were deemed as 'valid' by the HiCUP filter as well as those for each of the common artifacts removed by the filter. Additionally, a distribution of di-tag lengths is shown and de-duplication results are presented and broken down by cis-close, cis-far and trans. In general, in a good library, the majority of both forward and reverse reads should result in unique alignments (ideally >70%) as well as successful pairing of read pairs (see HiCUP summary reports at https://doi.org/10.5281/zenodo.4073433). A very low number of unique alignments or read pairing indicates poor libraries. Another indicator of a good library is a high proportion (>50%) of valid ditags and a high proportion of cis interactions (>70% of interaction peaks). However, these suggested proportions should be interpreted alongside the overall sequencing depth and coverage per captured fragment. Further statistics on library and sequencing characteristics are detailed in the HiCUP report, and guidelines on interpreting these reports are covered in the HiCUP 14 documentation. 'Caught a warning -checking for dispersion problems' There is a lack of overdispersion in the data, which is necessary for negative binomial model CHiCANE will automatically catch and attempt to diagnose the problem. If this is found to be the case, a Poisson distribution is fit instead.
HPC output files will indicate if this occurred

Assessing interaction calls of CHiCANE on region CHi-C data
The primary output from CHiCANE is the set of interaction calls sorted by FDR adjusted P-value (qvalue), in the form of a data table object. This output can then be filtered for the desired q-value threshold, region(s) of interest or simply cis interactions excluding bait-to-bait and trans interactions. Using a sample with two replicates (T-47D, an estrogen receptor (ER) + breast cancer cell line) from one of our previous studies 10 , we merged the replicates (Step 8) and performed interaction calling (Step 7), resulting in 21,766 interaction peaks (q-value < 0.05). Examining a well-known breast cancer risk locus 6,35 in a rCHi-C experiment as an example, the cis CHiCANE calls at the 2q35 locus, which contains a causal variant rs4442975 (Tag SNP: rs13387042), can then be filtered. The head() of the file with most relevant columns selected for easy readability is shown in  Table 3.
The following analyses and plots represent many of the common approaches for testing the integrity and subsequent interpretations stemming from CHiCANE's interaction peaks.   When CHiCANE is run with the 'interim.data.dir' option (Step 14), the model fit for each data partition will be saved as a rootogram plot along with the summary of model parameters. The plots for the model fits of the first four distance bins from the T-47D library ( Fig. 2; all model fit plots are at https://doi.org/10.5281/zenodo.4073433) demonstrate a higher number of counts in the smaller distance bins, with frequency decreasing as distance length increases (see 'Assessing model fits' under 'Experimental design'). These rootograms show the observed counts as histogram bins (gray bars) aligned to the top of CHiCANE's expected counts distribution in red. A good fit would represent minimal deviation from the horizontal reference line (at zero). Histogram bins above and below the reference line indicate over-and underprediction by the CHiCANE model, respectively. These data show good quality fits with CHiCANE's default settings and recommended negative binomial distribution. However, for rootograms displaying poor model fits, showing significant deviation from the reference line, users might wish to model their data using one of the alternative distributions (Poisson, truncated negative binomial, truncated Poisson) provided by CHiCANE (Step 9) and/or parameters filtering for unusual counts of bait and target fragments (Steps 10-12).
Examining the proportion of intra-and inter-chromosomal interaction peaks (q-value < 0.05) demonstrates that the vast majority are cis, non-bait-to-bait interactions (Fig. 3a). A sufficiently high proportion of cis interactions compared to trans interactions (Fig. 3a) is an indication of a good quality library and vice versa. However, this interpretation should be contextualized considering the number of chromosomes and genome size of the species being investigated. For instance, a species with a higher number of chromosomes is likely to generate more trans interactions and therefore the trans proportion might be higher compared to the trans proportion of the data shown here (Fig. 3a). Next, by separating cis interaction peaks into distance bins one can see that the distance between the bait and target fragments is mostly in the range of 100 kb-5 Mb (Fig. 3b,c). This is expected given that the underlying libraries are based on a rCHi-C protocol and CHiCANE's peak detection prioritizes high-confidence mid-range interactions.

Assessing replicate concordance and merging multiple replicates
If an experiment involves multiple replicates of each sample, it is beneficial to compare the replicates to one another to ensure consistency before deciding to merge them. CHiCANE includes functions to first process the reads from the individual replicates into interaction data (prepare.data(), Step 4) and to then compare the interactions found in each replicate to one another by distance bins (compare. replicates(), Step 5). Using our T-47D replicates, these example plots show that the replicates are highly correlated in the short-to mid-range interaction distance bins with decreasing correlation as the interaction distance increases (Fig. 4). Replicates in this experiment were merged using the default parameter 'sum', which does not adjust for the differences in the library sizes. If replicate concordance looks poor and the library sizes are known to be unbalanced in size, replicate merging should be performed with the parameter 'weighted-sum', which adjusts the counts relative to the library sizes (Step 8). However, it is worth noting that the poor concordance in replicates is likely to emerge from one or more poor libraries, and individual replicates should be assessed for quality separately, starting with the HiCUP report followed by evaluating the model fitting performed by CHiCANE.

Annotating interaction peaks
In order to interpret the non-captured end (target fragments) of the interaction peaks, a helpful starting point is to annotate them with known features of the reference genome, e.g., gene names. Using the interaction peaks called by CHiCANE in the T-47D library we made a BED file of the target fragments and overlapped them with the BED file of refGene using 'bedtools intersect -wao' (Step 17).
Here we show annotation of the target fragments at the 2q35 locus (top hit for each unique gene, Table 4). The final column in the table represents the number of base pairs that overlap between the two BED files provided. It can also be useful to limit the intersect to fragments that map to the transcription start site (TSS) in order to focus on the promoter region. Table 4 shows example interaction peaks called by CHiCANE from the 2q35 locus of the T-47D library, annotated (target fragments only) with gene names using bedtools intersect. Results shown are the top hit for each unique gene. While gene annotation is useful for some interacting fragments, the majority of interacting fragments (and GWAS risk loci) do not map to coding regions; it is important, therefore, to develop tools for annotating non-coding regions. This can be done by intersecting either or both of the interaction fragments with genome wide features, such as ChIP-Seq data, or disease-specific aberrations, such as somatic mutations. For instance, it is understood that spatial organization contributes to the selection of somatic alterations in cancer [36][37][38] . Hence, annotating target fragments with resources of somatic alterations can help unravel novel disease biology. Here we intersected our T-47D 2q35 target fragments with somatic variants derived from whole-genome sequencing data of the 211 breast cancer samples in the PCAWG 30 study (Step 17), finding fragments with somatic mutations (head of single/multi-nucleotide variants (SNV/MNVs) in Table 5 Table 2). This same method of annotation could be generalized to any genome-wide germline or somatic alterations such as copy number aberrations or chromosomal translocations. Table 5

Functional interpretation of target regions
Another informative tool in determining the biological relevance of interactions is eQTL analysis, which measures the association between genotype and mRNA abundance of a putative target gene. For example, at the 2q35 locus we 6 and others 35 found that the causal variant rs4442975 (Tag SNPs: rs13387042/rs6721996) is an eQTL for IGFBP5, which we confirm here using ER + breast cancers (T-47D is an ER + cell line) in the TCGA breast cancer cohort (P = 0.015, Fig. 5) (Step 18). eQTL analysis can also be useful to discriminate between multiple putative targets of a SNP, as the common practice of nearest gene association can lead to false positives 39 . eQTL analysis could be further refined by extending the model with matched copy number and DNA methylation data of the target gene, as these two variables are known to explain variance in mRNA abundance of a gene (Step 18).

Genome-wide visualization of interaction peaks
Another important aspect to gleaning meaningful interpretation of non-coding interacting fragments is to integrate these with genome annotations in a genome browser. This can be accomplished using Table 5 | Somatic mutations overlapping with 2q35 target fragments   target_chr  target_start  target_end  var_chr  var_loc  ref  alt   chr2  216500994  216504205  chr2  216503389  C  T  chr2  216504732  216510109  chr2  216505045  G  A  chr2  216504732  216510109  chr2  216505361  C  T  chr2  216504732  216510109  chr2  216505362  C  T  chr2  216510573  216512460  chr2  216511062  C  G  chr2  216512460  216517104  chr2  Viewing the 2q35 region in our T-47D example, we can clearly see that multiple significant interactions have been called between several baits in the region chr2:217030000-217070000 and several target fragments in an upstream region, chr2:216680000-216710000, corresponding to the location of the promoter of IGFBP5 (Fig. 6a). Viewing another example of 16q12.2 risk locus for breast cancer in the Epigenome Browser confirms interactions between the risk SNP (rs17817449) to distal fragments in IRX3 10 (Extended Data Fig. 1a). A further example of a breast cancer risk locus in the 14q24.1 region demonstrates previously identified interactions between the risk SNP (rs2588809) and  Interaction peaks (q-value < 0.05) are shown as arcs, where the height of the arc is proportional to the -log 10 q-value of the interaction peak. Only CHiCANE interaction peaks with q-values < 0.05 entirely within boundaries of the locus are shown. ZFP36L1, which was also shown to be a significant eQTL in ER + breast cancers 10 (Extended Data Fig.  1b). Having successfully loaded data into the Epigenome Browser, a wide range of public genome annotation tracks are freely available to aid interpretation of interaction peaks. A key resource readily available in Epigenome Browser is the ENCODE 40 collection of epigenetic marks, expression and transcription regulators across a wide range of human cell types.
For programmatic visualization of interaction peaks, the R library Gviz 41 provides a highly customisable range of options for incorporating additional data and annotation tracks. We have included a basic implementation of such a plot suited for analysis of CHi-C interactions as a helper function, create.locus.plot() (Step 20). As an example of the output from this plotting function, again looking at interactions in the 2q35 region of T-47D, we combine generic ideogram and genome axis tracks with a gene region track from the refGene GTF file to give context to the FDR filtered CHiCANE interactions at our selected locus. Furthermore, we add the counts track from the full (unfiltered) interaction call file, and an annotation track displaying the baits used in the capture portion of the experiment (Fig. 6b). For complex multi-track plotting, users might also want to consider using the Gviz library directly, or the BioCircos 42 R library.

Assessing interaction calls of CHiCANE on promoter CHi-C data
All of the aforementioned analyses were done on a sample with two replicates from a region CHi-C study; however, CHiCANE as a CHi-C toolkit is also well suited to promoter CHi-C analysis. As such, we also examined the megakaryocytes (MK) dataset (four replicates) from the Javierre pCHi-C study 9 to demonstrate results for pCHi-C data. Following replicate merging (Step 8) and interaction calling (Step 7), this dataset results in 82,830 interaction peaks (q-value < 0.05). As a representative example, the head() of cis interactions for chromosome 9 of MK libraries with the most relevant columns selected for readability are shown in Table 6. CHiCANE's default settings using the negative binomial distribution demonstrated good model fitting to the observed counts data. (First four model fit plots Extended Data Fig. 2, all model fit plots are at https://doi.org/10.5281/zenodo.4073433.) In analyzing the interaction peaks by type we see here again in this pCHi-C data that the vast majority are nonbait-to-bait, cis interactions within the mid-range distance of 100 kb-5 Mb (Extended Data Fig. 3). Table 6 shows example top cis interaction peaks called by CHiCANE from chromosome 9 of the MK library using the head() function in R, which defaults to the top six records. Selected columns are shown in the table. The full table is at https://doi.org/10.5281/zenodo.4073433.

Enrichment of enhancer marks
Histone modifications are associated with epigenetic regulation of transcription 43,44 . As such, examining histone marks that colocalize with interacting fragments can be helpful in determining both locus-specific properties, e.g., enhancer-promoter contacts, as well as global properties, such as enrichment of histone marks. Using the MK dataset, which had histone mark data available from the International Human Epigenome Consortium 45 for the same cell type, we performed enrichment analysis of four histone marks (H3K27ac, H3K36me3, H3K4me1 and H3K4me3). These marks are associated with active enhancers, promoters and transcribed regions. As the baits in the MK dataset were designed to capture promoters, baits and all bait-to-bait interactions were excluded such that the non-bait interacting fragments (target fragments) of each interaction peak were isolated from CHiCANE calls. The coordinates of the target fragments can be overlapped with the coordinates of histone marks using the bedtools 'intersect' command. A similar analysis can then be repeated on random sets of interactions created with the stratified.enrichment.sample() function (Step 21), creating an estimate of the background distribution of enrichment of a given mark. Resulting data can be examined either for the presence of a specific histone mark in a region of interest or for global enrichment or depletion of histone marks against the estimated background. Here we show results for the full MK dataset, where fold change is estimated between the proportion of interactions overlapping with a histone mark in our observed dataset and likewise mean proportion estimated from 100 distance-matched randomly sampled datasets (Fig. 7). A fold change of >1 shows that the observed data is enriched for the tested enhancer mark compared to the randomly sampled dataset. A similar analysis could be repeated for coordinates of other key regulatory features of interest, such as the Polycomb-mediated transcriptional repression mark H3K27me3, or the binding motifs of chromatin conformation mediator CTCF transcription factor and RAD21 double-strand break repair protein.

Overlap with topologically associating domains
TADs are key subunits of chromatin compartmentalization and define the regulatory landscape 46 . Genes tend to be coregulated when they are within the same TAD boundaries and hence TADs serve as a useful chromatin feature to evaluate functional activity of CHi-C interactions. A BED file containing coordinates of TAD boundaries can be intersected with a BED file with full-length coordinates of interaction peaks (Step 22) or TAD boundaries can be included as features in a locus plot created with the create.locus.plot() function (Step 20). To include TADs in a locus plot, we set the 'genomic.features' parameter to a BED file with the TAD coordinates for MK 9 and assess chromatin interactions for the chr9:103500000-106500000 locus alongside TAD boundaries. This region demonstrates interactions falling largely within a TAD region rather than crossing the TAD boundaries (Fig. 8). TAD boundaries are known for enrichment of CTCF binding proteins 47 , and any aberration or disruption in the orientation of CTCF sites can lead to changes in TAD boundaries 46 .  Therefore, we also recommend intersecting with CTCF binding site coordinates from the matching cell type or one with similar genetic background, as this can further help to prioritize truly functional interaction peaks.

Exploring the impact of covariates and statistical distributions
This section is intended for advanced users who have an understanding of statistical models for count data (see 'Experimental design'). As CHiCANE was conceived as a toolkit for processing and modeling of CHi-C count data, it provides a number of additional parameters to fully interrogate CHi-C libraries beyond the default settings. Of these, two key parameters are: (i) 'adjustment.terms', which allows for adjustment for potentially confounding variables (Step 13), and (ii) 'distribution', which serves to fit alternative distributions to model count data (Step 9). Adjusting the model is often necessary if either the known global properties of the genome act as potential biases (e.g., high genome instability 48,49 , heterogeneity in GC content 23 ), or when annotation of the resulting interaction peaks highlights potential biases (e.g., selection of longer non-captured/target fragments). These scenarios pose a question as to whether the observed interaction peaks are truly independent of known or observed confounders. In order to adjust the model fitting process, thereby identifying interaction peaks that are independent of potential confounders, the chicane() call can be parameterized using 'adjustment.terms' specifying any number of additional variables (Step 13). Using the pCHi-C MK library, we exemplify this parameter by adjusting the default negative binomial model with three variables separately: % GC content, length of interacting fragments and presence/absence of an interaction within a TAD boundary (Fig. 9a). Compared to the default settings (blue bars), adjusting for the bait and target fragment lengths resulted in a substantially reduced number of interaction peaks in the 1-10 Mb distance bin (green bar). This might not be of concern; however, it does shed light on variables that might bias our interpretations. Next, we explore the impact of the choice of the underlying distribution to model the observed counts data. Although not directly applicable to the MK library, which can be appropriately modeled with the negative binomial distribution (Extended Data Fig. 2), we examined the impact on CHiCANE's interaction calling when using the Poisson and truncated Poisson distributions. The resulting interaction peaks showed substantial differences compared to the default negative binomial model, with both the Poisson and truncated Poisson models having identified an approximately tenfold higher number of interaction peaks across all distance bins (Fig. 9b). These numbers are likely to contain a greater proportion of false positives given that the negative binomial model's fits were reasonable (Extended Data Fig. 2). However, these data might be useful for a discovery focused study where a low type II error rate is desired at the cost of a high type I error rate, provided the model fits for these models explain the data adequately well. These alternative model fits might also be appropriate when the default negative binomial distribution does not explain the data very well. During the model fitting process, if the count data exhibit a lack of overdispersion, CHiCANE automatically switches to the Poisson distribution (see Troubleshooting). Quantitative assessment of the quality of model fits between the default negative binomial and Poisson distributions can be done by comparing the fitted log-likelihood estimates of the matched distance bins (e.g., see model fit rootograms in 'Assessing model fitting of CHiCANE and resulting interaction peaks'), with the higher estimate indicating a better fit. To test the statistical significance of the difference between the two log-likelihood estimates for a given distance bin, the likelihood ratio test statistic can be used (Step 23).

Code availability
The CHiCANE R package is freely available through CRAN: https://cran.r-project.org/web/packages/ chicane. | Testing different models. Assessment of interaction peaks (q-value < 0.05) by distance bins using a range of statistical models applied through CHiCANE to the Javierre MK library. a, Bar plots showing the number of interaction peaks (q-value < 0.05) across distance bins when data was modeled using CHiCANE's default negative binomial distribution (blue), which allows for overdispersion, further adjusted for % GC content of bait and target fragments (orange), lengths of bait and target fragments (green) and TADs specified as whether both bait and target fragments were within a TAD boundary (brown). b, Bar plots showing the number of interaction peaks (q-value < 0.05) across distance bins when data was modeled using CHiCANE's default negative binomial distribution (blue), which allows for overdispersion. Interaction peaks identified using additional models based on Poisson distribution (orange) and truncated Poisson distribution (green) fitted using CHiCANE are also shown. Although CHiCANE supports truncated negative binomial distribution, these data were not generated due to memory requirements for this relatively large MK library.