The chimeric DNA ligation products generated by Hi-C represent pairwise chromatin interactions or physical 3D contacts within the nucleus, Then several different methods can be employed to analyze these maps to identify chromosomal structural patterns and their biological interpretations. Many of these data analysis approaches also apply to 3C-sequencing or other equivalent data.
Read mapping Hi-C data produced by deep sequencing is in the form of a traditional
FASTQ file, and the reads can be aligned to the genome of interest using
sequence alignment software (e.g.
Bowtie, bwa, etc.).) often support chimeric alignment and can be directly applied to long-read Hi-C data. Short-read Hi-C alignment is more challenging. Notably, Hi-C generates ligation junctions of varying sizes, but the exact position of the ligation site is not measured. HiC-Pro, HIPPIE, HiCUP, and TADbit, to map two portions of a paired end read separately, in the case that the two portions match distinct genomic positions, thus addressing the challenge where reads span the ligation junctions. and the 4D-Nucleosome Data Portal) often align short Hi-C reads with an alignment algorithm capable of chimeric alignment, such as bwa-mem, chromap and dragmap. This procedure calls alignment once and is simpler than iterative mapping.
Fragment assignment and filtering The mapped reads are then each assigned a single genomic alignment location according to its 5' mapped position in the genome. After binning, Hi-C data will be stored in a symmetrical matrix format. QuASAR, on the other hand, offers a bit more quality assessment, and compares replicate scores of the samples (given that replicates are indeed included for the experimental purpose) to find the maximum usable resolution. Some publications also tried to score interaction frequencies at the single-fragment level, where a higher coverage can be achieved even with a lower number of reads. HiCPlus, a tool developed by Zhang et al. in 2018, is able to impute Hi-C matrices similar to the original ones using only 1/16 of the original reads. and attempts to balance the symmetrical matrix using the aforementioned assumption (by equalizing the sum of each and every row and column in the matrix). the Knight-Ruiz matrix-balancing approach, and eigenvector decomposition (ICE) normalization. exist to statistically characterize the properties of loci pairs separated by a given distance, but discrete binning and fitting continuous functions are two common ways to analyze the distance-dependent interaction frequencies between datapoints. HiTC R, Although they each has their own differences and optimizations made on the original 2009 approach, their base protocols still rely on principal component analysis.
4. Topologically associating domains (TADs) TADs are sub-Mb structures that may harbor gene-regulatory features, such as local
promoter-
enhancer interactions. Thus, TADs represent regulatory microenvironments and usually show up on a Hi-C map as blocks of highly self-interacting regions in which interaction frequencies within the region are significantly higher than interaction frequencies between two adjacent regions. Another approach is to calculate the average interaction frequencies crossing over each bin, again within some predetermined genomic range. The resulting value is referred to as the insulation score and can be thought of as the average of a square sliding along the diagonal of the matrix (Crane et al.). resolution specific domains can be identified and a consensus set of domains conserved across resolutions can be calculated, MrTADFinder, 3DNetMod, and Matryoshka, are also developed to achieve better computing performance on higher resolution datasets.
5. Point interactions Biologically, regulatory interactions usually occur at much smaller scale than TADs, and two genomic elements can activate/inhibit the expression of a gene within as small a distance as 1 kb. Therefore, point interactions are important in interpreting Hi-C maps, and are expected to appear as local enrichments in contact probability. However, current methodologies for the identification of point interactions are all implicit in nature, in that they do not instruct what a point interaction should look like. Instead, point mutations are identified as
outliers with higher interaction frequencies than expected within the Hi-C matrix, given that the background model consists only of the strongest signals such as the distance-decay functions. The background model can be estimated and constructed using both local signal distributions and global approaches (i.e. chromosome-wide/genome-wide). Many of the aforementioned bioinformatics packages incorporate algorithms to identify point interactions. In short, the significance of individual pairwise interaction is calculated, and significantly high outliers are corrected for multiple testing before they are recognized as truly informative point interactions. It is helpful to compliment identified point interactions with additional evidence such as analysis of enrichment scores and biological replicates, to indicate that these interactions are indeed of biological significance. == Uses ==